Towards coordinated and robust real-time control: a decentralized approach for combined sewer overflow and urban flooding reduction based on multi-agent reinforcement learning

Zhiyu Zhang, Wenchong Tian, Zhenliang Liao · Water Research · 2023

[doi]

Towards Coordinated and Robust Real-Time Control: A Decentralized Approach for Combined Sewer Overflow and Urban Flooding Reduction Based on Multi-Agent Reinforcement Learning

Authors: Zhiyu Zhang, Wenchong Tian, Zhenliang Liao Year: 2023 Tags: multi-agent-reinforcement-learning, real-time-control, urban-drainage, combined-sewer-overflow, communication-robustness, decentralized-control

TL;DR

Proposes a Value Decomposition Network (VDN) MARL framework for decentralized real-time control of urban drainage systems, trained centrally but executed locally without inter-agent communication. Across two SWMM case studies, VDN achieves near-centralized CSO/flooding reduction while outperforming fully centralized DQN in robustness to observation and action communication failures.

First pass — the five C's

Category. Research prototype — algorithm design and simulation-based experimental evaluation.

Context. Urban drainage real-time control (RTC) subfield. Builds directly on: Mullapudi et al. (2020) — centralized DQN for stormwater control (the DQN baseline here); Sun et al. (2020) — MPC on the same Astlingen benchmark; Sunehag et al. (2018) — VDN cooperative MARL algorithm; Tian et al. (2022a,b) — RL for CSO/flooding in combined sewers.

Correctness. Load-bearing assumptions: (1) SWMM faithfully substitutes for a real system; (2) communication failure is i.i.d. Bernoulli per device per time step; (3) VDN's additive Q decomposition (Q_tot = ΣQ_i) adequately captures agent coordination; (4) 50 historical or 20 synthetic training events span the relevant rainfall distribution. All are plausible but untested within the paper.

Contributions. - First MARL application to decentralized RTC of urban drainage systems. - Demonstration that VDN centralized training enables decentralized agents to achieve performance close to a fully centralized DQN without requiring real-time inter-agent communication. - Systematic quantification of both observation-channel and action-channel communication failure impacts across three RL control architectures. - Integrated control structure (centralized DQN primary + decentralized VDN/IQL backup) shown to limit action-communication failure degradation.

Clarity. Well-organized with clear methodology; reward function weights in Case 2 are stated without justification or sensitivity analysis, and hyperparameter details are deferred entirely to supplementary materials.

Second pass — content

Main thrust: VDN's centralized-training/decentralized-execution paradigm narrows the performance gap to fully centralized control while providing structural redundancy that limits performance loss when sensor observations or actuator commands are lost.

Supporting evidence: - Case 1 (Astlingen benchmark, 101 test rainfalls 2007–2009): DQN 9.30%, VDN 8.23%, IQL 5.62% total CSO reduction vs. static baseline (BC); system maximum potential (MaxRed) is 18.02%. - Case 2 (Chaohu real-world, 100 synthetic test rainfalls, return periods 1–5 yr): DQN 21.36%, VDN 18.49%, IQL 14.39% combined CSO+flooding reduction vs. heuristic control (HC); MaxRed is 23.44%. - Observation failure robustness (Case 1): accumulated CSO of all three strategies remains below BC even at 90% per-step failure probability; DQN exhibits larger performance-loss rates than VDN/IQL at equal failure probabilities. - Action failure robustness (Case 1): "DQN & None" (no backup) produces wide, degrading performance distributions; "DQN & VDN" backup asymptotically approaches standalone VDN performance as failure probability increases toward 100%. - All strategies fail to reach MaxRed, confirming the inherent limitation of reactive (non-predictive) control.

Figures & tables: Figs. 7–8 (training curves) show reward and validation CSO/flooding convergence — axes labeled, but no confidence bands around training curves. Fig. 9 (Case 1 testing) uses bar-style aggregates across 101 events without confidence intervals or statistical significance. Fig. 10 (Case 2) shows bar charts for 4 representative events plus 100-event averages — no error bars. Figs. 12–13 use violin plots for communication failure experiments (50 parallel runs each), which appropriately convey distributional spread; dashed reference lines are included. No formal statistical significance tests appear anywhere in the paper. Table 4 summarizes reduction rates cleanly but without uncertainty estimates.

Follow-up references: - Sun et al. (2020) — Astlingen MPC benchmark; direct performance comparison reference for Case 1. - Mullapudi et al. (2020) — foundational centralized DQN for stormwater; the DQN baseline this paper extends. - Sunehag et al. (2018) — VDN algorithm; theoretical basis for the cooperative training mechanism. - van der Werf et al. (2022) — review of real-world RTC implementation barriers; contextualizes the communication-robustness motivation.

Third pass — critique

Implicit assumptions: - SWMM simulation = reality: no sim-to-real gap analysis; results may not transfer to physical deployment (load-bearing — failure invalidates all quantitative claims). - i.i.d. Bernoulli communication failure: real network outages tend to be spatially and temporally correlated; the i.i.d. model likely underestimates worst-case degradation for centralized control. - Linear additive Q decomposition (VDN): assumes agent value contributions are separable; interaction effects between agents sharing hydraulic reach could violate this. - Reward weights in Case 2 (5, 2, 1, 2, 0.1) treat flooding vs. CSO trade-off as fixed; results are sensitive to these weights, which are not analyzed.

Missing context or citations: - No comparison with distributed MPC, which explicitly coordinates across subsystems and is the natural benchmark for coordinated decentralized control. - QMIX (Rashid et al., 2018) and MADDPG (Lowe et al., 2017) — more expressive MARL algorithms — are not discussed or tested; VDN is among the simplest cooperative MARL methods. - Tian et al. (2022b), which combined RL+MPC on a similar system, is cited but not used as a performance comparison. - No engagement with sim-to-real transfer literature relevant to deploying learned policies on physical drainage infrastructure.

Possible experimental / analytical issues: - Case 2 training uses only 20 synthetic events from the same 1–5 year return-period distribution as the 100 test events; generalization is therefore tested within-distribution, not out-of-distribution. - Robustness experiments are conducted only in Case 1; it is unknown whether the communication-robustness advantage of VDN holds for the pump-control problem in Case 2. - No statistical significance tests on performance differences between strategies; the DQN–VDN gap (9.30% vs. 8.23% in Case 1; 21.36% vs. 18.49% in Case 2) may not be significant given event-to-event variability visible in figures. - MaxRed is computed via offline trajectory optimization with unspecified solver and horizon; its tightness as an upper bound is not validated. - The observation that VDN performance can increase under low failure probabilities (noise as implicit regularization) is noted anecdotally but not analyzed. - Hyperparameter selection procedure and sensitivity are relegated to supplementary materials with no summary in the main text.

Ideas for future work: 1. Test correlated failure models (e.g., simultaneous outage of all sensors in a sub-catchment) to stress-test the robustness advantage of decentralized control more realistically. 2. Substitute QMIX or MADDPG for VDN to determine whether the additive decomposition constraint limits coordination quality, particularly for the pump-control (discrete + combinatorial action) problem. 3. Conduct out-of-distribution evaluation using rainfall events with return periods > 5 years or from different climate scenarios to bound generalization failure. 4. Pilot a hardware-in-the-loop or field trial to quantify the sim-to-real gap, which is currently the largest unaddressed risk for practical deployment.

Methods

  • multi-agent reinforcement learning
  • value decomposition network (VDN)
  • deep Q-network (DQN)
  • independent Q-learning (IQL)
  • centralized training with decentralized execution
  • epsilon-greedy exploration
  • experience replay
  • dueling network architecture
  • SWMM simulation

Datasets

  • Astlingen benchmark combined sewer SWMM model
  • Chaohu City real-world drainage SWMM model
  • 10-year measured rainfall series (Astlingen)
  • Chicago Hyetograph synthetic rainfalls (Chaohu)

Claims

  • VDN with centralized training achieves similar CSO reduction performance to the fully centralized DQN agent while operating in a decentralized manner.
  • All three RL strategies reduce CSO volume by 5.62-9.30% over a static baseline in the benchmark case and reduce CSO and flooding by 14.39-21.36% over rule-based control in the real-world case.
  • Decentralized agents exhibit smaller performance loss than centralized agents under observation communication failures.
  • Well-trained decentralized agents serve as effective local backups when action commands from a centralized agent are lost, outperforming static fallback settings.
  • Independent Q-learning (IQL) suffers from the 'lazy agent' problem and training instability, while VDN's joint training resolves inter-agent coordination issues.