Distributed and vectorized method of characteristics for fast transient simulations in water distribution systems

Gerardo Riaño-Briceño, Lina Sela, Ben R. Hodges · Computer-Aided Civil and Infrastructure Engineering · 2021

[doi] water-hammermethod-of-characteristicsparallel-computingwater-distribution-systemstransient-flowvectorization

Distributed and Vectorized Method of Characteristics for Fast Transient Simulations in Water Distribution Systems

Authors: Gerardo Riaño-Briceño, Lina Sela, Ben R. Hodges Year: 2022 Tags: water-distribution-systems, method-of-characteristics, parallel-computing, transient-flow, vectorization, water-hammer

TL;DR

Proposes DV-MOC, a novel parallel implementation of the Method of Characteristics (MOC) for transient flow (water hammer) simulation in large water distribution systems (WDS), combining SIMD vectorization via aligned memory allocation with distributed-memory parallelization via MPI. Applied to the BWSN-II benchmark (14,824 pipes, ~6×10¹¹ solution points), the method achieves speedups in the order of thousands over sequential MOC for sufficiently dense numerical grids.

First pass — the five C's

Category. Research prototype — novel algorithm with computational experiments on benchmark network.

Context. Transient flow modeling in WDS; builds on: Wylie et al. (1993) foundational MOC theory for pipe networks; Boyd et al. (2014) OpenMOC SIMD + shared-memory parallelism for neutron transport; Meng et al. (2019a/b) GPU-MOC for single pipes; Ginting & Mundani (2019) vectorized Riemann solvers for shallow-water equations showing vectorization outperforms shared-memory parallelism alone.

Correctness. Three load-bearing assumptions: (1) steady friction model is adequate for large-scale WDS transients (authors cite Duan et al. and Xing & Sela but do not bound error introduced); (2) wave speed adjustment (rather than interpolation) preserves wave-front sharpness without quantified error bounds; (3) a non-optimal, even-split graph partitioning is sufficient to demonstrate distributed speedup — authors explicitly acknowledge this is suboptimal.

Contributions. - Reformulates all MOC equations (interior points and five boundary condition types: junctions, sources, valves, pumps, dead ends/leaks/bursts) into compact vector forms enabling contiguous memory allocation and SIMD execution on commodity CPUs. - Introduces point-order and lexicographic-order formalisms to handle arbitrary WDS network topologies while maintaining memory alignment needed for vectorization. - Develops DV-MOC: distributes the vectorized MOC across processors using MPI neighbor all-to-all communication, with barrier synchronization at each time step. - Demonstrates scalability on BWSN-II (14,824 pipes) with three numerical-grid resolutions on the Stampede2 supercomputer (Intel Xeon Phi 7250 nodes).

Clarity. Notation is dense and carefully defined but demanding; the sequential derivation of vector boundary conditions is thorough; the paper is well-structured, though the cut-off of Section 4 results means the primary DV-MOC performance evidence is absent from the provided text.

Second pass — content

Main thrust: DV-MOC reformulates MOC into SIMD-executable vector equations with aligned memory, then partitions the networked numerical grid across MPI processes, achieving scalable speedups unavailable to GPU-MOC (memory-bound) or sequential approaches for large WDS with complex topologies.

Supporting evidence: - V-MOC (single processor, vectorized) vs. sequential MOC on Example 2.1 (two-pipe, Intel Core i7-6600U, Python/NumPy): 172× speedup at 4,634 solution points; 25× speedup at 521 solution points; speedup grows with problem size due to cache efficiency. - Wave speed adjustment impact: with τ = 0.4 s, adjusted wave speeds in Example 2.1 drop to 362.5 m/s and 435 m/s from originals of 1,000 m/s and 800 m/s — demonstrating large numerical error from coarse grids. - BWSN-II (Example 4.1): 12,526 nodes, 14,824 pipes, single pump, six inline valves; transient generated by 10-s valve closure; all pipes assigned wave speed 1,000 m/s; simulation duration 80 s; three grid resolutions tested on Stampede2. - DV-MOC speedup over V-MOC "in the order of thousands" for BWSN-II at high resolution — stated in abstract and introduction; quantitative table/figure results are in the portion of Section 4 not included in the provided text. - Accuracy validated against Bentley HAMMER (details deferred to SI Section 4 of Riano-Briceno et al., 2021).

Figures & tables: Figure 1 compares head at node N₂ for τ = 0.145 s vs. τ = 0.4 s — axes appear labeled (time in s, head in m); no error bars. Figure 2a plots simulation time vs. problem size (nm) for sequential vs. V-MOC — axes labeled; no error bars or confidence intervals; no statistical replication reported. Figure 2b illustrates two-way partitioned numerical grid for Example 2.1. Figure 3 maps BWSN-II with two measurement zones. Figure 4 shows trade-off between problem size and wave speed adjustment across time step values — no uncertainty quantification. No error bars or statistical significance reporting anywhere in the provided text.

Follow-up references: - Wylie et al. (1993) — foundational MOC derivation and boundary conditions, essential background. - Meng et al. (2019a/b) — GPU-MOC, the closest prior parallel transient solver; understand its limitations to contextualize DV-MOC's claims. - Boyd et al. (2014) OpenMOC — SIMD + shared-memory MOC for neutron transport; the direct algorithmic antecedent. - Ostfeld et al. (2008) — BWSN-II benchmark description needed to interpret Example 4.1 network properties.

Third pass — critique

Implicit assumptions: - All 14,824 pipes in BWSN-II are assigned a uniform wave speed of 1,000 m/s — artificial homogeneity that eliminates realistic wave speed variability and understates wave speed adjustment errors in practice; would break accuracy claims if wave speeds varied widely across pipes. - Steady friction model sufficiency: the claim that unsteady/quasi-steady effects are "negligible" for large-scale WDS is cited but not bounded — transient accuracy for short, fast events may be materially affected. - Non-optimal even-split partitioning does not account for communication cost; speedup results may be pessimistic for large inter-processor communication topologies or optimistic for topologies with low cut edges. - Cache efficiency scaling (V-MOC speedup grows with problem size) assumes problem fits in available DRAM and that NumPy SIMD dispatch is consistent — not verified across hardware configurations.

Missing context or citations: - No runtime comparison against GPU-MOC (Meng et al.) despite direct relevance; only algorithmic limitations are argued. - No comparison against commercial solvers (e.g., Bentley HAMMER, Innovyze InfoSurge) on speed — only accuracy is checked against HAMMER. - WCM (Wood et al.) and GCM/AWH (Nault et al., 2016, 2018) are described as sacrificing accuracy for speed, but no numerical accuracy-vs-speed Pareto comparison is provided to establish where DV-MOC sits. - Skeletonization is mentioned as a competing approach but not benchmarked. - Finite-volume methods with inherent parallelism (Cao et al., 2020; Zhao & Ghidaoui, 2004) are listed in the introduction but never compared.

Possible experimental / analytical issues: - The primary DV-MOC performance results for BWSN-II (Section 4) are cut off in the provided text; the "order of thousands" speedup claim cannot be independently assessed from the available material. - Python/NumPy implementation ceiling: using an interpreted language with NumPy SIMD dispatch likely leaves substantial performance on the table relative to C/Fortran; speedup ratios relative to sequential Python MOC may not generalize to compiled-language baselines. - No replication across runs; wall-clock times on Stampede2 are subject to node contention variability — no standard deviations or confidence intervals reported. - Uniform wave speed assumption in Example 4.1 suppresses the wave speed adjustment problem that the paper identifies as the chief accuracy concern; a heterogeneous network would be more informative. - The paper acknowledges partitioning is non-optimal but does not quantify the performance penalty; sensitivity to partitioning strategy is untested. - Memory footprint and scalability ceiling for very fine grids (~6×10¹¹ points) are referenced but peak memory usage per node is not reported.

Ideas for future work: - GPU implementation of DV-MOC to directly compare distributed-CPU vs. GPU throughput for the same network, resolving the unaddressed GPU-MOC comparison. - Optimal or topology-aware graph partitioning (e.g., METIS/KaHyPar) and measurement of its speedup benefit relative to the naive even-split used here. - Extension to unsteady or quasi-steady friction models within the vectorized framework, with quantified accuracy improvement for high-velocity transients. - Application to a realistic network with heterogeneous pipe materials and wave speeds (spanning, e.g., 200–1,400 m/s) to stress-test wave speed adjustment minimization and accuracy claims under realistic conditions.

Methods

Method of Characteristics (MOC)
SIMD vectorization
distributed-memory parallelization
wave speed adjustment
MPI (Message Passing Interface) via OpenMPI
NumPy vector operations
EPANET steady-state initialization
graph partitioning for load balancing
neighbor all-to-all collective communication

Datasets

Battle of the Water Sensor Networks II (BWSN-II) — 12,526 nodes, 14,824 pipes
Two-pipe illustrative example

Claims

DV-MOC achieves speedup factors in the order of thousands over sequential MOC for sufficiently dense numerical grids in large-scale water distribution systems.
Vectorization alone (V-MOC) speeds up MOC by up to hundreds of times on a single processor compared to sequential MOC.
The proposed aligned memory allocation scheme enables SIMD vectorization to concurrently compute MOC equations for arbitrary networked topologies.
Distributed-memory parallelization ensures memory scalability for large-scale WDS with tens of thousands of pipes and nearly 6×10^11 solution points.
High-resolution simulations, which provide more accurate results via minimized wave speed adjustment, benefit more from DV-MOC speedups than low-resolution ones.