One script to solve it all – an open-source-based framework for a digital workflow based on WWTP data
[doi] wastewater-treatmentdigital-workflowenvironmental-data-scienceopen-sourceactivated-sludge-modellingdata-quality
One script to solve it all – an open-source-based framework for a digital workflow based on WWTP data
Authors: Markus Ahnert, Stefan Hurzlmeier Year: 2022 Tags: wastewater-treatment, digital-workflow, open-source, environmental-data-science, activated-sludge-modelling, wwtp-design
TL;DR
An R/RMarkdown-based open-source workflow automates import, plausibility checking, and German DWA-standard design calculations (A198 + A131) for routine WWTP operational data, then feeds parameterization of an activated sludge model (Simba#). The paper argues this replaces largely manual, project-by-project engineering practice with a reproducible, auto-documented pipeline applicable to ~9,100 German WWTPs.
First pass — the five C's
Category. Research prototype / methodology paper with a position-paper component.
Context. Environmental data science (EDS) subfield. Builds on Gibert et al. (2018) for the EDS challenge taxonomy that structures the paper's framing; Blair et al. (2019) for characterizing heterogeneous environmental data complexity; German standards DWA-A131 (2016) and ATV-DVWK-A198E (2003) as the encoded design rules; Newhart et al. (2019) for context on WWTP data quality and exploratory analysis gaps.
Correctness. Load-bearing assumptions: (1) DWA regulations are sufficiently algorithmic to encode in scripts — the paper itself notes ambiguities (e.g., sliding-window boundary treatment) that require design choices not resolved by the standard; (2) daily mean values from operating journals are adequate resolution for all design tasks; (3) open-source R packages remain stable across project lifetimes. These appear broadly defensible but are not tested.
Contributions. - First described end-to-end digital workflow covering WWTP data import through activated sludge model parameterization within a single scripting environment. - Explicit mapping of WWTP routine data challenges onto the Gibert et al. (2018) EDS taxonomy. - Open-source R/RMarkdown implementation coupling calculation and report generation, with a public GitHub repository. - Modular architecture enabling API-based control of commercial simulation software (Simba#) from within the R workflow.
Clarity. Generally readable, but the paper doubles as an advocacy piece for open-source and reproducibility, causing the methodology and the argument to blur; quantitative evidence is absent where assertions of benefit are strongest.
Second pass — content
Main thrust: Replacing ad-hoc, spreadsheet-based WWTP design practice with a modular R pipeline that encodes DWA regulations, enforces plausibility checks, auto-generates standardized reports, and initializes activated sludge models — saving engineering time and improving reproducibility.
Supporting evidence: - Applied to operating journals from WWTPs of "various sizes" covering "several years" as daily mean values; specific plant count, sizes, and validation metrics are not stated. - German regulatory scope: DWA design rules cover approximately 9,100 WWTPs in Germany (Federal Statistical Office, as of 2016). - Three dry-weather-day classification methods implemented (weather key, precipitation records, 21-day moving minimum); results differ across methods, requiring manual selection — a noted limit on full automation. - A131 module outputs: denitrification and nitrification tank volumes, secondary sedimentation sizing, required oxygen input, and internal recirculation rates. - Data density context: German self-monitoring sampling occurs only on a few days per week, creating large gaps; up to 100 data points per daily record; 365 records per year.
Figures & tables: Figure 3 (workflow scheme) is the central architectural figure — not quantitative, no axes. Figure 4 (A–H: scatterplots, histograms, time series for plausibility checks) is purely illustrative; axes are labeled but no statistical significance, confidence intervals, or error bars are shown, and no ground-truth comparison is provided. Figure 2 (German WWTP size distribution) is descriptive with labeled axes, no uncertainty. Table 1 lists workflow advantages as bullet categories without supporting numbers. Table 2 compares usual vs. digital workflow steps qualitatively. No figure carries a quantitative result.
Follow-up references: - Newhart et al. (2019) — most directly relevant review of data-driven WWTP performance analysis and data-quality practice. - Gibert et al. (2018) — foundational EDS challenge taxonomy the paper's argument is structured around. - Corominas et al. (2018) — critical review of methods for transforming WWTP data into operational knowledge; context for why descriptive statistics are underused. - Blischak et al. (2019) — open-source framework for reproducible research that motivated the workflow design philosophy here.
Third pass — critique
Implicit assumptions: - DWA regulations are unambiguous enough to encode algorithmically — contradicted by the paper's own admission that boundary conditions in moving-window calculations are underspecified; if regulations change, the entire encoded workflow must be re-validated. - Manual intervention at multiple decision points (dry-weather method choice, data cleaning) preserves rather than degrades reproducibility — this is asserted but not demonstrated; different engineers making different manual choices could produce divergent outputs from the same data. - Daily mean values contain sufficient information for all downstream design and modelling steps; higher-frequency data gaps are handled by interpolation or exclusion, but the sensitivity of design outputs to these choices is not evaluated. - R package ecosystem stability over multi-year project timelines; the paper's own "keep it simple" heuristic acknowledges this risk without quantifying it.
Missing context or citations: - DATAR (Gabaldón et al. 1998) and DESASS (Ferrer et al. 2008) are mentioned but never systematically compared against the proposed workflow in terms of capability, time, or accuracy. - No engagement with international equivalents (US EPA-based design workflows, ISO standards beyond BIM) to assess transferability outside DACH countries. - Cybersecurity and data governance for web deployment receive a single citation (Ooms 2013) with no substantive analysis despite the paper raising web-based use as a target application. - The claim of being the "first description of a complete digital workflow" is unverified; the literature search method and scope are not described.
Possible experimental / analytical issues: - Zero quantitative validation: no measured time savings, no error-rate comparison against manual calculation, no design-output accuracy check against an independent reference. - Case study plants are entirely anonymous (size class, country, data period vaguely described); results cannot be reproduced from the paper alone. - Figure 4 plausibility check plots are cherry-picked illustrations, not a systematic performance evaluation; false-positive and false-negative rates for automated anomaly detection are not reported. - The GitHub repository is referenced but the paper does not describe code version, test coverage, completeness, or whether the example data are real or synthetic. - Rejecting full automation on engineering-judgment grounds is reasonable but introduces uncontrolled inter-analyst variability that undermines the reproducibility argument — this tension is not resolved.
Ideas for future work: - Conduct a controlled time-and-accuracy study: have multiple engineers process the same dataset with the old manual workflow and the new pipeline, measuring wall-clock time, inter-analyst variance in design outputs, and error incidence. - Extend and test the regulatory encoding for non-DWA standards (e.g., US EPA 10-states standards, EU-aligned national regulations) to determine how much of the framework generalizes versus is Germany-specific. - Replace the spreadsheet import interface with a direct SCADA/database connector and evaluate whether higher-resolution input (hourly or sub-hourly) improves design parameter stability or activated sludge model calibration quality. - Formalize the manual intervention points into structured decision logs within the workflow so that inter-analyst reproducibility can be audited and the sensitivity of final design outputs to each manual choice can be quantified.
Methods
- exploratory data analysis
- plausibility checks
- time series analysis
- percentile calculation
- 21-day moving minimum
- activated sludge modelling
- dynamic simulation
- RMarkdown automated documentation
- API interface to Simba#
Datasets
- WWTP operating journal routine data
- daily mean influent flow and concentration records from multiple German wastewater treatment plants
Claims
- A modular open-source R-based digital workflow can automate data import, plausibility checking, design parameter calculation, and WWTP design according to DWA regulations A198 and A131, saving significant engineering time.
- Complete automation of the WWTP design workflow is computationally feasible but not advisable, as manual engineering judgement remains essential at key decision points.
- Coupling R scripting with RMarkdown enables transparent, reproducible, and automatically generated documentation of calculations and results.
- Routine WWTP operational data can serve as a sufficient basis for both static design and dynamic activated sludge model parameterisation without requiring additional data collection.
- The described workflow addresses key environmental data science challenges including data quality, reproducibility, and cross-disciplinary integration in the wastewater sector.