acados – a modular open-source framework for fast embedded optimal control

Robin Verschueren, Gianluca Frison, Dimitris Kouzoupis, Jonathan Frey, Niels van Duijkeren, Andrea Zanelli, Branimir Novoselnik, Thivaharan Albin, Rien Quirynen, Moritz Diehl · 2020

[arxiv] embedded-optimal-controlmodel-predictive-controlnonlinear-mpcopen-source-softwarereal-time-optimizationsequential-quadratic-programming

acados – a modular open-source framework for fast embedded optimal control

Authors: Robin Verschueren, Gianluca Frison, Dimitris Kouzoupis, Jonathan Frey, Niels van Duijkeren, Andrea Zanelli, Branimir Novoselnik, Thivaharan Albin, Rien Quirynen, Moritz Diehl Year: 2020 Tags: embedded-optimal-control, nonlinear-mpc, sequential-quadratic-programming, open-source-software, real-time-iteration, linear-algebra

TL;DR

acados is a modular, open-source (BSD 2-Clause) C library for fast embedded NMPC and MHE that avoids problem-specific solver code generation by building on the BLASFEO small-matrix linear algebra library and the HPIPM QP solver. It exposes Python and Matlab interfaces with CasADi integration, allowing rapid prototyping while achieving ≈2× speedup over ACADO at identical solution quality on tested benchmarks.

First pass — the five C's

Category. Research prototype / software paper describing a new framework with numerical benchmarks.

Context. Embedded optimal control / real-time NMPC subfield. Directly builds on: Frison et al. BLASFEO (small-matrix high-performance linear algebra); Frison et al. HPIPM (structured interior-point QP solver); Andersson et al. CasADi (expression-graph AD and code generation); Diehl et al. RTI scheme (single-iteration real-time SQP). Competes with ACADO Code Generation tool, GRAMPC, FalcOpt, VIATOC, and FORCES NLP.

Correctness. Load-bearing assumptions: (1) problem state/control dimensions fall in the 10–300 range where BLASFEO outperforms code-generated and standard BLAS routines; (2) RTI without globalization is adequate because initializations stay near the solution; (3) the benchmarked problems (chain of masses, cart-pole, engine DAE) are representative of typical embedded NMPC workloads. All three appear reasonable given established literature cited in the paper but are not re-derived here.

Contributions. - Modular C framework for embedded NMPC/MHE that swaps solvers, integrators, and QP back-ends without problem-specific code regeneration. - Novel implementation of Sequential Convex Quadratic Programming (SCQP) and a structure-exploiting Hessian convexification method within an NMPC solver (claimed first in any NMPC package). - Integration of GNSF-IRK structure-exploiting integrators with automatic transcription for CasADi models. - High-level Python/Matlab interfaces that auto-generate self-contained deployable C projects via CasADi.

Clarity. Well-organized and readable; the notation in Section 2 is dense but self-contained; the numerical results section is truncated in the available manuscript text, leaving the hardware-in-the-loop case study incomplete.

Second pass — content

Main thrust: acados achieves the same SQP solution quality as ACADO (RCSO = 1.01×10⁻⁴ on the chain-of-masses benchmark) at roughly half the computation time (1.05 ms vs. 1.97 ms median) by replacing problem-specific code generation with the BLASFEO-backed HPIPM stack, while a modular architecture enables solver components to be freely interchanged.

Supporting evidence: - Chain of masses (M=5 masses, N=40 horizon): acados median 1.05 ms, min 0.87 ms, max 2.23 ms per RTI iteration vs. ACADO median 1.97 ms, min 1.90 ms, max 3.45 ms; GRAMPC median 1.06 ms (similar speed but RCSO = 7.17×10⁻² vs. acados 1.01×10⁻⁴). - RCSO ranking: IPOPT 0 (reference), acados = ACADO = 1.01×10⁻⁴, VIATOC 4.74×10⁻³, GRAMPC 7.17×10⁻², FalcOpt 3.17×10⁻¹. - Convexification regularization (cart-pole swingup, open-loop): total solve time 66.034 ms vs. projection 103.65 ms vs. mirroring 174.32 ms; per-iteration cost similar (2.540 ms / 2.303 ms / 2.264 ms) — advantage is fewer iterations, not cheaper iterations. - BLASFEO reported (from [32]) to deliver up to 10× speedup over code-generated triple-loop linear algebra for medium matrix sizes, cited as primary driver of acados speed advantage over ACADO. - Hardware-in-the-loop experiment: dSPACE MicroAutoboxII (900 MHz IBM PPC 750GL, 16 MB RAM), Gauss-Legendre order-6 IRK, N=20, sampling time 0.05 s for a turbocharged engine DAE model — quantitative results not fully reproduced in the available text.

Figures & tables: - Figure 2 (computation time per RTI vs. closed-loop step): axes labeled (step index, time in ms); 10-run averages plotted; no error bars or confidence intervals. - Figure 3 (Pareto plot: RCSO vs. worst-case compute time): both axes labeled; acados and GRAMPC identified as Pareto-optimal; no uncertainty shown. - Figure 4 (SQP convergence with three regularizations, KKT residual vs. iteration, log scale): axes labeled; deterministic convergence so no error bars appropriate; convexification clearly fastest. - Tables 4–6: clean numerical summaries; Table 5 gives median/min/max but no standard deviation across 10 runs. - Table 1: comprehensive 40+ package reference table; useful context but no performance data. - No statistical significance tests anywhere; timing variance implied by min/max spread only.

Follow-up references: - [32] Frison et al. — BLASFEO: performance claims for all acados speed results rest on this. - [31] Frison et al. — HPIPM: the QP layer underpinning every acados NLP solve. - [4] Andersson et al. — CasADi: required to understand the AD/code-generation workflow. - [18] Diehl et al. — RTI scheme: theoretical basis for single-iteration real-time SQP without globalization.

Third pass — critique

Implicit assumptions: - Matrix sizes fall in the BLASFEO sweet spot (~10–300); for very small systems (<4×4), code-generated kernels win per the paper's own text, so the speed advantage shrinks or reverses. - Warm-started RTI always starts near the solution; no analysis of cold-start or large-disturbance scenarios where globalization would be needed. - CasADi-generated C code is deployable as-is on embedded targets; no analysis of code size, stack usage, or heap allocation on memory-constrained MCUs. - The chain-of-masses problem is representative; scaling behavior with larger state dimensions or longer horizons is shown only implicitly.

Missing context or citations: - No direct comparison to FORCES NLP [91], the closest proprietary competitor in the same algorithmic class (interior-point + OCP structure), despite it appearing in Table 1. - No engagement with OpEn / PANOC [80] or ProxALM-type methods that were emerging simultaneously and target similar embedded applications. - Memory footprint analysis (RAM, flash) is entirely absent despite embedded suitability being a stated goal. - No comparison of generated C code size between acados templated approach and ACADO standalone generated approach.

Possible experimental / analytical issues: - Only 10 timing runs reported with min/max; no standard deviation or distribution shape; outliers could significantly affect embedded real-time guarantees. - RCSO uses IPOPT as ground truth, but IPOPT may converge to different local optima on nonconvex problems, making the metric ambiguous. - Solver tuning (Table 3) appears set to favor ACADO/acados (RTI, full/partial condensing); FalcOpt and GRAMPC parameters are described minimally and may not be at their best operating point. - Case Study 2 (Hessian regularization) is a single open-loop solve with one fixed initialization; closed-loop or multi-initialization statistics would better characterize robustness. - Hardware-in-the-loop results are truncated in the available text — the embedded-platform claims cannot be fully evaluated. - No ablation separating BLASFEO's contribution from HPIPM's contribution to the speedup over ACADO.

Ideas for future work: - Systematic scaling study varying nx, nu, and N independently to map where acados is faster or slower than ACADO and GRAMPC. - Formalize memory footprint benchmarks (stack + heap in bytes) on a representative Cortex-M or similar resource-constrained MCU to back the embedded deployment claim. - Add optional globalization (filter line search or trust region) as an opt-in module for offline/batch applications, as the paper itself identifies this as future work. - Evaluate the GNSF-IRK integrator speedup in isolation on a suite of stiff DAE systems to quantify when the structure-exploitation pays off versus standard IRK.

Methods

Sequential Quadratic Programming (SQP)
Real-Time Iteration (RTI) scheme
Gauss-Newton Hessian approximation
Sequential Convex Quadratic Programming (SCQP)
structure-preserving Hessian convexification
multiple shooting discretization
explicit Runge-Kutta integration
implicit Runge-Kutta integration
GNSF-IRK (Generalized Nonlinear Static Feedback IRK)
lifted collocation integrators
partial condensing
primal-dual interior-point QP solving
algorithmic differentiation via CasADi

Datasets

chain-of-masses benchmark
cart-pole swingup
engine control (dSPACE MicroAutoboxII hardware-in-the-loop)

Claims

acados achieves competitive computational performance compared to ACADO and GRAMPC while offering greater flexibility through a modular architecture without relying on automatic code generation for algorithmic logic.
The BLASFEO linear algebra library provides up to 10x speedup over code-generated linear algebra routines for matrix sizes typical in embedded optimization.
The structure-exploiting Hessian convexification method converges almost twice as fast as projection regularization and significantly faster than mirroring regularization in exact-Hessian SQP.
acados and GRAMPC lie on the Pareto-optimal front of sub-optimality versus computation time, with acados being a factor 1000 less suboptimal than GRAMPC at comparable median computation times.
acados supports deployment on multiple CPU architectures including x86, x86_64, ARMv7A, ARMv8A, and PowerPC without sacrificing performance.