Parallelism (n_jobs)¶
The L2 YAML config exposes three independent parallelism knobs, plus an implicit fourth layer that runs at expansion time. This page explains which knob to use, the mutual-exclusion rules that prevent nested process pools, and the BLAS thread-oversubscription pitfall that bites every multi-process Python numerics pipeline.
The four parallelism layers¶
Layer |
YAML knob |
What gets parallelised |
|---|---|---|
L1 — C-table builder |
|
Grid points
|
L2 — diagram-term integration |
|
Per-diagram QMC integration
inside
|
L3 — sweep grid points |
|
Cartesian product of
|
L4 — topology canonicalisation |
(no YAML key yet) |
Inside
|
All four layers use joblib.Parallel(backend='loky'). Each defaults
to 1 (sequential). -1 means all CPU cores.
Choosing where to parallelise¶
If your sweep grid is large (many positions × t_final × component pairs), set
sweep.n_jobs: -1. This is the most common case.If your expansion has many diagrams per grid point (typical at perturbative order ≥ 2), set
expand.n_jobs: -1and leavesweep.n_jobs: 1. Useful when the sweep grid is small but individual evaluations are expensive.If the dominant cost is the C-table pre-build (slow user
c_closed_form_module, denser_max/n_grid_rgrid), setpropagators.n_jobs: -1. This finishes before any sweep starts, so it composes safely with either L2 or L3 parallelism.
Mutual exclusion (no nested pools)¶
joblib’s loky backend cannot nest: a worker process inside one pool
cannot itself open another pool. The library enforces this by raising
ValueError whenever you would request both:
expand.n_jobs > 1 AND sweep.n_jobs > 1 -> ValueError
For the L1 × {L2, L3} combination there is no error because the
guard is automatic: when the C-table goes into lazy mode (r_max /
n_grid_r etc. unset, the default), the library pins the lazy
spline cache’s inner n_jobs to 1 after
Propagators.build() completes. That removes the only path along
which an L2 or L3 worker could trigger a nested pool. Full-grid mode
runs to completion before any sweep starts, so there is no nesting
risk there either.
BLAS thread oversubscription¶
NumPy and SciPy operations inside each loky worker – matrix products,
scipy.integrate.dblquad, RectBivariateSpline.__call__ – are
themselves multi-threaded by default through OpenBLAS or MKL. Without
intervention, N worker processes each running N BLAS threads
yields N**2 total threads, leading to lock contention and a slowdown
worse than running serially.
Always set the following before invoking sft-wick with any ``n_jobs > 1``:
export OPENBLAS_NUM_THREADS=1
export MKL_NUM_THREADS=1
export OMP_NUM_THREADS=1
The CLI prints a one-line tip when it detects n_jobs > 1 without
those variables set; you can disable the tip by setting any of them
explicitly (even to 1 is enough – the tip suppresses itself once
it sees a value).