Parallelism (`n_jobs`)¶

The L2 YAML config exposes three independent parallelism knobs, plus an implicit fourth layer that runs at expansion time. This page explains which knob to use, the mutual-exclusion rules that prevent nested process pools, and the BLAS thread-oversubscription pitfall that bites every multi-process Python numerics pipeline.

The four parallelism layers¶

Layer	YAML knob	What gets parallelised
L1 — C-table builder	`propagators.n_jobs`	Grid points `(t1, t2[, r/cos/x])` inside `precompute_C_table_*`
L2 — diagram-term integration	`expand.n_jobs`	Per-diagram QMC integration inside `integrate_diagrams`
L3 — sweep grid points	`sweep.n_jobs`	Cartesian product of `positions × t_final × component_pairs`
L4 — topology canonicalisation	(no YAML key yet)	Inside `compute_moment_numerical` when the topology list is long

All four layers use joblib.Parallel(backend='loky'). Each defaults to 1 (sequential). -1 means all CPU cores.

Choosing where to parallelise¶

If your sweep grid is large (many positions × t_final × component pairs), set sweep.n_jobs: -1. This is the most common case.
If your expansion has many diagrams per grid point (typical at perturbative order ≥ 2), set expand.n_jobs: -1 and leave sweep.n_jobs: 1. Useful when the sweep grid is small but individual evaluations are expensive.
If the dominant cost is the C-table pre-build (slow user c_closed_form_module, dense r_max / n_grid_r grid), set propagators.n_jobs: -1. This finishes before any sweep starts, so it composes safely with either L2 or L3 parallelism.

Mutual exclusion (no nested pools)¶

joblib’s loky backend cannot nest: a worker process inside one pool cannot itself open another pool. The library enforces this by raising ValueError whenever you would request both:

expand.n_jobs > 1   AND   sweep.n_jobs > 1     -> ValueError

For the L1 × {L2, L3} combination there is no error because the guard is automatic: when the C-table goes into lazy mode (r_max / n_grid_r etc. unset, the default), the library pins the lazy spline cache’s inner n_jobs to 1 after Propagators.build() completes. That removes the only path along which an L2 or L3 worker could trigger a nested pool. Full-grid mode runs to completion before any sweep starts, so there is no nesting risk there either.

BLAS thread oversubscription¶

NumPy and SciPy operations inside each loky worker – matrix products, scipy.integrate.dblquad, RectBivariateSpline.__call__ – are themselves multi-threaded by default through OpenBLAS or MKL. Without intervention, N worker processes each running N BLAS threads yields N**2 total threads, leading to lock contention and a slowdown worse than running serially.

Always set the following before invoking sft-wick with any ``n_jobs > 1``:

export OPENBLAS_NUM_THREADS=1
export MKL_NUM_THREADS=1
export OMP_NUM_THREADS=1

The CLI prints a one-line tip when it detects n_jobs > 1 without those variables set; you can disable the tip by setting any of them explicitly (even to 1 is enough – the tip suppresses itself once it sees a value).

Parallelism (n_jobs)¶

The four parallelism layers¶

Choosing where to parallelise¶

Mutual exclusion (no nested pools)¶

BLAS thread oversubscription¶

Parallelism (`n_jobs`)¶