Mutation Testing
Mutation testing is a technique for measuring the depth of a test suite. The tool systematically corrupts the source code — flipping operators, replacing return values, deleting statements — and then re-runs the tests. If the tests still pass after a mutation, that surviving mutant marks untested behaviour: a place where the code could silently change without any test noticing. Survivors are the signal; killed mutants confirm the suite already covers that logic.
Why tensogram uses it
Tensogram’s critical path is a small set of Rust modules that handle wire-format layout, bit-flag dispatch, frame boundaries, and hash verification. These modules share three properties that make them ideal targets for mutation testing:
- Bit-flag handling. Per-frame flags like
HASH_PRESENTrely on bitwise operators (&,|,^). Swapping one for another is the exact mutation classcargo-mutantsgenerates, and a missed swap means silent data corruption. - Off-by-one frame-bound bugs. Boundary arithmetic in the decoder
(
<vs<=,+1vs-1) is another mutation sweet spot. The Pass-7 frame-bound bug caught during PR #111 review is a textbook example. - Wire-format byte-equality. Tensogram guarantees that the same input produces bit-identical output across Rust, Python, C++, and TypeScript. Golden-file tests enforce this, but mutation testing reveals whether those golden files actually exercise every branch in the encoder and decoder.
This page is the reference for the process. Remaining rollout work
(weekly-sweep triage, finishing critical-path coverage, and flipping
the PR-time gate to required-for-merge) is tracked in
plans/TODO.md under Code Quality.
The two regimes
Mutation testing runs in two complementary modes:
PR-time (--in-diff)
Every pull request runs cargo mutants --in-diff origin/main..HEAD as
a non-blocking job inside
.github/workflows/ci.yml. This
restricts mutation to lines the PR actually touches, keeping wall-clock
time reasonable (typically under five minutes). The job uploads
mutants.out/ as a CI artifact so reviewers can inspect survivors
without running locally. The plan is to flip this job to
required-for-merge once Phase 1 stabilises.
Weekly full sweep (sharded)
A scheduled workflow in
.github/workflows/mutants-weekly.yml
runs the full mutation sweep across all configured modules, split into
sixteen shards (--shard 1/16 through --shard 16/16) for parallelism.
Schedule: Saturday + Sunday at 09:00 UTC (weekend daytime, when the ECMWF builder fleet is least contended). Two runs per weekend catches commits made on Friday (Saturday’s run) and on Saturday (Sunday’s run).
Runner: ECMWF self-hosted infrastructure
(platform-builder-docker-xl fleet) inside the
eccr.ecmwf.int/tensogram/ci:1.3.0 container — the same runner the
rest of the test matrix uses. Zero GitHub-Actions minutes are
consumed by the sweep itself; only the small “open triage issue”
job at the end runs on ubuntu-latest (one short API call per
failed weekend).
Failure detection: surviving mutants surface through three independent channels:
- GitHub issue auto-created with labels
mutation-testingandtriage, containing a status table, the fullmissed.txtdump, and a link to the run artifact. Repo watchers receive an email. - Workflow run summary (
$GITHUB_STEP_SUMMARY) shows the caught/missed/timeout/unviable totals + the first 50 surviving mutants directly on the GitHub Actions run page — no artifact download needed for triage. - Red badge on the Actions tab + commit status.
The triage SLA is seven days from the auto-issue date.
Timeout safety: each shard has an 8-hour hard cap via
GH-Actions timeout-minutes: 480, with a timeout 7h wrapper
around cargo mutants so that if a shard hits the cap, the
artifact-upload step still runs with whatever was tested before
the kill. Typical shard wallclock is ~30 minutes, so the cap is
belt-and-braces only.
Running mutation testing locally
Install the pinned version of cargo-mutants:
cargo install cargo-mutants --version 27.0.0 --locked
Concurrency — seven nested layers
cargo-mutants concurrency is not a single number. Seven
independent layers each fan out by default to one thread per logical
CPU; multiplied together they saturate a 12-core laptop even at
CARGO_MUTANTS_JOBS=2. All seven must be clamped for a
sustainable load profile:
| # | Layer | Knob | Default | Recommended |
|---|---|---|---|---|
| 1 | Mutant workers | CARGO_MUTANTS_JOBS env | NCPUS | 2 |
| 2 | cargo + rustc invocation count | --jobserver-tasks flag | NCPUS | 2 |
| 3 | rustc codegen threads inside each rustc | [profile].codegen-units | 16 in release | 1 (pinned in [profile.release-mutants]) |
| 4 | Test-binary threads | cargo test -- --test-threads=N | NCPUS | 2 (pinned in .cargo/mutants.toml) |
| 5 | cmake build.rs (libaec-sys, blosc2-sys, zfp-sys-cc, tensogram-sz3-sys) | CMAKE_BUILD_PARALLEL_LEVEL | NCPUS | 1 |
| 6 | nested make from cmake | MAKEFLAGS=-j1 | NCPUS | -j1 |
| 7 | tensogram-internal rayon | TENSOGRAM_THREADS env | NCPUS | 1 |
Layer 3 is the most pernicious. Each rustc in release mode
defaults to 16 codegen threads, and cargo’s jobserver only counts the
rustc process as one slot. Without codegen-units = 1, even at
--jobserver-tasks 2 peak parallelism is 32 codegen threads — drawing
120W on M-series silicon for 5–15-second bursts that the 1-minute load average smooths away but a laptop power supply does not. The
[profile.release-mutants]profile (in workspaceCargo.toml) inherits release and pinscodegen-units = 1;.cargo/mutants.tomlselects this profile via--profile release-mutantsso every cargo build during a sweep uses it automatically.
Canonical local invocation for a Phase-1 file sweep:
CARGO_MUTANTS_JOBS=2 \
TENSOGRAM_THREADS=1 \
CMAKE_BUILD_PARALLEL_LEVEL=1 \
MAKEFLAGS=-j1 \
cargo mutants -p tensogram --file rust/tensogram/src/hash.rs --jobserver-tasks 2
The test-thread cap (--test-threads=2) and the
--profile release-mutants selection are pinned in
.cargo/mutants.toml so they don’t need to appear on every command
line.
macOS escape hatch — taskpolicy -b
On Apple Silicon laptops, application-level clamps can still miss
something — an undocumented threadpool in a future dependency, a code
path that bypasses TENSOGRAM_THREADS, etc. For unattended runs where
a power event is unacceptable, prefix the canonical invocation with
taskpolicy -b:
taskpolicy -b -- env \
CARGO_MUTANTS_JOBS=2 \
TENSOGRAM_THREADS=1 \
CMAKE_BUILD_PARALLEL_LEVEL=1 \
MAKEFLAGS=-j1 \
cargo mutants -p tensogram --file rust/tensogram/src/framing.rs --jobserver-tasks 2
taskpolicy -b runs the entire process tree under background QoS:
restricted to E-cores only (4 efficiency cores out of 12 on M-series),
with total package power capped at ~30% of system maximum. Applies to
all child processes — rustc, cc-rs, cmake, make, ld64, rayon, zstdmt
workers, blosc2 workers — without their cooperation. Kernel-level
enforcement, not application-level.
Cost: 2-3x wallclock (E-cores are slower than P-cores). Acceptable for unattended overnight runs; unsuitable for interactive iteration. Use the bare canonical invocation when you’re driving the sweep from a desk and watching it.
Per-context defaults
| Use case | Jobs | Jobserver tasks | TENSOGRAM_THREADS | CMAKE / MAKEFLAGS | taskpolicy |
|---|---|---|---|---|---|
| Default (12-core laptop, attended) | 2 | 2 | 1 | 1 / -j1 | optional |
| Unattended overnight on a laptop | 2 | 2 | 1 | 1 / -j1 | required |
| Workstation (≥16 cores, good cooling) | 4 | 4 | 1 | 2 / -j2 | not needed |
| CI runner | 2 | 2 | 1 | 1 / -j1 | not applicable |
Common invocations
Full sweep of a single file (useful when closing out a Phase-1 step):
CARGO_MUTANTS_JOBS=2 TENSOGRAM_THREADS=1 \
CMAKE_BUILD_PARALLEL_LEVEL=1 MAKEFLAGS=-j1 \
cargo mutants -p tensogram --file rust/tensogram/src/hash.rs --jobserver-tasks 2
Diff-only (the most common workflow for PR authors — mutates only lines you changed):
CARGO_MUTANTS_JOBS=2 TENSOGRAM_THREADS=1 \
CMAKE_BUILD_PARALLEL_LEVEL=1 MAKEFLAGS=-j1 \
cargo mutants --in-diff origin/main..HEAD --jobserver-tasks 2
Both commands write results to mutants.out/ in the current directory.
Resuming an interrupted sweep
Long sweeps (notably the framing-module sweep at ~12 hours with the
default concurrency) can be interrupted by power events, OOM kills, or
laptop sleep. cargo-mutants does not have a native resume flag, but
--shard N/K partitions the mutant list deterministically and lets you
re-run a subset:
# Run quarter 1 of 4 — first ~138 mutants of framing.rs
cargo mutants -p tensogram --file rust/tensogram/src/framing.rs --shard 1/4
# Then quarter 2, 3, 4 in subsequent sessions
cargo mutants -p tensogram --file rust/tensogram/src/framing.rs --shard 2/4
cargo mutants -p tensogram --file rust/tensogram/src/framing.rs --shard 3/4
cargo mutants -p tensogram --file rust/tensogram/src/framing.rs --shard 4/4
Shard assignment is stable across runs, so re-running a single shard
re-tests the same mutants. mutants.out/ is overwritten per
invocation; copy or rename between shard runs if you want to preserve
per-shard results.
Reading mutants.out/
After a run completes, cargo-mutants writes four result files into the
mutants.out/ directory:
-
caught.txt— Mutants that were killed by the test suite. This is the happy path: the tests detected the corruption and failed. A longcaught.txtand an emptymissed.txtis the goal. -
missed.txt— Surviving mutants. Each entry describes a source location and the mutation that was applied. Survivors indicate either a genuine coverage gap (write a test) or an equivalent mutant (the mutation does not change observable behaviour — add an exemption). This is the file you triage. -
timeout.txt— Mutants that caused the test suite to hang until the timeout expired. These are usually equivalent or dead-code mutations (e.g. removing a loop-break that the test harness never reaches). Worth a quick look, but rarely actionable. -
unviable.txt— Mutants that did not compile. Expected for type-constrained mutations (e.g. replacing au32return withString). These are noise and can be ignored.
When to add a test vs. an exemption
When you encounter a surviving mutant, follow this decision tree:
-
Does the mutation change observable behaviour? Read the diff
cargo-mutantsprints. If flipping that operator or deleting that statement would produce incorrect output, a wrong error, or a panic in production, the answer is yes — write a test. -
Is the mutant equivalent? Some mutations produce code that is functionally identical to the original (e.g. replacing
x > 0withx >= 1whenxis always a positive integer). If you can convince yourself (and a reviewer) that no input distinguishes the original from the mutant, add anexclude_reentry to.cargo/mutants.toml. -
Is the code cosmetic or logging-only? Display implementations,
fmt::Debugoverrides, and log-line formatting are legitimate exemption targets. Add anexclude_reentry.
Two strict rules apply in all cases:
- No mass suppression by category. Do not add broad patterns like
"impl Display"or"fn fmt"that suppress entire classes of mutants across the codebase. Each exemption must be scoped to a specific function or pattern. - Every
exclude_reentry needs a rationale. Add a one-line comment above the entry in.cargo/mutants.tomlexplaining why the mutant is equivalent or cosmetic.
Bumping cargo-mutants
The version is pinned at 27.0.0 across local installs, CI workflows, and this documentation. Version bumps land in dedicated PRs that:
- Update the version in
.cargo/mutants.tomlcomments, CI workflow files, and this page. - Re-run the full sweep on
rust/tensogram/src/hash.rsas a smoke test to confirm the new version produces comparable results. - Document any changed mutant operators or output-format differences in the PR description.
References
plans/TODO.md— remaining mutation-testing rollout work (Code Quality section).plans/TEST.md— test plan covering the full suite shape, including the mutation testing layer.- mutants.rs — upstream
cargo-mutantsdocumentation.