Create hypothesis_engine/scripts/backtest_filter_axes.js for axis retrospective validation

filter rejected TOOL reversible: simple 5h proposed 21 May 2026

What is the proposed change?

Create hypothesis_engine/scripts/backtest_filter_axes.js (directory does not exist; create it). CLI: node backtest_filter_axes.js --axis=v2_a11 [--cohort=ROBUST|MIXED|FRAGILE|ALL] [--labels=path/to/labels.csv]. The script reads hypothesis records from engine.db via the existing db.js connection utility, imports the specified axis scoring function directly from filter_score.js (require, do not reimplement), runs each hypothesis text through the scoring function, and writes JSONL to stdout: {hypothesis_id, nbj_class, axis_score}. If --labels is provided, reads a CSV of {hypothesis_id, nbj_class} for cohort classification (required when NBJ labels are not stored in engine.db). Also writes output to meta_engine/data/axis_backtest_{axis}_{YYYY-MM-DD}.jsonl. Exits with code 1 if zero hypotheses are processed. Exits with code 2 if the named axis function is not exported from filter_score.js.

Target files

hypothesis_engine/scripts/backtest_filter_axes.js

Expected effect

Running node hypothesis_engine/scripts/backtest_filter_axes.js --axis=v2_a11 --labels=s157_labels.csv against the 43 S157 candidates completes in <30 seconds and produces a JSONL file enabling Mann-Whitney U test between ROBUST and FRAGILE axis score distributions. This replaces the entirely manual S157 retrospective process with a repeatable, diffable, auditable command usable for every future AXIS proposal.

Falsifier — what would prove this wrong?

If the script runs but outputs identical axis scores for all 43 candidates (zero variance), the axis scoring function is not differentiating on real features of the hypothesis text — the axis function itself is broken, not this tool. If the script fails to import filter_score.js axis functions due to CommonJS/ESM boundary issues, filter_score.js must export individual axis functions via a thin export shim — failure mode is deterministic and fixable within the same 5-hour estimate.

Evidence that triggered the proposal

Corpus D: brain/S158_SHADOW_P4_SPEC.md — graduation criteria require a '43-candidate retrospective' with Wilson false-kill upper bound ≤2.5%; no tooling exists to run this check (all 43 evaluated manually in S157)
Corpus D: brain/S157_NBJ_DESCRIBABILITY_TEST.md — 43-candidate sweep conducted entirely manually in S157; no repeatable command exists for axis calibration as additional v2 axes are added
Corpus D: hypothesis_engine/scripts/ — directory confirmed absent from codebase; no backtest, retrospective, or axis calibration tooling exists anywhere in hypothesis_engine/

Proposer self-score

The proposer scored its own draft on these axes (0-3 each) before submitting.

Axis	Score
specificity	3
falsifier	3
solo feasible	3
blast radius	3
composability	3
reversibility	3

Disposition

Rejected by filter_score. The proposal did not meet the bar for specificity, falsifiability, or solo-feasibility.

Evaluation history

When	Move
2026-05-21 04:18	meta_filter_score
2026-05-21 04:15	meta_genesis