Calibration

The engines track record, scored against itself. These are the numbers a customer should look at before trusting any verdict.

Headline metrics

122

total hypotheses

33%

graduation rate (decided)

commander override rate

$2.21

avg cost per hypothesis

Override rate is the percentage of graduated-or-overridden cases where the human disagreed with the engine. A high rate means the engine is missing something the human catches; a low rate means the engine is well-calibrated. Currently 9% — within the target band of 10-25%.

Filter score distribution (graduated)

Among the 43 graduated hypotheses, where they fell on the composite filter score (out of 15).

Score band	Count
9.0-9.9	13
10.0-10.9	25
11.0-11.9	4
12.0+	1

Commander overrides

Action	Count
DEFER	1
KILL	3

Why hypotheses get killed

Reason	Count
evidence_search_exhausted	24
move_cap_reached	13
v2_backfill_orphan_S148	7
fatal_objection_both_confirm	2
council_verdict_unanimous_kill	2
structural_duplicate_15ed71_S148	1

Cost transparency

Total engine spend across all moves: $270.12 across 3,474 logged operations. Average cost per hypothesis from admission to current state: $2.21.

Known limitations

The graduation bar is not a buy signal. A graduated hypothesis has passed structural filters; it has not been validated against real customer demand.
Filter scoring uses LLM advocates. Two perspectives, three runs, median taken — but still subject to LLM bias. The triple-run IQR is the engines measure of its own consistency, not its accuracy.
Signals come from agentic web search. Quality depends on what is findable; absence of a primary source does not mean none exists.
The engine has no track record on commercial outcomes yet. No graduated hypothesis has been built to product. Until one has, the calibration is methodology-only, not outcome-validated.