Calibration
The engines track record, scored against itself. These are the numbers a customer should look at before trusting any verdict.
Headline metrics
33%
graduation rate (decided)
9%
commander override rate
$2.21
avg cost per hypothesis
Override rate is the percentage of graduated-or-overridden cases where the human disagreed with the engine. A high rate means the engine is missing something the human catches; a low rate means the engine is well-calibrated. Currently 9% — within the target band of 10-25%.
Filter score distribution (graduated)
Among the 43 graduated hypotheses, where they fell on the composite filter score (out of 15).
| Score band | Count |
|---|
| 9.0-9.9 | 13 |
| 10.0-10.9 | 25 |
| 11.0-11.9 | 4 |
| 12.0+ | 1 |
Commander overrides
Why hypotheses get killed
| Reason | Count |
|---|
| evidence_search_exhausted | 24 |
| move_cap_reached | 13 |
| v2_backfill_orphan_S148 | 7 |
| fatal_objection_both_confirm | 2 |
| council_verdict_unanimous_kill | 2 |
| structural_duplicate_15ed71_S148 | 1 |
Cost transparency
Total engine spend across all moves: $270.12 across 3,474 logged operations. Average cost per hypothesis from admission to current state: $2.21.
Known limitations
- The graduation bar is not a buy signal. A graduated hypothesis has passed structural filters; it has not been validated against real customer demand.
- Filter scoring uses LLM advocates. Two perspectives, three runs, median taken — but still subject to LLM bias. The triple-run IQR is the engines measure of its own consistency, not its accuracy.
- Signals come from agentic web search. Quality depends on what is findable; absence of a primary source does not mean none exists.
- The engine has no track record on commercial outcomes yet. No graduated hypothesis has been built to product. Until one has, the calibration is methodology-only, not outcome-validated.