Interviewer Scorecard Calibration Ledger for Heads of Talent

ranked [TRIANGULATED] filter 8.0/15 spread ±2.0 signals: 3 independent

What is this?

A per-interviewer calibration ledger for the head of talent at a 50-300 person founder-led SaaS running weekly engineering/product/GTM loops in Greenhouse / Lever / Ashby. Every interview produces a structured scorecard claim (e.g. 'strong hire, 8/10 technical, will ramp in 60 days'). AE ingests the scorecard at submission, tags 1-3 claim types (technical depth / ramp speed / culture-add / hire recommendation), and resolves each against ATS outcomes (offer extended, accepted, 90-day retention, manager 6-month rating). Each interviewer accumulates a persistent claim-type ledger; AE's 6-pattern autopsy maps to interviewer overclaim modes (Cosmetic Confidence on charismatic candidates, Concession Laundering on weak signals re-framed as 'coachable'). The head sees calibrated weights per interviewer per claim-type before finalising loops and committee debriefs, with promotion/demotion of interviewer panel slots driven by claim calibration rather than tenure. Volume is 200-1000+ scorecards/year/company — well above the statistical floor the critic correctly flagged.

Why did we consider it?

Interviewer scorecards map natively onto AE's graded-prediction engine, the calibration incumbents miss this niche, and the buyer/economics fit a solo UK commander targeting £100–300K within 18 months.

What breaks?

Violates the <24h feedback loop constraint by relying on 90-day and 6-month lagging indicators for ground truth.
Mathematical failure on per-interviewer volume (1,000 scorecards across 40 interviewers = ~25/year), making the 508-prediction threshold impossible per user.
Ground truth pollution: 6-month manager ratings measure subjective perception, not the objective reality required by the AE.

What did we learn?

Still in evaluation (phase: ranked). No verdict yet.

Filter scores

Five axes, each scored 0-3. Three independent runs by different model perspectives. Median shown.

Axis	What it measures
data moat	Does this product accumulate proprietary data that compounds?
10x model test	Does a better model make this more valuable, or redundant?
fast feedback loops	Can outputs be graded against reality in <30 days?
solo founder feasible	Can a solo operator build and run this without a team?
AI providers cant eat it	Do hyperscalers have structural reasons NOT to build this?

Composite median: 8.0 / 15. Graduation threshold: 9.0. IQR across runs: 2.0.

Evidence

Signal A — Primary source

https://arxiv.org/pdf/1504.03425 credibility: medium

We present a computational framework for automatically quantifying verbal and nonverbal behaviors in the context of job interviews.

Signal B — Competitor with documented gap

https://www.hackerearth.com/blog/how-to-use-ai-for-recruiting

HackerEarth offers Interviewer Benchmarking that compares interviewer performance and scoring patterns to identify calibration gaps, but the snippet shows no resolution of interviewer claims against longitudinal post-hire outcomes (90-day retention, manager 6-month ratings) nor per-claim-type ledger tracking (technical depth vs. ramp speed vs. culture-add). It identifies pattern discrepancies, not predictive accuracy per claim type.

Signal D — Demand proxy

{"found":true,"summary":"HBR published a dedicated framework for quantitative interview scorecards comparing predictions to outcomes; Ashby (ATS vendor) advocates starting with a data audit of existing interview notes before scorecard design; HN threads surface developer frustration with inconsistent and uncalibrated interview processes, and Triplebyte's now-sunsetting standardized assessment model shows market appetite for objectified interview evaluation.","sources":["https://hbr.org/2016/02/a-scorecard-for-making-better-hiring-decisions","https://www.linkedin.com/pulse/building-interviewer-…

Evaluation history

When	Stage	Phase
2026-05-14 01:07	evidence_search	ranked
2026-05-13 21:54	evidence_search	ranked
2026-05-13 21:01	evidence_search	ranked
2026-05-10 15:54	evidence_search	ranked
2026-05-10 15:12	evidence_search	ranked
2026-05-10 14:30	evidence_search	ranked
2026-05-10 13:42	evidence_search	ranked
2026-05-10 13:00	evidence_search	ranked
2026-05-10 12:18	evidence_search	ranked
2026-05-10 11:31	evidence_search	ranked
2026-05-10 11:18	evidence_search	ranked
2026-05-10 11:13	evidence_search	ranked
2026-05-10 11:07	evidence_search	ranked
2026-05-10 11:00	filter_score	scored
2026-05-10 10:54	filter_score	scored
2026-05-10 10:48	filter_score	scored
2026-05-10 10:43	evidence_search	argument
2026-05-10 10:36	audience_simulation	argument
2026-05-10 10:30	red_team_kill	argument
2026-05-10 10:24	steelman	argument
2026-05-10 10:20	genesis	argument