← all hypotheses

Interviewer Scorecard Calibration Ledger for Heads of Talent

ranked [TRIANGULATED] filter 8.0/15 spread ±2.0 signals: 3 independent
What is this?
A per-interviewer calibration ledger for the head of talent at a 50-300 person founder-led SaaS running weekly engineering/product/GTM loops in Greenhouse / Lever / Ashby. Every interview produces a structured scorecard claim (e.g. 'strong hire, 8/10 technical, will ramp in 60 days'). AE ingests the scorecard at submission, tags 1-3 claim types (technical depth / ramp speed / culture-add / hire recommendation), and resolves each against ATS outcomes (offer extended, accepted, 90-day retention, manager 6-month rating). Each interviewer accumulates a persistent claim-type ledger; AE's 6-pattern autopsy maps to interviewer overclaim modes (Cosmetic Confidence on charismatic candidates, Concession Laundering on weak signals re-framed as 'coachable'). The head sees calibrated weights per interviewer per claim-type before finalising loops and committee debriefs, with promotion/demotion of interviewer panel slots driven by claim calibration rather than tenure. Volume is 200-1000+ scorecards/year/company — well above the statistical floor the critic correctly flagged.
Why did we consider it?
Interviewer scorecards map natively onto AE's graded-prediction engine, the calibration incumbents miss this niche, and the buyer/economics fit a solo UK commander targeting £100–300K within 18 months.
What breaks?
  • Violates the <24h feedback loop constraint by relying on 90-day and 6-month lagging indicators for ground truth.
  • Mathematical failure on per-interviewer volume (1,000 scorecards across 40 interviewers = ~25/year), making the 508-prediction threshold impossible per user.
  • Ground truth pollution: 6-month manager ratings measure subjective perception, not the objective reality required by the AE.
What did we learn?
Still in evaluation (phase: ranked). No verdict yet.

Filter scores

Five axes, each scored 0-3. Three independent runs by different model perspectives. Median shown.

AxisWhat it measures
data moatDoes this product accumulate proprietary data that compounds?
10x model testDoes a better model make this more valuable, or redundant?
fast feedback loopsCan outputs be graded against reality in <30 days?
solo founder feasibleCan a solo operator build and run this without a team?
AI providers cant eat itDo hyperscalers have structural reasons NOT to build this?
Composite median: 8.0 / 15. Graduation threshold: 9.0. IQR across runs: 2.0.

Evidence

Signal A — Primary source

We present a computational framework for automatically quantifying verbal and nonverbal behaviors in the context of job interviews.

Signal B — Competitor with documented gap

HackerEarth offers Interviewer Benchmarking that compares interviewer performance and scoring patterns to identify calibration gaps, but the snippet shows no resolution of interviewer claims against longitudinal post-hire outcomes (90-day retention, manager 6-month ratings) nor per-claim-type ledger tracking (technical depth vs. ramp speed vs. culture-add). It identifies pattern discrepancies, not predictive accuracy per claim type.

Signal D — Demand proxy

{"found":true,"summary":"HBR published a dedicated framework for quantitative interview scorecards comparing predictions to outcomes; Ashby (ATS vendor) advocates starting with a data audit of existing interview notes before scorecard design; HN threads surface developer frustration with inconsistent and uncalibrated interview processes, and Triplebyte's now-sunsetting standardized assessment model shows market appetite for objectified interview evaluation.","sources":["https://hbr.org/2016/02/a-scorecard-for-making-better-hiring-decisions","https://www.linkedin.com/pulse/building-interviewer-…

Evaluation history

WhenStagePhase
2026-05-14 01:07evidence_searchranked
2026-05-13 21:54evidence_searchranked
2026-05-13 21:01evidence_searchranked
2026-05-10 15:54evidence_searchranked
2026-05-10 15:12evidence_searchranked
2026-05-10 14:30evidence_searchranked
2026-05-10 13:42evidence_searchranked
2026-05-10 13:00evidence_searchranked
2026-05-10 12:18evidence_searchranked
2026-05-10 11:31evidence_searchranked
2026-05-10 11:18evidence_searchranked
2026-05-10 11:13evidence_searchranked
2026-05-10 11:07evidence_searchranked
2026-05-10 11:00filter_scorescored
2026-05-10 10:54filter_scorescored
2026-05-10 10:48filter_scorescored
2026-05-10 10:43evidence_searchargument
2026-05-10 10:36audience_simulationargument
2026-05-10 10:30red_team_killargument
2026-05-10 10:24steelmanargument
2026-05-10 10:20genesisargument