← all hypotheses

Bot-Promise Slip Triage for B2B Support Operations

graduated [TRIANGULATED] filter 9.0/15 spread ±1.0 signals: 2 independent
What is this?
A daily morning ledger that surfaces every customer-facing commitment a bot agent (Intercom Fin / Zendesk AI / Decagon / Ada) made in the last 24-72 hours that is now at risk of breach, before the customer escalates. Buyer is the support ops lead at a 50-300 person B2B SaaS running one of these bot agents on inbound tickets. The bot constantly promises resolution dates, refunds, escalations, or engineering involvement it cannot guarantee, and the human team only finds out when the customer comes back angry. The product consumes the bot platform's structured event log (deadline_promised, refund_offered, escalated_to_human — manually tagged once at setup as commitment categories), runs AE's adversarial multi-model debate to challenge each event against current ticket state and historical similar-ticket resolution patterns, then ranks tickets by breach probability. The lead works the top 10 each morning, salvaging the customer before NPS damage. Outcome ground truth resolves within 3-14 days as tickets close, closing AE's grading loop on real reality — not on a vendor's self-rating of its own bot.
Why did we consider it?
Bot agents make promises their humans cannot keep; AE's adversarial-debate + reality-graded grading is uniquely shaped to rank breach risk before NPS damage, and the buyer, integration, and price all fit a solo UK evenings-and-weekends operator.
What breaks?
  • The Band-Aid Fallacy: Buyers will disable or restrict rogue bots making false financial/timeline promises rather than paying for a third-party triage tool to monitor them.
  • API Reality & Rate Limits: Bot platforms don't emit structured logs for hallucinated promises; parsing raw transcripts at scale will crush a solo dev with rate limits (per dblock's 'AI Slop' warning).
  • The HITL Bottleneck: Forcing Support Ops to manually salvage bot mistakes daily creates an unscalable human bottleneck (per Tian Pan's 'Human Review Queue' analysis).
What did we learn?
Engine verdict: GATHER_MORE_SIGNAL (WORTH_SKIMMING). Real structural pain, but extraction premise unvalidated and GTM fit hostile to introvert solo founder — needs 7-day signal check before commit.

Filter scores

Five axes, each scored 0-3. Three independent runs by different model perspectives. Median shown.

AxisWhat it measures
data moatDoes this product accumulate proprietary data that compounds?
10x model testDoes a better model make this more valuable, or redundant?
fast feedback loopsCan outputs be graded against reality in <30 days?
solo founder feasibleCan a solo operator build and run this without a team?
AI providers cant eat itDo hyperscalers have structural reasons NOT to build this?
Composite median: 9.0 / 15. Graduation threshold: 9.0. IQR across runs: 1.0.

Evidence

Signal B — Competitor with documented gap

Fini and similar AI triage tools (Wizr, LiveChatAI) focus on routing and resolving inbound tickets but do not retroactively audit commitments the bot itself made, detect at-risk promises, or rank tickets by breach probability before customer escalation. The gap is post-promise monitoring: no existing tool treats the bot's own outputs as liabilities to be triaged.

Signal D — Demand proxy

{"found":true,"summary":"Multiple content signals confirm the problem space: chatbot mistakes in customer support (broken escalations, poor handoffs) are widely discussed, B2B support teams struggle with reactive ticket-chasing, and the gap between bot automation and operational control is a recognized theme. However, no forum threads or GitHub issues specifically discuss bot-promise breach detection.","sources":["https://www.nurix.ai/resources/chatbot-mistakes-customer-support","https://front.com/blog/customer-service-automation","https://front.com/blog/b2b-customer-service","https://livechat…

Evaluation history

WhenStagePhase
2026-05-13 03:37deep_council_verdictgraduated
2026-05-13 03:36deep_claude_takegraduated
2026-05-13 03:34deep_90day_plangraduated
2026-05-13 03:33deep_riskgraduated
2026-05-13 03:32deep_distributiongraduated
2026-05-13 03:30deep_pricinggraduated
2026-05-13 03:29deep_moatgraduated
2026-05-13 03:28deep_buyer_simgraduated
2026-05-13 03:27deep_icpgraduated
2026-05-13 03:25deep_competitorgraduated
2026-05-13 03:25deep_market_realitygraduated
2026-05-13 03:18filter_scorescored
2026-05-13 03:12filter_scorescored
2026-05-13 03:06filter_scorescored
2026-05-13 03:00evidence_searchargument
2026-05-13 00:24evidence_searchargument
2026-05-12 22:36evidence_searchargument
2026-05-12 20:48evidence_searchargument
2026-05-12 18:54evidence_searchargument
2026-05-12 17:00evidence_searchargument
2026-05-12 15:12evidence_searchargument
2026-05-12 13:24evidence_searchargument
2026-05-12 11:36evidence_searchargument
2026-05-12 09:48evidence_searchargument
2026-05-12 08:06evidence_searchargument
2026-05-12 06:18evidence_searchargument
2026-05-11 20:30evidence_searchargument
2026-05-11 18:48evidence_searchargument
2026-05-11 17:18evidence_searchargument
2026-05-11 15:48evidence_searchargument
2026-05-11 14:24evidence_searchargument
2026-05-11 12:54evidence_searchargument
2026-05-11 12:24evidence_searchargument
2026-05-11 11:54evidence_searchargument
2026-05-11 11:30evidence_searchargument
2026-05-11 11:18evidence_searchargument
2026-05-11 11:06evidence_searchargument
2026-05-11 10:54evidence_searchargument
2026-05-11 10:42evidence_searchargument
2026-05-11 10:36evidence_searchargument
2026-05-11 10:24evidence_searchargument
2026-05-11 10:18audience_simulationargument
2026-05-11 10:12red_team_killargument
2026-05-11 10:06steelmanargument
2026-05-11 10:02genesisargument