Proving ROI in 10 days: a pilot template for AI in Growth & Ops (metrics, cut-offs, handover)

Viewing the AI-enhanced version of the article I wrote.
Contact me for the prompt used to generate the AI formatted version.
You don't need a six-month project to prove AI value. Run a 10-day pilot that is cheap, instrumented, and easy to roll back. This template gives the schedule, metrics, cut-offs, and handover materials you can reuse for Growth & Ops use-cases (support triage, order status, invoice QA, pricing rules, etc.).
The 10-day schedule (repeatable)
Days 0–1 - Scope & baseline
- Define one narrow user journey (e.g., "Where is my order?", "AP invoice check", "price update on 200 SKUs").
- Freeze a baseline window (last 14–30 days).
- Agree the primary KPI and target threshold (example targets below).
- Set a hard cost ceiling and a canary cohort (e.g., 10–20% of traffic).
Days 2–3 - Wire data & guardrails
- Connect the minimum data sources (orders, tickets, POs/price lists).
- Add tracing (inputs, outputs, latency, tokens/costs).
- Add guardrails (schema validation, escalation path, rate limits).
- Create an eval set (10–30 realistic cases) to check quality daily.
Days 4–5 - Ship a thin slice
- Enable the feature for the canary cohort only.
- Log all decisions with evidence (citations, diffs, audit trail).
- Start a daily scoreboard (see KPI table template below).
Days 6–7 - Tune or stop
- Review KPI movement vs baseline; fix obvious misses.
- If cut-offs are hit (quality, spend, or error thresholds), stop and document why.
Days 8–9 - Document & handover prep
- Finalise runbooks, dashboards, and rollback steps.
- Collect stakeholder feedback (support, finance, ops).
Day 10 - Decision
- If KPIs pass and budget holds, expand to 50–100% with monitoring.
- Otherwise, roll back and keep the artefacts for the next attempt.
Metrics that prove value
Core financials (pick the ones that fit your pilot)
- £ Saved / week from invoice variance & duplicates blocked.
- Tickets deflected % and median first response time.
- Gross margin % / £ movement after pricing rule changes.
- Conversion rate and refund rate where applicable.
- Cost per resolved item (tokens + infra + minutes of human time).
# Simple ROI helpers used in the daily scoreboard def pilot_roi(savings_per_week_gbp, revenue_uplift_gbp, pilot_cost_gbp): net = savings_per_week_gbp + revenue_uplift_gbp - pilot_cost_gbp return 0 if pilot_cost_gbp == 0 else net / pilot_cost_gbp def deflection_rate(total, handled_by_bot): return handled_by_bot / max(total, 1) def margin_delta_pct(gm_after, gm_before): return (gm_after - gm_before) / max(gm_before, 1e-6)
Cut-offs (stop/go rules)
Stop immediately if
- Output quality fails: eval score or human QA < 0.85 on the chosen metric.
- Budget exceeded: pilot spend > £X/day or £Y/conversation.
- Latency p95 > Z seconds for two consecutive days.
Continue/expand if
- KPI moves by ≥ target (see defaults below) and cost per unit is stable or improving for 3 days.
- No critical incidents or escalations without audit evidence.
Default KPI targets (sane starting points)
Ops / Support triage
- Deflection: +15–30% vs baseline.
- FRT: −30–50% median first response time.
- QA pass-rate: ≥ 90% of sampled answers acceptable.
AP / Invoice QA
- Overcharge detection: >= 0.5 to 1.5% of AP value flagged with < 5% false positives.
- Duplicate invoices caught: >= 80% of known dupes in backtests.
Pricing rules
- Gross margin: +1–3% on canary SKUs with no conversion drop beyond −5% relative.
- Rollback time: < 15 minutes from trigger to revert.
Daily scoreboard (CSV you can paste into Sheets)
date,kpi_name,baseline_value,pilot_value,delta,budget_spend_gbp,unit_cost_gbp,incidents,notes 2025-09-08,deflection_rate,0.32,0.45,0.13,38.40,0.06,0,ok 2025-09-09,median_frt_seconds,720,410,-310,35.10,0.05,0,tuned escalation 2025-09-10,qa_pass_rate,0.86,0.91,0.05,33.00,0.05,0,stable
Handover pack (what the team receives on Day 10)
Documents
- Decision memo (see template) with KPI results and a clear verdict.
- Runbook: on-call, escalation, rollback, and weekly checks.
- Data map: tables, joins, and owners; privacy notes.
Assets
- Dashboards: KPIs, spend, latency, evals.
- Tracing: searchable logs with request→answer→cost linkage.
- Eval set: frozen cases + procedure for updates.
# decision-memo.yml pilot: "Support Triage WhatsApp" owner: "Ops" period: "2025-09-01..2025-09-10" baseline_window: "2025-08-15..2025-08-31" kpis: - name: deflection_rate baseline: 0.32 pilot: 0.46 target: 0.15 absolute increase - name: qa_pass_rate baseline: 0.86 pilot: 0.91 budgets: daily_gbp: 50 per_conv_gbp: 0.07 result: "GO - expand to 50% traffic with same guardrails" risks: - "Peak-time latency near threshold; monitor p95"
Example pilot slices (pick one)
Support: "Where is my order?"
- Authenticate customer; look up order; answer with ETA, policy, and handoff button.
- KPIs: deflection %, FRT, cost/conversation, QA pass-rate.
AP: "Invoice QA on top 5 suppliers"
- Parse PDF; compare to PO/price list; flag variance/duplicates; open ticket.
- KPIs: £ overcharge detected, false-positive rate, minutes saved/invoice.
Pricing: "Guardrailed price update for 200 SKUs"
- Apply formula price with caps; canary to 10%; auto-rollback if conversion drops.
- KPIs: GM% delta, conversion delta, rollback MTTR.
Minimal instrumentation (don't skip)
Capture on every request
- Route/feature, user/session, prompt template, inputs (redacted), outputs, citations/evidence, latency, tokens, cost, decision taken (auto, escalated, rolled back).
Dashboards to publish
- Spend vs budget; p50/p95 latency; eval trend; KPI trend vs baseline.
Risk checklist
Before enabling traffic
- Legal/privacy reviewed the data map and redaction.
- Escalation path staffed; SLAs agreed.
- Canary cohort and kill switch verified.
During the pilot
- Daily QA sample reviewed by the business owner.
- Incidents labelled and linked to traces.
- Costs monitored with 80% alert and hard stop at 100%.
Rollback play (one command)
# example CLI pilot rollback --feature support_whatsapp --reason "qa_regression" --to 2025-09-03T10:00Z
This template is intentionally small. It proves or disproves value in ten days, keeps spend contained, and leaves you with artefacts the team can run without me.