Proving ROI in 10 days: a pilot template for AI in Growth & Ops (metrics, cut-offs, handover)

Proving ROI in 10 days: a pilot template for AI in Growth & Ops (metrics, cut-offs, handover)
Viewing the AI-enhanced version of the article I wrote.
Contact me  for the prompt used to generate the AI formatted version.

You don't need a six-month project to prove AI value. Run a 10-day pilot that is cheap, instrumented, and easy to roll back. This template gives the schedule, metrics, cut-offs, and handover materials you can reuse for Growth & Ops use-cases (support triage, order status, invoice QA, pricing rules, etc.).

The 10-day schedule (repeatable)

Days 0–1 - Scope & baseline

  • Define one narrow user journey (e.g., "Where is my order?", "AP invoice check", "price update on 200 SKUs").
  • Freeze a baseline window (last 14–30 days).
  • Agree the primary KPI and target threshold (example targets below).
  • Set a hard cost ceiling and a canary cohort (e.g., 10–20% of traffic).

Days 2–3 - Wire data & guardrails

  • Connect the minimum data sources (orders, tickets, POs/price lists).
  • Add tracing (inputs, outputs, latency, tokens/costs).
  • Add guardrails (schema validation, escalation path, rate limits).
  • Create an eval set (10–30 realistic cases) to check quality daily.

Days 4–5 - Ship a thin slice

  • Enable the feature for the canary cohort only.
  • Log all decisions with evidence (citations, diffs, audit trail).
  • Start a daily scoreboard (see KPI table template below).

Days 6–7 - Tune or stop

  • Review KPI movement vs baseline; fix obvious misses.
  • If cut-offs are hit (quality, spend, or error thresholds), stop and document why.

Days 8–9 - Document & handover prep

  • Finalise runbooks, dashboards, and rollback steps.
  • Collect stakeholder feedback (support, finance, ops).

Day 10 - Decision

  • If KPIs pass and budget holds, expand to 50–100% with monitoring.
  • Otherwise, roll back and keep the artefacts for the next attempt.

Metrics that prove value

Core financials (pick the ones that fit your pilot)

  • £ Saved / week from invoice variance & duplicates blocked.
  • Tickets deflected % and median first response time.
  • Gross margin % / £ movement after pricing rule changes.
  • Conversion rate and refund rate where applicable.
  • Cost per resolved item (tokens + infra + minutes of human time).
# Simple ROI helpers used in the daily scoreboard def pilot_roi(savings_per_week_gbp, revenue_uplift_gbp, pilot_cost_gbp): net = savings_per_week_gbp + revenue_uplift_gbp - pilot_cost_gbp return 0 if pilot_cost_gbp == 0 else net / pilot_cost_gbp def deflection_rate(total, handled_by_bot): return handled_by_bot / max(total, 1) def margin_delta_pct(gm_after, gm_before): return (gm_after - gm_before) / max(gm_before, 1e-6)

Cut-offs (stop/go rules)

Stop immediately if

  • Output quality fails: eval score or human QA < 0.85 on the chosen metric.
  • Budget exceeded: pilot spend > £X/day or £Y/conversation.
  • Latency p95 > Z seconds for two consecutive days.

Continue/expand if

  • KPI moves by ≥ target (see defaults below) and cost per unit is stable or improving for 3 days.
  • No critical incidents or escalations without audit evidence.

Default KPI targets (sane starting points)

Ops / Support triage

  • Deflection: +15–30% vs baseline.
  • FRT: −30–50% median first response time.
  • QA pass-rate: ≥ 90% of sampled answers acceptable.

AP / Invoice QA

  • Overcharge detection: >= 0.5 to 1.5% of AP value flagged with < 5% false positives.
  • Duplicate invoices caught: >= 80% of known dupes in backtests.

Pricing rules

  • Gross margin: +1–3% on canary SKUs with no conversion drop beyond −5% relative.
  • Rollback time: < 15 minutes from trigger to revert.

Daily scoreboard (CSV you can paste into Sheets)

date,kpi_name,baseline_value,pilot_value,delta,budget_spend_gbp,unit_cost_gbp,incidents,notes 2025-09-08,deflection_rate,0.32,0.45,0.13,38.40,0.06,0,ok 2025-09-09,median_frt_seconds,720,410,-310,35.10,0.05,0,tuned escalation 2025-09-10,qa_pass_rate,0.86,0.91,0.05,33.00,0.05,0,stable

Handover pack (what the team receives on Day 10)

Documents

  • Decision memo (see template) with KPI results and a clear verdict.
  • Runbook: on-call, escalation, rollback, and weekly checks.
  • Data map: tables, joins, and owners; privacy notes.

Assets

  • Dashboards: KPIs, spend, latency, evals.
  • Tracing: searchable logs with request→answer→cost linkage.
  • Eval set: frozen cases + procedure for updates.
# decision-memo.yml pilot: "Support Triage WhatsApp" owner: "Ops" period: "2025-09-01..2025-09-10" baseline_window: "2025-08-15..2025-08-31" kpis: - name: deflection_rate baseline: 0.32 pilot: 0.46 target: 0.15 absolute increase - name: qa_pass_rate baseline: 0.86 pilot: 0.91 budgets: daily_gbp: 50 per_conv_gbp: 0.07 result: "GO - expand to 50% traffic with same guardrails" risks: - "Peak-time latency near threshold; monitor p95"

Example pilot slices (pick one)

Support: "Where is my order?"

  • Authenticate customer; look up order; answer with ETA, policy, and handoff button.
  • KPIs: deflection %, FRT, cost/conversation, QA pass-rate.

AP: "Invoice QA on top 5 suppliers"

  • Parse PDF; compare to PO/price list; flag variance/duplicates; open ticket.
  • KPIs: £ overcharge detected, false-positive rate, minutes saved/invoice.

Pricing: "Guardrailed price update for 200 SKUs"

  • Apply formula price with caps; canary to 10%; auto-rollback if conversion drops.
  • KPIs: GM% delta, conversion delta, rollback MTTR.

Minimal instrumentation (don't skip)

Capture on every request

  • Route/feature, user/session, prompt template, inputs (redacted), outputs, citations/evidence, latency, tokens, cost, decision taken (auto, escalated, rolled back).

Dashboards to publish

  • Spend vs budget; p50/p95 latency; eval trend; KPI trend vs baseline.

Risk checklist

Before enabling traffic

  • Legal/privacy reviewed the data map and redaction.
  • Escalation path staffed; SLAs agreed.
  • Canary cohort and kill switch verified.

During the pilot

  • Daily QA sample reviewed by the business owner.
  • Incidents labelled and linked to traces.
  • Costs monitored with 80% alert and hard stop at 100%.

Rollback play (one command)

# example CLI pilot rollback --feature support_whatsapp --reason "qa_regression" --to 2025-09-03T10:00Z

This template is intentionally small. It proves or disproves value in ten days, keeps spend contained, and leaves you with artefacts the team can run without me.


Stay up to date

Get notified when I publish something new, and unsubscribe at any time.