Proving ROI in 10 days: a pilot template for AI in Growth & Ops (metrics, cut-offs, handover)

You don't need a six-month project to prove AI value. Run a 10-day pilot that is cheap, instrumented, and easy to roll back. This template gives the schedule, metrics, cut-offs, and handover materials you can reuse for Growth & Ops use-cases (support triage, order status, invoice QA, pricing rules, etc.).

The 10-day schedule (repeatable)

Days 0–1 - Scope & baseline

Define one narrow user journey (e.g., "Where is my order?", "AP invoice check", "price update on 200 SKUs").
Freeze a baseline window (last 14–30 days).
Agree the primary KPI and target threshold (example targets below).
Set a hard cost ceiling and a canary cohort (e.g., 10–20% of traffic).

Days 2–3 - Wire data & guardrails

Connect the minimum data sources (orders, tickets, POs/price lists).
Add tracing (inputs, outputs, latency, tokens/costs).
Add guardrails (schema validation, escalation path, rate limits).
Create an eval set (10–30 realistic cases) to check quality daily.

Days 4–5 - Ship a thin slice

Enable the feature for the canary cohort only.
Log all decisions with evidence (citations, diffs, audit trail).
Start a daily scoreboard (see KPI table template below).

Days 6–7 - Tune or stop

Review KPI movement vs baseline; fix obvious misses.
If cut-offs are hit (quality, spend, or error thresholds), stop and document why.

Days 8–9 - Document & handover prep

Finalise runbooks, dashboards, and rollback steps.
Collect stakeholder feedback (support, finance, ops).

Day 10 - Decision

If KPIs pass and budget holds, expand to 50–100% with monitoring.
Otherwise, roll back and keep the artefacts for the next attempt.

Metrics that prove value

Core financials (pick the ones that fit your pilot)

£ Saved / week from invoice variance & duplicates blocked.
Tickets deflected % and median first response time.
Gross margin % / £ movement after pricing rule changes.
Conversion rate and refund rate where applicable.
Cost per resolved item (tokens + infra + minutes of human time).

# Simple ROI helpers used in the daily scoreboard
def pilot_roi(savings_per_week_gbp, revenue_uplift_gbp, pilot_cost_gbp):
    net = savings_per_week_gbp + revenue_uplift_gbp - pilot_cost_gbp
    return 0 if pilot_cost_gbp == 0 else net / pilot_cost_gbp

def deflection_rate(total, handled_by_bot):
    return handled_by_bot / max(total, 1)

def margin_delta_pct(gm_after, gm_before):
    return (gm_after - gm_before) / max(gm_before, 1e-6)

Cut-offs (stop/go rules)

Stop immediately if

Output quality fails: eval score or human QA < 0.85 on the chosen metric.
Budget exceeded: pilot spend > £X/day or £Y/conversation.
Latency p95 > Z seconds for two consecutive days.

Continue/expand if

KPI moves by ≥ target (see defaults below) and cost per unit is stable or improving for 3 days.
No critical incidents or escalations without audit evidence.

Default KPI targets (sane starting points)

Ops / Support triage

Deflection: +15–30% vs baseline.
FRT: −30–50% median first response time.
QA pass-rate: ≥ 90% of sampled answers acceptable.

AP / Invoice QA

Overcharge detection: >= 0.5 to 1.5% of AP value flagged with < 5% false positives.
Duplicate invoices caught: >= 80% of known dupes in backtests.

Pricing rules

Gross margin: +1–3% on canary SKUs with no conversion drop beyond −5% relative.
Rollback time: < 15 minutes from trigger to revert.

Daily scoreboard (CSV you can paste into Sheets)

date,kpi_name,baseline_value,pilot_value,delta,budget_spend_gbp,unit_cost_gbp,incidents,notes
2025-09-08,deflection_rate,0.32,0.45,0.13,38.40,0.06,0,ok
2025-09-09,median_frt_seconds,720,410,-310,35.10,0.05,0,tuned escalation
2025-09-10,qa_pass_rate,0.86,0.91,0.05,33.00,0.05,0,stable

Handover pack (what the team receives on Day 10)

Documents

Decision memo (see template) with KPI results and a clear verdict.
Runbook: on-call, escalation, rollback, and weekly checks.
Data map: tables, joins, and owners; privacy notes.

Assets

Dashboards: KPIs, spend, latency, evals.
Tracing: searchable logs with request→answer→cost linkage.
Eval set: frozen cases + procedure for updates.

# decision-memo.yml
pilot: "Support Triage WhatsApp"
owner: "Ops"
period: "2025-09-01..2025-09-10"
baseline_window: "2025-08-15..2025-08-31"
kpis:
  - name: deflection_rate
    baseline: 0.32
    pilot: 0.46
    target: 0.15 absolute increase
  - name: qa_pass_rate
    baseline: 0.86
    pilot: 0.91
budgets:
  daily_gbp: 50
  per_conv_gbp: 0.07
result: "GO - expand to 50% traffic with same guardrails"
risks:
  - "Peak-time latency near threshold; monitor p95"

Example pilot slices (pick one)

Support: "Where is my order?"

Authenticate customer; look up order; answer with ETA, policy, and handoff button.
KPIs: deflection %, FRT, cost/conversation, QA pass-rate.

AP: "Invoice QA on top 5 suppliers"

Parse PDF; compare to PO/price list; flag variance/duplicates; open ticket.
KPIs: £ overcharge detected, false-positive rate, minutes saved/invoice.

Pricing: "Guardrailed price update for 200 SKUs"

Apply formula price with caps; canary to 10%; auto-rollback if conversion drops.
KPIs: GM% delta, conversion delta, rollback MTTR.

Minimal instrumentation (don't skip)

Capture on every request

Route/feature, user/session, prompt template, inputs (redacted), outputs, citations/evidence, latency, tokens, cost, decision taken (auto, escalated, rolled back).

Dashboards to publish

Spend vs budget; p50/p95 latency; eval trend; KPI trend vs baseline.

Risk checklist

Before enabling traffic

Legal/privacy reviewed the data map and redaction.
Escalation path staffed; SLAs agreed.
Canary cohort and kill switch verified.

During the pilot

Daily QA sample reviewed by the business owner.
Incidents labelled and linked to traces.
Costs monitored with 80% alert and hard stop at 100%.

Rollback play (one command)

# example CLI
pilot rollback --feature support_whatsapp --reason "qa_regression" --to 2025-09-03T10:00Z

This template is intentionally small. It proves or disproves value in ten days, keeps spend contained, and leaves you with artefacts the team can run without me.