Setting baselines before AI deployment

Measuring AI value after deployment is measuring against memory

The standard measurement sequence in AI deployment goes like this. A tool is deployed. After three to six months, the team is asked whether it has made a difference. The team reports that things feel faster or easier. Leadership records this as a positive outcome. The deployment is deemed successful.

This sequence produces confident-sounding evidence that is almost entirely unverifiable. "Things feel faster" is a comparison against a recalled state that nobody measured. The pre-AI condition exists only in memory, and memory is optimistic about change. People who wanted the tool to succeed tend to report that it succeeded.

The consequence is that organisations accumulate AI deployments with reported success rates that bear no relationship to actual impact. Tools that are making genuine differences produce similar reports to tools that are consuming budget and changing very little. Leadership cannot distinguish between them because the measurement framework treats both the same way.

When boards or investors ask for evidence of AI value, the organisation reaches for its collection of positive anecdotes. These are credible to people who want to believe them and immediately suspicious to people who are sceptical. Neither group is getting accurate information.

A baseline captures the four things that will change

A baseline is a documented measurement of current performance, taken before the AI deployment begins, against the specific dimensions that the deployment is intended to improve. It creates the reference point that makes post-deployment measurement meaningful.

The baseline does not need to be complex. It needs to be specific and it needs to be captured before the deployment changes the conditions it is measuring. Four dimensions cover the majority of what AI deployments are intended to affect:

Time: how long does the workflow currently take, end to end? Include handoffs, review stages and wait times, not just active working time.
Error rate: how frequently does the current workflow produce errors, rework or corrections? Measure this over at least four weeks to account for variation.
Output volume: how much does the workflow currently produce per person per week? This matters most for content creation, analysis and reporting workflows.
Decision quality: for decision-support workflows, what is the current rate of decisions that are subsequently reversed, escalated or identified as poor? This is harder to measure but the most important for strategic AI deployments.

The baseline should take no more than two weeks to establish for most workflows. It requires the team running the workflow to track what they are already doing more deliberately than usual for a defined period. The investment is small relative to the value of having credible comparison data after deployment.

The measurement window after deployment should match the baseline period. If the baseline covers four weeks of error rate data, the post-deployment measurement should also cover four weeks before a comparison is made.

Set baselines before the next deployment, not during it

The moment a deployment begins, the baseline window closes. AI changes the behaviour of the people using it immediately. They work differently. They spend time differently. They make different decisions about what to check and what to skip. Capturing a baseline during deployment is capturing a hybrid state that reflects neither the pre-AI condition nor the AI-enabled condition accurately.

For the next AI deployment your organisation is planning, build baseline capture into the pre-deployment process as a standard step. The sequence is:

Define the dimensions the deployment is intended to affect. Be specific. Not "improve efficiency" but "reduce review cycle time from five days to two."
Identify how each dimension is currently measured or can be measured over a defined period.
Capture baseline data for three to four weeks before deployment begins.
Document the baseline formally: who measured it, what period it covers, what methodology was used.
Set a post-deployment measurement window and a review date before the deployment goes live.

For deployments already live without a baseline, a retrospective baseline is still possible in some cases. If the workflow produces documented outputs with timestamps: reports, decisions, processed items. historical data may provide a usable reference point. This is imperfect but substantially better than reporting against memory.

The organisations that can demonstrate AI value credibly are the ones that made the measurement decision before the deployment decision. That sequence takes discipline to maintain when deployment pressure is high. It is also the only way to know whether the deployments are working.