Case study

From vague chatbot quality to measurable release criteria.

`Helpy` is a fictional SaaS support assistant used to demonstrate how an AI eval workflow can establish a baseline, catch failure modes, compare versions, and produce a practical recommendation.

Results

The v2 system improved because it answered from source material.

Faithfulness

v1 0.07

v2 0.88

Answer relevancy

v1 0.08

v2 0.73

Context precision

v1 0.00

v2 0.95

Problem

The assistant could answer generally, but there was no objective way to know when it was wrong.

Method

A fixed question set and eval metrics compared the baseline against a retrieval-backed version.

Decision

The team gets a clear view of what improved, what still fails, and what should block release.