Evaluate LLM and agent applications on your own data — context completeness, answer correctness, retrieval metrics, and memory benchmarks (LoCoMo, LongMemEval).
Evaluating an LLM application means measuring whether it produces correct, grounded, and useful outputs on your data — not whether the underlying model scores well on generic leaderboards. A good evaluation framework defines what “good” means for your task, builds a representative dataset, scores outputs with the right mix of automated and human methods, and runs continuously as prompts, models, and data change. This guide gives you that framework, then extends it to the hardest case: evaluating agent memory.
| Layer | What you measure | Methods |
|---|---|---|
| Output quality | Accuracy, relevance, faithfulness, format | Reference-based metrics, LLM-as-judge, human review |
| Retrieval / grounding | Did the system fetch the right context? | Context precision/recall, completeness |
| Agent memory | Does the agent recall the right facts over time? | LoCoMo, LongMemEval, custom multi-session sets |
| System | Latency, token cost, reliability | Tracing, load tests, regression suites |
Before metrics, define the task and its failure modes. A support agent's failures differ from a coding assistant's. Write down the dimensions that matter — typically faithfulness (is the answer supported by the provided context?), relevance (does it address the question?), correctness (does it match ground truth?), and format/safety (does it follow constraints?). These dimensions become your scoring rubric.
Evaluation is only as good as the data it runs on. Assemble a set that mirrors real usage: actual queries (anonymized), edge cases, adversarial inputs, and — critically for agents — multi-turn, multi-session scenarios where facts change. Include “should abstain” cases where the right answer is “I don't know.” Keep a frozen holdout so results are comparable over time.
Each test case runs the same four steps: retrieve context, judge whether it's sufficient, generate an answer, then grade the answer against a golden reference.
for case in test_cases: # each: a question + golden_answer
context = retrieve(case.question) # your retrieval / memory layer
completeness = judge_context( # COMPLETE / PARTIAL / INSUFFICIENT
case.question, context, case.golden_answer)
answer = generate(case.question, context) # the model under test
correct = grade(answer, case.golden_answer) # CORRECT / WRONG (LLM-as-judge)
record(case, completeness, correct, context, answer)
report(completeness_rate, accuracy, by_case=True)Score completeness first— if the right context wasn't retrieved, the answer was never going to be right, and that's a retrieval fix, not a prompt fix. Zep's evaluation harness runs exactly this loop on your data; see Evaluate Zep for your use case.
Most “LLM” failures are actually context failures: the right information wasn't retrieved, so the model guessed. Measure retrieval on its own — did the system surface the facts needed to answer? Completeness (did you get all the needed context?), not just precision, is what predicts whether retrieval holds up. If retrieval is the bottleneck, no amount of prompt tuning fixes it.
Agent memory needs its own evaluation because the failure mode is temporal: the agent must recall the right fact, at the right time, across many sessions, as facts change. Single-turn benchmarks miss this entirely. Two industry-standard benchmarks target it:
When you evaluate a memory system, measure accuracy together with retrieval latency and token consumption — memory systems often trade one for another, and a system that's accurate but slow or token-hungry won't survive production. As a reference point, Zep's temporal context graph approach reports 94.7% accuracy on LoCoMo (155ms retrieval, 5,760 tokens) and 90.2% on LongMemEval (162ms, 4,408 tokens) — leading on all three dimensions at once. Build a custom multi-session set on your data too: seed facts, change them, and test whether the agent answers “what's true now?” and “what was true then?” correctly.
Evaluation isn't a one-time gate. Wire it into the workflow: a frozen regression suite that runs on every prompt/model change, tracing in production to catch drift, and dashboards for quality, latency, and cost. Zep ships an evaluation framework for testing against your own data and zepctl for administering projects, so memory quality is measured continuously rather than assumed.
Related: How to test agent memory · What is agent memory? · Reducing LLM hallucinations · Research & benchmarks · AI agent memory guides
Model evals (MMLU, etc.) measure general capability. Application evals measure whether your system — prompts, retrieval, memory, guardrails — produces correct, grounded answers on your data. The latter is what determines production reliability.
Use multi-session benchmarks like LoCoMo and LongMemEval, plus a custom set on your data where facts change over time. Measure accuracy alongside retrieval latency and token cost.
It scales well and correlates with human judgment when calibrated against human labels and a clear rubric — but validate it periodically and watch for systematic bias.
Retrieval. Context completeness predicts whether the answer can be right at all, so measure it before answer correctness and you'll fix the actual bottleneck instead of tuning prompts around a retrieval gap.
LoCoMo and LongMemEval for long, multi-session memory — and always report accuracy alongside retrieval latency and token cost, since memory systems often trade one for another.