LLM Evaluation Framework: Test LLM & Agent Apps

Key takeaways

Evaluate the application, not the model: measure correctness, faithfulness, and relevance on your own data and queries.
Most “LLM” failures are retrieval failures — measure context completeness first, then answer correctness, latency, and token cost.
Evaluate agent memory with multi-session benchmarks (LoCoMo, LongMemEval); Zep reports 94.7% LoCoMo and 90.2% LongMemEval (benchmark results).

Evaluating an LLM application means measuring whether it produces correct, grounded, and useful outputs on your data — not whether the underlying model scores well on generic leaderboards. A good evaluation framework defines what “good” means for your task, builds a representative dataset, scores outputs with the right mix of automated and human methods, and runs continuously as prompts, models, and data change. This guide gives you that framework, then extends it to the hardest case: evaluating agent memory.

Layer	What you measure	Methods
Output quality	Accuracy, relevance, faithfulness, format	Reference-based metrics, LLM-as-judge, human review
Retrieval / grounding	Did the system fetch the right context?	Context precision/recall, completeness
Agent memory	Does the agent recall the right facts over time?	LoCoMo, LongMemEval, custom multi-session sets
System	Latency, token cost, reliability	Tracing, load tests, regression suites

Start with what “good” means

Before metrics, define the task and its failure modes. A support agent's failures differ from a coding assistant's. Write down the dimensions that matter — typically faithfulness (is the answer supported by the provided context?), relevance (does it address the question?), correctness (does it match ground truth?), and format/safety (does it follow constraints?). These dimensions become your scoring rubric.

Build a representative dataset

Evaluation is only as good as the data it runs on. Assemble a set that mirrors real usage: actual queries (anonymized), edge cases, adversarial inputs, and — critically for agents — multi-turn, multi-session scenarios where facts change. Include “should abstain” cases where the right answer is “I don't know.” Keep a frozen holdout so results are comparable over time.

Choose scoring methods

Reference-based metrics compare output to a known answer (exact match, F1, semantic similarity). Cheap and deterministic; best for tasks with clear ground truth.
LLM-as-judge uses a strong model to score outputs against a rubric (faithfulness, relevance). Scales human-like judgment; calibrate it against human labels and watch for bias.
Human evaluation remains the gold standard for nuanced quality. Use it to validate your automated judges and on high-stakes samples.
Retrieval metrics (context precision, recall, completeness) tell you whether failures are retrieval problems or generation problems — often the most actionable signal.

A minimal evaluation loop

Each test case runs the same four steps: retrieve context, judge whether it's sufficient, generate an answer, then grade the answer against a golden reference.

for case in test_cases:                       # each: a question + golden_answer
    context = retrieve(case.question)          # your retrieval / memory layer
    completeness = judge_context(              # COMPLETE / PARTIAL / INSUFFICIENT
        case.question, context, case.golden_answer)
    answer = generate(case.question, context)  # the model under test
    correct = grade(answer, case.golden_answer)  # CORRECT / WRONG (LLM-as-judge)
    record(case, completeness, correct, context, answer)

report(completeness_rate, accuracy, by_case=True)

Score completeness first— if the right context wasn't retrieved, the answer was never going to be right, and that's a retrieval fix, not a prompt fix. Zep's evaluation harness runs exactly this loop on your data; see Evaluate Zep for your use case.

Evaluate the retrieval layer separately

Most “LLM” failures are actually context failures: the right information wasn't retrieved, so the model guessed. Measure retrieval on its own — did the system surface the facts needed to answer? Completeness (did you get all the needed context?), not just precision, is what predicts whether retrieval holds up. If retrieval is the bottleneck, no amount of prompt tuning fixes it.

Evaluating agent memory (the hard part)

Agent memory needs its own evaluation because the failure mode is temporal: the agent must recall the right fact, at the right time, across many sessions, as facts change. Single-turn benchmarks miss this entirely. Two industry-standard benchmarks target it:

LoCoMo — long conversational memory across very long, multi-session dialogues.
LongMemEval — long-term memory recall and reasoning over extended histories.

LOCOMO Scores

Zep

75.14%

gpt-4o-mini Baseline

72.9%

Mem0 (Graph)

68.44%

Mem0 (Base)

66.88%

0%20%40%60%80%100%

When you evaluate a memory system, measure accuracy together with retrieval latency and token consumption — memory systems often trade one for another, and a system that's accurate but slow or token-hungry won't survive production. As a reference point, Zep's temporal context graph approach reports 94.7% accuracy on LoCoMo (155ms retrieval, 5,760 tokens) and 90.2% on LongMemEval (162ms, 4,408 tokens) — leading on all three dimensions at once. Build a custom multi-session set on your data too: seed facts, change them, and test whether the agent answers “what's true now?” and “what was true then?” correctly.

Operationalize it

Evaluation isn't a one-time gate. Wire it into the workflow: a frozen regression suite that runs on every prompt/model change, tracing in production to catch drift, and dashboards for quality, latency, and cost. Zep ships an evaluation framework for testing against your own data and zepctl for administering projects, so memory quality is measured continuously rather than assumed.

Frequently asked questions

What's the difference between evaluating a model and evaluating an application?

Model evals (MMLU, etc.) measure general capability. Application evals measure whether your system — prompts, retrieval, memory, guardrails — produces correct, grounded answers on your data. The latter is what determines production reliability.

How do I evaluate agent memory specifically?

Use multi-session benchmarks like LoCoMo and LongMemEval, plus a custom set on your data where facts change over time. Measure accuracy alongside retrieval latency and token cost.

Is LLM-as-judge reliable?

It scales well and correlates with human judgment when calibrated against human labels and a clear rubric — but validate it periodically and watch for systematic bias.

What should I measure first — retrieval or generation?

Retrieval. Context completeness predicts whether the answer can be right at all, so measure it before answer correctness and you'll fix the actual bottleneck instead of tuning prompts around a retrieval gap.

Which benchmarks should I use for agent memory?

LoCoMo and LongMemEval for long, multi-session memory — and always report accuracy alongside retrieval latency and token cost, since memory systems often trade one for another.