What Is LongMemEval?

LongMemEval is the standard academic benchmark for AI memory systems, introduced in the paper "LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory" (arXiv 2410.10813). It presents a system with a long conversation history, then asks questions that require retrieving specific facts from that history.

R@5 (Recall at 5) means: given a question, does the correct answer appear in the top 5 retrieved results? MemPalace raw mode scores 96.6% across 500 test questions — the highest published result for any system that requires no API key and no external service.

✅ What's Credible

✓96.6% raw R@5 is independently reproducible. @gizmax reproduced it on an M2 Ultra in under 5 minutes. Benchmark runner is in benchmarks/longmemeval_bench.py.
✓Verbatim storage is architecturally sound — you cannot lose information you never discard. The approach is valid.
✓Held-out score of 98.4% on unseen questions demonstrates genuine generalization, not just benchmark overfitting.
✓Team openly disclosed the 3 targeted fixes that took the score from 99.4% to 100%, and that the rerank pipeline was not yet in the public benchmark scripts.
✓+34% retrieval improvement from wing+room metadata filtering vs unfiltered search is real and documented across 22,000+ memories.

⚠️ What's Been Questioned

△100% hybrid score uses Haiku LLM reranking — it's not purely local. The 96.6% raw score is the fairer headline number.
△AAAK compression regresses accuracy — 84.2% R@5 vs 96.6% in raw mode. A 12.4-point regression. The 96.6% headline uses raw verbatim storage, not AAAK.
△LoCoMo test used top_k=50, which may exceed the candidate pool size. The 60.3% LoCoMo score is less certain than the LongMemEval score.
△"+34% palace boost" compares metadata-filtered search vs unfiltered search — metadata filtering is a standard ChromaDB feature, not a novel mechanism.
△Contradiction detection (fact_checker.py) exists but is not yet wired into knowledge graph operations. Being fixed in Issue #27.

All Published Benchmark Scores

Benchmark	Mode	Score	API Calls	Status
LongMemEval R@5	Raw (ChromaDB, zero API)	96.6%	Zero	✅ Independently reproduced
LongMemEval R@5	Hybrid + Haiku rerank	100% (500/500)	~500	✅ Real · uses cloud LLM
LongMemEval R@5	AAAK compression mode	84.2%	Zero	⚠️ Regresses vs raw mode
LongMemEval held-out	Raw, unseen questions	98.4%	Zero	✅ Shows generalization
LoCoMo R@10	Raw, session level	60.3%	Zero	⚠️ top_k=50 methodology debated
Personal palace R@10	Heuristic bench (internal)	85%	Zero	📊 Internal benchmark
Unfiltered search	All closets	60.9% R@10	Zero	📊 Baseline for +34% claim
Wing+room filtered search	Metadata filtering	94.8% R@10	Zero	✅ +34% over unfiltered

MemPalace vs All Competitors

System	LongMemEval R@5	API Required	Monthly Cost	Local	Open Source
⭐ MemPalace (hybrid)	100%	Optional	Free	Always	MIT
⭐ MemPalace (raw)	96.6%	None	Free	Always	MIT
Supermemory ASMR	~99%	Yes	Paid	No	No
Mastra	94.87%	Yes (GPT)	API costs	No	Partial
Mem0	~85%	Yes	$19–249/mo	No	No
Zep (Graphiti)	~85%	Yes	$25/mo+	Enterprise only	No

Our verdict: The 96.6% raw score is the most credible and impressive number — it's genuinely the highest published LongMemEval result requiring no API key, no cloud, and no LLM at any stage. The 100% hybrid score is real but comes with legitimate caveats. MemPalace is a real project with real innovations, and the verbatim-storage approach is architecturally correct.

Install & Test It Yourself →