MemPalace Benchmark Analysis

An honest breakdown of every published score — what's credible, what's been questioned, and how MemPalace stacks up against every competitor.

96.6%
LongMemEval R@5 — Raw Mode
100%
LongMemEval R@5 — Hybrid Mode
500
Questions Tested (500/500)
$0
API Cost — Raw Mode

What Is LongMemEval?

LongMemEval is the standard academic benchmark for AI memory systems, introduced in the paper "LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory" (arXiv 2410.10813). It presents a system with a long conversation history, then asks questions that require retrieving specific facts from that history.

R@5 (Recall at 5) means: given a question, does the correct answer appear in the top 5 retrieved results? MemPalace raw mode scores 96.6% across 500 test questions — the highest published result for any system that requires no API key and no external service.

✅ What's Credible

  • 96.6% raw R@5 is independently reproducible. @gizmax reproduced it on an M2 Ultra in under 5 minutes. Benchmark runner is in benchmarks/longmemeval_bench.py.
  • Verbatim storage is architecturally sound — you cannot lose information you never discard. The approach is valid.
  • Held-out score of 98.4% on unseen questions demonstrates genuine generalization, not just benchmark overfitting.
  • Team openly disclosed the 3 targeted fixes that took the score from 99.4% to 100%, and that the rerank pipeline was not yet in the public benchmark scripts.
  • +34% retrieval improvement from wing+room metadata filtering vs unfiltered search is real and documented across 22,000+ memories.

⚠️ What's Been Questioned

  • 100% hybrid score uses Haiku LLM reranking — it's not purely local. The 96.6% raw score is the fairer headline number.
  • AAAK compression regresses accuracy — 84.2% R@5 vs 96.6% in raw mode. A 12.4-point regression. The 96.6% headline uses raw verbatim storage, not AAAK.
  • LoCoMo test used top_k=50, which may exceed the candidate pool size. The 60.3% LoCoMo score is less certain than the LongMemEval score.
  • "+34% palace boost" compares metadata-filtered search vs unfiltered search — metadata filtering is a standard ChromaDB feature, not a novel mechanism.
  • Contradiction detection (fact_checker.py) exists but is not yet wired into knowledge graph operations. Being fixed in Issue #27.

All Published Benchmark Scores

BenchmarkModeScoreAPI CallsStatus
LongMemEval R@5Raw (ChromaDB, zero API)96.6%Zero✅ Independently reproduced
LongMemEval R@5Hybrid + Haiku rerank100% (500/500)~500✅ Real · uses cloud LLM
LongMemEval R@5AAAK compression mode84.2%Zero⚠️ Regresses vs raw mode
LongMemEval held-outRaw, unseen questions98.4%Zero✅ Shows generalization
LoCoMo R@10Raw, session level60.3%Zero⚠️ top_k=50 methodology debated
Personal palace R@10Heuristic bench (internal)85%Zero📊 Internal benchmark
Unfiltered searchAll closets60.9% R@10Zero📊 Baseline for +34% claim
Wing+room filtered searchMetadata filtering94.8% R@10Zero✅ +34% over unfiltered

MemPalace vs All Competitors

SystemLongMemEval R@5API RequiredMonthly CostLocalOpen Source
⭐ MemPalace (hybrid)100%OptionalFreeAlwaysMIT
⭐ MemPalace (raw)96.6%NoneFreeAlwaysMIT
Supermemory ASMR~99%YesPaidNoNo
Mastra94.87%Yes (GPT)API costsNoPartial
Mem0~85%Yes$19–249/moNoNo
Zep (Graphiti)~85%Yes$25/mo+Enterprise onlyNo
Our verdict: The 96.6% raw score is the most credible and impressive number — it's genuinely the highest published LongMemEval result requiring no API key, no cloud, and no LLM at any stage. The 100% hybrid score is real but comes with legitimate caveats. MemPalace is a real project with real innovations, and the verbatim-storage approach is architecturally correct.
Advertisement