LongMemEval Explained
LongMemEval is the standard academic benchmark for measuring AI memory system performance. Published in the paper "LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory" (arXiv:2410.10813), it presents a test system with a pre-loaded conversation history, then asks questions that require retrieving specific facts or reasoning from that history.
The R@5 metric (Recall at 5) measures whether the correct answer appears anywhere in the top 5 retrieved results. A score of 96.6% across 500 questions means MemPalace surfaced the correct answer in its top 5 results for 483 of those questions — using only local vector search with no API calls.
Why Verbatim Storage Wins on Recall Tasks
Every competing system — Mem0, Zep, most commercial alternatives — uses a large language model to process incoming conversations and extract "key facts." The LLM decides what is worth keeping: a decision, a preference, a name. The rest of the conversation is discarded. This produces compact, structured memory that is easy to query but fundamentally lossy.
The problem is specific to what LongMemEval tests. The benchmark frequently asks about reasoning context: not just what was decided, but why; not just which tool was chosen, but what alternatives were rejected and on what grounds. An extraction-based system that saved "team chose Postgres" loses the supporting context: "chosen over SQLite because the expected dataset size will exceed 10GB and we need concurrent write throughput — evaluated on November 3rd after benchmarking both options on a sample workload." That supporting context is exactly what LongMemEval questions probe.
MemPalace takes the opposite approach: store every word verbatim, let semantic vector search find the relevant passage at query time. No information is discarded at ingestion. The retrieval challenge is precision — finding the right passage from a large corpus — which is what the palace structure addresses.
The Retrieval Pipeline
When your AI (or a direct mempalace search call) issues a query, the following steps execute in sequence:
- Embedding — the query text is converted into a semantic vector using the same embedding model used during ingestion
- Metadata pre-filter — if wing and/or room are specified, ChromaDB applies these as hard filters before computing any similarity scores. Only passages matching the metadata enter the candidate pool.
- Approximate nearest-neighbour search — ChromaDB finds the closest vectors in the filtered candidate pool using HNSW (Hierarchical Navigable Small Worlds), a fast approximate search algorithm
- Top-k selection — the top-k most similar passages are returned as verbatim text. For R@5, k=5.
- Optional reranking — in hybrid mode, a Claude Haiku call re-scores the top-k candidates based on semantic relevance to the original query. This adds ~$0.001/query but improves edge-case accuracy.
Wing+Room Filtering: Why Structure Produces +34%
The most important architectural contribution to recall accuracy is the palace's metadata structure. Every stored passage carries wing and room tags. When a search applies these tags as pre-filters, the candidate pool shrinks from the full corpus to a topically coherent subset, and vector similarity becomes far more precise.
Across a test corpus of 22,000+ real conversation memories, the measured impact of progressively narrower filtering is striking:
| Search Configuration | R@10 | Gain vs Unfiltered |
|---|---|---|
| Unfiltered — search all stored passages | 60.9% | — |
| Filter by wing only | 73.1% | +12.2 points |
| Filter by wing + hall type | 84.8% | +23.9 points |
| Filter by wing + specific room | 94.8% | +33.9 points |
The gain is not from a novel algorithm — metadata filtering is standard ChromaDB. The gain comes from having a metadata schema (wings and rooms) that is meaningful enough to narrow the candidate pool usefully. A flat storage system with arbitrary tags would not produce these numbers. The palace structure was designed specifically to create this schema.
The 4-Layer Memory Stack
MemPalace does not load its full corpus into context on every session. It uses a tiered approach that minimises token consumption while ensuring the AI always has the critical context it needs:
Layer 0 — Identity (~50 tokens, loaded every session)
The plain-text contents of ~/.mempalace/identity.txt. Defines who the AI is, what project is currently active, and what role it plays. This layer never changes during normal operation.
Layer 1 — Critical Facts (~120 tokens, loaded every session)
A condensed summary of the most important facts across the entire palace: who is on the team, what the active projects are, what key decisions have been made, what preferences are established. This is generated by mempalace wake-up and can optionally use AAAK encoding to compress the token count further.
Layer 2 — Room Recall (loaded on demand)
When a topic arises that matches a known room, the relevant closets from that room are loaded. This is triggered automatically by the MCP server when the AI's query matches a room topic.
Layer 3 — Deep Search (loaded on demand)
A full semantic vector search against all stored closets. Triggered when the AI explicitly asks to search for something or when Layer 2 does not surface sufficient context. Returns verbatim passages, typically 200–500 tokens each.
L0 + L1 total approximately 170 tokens. The AI begins every session knowing the full landscape of your palace. L2 and L3 only add tokens when they are genuinely needed.
How the 100% Hybrid Mode Works
The raw mode's 96.6% score reflects failures on questions where vector similarity is misleading — cases where the most relevant passage is not the most semantically similar to the query wording. The hybrid mode adds a Claude Haiku reranking pass to catch these edge cases.
After the standard top-k retrieval, Haiku reads all k candidates alongside the original query and re-scores them based on actual semantic relevance rather than embedding similarity. The reranked top-5 achieves 100% R@5 across all 500 LongMemEval questions. The team disclosed that 3 targeted fixes moved the score from 99.4% to 100%, and that a held-out test set (never seen during development) scores 98.4% — confirming genuine generalisation rather than benchmark overfitting.