The Launch Backdrop

MemPalace v3.0.0 went public on April 6, 2026. Within the first 24 hours it accumulated 5,400 GitHub stars — by 72 hours that number had reached 26,900. The combination of a recognisable co-creator (actress Milla Jovovich) and a headline benchmark claim (highest-ever LongMemEval score) made the launch unusually visible for an open-source developer tool.

That visibility also brought scrutiny. Technical readers on HackerNews, in GitHub Issues, and on X began examining the README closely. Several of the claims did not survive close inspection. On April 7 — less than 24 hours after the launch post — Milla Jovovich and Ben Sigman published a detailed written response that addressed every criticism by name, acknowledged what was wrong, and committed to specific fixes.

Five Real Problems in the Original README

Problem 1 — The AAAK Token Count Used a Heuristic

The README included a token comparison claiming AAAK produced fewer tokens than plain English. The numbers were generated using len(text) // 3 — a rough approximation — rather than running the text through an actual tokenizer. When @panuhorsmalahti and others ran the same text through OpenAI's tokenizer, the result reversed: the English example tokenised to 66 tokens and the AAAK example to 73. AAAK adds structural overhead — entity codes, separators, truncation markers — that costs more tokens than it saves on short text. The intended use case (repeated entities across thousands of sessions) was never demonstrated by the example chosen.

Problem 2 — "30x Lossless Compression" Was Wrong on Two Counts

The README called AAAK "30x lossless compression." Both claims were inaccurate. AAAK is a lossy abbreviation system — it truncates sentences and replaces proper nouns with short codes, which means information is discarded, not preserved. And the compression ratio was derived from the same flawed heuristic as the token count. On the LongMemEval benchmark, AAAK mode scores 84.2% versus raw mode's 96.6% — a 12.4-point regression that directly quantifies the cost of the lossy compression.

Problem 3 — "+34% Palace Boost" Described a Standard ChromaDB Feature

The README presented a 34-point recall improvement from wing+room metadata filtering as though it were a unique capability of the palace architecture. In practice, filtering a vector search by metadata before computing similarity is standard ChromaDB functionality. The palace structure contributes meaningfully — it generates and enforces the metadata schema that makes the filtering useful — but the filtering itself is not novel. The framing implied a proprietary retrieval advancement that did not exist.

Problem 4 — Contradiction Detection Was Described as Active When It Was Not

The README implied that the knowledge graph automatically checked new entries against existing facts and flagged conflicts. The underlying utility (fact_checker.py) does exist and does work — but it is not called by the KG operations. It must be invoked explicitly. The discrepancy between the README's description and the actual codebase was real and was raised in Issue #27.

Problem 5 — The 100% Hybrid Score Lacked a Public Reproduction Path

The hybrid mode's 100% score was real — the result files existed and were genuine. But the reranking pipeline that produced it was not included in the public benchmark scripts at launch. Users who wanted to reproduce the result independently could not do so from the published code alone. The team committed to publishing the pipeline.

What Held Up Under Scrutiny

Independently Reproduced The 96.6% R@5 LongMemEval score in raw mode was reproduced by community member @gizmax on an Apple M2 Ultra in under five minutes using the public benchmark runner at benchmarks/longmemeval_bench.py. Zero API calls, zero cloud dependency.

The 96.6% raw score is the highest published LongMemEval result requiring no external service. That claim held.
The verbatim storage approach is architecturally sound. Storing full conversations prevents the recall gaps that summarisation creates.
The palace structure — wings, rooms, halls, tunnels, closets, drawers — is a real and useful organisation scheme. The 34-point retrieval improvement is real even if the mechanism is standard filtering.
The project is genuinely free, genuinely MIT licensed, and genuinely local. None of those claims were challenged.

Why the Community Response Was Different

Open-source launch announcements rarely include sentences like "what we got wrong." The April 7 response listed each error by name, explained the mechanism that produced it, and described the corrective action without minimising the scale of the mistake. Critics who had filed sharp GitHub issues were acknowledged by name. The tone throughout was direct rather than defensive.

"Thank you to everyone who poked holes in this. Brutal honest criticism is exactly what makes open source work, and it's what we asked for. Special thanks to @panuhorsmalahti, @lhl, @gizmax, and everyone who filed an issue or a PR in the first 48 hours."
— Milla Jovovich & Ben Sigman, April 7, 2026

Several of the people who filed the most critical issues became contributors within days. The project's current contributor list credits the community members who found the problems alongside the founders who wrote the original code.

The Fix List

AAAK token-count example replaced with real tokenizer output and a scenario where compression is demonstrated at genuine scale
Benchmark documentation updated to show raw, AAAK, and rooms modes clearly with explicit trade-off notes
Haiku reranking pipeline added to the public benchmark scripts
ChromaDB pinned to a tested version range to prevent import conflicts (Issue #100)
Shell injection vulnerability in the save hooks patched (Issue #110)
macOS ARM64 segfault investigated and mitigated (Issue #74)
fact_checker.py wiring into KG operations tracked as an open issue with planned completion