Audit Finds Issues with LoCoMo Long-Term Memory Benchmark
The article discusses an audit of the LoCoMo long-term memory benchmark, which found 6.4% of the answer key is wrong and the LLM judge accepts up to 63% of intentionally wrong answers. It also examines issues with the LongMemEval-S benchmark, which measures context window management rather than long-term memory.
Why it matters
These issues with leading long-term memory benchmarks call into question the validity of results and the ability to accurately measure progress in this important area of AI research.
Key Points
- 1LoCoMo has 99 score-corrupting errors in 1,540 questions (6.4%), including hallucinated facts, incorrect temporal reasoning, and speaker attribution errors
- 2The LLM judge used to score LoCoMo accepts 62.81% of intentionally wrong but topically adjacent answers
- 3LongMemEval-S tests context window management rather than long-term memory, as the entire test corpus fits in a single context window for most current models
- 4LoCoMo-Plus inherits the original LoCoMo questions with the documented errors and uses the same broken ground truth
Details
The article discusses an audit of the LoCoMo long-term memory benchmark, which is one of the most widely cited in the field. The audit found that 6.4% of the answer key contains errors, including hallucinated facts, incorrect temporal reasoning, and speaker attribution issues. The theoretical maximum score for a perfect system is only 93.6%. The article also examines the LLM judge used to score LoCoMo answers, finding that it accepts 62.81% of intentionally wrong but topically adjacent answers. This suggests the benchmark rewards weak retrieval that identifies the right topic but misses specific details. The article also looks at the LongMemEval-S benchmark, which is often raised as an alternative. However, it finds that the entire test corpus fits within the context window of modern language models, making it more of a context window management test than a true long-term memory evaluation. The article also discusses LoCoMo-Plus, which introduces a new 'cognitive' question category testing implicit inference. However, it inherits the original LoCoMo questions with the documented errors and uses the same broken ground truth without revalidation.
No comments yet
Be the first to comment