Reddit Machine Learning9h ago|Research & Papers Products & Services

Audit Finds Issues with LoCoMo Long-Term Memory Benchmark

The article discusses an audit of the LoCoMo long-term memory benchmark, which found 6.4% of the answer key is wrong and the LLM judge accepts up to 63% of intentionally wrong answers. It also examines issues with the LongMemEval-S benchmark, which measures context window management rather than long-term memory.

💡

Why it matters

These issues with leading long-term memory benchmarks call into question the validity of results and the ability to accurately measure progress in this important area of AI research.

Key Points

1LoCoMo has 99 score-corrupting errors in 1,540 questions (6.4%), including hallucinated facts, incorrect temporal reasoning, and speaker attribution errors
2The LLM judge used to score LoCoMo accepts 62.81% of intentionally wrong but topically adjacent answers
3LongMemEval-S tests context window management rather than long-term memory, as the entire test corpus fits in a single context window for most current models
4LoCoMo-Plus inherits the original LoCoMo questions with the documented errors and uses the same broken ground truth

Details

The article discusses an audit of the LoCoMo long-term memory benchmark, which is one of the most widely cited in the field. The audit found that 6.4% of the answer key contains errors, including hallucinated facts, incorrect temporal reasoning, and speaker attribution issues. The theoretical maximum score for a perfect system is only 93.6%. The article also examines the LLM judge used to score LoCoMo answers, finding that it accepts 62.81% of intentionally wrong but topically adjacent answers. This suggests the benchmark rewards weak retrieval that identifies the right topic but misses specific details. The article also looks at the LongMemEval-S benchmark, which is often raised as an alternative. However, it finds that the entire test corpus fits within the context window of modern language models, making it more of a context window management test than a true long-term memory evaluation. The article also discusses LoCoMo-Plus, which introduces a new 'cognitive' question category testing implicit inference. However, it inherits the original LoCoMo questions with the documented errors and uses the same broken ground truth without revalidation.

Audit Finds Issues with LoCoMo Long-Term Memory Benchmark

Why it matters

Key Points

Details

Dive deeper

Related Articles

Create Datasets from TikTok Videos

Is TensorFlow the

Comparing ResNet and Facial Landmarks for Real-time Student…

ACL ARR Submission Desk Rejected Due to Duplicate Versions

Building a Transformer Out of Claudes — Collaboration Reque…

Building a Demand Forecasting System for Multi-Location Ret…

Dual-engine approach for detecting AI-generated music in co…

Looking for Definition of Open-World Learning Problem

Concerns About Increasing Appendix Lengths in AI Conference…

Choosing Between ACL SRW, ICML Workshop, and AACL for Paper…

AI Curator

Ask me anything about AI

Related Articles

Create Datasets from TikTok Videos

Comparing ResNet and Facial Landmarks for Real-time Student…

ACL ARR Submission Desk Rejected Due to Duplicate Versions

Building a Transformer Out of Claudes — Collaboration Reque…

Building a Demand Forecasting System for Multi-Location Ret…

Dual-engine approach for detecting AI-generated music in co…

Looking for Definition of Open-World Learning Problem

Concerns About Increasing Appendix Lengths in AI Conference…

Choosing Between ACL SRW, ICML Workshop, and AACL for Paper…