Benchmarking Semantic vs. Lexical Deduplication on the Banking77 Dataset
The author conducted an experiment to quantify
💡
Why it matters
This research highlights the significant amount of semantic redundancy in real-world NLP datasets, which could impact the performance of downstream tasks like Retrieval-Augmented Generation.
Key Points
- 1Lexical deduplication removed less than 1% of rows, as the dataset contains many variations of the same intent
- 2Semantic deduplication using sentence embeddings and FAISS identified that 50.4% of the dataset consisted of semantic duplicates
- 3The author built a scalable pipeline using Polars LazyFrame and quantized FAISS indices, and packaged it into an open-source CLI tool called EntropyGuard
- 4The author hypothesizes that clearing the context window of duplicates could improve RAG retrieval accuracy, and would like to see research on this
Details
The author conducted an experiment to quantify the amount of
Like
Save
Cached
Comments
No comments yet
Be the first to comment