Document Retrieval
Nov 4, 2024

- BM25 Recall@10 = 0.77
- TF-IDF Recall@10 = 0.37
- Dense+FAISS Recall@10 = 0.47
- Grade: 5.75 / 6
TL;DR
- Implemented TF-IDF, BM25, and dense retrieval on a long, multilingual document corpus.
- BM25 clearly outperformed others with Recall@10 = 0.77, while dense retrieval struggled (0.47) and TF-IDF performed worst (0.37).
- Chunking and language-specific indices improved runtime but didn’t close the accuracy gap.
- Confirmed that BM25 remains a robust, resource-efficient method for long documents.
At a glance
- Role: IR engineer (team of 2)
- Timeline: Sep–Nov 2024 (6 weeks)
- Context: EPFL Distributed Information Systems course project
- Users/Stakeholders: Search engines, multilingual document archives, resource-constrained retrieval tasks
- Scope: Retrieval pipeline design, TF-IDF, BM25, Dense embedding + FAISS implementation, Evaluation
Problem
Information retrieval in large, multilingual corpora faces three challenges:
- Document length: long documents make sparse methods prone to bias and dense methods hit token limits and time constraints.
- Multilinguality: queries and documents in multiple languages complicate indexing.
- Compute constraints: dense retrieval requires GPU and memory resources that may not be available.
The challenge: can we build a retrieval system that balances accuracy, efficiency, and scalability under these constraints?
Solution overview
We compared three families of retrieval approaches:
- TF-IDF: classical bag-of-words baseline.
- BM25: improved bag-of-words with length normalization and tunable parameters.
- Dense retrieval: Sentence-Transformers embeddings with FAISS similarity search, combined with document chunking for token limit handling.
Architecture
- Input query → preprocessed.
- TF-IDF / BM25: sparse vectorization + similarity ranking.
- Dense retrieval: Sentence-Transformer embeddings → FAISS index → nearest neighbor search.
- Language-based sharding: separate indices per language for improved runtime.
Data
- Corpus: Long-form, multilingual documents.
- Splits: Train/validation/test with held-out queries.
- Chunking: For dense retrieval, documents split into smaller segments to fit transformer token limits.
- Language split: Seven subsets (one per language) to optimize retrieval speed.
Method
TF-IDF
- Computed sparse vectors.
- Scoring: cosine similarity.
- Precomputation of IDF across full corpus.
BM25
- Bag-of-words with length normalization.
- Parameters tuned via grid search:
k1 ∈ {1.4, 1.5, 1.6}b ∈ {0.35, 0.45, 0.55}- Implemented using custom notebooks for precomputation + inference.
Dense Retrieval
- Sentence-Transformers embeddings (multilingual models).
- FAISS index for similarity search.
- Document chunking into ~512-token segments.
- Retrieval at chunk level → aggregate top candidates for final ranking.
Experiments & Results
Benchmarks
| Model | Recall@10 (val) |
|---|---|
| BM25 (k1=1.5, b=0.45) | 0.7735 |
| Dense (Sentence-Transformers + FAISS) | 0.4715 |
| TF-IDF | 0.3738 |
BM25 tuning
| k1 | b=0.35 | b=0.45 | b=0.55 |
|---|---|---|---|
| 1.4 | 0.7735 | 0.7698 | 0.7723 |
| 1.5 | 0.7735 | 0.7735 | 0.7698 |
| 1.6 | 0.7698 | 0.7710 | 0.7686 |
Evaluation protocol.
- Metric: Recall@10.
- Document chunking applied for dense models.
- Language-specific indices used for all methods to improve efficiency.
Analysis
- BM25 was the most effective: robust to long documents, resource-efficient, and high accuracy.
- Dense retrieval underperformed due to chunking + compute limits, but showed potential for capturing semantic similarity with more resources.
- TF-IDF was lightweight but struggled with document length normalization.
- Splitting by language improved runtime significantly but not accuracy.
Impact
- Demonstrated that traditional sparse methods (BM25) remain highly competitive under resource constraints.
- Showed limits of dense retrieval for long, multilingual documents without large-scale GPU compute.
- Provided a reproducible retrieval benchmark for future hybrid or neural approaches.
What I learned
- Why BM25 continues to be a strong method in IR.
- Practical trade-offs of dense retrieval: semantic power vs. compute overhead.
- Importance of chunking strategies for handling long documents in embedding models.
- Value of systematic parameter tuning (
k1,b) in IR performance.
Future Work
- Explore hybrid methods: BM25 candidates reranked by dense embeddings.
- Experiment with larger multilingual embedding models (XLM-R, mBERT).
- Investigate efficient tokenization strategies for long docs.
References
- Wikipedia contributors: TF-IDF, BM25 (2024).
- Hugging Face: Sentence-Transformers docs.
- FAISS: Douze et al., 2017.
- Schwaber-Cohen (2023): Chunking strategies for LLM applications.