Document Retrieval

Nov 4, 2024

TL;DR

Implemented TF-IDF, BM25, and dense retrieval on a long, multilingual document corpus.
BM25 clearly outperformed others with Recall@10 = 0.77, while dense retrieval struggled (0.47) and TF-IDF performed worst (0.37).
Chunking and language-specific indices improved runtime but didn’t close the accuracy gap.
Confirmed that BM25 remains a robust, resource-efficient method for long documents.

Role: IR engineer (team of 2)
Timeline: Sep–Nov 2024 (6 weeks)
Context: EPFL Distributed Information Systems course project
Users/Stakeholders: Search engines, multilingual document archives, resource-constrained retrieval tasks
Scope: Retrieval pipeline design, TF-IDF, BM25, Dense embedding + FAISS implementation, Evaluation

Information retrieval in large, multilingual corpora faces three challenges:

Document length: long documents make sparse methods prone to bias and dense methods hit token limits and time constraints.
Multilinguality: queries and documents in multiple languages complicate indexing.
Compute constraints: dense retrieval requires GPU and memory resources that may not be available.

The challenge: can we build a retrieval system that balances accuracy, efficiency, and scalability under these constraints?

We compared three families of retrieval approaches:

TF-IDF: classical bag-of-words baseline.
BM25: improved bag-of-words with length normalization and tunable parameters.
Dense retrieval: Sentence-Transformers embeddings with FAISS similarity search, combined with document chunking for token limit handling.

Input query → preprocessed.
TF-IDF / BM25: sparse vectorization + similarity ranking.
Dense retrieval: Sentence-Transformer embeddings → FAISS index → nearest neighbor search.
Language-based sharding: separate indices per language for improved runtime.

Corpus: Long-form, multilingual documents.
Splits: Train/validation/test with held-out queries.
Chunking: For dense retrieval, documents split into smaller segments to fit transformer token limits.
Language split: Seven subsets (one per language) to optimize retrieval speed.

Model	Recall@10 (val)
BM25 (k1=1.5, b=0.45)	0.7735
Dense (Sentence-Transformers + FAISS)	0.4715
TF-IDF	0.3738

k1	b=0.35	b=0.45	b=0.55
1.4	0.7735	0.7698	0.7723
1.5	0.7735	0.7735	0.7698
1.6	0.7698	0.7710	0.7686

Evaluation protocol.

BM25 was the most effective: robust to long documents, resource-efficient, and high accuracy.
Dense retrieval underperformed due to chunking + compute limits, but showed potential for capturing semantic similarity with more resources.
TF-IDF was lightweight but struggled with document length normalization.
Splitting by language improved runtime significantly but not accuracy.

Demonstrated that traditional sparse methods (BM25) remain highly competitive under resource constraints.
Showed limits of dense retrieval for long, multilingual documents without large-scale GPU compute.
Provided a reproducible retrieval benchmark for future hybrid or neural approaches.

Why BM25 continues to be a strong method in IR.
Practical trade-offs of dense retrieval: semantic power vs. compute overhead.
Importance of chunking strategies for handling long documents in embedding models.
Value of systematic parameter tuning (k1, b) in IR performance.