Luca Engel

Document Retrieval

Nov 4, 2024

Document Retrieval

TL;DR

At a glance

Problem

Information retrieval in large, multilingual corpora faces three challenges:

  1. Document length: long documents make sparse methods prone to bias and dense methods hit token limits and time constraints.
  2. Multilinguality: queries and documents in multiple languages complicate indexing.
  3. Compute constraints: dense retrieval requires GPU and memory resources that may not be available.

The challenge: can we build a retrieval system that balances accuracy, efficiency, and scalability under these constraints?

Solution overview

We compared three families of retrieval approaches:

Architecture

Data

Method

TF-IDF

BM25

Dense Retrieval

Experiments & Results

Benchmarks

ModelRecall@10 (val)
BM25 (k1=1.5, b=0.45)0.7735
Dense (Sentence-Transformers + FAISS)0.4715
TF-IDF0.3738

BM25 tuning

k1b=0.35b=0.45b=0.55
1.40.77350.76980.7723
1.50.77350.77350.7698
1.60.76980.77100.7686

Evaluation protocol.

Analysis


Impact

What I learned

Future Work

References