Evaluating the Novelty of Knowledge in Texts with LLMs
Jan 10, 2025

- Knowledge Gain (Venice): 0.386 (Llama 3 70B), 0.309 (GPT-4o)
- Knowledge Gain (Wikipedia): 0.267 (Llama 3 70B), 0.208 (GPT-4o)
- Bias mitigation: rotating correct answer (A–D) ×4
- Project Grade: 6 / 6
TL;DR
- Built an MCQ pipeline to assess whether texts add new factual knowledge to an LLM.
- Introduced similarity-based filtering (Jaccard, ROUGE-L, cosine) for quality and difficulty.
- Mitigated positional bias by rotating correct answer slots across runs.
- Found highest Knowledge Gain on Venice books, lowest on Wikipedia (known training data).
At a glance
- Role: Sole author (semester project)
- Timeline: Sep 2024 – Jan 2025 (4 months)
- Context: EPFL Master Semester Project
- Users/Stakeholders: NLP researchers, dataset curators, retrieval pipeline designers
- My scope: Pipeline design → Implementation → Experiments → Report
Problem
Large Language Models contain vast factual knowledge, but it’s unclear which texts add new knowledge and do not simply confirm what the model already knows.
Key challenges:
- Measuring novelty: How to test whether a fact in a text is already encoded in the model?
- Evaluation design: Avoiding bias in generated MCQs (e.g. positional bias).
- Filtering for quality: Ensuring questions test knowledge gaps and not just wording tricks.
The broader question: Can we systematically detect and prioritize texts that provide knowledge gain for an LLM?
Solution overview
I developed a three-stage pipeline:
- MCQ Generation — Split text into chunks; generate multiple-choice questions (MCQs) with GPT-4o.
- MCQ Filtering — Apply overlap-based metrics (Jaccard, ROUGE-L) and cosine similarity (NV-Embed-v2) to filter out trivial or faulty questions.
- Evaluation — Test target LLMs (Llama 3 70B, GPT-4o) twice: with and without context. Compute Knowledge Gain (KG) as the performance lift with context.
Architecture
- Bias mitigation: Each MCQ evaluated 4× with rotated correct answer positions (A–D).
- Datasets:
- Venice books (private, presumed novel knowledge)
- Synthetic baseline (LLM-generated)
- Wikipedia (pre-cutoff) (presumed known knowledge)
Data
- Venice books: Digitized historical sources, unlikely in pretraining corpora.
- Synthetic baseline: Books generated by Llama 70B itself to simulate "already-known" text.
- Wikipedia (pre-cutoff): Scraped pre-2023 articles, maximizing overlap with training data.
- Scale: ~1.3k MCQs per dataset.
Method
- MCQ pipeline: Chunk → generate questions → filter → rotate answers → evaluate twice.
- Similarity filtering:
- Jaccard & ROUGE-L (context–answer overlap) to reduce faulty MCQ generations.
- Cosine similarity (NV-Embed-v2 embeddings) between correct and distractors to increase MCQ difficulty.
- Evaluation: Compute performance with/without context → derive Knowledge Gain.
Experiments & Results
| Dataset | Llama 3 70B KG | GPT-4o KG |
|---|---|---|
| Venice | 0.386 | 0.309 |
| Baseline | 0.360 | – |
| Wikipedia | 0.267 | 0.208 |
Findings:
- Venice > Baseline > Wikipedia: As expected, Venice books add the most new knowledge, while Wikipedia the least.
- Bias mitigation balanced evaluation across answer positions, avoiding inflated scores.
- Cosine filtering increased difficulty for no-context runs while keeping with-context stable.
Impact
- Validated that Knowledge Gain is measurable and tracks true novelty.
- Provides a tool for dataset curation: prioritize texts with high KG for fine-tuning.
- Supports RAG ingestion decisions: ingest texts that models demonstrably lack.
- Contributes to LLM evaluation research, bridging memorization vs. reasoning.
What I learned
- MCQ pipelines need careful bias controls; positional effects can dominate results.
- Similarity thresholds are powerful levers for dataset quality and difficulty.
- KG is not only a metric but a strategy for selecting texts in training pipelines.
- Handling hallucinated baselines is non-trivial—synthetic texts can inflate apparent novelty.
Future Work
- Extend to other domains beyond Venice (e.g. science, law, medicine).
- Refine baseline dataset creation to reduce hallucination artifacts.
- Explore fine-tuning + KG reduction as a measure of effective learning.
- Integrate into RAG frameworks to dynamically decide what to ingest.
References
- Hartmann et al., SoK: Memorization in LLMs (2023)
- Farquhar et al., Detecting hallucinations using semantic entropy (2024)
- Lin et al., Judging the Judges: Positional Bias in LLMs (2024)
- NV-Embed: Improved Techniques for Embedding Models (2024)