Evaluating the Novelty of Knowledge in Texts with LLMs

Jan 10, 2025

View report (PDF)

Evaluating the Novelty of Knowledge in Texts with LLMs

Knowledge Gain (Venice): 0.386 (Llama 3 70B), 0.309 (GPT-4o)
Knowledge Gain (Wikipedia): 0.267 (Llama 3 70B), 0.208 (GPT-4o)
Bias mitigation: rotating correct answer (A–D) ×4
Project Grade: 6 / 6

TL;DR

Built an MCQ pipeline to assess whether texts add new factual knowledge to an LLM.
Introduced similarity-based filtering (Jaccard, ROUGE-L, cosine) for quality and difficulty.
Mitigated positional bias by rotating correct answer slots across runs.
Found highest Knowledge Gain on Venice books, lowest on Wikipedia (known training data).

At a glance

Role: Sole author (semester project)
Timeline: Sep 2024 – Jan 2025 (4 months)
Context: EPFL Master Semester Project
Users/Stakeholders: NLP researchers, dataset curators, retrieval pipeline designers
My scope: Pipeline design → Implementation → Experiments → Report

Problem

Large Language Models contain vast factual knowledge, but it’s unclear which texts add new knowledge and do not simply confirm what the model already knows.

Key challenges:

Measuring novelty: How to test whether a fact in a text is already encoded in the model?
Evaluation design: Avoiding bias in generated MCQs (e.g. positional bias).
Filtering for quality: Ensuring questions test knowledge gaps and not just wording tricks.

The broader question: Can we systematically detect and prioritize texts that provide knowledge gain for an LLM?

Solution overview

I developed a three-stage pipeline:

MCQ Generation — Split text into chunks; generate multiple-choice questions (MCQs) with GPT-4o.
MCQ Filtering — Apply overlap-based metrics (Jaccard, ROUGE-L) and cosine similarity (NV-Embed-v2) to filter out trivial or faulty questions.
Evaluation — Test target LLMs (Llama 3 70B, GPT-4o) twice: with and without context. Compute Knowledge Gain (KG) as the performance lift with context.

Architecture

Bias mitigation: Each MCQ evaluated 4× with rotated correct answer positions (A–D).
Datasets:
- Venice books (private, presumed novel knowledge)
- Synthetic baseline (LLM-generated)
- Wikipedia (pre-cutoff) (presumed known knowledge)

Data

Venice books: Digitized historical sources, unlikely in pretraining corpora.
Synthetic baseline: Books generated by Llama 70B itself to simulate "already-known" text.
Wikipedia (pre-cutoff): Scraped pre-2023 articles, maximizing overlap with training data.
Scale: ~1.3k MCQs per dataset.

Method

MCQ pipeline: Chunk → generate questions → filter → rotate answers → evaluate twice.
Similarity filtering:
- Jaccard & ROUGE-L (context–answer overlap) to reduce faulty MCQ generations.
- Cosine similarity (NV-Embed-v2 embeddings) between correct and distractors to increase MCQ difficulty.
Evaluation: Compute performance with/without context → derive Knowledge Gain.

Experiments & Results

Dataset	Llama 3 70B KG	GPT-4o KG
Venice	0.386	0.309
Baseline	0.360	–
Wikipedia	0.267	0.208

Findings:

Venice > Baseline > Wikipedia: As expected, Venice books add the most new knowledge, while Wikipedia the least.
Bias mitigation balanced evaluation across answer positions, avoiding inflated scores.
Cosine filtering increased difficulty for no-context runs while keeping with-context stable.

Impact

Validated that Knowledge Gain is measurable and tracks true novelty.
Provides a tool for dataset curation: prioritize texts with high KG for fine-tuning.
Supports RAG ingestion decisions: ingest texts that models demonstrably lack.
Contributes to LLM evaluation research, bridging memorization vs. reasoning.

What I learned

MCQ pipelines need careful bias controls; positional effects can dominate results.
Similarity thresholds are powerful levers for dataset quality and difficulty.
KG is not only a metric but a strategy for selecting texts in training pipelines.
Handling hallucinated baselines is non-trivial—synthetic texts can inflate apparent novelty.

Future Work

Extend to other domains beyond Venice (e.g. science, law, medicine).
Refine baseline dataset creation to reduce hallucination artifacts.
Explore fine-tuning + KG reduction as a measure of effective learning.
Integrate into RAG frameworks to dynamically decide what to ingest.

References

Hartmann et al., SoK: Memorization in LLMs (2023)
Farquhar et al., Detecting hallucinations using semantic entropy (2024)
Lin et al., Judging the Judges: Positional Bias in LLMs (2024)
NV-Embed: Improved Techniques for Embedding Models (2024)

← Back to projects