Luca Engel

Evaluating the Novelty of Knowledge in Texts with LLMs

Jan 10, 2025

Evaluating the Novelty of Knowledge in Texts with LLMs

TL;DR

At a glance

Problem

Large Language Models contain vast factual knowledge, but it’s unclear which texts add new knowledge and do not simply confirm what the model already knows.

Key challenges:

  1. Measuring novelty: How to test whether a fact in a text is already encoded in the model?
  2. Evaluation design: Avoiding bias in generated MCQs (e.g. positional bias).
  3. Filtering for quality: Ensuring questions test knowledge gaps and not just wording tricks.

The broader question: Can we systematically detect and prioritize texts that provide knowledge gain for an LLM?

Solution overview

I developed a three-stage pipeline:

  1. MCQ Generation — Split text into chunks; generate multiple-choice questions (MCQs) with GPT-4o.
  2. MCQ Filtering — Apply overlap-based metrics (Jaccard, ROUGE-L) and cosine similarity (NV-Embed-v2) to filter out trivial or faulty questions.
  3. Evaluation — Test target LLMs (Llama 3 70B, GPT-4o) twice: with and without context. Compute Knowledge Gain (KG) as the performance lift with context.

Architecture

Data

Method

Experiments & Results

DatasetLlama 3 70B KGGPT-4o KG
Venice0.3860.309
Baseline0.360
Wikipedia0.2670.208

Findings:

Impact

What I learned

Future Work

References