Optimizing LLMs for Education: GPT-2 with Fine-Tuning, DPO, RAG & Quantization

Apr 20, 2024

Accuracy (SciQ MCQA): 33.4% with fine-tuning + RAG
Model size reduction: 510 MB → 179 MB (−65%)
Trade-off: RAG improved accuracy but slowed generation by ~16%
Grade: 5.25 / 6

TL;DR

Optimized an LLM for EPFL courses by optimizing GPT-2 with fine-tuning, DPO, RAG, and quantization.
Achieved 33.4% accuracy on SciQ MCQA with fine-tuning + RAG, surpassing the baseline (29.1%).
Quantized model achieved 65% size reduction with minimal accuracy loss.
Highlighted trade-offs: RAG boosts accuracy but slows inference.

At a glance

Role: Team project (4 students)
Timeline: Feb–Apr 2024 (10 weeks)
Context: Master’s modern NLP course project (CS-552, EPFL)
Users/Stakeholders: Students seeking automated scientific Q&A support
Scope: Data prep, Model fine-tuning, DPO alignment, RAG integration, Quantization, Evaluation

Students often lack immediate personalized help outside class. Traditional tutoring and office hours do not scale to diverse needs. Can a compact, optimized LLM act as a reliable AI tutor, answering scientific multiple-choice questions with accuracy, efficiency, and explainability?

Solution overview

We optimized GPT-2 to answer scientific questions by combining:

Fine-tuning on domain-specific datasets.
Direct Preference Optimization (DPO) to align responses with user preferences.
Retrieval-Augmented Generation (RAG) to incorporate external context.
Quantization (GPTQ) for efficient deployment.

Architecture

The system extends GPT-2 with four modular enhancements:

Fine-tuning: SciQ (MCQA) + ELI5_Category (open QA).
DPO: Alignment with >26k preference pairs from EPFL course questions.
RAG: Neural retriever fetches relevant docs → prepended to prompt.
Quantization: GPTQ reduces weights to 8-bit for deployment efficiency.

Data

ELI5_Category (open QA): ~92k training samples; long-form scientific answers.
SciQ (MCQA): ~11.7k training questions; reformatted for multiple-choice evaluation.
Preference pairs (DPO): 26k+ preference pairs from EPFL students, comparing ChatGPT-generated answers.
Preprocessing: SciQ reformatted to match ELI5 schema; preference data converted to chosen/rejected pairs.

Method

1. Fine-Tuning

GPT-2 fine-tuned in two phases:
1. Mixed ELI5 + SciQ (general science QA).
2. SciQ only (multiple-choice focus).
Challenges: model often generated full answers rather than just answer letters → postprocessing with BERTScore to select correct option.

2. Direct Preference Optimization (DPO)

Trained on student-generated preference pairs using Hugging Face’s DPOTrainer.
Alignment improved answer relevance and style, but did not surpass RAG in accuracy.

3. Retrieval-Augmented Generation (RAG)

Integrated retriever from Lewis et al. (2021).
Prepended relevant docs to MCQA prompts.
Boosted accuracy from 22.7% (fine-tuned only) → 33.4% (fine-tuned + RAG).
Trade-off: generation slowed by ~16% due to retrieval overhead.

4. Quantization (GPTQ)

Post-training quantization reduced model size: 510 MB → 179 MB (−65%).
Accuracy remained stable for fine-tuned model (22.7% vs 22.7%), slightly worse when combined with RAG (32.9% vs 33.4%).

Experiments & Results

Benchmarks

Model Variant	Accuracy (SciQ MCQA)
Baseline GPT-2	0.291
Baseline + RAG	0.319
Fine-tuned	0.227
Fine-tuned + RAG	0.334
Fine-tuned + DPO	0.319
Fine-tuned + DPO + RAG	0.311
Fine-tuned + Quantized	0.227
Fine-tuned + Quantized + RAG	0.329

Evaluation protocol. Compared accuracy across SciQ test split (n=1000). Postprocessing ensured consistent single-choice outputs.

Speed Tradeoff

The following table summarizes inference speed (tokens/ms) on a Google Colab T4 GPU:

Model	mean	std	Δ vs baseline
GPT-2 Baseline	12.41	1.65	—
Fine-tuned	12.68	1.77	+2%
Fine-tuned + RAG	14.53	1.79	+16% slower

Impact

Demonstrated feasibility of small optimized LLMs as educational assistants.
Provided accuracy improvements over baseline with modest compute.
Highlighted trade-offs (accuracy vs speed, quantization vs performance).

What I learned

Combining retrieval with small models can yield outsized gains.
Fine-tuning small models without retrieval may degrade performance.
Postprocessing (BERTScore) was essential for stable MCQA evaluation.
Quantization is powerful but must be tested with each augmentation.
Bigger models should be used for higher accuracy ceilings.

Future Work

Extend to multilingual support for EPFL’s diverse student body.
Explore multimodal models.
Test larger base models (GPT-NeoX, LLaMA) for higher ceilings.
Deploy prototype in controlled student environments.

References

Lewis et al. (2021) — Retrieval-Augmented Generation.
Rafailov et al. (2023) — Direct Preference Optimization.
Frantar et al. (2023) — GPTQ Quantization.
Welbl et al. (2017) — SciQ dataset.
Gao et al. (2021) — ELI5_Category dataset.

← Back to projects