Optimizing LLMs for Education: GPT-2 with Fine-Tuning, DPO, RAG & Quantization
Apr 20, 2024
Cover image generated with OpenAI's GPT-5 model.- Accuracy (SciQ MCQA): 33.4% with fine-tuning + RAG
- Model size reduction: 510 MB → 179 MB (−65%)
- Trade-off: RAG improved accuracy but slowed generation by ~16%
- Grade: 5.25 / 6
TL;DR
- Optimized an LLM for EPFL courses by optimizing GPT-2 with fine-tuning, DPO, RAG, and quantization.
- Achieved 33.4% accuracy on SciQ MCQA with fine-tuning + RAG, surpassing the baseline (29.1%).
- Quantized model achieved 65% size reduction with minimal accuracy loss.
- Highlighted trade-offs: RAG boosts accuracy but slows inference.
At a glance
- Role: Team project (4 students)
- Timeline: Feb–Apr 2024 (10 weeks)
- Context: Master’s modern NLP course project (CS-552, EPFL)
- Users/Stakeholders: Students seeking automated scientific Q&A support
- Scope: Data prep, Model fine-tuning, DPO alignment, RAG integration, Quantization, Evaluation
Problem
Students often lack immediate personalized help outside class. Traditional tutoring and office hours do not scale to diverse needs. Can a compact, optimized LLM act as a reliable AI tutor, answering scientific multiple-choice questions with accuracy, efficiency, and explainability?
Solution overview
We optimized GPT-2 to answer scientific questions by combining:
- Fine-tuning on domain-specific datasets.
- Direct Preference Optimization (DPO) to align responses with user preferences.
- Retrieval-Augmented Generation (RAG) to incorporate external context.
- Quantization (GPTQ) for efficient deployment.
Architecture
The system extends GPT-2 with four modular enhancements:
- Fine-tuning: SciQ (MCQA) + ELI5_Category (open QA).
- DPO: Alignment with >26k preference pairs from EPFL course questions.
- RAG: Neural retriever fetches relevant docs → prepended to prompt.
- Quantization: GPTQ reduces weights to 8-bit for deployment efficiency.
Data
- ELI5_Category (open QA): ~92k training samples; long-form scientific answers.
- SciQ (MCQA): ~11.7k training questions; reformatted for multiple-choice evaluation.
- Preference pairs (DPO): 26k+ preference pairs from EPFL students, comparing ChatGPT-generated answers.
- Preprocessing: SciQ reformatted to match ELI5 schema; preference data converted to chosen/rejected pairs.
Method
1. Fine-Tuning
- GPT-2 fine-tuned in two phases:
- Mixed ELI5 + SciQ (general science QA).
- SciQ only (multiple-choice focus).
- Challenges: model often generated full answers rather than just answer letters → postprocessing with BERTScore to select correct option.
2. Direct Preference Optimization (DPO)
- Trained on student-generated preference pairs using Hugging Face’s DPOTrainer.
- Alignment improved answer relevance and style, but did not surpass RAG in accuracy.
3. Retrieval-Augmented Generation (RAG)
- Integrated retriever from Lewis et al. (2021).
- Prepended relevant docs to MCQA prompts.
- Boosted accuracy from 22.7% (fine-tuned only) → 33.4% (fine-tuned + RAG).
- Trade-off: generation slowed by ~16% due to retrieval overhead.
4. Quantization (GPTQ)
- Post-training quantization reduced model size: 510 MB → 179 MB (−65%).
- Accuracy remained stable for fine-tuned model (22.7% vs 22.7%), slightly worse when combined with RAG (32.9% vs 33.4%).
Experiments & Results
Benchmarks
| Model Variant | Accuracy (SciQ MCQA) |
|---|---|
| Baseline GPT-2 | 0.291 |
| Baseline + RAG | 0.319 |
| Fine-tuned | 0.227 |
| Fine-tuned + RAG | 0.334 |
| Fine-tuned + DPO | 0.319 |
| Fine-tuned + DPO + RAG | 0.311 |
| Fine-tuned + Quantized | 0.227 |
| Fine-tuned + Quantized + RAG | 0.329 |
Evaluation protocol. Compared accuracy across SciQ test split (n=1000). Postprocessing ensured consistent single-choice outputs.
Speed Tradeoff
The following table summarizes inference speed (tokens/ms) on a Google Colab T4 GPU:
| Model | mean | std | Δ vs baseline |
|---|---|---|---|
| GPT-2 Baseline | 12.41 | 1.65 | — |
| Fine-tuned | 12.68 | 1.77 | +2% |
| Fine-tuned + RAG | 14.53 | 1.79 | +16% slower |
Impact
- Demonstrated feasibility of small optimized LLMs as educational assistants.
- Provided accuracy improvements over baseline with modest compute.
- Highlighted trade-offs (accuracy vs speed, quantization vs performance).
What I learned
- Combining retrieval with small models can yield outsized gains.
- Fine-tuning small models without retrieval may degrade performance.
- Postprocessing (BERTScore) was essential for stable MCQA evaluation.
- Quantization is powerful but must be tested with each augmentation.
- Bigger models should be used for higher accuracy ceilings.
Future Work
- Extend to multilingual support for EPFL’s diverse student body.
- Explore multimodal models.
- Test larger base models (GPT-NeoX, LLaMA) for higher ceilings.
- Deploy prototype in controlled student environments.
References
- Lewis et al. (2021) — Retrieval-Augmented Generation.
- Rafailov et al. (2023) — Direct Preference Optimization.
- Frantar et al. (2023) — GPTQ Quantization.
- Welbl et al. (2017) — SciQ dataset.
- Gao et al. (2021) — ELI5_Category dataset.