Recommender System for Books
Dec 13, 2024

- Best model RMSE: 0.82 (held-out ratings)
- Compared ALS, user-kNN, item-kNN, and metadata-enhanced hybrid CF
- Grade: 5.25 / 6
TL;DR
- Tackled the book recommendation problem on an extremely sparse user–item ratings dataset.
- Implemented ALS, user-based CF, item-based CF (with metadata), and a hybrid CF approach.
- Achieved best performance with the hybrid CF + metadata model: RMSE 0.82.
- Showed that metadata can add valuable information to improve recommendations.
At a glance
- Role: ML engineer (team of 2)
- Timeline: Oct–Dec 2024 (7 weeks)
- Context: EPFL Distributed Information Systems course project
- Users/Stakeholders: Students, researchers, and book recommendation engines
- Scope: Modeling, Hyperparameter tuning, Metadata integration, Evaluation
Problem
Recommender systems must predict user preferences despite sparse data: most users rate only one or two items, and new books suffer from cold-start issues. Traditional collaborative filtering (CF) struggles in such settings. The challenge: can we build a scalable recommender that performs well under sparsity, while still personalizing recommendations?
Solution overview
We implemented and compared multiple CF models:
- ALS matrix factorization — baseline latent factor model.
- User-based kNN CF — leverages user similarity for predictions.
- Item-based kNN CF with metadata — enriches similarity with book metadata (subjects, summaries, language).
- Hybrid CF — averages predictions from user- and item-based CF to balance their strengths.
Architecture
- Ratings matrix → split into train/val/test.
- ALS learns latent factors.
- kNN finds nearest neighbors (users or items) via cosine similarity.
- Metadata (ISBN-based features clustered with Sentence-BERT + k-means) enriches item similarity.
- Hybrid CF combines user- and item-CF predictions.
Data
- Dataset: Sparse user–book ratings matrix (explicit feedback).
- Sparsity: ~99.97% missing values (~100k ratings across ~19k users × ~15k books).
- Metadata: Subjects, summaries, and language from ISBNs. Encoded as categorical features and via clustering on Sentence-BERT embeddings.
- Splits: Train/val/test with only users and items appearing in train included in validation/test.
Method
- ALS: Latent factor decomposition with tuned factors (k=50), regularization (λ=0.8), 20 iterations.
- User-based CF: Cosine similarity over user rating vectors; neighborhood size tuned via grid search.
- Item-based CF: Cosine similarity over item rating vectors; metadata clusters added to similarity computation.
- Hybrid CF: Weighted average of user- and item-based predictions; optimized over validation set.
Experiments & Results
Benchmarks
A value of k = 5 was chosen for kNN models based on validation RMSEs.
| Model Variant | RMSE (test) |
|---|---|
| Hybrid CF + metadata | 0.8242 |
| User-based CF | 0.8252 |
| Item-based CF | 0.8260 |
| Item-based CF + metadata | 0.8256 |
| ALS | 1.1318 |
Evaluation protocol.
- Metric: RMSE on held-out ratings.
- Hyperparameter tuning:
- ALS: latent factors, regularization, number of parameters.
- Collaborative filtering: kNN
- Metadata clusters: k-means clustering.
Analysis
- ALS struggled with sparsity (RMSE > 1.1).
- User-based CF outperformed item-based CF slightly.
- Metadata improved item-based similarity but only marginally.
- Hybrid CF consistently yielded the best balance, with RMSE 0.824.
Impact
- Showed that classic CF methods remain strong baselines, even under severe sparsity.
- Demonstrated that metadata integration enhances methods, though with small improvements.
- Provided a scalable hybrid method with the strongest performance for this dataset.
What I learned
- Importance of hyperparameter tuning for CF stability.
- How sparsity impacts user- and item-based methods.
- Metadata helps but doesn’t always translate into large RMSE gains.
- Hybrid models are often more robust than single approaches.
Future Work
- Extend hybrid approach with weighted ensembles instead of simple averages.
- Incorporate implicit feedback (clicks, views) in addition to explicit ratings.
- Experiment with neural recommenders (Autoencoders, LightGCN) on this dataset.
- Explore real-world deployment with scalability and latency constraints.
References
- Scikit-learn NearestNeighbors documentation.
- Reimers & Gurevych (2019): Sentence-BERT.
- Rafailov et al. (2023): Hybrid CF strategies.