Recommender System for Books

Dec 13, 2024

Best model RMSE: 0.82 (held-out ratings)
Compared ALS, user-kNN, item-kNN, and metadata-enhanced hybrid CF
Grade: 5.25 / 6

TL;DR

Tackled the book recommendation problem on an extremely sparse user–item ratings dataset.
Implemented ALS, user-based CF, item-based CF (with metadata), and a hybrid CF approach.
Achieved best performance with the hybrid CF + metadata model: RMSE 0.82.
Showed that metadata can add valuable information to improve recommendations.

At a glance

Role: ML engineer (team of 2)
Timeline: Oct–Dec 2024 (7 weeks)
Context: EPFL Distributed Information Systems course project
Users/Stakeholders: Students, researchers, and book recommendation engines
Scope: Modeling, Hyperparameter tuning, Metadata integration, Evaluation

Recommender systems must predict user preferences despite sparse data: most users rate only one or two items, and new books suffer from cold-start issues. Traditional collaborative filtering (CF) struggles in such settings. The challenge: can we build a scalable recommender that performs well under sparsity, while still personalizing recommendations?

Solution overview

We implemented and compared multiple CF models:

ALS matrix factorization — baseline latent factor model.
User-based kNN CF — leverages user similarity for predictions.
Item-based kNN CF with metadata — enriches similarity with book metadata (subjects, summaries, language).
Hybrid CF — averages predictions from user- and item-based CF to balance their strengths.

Architecture

Ratings matrix → split into train/val/test.
ALS learns latent factors.
kNN finds nearest neighbors (users or items) via cosine similarity.
Metadata (ISBN-based features clustered with Sentence-BERT + k-means) enriches item similarity.
Hybrid CF combines user- and item-CF predictions.

Data

Dataset: Sparse user–book ratings matrix (explicit feedback).
Sparsity: ~99.97% missing values (~100k ratings across ~19k users × ~15k books).
Metadata: Subjects, summaries, and language from ISBNs. Encoded as categorical features and via clustering on Sentence-BERT embeddings.
Splits: Train/val/test with only users and items appearing in train included in validation/test.

Method

ALS: Latent factor decomposition with tuned factors (k=50), regularization (λ=0.8), 20 iterations.
User-based CF: Cosine similarity over user rating vectors; neighborhood size tuned via grid search.
Item-based CF: Cosine similarity over item rating vectors; metadata clusters added to similarity computation.
Hybrid CF: Weighted average of user- and item-based predictions; optimized over validation set.

Experiments & Results

Benchmarks

A value of k = 5 was chosen for kNN models based on validation RMSEs.

Model Variant	RMSE (test)
Hybrid CF + metadata	0.8242
User-based CF	0.8252
Item-based CF	0.8260
Item-based CF + metadata	0.8256
ALS	1.1318

Evaluation protocol.

Metric: RMSE on held-out ratings.
Hyperparameter tuning:
- ALS: latent factors, regularization, number of parameters.
- Collaborative filtering: kNN
- Metadata clusters: k-means clustering.

Analysis

ALS struggled with sparsity (RMSE > 1.1).
User-based CF outperformed item-based CF slightly.
Metadata improved item-based similarity but only marginally.
Hybrid CF consistently yielded the best balance, with RMSE 0.824.

Impact

Showed that classic CF methods remain strong baselines, even under severe sparsity.
Demonstrated that metadata integration enhances methods, though with small improvements.
Provided a scalable hybrid method with the strongest performance for this dataset.

What I learned

Importance of hyperparameter tuning for CF stability.
How sparsity impacts user- and item-based methods.
Metadata helps but doesn’t always translate into large RMSE gains.
Hybrid models are often more robust than single approaches.

Future Work

Extend hybrid approach with weighted ensembles instead of simple averages.
Incorporate implicit feedback (clicks, views) in addition to explicit ratings.
Experiment with neural recommenders (Autoencoders, LightGCN) on this dataset.
Explore real-world deployment with scalability and latency constraints.

References

Scikit-learn NearestNeighbors documentation.
Reimers & Gurevych (2019): Sentence-BERT.
Rafailov et al. (2023): Hybrid CF strategies.

← Back to projects