Books to Blockbusters: Data-Driven Advice for Adaptations
Dec 20, 2024
Cover image generated with OpenAI's GPT-5 model.- Matched uplift: +$10.7M revenue (median pairwise diff) compared to non-adaptations
- Win rate: 58% of book-based films beat matched non-adaptations
- Dataset: ~7.6k films, 659 book adaptations; 392 book–movie summary pairs
- Grade: 5.25 / 6
TL;DR
- Compared book-based films (“Bobs”) vs non-book films (“Nobs”) via propensity score matching to control confounders.
- Found Bobs win 58% of matched pairs with +$10.7M median revenue uplift.
- Key drivers: vote count, budget, popularity, runtime; adventure/thriller genres help.
- For choosing & adapting: prefer series, high 5★ share, and stay close to the book (plot similarity matters).
At a glance
- Role: Data scientist (team project)
- Timeline: Oct–Dec 2024 (8 weeks)
- Context: EPFL Applied Data Analysis course project
- Users/Stakeholders: Film producers, directors, story analysts
- Scope: Matching design, modeling (LR/RF), SHAP explainability, interactive storytelling
Problem
Studios must decide whether to adapt a book—and how. Headline revenue is influenced by many confounders (budget, popularity, genres, release year). A naïve Bob vs Nob average is misleading. We need a like-for-like comparison and actionable heuristics for book selection and adaptation fidelity.
Solution overview
A two-part analysis + interactive data story:
- Do adaptations outperform? Use propensity score matching to compare Bobs to similar Nobs controlling for key confounders.
- Which choices matter? Model revenue with linear regression & random forests; interpret with SHAP. Add book-movie plot similarity (NLP) to quantify fidelity effects.
Architecture
High level flow: data cleaning & joins (films + books + summaries) → feature engineering (inflation adjustment, genres, popularity, ratings mix) → matching (PSM) for Bob–Nob pairs → modeling (LR/RF + SHAP) → plot-similarity analysis (book vs movie summaries) → interactive story.
Data
- Films: ~7,694 movies with inflation-adjusted revenue/budget, genres, runtime, popularity, vote count/average.
- Books: Goodreads metadata (ratings mix, series flag, genres).
- Adaptation links: Which books map to which films.
- Summaries: 392 book–movie pairs with Wikipedia/IMDb summaries for plot-similarity (text overlap / embedding-based similarity).
- Cleaning: Keep entries with reliable budget & revenue; align time/price via CPI.
Method
Propensity Score Matching (Bobs vs Nobs)
- Goal: Like-for-like revenue comparison.
- Confounders in score: vote count, release year, budget, runtime, popularity, vote average, adventure genre, genre count.
- Matching: Nearest-neighbor within caliper; evaluate pairwise revenue difference.
Modeling & Explainability
- Linear Regression (global coefficients) and Random Forests (nonlinearities + interactions).
- SHAP for feature importance & directionality: budget↑, vote count↑, popularity↑, runtime↑; certain genres (adventure/thriller) positive.
Adaptation Fidelity (Books ↔ Movies)
- Similarity between book & movie summaries as a predictor alongside book/movie features.
- Finding: Higher plot similarity is typically associated with higher revenue than most standalone book features.
Experiments & Results
Matched uplift (Bobs vs Nobs)
| Metric | Result |
|---|---|
| Win rate (Bobs > Nobs) | 58% |
| Median uplift (revenue diff) | +$10.7M |
| Significance | p < 0.001 |
Drivers of revenue (modeling snapshot)
- Top drivers: vote count, budget, popularity, runtime; adventure/thriller genres.
- Books → film: 5★ share↑, 3★ share↓, is series = yes predictive for success.
- Adaptation: Higher plot similarity tends to outperform creative divergences.
Evaluation protocol.
- Inflation-adjusted USD; holdout validation for models; SHAP summary & dependence plots on top features.
- Matching diagnostics: balance checks on confounders pre/post matching.
Product & UX
- Interactive data story with quizzes, searchable plots, and matched-pair explorer.
- Explorables: revenue/time, budget vs revenue, SHAP visualizations, similarity comparisons.
- Direct link: Explore the live site.
System & Operations
- Stack: Python (pandas, scikit-learn), SHAP, D3/Plotly for interactive visuals.
- Reproducibility: Deterministic matching seeds; CPI normalization for money.
- Deployment: Static site hosting (GitHub Pages) with bundled assets/plots.
Impact
- Provides evidence-based guidance for producers: when to adapt and how to adapt.
- Reconciles business metrics (revenue) with creative choices (fidelity).
- Offers a reusable template for other IP-to-film analyses (games, comics).
What I learned
- Causal framing (matching) is critical to avoid misleading averages.
- Model + SHAP tells a more complete story than coefficients alone.
- Text similarity is a practical proxy for adaptation fidelity at scale.
- Communicating results via an interactive narrative improves adoption.
Future Work
- Add profitability (ROI) and marketing spend where available.
- Extend beyond books: games → film, comics → film comparisons.
References
- Rosenbaum & Rubin — Propensity score matching.
- Lundberg & Lee — SHAP explanations.
- IMDb, Goodreads, and Wikipedia/IMDb summaries (data sources).