Books to Blockbusters: Data-Driven Advice for Adaptations

Dec 20, 2024

Matched uplift: +$10.7M revenue (median pairwise diff) compared to non-adaptations
Win rate: 58% of book-based films beat matched non-adaptations
Dataset: ~7.6k films, 659 book adaptations; 392 book–movie summary pairs
Grade: 5.25 / 6

TL;DR

Compared book-based films (“Bobs”) vs non-book films (“Nobs”) via propensity score matching to control confounders.
Found Bobs win 58% of matched pairs with +$10.7M median revenue uplift.
Key drivers: vote count, budget, popularity, runtime; adventure/thriller genres help.
For choosing & adapting: prefer series, high 5★ share, and stay close to the book (plot similarity matters).

At a glance

Role: Data scientist (team project)
Timeline: Oct–Dec 2024 (8 weeks)
Context: EPFL Applied Data Analysis course project
Users/Stakeholders: Film producers, directors, story analysts
Scope: Matching design, modeling (LR/RF), SHAP explainability, interactive storytelling

Studios must decide whether to adapt a book—and how. Headline revenue is influenced by many confounders (budget, popularity, genres, release year). A naïve Bob vs Nob average is misleading. We need a like-for-like comparison and actionable heuristics for book selection and adaptation fidelity.

Solution overview

A two-part analysis + interactive data story:

Do adaptations outperform? Use propensity score matching to compare Bobs to similar Nobs controlling for key confounders.
Which choices matter? Model revenue with linear regression & random forests; interpret with SHAP. Add book-movie plot similarity (NLP) to quantify fidelity effects.

Architecture

High level flow: data cleaning & joins (films + books + summaries) → feature engineering (inflation adjustment, genres, popularity, ratings mix) → matching (PSM) for Bob–Nob pairs → modeling (LR/RF + SHAP) → plot-similarity analysis (book vs movie summaries) → interactive story.

Data

Films: ~7,694 movies with inflation-adjusted revenue/budget, genres, runtime, popularity, vote count/average.
Books: Goodreads metadata (ratings mix, series flag, genres).
Adaptation links: Which books map to which films.
Summaries: 392 book–movie pairs with Wikipedia/IMDb summaries for plot-similarity (text overlap / embedding-based similarity).
Cleaning: Keep entries with reliable budget & revenue; align time/price via CPI.

Method

Propensity Score Matching (Bobs vs Nobs)

Goal: Like-for-like revenue comparison.
Confounders in score: vote count, release year, budget, runtime, popularity, vote average, adventure genre, genre count.
Matching: Nearest-neighbor within caliper; evaluate pairwise revenue difference.

Modeling & Explainability

Linear Regression (global coefficients) and Random Forests (nonlinearities + interactions).
SHAP for feature importance & directionality: budget↑, vote count↑, popularity↑, runtime↑; certain genres (adventure/thriller) positive.

Adaptation Fidelity (Books ↔ Movies)

Similarity between book & movie summaries as a predictor alongside book/movie features.
Finding: Higher plot similarity is typically associated with higher revenue than most standalone book features.

Experiments & Results

Matched uplift (Bobs vs Nobs)

Metric	Result
Win rate (Bobs > Nobs)	58%
Median uplift (revenue diff)	+$10.7M
Significance	p < 0.001

Drivers of revenue (modeling snapshot)

Top drivers: vote count, budget, popularity, runtime; adventure/thriller genres.
Books → film: 5★ share↑, 3★ share↓, is series = yes predictive for success.
Adaptation: Higher plot similarity tends to outperform creative divergences.

Evaluation protocol.

Inflation-adjusted USD; holdout validation for models; SHAP summary & dependence plots on top features.
Matching diagnostics: balance checks on confounders pre/post matching.

Product & UX

Interactive data story with quizzes, searchable plots, and matched-pair explorer.
Explorables: revenue/time, budget vs revenue, SHAP visualizations, similarity comparisons.
Direct link: Explore the live site.

System & Operations

Stack: Python (pandas, scikit-learn), SHAP, D3/Plotly for interactive visuals.
Reproducibility: Deterministic matching seeds; CPI normalization for money.
Deployment: Static site hosting (GitHub Pages) with bundled assets/plots.

Impact

Provides evidence-based guidance for producers: when to adapt and how to adapt.
Reconciles business metrics (revenue) with creative choices (fidelity).
Offers a reusable template for other IP-to-film analyses (games, comics).

What I learned

Causal framing (matching) is critical to avoid misleading averages.
Model + SHAP tells a more complete story than coefficients alone.
Text similarity is a practical proxy for adaptation fidelity at scale.
Communicating results via an interactive narrative improves adoption.

Future Work

Add profitability (ROI) and marketing spend where available.
Extend beyond books: games → film, comics → film comparisons.

References

Rosenbaum & Rubin — Propensity score matching.
Lundberg & Lee — SHAP explanations.
IMDb, Goodreads, and Wikipedia/IMDb summaries (data sources).

← Back to projects