Quantifying Three Years of Reading
Since December 2014, I have tracked the books I read in a Google spreadsheet. It recently occurred to me to use this data to quantify how my reading habits have changed over time. This post will use PyMC3 to model my reading habits.
%matplotlib inline
from itertools import product
from matplotlib import dates as mdates
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
import pymc3 as pm
import scipy as sp
import seaborn as sns
from theano import shared, tensor as tt
set() sns.
= 27432 # from random.org, for reproductibility SEED
First we load the data from the Google Spreadsheet. Conveniently, pandas
can load CSVs from a web link.
= 'https://docs.google.com/spreadsheets/d/1wNbJv1Zf4Oichj3-dEQXE_lXVCwuYQjaoyU1gGQQqk4/export?gid=0&format=csv' GDOC_URI
= (pd.read_csv(
raw_df
GDOC_URI,=[
usecols'Year', 'Month', 'Day',
'End Year', 'End Month', 'End Day',
'Headline', 'Text'
]
)=1, how='all')
.dropna(axis=0)) .dropna(axis
raw_df.head()
Year | Month | Day | End Year | End Month | End Day | Headline | Text | |
---|---|---|---|---|---|---|---|---|
0 | 2014 | 12 | 13 | 2014.0 | 12.0 | 23.0 | The Bloody Chamber | Angela Carter, 126 pages |
1 | 2014 | 12 | 23 | 2015.0 | 1.0 | 4.0 | The Last Place on Earth | Roland Huntford, 564 pages |
2 | 2015 | 1 | 24 | 2015.0 | 2.0 | 13.0 | Empire Falls | Richard Russo, 483 pages |
3 | 2015 | 2 | 14 | 2015.0 | 2.0 | 20.0 | Wonder Boys | Michael Chabon, 368 pages |
4 | 2015 | 2 | 25 | 2015.0 | 3.0 | 4.0 | Red State, Blue State, Rich State, Poor State:… | Andrew Gelman, 196 pages |
raw_df.tail()
Year | Month | Day | End Year | End Month | End Day | Headline | Text | |
---|---|---|---|---|---|---|---|---|
58 | 2017 | 9 | 16 | 2017.0 | 10.0 | 14.0 | Civilization of the Middle Ages | Norman F. Cantor, 566 pages |
59 | 2017 | 10 | 14 | 2017.0 | 10.0 | 16.0 | The Bloody Chamber | Angela Carter, 126 pages |
60 | 2017 | 10 | 16 | 2017.0 | 10.0 | 27.0 | Big Data Baseball | Travis Sawchik, 233 pages |
61 | 2017 | 10 | 27 | 2017.0 | 12.0 | 7.0 | The History of Statistics: The Measurement of … | Stephen M. Stigler, 361 pages |
62 | 2017 | 12 | 8 | 2017.0 | 12.0 | 21.0 | An Arsonist’s Guide to Writers’ Homes in New E… | Brock Clarke, 303 pages |
The spreadhseet is formatted for use with Knight Lab’s excellent TimelineJS package. We transform the data to a more useful format for our purposes.
= pd.DataFrame({
df 'start_date': raw_df.apply(
lambda s: pd.datetime(
'Year'],
s['Month'],
s['Day']
s[
),=1
axis
),'end_date': raw_df.apply(
lambda s: pd.datetime(
int(s['End Year']),
int(s['End Month']),
int(s['End Day'])
),=1
axis
),'title': raw_df['Headline'],
'author': (raw_df['Text']
str.extract('(.*),.*', expand=True)
.0]),
.iloc[:, 'pages': (raw_df['Text']
str.extract(r'.*, (\d+) pages', expand=False)
.
.astype(np.int64))
})
'days'] = (df['end_date']
df['start_date'])
.sub(df[
.dt.days)
= df[[
df 'author', 'title',
'start_date', 'end_date', 'days',
'pages'
]]
Each row of the dataframe corresponds to a book I have read, and the columns are
-
author
, the book’s author, -
title
, the book’s title, -
start_date
, the date I started reading the book, -
end_date
, the date I finished reading the book, -
days
, then number of days it took me to read the book, and -
pages
, the number of pages in the book.
df.head()
author | title | start_date | end_date | days | pages | |
---|---|---|---|---|---|---|
0 | Angela Carter | The Bloody Chamber | 2014-12-13 | 2014-12-23 | 10 | 126 |
1 | Roland Huntford | The Last Place on Earth | 2014-12-23 | 2015-01-04 | 12 | 564 |
2 | Richard Russo | Empire Falls | 2015-01-24 | 2015-02-13 | 20 | 483 |
3 | Michael Chabon | Wonder Boys | 2015-02-14 | 2015-02-20 | 6 | 368 |
4 | Andrew Gelman | Red State, Blue State, Rich State, Poor State:… | 2015-02-25 | 2015-03-04 | 7 | 196 |
df.tail()
author | title | start_date | end_date | days | pages | |
---|---|---|---|---|---|---|
58 | Norman F. Cantor | Civilization of the Middle Ages | 2017-09-16 | 2017-10-14 | 28 | 566 |
59 | Angela Carter | The Bloody Chamber | 2017-10-14 | 2017-10-16 | 2 | 126 |
60 | Travis Sawchik | Big Data Baseball | 2017-10-16 | 2017-10-27 | 11 | 233 |
61 | Stephen M. Stigler | The History of Statistics: The Measurement of … | 2017-10-27 | 2017-12-07 | 41 | 361 |
62 | Brock Clarke | An Arsonist’s Guide to Writers’ Homes in New E… | 2017-12-08 | 2017-12-21 | 13 | 303 |
Modeling
We will model the number of days it takes me to read a book using count regression models based on the number of pages. It would also be reasonable to analyze this data using survival models.
Negative binomial regression
While Poisson regression is perhaps the simplest count regression model, we see that these data are fairly overdispersed
'days'].var() / df['days'].mean() df[
14.466643655077199
so negative binomial regression is more appropriate. We further verify that negative binomial regression is appropriate by plotting the logarithm of the number of pages versus the logarithm of the number of days it took me to read the book, since the logarithm is the link function for the negative binomial GLM.
= plt.subplots(figsize=(8, 6))
fig, ax
df.plot('pages', 'days',
=40, kind='scatter',
s=ax
ax;
)
'log');
ax.set_xscale("Number of pages");
ax.set_xlabel(
'log');
ax.set_yscale(=1.1e2);
ax.set_ylim(top"Number of days to read"); ax.set_ylabel(
This approximately linear relationship confirms the suitability of a negative binomial model.
Now we introduce some notation. Let \(y_i\) be the number of days it took me to read the \(i\)-th book and \(x^{\textrm{pages}}_i\) be the (standardized) logarithm of the number of pages in the \(i\)-th book. Our first model is
\[ \begin{align*} \beta^0, \beta^{\textrm{pages}} & \sim N(0, 5^2) \\ \theta_i & \sim \beta^0 + \beta^{\textrm{pages}} \cdot x^{\textrm{pages}}_i \\ \mu_i & = \exp({\theta_i}) \\ \alpha & \sim \operatorname{Lognormal}(0, 5^2) \\ y_i - 1 & \sim \operatorname{NegativeBinomial}(\mu_i, \alpha). \end{align*} \]
This model is expressed in PyMC3 below.
= df['days'].values
days
= df['pages'].values
pages = shared(pages) pages_
with pm.Model() as nb_model:
0 = pm.Normal('β0', 0., 5.)
β= pm.Normal('β_pages', 0., 5.)
β_pages
= tt.log(pages_)
log_pages = (log_pages - log_pages.mean()) / log_pages.std()
log_pages_std
= β0 + β_pages * log_pages_std
θ = tt.exp(θ)
μ
= pm.Lognormal('α', 0., 5.)
α
= pm.NegativeBinomial('days_obs', μ, α, observed=days - 1) days_obs
We now sample from the model’s posterior distribution.
= 3
NJOBS = {
SAMPLE_KWARGS 'njobs': NJOBS,
'random_seed': [SEED + i for i in range(NJOBS)]
}
with nb_model:
= pm.sample(**SAMPLE_KWARGS) nb_trace
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (3 chains in 3 jobs)
NUTS: [α_log__, β_pages, β0]
100%|██████████| 1000/1000 [00:05<00:00, 190.41it/s]
We check a few convergence diagnostics. The BFMI and energy distributions for our samples show no cause for concern.
= pm.energyplot(nb_trace)
ax
= pm.bfmi(nb_trace)
bfmi f"BFMI = {bfmi:.2f}"); ax.set_title(
The Gelman-Rubin statistics indicate that the chains have converged.
max(np.max(gr_stats) for gr_stats in pm.gelman_rubin(nb_trace).values())
1.0003248725601743
We use the posterior samples to make predictions so that we can examine residuals.
with nb_model:
= pm.sample_ppc(nb_trace)
nb_pred_trace
= nb_pred_trace['days_obs'].mean(axis=0) nb_pred_days
100%|██████████| 500/500 [00:00<00:00, 1114.32it/s]
Since the mean and variance of the negative binomial distribution are related, we use standardized residuals to untangle this relationship.
= (days - nb_pred_days) / nb_pred_trace['days_obs'].std(axis=0) nb_std_resid
We visualize these standardized residuals below.
= plt.subplots(figsize=(8, 6))
fig, ax
;
ax.scatter(nb_pred_days, nb_std_resid)
ax.hlines(0.975, 0.025]),
sp.stats.norm.isf([0, 75,
='--', label="95% confidence band"
linestyles;
)
0, 75);
ax.set_xlim("Predicted number of days to read");
ax.set_xlabel(
"Standardized residual");
ax.set_ylabel(
=1); ax.legend(loc
If the model is correct, approximately 95% of the residuals should lie between the dotted horizontal lines, and indeed most residuals are in this band.
We also plot the standardized residuals against the number of pages in the book, and notice no troubling patterns.
= plt.subplots(figsize=(8, 6))
fig, ax
'pages'], nb_std_resid);
ax.scatter(df[
ax.hlines(0.975, 0.025]),
sp.stats.norm.isf([0, 900,
='--', label="95% confidence band"
linestyles;
)
0, 900);
ax.set_xlim("Number of pages");
ax.set_xlabel(
"Standardized residual");
ax.set_ylabel(
=1); ax.legend(loc
We now examine this model’s predictions directly by sampling from the posterior predictive distribution.
= np.linspace(1, 1000, 300, dtype=np.int64)
PP_PAGES
pages_.set_value(PP_PAGES)
with nb_model:
= pm.sample_ppc(nb_trace, samples=5000) pp_nb_trace
100%|██████████| 5000/5000 [00:06<00:00, 825.67it/s]
= plt.subplots(figsize=(8, 6))
fig, ax
= 0.05
ALPHA
= np.percentile(
low, high 'days_obs'],
pp_nb_trace[100 * ALPHA / 2, 100 * (1 - ALPHA / 2)],
[=0
axis
)
ax.fill_between(
PP_PAGES, low, high,=0.35,
alpha=f"{1 - ALPHA:.0%} credible interval"
label;
)
ax.plot('days_obs'].mean(axis=0),
PP_PAGES, pp_nb_trace[="Posterior expected value"
label;
)
df.plot('pages', 'days',
=40, c='k',
s='scatter',
kind="Observed",
label=ax
ax;
)
"Number of pages");
ax.set_xlabel("Number of days to read");
ax.set_ylabel(
=2); ax.legend(loc
We see that most of the obserations fall within the 95% credible interval. An important feature of negative binomial regression is that the credible intervals expand as the predictions get larger. This feature is reflected in the fact that the predictions are less accurate for longer books.
Book effects
One advantage to working with such a personal data set is that I can explain the factors that led to certain outliers. Below are the four books that I read at the slowest average rate of pages per day.
=df['pages'] / df['days'])
(df.assign(pages_per_day4, 'pages_per_day')
.nsmallest('title', 'author', 'start_date', 'pages_per_day']]) [[
title | author | start_date | pages_per_day | |
---|---|---|---|---|
41 | The Handmaid’s Tale | Margaret Atwood | 2016-10-16 | 4.382353 |
48 | The Shadow of the Torturer | Gene Wolf | 2017-03-11 | 7.046512 |
24 | The Song of the Lark | Willa Cather | 2016-02-14 | 7.446429 |
61 | The History of Statistics: The Measurement of … | Stephen M. Stigler | 2017-10-27 | 8.804878 |
Several of these books make sense; I found The Shadow of the Torturer to be an unpleasant slog and The History of Statistics was quite technical and dense. On the other hand, The Handmaid’s Tale and The Song of the Lark were both quite enjoyable, but my time reading them coincided with other notable life events. I was reading The Handmaid’s Tale when certain unfortunate American political developments distracted me for several weeks in November 2016, and I was reading The Song of the Lark when a family member passed away in March 2016.
We modify the negative binomial regression model to include special factors for The Handmaid’s Tale and The Song of the Lark, in order to mitigate the influence of these unusual circumstances on our parameter estimates.
We let
\[ x^{\textrm{handmaid}}_i = \begin{cases} 1 & \textrm{if the } i\textrm{-th book is }\textit{The Handmaid's Tale} \\ 0 & \textrm{if the } i\textrm{-th book is not }\textit{The Handmaid's Tale} \end{cases}, \]
and similarly for \(x^{\textrm{lark}}_i\). We add the terms
\[ \begin{align*} \beta^{\textrm{handmaid}}, \beta^{\textrm{lark}} & \sim N(0, 5^2) \\ \beta^{\textrm{book}}_i & = \beta^{\textrm{handmaid}} \cdot x^{\textrm{handmaid}}_i + \beta^{\textrm{lark}} \cdot x^{\textrm{lark}}_i \\ \theta_i & \sim \beta_0 + \beta^{\textrm{book}}_i + \beta^{\textrm{pages}} \cdot x^{\textrm{pages}}_i \end{align*} \]
to the model below.
= (df['title']
is_lark "The Song of the Lark")
.eq(1.)
.mul(
.values)= shared(is_lark)
is_lark_
= (df['title']
is_handmaid "The Handmaid's Tale")
.eq(1.)
.mul(
.values)= shared(is_handmaid) is_handmaid_
pages_.set_value(pages)
with pm.Model() as book_model:
0 = pm.Normal('β0', 0., 5.)
β
= pm.Normal('β_lark', 0., 5.)
β_lark = pm.Normal('β_handmaid', 0., 5.)
β_handmaid = β_lark * is_lark_ + β_handmaid * is_handmaid_
β_book
= pm.Normal('β_pages', 0., 5.)
β_pages
= tt.log(pages_)
log_pages = (log_pages - log_pages.mean()) / log_pages.std()
log_pages_std
= β0 + β_book + β_pages * log_pages_std
θ = tt.exp(θ)
μ
= pm.Lognormal('α', 0., 5.)
α
= pm.NegativeBinomial('days_obs', μ, α, observed=days - 1) days_obs
We now sample from the model’s posterior distribution.
with book_model:
= pm.sample(**SAMPLE_KWARGS) book_trace
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (3 chains in 3 jobs)
NUTS: [α_log__, β_pages, β_handmaid, β_lark, β0]
100%|██████████| 1000/1000 [00:12<00:00, 77.79it/s]
Again, the BFMI, energy plots and Gelman-Rubin statistics indicate convergence.
= pm.energyplot(book_trace)
ax
= pm.bfmi(book_trace)
bfmi f"BFMI = {bfmi:.2f}"); ax.set_title(
max(np.max(gr_stats) for gr_stats in pm.gelman_rubin(book_trace).values())
1.0012914523533143
We see that the special factors for The Handmaid’s Tale and The Song of the Lark were indeed notable.
pm.forestplot(=['β_handmaid', 'β_lark'],
book_trace, varnames=0.025,
chain_spacing=False
rhat; )
Again, we calculate the model’s predictions in order to examine standardized residuals.
with book_model:
= pm.sample_ppc(book_trace)
book_pred_trace
= book_pred_trace['days_obs'].mean(axis=0) book_pred_days
100%|██████████| 500/500 [00:01<00:00, 463.09it/s]
= (days - book_pred_days) / book_pred_trace['days_obs'].std(axis=0) book_std_resid
Both standardized residual plots show no cause for concern.
= plt.subplots(figsize=(8, 6))
fig, ax
;
ax.scatter(book_pred_days, book_std_resid)
ax.hlines(0.975, 0.025]),
sp.stats.norm.isf([0, 120,
='--', label="95% confidence band"
linestyles;
)
0, 120);
ax.set_xlim("Predicted number of days to read");
ax.set_xlabel(
"Standardized residual"); ax.set_ylabel(
= plt.subplots(figsize=(8, 6))
fig, ax
'pages'], book_std_resid);
ax.scatter(df[
ax.hlines(0.975, 0.025]),
sp.stats.norm.isf([0, 900,
='--', label="95% confidence band"
linestyles;
)
0, 900);
ax.set_xlim("Number of pages");
ax.set_xlabel(
"Standardized residual"); ax.set_ylabel(
Since we now have two models, we use WAIC to compare them.
= (pm.compare(
compare_df
[nb_trace, book_trace],
[nb_model, book_model]
)={0: 'NB', 1: 'Book'})) .rename(index
= plt.subplots(figsize=(8, 6))
fig, ax
pm.compareplot(
compare_df, =False, dse=False,
insample_dev=ax
ax;
)
"WAIC");
ax.set_xlabel("Model"); ax.set_ylabel(
Since lower WAIC is better, we prefer the model with book effects, although not conclusively.
Again, we examine this model’s predictions directly by sampling from the posterior predictive distribution.
pages_.set_value(PP_PAGES)
is_handmaid_.set_value(np.zeros_like(PP_PAGES))
is_lark_.set_value(np.zeros_like(PP_PAGES))
with book_model:
= pm.sample_ppc(book_trace, samples=5000) pp_book_trace
100%|██████████| 5000/5000 [00:07<00:00, 640.09it/s]
= plt.subplots(figsize=(8, 6))
fig, ax
= np.percentile(
low, high 'days_obs'],
pp_book_trace[100 * ALPHA / 2, 100 * (1 - ALPHA / 2)],
[=0
axis
)
ax.fill_between(
PP_PAGES, low, high,=0.35,
alpha=f"{1 - ALPHA:.0%} credible interval"
label;
)
ax.plot('days_obs'].mean(axis=0),
PP_PAGES, pp_book_trace[="Posterior expected value"
label;
)
df.plot('pages', 'days',
=40, c='k',
s='scatter',
kind="Observed",
label=ax
ax;
)
"Number of pages");
ax.set_xlabel("Number of days to read");
ax.set_ylabel(
=2); ax.legend(loc
The predictions are visually similar to those of the previous model. The plot below compares the two model’s predictions directly.
= plt.subplots(figsize=(8, 6))
fig, ax
ax.plot('days_obs'].mean(axis=0),
PP_PAGES, pp_nb_trace[="Negative binomial model"
label;
)
ax.plot('days_obs'].mean(axis=0),
PP_PAGES, pp_book_trace[="Book effect model"
label;
)
"Number of pages");
ax.set_xlabel("Number of days to read");
ax.set_ylabel(
="Posterior expected value", loc=2); ax.legend(title
The predictions are quite similar, with the book effect model predicting slightly shorter durations, which makes sense as that model explicitly accounts for two books that it took me an unusually long amount of time to read.
Timeseries model
We now turn to the goal of this post, quantifying how my reading habits have changed over time. For computational simplicity, we operate on a time scale of weeks. Therefore, for each book, we calculate the number of weeks from the beginning of the observation period to when I started reading it.
= (df['start_date']
t_week 'start_date'].min())
.sub(df[
.dt.days7)
.floordiv(
.values)= shared(t_week)
t_week_
= t_week.max() + 1 n_week
We let the intercept \(\beta^0\) and the (log standardized) pages coefficient \(\beta^{\textrm{pages}}\) vary over time. We give these time-varying coefficient Gaussian random walk priors,
\[ \begin{align*} \beta^0_t \sim N(\beta^0_{t - 1}, 10^{-2}), \\ \beta^{\textrm{pages}}_t \sim N(\beta^{\textrm{pages}}_{t - 1}, 10^{-2}). \end{align*} \]
The small drift scale of \(10^{-1}\) is justified by the intuition that reading habits should change gradually.
pages_.set_value(pages)
is_handmaid_.set_value(is_handmaid) is_lark_.set_value(is_lark)
with pm.Model() as time_model:
0 = pm.GaussianRandomWalk(
β'β0', sd=0.1, shape=n_week
)
= pm.Normal('β_lark', 0., 5.)
β_lark = pm.Normal('β_handmaid', 0., 5.)
β_handmaid = β_lark * is_lark_ + β_handmaid * is_handmaid_
β_book
= pm.GaussianRandomWalk(
β_pages 'β_pages', sd=0.1, shape=n_week
)
= tt.log(pages_)
log_pages = (log_pages - log_pages.mean()) / log_pages.std()
log_pages_std
= β0[t_week_] + β_book + β_pages[t_week_] * log_pages_std
θ = tt.exp(θ)
μ
= pm.Lognormal('α', 0., 5.)
α
= pm.NegativeBinomial('days_obs', μ, α, observed=days - 1) days_obs
Again, we sample from the model’s posterior distribution.
with time_model:
= pm.sample(**SAMPLE_KWARGS) time_trace
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (3 chains in 3 jobs)
NUTS: [α_log__, β_pages, β_handmaid, β_lark, β0]
100%|██████████| 1000/1000 [03:26<00:00, 4.85it/s]
Again, the BFMI, energy plots, and Gelman-Rubin statistics indicate convergence.
= pm.energyplot(time_trace)
ax
= pm.bfmi(time_trace)
bfmi f"BFMI = {bfmi:.2f}"); ax.set_title(
max(np.max(gr_stats) for gr_stats in pm.gelman_rubin(time_trace).values())
1.0038733152191415
Once more, we examime the model’s standardized residuals.
with time_model:
= pm.sample_ppc(time_trace)
time_pred_trace
= time_pred_trace['days_obs'].mean(axis=0) time_pred_days
100%|██████████| 500/500 [00:00<00:00, 913.23it/s]
= (days - time_pred_days) / time_pred_trace['days_obs'].std(axis=0) time_std_resid
In general, the standardized residuals are now smaller and fewer are outside of the 95% confidence band.
= plt.subplots(figsize=(8, 6))
fig, ax
;
ax.scatter(time_pred_days, time_std_resid)
ax.hlines(0.975, 0.025]),
sp.stats.norm.isf([0, 120,
='--', label="95% confidence band"
linestyles;
)
0, 120);
ax.set_xlim("Predicted number of days to read");
ax.set_xlabel(
"Standardized residual");
ax.set_ylabel(
=1); ax.legend(loc
= plt.subplots(figsize=(8, 6))
fig, ax
'pages'], time_std_resid);
ax.scatter(df[
ax.hlines(0.975, 0.025]),
sp.stats.norm.isf([0, 900,
='--', label="95% confidence band"
linestyles;
)
0, 900);
ax.set_xlim("Number of pages");
ax.set_xlabel("Standardized residual");
ax.set_ylabel(
=1); ax.legend(loc
Again, we use WAIC to compare the three models.
= (pm.compare(
compare_df
[nb_trace, book_trace, time_trace],
[nb_model, book_model, time_model]
)={0: 'NB', 1: 'Book', 2: 'Time'})) .rename(index
= plt.subplots(figsize=(8, 6))
fig, ax
pm.compareplot(
compare_df, =False, dse=False,
insample_dev=ax
ax;
)
"WAIC");
ax.set_xlabel("Model"); ax.set_ylabel(
The timeseries model performs marginally worse than the previous model. We proceed since only the timeseries model answers our original question.
We now use the timeseries model to show how the amount of time it takes me to read a book has changed over time, conditioned on the length of the book.
= np.linspace(
t_grid 'start_date'].min()),
mdates.date2num(df['start_date'].max()),
mdates.date2num(df[
n_week )
= np.array([100, 200, 300, 400, 500])
PP_TIME_PAGES
= (pd.DataFrame(
pp_df list(product(
np.arange(n_week),
PP_TIME_PAGES
)),=['t_week', 'pages']
columns
)
.assign(=0,
is_handmaid=0
is_lark ))
'is_handmaid'].values)
is_handmaid_.set_value(pp_df['is_lark'].values)
is_lark_.set_value(pp_df[
't_week'].values)
t_week_.set_value(pp_df['pages'].values) pages_.set_value(pp_df[
with time_model:
= pm.sample_ppc(time_trace, samples=10000) pp_time_trace
100%|██████████| 10000/10000 [00:12<00:00, 791.11it/s]
'pp_days'] = pp_time_trace['days_obs'].mean(axis=0)
pp_df['t_plot'] = np.repeat(t_grid, 5) pp_df[
= plt.subplots(figsize=(8, 6))
fig, ax
for grp_pages, grp_df in pp_df.groupby('pages'):
grp_df.plot('t_plot', 'pp_days',
=f"{grp_pages} pages",
label=ax
ax;
)
min(), t_grid.max());
ax.set_xlim(t_grid.'%b %Y'));
ax.xaxis.set_major_formatter(mdates.DateFormatter(=6));
ax.xaxis.set_major_locator(mdates.MonthLocator(intervalFalse);
ax.xaxis.label.set_visible(
"Predicted number of days to read"); ax.set_ylabel(
The plot above exhibits a fascinating pattern; according to the timeseries model, I now read shorter books (fewer than approximately 300 pages) slightly faster than I did in 2015, but it takes me twice as long as before to read longer books. The trend for longer books is easier to explain; in the last 12-18 months, I have been doing much more public speaking and blogging than before, which naturally takes time away from reading. The trend for shorter books is a bit harder to explain, but upon some thought, I tend to read more purposefully as I approach the end of a book, looking forward to starting a new one. This effect occurs much earlier in shorter books than in longer ones, so it is a plausible explanation for the trend in shorter books.
This post is available as a Jupyter notebook here.