Bayesian Age/Period/Cohort Models in Python with PyMC
For my day job, I spend a lot of time thinking about e-commerce analytics and cohort analysis in particular. Statistical age-period-cohort (APC) models are important in many fields such as epidemiology, demography, marketing, and many more. These models also pose some interesting inferential challenges for the unwary. This post shows how to use pymc
to build Bayesian APC models in Python and presents a series of increasingly sophistocated systems of priors to resolve the inferential challenges these models pose.
First we make the necessary Python imports and do some light housekeeping.
%matplotlib inline
%config InlineBackend.figure_format = "retina"
from warnings import filterwarnings
from aesara import tensor as at
import arviz as az
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
import pymc as pm
import scipy as sp
import seaborn as sns
filterwarnings("ignore", category=UserWarning, module="pymc")
plt.rc("figure", figsize=(8, 6))
sns.set(color_codes=True)
Load the data¶
For illustrative purposes, we use data from the US General Social Survey 1974-2002 in this post. This dataset is available from Vincent Arel-Bundock's excellent Rdatasets
repository and is originally from the AER (Applied Econometrics with R) R package.
DATA_URL = "https://vincentarelbundock.github.io/Rdatasets/csv/AER/GSS7402.csv"
df = (pd.read_csv(DATA_URL, usecols=["kids", "age", "year"])
[["age", "year", "kids"]])
df
Exploratory data analysis¶
From the Rdatasets
site, this data set contains
[c]ross-section data for 9120 women taken from every fourth year of the US General Social Survey between 1974 and 2002 to investigate the determinants of fertility.
The columns of this data frame are
-
age
: the age at which the woman was surveyed, -
year
: the year the woman was surveyed, and -
kids
: the number of children the woman had at the time she was surveyed.
These columns contain all the information we need to build an APC model of how the average number of children has varied as a function of the woman's age, the year, and when she was born. Obviously the age
column corresponds to the "age" component of the APC model, and the year
column corresponds to the "period" component of the APC model. For this data set, the "cohort" component corresponds to they year in which the woman was born, which is simply calculated as the difference between the woman's age and the current year.
df["cohort"] = df["year"] - df["age"]
df
It is this relationship between age
, year
, and cohort
that will make a naively parametrized APC model unidentified, causing inferential challenges.
The following plots show how the average number of childen women varies with age, period (year), and cohort.
fig, axes = plt.subplots(ncols=3, nrows=2, figsize=(12, 4),
gridspec_kw={"height_ratios": (1, 5)})
axes[0, 0].sharex(axes[1, 0]);
axes[0, 1].sharex(axes[1, 1]);
axes[0, 2].sharex(axes[1, 2]);
axes[0, 0].sharey(axes[0, 1]);
axes[0, 1].sharey(axes[0, 2]);
axes[1, 0].sharey(axes[1, 1]);
axes[1, 1].sharey(axes[1, 2]);
# age
axes[0, 0].hist(
df["age"],
bins=np.arange(df["age"].min(), df["age"].max() - 1, 3),
lw=0
);
axes[0, 0].set_ylabel("Number\nof women");
(df.groupby("age")
["kids"]
.mean()
.plot(ax=axes[1, 0]));
axes[1, 0].set_xlabel("Age");
axes[1, 0].set_ylabel("Average number\nof children");
# year
axes[0, 1].hist(
df["year"],
bins=np.arange(df["year"].min(), df["year"].max() - 1),
lw=0
);
(df.groupby("year")
["kids"]
.mean()
.plot(ax=axes[1, 1]));
axes[1, 1].set_xlabel("Year");
# cohort
axes[0, 2].hist(
df["cohort"],
bins=np.arange(df["cohort"].min(), df["cohort"].max() - 1, 5),
lw=0
);
(df.groupby("cohort")
["kids"]
.mean()
.plot(ax=axes[1, 2]));
axes[1, 2].set_xlabel("Cohort (birth year)");
axes[1, 2].set_ylim(0, 4);
fig.tight_layout();
The relationship with age is not surprising: throughout the twenties and thirties the average number of children increases at a relatively stable rate and then levels out from the forties onward. The most notable feature of the period relationship is that it is relatively small compared to the age and cohort relationship. The relationship with cohort is perhaps the most interesting. The year a woman is born determines when she enters the prime chilbearing years of twenty to forty, so the peaks and troughs in the cohort relationship roughly corresponding to economic and societal conditions when the women in that cohort are in prime childbearing years.
GREAT_DEPRESSION = np.array([1929, 1939])
BABY_BOOM = np.array([1946, 1964])
fig, ax = plt.subplots()
ax.axvspan(GREAT_DEPRESSION[0] - 30, GREAT_DEPRESSION[1] - 20,
color="C3", alpha=0.5, label="Great Depression");
ax.axvspan(BABY_BOOM[0] - 30, BABY_BOOM[1] - 20,
color="C2", alpha=0.5, label="Post-WWII Baby Boom");
(df.groupby("cohort")
["kids"]
.mean()
.plot(label="_nolegend_", ax=ax));
ax.set_xlabel("Cohort (birth year)");
ax.set_ylim(0, 4);
ax.set_ylabel("Average number of children");
ax.legend(loc="upper right", title="Chilbearing age during");
Taking a closer look at the cohort relationship, we see that the first trough corresponds to women that would be having most of their children during the Great Depression, and the subsequent peak corresponds to women having children during the post-World War II baby boom.
Taken together, it is reasonable to hypothesize that biological factors related to age and socioeconomic factors related to cohort are sufficient to explain the variation in the number of children without considering period (year). The APC models we build in this post will allow us to evaluate the credibility of this hypothesis.
It is important to note at this point that I am not a demographer and even if I was, it is best not to draw deep demographic conclusions from any analysis conducted on this data set. The purpose of this post is to use this data set as a simple example to illustrate some of the subtleties involved in inference using APC models.
Dispersion¶
Since the number of children is discrete, we being by checking its index of dispersion in order to determine which probability distribution is best suited to model it.
def index_of_dispersion(x):
return x.var() / x.mean()
index_of_dispersion(df["kids"])
We see that the number of children is overdispersed across all responses.
fig, axes = plt.subplots(ncols=3, sharey=True, figsize=(12, 4))
index_of_dispersion(df.groupby("age")["kids"]).plot(label="_nolegend_", ax=axes[0]);
axes[0].axhline(1, c="k", ls="--", label="Unit dispersion (Poisson)");
axes[0].set_xlabel("Age");
axes[0].set_ylabel("Index of dispersion");
axes[0].legend(loc="upper left");
index_of_dispersion(df.groupby("year")["kids"]).plot(ax=axes[1]);
axes[1].axhline(1, c="k", ls="--");
axes[1].set_xlabel("Year");
index_of_dispersion(df.groupby("cohort")["kids"]).plot(ax=axes[2]);
axes[2].axhline(1, c="k", ls="--");
axes[2].set_xlabel("Year");
fig.tight_layout();
The number of children continues to be overdispersed (without about the same index) across each of the APC dimensions, so we will use the overdispersed negative binomial distribution in our models.
Modeling¶
We now start to build APC models of the number of children reported. Recall that if $X \sim \operatorname{NB}(\mu_X, \nu)$ and $Y \sim \operatorname{NB}(\mu_Y, \nu)$ then $X + Y \sim \operatorname{NB}(\mu_X + \mu_Y, \nu).$ Using this fact, we reduce the size of the observed data in our model by grouping by the age/period/cohort combination of each observation, counting the number of women with that combination, and summing their number of children.
apc_df = (df.groupby(["age", "year", "cohort"])
["kids"]
.agg(("size", "sum")))
apc_df
N = apc_df["size"].values
kids = apc_df["sum"].values
Here apc_df["size"]
and N
are the number of women surveyed at that specific age/period/cohort combination and apc_df["sum"]
and kids
are their total number of children.
For convenience we encode age
, year
, and cohort
as ordinal factors.
i, age_map = apc_df.index.get_level_values("age").factorize(sort=True)
j, year_map = apc_df.index.get_level_values("year").factorize(sort=True)
k, cohort_map = apc_df.index.get_level_values("cohort").factorize(sort=True)
Flat priors¶
We are now ready to build our first APC model, which will use flat priors for the age, period, and cohort effects. Throughout this post, let $i$ denote a sample's age, $j$ denote its period, and $k$ denote its cohort. All of our models will take the form
$$\begin{align*} \eta_{ijk} & = \eta_0 + \alpha_i + \beta_j + \gamma_k, \\ y_{ijk} & \sim \operatorname{NB}(N_{ijk} \cdot \exp(\eta_{ijk}), \nu). \end{align*}$$
Here $N_{ijk}$ is the number of women in the relevant age/period/cohort combination, $y_{ijk}$ is the total number of children they have, and $\nu$ controls the amount of overdispersion. Note that this is a negative binomial regression model with an offset term.
Using pymc
we specify the flat prior distributions
$$\pi(\eta_0), \pi(\alpha_i), \pi(\beta_i), \pi(\gamma_i) \propto 1.$$
coords = {
"age": age_map,
"year": year_map,
"cohort": cohort_map
}
SEED = 1234567890 # for reproducibility
with pm.Model(coords=coords, rng_seeder=SEED) as flat_model:
η0 = pm.Flat("η0")
α = pm.Flat("α", dims="age")
β = pm.Flat("β", dims="year")
γ = pm.Flat("γ", dims="cohort")
Using the prior $\nu \sim \operatorname{Half-}N(2.5^2)$, we specify $\eta_{ijk}$ and the observational likelihood as above.
with flat_model:
η = η0 + α[i] + β[j] + γ[k]
ν = pm.HalfNormal("ν", 2.5)
pm.NegativeBinomial("kids", N * at.exp(η), ν, observed=kids)
We are now ready to sample from the posterior distribution of this model.
CHAINS = 6
SAMPLE_KWARGS = {
"cores": CHAINS,
"random_seed": [SEED + i for i in range(CHAINS)]
}
with flat_model:
flat_trace = pm.sample(**SAMPLE_KWARGS)
We see quite a few warnings as a result of sampling, most notably over half of the post-tuning samples resulted in divergences, and the $\hat{R}$ statistics are well above the acceptable diagnostic threshold of 1.01. For simplicty, we focus on the $\hat{R}$ statistics.
flat_rhat = az.rhat(flat_trace)
ax = (flat_rhat.max()
.to_array()
.to_series()
.plot(kind="barh"))
ax.axvline(1, c="k", ls="--");
ax.set_xlabel(r"$\hat{R}$");
ax.invert_yaxis();
ax.set_ylabel(None);
We see that all of the parameters have $\hat{R}$ statistics significantly above one (two are above five) that are indicative of severe problems with inference. Focusing on the component of $\alpha$ with the highest $\hat{R}$ statistic we see that the chains are not very well mixed; they have explored completely different portions of the parameter space.
α_worst = flat_rhat["α"].idxmax()
az.plot_trace(flat_trace, var_names="α", coords={"age": α_worst}, divergences=False);
To quote the folk theorem of statistical computing:
When you have computational problems, often there’s a problem with your model.
With this wisdom in mind, experience indicates that this behavior in the sampler often arises when our model is not identified, so it is time to return to our observation above that the relationship between age
, year
, and cohort
would lead to inference problems.
Lack of identification¶
Briefly, a model is not identified if two different sets of parameters lead to the same (log) likelihood of the observed data. To illustrate the fact that this flat model is not identified, first we evaluate its log likelihood function when $\eta_0 = \alpha_i = \beta_j = \gamma_k = 0$ and $\nu = 1$.
zero_pt = {
"η0": 0,
"α": np.zeros_like(age_map),
"β": np.zeros_like(year_map),
"γ": np.zeros_like(cohort_map),
"ν_log__": 0
}
flat_logp = flat_model.compile_logp()
flat_logp(zero_pt)
Since a woman's cohort (birth year) is the year she was surveyed minus her age at the time of survey, we have the following constraint between age, period and cohort.
all((df["age"] - df["year"] + df["cohort"]) == 0)
Due to this constraint and the linearity of our model,
$$\eta_{ijk} = \eta_0 + \alpha_i + \beta_j + \gamma_k,$$
the following function produces parameters has the same likelihood as those above, given the observed data.
def get_unident_pt(c=1):
return {
"η0": 0,
"α": c * age_map,
"β": -c * year_map,
"γ": c * cohort_map,
"ν_log__": 0
}
flat_logp(get_unident_pt())
We test this assertion for 100 random values of c
in the interval $[-10, 10]$.
rng = np.random.default_rng(SEED)
for c in rng.uniform(-10, 10, size=100):
np.testing.assert_allclose(flat_logp(zero_pt), flat_logp(get_unident_pt(c=c)))
This experiment confirms there are infinitely many parameter values that lead to the same likelihood.
Unidentified models such as the APC model with flat priors lead to both theoretical and practical challenges. From a theoretical perspective, we should be wary of a interpreting a model that produces infinitely many sets of parameters that are equally compatible with our data. From a computational perspective, many inference algorithms struggle with unidentified models (recall the folk theorem of statistical computing). It is important to note that these computational issues are not unique to the Bayesian approach to inference using this APC model. With flat priors, the maximum a posteriori (MAP) estimator for this model is the same as the maximum likelihood estimator (MLE).
with flat_model:
flat_mle, opt_result = pm.find_MAP(return_raw=True)
opt_result.success
We see that scipy
's numerical optimizer has failed to find the MLE/MAP due to the unidentified nature of this model.
Normal priors¶
Regularization) is a common approach to resolving model identification problems. From a Bayesian perspective, regularization is equivalent to using non-flat priors on our parameters $\eta_0$, $\alpha_i$, $\beta_j$, and $\gamma_k$. It is reasonable to start with normally-distributed priors on these parameters. These normal priors are, in fact, equivalent to Tikhonov regularization/ridge regression.
For this model, we let $\eta_0, \alpha_i, \beta_j, \gamma_k \sim N(0, 2.5^2)$.
with pm.Model(coords=coords, rng_seeder=SEED) as norm_model:
η0 = pm.Normal("η0", 0, 2.5)
α = pm.Normal("α", 0, 2.5, dims="age")
β = pm.Normal("β", 0, 2.5, dims="year")
γ = pm.Normal("γ", 0, 2.5, dims="cohort")
Since these normal priors assign the higest probability density to zero, they favor smaller parameter values if all other factors are equal, just as in Tikhonov regularization/ridge regression.
The rest of the model is defined as before.
with norm_model:
η = η0 + α[i] + β[j] + γ[k]
ν = pm.HalfNormal("ν", 2.5)
pm.NegativeBinomial("kids", N * at.exp(η), ν, observed=kids)
We now sample from the posterior distribution of this model.
with norm_model:
norm_trace = pm.sample(**SAMPLE_KWARGS)
We see that overall there are fewer warnings than when we sampled from the flat model, but there are still hundreds of divergences. The $\hat{R}$ statistics are much closer to their ideal value of one, but still larger than 1.01, so we should continue to be a skeptical of the quality of these samples.
norm_rhat = az.rhat(norm_trace)
ax = (norm_rhat.max()
.to_array()
.to_series()
.plot(kind="barh"))
ax.axvline(1, c="k", ls="--");
ax.set_xlim(0.99, 1.08);
ax.set_xlabel(r"$\hat{R}$");
ax.invert_yaxis();
ax.set_ylabel(None);
To demonstrate the impact of these elevated $\hat{R}$ statistics, we visualize the samples from the $\alpha_i$ component with the highest $\hat{R}$ statistic below.
α_worst = norm_rhat["α"].idxmax()
az.plot_trace(norm_trace, var_names="α", coords={"age": α_worst}, divergences=False);
In the kernel density estimates on the left, we see at least one chain that is meandering around the parameter space.
fig, axes = plt.subplots(nrows=CHAINS, sharex=True, sharey=True)
for chain, ax in enumerate(axes):
(norm_trace.posterior["α"]
.sel(age=α_worst, chain=chain)
.plot(ax=ax));
ax.set_xlabel(None);
ax.set_yticks([]);
ax.set_ylabel(f"Chain {chain + 1}");
ax.set_title(None);
axes[-1].set_xlabel("Draw");
fig.suptitle("α");
fig.tight_layout();
Plotting each chain's trace on its own for clarity, we see that the second chain demonstrates particularly concerning behavior, producing relatively constant samples for long periods of time and then jumping to completely different portions of the parameter space.
Based on these results, we see that introducing regularization through normal priors has reduced but not eliminated the inference issues with this model.
Noncentered normal priors¶
The APC model with normal priors shows that adding some regularization helps, so for our next model we increase the amount of regularization. We do so by placing hierarchical normal priors on the APC parameters,
$$\alpha_i \sim N(0, \sigma_{\alpha}^2),\ \sigma_{\alpha} \sim \operatorname{Half}-N(2.5^2),$$ $$\beta_i \sim N(0, \sigma_{\beta}^2),\ \sigma_{\beta} \sim \operatorname{Half}-N(2.5^2),$$ $$\gamma_i \sim N(0, \sigma_{\gamma}^2),\ \sigma_{\gamma} \sim \operatorname{Half}-N(2.5^2).$$
In reality, we use an equivalent noncentered parameterization of this prior that often leads to more efficient sampling.
# the scale necessary to make a halfnormal distribution have unit variance
HALFNORMAL_SCALE = 1 / np.sqrt(1 - 2 / np.pi)
def noncentered_normal(name, *, dims):
Δ = pm.Normal(f"Δ_{name}", 0, 1, dims=dims)
σ = pm.HalfNormal(f"σ_{name}", 2.5 * HALFNORMAL_SCALE)
return pm.Deterministic(name, Δ * σ, dims=dims)
with pm.Model(coords=coords, rng_seeder=SEED) as nc_model:
η0 = pm.Normal("η0", 0, 2.5)
α = noncentered_normal("α", dims="age")
β = noncentered_normal("β", dims="year")
γ = noncentered_normal("γ", dims="cohort")
The rest of the model is defined as before, with the addition of a MutableData
container for the offset. This container will be useful when we visualize the posterior predictive distribution.
with nc_model:
η = η0 + α[i] + β[j] + γ[k]
N_ = pm.MutableData("N", N)
ν = pm.HalfNormal("ν", 2.5)
pm.NegativeBinomial("kids", N_ * at.exp(η), ν, observed=kids)
We now sample from the posterior distribution of this model.
with nc_model:
nc_trace = pm.sample(target_accept=0.9, **SAMPLE_KWARGS)
No divergences occured during sampling, though we do see a low effective sample size warning.
nc_rhat = az.rhat(nc_trace)
ax = (nc_rhat.max()
.to_array()
.to_series()
.plot(kind="barh"))
ax.axvline(1, c="k", ls="--");
ax.set_xlim(0.99, 1.01);
ax.set_xlabel(r"$\hat{R}$");
ax.invert_yaxis();
ax.set_ylabel(None);
The $\hat{R}$ statistics are comfortably below 1.01, a sign of significantly improved sampling. The trace plot for the component of $\alpha$ with the highest $\hat{R}$ statistic also shows quite good mixing.
az.plot_trace(nc_trace, var_names="α", coords={"age": nc_rhat["α"].idxmax()}, divergences=False);
We now sample from the posterior predictive distribution of this model and visualize the results.
N_PP = 10_000
def scale_pp(x):
return x / N_PP
with nc_model:
pm.set_data({"N": np.full_like(N, N_PP)})
nc_trace.extend(pm.sample_posterior_predictive(nc_trace))
CI_WIDTH = 0.95
fig, axes = plt.subplots(ncols=3, sharey=True, figsize=(12, 4))
# age
(df.groupby("age")
["kids"]
.mean()
.plot(c="k", ls="--", label="Observed", ax=axes[0]))
nc_pp_by_age = (
nc_trace.posterior_predictive
.assign_coords(kids_dim_0=apc_df.index.get_level_values("age").values)
["kids"]
.pipe(scale_pp)
.groupby("kids_dim_0")
)
axes[0].fill_between(age_map,
nc_pp_by_age.quantile((1 - CI_WIDTH) / 2, dim=("chain", "draw", "kids_dim_0")),
nc_pp_by_age.quantile(1 - (1 - CI_WIDTH) / 2, dim=("chain", "draw", "kids_dim_0")),
alpha=0.5);
axes[0].plot(age_map, nc_pp_by_age.mean(dim=("chain", "draw", "kids_dim_0")));
axes[0].set_xlabel("Age");
axes[0].set_ylabel("Children\n(posterior predictive)");
axes[0].legend(loc="upper left");
# year
(df.groupby("year")
["kids"]
.mean()
.plot(c="k", ls="--", ax=axes[1]));
nc_pp_by_year = (
nc_trace.posterior_predictive
.assign_coords(kids_dim_0=apc_df.index.get_level_values("year").values)
["kids"]
.pipe(scale_pp)
.groupby("kids_dim_0")
)
axes[1].fill_between(year_map,
nc_pp_by_year.quantile((1 - CI_WIDTH) / 2, dim=("chain", "draw", "kids_dim_0")),
nc_pp_by_year.quantile(1 - (1 - CI_WIDTH) / 2, dim=("chain", "draw", "kids_dim_0")),
alpha=0.5);
axes[1].plot(year_map, nc_pp_by_year.mean(dim=("chain", "draw", "kids_dim_0")));
axes[1].set_xlabel("Year");
# cohort
(df.groupby("cohort")
["kids"]
.mean()
.plot(c="k", ls="--", ax=axes[2]));
nc_pp_by_cohort = (
nc_trace.posterior_predictive
.assign_coords(kids_dim_0=apc_df.index.get_level_values("cohort").values)
["kids"]
.pipe(scale_pp)
.groupby("kids_dim_0")
)
axes[2].fill_between(cohort_map,
nc_pp_by_cohort.quantile((1 - CI_WIDTH) / 2, dim=("chain", "draw", "kids_dim_0")),
nc_pp_by_cohort.quantile(1 - (1 - CI_WIDTH) / 2, dim=("chain", "draw", "kids_dim_0")),
alpha=0.5);
axes[2].plot(cohort_map, nc_pp_by_cohort.mean(dim=("chain", "draw", "kids_dim_0")));
axes[2].set_xlabel("Cohort (birth year)");
fig.tight_layout();
Here we see solid agreement between the model and the observed data, perhaps even a bit too much agreement, as our APC effects seem to mimic the yearly noise in the observced data. It would be reasonable to expect that the true APC effects are smoother than what we see here. Our next two models how how including this behavior in our model can result in better sampling and more effective predictions.
Noncentered normal random walk priors¶
We can incorporate the expectation that the number of children only changes a bit from year-to-year (and therefore the relationship is smoother than above) by a placing noncentered Gaussian random walk priors on $\alpha_i$, $\beta_j$, and $\gamma_k$.
def noncentered_normal_rw(name, *, dims):
innov = pm.Normal(f"innov_{name}", 0, 1, dims=dims)
σ = pm.HalfNormal(f"σ_{name}", 2.5 * HALFNORMAL_SCALE)
return pm.Deterministic(name, innov.cumsum() * σ, dims=dims)
with pm.Model(coords=coords, rng_seeder=SEED) as rw_model:
η0 = pm.Normal("η0", 0, 2.5)
α = noncentered_normal_rw("α", dims="age")
β = noncentered_normal_rw("β", dims="year")
γ = noncentered_normal_rw("γ", dims="cohort")
The rest of the model is defined as before.
with rw_model:
η = η0 + α[i] + β[j] + γ[k]
N_ = pm.MutableData("N", N)
ν = pm.HalfNormal("ν", 2.5)
pm.NegativeBinomial("kids", N_ * at.exp(η), ν, observed=kids)
We now sample from the posterior distribution of this model.
with rw_model:
rw_trace = pm.sample(target_accept=0.95, **SAMPLE_KWARGS)
There are no divergences, no $\hat{R}$ warnings, and no effective sample size warnings.
rw_rhat = az.rhat(rw_trace)
ax = (rw_rhat.max()
.to_array()
.to_series()
.plot(kind="barh"))
ax.axvline(1, c="k", ls="--");
ax.set_xlim(0.99, 1.01);
ax.set_xlabel(r"$\hat{R}$");
ax.invert_yaxis();
ax.set_ylabel(None);
We now sample from and visualize the posterior predictive distribution.
with rw_model:
pm.set_data({"N": np.full_like(N, N_PP)})
rw_trace.extend(pm.sample_posterior_predictive(rw_trace))
fig, axes = plt.subplots(ncols=3, sharey=True, figsize=(12, 4))
# age
(df.groupby("age")
["kids"]
.mean()
.plot(c="k", ls="--", label="Observed", ax=axes[0]))
rw_pp_by_age = (
rw_trace.posterior_predictive
.assign_coords(kids_dim_0=apc_df.index.get_level_values("age").values)
["kids"]
.pipe(scale_pp)
.groupby("kids_dim_0")
)
axes[0].fill_between(age_map,
rw_pp_by_age.quantile((1 - CI_WIDTH) / 2, dim=("chain", "draw", "kids_dim_0")),
rw_pp_by_age.quantile(1 - (1 - CI_WIDTH) / 2, dim=("chain", "draw", "kids_dim_0")),
alpha=0.5, color="C1");
axes[0].plot(age_map, nc_pp_by_age.mean(dim=("chain", "draw", "kids_dim_0")),
label="Noncentered");
axes[0].plot(age_map, rw_pp_by_age.mean(dim=("chain", "draw", "kids_dim_0")),
c="C1", label="Random walk");
axes[0].set_xlabel("Age");
axes[0].set_ylabel("Children\n(posterior predictive)");
axes[0].legend(loc="upper left");
# year
(df.groupby("year")
["kids"]
.mean()
.plot(c="k", ls="--", ax=axes[1]));
rw_pp_by_year = (
rw_trace.posterior_predictive
.assign_coords(kids_dim_0=apc_df.index.get_level_values("year").values)
["kids"]
.pipe(scale_pp)
.groupby("kids_dim_0")
)
axes[1].fill_between(year_map,
rw_pp_by_year.quantile((1 - CI_WIDTH) / 2, dim=("chain", "draw", "kids_dim_0")),
rw_pp_by_year.quantile(1 - (1 - CI_WIDTH) / 2, dim=("chain", "draw", "kids_dim_0")),
alpha=0.5, color="C1");
axes[1].plot(year_map, nc_pp_by_year.mean(dim=("chain", "draw", "kids_dim_0")));
axes[1].plot(year_map, rw_pp_by_year.mean(dim=("chain", "draw", "kids_dim_0")),
c="C1");
axes[1].set_xlabel("Year");
# cohort
(df.groupby("cohort")
["kids"]
.mean()
.plot(c="k", ls="--", ax=axes[2]));
rw_pp_by_cohort = (
rw_trace.posterior_predictive
.assign_coords(kids_dim_0=apc_df.index.get_level_values("cohort").values)
["kids"]
.pipe(scale_pp)
.groupby("kids_dim_0")
)
axes[2].fill_between(cohort_map,
rw_pp_by_cohort.quantile((1 - CI_WIDTH) / 2, dim=("chain", "draw", "kids_dim_0")),
rw_pp_by_cohort.quantile(1 - (1 - CI_WIDTH) / 2, dim=("chain", "draw", "kids_dim_0")),
alpha=0.5, color="C1");
axes[2].plot(cohort_map, nc_pp_by_cohort.mean(dim=("chain", "draw", "kids_dim_0")));
axes[2].plot(cohort_map, rw_pp_by_cohort.mean(dim=("chain", "draw", "kids_dim_0")),
c="C1");
axes[2].set_xlabel("Cohort (birth year)");
fig.tight_layout();
We see that the posterior predictive mean for the random walk model is considerably smoother year-over-year than that of the noncentered normal model; this smoothness is particularly evident in the cohort plot.
We use Pareto-smoothed importance sampling leave-one-out cross validation (PSIS-LOO) to quantify the extent to which the random walk model is an improvement over the noncentered normal model.
traces = {
"Noncentered": nc_trace,
"Random walk": rw_trace
}
comp_df = pm.compare(traces)
comp_df
fig, ax = plt.subplots()
az.plot_compare(comp_df, insample_dev=False, plot_ic_diff=False,
textsize=9, ax=ax);
We see that the random walk model is not just a visual improvement over the noncentered normal model, but shows an appreciable quantitative improvement in predictive accuracy (as judged by PSIS-LOO).
It is particularly gratifying to see a model that better encodes our intuitions about the smoothness of APC effects perform better both from a computational and predictive perspective.
Smoothing splines¶
Smoothing splines are an even more powerful way to encode smoothness assumptions into our APC model which we will now compare to the noncentered normal random walk model. We will not include a full introduction to smoothing splines (or their Bayesian treatment) here, but we follow Milad Kharratzadeh’s excellent short paper Splines in Stan. For more details about splines see two previous posts, Bayesian Splines with Heteroskedastic Noise in Python with PyMC3 and Fitting a Simple Additive Model in Python .
We only use smoothing splines to model the age and cohort effects here, since they have many more observed values than the period (year
) in which the survey was conducted.
ax = (df[["age", "year", "cohort"]]
.nunique()
.plot(kind="barh"))
ax.set_xlabel("Number of unique values observed");
ax.invert_yaxis();
We define ten knots each for age and cohort, and use scipy
to form the cubic B-spline design matrix.
N_KNOT = 10
age_knots = np.linspace(15, 96, N_KNOT)
age_dmat = sp.interpolate.BSpline(age_knots, np.eye(N_KNOT), 3)(
apc_df.index.get_level_values("age").values
)
cohort_knots = np.linspace(1885, 1986, N_KNOT)
cohort_dmat = sp.interpolate.BSpline(cohort_knots, np.eye(N_KNOT), 3)(
apc_df.index.get_level_values("cohort").values
)
The spline model uses different coordinates than the previous models.
spline_coords = {
"age_knot": np.arange(N_KNOT),
"year": year_map,
"cohort_knot": np.arange(N_KNOT)
}
We use the priors on the spline coefficients $\alpha_i$ and $\gamma_k$ suggested in Splines in Stan (rescaled for computational efficiency).
with pm.Model(coords=spline_coords, rng_seeder=SEED) as spline_model:
η0 = pm.Normal("η0", 0, 2.5)
α = noncentered_normal_rw("α", dims="age_knot") / np.sqrt(N_KNOT)
β = noncentered_normal_rw("β", dims="year")
γ = noncentered_normal_rw("γ", dims="cohort_knot") / np.sqrt(N_KNOT)
The rest of the model is defined the similarly to before.
with spline_model:
η = η0 + at.dot(age_dmat, α) + β[j] + at.dot(cohort_dmat, γ)
N_ = pm.MutableData("N", N)
ν = pm.HalfNormal("ν", 2.5)
pm.NegativeBinomial("kids", N_ * at.exp(η), ν, observed=kids)
We now sample from the posterior distribution of this model.
with spline_model:
spline_trace = pm.sample(target_accept=0.99, **SAMPLE_KWARGS)
There are no divergences, no $\hat{R}$ warnings, and no effective sample size warnings.
spline_rhat = az.rhat(spline_trace)
ax = (spline_rhat.max()
.to_array()
.to_series()
.plot(kind="barh"))
ax.axvline(1, c="k", ls="--");
ax.set_xlim(0.99, 1.01);
ax.set_xlabel(r"$\hat{R}$");
ax.invert_yaxis();
ax.set_ylabel(None);
We now sample from and visualize the posterior predictive distribution.
with spline_model:
pm.set_data({"N": np.full_like(N, N_PP)})
spline_trace.extend(pm.sample_posterior_predictive(spline_trace))
fig, axes = plt.subplots(ncols=3, sharey=True, figsize=(12, 4))
# age
(df.groupby("age")
["kids"]
.mean()
.plot(c="k", ls="--", label="Observed", ax=axes[0]))
spline_pp_by_age = (
spline_trace.posterior_predictive
.assign_coords(kids_dim_0=apc_df.index.get_level_values("age").values)
["kids"]
.pipe(scale_pp)
.groupby("kids_dim_0")
)
axes[0].fill_between(age_map,
spline_pp_by_age.quantile((1 - CI_WIDTH) / 2, dim=("chain", "draw", "kids_dim_0")),
spline_pp_by_age.quantile(1 - (1 - CI_WIDTH) / 2, dim=("chain", "draw", "kids_dim_0")),
alpha=0.5, color="C2");
axes[0].plot(age_map, rw_pp_by_age.mean(dim=("chain", "draw", "kids_dim_0")),
c="C1", label="Random walk");
axes[0].plot(age_map, spline_pp_by_age.mean(dim=("chain", "draw", "kids_dim_0")),
c="C2", label="Spline");
axes[0].set_xlabel("Age");
axes[0].set_ylabel("Children\n(posterior predictive)");
axes[0].legend(loc="upper left");
# year
(df.groupby("year")
["kids"]
.mean()
.plot(c="k", ls="--", ax=axes[1]));
spline_pp_by_year = (
spline_trace.posterior_predictive
.assign_coords(kids_dim_0=apc_df.index.get_level_values("year").values)
["kids"]
.pipe(scale_pp)
.groupby("kids_dim_0")
)
axes[1].fill_between(year_map,
spline_pp_by_year.quantile((1 - CI_WIDTH) / 2, dim=("chain", "draw", "kids_dim_0")),
spline_pp_by_year.quantile(1 - (1 - CI_WIDTH) / 2, dim=("chain", "draw", "kids_dim_0")),
alpha=0.5, color="C2");
axes[1].plot(year_map, rw_pp_by_year.mean(dim=("chain", "draw", "kids_dim_0")),
c="C1");
axes[1].plot(year_map, spline_pp_by_year.mean(dim=("chain", "draw", "kids_dim_0")),
c="C2");
axes[1].set_xlabel("Year");
# cohort
(df.groupby("cohort")
["kids"]
.mean()
.plot(c="k", ls="--", ax=axes[2]));
spline_pp_by_cohort = (
spline_trace.posterior_predictive
.assign_coords(kids_dim_0=apc_df.index.get_level_values("cohort").values)
["kids"]
.pipe(scale_pp)
.groupby("kids_dim_0")
)
axes[2].fill_between(cohort_map,
spline_pp_by_cohort.quantile((1 - CI_WIDTH) / 2, dim=("chain", "draw", "kids_dim_0")),
spline_pp_by_cohort.quantile(1 - (1 - CI_WIDTH) / 2, dim=("chain", "draw", "kids_dim_0")),
alpha=0.5, color="C2");
axes[2].plot(cohort_map, rw_pp_by_cohort.mean(dim=("chain", "draw", "kids_dim_0")),
c="C1");
axes[2].plot(cohort_map, spline_pp_by_cohort.mean(dim=("chain", "draw", "kids_dim_0")),
c="C2");
axes[2].set_xlabel("Cohort (birth year)");
fig.tight_layout();
We see that the posterior predictive mean for the spline model is considerably smoother year-over-year than that of the noncentered normal random walk model model; this smoothness is particularly evident in the cohort plot.
We compare this model with the two previous ones.
traces["Spline"] = spline_trace
comp_df = pm.compare(traces)
comp_df
fig, ax = plt.subplots()
az.plot_compare(comp_df, insample_dev=False, plot_ic_diff=False,
textsize=9, ax=ax);
We see that the spline model is a slight improvement over the noncentered normal random walk model from the perspective of PSIS-LOO.
Age/cohort submodel¶
With these five models we have iterated from an unidentified model, to models that were less degenerate but still sampled poorly, to smoothly varying models that sample well. Having overcome these issues with our initial models, we now turn to the question of whether including a period (year) effect is actually beneficial.
fig, ax = plt.subplots()
(df.groupby("year")
["kids"]
.mean()
.plot(c="k", ls="--", label="Observed", ax=ax));
ax.fill_between(year_map,
spline_pp_by_year.quantile((1 - CI_WIDTH) / 2, dim=("chain", "draw", "kids_dim_0")),
spline_pp_by_year.quantile(1 - (1 - CI_WIDTH) / 2, dim=("chain", "draw", "kids_dim_0")),
alpha=0.5, color="C2");
ax.plot(year_map, spline_pp_by_year.mean(dim=("chain", "draw", "kids_dim_0")),
c="C2", label="Spline");
ax.set_xlabel("Year");
ax.set_ylabel("Children\n(posterior predictive)");
ax.legend();
The posterior predictive effect of period on the number of children from the previous model appears to be rather flat, giving us reason to suspect that it may not contribut much to the model's predictive accuracy. We now quantify the impact of removing period effects on the predictive accuracy of our model.
For $\alpha_i$ and $\gamma_k$, we use the same priors as the APC spline model, but we set $\beta_k \equiv 0$.
with pm.Model(coords=spline_coords, rng_seeder=SEED) as ac_model:
η0 = pm.Normal("η0", 0, 2.5)
α = noncentered_normal_rw("α", dims="age_knot") / np.sqrt(age_knots.size)
γ = noncentered_normal_rw("γ", dims="cohort_knot") / np.sqrt(cohort_knots.size)
η = η0 + at.dot(age_dmat, α) + at.dot(cohort_dmat, γ)
N_ = pm.MutableData("N", N)
ν = pm.HalfNormal("ν", 2.5)
pm.NegativeBinomial("kids", N_ * at.exp(η), ν, observed=kids)
We now sample from the posterior distribution of this model.
with ac_model:
ac_trace = pm.sample(target_accept=0.99, **SAMPLE_KWARGS)
There are no divergences, no $\hat{R}$ warnings, and no effective sample size warnings.
ac_rhat = az.rhat(ac_trace)
ax = (ac_rhat.max()
.to_array()
.to_series()
.plot(kind="barh"))
ax.axvline(1, c="k", ls="--");
ax.set_xlim(0.99, 1.01);
ax.set_xlabel(r"$\hat{R}$");
ax.invert_yaxis();
ax.set_ylabel(None);
We now sample from and visualize the posterior predictive distribution.
with ac_model:
pm.set_data({"N": np.full_like(N, N_PP)})
ac_trace.extend(pm.sample_posterior_predictive(ac_trace))
fig, axes = plt.subplots(ncols=2, sharey=True, figsize=(12, 5))
# age
(df.groupby("age")
["kids"]
.mean()
.plot(c="k", ls="--", label="Observed", ax=axes[0]))
ac_pp_by_age = (
ac_trace.posterior_predictive
.assign_coords(kids_dim_0=apc_df.index.get_level_values("age").values)
["kids"]
.pipe(scale_pp)
.groupby("kids_dim_0")
)
axes[0].fill_between(age_map,
ac_pp_by_age.quantile((1 - CI_WIDTH) / 2, dim=("chain", "draw", "kids_dim_0")),
ac_pp_by_age.quantile(1 - (1 - CI_WIDTH) / 2, dim=("chain", "draw", "kids_dim_0")),
alpha=0.5, color="C3");
axes[0].plot(age_map, spline_pp_by_age.mean(dim=("chain", "draw", "kids_dim_0")),
c="C2", label="APC");
axes[0].plot(age_map, ac_pp_by_age.mean(dim=("chain", "draw", "kids_dim_0")),
c="C3", ls="--", label="AC");
axes[0].set_xlabel("Age");
axes[0].set_ylabel("Children\n(posterior predictive)");
axes[0].legend(loc="upper left");
# cohort
(df.groupby("cohort")
["kids"]
.mean()
.plot(c="k", ls="--", ax=axes[1]));
ac_pp_by_cohort = (
ac_trace.posterior_predictive
.assign_coords(kids_dim_0=apc_df.index.get_level_values("cohort").values)
["kids"]
.pipe(scale_pp)
.groupby("kids_dim_0")
)
axes[1].fill_between(cohort_map,
ac_pp_by_cohort.quantile((1 - CI_WIDTH) / 2, dim=("chain", "draw", "kids_dim_0")),
ac_pp_by_cohort.quantile(1 - (1 - CI_WIDTH) / 2, dim=("chain", "draw", "kids_dim_0")),
alpha=0.5, color="C3");
axes[1].plot(cohort_map, spline_pp_by_cohort.mean(dim=("chain", "draw", "kids_dim_0")),
c="C2");
axes[1].plot(cohort_map, ac_pp_by_cohort.mean(dim=("chain", "draw", "kids_dim_0")),
c="C3", ls="--");
axes[1].set_xlabel("Cohort (birth year)");
fig.suptitle("Spline models");
fig.tight_layout();
We see that the posterior predictions from the APC and AC spline models are visually indistinguishable.
spline_traces = {
"APC": spline_trace,
"AC": ac_trace
}
spline_comp_df = pm.compare(spline_traces)
spline_comp_df
fig, ax = plt.subplots()
az.plot_compare(spline_comp_df, insample_dev=False, plot_ic_diff=False,
textsize=9, ax=ax);
Comparing the APC and AC spline models using PSIS-LOO shows that they are practically indistinguishable, so we prefer the more parismonious AC model. (Models that omit period have the added advantage that they are easier to use in forecasting as we need not hypothesize a specific functional form for future period effects.)
This post is available as a Jupyter notebook here.
%load_ext watermark
%watermark -n -u -v -iv