Austin Rochford

Revisiting Bayesian Survival Analysis in Python with PyMC

Austin Rochford — Tue, 10 Jan 2023 05:00:00 GMT

Recently during an interview to promote the upcoming PyMCon Web Series (for which I am mentoring one of the first round of presenters), my friend Ravin Kumar kindly mentioned that a 2015 post of mine, Bayesian Survival Analysis in Python with PyMC3, drove a lot of his early interest in PyMC. Ravin's comment caused me to revisit the post and realize how out of date it is. In the seven years since it was published, there have been advances along many fronts:

PyMC is now on version 5; in 2015 I was using version 3 for that post. Since that time Theano has been deprecated, PyMC considered alternate backends for some time, Aesara was forked from Theano, and finally PyTensor was forked from Aesara. Not only has the backend for tensor computation evolved, the PyMC API has been improved and expanded significantly over the past seven years.
ArviZ development technically started 2015, but didn't take off until 2018. ArviZ has kept pace with theoretical developments in Bayesian model criticism and grown to be an integral part of my Bayesian modeling workflow.
nutpie emerged in the second half of 2022 as an extremely fast Rust implementation of adaptive HMC that can sample from the posteriors of Bayesian models specified with either PyMC or Stan.
seaborn recently released the seaborn.objects interface, an exciting development in bringing grammar of graphics-style visualization to Python.
I have improved at least somewhat as a developer, modeler, and a writer since 2015.

For all of these reasons, I thought it would be fun for me and perhap helpful to others to revisit this post and rework the example with updated techniques.

First we make the necessary Python imports and do some light configuration.

In [1]:

%matplotlib inline

In [2]:

from multipledispatch.dispatcher import AmbiguityWarning
from warnings import filterwarnings

In [3]:

import arviz as az
import lifelines as ll
from matplotlib import pyplot as plt
from matplotlib.ticker import NullLocator
import numpy as np
import nutpie
import pandas as pd
import pymc as pm
from pytensor import tensor as pt
import seaborn as sns
from seaborn import objects as so
from statsmodels.datasets import get_rdataset

In [4]:

filterwarnings("ignore", category=AmbiguityWarning)
filterwarnings("ignore", module="pymc", category=FutureWarning)

In [5]:

plt.rc("figure", figsize=(8, 6))
sns.set(color_codes=True)

We begin by loading the mastectomy dataset from the HSAUR R package.

In [6]:

df = get_rdataset("mastectomy", package="HSAUR", cache=True).data
df["metastized"] = df["metastized"] == "yes"

In [7]:

df.head()

Out[7]:

	time	event	metastized
0	23	True	False
1	47	True	False
2	69	True	False
3	70	False	False
4	100	False	False

In [8]:

df.tail()

Out[8]:

	time	event	metastized
39	162	False	True
40	188	False	True
41	212	False	True
42	217	False	True
43	225	False	True

From the HSAUR documentation, this dataset represents

[s]urvival times in months after mastectomy of women with breast cancer. The cancers are classified as having metastized or not based on a histochemical marker.

In this data frame,

time is the number of months since mastectomy,
event indicates whether the woman died at the corresponding time (if True) or the observation was censored (if False). In the context of survival analysis, censoring means that the woman survived past the corresponding time, but that her death was not observed. Censoring (and its counterpart truncation) represents a fundamental challenge in survival analysis, and
metastized indicates whether the woman's cancer had metastized.

The following plot shows the time-to-event for each patient, along with whether or not the cancer had metastized and whether or not the patient's death was observed.

In [9]:

MONTHS_LABEL = "Months since mastectomy"

In [10]:

sorted_df = df.sort_values("time")

(so.Plot(sorted_df, x=0, y=np.arange(sorted_df.shape[0]),
         color="event", linestyle="metastized")
   .add(so.Dash(), so.Shift(x=sorted_df["time"] / 2), width="time")
   .scale(y=so.Continuous().tick(locator=NullLocator()),
          color=so.Nominal(), linestyle=so.Nominal())
   .limit(x=(0, None))
   .label(x=MONTHS_LABEL, y=None,
          color=str.capitalize, linestyle=str.capitalize))

Out[10]:

We see that metastization is highly correlated with both short lifetime post-mastectomy and actual death. The following contingency table confirms this observation.

In [11]:

df.pivot_table(values="time", index="metastized", columns="event",
               aggfunc=np.size, margins=True)

Out[11]:

event	False	True	All
metastized
False	7	5	12
True	11	21	32
All	18	26	44

An improved crash course in survival analysis¶

When studying time-to-event data, especially in the presence of censoring (and/or truncation), survival analysis is the appropriate modeling framework. As indicated by the name, survival analysis focuses on estimating the survival function. If $T$ is the time-to-event in question, the survival function is

$$S(t) = \mathbb{P}(T \geq t).$$

This focus on the survival function is important because for censored observations we only know that the time-to-event exceeds the recorded time. The naive approach of treating cesnored observations as if the event occured at the time of censoring risks systematically underestimating the true average survival time, as we will illustrate later in this post.

Instead of working directly with the survival function, it is convenient to phrase our models in terms of the cumulative hazard function.

$$\Lambda(t) = -\log S(t).$$

From this definition we see that $\Lambda(t) \geq 0$, $\Lambda(0) = 0$, and $\Lambda$ is nondecreasing. Since we are working on a discrete timescale (we only know how many months the patient surived for, not exactly when they died or were censored), it is further convenient to decompose this cumulative hazard function into a sum of per-period hazards,

$$\Lambda(t) = \sum_{s \leq t} \lambda(s).$$

In the continuous-time case, this sum is replaced with an appropriate integral.

The Cox proportional hazards model¶

With the cumulative hazard function decomposed as above, we can to introduce the Cox proportional hazards model. Given predictors $\mathbf{x}$, this model treats $\lambda(t)$ as a log linear model,

$$\log \lambda(t\ |\ \mathbf{x}) = \alpha(t) + \beta \cdot \mathbf{x}.$$

This model is often expressed as

$$\lambda(t\ |\ \mathbf{x}) = \lambda_0(t) \cdot \exp(\beta \cdot \mathbf{x}),$$

where $\lambda_0(t) = \exp(\alpha(t)).$ This model has "proportional hazards" because, if $\mathbb{y}$ is the predictors for another patient,

$$\frac{\lambda(t\ |\ \mathbf{x})}{\lambda(t\ |\ \mathbf{y})} = \frac{\lambda_0(t) \cdot \exp(\beta \cdot \mathbf{x})}{\lambda_0(t) \cdot \exp(\beta \cdot \mathbf{y})} = \exp(\beta \cdot (\mathbf{x} - \mathbf{y}))$$

is independent of $t$.

The Bayesian Cox model¶

With the mathematical form of our model specified, we can begin to implement it in Python using PyMC. First we extract the relevant columns from the data frame. In our case,

$$x_i = \begin{cases} 0 & \text{if the }i\text{-th patient's cancer has not metastized} \\ 1 & \text{if the }i\text{-th patient's cancer has metastized} \end{cases}.$$

In [12]:

t = df["time"].values
event = df["event"].values
x = df["metastized"].values

We place a hierarchical normal prior on $\alpha_t$ and a normal prior on $\beta$.

In [13]:

# the scale necessary to make a halfnormal distribution have unit variance
HALFNORMAL_SCALE = 1 / np.sqrt(1 - 2 / np.pi)

In [14]:

def noncentered_normal(name, *, dims, μ=None):
    if μ is None:
        μ = pm.Normal(f"μ_{name}", 0, 2.5)
    
    Δ = pm.Normal(f"Δ_{name}", 0, 1, dims=dims)
    σ = pm.HalfNormal(f"σ_{name}", 2.5 * HALFNORMAL_SCALE)
    
    return pm.Deterministic(name, μ + Δ * σ, dims=dims)

In [15]:

coords = {"metastized": np.array([False, True]), "time": np.arange(t.max() + 1)}

In [16]:

with pm.Model(coords=coords) as model:
    α = noncentered_normal("α", dims="time")
    β = pm.Normal("β", 0, 2.5)

We then define $\lambda(t) = \exp(\alpha(t) + \beta \cdot x).$

In [17]:

with model:
    λ = pt.exp(α[np.newaxis] + β * x[:, np.newaxis])

We could proceed by building

$$S(t\ |\ x) = \exp\left(-\sum_{s \leq t} \lambda(s\ |\ x)\right),$$

and using the fact that

$$\mathbb{P}(\text{died at }t\ |\ x) = S(t + 1\ |\ x) - S(t\ |\ x)$$

to tie the hazard function we have specified to our observations. A simpler approach uses a classic result (reference see §7.4.3) from the early 1980s, which is that this model is equivalent to the following Poisson regression model.

Let

$$d_{i, t} = \begin{cases} 1 & \text{if the }i\text{-th patient died in period }t \\ 0 & \text{otherwise} \end{cases}$$

indicate if the patient in question died in the $t$-th period and

$$e_{i, t} = \begin{cases} 1 & \text{if the }i\text{-th patient was alive at the beginning of period }t \\ 0 & \text{otherwise} \end{cases}$$

indicate if the patient was alive (exposed) at the beginning of the $t$-th period. The Cox model is equivalent to a Poisson model for $d_{i, t}$ with mean $e_{i, t} \cdot \lambda(t\ |\ x_i).$

We now construct the arrays measuring exposure and indicating death.

In [18]:

exposed = np.full((df.shape[0], t.max() + 1), True, dtype=np.bool_)
np.put_along_axis(exposed, t[:, np.newaxis], False, axis=1)
exposed = np.minimum.accumulate(exposed, axis=1)

In [19]:

event_ = np.full_like(exposed, False, dtype=np.bool_)
np.put_along_axis(event_, t[:, np.newaxis] - 1, event[:, np.newaxis], axis=1)

We now add the observed Poisson-distributed event indicator to the model.

In [20]:

with model:
    pm.Poisson("event", exposed * λ, observed=event_)

Before we sample from the model's posterior distribution, we define a series of quantities that will allow us to easily visualize the posterior predictive survival functions for patients whose cancer had or had not metastized. We do not use PyMC's built-in posterior predictive sampling method here due to the way we constructed the observed variable using the equivalent Poisson model. Adding these auxiliary quantities is more straightforward in this case.

In [21]:

with model:
    λ_pred = pt.exp(α[np.newaxis] + β * np.array([[0, 1]]).T)
    Λ_pred = λ_pred.cumsum(axis=1)
    sf_pred = pm.Deterministic("sf_pred", pt.exp(-Λ_pred), dims=("metastized", "time"))

Now we are ready to use nutpie to sample from the model's posterior distribution.

In [22]:

SEED = 123456789

In [23]:

trace = nutpie.sample(
    nutpie.compile_pymc_model(model),
    seed=SEED
)

100.00% [7800/7800 00:06<00:00 Chains in warmup: 0, Divergences: 6]

With only a few divergences, we check the Gelman-Rubin $\hat{R}$ statistics to see if there were any obvious issues with sampling.

All of the $\hat{R}$ statistics are below 1.01, so we see no obvious issues with sampling.

We now use seaborn to visualize the posterior predictive survival functions for patients in either state of metastization.

In [25]:

ALPHA = 0.05

so_ci = so.Perc([100 * ALPHA / 2, 100 * (1 - ALPHA / 2)])

In [26]:

PP_SF_LABEL = "Posterior predictive\nsurvival function"

In [27]:

(so.Plot(trace.posterior["sf_pred"].to_dataframe(),
         x="time", y="sf_pred", color="metastized")
   .add(so.Line(), so.Agg())
   .add(so.Band(), so_ci)
   .scale(color=so.Nominal())
   .limit(x=(0, t.max()), y=(0, 1))
   .label(x=MONTHS_LABEL, y=PP_SF_LABEL, color=str.capitalize))

Out[27]:

These posterior predictive survival functions look plausible; to confirm their correctness we compare them to predictions from Cox models fit using the lifetimes package.

In [28]:

ll_model = ll.CoxPHFitter().fit(df, duration_col="time", event_col="event")

In [29]:

pred_df = pd.merge(
    trace.posterior["sf_pred"].to_dataframe(),
    ll_model.predict_survival_function(np.array([[False, True]]).T, times=np.arange(t.max() + 1))
            .rename(columns=bool)
            .rename_axis("time", axis=0)
            .rename_axis("metastized", axis=1)
            .stack()
            .rename("ll_pred"),
    left_index=True, right_index=True
)

The following plot adds the predictions from lifetimes to the posterior predictions from our model.

In [30]:

(so.Plot(pred_df,
         x="time", y="sf_pred", color="metastized")
   .add(so.Line(), so.Agg())
   .add(so.Band(), so_ci)
   .add(so.Line(linestyle="--"), so.Agg(func=lambda x: x[0]),
        y="ll_pred")
   .scale(color=so.Nominal())
   .limit(x=(0, t.max()), y=(0, 1))
   .label(x=MONTHS_LABEL, y=PP_SF_LABEL, color=str.capitalize))

Out[30]:

The predictions are reasonably close (our Bayesian predictions are naturally more regularized towards the population mean than lifetimes'), which is reassuring.

We now return to a point briefly mentioned above, that naively estimating average lifetimes by ignoring censoring significantly underestimates the true average lifetime. First we use the helpful fact that

$$\mathbb{E}(T) = \sum_t S(t)$$

to calculate the posterior expected lifetime.

In [31]:

post_exp_life = trace.posterior["sf_pred"].sum(dim="time")

Now we naively estimate the expected lifetime two different ways. First we take the average lifetime ignoring censoring (grouped according to metastization).

In [32]:

naive_life_df = (df.groupby("metastized")
                   ["time"]
                   .mean()
                   .rename("naive"))

Second we take the average lifetime of patients that did, in fact, die (grouped according to metastization).

In [33]:

naive_event_life_df = (df[df["event"]]
                         .groupby("metastized")
                         ["time"]
                         .mean()
                         .rename("naive_event"))

In [34]:

HEIGHT = 0.015

(so.Plot(post_exp_life.to_dataframe()
                      .join(naive_life_df)
                      .join(naive_event_life_df),
         x="sf_pred", color="metastized")
   .add(so.Area(), so.KDE())
   .add(so.Dash(width=HEIGHT, linestyle=(0, (5, 5))),
        x="naive", y=HEIGHT / 2, orient="y")
   .add(so.Dash(width=HEIGHT, linestyle=(0, (0.25, 5))),
        x="naive_event", y=HEIGHT / 2, orient="y")
   .scale(y=so.Continuous().tick(locator=NullLocator()),
          color=so.Nominal())
   .limit(y=(0, HEIGHT))
   .label(x=MONTHS_LABEL, title="Posterior expected lifetime",
          color=str.capitalize))

Out[34]:

The first naive estimate is shown by the dashed horizontal lines and the second naive estimate is shown by the dotted lines. (I am still learning my way around seaborn's objects interface and haven't figured out how to annotate those linestyles properly yet.)

We see that these estimates are significantly lower than the posterior expected lifetimes.

Extensions of the Bayesian Cox model¶

There are many extensions of the Cox model that allow deviation from the proportional hazards assumption. Two standard extensions are time-varying features and/or time-varying coefficients. In this post we will build a model with time-varying coefficients, although our model is equally straightforward to modify to accommodate time-varying features. To see that such a model no longer has proportional hazards, note that when $\beta(t)$ is a function if time,

$$\frac{\lambda(t\ |\ \mathbf{x})}{\lambda(t\ |\ \mathbf{y})} = \frac{\lambda_0(t) \cdot \exp(\beta(t) \cdot \mathbf{x})}{\lambda_0(t) \cdot \exp(\beta(t) \cdot \mathbf{y})} = \exp(\beta(t) \cdot (\mathbf{x} - \mathbf{y}),)$$

which still depends on $t$. In the our case, time-varying effects might correspond to the hypothesis that metastization causes most patients to die quickly, but those patients whose that live longer have diminished extra risk due to metastization.

Thanks to the modularity of PyMC, it is easy to modify our model to include time-varying coefficients. We place a random-walk prior with appropriately scaled hierarchical normal increments on $\beta(t)$.

In [35]:

with pm.Model(coords=coords) as tv_model:
    μ_β = pm.Normal("μ_β", 0, 2.5)
    β_inc = noncentered_normal("β_inc", μ=0, dims="time")
    β = pm.Deterministic("β", μ_β + (β_inc / np.sqrt(t.max() + 1)).cumsum(axis=0),
                         dims="time")

The rest of the model is specified almost exactly as before.

In [36]:

with tv_model:
    α = noncentered_normal("α", dims="time")
    
    λ = pt.exp(α[np.newaxis] + β[np.newaxis] * x[:, np.newaxis])
    pm.Poisson("event", exposed * λ, observed=event_)
    
    λ_pred = pt.exp(α[np.newaxis] + β * np.array([[0, 1]]).T)
    Λ_pred = λ_pred.cumsum(axis=1)
    sf_pred = pm.Deterministic("sf_pred", pt.exp(-Λ_pred), dims=("metastized", "time"))

We now sample from this model's posterior distribution.

In [37]:

tv_trace = nutpie.sample(
    nutpie.compile_pymc_model(tv_model),
    seed=SEED
)

100.00% [7800/7800 00:10<00:00 Chains in warmup: 0, Divergences: 29]

There are a few more divergences, but they have not yet reached concerning levels. The Gelman-Rubin $\hat{R}$ statistics also indicate no obvious sampling issues.

The following plot of $\beta(t)$ somewhat bears out our hypothesis, as the additional hazard due to metastization decreases somewhat over time (the credible interval is quite wide though).

In [39]:

(so.Plot(tv_trace.posterior["β"].to_dataframe(),
         x="time", y="β")
   .add(so.Line(), so.Agg())
   .add(so.Band(), so_ci)
   .label(x=MONTHS_LABEL, y=r"$\beta(t)$"))

Out[39]:

We see the effect of these time-varying coefficients on the posterior predictive survival functions below (the dashed lines and credible intervals are for the time-varying model).

In [40]:

(so.Plot(trace.posterior["sf_pred"].to_dataframe(),
         x="time", y="sf_pred", color="metastized")
   .add(so.Line(), so.Agg())
   .add(so.Line(linestyle="--"), so.Agg(), 
        data=tv_trace.posterior["sf_pred"].to_dataframe())
   .add(so.Band(), so_ci,
        data=tv_trace.posterior["sf_pred"].to_dataframe())
   .scale(color=so.Nominal())
   .limit(x=(0, t.max()), y=(0, 1))
   .label(x=MONTHS_LABEL, y=PP_SF_LABEL, color=str.capitalize))

Out[40]:

This post is available as a Jupyter notebook here.

In [41]:

%load_ext watermark
%watermark -n -u -v -iv

Last updated: Tue Jan 10 2023

Python implementation: CPython
Python version       : 3.10.8
IPython version      : 8.7.0

numpy     : 1.23.5
pytensor  : 2.8.11
pymc      : 5.0.1
lifelines : 0.27.4
arviz     : 0.14.0
seaborn   : 0.12.2
pandas    : 1.5.2
matplotlib: 3.6.2
nutpie    : 0.5.1

Learning to Be Thoughtless in Python with Mesa

Austin Rochford — Fri, 07 Oct 2022 04:00:00 GMT

I recently read Who's Counting? Uniting Numbers and Narratives with Stories from Pop Culture, Puzzles, Politics, and More by John Allen Paulos. In the Chapter 6 section titled "Voting Blocs: Red States, Blue State, and a Model for Thoughtless Voting," Paulos mentions an interesting paper by Joshua Epstein, Learning To Be Thoughtless: Social Norms And Individual Computation. In this paper, Epstein proposes a simple but compelling agent-based model for how social norms evolve in a community. This model is interesting because it captures two fundametal forces that motivate human behavior quite elegantly:

the social influence the behavior of others exerts on our own, and
laziness.

Paulos uses Epstein's model as a potential explanation for the increasing polarization of American Politics, which, while plausible, is not the focus of this post. After following the citation to Epstein's paper, I thought it would be fun and instructive to implement this model in Python and reproduce the main results of the paper.

Epstien's paper uses the decision to drive on the left- or right-hand side of the road as the social norm under analysis. For the purpose of this post, we will use a boolean flag norm to indicate whether or not an agent conforms to the social norm (defined, for instance, by driving on the right-hand side of the road in the United States) or does not (driving on the left-hand side of the road in the United States).

Epstein names his agents "lazy statisticans." These lazy statisticians determine whether or not they follow the social norm based on the norm-following behavior of the other agents in their vicinity. The size of the neighborhood that influences a statistician's behavior is their radius. With this terminology, we can begin to implement lazy statisticians in Python. Throughout this post we use Mesa, an agent-based modeling framework that

allows users to quickly create agent-based models using built-in core components (such as spatial grids and agent schedulers) or customized implementations; visualize them using a browser-based interface; and analyze their results using Python’s data analysis tools.

In [1]:

%matplotlib inline
%config InlineBackend.figure_format = "retina"

In [2]:

from fastprogress.fastprogress import progress_bar
from matplotlib import pyplot as plt
import mesa

In [3]:

import seaborn as sns

In [4]:

plt.rc("figure", figsize=(8, 6))
plt.rc("font", size=14)
sns.set(color_codes=True)

CMAP = "viridis"

Each lazy statistician is a Mesa Agent. Following the paper, if no initial value for norm is past, we choose one with a coin flip, and if no initial radius is passed, we choose one uniformly between one and fitfy.

In [5]:

class LazyStatistician(mesa.Agent):
    def __init__(self, unique_id, model, norm=None, radius=None, tol=0.05):
        super().__init__(unique_id, model)
        
        self.norm = self.random.choice([False, True]) if norm is None else norm
        self.radius = self.random.randint(1, 51) if radius is None else radius
        self.tol = tol

When a lazy statistician updates their norm-following behavior, they do so based on the proportion of norm-following agents in their vicinity as determined by their radius (optionally offset by Δ).

In [6]:

class LazyStatistician(LazyStatistician):
    def get_norm_pct(self, *, Δ=0):
        radius = self.radius + Δ
        nhbrs = self.model.grid.iter_neighbors(self.pos, moore=True, radius=radius)

        return sum(nhbr.norm for nhbr in nhbrs) / (2 * radius + 1)

When a lazy statistician updates their norm-following behavior, they do so in two steps. First they update the radius of neighboring agents that influence them, then they update their norm.

When updating their radius, a lazy statistician first compares the proportion of norm-followers in their current neighborhood, defined by their radius, to the proportion of norm-followers in the neighborhood defined by radius + 1. If these not close (for the purposes of this post, we mostly follow the guideline of the paper and define "close" to be within 5% of each other), the lazy statistician increases their radius by one, since more data is better. If they are close, the lazy statistician compares the proportion of norm-followers in their current neighborhood to the proportion of norm-followers in the neighborhood by radius - 1. If these are close, the lazy statisician decreases their radius by one (due to their laziness).

In [7]:

class LazyStatistician(LazyStatistician):
    def update_radius(self):
        norm_pct = self.get_norm_pct()
        more_pct = self.get_norm_pct(Δ=1)
        
        if abs(norm_pct - more_pct) > self.tol:
            self.radius += 1
        elif self.radius > 1:
            less_pct = self.get_norm_pct(Δ=-1)
            
            if abs(norm_pct - less_pct) <= self.tol:
                self.radius -= 1

After updating their radius, a lazy statistician updates their norm to agree with the majority of the agents in their neighborhood (including themselves). Note that including themsevles means that there are an odd number of agents (2 * radius + 1) agents in their neighborhood, so their will never be an ambiguous case where exactly 50% of agents in the neighborhood are norm-followers.

In [8]:

class LazyStatistician(LazyStatistician):
    def update_norm(self):
        self.norm = self.get_norm_pct() > 0.5

    def step(self):
        self.update_radius()
        self.update_norm()

If there are $N$ agents, at each time step $N$ agents are drawn randomly with replacement and their status is updated as described above. Mesa does not include this sort of bootstrap scheduler out-of-the-box, but it is simple to implement.

In [9]:

class BootstrapActivation(mesa.time.BaseScheduler):
    def step(self):
        for _ in range(self.get_agent_count()):
            self.model.random.choice(self.agents).step()
            
        self.steps += 1
        self.time += 1

With LazyStatistician and BoostrapActivation defined, we are ready to implement our model.

In [10]:

class ThoughtlessModel(mesa.Model):
    def __init__(self, n_agent, norm=None, seed=None,
                 agent_cls=LazyStatistician, **agent_kwargs):
        self.grid = mesa.space.SingleGrid(n_agent, 1, torus=True)
        self.schedule = BootstrapActivation(self)
        
        for unique_id in range(n_agent):
            agent = agent_cls(unique_id, self, norm=norm, **agent_kwargs)
            
            self.schedule.add(agent)
            self.grid.place_agent(agent, (unique_id, 0))
            
        self.datacollector = mesa.DataCollector(agent_reporters={"norm": "norm", "radius": "radius"})
            
    def step(self):
        self.datacollector.collect(self)
        self.schedule.step()

The most important feature of of this model is that the agents are placed on a circle (as indicated by the torus=True keyword argument to SingleGrid.) Another point of note is that the constructor take the agent class as a parameter. This flexibility will be useful in reconstructing examples from the paper that add noise and/or temporal shocks to the lazy statistician's update rules.

With ThoughtlessModel in hand, we can start reproducing Epstein's computational results. The paper presents six runs of the model, exploring

stylized facts regarding the evolution of norms: [l]ocal conformity, global diversity, and punctuated equilibria.

Following Epstein, for the first simulation we set each lazy statistician's initial norm to False. In this case, we anticipate that norm should remain False for all agents for all time and each agent's radius should shrink to the minimum value of one.

We choose a seed value for reproducibility.

In [11]:

SEED = 1234567890

Instead of simulating 190 lazy statisticians as in the paper, we simulate 500. We find that this yields interesting outcomes in more complex situations far more often (as the seed varies). As in the paper, we simulate 275 time steps.

In [12]:

N_AGENT = 500
N_STEP = 275

In [13]:

def simulate(model, n_step):
    for _ in progress_bar(range(n_step)):
        model.step()
        
    return model

In [14]:

model_1a = simulate(ThoughtlessModel(N_AGENT, norm=False, seed=SEED), N_STEP)

100.00% [275/275 00:21<00:00]

Visualizing the evolution of each agent's norm and radius over time confirms our expectations.

In [15]:

def plot_results(model, cmap=CMAP, figsize=(16, 6)):
    agent_df = model.datacollector.get_agent_vars_dataframe()
    
    fig, axes = plt.subplots(ncols=2, sharex=True, sharey=True,
                             figsize=figsize, gridspec_kw={"width_ratios": (0.85, 1)})
    norm_ax, radius_ax = axes
    
    sns.heatmap(agent_df["norm"].unstack(),
                cmap=cmap, vmin=0, vmax=1, cbar=False,
                ax=norm_ax)
    
    norm_ax.set_xticks([])
    norm_ax.set_xlabel("Agent")
    
    norm_ax.set_yticks([])
    norm_ax.set_ylabel("Step")
    
    norm_ax.set_title("norm")
    
    rhm = sns.heatmap(agent_df["radius"].unstack(),
                     cmap=cmap, vmin=1, vmax=10,
                     ax=radius_ax)
    
    radius_ax.set_xticks([])
    radius_ax.set_xlabel("Agent")
    
    radius_ax.set_yticks([])
    radius_ax.set_ylabel(None)
    
    radius_ax.set_title("radius")

    fig.tight_layout()
    
    return fig, axes

In [16]:

plot_results(model_1a);

Though the paper does show this scenario, we get an analagous result when all of the initial norms are True.

In [17]:

model_1b = simulate(ThoughtlessModel(N_AGENT, norm=True, seed=SEED), N_STEP)

100.00% [275/275 00:21<00:00]

In [18]:

plot_results(model_1b);

In each of these cases, "individual computing dies out" is reflected by the fact that every agent's radius eventually decreases to one, the smallest possible.

Run 2. Random Initial Norms, Individual Computing At Norm Boundaries¶

Run one verified our intuition in the basic but unrealistic case of a uniform ("monolithic") initial value of norm. In run two we explore the more interesting and realistic situation where the initial values of norm are random.

In [19]:

model_2a = simulate(ThoughtlessModel(N_AGENT, seed=SEED), N_STEP)

100.00% [275/275 00:20<00:00]

In [20]:

plot_results(model_2a);

We see that there are two disjoint regions where norm is True and two where it is False. (Recall that the agents are on a circle so the regions at the left and right ends of this plot are the same from their perspective.) "Individual computing at norm boundaries" corresponds to the fact that all agents sufficiently far in the interior of the constant-norm regions eventually have radius one, and it is only agents near the borders of these regions that "compute" in the sense of consulting more than their direct neighbors in their norm update decisions.

In order to understand the factors influencing the width of these edges that compute, we reduce the tolerance used when updating an agent's radius from 5% to 2.5%.

In [21]:

model_2b = simulate(ThoughtlessModel(N_AGENT, tol=0.025, seed=SEED), N_STEP)

100.00% [275/275 00:20<00:00]

In [22]:

plot_results(model_2b);

We see that a lower tolerance causes an agent to expand their radius more often (remember the agent checks a larger radius before considering a smaller radius) and therefore we end up with more agents with nontrivial radius. In the opposite direction, doubling the tolerance to 10% causes agents to increase their radius less often and therefore the boundary where computing occurs shrinks.

In [23]:

model_2c = simulate(ThoughtlessModel(N_AGENT, tol=0.1, seed=SEED), N_STEP)

100.00% [275/275 00:19<00:00]

In [24]:

plot_results(model_2c);

Run 4. Modest Noise Level and Endogenous Neighborhood Norms¶

For implementation reasons we will replicate run three in the paper last and skip to run four now. The first two scenarios treated a noiseless case where agents never spontaneously change their norm regardless of the input from their neighbors. This noiselessness is unrealistic, so we introduce a NoisyLazyStatistician subclass of LazyStatistician that randomly draws a new norm occasionally according to its noise parameter.

In [25]:

class NoisyLazyStatistician(LazyStatistician):
    def __init__(self, unique_id, model, norm=None, radius=None, noise=0):
        super().__init__(unique_id, model, norm=norm, radius=radius)
        
        self.noise = noise
        
    def update_norm(self, noise=None):
        if self.random.random() < (self.noise if noise is None else noise):
            self.norm = self.random.choice([False, True])
        else:
            super().update_norm()

We simulate behavior when agents randomize their norm due to noise 15% of the time and visualize the results below.

In [26]:

model_4 = simulate(
    ThoughtlessModel(N_AGENT, seed=SEED, noise=0.15,
                     agent_cls=NoisyLazyStatistician),
    N_STEP
)

100.00% [275/275 00:19<00:00]

In [27]:

plot_results(model_4);

These results are visually quite interesting but not hugely different from those in previous runs. For the most part we get two blocks each for each value of norm. Interestingly, in this case small endogenous dissenting islands occasionally apear inside these blocks and manage to persist for a few timesteps before disappearing.

Run 5. Higher Noise and Endogenous Neighborhood Norms¶

The next simulation doubles the noise level from 15% to 30%.

In [28]:

model_5 = simulate(
    ThoughtlessModel(N_AGENT, seed=SEED, noise=0.3,
                     agent_cls=NoisyLazyStatistician),
    N_STEP
)

100.00% [275/275 00:19<00:00]

In [29]:

plot_results(model_5);

At this higher noise level, we see the bocks of consistent norm erode over time, and the endognenous islands that appear persist for much longer than in the previous simulation.

Run 6. Maximum Noise Does Not Induce Maximum Search¶

In the previous two noisy scenarios, the average agent's radius was consistently higher than in the noiseless scenario, as shown below.

In [30]:

def get_average_radius(model, steps=None):
    return (model.datacollector
                 .get_agent_vars_dataframe()
                 ["radius"]
                 .groupby(level="Step")
                 .mean())

In [31]:

fig, ax = plt.subplots()

get_average_radius(model_2a).plot(label="Model 2a", ax=ax);
get_average_radius(model_4).plot(label="Model 4", ax=ax);
get_average_radius(model_5).plot(label="Model 5", ax=ax);

ax.set_yscale("log");
ax.set_ylabel("Average agent radius");

ax.legend(loc="upper right");

This observation prompts the questions: how much does the average radius increase as the noise increases? Does it increase without bound in the 100% noise case? We answer these questions in the next simulation.

In [32]:

model_6 = simulate(
    ThoughtlessModel(N_AGENT, seed=SEED, noise=1,
                     agent_cls=NoisyLazyStatistician),
    N_STEP
)

100.00% [275/275 00:19<00:00]

In [33]:

plot_results(model_6);

In [34]:

fig, ax = plt.subplots()

get_average_radius(model_2a).plot(label="Model 2a", ax=ax);
get_average_radius(model_4).plot(label="Model 4", ax=ax);
get_average_radius(model_5).plot(label="Model 5", ax=ax);
get_average_radius(model_6).plot(label="Model 6", ax=ax);

ax.set_yscale("log");
ax.set_ylabel("Average agent radius");

ax.legend(loc="upper right");

We see that the average radius does not increase without bound even with maximum noise, which is interesting.

Run 3. Complacency in New Norm¶

Finally we return to run three from the paper, which we have deferred until after the noisy scenarios for implementation reasons. In this run, most of the time our agents behave noiselessly, but for ten time steps for $130 \leq t < 140$ the system experiences a "shock" which causes each agent to behave completely randomly. We first implement this behavior in the ShockedLazyStatistician subclass of NoisyLazyStatistician.

In [35]:

class ShockedLazyStatistician(NoisyLazyStatistician):
    def __init__(self, unique_id, model, norm=None, radius=None, noise=0, is_shock=None):
        super().__init__(unique_id, model, norm=norm, radius=radius, noise=noise)

        self.is_shock = (lambda _: False) if is_shock is None else is_shock

    def update_norm(self):
        super().update_norm(noise=1 if self.is_shock(self.model.schedule.time) else None)

We now simulate this scenario and visualize the results.

In [36]:

model_3a = simulate(
    ThoughtlessModel(N_AGENT, seed=SEED, noise=0,
                     is_shock=lambda step: 130 <= step < 140,
                     agent_cls=ShockedLazyStatistician),
    N_STEP
)

100.00% [275/275 00:19<00:00]

In [37]:

plot_results(model_3a);

Quite interestingly, many new blocks of consistent norm appear almost immediately after the shock subsides. This behavior is likely due to the fact that an agent's radius cannot increase all that much during the ten time step shock (as was shown in run five.) This explanation is consistent the the following plot of average agent radius over time.

In [38]:

ax = get_average_radius(model_3a).plot()

ax.set_yscale("log");
ax.set_ylabel("Average agent radius");

Though not present in the paper, we consider two additional scenarios. The first is adding low-level noise (15%) to all time steps in the above scenario.

In [39]:

model_3b = simulate(
    ThoughtlessModel(N_AGENT, seed=SEED, noise=0.15,
                     is_shock=lambda step: 130 <= step < 140,
                     agent_cls=ShockedLazyStatistician),
    N_STEP
)

100.00% [275/275 00:19<00:00]

In [40]:

plot_results(model_3b);

The results are not surprising given the outcomes of models 3a and 4.

For our last scenario we maintain this noise level, but increase the number of steps by a factor of ten and shock the system whenever $130 \leq t \operatorname{mod} 275 < 140.$

In [41]:

model_3c = simulate(
    ThoughtlessModel(N_AGENT, seed=SEED, noise=0.15,
                     is_shock=lambda step: 130 <= step % N_STEP < 140,
                     agent_cls=ShockedLazyStatistician),
    10 * N_STEP
)

100.00% [2750/2750 00:47<00:00]

In [42]:

plot_results(model_3c, figsize=(16, 10));

The results are quite fascinating but not unexpected.

We have now reproduce all of the scenarios from Epstein's Learning To Be Thoughtless: Social Norms And Individual Computation. The behavior of these models is quite interesting and I certainly learned quite a bit about agent-based modeling using Mesa.

This post is available as a Jupyter notebook here.

In [43]:

%load_ext watermark
%watermark -n -u -v -iv

Last updated: Fri Oct 07 2022

Python implementation: CPython
Python version       : 3.10.6
IPython version      : 8.4.0

seaborn   : 0.11.2
matplotlib: 3.5.3
mesa      : 1.0.0

A Closed-form Solution for the Cholesky Decomposition of the Covariance Matrix of Exchangeable Normal Variables

Austin Rochford — Sun, 25 Sep 2022 04:00:00 GMT

In a previous post I compared several parameterizations for estimating the covariance parameter of latent exchangeable normal random variables using PyMC. These parameterizations were necesssary because for quite some time before the official release of PyMC version 4.0, the beta lacked support for multivariate random variables. Support for these distributions has since been added, rendering the workaround parameterizations in that post superfluous. Even so, after writing that post I remained curious about the closed-form solution for the Cholesky decomposition of the covariance matrix for exchangeable normal random variables, and eventually was able to derive it. This post presents that closed-form solution along with a numerical verification of its correctness.

Recall that a vector of random variables is exchangeable if every permutation of its entries has the same joint distribution. A set of exchangeable normal random variables, $X_1, \ldots, X_T \sim N(\mu, \sigma^2)$ is parameterized by their marginal mean, $\mu$, marginal scale, $\sigma$, and the Pearson correlation coefficient

$$\rho = \frac{\mathbb{Cov}(X_i, X_j)}{\sigma^2}.$$

In this post we will assume $\mu = 0$ and $\sigma = 1$ for simplicity; the results are straightforward to generalize. With this assumption, the covariance matrix is

$$ \Sigma_{\rho} = \begin{pmatrix} 1 & \rho & \cdots & \rho & \rho \\ \rho & 1 & \cdots & \rho & \rho \\ \rho & \rho & \cdots & 1 & \rho \\ \rho & \rho & \cdots & \rho & 1 \end{pmatrix} \in \mathbb{R}^{T \times T}. $$

The previous post used SymPy to calculate the Cholesky decomposition of this matrix for small values of $T$. It was obvious that there was a pattern, but the we did not pursue it at the time, as the SymPy factorization could be lambdified and the result applied to Aesara tensors to facilitate inference in PyMC. This post will prove and numerically verify the following closed-form solution for the Cholesky decomposition of $\Sigma_{\rho}.$

Let $-\frac{1}{T - 1} \leq \rho < 1$. Define $d_1 = 1$ and $\ell_1 = \rho$. For $1 < j \leq T$, let $$d_j = \sqrt{d_{j - 1}^2 - \ell_{j - 1}^2}$$ and $$\ell_j = \frac{\rho - 1}{d_j} + d_j.$$

Finally, let $$ L_T = \begin{pmatrix} d_1 & 0 & 0 & \cdots & 0 \\ \ell_1 & d_2 & 0 & \cdots & 0 \\ \ell_1 & \ell_2 & d_3 & \cdots & 0 \\ & & & \ddots & \\ \ell_1 & \ell_2 & \ell_3 & \cdots & d_T \end{pmatrix} \in \mathbb{R}^{T \times T}. $$

Claim $\Sigma_T = L_T L_T^{\top}$.

Proof (by induction):

When $T = 1$ or $T = 2$, the claim is easily verified manually.

Assume, for all $1 \leq n \leq T$, that $\Sigma_n = L_n L_n^{\top}$.

It will be useful to block partition $L_n$ as $$L_n = \begin{pmatrix} L_{n - 1} & 0 \\ v_n & d_n \end{pmatrix},$$ where $v_n = \begin{pmatrix}\ell_1 & \ell_2 & \cdots & \ell_{n - 1}\end{pmatrix}$ are the vector of lower triangular elements of $L_n$. The inductive hypothesis then becomes $$\begin{align*} \Sigma_n & = L_n L_n^{\top} \\ & = \begin{pmatrix} L_{n - 1} & 0 \\ v_n & d_n \end{pmatrix} \begin{pmatrix} L_{n - 1}^{\top} & v_n^{\top} \\ 0 & d_n \end{pmatrix} \\ & = \begin{pmatrix} \Sigma_{n - 1} & L_{n - 1} v_n^{\top} \\ v_n L_{n - 1}^{\top} & v_n v_n^{\top} + d_n^2 \end{pmatrix}. \end{align*}$$ Equating the bottom right element of these two matrices gives $$1 = (\Sigma_n)_{n, n} = v_n v_n^{\top} + d_n^2.$$ Restated, we have that $v_n v_n^{\top} = 1 - d_n^2$. Equating the off-diagonal elements in the final column gives, for $1 \leq i < n$, $$\rho = (\Sigma_n)_{i, n} = (L_{n - 1} v_n^{\top})_i,$$ so $L_{n - 1} v_n^{\top} = \rho \vec{1}$.

Now, $$L_{T + 1} L_{T + 1}^{\top} = \begin{pmatrix} \Sigma_T & L_T v_{T + 1}^{\top} \\ v_{T + 1} L_T^{\top} & v_{T + 1} v_{T + 1}^{\top} + d_{T + 1}^2 \end{pmatrix}, $$ so it suffices to show that $$L_T v_{T + 1}^{\top} = \rho \vec{1}$$ and $$v_{T + 1} v_{T + 1}^{\top} + d_{T + 1}^2 = 1.$$

First, we have that $$\begin{align*} L_T v_{T + 1}^{\top} & = \begin{pmatrix} L_{T - 1} & 0\\ v_T & d_T \end{pmatrix} \begin{pmatrix} v_T^{\top} \\ \ell_T \end{pmatrix} \\ & = \begin{pmatrix} L_{T - 1} v_T^{\top} \\ v_T v_T^{\top} + d_T \cdot \ell_T \end{pmatrix}. \end{align*}$$ From the inductive hypothesis, $L_{T - 1} v_T^{\top} = \rho \vec{1}$. For the final entry, $$\begin{align*} v_T v_T^{\top} + d_T \cdot \ell_T & = 1 - d_T^2 + d_T \left(\frac{\rho - 1}{d_T} + d_T\right) \\ & = \rho. \end{align*}$$

Second, we have that $$\begin{align*} v_{T + 1} v_{T + 1}^{\top} + d_{T + 1}^2 & = v_T v_T^{\top} + \ell_T^2 + d_{T + 1}^2 \\ & = 1 - d_T^2 + \ell_T^2 + \left(\sqrt{d_T^2 - \ell_T^2}\right)^2 \\ & = 1. \end{align*}$$

QED

To validate these calculations numerically we will use Hypothesis, a Python library for generative testing. First we make the necessary Python imports.

In [1]:

from hypothesis import assume, given
from hypothesis.strategies import integers, floats

In [2]:

import numpy as np

The following function generates the covariance matrix, $\Sigma_{\rho}$, for given values of $\rho$ and $T$.

In [3]:

def get_cov_mat(ρ, T):
    return np.eye(T) + ρ * (1 - np.eye(T))

The next function implements the closed-form solution for the Cholesky decomposition of $\Sigma_{\rho}$ presented above.

In [4]:

def get_cov_mat_chol(ρ, T):
    diag = np.empty(T)
    tril = np.empty(T)
    
    diag[0] = 1
    tril[0] = ρ

    for j in range(1, T):
        diag[j] = np.sqrt(diag[j - 1]**2 - tril[j - 1]**2)
        tril[j] = (ρ - 1) / diag[j] + diag[j] if j < T - 1 else 0
            
    return np.diag(diag) + np.tril(tril, k=-1)

This final function is a Hypothesis test that verifies that we have, in fact, computed the Cholesky decomposition of $\Sigma_{\rho}$.

In [5]:

@given(floats(-0.5, 1), integers(2, 1000))
def test_cov_mat_chol(ρ, T):
    assume(-1 / (T - 1) <= ρ < 1)
    
    Σ = get_cov_mat(ρ, T)
    chol = get_cov_mat_chol(ρ, T)

    # check that chol is lower diagonal
    assert (chol == np.tril(chol)).all()
    
    # check that "chol is a factorization of Σ
    assert np.allclose(Σ, chol @ chol.T)

The given decorator tells Hypothesis to generate floating point values in the interval $\left[-\frac{1}{2}, 1\right]$ for $\rho$ and integers between 2 and 1,000 for $T$. The upper bound on values generated for $T$ is not necessary but ensures the test runs in a reasonable amount of time. The assume call ensures that $\rho$ is in the appropriate interval based on the size of the random vector. Given this guidance, Hypothesis will generate random values of $\rho$ and $T$ to run the test with.

In [6]:

test_cov_mat_chol()

The test passes, numerically validating our closed-form solution for the Cholesky decomposition.

This post is available as a Jupyter notebook here.

In [7]:

%load_ext watermark
%watermark -n -u -v -iv -p hypothesis

Last updated: Sun Sep 25 2022

Python implementation: CPython
Python version       : 3.10.6
IPython version      : 8.4.0

hypothesis: 6.54.6

numpy: 1.23.2

Bayesian Age/Period/Cohort Models in Python with PyMC

Austin Rochford — Mon, 22 Aug 2022 04:00:00 GMT

For my day job, I spend a lot of time thinking about e-commerce analytics and cohort analysis in particular. Statistical age-period-cohort (APC) models are important in many fields such as epidemiology, demography, marketing, and many more. These models also pose some interesting inferential challenges for the unwary. This post shows how to use pymc to build Bayesian APC models in Python and presents a series of increasingly sophistocated systems of priors to resolve the inferential challenges these models pose.

First we make the necessary Python imports and do some light housekeeping.

In [1]:

%matplotlib inline
%config InlineBackend.figure_format = "retina"

In [2]:

from warnings import filterwarnings

In [3]:

from aesara import tensor as at
import arviz as az
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
import pymc as pm
import scipy as sp
import seaborn as sns

In [4]:

filterwarnings("ignore", category=UserWarning, module="pymc")

In [5]:

plt.rc("figure", figsize=(8, 6))
sns.set(color_codes=True)

Load the data¶

For illustrative purposes, we use data from the US General Social Survey 1974-2002 in this post. This dataset is available from Vincent Arel-Bundock's excellent Rdatasets repository and is originally from the AER (Applied Econometrics with R) R package.

In [6]:

DATA_URL = "https://vincentarelbundock.github.io/Rdatasets/csv/AER/GSS7402.csv"

In [7]:

df = (pd.read_csv(DATA_URL, usecols=["kids", "age", "year"])
        [["age", "year", "kids"]])

In [8]:

df

Out[8]:

	age	year	kids
0	25	2002	0
1	30	2002	1
2	55	2002	1
3	57	2002	2
4	71	2002	2
...	...	...	...
9115	30	1998	3
9116	37	1998	2
9117	59	1998	3
9118	73	1998	2
9119	40	1998	0

9120 rows × 3 columns

Exploratory data analysis¶

From the Rdatasets site, this data set contains

[c]ross-section data for 9120 women taken from every fourth year of the US General Social Survey between 1974 and 2002 to investigate the determinants of fertility.

The columns of this data frame are

age: the age at which the woman was surveyed,
year: the year the woman was surveyed, and
kids: the number of children the woman had at the time she was surveyed.

These columns contain all the information we need to build an APC model of how the average number of children has varied as a function of the woman's age, the year, and when she was born. Obviously the age column corresponds to the "age" component of the APC model, and the year column corresponds to the "period" component of the APC model. For this data set, the "cohort" component corresponds to they year in which the woman was born, which is simply calculated as the difference between the woman's age and the current year.

In [9]:

df["cohort"] = df["year"] - df["age"]

In [10]:

df

Out[10]:

	age	year	kids	cohort
0	25	2002	0	1977
1	30	2002	1	1972
2	55	2002	1	1947
3	57	2002	2	1945
4	71	2002	2	1931
...	...	...	...	...
9115	30	1998	3	1968
9116	37	1998	2	1961
9117	59	1998	3	1939
9118	73	1998	2	1925
9119	40	1998	0	1958

9120 rows × 4 columns

It is this relationship between age, year, and cohort that will make a naively parametrized APC model unidentified, causing inferential challenges.

The following plots show how the average number of childen women varies with age, period (year), and cohort.

In [11]:

fig, axes = plt.subplots(ncols=3, nrows=2, figsize=(12, 4),
                         gridspec_kw={"height_ratios": (1, 5)})

axes[0, 0].sharex(axes[1, 0]);
axes[0, 1].sharex(axes[1, 1]);
axes[0, 2].sharex(axes[1, 2]);

axes[0, 0].sharey(axes[0, 1]);
axes[0, 1].sharey(axes[0, 2]);

axes[1, 0].sharey(axes[1, 1]);
axes[1, 1].sharey(axes[1, 2]);

# age

axes[0, 0].hist(
    df["age"],
    bins=np.arange(df["age"].min(), df["age"].max() - 1, 3),
    lw=0
);

axes[0, 0].set_ylabel("Number\nof women");

(df.groupby("age")
        ["kids"]
        .mean()
        .plot(ax=axes[1, 0]));

axes[1, 0].set_xlabel("Age");
axes[1, 0].set_ylabel("Average number\nof children");

# year

axes[0, 1].hist(
    df["year"],
    bins=np.arange(df["year"].min(), df["year"].max() - 1),
    lw=0
);

(df.groupby("year")
        ["kids"]
        .mean()
        .plot(ax=axes[1, 1]));

axes[1, 1].set_xlabel("Year");

# cohort

axes[0, 2].hist(
    df["cohort"],
    bins=np.arange(df["cohort"].min(), df["cohort"].max() - 1, 5),
    lw=0
);

(df.groupby("cohort")
        ["kids"]
        .mean()
        .plot(ax=axes[1, 2]));

axes[1, 2].set_xlabel("Cohort (birth year)");
axes[1, 2].set_ylim(0, 4);

fig.tight_layout();

The relationship with age is not surprising: throughout the twenties and thirties the average number of children increases at a relatively stable rate and then levels out from the forties onward. The most notable feature of the period relationship is that it is relatively small compared to the age and cohort relationship. The relationship with cohort is perhaps the most interesting. The year a woman is born determines when she enters the prime chilbearing years of twenty to forty, so the peaks and troughs in the cohort relationship roughly corresponding to economic and societal conditions when the women in that cohort are in prime childbearing years.

In [12]:

GREAT_DEPRESSION = np.array([1929, 1939])
BABY_BOOM = np.array([1946, 1964])

In [13]:

fig, ax = plt.subplots()

ax.axvspan(GREAT_DEPRESSION[0] - 30, GREAT_DEPRESSION[1] - 20,
           color="C3", alpha=0.5, label="Great Depression");
ax.axvspan(BABY_BOOM[0] - 30, BABY_BOOM[1] - 20,
           color="C2", alpha=0.5, label="Post-WWII Baby Boom");

(df.groupby("cohort")
   ["kids"]
   .mean()
   .plot(label="_nolegend_", ax=ax));

ax.set_xlabel("Cohort (birth year)");

ax.set_ylim(0, 4);
ax.set_ylabel("Average number of children");

ax.legend(loc="upper right", title="Chilbearing age during");

Taking a closer look at the cohort relationship, we see that the first trough corresponds to women that would be having most of their children during the Great Depression, and the subsequent peak corresponds to women having children during the post-World War II baby boom.

Taken together, it is reasonable to hypothesize that biological factors related to age and socioeconomic factors related to cohort are sufficient to explain the variation in the number of children without considering period (year). The APC models we build in this post will allow us to evaluate the credibility of this hypothesis.

It is important to note at this point that I am not a demographer and even if I was, it is best not to draw deep demographic conclusions from any analysis conducted on this data set. The purpose of this post is to use this data set as a simple example to illustrate some of the subtleties involved in inference using APC models.

Dispersion¶

Since the number of children is discrete, we being by checking its index of dispersion in order to determine which probability distribution is best suited to model it.

In [14]:

def index_of_dispersion(x):
    return x.var() / x.mean()

In [15]:

index_of_dispersion(df["kids"])

Out[15]:

1.5694754102397794

We see that the number of children is overdispersed across all responses.

In [16]:

fig, axes = plt.subplots(ncols=3, sharey=True, figsize=(12, 4))

index_of_dispersion(df.groupby("age")["kids"]).plot(label="_nolegend_", ax=axes[0]);
axes[0].axhline(1, c="k", ls="--", label="Unit dispersion (Poisson)");

axes[0].set_xlabel("Age");
axes[0].set_ylabel("Index of dispersion");
axes[0].legend(loc="upper left");

index_of_dispersion(df.groupby("year")["kids"]).plot(ax=axes[1]);
axes[1].axhline(1, c="k", ls="--");

axes[1].set_xlabel("Year");

index_of_dispersion(df.groupby("cohort")["kids"]).plot(ax=axes[2]);
axes[2].axhline(1, c="k", ls="--");

axes[2].set_xlabel("Year");

fig.tight_layout();

The number of children continues to be overdispersed (without about the same index) across each of the APC dimensions, so we will use the overdispersed negative binomial distribution in our models.

Modeling¶

We now start to build APC models of the number of children reported. Recall that if $X \sim \operatorname{NB}(\mu_X, \nu)$ and $Y \sim \operatorname{NB}(\mu_Y, \nu)$ then $X + Y \sim \operatorname{NB}(\mu_X + \mu_Y, \nu).$ Using this fact, we reduce the size of the observed data in our model by grouping by the age/period/cohort combination of each observation, counting the number of women with that combination, and summing their number of children.

In [17]:

apc_df = (df.groupby(["age", "year", "cohort"])
            ["kids"]
            .agg(("size", "sum")))

In [18]:

apc_df

Out[18]:

			size	sum
age	year	cohort
18	1974	1956	3	2
	1978	1960	7	3
	1982	1964	2	0
	1986	1968	1	0
	1990	1972	2	0
...	...	...	...	...
89	1986	1897	5	10
	1990	1901	8	11
	1994	1905	14	35
	1998	1909	10	31
	2002	1913	9	11

571 rows × 2 columns

In [19]:

N = apc_df["size"].values
kids = apc_df["sum"].values

Here apc_df["size"] and N are the number of women surveyed at that specific age/period/cohort combination and apc_df["sum"] and kids are their total number of children.

For convenience we encode age, year, and cohort as ordinal factors.

In [20]:

i, age_map = apc_df.index.get_level_values("age").factorize(sort=True)
j, year_map = apc_df.index.get_level_values("year").factorize(sort=True)
k, cohort_map = apc_df.index.get_level_values("cohort").factorize(sort=True)

Flat priors¶

We are now ready to build our first APC model, which will use flat priors for the age, period, and cohort effects. Throughout this post, let $i$ denote a sample's age, $j$ denote its period, and $k$ denote its cohort. All of our models will take the form

$$\begin{align*} \eta_{ijk} & = \eta_0 + \alpha_i + \beta_j + \gamma_k, \\ y_{ijk} & \sim \operatorname{NB}(N_{ijk} \cdot \exp(\eta_{ijk}), \nu). \end{align*}$$

Here $N_{ijk}$ is the number of women in the relevant age/period/cohort combination, $y_{ijk}$ is the total number of children they have, and $\nu$ controls the amount of overdispersion. Note that this is a negative binomial regression model with an offset term.

Using pymc we specify the flat prior distributions

$$\pi(\eta_0), \pi(\alpha_i), \pi(\beta_i), \pi(\gamma_i) \propto 1.$$

In [21]:

coords = {
    "age": age_map,
    "year": year_map,
    "cohort": cohort_map
}

In [22]:

SEED = 1234567890 # for reproducibility

In [23]:

with pm.Model(coords=coords, rng_seeder=SEED) as flat_model:
    η0 = pm.Flat("η0")
    α = pm.Flat("α", dims="age")
    β = pm.Flat("β", dims="year")
    γ = pm.Flat("γ", dims="cohort")

Using the prior $\nu \sim \operatorname{Half-}N(2.5^2)$, we specify $\eta_{ijk}$ and the observational likelihood as above.

In [24]:

with flat_model:
    η = η0 + α[i] + β[j] + γ[k]
    
    ν = pm.HalfNormal("ν", 2.5)
    pm.NegativeBinomial("kids", N * at.exp(η), ν, observed=kids)

We are now ready to sample from the posterior distribution of this model.

In [25]:

CHAINS = 6

SAMPLE_KWARGS = {
    "cores": CHAINS,
    "random_seed": [SEED + i for i in range(CHAINS)]
}

In [26]:

with flat_model:
    flat_trace = pm.sample(**SAMPLE_KWARGS)

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (6 chains in 6 jobs)
NUTS: [η0, α, β, γ, ν]

100.00% [12000/12000 06:47<00:00 Sampling 6 chains, 3,014 divergences]

Sampling 6 chains for 1_000 tune and 1_000 draw iterations (6_000 + 6_000 draws total) took 428 seconds.
The chain contains only diverging samples. The model is probably misspecified.
The acceptance probability does not match the target. It is 0.3571, but should be close to 0.8. Try to increase the number of tuning steps.
The chain contains only diverging samples. The model is probably misspecified.
The acceptance probability does not match the target. It is 0.1821, but should be close to 0.8. Try to increase the number of tuning steps.
The acceptance probability does not match the target. It is 0.9528, but should be close to 0.8. Try to increase the number of tuning steps.
The chain reached the maximum tree depth. Increase max_treedepth, increase target_accept or reparameterize.
The chain reached the maximum tree depth. Increase max_treedepth, increase target_accept or reparameterize.
There were 14 divergences after tuning. Increase `target_accept` or reparameterize.
The chain reached the maximum tree depth. Increase max_treedepth, increase target_accept or reparameterize.
The chain contains only diverging samples. The model is probably misspecified.
The acceptance probability does not match the target. It is 0.2558, but should be close to 0.8. Try to increase the number of tuning steps.
The rhat statistic is larger than 1.01 for some parameters. This indicates problems during sampling. See https://arxiv.org/abs/1903.08008 for details
The effective sample size per chain is smaller than 100 for some parameters. A higher number is needed for reliable rhat and ess computation. See https://arxiv.org/abs/1903.08008 for details

We see quite a few warnings as a result of sampling, most notably over half of the post-tuning samples resulted in divergences, and the $\hat{R}$ statistics are well above the acceptable diagnostic threshold of 1.01. For simplicty, we focus on the $\hat{R}$ statistics.

In [27]:

flat_rhat = az.rhat(flat_trace)

In [28]:

ax = (flat_rhat.max()
               .to_array()
               .to_series()
               .plot(kind="barh"))
ax.axvline(1, c="k", ls="--");

ax.set_xlabel(r"$\hat{R}$");

ax.invert_yaxis();
ax.set_ylabel(None);

We see that all of the parameters have $\hat{R}$ statistics significantly above one (two are above five) that are indicative of severe problems with inference. Focusing on the component of $\alpha$ with the highest $\hat{R}$ statistic we see that the chains are not very well mixed; they have explored completely different portions of the parameter space.

In [29]:

α_worst = flat_rhat["α"].idxmax()

az.plot_trace(flat_trace, var_names="α", coords={"age": α_worst}, divergences=False);

To quote the folk theorem of statistical computing:

When you have computational problems, often there’s a problem with your model.

With this wisdom in mind, experience indicates that this behavior in the sampler often arises when our model is not identified, so it is time to return to our observation above that the relationship between age, year, and cohort would lead to inference problems.

Lack of identification¶

Briefly, a model is not identified if two different sets of parameters lead to the same (log) likelihood of the observed data. To illustrate the fact that this flat model is not identified, first we evaluate its log likelihood function when $\eta_0 = \alpha_i = \beta_j = \gamma_k = 0$ and $\nu = 1$.

In [30]:

zero_pt = {
    "η0": 0,
    "α": np.zeros_like(age_map),
    "β": np.zeros_like(year_map),
    "γ": np.zeros_like(cohort_map),
    "ν_log__": 0
}

In [31]:

flat_logp = flat_model.compile_logp()
flat_logp(zero_pt)

Out[31]:

array(-2731.94778354)

Since a woman's cohort (birth year) is the year she was surveyed minus her age at the time of survey, we have the following constraint between age, period and cohort.

In [32]:

all((df["age"] - df["year"] + df["cohort"]) == 0)

Out[32]:

True

Due to this constraint and the linearity of our model,

$$\eta_{ijk} = \eta_0 + \alpha_i + \beta_j + \gamma_k,$$

the following function produces parameters has the same likelihood as those above, given the observed data.

In [33]:

def get_unident_pt(c=1):
    return {
        "η0": 0,
        "α": c * age_map,
        "β": -c * year_map,
        "γ": c * cohort_map,
        "ν_log__": 0
    }

In [34]:

flat_logp(get_unident_pt())

Out[34]:

array(-2731.94778354)

We test this assertion for 100 random values of c in the interval $[-10, 10]$.

In [35]:

rng = np.random.default_rng(SEED)

for c in rng.uniform(-10, 10, size=100):
    np.testing.assert_allclose(flat_logp(zero_pt), flat_logp(get_unident_pt(c=c)))

This experiment confirms there are infinitely many parameter values that lead to the same likelihood.

Unidentified models such as the APC model with flat priors lead to both theoretical and practical challenges. From a theoretical perspective, we should be wary of a interpreting a model that produces infinitely many sets of parameters that are equally compatible with our data. From a computational perspective, many inference algorithms struggle with unidentified models (recall the folk theorem of statistical computing). It is important to note that these computational issues are not unique to the Bayesian approach to inference using this APC model. With flat priors, the maximum a posteriori (MAP) estimator for this model is the same as the maximum likelihood estimator (MLE).

In [36]:

with flat_model:
 flat_mle, opt_result = pm.find_MAP(return_raw=True)

100.00% [21/21 00:00<00:00 logp = -2,867, ||grad|| = 1,577.9]

In [37]:

opt_result.success

Out[37]:

False

We see that scipy's numerical optimizer has failed to find the MLE/MAP due to the unidentified nature of this model.

Normal priors¶

Regularization is a common approach to resolving model identification problems. From a Bayesian perspective, regularization is equivalent to using non-flat priors on our parameters $\eta_0$, $\alpha_i$, $\beta_j$, and $\gamma_k$. It is reasonable to start with normally-distributed priors on these parameters. These normal priors are, in fact, equivalent to Tikhonov regularization/ridge regression.

For this model, we let $\eta_0, \alpha_i, \beta_j, \gamma_k \sim N(0, 2.5^2)$.

In [38]:

with pm.Model(coords=coords, rng_seeder=SEED) as norm_model:
    η0 = pm.Normal("η0", 0, 2.5)
    α = pm.Normal("α", 0, 2.5, dims="age")
    β = pm.Normal("β", 0, 2.5, dims="year")
    γ = pm.Normal("γ", 0, 2.5, dims="cohort")

Since these normal priors assign the higest probability density to zero, they favor smaller parameter values if all other factors are equal, just as in Tikhonov regularization/ridge regression.

The rest of the model is defined as before.

In [39]:

with norm_model:
    η = η0 + α[i] + β[j] + γ[k]
    
    ν = pm.HalfNormal("ν", 2.5)
    pm.NegativeBinomial("kids", N * at.exp(η), ν, observed=kids)

We now sample from the posterior distribution of this model.

In [40]:

with norm_model:
    norm_trace = pm.sample(**SAMPLE_KWARGS)

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (6 chains in 6 jobs)
NUTS: [η0, α, β, γ, ν]

100.00% [12000/12000 03:05<00:00 Sampling 6 chains, 801 divergences]

Sampling 6 chains for 1_000 tune and 1_000 draw iterations (6_000 + 6_000 draws total) took 206 seconds.
The acceptance probability does not match the target. It is 0.9187, but should be close to 0.8. Try to increase the number of tuning steps.
There were 800 divergences after tuning. Increase `target_accept` or reparameterize.
The acceptance probability does not match the target. It is 0.6221, but should be close to 0.8. Try to increase the number of tuning steps.
There was 1 divergence after tuning. Increase `target_accept` or reparameterize.
The rhat statistic is larger than 1.01 for some parameters. This indicates problems during sampling. See https://arxiv.org/abs/1903.08008 for details
The effective sample size per chain is smaller than 100 for some parameters.  A higher number is needed for reliable rhat and ess computation. See https://arxiv.org/abs/1903.08008 for details

We see that overall there are fewer warnings than when we sampled from the flat model, but there are still hundreds of divergences. The $\hat{R}$ statistics are much closer to their ideal value of one, but still larger than 1.01, so we should continue to be a skeptical of the quality of these samples.

In [41]:

norm_rhat = az.rhat(norm_trace)

In [42]:

ax = (norm_rhat.max()
               .to_array()
               .to_series()
               .plot(kind="barh"))
ax.axvline(1, c="k", ls="--");

ax.set_xlim(0.99, 1.08);
ax.set_xlabel(r"$\hat{R}$");

ax.invert_yaxis();
ax.set_ylabel(None);

To demonstrate the impact of these elevated $\hat{R}$ statistics, we visualize the samples from the $\alpha_i$ component with the highest $\hat{R}$ statistic below.

In [43]:

α_worst = norm_rhat["α"].idxmax()

az.plot_trace(norm_trace, var_names="α", coords={"age": α_worst}, divergences=False);

In the kernel density estimates on the left, we see at least one chain that is meandering around the parameter space.

In [44]:

fig, axes = plt.subplots(nrows=CHAINS, sharex=True, sharey=True)

for chain, ax in enumerate(axes):
    (norm_trace.posterior["α"]
               .sel(age=α_worst, chain=chain)
               .plot(ax=ax));
    
    ax.set_xlabel(None);
    
    ax.set_yticks([]);
    ax.set_ylabel(f"Chain {chain + 1}");
    
    ax.set_title(None);

axes[-1].set_xlabel("Draw");

fig.suptitle("α");
fig.tight_layout();

Plotting each chain's trace on its own for clarity, we see that the second chain demonstrates particularly concerning behavior, producing relatively constant samples for long periods of time and then jumping to completely different portions of the parameter space.

Based on these results, we see that introducing regularization through normal priors has reduced but not eliminated the inference issues with this model.

Noncentered normal priors¶

The APC model with normal priors shows that adding some regularization helps, so for our next model we increase the amount of regularization. We do so by placing hierarchical normal priors on the APC parameters,

$$\alpha_i \sim N(0, \sigma_{\alpha}^2),\ \sigma_{\alpha} \sim \operatorname{Half}-N(2.5^2),$$$$\beta_i \sim N(0, \sigma_{\beta}^2),\ \sigma_{\beta} \sim \operatorname{Half}-N(2.5^2),$$$$\gamma_i \sim N(0, \sigma_{\gamma}^2),\ \sigma_{\gamma} \sim \operatorname{Half}-N(2.5^2).$$

In reality, we use an equivalent noncentered parameterization of this prior that often leads to more efficient sampling.

In [45]:

# the scale necessary to make a halfnormal distribution have unit variance
HALFNORMAL_SCALE = 1 / np.sqrt(1 - 2 / np.pi)

In [46]:

def noncentered_normal(name, *, dims):
    Δ = pm.Normal(f"Δ_{name}", 0, 1, dims=dims)
    σ = pm.HalfNormal(f"σ_{name}", 2.5 * HALFNORMAL_SCALE)
    
    return pm.Deterministic(name, Δ * σ, dims=dims)

In [47]:

with pm.Model(coords=coords, rng_seeder=SEED) as nc_model:
    η0 = pm.Normal("η0", 0, 2.5)

    α = noncentered_normal("α", dims="age")
    β = noncentered_normal("β", dims="year")
    γ = noncentered_normal("γ", dims="cohort")

The rest of the model is defined as before, with the addition of a MutableData container for the offset. This container will be useful when we visualize the posterior predictive distribution.

In [48]:

with nc_model:
    η = η0 + α[i] + β[j] + γ[k]
    
    N_ = pm.MutableData("N", N)
    ν = pm.HalfNormal("ν", 2.5)
    pm.NegativeBinomial("kids", N_ * at.exp(η), ν, observed=kids)

We now sample from the posterior distribution of this model.

In [49]:

with nc_model:
    nc_trace = pm.sample(target_accept=0.9, **SAMPLE_KWARGS)

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (6 chains in 6 jobs)
NUTS: [η0, Δ_α, σ_α, Δ_β, σ_β, Δ_γ, σ_γ, ν]

100.00% [12000/12000 00:29<00:00 Sampling 6 chains, 0 divergences]

Sampling 6 chains for 1_000 tune and 1_000 draw iterations (6_000 + 6_000 draws total) took 50 seconds.
The effective sample size per chain is smaller than 100 for some parameters.  A higher number is needed for reliable rhat and ess computation. See https://arxiv.org/abs/1903.08008 for details

No divergences occured during sampling, though we do see a low effective sample size warning.

In [50]:

nc_rhat = az.rhat(nc_trace)

In [51]:

ax = (nc_rhat.max()
             .to_array()
             .to_series()
             .plot(kind="barh"))
ax.axvline(1, c="k", ls="--");

ax.set_xlim(0.99, 1.01);
ax.set_xlabel(r"$\hat{R}$");

ax.invert_yaxis();
ax.set_ylabel(None);

The $\hat{R}$ statistics are comfortably below 1.01, a sign of significantly improved sampling. The trace plot for the component of $\alpha$ with the highest $\hat{R}$ statistic also shows quite good mixing.

In [52]:

az.plot_trace(nc_trace, var_names="α", coords={"age": nc_rhat["α"].idxmax()}, divergences=False);

We now sample from the posterior predictive distribution of this model and visualize the results.

In [53]:

N_PP = 10_000 

def scale_pp(x):
    return x / N_PP

In [54]:

with nc_model:
    pm.set_data({"N": np.full_like(N, N_PP)})
    nc_trace.extend(pm.sample_posterior_predictive(nc_trace))

100.00% [6000/6000 00:01<00:00]

In [55]:

CI_WIDTH = 0.95

In [56]:

fig, axes = plt.subplots(ncols=3, sharey=True, figsize=(12, 4))

# age

(df.groupby("age")
        ["kids"]
        .mean()
        .plot(c="k", ls="--", label="Observed", ax=axes[0]))

nc_pp_by_age = (
    nc_trace.posterior_predictive
            .assign_coords(kids_dim_0=apc_df.index.get_level_values("age").values)
            ["kids"]
            .pipe(scale_pp)
            .groupby("kids_dim_0")
)

axes[0].fill_between(age_map,
                     nc_pp_by_age.quantile((1 - CI_WIDTH) / 2, dim=("chain", "draw", "kids_dim_0")),
                     nc_pp_by_age.quantile(1 - (1 - CI_WIDTH) / 2, dim=("chain", "draw", "kids_dim_0")),
                     alpha=0.5);
axes[0].plot(age_map, nc_pp_by_age.mean(dim=("chain", "draw", "kids_dim_0")));


axes[0].set_xlabel("Age");
axes[0].set_ylabel("Children\n(posterior predictive)");

axes[0].legend(loc="upper left");

# year

(df.groupby("year")
        ["kids"]
        .mean()
        .plot(c="k", ls="--", ax=axes[1]));

nc_pp_by_year = (
    nc_trace.posterior_predictive
            .assign_coords(kids_dim_0=apc_df.index.get_level_values("year").values)
            ["kids"]
            .pipe(scale_pp)
            .groupby("kids_dim_0")
)

axes[1].fill_between(year_map,
                     nc_pp_by_year.quantile((1 - CI_WIDTH) / 2, dim=("chain", "draw", "kids_dim_0")),
                     nc_pp_by_year.quantile(1 - (1 - CI_WIDTH) / 2, dim=("chain", "draw", "kids_dim_0")),
                     alpha=0.5);
axes[1].plot(year_map, nc_pp_by_year.mean(dim=("chain", "draw", "kids_dim_0")));

axes[1].set_xlabel("Year");

# cohort

(df.groupby("cohort")
        ["kids"]
        .mean()
        .plot(c="k", ls="--", ax=axes[2]));

nc_pp_by_cohort = (
    nc_trace.posterior_predictive
            .assign_coords(kids_dim_0=apc_df.index.get_level_values("cohort").values)
            ["kids"]
            .pipe(scale_pp)
            .groupby("kids_dim_0")
)

axes[2].fill_between(cohort_map,
                     nc_pp_by_cohort.quantile((1 - CI_WIDTH) / 2, dim=("chain", "draw", "kids_dim_0")),
                     nc_pp_by_cohort.quantile(1 - (1 - CI_WIDTH) / 2, dim=("chain", "draw", "kids_dim_0")),
                     alpha=0.5);
axes[2].plot(cohort_map, nc_pp_by_cohort.mean(dim=("chain", "draw", "kids_dim_0")));


axes[2].set_xlabel("Cohort (birth year)");

fig.tight_layout();

Here we see solid agreement between the model and the observed data, perhaps even a bit too much agreement, as our APC effects seem to mimic the yearly noise in the observced data. It would be reasonable to expect that the true APC effects are smoother than what we see here. Our next two models how how including this behavior in our model can result in better sampling and more effective predictions.

Noncentered normal random walk priors¶

We can incorporate the expectation that the number of children only changes a bit from year-to-year (and therefore the relationship is smoother than above) by a placing noncentered Gaussian random walk priors on $\alpha_i$, $\beta_j$, and $\gamma_k$.

In [57]:

def noncentered_normal_rw(name, *, dims):    
    innov = pm.Normal(f"innov_{name}", 0, 1, dims=dims)
    σ = pm.HalfNormal(f"σ_{name}", 2.5 * HALFNORMAL_SCALE)
    
    return pm.Deterministic(name, innov.cumsum() * σ, dims=dims)

In [58]:

with pm.Model(coords=coords, rng_seeder=SEED) as rw_model:
    η0 = pm.Normal("η0", 0, 2.5)
    
    α = noncentered_normal_rw("α", dims="age")
    β = noncentered_normal_rw("β", dims="year")
    γ = noncentered_normal_rw("γ", dims="cohort")

The rest of the model is defined as before.

In [59]:

with rw_model:
    η = η0 + α[i] + β[j] + γ[k]
    
    N_ = pm.MutableData("N", N)
    ν = pm.HalfNormal("ν", 2.5)
    pm.NegativeBinomial("kids", N_ * at.exp(η), ν, observed=kids)

We now sample from the posterior distribution of this model.

In [60]:

with rw_model:
    rw_trace = pm.sample(target_accept=0.95, **SAMPLE_KWARGS)

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (6 chains in 6 jobs)
NUTS: [η0, innov_α, σ_α, innov_β, σ_β, innov_γ, σ_γ, ν]

100.00% [12000/12000 04:15<00:00 Sampling 6 chains, 0 divergences]

Sampling 6 chains for 1_000 tune and 1_000 draw iterations (6_000 + 6_000 draws total) took 277 seconds.

There are no divergences, no $\hat{R}$ warnings, and no effective sample size warnings.

In [61]:

rw_rhat = az.rhat(rw_trace)

In [62]:

ax = (rw_rhat.max()
             .to_array()
             .to_series()
             .plot(kind="barh"))
ax.axvline(1, c="k", ls="--");

ax.set_xlim(0.99, 1.01);
ax.set_xlabel(r"$\hat{R}$");

ax.invert_yaxis();
ax.set_ylabel(None);

We now sample from and visualize the posterior predictive distribution.

In [63]:

with rw_model:
    pm.set_data({"N": np.full_like(N, N_PP)})
    rw_trace.extend(pm.sample_posterior_predictive(rw_trace))

100.00% [6000/6000 00:01<00:00]

In [64]:

fig, axes = plt.subplots(ncols=3, sharey=True, figsize=(12, 4))

# age

(df.groupby("age")
        ["kids"]
        .mean()
        .plot(c="k", ls="--", label="Observed", ax=axes[0]))

rw_pp_by_age = (
    rw_trace.posterior_predictive
            .assign_coords(kids_dim_0=apc_df.index.get_level_values("age").values)
            ["kids"]
            .pipe(scale_pp)
            .groupby("kids_dim_0")
)

axes[0].fill_between(age_map,
                     rw_pp_by_age.quantile((1 - CI_WIDTH) / 2, dim=("chain", "draw", "kids_dim_0")),
                     rw_pp_by_age.quantile(1 - (1 - CI_WIDTH) / 2, dim=("chain", "draw", "kids_dim_0")),
                     alpha=0.5, color="C1");
axes[0].plot(age_map, nc_pp_by_age.mean(dim=("chain", "draw", "kids_dim_0")),
             label="Noncentered");
axes[0].plot(age_map, rw_pp_by_age.mean(dim=("chain", "draw", "kids_dim_0")),
             c="C1", label="Random walk");


axes[0].set_xlabel("Age");
axes[0].set_ylabel("Children\n(posterior predictive)");

axes[0].legend(loc="upper left");

# year

(df.groupby("year")
        ["kids"]
        .mean()
        .plot(c="k", ls="--", ax=axes[1]));

rw_pp_by_year = (
    rw_trace.posterior_predictive
            .assign_coords(kids_dim_0=apc_df.index.get_level_values("year").values)
            ["kids"]
            .pipe(scale_pp)
            .groupby("kids_dim_0")
)

axes[1].fill_between(year_map,
                     rw_pp_by_year.quantile((1 - CI_WIDTH) / 2, dim=("chain", "draw", "kids_dim_0")),
                     rw_pp_by_year.quantile(1 - (1 - CI_WIDTH) / 2, dim=("chain", "draw", "kids_dim_0")),
                     alpha=0.5, color="C1");
axes[1].plot(year_map, nc_pp_by_year.mean(dim=("chain", "draw", "kids_dim_0")));
axes[1].plot(year_map, rw_pp_by_year.mean(dim=("chain", "draw", "kids_dim_0")),
             c="C1");

axes[1].set_xlabel("Year");

# cohort

(df.groupby("cohort")
        ["kids"]
        .mean()
        .plot(c="k", ls="--", ax=axes[2]));

rw_pp_by_cohort = (
    rw_trace.posterior_predictive
            .assign_coords(kids_dim_0=apc_df.index.get_level_values("cohort").values)
            ["kids"]
            .pipe(scale_pp)
            .groupby("kids_dim_0")
)

axes[2].fill_between(cohort_map,
                     rw_pp_by_cohort.quantile((1 - CI_WIDTH) / 2, dim=("chain", "draw", "kids_dim_0")),
                     rw_pp_by_cohort.quantile(1 - (1 - CI_WIDTH) / 2, dim=("chain", "draw", "kids_dim_0")),
                     alpha=0.5, color="C1");
axes[2].plot(cohort_map, nc_pp_by_cohort.mean(dim=("chain", "draw", "kids_dim_0")));
axes[2].plot(cohort_map, rw_pp_by_cohort.mean(dim=("chain", "draw", "kids_dim_0")),
             c="C1");

axes[2].set_xlabel("Cohort (birth year)");

fig.tight_layout();

We see that the posterior predictive mean for the random walk model is considerably smoother year-over-year than that of the noncentered normal model; this smoothness is particularly evident in the cohort plot.

We use Pareto-smoothed importance sampling leave-one-out cross validation (PSIS-LOO) to quantify the extent to which the random walk model is an improvement over the noncentered normal model.

In [65]:

traces = {
    "Noncentered": nc_trace,
    "Random walk": rw_trace
}

In [66]:

comp_df = pm.compare(traces)

In [67]:

comp_df

Out[67]:

	rank	loo	p_loo	d_loo	weight	se	dse	warning	loo_scale
Random walk	0	-1903.937438	29.843949	0.000000	1.0	16.355066	0.00000	False	log
Noncentered	1	-1962.129697	74.518856	58.192259	0.0	16.478794	5.32143	False	log

In [68]:

fig, ax = plt.subplots()

az.plot_compare(comp_df, insample_dev=False, plot_ic_diff=False,
                textsize=9, ax=ax);

We see that the random walk model is not just a visual improvement over the noncentered normal model, but shows an appreciable quantitative improvement in predictive accuracy (as judged by PSIS-LOO).

It is particularly gratifying to see a model that better encodes our intuitions about the smoothness of APC effects perform better both from a computational and predictive perspective.

Smoothing splines¶

Smoothing splines are an even more powerful way to encode smoothness assumptions into our APC model which we will now compare to the noncentered normal random walk model. We will not include a full introduction to smoothing splines (or their Bayesian treatment) here, but we follow Milad Kharratzadeh’s excellent short paper Splines in Stan. For more details about splines see two previous posts, Bayesian Splines with Heteroskedastic Noise in Python with PyMC3 and Fitting a Simple Additive Model in Python .

We only use smoothing splines to model the age and cohort effects here, since they have many more observed values than the period (year) in which the survey was conducted.

In [69]:

ax = (df[["age", "year", "cohort"]]
         .nunique()
         .plot(kind="barh"))

ax.set_xlabel("Number of unique values observed");
ax.invert_yaxis();

We define ten knots each for age and cohort, and use scipy to form the cubic B-spline design matrix.

In [70]:

N_KNOT = 10

In [71]:

age_knots = np.linspace(15, 96, N_KNOT)
age_dmat = sp.interpolate.BSpline(age_knots, np.eye(N_KNOT), 3)(
    apc_df.index.get_level_values("age").values
)

In [72]:

cohort_knots = np.linspace(1885, 1986, N_KNOT)
cohort_dmat = sp.interpolate.BSpline(cohort_knots, np.eye(N_KNOT), 3)(
    apc_df.index.get_level_values("cohort").values
)

The spline model uses different coordinates than the previous models.

In [73]:

spline_coords = {
    "age_knot": np.arange(N_KNOT),
    "year": year_map,
    "cohort_knot": np.arange(N_KNOT)
}

We use the priors on the spline coefficients $\alpha_i$ and $\gamma_k$ suggested in Splines in Stan (rescaled for computational efficiency).

In [74]:

with pm.Model(coords=spline_coords, rng_seeder=SEED) as spline_model:
    η0 = pm.Normal("η0", 0, 2.5)
    
    α = noncentered_normal_rw("α", dims="age_knot") / np.sqrt(N_KNOT)
    β = noncentered_normal_rw("β", dims="year")
    γ = noncentered_normal_rw("γ", dims="cohort_knot") / np.sqrt(N_KNOT)

The rest of the model is defined the similarly to before.

In [75]:

with spline_model:
    η = η0 + at.dot(age_dmat, α) + β[j] + at.dot(cohort_dmat, γ)
    
    N_ = pm.MutableData("N", N)
    ν = pm.HalfNormal("ν", 2.5)
    pm.NegativeBinomial("kids", N_ * at.exp(η), ν, observed=kids)

We now sample from the posterior distribution of this model.

In [76]:

with spline_model:
    spline_trace = pm.sample(target_accept=0.99, **SAMPLE_KWARGS)

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (6 chains in 6 jobs)
NUTS: [η0, innov_α, σ_α, innov_β, σ_β, innov_γ, σ_γ, ν]

100.00% [12000/12000 07:21<00:00 Sampling 6 chains, 0 divergences]

Sampling 6 chains for 1_000 tune and 1_000 draw iterations (6_000 + 6_000 draws total) took 462 seconds.

There are no divergences, no $\hat{R}$ warnings, and no effective sample size warnings.

In [77]:

spline_rhat = az.rhat(spline_trace)

In [78]:

ax = (spline_rhat.max()
             .to_array()
             .to_series()
             .plot(kind="barh"))
ax.axvline(1, c="k", ls="--");

ax.set_xlim(0.99, 1.01);
ax.set_xlabel(r"$\hat{R}$");

ax.invert_yaxis();
ax.set_ylabel(None);

We now sample from and visualize the posterior predictive distribution.

In [79]:

with spline_model:
    pm.set_data({"N": np.full_like(N, N_PP)})
    spline_trace.extend(pm.sample_posterior_predictive(spline_trace))

100.00% [6000/6000 00:01<00:00]

In [80]:

fig, axes = plt.subplots(ncols=3, sharey=True, figsize=(12, 4))

# age

(df.groupby("age")
        ["kids"]
        .mean()
        .plot(c="k", ls="--", label="Observed", ax=axes[0]))

spline_pp_by_age = (
    spline_trace.posterior_predictive
                .assign_coords(kids_dim_0=apc_df.index.get_level_values("age").values)
                ["kids"]
                .pipe(scale_pp)
                .groupby("kids_dim_0")
)

axes[0].fill_between(age_map,
                     spline_pp_by_age.quantile((1 - CI_WIDTH) / 2, dim=("chain", "draw", "kids_dim_0")),
                     spline_pp_by_age.quantile(1 - (1 - CI_WIDTH) / 2, dim=("chain", "draw", "kids_dim_0")),
                     alpha=0.5, color="C2");
axes[0].plot(age_map, rw_pp_by_age.mean(dim=("chain", "draw", "kids_dim_0")),
             c="C1", label="Random walk");
axes[0].plot(age_map, spline_pp_by_age.mean(dim=("chain", "draw", "kids_dim_0")),
             c="C2", label="Spline");


axes[0].set_xlabel("Age");
axes[0].set_ylabel("Children\n(posterior predictive)");

axes[0].legend(loc="upper left");

# year

(df.groupby("year")
        ["kids"]
        .mean()
        .plot(c="k", ls="--", ax=axes[1]));

spline_pp_by_year = (
    spline_trace.posterior_predictive
                .assign_coords(kids_dim_0=apc_df.index.get_level_values("year").values)
                ["kids"]
                .pipe(scale_pp)
                .groupby("kids_dim_0")
)

axes[1].fill_between(year_map,
                     spline_pp_by_year.quantile((1 - CI_WIDTH) / 2, dim=("chain", "draw", "kids_dim_0")),
                     spline_pp_by_year.quantile(1 - (1 - CI_WIDTH) / 2, dim=("chain", "draw", "kids_dim_0")),
                     alpha=0.5, color="C2");
axes[1].plot(year_map, rw_pp_by_year.mean(dim=("chain", "draw", "kids_dim_0")),
             c="C1");
axes[1].plot(year_map, spline_pp_by_year.mean(dim=("chain", "draw", "kids_dim_0")),
             c="C2");

axes[1].set_xlabel("Year");

# cohort

(df.groupby("cohort")
        ["kids"]
        .mean()
        .plot(c="k", ls="--", ax=axes[2]));

spline_pp_by_cohort = (
    spline_trace.posterior_predictive
                .assign_coords(kids_dim_0=apc_df.index.get_level_values("cohort").values)
                ["kids"]
                .pipe(scale_pp)
                .groupby("kids_dim_0")
)

axes[2].fill_between(cohort_map,
                     spline_pp_by_cohort.quantile((1 - CI_WIDTH) / 2, dim=("chain", "draw", "kids_dim_0")),
                     spline_pp_by_cohort.quantile(1 - (1 - CI_WIDTH) / 2, dim=("chain", "draw", "kids_dim_0")),
                     alpha=0.5, color="C2");
axes[2].plot(cohort_map, rw_pp_by_cohort.mean(dim=("chain", "draw", "kids_dim_0")),
             c="C1");
axes[2].plot(cohort_map, spline_pp_by_cohort.mean(dim=("chain", "draw", "kids_dim_0")),
             c="C2");

axes[2].set_xlabel("Cohort (birth year)");

fig.tight_layout();

We see that the posterior predictive mean for the spline model is considerably smoother year-over-year than that of the noncentered normal random walk model model; this smoothness is particularly evident in the cohort plot.

We compare this model with the two previous ones.

In [81]:

traces["Spline"] = spline_trace

In [82]:

comp_df = pm.compare(traces)

In [83]:

comp_df

Out[83]:

	rank	loo	p_loo	d_loo	weight	se	dse	warning	loo_scale
Spline	0	-1900.014235	9.728448	0.000000	0.657634	16.891238	0.000000	False	log
Random walk	1	-1903.937438	29.843949	3.923203	0.342366	16.355066	5.064813	False	log
Noncentered	2	-1962.129697	74.518856	62.115462	0.000000	16.478794	7.580858	False	log

In [84]:

fig, ax = plt.subplots()

az.plot_compare(comp_df, insample_dev=False, plot_ic_diff=False,
                textsize=9, ax=ax);

We see that the spline model is a slight improvement over the noncentered normal random walk model from the perspective of PSIS-LOO.

Age/cohort submodel¶

With these five models we have iterated from an unidentified model, to models that were less degenerate but still sampled poorly, to smoothly varying models that sample well. Having overcome these issues with our initial models, we now turn to the question of whether including a period (year) effect is actually beneficial.

In [85]:

fig, ax = plt.subplots()

(df.groupby("year")
        ["kids"]
        .mean()
        .plot(c="k", ls="--", label="Observed", ax=ax));

ax.fill_between(year_map,
                spline_pp_by_year.quantile((1 - CI_WIDTH) / 2, dim=("chain", "draw", "kids_dim_0")),
                spline_pp_by_year.quantile(1 - (1 - CI_WIDTH) / 2, dim=("chain", "draw", "kids_dim_0")),
                alpha=0.5, color="C2");
ax.plot(year_map, spline_pp_by_year.mean(dim=("chain", "draw", "kids_dim_0")),
             c="C2", label="Spline");

ax.set_xlabel("Year");
ax.set_ylabel("Children\n(posterior predictive)");
ax.legend();

The posterior predictive effect of period on the number of children from the previous model appears to be rather flat, giving us reason to suspect that it may not contribut much to the model's predictive accuracy. We now quantify the impact of removing period effects on the predictive accuracy of our model.

For $\alpha_i$ and $\gamma_k$, we use the same priors as the APC spline model, but we set $\beta_k \equiv 0$.

In [86]:

with pm.Model(coords=spline_coords, rng_seeder=SEED) as ac_model:
    η0 = pm.Normal("η0", 0, 2.5)
    
    α = noncentered_normal_rw("α", dims="age_knot") / np.sqrt(age_knots.size)
    γ = noncentered_normal_rw("γ", dims="cohort_knot") / np.sqrt(cohort_knots.size)
    
    η = η0 + at.dot(age_dmat, α) + at.dot(cohort_dmat, γ)
    
    N_ = pm.MutableData("N", N)
    ν = pm.HalfNormal("ν", 2.5)
    pm.NegativeBinomial("kids", N_ * at.exp(η), ν, observed=kids)

We now sample from the posterior distribution of this model.

In [87]:

with ac_model:
    ac_trace = pm.sample(target_accept=0.99, **SAMPLE_KWARGS)

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (6 chains in 6 jobs)
NUTS: [η0, innov_α, σ_α, innov_γ, σ_γ, ν]

100.00% [12000/12000 05:47<00:00 Sampling 6 chains, 0 divergences]

Sampling 6 chains for 1_000 tune and 1_000 draw iterations (6_000 + 6_000 draws total) took 369 seconds.

There are no divergences, no $\hat{R}$ warnings, and no effective sample size warnings.

In [88]:

ac_rhat = az.rhat(ac_trace)

In [89]:

ax = (ac_rhat.max()
             .to_array()
             .to_series()
             .plot(kind="barh"))
ax.axvline(1, c="k", ls="--");

ax.set_xlim(0.99, 1.01);
ax.set_xlabel(r"$\hat{R}$");

ax.invert_yaxis();
ax.set_ylabel(None);

We now sample from and visualize the posterior predictive distribution.

In [90]:

with ac_model:
    pm.set_data({"N": np.full_like(N, N_PP)})
    ac_trace.extend(pm.sample_posterior_predictive(ac_trace))

100.00% [6000/6000 00:01<00:00]

In [91]:

fig, axes = plt.subplots(ncols=2, sharey=True, figsize=(12, 5))

# age

(df.groupby("age")
   ["kids"]
   .mean()
   .plot(c="k", ls="--", label="Observed", ax=axes[0]))

ac_pp_by_age = (
    ac_trace.posterior_predictive
            .assign_coords(kids_dim_0=apc_df.index.get_level_values("age").values)
            ["kids"]
            .pipe(scale_pp)
            .groupby("kids_dim_0")
)

axes[0].fill_between(age_map,
                     ac_pp_by_age.quantile((1 - CI_WIDTH) / 2, dim=("chain", "draw", "kids_dim_0")),
                     ac_pp_by_age.quantile(1 - (1 - CI_WIDTH) / 2, dim=("chain", "draw", "kids_dim_0")),
                     alpha=0.5, color="C3");
axes[0].plot(age_map, spline_pp_by_age.mean(dim=("chain", "draw", "kids_dim_0")),
             c="C2", label="APC");
axes[0].plot(age_map, ac_pp_by_age.mean(dim=("chain", "draw", "kids_dim_0")),
             c="C3", ls="--", label="AC");


axes[0].set_xlabel("Age");
axes[0].set_ylabel("Children\n(posterior predictive)");

axes[0].legend(loc="upper left");

# cohort

(df.groupby("cohort")
        ["kids"]
        .mean()
        .plot(c="k", ls="--", ax=axes[1]));

ac_pp_by_cohort = (
    ac_trace.posterior_predictive
            .assign_coords(kids_dim_0=apc_df.index.get_level_values("cohort").values)
            ["kids"]
            .pipe(scale_pp)
            .groupby("kids_dim_0")
)

axes[1].fill_between(cohort_map,
                     ac_pp_by_cohort.quantile((1 - CI_WIDTH) / 2, dim=("chain", "draw", "kids_dim_0")),
                     ac_pp_by_cohort.quantile(1 - (1 - CI_WIDTH) / 2, dim=("chain", "draw", "kids_dim_0")),
                     alpha=0.5, color="C3");
axes[1].plot(cohort_map, spline_pp_by_cohort.mean(dim=("chain", "draw", "kids_dim_0")),
             c="C2");
axes[1].plot(cohort_map, ac_pp_by_cohort.mean(dim=("chain", "draw", "kids_dim_0")),
             c="C3", ls="--");

axes[1].set_xlabel("Cohort (birth year)");

fig.suptitle("Spline models");
fig.tight_layout();

We see that the posterior predictions from the APC and AC spline models are visually indistinguishable.

In [92]:

spline_traces = {
    "APC": spline_trace,
    "AC": ac_trace
}

In [93]:

spline_comp_df = pm.compare(spline_traces)

In [94]:

spline_comp_df

Out[94]:

	rank	loo	p_loo	d_loo	weight	se	dse	warning	loo_scale
APC	0	-1900.014235	9.728448	0.000000	1.0	16.891238	0.000000	False	log
AC	1	-1900.521989	8.158803	0.507754	0.0	16.921409	0.913015	False	log

In [95]:

fig, ax = plt.subplots()

az.plot_compare(spline_comp_df, insample_dev=False, plot_ic_diff=False,
                textsize=9, ax=ax);

Comparing the APC and AC spline models using PSIS-LOO shows that they are practically indistinguishable, so we prefer the more parismonious AC model. (Models that omit period have the added advantage that they are easier to use in forecasting as we need not hypothesize a specific functional form for future period effects.)

This post is available as a Jupyter notebook here.

In [96]:

%load_ext watermark
%watermark -n -u -v -iv

Last updated: Mon Aug 22 2022

Python implementation: CPython
Python version       : 3.10.4
IPython version      : 8.4.0

scipy     : 1.8.0
numpy     : 1.22.2
seaborn   : 0.11.2
aesara    : 2.5.1
arviz     : 0.12.1
pymc      : 4.0.0b6
pandas    : 1.4.1
matplotlib: 3.5.1

A Modern Introduction to Probabilistic Programming with PyMC

Austin Rochford — Sat, 29 Jan 2022 05:00:00 GMT

I first started working with probabilistic programming about ten years ago (in late 2012 or early 2013) using PyMC2. At the time I was preparing to leave a PhD program in pure math for a data science career in industry. I found that the Bayesian approach to statistics and machine learning appealed to my mathematical sensibilities. I soon found PyMC3 and loved how it provided a fast path to translate the mathematical models I had in my head into executable code. A lot has changed in the past ten years; I have grown professionally and technically, the theory behind applied Bayesian inference (in particular adaptive Hamiltonian Monte Carlo algorithms) has blossomed, and PyMC and its related libraries have matured. This post presents an introduction to what I think of as a modern probabilistic programming with PyMC from a perspective that my experience over the last decade has shaped. It is modern in two respects:

It features modern Bayesian inference algorithms and best practices, preferring adaptive Hamiltonian Monte Carlo samplers over Metropolis-Hastings-style ones. It introduces recently developed/enhanced diagnostic tools to diagnose the convergence (or not) of these agorithms.
It features the cutting-edge beta version of PyMC v4 and relies heavily on Aesara and ArviZ, both of which did not exist ten years ago.

This post is an extended version of an introductory talk I gave in January 2022 for the Data Umbrella and PyMC sprint meant to introduce potential new contributors to PyMC. You can find a video of this talk on YouTube.

Table of contents¶

First we make the necessary Python imports and do some light housekeeping.

In [1]:

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

In [2]:

from warnings import filterwarnings

In [3]:

from aesara import pprint
from matplotlib import pyplot as plt, ticker
import numpy as np
import pandas as pd
import scipy as sp
import seaborn as sns
from sklearn.preprocessing import StandardScaler

In [4]:

filterwarnings('ignore', category=RuntimeWarning, message="overflow encountered in exp")
filterwarnings('ignore', category=UserWarning, module='pymc',
               message="Unable to validate shapes: Cannot sample from flat variable")

In [5]:

FIG_SIZE = np.array([8, 6])
plt.rc('figure', figsize=FIG_SIZE)

dollar_formatter = ticker.StrMethodFormatter("${x:,.2f}")
pct_formatter = ticker.StrMethodFormatter("{x:.1%}")

sns.set(color_codes=True)

Probabilistic programming from three perspectives¶

In this section we motivate probabilistic programming from three perspectives: theoretical, statistical, and computational.

Theoretical¶

There is a pervasive perspective (MIT Sloan, Forbes, storytellingwithdata.com provide just a few examples) that data science and analytics involves collecting, storing, and analyzing data to produce a story about the world that is compelling enough to change the behavior of individuals, businesses, or systems. As with all sufficiently popular perspectives this is not incorrect, but the relationship between the data and storytelling warrants closer examiniation. Consider the following diagram, Charles Joseph Minard's famous map of Napoleon's Russian campaign.

Edward Tufte, one of the foremost authorities on the subject of data visualization, considers that "[this] may well be the best statistical graphic ever drawn" in his classic book The Visual Display of Quantitative Information (p 40). This figure certainly uses data (the size of Napoleon's army at various points in time, distance covered, temperature, and more) to tell a compelling story about the perils of invading Russia. In this case, the data leave very little room for differences of interpretation (although it leaves plenty of opportunity for beautiful design). Almost no data set we encounter in our daily work will ever tell a story as obviously as this one does. For me, the central theoretical insight of probabilistic programming is that, given the uncertainty and ambiguity inherent in most data sets, it is more productive to start with stories of how the data might have been generated, then use the observed data to reason about those stories.

Statistical¶

This discussion is quite abstract; we now use the language of statistics to make it more concrete. Rephrased in terms of conditional probability, the popular perspective that a story flows naturally from the data is analagous to searching for a story such that conditional probability $P(\text{Story}\ |\ \text{Data})$ (the probability of the story given the data) is quite high. The central insight of probabilistic programming above says that this search is much easier if instead we begin by telling stories of how the data may have been generated, which is analagous to specifying $P(\text{Data}\ |\ \text{Story})$. Having specified $P(\text{Data}\ |\ \text{Story})$, how can we then reverse this conditional probabiltiy to arrive at $P(\text{Story}\ |\ \text{Data})$? The most straightforward answer is to use Bayes' theorem.

Recast in our notation, Bayes' theorem becomes

$$P(\text{Story}\ |\ \text{Data}) = \frac{P(\text{Data}\ |\ \text{Story}) \cdot P(\text{Story})}{P(\text{Data})}.$$

Allowing $\mathcal{D}$ to denote our data and $\theta$ to denote the unknown parameters of our data generating process we get a form of Bayes' theorem that is familiar to statisticians:

$$P(\theta\ |\ \mathcal{D}) = \frac{P(\mathcal{D}\ |\ \theta) \cdot P(\theta)}{P(\mathcal{D})}.$$

The denominator of this expression, the marginal probability of the data, is calculated as

$$P(\mathcal{D}) = \int P(\mathcal{D}\ |\ \theta) \cdot P(\theta)\ d\theta,$$

which leads to our third perspective on probabilistic programming.

Computational¶

For many realistic models (as we will call stories about how our data was generated from now on), this integral for $P(\mathcal{D})$ is analytically intractible and must be approximated. The most common approach taken to approximate this integral in probabilistic programming is through Monte Carlo methods. Monte Carlo methods use well-crafted sequences of random numbers to approximate integrals. The following basic example illustrates the basic idea behind Monte Carlo methods.

A Monte Carlo approximation of $\pi$¶

Generate 5,000 random points uniformly distributed in the square $-1, \leq x, y \leq 1$.

In [6]:

SEED = 123456789 # for reproducibility

rng = np.random.default_rng(SEED)

In [7]:

N = 5_000

x, y = rng.uniform(-1, 1, size=(2, N))

In [8]:

fig, ax = plt.subplots(subplot_kw={'aspect': 'equal'})

ax.scatter(x, y, alpha=0.5);

ax.set_xticks([-1, 0, 1]);
ax.set_xlim(-1.01, 1.01);

ax.set_yticks([-1, 0, 1]);
ax.set_ylim(-1.01, 1.01);

Consider the points that fall inside the unit circle centered at the origin, $x^2 + y^2 < 1$.

In [9]:

in_circle = x**2 + y**2 < 1

In [10]:

fig, ax = plt.subplots(subplot_kw={'aspect': 'equal'})

ax.scatter(x[~in_circle], y[~in_circle], c='C1', alpha=0.5);
ax.scatter(x[in_circle], y[in_circle], c='C2', alpha=0.5);

ax.add_artist(plt.Circle((0, 0), 1, fill=False, edgecolor='k'));

ax.set_xticks([-1, 0, 1]);
ax.set_xlim(-1.01, 1.01);

ax.set_yticks([-1, 0, 1]);
ax.set_ylim(-1.01, 1.01);

What fraction of these 5,000 points lie inside the unit circle? Intuitively it should be the ratio of area of the circle, $\pi \cdot 1^2 = \pi$, to the area of the square, $2^2 = 4$. We see that four times the proportion of random points that lie in the circle gives a decent approximation of $\pi$.

In [11]:

4 * in_circle.mean()

Out[11]:

3.1488

Here we have used geometric reasoning to arrive at a Monte Carlo approximation of the integral

$$\pi = 4 \int_0^1 \sqrt{1 - x^2}\ dx.$$

The integral corresponds to a quarter of the area of the circle, so four times its value is $\pi$.

Probabilistic programming with PyMC¶

With these three motivating perspectives in hand, we are ready to put theory into action by using PyMC to solve a problem that is nontrivial, but that we can still answer with pen and paper. In my experience, recreating known results with a new technique is vital building the confidence to apply that technique to novel situations.

The Monty Hall problem¶

Let's Make a Deal is an American game show that had its first broadcast run on NBC and ABC from 1963 to 1976. In (an idealized version of) one segment the show's host, Monty Hall, presented the contestant with three doors, two of which led to goats and one of which led to a sports car. The contestant would win whichever "prize" was behind the door they eventually opened. Monty would ask the contestant to choose a door, but before revealing the prize behind it, he would open one of the other two doors to reveal a goat. After showing the contestant the goat, he would offer the contestant the chance to switch their choice of door.

The famed Monty Hall problem asks whether the contestant should switch their choice of door in order to maximize the probability that they win the sports car and not one of the goats. To the befuddlement of many probability and statistics students, the answer is yes, the contestant should switch their choice of doors after Monty opens one. This result is counterintuitive at first, but arises from the fact that Monty knows which prize is behind each door. He will always show the contestant a goat after they make their initial choice (otherwise there would be no drama for the television viewers), and this fact transfers enough of his knowledge to the contestant to make switching the correct choice.

We will first solve the Monty Hall problem through exact calculation, then show how to derive the same answer with PyMC using probabilistic programming techniques. For the sake of this example, suppose the contestant initially chose the first door and Monty opened the third door to reveal a goat (this situation is shown above). Initially we have no information about which door the car is behind so

$$P(\text{Car behind 1}) = P(\text{Car behind 2}) = P(\text{Car behind 3}) = \frac{1}{3}.$$

The following table shows which door Monty can open after the contestant's initial choice based on the true location of the car, which he knows.

	Monty can open
Given car behind	Door 1	Door 2	Door 3
Door 1	No	Yes	Yes
Door 2	No	No	Yes
Door 3	No	Yes	No

In two of three cases (when the car is behind the second or third door), Monty has no choice of which door to open if he wants to maintain suspense for the viewing audience. These are the cases where his choice gives us enough information to make switching doors the correct choice.

The following table translates the previous one into conditional probabilities.

	Probability Monty opens
Given car behind	Door 1	Door 2	Door 3
Door 1	0	$\frac{1}{2}$	$\frac{1}{2}$
Door 2	0	0	1
Door 3	0	1	0

The second entry in the first row corresponds to

$$P(\text{Monty opens 2}\ |\ \text{Car behind 1}) = \frac{1}{2},$$

and other entries can be interpreted similarly.

Using Bayes' theorem,

$$P(\text{Car behind 1}\ |\ \text{Monty opens 3}) = \frac{P(\text{Monty opens 3}\ |\ \text{Car behind 1}) \cdot P(\text{Car behind 1})}{P(\text{Month opens 3})}.$$

The terms in the numerator are easy to calculate given our prior ignorance of the location of the prize and the table above,

$$ \begin{align*} P(\text{Car behind 1}) & = \frac{1}{3}\text{ and} \\ P(\text{Monty opens 3}\ |\ \text{Car behind 1}) & = \frac{1}{2}. \end{align*} $$

Using the law of total probability, the denominator is

$$ \begin{align*} P(\text{Month opens 3}) & = P(\text{Monty opens 3}\ |\ \text{Car behind 1}) \cdot P(\text{Car behind 1}) \\ &\ \ \ \ \ \ + P(\text{Monty opens 3}\ |\ \text{Car behind 2}) \cdot P(\text{Car behind 2}) \\ &\ \ \ \ \ \ + P(\text{Monty opens 3}\ |\ \text{Car behind 3}) \cdot P(\text{Car behind 3}) \\ & = \frac{1}{2} \cdot \frac{1}{3} + 1 \cdot \frac{1}{3} + 0 \cdot \frac{1}{3} \\ & = \frac{1}{2}. \end{align*} $$

Finally,

$$ \begin{align*} P(\text{Car behind 1}\ |\ \text{Monty opens 3}) & = \frac{P(\text{Monty opens 3}\ |\ \text{Car behind 1}) \cdot P(\text{Car behind 1})}{P(\text{Month opens 3})} \\ & = \frac{\frac{1}{2} \cdot \frac{1}{3}}{\frac{1}{2}} = \frac{1}{3}. \end{align*} $$

Since the car is behind the first or second door with probability one,

$$P(\text{Car behind 2}\ |\ \text{Monty opens 3}) = \frac{2}{3},$$

and the contestant doubles their chances of winning the sports car by switching doors.

These calculations were tedious but not particularly hard. Given that computers are great at automating the boring stuff, we'll now see how to arrive at the same answer by telling the story of how Monty chooses which door to open in code and then allowing PyMC to (approximately) calculate the probability the car is behind each door.

We begin by importing PyMC.

In [12]:

import pymc as pm

Initially we have no idea which door the car is behind, so we treat each possibility as equally likely. (Note that we will use zero-indexed doors in our code; the first door is door number zero, the second door is door number one, and the third door is door number two in this code.)

In [13]:

with pm.Model(rng_seeder=SEED) as monty_model:
    car = pm.DiscreteUniform("car", 0, 2)

Here we have created a PyMC model using the Model context manager. It is helpful to think of a PyMC Model as container for the story about how our data was generated. The DiscreteUniform distribution tells PyMC that there is a one-third chance the car is behind each of the doors.

We now express the conditional probability that Monty opens each door given the location of the car in code.

	Probability Monty opens
Given car behind	Door 1	Door 2	Door 3
Door 1	0	$\frac{1}{2}$	$\frac{1}{2}$
Door 2	0	0	1
Door 3	0	1	0

In [14]:

from aesara import tensor as at

In [15]:

p_open = at.switch(
    at.eq(car, 0),
    np.array([0, 0.5, 0.5]), # it is behind the first door
    at.switch(
        at.eq(car, 1),
        np.array([0, 0, 1]), # it is behind the second door
        np.array([0, 1, 0])  # it is behind the third door
    )
)

This code looks a bit odd at first, but is easy to interpret with a bit of guidance. The Aesara function switch acts like an if statement, evaluating the condition in its first argument, returning its second argument if that condition is true and its third argument otherwise. The Aesara function eq is a simple test of equality. We will return to the question of why we choose to define p_open using Aesara instead of directly using Python's built in if/else and == constructs after this example.

So far we have expressed our initial ignorance about the location of the car and Monty's process for deciding which door to open in code. All that remains is to specify that he did, in fact, open the third door.

In [16]:

with monty_model:
    opened = pm.Categorical("opened", p_open, observed=2)

The most notable part of this code is observed=2 argument passed to the Categorical distribution, which tells PyMC that Monty opened the third door.

With opened specified, we are ready to have PyMC perform inference and get a Monte Carlo approximation of the probability that the car is behind each door given that contestant intially chose the first one and Monty opened the third one.

In [17]:

with monty_model:
    monty_trace = pm.sample()

Multiprocess sampling (2 chains in 2 jobs)
Metropolis: [car]

100.00% [4000/4000 00:00

Sampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 2 seconds.
The number of effective samples is smaller than 25% for some parameters.

After performing Monte Carlo inference using this model (colloquially refered to as "sampling"), we are left with 2,000 draws of the location of the car.

By counting the proportion of samples that are equal to zero we can see whether or not the contestant should switch their choice of door.

In [19]:

(monty_trace.posterior["car"] == 0).mean()

Out[19]:

<xarray.DataArray 'car' ()>
array(0.314)

We see that our Monte carlo approximation is quite close to the true probability that the car is behind the first door we calculated above, $\frac{1}{3}$. The relevant posterior probabilities are shown graphically below.

In [20]:

ax = (monty_trace.posterior["car"]
                 .to_dataframe()
                 .value_counts(normalize=True, sort=False)
                 .plot.bar(rot=0))

ax.set_xticklabels(["First", "Second"]);
ax.set_xlabel("Door");

ax.set_yticks(np.linspace(0, 1, 4));
ax.yaxis.set_major_formatter(pct_formatter);
ax.set_ylabel("Probability car is behind the door");

Recreating the known solution to the Monty Hall problem gives us confidence to proceed to more complex problems with probabilistic programming. Before we do so we highlight two important components of PyMC that we have used to solve this problem.

PyMC distributions¶

Both the DiscreteUniform and Categorical objects above are examples of PyMC distributions. PyMC provides implementations of dozens of probability distributions out ot of the box. These distributions are the building blocks we build our models. In the Monty Hall problem, the DiscreteUniform distribution expressed the fact that we had no reason to believe any door was more likely to contain the car than the others before Monty opened one of them. The Categorical distribution expressed that he only opened a single door with probability given by p_open.

The probability distributions PyMC provides range from commonly the used normal and Poisson distributions

to the less frequently used zero-inflated binomial and Kumaraswamy distributions.

It is straightforward to implement new distributions following the example of existing distributions. Implementing a new distribution is often a good first contribution to PyMC because the effort can be relatively small and self-contained.

Aesara¶

When defining p_open for the Monty Hall problem we briefly mentioned Aesara. Aesara is the tensor computation library that PyMC uses for mathematical calculations. It fills a niche in the data science software ecosystem that is similar to TensorFlow or PyTorch.

From the Aesara documentation:

Aesara is a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. Aesara features:

Tight integration with NumPy – Use numpy.ndarray in Aesara-compiled functions.

Efficient symbolic differentiation – Aesara does your derivatives for functions with one or many inputs.

Speed and stability optimizations – Get the right answer for log(1+x) even when x is really tiny.

Dynamic C/JAX/Numba code generation – Evaluate expressions faster.

Aesara is based on Theano, which has been powering large-scale computationally intensive scientific investigations since 2007.

PyMC relies on each of these features of Aesara in the implementation its Monte Carlo samplers to varying degrees. The second point is most important for the purposes of this introduction. Stated more plainly, Aesara automates calculus for us so our Monte Carlo approximation algorithms can use information about the gradient of the probability distributions form the model in order to sample efficiently. This increased efficiency using Aesara-generated gradients that makes us willing to tolerate the definition of p_open above which is somewhat awkward compared to the equivalent Python if/else and == implementation.

Automating calculus¶

Recall from calculus that

$$\frac{d}{dx} \left(x^3\right) = 3 x^2.$$

We can derive this result using Aesara as follows. First we define a scalar variable $x$,

In [21]:

x = at.scalar("x")

Then we let $y = x^3$,

In [22]:

y = x**3

We now ask aesara to differentiate y with respect to x

In [23]:

pprint(at.grad(y, x))

Out[23]:

'((fill((x ** 3), 1.0) * 3) * (x ** (3 - 1)))'

Squinting a bit, fill((x ** 3), 1.0) becomes (1.0**3) becomes 1 when we substitute ("fill") one for x, so the result simplifies to $3 x^2$ as expected. This example is obviously simple, but Aesara truly shines when the function to be differentiated is complex and involves many inputs. Such complex functions arise naturally from modern models of complex phenomena.

Hamiltonian Monte Carlo, the curse of dimensionality, and differential geometry¶

PyMC uses Aesara for automatic gradient computation in order to implement a class of Monte Carlo algorithms known as Hamiltonian Monte Carlo algorithms. These algorithms take their inspiration from the field of Hamiltonian mechanics in physics. They are essential for performing inference on realistic models with many parameters. To understand why, we consider the curse of dimensionality. From Wikipedia,

The curse of dimensionality refers to various phenomena that arise when analyzing and organizing data in high-dimensional spaces that do not occur in low-dimensional settings such as the three-dimensional physical space of everyday experience.

One geometric interpretation of the curse of dimensionality is that as the number of dimensions grows, the volume of the hypersphere of radius one goes to zero quickly, as shown below.

In [24]:

ndim = np.linspace(1, 1_000)
vol = 2. * np.power(np.pi, ndim / 2.) / ndim / sp.special.gamma(ndim / 2)

In [25]:

fig, ax = plt.subplots()

ax.plot(ndim, vol);

ax.set_xscale('log');
ax.set_xlabel("Dimensions");

ax.set_yscale('log');
ax.set_ylabel("Volume of the unit hypersphere");

In probabilistic programming, the number of dimensions corresponds to the number of unknown parameters in our model. Though the Monty Hall problem has only one unknown parameter (the location of the prize), models used in applied work can easily have hundreds or thousands of parameters. The final example of this post will be a relatively straightforward model that still contains hundreds of parameters.

Rephrased in terms of probability, the curse of dimensionality says if we draw a point at random from the unit hypercube, the probability that it is in the unit hypersphere quickly goes to zero as the number of dimensions increases.

In [26]:

cod_fig, cod_ax = plt.subplots()

cod_ax.plot(ndim, vol / 2**ndim);

cod_ax.set_xscale('log');
cod_ax.set_xlabel("Dimensions");

cod_ax.set_yscale('log');
cod_ax.set_ylabel("Probability a point randomly drawn in the\n"
                  "unit (hyper)cube is in the unit (hyper)sphere");

The Monte Carlo approximations used in probabilistic programming work by identifying regions of the parameter space that contain most of the posterior probability and spending more time sampling in those regions than in others. In high dimensions, the curse of dimensionality makes these high posterior probability regions much more difficult to locate.

Using information about the curvature of the likelihood function of a model makes it easier to locate these regions of high posterior probability. Differential geometry provides the mathematical machinery to quantify the curvature of a hypersurface in terms of derivatives.

The excellent paper A Conceptual Introduction to Hamiltonian Monte Carlo provides a thorough and clear introduction to the concepts of differential geometry that are most relevan to Hamiltonian Monte Carlo algorithms. The author, Michael Betancourt, is a key Stan contributor and his online writings are also excellent resources for learning more about probabilistic programming in general.

The final example in this post will illustrate the importance of Hamiltonian Monte Carlo algorithms in modern probabilistic programming.

Robust regression¶

The Monty Hall problem provided an excellent case study for making the theory of probabilistic programming tangible. We now turn to a slightly more realistic example to gain more experience doing probabilistic programming with PyMC.

Anscombe's quartet¶

Anscombe's quartet is an interesting group of four data sets that share nearly identical descriptive statistics while in fact having very different underlying relationships between the independent and dependent variables. The following code for visualizing Anscombe's quartet is adapted from the Matplotlib documentation.

In [27]:

x = np.array([10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5], dtype=np.float64)
y1 = np.array([8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68])
y2 = np.array([9.14, 8.14, 8.74, 8.77, 9.26, 8.10, 6.13, 3.10, 9.13, 7.26, 4.74])
y3 = np.array([7.46, 6.77, 12.74, 7.11, 7.81, 8.84, 6.08, 5.39, 8.15, 6.42, 5.73])
x4 = np.array([8, 8, 8, 8, 8, 8, 8, 19, 8, 8, 8])
y4 = np.array([6.58, 5.76, 7.71, 8.84, 8.47, 7.04, 5.25, 12.50, 5.56, 7.91, 6.89])

datasets = {
    'I': (x, y1),
    'II': (x, y2),
    'III': (x, y3),
    'IV': (x4, y4)
}

In [28]:

fig, axs = plt.subplots(2, 2, sharex=True, sharey=True)
axs[0, 0].set(xlim=(0, 20), ylim=(2, 14))
axs[0, 0].set(xticks=(0, 10, 20), yticks=(4, 8, 12))

for ax, (label, (x_, y)) in zip(axs.flat, datasets.items()):
    ax.text(0.1, 0.9, label, fontsize=20, transform=ax.transAxes, va='top')
    ax.tick_params(direction='in', top=True, right=True)
    ax.plot(x_, y, 'o')

    # linear regression
    p1, p0 = np.polyfit(x_, y, deg=1)  # slope, intercept
    ax.axline(xy1=(0, p0), slope=p1, color='r', lw=2)

    # add text box for the statistics
    stats = (f'$\\mu$ = {np.mean(y):.2f}\n'
             f'$\\sigma$ = {np.std(y):.2f}\n'
             f'$r$ = {np.corrcoef(x_, y)[0][1]:.2f}')
    bbox = dict(boxstyle='round', fc='blanchedalmond', ec='orange', alpha=0.5)
    ax.text(0.95, 0.07, stats, fontsize=9, bbox=bbox,
            transform=ax.transAxes, horizontalalignment='right')
    
axs[1, 0].add_artist(plt.Rectangle((0, 2), 20, 12, fill=False, edgecolor='r', lw=5));

fig.tight_layout();

Anscombe's quartet is an excellent teaching tool; for the purposes of implementing robust regression in PyMC we will focus on the third data set highlighted above and plotted on its own below, together with the ordinary least squares (OLS) line fit to these data using NumPy's polyfit.

In [29]:

m_ols, b_ols = np.polyfit(x, y3, deg=1)

In [30]:

def plot_line(m, b, *, ax, **kwargs):
    ax.axline((0, b), slope=m, **kwargs)
    
    return ax

In [31]:

fig, ax = plt.subplots()

ax.scatter(x, y3);
plot_line(m_ols, b_ols, c='r', ls='--', label="NumPy OLS", ax=ax);

ax.set_xlim(3.5, 14.5);

ax.legend();

This data set is useful for our purposes because all the points lie exactly on a line except for one outlier. That "robust" line is shown below.

In [32]:

is_outlier = x == 13

m_robust, b_robust = np.polyfit(x[~is_outlier], y3[~is_outlier], deg=1)

In [33]:

fig, ax = plt.subplots()

ax.scatter(x, y3);
plot_line(m_ols, b_ols, c='r', ls='--', label="NumPy OLS", ax=ax);

plot_line(m_robust, b_robust, ls='--', label="Robust", ax=ax);

ax.set_xlim(3.5, 14.5);

ax.legend();

Our goal is to recover the robust line using PyMC. We will iterate through a few models before we do so.

Ordinary least squares¶

We begin by recovering not the robust line, but the NumPy OLS line for simplicity. Though the OLS estimator is often derived as the linear transformation that minimizes the mean squared residuals of the observed data, an equivalent definition is that it is the (linear) maximum likelihood estimator when the residuals are assumed to be normally distributed. This equivalent definition is the appropriate one for use with PyMC.

We assume all valid values $m, b \in \mathbb{R}$ and $\sigma > 0$ are equally likely.

In [34]:

with pm.Model(rng_seeder=SEED) as ols_model:
    m = pm.Flat("m")
    b = pm.Flat("b")
    
    σ = pm.HalfFlat("σ")

Here Flat assigns all real numbers a log probability of zero, and similarly HalfFlat assigns all positive real numbers a log probability of zero. (These are not proper probability distributions. There is no uniform distribution on all real numbers or on all positive real numbers. Still, these improper distributions work computationally in certain cases.)

The fact that our model is linear means that

$$y = m \cdot x + b + \varepsilon,$$

where $\varepsilon \sim N(0, \sigma^2)$. This form is mathematically equvalent to

$$y \sim N(m \cdot x + b, \sigma^2).$$

In PyMC, distribution of the observed data becomes

In [35]:

with ols_model:
    y_obs = pm.Normal("y_obs", m * x + b, σ, observed=y3)

Unlike in typical mathematical notation, $N\left(\mu, \sigma^2\right)$, that specifies the variance $\sigma^2$, PyMC's Normal expects the scale (standard deviation), $\sigma$, as its second argument. This small difference aside, one of the aspects of PyMC that I find particularly compelling is that its syntax is quite close to the mathematical specification of a model.

Finally we are ready to sample from this model's posterior distribution and check it against NumPy's OLS result.

In [36]:

CORES = 3

SAMPLE_KWARGS = {
    'cores': CORES,
    'random_seed': (SEED + np.arange(CORES)).tolist()
}

In [37]:

with ols_model:
    ols_trace = pm.sample(**SAMPLE_KWARGS)

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (3 chains in 3 jobs)
NUTS: [m, b, σ]

100.00% [6000/6000 00:09

Sampling 3 chains for 1_000 tune and 1_000 draw iterations (3_000 + 3_000 draws total) took 10 seconds.
There were 155 divergences after tuning. Increase `target_accept` or reparameterize.
The acceptance probability does not match the target. It is 0.4399, but should be close to 0.8. Try to increase the number of tuning steps.
There was 1 divergence after tuning. Increase `target_accept` or reparameterize.
The number of effective samples is smaller than 10% for some parameters.

Immediately PyMC provides quite a bit of information. We see a few warnings about divergences, acceptance probabilities, and effective samples. We will discuss a few of these ideas in the following section, and provide resources to learn more about those we do not discuss at the end of this post.

First, however, we visualize the results of inference with this model.

In [38]:

fig, ax = plt.subplots()

ax.scatter(x, y3);
plot_line(m_ols, b_ols, c='r', ls='--', label="NumPy OLS", ax=ax);
plot_line(m_robust, b_robust, c='C0', ls='--', label="Robust", ax=ax);

plot_line(
    ols_trace.posterior["m"].mean(dim=("chain", "draw")),
    ols_trace.posterior["b"].mean(dim=("chain", "draw")),
    c='C1', label="PyMC OLS (posterior expected value)", ax=ax
);

ax.set_xlim(3.5, 14.5);

ax.legend();

We see that PyMC's OLS result agrees quite closely with NumPy's OLS result. One of the advantages of Bayesian inference with PyMC is that we get built-in estimates of uncertainty for our results. Below we plot not only the posterior expected value of PyMC's OLS estimate, but also a few individual realizations of the posterior distribution.

In [39]:

THIN = 50

In [40]:

fig, ax = plt.subplots()

ax.scatter(x, y3);
plot_line(m_ols, b_ols, c='r', ls='--', label="NumPy OLS", ax=ax);
plot_line(m_robust, b_robust, c='C0', ls='--', label="Robust", ax=ax);

plot_line(
    ols_trace.posterior["m"].mean(dim=("chain", "draw")),
    ols_trace.posterior["b"].mean(dim=("chain", "draw")),
    c='C1', label="PyMC OLS (posterior expected value)", ax=ax
);

for m, b in (ols_trace.posterior[["m", "b"]]
                      .sel(chain=0)
                      .thin(THIN)
                      .to_array().T):
    plot_line(m.values, b.values, c='C1', alpha=0.25, ax=ax);

ax.set_xlim(3.5, 14.5);

ax.legend();

These samples show some interesting behavior. Most of the posterior samples are fairly close to the posterior expected value, but one or two are significantly lower and closer to the true behavior of the non-outliers.

We visualize the posterior distributions of $m$ and $b$ and see where the true robust values fall in those distributions.

In [41]:

import arviz as az

In [42]:

az.plot_posterior(ols_trace, var_names=["m", "b"], ref_val=[m_robust, b_robust]);

The posterior distribution of the slope tends to overestimate the true robust value and that of the intercept tends to underestimate the truth. This behavior makes sense in the presence of the outlier. Since its $y$-value is significantly higher than the robust trend would indicate, its presence causes the the slope to be overestimated, which causes the intercept to be underestimated to compensate.

Before we iterate on this model to improve its roustness, we pause to highlight two more of the key components of the PyMC ecosystem that we have just encountered.

ArviZ¶

ArviZ is the package we used to produce the posterior visualization above for our OLS model in one line of code.

From the ArviZ documentation:

ArviZ is a Python package for exploratory analysis of Bayesian models. Includes functions for posterior analysis, data storage, sample diagnostics, model checking, and comparison.

The goal is to provide backend-agnostic tools for diagnostics and visualizations of Bayesian inference in Python, by first converting inference data into xarray objects. See here for more on xarray and ArviZ usage and here for more on InferenceData structure and specification.ArviZ is a Python package for exploratory analysis of Bayesian models. Includes functions for posterior analysis, data storage, sample diagnostics, model checking, and comparison.

The goal is to provide backend-agnostic tools for diagnostics and visualizations of Bayesian inference in Python, by first converting inference data into xarray objects. See here for more on xarray and ArviZ usage and here for more on InferenceData structure and specification.

Because probabilistic programming is so tightly intertwined with Bayesian inference, many of the visualizations and statistical calculations are not useful for only one probabilistic programming library/language. ArviZ defines a standardized format for probabilistic programming libraries in which to store the results of inference. For any results in this standardized format, ArviZ provides many functions to perform statistical calculations and produce visualizations. Examining the type of ols_trace, we see that it is an ArviZ InferenceData object.

In [43]:

type(ols_trace)

Out[43]:

arviz.data.inference_data.InferenceData

In addition to the posterior visualization above, ArviZ also provides many diagnostic visualizations. Below we use ArviZ to produce a parallel plot that highlights the divergences PyMC experienced during sampling. These divergences are indicative of potential issues with convergence of our Monte Carlo approximations.

In [44]:

az.plot_parallel(ols_trace);

We will use several more ArviZ statistical functions and visualizations throughout the rest of this post.

Xarray¶

As the above quote from the ArivZ documentation emphasizes, xarray is the core data structure underlying ArviZ's InferenceData object. The posterior component of ols_trace is in fact an xarray Dataset.

In [45]:

type(ols_trace.posterior)

Out[45]:

xarray.core.dataset.Dataset

From the xarray documentation:

xarray (formerly xray) is an open source project and Python package that makes working with labelled multi-dimensional arrays simple, efficient, and fun!

[...]

Xarray is inspired by and borrows heavily from pandas, the popular data analysis package focused on labelled tabular data. It is particularly tailored to working with netCDF files, which were the source of xarray’s data model, and integrates tightly with dask for parallel computing.

We will see the real power that xarray lends to ArviZ and PyMC while analyzing Lego set prices in the final, most complex example in this post.

Bayesian Ridge Regression¶

Now that we have become acquainted with ArviZ and xarray, we return to the third data set in Anscombe's quartet and robust regression. One way to introduce robustness in a statistical model is to add regularization. In our OLS model, the priors on $m$, $b$, and $\sigma$ assign the same likelihood to all valid values of these parameters, so extremely large values are just as likely as small values. We can change the prior distributions on these parameters to ones that assign higher probability to smaller (absolute) values as a form of regularization. In fact, normal priors on $m$ and $b$ are equivalent to ridge regression. The regularization parameter in ridge regression is related to the scale of the normal prior distributions on $m$ and $b$.

We now implement this Bayesian ridge regression model. Let $m \sim N(0, 2.5^2)$ and $b \sim N(0, 10^2)$ (recall that in ridge regression it is common to regularize the intercept much less than the other coefficients, if at all; a larger prior scale corresponds to weaker regularization).

In [46]:

with pm.Model(rng_seeder=SEED) as ridge_model:
    m = pm.Normal("m", 0, 2.5)
    b = pm.Normal("b", 0., 10)

We let $\sigma$ have a half-normal distribution, $\sigma \sim \text{Half-}N(2.5^2)$.

In [47]:

with ridge_model:
    σ = pm.HalfNormal("σ", 2.5)

The ease with which we can change the prior distributions on $m$, $b$, and $\sigma$ from the OLS model to get the ridge model is one of the strengths of probabilistic programming and PyMC.

The likelihood of the observed data is the same as in the OLS model.

In [48]:

with ridge_model:
    y_obs = pm.Normal("y_obs", m * x + b, σ, observed=y3)

We now sample from the ridge model's posterior distribution.

In [49]:

with ridge_model:
    ridge_trace = pm.sample(**SAMPLE_KWARGS)

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (3 chains in 3 jobs)
NUTS: [m, b, σ]

100.00% [6000/6000 00:08

Sampling 3 chains for 1_000 tune and 1_000 draw iterations (3_000 + 3_000 draws total) took 9 seconds.
There were 3 divergences after tuning. Increase `target_accept` or reparameterize.
There were 18 divergences after tuning. Increase `target_accept` or reparameterize.
The number of effective samples is smaller than 25% for some parameters.

We reduced the number of divergences during inference, but not eliminated them.

In [50]:

fig, ax = plt.subplots()

ax.scatter(x, y3);
plot_line(m_ols, b_ols, c='r', ls='--', label="NumPy OLS", ax=ax);
plot_line(m_robust, b_robust, c='C0', ls='--', label="Robust", ax=ax);
plot_line(
    ols_trace.posterior["m"].mean(dim=("chain", "draw")),
    ols_trace.posterior["b"].mean(dim=("chain", "draw")),
    c='C1', label="PyMC OLS (posterior expected value)", ax=ax
);

plot_line(
    ridge_trace.posterior["m"].mean(dim=("chain", "draw")),
    ridge_trace.posterior["b"].mean(dim=("chain", "draw")),
    c='C2', label="PyMC Ridge (posterior expected value)", ax=ax
);

for m, b in (ridge_trace.posterior[["m", "b"]]
                        .sel(chain=0)
                        .thin(THIN)
                        .to_array().T):
    plot_line(m.values, b.values, c='C2', alpha=0.25, ax=ax);

ax.set_xlim(3.5, 14.5);

ax.legend();

We see that the ridge regression result is not too different from the OLS result, so we have not effectively introduced robustness. Upon reflection, this is due to the fact that regularization of this type is more effective when outliers appear in the $x$-values than in the $y$-values, which is the case for this data set.

The posterior visualization of $m$ and $b$ compared to their true robust values remains largely unchanged as well.

In [51]:

az.plot_posterior(ridge_trace, var_names=["m", "b"], ref_val=[m_robust, b_robust]);

Robust regression ¶

To add robustness against outlier $y$-values, we change the distribution of the observations from a normal distribution to a fatter-tailed distribution. Student's t-distribution has a shape similar to that of the normal distribution, but with fatter tails.

In [52]:

fig, ax = plt.subplots()

x_plot = np.linspace(-3, 3)
ax.plot(x_plot, sp.stats.norm.pdf(x_plot),
        label="Standard normal");

DF = 2
ax.plot(x_plot, sp.stats.t.pdf(x_plot, DF),
        label=f"Student t, $\\nu = {DF}$");

ax.set_yticks([]);
ax.set_ylabel("Probability density");

ax.legend();

These fatter tails make a Student t-likelihood more robust against outliers than a normal likelihood.

This model uses the same prior distributions on $m$, $b$, and $\sigma$ as the ridge model.

In [53]:

with pm.Model(rng_seeder=SEED) as robust_model:
    m = pm.Normal("m", 0, 2.5)
    b = pm.Normal("b", 0., 10)
    
    σ = pm.HalfNormal("σ", 2.5)

We place a uniform prior on the number of degrees of freedom of the Student t-likelihood, $\nu \sim U(1, 10)$.

In [54]:

with robust_model:
    ν = pm.Uniform("ν", 1, 10)

We restrict $\nu$ to be smaller than ten because as $\nu \to \infty$ the Student t-distribution converges to a normal distribution, negating the robustness we are hoping to introduce.

We now specify the likelihood of the observations and sample from this robust model.

In [55]:

with robust_model:
    y_obs = pm.StudentT("y_obs", nu=ν, mu=m * x + b, sigma=σ, observed=y3)
    
    robust_trace = pm.sample(**SAMPLE_KWARGS)

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (3 chains in 3 jobs)
NUTS: [m, b, σ, ν]

100.00% [6000/6000 00:20

Sampling 3 chains for 1_000 tune and 1_000 draw iterations (3_000 + 3_000 draws total) took 21 seconds.
The acceptance probability does not match the target. It is 0.6854, but should be close to 0.8. Try to increase the number of tuning steps.
The number of effective samples is smaller than 25% for some parameters.

There were no divergences while sampling from this model, which is promising. (Mismatched acceptance probabilites are easier to remedy than divergences.)

We see that the Student t-likelihood has enabled us to recover the true, robust trend, ignoring the single outlier.

In [56]:

fig, ax = plt.subplots()

ax.scatter(x, y3);
plot_line(m_ols, b_ols, c='r', ls='--', label="NumPy OLS", ax=ax);
plot_line(m_robust, b_robust, c='C0', ls='--', label="Robust", ax=ax);
plot_line(
    ols_trace.posterior["m"].mean(dim=("chain", "draw")),
    ols_trace.posterior["b"].mean(dim=("chain", "draw")),
    c='C1', label="PyMC OLS (posterior expected value)", ax=ax
);
plot_line(
    ridge_trace.posterior["m"].mean(dim=("chain", "draw")),
    ridge_trace.posterior["b"].mean(dim=("chain", "draw")),
    c='C2', label="PyMC Ridge (posterior expected value)", ax=ax
);

plot_line(
    robust_trace.posterior["m"].mean(dim=("chain", "draw")),
    robust_trace.posterior["b"].mean(dim=("chain", "draw")),
    c='C4', label="PyMC Robust (posterior expected value)", ax=ax
);


for m, b in (robust_trace.posterior[["m", "b"]]
                         .sel(chain=0)
                         .thin(THIN)
                         .to_array().T):
    plot_line(m.values, b.values, c='C4', alpha=0.25, ax=ax);

ax.set_xlim(3.5, 14.5);

ax.legend();

The posterior plots for this model show the true (robust) $m$ and $b$ comfortably in the middle of their high posterior density intervals.

In [57]:

az.plot_posterior(robust_trace, var_names=["m", "b"], ref_val=[m_robust, b_robust]);

Robust regression provides an excellent example of the flexibility of probabilistic programming in PyMC. By changing the distributions that are the building blocks of our model, we were able to iterate from an OLS model that was quite sensitive to the presence of an outlier to a robust model that completely ignored the outlier.

A Bayesian analysis of Lego set prices¶

Now that we have solved the Monty Hall problem and built a robust regression model using PyMC, we turn our attention to a more realistic example.

I am an AFOL (adult fan of Lego), part of my collection is shown below.

My collection is mostly Star Wars and NASA sets, with a few miscellaneous others (Birds of Paradise (10289) is a particularly nice set, in my opinion). I have enjoyed the trend of Lego releasing detailed display sets for adults in recent years, so when the company announced Darth Vader Meditation Chamber (75296), I was intrigued.

The set contains 663 pieces and was priced at $69.99 in the US, which felt a bit expensive to me. As a data nerd, I was determined to see how this price compared to other sets historically before I decided whether or not to order it. The result of this effort was five blog posts:

For the final example of this post we will recreate a simplified version of the PyMC model of Lego set prices developed in these posts.

The data used for this example was scraped from Brickset, an online reference for Lego enthusiasts. The data includes mosts sets released between 1980 and the end of 2021, with some light filtering discussed in the above posts.

We now load the Lego data.

In [58]:

LEGO_DATA_URL = "https://austinrochford.com/resources/talks/data_umbrella_brickset_19800101_20211098.csv"

In [59]:

lego_df = pd.read_csv(LEGO_DATA_URL, parse_dates=["Year released"], index_col="Set number")
lego_df["Year released"] = lego_df["Year released"].dt.year

In [60]:

lego_df

Out[60]:

	Name	Set type	Theme	Year released	Pieces	Subtheme	RRP$	RRP2021
Set number
1041-2	Educational Duplo Building Set	Normal	Dacta	1980	68.0	NaN	36.50	122.721632
1075-1	LEGO People Supplementary Set	Normal	Dacta	1980	304.0	NaN	14.50	48.752429
5233-1	Bedroom	Normal	Homemaker	1980	26.0	NaN	4.50	15.130064
6305-1	Trees and Flowers	Normal	Town	1980	12.0	Accessories	3.75	12.608387
6306-1	Road Signs	Normal	Town	1980	12.0	Accessories	2.50	8.405591
...	...	...	...	...	...	...	...	...
80025-1	Sandy's Power Loader Mech	Normal	Monkie Kid	2021	520.0	Season 2	54.99	54.990000
80026-1	Pigsy's Noodle Tank	Normal	Monkie Kid	2021	662.0	Season 2	59.99	59.990000
80028-1	The Bone Demon	Normal	Monkie Kid	2021	1375.0	Season 2	119.99	119.990000
80106-1	Story of Nian	Normal	Seasonal	2021	1067.0	Chinese Traditional Festivals	79.99	79.990000
80107-1	Spring Lantern Festival	Normal	Seasonal	2021	1793.0	Chinese Traditional Festivals	119.99	119.990000

6423 rows × 8 columns

Most of the columns of this data frame are self-explanatory. RRP is "recommended retail price," it is the price Lego charges for the set through Lego.com and its brick-and-mortar retail stores. RRP2021 is this recommended retail price adjusted to 2021 dollars. (For details about adjustment, see this post.)

Exploratory data analysis¶

Naturally sets with more pieces will cost more, which we confirm by visualizing the relationship between Pieces and RRP2021.

In [61]:

VADER_MEDITATION = "75296-1"

vader_label = f"{lego_df.loc[VADER_MEDITATION, 'Name']} ({VADER_MEDITATION.split('-')[0]})"

In [62]:

ax = sns.scatterplot(x="Pieces", y="RRP2021", data=lego_df,
                     alpha=0.5)
ax.scatter(lego_df.loc[VADER_MEDITATION, "Pieces"],
           lego_df.loc[VADER_MEDITATION, "RRP2021"],
           c='r', label=vader_label);

ax.set_ylabel("Recommended retail price\n(2021 $)");
ax.legend();

Since the number of pieces in a set and the corresponding prices vary widely, we log-transform both below.

In [63]:

ax = sns.scatterplot(x="Pieces", y="RRP2021", data=lego_df,
                     alpha=0.5)
ax.scatter(lego_df.loc[VADER_MEDITATION, "Pieces"],
           lego_df.loc[VADER_MEDITATION, "RRP2021"],
           c='r', label=vader_label);

ax.set_xscale('log');

ax.set_yscale('log');
ax.set_ylabel("Recommended retail price\n(2021 $)");

ax.legend();

These plots also highlights the location of Darth Vader Meditation Chamber (75296). It's difficult to tell whether or not the set is fairly priced for its size, hence the need for a statistical model.

Another consideration is that Lego likely has improved their manufacturing processes over time, reducing the average price it costs to produce a piece. An interesting question is whether or not some of these presumed savings have been passed onto the consumer, or if Lego has chosen to retain them all as increased margin.

Below we visualize how the log price per log piece has changed over time since 1980. (LLPPP2021 stands for log-log price per piece in 2021 dollars, which is a mouthful, so we will try not to spell it out too often.)

In [64]:

lego_df["LLPPP2021"] = (
    lego_df["RRP2021"]
           .pipe(np.log)
           .div(lego_df["Pieces"]
                       .pipe(np.log))
)

In [65]:

fig, ax = plt.subplots(figsize=1.5 * FIG_SIZE)

sns.stripplot(x="Year released", y="LLPPP2021", data=lego_df,
              jitter=0.25, color='C0', alpha=0.5, ax=ax)
ax.scatter(lego_df.loc[VADER_MEDITATION, "Year released"] - lego_df["Year released"].min(),
           lego_df.loc[VADER_MEDITATION, "LLPPP2021"],
           c='r', zorder=10, label=vader_label);

ax.xaxis.set_major_locator(ticker.MultipleLocator(5));

ax.set_ylim(0.25, 1.5);
ax.set_ylabel("Recommended retail log price\n(2021 $) per log piece");

ax.legend();

It certainly appears that the (log) price Lego charges per (log) piece has decreased over time. Darth Vader Meditation Chamber (75296) seems to be a bit above average in this respect for sets released in 2021.

Another factor likely to influence the price of a set is its theme. It seems plausible that themes primarily targeted at children might have a slightly smaller cost per piece than those targeted at adults, and that themes that involve licensed intellectual property (Star Wars, Disney, Marvel, etc.) would have a higher price per piece to maintain Lego's margins while covering licensing costs.

In [66]:

PLOT_THEMES = ["Star Wars", "Disney", "Marvel Super Heroes", "Creator", "City", "Friends"]

lego_df["Plot theme"] = (lego_df["Theme"]
                                .where(lego_df["Theme"].isin(PLOT_THEMES),
                                       "Other"))
theme_plot_min_year = (lego_df[lego_df["Theme"].isin(PLOT_THEMES)]
                             ["Year released"]
                             .sub(1)
                             .min())

In [67]:

fig, ax = plt.subplots(figsize=1.5 * FIG_SIZE)

sns.stripplot(
    x="Year released", y="LLPPP2021", hue="Plot theme",
    data=lego_df[lego_df["Year released"] >= theme_plot_min_year],
    dodge=True, alpha=0.5, ax=ax
);

ax.xaxis.set_major_locator(ticker.MultipleLocator(5));

ax.set_ylim(0.25, 1.5);
ax.set_ylabel("Recommended retail log price\n(2021 $) per log piece");

ax.legend();

The above plot shows that this narrative is plausible but not obviously correct, so we will use our model to evaluate it further.

Price model¶

The exploratory data analysis above has highlighted three factors that we would like to include in our model:

the log-log relationship between piece count and price,
the decreasing log price per log piece over time, and
these relationships vary by theme.

With these factors in mind, our model takes the following general form.

$$\log \text{Price} \approx (\text{Year intercept}) + (\text{Theme intercept}) + \left((\text{Year slope}) + (\text{Theme slope})\right) \cdot \log \text{Pieces}$$

There are many ways we could choose to encode the time-varying and theme-varying components of this model. For simplicity in this introductory post we model the time-varying components as Gaussian random walks and the theme-varying components as hierarchical normal distributions.

In [68]:

def gaussian_random_walk(name, *, dims, innov_scale=1.):
    Δ = pm.Normal(f"Δ_{name}", 0., innov_scale,  dims=dims)

    return pm.Deterministic(name, Δ.cumsum(), dims=dims)

def noncentered_normal(name, *, dims, μ=None):
    μ = pm.Normal(f"μ_{name}", 0., 2.5)
    Δ = pm.Normal(f"Δ_{name}", 0., 1., dims=dims)
    σ = pm.HalfNormal(f"σ_{name}", 2.5)
    
    return pm.Deterministic(name, μ + Δ * σ, dims=dims)

We will not dwell on the mathematical details of these components in this post, instead providing links to further reading above. What is important about these components is that they are relatively simple combinations of our model building blocks, PyMC distributions.

We now do some very light feature engineering (standardizing the log piece count, encoding themes numerically, transforming Year released to number of years after 1980, etc.).

In [69]:

log_pieces = (lego_df["Pieces"]
                     .pipe(np.log)
                     .values)

scaler = StandardScaler().fit(log_pieces[:, np.newaxis])

def scale_log_pieces(log_pieces, scaler=scaler):
    return scaler.transform(log_pieces[:, np.newaxis])[:, 0]

std_log_pieces = scale_log_pieces(log_pieces)

In [70]:

log_rrp2021 = (lego_df["RRP2021"]
                      .pipe(np.log)
                      .values)

In [71]:

theme_id, theme_map = lego_df["Theme"].factorize(sort=True)

In [72]:

t, years = lego_df["Year released"].factorize(sort=True)

We are now ready to specify our model. First we define the coordinates xarray will use to label our results. These coordinates allow ArviZ to produce semantically meaningful visualizations of the posterior distribution. For more detailed information about how PyMC and ArviZ benefit from using xarray coordinates, see this excellent post from PyMC and ArviZ contributor Oriol Abril.

In [73]:

coords = {
    "set": lego_df.index,
    "theme": theme_map,
    "year": years
}

We first define the time- and theme-varying components of the intercept. The coords argument passed to the model's constructor tells PyMC and ArviZ how to label the results of inference.

In [74]:

with pm.Model(coords=coords, rng_seeder=SEED) as lego_model:
    β0_t = gaussian_random_walk("β0_t", dims="year", innov_scale=0.1)
    β0_theme = noncentered_normal("β0_theme", dims="theme")

Similarly we define the time- and theme-varying components of the price-per-log piece.

In [75]:

with lego_model:
    β_pieces_t = gaussian_random_walk("β_pieces_t", dims="year", innov_scale=0.1)
    β_pieces_theme = noncentered_normal("β_pieces_theme", dims="theme")

We now define the scale of the observational noise and the predicted log price of each set.

In [76]:

with lego_model:
    σ = pm.HalfNormal("σ", 5.)
    μ = β0_t[t] + β0_theme[theme_id] \
        + (β_pieces_t[t] + β_pieces_theme[theme_id]) * std_log_pieces \
        - 0.5 * σ**2

The term - 0.5 * σ**2 in the definiton of μ comes from the fact that this linear model of log price in terms of log piece count is equivalent to a multiplicative model of price in terms of piece count. (See this post for details.)

Finally we specify the likelihood of the observed log prices.

In [77]:

with lego_model:
    obs = pm.Normal("obs", μ, σ, dims="set", observed=log_rrp2021)

Why Hamiltonian Monte Carlo?¶

To illustrate the power of the Hamiltonian Monte carlo inference algorithms that PyMC uses Aesara to implement, we first perform inference on this model using the Metropolis-Hastings algorithm. Metropolis-Hastings is a relatively simple Markov chain Monte Carlo method that does not use gradient information to account for the shape of the posterior distribution. We choose this sampler by specifying step=pm.Metropolis().

In [78]:

with lego_model:
    mh_trace = pm.sample(**SAMPLE_KWARGS, step=pm.Metropolis(), draws=10_000)

Multiprocess sampling (3 chains in 3 jobs)
CompoundStep
>Metropolis: [Δ_β0_t]
>Metropolis: [μ_β0_theme]
>Metropolis: [Δ_β0_theme]
>Metropolis: [σ_β0_theme]
>Metropolis: [Δ_β_pieces_t]
>Metropolis: [μ_β_pieces_theme]
>Metropolis: [Δ_β_pieces_theme]
>Metropolis: [σ_β_pieces_theme]
>Metropolis: [σ]

100.00% [33000/33000 01:14

Sampling 3 chains for 1_000 tune and 10_000 draw iterations (3_000 + 30_000 draws total) took 75 seconds.
The rhat statistic is larger than 1.4 for some parameters. The sampler did not converge.
The estimated number of effective samples is smaller than 200 for some parameters.

PyMC and ArviZ warn us that this sampler has not converged. Indeed, when we visualize the $\hat{R}$ statistics for the resulting samples we see that they are significantly larger than one.

In [79]:

fig, ax = plt.subplots()

max_rhat = (az.rhat(mh_trace)
               .max()
               .to_array())
nvar, = max_rhat.shape

ax.barh(np.arange(nvar), max_rhat);
ax.axvline(1, c='k', ls='--', label="Convergence");

ax.set_xlim(left=0.8);
ax.set_xlabel(r"$\hat{R}$");

ax.set_yticks(np.arange(nvar));
ax.set_yticklabels(max_rhat.coords["variable"].to_numpy()[::-1]);

ax.legend();

Since convergent samples should have $\hat{R}$ statistics quite close to one, we do not trust these results. To further reinforce this point, we plot the densities and trajectories of the samples for $\sigma$ from each of the three chains below.

In [80]:

az.plot_trace(mh_trace, var_names="σ");

The complex, multimodal distributions on the left and the noticeably different between chain trajectories on the right are symptomatic of sampling issues. It is these symptoms that are quantified in the $\hat{R}$ statistic for $\sigma$. For more information on $\hat{R}$ statistics, consult the excellent paper Rank-normalization, folding, and localization: An improved $\hat{R}$ for assessing convergence of MCMC.

When we considered in the context of the curse of dimensionality these convergence issues are not surprising. This model has

In [81]:

n_lego_param = sum([
    coords["year"].size,  # time intercept increments
    coords["theme"].size, # theme intercept offsets
    2,                    # theme intercept location and scale
    coords["year"].size,  # time slope increments
    coords["theme"].size, # theme slope offsets
    2,                    # theme intercept location and scale
    1                     # scale of observational noise
])

In [82]:

n_lego_param

Out[82]:

parameters. Since the volume of the unit sphere in 351-dimensional space is quite small, the Metropolis-Hastings sampler struggles to approximate this model's posterior distribution well.

In [83]:

cod_ax.axvline(n_lego_param, c='k', ls='--', label=f"{n_lego_param} dimensions");
cod_ax.legend();

cod_fig

Out[83]:

We now sample from the posterior distribution of this model using adaptive HMC (NUTS).

In [84]:

with lego_model:
    lego_trace = pm.sample(**SAMPLE_KWARGS)

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (3 chains in 3 jobs)
NUTS: [Δ_β0_t, μ_β0_theme, Δ_β0_theme, σ_β0_theme, Δ_β_pieces_t, μ_β_pieces_theme, Δ_β_pieces_theme, σ_β_pieces_theme, σ]

100.00% [6000/6000 10:09

Sampling 3 chains for 1_000 tune and 1_000 draw iterations (3_000 + 3_000 draws total) took 610 seconds.
The number of effective samples is smaller than 10% for some parameters.

There are no $\hat{R}$ warnings here, only a rather mild warning about the number of effective samples. We see below that the $\hat{R}$ statistics are quite close to one, indicating no obvious issues with convergence.

In [85]:

fig, ax = plt.subplots()

max_rhat = (az.rhat(lego_trace)
               .max()
               .to_array())
nvar, = max_rhat.shape

ax.barh(np.arange(nvar), max_rhat);
ax.axvline(1, c='k', ls='--', label="Convergence");

ax.set_xlim(0.99, 1.015);
ax.set_xlabel(r"$\hat{R}$");

ax.set_yticks(np.arange(nvar));
ax.set_yticklabels(max_rhat.coords["variable"].to_numpy()[::-1]);

ax.legend();

The per-chain posterior distributions of $\sigma$ and look reasonable for these samples, and the chain trajectories appear well-mixed.

In [86]:

az.plot_trace(lego_trace, var_names="σ");

Judging only by the sampling (clock) time, we might be tempted to prefer inference with Metropolis-Hastings sampling, as it takes significantly less time than with adaptive HMC sampling.

In [87]:

sampling_time = np.array([
    mh_trace.sample_stats.sampling_time,
    lego_trace.sample_stats.sampling_time
])

In [88]:

fig, ax = plt.subplots()

ax.bar([0, 1], sampling_time);

ax.set_xticks([0, 1]);
ax.set_xticklabels(["Metropolis-Hastings", "Adaptive HMC (NUTS)"]);

ax.set_ylabel("Sampling time");

This speed is, however, deceptive. In Markov chain Monte Carlo inference it important to consider the effective sample size, which takes into account the quality of each sample. Effective sample size is intimately connected to the $\hat{R}$ statistic we checked above. Below we plot the sampling rate in terms of effective samples per second for $\sigma$ (the conclusion is similar for the other parameters).

In [89]:

σ_ess = np.array([
    az.ess(mh_trace, var_names="σ")["σ"],
    az.ess(lego_trace, var_names="σ")["σ"]
])

σ_esps = σ_ess / sampling_time

In [90]:

fig, ax = plt.subplots()

ax.bar([0, 1], σ_esps);

ax.set_xticks([0, 1]);
ax.set_xticklabels(["Metropolis-Hastings", "Adaptive HMC (NUTS)"]);

ax.set_ylabel("Effective samples per second");

We see that despite being much slower according to wall-clock time, adaptive HMC is significantly more efficient in terms of effective samples per second.

Under some extremely generous assumptions it would take the Metropolis-Hastings sampler approximately

In [91]:

σ_ess[1] / σ_esps[0] // 60 // 60

Out[91]:

28.0

hours to produce the same effective sample size (for $\sigma$) as the adaptive HMC sampler has achieved in about ten minutes.

The vastly superior sampling efficiency of adaptive HMC is well worth the small amount of extra work involved in building our model from Aesara tensor operations.

In these days of billion-parameter neural networks, even this 351-parameter model seems rather tame. No matter what probabilistic programming language is used, adaptive HMC is crucial to scaling Bayesian inference to modern, complex models of subtle phenomena.

Darth Vader Meditation Chamber revisited¶

We now return to the question that prompted us to build this model: is Darth Vader Meditation Chamber (75296) fairly priced? To answer this question, we sample from the model's posterior predictive distribution.

In [92]:

with lego_model:
    pp_trace = pm.sample_posterior_predictive(lego_trace)

100.00% [3000/3000 00:01

These posterior predictive samples gives us a distribution of plausible prices for a set with the characteristics of Darth Vader Meditation Chamber (75296).

In [93]:

def format_posterior_artist(artist, formatter):
    text = artist.get_text()
    x, _ = artist.get_position()

    if text.startswith(" ") or text.endswith(" "):
        fmtd_text = formatter(x)
        artist.set_text(
            " " + fmtd_text if text.startswith(" ") else fmtd_text + " "
        )
    elif "=" in text:
        before, _ = text.split("=")
        artist.set_text("=".join((before, formatter(x))))
    elif "<" in text:
        below, ref_val_str, above = text.split("<")
        artist.set_text("<".join((
            below,
            " " + formatter(float(ref_val_str)) + " ",
            above
        )))

def format_posterior_text(formatter, ax=None):
    if ax is None:
        ax = plt.gca()
    
    artists = [artist for artist in ax.get_children() if isinstance(artist, plt.Text)]
    
    for artist in artists:
        format_posterior_artist(artist, formatter)

In [94]:

ax = az.plot_posterior(
    pp_trace, group='posterior_predictive', coords={"set": VADER_MEDITATION},
    transform=np.exp, ref_val=lego_df.loc[VADER_MEDITATION, "RRP2021"]
)

format_posterior_text(dollar_formatter, ax=ax);

ax.set_xlabel("Posterior predicted RRP (2021 $)");
ax.set_title(vader_label);

Using plot_posterior with just a few more keyword arguments than before, we see that the recommended retail price of $69.99 is comfortably in the middle of the posterior predictive distribution of this set's price. After seeing this result, I promptly begged my Lego overlords for forgiveness, ordered the set for myself, and can wholeheardetly recommend it to any Star Wars Lego fans.

Here we have also taken advantage of the fact that pp_trace.posterior_predictive is an xarray Dataset, so by passing coords={"set": VADER_MEDITATION} to plot_posterior ArviZ knows to only plot the posterior predictive distribution for Darth Vader Meditation Chamber (75296). The decal of Admril Ozzel and Captain Piett is a very nice touch.

The following plots show how straightforward it is to use ArviZ to visualize whether the themes we highlighted during EDA have a higher or lower than average baseline price and whether the price increases more or less quickly than average as the number of pieces increases.

In [95]:

ax, = az.plot_forest(
    lego_trace, var_names="β0_theme",
    coords={"theme": PLOT_THEMES},
    kind='ridgeplot', ridgeplot_alpha=0.5,
    combined=True, hdi_prob=1
)

ax.axvline(
    lego_trace.posterior["μ_β0_theme"].mean(),
    c='k', ls='--', label="Average theme"
);

ax.set_xticks([]);
ax.set_xlabel(r"$\beta_{0, \mathrm{theme}}$");

ax.set_yticklabels(PLOT_THEMES[::-1]);

ax.legend();

In [96]:

ax, = az.plot_forest(
    lego_trace, var_names="β_pieces_theme",
    coords={"theme": PLOT_THEMES},
    kind='ridgeplot', ridgeplot_alpha=0.5,
    combined=True, hdi_prob=1
)

ax.axvline(
    lego_trace.posterior["μ_β_pieces_theme"].mean(),
    c='k', ls='--', label="Average theme"
);

ax.set_xticks([]);
ax.set_xlabel(r"$\beta_{\mathrm{pieces}, \mathrm{theme}}$");

ax.set_yticklabels(PLOT_THEMES[::-1]);

ax.legend();

Resources¶

This post has presented how I think about probabilistic programming, from theoretical principles, to toy problems, to realistic applications. My perspective is certainly not the only one and probably not the best one (if such a thing exists), but I certainly find it compelling. If this introduction has sparked your interest in probabilistic programming with PyMC, here are some resources I recommend to learn more.

References¶

Probabilistic Programming and Bayesian Methods for Hackers is an excellent open-source introduction to probabilistic programming and Bayesian statistics targeted at those coming from a programming background. Each chapter is executable as a Jupyter notebook, so readers can modify and extend the examples while learning. This book introduced me to probabilistic programming almost ten years ago, so I am quite fond of it. Note that the notebooks currently use PyMC3, the previous version of PyMC, while this introduction has used PyMC v4. The code will run under PyMC v4 with at most very slight modifications.

Bayesian Modeling and Computation in Python is an excellent recently released book by Osvaldo A. Martin, Ravin Kumar, and Junpeng Lao. All three of the authors are core contributors to PyMC. They have generously released the contents of the book for free at the above link, but please consider supporting their excellent work by purchasing a copy if you can afford to do so. The same caveat applies in that this book targets PyMC3.

Statistical Rethinking: A Bayesian Course with Examples in R and Stan is an excellent book by Richard McElreath targeted at those coming from a social science background. Though it presents its examples in R using the Stan probabilistic programming language, it is nonetheless an excellent introduction to Bayesian thinking and model building. Because the book's content is so good, a number of PyMC enthusiasts have ported the examples and solved the exercises in Jupyter notebooks using PyMC3.

As mentioned above, Stan is another probabilistic programming language that fills a similar niche as PyMC. The Stan model specification language is a domain-specific language embedded in C++, and there are Stan interfaces implemented in many languages (Python, R, command line). The Stan User's Guide is an excellent reference for probabilistic programming and Bayesian methods in general. There is a very friendly relationship between the Stan and PyMC communities; both are NumFocus sponsored projects. Like PyMC, the results of inference conducted using PyStan are compatible with ArviZ.

Community¶

PyMC maintains an active discussion board that is an excellent place to ask questions and get help debugging model construction and performance issues.

As an open source project, PyMC is always looking for new contributors to contribute to all parts of the stack, from sampling algorithm improvements, new probability distributions, to documentation. Issues tagged "beginner friendly" on GitHub are a great place to look for places that new contributors can make an impact.

In the nearly ten years that I have been using PyMC and seven years that I have been contributing, I have found the community to be nothing but supportive and welcoming. Please do consider joining us if you find yourself drawn to probabilistic programming! As NumFOCUS-sponsored projects, PyMC and ArviZ both regularly participate in Google Summer of Code, which is another excellent way to get involved with these projects.

Thank you!¶

First and foremost, I would like to thank you, the reader, for making it this far through a long post where I share the perspective I have developed on probabilistic programming with PyMC over the last ten years. I hope you enjoyed reading this post almost as much as I enjoyed writing it!

I would also like to thank my friends Meenal Jhajharia for inviting me to give the Data Umbrella talk that I have expanded into this blog post and Thomas Wiecki for encouraging me to expand that talk into this post. I would also like to thank Reshama Shaikh and the entire Data Umbrella team for organizing the PyMC/Data Umbrella sprint that my talk was a part of and all the other important work that they do.

This post is available as a Jupyter notebook here.

In [97]:

%load_ext watermark
%watermark -n -u -v -iv

Last updated: Sat Jan 29 2022

Python implementation: CPython
Python version       : 3.7.12
IPython version      : 7.30.1

scipy     : 1.7.3
aesara    : 2.3.2
matplotlib: 3.5.1
pandas    : 1.3.5
seaborn   : 0.11.2
numpy     : 1.19.5
arviz     : 0.11.4
pymc      : 4.0.0b1

Efficiency of Various Parameterizations for Sampling from Latent Exchangeable Normal Random Variables in PyMC

Austin Rochford — Fri, 05 Nov 2021 04:00:00 GMT

Author's Note (2022-09-05): As of the official release of PyMC version 4.0, the current stable version of PyMC now supports multivariate distributions. Inference using these is generally much more efficient than any of the workaround parameterizations presented in this post. For those still itnerested in a closed-form solution for the Cholesky decomposition this post computes using SymPy, consult this subsequent post

For an upcoming analysis of the fighting skills of hockey players, I've fallen down a rabbit hole comparing several different parametrizations for estimating the covariance parameter of latent exchangeable normal random variables using PyMC. After working out which parameterization was most efficient for my model, I've decided that this derivation is interesting enough to merit its own post.


Alice in Wonderland, statistics, and hockey: three of my favorite things

Exchangeable random variables¶

Recall that a sequence of random variables $X_1, \ldots, X_T$ is exchangeable if its joint distribution is invariant under permutations of the indicues. More concretely, this sequence is exchangeable if for every permutation of the indices (a bijective function $\tau: \{1, \ldots T\} \to \{1, \ldots T\}$), $X_1, \ldots, X_T$ and $X_{\tau(1)}, \ldots, X_{\tau(T)}$ have the same joint distribution.

Exchangeabile variables arise in many situations, and most importantly for our situation, lead to interesting covariance/correlation structures. It is evident from the definition that exchangeable random variables must have the same mean and variance, so we define $\mu = \mathbb{E}(X_t)$ and $\sigma^2 = \mathbb{Var}(X_t)$. It is also evident that each pair has the same covariance, so we define the Pearson correlation coefficient

$$\rho = \frac{\mathbb{Cov}(X_s, X_t)}{\sigma^2}.$$

This post will focus on the sampling performance for estimating the posterior distribution of $\rho$ using various mathematically equivalent parameterizations.

Exchangeability imposes a restriction on how anticorrelated these random variables can be, as the following calculation shows. We know that variance must be nonnegative, so by the bilinearity of covariance

$$ \begin{align} 0 & \leq \mathbb{Var}\left(\sum_{t = 1}^T X_i\right) \\ & = \sum_{t = 1}^T \mathbb{Var}(X_i) + 2 \sum_{t = 1}^T \sum_{s = 1}^t \mathbb{Cov}(X_s, X_t) \\ & = T \sigma^2 + T (T - 1) \rho \sigma^2 \\ & = (1 + (T - 1) \rho) \cdot T \sigma^2. \end{align} $$

Since $T \sigma^2 \geq 0$ by definition, we must have

$$1 + (T - 1) \rho \geq 0.$$

Solving this inequality gives

$$\rho \geq -\frac{1}{T - 1},$$

so exchangeable random variables cannot be too anticorrelated, and how anticorrelated they can be depends on the length of the sequence. In the limit $T \to \infty$, we see that an infinite sequence of exchangeable random variables must have nonnegative Pearson correlation.

This constraint on the range of possible Pearson correlations for an exchangeable sequence is of particular interest for my hockey application as I want to examine the evidence that the latent fighting "skill" of players is uncorrelated across seasons. (The word "skill" appears in quotes here because if there is no season-over-season correlation, we are probably not modeling a true latent skill.)

We will keep this constraint in mind as we explore the sampling efficiency of various parametrizations of latent exchangeable normal random variables in the following sections.

Generating the data¶

For benchmarking purposes, we will simulate data sets drawn from a population based on latent normal parameters that are exchangeable. In order to focus on estimating the Pearson correlation, $\rho$, we will assume all random variables have unit scale, $\sigma = 1$. These calculations extend easily to the case of an abritrary scale.

First we make the necessary Python imports and do some light housekeeping.

In [1]:

from contextlib import contextmanager
from enum import Enum
from fastprogress.fastprogress import progress_bar
import json
import logging
from pathlib import Path
from toolz import compose, valmap

In [2]:

from aesara import tensor as at
import arviz as az
import matplotlib as mpl
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
import pymc as pm
import seaborn as sns
import sympy as sym

You are running the v4 development version of PyMC which currently still lacks key features. You probably want to use the stable v3 instead which you can either install via conda or find on the v3 GitHub branch: https://github.com/pymc-devs/pymc/tree/v3

In [3]:

FIG_WIDTH = 8
FIG_HEIGHT = 6
mpl.rcParams['figure.figsize'] = (FIG_WIDTH, FIG_HEIGHT)

sns.set(color_codes=True)

For the purposes of these simulations we will set $T = 4$, small enough so we can visualize covariance matrices using SymPy, but large enough that I don't want to do the necessary calcuations by hand.

In [4]:

T = 4

We generate $K = 100$ random latent $T$-vectors from an exchangeable normal distribution with mean zero, unit scale, and Pearson correlation $\rho$, so the covariance matrix is

and $\mathbf{\mu}_1, \ldots, \mathbf{\mu}_K \sim N\left(0, \Sigma_{\rho}\right)$. For the moment we will use $\rho = 0.5$.

In [5]:

RHO = 0.5
K = 100

In [6]:

SEED = 123456789

rng = np.random.default_rng(SEED)

In [7]:

Sigma_rho = np.eye(T) + RHO * (1 - np.eye(T))
mu = rng.multivariate_normal(np.zeros(T), Sigma_rho, size=K)

We see that these samples have the desired covariance structure (allowing for sampling noise on the order of $\frac{1}{\sqrt{K}} = 0.1$).

In [8]:

np.corrcoef(mu, rowvar=False)

Out[8]:

array([[1.        , 0.43203461, 0.42611009, 0.52428711],
       [0.43203461, 1.        , 0.49784241, 0.52027976],
       [0.42611009, 0.49784241, 1.        , 0.48151274],
       [0.52428711, 0.52027976, 0.48151274, 1.        ]])

We now generate observations. Each of these $\mathbf{\mu}_k$ represent the latent mean of a group. We will generate $n_K = 20$ observations for group, $\mathbf{x}_1, \ldots, \mathbf{x}_{K \cdot n_K} \in \mathbb{R}^T$ with distribution $x_i \sim N\left(\mathbf{\mu}_{k(i)}, \mathbb{I}_T\right)$ where $k(i) = \left\lfloor \frac{i}{n_K} \right\rfloor$ is the group of the $i$-th observation and $\mathbb{I}_T$ is the T-dimensional identity matrix.

In [9]:

n_K = 20
k_i = np.arange(K * n_K) // n_K

In [10]:

x = mu[k_i] + rng.normal(size=(K * n_K, T))

Calculating the covariance of $x_{i, s} = \mu_{i, s} + \varepsilon_{i, s}$ and $x_{i, t} = \mu_{i, t} + \varepsilon_{i, t}$, with $\varepsilon_{i, t} \sim N(0, 1)$ i.i.d as above, we get

$$ \begin{align} \mathbb{Cov}(x_{i, s}, x_{i, t}) & = \mathbb{Cov}(\mu_{i, s} + \varepsilon_{i, s}, \mu_{i, t} + \varepsilon_{i, t}) \\ & = \mathbb{Cov}(\mu_{i, s}, \mu_{i, t}) + \mathbb{Cov}(\varepsilon_{i, s}, \varepsilon_{i, t}) \\ & = \begin{cases} \sigma^2 + 1 & \text{if } s = t \\ \rho \sigma^2 & \text{if } s \neq t \end{cases} \\ & = \begin{cases} 2 & \text{if } s = t \\ 0.5 & \text{if } s \neq t \end{cases}. \end{align} $$

Therefore we get a Pearson correlation of

$$ \begin{align} \mathbb{Corr}(x_{i, s}, x_{i, t}) & = \frac{\mathbb{Cov}(x_{i, s}, x_{i, t})}{\sqrt{\mathbb{Cov}(x_{i, s}, x_{i, s})} \cdot \sqrt{\mathbb{Cov}(x_{i, t}, x_{i, t})}} \\ & = \frac{0.5}{\sqrt{2} \cdot \sqrt{2}} \\ & = 0.25 \end{align} $$

for $s \neq t$.

This correlation is borne out in our simulated data.

In [11]:

np.corrcoef(x, rowvar=False)

Out[11]:

array([[1.        , 0.2343107 , 0.19670175, 0.24605147],
       [0.2343107 , 1.        , 0.22697497, 0.23203603],
       [0.19670175, 0.22697497, 1.        , 0.24862999],
       [0.24605147, 0.23203603, 0.24862999, 1.        ]])

In order to study the sampling performance of different parametrizations for a variety of values of $\rho$, the following function that generates data as above given $\rho$ will be useful.

In [12]:

def generate_data(rho, rng=None):
    if rng is None:
        rng = np.random.default_rng()
    
    Sigma_rho = np.eye(T) + rho * (1 - np.eye(T))
    mu = rng.multivariate_normal(np.zeros(T), Sigma_rho, size=K)
    
    return mu[k_i] + rng.normal(size=(K * n_K, T))

Parametrizations for inference with exchangeble multivarial normal random variables¶

For the rest of this post we will focus exclusively on the

When we imported pymc we got the following warning:

You are running the v4 development version of PyMC which currently still lacks key features. You probably want to use the stable v3 instead which you can either install via conda or find on the v3 GitHub branch: https://github.com/pymc-devs/pymc/tree/v3

One of the key features the development version of PyMC v4 lacks as of the commit that was installed in the Docker container I am composing this post on,

In [13]:

try:
    pm_path = Path(pm.__path__[0])
    dist_info_path, = pm_path.parent.glob("pymc-*.dist-info")

    with open(dist_info_path / "direct_url.json", 'r') as src:
        direct_url_data = json.load(src)

    print(direct_url_data["vcs_info"]["commit_id"])
except:
    print("Nothing to see here")

598dd9de2b818a58480071720a9f3da63177be89

is a working multivariate normal implementation. This lack is not a true problem as we can quickly implement a workaround sufficient for our situation that will be roughly in line with the eventual PyMC v4 implementation.

The Cholesky decomposition¶

Recall that if $\mathbf{\mu} \in \mathbb{R}^T$ and $\Sigma \in \mathbb{R}^{T \times T}$ is a positive definite covariance matrix and $\mathbf{x} \sim N(\mathbf{\mu}, \Sigma)$ then the density of $\mathbf{x}$ is

$$ f\left(\mathbf{x}\ | \mathbf{\mu}, \Sigma\right) = (2 \pi)^{-\frac{T}{2}} \cdot \left|\Sigma\right|^{-\frac{1}{2}} \cdot \exp \left(-\frac{(\mathbf{x} - \mathbf{\mu})^{\top} \Sigma^{-1}(\mathbf{x} - \mathbf{\mu})}{2}\right). $$

The quadratic form $(\mathbf{x} - \mathbf{\mu})^{\top} \Sigma^{-1}(\mathbf{x} - \mathbf{\mu})$ is suggestive. If we can decompose the covariance matrix $\Sigma = L L^{\top}$ where $L$ is sufficiently nice, we can rewrite the quadratic form as

$$ (\mathbf{x} - \mathbf{\mu})^{\top} \Sigma^{-1}(\mathbf{x} - \mathbf{\mu}) = (L^{-1} (\mathbf{x} - \mathbf{\mu}))^{\top} (L^{-1} (\mathbf{x} - \mathbf{\mu})). $$

If, in particular, $L$ is lower diagonal, it is relatively easy to solve the linear system $L \mathbf{z} = \mathbf{x} - \mathbf{\mu}$ using back subtitution. The determinant in the density is also easy to calculate because $|\Sigma|^{-\frac{1}{2}} = |L|^{-1}$, and the determinant of a triangular matrix is just the product of its diagonal entries. Fortunately such a decomposition, the Cholesky decomposition, is well-known and has been extensively studied. Together with the Cholesky decomposition, the following fact gives an easy way to generate samples from a multivariate normal distribution with arbitrary covariance matrix.

Theorem Let $\mathbf{z} \in \mathbb{R}^n$ have components drawn from a standard normal distribution, $z_1, \ldots z_n \overset{\text{i.i.d.}}{\sim} N(0, 1)$. Let $A \in \mathbb{R}^{m \times n}$. Then $A \mathbf{z} \in \mathbb{R}^m$ has distribution $A \mathbf{z} \sim N(0, A^{\top}A)$.

Therefore if we want to generate samples from an $N(0, \Sigma)$ distribution, we can calculate the Cholesky decomposition of the covariance matrix, $\Sigma = L L^{\top}$, generate samples $\mathbf{z}_i \sim N(0, \mathbb{I})$ and multiply to get $L \mathbf{z}_i \sim N(0, \Sigma)$.

This post will compare two ways of calculating the Cholesky factorization of $\Sigma$ in the case that the normal random variables are exchangeable with a third parametrization for cases where $\rho \geq 0$ through the lens of sampling efficiency using Hamiltonian Monte Carlo sampling implemented in PyMC.

General Cholesky decomposition¶

PyMC uses Aesara as its underlying tensor computation/differentiation engine. Conveniently, Aesara implements a cholesky decomposition that PyMC then wraps in the function pm.distributions.multivariate.cholesky that can be applied to matrices of random variables.

We begin with a uniform prior on our correlation coefficient in the permissible range $-\frac{1}{T - 1} \leq \rho < 1$.

In [14]:

with pm.Model(rng_seeder=SEED) as gen_model:
    ρ = pm.Uniform("ρ", -1. / (T - 1), 1.)

We then build the associated covariance matrix

and compute its Cholesky decomposition.

In [15]:

def get_corr_mat(rho, n):
    return np.eye(n) + rho * (1 - np.eye(n))

In [16]:

with gen_model:
    Σ_ρ = get_corr_mat(ρ, T)
    L = pm.distributions.multivariate.cholesky(Σ_ρ)

Following the plan laid out in the theorem above, we then sample $\mathbf{z}_1, \ldots, \mathbf{z}_K \in \mathbb{R}^T$ having standard normally distributed entries, and transform them to to $\mathbf{\mu}_k = L \mathbf{z}_k \sim N\left(0, \Sigma_{\rho}\right)$.

In [17]:

with gen_model:
    z = pm.Normal("z", 0., 1., shape=(K, T))
    μ = z.dot(L.T)

Finally we specify the likelihood of the observed data.

In [18]:

with gen_model:
    pm.Normal("obs", μ[k_i], 1., observed=pm.Data("x", x))

We now sample from the model's posterior distribution.

In [19]:

CORES = 3

SAMPLE_KWARGS = {
    'cores': CORES,
    'random_seed': [SEED + i for i in range(CORES)]
}

In [20]:

with gen_model:
    gen_trace = pm.sample(**SAMPLE_KWARGS)

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (3 chains in 3 jobs)
NUTS: [ρ, z]

100.00% [6000/6000 04:03

Sampling 3 chains for 1_000 tune and 1_000 draw iterations (3_000 + 3_000 draws total) took 245 seconds.
The number of effective samples is smaller than 25% for some parameters.

Note the warning about small effective sample size. Effective samples per second will be the key quantity to consider when we benchmark our parametrizations against each other.

Plotting the posterior distribution of $\rho$, we see that the true value of 0.5 is comfortably inside the high posterior density interval.

In [21]:

az.plot_posterior(gen_trace, var_names="ρ", ref_val=0.5);

Cholesky decomposition for excheangeable normal random variables¶

One advantage of the previous parametrization is its generality; nowhere did we rely on the fact that the random variables were exchangeable. While pursuing my hockey fight research, I was curious if we can improve upon the sampling performance by taking advantage of the fact that exchangeable normal random variables lead to the highly structured covariance matrix

$$ \begin{align} \Sigma_{\rho} = \begin{pmatrix} 1 & \rho & \cdots & \rho & \rho \\ \rho & 1 & \cdots & \rho & \rho \\ \rho & \rho & \cdots & 1 & \rho \\ \rho & \rho & \cdots & \rho & 1 \end{pmatrix}. \end{align} $$

Therefore the question is, can we exploit this structure to come up with a simple (enough) closed form expression to facilitate faster sampling? The answer was not obvious to me, but I was hopeful. Prior to considering this question I had never calculated a Cholesky decomposition by hand, and I still have not. To avoid the tedious work of manual calculations, we can lean on SymPy.

Below we generate a matrix the $T \times T$ covariance matrix for our situation (exchangeable normal random variables with unit scale and $T = 4$).

In [22]:

def get_corr_mat_sympy(rho, n):
    return sym.eye(n) + rho * (sym.ones(n, n) - sym.eye(n))

In [23]:

rho = sym.var(r"\rho")

In [24]:

Sigma_rho_sym = get_corr_mat_sympy(rho, T)

In [25]:

Sigma_rho_sym

Out[25]:

$\displaystyle \left[\begin{matrix}1 & \rho & \rho & \rho\\\rho & 1 & \rho & \rho\\\rho & \rho & 1 & \rho\\\rho & \rho & \rho & 1\end{matrix}\right]$

We can ask SymPy for this matrix's Cholesky decomposition and try to spot useful patterns.

In [26]:

L_sym = Sigma_rho_sym.cholesky(hermitian=False)

In [27]:

L_sym

Out[27]:

$\displaystyle \left[\begin{matrix}1 & 0 & 0 & 0\\\rho & \sqrt{1 - \rho^{2}} & 0 & 0\\\rho & \frac{- \rho^{2} + \rho}{\sqrt{1 - \rho^{2}}} & \sqrt{- \rho^{2} + 1 - \frac{\left(- \rho^{2} + \rho\right)^{2}}{1 - \rho^{2}}} & 0\\\rho & \frac{- \rho^{2} + \rho}{\sqrt{1 - \rho^{2}}} & \frac{- \rho^{2} + \rho - \frac{\left(- \rho^{2} + \rho\right)^{2}}{1 - \rho^{2}}}{\sqrt{- \rho^{2} + 1 - \frac{\left(- \rho^{2} + \rho\right)^{2}}{1 - \rho^{2}}}} & \sqrt{- \rho^{2} + 1 - \frac{\left(- \rho^{2} + \rho - \frac{\left(- \rho^{2} + \rho\right)^{2}}{1 - \rho^{2}}\right)^{2}}{- \rho^{2} + 1 - \frac{\left(- \rho^{2} + \rho\right)^{2}}{1 - \rho^{2}}} - \frac{\left(- \rho^{2} + \rho\right)^{2}}{1 - \rho^{2}}}\end{matrix}\right]$

One thing I notice immediately is that I am glad I did not calculate this by hand. Another important point to notice is that each column has at most two unique (nonzero) entries:

the diagonal entry, and
the entry below the diagonal that is repeated to fill the rest of the column.

For a $T \times T$ covariance matrix, that means that there are only $2 T - 1$ unique entries to calculate, rather than $\frac{T (T - 1)}{2}$, a significant reduction. I suspect, but also have not rigorously shown that these values can be calculated in less than the cubic time required for the Cholesky decomposition of an arbitrary positive definite matrix.

While there is a pattern here that we could manually implement, we can be lazy and use SymPy's lambdify and a small Aesara bookkeeping function to convert the SymPy computational graph for the above decomposition into an Aesara function that can be used in a PyMC model.

In [28]:

def to_aesara(mat):
    return at.stack(
        [at.stack(row, axis=0) for row in mat],
        axis=0
    )

In [29]:

exch_chol = compose(to_aesara, sym.lambdify(rho, L_sym))

Except for the calculation of the Cholesky decomposition, every aspect of this model is the same as in the previous one.

In [30]:

with pm.Model(rng_seeder=SEED) as exch_model:
    ρ = pm.Uniform("ρ", -1. / (T - 1), 1.)
    L = exch_chol(ρ)
    
    z = pm.Normal("z", 0., 1., shape=(K, T))
    μ = z.dot(L.T)
    
    pm.Normal("obs", μ[k_i], 1., observed=pm.Data("x", x))

We now sample from this model's posterior distribution.

In [31]:

with exch_model:
    exch_trace = pm.sample(**SAMPLE_KWARGS)

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (3 chains in 3 jobs)
NUTS: [ρ, z]

100.00% [6000/6000 00:25

Sampling 3 chains for 1_000 tune and 1_000 draw iterations (3_000 + 3_000 draws total) took 26 seconds.
The number of effective samples is smaller than 25% for some parameters.

This model sampled much more quickly than the previous one, but we still see the effective sample size warning.

This model has done a very similar job of recovering the true correlation coefficient.

In [32]:

az.plot_posterior(exch_trace, var_names="ρ", ref_val=0.5);

Now that we have two models that produce roughly equivalent estimates, we can compare their sampling efficiency.

In [33]:

class ModelType(Enum):
    GEN_CHOL = "General Cholesky"
    EXCH_CHOL = "Exchangeable Cholesky"
    TWO_STAGE = "Two-stage"

In [34]:

half_traces = {
    ModelType.GEN_CHOL: gen_trace,
    ModelType.EXCH_CHOL: exch_trace
}

In [35]:

def get_ess_value(trace):
    return (az.ess(trace, var_names="ρ")
                   .to_array()
                   .values[0])

def get_sampling_time(trace):
    return (trace.sample_stats
                 .attrs
                 ["sampling_time"])

def get_bench_df(traces):
    df = pd.DataFrame.from_dict({
        "Effective sample size": valmap(get_ess_value, traces),
        "Sampling time": valmap(get_sampling_time, traces)
    })
    df["Effective samples per second"] = df["Effective sample size"] \
        / df["Sampling time"]
    df.index = df.index.map(lambda e: e.value)
    
    return df

Here we have a benchmark comparing the effective sample size, sampling time, and effective samples per second for these two models.

In [36]:

half_df = get_bench_df(half_traces)

In [37]:

half_df

Out[37]:

	Effective sample size	Sampling time	Effective samples per second
General Cholesky	518.860706	244.747165	2.119987
Exchangeable Cholesky	530.852329	25.752860	20.613335

We see that these two models have produced essentially the same number of effective samples, but that the exchangeable model has produced this samples in a much shorter period of time than the general model, leading to a significantly higher effective samples per second.

In [38]:

fig, axes = plt.subplots(nrows=3, sharex=True,
                         figsize=(0.8 * FIG_WIDTH, 1.2 * FIG_HEIGHT))


(half_df["Effective sample size"]
        .plot.bar(ax=axes[0]));
(half_df["Sampling time"]
        .plot.bar(ax=axes[1]));
(half_df["Effective samples per second"]
        .plot.bar(ax=axes[2], rot=False));

axes[1].set_ylabel("Seconds");

axes[0].set_title("Effective sample size");
axes[1].set_title("Sampling time");
axes[2].set_title("Effective samples per second");

fig.tight_layout()

Effective samples per second is the metric by which we will ultimately judge our parametrizations.

Two-stage sampling for nonnegative correlations¶

There is one more approach that is worth including in this comparison. If the correlation coefficient $\rho \geq 0$ we can produce samples with the desired covariance structure according to the following scheme.

Sample $w_1, \ldots z_K \sim N(0, 1)$.
Sample $z_{k, t} \sim N(0, 1)$ for $k = 1, \ldots, K$ and $t = 1, \ldots T$.
Set $\mu_{k, t} = \sqrt{\rho} \cdot w_k + \sqrt{1 - \rho} \cdot z_{k, t}$.

To see that these samples have the appropriate structure we calculate

$$ \begin{align} \mathbb{Cov}(\mu_{k, s}, \mu_{k, t}) & = \mathbb{Cov}(\sqrt{\rho} \cdot w_k + \sqrt{1 - \rho} \cdot z_{k, s}, \sqrt{\rho} \cdot w_k + \sqrt{1 - \rho} \cdot z_{k, t}) \\ & = \rho + (1 - \rho) \cdot \mathbb{Cov}(z_{k, s}, z_{k, t}) \end{align} $$

by independence. We have

$$ \mathbb{Cov}(z_{k, s}, z_{k, t}) = \begin{cases} 1 & \text{if } s = t \\ 0 & \text{if } s \neq t \end{cases}, $$

$$ \mathbb{Cov}(\mu_{k, s}, \mu_{k, t}) = \begin{cases} 1 & \text{if } s = t \\ \rho & \text{if } s \neq t \end{cases}. $$

It is clear from the $\sqrt{\rho}$ and $\sqrt{1 - \rho}$ terms why this sampling scheme requires $0 \leq \rho < 1$.

In [39]:

with pm.Model(rng_seeder=SEED) as two_stage_model:
    ρ = pm.Uniform("ρ", 0., 1.)

    z = pm.Normal("z", 0., 1., shape=(K, 1))
    w = pm.Normal("w", 0., 1., shape=(K, T))
    μ = at.sqrt(ρ) * z + at.sqrt(1 - ρ) * w
    
    pm.Normal("y_obs", μ[k_i], 1., observed=pm.Data("x", x))

We now sample from this model's posterior distribution.

In [40]:

with two_stage_model:
    half_traces[ModelType.TWO_STAGE] = pm.sample(**SAMPLE_KWARGS)

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (3 chains in 3 jobs)
NUTS: [ρ, z, w]

100.00% [6000/6000 01:03

Sampling 3 chains for 1_000 tune and 1_000 draw iterations (3_000 + 3_000 draws total) took 64 seconds.

Once again this model recovers the true correlation coefficient roughly as well as the others.

In [41]:

az.plot_posterior(half_traces[ModelType.TWO_STAGE],
                  var_names="ρ", ref_val=0.5);

Reevaluating our benchmarks with this model, we see that it samples slightly fewer effective samples per second than the exchangeable model, but is still significantly more efficient than the general model.

In [42]:

half_df = get_bench_df(half_traces)

In [43]:

fig, axes = plt.subplots(nrows=3, sharex=True,
                         figsize=(0.8 * FIG_WIDTH, 1.2 * FIG_HEIGHT))

(half_df["Effective sample size"]
        .plot.bar(ax=axes[0]));
(half_df["Sampling time"]
        .plot.bar(ax=axes[1]));
(half_df["Effective samples per second"]
        .plot.bar(ax=axes[2], rot=False));

axes[1].set_ylabel("Seconds");

axes[0].set_title("Effective sample size");
axes[1].set_title("Sampling time");
axes[2].set_title("Effective samples per second");

fig.tight_layout()

Since the exchangeable model samples slightly more effective samples per second and can accomodate a wider range of correlation coefficients ($-\frac{1}{T - 1} \leq \rho < 1$ versus $0 \leq \rho <1$) when compared to the two-stage model, we are inclined to prefer this parameterization.

Benchmarking¶

Before concluding that the exchangeable Cholesky parametrization is best for the hockey fight application, we will repeat the above exercise across a range of values of $\rho$ of interest. We choose the left endpoint the interval, $-\frac{1}{T - 1} \leq \rho < 1$, a few values around zero to see how these parametrizations behave in the neighborhood of zero, and a few larger positive values.

In [44]:

RHOS = [-1 / (T - 1), -0.05, 0., 0.05, 0.5, 0.9]

In [45]:

data = {RHO: x}

for rho in RHOS:
    if rho not in data:
        data[rho] = generate_data(rho, rng=rng)

In [46]:

traces = {RHO: half_traces}

In [47]:

def can_use_model(rho, name):
    return rho >= 0 or name != ModelType.TWO_STAGE

def sample_models(rho, x, models, *,
                  sample_kwargs=SAMPLE_KWARGS,
                  progressbar=True):
    traces = {}
    
    for i, (name, model) in enumerate(models.items()):
        if can_use_model(rho, name):
            with model:
                pm.set_data({"x": x})
                traces[name] = pm.sample(progressbar=progressbar,
                                         **sample_kwargs)
            
    return traces

In [48]:

@contextmanager
def set_log_level(logger, level):
    orig_level = logger.getEffectiveLevel()
    
    try:
        logger.setLevel(level)
        yield
    finally:
        logger.setLevel(orig_level)

In [49]:

models = {
    ModelType.GEN_CHOL: gen_model,
    ModelType.EXCH_CHOL: exch_model,
    ModelType.TWO_STAGE: two_stage_model
}

In [50]:

pm_logger = logging.getLogger("pymc")

RHOS_prog = progress_bar(RHOS)

with set_log_level(pm_logger, logging.CRITICAL):
    for rho in RHOS_prog:
        RHOS_prog.comment = f"Sampling models for $\\rho = {rho:.2f}$"

        if rho not in traces:
            
            traces[rho] = sample_models(rho, data[rho], models,
                                        progressbar=False)

100.00% [6/6 22:56

From these traces, we can get a combined benchmark data frame.

In [51]:

def add_rho_level(rho, df):
    return (df.assign(rho=rho)
              .rename_axis("model")
              .set_index("rho", append=True))

In [52]:

bench_dfs = valmap(get_bench_df, traces)
bench_df = (
    pd.concat((
        add_rho_level(rho, rho_df) for rho, rho_df in bench_dfs.items()
      ))
      .sort_index(level="rho")
)

In [53]:

bench_df.head()

Out[53]:

		Effective sample size	Sampling time	Effective samples per second
model	rho
Exchangeable Cholesky	-0.333333	63.184842	24.983521	2.529061
General Cholesky	-0.333333	95.288883	243.715941	0.390983
Exchangeable Cholesky	-0.050000	364.477740	23.676677	15.393957
General Cholesky	-0.050000	341.109837	180.427612	1.890563
Exchangeable Cholesky	0.000000	371.375624	21.831124	17.011292

This analysis shows that, in every situation we have tested, the exchangeable Cholesky parameterization is preferable in terms of effective samples per second.

In [54]:

fig, axes = plt.subplots(nrows=3, sharex=True,
                         figsize=(0.8 * FIG_WIDTH, 1.2 * FIG_HEIGHT))

sns.barplot(
    x="rho",  y="Effective sample size",
    data=bench_df.reset_index(),
    hue="model", ax=axes[0]
);
sns.barplot(
    x="rho",  y="Sampling time",
    data=bench_df.reset_index(),
    hue="model", ax=axes[1]
);
sns.barplot(
    x="rho",  y="Effective samples per second",
    data=bench_df.reset_index(),
    hue="model", ax=axes[2]
);

axes[0].set_xlabel(None);
axes[1].set_xlabel(None);

axes[2].set_xticklabels([f"{rho:.2f}" for rho in RHOS]);
axes[2].set_xlabel(r"$\rho$");

axes[0].set_ylabel(None);

axes[1].set_yscale('log');
axes[1].set_ylim(bottom=0.75);
axes[1].set_yticks([1, 10, 100]);
axes[1].set_ylabel("Seconds");

axes[2].set_ylabel(None);

axes[0].set_title("Effective sample size");
axes[1].set_title("Sampling time");
axes[2].set_title("Effective samples per second");

axes[0].legend(loc="upper left", title="Parameterization");
axes[1].legend_.set(visible=False);
axes[2].legend_.set(visible=False);

fig.tight_layout()

Interestingly, the exchangeable parameterization outperforms the two-stage sampling approach by a wider margin in the limit $\rho \searrow 0$.

Of course if we wanted a really rigorous benchmark we would run inference in each of these situations many times to get a reliable average value and standard deviation for each of these metrics. I am unfortunately impatient, and collecting this data has already taken quite a while, so this evidence is sufficient for me to prefer the exchangeable Cholesky parameterization in my hockey fight analysis.

This post is available as a Jupyter notebook here.

In [55]:

%load_ext watermark
%watermark -n -u -v -iv

Last updated: Fri Nov 05 2021

Python implementation: CPython
Python version       : 3.7.10
IPython version      : 7.28.0

json      : 2.0.9
logging   : 0.5.1.2
pymc      : 4.0.0
sympy     : 1.9
numpy     : 1.19.5
aesara    : 2.2.2
seaborn   : 0.11.2
pandas    : 1.3.3
matplotlib: 3.4.3
arviz     : 0.11.4

A Bayesian Model of Lego Set Ratings

Austin Rochford — Mon, 25 Oct 2021 04:00:00 GMT

Over the last few months I have developed an interest in analyzing data about Lego sets, primarily scraped from the excellent community resource Brickset. Until now I have focused on price, with one empirical analysis and two Bayesian analyses of the factors that drive Lego set prices, fairness of the pricing of certain notable sets (75296 and 75309), and whether I tend to collect over- or under-priced sets.

In this post I will change that focus to understanding what factors affect the ratings given to each set by Brickset members. Given my particular interest in Star Wars sets (they comprise the overwhelming majority of my collection), I am specifically curious if there is any relationship between the critical and popular reception of individual Star Wars films and television series and the Lego sets associated with them.

First we make the necessary Python imports and do some light housekeeping.

In [1]:

%matplotlib inline

In [2]:

import datetime
from functools import reduce
from warnings import filterwarnings

In [3]:

import arviz as az
from aesara import shared, tensor as at
import matplotlib as mpl
from matplotlib import pyplot as plt
from matplotlib.ticker import MultipleLocator, StrMethodFormatter
import networkx as nx
import numpy as np
import pandas as pd
import pymc as pm
import scipy as sp
import seaborn as sns
from sklearn.preprocessing import StandardScaler
import xarray as xr

You are running the v4 development version of PyMC which currently still lacks key features. You probably want to use the stable v3 instead which you can either install via conda or find on the v3 GitHub branch: https://github.com/pymc-devs/pymc/tree/v3

In [4]:

filterwarnings('ignore', category=RuntimeWarning, module='aesara')
filterwarnings('ignore', category=UserWarning, module='arviz')
filterwarnings('ignore', category=FutureWarning, module='pymc')

In [5]:

FIG_WIDTH = 8
FIG_HEIGHT = 6
mpl.rcParams['figure.figsize'] = (FIG_WIDTH, FIG_HEIGHT)

sns.set(color_codes=True)

pct_formatter = StrMethodFormatter('{x:.1%}')

Load the data¶

We begin the real work by loading the data scraped from Brickset. See the first post in this series for more background on the data. Note that in addition to the data present in previous scrapes we have added the count of star ratings '✭', '✭✭','✭✭✭', '✭✭✭✭', and '✭✭✭✭✭' for each set that has ratings available.

In [6]:

DATA_URI = 'https://austinrochford.com/resources/lego/brickset_19800101_20210922.csv.gz'

In [7]:

def to_year(x):
    return np.int64(round(x))

In [8]:

ATTR_COLS = [
    'Set number', 'Name', 'Set type', 'Year released',
    'Theme', 'Subtheme', 'Pieces', 'RRP$'
]
RATING_COLS = ['✭', '✭✭','✭✭✭', '✭✭✭✭', '✭✭✭✭✭']

ALL_COLS = ATTR_COLS + RATING_COLS

In [9]:

full_df = (pd.read_csv(DATA_URI, usecols=ALL_COLS)
             [ALL_COLS]
             .dropna(subset=set(ATTR_COLS) - {"Subtheme"}))
full_df["Year released"] = full_df["Year released"].apply(to_year)
full_df["Subtheme"] = full_df["Subtheme"].fillna("None")
full_df = (full_df.sort_values(["Year released", "Set number"])
                  .set_index("Set number"))

We see that the data set contains information on approximately 8,600 Lego sets produced between 1980 and September 2021.

In [10]:

full_df

Out[10]:

	Name	Set type	Year released	Theme	Subtheme	Pieces	RRP$	✭	✭✭	✭✭✭	✭✭✭✭	✭✭✭✭✭
Set number
1041-2	Educational Duplo Building Set	Normal	1980	Dacta	None	68.0	36.50	NaN	NaN	NaN	NaN	NaN
1075-1	LEGO People Supplementary Set	Normal	1980	Dacta	None	304.0	14.50	NaN	NaN	NaN	NaN	NaN
1101-1	Replacement 4.5V Motor	Normal	1980	Service Packs	None	1.0	5.65	NaN	NaN	NaN	NaN	NaN
1123-1	Ball and Socket Couplings & One Articulated Joint	Normal	1980	Service Packs	None	8.0	16.00	NaN	NaN	NaN	NaN	NaN
1130-1	Plastic Folder for Building Instructions	Normal	1980	Service Packs	None	1.0	14.00	NaN	NaN	NaN	NaN	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...
80025-1	Sandy's Power Loader Mech	Normal	2021	Monkie Kid	Season 2	520.0	54.99	NaN	NaN	NaN	NaN	NaN
80026-1	Pigsy's Noodle Tank	Normal	2021	Monkie Kid	Season 2	662.0	59.99	NaN	NaN	NaN	NaN	NaN
80028-1	The Bone Demon	Normal	2021	Monkie Kid	Season 2	1375.0	119.99	NaN	NaN	NaN	NaN	NaN
80106-1	Story of Nian	Normal	2021	Seasonal	Chinese Traditional Festivals	1067.0	79.99	NaN	2.0	2.0	11.0	14.0
80107-1	Spring Lantern Festival	Normal	2021	Seasonal	Chinese Traditional Festivals	1793.0	119.99	1.0	1.0	2.0	14.0	82.0

8632 rows × 12 columns

In [11]:

full_df.describe()

Out[11]:

	Year released	Pieces	RRP$	✭	✭✭	✭✭✭	✭✭✭✭	✭✭✭✭✭
count	8632.000000	8632.000000	8632.000000	5025.000000	5476.000000	5667.000000	5694.000000	5693.000000
mean	2009.286376	263.574490	31.667443	4.649154	8.555698	19.518087	30.423077	45.170912
std	9.030748	489.757624	45.678743	4.126446	7.975121	18.917916	32.934498	62.309390
min	1980.000000	0.000000	0.000000	1.000000	1.000000	1.000000	1.000000	1.000000
25%	2003.000000	32.000000	7.000000	2.000000	3.000000	7.000000	11.000000	13.000000
50%	2012.000000	100.000000	18.000000	3.000000	6.000000	14.000000	20.000000	27.000000
75%	2017.000000	305.000000	39.990000	6.000000	11.000000	26.000000	37.750000	52.000000
max	2021.000000	11695.000000	799.990000	46.000000	96.000000	228.000000	361.000000	1080.000000

As in previous posts, we adjust RRP (recommended retail price) to 2021 dollars.

In [12]:

CPI_URL = 'https://austinrochford.com/resources/lego/CPIAUCNS202100401.csv'

In [13]:

years = pd.date_range('1979-01-01', '2021-01-01', freq='Y') \
        + datetime.timedelta(days=1)
cpi_df = (pd.read_csv(CPI_URL, index_col="DATE", parse_dates=["DATE"])
            .loc[years])
cpi_df["to2021"] = cpi_df.loc["2021-01-01"] / cpi_df
cpi_df["year"] = cpi_df.index.year

In [14]:

cpi_df.head()

Out[14]:

	CPIAUCNS	to2021	year
DATE
1980-01-01	77.8	3.362237	1980
1981-01-01	87.0	3.006690	1981
1982-01-01	94.3	2.773934	1982
1983-01-01	97.8	2.674663	1983
1984-01-01	101.9	2.567046	1984

In [15]:

pd.merge(full_df, cpi_df,
         left_on="Year released",
         right_on="year")

Out[15]:

	Name	Set type	Year released	Theme	Subtheme	Pieces	RRP$	✭	✭✭	✭✭✭	✭✭✭✭	✭✭✭✭✭	CPIAUCNS	to2021	year
0	Educational Duplo Building Set	Normal	1980	Dacta	None	68.0	36.50	NaN	NaN	NaN	NaN	NaN	77.800	3.362237	1980
1	LEGO People Supplementary Set	Normal	1980	Dacta	None	304.0	14.50	NaN	NaN	NaN	NaN	NaN	77.800	3.362237	1980
2	Replacement 4.5V Motor	Normal	1980	Service Packs	None	1.0	5.65	NaN	NaN	NaN	NaN	NaN	77.800	3.362237	1980
3	Ball and Socket Couplings & One Articulated Joint	Normal	1980	Service Packs	None	8.0	16.00	NaN	NaN	NaN	NaN	NaN	77.800	3.362237	1980
4	Plastic Folder for Building Instructions	Normal	1980	Service Packs	None	1.0	14.00	NaN	NaN	NaN	NaN	NaN	77.800	3.362237	1980
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
8627	Sandy's Power Loader Mech	Normal	2021	Monkie Kid	Season 2	520.0	54.99	NaN	NaN	NaN	NaN	NaN	261.582	1.000000	2021
8628	Pigsy's Noodle Tank	Normal	2021	Monkie Kid	Season 2	662.0	59.99	NaN	NaN	NaN	NaN	NaN	261.582	1.000000	2021
8629	The Bone Demon	Normal	2021	Monkie Kid	Season 2	1375.0	119.99	NaN	NaN	NaN	NaN	NaN	261.582	1.000000	2021
8630	Story of Nian	Normal	2021	Seasonal	Chinese Traditional Festivals	1067.0	79.99	NaN	2.0	2.0	11.0	14.0	261.582	1.000000	2021
8631	Spring Lantern Festival	Normal	2021	Seasonal	Chinese Traditional Festivals	1793.0	119.99	1.0	1.0	2.0	14.0	82.0	261.582	1.000000	2021

8632 rows × 15 columns

In [16]:

full_df.insert(len(ATTR_COLS) - 1,
               "RRP2021",
               (pd.merge(full_df, cpi_df,
                         left_on="Year released",
                         right_on="year")
                   .apply(lambda df: (df["RRP$"] * df["to2021"]),
                          axis=1))
                   .values)

In [17]:

full_df.head()

Out[17]:

	Name	Set type	Year released	Theme	Subtheme	Pieces	RRP$	RRP2021	✭	✭✭	✭✭✭	✭✭✭✭	✭✭✭✭✭
Set number
1041-2	Educational Duplo Building Set	Normal	1980	Dacta	None	68.0	36.50	122.721632	NaN	NaN	NaN	NaN	NaN
1075-1	LEGO People Supplementary Set	Normal	1980	Dacta	None	304.0	14.50	48.752429	NaN	NaN	NaN	NaN	NaN
1101-1	Replacement 4.5V Motor	Normal	1980	Service Packs	None	1.0	5.65	18.996636	NaN	NaN	NaN	NaN	NaN
1123-1	Ball and Socket Couplings & One Articulated Joint	Normal	1980	Service Packs	None	8.0	16.00	53.795784	NaN	NaN	NaN	NaN	NaN
1130-1	Plastic Folder for Building Instructions	Normal	1980	Service Packs	None	1.0	14.00	47.071311	NaN	NaN	NaN	NaN	NaN

We also add a column indicating whether or not I own each set.

In [18]:

AUSTIN_URI = 'https://austinrochford.com/resources/lego/Brickset-MySets-owned-20211002.csv'

In [19]:

austin_sets = set(
    pd.read_csv(AUSTIN_URI, usecols=["Number"])
      .values
      .squeeze()
)

In [20]:

full_df["Austin owns"] = (full_df.index
                                 .get_level_values("Set number")
                                 .isin(austin_sets))

In [21]:

full_df.head()

Out[21]:

	Name	Set type	Year released	Theme	Subtheme	Pieces	RRP$	RRP2021	✭	✭✭	✭✭✭	✭✭✭✭	✭✭✭✭✭	Austin owns
Set number
1041-2	Educational Duplo Building Set	Normal	1980	Dacta	None	68.0	36.50	122.721632	NaN	NaN	NaN	NaN	NaN	False
1075-1	LEGO People Supplementary Set	Normal	1980	Dacta	None	304.0	14.50	48.752429	NaN	NaN	NaN	NaN	NaN	False
1101-1	Replacement 4.5V Motor	Normal	1980	Service Packs	None	1.0	5.65	18.996636	NaN	NaN	NaN	NaN	NaN	False
1123-1	Ball and Socket Couplings & One Articulated Joint	Normal	1980	Service Packs	None	8.0	16.00	53.795784	NaN	NaN	NaN	NaN	NaN	False
1130-1	Plastic Folder for Building Instructions	Normal	1980	Service Packs	None	1.0	14.00	47.071311	NaN	NaN	NaN	NaN	NaN	False

Based on the exploratory data analysis in the first post in this series, we filter full_df down to approximately 6,400 sets to be considered for analysis.

In [22]:

FILTERS = [
    full_df["Set type"] == "Normal",
    full_df["Pieces"] > 10,
    full_df["Theme"] != "Duplo",
    full_df["Theme"] != "Service Packs",
    full_df["Theme"] != "Bulk Bricks",
    full_df["RRP2021"] > 0
]

In [23]:

df = full_df[reduce(np.logical_and, FILTERS)].copy()

In [24]:

df

Out[24]:

	Name	Set type	Year released	Theme	Subtheme	Pieces	RRP$	RRP2021	✭	✭✭	✭✭✭	✭✭✭✭	✭✭✭✭✭	Austin owns
Set number
1041-2	Educational Duplo Building Set	Normal	1980	Dacta	None	68.0	36.50	122.721632	NaN	NaN	NaN	NaN	NaN	False
1075-1	LEGO People Supplementary Set	Normal	1980	Dacta	None	304.0	14.50	48.752429	NaN	NaN	NaN	NaN	NaN	False
5233-1	Bedroom	Normal	1980	Homemaker	None	26.0	4.50	15.130064	NaN	NaN	NaN	NaN	NaN	False
6305-1	Trees and Flowers	Normal	1980	Town	Accessories	12.0	3.75	12.608387	3.0	4.0	9.0	6.0	13.0	False
6306-1	Road Signs	Normal	1980	Town	Accessories	12.0	2.50	8.405591	4.0	2.0	8.0	13.0	17.0	False
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
80025-1	Sandy's Power Loader Mech	Normal	2021	Monkie Kid	Season 2	520.0	54.99	54.990000	NaN	NaN	NaN	NaN	NaN	False
80026-1	Pigsy's Noodle Tank	Normal	2021	Monkie Kid	Season 2	662.0	59.99	59.990000	NaN	NaN	NaN	NaN	NaN	False
80028-1	The Bone Demon	Normal	2021	Monkie Kid	Season 2	1375.0	119.99	119.990000	NaN	NaN	NaN	NaN	NaN	False
80106-1	Story of Nian	Normal	2021	Seasonal	Chinese Traditional Festivals	1067.0	79.99	79.990000	NaN	2.0	2.0	11.0	14.0	False
80107-1	Spring Lantern Festival	Normal	2021	Seasonal	Chinese Traditional Festivals	1793.0	119.99	119.990000	1.0	1.0	2.0	14.0	82.0	False

6420 rows × 14 columns

In [25]:

df.describe()

Out[25]:

	Year released	Pieces	RRP$	RRP2021	✭	✭✭	✭✭✭	✭✭✭✭	✭✭✭✭✭
count	6420.000000	6420.000000	6420.000000	6420.000000	4307.000000	4712.000000	4883.000000	4907.000000	4906.000000
mean	2009.714642	343.184112	37.552062	46.171183	4.636638	8.516978	19.449928	31.193397	46.018956
std	8.939369	544.179590	50.379534	59.373029	4.273900	8.313056	19.843605	34.925100	66.244232
min	1980.000000	11.000000	0.600000	0.971220	1.000000	1.000000	1.000000	1.000000	1.000000
25%	2003.000000	69.000000	9.990000	11.870620	2.000000	3.000000	7.000000	10.000000	13.000000
50%	2012.000000	181.000000	19.990000	27.347055	3.000000	6.000000	13.000000	19.000000	25.000000
75%	2017.000000	404.000000	49.990000	56.497192	6.000000	11.000000	25.000000	38.000000	52.000000
max	2021.000000	11695.000000	799.990000	897.373477	46.000000	96.000000	228.000000	361.000000	1080.000000

We now reduce df to sets that have had at least one rating.

In [26]:

rated_df = df.dropna(how='all', subset=RATING_COLS).copy()
rated_df[RATING_COLS] = rated_df[RATING_COLS].fillna(0.)
rated_df["Ratings"] = rated_df[RATING_COLS].sum(axis=1)

In [27]:

rated_df.head()

Out[27]:

	Name	Set type	Year released	Theme	Subtheme	Pieces	RRP$	RRP2021	✭	✭✭	✭✭✭	✭✭✭✭	✭✭✭✭✭	Austin owns	Ratings
Set number
6305-1	Trees and Flowers	Normal	1980	Town	Accessories	12.0	3.75	12.608387	3.0	4.0	9.0	6.0	13.0	False	35.0
6306-1	Road Signs	Normal	1980	Town	Accessories	12.0	2.50	8.405591	4.0	2.0	8.0	13.0	17.0	False	44.0
6375-2	Exxon Gas Station	Normal	1980	Town	Shops and Services	267.0	20.00	67.244730	0.0	0.0	1.0	5.0	12.0	False	18.0
6390-1	Main Street	Normal	1980	Town	Shops and Services	591.0	40.00	134.489460	0.0	1.0	1.0	3.0	14.0	False	19.0
6861-1	X1 Patrol Craft	Normal	1980	Space	Classic	55.0	4.00	13.448946	1.0	3.0	6.0	14.0	19.0	False	43.0

We see that slightly more than 75% of the sets under consideration have at least one rating, which is (pleasantly) more than I was expecting.

In [28]:

rated_df.shape[0] / df.shape[0]

Out[28]:

0.7643302180685358

Exploratory Data Analysis¶

We begin with a descriptive exploration of the Brickset ratings data. Immediately we see that the vast majority of sets have been rated a few dozen to a few hundred times.

In [29]:

fig, axes = plt.subplots(ncols=2, sharex=True,
                         figsize=(1.75 * FIG_WIDTH, FIG_HEIGHT))

sns.histplot(rated_df[RATING_COLS].sum(axis=1),
             lw=0., ax=axes[0])

axes[0].set_xlim(left=0);
axes[0].set_xlabel("Number of ratings");

axes[0].set_ylabel("Number of sets");

sns.histplot(rated_df[RATING_COLS].sum(axis=1),
             cumulative=True, lw=0., ax=axes[1])
axes[1].axhline(rated_df.shape[0], c='k', ls='--');

axes[1].set_xlim(left=0);
axes[1].set_xlabel("Number of ratings");

axes[1].set_ylabel("Cumulative number of sets");

fig.tight_layout();

We now explore the distribution of average ratings.

In [30]:

def average_rating(ratings):
    if isinstance(ratings, (xr.DataArray, xr.Dataset)):
        stars = ratings.coords["rating"].str.len()

        return (ratings * stars).sum(dim="rating") / ratings.sum(dim="rating")
    else:
        ratings = np.asanyarray(ratings)
        stars = 1 + np.arange(ratings.shape[-1])

        return (ratings * stars).sum(axis=-1) / ratings.sum(axis=-1)

In [31]:

rated_df["Average rating"] = (rated_df[RATING_COLS]
                                      .apply(average_rating, axis=1))

We see that the distribution over time (our data set covers sets released between 1980 and September 2021) has decreased from just under 4.5 to just below 4 over the years. Unsurprisingly, the percentage of ratings in each category between one and five stars has varied similarly over time.

In [32]:

time_fig, axes = plt.subplots(
    ncols=2, sharex=True,
    figsize=(1.75 * FIG_WIDTH, FIG_HEIGHT)
)

(rated_df["Average rating"]
         .groupby(rated_df["Year released"])
         .mean()
         .rolling(5, min_periods=1)
         .mean()
         .plot(ax=axes[0]));

axes[0].set_ylim(3.4, 4.6);
axes[0].set_yticks([3.5, 4, 4.5]);
axes[0].set_yticklabels(["✭✭✭ ½", "✭✭✭✭", "✭✭✭✭ ½"]);
axes[0].set_ylabel("Average set rating");

(rated_df[RATING_COLS]
         .div(rated_df[RATING_COLS].sum(axis=1),
              axis=0)
         .groupby(rated_df["Year released"])
         .mean()
         .rolling(5, min_periods=1)
         .mean()
         .iloc[:, ::-1]
         .plot(ax=axes[1]));

axes[1].yaxis.set_major_formatter(pct_formatter);
axes[1].set_ylabel("Average percentage of ratings");

It is interesting to try to interpret this trend. Since the year is the year the set was released and not necessarily reviewed (Brickset did not exist until 1997), there is certainly some selection bias in early reviews, because they will only exist for those Lego collectors sufficiently passionate to own older sets, join Brickset, and rate them there once the set becomes available. In fact, for any period of time it is important to remember that these ratings do not represent the sentiment of the general Lego purchasing or collecting public towards each set, but the sentiments of the subset of those purchasers and collectors that are not only movitated to visit and join Brickset, but to rate their sets there. For reference, I visit Bricket frequently for both research purposes and to track my collection, but I have never rated a set there. It seems reasonable to assume that these biases will shrink, but never truly vanish, for more recently released sets. These biases and caveats are important to keep in the back of our mind as we perform our analysis.

We now turn to the specific (sub)themes that I tend to collect, namely Star Wars and NASA sets.

In [33]:

fig, axes = plt.subplots(ncols=2, sharex=True, sharey=True,
                         figsize=(1.75 * FIG_WIDTH, FIG_HEIGHT))

# Theme
sns.kdeplot(
    rated_df.groupby("Theme")
            ["Average rating"]
            .mean(),
    ax=axes[0]
);
sns.rugplot(
    rated_df.groupby("Theme")
            ["Average rating"]
            .mean(),
    height=0.05, c='C0', ax=axes[0]
);
axes[0].axvline(
    rated_df["Average rating"]
            .mean(),
    ls='--', label="Average (all themes)"
);
axes[0].axvline(
    rated_df[rated_df["Theme"] == "Star Wars"]
            ["Average rating"]
            .mean(),
    color='C1', label="Star Wars sets"
);

axes[0].set_yticks([]);
axes[0].legend(loc='upper left');
axes[0].set_title("Average rating by theme");

# Subtheme
sns.kdeplot(
    rated_df.groupby("Subtheme")
            ["Average rating"]
            .mean(),
    ax=axes[1]
);
sns.rugplot(
    rated_df.groupby("Subtheme")
            ["Average rating"]
            .mean(),
    height=0.05, c='C0', ax=axes[1]
);
axes[1].axvline(
    rated_df["Average rating"]
            .mean(),
    ls='--', label="Average (all subthemes)"
);
axes[1].axvline(
    rated_df[rated_df["Subtheme"] == "NASA"]
            ["Average rating"]
            .mean(),
    color='C2', label="NASA sets"
);

axes[1].legend(loc='upper left');
axes[1].set_title("Average rating by theme");

fig.tight_layout();

We see that the average rating for Star Wars sets is close to the average rating of all sets, whereas the averating rating for NASA sets is significantly higher than the average rating for all sets. I am not surprised that NASA sets score so highly (21309, 10266, 10283, and 21312) are some of my favorite sets, but I am a bit surprised to find that Star Wars, as an overall theme, is as middle-of-the-road as it is. Even restricting to the period after 1999 (when Star Wars sets started to be released), this phenomenon persists.

In [34]:

star_wars_min_year = (rated_df[rated_df["Theme"] == "Star Wars"]
                              ["Year released"]
                              .min())

In [35]:

star_wars_min_year

Out[35]:

In [36]:

ax = sns.kdeplot(
    rated_df.groupby("Theme")
            ["Average rating"]
            .mean(),
)
sns.rugplot(
    rated_df[rated_df["Year released"] >= star_wars_min_year]
            .groupby("Theme")
            ["Average rating"]
            .mean(),
    height=0.05, c='C0', ax=ax
);
ax.axvline(
    rated_df[rated_df["Year released"] >= star_wars_min_year]
            ["Average rating"]
            .mean(),
    ls='--', label="Average (all themes)"
);
ax.axvline(
    rated_df[rated_df["Theme"] == "Star Wars"]
            ["Average rating"]
            .mean(),
    color='C1', label="Star Wars sets"
);

ax.set_yticks([]);
ax.legend(loc='upper left');
ax.set_title(f"Average rating by theme\n(Sets released after {star_wars_min_year})");

Our subsequent analysis will attempt to control for both the year a set was released and other factors (piece count and price) more rigorously in order to see if this phenomenon persists.

Turning to piece count and price, we see that a positive, fairly linear relationship between both of these characteristics and average rating.

In [37]:

pieces_price_grid = sns.pairplot(rated_df,
                    x_vars=["Pieces", "RRP2021"],
                    y_vars=["Average rating"],
                    plot_kws={'alpha': 0.25},
                    height=FIG_HEIGHT)

for ax in pieces_price_grid.axes.flat:
    ax.set_xscale('log');
    
pieces_price_grid.axes[0, 1].set_xlabel("RRP (2021 $)");

It is interesting to consider the causal nature of the relationship between piece count, price, and ratings. Clearly larger sets requiring more pieces have higher production costs and therefore higher retail prices. Higher priced sets probably also have to be of a better overall quality in order to justify the significant expense. I personally occasionally buy cheaper sets that I don't love if they have interesting components. I purchased 75299 just to get Mando and the Child riding a speeder together (the rest of the bags went straight into my spare parts bin).

Based on this discussion, the causal DAG for the relationship between piece count, price, and rating is as follows.

In [38]:

graph = nx.DiGraph()
graph.add_edges_from([
    ("Piece count", "Price"),
    ("Piece count", "Rating"),
    ("Price", "Rating")
])

In [39]:

fig, ax = plt.subplots()

POS = {
    "Piece count": (1, 1),
    "Price": (0, 0.5),
    "Rating": (1, 0)
}
LABEL_POS = {
    "Piece count": (0.97, 1.075),
    "Price": (0, 0.575),
    "Rating": (1, -0.075)
}

nx.draw_networkx_nodes(graph, pos=POS, ax=ax);
nx.draw_networkx_edges(graph, pos=POS, ax=ax);
nx.draw_networkx_labels(graph, pos=LABEL_POS, ax=ax);

ax.set_xticks([]);
ax.set_yticks([]);
ax.set_aspect('equal');

This discussion of causality is certainly interesting and may be worth future exploration, but we will not take the causal perspective for the rest of this post. Rather, our models will be interpreted descriptively, as tools that help us summarize the ratings of various groups of Lego sets.

We conclude our exploratory data analysis by seeing where the ratings of the sets I own fall in the distribution of all rated sets. It appears that most of my sets are rated above average, because, of course, I have excellent taste.

In [40]:

ax = sns.kdeplot(rated_df["Average rating"])

sns.rugplot(rated_df["Average rating"],
            height=0.05, c='C0', ax=ax);
ax.axvline(rated_df["Average rating"].mean(),
           c='C0', ls='--', label="Average (all sets)");

sns.rugplot(
    rated_df[rated_df["Austin owns"]]
            ["Average rating"],
    height=0.05, c='C1', ax=ax
);
ax.axvline(
    rated_df[rated_df["Austin owns"]]
            ["Average rating"]
            .mean(),
    c='C1', label="Average (my sets)"
);

ax.set_yticks([]);
ax.legend();

Modeling Ratings¶

We will now build a few Bayesian models gradually incorporating the relationships we found during exploratory data analysis. The appropriate model for this type of data (ratings on a one-to-five star scale) is an ordinal regression model. While the ratings are ostensibly numerical, and we may be tempted to model them using, for example, a normal likelihood, an ordinal regression model is more appropriate. While it is fairly safe to assume that a set I like a set that I give four stars better than a set that I give five starts, it is a much bigger assumption to say that the difference in how much I like these two sets is exactly the same as the difference between two sets that I rate with four and five stars just because $4 - 3 = 5 - 4$. Ordinal regression is a flexible model that acknowledges that the responses are ordered, but allows the distance between the categories of response to vary.

Mathematically an ordinal regression model can be specified as followed. If we have $K$ response classes (in our case $K = 5$), we define $K - 1$ ascending cut points $c_1 < c_2 < \ldots < c_{K - 1}$. Let $g: (0, 1) \to \mathbb{R}$ be a link function with $\displaystyle{\lim_{\eta \to 0}}\ g(\eta) = -\infty$ and $\displaystyle{\lim_{\eta \to 1}}\ g(\eta) = \infty$. The logit and probit link functions are common choices, leading to ordinal regression models with slightly different interpretations. If $\eta_i$ is a latent quantity related to the $i$-th observed rating (specifying the form of $\eta_i$ is the most important part of our model and will occupy much of the rest of this post), then the probability that the rating $R_i$ is at most $k \in {1, 2, \ldots K}$ is

$$ P(R_i \leq k\ |\ \eta_i) = g^{-1}(c_k - \eta_i). $$

If we define $c_0 = -\infty$, $c_K = \infty$, $g^{-1}(-\infty) = 0$ and $g^{-1}(\infty) = 1$, we get

$$P(R_i = k\ |\ \eta_i) = P(R_i \leq k\ |\ \eta_i) - P(R_i \leq k - 1\ |\ \eta_i) = g^{-1}(c_k - \eta_i) - g^{-1}(c_{k - 1} - \eta_i)$$

To concretely illustrate the ordinal regression model, suppose there are $K = 3$ possible ratings, $c_1 = 0$, $c_2 = 2.3$, and we use the logistic link

$$g^{-1}(\eta) = \frac{1}{1 + \exp(-\eta)}.$$

An example of the rating probabilities given $\eta$ are shown below.

In [41]:

C = np.array([0., 2.3])

In [42]:

η_plot = np.linspace(-2, 4, 100)
η_diff = (-np.subtract.outer(η_plot, C))
p_plot = np.diff(sp.special.expit(η_diff), prepend=0., append=1.)

In [43]:

fig, ax = plt.subplots()

for k, c_plot in enumerate(C, start=1):
    ax.axvline(c_plot,
               c='k', ls='--', zorder=5,
               label=f"$c_{{{k}}} = {c_plot}$");

for k, p_plot_ in enumerate(p_plot.T):
    ax.plot(η_plot, p_plot_, label=f"$k = {k}$");

ax.set_xlim(-2, 4);
ax.set_xlabel(r"$\eta$");

ax.set_ylim(0, 1);
ax.yaxis.set_major_formatter(pct_formatter);
ax.set_ylabel(r"$P(R = k \|\ \eta)$");

ax.legend();
ax.set_title("Ordinal Regression Probabilities");

Those familiar with common applications of ordinal regression and/or psychometrics will recognize that we are quite close to having specified an ordered item-response model. In a typical ordinal item-response model, the cut points $c_k$ are allowed to vary according to the item being rated and the latent quantity $\eta$ is allowed to vary according to the item being rated and the person doing the rating. Unfortunately with our Brickset data, we have no information about the individual raters so we cannot form an item-response model. Instead, we will infer a fixed set of cutpoints and allow $\eta$ to vary based on the set being rated.

Intriguingly, Brickset also allows members to leave reviews of specific sets which contain ratings along several dimensions. I have in fact scraped this review data, but exploratory data analysis shows that very few members rate more than one set, so I am not yet confident that I can build a useful item-response model using this data. Building such a model may be the topic of a future post, but for the moment we restrict our attention to the anonymous ratings we have explored above.,

Time¶

Our first model attempts to capture the temporal dynamics of ratings based on the year in which the set was released, as shown above.

In [44]:

time_fig

Out[44]:

This focus on time (and set) effects will allow us to start by building a somewhat simple ordinal regression model in pymc3.

This model will use smoothing splines to model the effect of year on ratings and set-level random effects to capture the popularity of a set relative to the baeline popularity of all sets in the year it was released.

We now establish some notation. Throughout, $i$ will denote the index of a set in rated_df. Let $N_i$ be the number of times that set was rated and $\mathbf{R}_i = (R_{i, 1}, R_{i, 2}, R_{i, 3}, R_{i, 4}, R_{i, 5})$ be the number of one-, two-, three-, four-, and five-star ratings for the $i$-th set, respectively.

In [45]:

n_rating = len(RATING_COLS)
n_cut = n_rating - 1

ratings_ct = rated_df[RATING_COLS].values
ratings = rated_df["Ratings"].values

Let $t(i)$ be (the index of) the year that the $i$-th set was released.

In [46]:

t, year_map = rated_df["Year released"].factorize(sort=True)
n_year = year_map.size

In our first model, we have

$$\eta_i = f_{t(i)} + \beta_{\text{set}, i}.$$

Here $f_t$ is a smoothing spline meant to capture the average rating of sets released in the $t$-th year. This post will not cover in depth the process for specifying a Bayesian smoothing spline. Interested readers should consult a previous post of mine for details.

In [47]:

N_KNOT = 10

knots = np.linspace(0, n_year, N_KNOT)
bf = sp.interpolate.BSpline(knots, np.eye(N_KNOT), 3)
t_dmat = bf(np.arange(n_year))

In [48]:

coords = {
    "cut": np.arange(n_cut),
    "rating": RATING_COLS,
    "knot": np.arange(N_KNOT),
    "set": rated_df.index.values,
    "year": year_map
}

In [49]:

SEED = 123456789 # for reproducibility

In [50]:

with pm.Model(coords=coords, rng_seeder=SEED) as model:
    β_t_inc = pm.Normal("β_t_inc", 0., 0.1, dims="knot")
    β_t = β_t_inc.cumsum()
    f_t = pm.Deterministic("f_t", at.dot(t_dmat, β_t), dims="year")

The set-level random effects $\beta_{\text{set}, i}$ follow a hierarchical normal distribution equivalent to

$$ \begin{align*} \sigma_{\beta_{\text{set}}} & \sim \text{Half}-N\left(2.5^2\right) \\ \beta_{\text{set}, i} & \sim N\left(0, \sigma_{\beta_{\text{set}}}^2\right). \end{align*} $$

In practice, we use an equivalent non-centered parametrization that samples better in practice than this more mathematically elegant one.

Note that it is important that the prior expected value of $\eta_i$ be fixed (in our case it is zero), otherwise model will not be identified. If the prior expected value is not fixed, adding any constant to the cutpoints and $\eta$ will produce the same likelihood.

In [51]:

# the scale necessary to make a halfnormal distribution
# have unit variance
HALFNORMAL_SCALE = 1. / np.sqrt(1. - 2. / np.pi)

def noncentered_normal(name, *, dims, μ=None, σ=2.5):
    if μ is None:
        μ = pm.Normal(f"μ_{name}", 0., 2.5)

    Δ = pm.Normal(f"Δ_{name}", 0., 1., dims=dims)
    σ = pm.HalfNormal(f"σ_{name}", σ * HALFNORMAL_SCALE)

    return pm.Deterministic(name, μ + Δ * σ, dims=dims)

In [52]:

with model:
    β_set = noncentered_normal("β_set", dims="set", μ=0.)

    η = f_t[t] + β_set

We place a $N\left(0, 2.5^2\right)$ squared prior on the cutpoints, constraining them to be ordered with the keyword argument transform=pm.transforms.ordered.

In [53]:

with model:
    c = pm.Normal("c", 0., 2.5, dims="cut",
                  transform=pm.transforms.ordered,
                  initval=coords["cut"])

It now remains to specify the likelihood of the observed rankings. Since the data we have is the count of ratings for each number of stars, an ordered multinomial model is appropriate.

In [54]:

with model:
    ratings_obs = pm.OrderedMultinomial(
        "ratings_obs", η, c, ratings,
        dims=("set", "rating"),
        observed=ratings_ct
    )

We are now ready to sample from this model's posterior distribution.

In [55]:

CORES = 3

SAMPLE_KWARGS = {
    'cores': CORES,
    'random_seed': [SEED + i for i in range(CORES)]
}

In [56]:

with model:
    trace = pm.sample(**SAMPLE_KWARGS)

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (3 chains in 3 jobs)
NUTS: [β_t_inc, Δ_β_set, σ_β_set, c]

100.00% [6000/6000 20:39

Sampling 3 chains for 1_000 tune and 1_000 draw iterations (3_000 + 3_000 draws total) took 1242 seconds.
The number of effective samples is smaller than 10% for some parameters.

Standard sampling diagnostics show no cause for concern.

In [57]:

az.plot_energy(trace);

The following plots show that this model has captured the temporal dynamics of set ratings fairly well.

In [59]:

set_xr = (rated_df[["Year released", "Theme", "Subtheme"]]
                  .rename_axis("set")
                  .rename(columns={
                      "Year released": "year",
                      "Theme": "theme",
                      "Subtheme": "sub"
                  })
                  .to_xarray())

In [60]:

fig, axes = plt.subplots(
    ncols=2, sharex=True,
    figsize=(1.75 * FIG_WIDTH, FIG_HEIGHT)
)

(rated_df["Average rating"]
         .groupby(rated_df["Year released"])
         .mean()
         .rolling(5, min_periods=1)
         .mean()
         .plot(ax=axes[0]));
(trace.posterior["ratings_obs_probs"]
      .pipe(average_rating)
      .mean(dim=("chain", "draw"))
      .groupby(set_xr["year"])
      .mean()
      .to_dataframe(name="Posterior\nexpected value")
      .rolling(5, min_periods=1)
      .mean()
      .plot(c='k', ls='--', ax=axes[0]));

axes[0].set_ylim(3.4, 4.6);
axes[0].set_yticks([3.5, 4, 4.5]);
axes[0].set_yticklabels(["✭✭✭ ½", "✭✭✭✭", "✭✭✭✭ ½"]);
axes[0].set_ylabel("Average set rating");

(rated_df[RATING_COLS]
         .div(rated_df[RATING_COLS].sum(axis=1),
              axis=0)
         .groupby(rated_df["Year released"])
         .mean()
         .rolling(5, min_periods=1)
         .mean()
         .iloc[:, ::-1]
         .plot(ax=axes[1]));
(trace.posterior["ratings_obs_probs"]
      .mean(dim=("chain", "draw"))
      .groupby(set_xr["year"])
      .mean(dim="set")
      .to_dataframe()
      .unstack(level="rating")
      ["ratings_obs_probs"]
      .rolling(5, min_periods=1)
      .mean()
      .plot(c='k', ls='--', legend=False,
            ax=axes[1]));

axes[1].yaxis.set_major_formatter(pct_formatter);
axes[1].set_ylabel("Average percentage of ratings");

We'll refine this model before we start interpreting set-level effects, but it is instructive to consider, even in this simple model, the effect of hierarchical shrinkage on sets with the fewest and the most reviews.

In [61]:

def make_row_set_label(row):
    return f"{row['Name']} ({row.name})"

def make_set_label(set_df):
    return set_df.apply(make_row_set_label, axis=1)

In [62]:

total_ratings_argsorted = ratings_ct.sum(axis=1).argsort()
fewest_total_ratings = total_ratings_argsorted[:10]
most_total_ratings = total_ratings_argsorted[-10:]

In [63]:

fig, axes = plt.subplots(
    ncols=2, sharex=True,
    figsize=(1.75 * FIG_WIDTH, FIG_HEIGHT)
)

az.plot_forest(
    trace, var_names=["ratings_obs_probs"],
    transform=average_rating,
    coords={"set": coords["set"][fewest_total_ratings]},
    combined=True, ax=axes[0]
);
axes[0].scatter(
    rated_df["Average rating"]
            .iloc[fewest_total_ratings]
            .iloc[::-1]
            .values,
    axes[0].get_yticks(),
    c='k', zorder=5,
    label="Actual"
);

axes[0].set_xlabel("Average rating");
axes[0].set_yticklabels(make_set_label(
    rated_df.iloc[fewest_total_ratings]
            .iloc[::-1]
));

axes[0].legend();
axes[0].set_title("Sets with the fewest ratings");

az.plot_forest(
    trace, var_names=["ratings_obs_probs"],
    transform=average_rating,
    coords={"set": coords["set"][most_total_ratings]},
    combined=True, ax=axes[1]
);
axes[1].scatter(
    rated_df["Average rating"]
            .iloc[most_total_ratings]
            .iloc[::-1],
    axes[1].get_yticks(),
    c='k', zorder=5,
    label="Actual"
);

axes[1].set_xlabel("Average rating");

axes[1].yaxis.tick_right();
axes[1].set_yticklabels(make_set_label(
    rated_df.iloc[most_total_ratings]
            .iloc[::-1]
));

axes[1].set_title("Sets with the most ratings");

fig.tight_layout();

There are two interesting elements in these plots. First is that the posterior credible intervals are much larger for the sets with the fewest ratings and for those with the most ratings. This makes perfect sense, as we have much more information about the sets on the right. The second is that the posterior expected ratings for the sets with the fewest ratings are significantly further from their true values than those for the sets with the most ratings. This is reflective of the fact that our hierarcical model shrinks the observed ratings towards the average rating for the year that set was released. This shrinkage applies more to sets with fewer reviews, resulting in the behavior show by the plot on the left.

Full model¶

The full model we will use to interpret set ratings takes into account not only the year in which the set was released but also its theme and subtheme (taxonomic categories of related sets) and its piece count and price.

This model includes year- and set-effects in the same way as the previous one.

In [64]:

theme_id, theme_map = rated_df["Theme"].factorize(sort=True)

In [65]:

sub_id, sub_map = (rated_df["Subtheme"]
                           .fillna("None")
                           .factorize(sort=True))
n_sub = sub_map.size

In [66]:

coords["sub"] = sub_map
coords["theme"] = theme_map

In [67]:

with pm.Model(coords=coords, rng_seeder=SEED) as full_model:
    β_t_inc = pm.Normal("β_t_inc", 0., 0.1, dims="knot")
    β_t = β_t_inc.cumsum()
    f_t = pm.Deterministic("f_t", at.dot(t_dmat, β_t), dims="year")
    
    β_set = noncentered_normal("β_set", dims="set", μ=0.)

As shown in our exploratory data analysis, there is a fairly linear relationship between log piece count and log RRP and a set's average rating.

In [68]:

pieces_price_grid.fig

Out[68]:

Incorporating log piece count and log RRP as predictors in our model is straightforward. Let $x_{\text{pieces}, i}$ denote the standardized log piece count of the $i$-th set, and $x_{\text{price}, i}$ its standardized log RRP (in 2021 dollars).

In [69]:

def make_scaler(x):
    return StandardScaler().fit(x[:, np.newaxis])

def scale(x, scaler):
    x = np.asarray(x)

    return scaler.transform(x[:, np.newaxis])[:, 0]

In [70]:

log_pieces = np.log(rated_df["Pieces"].values)
piece_scaler = make_scaler(log_pieces)
x_pieces = scale(log_pieces, piece_scaler)

In [71]:

log_price = np.log(rated_df["RRP2021"].values)
price_scaler = make_scaler(log_price)
x_price = scale(log_price, price_scaler)

We use normal priors for the coefficients $\beta_{\text{pieces}}, \beta_{\text{price}} \sim N\left(0, 2.5^2\right)$.

In [72]:

with full_model:
    β_pieces = pm.Normal("β_pieces", 0., 2.5)
    β_price = pm.Normal("β_price", 0., 2.5)

From one of my recent posts we know that the theme and subtheme of a set have a significant impact on its price. We therefore include these as random effects in our final model of set ratings.

The theme of a set is the broadest taxanomic category to which the set belongs.

In [73]:

ax = (rated_df["Theme"]
              .value_counts()
              .nlargest(10)
              .plot.barh())

ax.set_xlabel("Number of sets");

ax.invert_yaxis();
ax.set_ylabel("Theme");

ax.set_title("Top 10 themes by number of sets");

The subtheme of a set is a finer-grained taxonomic category.

In [74]:

ax = (rated_df["Subtheme"]
              .value_counts()
              .nlargest(10)
              .plot.barh())

ax.set_xlabel("Number of sets");

ax.invert_yaxis();
ax.set_ylabel("Subheme");

ax.set_title("Top 10 subthemes by number of sets");

As an exapmle, for Star Wars-themed sets, the subtheme largely corresponds to the movie or television show to which the set is related.

In [75]:

ax = (rated_df[rated_df["Theme"] == "Star Wars"]
              ["Subtheme"]
              .value_counts()
              .nlargest(10)
              .plot.barh())

ax.set_xlabel("Number of sets");

ax.invert_yaxis();
ax.set_ylabel("Subheme");

ax.set_title("Top 10 subthemes\namong Star Wars-themed sets\nby number of sets");

The relationship between theme and subtheme is analagous for Marvel-themed sets.

In [76]:

ax = (rated_df[rated_df["Theme"] == "Marvel Super Heroes"]
              ["Subtheme"]
              .value_counts()
              .nlargest(10)
              .plot.barh())

ax.set_xlabel("Number of sets");

ax.invert_yaxis();
ax.set_ylabel("Subheme");

ax.set_title("Top 10 subthemes\namong Marvel-themed sets\nby number of sets");

Let $j(i)$ denote the theme of the $i$-th set. The theme-level random effects use a hierarchical normal prior,

$$ \begin{align*} \sigma_{\beta_{\text{theme}}} & \sim \text{Half}-N\left(2.5^2\right) \\ \beta_{\text{theme}, j} & \sim N\left(0, \sigma_{\beta_{\text{theme}}}^2\right) \\ \gamma_{\text{theme},\ \text{pieces}}, \gamma_{\text{theme},\ \text{price}} & \sim N\left(0, 2.5^2\right) \\ \beta_{\text{theme}, j}^{\text{BG}} & = \beta_{\text{theme}, j} + \gamma_{\text{theme},\ \text{pieces}} \cdot \bar{x}_{\text{pieces}, j} + \gamma_{\text{theme},\ \text{price}} \cdot \bar{x}_{\text{price}, j}. \end{align*} $$

Here $\bar{x}_{\text{pieces}, j}$ is the average standardized log number of pieces for all sets in the $j$-th theme, and $\bar{x}_{\text{price}, j}$ is the average standardized log price for those sets. These terms are included following the guidance of Bafumi and Gelman (2006) to account for the fact that the sets in certain themes will have more pieces/higher prices on average than other themes. (The superscripbt ${}^{\text{BG}}$ stands for Bafumi-Gelman.) These terms will make the theme-level random effects $\beta_{\text{theme}, j}$ more interpetable.

In [77]:

x_pieces_theme_bar = (rated_df["Pieces"]
                              .pipe(np.log)
                              .groupby(theme_id)
                              .mean()
                              .pipe(scale, piece_scaler))
x_price_theme_bar = (rated_df["RRP2021"]
                             .pipe(np.log)
                             .groupby(theme_id)
                             .mean()
                             .pipe(scale, price_scaler))

In [78]:

with full_model:
    β_theme = noncentered_normal("β_theme", dims="theme", μ=0.)
    γ_theme_pieces = pm.Normal("γ_theme_pieces", 0., 2.5)
    γ_theme_price = pm.Normal("γ_theme_price", 0., 2.5)
    β_theme_bg = pm.Deterministic(
        "β_theme_bg",
        β_theme \
            + γ_theme_pieces * x_pieces_theme_bar \
            + γ_theme_price * x_price_theme_bar,
        dims="theme"
    )

Let $\ell(i)$ denote the subtheme of the $i$-th set. The subtheme-level random effects for sets with an associated subtheme follow a similar hierarchical normal prior (with the Bafumi-Gelman terms),

$$ \begin{align*} \sigma_{\beta_{\text{sub}}} & \sim \text{Half}-N\left(2.5^2\right) \\ \beta_{\text{sub}, \ell} & \sim N\left(0, \sigma_{\beta_{\text{sub}}}^2\right) \\ \gamma_{\text{sub},\ \text{pieces}}, \gamma_{\text{sub},\ \text{price}} & \sim N\left(0, 2.5^2\right) \\ \beta_{\text{sub}, \ell}^{\text{BG}} & = \beta_{\text{sub}, \ell} + \gamma_{\text{sub},\ \text{pieces}} \cdot \check{x}_{\text{pieces}, \ell} + \gamma_{\text{sub},\ \text{price}} \cdot \check{x}_{\text{price}, \ell}. \end{align*} $$

Here $\check{x}_{\text{pieces}, \ell}$ is the average standardized log number of pieces for all sets in the $\ell$-th subtheme, and $\check{x}_{\text{price}, \ell}$ is the average standardized log price for those sets.

In [79]:

x_pieces_sub_bar = (rated_df["Pieces"]
                            .pipe(np.log)
                            .groupby(sub_id)
                            .mean()
                            .pipe(scale, piece_scaler))
x_price_sub_bar = (rated_df["RRP2021"]
                           .pipe(np.log)
                           .groupby(sub_id)
                           .mean()
                           .pipe(scale, price_scaler))

In [80]:

with full_model:
    β_sub = noncentered_normal("β_sub", dims="sub", μ=0.)
    γ_sub_pieces = pm.Normal("γ_sub_pieces", 0., 2.5)
    γ_sub_price = pm.Normal("γ_sub_price", 0., 2.5)
    β_sub_bg = pm.Deterministic(
        "β_sub_bg",
        β_sub \
            + γ_sub_pieces * x_pieces_sub_bar \
            + γ_sub_price * x_price_sub_bar,
        dims="sub"
    )

We then let

$$\eta_i = f_{t(i)} + \beta_{\text{theme}, j(i)}^{\text{BG}} + \beta_{\text{sub}, \ell(i)}^{\text{BG}} + \beta_{\text{pieces}} \cdot x_{\text{pieces}, i} + \beta_{\text{price}} \cdot x_{\text{price}, i} + \beta_{\text{set}, i}.$$

In [81]:

with full_model:
    η = f_t[t] \
            + β_set + β_theme_bg[theme_id] + β_sub_bg[sub_id] \
            + β_pieces * x_pieces + β_price * x_price

With this definition of $\eta_i$, the cutpoints and likelihoods are specified similarly to the previous two models.

In [82]:

with full_model:    
    c = pm.Normal("c", 0., 2.5, dims="cut",
                  transform=pm.transforms.ordered,
                  initval=coords["cut"])
    
    ratings_obs = pm.OrderedMultinomial(
        "ratings_obs", η, c, ratings,
        dims=("set", "rating"),
        observed=ratings_ct
    )

We smple from the posterior and posterior predictive distributions of this model.

In [83]:

with full_model:
    full_trace = pm.sample(**SAMPLE_KWARGS)
    pp_trace = pm.sample_posterior_predictive(full_trace)
    
full_trace.extend(pp_trace)

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (3 chains in 3 jobs)
NUTS: [β_t_inc, Δ_β_set, σ_β_set, β_pieces, β_price, Δ_β_theme, σ_β_theme, γ_theme_pieces, γ_theme_price, Δ_β_sub, σ_β_sub, γ_sub_pieces, γ_sub_price, c]

100.00% [6000/6000 39:56

Sampling 3 chains for 1_000 tune and 1_000 draw iterations (3_000 + 3_000 draws total) took 2400 seconds.
The number of effective samples is smaller than 25% for some parameters.

100.00% [3000/3000 06:10

Once again the sampling diagnostics show no cause for concern.

In [84]:

az.plot_energy(full_trace);

We use Pareto-smoothed importance sampling leave-one-out cross validation (PSIS-LOO) to compare this model to the previous one.

In [86]:

traces = {
    "Time": trace,
    "Full": full_trace
}

In [87]:

%%time
comp_df = az.compare(traces)
comp_df.loc[:, :"dse"]

CPU times: user 55.2 s, sys: 5.01 ms, total: 55.2 s
Wall time: 55.2 s

Out[87]:

	rank	loo	p_loo	d_loo	weight	se	dse
Full	0	-50609.673077	2830.614967	0.000000	0.92172	229.793944	0.000000
Time	1	-51275.468893	3515.020761	665.795816	0.07828	224.322963	38.636911

In [88]:

fig, ax = plt.subplots()

az.plot_compare(comp_df,
                plot_ic_diff=False, insample_dev=False,
                ax=ax);

It is fascinating that despite adding random effects for themes and subthemes, resulting in over 500 additional formal parameters, this final model has the fewer effective parameters by a reasonable margin.

In [89]:

sum([
    theme_map.size + 1, # one parameters per theme, plus the theme scale
    sub_map.size + 1, # one parameter per subtheme, plues the subtheme scale
    2 # regression coefficients
])

Out[89]:

In [90]:

ax = (comp_df["p_loo"]
             .plot.barh())

ax.set_xlabel("Effective number of parameters");
ax.set_ylabel("Model");

Analysis¶

First we sanity check the model's predictions by calculating the posterior predicted average rating and the associated residual.

In [91]:

pp_avg_rating = (full_trace.posterior_predictive
                           ["ratings_obs"]
                           .pipe(average_rating)
                           .mean(dim=("chain", "draw")))
rated_df["Residual average rating"] = rated_df["Average rating"] - pp_avg_rating

The following plot shows that the residuals are fairly well-centered around zero and reasonably small.

In [92]:

grid = sns.jointplot(
    x="Average rating", y="Residual average rating",
    data=rated_df,
    alpha=0.05
)

grid.ax_marg_x.set_visible(False);

We do see that our model tends to predict higher-than-observed average ratings (resulting in negative residuals) for poorly rater models and lower-than-observed average ratings (resulting in positive residuals) for highly rated models. This behavior is due to the shrinkage caused by our hierarchical model for set-level effects. As shown above, this model shrinks the predicted ratings of sets with relatively few ratings towards the overall average rating.

In [93]:

grid = sns.jointplot(x="Average rating", y="Ratings", data=rated_df,
                     alpha=0.15)

grid.ax_joint.set_yscale('log');
grid.ax_marg_x.set_yscale('log');
grid.ax_marg_y.set_visible(False);

We see that sets with extremely low ratings (less than roughly 3.3) or extremely high ratings (above roughly 4.6) tends to have fewer overall ratings than other sets. The predicted ratings for these sets are therefore shrunk towards the mean, resulting in the behavior of the residuals shown above.

Plotting residuals versus our regression features (pieces and price) shows perhaps a slight negative trend, with the model systematically (slightly) overestimating the effect ratings of very large/expensive sets.

In [94]:

grid = sns.pairplot(
    rated_df,
    x_vars=["Pieces", "RRP2021"],
    y_vars=["Residual average rating"],
    plot_kws={'alpha': 0.25},
    height=FIG_HEIGHT
)

axes = grid.axes.squeeze()

axes[0].set_xscale('log');

axes[1].set_xscale('log');
axes[1].set_xlabel("RRP (2021 $)");

We could certainly improve the model by making the relationship between (log) pieces and price and ratings be nonlinear. For simplicity we do not pursue these changes since this discussion applies to relatively few sets. (Addressing this issue may be in interesting future post, but this one is already long enough.)

Recall that our model decomposes ratings into a sum of time, theme, subtheme, piece count, price, and set effects. The effect of time was addressed in the first model. We start our analysis with piece count and price effects, then proceed to analyze theme effects, subtheme effects, set effects, and the relationship between Star Wars set ratings and the reception of the related film/TV series in turn.

Piece count and price¶

Unsurprisingly the posterior distributions of the regression coefficients for (the standardized logarithm of) the set's piece count and price are positive. It is important to recall from previous posts that piece count and price are highly correlated (of course Lego charges more for sets that have more pieces and therefore higher production cost).

In [95]:

az.plot_posterior(full_trace, var_names=["β_pieces", "β_price"]);

It is also intuitive that these posterior distributions are concentrated above zero. Larger, more expensive sets are naturally a more considered purchase and therefore the underlying quality of the design must be higher to justify the purchase. (We are of course assuming that most of the raters purchased the set for themselves. Outside of review sets distributed by Lego for publicity and gifted sets this assumption seems reasonable.)

We also examine the coefficients of the Bafumi-Gelman correction terms for the average (standardized logarithm of the) piece count and price for theme and subtheme effects.

In [96]:

az.plot_posterior(full_trace, var_names="γ", filter_vars="like", grid=(2, 2));

Interestingly, the corrections for theme effects for both piece count and price are hard to distinguish from zero, while the same corrections for subtheme effects are well-separated from zero. The sign difference is in the posterior distributions of $\gamma_{\text{sub}, \text{pieces}}$ and $\gamma_{\text{sub}, \text{price}}$ is interesting.

In [97]:

fig, axes = plt.subplots(ncols=2, figsize=(2 * FIG_WIDTH, FIG_HEIGHT))
axes[0].set_aspect('equal');

axes[0].scatter(x_pieces_sub_bar, x_price_sub_bar, alpha=0.5);
axes[0].axline(
    (0, 0), slope=1, ls='--', c='k',
    label="$y = x$"
);

axes[0].set_xlabel("Average standardized log piece count");
axes[0].set_ylabel("Average standardized log price");

axes[0].legend();
axes[0].set_title("Subtheme");

az.plot_pair(
    full_trace, var_names="γ_sub", filter_vars="like",
    scatter_kwargs={'alpha': 0.25},
    ax=axes[1]
);

axes[1].set_xlabel(r"$\gamma_{\mathrm{sub}, \mathrm{pieces}}$");
axes[1].set_ylabel(r"$\gamma_{\mathrm{sub}, \mathrm{price}}$");
axes[1].set_title("Posterior distribution");

fig.tight_layout();

The plot on the left shows that $\check{x}_{\text{pieces}, \ell}$ and $\check{x}_{\text{price}, \ell}$ are highly correlated (in fact they are quite close to the line $y = x$ in many cases). Unsurprisingly, the plot on the right shows that the posterior samples for the Bafumi-Gelman correction coefficients are similarly highly correlated.

In [98]:

ax = az.plot_posterior(
    full_trace.posterior["γ_sub_pieces"] \
        + full_trace.posterior["γ_sub_price"]
)
ax.set_title(None);

This plot shows the posterior distribution of $\gamma_{\text{sub},\ \text{pieces}} + \gamma_{\text{sub},\ \text{price}}$. This distribution together with the relationship between $\check{x}_{\text{pieces}, \ell}$ and $\check{x}_{\text{price}, \ell}$ show that as the average number of pieces in sets in a subtheme increases (and therefore their average price), the effect of subtheme on average rating is slightly negative. Stated another way, sets from subthemes with higher average (standardized log) piece counts have to have higher piece counts to reap the benefits from $\beta_{\text{pieces}}$ and $\beta_{\text{price}}$.

Theme¶

The following plot shows the themes with the largest and smallest impact on rating as quantified by $\beta_{\text{theme}, j}$.

In [99]:

def get_extreme_coords(x, coord, n=20):
    argsorted_coord = (x.isel({coord: x.argsort().values})
                        .coords[coord]
                        .values)

    return np.concatenate((
        argsorted_coord[:n // 2], argsorted_coord[-(n // 2):]
    ))

In [100]:

β_theme_post_mean = (full_trace.posterior["β_theme"]
                               .mean(dim=("chain", "draw")))

β_theme_post_mean_ext = get_extreme_coords(β_theme_post_mean, "theme")

ax, = az.plot_forest(
    full_trace, var_names=["β_theme"],
    coords={"theme": β_theme_post_mean_ext[::-1]},
    combined=True
);

ax.set_xlabel(r"Posterior $\beta_{\mathrm{theme}, j}$");
ax.set_yticklabels(β_theme_post_mean_ext);
ax.set_title("Top and bottom themes");

A few of these themes jump out at me. It makes sense that the "Ideas" theme would have a high theme effect, because these are sets that were produced after being proposed and voted on by Lego fans. It seems reasonable that the sentiments of those that vote for sets on Lego Ideas and those that rate sets on Bricksets would be aligned. I also thoroughly enjoy my collection of Star Wars BrickHeadz.

The themes with lower effects are a bit harder to interpret; I'm not particularly familiar with any of these themes. I do, however, not find it very hard to believe that SpongeBob SquarePants and Angry Birds Movie sets would not be particularly well-received, espescially by the sort of person that is likely to be rating sets on Brickset. One notable trend here is that there are many child-focused themes among the themes with the most negative effects. I would not, in general, expect children to be logging into Brickset to rate sets, so perhaps parents are more inclined to log on and rate the sets that have disappointed their children than to rate ones that satisfied them.

The Star Wars theme effect is slightly negative. We'll analyze the factors driving the ratings of Star Wars sets in more detail shortly.

In [101]:

ax = sns.kdeplot(β_theme_post_mean)
sns.rugplot(β_theme_post_mean.to_numpy(),
            height=0.05, c='C0', ax=ax);
ax.axvline(β_theme_post_mean[theme_map == "Star Wars"],
           c='C1', label="Star Wars sets");

ax.set_xlabel(r"Posterior expected $\beta_{\mathrm{theme}, j}$");
ax.legend();

Subtheme effects¶

The following plot shows the subthemes with the largest and smallest impact on rating as quantified by $\beta_{\text{sub}, \ell}$.

In [102]:

β_sub_post_mean = (full_trace.posterior["β_sub"]
                             .mean(dim=("chain", "draw")))

β_sub_post_mean_ext = get_extreme_coords(β_sub_post_mean, "sub")

ax, = az.plot_forest(
    full_trace, var_names=["β_sub"],
    coords={"sub": β_sub_post_mean_ext[::-1]},
    combined=True,
);

ax.set_xlabel(r"Posterior $\beta_{\mathrm{sub}, \ell}$");
ax.set_yticklabels(β_sub_post_mean_ext);
ax.set_title("Top and bottom subthemes");

I don't see much of narrative here, as I'm not familiar with many of these subthemes. It is interesting that, out of all the Star Wars subthemes, Rogue One makes an appearance as one of the subthemes with the largest impact on ratings.

Interestingly, Star Wars subthemes as a whole don't really stand out among all subthemes. NASA sets do, however show an above-average subtheme effect.

In [103]:

ax = sns.kdeplot(β_sub_post_mean.to_numpy(),
                 label="All subthemes")
sns.rugplot(β_sub_post_mean.to_numpy(),
            height=0.05, c='C0',
            ax=ax);

star_wars_subs = (rated_df[rated_df["Theme"] == "Star Wars"]
                          ["Subtheme"]
                          .unique())
sns.kdeplot(β_sub_post_mean.sel({"sub": star_wars_subs})
                           .to_numpy(),
            c='C1', label="Star Wars subthemes",
            ax=ax);
sns.rugplot(β_sub_post_mean.sel({"sub": star_wars_subs})
                           .to_numpy(),
            height=0.075, c='C1', ax=ax);

ax.axvline(β_sub_post_mean.sel({"sub": "NASA"}),
           c='C2', ls='--', label="NASA subtheme");

ax.set_xlabel(r"Posterior expected $\beta_{\mathrm{sub}, \ell}$");
ax.legend();

Returning to Rogue One, there are relatively few sets in this subtheme, and I don't own any of them.

In [104]:

rated_df[rated_df["Subtheme"] == "Rogue One"]

Out[104]:

	Name	Set type	Year released	Theme	Subtheme	Pieces	RRP$	RRP2021	✭	✭✭	✭✭✭	✭✭✭✭	✭✭✭✭✭	Austin owns	Ratings	Average rating	Residual average rating
Set number
75152-1	Imperial Assault Hovertank	Normal	2016	Star Wars	Rogue One	385.0	29.99	33.112344	15.0	32.0	93.0	249.0	319.0	False	708.0	4.165254	0.029373
75153-1	AT-ST Walker	Normal	2016	Star Wars	Rogue One	449.0	39.99	44.153473	20.0	34.0	74.0	278.0	546.0	False	952.0	4.361345	0.014891
75154-1	TIE Striker	Normal	2016	Star Wars	Rogue One	543.0	69.99	77.276858	5.0	20.0	76.0	196.0	262.0	False	559.0	4.234347	0.048867
75155-1	Rebel U-wing Fighter	Normal	2016	Star Wars	Rogue One	659.0	79.99	88.317987	20.0	25.0	68.0	238.0	420.0	False	771.0	4.313878	0.011525
75156-1	Krennic's Imperial Shuttle	Normal	2016	Star Wars	Rogue One	863.0	89.99	99.359115	15.0	29.0	59.0	168.0	434.0	False	705.0	4.385816	-0.013055
30496-1	U-Wing Fighter	Normal	2017	Star Wars	Rogue One	55.0	3.99	4.297959	12.0	22.0	59.0	53.0	58.0	False	204.0	3.602941	-0.003324
75164-1	Rebel Trooper Battle Pack	Normal	2017	Star Wars	Rogue One	120.0	14.99	16.146971	14.0	34.0	87.0	150.0	228.0	False	513.0	4.060429	0.007587
75165-1	Imperial Trooper Battle Pack	Normal	2017	Star Wars	Rogue One	112.0	14.99	16.146971	19.0	36.0	106.0	191.0	283.0	False	635.0	4.075591	0.009082
75171-1	Battle on Scarif	Normal	2017	Star Wars	Rogue One	419.0	49.99	53.848369	7.0	26.0	70.0	157.0	121.0	False	381.0	3.942257	0.025589
75172-1	Y-wing Starfighter	Normal	2017	Star Wars	Rogue One	691.0	59.99	64.620198	17.0	30.0	62.0	195.0	499.0	False	803.0	4.405978	-0.005729

In [105]:

((rated_df["Subtheme"] == "Rogue One") & rated_df["Austin owns"]).sum()

Out[105]:

I had expected Darth Vader's Castle (75251) to be a Rogue One set (to my knowledge this is the movie in which the castle first appeared). It was a build that I particularly enjoyed, but it is filed under the Miscellaneous subtheme for reasons not entirely clear to me.

In [106]:

rated_df.loc["75251-1", :"RRP2021"]

Out[106]:

Name             Darth Vader's Castle
Set type                       Normal
Year released                    2019
Theme                       Star Wars
Subtheme                Miscellaneous
Pieces                         1060.0
RRP$                           129.99
RRP2021                      135.0871
Name: 75251-1, dtype: object

The Darth Vader exhibit in my collection

We'll return to Star Wars subtheme effects and how they are related to the (critical and public) reception of various entries in the Star Wars media franchise shortly.

Set-level¶

The following plot shows the sets with the largest and smallest impact on rating as quantified by $\beta_{\text{set}, i}$.

In [107]:

β_set_post_mean = (full_trace.posterior["β_set"]
                             .mean(dim=("chain", "draw")))
β_set_post_mean_ext = get_extreme_coords(β_set_post_mean, "set")

ax, = az.plot_forest(
    full_trace, var_names=["β_set"],
    coords={"set": β_set_post_mean_ext[::-1]},
    combined=True,
);

ax.set_xlabel(r"Posterior $\beta_{\mathrm{set}, i}$");
ax.set_yticklabels(make_set_label(rated_df.loc[β_set_post_mean_ext]));
ax.set_title("Top and bottom sets");

Here we finally see many Star Wars sets emerge, which makes sense. Since the Star Wars theme effect is close to average (zero) and the distribution of Star Wars subtheme effects does not stand out from the overall distribution of subtheme effects, the set effect is left to account for Star Wars sets with particularly high or low observed ratings (after adjusting for piece count and price). I am particularly glad to see the 501st Legion Clone Troopers (75280) listed in the top ten for set effects; it's a thoroughly enjoyable small set. The Star Wars sets in the bottom ten for set effects suggest a trend that we will elaborate on shortly. Outside of the Imperial Shuttle (76302) and Assault on Hoth (75098) these sets are generally associated with the Prequel Trilogy (Episodes I, II, and III) or the Sequel Trilogy (Episodes VII, VII, and IX). These entries in the Star Wars media franchise are (on average) less popular than the Original Trilogy (Episodes IV, V, and VI).

I am personally not terribly surprised by most of these sets with extreme negative effects. The prevalence of First Order sets here is telling; I have found very few First Order sets compelling enough to purchase (Kylo Ren's Shuttle (75264) and First Order Heavy Assault Walker (30497) notwithstanding; I am a sucker for MicroFighters and small sets in general). I am somewhat surprised to see Imperial Shuttle (76302) on this list. I didn't think this set was amazing, but it was not particularly bad either. Having missed out on Imperial Shuttle Tydirium (75094) I consider this replacement a solid addition to my collection.

Rating components¶

With a bit of work we can visualize the cumulative influence of each component in our ratings model (year released, theme, subtheme, piece count, price, and set) on a set's overall rating.

In [108]:

def get_set_η_comp(trace, set_data):
    η_comp = (trace.posterior[["f_t", "β_theme_bg", "β_sub_bg"]]
                   .sel(set_data[["year", "theme", "sub"]]))
    η_comp["pieces"] = trace.posterior["β_pieces"] * set_data["x_pieces"]
    η_comp["price"] = trace.posterior["β_price"] * set_data["x_price"]
    η_comp["β_set"] = trace.posterior["β_set"].sel({"set": set_data["set"]})
    
    return η_comp

In [109]:

def get_comp_cumsum(η_comp):
    return (η_comp.to_array(dim="comp")
                  .cumsum(dim="comp")
                  .to_dataset("comp"))

In [110]:

def get_cum_prob(trace, set_data):
    η_comp = get_set_η_comp(trace, set_data)
    cum_η_comp = get_comp_cumsum(η_comp)
    cum_cdf_comp = ((cum_η_comp - trace.posterior["c"])
                    .pipe(sp.special.expit))
    
    cum_fst_prob = 1. - (
        cum_cdf_comp.sel(cut=0)
                    .assign(rating=xr.DataArray(["✭"], {"cut": [0]}))
                    .set_index(cut="rating")
                    .rename(cut="rating")
    )
    cum_mid_prob = -1. * (
        cum_cdf_comp.diff(dim="cut")
                    .assign(rating=xr.DataArray(["✭✭", "✭✭✭", "✭✭✭✭"],
                            {"cut": coords["cut"][1:]}))
                    .set_index(cut="rating")
                    .rename(cut="rating")
    )
    cum_last_prob = (
        cum_cdf_comp.isel(cut=-1)
                    .assign(rating=xr.DataArray(["✭✭✭✭✭"], {"cut": [3]}))
                    .set_index(cut="rating")
                    .rename(cut="rating")
    )
    
    return xr.concat(
        (cum_fst_prob, cum_mid_prob, cum_last_prob),
        dim="rating"
    )

In [111]:

def star_formatter(stars, _):
    full_stars = int(stars)
    label = full_stars * "✭"
    
    if stars != full_stars:
        label += " ½"
    
    return label

def plot_cum_avg_rating(trace, set_data, ax=None):
    if ax is None:
        _, ax = plt.subplots()
        
    cum_prob = get_cum_prob(trace, set_data)
    
    az.plot_forest(
        cum_prob.pipe(average_rating),
        combined=True, hdi_prob=1.,
        kind='ridgeplot', ridgeplot_truncate=False,
        ridgeplot_quantiles=[0.5],
        ridgeplot_alpha=0.5, ridgeplot_overlap=3.,
        ax=ax
    )
    
    x_min, x_max = ax.get_xlim()
    ax.set_xlim(min(x_min, 3.5), max(x_max, 4.5))
    ax.xaxis.set_major_locator(MultipleLocator(0.5))
    ax.xaxis.set_major_formatter(star_formatter)
    ax.set_xlabel("Posterior average set rating")

    ax.set_yticklabels([
        "Set",
        f"Price\n${rated_df['RRP2021'].loc[set_data['set'].values]:,.2f}",
        f"Pieces\n{rated_df['Pieces'].loc[set_data['set'].values]:,.0f}",
        f"Subtheme (BG)\n({set_data['sub'].values})",
        f"Theme (BG)\n({set_data['theme'].values})",
        f"Year released\n({set_data['year'].values})"
    ]);
    ax.set_ylabel("After cumulative effects of\n"
                  + r"$\longleftarrow$");
    
    ax.set_title(
        make_row_set_label(rated_df.loc[set_data["set"].values])
    );
    
    return ax

In [112]:

set_xr["x_pieces"] = xr.DataArray(x_pieces, {"set": rated_df.index.values})
set_xr["x_price"] = xr.DataArray(x_price, {"set": rated_df.index.values})

These plots take a bit of explanation to interpret, so we begin by considering one of the most iconic Star Wars sets of all time (it's at the top of my wish list), the Millenium Falcon (75192).

In [113]:

UCS_FALCON = "75192-1"

In [114]:

plot_cum_avg_rating(full_trace, set_xr.sel(set=UCS_FALCON));

This plot is best read top to bottom. The topmost distribution is the posterior distribution of the average rating of a set released in 2017 that is average in all other respects. The second distribution is the posterior distribution of the average rating of a Star Wars set released in 2017 that is average in all other respects, and so on. (Note that the ordering here has been chosen for conceptual, not mathematical reasons as there is no ordering inherent in our decomposition. In future work we may use some tools from causal inference to induce such a mathematical ordering.)

This plot shows that 75192 gets its most significant ratings boost due to the extremely large number of pieces, but the price and set effects are not to be ignored. In short, this is an outstanding set in almost every respect.

The following plots show the ratings decomposition for three of my favorite non-Star Wars sets, NASA Apollo Saturn V (92176), NASA Apollo 11 Lunar Lander, and NASA Space Shuttle Discovery (10283).

In [115]:

SATURN_V = "92176-1"
LUNAR_LANDER = "10266-1"
SPACE_SHUTTLE = "10283-1"

In [116]:

fig, axes = plt.subplots(nrows=3, sharex=True, sharey=True,
                         figsize=(FIG_WIDTH, 3 * FIG_HEIGHT))

plot_cum_avg_rating(full_trace, set_xr.sel(set=SATURN_V),
                    ax=axes[0]);
plot_cum_avg_rating(full_trace, set_xr.sel(set=LUNAR_LANDER),
                    ax=axes[1]);
plot_cum_avg_rating(full_trace, set_xr.sel(set=SPACE_SHUTTLE),
                    ax=axes[2]);

axes[1].set_xlabel(None);
axes[2].set_xlabel(None);

fig.tight_layout();

These sets get health ratings boosts from almost every factor, in contrast to most Star Wars sets.

It is interesting to contrast these decompositions with that for another one of my favorite NASA/space sets, Women of NASA (21312).

In [117]:

WOMEN_OF_NASA = "21312-1"

In [118]:

plot_cum_avg_rating(full_trace, set_xr.sel(set=WOMEN_OF_NASA));

This set gets its most significant boost from the Ideas theme and its small size and modest price don't contribute much. I am a bit disappointed to see the set effect be slightly negative here.

So far we have focused on sets with average-to-high ratings. We return to the Imperial Shuttle (75302) as an example of a set with below average ratings.

In [119]:

IMPERIAL_SHUTTLE = "75302-1"

In [120]:

plot_cum_avg_rating(full_trace, set_xr.sel(set=IMPERIAL_SHUTTLE));

We see that most factors (year released, theme, subtheme, piece count, and price) point to an average (or slightly above) rating. The final posterior average rating drops dramatically when set effects are factored in, showing that this set is well below the quality that Bricket's raters expected given its other attributes. It is also interesting to see that the uncertainty in the posterior distribution is much larger after the set effect is included than before.

In [121]:

ax = (rated_df[RATING_COLS]
              .loc[IMPERIAL_SHUTTLE]
              .plot.bar(rot=0))

ax.set_xlabel("Rating");
ax.set_ylabel("Number of ratings\n(observed)");
ax.set_title(make_row_set_label(rated_df.loc[IMPERIAL_SHUTTLE]));

In [122]:

rating_pct = (rated_df[RATING_COLS]
                      .div(rated_df["Ratings"], axis=0))

ax = (rating_pct.le(rating_pct.loc[IMPERIAL_SHUTTLE])
                .mean()
                .plot.bar(rot=0))

ax.set_xlabel("Rating");

ax.yaxis.set_major_formatter(pct_formatter);
ax.set_ylabel("Percentile of percent of\ntotal ratings in rating");

ax.set_title(make_row_set_label(rated_df.loc[IMPERIAL_SHUTTLE]));

We see that this set has a disproportionate amount of ratings in the one to four star range and very few in the five star range (relative to all sets). This relatively large amount of small ratings is partially responsible for the width of the final posterior once set effects have been included.

An interesting case to conisder is the International Space Station (21321).

In [123]:

ISS = "21321-1"

In [124]:

plot_cum_avg_rating(full_trace, set_xr.sel(set=ISS));

This set gets a boost from its theme (Ideas) and number of pieces, but the set effect reduces the overall posterior average rating appreciably. This decomposition is in line with my experience with this set. It is an interesting idea, but the full build is quite flimsy and prone to breaking (mine has fallen apart in various ways a few times). Of all my NASA/space sets it is my least favorite.

For fun, we look at the decomposition of the set with the smallest posterior expected average rating.

In [125]:

WORST_SET_DATA = set_xr.isel(set=pp_avg_rating.argmin())

In [126]:

plot_cum_avg_rating(full_trace, WORST_SET_DATA);

This set has the deck stacked against it ratings-wise: it's from a poorly rated theme and it's small and cheap.

These decompositions are really fascinating to get lost in. For any interested readers, I have posted the decomposition plots for every set in the data set here (warning: this is a 109MB gzipped tarball).

My collection¶

As in previous posts, we can use this model to shed light on my collecting behavior.

In [127]:

austin_themes = (rated_df[rated_df["Austin owns"]]
                         ["Theme"]
                         .unique())
austin_subs = (rated_df[rated_df["Austin owns"]]
                       ["Subtheme"]
                       .dropna()
                       .unique())

In [128]:

fig, axes = plt.subplots(ncols=2, sharex=True, sharey=True,
                         figsize=(1.75 * FIG_WIDTH, FIG_HEIGHT))

sns.kdeplot(β_theme_post_mean.to_numpy(),
            label="All themes", ax=axes[0]);
sns.rugplot(β_theme_post_mean.to_numpy(),
            height=0.05, c='C0', ax=axes[0]);

sns.kdeplot(β_theme_post_mean.sel({"theme": austin_themes})
                             .to_numpy(),
            c='C1', label="My themes", ax=axes[0]);
sns.rugplot(β_theme_post_mean.sel({"theme": austin_themes})
                             .to_numpy(),
            height=0.075, c='C1', ax=axes[0]);

axes[0].set_xlabel(r"Posterior expected $\beta_{\mathrm{theme}, j}$");
axes[0].legend();

sns.kdeplot(β_sub_post_mean.to_numpy(),
            label="All subthemes", ax=axes[1]);
sns.rugplot(β_sub_post_mean.to_numpy(),
            height=0.05, c='C0', ax=axes[1]);

sns.kdeplot(β_sub_post_mean.sel({"sub": austin_subs})
                           .to_numpy(),
            c='C1', label="My subthemes",
            ax=axes[1]);
sns.rugplot(β_sub_post_mean.sel({"sub": austin_subs})
                           .to_numpy(),
            height=0.075, c='C1', ax=axes[1]);

axes[1].set_xlabel(r"Posterior expected $\beta_{\mathrm{sub}, \ell}$");
axes[1].legend();

fig.tight_layout();

From these plots, it appears that I do tend to collect sets from themes and subthemes that tend to be (slightly) more highly reviewed than average.

The following plots show the specific posterior distributions for the themes and subthemes that of which I own at least one set.

In [129]:

fig, axes = plt.subplots(ncols=2, sharey=True,
                         figsize=(1.75 * FIG_WIDTH, FIG_HEIGHT))

az.plot_forest(
    full_trace, var_names=["β_theme"],
    coords={"theme": austin_themes},
    combined=True, ax=axes[0]
);

axes[0].set_xlabel(r"Posterior expected $\beta_{\mathrm{theme}, j}$");

axes[0].set_yticklabels(austin_themes[::-1]);
axes[0].set_ylabel("Theme");

axes[0].set_title(None);

axes[1].barh(
    axes[0].get_yticks(),
    rated_df[rated_df["Austin owns"]]
            ["Theme"]
            .value_counts()
            .loc[austin_themes[::-1]]
);

axes[1].set_xlabel("Number of sets");

fig.suptitle("My themes");
fig.tight_layout();

In [130]:

fig, axes = plt.subplots(ncols=2, sharey=True,
                         figsize=(1.75 * FIG_WIDTH, 1.5 * FIG_HEIGHT))


az.plot_forest(
    full_trace, var_names=["β_sub"],
    coords={"sub": austin_subs},
    combined=True, ax=axes[0]
);

axes[0].set_xlabel(r"Posterior expected $\beta_{\mathrm{sub}, j}$");

axes[0].set_yticklabels(austin_subs[::-1]);
axes[0].set_ylabel("Subtheme");

axes[0].set_title(None);

axes[1].barh(
    axes[0].get_yticks(),
    rated_df[rated_df["Austin owns"]]
            ["Subtheme"]
            .value_counts()
            .loc[austin_subs[::-1]]
);

axes[1].set_xlabel("Number of sets");

fig.suptitle("My subthemes");
fig.tight_layout();

These results would seem to justify a moderate amount of pride in my collecting habits.

Star Wars set ratings vs. media reception¶

We finally turn to our final question, whether the ratings of Star Wars sets are related to the critical and/or public reception of the entries in the Star Wars media franchise they are drawn from.

These critic and audience scores were retrieved from Rotten Tomatoes on October 24, 2021. There are many sources for film rating data online that could be used for this analysis, Rotten Tomatoes was chosen primarily for convenience.

In [131]:

movie_df = pd.DataFrame.from_records(
    [("Episode I", 0.52, 0.59),
     ("Episode II", 0.65, 0.56),
     ("Episode III", 0.8, 0.66),
     ("Episode IV", 0.92, 0.96),
     ("Episode V", 0.94, 0.97),
     ("Episode VI", 0.82, 0.95),
     ("The Force Awakens", 0.93, 0.85),
     ("Rogue One", 0.84, 0.86),
     ("The Last Jedi", 0.91, 0.42),
     ("Solo", 0.7, 0.64),
     ("The Rise of Skywalker", 0.52, 0.86)],
    columns=["Subtheme", "Critics", "Audiences"]
)

movie_df["post_mean"] = β_sub_post_mean.sel(sub=movie_df["Subtheme"].values)
movie_df["bg_post_mean"] = (full_trace.posterior["β_sub_bg"]
                                      .sel(sub=movie_df["Subtheme"].values)
                                      .mean(dim=("chain", "draw")))

movie_df["Critics (logit)"] = sp.special.logit(movie_df["Critics"])
movie_df["Audiences (logit)"] = sp.special.logit(movie_df["Audiences"])

In [132]:

movie_df

Out[132]:

	Subtheme	Critics	Audiences	post_mean	bg_post_mean	Critics (logit)	Audiences (logit)
0	Episode I	0.52	0.59	-0.172209	-0.185344	0.080043	0.363965
1	Episode II	0.65	0.56	0.104313	0.075953	0.619039	0.241162
2	Episode III	0.80	0.66	0.077215	0.047762	1.386294	0.663294
3	Episode IV	0.92	0.96	-0.034464	-0.059945	2.442347	3.178054
4	Episode V	0.94	0.97	0.057337	0.015157	2.751535	3.476099
5	Episode VI	0.82	0.95	0.000920	0.006223	1.516347	2.944439
6	The Force Awakens	0.93	0.85	-0.194143	-0.238966	2.586689	1.734601
7	Rogue One	0.84	0.86	0.281671	0.218561	1.658228	1.815290
8	The Last Jedi	0.91	0.42	-0.220747	-0.300822	2.313635	-0.322773
9	Solo	0.70	0.64	-0.029700	-0.108130	0.847298	0.575364
10	The Rise of Skywalker	0.52	0.86	-0.121238	-0.190297	0.080043	1.815290

Here we have both the raw and log-odds transformed critics and audiences scores from Rotten Tomatoes along with the posterior expected values of both subtheme effects, $\beta_{\text{sub}, j}$ and $\beta_{\text{sub}, j}^{\text{BG}}$.

In [133]:

REVIEW_SCORES = ["Critics", "Audiences", "Critics (logit)", "Audiences (logit)"]
POST_EFFECTS = ["post_mean", "bg_post_mean"]

In [134]:

def add_corr(x, y, **_):
    ax = plt.gca()
    
    ρ, ρ_p_val = sp.stats.pearsonr(x, y)
    r, r_p_val = sp.stats.spearmanr(x, y)
    
    ax.set_title(
        r"$\rho = " + f"{ρ:.2f}$" + f" ($p = {ρ_p_val:.3f}$)\n" \
            + f"$r = {r:.2f}\ (p = {r_p_val:.3f})$"
    )

In [135]:

grid = sns.pairplot(
    movie_df,
    x_vars=REVIEW_SCORES, y_vars=POST_EFFECTS
)
grid.map(add_corr);

grid.axes[0, 0].set_ylabel("Posterior expected\n" + r"$\beta_{\mathrm{sub}, j}$");
grid.axes[1, 0].set_ylabel("Posterior expected\n" + r"$\beta_{\mathrm{sub}, j}^{\mathrm{BG}}$");

grid.fig.suptitle("Star Wars Movies");
grid.tight_layout();

This plot shows both the Pearson correlation coefficient ($\rho$) and the Spearman rank correlation coefficient ($r$) of the Critic and Audience scores (both standard and log-odds) and the posterior expected subtheme effects (both with and without the Bafumi-Gelman correction). We see that these correlations tend to be weak and insignificant.

We produce a similar plot for Star Wars television shows below.

In [136]:

tv_df = pd.DataFrame.from_records(
    [("The Clone Wars", 0.93, 0.92),
     ("Rebels", 0.98, 0.82),
     ("The Mandalorian", 0.93, 0.91),
     ("Rebels", 0.92, 0.6),
     ("The Bad Batch", 0.88, 0.81)],
    columns=["Subtheme", "Critics", "Audiences"]
)

tv_df["post_mean"] = β_sub_post_mean.sel(sub=tv_df["Subtheme"].values)
tv_df["bg_post_mean"] = (full_trace.posterior["β_sub_bg"]
                                      .sel(sub=tv_df["Subtheme"].values)
                                      .mean(dim=("chain", "draw")))

tv_df["Critics (logit)"] = sp.special.logit(tv_df["Critics"])
tv_df["Audiences (logit)"] = sp.special.logit(tv_df["Audiences"])

In [137]:

grid = sns.pairplot(
    tv_df,
    x_vars=REVIEW_SCORES, y_vars=POST_EFFECTS
)
grid.map(add_corr);

grid.axes[0, 0].set_ylabel("Posterior expected\n" + r"$\beta_{\mathrm{sub}, j}$");
grid.axes[1, 0].set_ylabel("Posterior expected\n" + r"$\beta_{\mathrm{sub}, j}^{\mathrm{BG}}$");

grid.fig.suptitle("Star Wars TV Shows");
grid.tight_layout();

It is interesting that the rank correlations here are much higher than for the movies above, but still not significant. (Of course as good Bayesians we take these p-values with a heap of salt.)

In the end we conclude that this model shows no evidence that the ratings of Star Wars sets are correlated with the entries in the Star Wars media franchise from which they are drawn.

This post is available as a Jupyter notebook here.

In [138]:

%load_ext watermark
%watermark -n -u -v -iv

Last updated: Mon Oct 25 2021

Python implementation: CPython
Python version       : 3.7.10
IPython version      : 7.28.0

xarray    : 0.19.0
seaborn   : 0.11.2
scipy     : 1.7.1
pandas    : 1.3.3
pymc      : 4.0.0
arviz     : 0.11.4
numpy     : 1.19.5
aesara    : 2.2.2
networkx  : 2.5
matplotlib: 3.4.3

A Fair Price for the Titanic? A Bayesian Analysis of the Price of Large Lego Sets

Austin Rochford — Tue, 12 Oct 2021 04:00:00 GMT

Lego recently announced the release of an extreme large (more than 9,000 pieces, roughly 4.5 ft long) model of the Titanic (10294) priced at $629.99.

9,090 pieces. 1.3 meters long (4 ft. 5 in). One LEGO Titanic building project! https://t.co/guhn2isu17 pic.twitter.com/jszY6C4MtC
— LEGO (@LEGO_Group) October 7, 2021

Readers of this blog will know that I have written a series of posts (1 2 3) analyzing the factors that drive Lego prices, from the year a set was released to its piece count, theme, and subtheme. The impending release of the Titanic offers an excellent opportunity to use the model from the most recent post in this series to examine the prices of the sets with the largest number of pieces.

First we make the necessary Python imports and do a bit of housekeeping.

In [1]:

%matplotlib inline

In [2]:

import datetime
from functools import reduce

In [3]:

from aesara import shared, tensor as at
import arviz as az
from matplotlib import MatplotlibDeprecationWarning, pyplot as plt
from matplotlib.ticker import StrMethodFormatter
import numpy as np
import pandas as pd
import pymc3 as pm
from sklearn.preprocessing import StandardScaler
import seaborn as sns
import xarray as xr

You are running the v4 development version of PyMC3 which currently still lacks key features. You probably want to use the stable v3 instead which you can either install via conda or find on the v3 GitHub branch: https://github.com/pymc-devs/pymc3/tree/v3

In [4]:

FIGSIZE = (8, 6)

plt.rcParams['figure.figsize'] = FIGSIZE
sns.set(color_codes=True)

dollar_formatter = StrMethodFormatter("${x:,.0f}")

Load the Data¶

We begin the real work by loading the data scraped from Brickset. See the first post in this series for more background on the data.

In [5]:

DATA_URL = 'https://austinrochford.com/resources/lego/brickset_19800101_20211098.csv.gz'

In [6]:

def to_datetime(year):
    return np.datetime64(f"{round(year)}-01-01")

In [7]:

COLS = [
    "Year released", "Set number",
    "Name", "Set type", "Theme", "Subtheme",
    "Pieces", "RRP$"
]

In [8]:

full_df = (pd.read_csv(DATA_URL, usecols=COLS)
             .dropna(subset=set(COLS) - {"Subtheme"}))
full_df["Year released"] = full_df["Year released"].apply(to_datetime)
full_df = (full_df.sort_values(["Year released", "Set number"])
                  .set_index("Set number"))

In [9]:

full_df.head()

Out[9]:

	Name	Set type	Theme	Year released	Pieces	Subtheme	RRP$
Set number
1041-2	Educational Duplo Building Set	Normal	Dacta	1980-01-01	68.0	NaN	36.50
1075-1	LEGO People Supplementary Set	Normal	Dacta	1980-01-01	304.0	NaN	14.50
1101-1	Replacement 4.5V Motor	Normal	Service Packs	1980-01-01	1.0	NaN	5.65
1123-1	Ball and Socket Couplings & One Articulated Joint	Normal	Service Packs	1980-01-01	8.0	NaN	16.00
1130-1	Plastic Folder for Building Instructions	Normal	Service Packs	1980-01-01	1.0	NaN	14.00

In [10]:

full_df.tail()

Out[10]:

	Name	Set type	Theme	Year released	Pieces	Subtheme	RRP$
Set number
80025-1	Sandy's Power Loader Mech	Normal	Monkie Kid	2021-01-01	520.0	Season 2	54.99
80026-1	Pigsy's Noodle Tank	Normal	Monkie Kid	2021-01-01	662.0	Season 2	59.99
80028-1	The Bone Demon	Normal	Monkie Kid	2021-01-01	1375.0	Season 2	119.99
80106-1	Story of Nian	Normal	Seasonal	2021-01-01	1067.0	Chinese Traditional Festivals	79.99
80107-1	Spring Lantern Festival	Normal	Seasonal	2021-01-01	1793.0	Chinese Traditional Festivals	119.99

In [11]:

full_df.describe()

Out[11]:

	Pieces	RRP$
count	8636.000000	8636.000000
mean	265.029180	31.741934
std	500.511219	46.129177
min	0.000000	0.000000
25%	32.000000	7.000000
50%	100.000000	18.000000
75%	305.000000	39.990000
max	11695.000000	799.990000

We now add a column RRP2021, which is RRP$ adjusted to 2021 dollars.

In [12]:

CPI_URL = 'https://austinrochford.com/resources/lego/CPIAUCNS202100401.csv'

In [13]:

years = pd.date_range('1979-01-01', '2021-01-01', freq='Y') \
            + datetime.timedelta(days=1)
cpi_df = (pd.read_csv(CPI_URL, index_col="DATE", parse_dates=["DATE"])
            .loc[years])
cpi_df["to2021"] = cpi_df.loc["2021-01-01"] / cpi_df

In [14]:

full_df["RRP2021"] = (pd.merge(full_df, cpi_df,
                               left_on=["Year released"],
                               right_index=True)
                        .apply(lambda df: df["RRP$"] * df["to2021"],
                               axis=1))

Based on the exploratory data analysis in the first post in this series, we filter full_df down to approximately 6,400 sets to be included in our analysis.

In [15]:

FILTERS = [
    full_df["Set type"] == "Normal",
    full_df["Pieces"] > 10,
    full_df["Theme"] != "Duplo",
    full_df["Theme"] != "Service Packs",
    full_df["Theme"] != "Bulk Bricks",
    full_df["RRP2021"] > 0
]

In [16]:

df = full_df[reduce(np.logical_and, FILTERS)].copy()

In [17]:

df.head()

Out[17]:

	Name	Set type	Theme	Year released	Pieces	Subtheme	RRP$	RRP2021
Set number
1041-2	Educational Duplo Building Set	Normal	Dacta	1980-01-01	68.0	NaN	36.50	122.721632
1075-1	LEGO People Supplementary Set	Normal	Dacta	1980-01-01	304.0	NaN	14.50	48.752429
5233-1	Bedroom	Normal	Homemaker	1980-01-01	26.0	NaN	4.50	15.130064
6305-1	Trees and Flowers	Normal	Town	1980-01-01	12.0	Accessories	3.75	12.608387
6306-1	Road Signs	Normal	Town	1980-01-01	12.0	Accessories	2.50	8.405591

In [18]:

df.tail()

Out[18]:

	Name	Set type	Theme	Year released	Pieces	Subtheme	RRP$	RRP2021
Set number
80025-1	Sandy's Power Loader Mech	Normal	Monkie Kid	2021-01-01	520.0	Season 2	54.99	54.99
80026-1	Pigsy's Noodle Tank	Normal	Monkie Kid	2021-01-01	662.0	Season 2	59.99	59.99
80028-1	The Bone Demon	Normal	Monkie Kid	2021-01-01	1375.0	Season 2	119.99	119.99
80106-1	Story of Nian	Normal	Seasonal	2021-01-01	1067.0	Chinese Traditional Festivals	79.99	79.99
80107-1	Spring Lantern Festival	Normal	Seasonal	2021-01-01	1793.0	Chinese Traditional Festivals	119.99	119.99

In [19]:

df.describe()

Out[19]:

	Pieces	RRP$	RRP2021
count	6423.000000	6423.000000	6423.000000
mean	345.121283	37.652064	46.267159
std	556.907975	50.917343	59.812083
min	11.000000	0.600000	0.971220
25%	69.000000	9.990000	11.896044
50%	181.000000	19.990000	27.420158
75%	404.000000	49.990000	56.497192
max	11695.000000	799.990000	897.373477

Modeling¶

We will reuse the final model from the previous post. It includes a time-varying component along with random slopes and intercepts that vary according to the the theme and subtheme of the set in question.

First we create our feature vector (standardized log piece count) and our target vector (log RRP in 2021 dollars).

In [20]:

pieces = df["Pieces"].values
log_pieces = np.log(df["Pieces"].values)

scaler = StandardScaler().fit(log_pieces[:, np.newaxis])

def scale_log_pieces(log_pieces, scaler=scaler):
    return scaler.transform(log_pieces[:, np.newaxis])[:, 0]

std_log_pieces = scale_log_pieces(log_pieces)

In [21]:

rrp2021 = df["RRP2021"].values
log_rrp2021 = np.log(rrp2021)

Next we build indicator variables for the themes and subthemes.

In [22]:

theme_id, theme_map = df["Theme"].factorize(sort=True)

theme_mean_std_log_pieces = (pd.Series(std_log_pieces, index=df.index)
                               .groupby(df["Theme"])
                               .mean())

In [23]:

sub_id, sub_map = df["Subtheme"].factorize(sort=True)
n_sub = sub_map.size

sub_isnull = sub_id == -1

sub_mean_std_log_pieces = (pd.Series(std_log_pieces, index=df.index)
                             .groupby(df["Subtheme"])
                             .mean())

Finally, we transform the year each set was released into the number of years since 1980, when our data set begins.

In [24]:

year = (df["Year released"]
          .dt.year
          .values)
t = year - year.min()

We now recreate the model as in the previous post.

In [25]:

def hierarchical_normal(name, *, dims, μ=None):
    if μ is None:
        μ = pm.Normal(f"μ_{name}", 0., 2.5)

    Δ = pm.Normal(f"Δ_{name}", 0., 1., dims=dims)
    σ = pm.HalfNormal(f"σ_{name}", 2.5)
    
    return pm.Deterministic(name, μ + Δ * σ, dims=dims)

def hierarchical_normal_with_mean(name, x_mean, *, dims,
                                  μ=None, mean_name="γ"):
    mean_coef = pm.Normal(f"{mean_name}_{name}", 0., 2.5)
        
    return pm.Deterministic(
        name,
        hierarchical_normal(f"_{name}", dims=dims, μ=μ) \
            + mean_coef * x_mean,
        dims=dims
    )

In [26]:

def gaussian_random_walk(name, *, dims, innov_scale=1.):
    Δ = pm.Normal(f"Δ_{name}", 0., innov_scale,  dims=dims)

    return pm.Deterministic(name, Δ.cumsum(), dims=dims)

In [27]:

coords = {
    "set": df.index.values,
    "sub": sub_map,
    "theme": theme_map,
    "year": np.unique(year)
}

First we specify the components of the intercept.

In [28]:

with pm.Model(coords=coords) as model:
    β0_0 = pm.Normal("β0_0", 0., 2.5)
    β0_t = gaussian_random_walk("β0_t", dims="year",
                                innov_scale=0.1)
    
    β0_theme = hierarchical_normal_with_mean(
        "β0_theme",
        theme_mean_std_log_pieces.values,
        dims="theme", μ=0.
    )
    
    β0_sub_null = pm.Normal("β0_sub_null", 0., 2.5)
    β0_sub_nn = hierarchical_normal_with_mean(
        "β0_sub_nn",
        sub_mean_std_log_pieces.values,
        dims="sub", μ=0., mean_name="λ"
    )
    β0_sub_i = at.switch(
        sub_isnull,
        β0_sub_null,
        β0_sub_nn[at.clip(sub_id, 0, n_sub)]
    )
    
    β0_i = β0_0 + β0_t[t] + β0_theme[theme_id] + β0_sub_i

We specify the components of the slope for the (standardized log) number of pieces in the set analagously.

In [29]:

with model:
    β_pieces_0 = pm.Normal("β_pieces_0", 0., 2.5)
    
    β_pieces_theme = hierarchical_normal_with_mean(
        "β_pieces_theme",
        theme_mean_std_log_pieces.values,
        dims="theme", μ=0.
    )
    
    β_pieces_sub_null = pm.Normal("β_pieces_sub_null", 0., 2.5)
    β_pieces_sub_nn = hierarchical_normal_with_mean(
        "β_pieces_sub_nn",
        sub_mean_std_log_pieces.values,
        dims="sub", μ=0., mean_name="λ"
    )
    β_pieces_sub_i = at.switch(
        sub_isnull,
        β_pieces_sub_null,
        β_pieces_sub_nn[at.clip(sub_id, 0, n_sub)]
    )

    β_pieces_i = β_pieces_0 + β_pieces_theme[theme_id] + β_pieces_sub_i

Finally we specify the observational likelihood.

In [30]:

with model:
    σ = pm.HalfNormal("σ", 5.)
    μ = β0_i + β_pieces_i * std_log_pieces - 0.5 * σ**2
    
    obs = pm.Normal("obs", μ, σ, dims="set",
                    observed=log_rrp2021)

We are now ready to sample from the posterior distribution of this model.

In [31]:

CHAINS = 3
SEED = 123456789

SAMPLE_KWARGS = {
    'cores': CHAINS,
    'random_seed': [SEED + i for i in range(CHAINS)],
    'return_inferencedata': True
}

In [32]:

with model:
    trace = pm.sample(**SAMPLE_KWARGS)
    pp_trace = pm.to_inference_data(
        posterior_predictive=pm.sample_posterior_predictive(trace)
    )

trace.extend(pp_trace)

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (3 chains in 3 jobs)
NUTS: [β0_0, Δ_β0_t, γ_β0_theme, Δ__β0_theme, σ__β0_theme, β0_sub_null, λ_β0_sub_nn, Δ__β0_sub_nn, σ__β0_sub_nn, β_pieces_0, γ_β_pieces_theme, Δ__β_pieces_theme, σ__β_pieces_theme, β_pieces_sub_null, λ_β_pieces_sub_nn, Δ__β_pieces_sub_nn, σ__β_pieces_sub_nn, σ]

100.00% [6000/6000 17:15

Sampling 3 chains for 1_000 tune and 1_000 draw iterations (3_000 + 3_000 draws total) took 1036 seconds.
The number of effective samples is smaller than 10% for some parameters.

100.00% [3000/3000 00:02

Standard sampling diangostics show no cause for concern.

In [33]:

az.plot_energy(trace);

In [34]:

print(az.rhat(trace)
        .max()
        .to_array()
        .max())

<xarray.DataArray ()>
array(1.01564337)

Analysis¶

Now that we have sampled from the posterior distribution of the model, we are finally prepared to analyze and interpret the pricing of the sets with the largest number of pieces. First we select the twenty sets with the largest piece count for closer inspection.

In [35]:

N_TOP = 20

top_by_pieces = df.nlargest(N_TOP, "Pieces")

We examine the posterior predictive distributions of each of these set's price compared to its actual recommended retail price below.

In [36]:

def make_set_label(row):
    name = row["Name"]
    set_number, _ = row.name.split('-', 1)
    pieces = int(row["Pieces"])
    
    return f"{name} ({set_number})\n{pieces:,} pieces"

In [37]:

fig, ax = plt.subplots(figsize=(FIGSIZE[0], 2 * FIGSIZE[1]))

az.plot_forest(
    trace.posterior_predictive,
    var_names=["obs"],
    coords={"set": top_by_pieces.index},
    transform=np.exp,
    ax=ax
);
ax.scatter(
    top_by_pieces["RRP2021"][::-1],
    ax.get_yticks(),
    c='k', label="Actual", zorder=5
);

ax.set_yticklabels(
    top_by_pieces.apply(make_set_label, axis=1)[::-1]
);

ax.xaxis.set_major_formatter(dollar_formatter);
ax.set_xlabel("Posterior predicted RRP (2021 $)");

ax.legend();
ax.set_title("Sets with the most pieces");

The first time I drew this plot I was a bit surprised at how well the model predicts the prices for these large sets. I expected to see larger deviations for sets with an extreme number of pieces (more than about 3,000), so this is a pleasant surprise. It is worth nothing that while the model's predictions are accurate for most of these large sets, there are a few sets where the actual price are further from the posterior expected value. These sets are worth focusing on.

In [38]:

FOCUS_SETS = [
    "31203-1",
    "10294-1",
    "71043-1",
    "2000409-1",
    "75252-1",
    "10214-1"
]

In [39]:

fig, ax = plt.subplots()

az.plot_forest(
    trace.posterior_predictive,
    var_names=["obs"],
    coords={"set": FOCUS_SETS},
    transform=np.exp,
    ax=ax
);
ax.scatter(
    df["RRP2021"]
      .loc[FOCUS_SETS]
      [::-1],
    ax.get_yticks(),
    c='k', label="Actual", zorder=5
);

ax.set_yticklabels(
    df.loc[FOCUS_SETS]
      .apply(make_set_label, axis=1)
      [::-1]
);

ax.xaxis.set_major_formatter(dollar_formatter);
ax.set_xlabel("Posterior predicted RRP (2021 $)");

ax.legend();
ax.set_title(None);

Sets that are priced significantly lower than their posterior expected value tend to have lots of similar small and/or standard pieces.

The World Map (31203) contains thousands of small dots.

The Window Exploration Bag (2000409) contains thousands of fairly standard bricks. (I had never heard of Lego's "Serious Play" method before, and it is interesting and may merit a post of its own.)

Tower Bridge (10214) conforms to this explanation slightly less well, but there are still large repetitive portions (turrets, bridge suspension cables) that require many fairly standard pieces.

In light of these similarities for large sets that are underpriced relative to our predictions, it makes sense to consider including information about the pieces included in the set (and the quantity of each piece that is included) in a future model. I have not scraped this data about the piece composition of each set yet, but it is available from Brickset an probably also from BrickLink.

Sets that are priced significantly higher than their posterior expected value include the Titanic (10294), Hogwarts Castle (71043) and the Ultimate Collector Series Imperial Star Destroyer (75252). Each of these sets is iconic either on its own (the Titanic) or within the associated media franchies (Harry Potter and Star Wars). It seems reasonable that Lego may be able to charge a premium on such sets. Note that this interpretation does not hold true for all sets, as the Ultimate Collector Series Millenium Falcon (75192) is perhaps even more iconic than the Imperial Star Destroyer. In this instance, Lego may believe that the recommended retail price of $799.99 is near the maximum that the market will reasonably bear for a Lego set.

We can examine the posterior predictive distribution of some individual sets in more detail.

In [40]:

def format_posterior_artist(artist, formatter):
    text = artist.get_text()
    x, _ = artist.get_position()

    if text.startswith(" ") or text.endswith(" "):
        fmtd_text = formatter(x)
        artist.set_text(
            " " + fmtd_text if text.startswith(" ") else fmtd_text + " "
        )
    elif "=" in text:
        before, _ = text.split("=")
        artist.set_text("=".join((before, formatter(x))))
    elif "<" in text:
        below, ref_val_str, above = text.split("<")
        artist.set_text("<".join((
            below,
            " " + formatter(float(ref_val_str)) + " ",
            above
        )))

def format_posterior_text(formatter, ax=None):
    if ax is None:
        ax = plt.gca()
    
    artists = [artist for artist in ax.get_children() if isinstance(artist, plt.Text)]
    
    for artist in artists:
        format_posterior_artist(artist, formatter)

In [41]:

def plot_set_posterior(set_number, df, ax=None):
    if ax is None:
        fig, ax = plt.subplots()
    
    az.plot_posterior(
        trace, var_names=["obs"],
        group="posterior_predictive",
        coords={"set": f"{set_number}-1"},
        transform=np.exp,
        ref_val=df["RRP2021"].loc[f"{set_number}-1"],
        ax=ax
    )

    format_posterior_text(dollar_formatter, ax=ax)

    ax.set_xlabel("Posterior predicted RRP (2021 $)")
    ax.set_title(f"{df['Name'].loc[set_number + '-1']} ({set_number})")
    
    return ax

As seen above, the Titanic (10294) is a bit overpriced according to our model, but not unreasonably so.

In [42]:

TITANIC_SET_NUMBER = "10294"
plot_set_posterior(TITANIC_SET_NUMBER, df);

We see that the Millenium Falcon (75192) is priced right about in line with the models expectations.

In [43]:

MF_SET_NUMBER = "75192"
plot_set_posterior(MF_SET_NUMBER, df);

I hope this analysis has been as interesting to you as it has to me. I am pleasantly surprised at how well the model predicts set prices even when the number of pieces is extremely large. While this analysis has not sold me on the Titanic, it adds to my desire to get the Ultimate Collector Series Millenium Falcon and Imperial Star Destroy some day. (Unfortunately?) I have to finish my half-built Super Star Destroyer (10221) first.

This post is available as a Jupyter notebook here.

In [44]:

%load_ext watermark
%watermark -n -u -v -iv

Last updated: Tue Oct 12 2021

Python implementation: CPython
Python version       : 3.7.10
IPython version      : 7.25.0

arviz     : 0.11.2
seaborn   : 0.11.1
matplotlib: 3.4.2
aesara    : 2.1.3
xarray    : 0.19.0
pymc3     : 4.0
numpy     : 1.19.5
pandas    : 1.3.0

Playing Battleship with Bayesian Search Theory, Thompson Sampling, and Approximate Bayesian Computation

Austin Rochford — Thu, 02 Sep 2021 04:00:00 GMT

As a child, I spent many hours playing Battleship against my Mom or brother to pass the time.

Battleship is a classic game of incomplete information that involves placing ships of different lengths on a square grid ($10 \times 10$ in the most popular version) and trying to guess the location of all of the opponent’s ships (sinking them) before the opponent does the same.

Many years after I stopped playing Battleship with my family, I developed an interest in Bayesian search theory after learning about a firm that specializes in search consulting for the US government at jobs-in-industry panel in graduate school. I had already happily lined up a job in e-commerce optimization, but the idea of using statistics to guide the potentially expensive searches piqued my intellectual curiosity. I e-mailed the presenter afterward for more information and he recommended [Theory of Optimal Search]^(Stone, Lawrence D. Theory of optimal search. Elsevier, 1976.) as the sessential reference in the field. The book is interesting if a bit dry, it is about mathematical optimization after all, and I was always curious about the connection with the search theory it develops and Battleship, which has always stood out in my mind as a basic search problem. Though most of the theory in this book is not directly applicable to Battleship for various reasons, a continuing theme is allocating search in areas with the largest posterior probability of containing the target object given the search results so far. A 1985 result of [Assaf and Zamir]^(Assaf, David, and Shmuel Zamir. “Optimal sequential search: a Bayesian approach.” The Annals of Statistics 13, no. 3 (1985): 1213-1221.) shows the optimality of this strategy in a situation much closer to that of Battleship.

Inspired by these forays into search theory I have thought idly for many years about constructing a near-optimal Bayesian approach to playing Battleship. This post is the culmination of these thoughts, showing how to use ideas from Bayesian search theory, approximate Bayesian computation (ABC), and Thompson sampling to construct an easily tractable, near optimal strategy for Battleship that only requires simulating random Battleship boards and not reasoning through hard-coded special cases to achieve strong play.

Playing Battleship

First we make the necessary Python imports and do a bit of housekeeping.

%matplotlib inline

from abc import ABC, abstractmethod
import datetime
from IPython.display import HTML
from itertools import product
import multiprocessing as mp
from tqdm import tqdm, trange

from empiricaldist import Pmf
import matplotlib as mpl
from matplotlib import pyplot as plt
from matplotlib.animation import FuncAnimation
from matplotlib.colors import LinearSegmentedColormap
from matplotlib.ticker import IndexLocator, StrMethodFormatter
import numpy as np
from scipy import linalg, stats
import seaborn as sns

FIG_WIDTH = 8
FIG_HEIGHT = 6
mpl.rcParams['figure.figsize'] = (FIG_WIDTH, FIG_HEIGHT)

CMAP = LinearSegmentedColormap.from_list(
    'battleship',
    [
        (0.0, 'blue'),
        (0.2, 'orange'),
        (1., 'red')
    ]
)
CMAP.set_bad('w')

sns.set(color_codes=True)

pct_formatter = StrMethodFormatter('{x:.1%}')

SEED = 123456789 # for reproducibility

rng = np.random.default_rng(SEED)

We represent a Battleship board as a sequence of rectangular grids, one per ship. For standard Battleship, that means a game consists of five $10 \times 10$ grids, the first grid containing a ship of length five, the seconding containing a ship of length four, and so on.

The following constants are useful for defining a standard game of Battleship. (Note that we are using Hasbro’s names for the ships, revised in 2002.)

GRID_LENGTH = 10
SHIP_NAMES = [
    "Carrier",
    "Battleship",
    "Destroyer",
    "Submarine",
    "Patrol boat"
]
SHIP_SIZES = np.array([
    5, 
    4,
    3,
    3,
    2
])

n_ships = SHIP_SIZES.size

The following plots show how a Battleship board is represented as a sequence of five grids, one per ship.

ships = np.zeros((n_ships, GRID_LENGTH, GRID_LENGTH))
ships[0, 0, :5] = 1
ships[1, 4:8, 7] = 1
ships[2, 9, 3:6] = 1
ships[3, 3:6, 1] = 1
ships[4, 5, 8:10] = 1

def plot_board(board, cmap=CMAP,
               cbar=False, cbar_kwargs=None,
               ax=None, **heatmap_kwargs):
    if ax is None:
        _, ax = plt.subplots(figsize=(FIG_WIDTH, FIG_WIDTH))
        
    heatmap_kwargs.setdefault('vmin', 0)
    heatmap_kwargs.setdefault('vmax', 1)
    
    board_ = np.atleast_2d(board)
    mask = board_.mask if np.ma.is_masked(board_) else None
    
    if cbar_kwargs is None:
        cbar_kwargs = {}
        
    cbar_kwargs.setdefault('format', pct_formatter)
    cbar_kwargs.setdefault('label', "Posterior probability of a hit")
    
    sns.heatmap(board_, mask=mask, cmap=cmap,
                cbar=cbar, square=True,
                linewidths=1.5, linecolor='k',
                cbar_kws=cbar_kwargs,
                ax=ax, **heatmap_kwargs)
    
    if board_.shape[0] == 1:
        ax.set_yticklabels([])
        
    if cbar:
        cbar = ax.figure.axes[-1]
        cbar.set_ylabel(cbar.get_ylabel(), rotation=270)
        
    return ax

fig, axes = plt.subplots(ncols=n_ships, sharex=True, sharey=True,
                         figsize=(n_ships * 2.5, 2.5))

for (ship, name, ax) in zip(ships, SHIP_NAMES, axes):
    plot_board(ship, ax=ax);
    
    ax.set_xticklabels([]);
    ax.set_yticklabels([]);
    ax.set_title(name);
    
fig.tight_layout();

Adding these ship grids gives us the two-dimensional board familiar to anyone who has played battleship.

def to_board(ships, ship_axis=0):
    return ships.sum(axis=ship_axis)

plot_board(to_board(ships));

The following class implements most of the basics necessary to track the state of a game of Battleship, make guesses, and reveal the contents of guessed spaces.

class Battleship:
    def __init__(self, ships):
        self._ships = ships
        self._turn_revealed = []
        
    @property
    def _board(self):
        return to_board(self._ships)
        
    @property
    def grid_length(self):
        return self._board.shape[0]
    
    @property
    def is_solved(self):
        return self.revealed.sum() == self._board.sum()
    
    @property
    def revealed(self):
        return to_board(self._revealed_ships)
        
    @property
    def _revealed_ships(self):
        if self.turns > 0:
            return self._turn_revealed[-1]
        else:
            return np.ma.masked_all_like(self._ships)
    
    @property
    def ship_sizes(self):
        return self._ships.sum(axis=(1, 2))
    
    @property
    def sunk(self):
        ship_sizes = self._ships.sum(axis=(1, 2))
        revealed_sizes = (self._revealed_ships
                              .sum(axis=(1, 2))
                              .filled(0))
        
        return ship_sizes == revealed_sizes
    
    @property
    def turn_revealed(self):
        return [np.ma.masked_all_like(self._board)] \
                + [to_board(revealed) for revealed in self._turn_revealed]
    
    @property
    def turns(self):
        return len(self._turn_revealed)

These bookkeeping methods are fairly self-explanatory. An instance of Battleship tracks which squares have been revealed to the opponent using numpy’s masked arrays. The revealed grid is masked wherever the opponent has not yet guessed.

The heart of our Battleship class is the guess method, which takes the coordinates to guess, and returns a tuple containing whether or not the guess resulted in a hit or miss, and the index of the ship sunk as a result of that guess (None if no ship was sunk). (Here we use a dirty trick to add a method to the Battleship class after it has already been defined for expository clarity. Don’t do this in practice!)

class Battleship(Battleship):
    def guess(self, i, j):
        if not self.revealed.mask[i, j]:
            raise ValueError(f"{i}, {j} already guessed")
        else:
            prev_sunk = self.sunk

            next_ships = self._revealed_ships.copy()
            next_ships[:, i, j] = self._ships[:, i, j]
            self._turn_revealed.append(next_ships)
            
            curr_sunk = self.sunk
            
            if (curr_sunk == prev_sunk).all():
                sunk = None
            else:
                sunk = (curr_sunk & ~prev_sunk).argmax()
            
            return self._board[i, j], sunk

In addition to this class representing game state and history, we define an abstract class Strategy that generates the guesses necessary to play a game.

class Strategy(ABC):
    @abstractmethod
    def next_guess(self, revealed):
        pass

    def reveal(self, i, j, hit_or_miss, sunk):
        pass

The next_guess method takes a masked array of the board as revealed so far and should return the coordinates of the next spot to guess. The reveal method takes guessed coordinates, an indicator of whether the guess resulted in a hit or miss, and the index of the ship sunk as a result of that guess (or None, as with guess above). This method allows a Strategy to update internal state based on the result of guesses. While some simple strategies can infer the next guess purely based on the state of the revealed board, we will see that it is useful when taking a Bayesian approach to maintain state inside a strategy.

The play function take a configuration of ships and a Strategy and plays the corresponding game of Battleship.

def play(ships, strategy, progress_bar=False):
    game = Battleship(ships)
    
    if progress_bar:
        pbar = tqdm(total=int(ships.sum()))
    
    while not game.is_solved:
        i, j = strategy.next_guess(game.revealed)
        hit_or_miss, sunk = game.guess(i, j)
        strategy.reveal(i, j, hit_or_miss, sunk)
        
        if progress_bar and hit_or_miss == 1:
            pbar.update()
            
    if progress_bar:
        pbar.close()
        
    return game

The rest of this post progresses from strategies for simplified versions of Battleship to finally showing how to use ABC and Thompson sampling to achieve near optimal play.

Manual guessing

This framework allows us to manually play Battleship by inputting user guesses and watching how the board evolves.

class ManualStrategy(Strategy):
    def next_guess(self, revealed):
        ax = plot_board(revealed)
        ax.set_title("Currently revealed")
        plt.show()

        i = int(input("Enter row to guess: "))
        j = int(input("Enter column to guess: "))
        
        return i, j

try:
    play(ships, ManualStrategy())
except KeyboardInterrupt:
    print("Game ended")

Enter row to guess:  5
Enter column to guess:  4

Enter row to guess:  0
Enter column to guess:  3

Enter row to guess:  1
Enter column to guess:  3

Enter row to guess:  0
Enter column to guess:  2

Enter row to guess:  0
Enter column to guess:  4

Game ended

This strategy is not really useful for our purposes, but it provides a good illustration of how the game is usually played. The white squares here indicate the cells whose contents are unknown because we have not yet guessed them.

Row with one ship

We begin by simplifying the idea of Battleship to the most basic possible setting of a single row containing a single ship.

The following function generates all possible rows of a given length containing a ship of a given size.

def get_all_ship_rows(grid_length, ship_size):
    i, j = np.indices((grid_length, grid_length))[:, :-ship_size + 1]
    
    return 1 * (i <= j) & (j < i + ship_size)

For a row of length three and a ship of size two (3/2), there are only two such rows.

all_rows_3_2 = get_all_ship_rows(3, 2)

ax = plot_board(all_rows_3_2)
ax.set_title("All rows 3/2");

For a row of length ten and a ship of size four (10/4), there are seven such rows.

all_rows_10_4 = get_all_ship_rows(10, 4)

ax = plot_board(all_rows_10_4)
ax.set_title("All rows 10/4");

We will compare our Bayesian strategies to several benchmark strategies, starting with the simplest possible strategy: random guessing.

Random guessing

The random guessing approach to Battleship is equivalent to the following urn problem from classic probability theory. Imagine an urn containing 100 balls (corresponding to cells on the board), 17 of which are red (corresponding to cells covered by ships), and 83 of which are blue. The number of turns required to solve Battleship through random guessing has the same distribution as the number of balls of any color that are drawn before all 17 red balls have been drawn. It is well known (in the right circles, at least) that this corresponds to the negative hypergeometric distribution.

Original image credit John D. Cook

The following function returns the probability mass function (pmf) of the appropriate negative hypergeometric distribution given a grid and ship size.

def get_random_guess_dist(grid_size, ship_size, n_hit=None):
    if n_hit is None:
        n_hit = ship_size

    support = np.arange(ship_size, grid_size + 1)
    nhg = stats.nhypergeom(grid_size,
                           grid_size - ship_size,
                           n_hit)
    
    return Pmf(nhg.pmf(support - ship_size), support)

We see, unsurprisingly, that random guessing will usually take near the maximum number of turns to solve the game.

std_random_pmf = get_random_guess_dist(GRID_LENGTH**2, SHIP_SIZES.sum())

def plot_turn_dist(pmf, kind='bar', mean=False, ax=None, mean_kwargs=None, **kwargs):
    if ax is None:
        _, ax = plt.subplots()
        
    if kind == 'bar':
        # pandas barplot uses odd indexing, making it hard to
        # mix bar and line plots if we use pandas's versions
        kwargs.setdefault('width', 1)
        kwargs.setdefault('alpha', 0.75)

        ax.bar(pmf.index, pmf, **kwargs)
    elif kind == 'line':
        pmf.plot(ax=ax, **kwargs)
    else:
        raise ValueError("kind must be one of 'bar' or 'line'")    
        
    if mean:
        if mean_kwargs is None:
            mean_kwargs = {}
            
        mean_kwargs.setdefault('ls', '--')
        
        ax.axvline(pmf.mean(), **mean_kwargs);
    
    ax.xaxis.grid(False)
    ax.set_xlabel("Turns")
    
    return ax

def make_pct_yaxis(ax):
    ax.yaxis.set_major_formatter(pct_formatter)
    ax.set_ylabel("Probability")
    
    return ax

std_turn_ax = plot_turn_dist(std_random_pmf, kind='line', mean=True)
    
make_pct_yaxis(std_turn_ax);
std_turn_ax.set_title("Random guessing\nStandard Battleship");

The dashed vertical line shows the expected number of turns to solve Battleship by randomly guessing,

std_random_pmf.mean()

95.38888888888883

Returning the the case of a row with a single ship, we visualize the turn distribution of random guessin strategies for various row lengths and a ship of size two.

RANDOM_ROW_LENGTHS = [4, 5, 8, 10]

fig, axes = plt.subplots(nrows=len(RANDOM_ROW_LENGTHS) // 2, ncols=2,
                         sharey=True,
                         figsize=(FIG_WIDTH, 1.25 * FIG_HEIGHT))

for (row_length, ax) in zip(RANDOM_ROW_LENGTHS, axes.flat):
    pmf = get_random_guess_dist(row_length, 2)
    plot_turn_dist(pmf, ax=ax);
    
    make_pct_yaxis(ax);
    ax.set_title(f"{row_length}/2");

fig.suptitle("Random guessing");
fig.tight_layout();

We do the same for a ship of size three.

fig, axes = plt.subplots(nrows=len(RANDOM_ROW_LENGTHS) // 2, ncols=2,
                         sharey=True,
                         figsize=(FIG_WIDTH, 1.25 * FIG_HEIGHT))

for (row_length, ax) in zip(RANDOM_ROW_LENGTHS, axes.flat):
    pmf = get_random_guess_dist(row_length, 3)
    plot_turn_dist(pmf, ax=ax);
    
    make_pct_yaxis(ax);
    ax.set_title(f"{row_length}/3");

fig.suptitle("Random guessing");
fig.tight_layout();

As with the case of standard Battleship, the most likely outcome is to use all the turns, and the expected value will be slightly below that.

Optimal search

In the simple setting of a row with a single ship, it is not too hard to work out the optimal search strategy, which consists of two phases. First we must locate the ship by getting at least one hit. Once the ship has been located, we must sink it by attacking spaces next to known hits until all of the ship has been found.

After at least one hit

We first implement optimal search after at least one hit has occurred. In this case, we guess the spot immediately to the left of the leftmost hit, unless that hit is in the first cell, or the cell to its left has already been shown to contain water. In those two cases, we guess the cell to the right of the rightmost hit until all cells containing the ship have been hit. The following function implements this strategy.

def next_guess_1d_with_hit(row):
    first = row.argmax()

    # found the left edge, fill out the ship to the right
    if first == 0 or row[first - 1] == 0:
        return first + row[first:].mask.argmax()
    # find the left edge
    else:
        return first - 1

We can show how this strategy works through a simple animated example. Suppose the position of a length three ship on a length ten row is as shown below.

board = np.zeros(10)
board[6:9] = 1

plot_board(board);

Also suppose that after three guesses, the following spots have been revealed.

revealed = np.ma.masked_all_like(board)
revealed[[1, 4, 7]] = board[[1, 4, 7]]

plot_board(revealed);

The above strategy will take three guesses to sink the ship, as shown in the animation below.

turn_revealed = [revealed]

next_j = next_guess_1d_with_hit(revealed)
next_revealed = revealed.copy()
next_revealed[next_j] = board[next_j]
turn_revealed.append(next_revealed)
revealed = next_revealed

next_j = next_guess_1d_with_hit(revealed)
next_revealed = revealed.copy()
next_revealed[next_j] = board[next_j]
turn_revealed.append(next_revealed)
revealed = next_revealed

next_j = next_guess_1d_with_hit(revealed)
next_revealed = revealed.copy()
next_revealed[next_j] = board[next_j]
turn_revealed.append(next_revealed)
revealed = next_revealed

def animate_boards(boards,
                   cmap=CMAP, cbar=False, ax=None,
                   heatmap_kwargs=None, **ani_kwargs):
    if ax is None:
        fig, ax = plt.subplots(figsize=(FIG_WIDTH, FIG_WIDTH))
    else:
        fig = ax.figure
        
    if heatmap_kwargs is None:
        heatmap_kwargs = {}
        
    plot_board(boards[0],
               cmap=cmap, cbar=cbar, ax=ax,
               **heatmap_kwargs)
    
    quadmesh, *_ = ax.get_children()
    
    def ani_func(i):
        quadmesh.set_array(boards[i])
        
        return quadmesh,
    
    ani_kwargs.setdefault('blit', True)
    ani_kwargs.setdefault('frames', len(boards))
    
    return FuncAnimation(fig, ani_func, **ani_kwargs)

%%capture
ani = animate_boards(turn_revealed, interval=300)

HTML(ani.to_html5_video())

Your browser does not support the video tag.

Getting the first hit

Immediately it is clear that we only have to search one of the conrguence classes of the cell index mod the ship size to guarantee a hit. In the following plot of a row of length ten, each congruence class mod three is colored differently.

plot_board(np.arange(10) % 3, cmap='plasma', vmin=0, vmax=2);

A moment’s consideration shows that the congruence class of -1 modulo the ship size will always have the fewest elements of any of the congruence class, so we choose to search along these cells. The following plot shows the cells to be searched until a hit is found in red.

def get_search_cells_1d(grid_length, ship_size):
    i = np.arange(grid_length)
    
    return i[i % ship_size == -1 % ship_size]

row = np.zeros(10)
row[get_search_cells_1d(10, 3)] = 1

plot_board(row);

The following subclass of Strategy implements the optimal strategy of searching along this grid until a hit is found, then greedily filling out the ship.

class Single1DOptimalStrategy(Strategy):
    def __init__(self, grid_length, ship_size):
        self._grid_length = grid_length
        self._ship_size = ship_size
        
    def next_guess(self, revealed):
        if revealed.mask.all() or revealed.sum() == 0:
            search_j = get_search_cells_1d(self._grid_length,
                                           self._ship_size)            
            next_j = search_j[revealed.mask[0, search_j].argmax()]
        else:
            return 0, next_guess_1d_with_hit(revealed[0])
                
        return 0, next_j

We now use this strategy to play all three games with a length three ship on a length five row.

all_rows_5_3 = get_all_ship_rows(5, 3)

opt_row_games_5_3 = [
    play(ship_row[np.newaxis, np.newaxis],
         Single1DOptimalStrategy(5, 3))
        for ship_row in all_rows_5_3
]
opt_row_turns_5_3 = Pmf.from_seq([game.turns for game in opt_row_games_5_3])

ax = plot_turn_dist(opt_row_turns_5_3)

make_pct_yaxis(ax);
ax.set_title("Optimal play\nShip size 3, row length 5");

Two of the games take four turns and one takes three turns. Notably, no games take five turns. This is because we know that the third cell is a guaranteed hit in this configuration.

We can repeat this exercise for a row of length ten.

all_rows_10_3 = get_all_ship_rows(10, 3)

opt_row_games_10_3 = [
    play(ship_row[np.newaxis, np.newaxis],
         Single1DOptimalStrategy(10, 3))
        for ship_row in all_rows_10_3
]
opt_row_turns_pmf_10_3 = Pmf.from_seq([game.turns for game in opt_row_games_10_3])

ax = plot_turn_dist(opt_row_turns_pmf_10_3)

make_pct_yaxis(ax);
ax.set_title("Optimal play\nShip size 3, row length 10");

We see that no games take more than six turns with this configuration. Since the equivalence class of -1 modulo 3 contains three members (2, 5, and 8), we are guaranteed to get our first hit in at most three guesses, and once we have a first hit it will take at most three guesses to sink the ship (two hits and potentially one miss if the first hit was not on the rightmost cell of the ship).

Bayesian (Thompson sampling)

The above thought exercise is interesting, but it involves several domain-specific insights (the optimal search grid, how to optimally sink a ship that has been located) that do not generalize immediately to two-dimensional grids that allow ships to be oriented in either rows or columns. (Note that if we restrict our ship in an $m \times n$ grid to be oriented along a row, we’re essentially playing a one-dimensional game on a row of length $m \cdot n$.) In this section we will show how to reproduce the above results using a Bayesian approach to this game of simplified Battleship. This approach will generalize to a near-optimal strategy for standard Battleship with many fewer hard coded rules than the above optimal one-dimensional strategy requires to generalize.

We begin by enumerating all possible boards for a given row/ship configuration (in this case 5/3).

ax = plot_board(all_rows_5_3)
ax.set_yticklabels([]);

We consider this set of ships as a uniform prior distribution on the set of possible board layouts. With this perspective, when we take the average in each columns, we get the prior probability that each cell contains a ship.

We already see that this Bayesian approach thhat we can conclude that the third cell must contain a ship without doing any modular arithmetic.

For a configuration with a longer row (10/3), we see that the middle six cells have the higest prior probability of containing a ship. This makes sense as there are three configurations in which each of these cells contains a ship, but only one configuration in which the first cell contains a ship, and similarly for the other cells near the edges.

all_rows_10_3 = get_all_ship_rows(10, 3)

plot_board(all_rows_10_3.mean(axis=0), cbar=True);

Looking at these probabilities, a resonable strategy would seem to be to guess the next cell with the highest probability of containing a hit. In both the 5/3 and 10/3 cases, this would result in guessing the third cell next. (By convention when many cells have the same hit probability, we will choose the one wit smallest coordinates.)

Let’s focus on the 5/3 configuration for the moment and return to 10/3 after. Suppose the true 5/3 board is as follows.

board_5_3 = np.zeros(5)
board_5_3[1:4] = 1

plot_board(board_5_3);

Guessing the third cell must result in a hit in the 5/3 configuration, so after one guess the following board has been revealed.

revealed_5_3 = np.ma.masked_all_like(board_5_3)
revealed_5_3[2] = board_5_3[2]

plot_board(revealed_5_3);

With this information, what should our next guess be? For the first guess we chose the cell with the highest prior probability of yielding a hit, so it seems intuitive that we would choose the cell with the highest posterior probability of yielding a hit, given the observation from our first guess.

We can get this posterior distribution by eliminating the boards that do not match the cells that have been revealed to us.

def is_compat(ships, revealed, board_axis=(-2, -1)):
    return (ships == revealed).all(axis=board_axis)

compat_5_3 = all_rows_5_3[
    is_compat(all_rows_5_3, revealed_5_3, board_axis=-1)
]

ax = plot_board(compat_5_3)
ax.set_yticklabels([]);

Since the third cell is guaranteed to be a hit in the 5/3 configuration, all ship layouts are compatible with the result of our first guess, and the posterior is the same as the prior restricted to unknown cells. With this restriction our posterior is as follows.

plot_board(
    np.ma.masked_array(compat_5_3.mean(axis=0),
                       mask=~revealed_5_3.mask),
    cbar=True
);

We see that out of all the unknown cells, the second and third are tied for the highest posterior probability of being a hit. By our convention, we will choose the second cell as the next guess, which we know will reveal a hit.

revealed_5_3[1] = board_5_3[1]

plot_board(revealed_5_3);

With this observations we are able to eliminate one of the potential boards.

compat_5_3 = all_rows_5_3[
    is_compat(all_rows_5_3, revealed_5_3, board_axis=-1)
]

ax = plot_board(compat_5_3)
ax.set_yticklabels([]);

Combining these two compatible boards to form the posterior distribution, we get the following.

ax = plot_board(
    np.ma.masked_array(compat_5_3.mean(axis=0),
                       mask=~revealed_5_3.mask),
    cbar=True
)
ax.set_yticklabels([]);

We see that it is impossible for the fourth cell to contain a hit, and there’s a 50% chance that the first and fourth cells will be a hit. By our convention above, we guess the first cell, resulting in a miss.

revealed_5_3[0] = board_5_3[0]

plot_board(revealed_5_3);

Finally we see that there is only one board compatible with the information that has been revealed, so we guess the fourth cell to end the game.

compat_5_3 = all_rows_5_3[
    is_compat(all_rows_5_3, revealed_5_3, board_axis=-1)
]

ax = plot_board(compat_5_3)
ax.set_yticklabels([]);

We now return to the 10/3 case. Suppose we are working with the following board.

board_10_3 = np.zeros(10)
board_10_3[5:8] = 1

plot_board(board_10_3);

Recalling the prior distribution on hits in the 10/3 configuration, we see first guess the third cell, revealing a miss.

plot_board(all_rows_10_3.mean(axis=0), cbar=True);

revealed_10_3 = np.ma.masked_all_like(board_10_3)
revealed_10_3[2] = board_10_3[2]

plot_board(revealed_10_3);

Excluding boards that are incompatible with a miss in the third cell, we see that we have already ruled out a ship in the first two cells as well.

compat_10_3 = all_rows_10_3[
    is_compat(all_rows_10_3, revealed_10_3, board_axis=-1)
]

ax = plot_board(compat_10_3)
ax.set_yticklabels([]);

Unsurprisingly, the sixth through eight cells are tied for the highest probability of yielding a hit, so we guess the six cell, revealing a hit.

plot_board(compat_10_3.mean(axis=0), cbar=True);

revealed_10_3[5] = board_10_3[5]

plot_board(revealed_10_3);

Hitting the rest of the ship will proceed similarly to the 5/3 case, so we stop playing here.

This approach of calculating the posterior distribution by enumerating the possible grids and eliminating those that are incompatible with the observed hits and misses is a very simple type of approximate Bayesian computation (ABC). ABC is a fascinating and deep field of research in its own right, and we are really only scratching its surface here.

Choosing to guess the cell with the maximum a posteriori (MAP) is a simplified (greedy) form of Thompson sampling, a Bayesian approach to playing games of incomplete information that chooses the next action according to the probability that it maximizes the expected reward. In the simple situation of a single ship (in both one and two dimensions) we can calculate the cell(s) that has the MAP probability chance of yielding a hit (the expected reward) exactly, so this strategy becomes greedy. (I have a soft spot for Thompsons sampling in general as we use it in a number of our machine-learning based products at Kibo, where I work. For a discussion of the applications of Thompson sampling to e-commerce optimization, take a look at my talks.)

def argmax_2d(x):
    max_i = x.max(axis=1).argmax()
    max_j = x[max_i].argmax()
    
    return max_i, max_j

class SingleThompsonStrategy(Strategy):
    def __init__(self, poss_ships):
        self._poss_ships = poss_ships
    
    def compat_ships(self, revealed):
        if revealed.mask.all():
            return self._poss_ships
        else:
            is_compat_ = is_compat(self._poss_ships, revealed)

            return self._poss_ships[is_compat_]
    
    def next_guess(self, revealed):
        post = np.ma.masked_array(
            self.compat_ships(revealed).mean(axis=0),
            mask=~revealed.mask
        )
        
        return argmax_2d(post)

We watch this strategy play the game we started above.

ts = SingleThompsonStrategy(all_rows_10_3[:, np.newaxis, :])
game_10_3 = play(board_10_3[np.newaxis, np.newaxis, :], ts,
                 progress_bar=True)

100%|██████████| 3/3 [00:00<00:00, 391.81it/s]

%%capture
ani = animate_boards(game_10_3.turn_revealed)

HTML(ani.to_html5_video())

Your browser does not support the video tag.

As with our optimal row strategy, we can play every possible 10/3 game with the Thompson sampling strategy and compare the distributions of the number of turns.

ts_row_games_10_3 = [
    play(row[np.newaxis, np.newaxis],
         SingleThompsonStrategy(all_rows_10_3[:, np.newaxis, :]))
        for row in all_rows_10_3
]
ts_row_turns_pmf_10_3 = Pmf.from_seq([
    game.turns for game in ts_row_games_10_3
])

fig, (opt_ax, ts_ax) = plt.subplots(nrows=2, sharex=True, sharey=True,
                                    figsize=(FIG_WIDTH, 2 * FIG_HEIGHT))

plot_turn_dist(opt_row_turns_pmf_10_3, color='C0', ax=opt_ax);

make_pct_yaxis(opt_ax);
opt_ax.set_title("Optimal play");

plot_turn_dist(ts_row_turns_pmf_10_3, color='C1', ax=ts_ax);

make_pct_yaxis(ts_ax);
ts_ax.set_title("Thompson sampling");

fig.suptitle("Ship size 3, row length 10");
fig.tight_layout();

It turns out that for a single row with a single ship, the Thompson sampling strategy is optimal!

Image credit SINMANTYX

It is extremely cool to me that we can reproduce the optimal strategy without too much domain-specific thought, but just by enumerating all possible boards and sequentially eliminating the ones that are no longer compatible with the observed (revealed) data! This result is particularly exciting because, as we will see, the generalization of the row-optimal strategy to a two-dimensional grid is not straightforward (even for just one ship), but the Thompson sampling approach generalizes much more readily.

Square, one ship

We now turn to the case of a square grid with one ship on it to see how these strategies generalize to something closer to standard Battleship.

Random guessing

Once again the number of turns it takes to sink the ship follows a negative hypergeometric distribution. For the 5/5/3 configuration, this distribution is shown below.

random_5_5_3_pmf = get_random_guess_dist(5**2, 3)

ax = plot_turn_dist(random_5_5_3_pmf, kind='line', mean=True)

make_pct_yaxis(ax);
ax.set_title(r"5/5/3");

There are not really any surprises here; the geometry of the board and the ability to have ships orient along rows or columns doesn’t affect the random guessing strategy at all.

Near optimal

We now construct a near optimal strategy for one ship on a two-dimensional grid based on trying to reduce the problem to search on a one-dimensional row/column as quickly as possible. We call this strategy “near optimal” because we will see that it takes slightly more turns, on average, than a Thompson sampling strategy to sink the ship.

It is tempting to treat an $m \times n$ two-dimensional grid as a $m \cdot n$ one-dimensional grid and use the search grid from the above strategy, but this grid will only minimize the turns-to-first-hit when $m$ and $n$ are relatively prime. To illustrate this fact, we show the results of wrapping the search grid for the 100/2 configuration to the 10/10/2 configuration below.

bad_search_j_1d = get_search_cells_1d(100, 2)
bad_search_i, bad_search_j = np.divmod(bad_search_j_1d, 10)
bad_search_grid = np.zeros((10, 10))
bad_search_grid[bad_search_i, bad_search_j] = 1

ax = plot_board(bad_search_grid)
ax.set_title(r"Don't do this!", fontweight='bold');

Clearly this search strategy can fail to locate a column-oriented ship in an even-numbered column even after fifty guesses!

Instead, the optimal grid to search for the initial hit (at least for a square grid) is to cycle through the equivalence classes modulo the ship size in each row, as shown below for both the 10/10/2 and the 5/5/3 configurations.

def get_search_cells_2d(grid_shape, ship_size):
    i, j = np.indices(grid_shape)
    in_grid = (i - 1) % ship_size == j % ship_size
    
    return i[in_grid], j[in_grid]

search_i, search_j = get_search_cells_2d((10, 10), 2)
search_grid = np.zeros((10, 10))
search_grid[search_i, search_j] = 1

ax = plot_board(search_grid)
ax.set_title(r"Ship size 3, $5 \times 5$ grid");

search_i, search_j = get_search_cells_2d((5, 5), 3)
search_grid = np.zeros((5, 5))
search_grid[search_i, search_j] = 1

ax = plot_board(search_grid)
ax.set_title(r"Ship size 3, $5 \times 5$ grid");

We can tell that these search grids are optimal because each cell highlighted in red is one cell less than the ship shize away from any other hightlighted cell.

Once a hit has been found by searching these cells, we have to determine if the ship is oriented along the row or column passing through that cell. For hits near a boundary (in cells less than a ship’s length away from either the top/bottom or left/right edge) we only have to test one cell to decide which way the ship is oriented. For hits at the interior we have to, in general, test two adjacent cells (either left and right or above and below) to determine the orientation of the ship.

The following plot highlights the cells that only require one test guess to determine the row/column orientation following a hit for the 10/10/3 configuration.

near_edge = np.zeros((10, 10))
i, j = np.indices((10, 10))
near_edge[(i < 2) | (7 < i) | (j < 2) | (7 < j)] = 1

plot_board(near_edge);

The strategy outlined here is implemented below.

class SingleGridStrategy(Strategy):
    def __init__(self, ship_size):
        self._ship_size = ship_size

    def next_guess(self, revealed):
        if revealed.mask.all() or revealed.sum() == 0:
            next_unmasked = (revealed.mask
                                     [search_i, search_j]
                                     .argmax())

            return search_i[next_unmasked], search_j[next_unmasked]
        else:
            grid_length, _ = revealed.shape
            hit_i, hit_j = argmax_2d(revealed)
            ship_size = self._ship_size
            
            if revealed.sum() == 1:                
                if hit_i < ship_size - 1 or grid_length - ship_size < hit_i:
                    next_i = hit_i + (1 if hit_i < ship_size - 1 else -1)
                    
                    if revealed.mask[next_i, hit_j]:
                        return next_i, hit_j
                    else:
                        is_row = revealed[next_i, hit_j] == 0
                elif hit_j < ship_size - 1 or grid_length - ship_size < hit_j:
                    next_j = hit_j + (1 if hit_j < ship_size - 1 else -1)
                    
                    if revealed.mask[hit_i, next_j]:
                        return hit_i, next_j
                    else:
                        is_row = revealed[hit_i, next_j] == 1
                elif revealed.mask[hit_i, hit_j - 1]:
                    return hit_i, hit_j - 1
                elif revealed.mask[hit_i, hit_j + 1]:
                    return hit_i, hit_j + 1
                else:
                    is_row = False
            else:                
                is_row = revealed[hit_i].sum() > 1
                
            if is_row:
                return hit_i, next_guess_1d_with_hit(revealed[hit_i])
            else:
                return next_guess_1d_with_hit(revealed[:, hit_j]), hit_j

This strategy involves quite a bit of branching logic and more than a few magic numbers. While I do not doubt that this implementation can be simplified, avoiding complex branching logic is one of the motivations for a simulation-based Bayesian approach to Battleship.

To understand the distribution of turns required by this strategy to sink the ship, we define a function that generates all square grids of a given shape with a single ship of a given size.

def get_all_ships(grid_length, ship_size):
    rows = get_all_ship_rows(grid_length, ship_size)
    n_rows, _ = rows.shape
    
    i = np.arange(grid_length)

    boards = np.zeros((
        n_rows, grid_length,
        grid_length, grid_length,
    ))
    boards[:, i, i, :] = rows[:, np.newaxis]
    long_boards = boards.reshape((-1, grid_length, grid_length))
    
    return np.concatenate((
        long_boards,
        long_boards.transpose(0, 2, 1)
    ))

We generate all possible boards for the 5/5/3 configuration and visualize a few of them.

all_ships_5_5_3 = get_all_ships(5, 3)

fig, axes = plt.subplots(nrows=2, ncols=2, sharex=True, sharey=True,
                         figsize=(FIG_WIDTH, FIG_WIDTH))

plt_boards = rng.choice(all_ships_5_5_3, size=axes.size, replace=False)

for (board, ax) in zip(plt_boards, axes.flat):
    plot_board(board, ax=ax);

fig.suptitle("Four boards, 5/5/3");
fig.tight_layout();

We now play all 5/5/3 games using this strategy.

grid_games_5_5_3 = [
    play(ship[np.newaxis], SingleGridStrategy(3))
        for ship in all_ships_5_5_3
]
grid_turns_pmf_5_5_3 = Pmf.from_seq([
    game.turns for game in grid_games_5_5_3
])

ax = plot_turn_dist(grid_turns_pmf_5_5_3,
                    color='C1', label="Near optimal",
                    mean=True, mean_kwargs={'c': 'C1'});
plot_turn_dist(random_5_5_3_pmf,
               kind='line', c='C0',
               mean=True, mean_kwargs={'c': 'C0'},
               label="Random",
               ax=ax)

make_pct_yaxis(ax);

ax.legend();

We see that this near optimal strategy is an significant improvement over random guessing (which is expected). Intuitively, it seems odd that the distribution is bimodal. In fact, we will see after implementing Thompson sampling in the two-dimensional case, that this strategy is definitely not optimal.

Bayesian (Thompson sampling)

The generalization of one-dimensional Thompson sampling to a two-dimensional grid is much simpler than the above generalization of the one-dimensional optimal strategy. The implementation of Thompson sampling for a single ship in two dimensions is shown below.

class SingleThompsonStrategy(Strategy):
    def __init__(self, poss_ships):
        self._poss_ships = poss_ships
    
    def compat_ships(self, revealed):
        if revealed.mask.all():
            return self._poss_ships
        else:
            is_compat_ = is_compat(self._poss_ships, revealed)

            return self._poss_ships[is_compat_]
    
    def next_guess(self, revealed):
        post = np.ma.masked_array(
            self.compat_ships(revealed).mean(axis=0),
            mask=~revealed.mask
        )
        
        return argmax_2d(post)

We now play all 5/5/3 games with this strategy to understand its turn distribution.

ts_games_5_5_3 = [
    play(ship[np.newaxis], SingleThompsonStrategy(all_ships_5_5_3))
        for ship in all_ships_5_5_3
]
ts_turns_pmf_5_5_3 = Pmf.from_seq([
    game.turns for game in ts_games_5_5_3
])

As alluded to above, we see that the turn distribution for Thompson sampling is concentrated to the left of the turn distribution for our near optimal strategy.

ax = plot_turn_dist(grid_turns_pmf_5_5_3,
                    color='C1', label="Near optimal",
                    mean=True, mean_kwargs={'c': 'C1'});
plot_turn_dist(random_5_5_3_pmf,
               kind='line', c='C0',
               mean=True, mean_kwargs={'c': 'C0'},
               label="Random",
               ax=ax)
plot_turn_dist(ts_turns_pmf_5_5_3,
               color='C2', label="Thompson sampling",
               mean=True, mean_kwargs={'c': 'C2'},
               ax=ax);

make_pct_yaxis(ax);

ax.legend(loc='upper right');

Note that while the turn distribution for Thompson sampling is concentrated further to the left, the expected numbers of turns for Thompson sampling and our near optimal strategy, indicated by the dashed vertical lines, are quite close.

No doubt we could improve the logic of our near optimal strategy to match (or perhaps even beat) Thompson sampling, but the generalization of Thompson sampling to the two-dimensional grid is so much more elegant that it seems hardly worthwhile.

Standard battleship

Now that we understand these approaches to simplified versions of Battleship, we are ready to tackle the Thompson sampling approach to standard Battleship.

Random

From above we recall the turn distribution for random guessing at standard Battleship.

std_turn_ax.figure

Optimistic expected values

Generalizing the optimal one-dimensional strategy to two dimensions was sufficiently difficult that we do not attempt to define even a near optimal strategy for standard Battleship. Configurations where several ships are adjacent, such as the one shown below, will certainly lead to complex and bug-prone branching logic.

ships = np.zeros((n_ships, GRID_LENGTH, GRID_LENGTH))
ships[0, 0, :5] = 1
ships[1, 1, :4] = 1
ships[2, 2, :3] = 1
ships[3, 3, :3] = 1
ships[4, 4, :2] = 1

fig, axes = plt.subplots(ncols=n_ships, sharex=True, sharey=True,
                         figsize=(n_ships * 2.5, 2.5))

for (ship, name, ax) in zip(ships, SHIP_NAMES, axes):
    plot_board(ship, ax=ax);
    
    ax.set_xticklabels([]);
    ax.set_yticklabels([]);
    ax.set_title(name);
    
fig.tight_layout();

plot_board(to_board(ships));

In spite of these challenges, we will want to at least be able to benchmark the average number of turns Thompson sampling requires against other approaches. Fortunately we can propose several scenarios for which the expected number of turns is straightforward to calculate.

Extremely optimistic

We can make two assumptions to get a reasonable floor on the expected number of turns.

The first five hits we get are each from a different ship.
As soon as we hit a ship, we know the complete location of that ship.

Together with a search strategy for the first five hits, we can calculate the expected value in this unrealistic but instructive scenario. Weakening these assumptions also leads to more realistic expected values to which we can compare the performance of Thompson sampling.

A fairly efficient search strategy for the first five hits in standard Battle ship is to search a mod-two grid (since the smallest ship has size two).

search_i, search_j = get_search_cells_2d((10, 10), 2)
search_grid = np.zeros((10, 10))
search_grid[search_i, search_j] = 1

ax = plot_board(search_grid)
ax.set_title(r"Standard Battleship");

The distribution of the number of turns it takes to get five hits in these fifty cells is given below.

extreme_hit_pmf = get_random_guess_dist(search_grid.sum(),
                                        np.ceil(SHIP_SIZES / 2).sum(),
                                        n_hit=SHIP_SIZES.size)

ax = plot_turn_dist(extreme_hit_pmf, mean=True)

make_pct_yaxis(ax);
ax.set_title("Finding five ships on a\n"
             r"$10 \times 10$ grid (optimistically)");

Under the second assumption, it will take us 12 more guesses to sink all the ships, so the expected number of turns to solve the game (optimistically) is

extreme_ev = extreme_hit_pmf.mean() + (SHIP_SIZES - 1).sum()

extreme_ev

40.18181818181814

Under the slightly more conservative (but still optimistic) assumption that it takes on average one and a half misses to determine the location and orientation of a ship, the expected value becomes

optim_ev = extreme_hit_pmf.mean() + 1.5 * SHIP_SIZES.size + (SHIP_SIZES - 2).sum()

optim_ev

42.68181818181814

We will use these two figures to benchmark the performance of Thompson sampling for standard Battleship in the next section.

Bayesian (Thompson sampling)

Generalizing the Thompson sampling strategy from one ship on a two-dimensional grid to standard Battleship is slighly more conceptually involved than generalize from one- to two-dimensional grids with one ship, but not too much more.

The generalization also requires some computational optimizations, as we can no longer expect to enumerate and store all possible configurations of the board on reasonably sized hardware. To illustrate this and facilitate our eventual solution to this issue, we generate all possible configurations of each ship.

all_ships = [get_all_ships(GRID_LENGTH, ship_size) for ship_size in SHIP_SIZES]

Even though we can enumerate the possibilities for each ship easily,

n_ship_grids = np.array([len(ships) for ships in all_ships])

n_ship_grids

array([120, 140, 160, 160, 180])

there are more than $10^{10}$ possible combinations in the cartesian product of these configurations

np.log10(n_ship_grids).sum()

10.88882175214102

Now, not all elements of the Cartesian product produce valid boards (the ships may overlap). An example of such an element is given below.

ships = np.zeros((n_ships, GRID_LENGTH, GRID_LENGTH))
ships[0, 0, :5] = 1
ships[1, 0, :4] = 1
ships[2, 0, :3] = 1
ships[3, 0, :3] = 1
ships[4, 0, :2] = 1

fig, axes = plt.subplots(ncols=n_ships, sharex=True, sharey=True,
                         figsize=(n_ships * 2.5, 2.5))

for (ship, name, ax) in zip(ships, SHIP_NAMES, axes):
    plot_board(ship, ax=ax);
    
    ax.set_xticklabels([]);
    ax.set_yticklabels([]);
    ax.set_title(name);
    
fig.tight_layout();

plot_board(to_board(ships), vmax=5.,
           cbar=True,
           cbar_kwargs={
               'format': '%d',
               'label': "Number of ships"
           });

The following function tests to see if an array of ship grids constitutes a valid board (has no overlap).

def has_no_overlap(ships, ship_axis=0, board_axis=(-2, -1)):
    return (to_board(ships, ship_axis=ship_axis)
                .max(axis=board_axis) \
                == 1)

has_no_overlap(ships)

False

Even eliminating sets of ships that do overlap, there are still too many possible boards to enumerate. To address this challenge we will switch from Thompson enumeration, which is a more proper description of what we’ve been doing so far to Thompson sampling. Instead of enumerating all boards compatible with the currently revealed information and choosing the unknown cell with the highest posterior probability of yielding a hit, we will draw a sample from the posterior distribution of compatible boards, calculate the cell most likely to yield a hit based on this sample to use as the next guess.

This sampling approach means we only need to be able to simulate random games of Battleship and check if they are compatible with the currently revealed information in order to generalize Thompson sampling to standard Battleship. Simulating random games of Battleship is not particularly hard, but simulating games compatible with boards that have many spots revealed can take a prohibitively long time if done naively by sampling from each ship’s possible configurations and then testing to see if the product of the samples is compatible. (I have tried this and had it take over an hour to generate a compatible sample with approximately 30 cells revealed.)

Fortunately we can address this issue by intelligently propagating hits and misses from the aggregate board to each ship’s grid and restricting to a subset of each ship’s position that is more likely to produce a compatible board when combined with all the other ship samples.

Before walking through this process, we introduce the machinery necessary to sample random (compatible) boards.

def _sample_compat_ships_block(poss_ships, revealed,
                               rng=None, block_size=10_000):
    if rng is None:
        rng = np.random.default_rng()

    samples = np.stack([
            rng.choice(ships, size=block_size, shuffle=False)
                for ships in poss_ships
        ],
        axis=1
    )
    
    has_no_overlap_ = has_no_overlap(samples, ship_axis=1)
    valid_samples = samples[has_no_overlap_]
    
    if revealed.mask.all():
        return valid_samples
    else:
        is_compat_ = is_compat(to_board(valid_samples, ship_axis=1), revealed)
    
        return valid_samples[is_compat_]
    
def sample_compat_ships_block(poss_ships, revealed,
                              rng=None, block_size=10_000):
    block = []
    
    while len(block) == 0:
        block = _sample_compat_ships_block(
            poss_ships, revealed,
            rng=rng, block_size=block_size
        )
        
    return block


def generate_compat_ships(poss_ships, revealed,
                          rng=None, block_size=10_000):
    while True:
        yield from sample_compat_ships_block(
            poss_ships, revealed,
            rng=rng, block_size=block_size
        )

def take(n, gen, progress_bar=False):
    n_range = trange(n) if progress_bar else range(n)
    
    return [x for _, x in zip(n_range, gen)]

def sample_compat_ships(poss_ships, revealed, n,
                        rng=None, block_size=10_000,
                        progress_bar=False):
    compat_gen = generate_compat_ships(poss_ships, revealed,
                                       rng=rng)
    
    return np.array(take(n, compat_gen,
                         progress_bar=progress_bar))

def sample_ships(poss_ships, n,
                 rng=None, block_size=10_000,
                 progress_bar=False):
    empty = np.ma.masked_all_like(all_ships[0][0])
    
    return sample_compat_ships(poss_ships, empty, n,
                               rng=rng, block_size=block_size,
                               progress_bar=progress_bar)

We plot four randomly sampled boards below.

ships = sample_ships(all_ships, 4, rng=rng)

fig, axes = plt.subplots(nrows=2, ncols=2, sharex=True, sharey=True,
                         figsize=(FIG_WIDTH, FIG_WIDTH))

for (ships_, ax) in zip(ships, axes.flat):
    plot_board(to_board(ships_), ax=ax);

fig.suptitle("Four boards, standard Battleship");
fig.tight_layout();

To illustrate how long it can take to sample compatible boards from the full Cartesian product, we randomly mask one third of the cells in the first board above and attempt to sample a compatible board.

masked_i, masked_j = rng.choice(
    np.indices((GRID_LENGTH, GRID_LENGTH)).reshape(2, -1),
    size=GRID_LENGTH**2 // 3, axis=1,
    replace=False, shuffle=False
)
mask = np.full((GRID_LENGTH, GRID_LENGTH), False)
mask[masked_i, masked_j] = True

revealed = np.ma.masked_array(to_board(ships[0]), mask=mask)

plot_board(revealed);

start = datetime.datetime.now()

try:
    sample_compat_ships(all_ships, revealed, 1, rng=rng)
except KeyboardInterrupt:
    end = datetime.datetime.now()
    
    print("You gave up trying to sample a compatible board"
          f"after {(end - start).total_seconds():.1f} seconds")

You gave up trying to sample a compatible boardafter 59.4 seconds

In order to implement the strategy for sampling compatible boards outlined above, we need to think about how to propagate hit/miss information about the combined board to each ship. For a miss, this propagation is easy, as all ships must have been missed. Propagating the information from a hit is much trickier, since in general we don’t know which ship was hit. We do however, know which ship was hit when our opponent announces that we sank one of their ships, because they have to tell us which one.

To implement Thompson sampling for standard Battleship then, we track not just what has been revealed, but what we can conclude has been revealed per ship. We use this per-ship revealed information to reduce our possibilities the compatible ship grids before sampling, then sample from the Cartesian product of these reduced sets, which produces compatible samples in at most seconds, even when many cells have been revealed.

Of course there are many situations where we can infer which ship has been hit. Consider the following board.

revealed = np.ma.masked_all((10, 10))
revealed[0, :2] = 0
revealed[1, 2] = 0
revealed[2, :2] = 0

plot_board(revealed);

If guessing (1, 0) results in a hit, we can obviously conclude that the patrol boat was hit. We do not account for situations such as this when propagating hit information from the combined board to individual ships to avoid cumbersome and error-prone conditional logic in our strategy. (In fact, Thompson sampling accounts for this automatically without a special rule, as there will be only one possible position for the patrol boat compatible with this revealed information.)

We implement Thompson sampling for standard Battleship as described above.

class ThompsonStrategy(Strategy):
    def __init__(self, all_ships, rng=None, block_size=10_000):
        self._ship_strats = [
            SingleThompsonStrategy(ships) for ships in all_ships
        ]
        
        n_ship = len(all_ships)
        grid_shape = all_ships[0][0].shape
        self._ships_revealed = np.ma.masked_all((n_ship,) + grid_shape)

        self._rng = rng if rng is not None else np.random.default_rng()
        self._block_size = block_size
    
    def next_guess(self, revealed, n=None):
        post = np.ma.masked_array(
            self.sample_post(revealed, n=n),
            mask=~revealed.mask
        )
        
        return argmax_2d(post)
    
    def sample_post(self, revealed, n=None):
        all_compat_ships = [
            ship_strat.compat_ships(ship_revealed)
                for (ship_strat, ship_revealed)
                    in zip(self._ship_strats, self._ships_revealed)
        ]
        
        if n is None:
            compat_samples = sample_compat_ships_block(
                all_compat_ships, revealed,
                rng=self._rng, block_size=self._block_size
            )
        else:
            compat_samples = sample_compat_ships(
                all_compat_ships, revealed, n=n,
                rng=self._rng, block_size=self._block_size
            )
        
        return to_board(compat_samples, ship_axis=1).mean(axis=0)

    def reveal(self, i, j, hit_or_miss, sunk):
        if hit_or_miss == 0 or sunk is not None:
            self._ships_revealed[:, i, j] = 0
            
            if sunk is not None:
                self._ships_revealed[sunk, i, j] = 1

We use this strategy to play a game and visualize the results.

strat = ThompsonStrategy(all_ships, rng=rng)
game = play(ships[0], strat,
            progress_bar=True)

100%|██████████| 17/17 [00:02<00:00, 7.06it/s]

game.turns

%%capture
ani = animate_boards(game.turn_revealed)

HTML(ani.to_html5_video())

Your browser does not support the video tag.

With a bit of extra bookeeping, we can also watch the posterior distribution used for Thompson sampling evolve after each turn.

game = Battleship(ships[0])
strat = ThompsonStrategy(all_ships, rng=rng)

posts = []

with tqdm(total=SHIP_SIZES.sum()) as pbar:
    while not game.is_solved:
        post = strat.sample_post(game.revealed, n=1_000)
        i, j = argmax_2d(np.ma.masked_array(
            post, mask=~game.revealed.mask
        ))

        hit_or_miss, sunk = game.guess(i, j)
        strat.reveal(i, j, hit_or_miss, sunk)
        
        posts.append(post)
        
        if hit_or_miss == 1:
            pbar.update()

100%|██████████| 17/17 [00:09<00:00, 1.83it/s]

%%capture
ani = animate_boards(posts, cbar=True)

HTML(ani.to_html5_video())

Your browser does not support the video tag.

I’m not really sure how interpretable this is, but it is kind of mesmerizing.

Finally, we play 1,000 random games of standard Battleship using this strategy to understand the distributions of the number of turns required.

N_GAME = 1_000

def pickleable_play(args):
    i, ships = args

    return play(
        ships,
        ThompsonStrategy(all_ships,
                         rng=np.random.default_rng(SEED + i))
    )

with mp.Pool(mp.cpu_count()) as pool:
    ts_games = [game for game in tqdm(
        pool.imap_unordered(
            pickleable_play,
            zip(range(N_GAME),
                sample_ships(all_ships, N_GAME, rng=rng))
        ),
        total=N_GAME
    )]

100%|██████████| 1000/1000 [15:32<00:00, 1.07it/s]

ts_turns = np.array([game.turns for game in ts_games])
ts_turns_pmf = Pmf.from_seq(ts_turns)

ts_turns_pmf.to_csv('./ts_turns_pmf.csv')

ax = plot_turn_dist(std_random_pmf, kind='line', mean=True,
                    label="Random guessing")
plot_turn_dist(ts_turns_pmf, color='C1',
               mean=True, mean_kwargs={'c': 'C1'},
               label="Thompson sampling", ax=ax);
ax.axvline(extreme_ev, c='C2', ls='--',
           label="Extremely optimistic\nexpected value");
ax.axvline(optim_ev, c='C3', ls='--',
           label="Optimistic\nexpected value");

ax.xaxis.set_major_locator(IndexLocator(base=5, offset=-2));

ax.set_ylim(0, 0.0825);
make_pct_yaxis(ax);

ax.legend(loc='upper right', bbox_to_anchor=(0.825, 1));
ax.set_title("Standard Battleship");

print(f"""
Extremely optimistic expected value = {extreme_ev:.2f}
Optimistic expected value = {optim_ev:.2f}
Thompson sampling expected value = {ts_turns_pmf.mean():.2f}
""")

Extremely optimistic expected value = 40.18 Optimistic expected value = 42.68 Thompson sampling expected value = 45.89

It’s very exciting to see that Thompson sampling is not too much less efficient (on average) than our (extremely) optimistic baseline expected values!

For fun, we visualize the simulated games with the fewest and most turns, to see what it looks like when we get very lucky or unlucky.

min_ts_game = ts_games[(ts_turns == ts_turns.min()).argmax()]
max_ts_game = ts_games[(ts_turns == ts_turns.max()).argmax()]

%%capture
min_ani = animate_boards(min_ts_game.turn_revealed)
max_ani = animate_boards(max_ts_game.turn_revealed)

HTML(min_ani.to_html5_video())

Your browser does not support the video tag.

HTML(max_ani.to_html5_video())

Your browser does not support the video tag.

Interestingly, we see that the fastest game three ships adjacent to each other. We likely got a bit of a speedup from finding two of those while attempting to sink the first one hit.

These simulated games naturally invite the question of whether Thompson sampling is optimal for standard Battleship (in the appropriate sense) as we have implemented it. I strongly suspect that it is near-optimal, but not quite optimal. Likely Thompson sampling combined with a few simple rules (like a restricted initial search grid) will produce slightly better (and perhaps even optimal) results. Searching the internet for prior work on near optimal Battleship strategies yields a few articles about strategies that take, on average, 42-44 guesses to complete the game. Our simple Thompson sampling approach is quite close to that range, and with a few heuristics bolted on would likely be able to shave off a few turns to reach that level of performance. We can also improve the performance of Thompson sampling at the cost of computational time by increasing the block_size used for sampling, resulting in a more accurate approximation to the posterior distribution. Another interesting feature of the strategies in those articles is that they try to balance searching (exploration) and sinking known ships (exploitation) with rules, which we do not have to do explicitly when using Thompson sampling.

Two other interesting extensions of this work would be to change the process of randomly generating ships (which is equivalent to changing the prior used in ABC) and to explore which strategies for initially placing ships take the longest for Thompson sampling to solve.

This framework is generalizable to variations of Battleship such as salvo or or limited lying is allowed, and we may explore those generalizations in future posts.

Many thanks to Kiril Zvezdarov and Meenal Jhajharia for providing helpful feedback on early drafts of this post.

This post is available as a Jupyter notebook here.

Bayesian Splines with Heteroskedastic Noise in Python with PyMC3

Austin Rochford — Wed, 11 Aug 2021 04:00:00 GMT

Splines are a powerful tool when modeling nonlinear relationships. This post shows how to include splines in a Bayesian model in Python using pymc3. In addition, we will show how to use a second spline component to handle heteroskedastic data, that is, data where the noise scale is not constant.

Image credit Wikipedia

To illustrate these concepts, we will use Lidar data from Larry Wasserman’s excellent book All of Nonparametric Statistics.

Load the Data

First we make the necessary Python imports and do some light housekeeping.

%matplotlib inline

from warnings import filterwarnings

from aesara import shared, tensor as at
import arviz as az
import matplotlib as mpl
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
import pymc3 as pm
import scipy as sp
import seaborn as sns

You are running the v4 development version of PyMC3 which currently still lacks key features. You probably want to use the stable v3 instead which you can either install via conda or find on the v3 GitHub branch: https://github.com/pymc-devs/pymc3/tree/v3

filterwarnings('ignore', category=UserWarning, module='arviz')

mpl.rcParams['figure.figsize'] = (8, 6)

sns.set(color_codes=True)

We are now ready to load the data.

DATA_URL = 'http://www.stat.cmu.edu/~larry/all-of-nonpar/=data/lidar.dat'

df = pd.read_csv(DATA_URL, sep=' +', engine='python')

df.head()

	range	logratio
0	390	-0.050356
1	391	-0.060097
2	393	-0.041901
3	394	-0.050985
4	396	-0.059913

df.describe()

	range	logratio
count	221.000000	221.000000
mean	554.751131	-0.291156
std	95.912396	0.282475
min	390.000000	-0.949554
25%	472.000000	-0.542305
50%	555.000000	-0.108043
75%	637.000000	-0.054825
max	720.000000	0.026907

We standardize both range and logratio to make it easier to specify priors once we begin building our spline models.

std_df = (df - df.mean()) / df.std()

std_df.head()

	range	logratio
0	-1.717725	0.852467
1	-1.707299	0.817981
2	-1.686447	0.882398
3	-1.676020	0.850240
4	-1.655168	0.818631

Exploratory Data Analysis

The task at hand is to model (standardized) logratio as a function of (standardized) range.

fig, (std_ax, joint_ax) = plt.subplots(
    nrows=2, sharex=True, sharey=False,
    gridspec_kw={'height_ratios': [1, 4]}
)

sns.scatterplot(x="range", y="logratio", data=std_df,
                alpha=0.5, ax=joint_ax);

(std_df.groupby(std_df["range"].round(1))
       ["logratio"]
       .std()
       .rolling(5)
       .mean()
       .plot(ax=std_ax));

std_ax.set_ylabel("Standard\ndeviation\n(binned)");

fig.tight_layout();

The scatter plot shows that the relationship is definitely nonlinear, and there is no obvious (to me at least) transform of logratio that will make the relationship linear. The top plot shows how the (binned, smoothed) standard deviation of logratio varies with range. As is evident from both plots, as range increases, so does the scale of variation of logratio.

Introduction to Splines

Regression splines are a type of of generalized additive model (GAM) that use linear combinations of (generally low-degree) polynomials to introduce nonlinear relationships between covariates and responses.

To begin constructing our spline model, we must choose a number of knots (also known as anchors or control points) in the domain of our co variate. In this post we will use twenty splines in the interval $[-1.75, 1.75]$, which comfortably contains the observed values of range.

N_KNOT = 20

knots = np.linspace(-1.75, 1.75, N_KNOT)

The following plot shows the location of the knots in the Lidar data.

fig, (std_ax, joint_ax) = plt.subplots(
    nrows=2, sharex=True, sharey=False,
    gridspec_kw={'height_ratios': [1, 4]}
)

sns.scatterplot(x="range", y="logratio", data=std_df,
                alpha=0.5, ax=joint_ax);
sns.rugplot(knots, height=0.075,
            c='k', label="Knots",
            ax=joint_ax);

(std_df.groupby(std_df["range"].round(1))
       ["logratio"]
       .std()
       .rolling(5)
       .mean()
       .plot(ax=std_ax));
sns.rugplot(knots, height=0.15,
            c='k', label="Knots",
            ax=std_ax);

std_ax.set_ylabel("Standard\ndeviation\n(binned)");

joint_ax.legend();
fig.tight_layout();

Let $x^*_i$, $i = 1, 2, \ldots, 20$ be the location of the $j$-th knot. The spline model we will use is given by

\[E(Y\ |\ X) = \sum_{j = 1}^{20} \beta_j \cdot B_{j, k; \mathbf{x}^*}(X)\].

(If one applies a link function to the conditional expectation of the left hand side, this becomes a generalized additive model.) For spline regression, $B_{j, k; \mathbf{x}^*}(\cdot)$ is a $k$-th-degree polynomial in $x$ and $x^*$. There are many possible choices for these functions. We will use scipy’s cubic B-spline implementation. For more information splines, consult Simon Wood’s excellent book Generalized Additive Models.

basis = sp.interpolate.BSpline(knots, np.eye(N_KNOT), 3)

We see that basis is a callable function function that will give the design matrix for spline regression at a given set of points.

hasattr(basis, '__call__')

True

We build this design matrix at the (standardized) value of range.

dmat = shared(basis(std_df["range"]))

With dmat in hand, we are ready to build our model with pymc3. We follow the model specified in Milad Kharratzadeh’s excellent short paper Splines in Stan.

The model for the conditional mean is given above,

\[\mu\ |\ X = \sum_{j = 1}^{20} \beta_j \cdot B_{j, k; \mathbf{x}^*}(X).\]

We put a Gaussian random walk prior (GRW) on these coefficients $\beta_j$, under the intuition that the coefficients for adjacent knots should be similar. We parameterize our GRW as follows:

\[ \begin{align*} \mu_{\beta} & \sim N(0, 2.5^2) \\ \Delta_{\beta, j} & \sim N(0, 1) \\ \sigma_{\beta} & \sim \mathrm{Half}-N(2.5^2) \\ \beta_j & = \mu_{\beta} + \sigma_{\beta} \cdot \sum_{i = 1}^j \Delta_{\beta, i}. \end{align*} \]

# the scale necessary to make a halfnormal distribution
# have unit variance
HALFNORMAL_SCALE = 1. / np.sqrt(1. - 2. / np.pi)

coords = {"knot": np.arange(N_KNOT)}

with pm.Model(coords=coords) as model:
    μ_β = pm.Normal("μ_β", 0., 2.5)
    Δ_β = pm.Normal("Δ_β", 0., 1., dims="knot")
    σ_β = pm.HalfNormal("σ_β", 2.5 * HALFNORMAL_SCALE)
    β = pm.Deterministic("β", μ_β + σ_β * Δ_β.cumsum(),
                         dims="knot")
    μ = at.dot(dmat, β)

Our observational model here is normal, with unknown variance $\sigma \sim \mathrm{Half-}N(2.5^2)$.

with model:
    σ = pm.HalfNormal("σ", 2.5 * HALFNORMAL_SCALE)
    obs = pm.Normal("obs", μ, σ, observed=std_df["logratio"])

We now sample from the posterior distribution of this model.

SEED = 123456789
CORES = 3

SAMPLE_KWARGS = {
    'cores': CORES,
    'random_seed': [SEED + i for i in range(CORES)],
    'return_inferencedata': True,
    'target_accept': 0.95
}

with model:
    trace = pm.sample(**SAMPLE_KWARGS)

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (3 chains in 3 jobs)
NUTS: [μ_β, Δ_β, σ_β, σ]

100.00% [6000/6000 02:21<00:00 Sampling 3 chains, 0 divergences]

Sampling 3 chains for 1_000 tune and 1_000 draw iterations (3_000 + 3_000 draws total) took 142 seconds.

None of the standard sampling diagnostics show cause for concern.

az.plot_energy(trace);

az.rhat(trace).max()

<xarray.Dataset>
Dimensions:  ()
Data variables:
    μ_β      float64 1.002
    Δ_β      float64 1.003
    σ_β      float64 1.003
    σ        float64 1.001
    β        float64 1.002

To visualize our predictions, we sample from the posterior predictive distribution along a grid of reasonable values for range.

pp_range = np.linspace(-1.75, 1.75, 100)
dmat.set_value(basis(pp_range))

with model:
    pp_trace = pm.sample_posterior_predictive(trace)

100.00% [3000/3000 00:00<00:00]

We now plot the posterior predictions.

fig, (std_ax, joint_ax) = plt.subplots(
    nrows=2, sharex=True, sharey=False,
    gridspec_kw={'height_ratios': [1, 4]}
)

low, high = np.percentile(pp_trace["obs"], [2.5, 97.5], axis=0)
joint_ax.fill_between(pp_range, low, high,
                      color='k', alpha=0.25,
                      label="95% credible interval");

sns.scatterplot(x="range", y="logratio", data=std_df,
                alpha=0.5, ax=joint_ax);

(std_df.groupby(std_df["range"].round(1))
       ["logratio"]
       .std()
       .rolling(5)
       .mean()
       .plot(ax=std_ax, label="Actual"));

std_ax.plot(pp_range, pp_trace["obs"].std(axis=0),
            c='k', label="Posterior predictive");

joint_ax.plot(pp_range, pp_trace["obs"].mean(axis=0),
              c='k', label="Posterior expected value");

std_ax.set_ylabel("Standard\ndeviation\n(binned)");

std_ax.legend(loc='upper left', bbox_to_anchor=(0., 1.65));
joint_ax.legend(loc='lower left');
fig.tight_layout();

Visually, we appear to have captured the relationship between range and the expected value of logratio reasonably well. The credible interval and the standard deviation above are a bit odd though. We have built a homoskedastic (same-variance) observational model, so the credible interval has roughly the same width, even though the data show a small variance for small values of range, and variance increases as range does.

Accounting for heteroskedasticity

In order to remedy this issue, we will build a heteroskedastic model that allows the variance of logratio to vary with ratio. In fact, we will use a spline to model the changing variance as well.

Let $\gamma_j$ come from a GRW similar to $\beta_j$. We define

\[ \begin{align*} \eta_{\sigma}\ |\ X & = \sum_{j = 1}^{20} \gamma_j \cdot B_{j, k; \mathbf{x}^*}(X) \\ \sigma\ |\ X & = 0.05 + \exp(\eta_{\sigma}). \end{align*} \]

Note that the $0.05$ factor in the definition of $\sigma\ |\ X$ sets a lower bound on the variance, which is necessary for computational stability.

The model is a straightforward adaptation of the homoskedastic one.

dmat.set_value(basis(std_df["range"]))

with pm.Model(coords=coords) as var_model:
    β0 = pm.Normal("β0", 0., 2.5)
    Δ_β = pm.Normal("Δ_β", 0., 1., dims="knot")
    σ_β = pm.HalfNormal("σ_β", 2.5 * HALFNORMAL_SCALE)
    β = pm.Deterministic("β", β0 + Δ_β.cumsum() * σ_β,
                         dims="knot")
    μ = at.dot(dmat, β)
    
    γ0 = pm.Normal("γ0", 0., 2.5)
    Δ_γ = pm.Normal("Δ_γ", 0., 1., dims="knot")
    σ_γ = pm.HalfNormal("σ_γ", 2.5 * HALFNORMAL_SCALE)
    γ = pm.Deterministic("γ", γ0 + Δ_γ.cumsum() * σ_γ,
                         dims="knot")
    η_σ = at.dot(dmat, γ)
    σ = 0.05 + at.exp(η_σ)

    obs = pm.Normal("obs", μ, σ, observed=std_df["logratio"])

We now sample from this model.

with var_model:
    var_trace = pm.sample(**SAMPLE_KWARGS)

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (3 chains in 3 jobs)
NUTS: [β0, Δ_β, σ_β, γ0, Δ_γ, σ_γ]

100.00% [6000/6000 04:42<00:00 Sampling 3 chains, 0 divergences]

Sampling 3 chains for 1_000 tune and 1_000 draw iterations (3_000 + 3_000 draws total) took 282 seconds.

Again the sampling diagnostics show no cause for concern.

az.plot_energy(var_trace);

az.rhat(var_trace).max()

<xarray.Dataset>
Dimensions:  ()
Data variables:
    β0       float64 0.9998
    Δ_β      float64 1.006
    σ_β      float64 1.0
    γ0       float64 1.001
    Δ_γ      float64 1.005
    σ_γ      float64 1.0
    β        float64 1.002
    γ        float64 1.002

We do see that the values of $\gamma_j$ that correspond to small values of range have small coefficients, and the coefficients grow as range gets larger.

ax, = az.plot_forest(var_trace, var_names=["γ"])

ax.set_xlabel(r"$\gamma_j$");

ax.set_yticklabels(np.arange(N_KNOT)[::-1]);
ax.set_ylabel("$j$");

Again we sample from the posterior predictive distribution of this model.

dmat.set_value(basis(pp_range))

with var_model:
    pp_var_trace = pm.sample_posterior_predictive(var_trace)

100.00% [3000/3000 00:00<00:00]

We plot these predictions in order to compare them to those of the homoskedastic model.

fig, (std_ax, joint_ax) = plt.subplots(
    nrows=2, sharex=True, sharey=False,
    gridspec_kw={'height_ratios': [1, 4]}
)

low, high = np.percentile(pp_var_trace["obs"], [2.5, 97.5], axis=0)
joint_ax.fill_between(pp_range, low, high,
                      color='k', alpha=0.25,
                      label="95% credible interval");

sns.scatterplot(x="range", y="logratio", data=std_df,
                alpha=0.5, ax=joint_ax);

(std_df.groupby(std_df["range"].round(1))
       ["logratio"]
       .std()
       .rolling(5)
       .mean()
       .plot(ax=std_ax, label="Actual"));

std_ax.plot(pp_range, pp_trace["obs"].std(axis=0),
            c='k', label="Posterior predictive\n(homoskedastic)");
std_ax.plot(pp_range, pp_var_trace["obs"].std(axis=0),
            c='r', ls='--',
            label="Posterior predictive\n(heteroskedastic)");

joint_ax.plot(pp_range, pp_trace["obs"].mean(axis=0),
              c='k', label="Posterior predictive\n(homoskedastic)");
joint_ax.plot(pp_range, pp_var_trace["obs"].mean(axis=0),
              c='r', ls='--',
              label="Posterior predictive\n(heteroskedastic)");

std_ax.set_ylabel("Standard\ndeviation\n(binned)");

std_ax.legend(loc='upper left', ncol=3,
              bbox_to_anchor=(0., 1.6));
joint_ax.legend(loc='lower left');
fig.tight_layout();

We see that the homo- and heteroskedastic models produce essentially the same estimate of the expected value of logratio, but that the heteroskedastic model comes closer to capturing the true change in the variance.

We now compare these two models using Pareto-smoothed importance sampling leave-one-out cross-validation (PSIS-LOO).

traces = {
    'Homoskedastic': trace,
    'Heteroskedastic': var_trace
}

comp_df = az.compare(traces)

comp_df.loc[:, :"weight"]

	rank	loo	p_loo	d_loo	weight
Heteroskedastic	0	41.369658	17.759399	0.000000	1.000000e+00
Homoskedastic	1	-41.703944	14.188435	83.073602	8.058976e-11

fig, ax = plt.subplots()

az.plot_compare(comp_df, plot_ic_diff=False, ax=ax);

Interestingly, the PSIS-LOO score for the heteroskedastic model is significantly higher than that of the homoskedastic model, even though these two models predict essentially the same conditional mean.

This post is available as a Jupyter notebook here.

Austin Rochford

Revisiting Bayesian Survival Analysis in Python with PyMC

An improved crash course in survival analysis¶

The Cox proportional hazards model¶

The Bayesian Cox model¶

Extensions of the Bayesian Cox model¶

Learning to Be Thoughtless in Python with Mesa

Run 1. Monolithic Social Norm, Individual Computing Dies Out¶

Run 2. Random Initial Norms, Individual Computing At Norm Boundaries¶

Run 4. Modest Noise Level and Endogenous Neighborhood Norms¶

Run 5. Higher Noise and Endogenous Neighborhood Norms¶

Run 6. Maximum Noise Does Not Induce Maximum Search¶

Run 3. Complacency in New Norm¶

A Closed-form Solution for the Cholesky Decomposition of the Covariance Matrix of Exchangeable Normal Variables

Bayesian Age/Period/Cohort Models in Python with PyMC

Load the data¶

Exploratory data analysis¶

Dispersion¶

Modeling¶

Flat priors¶

Lack of identification¶

Normal priors¶

Noncentered normal priors¶

Noncentered normal random walk priors¶

Smoothing splines¶

Age/cohort submodel¶

A Modern Introduction to Probabilistic Programming with PyMC

Table of contents¶

Probabilistic programming from three perspectives¶

Theoretical¶

Statistical¶

Computational¶

A Monte Carlo approximation of $\pi$¶

Probabilistic programming with PyMC¶

The Monty Hall problem¶

PyMC distributions¶

Aesara¶

Automating calculus¶

Hamiltonian Monte Carlo, the curse of dimensionality, and differential geometry¶

Robust regression¶

Anscombe's quartet¶

Ordinary least squares¶

ArviZ¶

Xarray¶

Bayesian Ridge Regression¶

Robust regression¶

A Bayesian analysis of Lego set prices¶

Exploratory data analysis¶

Price model¶

Why Hamiltonian Monte Carlo?¶

Darth Vader Meditation Chamber revisited¶

Resources¶

References¶

Community¶

Thank you!¶

Efficiency of Various Parameterizations for Sampling from Latent Exchangeable Normal Random Variables in PyMC

Exchangeable random variables¶

Generating the data¶

Parametrizations for inference with exchangeble multivarial normal random variables¶

The Cholesky decomposition¶

General Cholesky decomposition¶

Cholesky decomposition for excheangeable normal random variables¶

Two-stage sampling for nonnegative correlations¶

Benchmarking¶

A Bayesian Model of Lego Set Ratings

Load the data¶

Exploratory Data Analysis¶

Modeling Ratings¶

Time¶

Full model¶

Analysis¶

Piece count and price¶

Theme¶

Subtheme effects¶

Set-level¶

Rating components¶

My collection¶

Star Wars set ratings vs. media reception¶

A Fair Price for the Titanic? A Bayesian Analysis of the Price of Large Lego Sets

Load the Data¶

Robust regression ¶