This post is the second in a series of posts analyzing Lego pricing data scraped from brickset.com. The first post gave an empirical analysis of the fairness of the price for Darth Vader’s Meditation Chamber (75296), disproving my initial impression that the set was overpriced for its size and prompting me to preorder it. In this post we will build several more formal models of the price of Lego sets based on their size and use them to answer a few related questions.


The newest addition to my collection (75244).

First we make the necessary Python imports and do some light housekeeping.

%matplotlib inline
import datetime
from functools import reduce
import itertools as itl
from warnings import filterwarnings
from aesara import shared, tensor as at
import arviz as az
from matplotlib import MatplotlibDeprecationWarning, pyplot as plt
from matplotlib.lines import Line2D
from matplotlib.ticker import FuncFormatter, MultipleLocator, StrMethodFormatter
import numpy as np
import pandas as pd
import pymc3 as pm
import scipy as sp
from sklearn.preprocessing import StandardScaler
import seaborn as sns
filterwarnings('ignore', category=UserWarning, message=".*FixedFormatter.*")
filterwarnings('ignore', category=MatplotlibDeprecationWarning, module='pandas')
filterwarnings('ignore', category=UserWarning, module='aesara')
filterwarnings('ignore', category=UserWarning, module='arviz')
filterwarnings('ignore', category=UserWarning, module='pandas')
plt.rcParams['figure.figsize'] = (8, 6)

sns.set(color_codes=True)

dollar_formatter = StrMethodFormatter("${x:,.2f}")

Load the Data

We begin the real work by loading the data scraped from Brickset. See the first post in this series for more background on the data.

DATA_URL = 'https://austinrochford.com/resources/lego/brickset_01011980_06012021.csv.gz'
def to_datetime(year):
    return np.datetime64(f"{round(year)}-01-01")
full_df = (pd.read_csv(DATA_URL,
                       usecols=[
                           "Year released", "Set number",
                           "Name", "Set type", "Theme", "Subtheme",
                           "Pieces", "RRP"
                       ])
             .dropna(subset=[
                 "Year released", "Set number",
                 "Name", "Set type", "Theme",
                 "Pieces", "RRP"
             ]))
full_df["Year released"] = full_df["Year released"].apply(to_datetime)
full_df = (full_df.set_index(["Year released", "Set number"])
                  .sort_index())

We see that the data set contains information on approximately 8,500 Lego sets produced between 1980 and June 2021.

full_df.head()
Name Set type Theme Pieces RRP Subtheme
Year released Set number
1980-01-01 1041-2 Educational Duplo Building Set Normal Dacta 68.0 36.50 NaN
1075-1 LEGO People Supplementary Set Normal Dacta 304.0 14.50 NaN
1101-1 Replacement 4.5V Motor Normal Service Packs 1.0 5.65 NaN
1123-1 Ball and Socket Couplings & One Articulated Joint Normal Service Packs 8.0 16.00 NaN
1130-1 Plastic Folder for Building Instructions Normal Service Packs 1.0 14.00 NaN


full_df.tail()
Name Set type Theme Pieces RRP Subtheme
Year released Set number
2021-01-01 80022-1 Spider Queen’s Arachnoid Base Normal Monkie Kid 1170.0 119.99 Season 2
80023-1 Monkie Kid’s Team Dronecopter Normal Monkie Kid 1462.0 149.99 Season 2
80024-1 The Legendary Flower Fruit Mountain Normal Monkie Kid 1949.0 169.99 Season 2
80106-1 Story of Nian Normal Seasonal 1067.0 79.99 Chinese Traditional Festivals
80107-1 Spring Lantern Festival Normal Seasonal 1793.0 119.99 Chinese Traditional Festivals


full_df.describe()
Pieces RRP
count 8502.000000 8502.000000
mean 258.739591 31.230506
std 481.627846 44.993559
min 0.000000 0.000000
25% 32.000000 7.000000
50% 98.000000 17.000000
75% 300.000000 39.990000
max 11695.000000 799.990000


We also load data about my personal collection and add a column to full_df indicating whether or not I own each set.

MY_COLLECTION_URL = 'https://austinrochford.com/resources/lego/Brickset-MySets-owned-20210602.csv'
my_df = pd.read_csv(MY_COLLECTION_URL)
my_df.index
Index(['8092-1', '10221-1', '10266-1', '10281-1', '10283-1', '21309-1',
       '21312-1', '21320-1', '21321-1', '31091-1', '40174-1', '40268-1',
       '40391-1', '40431-1', '40440-1', '41602-1', '41608-1', '41609-1',
       '41628-1', '75030-1', '75049-1', '75074-1', '75075-1', '75093-1',
       '75099-1', '75136-1', '75137-1', '75138-1', '75162-1', '75176-1',
       '75187-1', '75229-1', '75243-1', '75244-1', '75248-1', '75254-1',
       '75255-1', '75263-1', '75264-1', '75266-1', '75267-1', '75269-1',
       '75273-1', '75277-1', '75278-1', '75283-1', '75292-1', '75297-1',
       '75302-1', '75306-1', '75308-1', '75317-1', '75318-1'],
      dtype='object')
full_df["austin"] = (full_df.index
                            .get_level_values("Set number")
                            .isin(my_df.index))

Since the data spans more than 40 years, it is important to adjust for inflation. We use the Consumer Price Index for All Urban Consumers: All Items in U.S. City Average from the U.S. Federal Reserve to adjust for inflation.

CPI_URL = 'https://austinrochford.com/resources/lego/CPIAUCNS202100401.csv'
years = pd.date_range('1979-01-01', '2021-01-01', freq='Y') \
            + datetime.timedelta(days=1)
cpi_df = (pd.read_csv(CPI_URL, index_col="DATE", parse_dates=["DATE"])
            .loc[years])
cpi_df["to2021"] = cpi_df.loc["2021-01-01"] / cpi_df

We now add a column RRP2021, which is RRP adjusted to 2021 dollars.

full_df["RRP2021"] = (pd.merge(full_df, cpi_df,
                               left_on=["Year released"],
                               right_index=True)
                        .apply(lambda df: df["RRP"] * df["to2021"],
                               axis=1))

Based on the exploratory data analysis in the previous post, we filter full_df down to approximately 6,300 sets to be included in our analysis.

FILTERS = [
    full_df["Set type"] == "Normal",
    full_df["Pieces"] > 10,
    full_df["Theme"] != "Duplo",
    full_df["Theme"] != "Service Packs",
    full_df["Theme"] != "Bulk Bricks",
    full_df["RRP"] > 0
]
df = full_df[reduce(np.logical_and, FILTERS)].copy()
df.head()
Name Set type Theme Pieces RRP Subtheme austin RRP2021
Year released Set number
1980-01-01 1041-2 Educational Duplo Building Set Normal Dacta 68.0 36.50 NaN False 122.721632
1075-1 LEGO People Supplementary Set Normal Dacta 304.0 14.50 NaN False 48.752429
5233-1 Bedroom Normal Homemaker 26.0 4.50 NaN False 15.130064
6305-1 Trees and Flowers Normal Town 12.0 3.75 Accessories False 12.608387
6306-1 Road Signs Normal Town 12.0 2.50 Accessories False 8.405591


df.describe()
Pieces RRP RRP2021
count 6312.000000 6312.000000 6312.000000
mean 337.177440 37.038246 45.795998
std 535.619271 49.657704 58.952563
min 11.000000 0.600000 0.971220
25% 68.000000 9.990000 11.866173
50% 177.000000 19.990000 26.712319
75% 400.000000 49.500000 55.952471
max 11695.000000 799.990000 897.373477


Modeling

Upon plotting the data, a linear relationship between the logarithm of pieces and the logarithm of reccomended retail price (in 2021 dollars) becomes immediately apparent.

ax = sns.scatterplot(x="Pieces", y="RRP2021", data=df,
                alpha=0.05);
sns.scatterplot(x="Pieces", y="RRP2021",
                data=df[df["austin"] == True],
                label="My sets", ax=ax); 

ax.set_xscale('log');

ax.set_yscale('log');
ax.set_ylabel("Retail price (2021 $)");

ax.legend(loc='upper left');
png

Pooled model

We start with a very simple pooled linear model of log price versus log pieces. Let \(Y_i\) denote the recommended retail price of the \(i\)-th set and \(y_i = \log Y_i\). Similarly, let \(X_i\) denote the number of pieces in the \(i\)-th set and \(x_i = \log X_i\). Due to the log-log linear relationship between \(Y_i\) and \(X_i\), our models will take the form

\[Y_i = f_i(X_i) \cdot \varepsilon_i,\]

where \(\varepsilon_i\) accounts for the unmeasured noise in the price of the \(i\)-th set. We place a log-normal prior on \(\varepsilon_i\) so that $ (_i) = 1$ and \(\textrm{Var}\left(\log(\varepsilon_j)\right) = \sigma_i^2.\) standard distributional math shows that

\[y_i \sim N\left(\log\left(f_i(X_i))\right) - \frac{\sigma_i^2}{2}, \sigma_i^2\right).\]

Each model is of the form

\[f_i(X_i) = b_{0, i} \cdot X_i^{\beta_i},\]

so

\[\log\left(f_i(X_i)\right) = \beta_{0, i} + \beta_i x_i,\]

where \(\beta_{0, i} = \log b_{0, i}\).

We define log_pieces and log_rrp2021 which correspond to \(x_i\) and \(y_i\) respectively.

pieces = df["Pieces"].values
log_pieces = np.log(df["Pieces"].values)

rrp2021 = df["RRP2021"].values
log_rrp2021 = np.log(rrp2021)

For computational efficiency, we will standardize \(x_i\) to produce \(\tilde{x}_i\) which has mean zero and variance one.

scaler = StandardScaler().fit(log_pieces[:, np.newaxis])

def scale_log_pieces(log_pieces, scaler=scaler):
    return scaler.transform(log_pieces[:, np.newaxis])[:, 0]

std_log_pieces = scale_log_pieces(log_pieces)
std_log_pieces_ = shared(std_log_pieces)

Here definining std_log_pieces_ as an aesara shared variable will allow us to change its value later for posterior predictive sampling.

Our models are therefore given by

\[\log\left(f_i(X_i)\right) = \beta_{0, i} + \beta_i \tilde{x}_i.\]

For the pooled model, \(\beta_{0, i} = \beta_0\), \(\beta_i = \beta\), and \(\sigma_i = \sigma\) are constant across all sets. We give them the priors

\[ \begin{align*} \beta_0 & \sim N(0, 2.5^2) \\ \beta & \sim N(0, 2.5^2) \\ \sigma & \sim \textrm{HalfNormal}(5^2). \end{align*} \]

with pm.Model() as pooled_model:
    β0 = pm.Normal("β0", 0., 2.5)
    β_pieces = pm.Normal("β_pieces", 0., 2.5)
    
    σ = pm.HalfNormal("σ", 5.)

We define the mean and likelihood of the observed data as above.

with pooled_model:
    μ = β0 + β_pieces * std_log_pieces_ - 0.5 * σ**2
    obs = pm.Normal("obs", μ, σ, observed=log_rrp2021)

We now sample from posterior distribution of this model.

CHAINS = 3
SEED = 123456789

SAMPLE_KWARGS = {
    'cores': CHAINS,
    'random_seed': [SEED + i for i in range(CHAINS)],
    'return_inferencedata': True
}
with pooled_model:
    pooled_trace = pm.sample(**SAMPLE_KWARGS)
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (3 chains in 3 jobs)
NUTS: [β0, β_pieces, σ]

100.00% [6000/6000 00:15<00:00 Sampling 3 chains, 0 divergences]

Sampling 3 chains for 1_000 tune and 1_000 draw iterations (3_000 + 3_000 draws total) took 18 seconds.

The energy plot, BFMI, and \(\hat{R}\) statistics for these samples show no cause for concern.

def make_diagnostic_plots(trace, axes=None, min_mult=0.995, max_mult=1.005):
    if axes is None:
        fig, axes = plt.subplots(ncols=2, sharex=False, sharey=False,
                                 figsize=(16, 6))
        
    az.plot_energy(trace, ax=axes[0])
    
    
    rhat = az.rhat(trace).max()
    axes[1].barh(np.arange(len(rhat.variables)), rhat.to_array(),
                 tick_label=list(rhat.variables.keys()))
    axes[1].axvline(1, c='k', ls='--')

    axes[1].set_xlim(
        min_mult * min(rhat.min().to_array().min(), 1),
        max_mult * max(rhat.max().to_array().max(), 1)
    )
    axes[1].set_xlabel(r"$\hat{R}$")

    axes[1].set_ylabel("Variable")
    
    return fig, axes
make_diagnostic_plots(pooled_trace);
png

We sample from the posterior predictive distribution at the observed data points and plot the residuals.

with pooled_model:
    pp_trace_data = pm.sample_posterior_predictive(pooled_trace)
    
df["pooled_resid"] = (log_rrp2021 - pp_trace_data["obs"]).mean(axis=0)

100.00% [3000/3000 00:02<00:00]

ax = sns.scatterplot(x="Pieces", y="pooled_resid", data=df,
                     alpha=0.1)

ax.set_xscale('log');
ax.set_xlabel("Pieces");

ax.set_ylabel("Log RRP residual (2021 $)");

ax.set_title("Pooled model");
png

We will investigate these residuals further shortly. First we sample from the posterior predictive distribution along a grid spanning reasonable values of pieces and plot the resulting predictions.

pp_pieces = np.logspace(1, np.log10(pieces.max()))
pp_log_pieces = np.log(pp_pieces)
pp_std_log_pieces = scale_log_pieces(pp_log_pieces)
std_log_pieces_.set_value(pp_std_log_pieces)
with pooled_model:
    pooled_pp_trace = pm.sample_posterior_predictive(pooled_trace)

100.00% [3000/3000 00:01<00:00]

ax = sns.scatterplot(x="Pieces", y="RRP2021", data=df,
                alpha=0.05);
sns.scatterplot(x="Pieces", y="RRP2021",
                data=df[df["austin"] == True],
                alpha=0.5, label="My sets",
                ax=ax);

pp_low, pp_high = np.percentile(
    np.exp(pooled_pp_trace["obs"]),
    [2.5, 97.4],
    axis=0
)
ax.plot(pp_pieces, np.exp(pooled_pp_trace["obs"]).mean(axis=0),
        c='k', label="Posterior predicted mean");
ax.fill_between(pp_pieces, pp_low, pp_high,
                color='k', alpha=0.15, zorder=-1,
                label="95% posterior interval");

ax.set_xscale('log');

ax.set_yscale('log');
ax.set_ylabel("Retail price (2021 $)");

ax.legend(loc='upper left');
png

The predictions seem reasonable. We now plot the posterior distribution of the regression coefficients.

az.plot_posterior(pooled_trace, var_names=["β0", "β_pieces"]);
png

When we break down the residuals for a few notable themes, we see that Star Wars and Disney sets have larger residuals than Creator sets with the corresponding number of pieces.

ax = sns.scatterplot(x="Pieces", y="pooled_resid", data=df,
                     alpha=0.1)

for theme in ["Star Wars", "Creator", "Disney"]:
    sns.scatterplot(x="Pieces", y="pooled_resid",
                    data=df[df["Theme"] == theme],
                    alpha=0.5, label=f"{theme} sets",
                    ax=ax);

ax.set_xscale('log');
ax.set_xlabel("Pieces");

ax.set_ylabel("Log RRP residual (2021 $)");

ax.set_title("Pooled model");
png

This difference in the distribution of residuals across themes suggests that we introduce theme-specific factors into our model.

Unpooled model

In order to introduce theme-specific factors, we create indicator variables for each theme.

theme_id, theme_map = df["Theme"].factorize(sort=True)
n_theme = theme_map.size

theme_id_ = shared(theme_id)

The simplest model that includes theme-specific factors is an unpooled model that consists of independent regressions per theme. With \(j(i)\) denoting the index of the theme of the \(i\)-th set, we have the model

\[\log\left(f_i(X_i)\right) = \beta_{0, j(i)} + \beta_{j(i)} \tilde{x}_i\]

with the i.i.d. priors

\[ \begin{align*} \beta_{0, 1}, \ldots \beta_{0, n_j} & \sim N(0, 2.5^2) \\ \beta_1, \ldots \beta_{n_j} & \sim N(0, 2.5^2) \\ \sigma_1, \ldots, \sigma_{n_j} & \sim \textrm{HalfNormal}(5^2). \end{align*} \]

std_log_pieces_.set_value(std_log_pieces)
with pm.Model() as unpooled_model:
    β0 = pm.Normal("β0", 0., 2.5, shape=n_theme)
    β_pieces = pm.Normal("β_pieces", 0., 2.5, shape=n_theme)
    
    σ = pm.HalfNormal("σ", 5., shape=n_theme)
    μ = β0[theme_id_] + β_pieces[theme_id_] * std_log_pieces_ \
            - 0.5 * σ[theme_id_]**2
    
    obs = pm.Normal("obs", μ, σ[theme_id_], observed=log_rrp2021)

We don’t actually sample from this model because it is underdetermined for many themes that have few sets associated with them, as the following plot shows.

ax = (df["Theme"]
        .value_counts()
        .nsmallest(n=20)
        .plot.barh())

ax.set_xlabel("Number of sets");
ax.set_ylabel("Theme");
png

There are five themes with only a single set, and a number of themes with fewer than five sets total. Regression models with only a single data point are underdetermined, and those with few observations are likely to have wide error bars. Instead of imposing a minimum number of sets per theme and then using an unpooled model, we will use a partially pooled model to share some information across themes/regularize while allowing themes with many sets to deviate from the average if the data indicate that they should.

Partially pooled model

One model that shares information across themes is called “partially pooled” (it may also be called multilevel or hierarchical). In the partially pooled model, the intercept and piece coefficient for each theme are drawn not from i.i.d normal distributions but from a normal distribution with a common mean and variance. This shared hyperprior distribution regularizes the theme intercepts and coefficients, shrinking those for themes with fewer sets towards the pooled values.

Conceptually, we use the priors

\[ \begin{align*} \mu_{\beta_0} & \sim N(0, 2.5^2) \\ \sigma_{\beta_0} & \sim \textrm{HalfNormal}(2.5^2) \\ \beta_{0, 1}, \ldots, \beta_{0, n_j} & \sim N\left(\mu_{\beta_0}, \sigma_{\beta_0}^2\right) \end{align*} \]

and similarly for \(\beta\). In reality, we use a non-centered parametrization parameterization that is mathematically equivalent but often more computationally efficient.

def hierarchical_normal(name, shape, μ=None):
    if μ is None:
        μ = pm.Normal(f"μ_{name}", 0., 2.5)

    Δ = pm.Normal(f"Δ_{name}", 0., 1., shape=shape)
    σ = pm.HalfNormal(f"σ_{name}", 2.5)
    
    return pm.Deterministic(name, μ + Δ * σ)
with pm.Model() as partial_model:
    β0 = hierarchical_normal("β0", n_theme)
    β_pieces = hierarchical_normal("β_pieces", n_theme)
    
    σ = pm.HalfNormal("σ", 5.)
    μ = β0[theme_id_] + β_pieces[theme_id_] * std_log_pieces_ \
            - 0.5 * σ**2
    
    obs = pm.Normal("obs", μ, σ, observed=log_rrp2021)
with partial_model:
    partial_trace = pm.sample(**SAMPLE_KWARGS)
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (3 chains in 3 jobs)
NUTS: [μ_β0, Δ_β0, σ_β0, μ_β_pieces, Δ_β_pieces, σ_β_pieces, σ]

100.00% [6000/6000 03:15<00:00 Sampling 3 chains, 0 divergences]

Sampling 3 chains for 1_000 tune and 1_000 draw iterations (3_000 + 3_000 draws total) took 196 seconds.
The estimated number of effective samples is smaller than 200 for some parameters.

Again our diagnostics show no cause for concern.

make_diagnostic_plots(partial_trace);
png

After calculating the residuals for the partially pooled model we see that they tend to be closer to zero and the diagonal bands are less pronounced.

with partial_model:
    pp_trace_data = pm.sample_posterior_predictive(partial_trace)
    
df["partial_resid"] = (log_rrp2021 - pp_trace_data["obs"]).mean(axis=0)

100.00% [3000/3000 00:02<00:00]

fig, axes = plt.subplots(ncols=2, sharex=True, sharey=True,
                         figsize=(14, 6))

sns.scatterplot(x="Pieces", y="pooled_resid", data=df,
                alpha=0.1, ax=axes[0]);

axes[0].set_xscale('log');
axes[0].set_xlabel("Pieces");

axes[0].set_ylabel("Log RRP residual (2021 $)");

axes[0].set_title("Pooled model");

sns.scatterplot(x="Pieces", y="partial_resid", data=df,
                alpha=0.1, ax=axes[1]);

axes[1].set_xscale('log');
axes[1].set_xlabel("Pieces");

axes[1].set_ylabel("Log RRP residual (2021 $)");

axes[1].set_title("Partially pooled model");

fig.tight_layout();
png

We also see that the residuals for the example themes (Star Wars, Creator, and Disney) are more comingled than for the pooled model.

fig, axes = plt.subplots(ncols=2, sharex=True, sharey=True,
                         figsize=(14, 6))

sns.scatterplot(x="Pieces", y="pooled_resid", data=df,
                alpha=0.1, ax=axes[0]);

for theme in ["Star Wars", "Creator", "Disney"]:
    sns.scatterplot(x="Pieces", y="pooled_resid",
                    data=df[df["Theme"] == theme],
                    alpha=0.5, label=f"{theme} sets",
                    ax=axes[0]);

axes[0].set_xscale('log');
axes[0].set_xlabel("Pieces");

axes[0].set_ylabel("Log RRP residual (2021 $)");

axes[0].set_title("Pooled model");

sns.scatterplot(x="Pieces", y="partial_resid", data=df,
                alpha=0.1, ax=axes[1]);

for theme in ["Star Wars", "Creator", "Disney"]:
    sns.scatterplot(x="Pieces", y="partial_resid",
                    data=df[df["Theme"] == theme],
                    alpha=0.5, label=f"{theme} sets",
                    ax=axes[1]);

axes[1].set_xscale('log');
axes[1].set_xlabel("Pieces");

axes[1].set_ylabel("Log RRP residual (2021 $)");

axes[1].set_title("Partially pooloed model");

fig.tight_layout();
png

Now that we have posterior samples from two models, we use Pareto-smoothed importance sampling leave-one-out cross validation1

traces = {
    "Unpooled": pooled_trace,
    "Partially pooled": partial_trace
}
comp_df = az.compare(traces)
comp_df.loc[:, :"weight"]
rank loo p_loo d_loo weight
Partially pooled 0 -2315.578245 190.594364 0.000000 0.99951
Unpooled 1 -5133.311893 4.871287 2817.733648 0.00049


fig, ax = plt.subplots(figsize=(12, 6))

az.plot_compare(comp_df, plot_ic_diff=False, ax=ax);
png

The table and chart show that by sharing some information across themes while also incorporating their varying effects the partially pooled model has substantially better predictive accuracy than the pooled model (higher values of loo correspond to better predictive ability).

In order to compare the posterior predictions of the pooled and partially pooled model directly, we need to define a grid over theme/standardized log piece combinations.

pp_df = pd.DataFrame.from_records(
    itl.product(range(n_theme), pp_std_log_pieces),
    columns=["theme_id", "std_log_pieces"]
)
pp_df["theme"] = theme_map[pp_df["theme_id"].values]
pp_df["pieces"] = np.exp(
    scaler.inverse_transform(pp_df["std_log_pieces"]
                                  .values[:, np.newaxis])
          .ravel()
)
pp_df.head()
theme_id std_log_pieces theme pieces
0 0 -2.373432 4 Juniors 10.000000
1 0 -2.252260 4 Juniors 11.550805
2 0 -2.131088 4 Juniors 13.342109
3 0 -2.009916 4 Juniors 15.411210
4 0 -1.888744 4 Juniors 17.801188


pp_df.tail()
theme_id std_log_pieces theme pieces
6545 130 3.079305 Znap 6569.786072
6546 130 3.200477 Znap 7588.631734
6547 130 3.321649 Znap 8765.480484
6548 130 3.442820 Znap 10124.835517
6549 130 3.563992 Znap 11695.000000
std_log_pieces_.set_value(pp_df["std_log_pieces"].values)
theme_id_.set_value(pp_df["theme_id"].values)

We now sample from the posterior predictive distribution and compare the predictions to those from the pooled model.

with partial_model:
    partial_pp_trace = pm.sample_posterior_predictive(partial_trace)

100.00% [3000/3000 00:02<00:00]

pp_df["pp_partial_mean"] = np.exp(partial_pp_trace["obs"]).mean(axis=0)
pp_df["pp_partial_low"], pp_df["pp_partial_high"] = np.percentile(
    np.exp(partial_pp_trace["obs"]),
    [2.5, 97.5],
    axis=0
)
pp_df.head()
theme_id std_log_pieces theme pieces pp_partial_mean pp_partial_low pp_partial_high
0 0 -2.373432 4 Juniors 10.000000 5.214650 2.380664 10.484612
1 0 -2.252260 4 Juniors 11.550805 5.917360 2.668467 11.413480
2 0 -2.131088 4 Juniors 13.342109 6.864103 3.045283 13.610948
3 0 -2.009916 4 Juniors 15.411210 7.653877 3.515344 14.966362
4 0 -1.888744 4 Juniors 17.801188 8.679962 4.009006 16.935274


ax = sns.scatterplot(x="Pieces", y="RRP2021", data=df,
                alpha=0.05);

(pp_df.groupby("theme_id")
      .plot("pieces", "pp_partial_mean",
            c='k', alpha=0.1, label='_nolegend_',
            ax=ax));
ax.plot(pp_pieces, np.exp(pooled_pp_trace["obs"]).mean(axis=0),
        c='r', lw=2, label="Pooled");

ax.set_xscale('log');

ax.set_yscale('log');
ax.set_ylabel("Retail price (2021 $)");

handles, _ = ax.get_legend_handles_labels()
partial_line = Line2D([0], [0], color='k',
                      label="Partially pooled")
handles.append(partial_line)

ax.legend(handles=handles,
          title="Posterior predicted mean",
          loc='upper left');
png

Note that the partial pooling model produces one line per theme, with varying slopes and intercepts. We can see that while the predictions for some themes are quite far away from the pooled predictions most are close to them, as shown by the stacking of the semi-transparent lines. Below we highlight the partially pooled predictions for a few themes of interest among all partially pooled predictions.

PLOT_THEMES = [
    "Creator Expert",
    "Disney",
    "Star Wars",
    "Harry Potter",
    "Marvel Super Heroes",
    "Ninjago",
    "City",
    "Space",
    "Jurassic World"
]
ax = sns.scatterplot(x="Pieces", y="RRP2021", data=df,
                alpha=0.05);

(pp_df.groupby("theme_id")
      .plot("pieces", "pp_partial_mean",
            c='k', alpha=0.1, label='_nolegend_',
            ax=ax));
(pp_df[pp_df["theme"].isin(PLOT_THEMES)]
      .groupby("theme_id")
      .plot("pieces", "pp_partial_mean",
            c='r', label='_nolegend_',
            ax=ax));

ax.set_xscale('log');

ax.set_yscale('log');
ax.set_ylabel("Retail price (2021 $)");

title = "Highlighted: " \
            + ", ".join(PLOT_THEMES[:4]) + ",\n" \
            + ", ".join(PLOT_THEMES[4:-1]) \
            + ", and " + PLOT_THEMES[-1]
ax.set_title(title);
png

These notable themes have preditions close to the global average, which makes sense since they all have a fair number of sets, so the global mean is pulled towards their theme effects.

It has long been known in the quantitative social sciences that when naively modeled (as we have) that group-level effects (in this case theme-level effects), can strongly correlate with item-level covariates (in this case standardized log pieces). We now investigate whether our partially pooled model results in such a correlation.

First we calculate the average value of std_log_pieces within each theme.

theme_mean_std_log_pieces = (pd.Series(std_log_pieces, index=df.index)
                               .groupby(df["Theme"])
                               .mean())
ax = theme_mean_std_log_pieces.plot.hist(bins=20, lw=0., alpha=0.5)

ax.set_xlabel("Average standardized log pieces\nper set within theme");
png

We see that these theme averages vary fairly widely. The following plots show the relationship between each theme’s average standardized log pieces and the posterior expected values of \(\beta_{0, j}\) and \(\beta_j\) respectively.

post_β0 = partial_trace["posterior"]["β0"].mean(dim=('chain', 'draw'))
post_β_pieces = partial_trace["posterior"]["β_pieces"].mean(dim=('chain', 'draw'))
fig, (β0_ax, β_pieces_ax) = plt.subplots( 
    ncols=2, sharex=True, sharey=False,
    figsize=(15, 6)
)

β0_ax.scatter(theme_mean_std_log_pieces, post_β0);

β0_ax.set_xlabel("Average standardized log pieces\nper set within theme");
β0_ax.set_ylabel("Posterior expected\nvalue of $\\beta_0$");
 
β0_corr, β0_corr_p_val = sp.stats.pearsonr(
    theme_mean_std_log_pieces.values,
    post_β0
)
β0_ax.set_title(f"Correlation coefficient = {β0_corr:0.2f}\n"
                f"Two-tailed p-value {β0_corr_p_val:.2e}");

β_pieces_ax.scatter(theme_mean_std_log_pieces, post_β_pieces);

β_pieces_ax.set_xlabel("Average standardized log pieces\nper set within theme");
β_pieces_ax.set_ylabel("Posterior expected\nvalue of $\\beta_{pieces}$");

β_pieces_corr, β_pieces_corr_p_val = sp.stats.pearsonr(
    theme_mean_std_log_pieces.values,
    post_β_pieces
)
β_pieces_ax.set_title(f"Correlation coefficient = _pieces_corr:0.2f}\n"
                      f"Two-tailed p-value _pieces_corr_p_val:.2e}");
png

Both plots show highly significant correlations, with a stronger correlation between the theme average and the posterior expected average of \(\beta_{0, j}\). These correlations lead to violations of the Gauss-Markov assumptions in our regression design, so we remedy them with an expanded model in the next section.

Partially pooled mean model

Fortunately Bafumi and Gelman2 show how to remedy this violation by including the theme-level average standardized log pieces as a covariate in \(\beta_{0, j}\) and \(\beta_j\). if we denote the average standardized log pieces for sets in the \(j\)-th theme as \(\bar{\tilde{x}}_j\), in this extended model, \(\beta_{0, j}\) is defined as

\[ \begin{align*} \mu_{\beta_0}, \gamma_{\beta, 0} & \sim N(0, 2.5^2) \\ \sigma_{\beta_0} & \sim \textrm{HalfNormal}(2.5^2) \\ \beta_{0, j} & \sim N\left(\mu_{\beta_0} + \gamma_{\beta, 0} \cdot \bar{\tilde{x}}_j, \sigma_{\beta_0}^2\right), \end{align*} \]

and similarly for \(\beta_j\).

def hierarchical_normal_with_mean(name, x_mean, shape, μ=None):
    γ = pm.Normal(f"γ_{name}", 0., 2.5)
        
    return hierarchical_normal(name, shape, μ=μ) + γ * x_mean
with pm.Model() as mean_model:
    σ = pm.HalfNormal("σ", 5.)
    
    β0 = hierarchical_normal_with_mean(
        "β0", at.constant(theme_mean_std_log_pieces),
        n_theme
    )
    β_pieces = hierarchical_normal_with_mean(
        "β_pieces", at.constant(theme_mean_std_log_pieces),
        n_theme
    )
    
    μ = β0[theme_id_] + β_pieces[theme_id_] * std_log_pieces_ \
            - 0.5 * σ**2
    
    obs = pm.Normal("obs", μ, σ, observed=log_rrp2021)

We now sample from this model.

std_log_pieces_.set_value(std_log_pieces)
theme_id_.set_value(theme_id)
with mean_model:
    mean_trace = pm.sample(**SAMPLE_KWARGS)
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (3 chains in 3 jobs)
NUTS: [σ, γ_β0, μ_β0, Δ_β0, σ_β0, γ_β_pieces, μ_β_pieces, Δ_β_pieces, σ_β_pieces]

100.00% [6000/6000 03:57<00:00 Sampling 3 chains, 0 divergences]

Sampling 3 chains for 1_000 tune and 1_000 draw iterations (3_000 + 3_000 draws total) took 239 seconds.
The number of effective samples is smaller than 10% for some parameters.

Once again the sampling diagnostics show no cause for concern.

make_diagnostic_plots(mean_trace);
png

The correlations from above have disappeared in this model.

post_β0 = mean_trace["posterior"]["β0"].mean(dim=('chain', 'draw'))
post_β_pieces = mean_trace["posterior"]["β_pieces"].mean(dim=('chain', 'draw'))
fig, (β0_ax, β_pieces_ax) = plt.subplots( 
    ncols=2, sharex=True, sharey=False,
    figsize=(15, 6)
)

β0_ax.scatter(theme_mean_std_log_pieces, post_β0);

β0_ax.set_xlabel("Average standardized log pieces\nper set within theme");
β0_ax.set_ylabel("Posterior expected\nvalue of $\\beta_0$");
 
β0_corr, β0_corr_p_val = sp.stats.pearsonr(
    theme_mean_std_log_pieces.values,
    post_β0
)
β0_ax.set_title(f"Correlation coefficient = {β0_corr:0.2f}\n"
                f"Two-tailed p-value {β0_corr_p_val:.2f}");

β_pieces_ax.scatter(theme_mean_std_log_pieces, post_β_pieces);

β_pieces_ax.set_xlabel("Average standardized log pieces\nper set within theme");
β_pieces_ax.set_ylabel("Posterior expected\nvalue of $\\beta_{pieces}$");

β_pieces_corr, β_pieces_corr_p_val = sp.stats.pearsonr(
    theme_mean_std_log_pieces.values,
    post_β_pieces
)
β_pieces_ax.set_title(f"Correlation coefficient = _pieces_corr:0.2f}\n"
                      f"Two-tailed p-value _pieces_corr_p_val:.2f}");
png

We add this model to our comparison.

traces["Partially pooled with mean"] = mean_trace
comp_df = az.compare(traces)
comp_df.loc[:, :"weight"]
rank loo p_loo d_loo weight
Partially pooled 0 -2315.578245 190.594364 0.000000 0.958583
Partially pooled with mean 1 -2316.840059 187.512592 1.261814 0.000000
Unpooled 2 -5133.311893 4.871287 2817.733648 0.041417


fig, ax = plt.subplots(figsize=(12, 6))

az.plot_compare(comp_df, plot_ic_diff=False, ax=ax);
png

We see that the the PSIS-LOO scores of both partially pooled model are better than the pooled model and statistically indistinguishable from each other. Given these scores, we will work with the model that accounts for the theme’s average standardized log pieces going forward, since it satisfies the Gauss-Markov assumptions better.

The residuals from the two partially pooled models are also visually quite similar, though the second partial pooling model does have less pronounced diagonal bands.

with mean_model:
    pp_trace_data = pm.sample_posterior_predictive(mean_trace)

df["mean_resid"] = (log_rrp2021 - pp_trace_data["obs"]).mean(axis=0)

100.00% [3000/3000 00:01<00:00]

fig, axes = plt.subplots(ncols=2, sharex=True, sharey=True,
                         figsize=(14, 6))

sns.scatterplot(x="Pieces", y="pooled_resid", data=df,
                alpha=0.1, ax=axes[0]);

axes[0].set_xscale('log');
axes[0].set_xlabel("Pieces");

axes[0].set_ylabel("Log RRP residual (2021 $)");

axes[0].set_title("Partially pooled model");

sns.scatterplot(x="Pieces", y="mean_resid", data=df,
                alpha=0.1, ax=axes[1]);

axes[1].set_xscale('log');
axes[1].set_xlabel("Pieces");

axes[1].set_ylabel("Log RRP residual (2021 $)");

axes[1].set_title("Partially pooled mean model");

fig.tight_layout();
png

Interestingly, there is more variation in the varying intercepts (\(\beta_{0, j}\)) than the varying slopes (\(\beta_j\)), since \(\sigma_{\beta_0}\) has a larger posterior expected value than \(\sigma_{\beta}\).

az.plot_posterior(mean_trace,
                  var_names=[
                      "μ_β0", "μ_β_pieces",
                      "σ_β0", "σ_β_pieces"
                  ],
                  grid=(2, 2));
png

Plotting the residuals of log RRP in 2021 dollars versus time shows that the residuals for this model are significantly larger for sets between 1980 and 1995. This phenomenon makes sense as Lego has released many more sets per year after 1995 than in the initial period, so the estimates are dominated by the data in this later period.

ax = (df.groupby(level="Year released")
        ["mean_resid"]
        .mean()
        .plot(c='k', label="Year average"));

strip_ax = ax.twiny()
sns.stripplot(x="Year released", y="mean_resid",
              data=df.reset_index(),
              jitter=1.5,
              color='C0', alpha=0.1,
              ax=strip_ax);

strip_ax.set_axis_off();

ax.set_ylim(-2, 2);
ax.set_ylabel("Log RRP residual (2021 $)");

ax.legend(loc='upper left');
png

Time-varying model

We address this issue with the residuals by adding a time-varying component to \(\beta_0\). First we build t, which indicates the number of years after the data begins in 1980 that the set was released.

year = df.index.get_level_values("Year released").year.values
t = year - year.min()
t_ = shared(t)
n_year = int(t.max() + 1)

The prior on \(\beta_j\) stays the same as in the mean model, but we add a Gaussian random walk term, \(\beta_{0, t}\) to the intercept, \(\beta_0\). We use the priors

\[ \begin{align*} \mu_{\beta_{0, pieces}}, \gamma_{\beta_{0, pieces}} & \sim N(0, 2.5^2) \\ \sigma_{\beta_{0, pieces}} & \sim \textrm{HalfNormal}(2.5^2) \\ \Delta_{\beta_{0, i}} & \sim N(0, 0.25^2) \\ \beta_{0, t} & \sim \sum_{i = 1}^t \Delta_{\beta_{0, i}} \\ \beta_{0, j, t} & \sim N\left(\mu_{\beta_0} + \beta_{0, t} + \gamma_{\beta, 0} \cdot \bar{\tilde{x}}_j, \sigma_{\beta_0}^2\right). \end{align*} \]

We see that the time-varying factor \(\beta_{0, t}\) is constant across themes in order to reduce the computational burden when sampling from this model.

with pm.Model() as time_model:
    Δ_β0_t = pm.Normal("Δ_β0_t", 0., 0.25, shape=n_year)
    β0_t = pm.Deterministic("β0_t", Δ_β0_t.cumsum())
    
    β0_theme = hierarchical_normal_with_mean(
        "β0_theme", at.constant(theme_mean_std_log_pieces),
        n_theme
    )
    β0_i = β0_t[t_] + β0_theme[theme_id_]

    β_pieces = hierarchical_normal_with_mean(
        "β_pieces", at.constant(theme_mean_std_log_pieces),
        n_theme
    )
    
    σ = pm.HalfNormal("σ", 5.)
    μ = β0_i + β_pieces[theme_id_] * std_log_pieces_ - 0.5 * σ**2
    
    obs = pm.Normal("obs", μ, σ, observed=log_rrp2021)

We now sample from this model.

with time_model:
    time_trace = pm.sample(**SAMPLE_KWARGS)
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (3 chains in 3 jobs)
NUTS: [Δ_β0_t, γ_β0_theme, μ_β0_theme, Δ_β0_theme, σ_β0_theme, γ_β_pieces, μ_β_pieces, Δ_β_pieces, σ_β_pieces, σ]

100.00% [6000/6000 17:36<00:00 Sampling 3 chains, 0 divergences]

Sampling 3 chains for 1_000 tune and 1_000 draw iterations (3_000 + 3_000 draws total) took 1056 seconds.
The number of effective samples is smaller than 25% for some parameters.

Once again the sampling diagnostics show no cause for concern.

make_diagnostic_plots(time_trace);
png

We see that the time-varying model performs noticeably better than the other models in terms of PSIS-LOO score.

traces["Partially pooled with mean and time"] = time_trace
comp_df = az.compare(traces)
comp_df.loc[:, :"weight"]
rank loo p_loo d_loo weight
Partially pooled with mean and time 0 -2058.373203 245.239325 0.000000 9.205189e-01
Partially pooled 1 -2315.578245 190.594364 257.205042 4.565283e-02
Partially pooled with mean 2 -2316.840059 187.512592 258.466855 1.857221e-12
Unpooled 3 -5133.311893 4.871287 3074.938690 3.382830e-02


fig, ax = plt.subplots(figsize=(12, 6))

az.plot_compare(comp_df, plot_ic_diff=False, ax=ax);
png

The residuals for this model are similar to those of the previous one when plotted against the number of pieces in the set.

with time_model:
    pp_time_data = pm.sample_posterior_predictive(time_trace)

df["time_resid"] = (log_rrp2021 - pp_time_data["obs"]).mean(axis=0)

100.00% [3000/3000 00:02<00:00]

fig, axes = plt.subplots(ncols=2, sharex=True, sharey=True,
                         figsize=(14, 6))

sns.scatterplot(x="Pieces", y="mean_resid", data=df,
                alpha=0.1, ax=axes[0]);

axes[0].set_xscale('log');
axes[0].set_xlabel("Pieces");

axes[0].set_ylabel("Log RRP residual (2021 $)");

axes[0].set_title("Partially pooled mean model");

sns.scatterplot(x="Pieces", y="time_resid", data=df,
                alpha=0.1, ax=axes[1]);

axes[1].set_xscale('log');
axes[1].set_xlabel("Pieces");

axes[1].set_ylabel("Log RRP residual (2021 $)");

axes[1].set_title("Partially pooled mean model with time");

fig.tight_layout();
png

The following plot shows that this model has significantly reduced the amount of time-dependence in the residuals.

ax = (df.groupby(level="Year released")
        ["time_resid"]
        .mean()
        .plot(c='k', label="Year average"));

strip_ax = ax.twiny()
sns.stripplot(x="Year released", y="time_resid",
              data=df.reset_index(),
              jitter=1.5,
              color='C0', alpha=0.1,
              ax=strip_ax);

strip_ax.set_axis_off();

ax.set_ylim(-2, 2);
ax.set_ylabel("Log RRP residual (2021 $)");

ax.legend(loc='upper left');
png

Again, there is more variation in the varying intercepts (\(\beta_{0, j}\)) than the varying slopes (\(\beta_j\)), since \(\sigma_{\beta_0}\) has a larger posterior expected value than \(\sigma_{\beta}\).

az.plot_posterior(time_trace,
                  var_names=[
                      "μ_β0_theme", "μ_β_pieces",
                      "σ_β0_theme", "σ_β_pieces"
                  ],
                  grid=(2, 2));
png

We now revisit the question of whether or not Darth Vader’s Meditation Chamber is fairly priced. In the previous post in this series, we compared the price per piece of this set to all sets, all Star Wars sets, and all sets currently in my collection. We found that it was in or near the first quartile of price per piece among all sets and Star Wars sets, and near the median priece per piece in my collection. With this model in hand, we can ask where the adjusted recommended retail price falls in the predicted distribution of posterior predicted prices for a Star Wars set with 663 pieces.

VADER_MEDITATION = "75296-1"

vader_info = df.xs(VADER_MEDITATION, level="Set number")
vader_iloc = (df.index
                .get_level_values("Set number")
                .isin([VADER_MEDITATION])
                .argmax())
(vader_info["RRP2021"].values[0] \
    >= np.exp(pp_trace_data["obs"][:, vader_iloc])).mean()
0.2816666666666667

The set falls in towards the bottom of the second quartile of predicted prices for such a set, which is consistent with the empirical price per piece analysis of the previous post. This model can also answer some counterfactual questions about my collection. We see that my sets are highly concentrated in the Star Wars theme, with a smattering of other themes represented as well.

(df[df["austin"] == True]
   ["Theme"]
   .value_counts(ascending=True)
   .plot.barh());
png

This preponderance of Star Wars sets is even stronger than it initially appears because all but one of my BrickHeadz sets are Star Wars characters. The single exception are the adorable German Shepherds (40440).

In particular, we can compare the actual cost of my collection to the predicted expected cost of my collection conditioned on the piece count and theme of the sets.

coll_value = (df[df["austin"] == True]
                ["RRP2021"]
                .sum())
coll_value
3206.3933684401236
(np.exp(pp_trace_data["obs"])
                     [:, df["austin"] == True]
                     .mean(axis=0)
                     .sum())
3826.537853977332

So my collection cost about 20% less than expected conditioned on piece count and theme. (This figure actually understates my frugality as I have received several sets as gifts, notably the massive Super Star Destroyer (10221).) We can easily check where my collection’s actual value falls in the posterior predictive distribution of its possible values.

pp_coll_value = (np.exp(pp_trace_data["obs"])
                   [:, df["austin"] == True]
                   .sum(axis=1))
ax, = (sns.displot(pp_coll_value, kind='kde',
                   height=6, aspect=8 / 6,
                   label="Posterior predictive distribution")
          .axes.flatten())

ax.axvline(coll_value, c='k', ls='--',
           label="Actual value");

ax.xaxis.set_major_formatter(dollar_formatter);
ax.set_xlabel("Collection value (2021 $)");

ax.set_yticks([]);

ax.legend(loc='upper right');
png
(coll_value > pp_coll_value).mean()
0.007

My collection’s actual value is in the first percentile of the posterior predictive distribution of possible collection values; according to this model, I really am quite cheap!

It is worth noting that in this post we have modeled log price directly, not price per piece as in the last model. Those familiar with offset terms in offsets in generalizedl linear models will notice that the model we have specified can be equivalent to an model of (log) price per piece if we fix \(\beta_j = 1\), since \(\log\left(\frac{x}{y}\right) = \log x - \log y.\) The following plot shows how far outside the high density area of the posterior distribution of \(\mu_{\beta_0}\) the value one is.

az.plot_posterior(time_trace, var_names=["μ_β0_theme"], ref_val=1.);
png

Due to this fact, we prefer to model (log) price directly over using an offset model that includes (log) piece count on both sides of the equal sign.

Future posts will explore further inferences from this model and develop extended models using other attributes of each set available via Brickset.

This post is available as a Jupyter notebook here.

%load_ext watermark
%watermark -n -u -v -iv
Last updated: Thu Jun 10 2021

Python implementation: CPython
Python version       : 3.8.8
IPython version      : 7.22.0

pandas    : 1.2.3
aesara    : 2.0.10
pymc3     : 4.0
scipy     : 1.6.2
seaborn   : 0.11.1
matplotlib: 3.4.1
arviz     : 0.11.2
numpy     : 1.20.2

  1. Vehtari, A., Gelman, A., & Gabry, J. (2017). Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. (PSIS-LOO) to compare them.) Statistics and computing, 27(5), 1413-1432.

  2. Bafumi, J., & Gelman, A. E. (2006). Fitting multilevel models when predictors and group effects correlate.