A Fair Price for the Republic Gunship? A Bayesian Analysis of Ultimate Collector Series Prices in Python with PyMC3
Since my last post on modeling Lego prices, Lego has announced the release of 73509, an Ultimate Collector Series Republic Gunship from the Battle of Geonosis.
As with 75296, Darth Vader Meditation Chamber, we can ask whether the $349.99 recommended retail price of this set is fair given its number of pieces, Star Wars theme, and Ultimate Collector Series status. This post will update the model from the previous one to include random effects for subtheme (in Lego collector jargon, the theme of 75309 is “Star Wars” and the subtheme is “Ultimate Collector Series”) to answer this question. In addition, we will explore the effect of different themes and subthemes on set prices.
First we make some Python imports and do a bit of housekeeping.
%matplotlib inline
import datetime
from functools import reduce
from warnings import filterwarnings
from aesara import shared, tensor as at
import arviz as az
from matplotlib import MatplotlibDeprecationWarning, pyplot as plt
from matplotlib.ticker import StrMethodFormatter
import numpy as np
import pandas as pd
import pymc3 as pm
from sklearn.preprocessing import StandardScaler
import seaborn as sns
You are running the v4 development version of PyMC3 which currently still lacks key features. You probably want to use the stable v3 instead which you can either install via conda or find on the v3 GitHub branch: https://github.com/pymc-devs/pymc3/tree/v3
'ignore', category=MatplotlibDeprecationWarning, module='pandas')
filterwarnings('ignore', category=UserWarning, module='aesara')
filterwarnings('ignore', category=UserWarning, module='arviz')
filterwarnings('ignore', category=UserWarning, module='pandas') filterwarnings(
= (8, 6)
FIGSIZE
'figure.figsize'] = FIGSIZE
plt.rcParams[set(color_codes=True)
sns.
= StrMethodFormatter("${x:,.2f}") dollar_formatter
Load the Data
We begin the real work by loading the data scraped from Brickset. See the first post in this series for more background on the data.
= 'https://austinrochford.com/resources/lego/brickset_01011980_06012021.csv.gz' DATA_URL
def to_datetime(year):
return np.datetime64(f"{round(year)}-01-01")
= (pd.read_csv(DATA_URL,
full_df =[
usecols"Year released", "Set number",
"Name", "Set type", "Theme", "Subtheme",
"Pieces", "RRP"
])=[
.dropna(subset"Year released", "Set number",
"Name", "Set type", "Theme",
"Pieces", "RRP"
]))"Year released"] = full_df["Year released"].apply(to_datetime)
full_df[= (full_df.set_index(["Year released", "Set number"])
full_df .sort_index())
We see that the data set contains information on approximately 8,500 Lego sets produced between 1980 and June 2021.
full_df.head()
Name | Set type | Theme | Pieces | RRP | Subtheme | ||
---|---|---|---|---|---|---|---|
Year released | Set number | ||||||
1980-01-01 | 1041-2 | Educational Duplo Building Set | Normal | Dacta | 68.0 | 36.50 | NaN |
1075-1 | LEGO People Supplementary Set | Normal | Dacta | 304.0 | 14.50 | NaN | |
1101-1 | Replacement 4.5V Motor | Normal | Service Packs | 1.0 | 5.65 | NaN | |
1123-1 | Ball and Socket Couplings & One Articulated Joint | Normal | Service Packs | 8.0 | 16.00 | NaN | |
1130-1 | Plastic Folder for Building Instructions | Normal | Service Packs | 1.0 | 14.00 | NaN |
full_df.tail()
Name | Set type | Theme | Pieces | RRP | Subtheme | ||
---|---|---|---|---|---|---|---|
Year released | Set number | ||||||
2021-01-01 | 80022-1 | Spider Queen’s Arachnoid Base | Normal | Monkie Kid | 1170.0 | 119.99 | Season 2 |
80023-1 | Monkie Kid’s Team Dronecopter | Normal | Monkie Kid | 1462.0 | 149.99 | Season 2 | |
80024-1 | The Legendary Flower Fruit Mountain | Normal | Monkie Kid | 1949.0 | 169.99 | Season 2 | |
80106-1 | Story of Nian | Normal | Seasonal | 1067.0 | 79.99 | Chinese Traditional Festivals | |
80107-1 | Spring Lantern Festival | Normal | Seasonal | 1793.0 | 119.99 | Chinese Traditional Festivals |
full_df.describe()
Pieces | RRP | |
---|---|---|
count | 8502.000000 | 8502.000000 |
mean | 258.739591 | 31.230506 |
std | 481.627846 | 44.993559 |
min | 0.000000 | 0.000000 |
25% | 32.000000 | 7.000000 |
50% | 98.000000 | 17.000000 |
75% | 300.000000 | 39.990000 |
max | 11695.000000 | 799.990000 |
= 'https://austinrochford.com/resources/lego/CPIAUCNS202100401.csv' CPI_URL
= pd.date_range('1979-01-01', '2021-01-01', freq='Y') \
years + datetime.timedelta(days=1)
= (pd.read_csv(CPI_URL, index_col="DATE", parse_dates=["DATE"])
cpi_df
.loc[years])"to2021"] = cpi_df.loc["2021-01-01"] / cpi_df cpi_df[
We now add a column RRP2021
, which is RRP
adjusted to 2021 dollars.
"RRP2021"] = (pd.merge(full_df, cpi_df,
full_df[=["Year released"],
left_on=True)
right_indexapply(lambda df: df["RRP"] * df["to2021"],
.=1)) axis
Based on the exploratory data analysis in the first post in this series, we filter full_df
down to approximately 6,300 sets to be included in our analysis.
= [
FILTERS "Set type"] == "Normal",
full_df["Pieces"] > 10,
full_df["Theme"] != "Duplo",
full_df["Theme"] != "Service Packs",
full_df["Theme"] != "Bulk Bricks",
full_df["RRP"] > 0
full_df[ ]
= full_df[reduce(np.logical_and, FILTERS)].copy() df
df.head()
Name | Set type | Theme | Pieces | RRP | Subtheme | RRP2021 | ||
---|---|---|---|---|---|---|---|---|
Year released | Set number | |||||||
1980-01-01 | 1041-2 | Educational Duplo Building Set | Normal | Dacta | 68.0 | 36.50 | NaN | 122.721632 |
1075-1 | LEGO People Supplementary Set | Normal | Dacta | 304.0 | 14.50 | NaN | 48.752429 | |
5233-1 | Bedroom | Normal | Homemaker | 26.0 | 4.50 | NaN | 15.130064 | |
6305-1 | Trees and Flowers | Normal | Town | 12.0 | 3.75 | Accessories | 12.608387 | |
6306-1 | Road Signs | Normal | Town | 12.0 | 2.50 | Accessories | 8.405591 |
df.describe()
Pieces | RRP | RRP2021 | |
---|---|---|---|
count | 6312.000000 | 6312.000000 | 6312.000000 |
mean | 337.177440 | 37.038246 | 45.795998 |
std | 535.619271 | 49.657704 | 58.952563 |
min | 11.000000 | 0.600000 | 0.971220 |
25% | 68.000000 | 9.990000 | 11.866173 |
50% | 177.000000 | 19.990000 | 26.712319 |
75% | 400.000000 | 49.500000 | 55.952471 |
max | 11695.000000 | 799.990000 | 897.373477 |
Exploratory Data Analysis
The first post in this series does quite a lot of exploratory data analysis (EDA) on this data which we will not repeat here. For this post our EDA will focus on subtheme.
We see that there are a bit less than 500 subthemes and that just under a quarter of sets have no subtheme associated with them on Brickset.
"Subtheme"].describe() df[
count 4896
unique 482
top 3 in 1
freq 189
Name: Subtheme, dtype: object
"Subtheme"].isnull().mean() df[
0.22433460076045628
We see that there are many Technic sets with no subtheme, and a few other themes also stand out for having many sets with no subtheme.
= (df[df["Subtheme"].isnull()]
ax "Theme"]
[
.value_counts()=(8, 16)))
.plot.barh(figsize
'log');
ax.set_xscale("Number of sets with no subtheme");
ax.set_xlabel(
;
ax.invert_yaxis()"Theme"); ax.set_ylabel(
When we introduce subtheme-based random effects we will treat these sets with null subtheme values carefully.
Now we check whether or not each subtheme is only ever paired with one theme.
= (df.groupby("Subtheme")
sub_n_themes "Theme"]
[ .nunique())
= (sub_n_themes.value_counts()
ax =0))
.plot.bar(rot
"Number of themes associated\n"
ax.set_xlabel(("with the subtheme"));
'log');
ax.set_yscale("Number of subthemes"); ax.set_ylabel(
== 1).mean() (sub_n_themes
0.9190871369294605
This plot and calculation show that the vast majority (about 92%) of subthemes are associated with only one theme. We now examine the subthemes that are associated with more than one theme.
= (sub_n_themes[sub_n_themes > 1]
ax
.sort_values()=(8, 9)))
.plot.barh(figsize
"Number of themes associated\n"
ax.set_xlabel(("with the subtheme"));
The subthemes associated with the most themes make sense, many themes will have “Miscellaneous,” “Promotional,” “Seasonal,” and “Accessories” sets. Of the remaining such subthemes, I personally am curious about the themes that the “Star Wars” subtheme is associated with.
= "Star Wars" STAR_WARS
= (df[df["Subtheme"] == STAR_WARS]
ax "Theme"]
[
.value_counts()
.plot.barh())
"Number of sets with\n"
ax.set_xlabel(("\"Star Wars\" subtheme"))
;
ax.invert_yaxis()"Theme"); ax.set_ylabel(
These themes make sense, there are “BrickHeadz” of many sorts (in addition to Star Wars, Disney, Harry Potter, and Pets come to mind immediately) and the same reasoning applies to “Brick Sketches” and “Mindstorms.”
In order to truly remain faithful to the source data, we will treat theme and subtheme as if they are not nested, even though in reality they are.
Modeling
Time-varying intercepts, theme-based random effects and slopes
We begin by rebuilding the final model from the previous post, which models the relationship between log RRP in 2021 dollars and log piece count using time-varying intercepts and time-constant slopes with theme-based random effects. For full details of this model please consult the previous post.
First we build the feature vector \(x\) (standardized log price) and the target vector \(y\) (log RRP in 2021 dollars).
= df["Pieces"].values
pieces = np.log(df["Pieces"].values)
log_pieces
= df["RRP2021"].values
rrp2021 = np.log(rrp2021) log_rrp2021
= StandardScaler().fit(log_pieces[:, np.newaxis])
scaler
def scale_log_pieces(log_pieces, scaler=scaler):
return scaler.transform(log_pieces[:, np.newaxis])[:, 0]
= scale_log_pieces(log_pieces)
std_log_pieces = shared(std_log_pieces) std_log_pieces_
We also encode the themes as integer indices and calculate the (standardize) mean log piece count per theme in order to make an adjustment to satisfy the Gauss-Markov criteria.
= df["Theme"].factorize(sort=True)
theme_id, theme_map = theme_map.size
n_theme
= shared(theme_id) theme_id_
= (pd.Series(std_log_pieces, index=df.index)
theme_mean_std_log_pieces "Theme"])
.groupby(df[ .mean())
Finally we encode the year (in years since 1960) that the set was released for the time-varying component of the intercept.
= df.index.get_level_values("Year released").year.values
year = year - year.min()
t = int(t.max() + 1)
n_year
= shared(t) t_
We now define the final model of the previous post in PyMC3.
def hierarchical_normal(name, shape, μ=None):
if μ is None:
= pm.Normal(f"μ_{name}", 0., 2.5)
μ
= pm.Normal(f"Δ_{name}", 0., 1., shape=shape)
Δ = pm.HalfNormal(f"σ_{name}", 2.5)
σ
return pm.Deterministic(name, μ + Δ * σ)
def hierarchical_normal_with_mean(name, x_mean, shape,
=None, mean_name="γ"):
μ= pm.Normal(f"{mean_name}_{name}", 0., 2.5)
mean_coef
return pm.Deterministic(
name,f"_{name}", shape, μ=μ) \
hierarchical_normal(+ mean_coef * x_mean
)
def gaussian_random_walk(name, shape, innov_scale=1.):
= pm.Normal(f"Δ_{name}", 0., innov_scale, shape=shape)
Δ
return pm.Deterministic(name, Δ.cumsum())
with pm.Model() as model:
0_0 = pm.Normal("β0_0", 0., 2.5)
β
0_t = gaussian_random_walk("β0_t", n_year, innov_scale=0.1)
β
0_theme = hierarchical_normal_with_mean(
β"β0_theme",
at.constant(theme_mean_std_log_pieces),=0.
n_theme, μ
)0_i = β0_0 + β0_t[t_] + β0_theme[theme_id_]
β
= pm.Normal("β_pieces_0", 0., 2.5)
β_pieces_0
= hierarchical_normal_with_mean(
β_pieces_theme "β_pieces_theme",
at.constant(theme_mean_std_log_pieces),=0.
n_theme, μ
)
= β_pieces_0 + β_pieces_theme
β_pieces
= pm.HalfNormal("σ", 5.)
σ = β0_i + β_pieces_theme[theme_id_] * std_log_pieces_ - 0.5 * σ**2
μ
= pm.Normal("obs", μ, σ, observed=log_rrp2021) obs
We are finally ready to sample from the posterior distribution of this model given the Brickset data.
= 3
CHAINS = 123456789
SEED
= {
SAMPLE_KWARGS 'cores': CHAINS,
'random_seed': [SEED + i for i in range(CHAINS)],
'return_inferencedata': True
}
with model:
= pm.sample(**SAMPLE_KWARGS) trace
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (3 chains in 3 jobs)
NUTS: [β0_0, Δ_β0_t, γ_β0_theme, Δ__β0_theme, σ__β0_theme, β_pieces_0, γ_β_pieces_theme, Δ__β_pieces_theme, σ__β_pieces_theme, σ]
100.00% [6000/6000 09:51<00:00 Sampling 3 chains, 0 divergences]
Sampling 3 chains for 1_000 tune and 1_000 draw iterations (3_000 + 3_000 draws total) took 592 seconds.
The estimated number of effective samples is smaller than 200 for some parameters.
The standard sampling diagnostics show no cause for concern.
def plot_diagnostics(trace, axes=None, min_mult=0.995, max_mult=1.005):
if axes is None:
= plt.subplots(ncols=2, sharex=False, sharey=False,
fig, axes =(16, 6))
figsize
=axes[0])
az.plot_energy(trace, ax
= az.rhat(trace).max()
rhat 1].barh(np.arange(len(rhat.variables)), rhat.to_array(),
axes[=list(rhat.variables.keys()))
tick_label1].axvline(1, c='k', ls='--')
axes[
1].set_xlim(
axes[* min(rhat.min().to_array().min(), 1),
min_mult * max(rhat.max().to_array().max(), 1)
max_mult
)1].set_xlabel(r"$\hat{R}$")
axes[
1].set_ylabel("Variable")
axes[
return fig, axes
; plot_diagnostics(trace)
We now sample from the posterior predictive distribution of the model and visualize the resulting residuals.
with model:
= pm.sample_posterior_predictive(trace)
pp_data
"pred"] = pp_data["obs"].mean(axis=0)
df["resid"] = log_rrp2021 - df["pred"] df[
100.00% [3000/3000 00:01<00:00]
= sns.scatterplot(x="Pieces", y="resid",
ax =df.reset_index(),
data='C0', alpha=0.1);
color"Pieces"].round(-2))
(df.groupby(df["resid"]
[
.mean()='k', label="Bucketed average",
.plot(c=ax));
ax
'log');
ax.set_xscale(
-2, 2);
ax.set_ylim("Log RRP residual (2021 $)");
ax.set_ylabel(
='upper left'); ax.legend(loc
= (df.groupby(level="Year released")
ax "resid"]
[
.mean()='k', label="Year average"));
.plot(c
= ax.twiny()
strip_ax ="Year released", y="resid",
sns.stripplot(x=df.reset_index(),
data=1.5,
jitter='C0', alpha=0.1,
color=strip_ax);
ax
;
strip_ax.set_axis_off()
-2, 2);
ax.set_ylim("Log RRP residual (2021 $)");
ax.set_ylabel(
='upper left'); ax.legend(loc
The residuals are reasonably well-centered around zero when grouped by piece and year as expected from the previous post.
Adding subtheme-based random effects and slopes
We now encode subthemes as integer identifiers.
= df["Subtheme"].factorize(sort=True)
sub_id, sub_map = sub_map.size
n_sub
= shared(sub_id) sub_id_
Note that factorize
encodes null values as -1
by default, so we see that the number of such entries in sub_id
matches the number of sets with null subtheme.
== -1).sum() (sub_id
1416
"Subtheme"]
(df[
.isnull()sum()) .
1416
We build a mask indicating which subthemes are null.
= sub_id == -1
sub_isnull = shared(sub_isnull) sub_isnull_
We also calculate the average (standardized) log number of pieces for each set in a subtheme.
= (pd.Series(std_log_pieces, index=df.index)
sub_mean_std_log_pieces "Subtheme"])
.groupby(df[ .mean())
Recall from the previous post that \(j(i)\) represents the index of the theme of the \(i\)-th set. In similar fashion, let \(k(i)\) denote the index of the subtheme of the \(i\)-th set.
Our model with additional random effects and slopes for subthemes is as follows.
The expected value of the slope is \(\beta_0^0 \sim N(0, 2.5^2)\).
with pm.Model() as sub_model:
0_0 = pm.Normal("β0_0", 0., 2.5) β
The time-varying component of the intercept is
\[\beta_{0, t} \sim \text{GaussianRandomWalk}\left(0, (0.1)^2\right).\]
with sub_model:
0_t = gaussian_random_walk("β0_t", n_year, innov_scale=0.1) β
\[ \begin{align*} \gamma_0 & \sim N(0, 2.5^2) \\ \sigma_0^{\text{theme}} & \sim \textrm{HalfNormal}(2.5^2) \\ \beta_{0, j}^{\text{theme}} & \sim N\left(\gamma_0 \cdot \bar{\tilde{x}}_j, \left(\sigma_0^{\text{theme}}\right)^2\right). \end{align*} \]
Recall that here \(\tilde{x}_i\) is the standardized log piece count of the \(i\)-th set and \(\bar{\tilde{x}}_j\) is the average standardized log piece count of all sets in the \(j\)-th theme.
with sub_model:
0_theme = hierarchical_normal_with_mean(
β"β0_theme",
at.constant(theme_mean_std_log_pieces),=0.
n_theme, μ )
We now include the sub-theme based random intercepts. All sets with null subtheme are given the same subtheme effect, \(\beta_{0, -1}^{\text{sub}}\). We use
\[ \begin{align*} \beta_{0, -1}^{\text{sub}}, \lambda_0^{\text{pieces}} & \sim N(0, 2.5^2) \\ \sigma_0^{\text{sub}} & \sim \text{HalfNormal}(2.5^2) \\ \beta_{0, k}^{\text{sub}} & \sim N\left(\lambda_0^{\text{pieces}} \cdot \check{\tilde{x}}_k, \left(\sigma_0^{\text{sub}}\right)^2\right). \end{align*} \]
Here \(\check{\tilde{x}}_k\) is the average standardized log piece count of all sets in the \(k\)-th subtheme.
with sub_model:
0_sub_null = pm.Normal("β0_sub_null", 0., 2.5)
β0_sub_nn = hierarchical_normal_with_mean(
β"β0_sub_nn",
at.constant(sub_mean_std_log_pieces),=0., mean_name="λ"
n_sub, μ
)0_sub_i = at.switch(
β
sub_isnull_,0_sub_null,
β0_sub_nn[at.clip(sub_id_, 0, n_sub)]
β )
Assembling the intercept terms, we get
\[\beta_{0, i} = \beta_{0, 0} + \beta_{0, t(i)} + \beta_{0, j(i)}^{\text{theme}} + \beta_{0, k(i)}^{\text{sub}}.\]
with sub_model:
0_i = β0_0 + β0_t[t_] + β0_theme[theme_id_] + β0_sub_i β
The varying slopes, \(\beta_{\text{pieces}, i}\), are defined similarly, just without a time-varying component.
with sub_model:
= pm.Normal("β_pieces_0", 0., 2.5)
β_pieces_0
= hierarchical_normal_with_mean(
β_pieces_theme "β_pieces_theme",
at.constant(theme_mean_std_log_pieces),=0.
n_theme, μ
)
= pm.Normal("β_pieces_sub_null", 0., 2.5)
β_pieces_sub_null = hierarchical_normal_with_mean(
β_pieces_sub_nn "β_pieces_sub_nn",
at.constant(sub_mean_std_log_pieces),=0., mean_name="λ"
n_sub, μ
)= at.switch(
β_pieces_sub_i
sub_isnull_,
β_pieces_sub_null,0, n_sub)]
β_pieces_sub_nn[at.clip(sub_id_,
)
= β_pieces_0 + β_pieces_theme[theme_id_] + β_pieces_sub_i β_pieces_i
Finally, we specify the likelihood of this model and sample from its posterior distribution.
with sub_model:
= pm.HalfNormal("σ", 5.)
σ = β0_i + β_pieces_i * std_log_pieces_ - 0.5 * σ**2
μ
= pm.Normal("obs", μ, σ, observed=log_rrp2021) obs
with sub_model:
= pm.sample(**SAMPLE_KWARGS) sub_trace
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (3 chains in 3 jobs)
NUTS: [β0_0, Δ_β0_t, γ_β0_theme, Δ__β0_theme, σ__β0_theme, β0_sub_null, λ_β0_sub_nn, Δ__β0_sub_nn, σ__β0_sub_nn, β_pieces_0, γ_β_pieces_theme, Δ__β_pieces_theme, σ__β_pieces_theme, β_pieces_sub_null, λ_β_pieces_sub_nn, Δ__β_pieces_sub_nn, σ__β_pieces_sub_nn, σ]
100.00% [6000/6000 17:10<00:00 Sampling 3 chains, 0 divergences]
Sampling 3 chains for 1_000 tune and 1_000 draw iterations (3_000 + 3_000 draws total) took 1031 seconds.
The number of effective samples is smaller than 25% for some parameters.
The sampling diagnostics show no cause for concern.
; plot_diagnostics(sub_trace)
We now sample from the posterior predictive distribution of the model and compare the resulting residuals to those of the baseline model.
with sub_model:
= pm.sample_posterior_predictive(sub_trace)
pp_sub_data
"sub_pred"] = pp_sub_data["obs"].mean(axis=0)
df["sub_resid"] = log_rrp2021 - df["sub_pred"] df[
100.00% [3000/3000 00:02<00:00]
= plt.subplots(
fig, (ax, sub_ax) =2, sharex=True, sharey=True,
ncols=(2 * FIGSIZE[0], FIGSIZE[1])
figsize
)
="Pieces", y="resid",
sns.scatterplot(x=df.reset_index(),
data='C0', alpha=0.1,
color=ax);
ax"Pieces"].round(-2))
(df.groupby(df["resid"]
[
.mean()='k', label="Bucketed average",
.plot(c=ax));
ax
'log');
ax.set_xscale(
-2, 2);
ax.set_ylim("Log RRP residual (2021 $)");
ax.set_ylabel(
='upper left');
ax.legend(loc"Baseline model");
ax.set_title(
="Pieces", y="sub_resid",
sns.scatterplot(x=df.reset_index(),
data='C0', alpha=0.1,
color=sub_ax);
ax"Pieces"].round(-2))
(df.groupby(df["sub_resid"]
[
.mean()='k', ax=sub_ax));
.plot(c
"Subtheme model");
sub_ax.set_title(
; fig.tight_layout()
= plt.subplots(
fig, (ax, sub_ax) =2, sharex=True, sharey=True,
ncols=(2 * FIGSIZE[0], FIGSIZE[1])
figsize
)
="Year released")
(df.groupby(level"resid"]
[
.mean()='k', label="Year average",
.plot(c=ax));
ax
= ax.twiny()
strip_ax ="Year released", y="resid",
sns.stripplot(x=df.reset_index(),
data=1.5,
jitter='C0', alpha=0.1,
color=strip_ax);
ax
;
strip_ax.set_axis_off()
-2, 2);
ax.set_ylim("Log RRP residual (2021 $)");
ax.set_ylabel(
='upper left');
ax.legend(loc"Baseline model");
ax.set_title(
="Year released")
(df.groupby(level"sub_resid"]
[
.mean()='k', label="Year average",
.plot(c=sub_ax));
ax
= sub_ax.twiny()
sub_strip_ax ="Year released", y="sub_resid",
sns.stripplot(x=df.reset_index(),
data=1.5,
jitter='C0', alpha=0.1,
color=sub_strip_ax);
ax
;
sub_strip_ax.set_axis_off()
"Subtheme model");
sub_ax.set_title(
; fig.tight_layout()
The residuals look quite similar, with perhaps slightly less variation for the subtheme model. Since the residuals are so visually similar, we turn to Pareto-smoothed importance sampling leave-one-out cross validation1 (PSIS-LOO) to compare the models.
= {
traces "Baseline": trace,
"Subtheme": sub_trace
}
%%time
= az.compare(traces) comp_df
CPU times: user 20.7 s, sys: 397 ms, total: 21.1 s
Wall time: 21.1 s
"weight"] comp_df.loc[:, :
rank | loo | p_loo | d_loo | weight | |
---|---|---|---|---|---|
Subtheme | 0 | -1142.927746 | 639.913733 | 0.000000 | 0.911548 |
Baseline | 1 | -2062.767732 | 251.538394 | 919.839986 | 0.088452 |
=False); az.plot_compare(comp_df, plot_ic_diff
We see that the subtheme model has approximately 2.5 times as many effective parameters as the baseline model and produces a significantly higher PSIS-LOO score.
Model interpretation
We now interpret the posterior distributions of various parameters and the posterior predictions of these two models.
= 18 LARGE_LABEL_FS
= az.plot_posterior(
grid
trace,=[
var_names"σ__β0_theme",
"σ__β_pieces_theme"
],='hide',
hdi_prob='C0', label="Baseline model"
color
)
az.plot_posterior(
sub_trace,=[
var_names"σ__β0_theme",
"σ__β_pieces_theme"
],='hide',
hdi_prob='C1', label="Subtheme model",
color=grid
ax;
)
0].set_xlabel(r"$\sigma_{\beta_0^{\mathrm{theme}}}$",
grid[=LARGE_LABEL_FS);
fontsize0].set_title(None);
grid[0].get_legend().remove()
grid[
1].set_xlabel(r"$\sigma_{\beta_{\mathrm{pieces}}^{\mathrm{theme}}}$",
grid[=LARGE_LABEL_FS);
fontsize1].set_title(None); grid[
We see that after adding subtheme-based varying intercepts and slopes that the scale of the varying intercepts remains roughly the same, but the scale of the varying slopes is greatly reduced.
= az.plot_posterior(sub_trace,
grid =[
var_names"σ__β0_sub_nn",
"σ__β_pieces_sub_nn"
])
0].set_xlabel(r"$\sigma_{\beta_0^{\mathrm{sub}}}$",
grid[=18);
fontsize0].set_title(None);
grid[
1].set_xlabel(r"$\sigma_{\beta_{\mathrm{pieces}}^{\mathrm{sub}}}$",
grid[=18);
fontsize1].set_title(None); grid[
Interestingly, the scale of the subtheme-based variation is significantly smaller than that of the theme-based variation in the baseline model.
0_theme_hat = trace["posterior"]["β0_theme"].mean(dim=("chain", "draw"))
β= trace["posterior"]["β_pieces_theme"].mean(dim=("chain", "draw"))
β_pieces_theme_hat
0_theme_hat_sub = sub_trace["posterior"]["β0_theme"].mean(dim=("chain", "draw"))
β= sub_trace["posterior"]["β_pieces_theme"].mean(dim=("chain", "draw")) β_pieces_theme_hat_sub
= plt.subplots(
fig, (int_ax, slope_ax) =2,
ncols=(2 * FIGSIZE[1], FIGSIZE[1])
figsize
)
'equal');
int_ax.set_aspect('equal');
slope_ax.set_aspect(
sns.scatterplot(=β0_theme_hat, y=β0_theme_hat_sub,
x=0.5, ax=int_ax
alpha;
)
0, 0), slope=1, c='k', ls='--');
int_ax.axline((
"Baseline model");
int_ax.set_xlabel("Subtheme model");
int_ax.set_ylabel(r"$\beta_0^{\mathrm{theme}}$");
int_ax.set_title(
sns.scatterplot(=β_pieces_theme_hat, y=β_pieces_theme_hat_sub,
x=0.5, ax=slope_ax
alpha;
)
0, 0), slope=1, c='k', ls='--');
slope_ax.axline((
"Baseline model");
slope_ax.set_xlabel("Subtheme model");
slope_ax.set_ylabel(r"$\beta_{\mathrm{pieces}}^{\mathrm{theme}}$");
slope_ax.set_title(
; fig.tight_layout()
These relationships are shown clearly when we scatter plot the posterior expected value of the theme-based varying slopes and intercepts against each other from each model. We see that the theme-based varying intercepts are largely concentrated around the line \(y = x\) while the theme-based varying slopes almost all have \(y < x\).
0_sub_nn_hat = sub_trace["posterior"]["β0_sub_nn"].mean(dim=("chain", "draw"))
β= sub_trace["posterior"]["β_pieces_sub_nn"].mean(dim=("chain", "draw")) β_pieces_sub_nn_hat
= sns.scatterplot(
ax =β0_sub_nn_hat, y=β_pieces_sub_nn_hat,
x=0.5
alpha
)
r"$\beta_0^{\mathrm{sub}}$");
ax.set_xlabel(r"$\beta_{\mathrm{pieces}}^{\mathrm{sub}}$"); ax.set_ylabel(
There is a negative relationship subtheme-based varying intercepts and slopes. As the subtheme-based intercept increased, the subtheme-based slope decreases.
= plt.subplots(
fig, (int_ax, slope_ax) =2,
ncols=(2 * FIGSIZE[1], FIGSIZE[1])
figsize
)
sns.scatterplot(=β0_theme_hat, y=β_pieces_theme_hat,
x=0.5, ax=int_ax
alpha;
)
r"$\beta_0^{\mathrm{theme}}$");
int_ax.set_xlabel(r"$\beta_{\mathrm{pieces}}^{\mathrm{theme}}$");
int_ax.set_ylabel("Baseline model");
int_ax.set_title(
sns.scatterplot(=β0_theme_hat_sub, y=β_pieces_theme_hat_sub,
x=0.5, ax=slope_ax
alpha;
)
r"$\beta_0^{\mathrm{theme}}$");
slope_ax.set_xlabel(r"$\beta_{\mathrm{pieces}}^{\mathrm{theme}}$");
slope_ax.set_ylabel("Subtheme model");
slope_ax.set_title(
; fig.tight_layout()
These negative relationships exist for the theme-based varying intercepts and slopes in both models as well, but are less pronounced. This tighter relationship between subtheme-based varying intercepts and slopes is partially responsible for the smaller scale of the theme-based varying slopes in the subtheme model.
We now turn to examining the effects of particular subthemes. First we plot the subthemes with the most extreme intercept and slopes.
= 10 N_EXTREME
def get_extreme_ix(x, n_extreme=N_EXTREME):
= x.argsort()
argsorted
return np.concatenate((argsorted[:n_extreme // 2], argsorted[-(n_extreme // 2):]))
0_sub_ext_ix = get_extreme_ix(β0_sub_nn_hat)
β
= az.plot_forest(sub_trace, var_names=["β0_sub_nn"],
ax, ={
coords"β0_sub_nn_dim_0": β0_sub_ext_ix
},=0.95, combined=True)
hdi_prob
r"$\beta_0^{\mathrm{sub}}$");
ax.set_xlabel(0_sub_ext_ix][::-1]);
ax.set_yticklabels(sub_map[βNone); ax.set_title(
= get_extreme_ix(β_pieces_sub_nn_hat)
β_pieces_sub_ext_ix
= az.plot_forest(sub_trace, var_names=["β_pieces_sub_nn"],
ax, ={
coords"β_pieces_sub_nn_dim_0": β_pieces_sub_ext_ix
},=0.95, combined=True)
hdi_prob
r"$\beta_{\mathrm{pieces}}^{\mathrm{sub}}$");
ax.set_xlabel(-1]);
ax.set_yticklabels(sub_map[β_pieces_sub_ext_ix][::None); ax.set_title(
That subthemes like “Extra Dots,” “Bulk Set,” and “Mini Building Set” have some of the smallest intercepts and “Outdoor RC” has the largest is intuitive. Due to the negative correlation between subtheme intercepts and slopes, we then see that “Outdoor RC” is in the bottom in terms of slope. It also makes sense that “Technic” has one of the highest slopes, as Technic sets tend to include specialized pieces more often than most subthemes.
The Republic Gunship
We now return to the original question in this post, whether the Ultimate Collector Series Republic Gunship is fairly priced. To answer this question, first we examine the varying slopes and intercepts for Star Wars-themed and Ultimate Collector Series-subthemed sets for the models.
= (theme_map == STAR_WARS).argmax() star_wars_id
= az.plot_posterior(
grid
trace,=[
var_names"β0_theme",
"β_pieces_theme"
],={
coords"β0_theme_dim_0": [star_wars_id],
"β_pieces_theme_dim_0": [star_wars_id]
},='hide',
hdi_prob='C0', label="Baseline model"
color
)
az.plot_posterior(
sub_trace,=[
var_names"β0_theme",
"β_pieces_theme"
],={
coords"β0_theme_dim_0": [star_wars_id],
"β_pieces_theme_dim_0": [star_wars_id]
},='hide',
hdi_prob='C1', label="Subtheme model",
color=grid
ax;
)
0].set_xlabel(r"$\beta_0^{\mathrm{theme}}$",
grid[=LARGE_LABEL_FS);
fontsize0].set_title(None);
grid[0].get_legend().remove()
grid[
1].set_xlabel(r"$\beta_{\mathrm{pieces}}^{\mathrm{theme}}$",
grid[=LARGE_LABEL_FS);
fontsize1].set_title(None);
grid[
"Star Wars"); plt.gcf().suptitle(
As with the general trend, we see that the posterior distributions for the intercept Star Wars-themed sets are similar for both models, but their slopes are quite different.
= "Ultimate Collector Series"
UCS = (sub_map == UCS).argmax() ucs_id
We see that the intercept for Ultimate Collector series sets is quite average and the slope for such sets is perhaps slightly positive, but not large.
= az.plot_posterior(
int_ax, slope_ax
sub_trace,=[
var_names"β0_sub_nn",
"β_pieces_sub_nn"
],={
coords"β0_sub_nn_dim_0": [ucs_id],
"β_pieces_sub_nn_dim_0": [ucs_id]
},=0.
ref_val
)
r"$\beta_0^{\mathrm{sub}}$",
int_ax.set_xlabel(=LARGE_LABEL_FS);
fontsizeNone);
int_ax.set_title(
r"$\beta_{\mathrm{pieces}}^{\mathrm{sub}}$",
slope_ax.set_xlabel(=LARGE_LABEL_FS);
fontsizeNone);
slope_ax.set_title(
"Ultimate Collector Series"); plt.gcf().suptitle(
We see that Ultimate Collector series sets are not really set apart by their subtheme intercept or slope, but rather the fact that they have many more pieces than most Star WSars sets (on average).
= sns.kdeplot(df["Pieces"][df["Theme"] == STAR_WARS],
ax =(0., np.inf),
clip="All Star Wars sets")
label"Pieces"][df["Subtheme"] == UCS],
sns.kdeplot(df[=(0., np.inf),
clip="Ultimate Collector Series sets",
label=ax);
ax"Pieces"][df["Subtheme"] == UCS],
sns.rugplot(df[=0.05, c='C1', ax=ax);
height
=-100.);
ax.set_xlim(left;
ax.set_yticklabels([]); ax.legend()
Finally we examine the posterior predictive distribution of the price of the Ultimate Collector Series Republic Gunship.
= "73509"
GUNSHIP_SET_NUM = 3292
GUNSHIP_PIECES
= 2021
GUNSHIP_YEAR = 2021 - year.min() gunship_t
std_log_pieces_.set_value(scale_log_pieces(np.log([GUNSHIP_PIECES])))
theme_id_.set_value([star_wars_id])False])
sub_isnull_.set_value([
sub_id_.set_value([ucs_id]) t_.set_value([gunship_t])
with model:
= pm.sample_posterior_predictive(trace) pp_gunship_data
100.00% [3000/3000 00:00<00:00]
with sub_model:
= pm.sample_posterior_predictive(sub_trace) pp_gunship_sub_data
100.00% [3000/3000 00:00<00:00]
= 349.99 GUNSHIP_RRP
def format_posterior_artist(artist, formatter):
= artist.get_text()
text = artist.get_position()
x, _
if text.startswith(" ") or text.endswith(" "):
= formatter(x)
fmtd_text
artist.set_text(" " + fmtd_text if text.startswith(" ") else fmtd_text + " "
)elif "=" in text:
= text.split("=")
before, _ "=".join((before, formatter(x))))
artist.set_text(elif "<" in text:
= text.split("<")
below, ref_val_str, above "<".join((
artist.set_text(
below," " + formatter(float(ref_val_str)) + " ",
above
)))
def format_posterior_text(formatter, ax=None):
if ax is None:
= plt.gca()
ax
= [artist for artist in ax.get_children() if isinstance(artist, plt.Text)]
artists
for artist in artists:
format_posterior_artist(artist, formatter),
= plt.subplots(
fig, (ax, sub_ax) =2, sharex=True, sharey=True,
ncols=(2 * FIGSIZE[0], FIGSIZE[1]))
figsize
"obs"]),
az.plot_posterior(np.exp(pp_gunship_data[=GUNSHIP_RRP, ax=ax);
ref_val=ax)
format_posterior_text(dollar_formatter, ax
f"Posterior predicted RRP for {GUNSHIP_SET_NUM}\n(2021 $)");
ax.set_xlabel("Baseline model");
ax.set_title(
"obs"]),
az.plot_posterior(np.exp(pp_gunship_sub_data[=GUNSHIP_RRP, ax=sub_ax);
ref_val=sub_ax)
format_posterior_text(dollar_formatter, ax
f"Posterior predicted RRP for {GUNSHIP_SET_NUM}\n(2021 $)");
sub_ax.set_xlabel("Subtheme model");
sub_ax.set_title(
; fig.tight_layout()
The plot on the left shows that the baseline model considers the Ultimate Collector Series Republic Gunship slightly overpriced, while the plot on the right shows that the subtheme model considers it slightly underpriced. Since these deviations from the posterior expected value are small in either case, it is safe to say that both of these models consider the Ultimate Collector Series Republic Gunship to be priced in line with Lego’s historic prices. (I’ll be ordering it come August 1.)
This post is available as a Jupyter notebook here.
%load_ext watermark
%watermark -n -u -v -iv
Last updated: Tue Jul 20 2021
Python implementation: CPython
Python version : 3.7.10
IPython version : 7.24.1
aesara : 2.0.12
seaborn : 0.11.1
numpy : 1.19.5
pymc3 : 4.0
matplotlib: 3.4.2
arviz : 0.11.2
pandas : 1.2.4
Vehtari, A., Gelman, A., & Gabry, J. (2017). Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. Statistics and computing, 27(5), 1413-1432.↩︎