A Fair Price for Darth Vader's Meditation Chamber? A Lego Price Analysis
Earlier this week I was taking a break from work and browsing lego.com (as one does), and came across Darth Vader’s Meditation Chamber (75296), which is newly available for preorder.
My first thought was “this looks nice, but $69.99 feels a bit steep for 663 pieces.” Being both a Lego and data nerd, I decided to see how good or bad this price actually is compared to other sets.
I searched for a freely available source of data on Lego set prices, but didn’t find any that were suitable to answer my question. After a few hours of frustration, I wrote a small Python script using Beautiful Soup to scrape brickset.com’s historical data on Lego sets dating back to 1980. Out of respect for the hard and excellent work of the Brickset team, I won’t be sharing the scraper code, but I have made the data set (scraped on June 1, 2021) publicly available. This is the first post in a series analyzing this data.
= 'https://austinrochford.com/resources/lego/brickset_01011980_06012021.csv.gz' DATA_URL
First we make some standard Python imports and load the data.
%matplotlib inline
import datetime
from functools import reduce
from warnings import filterwarnings
from matplotlib import MatplotlibDeprecationWarning, pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
'ignore', category=MatplotlibDeprecationWarning) filterwarnings(
'figure.figsize'] = (8, 6)
plt.rcParams[
set(color_codes=True) sns.
def to_datetime(year):
return np.datetime64(f"{round(year)}-01-01")
= (pd.read_csv(DATA_URL,
full_df =[
usecols"Year released", "Set number",
"Name", "Set type", "Theme", "Subtheme",
"Pieces", "RRP"
])=[
.dropna(subset"Year released", "Set number",
"Name", "Set type", "Theme",
"Pieces", "RRP"
]))"Year released"] = full_df["Year released"].apply(to_datetime)
full_df[= (full_df.set_index(["Year released", "Set number"])
full_df .sort_index())
full_df.head()
Name | Set type | Theme | Pieces | RRP | Subtheme | ||
---|---|---|---|---|---|---|---|
Year released | Set number | ||||||
1980-01-01 | 1041-2 | Educational Duplo Building Set | Normal | Dacta | 68.0 | 36.50 | NaN |
1075-1 | LEGO People Supplementary Set | Normal | Dacta | 304.0 | 14.50 | NaN | |
1101-1 | Replacement 4.5V Motor | Normal | Service Packs | 1.0 | 5.65 | NaN | |
1123-1 | Ball and Socket Couplings & One Articulated Joint | Normal | Service Packs | 8.0 | 16.00 | NaN | |
1130-1 | Plastic Folder for Building Instructions | Normal | Service Packs | 1.0 | 14.00 | NaN |
full_df.tail()
Name | Set type | Theme | Pieces | RRP | Subtheme | ||
---|---|---|---|---|---|---|---|
Year released | Set number | ||||||
2021-01-01 | 80022-1 | Spider Queen’s Arachnoid Base | Normal | Monkie Kid | 1170.0 | 119.99 | Season 2 |
80023-1 | Monkie Kid’s Team Dronecopter | Normal | Monkie Kid | 1462.0 | 149.99 | Season 2 | |
80024-1 | The Legendary Flower Fruit Mountain | Normal | Monkie Kid | 1949.0 | 169.99 | Season 2 | |
80106-1 | Story of Nian | Normal | Seasonal | 1067.0 | 79.99 | Chinese Traditional Festivals | |
80107-1 | Spring Lantern Festival | Normal | Seasonal | 1793.0 | 119.99 | Chinese Traditional Festivals |
Most of the fields are fairly self-explanatory. RRP
is the recommended retail price of the set in dollars.
For fun, I have also exported my Lego collection from Brickset and load it now.
= 'https://austinrochford.com/resources/lego/Brickset-MySets-owned-20210602.csv' MY_COLLECTION_URL
= pd.read_csv(MY_COLLECTION_URL) my_df
my_df.index
Index(['8092-1', '10221-1', '10266-1', '10281-1', '10283-1', '21309-1',
'21312-1', '21320-1', '21321-1', '31091-1', '40174-1', '40268-1',
'40391-1', '40431-1', '40440-1', '41602-1', '41608-1', '41609-1',
'41628-1', '75030-1', '75049-1', '75074-1', '75075-1', '75093-1',
'75099-1', '75136-1', '75137-1', '75138-1', '75162-1', '75176-1',
'75187-1', '75229-1', '75243-1', '75244-1', '75248-1', '75254-1',
'75255-1', '75263-1', '75264-1', '75266-1', '75267-1', '75269-1',
'75273-1', '75277-1', '75278-1', '75283-1', '75292-1', '75297-1',
'75302-1', '75306-1', '75308-1', '75317-1', '75318-1'],
dtype='object')
We add a column to full_df
indicating whether or not I own the set represented by each row.
"austin"] = (full_df.index
full_df["Set number")
.get_level_values( .isin(my_df.index))
Exploratory Data Analysis
First we check for any missing data.
full_df.isnull().mean()
Name 0.000000
Set type 0.000000
Theme 0.000000
Pieces 0.000000
RRP 0.000000
Subtheme 0.241825
austin 0.000000
dtype: float64
About a quarter of the sets do not have a Subtheme
, but each set has data for every other column. We see below that most sets are classified as “normal” building sets, but there are some books and other types of sets present in the data.
= (full_df["Set type"]
ax =True)
.value_counts(ascending='barh'))
.plot(kind
'log');
ax.set_xscale("Number of sets");
ax.set_xlabel(
"Set type"); ax.set_ylabel(
For simplicity, we will focus only on “normal” sets.
= [full_df["Set type"] == "Normal"] FILTERS
= full_df[reduce(np.logical_and, FILTERS)] df
We still have information on over 8,000 sets.
"Pieces"].describe() df[
count 8163.000000
mean 265.848095
std 489.269642
min 1.000000
25% 34.000000
50% 102.000000
75% 310.000000
max 11695.000000
Name: Pieces, dtype: float64
The set with the most pieces is the recently released World Map, (31203-1).
"Pieces"].idxmax()] df.loc[df[
Name World Map
Set type Normal
Theme Art
Pieces 11695.0
RRP 249.99
Subtheme Miscellaneous
austin False
Name: (2021-01-01 00:00:00, 31203-1), dtype: object
I love the idea of a Lego world map, but I’m not in love with the ocean color, so I’ll probably pass on this beast.
We see below that here are many sets with very few pieces (presumably replacement parts, minifigures, and promotional sets).
= df["Pieces"].max()
max_pieces = 1.1 * max_pieces
plt_max_pieces
= sns.kdeplot(data=full_df, x="Pieces",
ax ="All sets")
label=full_df, x="Pieces",
sns.rugplot(data='k', alpha=0.1, ax=ax);
c
= [1, 10, 25, 50, 100]
THRESHES
for thresh in THRESHES:
=full_df[full_df["Pieces"] > thresh],
sns.kdeplot(data="Pieces",
x=(thresh, plt_max_pieces),
clip=f"Sets with more\nthan {thresh} pieces",
label=ax);
ax
'log');
ax.set_xscale(+ [10**3, 10**4])
ax.set_xticks(THRESHES 0.9, plt_max_pieces);
ax.set_xlim(
;
ax.set_yticks([]); ax.legend()
We filter our analysis to sets with more than 10 pieces.
"Pieces"] > 10) FILTERS.append(full_df[
= full_df[reduce(np.logical_and, FILTERS)] df
Note that by using FILTERS.append
the order of execution of cells in this notebook becomes very important (notebooks are bad, etc., but I love them anyway).
We now examine the distribution of sets across themes.
"Theme"]
(df[
.value_counts() .describe())
count 134.000000
mean 52.992537
std 94.157163
min 1.000000
25% 7.000000
50% 19.000000
75% 52.500000
max 556.000000
Name: Theme, dtype: float64
= df["Theme"].nunique()
n_theme
= 12
N_THEME_PLOTS = 2
N_THEME_COLS = N_THEME_PLOTS // N_THEME_COLS
n_theme_rows
= int(np.ceil(n_theme / N_THEME_PLOTS)) n_themes_per_plot
= plt.subplots(nrows=n_theme_rows, ncols=N_THEME_COLS,
fig, axes =True, sharey=False,
sharex=(16, n_theme_rows * 6))
figsize
= df["Theme"].value_counts()
theme_ct
for i, ax in zip(range(0, n_theme, n_themes_per_plot),
axes.flatten()):+ n_themes_per_plot]
(theme_ct.iloc[i:i ='barh', ax=ax));
.plot(kind
'log');
ax.set_xscale("Number of sets");
ax.set_xlabel(
;
ax.invert_yaxis()"Theme");
ax.set_ylabel(
; fig.tight_layout()
Unsurprisingly, the Star Wars theme has the most sets historically. That Duplo comes in second is interesting. We filter out Duplo sets, service pack, and bulk brick sets from our analysis.
"Theme"] != "Duplo")
FILTERS.append(full_df["Theme"] != "Service Packs")
FILTERS.append(full_df["Theme"] != "Bulk Bricks") FILTERS.append(full_df[
= full_df[reduce(np.logical_and, FILTERS)] df
Set Price
We now turn to the question that prompted this work, whether or not Darth Vader’s Meditation Chamber is overpriced. Our set data spans 1980-2021, and we see that the number of sets released has been increasing fairly steadily over the years.
= (df.index
ax "Year released")
.get_level_values(
.value_counts()
.sort_index()
.plot())
"Year released");
ax.set_xlabel("Number of sets"); ax.set_ylabel(
Since the data spans more than 40 years, it is important to adjust for inflation. We use the Consumer Price Index for All Urban Consumers: All Items in U.S. City Average from the U.S. Federal Reserve to adjust for inflation.
= 'https://austinrochford.com/resources/lego/CPIAUCNS202100401.csv' CPI_URL
= pd.date_range('1979-01-01', '2021-01-01', freq='Y') \
years + datetime.timedelta(days=1)
= (pd.read_csv(CPI_URL, index_col="DATE", parse_dates=["DATE"])
cpi_df
.loc[years])"to2021"] = cpi_df.loc["2021-01-01"] / cpi_df cpi_df[
= plt.subplots(ncols=2, sharex=True, sharey=False,
fig, (cpi_ax, factor_ax) =(16, 6))
figsize
"CPIAUCNS"].plot(ax=cpi_ax);
cpi_df[
"Year");
cpi_ax.set_xlabel("CPIAUCNS");
cpi_ax.set_ylabel(
"to2021"].plot(ax=factor_ax);
cpi_df[
"Year");
factor_ax.set_xlabel("Inflation multiple to 2021 dollars");
factor_ax.set_ylabel(
; fig.tight_layout()
We now add a column RRP2021
, which is RRP
adjusted to 2021 dollars.
"RRP2021"] = (pd.merge(full_df, cpi_df,
full_df[=["Year released"],
left_on=True)
right_indexapply(lambda df: df["RRP"] * df["to2021"],
.=1)) axis
= full_df[reduce(np.logical_and, FILTERS)] df
Here we plot the most obvious relationship pertinent to my initial question about Darth Vader’s Meditation Chamber, price versus number of pieces.
= sns.scatterplot(x="Pieces", y="RRP2021", data=df, alpha=0.1)
ax ="Pieces", y="RRP2021",
sns.scatterplot(x=df[df["austin"] == True],
data=0.5, label="My sets");
alpha
'log');
ax.set_xscale(
'log');
ax.set_yscale("Retail price (2021 $)"); ax.set_ylabel(
The relationship is fairly linear on a log-log scale, which will be important in subsequent posts when we introduce more complex statistical models. The sets in my collection are highlighted in this plot.
We can also highlight certain themes of interest to see where the sets from those themes fall.
= [
PLOT_THEMES "Creator Expert",
"Disney",
"Star Wars",
"Harry Potter",
"Marvel Super Heroes",
"Ninjago",
"City",
"Space",
"Jurassic World"
]
= sns.relplot(x="Pieces", y="RRP2021", col="Theme",
grid =df[df["Theme"].isin(PLOT_THEMES)],
data='C1', alpha=0.5, col_wrap=3, zorder=5)
color
for ax in grid.axes.flatten():
="Pieces", y="RRP2021", data=df,
sns.scatterplot(x='C0', alpha=0.05,
color=ax);
ax
'log');
ax.set_xscale(
'log');
ax.set_yscale("Retail price (2021 $)"); ax.set_ylabel(
There are some interesting relationships here.
- Space sets seem to be more expensive per piece than average.
- Star Wars, City, Harry Potter, and Jurassic World sets seem to be fairly uniformly distributed around the average price per piece.
- Marvel Super Heroes and Ninjago seem to be less expensive per piece than average.
- Creator Expert sets have many pieces, which makes perfect sense.
We’ll dig deeper into these observations in a subsequent post.
We now add a DPP2021
column to the data frame which represents the set’s price per piece in 2021 dollars.
"DPP2021"] = (df["RRP2021"]
full_df["Pieces"])
.div(df["DPP2021")) .rename(
= full_df[reduce(np.logical_and, FILTERS)] df
In order to decide whether or not Darth Vader’s Meditation Chamber is overpriced, we look at the percentile of its adjusted price per piece in various populations of sets.
= "75296-1" VADER_MEDITATION
We see that this set is in the 25th percentile of adjusted price per piece among all sets, and the 13th percentile among Star Wars sets.
"DPP2021"]
(df["DPP2021"]
.lt(df[="Set number")
.xs(VADER_MEDITATION, level0])
.values[ .mean())
0.25019669551534224
"DPP2021"]
(df["Theme"] == "Star Wars"]
[df["DPP2021"]
.lt(df[="Set number")
.xs(VADER_MEDITATION, level0])
.values[ .mean())
0.13309352517985612
Its adjusted price per piece is near the median for my collection (apparently I’m relatively cheap), but still nowhere near a bad value by this metric.
"austin"] == True]
(df[df["DPP2021"]
["DPP2021"]
.lt(df[="Set number")
.xs(VADER_MEDITATION, level0])
.values[ .mean())
0.46153846153846156
= sns.scatterplot(x="Pieces", y="RRP2021", data=df,
ax ='C0', alpha=0.05,
color="All sets");
label
="Pieces", y="RRP2021",
sns.scatterplot(x=df[df["Theme"] == "Star Wars"],
data='C1', alpha=0.5,
color="Star wars sets",
label=ax);
ax="Pieces", y="RRP2021",
sns.scatterplot(x=df[df["austin"] == True],
data='C2', alpha=0.5,
color="My sets",
label=ax);
ax"Pieces"].xs(VADER_MEDITATION, level="Set number"),
ax.scatter(df["RRP2021"].xs(VADER_MEDITATION, level="Set number"),
df[="C3", s=150, marker='D', alpha=0.9,
color=f"Darth Vader's Meditation Chamber ({VADER_MEDITATION})");
label
'log');
ax.set_xscale(
'log');
ax.set_yscale("Retail price (2021 $)");
ax.set_ylabel(
='upper left'); ax.legend(loc
As an apology to my Lego overlords, I preordered Darth Vader’s Meditation Chamber after doing this analysis.
It is also interesting to examine how the adjusted price per piece has changed over time.
= sns.lineplot(x="Year released", y="DPP2021", data=df,
ax ='k')
color
10**-1, 2);
ax.set_ylim('log');
ax.set_yscale("Price per piece (2021 $)"); ax.set_ylabel(
It seems that the adjusted price per piece has been steadily decreasing over time. Presumably Lego has been optimizing their manufacturing processes to be more efficient over the years and has passed some of the reduction in costs onto we consumers. Lego has likely improved set pricing over time as they have accumulated historical data on sales of past sets.
When we break this timeseries down by theme we see a general convergence of the price per piece to the average across our themes of interest.
= sns.lineplot(x="Year released", y="DPP2021", data=df,
ax ='k', label="All")
color="Year released", y="DPP2021", hue="Theme",
sns.lineplot(x=df[df["Theme"].isin(PLOT_THEMES)],
data=ax);
ax
10**-1, 2);
ax.set_ylim('log');
ax.set_yscale("Price per piece (2021 $)"); ax.set_ylabel(
Future posts in this series will introduce more complex statistical models for this data and use them to explore these trends and more structure in the data.
This post is available as a Jupyter notebook here. The Docker environment that this post was written in is available here.
%load_ext watermark
%watermark -n -u -v -iv
Last updated: Thu Jun 03 2021
Python implementation: CPython
Python version : 3.8.8
IPython version : 7.22.0
matplotlib: 3.4.1
numpy : 1.20.2
pandas : 1.2.3
seaborn : 0.11.1