Saving Memory by Counting Combinations of Features
For all the hype about big data, much value resides in the world’s medium and small data. Especially when we consider the length of the feedback loop and total analyst time invested, insights from small and medium data are quite attractive and economical. Personally, I find analyzing data that fits into memory quite convenient, and therefore, when I am confronted with a data set that does not fit in memory as-is, I am willing to spend a bit of time to try to manipulate it to fit into memory.
The first technique I usually turn to is to only store distinct rows of a data set, along with the count of the number of times that row appears in the data set. This technique is fairly simple to implement, especially when the data set is generated by a SQL query. If the initial query that generates the data set is
SELECT u, v, w FROM t;
we would modify it to become
SELECT u, v, w, COUNT(1) FROM t GROUP BY u, v, w;
We now generate a sample data set with both discrete and continuous features.
%matplotlib inline
from __future__ import division
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
from patsy import dmatrices, dmatrix
import scipy as sp
import seaborn as sns
from statsmodels import api as sm
from statsmodels.base.model import GenericLikelihoodModel
1545721) # from random.org np.random.seed(
= 100001 N
= 0, 100 u_min, u_max
= 0.6 v_p
= 50
n_ws
= sp.stats.norm.rvs(0, 1, size=n_ws)
ws = ws.min(), ws.max() w_min, w_max
= pd.DataFrame({
df 'u': np.random.randint(u_min, u_max, size=N),
'v': sp.stats.bernoulli.rvs(v_p, size=N),
'w': np.random.choice(ws, size=N, replace=True)
})
df.head()
u | v | w | |
---|---|---|---|
0 | 97 | 0 | 0.537397 |
1 | 79 | 1 | 1.536383 |
2 | 44 | 1 | 1.074817 |
3 | 51 | 0 | -0.491210 |
4 | 47 | 1 | 1.592646 |
We see that this data frame has just over 100,000 rows, but only about 10,000 distinct rows.
0] df.shape[
100001
0] df.drop_duplicates().shape[
9997
We now use pandas
’ groupby
method to produce a data frame that contains the count of each unique combination of x
, y
, and z
.
= df.groupby(list(df.columns)).size()
count_df = 'count'
count_df.name = count_df.reset_index() count_df
In order to make later examples interesting, we shuffle the rows of the reduced data frame, because pandas
automatically sorts the values we grouped on in the reduced data frame.
= count_df.index.values
shuffled_ixs
np.random.shuffle(shuffled_ixs)= count_df.iloc[shuffled_ixs].copy().reset_index(drop=True) count_df
count_df.head()
u | v | w | count | |
---|---|---|---|---|
0 | 0 | 0 | 0.425597 | 14 |
1 | 48 | 1 | -0.993981 | 7 |
2 | 35 | 0 | 0.358156 | 9 |
3 | 19 | 1 | -0.760298 | 17 |
4 | 40 | 1 | -0.688514 | 13 |
Again, we see that we are storing 90% fewer rows. Although this data set has been artificially generated, I have seen space savings of up to 98% when applying this technique to real-world data sets.
0] / N count_df.shape[
0.0999690003099969
This space savings allows me to analyze data sets which initially appear too large to fit in memory. For example, the computer I am writing this on has 16 GB of RAM. At a 90% space savings, I can comfortably analyze a data set that might otherwise be 80 GB in memory while leaving a healthy amount of memory for other processes. To me, the convenience and tight feedback loop that come with fitting a data set entirely in memory are hard to overstate.
As nice as it is to fit a data set into memory, it’s not very useful unless we can still analyze it. The rest of this post will show how we can perform standard operations on these summary data sets.
For convenience, we will separate the feature columns from the count columns.
= count_df[['u', 'v', 'w']]
summ_df = count_df['count'] n
Mean
Suppose we have a group of numbers \(x_1, x_2, \ldots, x_n\). Let the unique values among these numbers be denoted \(z_1, z_2, \ldots, z_m\) and let \(n_j\) be the number of times \(z_j\) apears in the original group. The mean of the \(x_i\)s is therefore
\[ \begin{align*} \bar{x} & = \frac{1}{n} \sum_{i = 1}^n x_i = \frac{1}{n} \sum_{j = 1}^m n_j z_j, \end{align*} \]
since we may group identical \(x_i\)s into a single summand. Since \(n = \sum_{j = 1}^m n_j\), we can calculate the mean using the following function.
def mean(df, count):
return df.mul(count, axis=0).sum() / count.sum()
mean(summ_df, n)
u 49.308067
v 0.598704
w 0.170815
dtype: float64
We see that the means calculated by our function agree with the means of the original data frame.
=0) df.mean(axis
u 49.308067
v 0.598704
w 0.170815
dtype: float64
=0)) np.allclose(mean(summ_df, n), df.mean(axis
True
Variance
We can calculate the variance as
\[ \begin{align*} \sigma_x^2 & = \frac{1}{n - 1} \sum_{i = 1}^n \left(x_i - \bar{x}\right)^2 = \frac{1}{n - 1} \sum_{j = 1}^m n_j \left(z_j - \bar{x}\right)^2 \end{align*} \]
using the same trick of combining identical terms in the original sum. Again, this calculation is easy to implement in Python.
def var(df, count):
= mean(df, count)
mu
return np.power(df - mu, 2).mul(count, axis=0).sum() / (count.sum() - 1)
var(summ_df, n)
u 830.025064
v 0.240260
w 1.099191
dtype: float64
We see that the variances calculated by our function agree with the variances of the original data frame.
df.var()
u 830.025064
v 0.240260
w 1.099191
dtype: float64
=0)) np.allclose(var(summ_df, n), df.var(axis
True
Histogram
Histograms are fundamental tools for exploratory data analysis. Fortunately, pyplot
’s hist
function easily accommodates summarized data using the weights
optional argument.
= plt.subplots(ncols=2, sharex=True, sharey=True, figsize=(16, 6))
fig, (full_ax, summ_ax)
= 20
nbins
= sns.color_palette()[:2]
blue, green
=nbins, color=blue, alpha=0.5, lw=0);
full_ax.hist(df.w, bins'$w$');
full_ax.set_xlabel('Count');
full_ax.set_ylabel('Full data frame');
full_ax.set_title(
=nbins, weights=n, color=green, alpha=0.5, lw=0);
summ_ax.hist(summ_df.w, bins'$w$');
summ_ax.set_xlabel('Summarized data frame'); summ_ax.set_title(
We see that the histograms for \(w\) produced from the full and summarized data frames are identical.
Quantiles
Calculating the mean and variance of our summarized data frames was not too difficult. Calculating quantiles from this data frame is slightly more involved, though still not terribly hard.
Our implementation will rely on sorting the data frame. Though this implementation is not optimal from a computation complexity point of view, it is in keeping with the spirit of pandas
’ implementation of quantiles. I have given some thought on how to implement linear time selection on the summarized data frame, but have not yet worked out the details.
Before writing a function to calculate quantiles of a data frame with several columns, we will walk through the simpler case of computing the quartiles of a single series.
= summ_df.u u
u.head()
0 0
1 48
2 35
3 19
4 40
Name: u, dtype: int64
First we argsort
the series.
= u.argsort() sorted_ilocs
We see that u.iloc[sorted_ilocs]
will now be in ascending order.
= u.iloc[sorted_ilocs] sorted_u
-1] <= sorted_u[1:]).all() (sorted_u[:
True
More importantly, counts.iloc[sorted_ilocs]
will have the count of the smallest element of u
first, the count of the second smallest element second, etc.
= n.iloc[sorted_ilocs]
sorted_n = sorted_n.cumsum()
sorted_cumsum = (sorted_cumsum / n.sum()).values cdf
Now, the \(i\)-th location of sorted_cumsum
will contain the number of elements of u
less than or equal to the \(i\)-th smallest element, and therefore cdf
is the empirical cumulative distribution function of u
. The following plot shows that this interpretation is correct.
= plt.subplots(figsize=(8, 6))
fig, ax
= sns.color_palette()[:3]
blue, _, red =blue, label='Empirical CDF');
ax.plot(sorted_u, cdf, c= np.arange(100)
plot_u '--', c=red, label='Population CDF');
ax.plot(plot_u, sp.stats.randint.cdf(plot_u, u_min, u_max),
'$u$');
ax.set_xlabel(
=2); ax.legend(loc
If, for example, we wish to find the median of u
, we want to find the first location in cdf
which is greater than or equal to 0.5.
= (cdf < 0.5).argmin() median_iloc_in_sorted
The index of the median in u
is therefore sorted_ilocs.iloc[median_iloc_in_sorted]
, so the median of u
is
u.iloc[sorted_ilocs.iloc[median_iloc_in_sorted]]
49
0.5) df.u.quantile(
49.0
We can generalize this method to calculate multiple quantiles simultaneously as follows.
= np.array([0.25, 0.5, 0.75]) q
=0)]] u.iloc[sorted_ilocs.iloc[np.less.outer(cdf, q).argmin(axis
2299 24
9079 49
1211 74
Name: u, dtype: int64
df.u.quantile(q)
0.25 24
0.50 49
0.75 74
dtype: float64
The array np.less.outer(cdf, q).argmin(axis=0)
contains three columns, each of which contains the result of comparing cdf
to an element of q
. The following function generalizes this approach from series to data frames.
def quantile(df, count, q=0.5):
= np.ravel(q)
q
= df.apply(pd.Series.argsort)
sorted_ilocs = sorted_ilocs.apply(lambda s: count.iloc[s].values)
sorted_counts = sorted_counts.cumsum() / sorted_counts.sum()
cdf
= pd.DataFrame(np.less.outer(cdf.values, q).argmin(axis=0).T,
q_ilocs_in_sorted_ilocs =df.columns)
columns= sorted_ilocs.apply(lambda s: s[q_ilocs_in_sorted_ilocs[s.name]].reset_index(drop=True))
q_ilocs
= df.apply(lambda s: s.iloc[q_ilocs[s.name]].reset_index(drop=True))
q_df = q
q_df.index
return q_df
=q) quantile(summ_df, n, q
u | v | w | |
---|---|---|---|
0.25 | 24 | 0 | -0.688514 |
0.50 | 49 | 1 | 0.040036 |
0.75 | 74 | 1 | 1.074817 |
=q) df.quantile(q
u | v | w | |
---|---|---|---|
0.25 | 24 | 0 | -0.688514 |
0.50 | 49 | 1 | 0.040036 |
0.75 | 74 | 1 | 1.074817 |
=q), df.quantile(q=q)) np.allclose(quantile(summ_df, n, q
True
Bootstrapping
Another important operation is bootstrapping. We will see two ways to perfom bootstrapping on the summary data set.
= 10000 n_boot
Key to both approaches to the bootstrap is knowing the proprotion of the data set that each distinct combination of features comprised.
= n / n.sum() weights
The two approaches differ in what type of data frame they produce. The first we will discuss produces a non-summarized data frame with non-unique rows, while the second produces a summarized data frame. Each fo these approaches to bootstrapping is useful in different situations.
Non-summarized bootstrap
To produce a non-summarized data frame, we generate a list of locations in feature_df
based on weights
using numpy.random.choice
.
boot_ixs = np.random.choice(summ_df.shape[0], size=n_boot, replace=True,
p=weights)
= summ_df.iloc[boot_ixs] boot_df
boot_df.head()
u | v | w | |
---|---|---|---|
1171 | 47 | 1 | -1.392235 |
9681 | 3 | 1 | 0.018521 |
6664 | 13 | 1 | 1.941207 |
8343 | 13 | 0 | 0.655181 |
3595 | 95 | 1 | 0.972592 |
We can verify that our bootstrapped data frame has (approximately) the same distribution as the original data frame using Q-Q plots.
= np.linspace(0, 1, 100) ps
= boot_df[['u', 'w']].quantile(q=ps) boot_qs
= df[['u', 'w']].quantile(q=ps) qs
= plt.subplots(figsize=(8, 6))
fig, ax
= sns.color_palette()[0]
blue '--', c='k', lw=0.75,
ax.plot((u_min, u_max), (u_min, u_max), ='Perfect agreement');
label=blue, alpha=0.5);
ax.scatter(qs.u, boot_qs.u, c
;
ax.set_xlim(u_min, u_max)'Original quantiles');
ax.set_xlabel(
;
ax.set_ylim(u_min, u_max)'Resampled quantiles');
ax.set_ylabel(
'Q-Q plot for $u$');
ax.set_title(=2); ax.legend(loc
= plt.subplots(figsize=(8, 6))
fig, ax
= sns.color_palette()[0]
blue '--', c='k', lw=0.75,
ax.plot((w_min, w_max), (w_min, w_max), ='Perfect agreement');
label=blue, alpha=0.5);
ax.scatter(qs.w, boot_qs.w, c
;
ax.set_xlim(w_min, w_max)'Original quantiles');
ax.set_xlabel(
;
ax.set_ylim(w_min, w_max)'Resampled quantiles');
ax.set_ylabel(
'Q-Q plot for $w$');
ax.set_title(=2); ax.legend(loc
We see that both of the resampled distributions agree quite closely with the original distributions. We have only produced Q-Q plots for \(u\) and \(w\) because \(v\) is binary-valued.
While at first non-summarized boostrap resampling may appear to counteract the benefits of summarizing the original data frame, it can be quite useful when training and evaluating online learning algorithms, where iterating through the locations of the bootstrapped data in the original summarized data frame is efficient.
Summarized bootstrap
To produce a summarized data frame, the counts of the resampled data frame are sampled from a multinomial distribution with event probabilities given by weights
.
= pd.Series(np.random.multinomial(n_boot, weights), name='count') boot_counts
Again, we compare the distribution of our bootstrapped data frame to that of the original with Q-Q plots. Here our summarized quantile function is quite useful.
= quantile(summ_df, boot_counts, q=ps) boot_count_qs
= plt.subplots(figsize=(8, 6))
fig, ax
'--', c='k', lw=0.75,
ax.plot((u_min, u_max), (u_min, u_max), ='Perfect agreement');
label=blue, alpha=0.5);
ax.scatter(qs.u, boot_count_qs.u, c
;
ax.set_xlim(u_min, u_max)'Original quantiles');
ax.set_xlabel(
;
ax.set_ylim(u_min, u_max)'Resampled quantiles');
ax.set_ylabel(
'Q-Q plot for $u$');
ax.set_title(=2); ax.legend(loc
= plt.subplots(figsize=(8, 6))
fig, ax
'--', c='k', lw=0.75,
ax.plot((w_min, w_max), (w_min, w_max), ='Perfect agreement');
label=blue, alpha=0.5);
ax.scatter(qs.w, boot_count_qs.w, c
;
ax.set_xlim(w_min, w_max)'Original quantiles');
ax.set_xlabel(
;
ax.set_ylim(w_min, w_max)'Resampled quantiles');
ax.set_ylabel(
'Q-Q plot for $w$');
ax.set_title(=2); ax.legend(loc
Again, we see that both of the resampled distributions agree quite closely with the original distributions.
Linear Regression
Linear regression is among the most frequently used types of statistical inference, and it plays nicely with summarized data. Typically, we have a response variable \(y\) that we wish to model as a linear combination of \(u\), \(v\), and \(w\) as
\[ \begin{align*} y_i = \beta_0 + \beta_1 u_i + \beta_2 v_i + \beta_3 w_i + \varepsilon, \end{align*} \]
where \(\varepsilon \sim N(0, \sigma^2)\) is noise. We generate such a data set below (with \(\sigma = 0.1\)).
= np.array([-3., 0.1, -4., 2.]) beta
= 0.1 noise_std
= dmatrix('u + v + w', data=df) X
= pd.Series(np.dot(X, beta), name='y') + sp.stats.norm.rvs(scale=noise_std, size=N) y
y.head()
0 7.862559
1 3.830585
2 -0.388246
3 1.047091
4 0.992082
Name: y, dtype: float64
Each element of the series y
corresponds to one row in the uncompressed data frame df
. The OLS
class from statsmodels
comes quite close to recovering the true regression coefficients.
= sm.OLS(y, X).fit() full_ols
full_ols.params
const -2.999658
x1 0.099986
x2 -3.998997
x3 2.000317
dtype: float64
To show how we can perform linear regression on the summarized data frame, we recall the the ordinary least squares estimator minimizes the residual sum of squares. The residual sum of squares is given by
\[ \begin{align*} RSS & = \sum_{i = 1}^n \left(y_i - \mathbf{x}_i \mathbf{\beta}^{\intercal}\right)^2. \end{align*} \]
Here \(\mathbf{x}_i = [1\ u_i\ v_i\ w_i]\) is the \(i\)-th row of the original data frame (with a constant added for the intercept) and \(\mathbf{\beta} = [\beta_0\ \beta_1\ \beta_2\ \beta_3]\) is the row vector of regression coefficients. It would be tempting to rewrite \(RSS\) by grouping the terms based on the row their features map to in the compressed data frame, but this approach would lead to incorrect results. Due to the stochastic noise term \(\varepsilon_i\), identical values of \(u\), \(v\), and \(w\) can (and will almost certainly) map to different values of \(y\). We can see this phenomenon by calculating the range of \(y\) grouped on \(u\), \(v\), and \(w\).
= pd.concat((y, df), axis=1) reg_df
'u', 'v', 'w')).y.apply(np.ptp).describe() reg_df.groupby((
count 9997.000000
mean 0.297891
std 0.091815
min 0.000000
25% 0.237491
50% 0.296838
75% 0.358015
max 0.703418
Name: y, dtype: float64
If \(y\) were uniquely determined by \(u\), \(v\), and \(w\), we would expect the mean and quartiles of these ranges to be zero, which they are not. Fortunately, we can account for is difficulty with a bit of care.
Let \(S_j = \{i\ |\ \mathbf{x}_i = \mathbf{z}_j\}\), the set of row indices in the original data frame that correspond to the \(j\)-th row in the summary data frame. Define \(\bar{y}_{(j)} = \frac{1}{n_j} \sum_{i \in S_j} y_i\), which is the mean of the response variables that correspond to \(\mathbf{z}_j\). Intuitively, since \(\varepsilon_i\) has mean zero, \(\bar{y}_{(j)}\) is our best unbiased estimate of \(\mathbf{z}_j \mathbf{\beta}^{\intercal}\). We will now show that regressing \(\sqrt{n_j} \bar{y}_{(j)}\) on \(\sqrt{n_j} \mathbf{z}_j\) gives the same results as the full regression. We use the standard trick of adding and subtracting the mean and get
\[ \begin{align*} RSS & = \sum_{j = 1}^m \sum_{i \in S_j} \left(y_i - \mathbf{z}_j \mathbf{\beta}^{\intercal}\right)^2 \\ & = \sum_{j = 1}^m \sum_{i \in S_j} \left(\left(y_i - \bar{y}_{(j)}\right) + \left(\bar{y}_{(j)} - \mathbf{z}_j \mathbf{\beta}^{\intercal}\right)\right)^2 \\ & = \sum_{j = 1}^m \sum_{i \in S_j} \left(\left(y_i - \bar{y}_{(j)}\right)^2 + 2 \left(y_i - \bar{y}_{(j)}\right) \left(\bar{y}_{(j)} - \mathbf{z}_j \mathbf{\beta}^{\intercal}\right) + \left(\bar{y}_{(j)} - \mathbf{z}_j \mathbf{\beta}^{\intercal}\right)^2\right). \end{align*} \]
As is usual in these situations, the cross term vanishes, since
\[ \begin{align*} \sum_{i \in S_j} \left(y_i - \bar{y}_{(j)}\right) \left(\bar{y}_{(j)} - \mathbf{z}_j \mathbf{\beta}^{\intercal}\right) & = \sum_{i \in S_j} \left(y_i \bar{y}_{(j)} - y_i \mathbf{z}_j \mathbf{\beta}^{\intercal} - \bar{y}_{(j)}^2 + \bar{y}_{(j)} \mathbf{z}_j \mathbf{\beta}^{\intercal}\right) \\ & = \bar{y}_{(j)} \sum_{i \in S_j} y_i - \mathbf{z}_j \mathbf{\beta}^{\intercal} \sum_{i \in S_j} y_i - n_j \bar{y}_{(j)}^2 + n_j \bar{y}_{(j)} \mathbf{z}_j \mathbf{\beta}^{\intercal} \\ & = n_j \bar{y}_{(j)}^2 - n_j \bar{y}_{(j)} \mathbf{z}_j \mathbf{\beta}^{\intercal} - n_j \bar{y}_{(j)}^2 + n_j \bar{y}_{(j)} \mathbf{z}_j \mathbf{\beta}^{\intercal} \\ & = 0. \end{align*} \]
Therefore we may decompose the residual sum of squares as
\[ \begin{align*} RSS & = \sum_{j = 1}^m \sum_{i \in S_j} \left(y_i - \bar{y}_{(j)}\right)^2 + \sum_{j = 1}^m \sum_{i \in S_j} \left(\bar{y}_{(j)} - \mathbf{z}_j \mathbf{\beta}^{\intercal}\right)^2 \\ & = \sum_{j = 1}^m \sum_{i \in S_j} \left(y_i - \bar{y}_{(j)}\right)^2 + \sum_{j = 1}^m n_j \left(\bar{y}_{(j)} - \mathbf{z}_j \mathbf{\beta}^{\intercal}\right)^2. \end{align*} \]
The important property of this decomposition is that the first sum does not depend on \(\mathbf{\beta}\), so minimizing \(RSS\) with respect to \(\mathbf{\beta}\) is equivalent to minimizing the second sum. We see that this second sum can be written as
\[ \begin{align*} \sum_{j = 1}^m n_j \left(\bar{y}_{(j)} - \mathbf{z}_j \mathbf{\beta}^{\intercal}\right)^2 & = \sum_{j = 1}^m \left(\sqrt{n_j} \bar{y}_{(j)} - \sqrt{n_j} \mathbf{z}_j \mathbf{\beta}^{\intercal}\right)^2 \end{align*}, \]
which is exactly the residual sum of squares for regressing \(\sqrt{n_j} \bar{y}_{(j)}\) on \(\sqrt{n_j} \mathbf{z}_j\).
= reg_df.groupby(('u', 'v', 'w')).y.mean().reset_index().iloc[shuffled_ixs].reset_index(drop=True).copy()
summ_reg_df 'n'] = n summ_reg_df[
summ_reg_df.head()
u | v | w | y | n | |
---|---|---|---|---|---|
0 | 0 | 0 | 0.425597 | -2.173984 | 14 |
1 | 48 | 1 | -0.993981 | -4.174895 | 7 |
2 | 35 | 0 | 0.358156 | 1.252848 | 9 |
3 | 19 | 1 | -0.760298 | -6.612355 | 17 |
4 | 40 | 1 | -0.688514 | -4.379063 | 13 |
The design matrices for this summarized model are easy to construct using patsy
.
= dmatrices("""
y_summ, X_summ I(np.sqrt(n) * y) ~
np.sqrt(n) + I(np.sqrt(n) * u) + I(np.sqrt(n) * v) + I(np.sqrt(n) * w) - 1
""",
=summ_reg_df) data
Note that we must remove patsy
’s constant column for the intercept and replace it with np.sqrt(n)
.
= sm.OLS(y_summ, X_summ).fit() summ_ols
summ_ols.params
array([-2.99965783, 0.09998571, -3.99899718, 2.00031673])
We see that the summarized regression produces the same parameter estimates as the full regression.
np.allclose(full_ols.params, summ_ols.params)
True
Logistic Regression
As a final example of adapting common methods to summarized data frames, we will show how to fit a logistic regression model on a summarized data set by maximum likelihood.
We will use the model
\[P(s = 1\ |\ w) = \frac{1}{1 + \exp(-\mathbf{x} \gamma^{\intercal})}\].
As above, \(\mathbf{x}_i = [1\ u_i\ v_i\ w_i]\). The true value of \(\gamma\) is
= np.array([1., 0.01, -1., -2.]) gamma
We now generate samples from this model.
= dmatrix('u + v + w', data=df) X
= pd.Series(sp.special.expit(np.dot(X, gamma)), name='p')
p = pd.Series(sp.stats.bernoulli.rvs(p), name='s') s
= pd.concat((s, p, df), axis=1) logit_df
logit_df.head()
s | p | u | v | w | |
---|---|---|---|---|---|
0 | 1 | 0.709963 | 97 | 0 | 0.537397 |
1 | 0 | 0.092560 | 79 | 1 | 1.536383 |
2 | 0 | 0.153211 | 44 | 1 | 1.074817 |
3 | 1 | 0.923609 | 51 | 0 | -0.491210 |
4 | 0 | 0.062077 | 47 | 1 | 1.592646 |
We first fit the logistic regression model to the full data frame.
= sm.Logit(s, X).fit() full_logit
Optimization terminated successfully.
Current function value: 0.414221
Iterations 7
full_logit.params
const 0.965283
x1 0.009944
x2 -0.966797
x3 -1.990506
dtype: float64
We see that the estimates are quite close to the true parameters.
The technique used to adapt maximum likelihood estimation of logistic regression to the summarized data frame is quite elegant. The likelihood for the full data set is given by the fact that (given \(u\), \(v\), and \(w\)) \(s\) is Bernoulli distributed with
\[s_i\ |\ \mathbf{x}_i \sim \operatorname{Ber}\left(\frac{1}{1 + \exp(-\mathbf{x}_i \gamma^{\intercal})}\right).\]
To derive the likelihood for the summarized data set, we count the number of successes (where \(s = 1\)) for each unique combination of features \(\mathbf{z}_j\), and denote this quantity \(k_j\).
= logit_df.groupby(('u', 'v', 'w')).s.sum().reset_index().iloc[shuffled_ixs].reset_index(drop=True).copy()
summ_logit_df = summ_logit_df.rename(columns={'s': 'k'})
summ_logit_df 'n'] = n summ_logit_df[
summ_logit_df.head()
u | v | w | k | n | |
---|---|---|---|---|---|
0 | 0 | 0 | 0.425597 | 5 | 14 |
1 | 48 | 1 | -0.993981 | 7 | 7 |
2 | 35 | 0 | 0.358156 | 8 | 9 |
3 | 19 | 1 | -0.760298 | 12 | 17 |
4 | 40 | 1 | -0.688514 | 9 | 13 |
Now, instead of each row representing a single Bernoulli trial (as in the full data frame), each row represents \(n_j\) trials, so we have that \(k_j\) is (conditionally) Binomially distributed with
\[k_j\ |\ \mathbf{z}_j \sim \operatorname{Bin}\left(n_j, \frac{1}{1 + \exp(-\mathbf{z}_j \gamma^{\intercal})}\right).\]
= dmatrix('u + v + w', data=summ_logit_df) summ_logit_X
As I have shown in a previous post, we can use statsmodels
’ GenericLikelihoodModel
class to fit custom probability models by maximum likelihood. The model is implemented as follows.
class SummaryLogit(GenericLikelihoodModel):
def __init__(self, endog, exog, n, **qwargs):
"""
endog is the number of successes
exog are the features
n are the number of trials
"""
self.n = n
super(SummaryLogit, self).__init__(endog, exog, **qwargs)
def nloglikeobs(self, gamma):
"""
gamma is the vector of regression coefficients
returns the negative log likelihood of each of the observations for
the coefficients in gamma
"""
= sp.special.expit(np.dot(self.exog, gamma))
p
return -sp.stats.binom.logpmf(self.endog, self.n, p)
def fit(self, start_params=None, maxiter=10000, maxfun=5000, **qwargs):
# wraps the GenericLikelihoodModel's fit method to set default start parameters
if start_params is None:
= np.zeros(self.exog.shape[1])
start_params
return super(SummaryLogit, self).fit(start_params=start_params,
=maxiter, maxfun=maxfun, **qwargs) maxiter
= SummaryLogit(summ_logit_df.k, summ_logit_X, summ_logit_df.n).fit() summ_logit
Optimization terminated successfully.
Current function value: 1.317583
Iterations: 357
Function evaluations: 599
Again, we get reasonable estimates of the regression coefficients, which are close to those obtained from the full data set.
summ_logit.params
array([ 0.96527992, 0.00994322, -0.96680904, -1.99051485])
=10**-4) np.allclose(summ_logit.params, full_logit.params, rtol
True
Conclusion
Hopefully this introduction to the technique of summarizing data sets has proved useful and will allow you to explore medium data more easily in the future. We have only scratched the surface on the types of statistical techniques that can be adapted to work on summarized data sets, but with a bit of ingenuity, many of the ideas in this post can apply to other models.