My last post showed how to use Dirichlet processes and `pymc3`

to perform Bayesian nonparametric density estimation. This post expands on the previous one, illustrating dependent density regression with `pymc3`

.

Just as Dirichlet process mixtures can be thought of as infinite mixture models that select the number of active components as part of inference, dependent density regression can be thought of as infinite mixtures of experts that select the active experts as part of inference. Their flexibility and modularity make them powerful tools for performing nonparametric Bayesian Data analysis.

```
%matplotlib inline
from IPython.display import HTML
```

```
from matplotlib import animation as ani, pyplot as plt
import numpy as np
import pandas as pd
import pymc3 as pm
import seaborn as sns
from theano import shared, tensor as tt
```

```
plt.rc('animation', writer='avconv')
blue, *_ = sns.color_palette()
```

```
SEED = 972915 # from random.org; for reproducibility
np.random.seed(SEED)
```

Throughout this post, we will use the LIDAR data set from Larry Wasserman’s excellent book, *All of Nonparametric Statistics*. We standardize the data set to improve the rate of convergence of our samples.

```
DATA_URI = 'http://www.stat.cmu.edu/~larry/all-of-nonpar/=data/lidar.dat'
def standardize(x):
return (x - x.mean()) / x.std()
df = (pd.read_csv(DATA_URI, sep=' *', engine='python')
.assign(std_range=lambda df: standardize(df.range),
std_logratio=lambda df: standardize(df.logratio)))
```

`df.head()`

range | logratio | std_logratio | std_range | |
---|---|---|---|---|

0 | 390 | -0.050356 | 0.852467 | -1.717725 |

1 | 391 | -0.060097 | 0.817981 | -1.707299 |

2 | 393 | -0.041901 | 0.882398 | -1.686447 |

3 | 394 | -0.050985 | 0.850240 | -1.676020 |

4 | 396 | -0.059913 | 0.818631 | -1.655168 |

We plot the LIDAR data below.

```
fig, ax = plt.subplots(figsize=(8, 6))
ax.scatter(df.std_range, df.std_logratio,
c=blue);
ax.set_xticklabels([]);
ax.set_xlabel("Standardized range");
ax.set_yticklabels([]);
ax.set_ylabel("Standardized log ratio");
```

This data set has a two interesting properties that make it useful for illustrating dependent density regression.

- The relationship between range and log ratio is nonlinear, but has locally linear components.
- The observation noise is heteroskedastic; that is, the magnitude of the variance varies with the range.

The intuitive idea behind dependent density regression is to reduce the problem to many (related) density estimates, conditioned on fixed values of the predictors. The following animation illustrates this intuition.

```
fig, (scatter_ax, hist_ax) = plt.subplots(ncols=2, figsize=(16, 6))
scatter_ax.scatter(df.std_range, df.std_logratio,
c=blue, zorder=2);
scatter_ax.set_xticklabels([]);
scatter_ax.set_xlabel("Standardized range");
scatter_ax.set_yticklabels([]);
scatter_ax.set_ylabel("Standardized log ratio");
bins = np.linspace(df.std_range.min(), df.std_range.max(), 25)
hist_ax.hist(df.std_logratio, bins=bins,
color='k', lw=0, alpha=0.25,
label="All data");
hist_ax.set_xticklabels([]);
hist_ax.set_xlabel("Standardized log ratio");
hist_ax.set_yticklabels([]);
hist_ax.set_ylabel("Frequency");
hist_ax.legend(loc=2);
endpoints = np.linspace(1.05 * df.std_range.min(), 1.05 * df.std_range.max(), 15)
frame_artists = []
for low, high in zip(endpoints[:-1], endpoints[2:]):
interval = scatter_ax.axvspan(low, high,
color='k', alpha=0.5, lw=0, zorder=1);
*_, bars = hist_ax.hist(df[df.std_range.between(low, high)].std_logratio,
bins=bins,
color='k', lw=0, alpha=0.5);
frame_artists.append((interval,) + tuple(bars))
animation = ani.ArtistAnimation(fig, frame_artists,
interval=500, repeat_delay=3000, blit=True)
plt.close(); # prevent the intermediate figure from showing
```

`HTML(animation.to_html5_video())`

As we slice the data with a window sliding along the x-axis in the left plot, the empirical distribution of the y-values of the points in the window varies in the right plot. An important aspect of this approach is that the density estimates that correspond to close values of the predictor are similar.

In the previous post, we saw that a Dirichlet process estimates a probability density as a mixture model with infinitely many components. In the case of normal component distributions,

\[ y \sim \sum_{i = 1}^{\infty} w_i \cdot N(\mu_i, \tau_i^{-1}), \]

where the mixture weights, \(w_1, w_2, \ldots\), are generated by a stick-breaking process.

Dependent density regression generalizes this representation of the Dirichlet process mixture model by allowing the mixture weights and component means to vary conditioned on the value of the predictor, \(x\). That is,

\[ y\ |\ x \sim \sum_{i = 1}^{\infty} w_i\ |\ x \cdot N(\mu_i\ |\ x, \tau_i^{-1}). \]

In this post, we will follow Chapter 23 of *Bayesian Data Analysis* and use a probit stick-breaking process to determine the conditional mixture weights, \(w_i\ |\ x\). The probit stick-breaking process starts by defining

\[ v_i\ |\ x = \Phi(\alpha_i + \beta_i x), \]

where \(\Phi\) is the cumulative distribution function of the standard normal distribution. We then obtain \(w_i\ |\ x\) by applying the stick breaking process to \(v_i\ |\ x\). That is,

\[ w_i\ |\ x = v_i\ |\ x \cdot \prod_{j = 1}^{i - 1} (1 - v_j\ |\ x). \]

For the LIDAR data set, we use independent normal priors \(\alpha_i \sim N(0, 5^2)\) and \(\beta_i \sim N(0, 5^2)\). We now express this this model for the conditional mixture weights using `pymc3`

.

```
def norm_cdf(z):
return 0.5 * (1 + tt.erf(z / np.sqrt(2)))
def stick_breaking(v):
return v * tt.concatenate([tt.ones_like(v[:, :1]),
tt.extra_ops.cumprod(1 - v, axis=1)[:, :-1]],
axis=1)
```

```
N, _ = df.shape
K = 20
std_range = df.std_range.values[:, np.newaxis]
std_logratio = df.std_logratio.values[:, np.newaxis]
x_lidar = shared(std_range, broadcastable=(False, True))
with pm.Model() as model:
alpha = pm.Normal('alpha', 0., 5., shape=K)
beta = pm.Normal('beta', 0., 5., shape=K)
v = norm_cdf(alpha + beta * x_lidar)
w = pm.Deterministic('w', stick_breaking(v))
```

We have defined `x_lidar`

as a `theano`

`shared`

variable in order to use `pymc3`

’s posterior prediction capabilities later.

While the dependent density regression model theoretically has infinitely many components, we must truncate the model to finitely many components (in this case, twenty) in order to express it using `pymc3`

. After sampling from the model, we will verify that truncation did not unduly influence our results.

Since the LIDAR data seems to have several linear components, we use the linear models

\[ \begin{align*} \mu_i\ |\ x & \sim \gamma_i + \delta_i x \\ \gamma_i & \sim N(0, 10^2) \\ \delta_i & \sim N(0, 10^2) \end{align*} \]

for the conditional component means.

```
with model:
gamma = pm.Normal('gamma', 0., 10., shape=K)
delta = pm.Normal('delta', 0., 10., shape=K)
mu = pm.Deterministic('mu', gamma + delta * x_lidar)
```

Finally, we place the prior \(\tau_i \sim \textrm{Gamma}(1, 1)\) on the component precisions.

```
with model:
tau = pm.Gamma('tau', 1., 1., shape=K)
obs = pm.NormalMixture('obs', w, mu, tau=tau, observed=std_logratio)
```

We now draw sample from the dependent density regression model.

```
SAMPLES = 20000
BURN = 10000
THIN = 10
with model:
step = pm.Metropolis()
trace_ = pm.sample(SAMPLES, step, random_seed=SEED)
trace = trace_[BURN::THIN]
```

`100%|██████████| 20000/20000 [01:30<00:00, 204.48it/s]`

To verify that truncation did not unduly influence our results, we plot the largest posterior expected mixture weight for each component. (In this model, each point has a mixture weight for each component, so we plot the maximum mixture weight for each component across all data points in order to judge if the component exerts any influence on the posterior.)

```
fig, ax = plt.subplots(figsize=(8, 6))
ax.bar(np.arange(K) + 1 - 0.4,
trace['w'].mean(axis=0).max(axis=0));
ax.set_xlim(1 - 0.5, K + 0.5);
ax.set_xlabel('Mixture component');
ax.set_ylabel('Largest posterior expected\nmixture weight');
```

Since only three mixture components have appreciable posterior expected weight for any data point, we can be fairly certain that truncation did not unduly influence our results. (If most components had appreciable posterior expected weight, truncation may have influenced the results, and we would have increased the number of components and sampled again.)

Visually, it is reasonable that the LIDAR data has three linear components, so these posterior expected weights seem to have identified the structure of the data well. We now sample from the posterior predictive distribution to get a better understand the model’s performance.

```
PP_SAMPLES = 5000
lidar_pp_x = np.linspace(std_range.min() - 0.05, std_range.max() + 0.05, 100)
x_lidar.set_value(lidar_pp_x[:, np.newaxis])
with model:
pp_trace = pm.sample_ppc(trace, PP_SAMPLES, random_seed=SEED)
```

`100%|██████████| 5000/5000 [01:18<00:00, 66.54it/s]`

Below we plot the posterior expected value and the 95% posterior credible interval.

```
fig, ax = plt.subplots()
ax.scatter(df.std_range, df.std_logratio,
c=blue, zorder=10,
label=None);
low, high = np.percentile(pp_trace['obs'], [2.5, 97.5], axis=0)
ax.fill_between(lidar_pp_x, low, high,
color='k', alpha=0.35, zorder=5,
label='95% posterior credible interval');
ax.plot(lidar_pp_x, pp_trace['obs'].mean(axis=0),
c='k', zorder=6,
label='Posterior expected value');
ax.set_xticklabels([]);
ax.set_xlabel('Standardized range');
ax.set_yticklabels([]);
ax.set_ylabel('Standardized log ratio');
ax.legend(loc=1);
ax.set_title('LIDAR Data');
```

The model has fit the linear components of the data well, and also accomodated its heteroskedasticity. This flexibility, along with the ability to modularly specify the conditional mixture weights and conditional component densities, makes dependent density regression an extremely useful nonparametric Bayesian model.

To learn more about depdendent density regression and related models, consult *Bayesian Data Analysis*, *Bayesian Nonparametric Data Analysis*, or *Bayesian Nonparametrics*.

This post is available as a Jupyter notebook here.

]]>I have been intrigued by the flexibility of nonparametric statistics for many years. As I have developed an understanding and appreciation of Bayesian modeling both personally and professionally over the last two or three years, I naturally developed an interest in Bayesian nonparametric statistics. I am pleased to begin a planned series of posts on Bayesian nonparametrics with this post on Dirichlet process mixtures for density estimation.

The Dirichlet process is a flexible probability distribution over the space of distributions. Most generally, a probability distribution, \(P\), on a set \(\Omega\) is a measure that assigns measure one to the entire space (\(P(\Omega) = 1\)). A Dirichlet process \(P \sim \textrm{DP}(\alpha, P_0)\) is a measure that has the property that, for every finite disjoint partition \(S_1, \ldots, S_n\) of \(\Omega\),

\[(P(S_1), \ldots, P(S_n)) \sim \textrm{Dir}(\alpha P_0(S_1), \ldots, \alpha P_0(S_n)).\]

Here \(P_0\) is the base probability measure on the space \(\Omega\). The precision parameter \(\alpha > 0\) controls how close samples from the Dirichlet process are to the base measure, \(P_0\). As \(\alpha \to \infty\), samples from the Dirichlet process approach the base measure \(P_0\).

Dirichlet processes have several properties that make then quite suitable to MCMC simulation.

The posterior given i.i.d. observations \(\omega_1, \ldots, \omega_n\) from a Dirichlet process \(P \sim \textrm{DP}(\alpha, P_0)\) is also a Dirichlet process with

\[P\ |\ \omega_1, \ldots, \omega_n \sim \textrm{DP}\left(\alpha + n, \frac{\alpha}{\alpha + n} P_0 + \frac{1}{\alpha + n} \sum_{i = 1}^n \delta_{\omega_i}\right),\]

where \(\delta\) is the Dirac delta measure

\[\begin{align*} \delta_{\omega}(S) & = \begin{cases} 1 & \textrm{if } \omega \in S \\ 0 & \textrm{if } \omega \not \in S \end{cases} \end{align*}.\]

The posterior predictive distribution of a new observation is a compromise between the base measure and the observations,

\[\omega\ |\ \omega_1, \ldots, \omega_n \sim \frac{\alpha}{\alpha + n} P_0 + \frac{1}{\alpha + n} \sum_{i = 1}^n \delta_{\omega_i}.\]

We see that the prior precision \(\alpha\) can naturally be interpreted as a prior sample size. The form of this posterior predictive distribution also lends itself to Gibbs sampling.

Samples, \(P \sim \textrm{DP}(\alpha, P_0)\), from a Dirichlet process are discrete with probability one. That is, there are elements \(\omega_1, \omega_2, \ldots\) in \(\Omega\) and weights \(w_1, w_2, \ldots\) with \(\sum_{i = 1}^{\infty} w_i = 1\) such that

\[P = \sum_{i = 1}^\infty w_i \delta_{\omega_i}.\]

- The stick-breaking process gives an explicit construction of the weights \(w_i\) and samples \(\omega_i\) above that is straightforward to sample from. If \(\beta_1, \beta_2, \ldots \sim \textrm{Beta}(1, \alpha)\), then \(w_i = \beta_i \prod_{j = 1}^{j - 1} (1 - \beta_j)\). The relationship between this representation and stick breaking may be illustrated as follows:
- Start with a stick of length one.
- Break the stick into two portions, the first of proportion \(w_1 = \beta_1\) and the second of proportion \(1 - w_1\).
- Further break the second portion into two portions, the first of proportion \(\beta_2\) and the second of proportion \(1 - \beta_2\). The length of the first portion of this stick is \(\beta_2 (1 - \beta_1)\); the length of the second portion is \((1 - \beta_1) (1 - \beta_2)\).
- Continue breaking the second portion from the previous break in this manner forever. If \(\omega_1, \omega_2, \ldots \sim P_0\), then

\[P = \sum_{i = 1}^\infty w_i \delta_{\omega_i} \sim \textrm{DP}(\alpha, P_0).\]

We can use the stick-breaking process above to easily sample from a Dirichlet process in Python. For this example, \(\alpha = 2\) and the base distribution is \(N(0, 1)\).

`%matplotlib inline`

`from __future__ import division`

```
from matplotlib import pyplot as plt
import numpy as np
import pymc3 as pm
import scipy as sp
import seaborn as sns
from statsmodels.datasets import get_rdataset
from theano import tensor as T
```

`Couldn't import dot_parser, loading of dot files will not be possible.`

`blue = sns.color_palette()[0]`

`np.random.seed(462233) # from random.org`

```
N = 20
K = 30
alpha = 2.
P0 = sp.stats.norm
```

We draw and plot samples from the stick-breaking process.

```
beta = sp.stats.beta.rvs(1, alpha, size=(N, K))
w = np.empty_like(beta)
w[:, 0] = beta[:, 0]
w[:, 1:] = beta[:, 1:] * (1 - beta[:, :-1]).cumprod(axis=1)
omega = P0.rvs(size=(N, K))
x_plot = np.linspace(-3, 3, 200)
sample_cdfs = (w[..., np.newaxis] * np.less.outer(omega, x_plot)).sum(axis=1)
```

```
fig, ax = plt.subplots(figsize=(8, 6))
ax.plot(x_plot, sample_cdfs[0], c='gray', alpha=0.75,
label='DP sample CDFs');
ax.plot(x_plot, sample_cdfs[1:].T, c='gray', alpha=0.75);
ax.plot(x_plot, P0.cdf(x_plot), c='k', label='Base CDF');
ax.set_title(r'$\alpha = {}$'.format(alpha));
ax.legend(loc=2);
```

As stated above, as \(\alpha \to \infty\), samples from the Dirichlet process converge to the base distribution.

```
fig, (l_ax, r_ax) = plt.subplots(ncols=2, sharex=True, sharey=True, figsize=(16, 6))
K = 50
alpha = 10.
beta = sp.stats.beta.rvs(1, alpha, size=(N, K))
w = np.empty_like(beta)
w[:, 0] = beta[:, 0]
w[:, 1:] = beta[:, 1:] * (1 - beta[:, :-1]).cumprod(axis=1)
omega = P0.rvs(size=(N, K))
sample_cdfs = (w[..., np.newaxis] * np.less.outer(omega, x_plot)).sum(axis=1)
l_ax.plot(x_plot, sample_cdfs[0], c='gray', alpha=0.75,
label='DP sample CDFs');
l_ax.plot(x_plot, sample_cdfs[1:].T, c='gray', alpha=0.75);
l_ax.plot(x_plot, P0.cdf(x_plot), c='k', label='Base CDF');
l_ax.set_title(r'$\alpha = {}$'.format(alpha));
l_ax.legend(loc=2);
K = 200
alpha = 50.
beta = sp.stats.beta.rvs(1, alpha, size=(N, K))
w = np.empty_like(beta)
w[:, 0] = beta[:, 0]
w[:, 1:] = beta[:, 1:] * (1 - beta[:, :-1]).cumprod(axis=1)
omega = P0.rvs(size=(N, K))
sample_cdfs = (w[..., np.newaxis] * np.less.outer(omega, x_plot)).sum(axis=1)
r_ax.plot(x_plot, sample_cdfs[0], c='gray', alpha=0.75,
label='DP sample CDFs');
r_ax.plot(x_plot, sample_cdfs[1:].T, c='gray', alpha=0.75);
r_ax.plot(x_plot, P0.cdf(x_plot), c='k', label='Base CDF');
r_ax.set_title(r'$\alpha = {}$'.format(alpha));
r_ax.legend(loc=2);
```

For the task of density estimation, the (almost sure) discreteness of samples from the Dirichlet process is a significant drawback. This problem can be solved with another level of indirection by using Dirichlet process mixtures for density estimation. A Dirichlet process mixture uses component densities from a parametric family \(\mathcal{F} = \{f_{\theta}\ |\ \theta \in \Theta\}\) and represents the mixture weights as a Dirichlet process. If \(P_0\) is a probability measure on the parameter space \(\Theta\), a Dirichlet process mixture is the hierarchical model

\[ \begin{align*} x_i\ |\ \theta_i & \sim f_{\theta_i} \\ \theta_1, \ldots, \theta_n & \sim P \\ P & \sim \textrm{DP}(\alpha, P_0). \end{align*} \]

To illustrate this model, we simulate draws from a Dirichlet process mixture with \(\alpha = 2\), \(\theta \sim N(0, 1)\), \(x\ |\ \theta \sim N(\theta, (0.3)^2)\).

```
N = 5
K = 30
alpha = 2
P0 = sp.stats.norm
f = lambda x, theta: sp.stats.norm.pdf(x, theta, 0.3)
```

```
beta = sp.stats.beta.rvs(1, alpha, size=(N, K))
w = np.empty_like(beta)
w[:, 0] = beta[:, 0]
w[:, 1:] = beta[:, 1:] * (1 - beta[:, :-1]).cumprod(axis=1)
theta = P0.rvs(size=(N, K))
dpm_pdf_components = f(x_plot[np.newaxis, np.newaxis, :], theta[..., np.newaxis])
dpm_pdfs = (w[..., np.newaxis] * dpm_pdf_components).sum(axis=1)
```

```
fig, ax = plt.subplots(figsize=(8, 6))
ax.plot(x_plot, dpm_pdfs.T, c='gray');
ax.set_yticklabels([]);
```

We now focus on a single mixture and decompose it into its individual (weighted) mixture components.

```
fig, ax = plt.subplots(figsize=(8, 6))
ix = 1
ax.plot(x_plot, dpm_pdfs[ix], c='k', label='Density');
ax.plot(x_plot, (w[..., np.newaxis] * dpm_pdf_components)[ix, 0],
'--', c='k', label='Mixture components (weighted)');
ax.plot(x_plot, (w[..., np.newaxis] * dpm_pdf_components)[ix].T,
'--', c='k');
ax.set_yticklabels([]);
ax.legend(loc=1);
```

Sampling from these stochastic processes is fun, but these ideas become truly useful when we fit them to data. The discreteness of samples and the stick-breaking representation of the Dirichlet process lend themselves nicely to Markov chain Monte Carlo simulation of posterior distributions. We will perform this sampling using `pymc3`

.

Our first example uses a Dirichlet process mixture to estimate the density of waiting times between eruptions of the Old Faithful geyser in Yellowstone National Park.

`old_faithful_df = get_rdataset('faithful', cache=True).data[['waiting']]`

For convenience in specifying the prior, we standardize the waiting time between eruptions.

`old_faithful_df['std_waiting'] = (old_faithful_df.waiting - old_faithful_df.waiting.mean()) / old_faithful_df.waiting.std()`

`old_faithful_df.head()`

waiting | std_waiting | |
---|---|---|

0 | 79 | 0.596025 |

1 | 54 | -1.242890 |

2 | 74 | 0.228242 |

3 | 62 | -0.654437 |

4 | 85 | 1.037364 |

```
fig, ax = plt.subplots(figsize=(8, 6))
n_bins = 20
ax.hist(old_faithful_df.std_waiting, bins=n_bins, color=blue, lw=0, alpha=0.5);
ax.set_xlabel('Standardized waiting time between eruptions');
ax.set_ylabel('Number of eruptions');
```

Observant readers will have noted that we have not been continuing the stick-breaking process indefinitely as indicated by its definition, but rather have been truncating this process after a finite number of breaks. Obviously, when computing with Dirichlet processes, it is necessary to only store a finite number of its point masses and weights in memory. This restriction is not terribly onerous, since with a finite number of observations, it seems quite likely that the number of mixture components that contribute non-neglible mass to the mixture will grow slower than the number of samples. This intuition can be formalized to show that the (expected) number of components that contribute non-negligible mass to the mixture approaches \(\alpha \log N\), where \(N\) is the sample size.

There are various clever Gibbs sampling techniques for Dirichlet processes that allow the number of components stored to grow as needed. Stochastic memoization is another powerful technique for simulating Dirichlet processes while only storing finitely many components in memory. In this introductory post, we take the much less sophistocated approach of simple truncating the Dirichlet process components that are stored after a fixed number, \(K\), of components. Importantly, this approach is compatible with some of `pymc3`

’s (current) technical limitations. Ohlssen, et al. provide justification for truncation, showing that \(K > 5 \alpha + 2\) is most likely sufficient to capture almost all of the mixture weights (\(\sum_{i = 1}^{K} w_i > 0.99\)). We can practically verify the suitability of our truncated approximation to the Dirichlet process by checking the number of components that contribute non-negligible mass to the mixture. If, in our simulations, all components contribute non-negligible mass to the mixture, we have truncated our Dirichlet process too early.

Our Dirichlet process mixture model for the standardized waiting times is

\[ \begin{align*} x_i\ |\ \mu_i, \lambda_i, \tau_i & \sim N\left(\mu, (\lambda_i \tau_i)^{-1}\right) \\ \mu_i\ |\ \lambda_i, \tau_i & \sim N\left(0, (\lambda_i \tau_i)^{-1}\right) \\ (\lambda_1, \tau_1), (\lambda_2, \tau_2), \ldots & \sim P \\ P & \sim \textrm{DP}(\alpha, U(0, 5) \times \textrm{Gamma}(1, 1)) \\ \alpha & \sim \textrm{Gamma}(1, 1). \end{align*} \]

Note that instead of fixing a value of \(\alpha\), as in our previous simulations, we specify a prior on \(\alpha\), so that we may learn its posterior distribution from the observations. This model is therefore actually a mixture of Dirichlet process mixtures, since each fixed value of \(\alpha\) results in a Dirichlet process mixture.

We now construct this model using `pymc3`

.

```
N = old_faithful_df.shape[0]
K = 30
```

```
with pm.Model() as model:
alpha = pm.Gamma('alpha', 1., 1.)
beta = pm.Beta('beta', 1., alpha, shape=K)
w = pm.Deterministic('w', beta * T.concatenate([[1], T.extra_ops.cumprod(1 - beta)[:-1]]))
component = pm.Categorical('component', w, shape=N)
tau = pm.Gamma('tau', 1., 1., shape=K)
lambda_ = pm.Uniform('lambda', 0, 5, shape=K)
mu = pm.Normal('mu', 0, lambda_ * tau, shape=K)
obs = pm.Normal('obs', mu[component], lambda_[component] * tau[component],
observed=old_faithful_df.std_waiting.values)
```

```
Applied log-transform to alpha and added transformed alpha_log to model.
Applied logodds-transform to beta and added transformed beta_logodds to model.
Applied log-transform to tau and added transformed tau_log to model.
Applied interval-transform to lambda and added transformed lambda_interval to model.
```

We sample from the posterior distribution 20,000 times, burn the first 10,000 samples, and thin to every tenth sample.

```
with model:
step1 = pm.Metropolis(vars=[alpha, beta, w, lambda_, tau, mu, obs])
step2 = pm.ElemwiseCategoricalStep([component], np.arange(K))
trace_ = pm.sample(20000, [step1, step2])
trace = trace_[10000::10]
```

` [-----------------100%-----------------] 20000 of 20000 complete in 139.3 sec`

The posterior distribution of \(\alpha\) is concentrated between 0.4 and 1.75.

`pm.traceplot(trace, varnames=['alpha']);`

To verify that our truncation point is not biasing our results, we plot the distribution of the number of mixture components used.

`n_components_used = np.apply_along_axis(lambda x: np.unique(x).size, 1, trace['component'])`

```
fig, ax = plt.subplots(figsize=(8, 6))
bins = np.arange(n_components_used.min(), n_components_used.max() + 1)
ax.hist(n_components_used + 1, bins=bins, normed=True, lw=0, alpha=0.75);
ax.set_xticks(bins + 0.5);
ax.set_xticklabels(bins);
ax.set_xlim(bins.min(), bins.max() + 1);
ax.set_xlabel('Number of mixture components used');
ax.set_ylabel('Posterior probability');
```

We see that the vast majority of samples use five mixture components, and the largest number of mixture components used by any sample is eight. Since we truncated our Dirichlet process mixture at thirty components, we can be quite sure that truncation did not bias our results.

We now compute and plot our posterior density estimate.

```
post_pdf_contribs = sp.stats.norm.pdf(np.atleast_3d(x_plot),
trace['mu'][:, np.newaxis, :],
1. / np.sqrt(trace['lambda'] * trace['tau'])[:, np.newaxis, :])
post_pdfs = (trace['w'][:, np.newaxis, :] * post_pdf_contribs).sum(axis=-1)
post_pdf_low, post_pdf_high = np.percentile(post_pdfs, [2.5, 97.5], axis=0)
```

```
fig, ax = plt.subplots(figsize=(8, 6))
n_bins = 20
ax.hist(old_faithful_df.std_waiting.values, bins=n_bins, normed=True,
color=blue, lw=0, alpha=0.5);
ax.fill_between(x_plot, post_pdf_low, post_pdf_high,
color='gray', alpha=0.45);
ax.plot(x_plot, post_pdfs[0],
c='gray', label='Posterior sample densities');
ax.plot(x_plot, post_pdfs[::100].T, c='gray');
ax.plot(x_plot, post_pdfs.mean(axis=0),
c='k', label='Posterior expected density');
ax.set_xlabel('Standardized waiting time between eruptions');
ax.set_yticklabels([]);
ax.set_ylabel('Density');
ax.legend(loc=2);
```

As above, we can decompose this density estimate into its (weighted) mixture components.

```
fig, ax = plt.subplots(figsize=(8, 6))
n_bins = 20
ax.hist(old_faithful_df.std_waiting.values, bins=n_bins, normed=True,
color=blue, lw=0, alpha=0.5);
ax.plot(x_plot, post_pdfs.mean(axis=0),
c='k', label='Posterior expected density');
ax.plot(x_plot, (trace['w'][:, np.newaxis, :] * post_pdf_contribs).mean(axis=0)[:, 0],
'--', c='k', label='Posterior expected mixture\ncomponents\n(weighted)');
ax.plot(x_plot, (trace['w'][:, np.newaxis, :] * post_pdf_contribs).mean(axis=0),
'--', c='k');
ax.set_xlabel('Standardized waiting time between eruptions');
ax.set_yticklabels([]);
ax.set_ylabel('Density');
ax.legend(loc=2);
```

The Dirichlet process mixture model is incredibly flexible in terms of the family of parametric component distributions \(\{f_{\theta}\ |\ f_{\theta} \in \Theta\}\). We illustrate this flexibility below by using Poisson component distributions to estimate the density of sunspots per year.

`sunspot_df = get_rdataset('sunspot.year', cache=True).data`

`sunspot_df.head()`

time | sunspot.year | |
---|---|---|

0 | 1700 | 5 |

1 | 1701 | 11 |

2 | 1702 | 16 |

3 | 1703 | 23 |

4 | 1704 | 36 |

For this problem, the model is

\[ \begin{align*} x_i\ |\ \lambda_i & \sim \textrm{Poisson}(\lambda_i) \\ \lambda_1, \lambda_2, \ldots & \sim P \\ P & \sim \textrm{DP}(\alpha, U(0, 300)) \\ \alpha & \sim \textrm{Gamma}(1, 1). \end{align*} \]

```
N = sunspot_df.shape[0]
K = 30
```

```
with pm.Model() as model:
alpha = pm.Gamma('alpha', 1., 1.)
beta = pm.Beta('beta', 1, alpha, shape=K)
w = pm.Deterministic('beta', beta * T.concatenate([[1], T.extra_ops.cumprod(1 - beta[:-1])]))
component = pm.Categorical('component', w, shape=N)
mu = pm.Uniform('mu', 0., 300., shape=K)
obs = pm.Poisson('obs', mu[component], observed=sunspot_df['sunspot.year'])
```

```
Applied log-transform to alpha and added transformed alpha_log to model.
Applied logodds-transform to beta and added transformed beta_logodds to model.
Applied interval-transform to mu and added transformed mu_interval to model.
```

```
with model:
step1 = pm.Metropolis(vars=[alpha, beta, w, mu, obs])
step2 = pm.ElemwiseCategoricalStep([component], np.arange(K))
trace_ = pm.sample(20000, [step1, step2])
```

` [-----------------100%-----------------] 20000 of 20000 complete in 111.9 sec`

`trace = trace_[10000::10]`

For the sunspot model, the posterior distribution of \(\alpha\) is concentrated between one and three, indicating that we should expect more components to contribute non-negligible amounts to the mixture than for the Old Faithful waiting time model.

`pm.traceplot(trace, varnames=['alpha']);`

Indeed, we see that there are (on average) about ten to fifteen components used by this model.

`n_components_used = np.apply_along_axis(lambda x: np.unique(x).size, 1, trace['component'])`

```
fig, ax = plt.subplots(figsize=(8, 6))
bins = np.arange(n_components_used.min(), n_components_used.max() + 1)
ax.hist(n_components_used + 1, bins=bins, normed=True, lw=0, alpha=0.75);
ax.set_xticks(bins + 0.5);
ax.set_xticklabels(bins);
ax.set_xlim(bins.min(), bins.max() + 1);
ax.set_xlabel('Number of mixture components used');
ax.set_ylabel('Posterior probability');
```

We now calculate and plot the fitted density estimate.

`x_plot = np.arange(250)`

```
post_pmf_contribs = sp.stats.poisson.pmf(np.atleast_3d(x_plot),
trace['mu'][:, np.newaxis, :])
post_pmfs = (trace['beta'][:, np.newaxis, :] * post_pmf_contribs).sum(axis=-1)
post_pmf_low, post_pmf_high = np.percentile(post_pmfs, [2.5, 97.5], axis=0)
```

```
fig, ax = plt.subplots(figsize=(8, 6))
ax.hist(sunspot_df['sunspot.year'].values, bins=40, normed=True, lw=0, alpha=0.75);
ax.fill_between(x_plot, post_pmf_low, post_pmf_high,
color='gray', alpha=0.45)
ax.plot(x_plot, post_pmfs[0],
c='gray', label='Posterior sample densities');
ax.plot(x_plot, post_pmfs[::200].T, c='gray');
ax.plot(x_plot, post_pmfs.mean(axis=0),
c='k', label='Posterior expected density');
ax.legend(loc=1);
```

Again, we can decompose the posterior expected density into weighted mixture densities.

```
fig, ax = plt.subplots(figsize=(8, 6))
ax.hist(sunspot_df['sunspot.year'].values, bins=40, normed=True, lw=0, alpha=0.75);
ax.plot(x_plot, post_pmfs.mean(axis=0),
c='k', label='Posterior expected density');
ax.plot(x_plot, (trace['beta'][:, np.newaxis, :] * post_pmf_contribs).mean(axis=0)[:, 0],
'--', c='k', label='Posterior expected\nmixture components\n(weighted)');
ax.plot(x_plot, (trace['beta'][:, np.newaxis, :] * post_pmf_contribs).mean(axis=0),
'--', c='k');
ax.legend(loc=1);
```

We have only scratched the surface in terms of applications of the Dirichlet process and Bayesian nonparametric statistics in general. This post is the first in a series I have planned on Bayesian nonparametrics, so stay tuned.

]]>Survival analysis studies the distribution of the time to an event. Its applications span many fields across medicine, biology, engineering, and social science. This post shows how to fit and analyze a Bayesian survival model in Python using `pymc3`

.

We illustrate these concepts by analyzing a mastectomy data set from `R`

’s `HSAUR`

package.

`%matplotlib inline`

```
from matplotlib import pyplot as plt
import numpy as np
import pymc3 as pm
from pymc3.distributions.timeseries import GaussianRandomWalk
import seaborn as sns
from statsmodels import datasets
from theano import tensor as T
```

`Couldn't import dot_parser, loading of dot files will not be possible.`

Fortunately, `statsmodels.datasets`

makes it quite easy to load a number of data sets from `R`

.

```
df = datasets.get_rdataset('mastectomy', 'HSAUR', cache=True).data
df.event = df.event.astype(np.int64)
df.metastized = (df.metastized == 'yes').astype(np.int64)
n_patients = df.shape[0]
patients = np.arange(n_patients)
```

`df.head()`

time | event | metastized | |
---|---|---|---|

0 | 23 | 1 | 0 |

1 | 47 | 1 | 0 |

2 | 69 | 1 | 0 |

3 | 70 | 0 | 0 |

4 | 100 | 0 | 0 |

`n_patients`

`44`

Each row represents observations from a woman diagnosed with breast cancer that underwent a mastectomy. The column `time`

represents the time (in months) post-surgery that the woman was observed. The column `event`

indicates whether or not the woman died during the observation period. The column `metastized`

represents whether the cancer had metastized prior to surgery.

This post analyzes the relationship between survival time post-mastectomy and whether or not the cancer had metastized.

First we introduce a (very little) bit of theory. If the random variable \(T\) is the time to the event we are studying, survival analysis is primarily concerned with the survival function

\[S(t) = P(T > t) = 1 - F(t),\]

where \(F\) is the CDF of \(T\). It is mathematically convenient to express the survival function in terms of the hazard rate, \(\lambda(t)\). The hazard rate is the instantaneous probability that the event occurs at time \(t\) given that it has not yet occured. That is,

\[\begin{align*} \lambda(t) & = \lim_{\Delta t \to 0} \frac{P(t < T < t + \Delta t\ |\ T > t)}{\Delta t} \\ & = \lim_{\Delta t \to 0} \frac{P(t < T < t + \Delta t)}{\Delta t \cdot P(T > t)} \\ & = \frac{1}{S(t)} \cdot \lim_{\Delta t \to 0} \frac{S(t + \Delta t) - S(t)}{\Delta t} = -\frac{S'(t)}{S(t)}. \end{align*}\]

Solving this differential equation for the survival function shows that

\[S(t) = \exp\left(-\int_0^s \lambda(s)\ ds\right).\]

This representation of the survival function shows that the cumulative hazard function

\[\Lambda(t) = \int_0^t \lambda(s)\ ds\]

is an important quantity in survival analysis, since we may consicesly write \(S(t) = \exp(-\Lambda(t)).\)

An important, but subtle, point in survival analysis is censoring. Even though the quantity we are interested in estimating is the time between surgery and death, we do not observe the death of every subject. At the point in time that we perform our analysis, some of our subjects will thankfully still be alive. In the case of our mastectomy study, `df.event`

is one if the subject’s death was observed (the observation is not censored) and is zero if the death was not observed (the observation is censored).

`df.event.mean()`

`0.59090909090909094`

Just over 40% of our observations are censored. We visualize the observed durations and indicate which observations are censored below.

```
fig, ax = plt.subplots(figsize=(8, 6))
blue, _, red = sns.color_palette()[:3]
ax.hlines(patients[df.event.values == 0], 0, df[df.event.values == 0].time,
color=blue, label='Censored');
ax.hlines(patients[df.event.values == 1], 0, df[df.event.values == 1].time,
color=red, label='Uncensored');
ax.scatter(df[df.metastized.values == 1].time, patients[df.metastized.values == 1],
color='k', zorder=10, label='Metastized');
ax.set_xlim(left=0);
ax.set_xlabel('Months since mastectomy');
ax.set_ylim(-0.25, n_patients + 0.25);
ax.legend(loc='center right');
```

When an observation is censored (`df.event`

is zero), `df.time`

is not the subject’s survival time. All we can conclude from such a censored obsevation is that the subject’s true survival time exceeds `df.time`

.

This is enough basic surival analysis theory for the purposes of this post; for a more extensive introduction, consult Aalen et al.^{1}

The two most basic estimators in survial analysis are the Kaplan-Meier estimator of the survival function and the Nelson-Aalen estimator of the cumulative hazard function. However, since we want to understand the impact of metastization on survival time, a risk regression model is more appropriate. Perhaps the most commonly used risk regression model is Cox’s proportional hazards model. In this model, if we have covariates \(\mathbf{x}\) and regression coefficients \(\beta\), the hazard rate is modeled as

\[\lambda(t) = \lambda_0(t) \exp(\mathbf{x} \beta).\]

Here \(\lambda_0(t)\) is the baseline hazard, which is independent of the covariates \(\mathbf{x}\). In this example, the covariates are the one-dimensonal vector `df.metastized`

.

Unlike in many regression situations, \(\mathbf{x}\) should not include a constant term corresponding to an intercept. If \(\mathbf{x}\) includes a constant term corresponding to an intercept, the model becomes unidentifiable. To illustrate this unidentifiability, suppose that

\[\lambda(t) = \lambda_0(t) \exp(\beta_0 + \mathbf{x} \beta) = \lambda_0(t) \exp(\beta_0) \exp(\mathbf{x} \beta).\]

If \(\tilde{\beta}_0 = \beta_0 + \delta\) and \(\tilde{\lambda}_0(t) = \lambda_0(t) \exp(-\delta)\), then \(\lambda(t) = \tilde{\lambda}_0(t) \exp(\tilde{\beta}_0 + \mathbf{x} \beta)\) as well, making the model with \(\beta_0\) unidentifiable.

In order to perform Bayesian inference with the Cox model, we must specify priors on \(\beta\) and \(\lambda_0(t)\). We place a normal prior on \(\beta\), \(\beta \sim N(\mu_{\beta}, \sigma_{\beta}^2),\) where \(\mu_{\beta} \sim N(0, 10^2)\) and \(\sigma_{\beta} \sim U(0, 10)\).

A suitable prior on \(\lambda_0(t)\) is less obvious. We choose a semiparametric prior, where \(\lambda_0(t)\) is a piecewise constant function. This prior requires us to partition the time range in question into intervals with endpoints \(0 \leq s_1 < s_2 < \cdots < s_N\). With this partition, \(\lambda_0 (t) = \lambda_j\) if \(s_j \leq t < s_{j + 1}\). With \(\lambda_0(t)\) constrained to have this form, all we need to do is choose priors for the \(N - 1\) values \(\lambda_j\). We use independent vague priors \(\lambda_j \sim \operatorname{Gamma}(10^{-2}, 10^{-2}).\) For our mastectomy example, we make each interval three months long.

```
interval_length = 3
interval_bounds = np.arange(0, df.time.max() + interval_length + 1, interval_length)
n_intervals = interval_bounds.size - 1
intervals = np.arange(n_intervals)
```

We see how deaths and censored observations are distributed in these intervals.

```
fig, ax = plt.subplots(figsize=(8, 6))
ax.hist(df[df.event == 1].time.values, bins=interval_bounds,
color=red, alpha=0.5, lw=0,
label='Uncensored');
ax.hist(df[df.event == 0].time.values, bins=interval_bounds,
color=blue, alpha=0.5, lw=0,
label='Censored');
ax.set_xlim(0, interval_bounds[-1]);
ax.set_xlabel('Months since mastectomy');
ax.set_yticks([0, 1, 2, 3]);
ax.set_ylabel('Number of observations');
ax.legend();
```

With the prior distributions on \(\beta\) and \(\lambda_0(t)\) chosen, we now show how the model may be fit using MCMC simulation with `pymc3`

. The key observation is that the piecewise-constant proportional hazard model is closely related to a Poisson regression model. (The models are not identical, but their likelihoods differ by a factor that depends only on the observed data and not the parameters \(\beta\) and \(\lambda_j\). For details, see Germán Rodríguez’s WWS 509 course notes.)

We define indicator variables based on whether or the \(i\)-th suject died in the \(j\)-th interval,

\[d_{i, j} = \begin{cases} 1 & \textrm{if subject } i \textrm{ died in interval } j \\ 0 & \textrm{otherwise} \end{cases}.\]

```
last_period = np.floor((df.time - 0.01) / interval_length)
death = np.zeros((n_patients, n_intervals))
death[patients, last_period] = df.event
```

We also define \(t_{i, j}\) to be the amount of time the \(i\)-th subject was at risk in the \(j\)-th interval.

```
exposure = np.greater_equal.outer(df.time, interval_bounds[:-1]) * interval_length
exposure[patients, last_period] = df.time - interval_bounds[last_period]
```

Finally, denote the risk incurred by the \(i\)-th subject in the \(j\)-th interval as \(\lambda_{i, j} = \lambda_j \exp(\mathbf{x}_i \beta)\).

We may approximate \(d_{i, j}\) with a Possion random variable with mean \(t_{i, j}\ \lambda_{i, j}\). This approximation leads to the following `pymc3`

model.

`SEED = 5078864 # from random.org`

```
with pm.Model() as model:
lambda0 = pm.Gamma('lambda0', 0.01, 0.01, shape=n_intervals)
sigma = pm.Uniform('sigma', 0., 10.)
tau = pm.Deterministic('tau', sigma**-2)
mu_beta = pm.Normal('mu_beta', 0., 10**-2)
beta = pm.Normal('beta', mu_beta, tau)
lambda_ = pm.Deterministic('lambda_', T.outer(T.exp(beta * df.metastized), lambda0))
mu = pm.Deterministic('mu', exposure * lambda_)
obs = pm.Poisson('obs', mu, observed=death)
```

We now sample from the model.

```
n_samples = 40000
burn = 20000
thin = 20
```

```
with model:
step = pm.Metropolis()
trace_ = pm.sample(n_samples, step, random_seed=SEED)
```

` [-----------------100%-----------------] 40000 of 40000 complete in 39.0 sec`

`trace = trace_[burn::thin]`

We see that the hazard rate for subjects whose cancer has metastized is about one and a half times the rate of those whose cancer has not metastized.

`np.exp(trace['beta'].mean())`

`1.645592148084472`

`pm.traceplot(trace, vars=['beta']);`

`pm.autocorrplot(trace, vars=['beta']);`

We now examine the effect of metastization on both the cumulative hazard and on the survival function.

```
base_hazard = trace['lambda0']
met_hazard = trace['lambda0'] * np.exp(np.atleast_2d(trace['beta']).T)
```

```
def cum_hazard(hazard):
return (interval_length * hazard).cumsum(axis=-1)
def survival(hazard):
return np.exp(-cum_hazard(hazard))
```

```
def plot_with_hpd(x, hazard, f, ax, color=None, label=None, alpha=0.05):
mean = f(hazard.mean(axis=0))
percentiles = 100 * np.array([alpha / 2., 1. - alpha / 2.])
hpd = np.percentile(f(hazard), percentiles, axis=0)
ax.fill_between(x, hpd[0], hpd[1], color=color, alpha=0.25)
ax.step(x, mean, color=color, label=label);
```

```
fig, (hazard_ax, surv_ax) = plt.subplots(ncols=2, sharex=True, sharey=False, figsize=(16, 6))
plot_with_hpd(interval_bounds[:-1], base_hazard, cum_hazard,
hazard_ax, color=blue, label='Had not metastized')
plot_with_hpd(interval_bounds[:-1], met_hazard, cum_hazard,
hazard_ax, color=red, label='Metastized')
hazard_ax.set_xlim(0, df.time.max());
hazard_ax.set_xlabel('Months since mastectomy');
hazard_ax.set_ylabel(r'Cumulative hazard $\Lambda(t)$');
hazard_ax.legend(loc=2);
plot_with_hpd(interval_bounds[:-1], base_hazard, survival,
surv_ax, color=blue)
plot_with_hpd(interval_bounds[:-1], met_hazard, survival,
surv_ax, color=red)
surv_ax.set_xlim(0, df.time.max());
surv_ax.set_xlabel('Months since mastectomy');
surv_ax.set_ylabel('Survival function $S(t)$');
fig.suptitle('Bayesian survival model');
```

We see that the cumulative hazard for metastized subjects increases more rapidly initially (through about seventy months), after which it increases roughly in parallel with the baseline cumulative hazard.

These plots also show the pointwise 95% high posterior density interval for each function. One of the distinct advantages of the Bayesian model fit with `pymc3`

is the inherent quantification of uncertainty in our estimates.

Another of the advantages of the model we have built is its flexibility. From the plots above, we may reasonable believe that the additional hazard due to metastization varies over time; it seems plausible that cancer that has metastized increases the hazard rate immediately after the mastectomy, but that the risk due to metastization decreases over time. We can accomodate this mechanism in our model by allowing the regression coefficients to vary over time. In the time-varying coefficent model, if \(s_j \leq t < s_{j + 1}\), we let \(\lambda(t) = \lambda_j \exp(\mathbf{x} \beta_j).\) The sequence of regression coefficients \(\beta_1, \beta_2, \ldots, \beta_{N - 1}\) form a normal random walk with \(\beta_1 \sim N(0, 1)\), \(\beta_j\ |\ \beta_{j - 1} \sim N(\beta_{j - 1}, 1)\).

We implement this model in `pymc3`

as follows.

```
with pm.Model() as time_varying_model:
lambda0 = pm.Gamma('lambda0', 0.01, 0.01, shape=n_intervals)
beta = GaussianRandomWalk('beta', tau=1., shape=n_intervals)
lambda_ = pm.Deterministic('h', lambda0 * T.exp(T.outer(T.constant(df.metastized), beta)))
mu = pm.Deterministic('mu', exposure * lambda_)
obs = pm.Poisson('obs', mu, observed=death)
```

We proceed to sample from this model.

```
with time_varying_model:
step = pm.Metropolis()
time_varying_trace_ = pm.sample(n_samples, step, random_seed=SEED)
```

` [-----------------100%-----------------] 40000 of 40000 complete in 56.7 sec`

`time_varying_trace = time_varying_trace_[burn::thin]`

We see from the plot of \(\beta_j\) over time below that initially \(\beta_j > 0\), indicating an elevated hazard rate due to metastization, but that this risk declines as \(\beta_j < 0\) eventually.

```
fig, ax = plt.subplots(figsize=(8, 6))
beta_hpd = np.percentile(time_varying_trace['beta'], [2.5, 97.5], axis=0)
beta_low = beta_hpd[0]
beta_high = beta_hpd[1]
ax.fill_between(interval_bounds[:-1], beta_low, beta_high,
color=blue, alpha=0.25);
beta_hat = time_varying_trace['beta'].mean(axis=0)
ax.step(interval_bounds[:-1], beta_hat, color=blue);
ax.scatter(interval_bounds[last_period[(df.event.values == 1) & (df.metastized == 1)]],
beta_hat[last_period[(df.event.values == 1) & (df.metastized == 1)]],
c=red, zorder=10, label='Died, cancer metastized');
ax.scatter(interval_bounds[last_period[(df.event.values == 0) & (df.metastized == 1)]],
beta_hat[last_period[(df.event.values == 0) & (df.metastized == 1)]],
c=blue, zorder=10, label='Censored, cancer metastized');
ax.set_xlim(0, df.time.max());
ax.set_xlabel('Months since mastectomy');
ax.set_ylabel(r'$\beta_j$');
ax.legend();
```

The coefficients \(\beta_j\) begin declining rapidly around one hundred months post-mastectomy, which seems reasonable, given that only three of twelve subjects whose cancer had metastized lived past this point died during the study.

The change in our estimate of the cumulative hazard and survival functions due to time-varying effects is also quite apparent in the following plots.

```
tv_base_hazard = time_varying_trace['lambda0']
tv_met_hazard = time_varying_trace['lambda0'] * np.exp(np.atleast_2d(time_varying_trace['beta']))
```

```
fig, ax = plt.subplots(figsize=(8, 6))
ax.step(interval_bounds[:-1], cum_hazard(base_hazard.mean(axis=0)),
color=blue, label='Had not metastized');
ax.step(interval_bounds[:-1], cum_hazard(met_hazard.mean(axis=0)),
color=red, label='Metastized');
ax.step(interval_bounds[:-1], cum_hazard(tv_base_hazard.mean(axis=0)),
color=blue, linestyle='--', label='Had not metastized (time varying effect)');
ax.step(interval_bounds[:-1], cum_hazard(tv_met_hazard.mean(axis=0)),
color=red, linestyle='--', label='Metastized (time varying effect)');
ax.set_xlim(0, df.time.max() - 4);
ax.set_xlabel('Months since mastectomy');
ax.set_ylim(0, 2);
ax.set_ylabel(r'Cumulative hazard $\Lambda(t)$');
ax.legend(loc=2);
```

```
fig, (hazard_ax, surv_ax) = plt.subplots(ncols=2, sharex=True, sharey=False, figsize=(16, 6))
plot_with_hpd(interval_bounds[:-1], tv_base_hazard, cum_hazard,
hazard_ax, color=blue, label='Had not metastized')
plot_with_hpd(interval_bounds[:-1], tv_met_hazard, cum_hazard,
hazard_ax, color=red, label='Metastized')
hazard_ax.set_xlim(0, df.time.max());
hazard_ax.set_xlabel('Months since mastectomy');
hazard_ax.set_ylim(0, 2);
hazard_ax.set_ylabel(r'Cumulative hazard $\Lambda(t)$');
hazard_ax.legend(loc=2);
plot_with_hpd(interval_bounds[:-1], tv_base_hazard, survival,
surv_ax, color=blue)
plot_with_hpd(interval_bounds[:-1], tv_met_hazard, survival,
surv_ax, color=red)
surv_ax.set_xlim(0, df.time.max());
surv_ax.set_xlabel('Months since mastectomy');
surv_ax.set_ylabel('Survival function $S(t)$');
fig.suptitle('Bayesian survival model with time varying effects');
```

We have really only scratched the surface of both survival analysis and the Bayesian approach to survival analysis. More information on Bayesian survival analysis is available in Ibrahim et al.^{2} (For example, we may want to account for individual frailty in either or original or time-varying models.)

This post is available as an IPython notebook here.

Tags: Bayesian Statistics, PyMC3

Outside of the beta-binomial model, the multivariate normal model is likely the most studied Bayesian model in history. Unfortunately, as this issue shows, `pymc3`

cannot (yet) sample from the standard conjugate normal-Wishart model. Fortunately, `pymc3`

*does* support sampling from the LKJ distribution. This post will show how to fit a simple multivariate normal model using `pymc3`

with an normal-LKJ prior.

The normal-Wishart prior is conjugate for the multivariate normal model, so we can find the posterior distribution in closed form. Even with this closed form solution, sampling from a multivariate normal model in `pymc3`

is important as a building block for more complex models that will be discussed in future posts.

First, we generate some two-dimensional sample data.

`%matplotlib inline`

```
from matplotlib.patches import Ellipse
from matplotlib import pyplot as plt
import numpy as np
import pymc3 as pm
import scipy as sp
import seaborn as sns
from theano import tensor as T
```

`Couldn't import dot_parser, loading of dot files will not be possible.`

`np.random.seed(3264602) # from random.org`

```
N = 100
mu_actual = sp.stats.uniform.rvs(-5, 10, size=2)
cov_actual_sqrt = sp.stats.uniform.rvs(0, 2, size=(2, 2))
cov_actual = np.dot(cov_actual_sqrt.T, cov_actual_sqrt)
x = sp.stats.multivariate_normal.rvs(mu_actual, cov_actual, size=N)
```

```
var, U = np.linalg.eig(cov_actual)
angle = 180. / np.pi * np.arccos(np.abs(U[0, 0]))
```

```
fig, ax = plt.subplots(figsize=(8, 6))
blue = sns.color_palette()[0]
e = Ellipse(mu_actual, 2 * np.sqrt(5.991 * var[0]), 2 * np.sqrt(5.991 * var[1]), angle=-angle)
e.set_alpha(0.5)
e.set_facecolor('gray')
e.set_zorder(10);
ax.add_artist(e);
ax.scatter(x[:, 0], x[:, 1], c='k', alpha=0.5, zorder=11);
rect = plt.Rectangle((0, 0), 1, 1, fc='gray', alpha=0.5)
ax.legend([rect], ['95% true credible region'], loc=2);
```

The sampling distribution for our model is \(x_i \sim N(\mu, \Lambda)\), where \(\Lambda\) is the precision matrix of the distribution. The precision matrix is the inverse of the covariance matrix. The support of the LKJ distribution is the set of correlation matrices, not covariance matrices. We will use the separation strategy from Barnard et al. to combine an LKJ prior on the correlation matrix with a prior on the standard deviations of each dimension to produce a prior on the covariance matrix.

Let \(\sigma\) be the vector of standard deviations of each component of our normal distribution, and \(\mathbf{C}\) be the correlation matrix. The relationship

\[\Sigma = \operatorname{diag}(\sigma)\ \mathbf{C} \operatorname{diag}(\sigma)\]

shows that priors on \(\sigma\) and \(\mathbf{C}\) will induce a prior on \(\Sigma\). Following Barnard et al., we place a standard lognormal prior each the elements \(\sigma\), and an LKJ prior on the correlation matric \(\mathbf{C}\). The LKJ distribution requires a shape parameter \(\nu > 0\). If \(\mathbf{C} \sim LKJ(\nu)\), then \(f(\mathbf{C}) \propto |\mathbf{C}|^{\nu - 1}\) (here \(|\cdot|\) is the determinant).

We can now begin to build this model in `pymc3`

.

```
with pm.Model() as model:
sigma = pm.Lognormal('sigma', np.zeros(2), np.ones(2), shape=2)
nu = pm.Uniform('nu', 0, 5)
C_triu = pm.LKJCorr('C_triu', nu, 2)
```

There is a slight complication in `pymc3`

’s handling of the `LKJCorr`

distribution; `pymc3`

represents the support of this distribution as a one-dimensional vector of the upper triangular elements of the full covariance matrix.

`C_triu.tag.test_value.shape`

`(1,)`

In order to build a the full correlation matric \(\mathbf{C}\), we first build a \(2 \times 2\) tensor whose values are all `C_triu`

and then set the diagonal entries to one. (Recall that a correlation matrix must be symmetric and positive definite with all diagonal entries equal to one.) We can then proceed to build the covariance matrix \(\Sigma\) and the precision matrix \(\Lambda\).

```
with model:
C = pm.Deterministic('C', T.fill_diagonal(C_triu[np.zeros((2, 2), dtype=np.int64)], 1.))
sigma_diag = pm.Deterministic('sigma_mat', T.nlinalg.diag(sigma))
cov = pm.Deterministic('cov', T.nlinalg.matrix_dot(sigma_diag, C, sigma_diag))
tau = pm.Deterministic('tau', T.nlinalg.matrix_inverse(cov))
```

While defining `C`

in terms of `C_triu`

was simple in this case because our sampling distribution is two-dimensional, the example from this StackOverflow question shows how to generalize this transformation to arbitrarily many dimensions.

Finally, we define the prior on \(\mu\) and the sampling distribution.

```
with model:
mu = pm.MvNormal('mu', 0, tau, shape=2)
x_ = pm.MvNormal('x', mu, tau, observed=x)
```

We are now ready to fit this model using `pymc3`

.

```
n_samples = 4000
n_burn = 2000
n_thin = 2
```

```
with model:
step = pm.Metropolis()
trace_ = pm.sample(n_samples, step)
```

` [-----------------100%-----------------] 4000 of 4000 complete in 5.8 sec`

`trace = trace_[n_burn::n_thin]`

We see that the posterior estimate of \(\mu\) is reasonably accurate.

`pm.traceplot(trace, vars=['mu']);`

`trace['mu'].mean(axis=0)`

`array([-1.41086412, -4.6853101 ])`

`mu_actual`

`array([-1.41866859, -4.8018335 ])`

The estimates of the standard deviations are certainly biased.

`pm.traceplot(trace, vars=['sigma']);`

`trace['sigma'].mean(axis=0)`

`array([ 0.75736536, 1.49451149])`

`np.sqrt(var)`

`array([ 0.3522422 , 1.58192855])`

However, the 95% posterior credible region is visuall quite close to the true credible region, so we can be fairly satisfied with our model.

```
post_cov = trace['cov'].mean(axis=0)
post_sigma, post_U = np.linalg.eig(post_cov)
post_angle = 180. / np.pi * np.arccos(np.abs(post_U[0, 0]))
```

```
fig, ax = plt.subplots(figsize=(8, 6))
blue = sns.color_palette()[0]
e = Ellipse(mu_actual, 2 * np.sqrt(5.991 * post_sigma[0]), 2 * np.sqrt(5.991 * post_sigma[1]), angle=-post_angle)
e.set_alpha(0.5)
e.set_facecolor(blue)
e.set_zorder(9);
ax.add_artist(e);
e = Ellipse(mu_actual, 2 * np.sqrt(5.991 * var[0]), 2 * np.sqrt(5.991 * var[1]), angle=-angle)
e.set_alpha(0.5)
e.set_facecolor('gray')
e.set_zorder(10);
ax.add_artist(e);
ax.scatter(x[:, 0], x[:, 1], c='k', alpha=0.5, zorder=11);
rect = plt.Rectangle((0, 0), 1, 1, fc='gray', alpha=0.5)
post_rect = plt.Rectangle((0, 0), 1, 1, fc=blue, alpha=0.5)
ax.legend([rect, post_rect],
['95% true credible region',
'95% posterior credible region'],
loc=2);
```

Again, this model is quite simple, but will be an important component of more complex models that I will blog about in the future.

This post is available as an IPython notebook here.

Tags: Bayesian Statistics, PyMC3

Recently, I have been learning about (generalized) additive models by working through Simon Wood’s book. I have previously posted an IPython notebook implementing the models from Chapter 3 of the book. In this post, I will show how to fit a simple additive model in Python in a bit more detail.

We will use a LIDAR dataset that is available on the website for Larry Wasserman’s book *All of Nonparametric Statistics*.

`%matplotlib inline`

```
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
import patsy
import scipy as sp
import seaborn as sns
from statsmodels import api as sm
```

```
df = pd.read_csv('http://www.stat.cmu.edu/~larry/all-of-nonpar/=data/lidar.dat',
sep=' *', engine='python')
df['std_range'] = (df.range - df.range.min()) / df.range.ptp()
n = df.shape[0]
```

`df.head()`

range | logratio | std_range | |
---|---|---|---|

0 | 390 | -0.050356 | 0.000000 |

1 | 391 | -0.060097 | 0.003030 |

2 | 393 | -0.041901 | 0.009091 |

3 | 394 | -0.050985 | 0.012121 |

4 | 396 | -0.059913 | 0.018182 |

This data set is well-suited to additive modeling because the relationship between the variables is highly non-linear.

```
fig, ax = plt.subplots(figsize=(8, 6))
blue = sns.color_palette()[0]
ax.scatter(df.std_range, df.logratio, c=blue, alpha=0.5);
ax.set_xlim(-0.01, 1.01);
ax.set_xlabel('Scaled range');
ax.set_ylabel('Log ratio');
```

An additive model represents the relationship between explanatory variables \(\mathbf{x}\) and a response variable \(y\) as a sum of smooth functions of the explanatory variables

\[y = \beta_0 + f_1(x_1) + f_2(x_2) + \cdots + f_k(x_k) + \varepsilon.\]

The smooth functions \(f_i\) can be estimated using a variety of nonparametric techniques. Following Chapter 3 of Wood’s book, we will fit our additive model using penalized regression splines.

Since our LIDAR data set has only one explanatory variable, our additive model takes the form

\[y = \beta_0 + f(x) + \varepsilon.\]

We fit this model by minimizing the penalized residual sum of squares

\[PRSS = \sum_{i = 1}^n \left(y_i - \beta_0 - f(x_i)\right)^2 + \lambda \int_0^1 \left(f''(x)\right)^2\ dx.\]

The penalty term

\[\int_0^1 \left(f''(x)\right)^2\ dx\]

causes us to only choose less smooth functions if they fit the data much better. The smoothing parameter \(\lambda\) controls the rate at which decreased smoothness is traded for a better fit.

In the penalized regression splines model, we must also choose basis functions \(\varphi_1, \varphi_2, \ldots, \varphi_k\), which we then use to express the smooth function \(f\) as

\[f(x) = \beta_1 \varphi_1(x) + \beta_2 \varphi_2(x) + \cdots + \beta_k \varphi_k(x).\]

With these basis functions in place, if we define \(\mathbf{x}_i = [1\ x_i\ \varphi_2(x_i)\ \cdots \varphi_k(x_i)]\) and

\[\mathbf{X} = \begin{bmatrix} \mathbf{x}_1 \\ \vdots \\ \mathbf{x}_n \end{bmatrix},\]

the model \(y_i = \beta_0 + f(x_i) + \varepsilon\) can be rewritten as \(\mathbf{y} = \mathbf{X} \beta + \varepsilon\). It is tedious but not difficult to show that when \(f\) is expressed as a linear combination of basis functions, there is always a positive semidefinite matrix \(\mathbf{S}\) such that

\[\int_0^1 \left(f''(x)\right)^2\ dx = \beta^{\intercal} \mathbf{S} \beta.\]

Since \(\mathbf{S}\) is positive semidefinite, it has a square root \(\mathbf{B}\) such that \(\mathbf{B}^{\intercal} \mathbf{B} = \mathbf{S}\). The penalized residual sum of squares objective function can then be written as

\[ \begin{align*} PRSS & = (\mathbf{y} - \mathbf{X} \beta)^{\intercal} (\mathbf{y} - \mathbf{X} \beta) + \lambda \beta^{\intercal} \mathbf{B}^{\intercal} \mathbf{B} \beta = (\mathbf{\tilde{y}} - \mathbf{\tilde{X}} \beta)^{\intercal} (\mathbf{\tilde{y}} - \mathbf{\tilde{X}} \beta), \end{align*} \]

where

\[\mathbf{\tilde{y}} = \begin{bmatrix} \mathbf{y} \\ \mathbf{0}_{k + 1} \end{bmatrix} \]

and

\[\mathbf{\tilde{X}} = \begin{bmatrix} \mathbf{X} \\ \sqrt{\lambda}\ \mathbf{B} \end{bmatrix}. \]

Therefore the augmented data matrices \(\mathbf{\tilde{y}}\) and \(\mathbf{\tilde{X}}\) allow us to express the penalized residual sum of squares for the original model as the residual sum of squares of the OLS model \(\mathbf{\tilde{y}} = \mathbf{\tilde{X}} \beta + \tilde{\varepsilon}\). This augmented model allows us to use widely available machinery for fitting OLS models to fit the additive model as well.

The last step before we can fit the model in Python is to choose the basis functions \(\varphi_i\). Again, following Chapter 3 of Wood’s book, we let

\[R(x, z) = \frac{1}{4} \left(\left(z - \frac{1}{2}\right)^2 - \frac{1}{12}\right) \left(\left(x - \frac{1}{2}\right)^2 - \frac{1}{12}\right) - \frac{1}{24} \left(\left(\left|x - z\right| - \frac{1}{2}\right)^4 - \frac{1}{2} \left(\left|x - z\right| - \frac{1}{2}\right)^2 + \frac{7}{240}\right).\]

```
def R(x, z):
return ((z - 0.5)**2 - 1 / 12) * ((x - 0.5)**2 - 1 / 12) / 4 - ((np.abs(x - z) - 0.5)**4 - 0.5 * (np.abs(x - z) - 0.5)**2 + 7 / 240) / 24
R = np.frompyfunc(R, 2, 1)
def R_(x):
return R.outer(x, knots).astype(np.float64)
```

Though this function is quite complicated, we will see that it has some very conveient properties. We must also choose a set of knots \(z_i\) in \([0, 1]\), \(i = 1, 2, \ldots, q\).

```
q = 20
knots = df.std_range.quantile(np.linspace(0, 1, q))
```

Here we have used twenty knots situatied at percentiles of `std_range`

.

Now we define our basis functions as \(\varphi_1(x) = x\), \(\varphi_{i}(x) = R(x, z_{i - 1})\) for \(i = 2, 3, \ldots q + 1\).

Our model matrices \(\mathbf{y}\) and \(\mathbf{X}\) are therefore

`y, X = patsy.dmatrices('logratio ~ std_range + R_(std_range)', data=df)`

Note that, by default, `patsy`

always includes an intercept column in `X`

.

The advantage of the function \(R\) is the the penalty matrix \(\mathbf{S}\) has the form

\[S = \begin{bmatrix} \mathbf{0}_{2 \times 2} & \mathbf{0}_{2 \times q} \\ \mathbf{0}_{q \times 2} & \mathbf{\tilde{S}} \end{bmatrix},\]

where \(\mathbf{\tilde{S}}_{ij} = R(z_i, z_j)\). We now calculate \(\mathbf{S}\) and its square root \(\mathbf{B}\).

```
S = np.zeros((q + 2, q + 2))
S[2:, 2:] = R_(knots)
```

```
B = np.zeros_like(S)
B[2:, 2:] = np.real_if_close(sp.linalg.sqrtm(S[2:, 2:]), tol=10**8)
```

We now have all the ingredients necessary to fit some additive models to the LIDAR data set.

```
def fit(y, X, B, lambda_=1.0):
# build the augmented matrices
y_ = np.vstack((y, np.zeros((q + 2, 1))))
X_ = np.vstack((X, np.sqrt(lambda_) * B))
return sm.OLS(y_, X_).fit()
```

We have not yet discussed how to choose the smoothing parameter \(\lambda\), so we will fit several models with different values of \(\lambda\) to see how it affects the results.

```
fig, axes = plt.subplots(nrows=3, ncols=2, sharex=True, sharey=True, squeeze=True, figsize=(12, 13.5))
plot_lambdas = np.array([1.0, 0.1, 0.01, 0.001, 0.0001, 0.00001])
plot_x = np.linspace(0, 1, 100)
plot_X = patsy.dmatrix('std_range + R_(std_range)', {'std_range': plot_x})
for lambda_, ax in zip(plot_lambdas, np.ravel(axes)):
ax.scatter(df.std_range, df.logratio, c=blue, alpha=0.5);
results = fit(y, X, B, lambda_=lambda_)
ax.plot(plot_x, results.predict(plot_X));
ax.set_xlim(-0.01, 1.01);
ax.set_xlabel('Scaled range');
ax.set_ylabel('Log ratio');
ax.set_title(r'$\lambda = {}$'.format(lambda_));
fig.tight_layout();
```

We can see that as \(\lambda\) decreases, the model becomes less smooth. Visually, it seems that the optimal value of \(\lambda\) lies somewhere between \(10^{-2}\) and \(10^{-4}\). We need a rigorous way to choose the optimal value of \(\lambda\). As is often the case in such situations, we turn to cross-validation. Specifically, we will use generalized cross-validation to choose the optimal value of \(\lambda\). The GCV score is given by

\[\operatorname{GCV}(\lambda) = \frac{n \sum_{i = 1}^n \left(y_i - \hat{y}_i\right)^2}{\left(n - \operatorname{tr} \mathbf{H}\right)^2}.\]

Here, \(\hat{y}_i\) is the \(i\)-th predicted value, and \(\mathbf{H}\) is upper left \(n \times n\) submatrix of the the influence matrix for the OLS model \(\mathbf{\tilde{y}} = \mathbf{\tilde{X}} + \varepsilon\).

```
def gcv_score(results):
X = results.model.exog[:-(q + 2), :]
n = X.shape[0]
y = results.model.endog[:n]
y_hat = results.predict(X)
hat_matrix_trace = results.get_influence().hat_matrix_diag[:n].sum()
return n * np.power(y - y_hat, 2).sum() / np.power(n - hat_matrix_trace, 2)
```

Now we evaluate the GCV score of the model over a range of \(\lambda\) values.

`lambdas = np.logspace(0, 50, 100, base=1.5) * 1e-8`

`gcv_scores = np.array([gcv_score(fit(y, X, B, lambda_=lambda_)) for lambda_ in lambdas])`

```
fig, ax = plt.subplots(figsize=(8, 6))
ax.plot(lambdas, gcv_scores);
ax.set_xscale('log');
ax.set_xlabel(r'$\lambda$');
ax.set_ylabel(r'$\operatorname{GCV}(\lambda)$');
```

The GCV-optimal value of \(\lambda\) is therefore

```
lambda_best = lambdas[gcv_scores.argmin()]
lambda_best
```

`0.00063458365729550153`

This value of \(\lambda\) produces a visually reasonable fit.

```
fig, ax = plt.subplots(figsize=(8, 6))
ax.scatter(df.std_range, df.logratio, c=blue, alpha=0.5);
results = fit(y, X, B, lambda_=lambda_best)
ax.plot(plot_x, results.predict(plot_X), label=r'$\lambda = {}$'.format(lambda_best));
ax.set_xlim(-0.01, 1.01);
ax.legend();
```

We have only scratched the surface of additive models, fitting a simple model of one variable with penalized regression splines. In general, additive models are quite powerful and flexible, while remaining quite interpretable.

This post is available as an IPython notebook here.

Tags: Statistics, Python

For all the hype about big data, much value resides in the world’s medium and small data. Especially when we consider the length of the feedback loop and total analyst time invested, insights from small and medium data are quite attractive and economical. Personally, I find analyzing data that fits into memory quite convenient, and therefore, when I am confronted with a data set that does not fit in memory as-is, I am willing to spend a bit of time to try to manipulate it to fit into memory.

The first technique I usually turn to is to only store distinct rows of a data set, along with the count of the number of times that row appears in the data set. This technique is fairly simple to implement, especially when the data set is generated by a SQL query. If the initial query that generates the data set is

`SELECT u, v, w FROM t;`

we would modify it to become

`SELECT u, v, w, COUNT(1) FROM t GROUP BY u, v, w;`

We now generate a sample data set with both discrete and continuous features.

`%matplotlib inline`

`from __future__ import division`

```
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
from patsy import dmatrices, dmatrix
import scipy as sp
import seaborn as sns
from statsmodels import api as sm
from statsmodels.base.model import GenericLikelihoodModel
```

`np.random.seed(1545721) # from random.org`

`N = 100001`

`u_min, u_max = 0, 100`

`v_p = 0.6`

```
n_ws = 50
ws = sp.stats.norm.rvs(0, 1, size=n_ws)
w_min, w_max = ws.min(), ws.max()
```

```
df = pd.DataFrame({
'u': np.random.randint(u_min, u_max, size=N),
'v': sp.stats.bernoulli.rvs(v_p, size=N),
'w': np.random.choice(ws, size=N, replace=True)
})
```

`df.head()`

u | v | w | |
---|---|---|---|

0 | 97 | 0 | 0.537397 |

1 | 79 | 1 | 1.536383 |

2 | 44 | 1 | 1.074817 |

3 | 51 | 0 | -0.491210 |

4 | 47 | 1 | 1.592646 |

We see that this data frame has just over 100,000 rows, but only about 10,000 distinct rows.

`df.shape[0]`

`100001`

`df.drop_duplicates().shape[0]`

`9997`

We now use `pandas`

’ `groupby`

method to produce a data frame that contains the count of each unique combination of `x`

, `y`

, and `z`

.

```
count_df = df.groupby(list(df.columns)).size()
count_df.name = 'count'
count_df = count_df.reset_index()
```

In order to make later examples interesting, we shuffle the rows of the reduced data frame, because `pandas`

automatically sorts the values we grouped on in the reduced data frame.

```
shuffled_ixs = count_df.index.values
np.random.shuffle(shuffled_ixs)
count_df = count_df.iloc[shuffled_ixs].copy().reset_index(drop=True)
```

`count_df.head()`

u | v | w | count | |
---|---|---|---|---|

0 | 0 | 0 | 0.425597 | 14 |

1 | 48 | 1 | -0.993981 | 7 |

2 | 35 | 0 | 0.358156 | 9 |

3 | 19 | 1 | -0.760298 | 17 |

4 | 40 | 1 | -0.688514 | 13 |

Again, we see that we are storing 90% fewer rows. Although this data set has been artificially generated, I have seen space savings of up to 98% when applying this technique to real-world data sets.

`count_df.shape[0] / N`

`0.0999690003099969`

This space savings allows me to analyze data sets which initially appear too large to fit in memory. For example, the computer I am writing this on has 16 GB of RAM. At a 90% space savings, I can comfortably analyze a data set that might otherwise be 80 GB in memory while leaving a healthy amount of memory for other processes. To me, the convenience and tight feedback loop that come with fitting a data set entirely in memory are hard to overstate.

As nice as it is to fit a data set into memory, it’s not very useful unless we can still analyze it. The rest of this post will show how we can perform standard operations on these summary data sets.

For convenience, we will separate the feature columns from the count columns.

```
summ_df = count_df[['u', 'v', 'w']]
n = count_df['count']
```

Suppose we have a group of numbers \(x_1, x_2, \ldots, x_n\). Let the unique values among these numbers be denoted \(z_1, z_2, \ldots, z_m\) and let \(n_j\) be the number of times \(z_j\) apears in the original group. The mean of the \(x_i\)s is therefore

\[ \begin{align*} \bar{x} & = \frac{1}{n} \sum_{i = 1}^n x_i = \frac{1}{n} \sum_{j = 1}^m n_j z_j, \end{align*} \]

since we may group identical \(x_i\)s into a single summand. Since \(n = \sum_{j = 1}^m n_j\), we can calculate the mean using the following function.

```
def mean(df, count):
return df.mul(count, axis=0).sum() / count.sum()
```

`mean(summ_df, n)`

```
u 49.308067
v 0.598704
w 0.170815
dtype: float64
```

We see that the means calculated by our function agree with the means of the original data frame.

`df.mean(axis=0)`

```
u 49.308067
v 0.598704
w 0.170815
dtype: float64
```

`np.allclose(mean(summ_df, n), df.mean(axis=0))`

`True`

We can calculate the variance as

\[ \begin{align*} \sigma_x^2 & = \frac{1}{n - 1} \sum_{i = 1}^n \left(x_i - \bar{x}\right)^2 = \frac{1}{n - 1} \sum_{j = 1}^m n_j \left(z_j - \bar{x}\right)^2 \end{align*} \]

using the same trick of combining identical terms in the original sum. Again, this calculation is easy to implement in Python.

```
def var(df, count):
mu = mean(df, count)
return np.power(df - mu, 2).mul(count, axis=0).sum() / (count.sum() - 1)
```

`var(summ_df, n)`

```
u 830.025064
v 0.240260
w 1.099191
dtype: float64
```

We see that the variances calculated by our function agree with the variances of the original data frame.

`df.var()`

```
u 830.025064
v 0.240260
w 1.099191
dtype: float64
```

`np.allclose(var(summ_df, n), df.var(axis=0))`

`True`

Histograms are fundamental tools for exploratory data analysis. Fortunately, `pyplot`

’s `hist`

function easily accommodates summarized data using the `weights`

optional argument.

```
fig, (full_ax, summ_ax) = plt.subplots(ncols=2, sharex=True, sharey=True, figsize=(16, 6))
nbins = 20
blue, green = sns.color_palette()[:2]
full_ax.hist(df.w, bins=nbins, color=blue, alpha=0.5, lw=0);
full_ax.set_xlabel('$w$');
full_ax.set_ylabel('Count');
full_ax.set_title('Full data frame');
summ_ax.hist(summ_df.w, bins=nbins, weights=n, color=green, alpha=0.5, lw=0);
summ_ax.set_xlabel('$w$');
summ_ax.set_title('Summarized data frame');
```

We see that the histograms for \(w\) produced from the full and summarized data frames are identical.

Calculating the mean and variance of our summarized data frames was not too difficult. Calculating quantiles from this data frame is slightly more involved, though still not terribly hard.

Our implementation will rely on sorting the data frame. Though this implementation is not optimal from a computation complexity point of view, it is in keeping with the spirit of `pandas`

’ implementation of quantiles. I have given some thought on how to implement linear time selection on the summarized data frame, but have not yet worked out the details.

Before writing a function to calculate quantiles of a data frame with several columns, we will walk through the simpler case of computing the quartiles of a single series.

`u = summ_df.u`

`u.head()`

```
0 0
1 48
2 35
3 19
4 40
Name: u, dtype: int64
```

First we `argsort`

the series.

`sorted_ilocs = u.argsort()`

We see that `u.iloc[sorted_ilocs]`

will now be in ascending order.

`sorted_u = u.iloc[sorted_ilocs]`

`(sorted_u[:-1] <= sorted_u[1:]).all()`

`True`

More importantly, `counts.iloc[sorted_ilocs]`

will have the count of the smallest element of `u`

first, the count of the second smallest element second, etc.

```
sorted_n = n.iloc[sorted_ilocs]
sorted_cumsum = sorted_n.cumsum()
cdf = (sorted_cumsum / n.sum()).values
```

Now, the \(i\)-th location of `sorted_cumsum`

will contain the number of elements of `u`

less than or equal to the \(i\)-th smallest element, and therefore `cdf`

is the empirical cumulative distribution function of `u`

. The following plot shows that this interpretation is correct.

```
fig, ax = plt.subplots(figsize=(8, 6))
blue, _, red = sns.color_palette()[:3]
ax.plot(sorted_u, cdf, c=blue, label='Empirical CDF');
plot_u = np.arange(100)
ax.plot(plot_u, sp.stats.randint.cdf(plot_u, u_min, u_max), '--', c=red, label='Population CDF');
ax.set_xlabel('$u$');
ax.legend(loc=2);
```

If, for example, we wish to find the median of `u`

, we want to find the first location in `cdf`

which is greater than or equal to 0.5.

`median_iloc_in_sorted = (cdf < 0.5).argmin()`

The index of the median in `u`

is therefore `sorted_ilocs.iloc[median_iloc_in_sorted]`

, so the median of `u`

is

`u.iloc[sorted_ilocs.iloc[median_iloc_in_sorted]]`

`49`

`df.u.quantile(0.5)`

`49.0`

We can generalize this method to calculate multiple quantiles simultaneously as follows.

`q = np.array([0.25, 0.5, 0.75])`

`u.iloc[sorted_ilocs.iloc[np.less.outer(cdf, q).argmin(axis=0)]]`

```
2299 24
9079 49
1211 74
Name: u, dtype: int64
```

`df.u.quantile(q)`

```
0.25 24
0.50 49
0.75 74
dtype: float64
```

The array `np.less.outer(cdf, q).argmin(axis=0)`

contains three columns, each of which contains the result of comparing `cdf`

to an element of `q`

. The following function generalizes this approach from series to data frames.

```
def quantile(df, count, q=0.5):
q = np.ravel(q)
sorted_ilocs = df.apply(pd.Series.argsort)
sorted_counts = sorted_ilocs.apply(lambda s: count.iloc[s].values)
cdf = sorted_counts.cumsum() / sorted_counts.sum()
q_ilocs_in_sorted_ilocs = pd.DataFrame(np.less.outer(cdf.values, q).argmin(axis=0).T,
columns=df.columns)
q_ilocs = sorted_ilocs.apply(lambda s: s[q_ilocs_in_sorted_ilocs[s.name]].reset_index(drop=True))
q_df = df.apply(lambda s: s.iloc[q_ilocs[s.name]].reset_index(drop=True))
q_df.index = q
return q_df
```

`quantile(summ_df, n, q=q)`

u | v | w | |
---|---|---|---|

0.25 | 24 | 0 | -0.688514 |

0.50 | 49 | 1 | 0.040036 |

0.75 | 74 | 1 | 1.074817 |

`df.quantile(q=q)`

u | v | w | |
---|---|---|---|

0.25 | 24 | 0 | -0.688514 |

0.50 | 49 | 1 | 0.040036 |

0.75 | 74 | 1 | 1.074817 |

`np.allclose(quantile(summ_df, n, q=q), df.quantile(q=q))`

`True`

Another important operation is bootstrapping. We will see two ways to perfom bootstrapping on the summary data set.

`n_boot = 10000`

Key to both approaches to the bootstrap is knowing the proprotion of the data set that each distinct combination of features comprised.

`weights = n / n.sum()`

The two approaches differ in what type of data frame they produce. The first we will discuss produces a non-summarized data frame with non-unique rows, while the second produces a summarized data frame. Each fo these approaches to bootstrapping is useful in different situations.

To produce a non-summarized data frame, we generate a list of locations in `feature_df`

based on `weights`

using `numpy.random.choice`

.

```
boot_ixs = np.random.choice(summ_df.shape[0], size=n_boot, replace=True,
p=weights)
```

`boot_df = summ_df.iloc[boot_ixs]`

`boot_df.head()`

u | v | w | |
---|---|---|---|

1171 | 47 | 1 | -1.392235 |

9681 | 3 | 1 | 0.018521 |

6664 | 13 | 1 | 1.941207 |

8343 | 13 | 0 | 0.655181 |

3595 | 95 | 1 | 0.972592 |

We can verify that our bootstrapped data frame has (approximately) the same distribution as the original data frame using Q-Q plots.

`ps = np.linspace(0, 1, 100)`

`boot_qs = boot_df[['u', 'w']].quantile(q=ps)`

`qs = df[['u', 'w']].quantile(q=ps)`

```
fig, ax = plt.subplots(figsize=(8, 6))
blue = sns.color_palette()[0]
ax.plot((u_min, u_max), (u_min, u_max), '--', c='k', lw=0.75,
label='Perfect agreement');
ax.scatter(qs.u, boot_qs.u, c=blue, alpha=0.5);
ax.set_xlim(u_min, u_max);
ax.set_xlabel('Original quantiles');
ax.set_ylim(u_min, u_max);
ax.set_ylabel('Resampled quantiles');
ax.set_title('Q-Q plot for $u$');
ax.legend(loc=2);
```

```
fig, ax = plt.subplots(figsize=(8, 6))
blue = sns.color_palette()[0]
ax.plot((w_min, w_max), (w_min, w_max), '--', c='k', lw=0.75,
label='Perfect agreement');
ax.scatter(qs.w, boot_qs.w, c=blue, alpha=0.5);
ax.set_xlim(w_min, w_max);
ax.set_xlabel('Original quantiles');
ax.set_ylim(w_min, w_max);
ax.set_ylabel('Resampled quantiles');
ax.set_title('Q-Q plot for $w$');
ax.legend(loc=2);
```

We see that both of the resampled distributions agree quite closely with the original distributions. We have only produced Q-Q plots for \(u\) and \(w\) because \(v\) is binary-valued.

While at first non-summarized boostrap resampling may appear to counteract the benefits of summarizing the original data frame, it can be quite useful when training and evaluating online learning algorithms, where iterating through the locations of the bootstrapped data in the original summarized data frame is efficient.

To produce a summarized data frame, the counts of the resampled data frame are sampled from a multinomial distribution with event probabilities given by `weights`

.

`boot_counts = pd.Series(np.random.multinomial(n_boot, weights), name='count')`

Again, we compare the distribution of our bootstrapped data frame to that of the original with Q-Q plots. Here our summarized quantile function is quite useful.

`boot_count_qs = quantile(summ_df, boot_counts, q=ps)`

```
fig, ax = plt.subplots(figsize=(8, 6))
ax.plot((u_min, u_max), (u_min, u_max), '--', c='k', lw=0.75,
label='Perfect agreement');
ax.scatter(qs.u, boot_count_qs.u, c=blue, alpha=0.5);
ax.set_xlim(u_min, u_max);
ax.set_xlabel('Original quantiles');
ax.set_ylim(u_min, u_max);
ax.set_ylabel('Resampled quantiles');
ax.set_title('Q-Q plot for $u$');
ax.legend(loc=2);
```

```
fig, ax = plt.subplots(figsize=(8, 6))
ax.plot((w_min, w_max), (w_min, w_max), '--', c='k', lw=0.75,
label='Perfect agreement');
ax.scatter(qs.w, boot_count_qs.w, c=blue, alpha=0.5);
ax.set_xlim(w_min, w_max);
ax.set_xlabel('Original quantiles');
ax.set_ylim(w_min, w_max);
ax.set_ylabel('Resampled quantiles');
ax.set_title('Q-Q plot for $w$');
ax.legend(loc=2);
```

Again, we see that both of the resampled distributions agree quite closely with the original distributions.

Linear regression is among the most frequently used types of statistical inference, and it plays nicely with summarized data. Typically, we have a response variable \(y\) that we wish to model as a linear combination of \(u\), \(v\), and \(w\) as

\[ \begin{align*} y_i = \beta_0 + \beta_1 u_i + \beta_2 v_i + \beta_3 w_i + \varepsilon, \end{align*} \]

where \(\varepsilon \sim N(0, \sigma^2)\) is noise. We generate such a data set below (with \(\sigma = 0.1\)).

`beta = np.array([-3., 0.1, -4., 2.])`

`noise_std = 0.1`

`X = dmatrix('u + v + w', data=df)`

`y = pd.Series(np.dot(X, beta), name='y') + sp.stats.norm.rvs(scale=noise_std, size=N)`

`y.head()`

```
0 7.862559
1 3.830585
2 -0.388246
3 1.047091
4 0.992082
Name: y, dtype: float64
```

Each element of the series `y`

corresponds to one row in the uncompressed data frame `df`

. The `OLS`

class from `statsmodels`

comes quite close to recovering the true regression coefficients.

`full_ols = sm.OLS(y, X).fit()`

`full_ols.params`

```
const -2.999658
x1 0.099986
x2 -3.998997
x3 2.000317
dtype: float64
```

To show how we can perform linear regression on the summarized data frame, we recall the the ordinary least squares estimator minimizes the residual sum of squares. The residual sum of squares is given by

\[ \begin{align*} RSS & = \sum_{i = 1}^n \left(y_i - \mathbf{x}_i \mathbf{\beta}^{\intercal}\right)^2. \end{align*} \]

Here \(\mathbf{x}_i = [1\ u_i\ v_i\ w_i]\) is the \(i\)-th row of the original data frame (with a constant added for the intercept) and \(\mathbf{\beta} = [\beta_0\ \beta_1\ \beta_2\ \beta_3]\) is the row vector of regression coefficients. It would be tempting to rewrite \(RSS\) by grouping the terms based on the row their features map to in the compressed data frame, but this approach would lead to incorrect results. Due to the stochastic noise term \(\varepsilon_i\), identical values of \(u\), \(v\), and \(w\) can (and will almost certainly) map to different values of \(y\). We can see this phenomenon by calculating the range of \(y\) grouped on \(u\), \(v\), and \(w\).

`reg_df = pd.concat((y, df), axis=1)`

`reg_df.groupby(('u', 'v', 'w')).y.apply(np.ptp).describe()`

```
count 9997.000000
mean 0.297891
std 0.091815
min 0.000000
25% 0.237491
50% 0.296838
75% 0.358015
max 0.703418
Name: y, dtype: float64
```

If \(y\) were uniquely determined by \(u\), \(v\), and \(w\), we would expect the mean and quartiles of these ranges to be zero, which they are not. Fortunately, we can account for is difficulty with a bit of care.

Let \(S_j = \{i\ |\ \mathbf{x}_i = \mathbf{z}_j\}\), the set of row indices in the original data frame that correspond to the \(j\)-th row in the summary data frame. Define \(\bar{y}_{(j)} = \frac{1}{n_j} \sum_{i \in S_j} y_i\), which is the mean of the response variables that correspond to \(\mathbf{z}_j\). Intuitively, since \(\varepsilon_i\) has mean zero, \(\bar{y}_{(j)}\) is our best unbiased estimate of \(\mathbf{z}_j \mathbf{\beta}^{\intercal}\). We will now show that regressing \(\sqrt{n_j} \bar{y}_{(j)}\) on \(\sqrt{n_j} \mathbf{z}_j\) gives the same results as the full regression. We use the standard trick of adding and subtracting the mean and get

\[ \begin{align*} RSS & = \sum_{j = 1}^m \sum_{i \in S_j} \left(y_i - \mathbf{z}_j \mathbf{\beta}^{\intercal}\right)^2 \\ & = \sum_{j = 1}^m \sum_{i \in S_j} \left(\left(y_i - \bar{y}_{(j)}\right) + \left(\bar{y}_{(j)} - \mathbf{z}_j \mathbf{\beta}^{\intercal}\right)\right)^2 \\ & = \sum_{j = 1}^m \sum_{i \in S_j} \left(\left(y_i - \bar{y}_{(j)}\right)^2 + 2 \left(y_i - \bar{y}_{(j)}\right) \left(\bar{y}_{(j)} - \mathbf{z}_j \mathbf{\beta}^{\intercal}\right) + \left(\bar{y}_{(j)} - \mathbf{z}_j \mathbf{\beta}^{\intercal}\right)^2\right). \end{align*} \]

As is usual in these situations, the cross term vanishes, since

\[ \begin{align*} \sum_{i \in S_j} \left(y_i - \bar{y}_{(j)}\right) \left(\bar{y}_{(j)} - \mathbf{z}_j \mathbf{\beta}^{\intercal}\right) & = \sum_{i \in S_j} \left(y_i \bar{y}_{(j)} - y_i \mathbf{z}_j \mathbf{\beta}^{\intercal} - \bar{y}_{(j)}^2 + \bar{y}_{(j)} \mathbf{z}_j \mathbf{\beta}^{\intercal}\right) \\ & = \bar{y}_{(j)} \sum_{i \in S_j} y_i - \mathbf{z}_j \mathbf{\beta}^{\intercal} \sum_{i \in S_j} y_i - n_j \bar{y}_{(j)}^2 + n_j \bar{y}_{(j)} \mathbf{z}_j \mathbf{\beta}^{\intercal} \\ & = n_j \bar{y}_{(j)}^2 - n_j \bar{y}_{(j)} \mathbf{z}_j \mathbf{\beta}^{\intercal} - n_j \bar{y}_{(j)}^2 + n_j \bar{y}_{(j)} \mathbf{z}_j \mathbf{\beta}^{\intercal} \\ & = 0. \end{align*} \]

Therefore we may decompose the residual sum of squares as

\[ \begin{align*} RSS & = \sum_{j = 1}^m \sum_{i \in S_j} \left(y_i - \bar{y}_{(j)}\right)^2 + \sum_{j = 1}^m \sum_{i \in S_j} \left(\bar{y}_{(j)} - \mathbf{z}_j \mathbf{\beta}^{\intercal}\right)^2 \\ & = \sum_{j = 1}^m \sum_{i \in S_j} \left(y_i - \bar{y}_{(j)}\right)^2 + \sum_{j = 1}^m n_j \left(\bar{y}_{(j)} - \mathbf{z}_j \mathbf{\beta}^{\intercal}\right)^2. \end{align*} \]

The important property of this decomposition is that the first sum does not depend on \(\mathbf{\beta}\), so minimizing \(RSS\) with respect to \(\mathbf{\beta}\) is equivalent to minimizing the second sum. We see that this second sum can be written as

\[ \begin{align*} \sum_{j = 1}^m n_j \left(\bar{y}_{(j)} - \mathbf{z}_j \mathbf{\beta}^{\intercal}\right)^2 & = \sum_{j = 1}^m \left(\sqrt{n_j} \bar{y}_{(j)} - \sqrt{n_j} \mathbf{z}_j \mathbf{\beta}^{\intercal}\right)^2 \end{align*}, \]

which is exactly the residual sum of squares for regressing \(\sqrt{n_j} \bar{y}_{(j)}\) on \(\sqrt{n_j} \mathbf{z}_j\).

```
summ_reg_df = reg_df.groupby(('u', 'v', 'w')).y.mean().reset_index().iloc[shuffled_ixs].reset_index(drop=True).copy()
summ_reg_df['n'] = n
```

`summ_reg_df.head()`

u | v | w | y | n | |
---|---|---|---|---|---|

0 | 0 | 0 | 0.425597 | -2.173984 | 14 |

1 | 48 | 1 | -0.993981 | -4.174895 | 7 |

2 | 35 | 0 | 0.358156 | 1.252848 | 9 |

3 | 19 | 1 | -0.760298 | -6.612355 | 17 |

4 | 40 | 1 | -0.688514 | -4.379063 | 13 |

The design matrices for this summarized model are easy to construct using `patsy`

.

```
y_summ, X_summ = dmatrices("""
I(np.sqrt(n) * y) ~
np.sqrt(n) + I(np.sqrt(n) * u) + I(np.sqrt(n) * v) + I(np.sqrt(n) * w) - 1
""",
data=summ_reg_df)
```

Note that we must remove `patsy`

’s constant column for the intercept and replace it with `np.sqrt(n)`

.

`summ_ols = sm.OLS(y_summ, X_summ).fit()`

`summ_ols.params`

`array([-2.99965783, 0.09998571, -3.99899718, 2.00031673])`

We see that the summarized regression produces the same parameter estimates as the full regression.

`np.allclose(full_ols.params, summ_ols.params)`

`True`

As a final example of adapting common methods to summarized data frames, we will show how to fit a logistic regression model on a summarized data set by maximum likelihood.

We will use the model

\[P(s = 1\ |\ w) = \frac{1}{1 + \exp(-\mathbf{x} \gamma^{\intercal})}\].

As above, \(\mathbf{x}_i = [1\ u_i\ v_i\ w_i]\). The true value of \(\gamma\) is

`gamma = np.array([1., 0.01, -1., -2.])`

We now generate samples from this model.

`X = dmatrix('u + v + w', data=df)`

```
p = pd.Series(sp.special.expit(np.dot(X, gamma)), name='p')
s = pd.Series(sp.stats.bernoulli.rvs(p), name='s')
```

`logit_df = pd.concat((s, p, df), axis=1)`

`logit_df.head()`

s | p | u | v | w | |
---|---|---|---|---|---|

0 | 1 | 0.709963 | 97 | 0 | 0.537397 |

1 | 0 | 0.092560 | 79 | 1 | 1.536383 |

2 | 0 | 0.153211 | 44 | 1 | 1.074817 |

3 | 1 | 0.923609 | 51 | 0 | -0.491210 |

4 | 0 | 0.062077 | 47 | 1 | 1.592646 |

We first fit the logistic regression model to the full data frame.

`full_logit = sm.Logit(s, X).fit()`

```
Optimization terminated successfully.
Current function value: 0.414221
Iterations 7
```

`full_logit.params`

```
const 0.965283
x1 0.009944
x2 -0.966797
x3 -1.990506
dtype: float64
```

We see that the estimates are quite close to the true parameters.

The technique used to adapt maximum likelihood estimation of logistic regression to the summarized data frame is quite elegant. The likelihood for the full data set is given by the fact that (given \(u\), \(v\), and \(w\)) \(s\) is Bernoulli distributed with

\[s_i\ |\ \mathbf{x}_i \sim \operatorname{Ber}\left(\frac{1}{1 + \exp(-\mathbf{x}_i \gamma^{\intercal})}\right).\]

To derive the likelihood for the summarized data set, we count the number of successes (where \(s = 1\)) for each unique combination of features \(\mathbf{z}_j\), and denote this quantity \(k_j\).

```
summ_logit_df = logit_df.groupby(('u', 'v', 'w')).s.sum().reset_index().iloc[shuffled_ixs].reset_index(drop=True).copy()
summ_logit_df = summ_logit_df.rename(columns={'s': 'k'})
summ_logit_df['n'] = n
```

`summ_logit_df.head()`

u | v | w | k | n | |
---|---|---|---|---|---|

0 | 0 | 0 | 0.425597 | 5 | 14 |

1 | 48 | 1 | -0.993981 | 7 | 7 |

2 | 35 | 0 | 0.358156 | 8 | 9 |

3 | 19 | 1 | -0.760298 | 12 | 17 |

4 | 40 | 1 | -0.688514 | 9 | 13 |

Now, instead of each row representing a single Bernoulli trial (as in the full data frame), each row represents \(n_j\) trials, so we have that \(k_j\) is (conditionally) Binomially distributed with

\[k_j\ |\ \mathbf{z}_j \sim \operatorname{Bin}\left(n_j, \frac{1}{1 + \exp(-\mathbf{z}_j \gamma^{\intercal})}\right).\]

`summ_logit_X = dmatrix('u + v + w', data=summ_logit_df)`

As I have shown in a previous post, we can use `statsmodels`

’ `GenericLikelihoodModel`

class to fit custom probability models by maximum likelihood. The model is implemented as follows.

```
class SummaryLogit(GenericLikelihoodModel):
def __init__(self, endog, exog, n, **qwargs):
"""
endog is the number of successes
exog are the features
n are the number of trials
"""
self.n = n
super(SummaryLogit, self).__init__(endog, exog, **qwargs)
def nloglikeobs(self, gamma):
"""
gamma is the vector of regression coefficients
returns the negative log likelihood of each of the observations for
the coefficients in gamma
"""
p = sp.special.expit(np.dot(self.exog, gamma))
return -sp.stats.binom.logpmf(self.endog, self.n, p)
def fit(self, start_params=None, maxiter=10000, maxfun=5000, **qwargs):
# wraps the GenericLikelihoodModel's fit method to set default start parameters
if start_params is None:
start_params = np.zeros(self.exog.shape[1])
return super(SummaryLogit, self).fit(start_params=start_params,
maxiter=maxiter, maxfun=maxfun, **qwargs)
```

`summ_logit = SummaryLogit(summ_logit_df.k, summ_logit_X, summ_logit_df.n).fit()`

```
Optimization terminated successfully.
Current function value: 1.317583
Iterations: 357
Function evaluations: 599
```

Again, we get reasonable estimates of the regression coefficients, which are close to those obtained from the full data set.

`summ_logit.params`

`array([ 0.96527992, 0.00994322, -0.96680904, -1.99051485])`

`np.allclose(summ_logit.params, full_logit.params, rtol=10**-4)`

`True`

Hopefully this introduction to the technique of summarizing data sets has proved useful and will allow you to explore medium data more easily in the future. We have only scratched the surface on the types of statistical techniques that can be adapted to work on summarized data sets, but with a bit of ingenuity, many of the ideas in this post can apply to other models.

This post is available as an IPython notebook here.

Tags: Statistics, Python

Ordinarly least squares (OLS) is, without a doubt, the most-well known linear regression model. Despite its wide applicability, it often gives undesireable results when the data deviate from its underlying normal model. In particular, it is quite sensitive to outliers in the data. In this post, we will illustrate this sensitivity and then show that changing the error distribution results in a more robust regression model.

We will use one of the data sets from Anscombe’s quartet to illustrate these concepts. Anscombe’s quartet is a well-known group of four data sets that illustrates the importance of exploratory data analysis and visualization. In particular, we will use the third dataset from Anscombe’s quartet. This data set is shown below.

```
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
import pymc3
from scipy import stats
import seaborn as sns
from statsmodels import api as sm
```

```
x = np.array([10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5])
y = np.array([7.46, 6.77, 12.74, 7.11, 7.81, 8.84, 6.08, 5.39, 8.15, 6.42, 5.73])
```

It is quite clear that this data set exhibits a highly linear relationship, with one outlier (when `x = 13`

). Below, we show the results of two OLS models fit to this data. One is fit to all of the data, and the other is fit to the data with the outlier point removed.

`X = sm.add_constant(x)`

`ols_result = sm.OLS(y, X).fit()`

```
x_no_outlier = x[x != 13]
X_no_outlier = X[x != 13]
y_no_outlier = y[x != 13]
```

`no_outlier_result = sm.OLS(y_no_outlier, X_no_outlier).fit()`

One of the ways the OLS estimator can be derived is by minimizing the mean squared error (MSE) of the model on the training data. Below we show the MSE of both of these models on both the full data set and the data set without the outlier.

```
def mse(actual, predicted):
return ((actual - predicted)**2).mean()
```

```
mse_df = pd.DataFrame({
'Full data set': [
mse(y, ols_result.predict(X)),
mse(y, no_outlier_result.predict(X))
],
'Data set without outlier': [
mse(y_no_outlier, ols_result.predict(X_no_outlier)),
mse(y_no_outlier, no_outlier_result.predict(X_no_outlier))
]
},
index=['Full model', 'No outlier mdoel']
)
mse_df = mse_df[mse_df.columns[::-1]]
```

`mse_df`

Full data set | Data set without outlier | |
---|---|---|

Full model | 1.250563 | 0.325152 |

No outlier mdoel | 1.637640 | 0.000008 |

By simple visual inspection, we suspect that without the outlier (`x = 13`

), the relationship between `x`

and `y`

is (almost) perfectly linear. This suspicion is confirmed by this model’s very small MSE on the reduced data set.

Unfortunately, we will usually have many more than eleven points in our data set. In reality, the outliers may be more difficult to detect visually, and they may be harder to exclude manually. We would like a model that performs reasonably well, even in the presence of outliers.

Before we define such a robust regression model, it will be helpful to consider the OLS model mathematically. In the OLS model, we have that

\[y_i = \vec{\beta} \cdot \vec{x}_i + \varepsilon_i.\]

Here, \(y_i\) is the observation corresponding to the feature vector \(\vec{x}_i\), \(\vec{\beta}\) is the vector of regression coefficients, and \(\varepsilon_i\) is noise. In the OLS model, the noise terms are independent with identical normal distributions. It is the properties of these normally distributed errors that make OLS susceptible to outliers.

The normal distribution is well-known to have thin tails. That is, it assigns relatively little probability to observations far away from the mean. Students of basic statistics are quite familiar with the fact that approximately 95% of the mass of the normal distribution lies within two standard deviations of the mean.

We find a robust regression model by choosing an error distribution with fatter tails; a common choice is Student’s t-distribution.

Below we define this model using `pymc3`

.

```
with pymc3.Model() as model:
# Regression coefficients
alpha = pymc3.Uniform('alpha', -100, 100)
beta = pymc3.Uniform('beta', -100, 100)
# Expected value
y_hat = alpha + beta * x
# Observations with t-distributed error
y_obs = pymc3.T('y_obs', nu=5, mu=y_hat, observed=y)
```

Here we have given our t-distributed residuals five degrees of freedom. Customarily, these models will use four, five, or six degrees of freedom. It is important that the number of degrees of freedom, \(\nu\), be relatively small, because as \(\nu \to \infty\), the t-distribution converges to the normal distribution.

We now fit this model using no-U-turn sampling.

```
with model:
step = pymc3.NUTS()
trace_ = pymc3.sample(3000, step)
burn = 1000
thin = 2
trace = trace_[burn::thin]
```

` [-----------------100%-----------------] 3000 of 3000 complete in 8.7 sec`

The following plots summarize the posterior distribution of the regression intercept (\(\alpha\)) and slope (\(\beta\)).

We now plot the robust model along with the previous models.

```
alpha = trace['alpha'].mean()
beta = trace['beta'].mean()
```

We see right away that, although the robust model has not completely captured the linear relationship between the non-outlier points, it is much less biased by the outlier than the OLS model on the full data set. Below we compare the MSE of this model to the previous models.

```
robust_mse = pd.Series([mse(y, alpha + beta * x), mse(y_no_outlier, alpha + beta * x_no_outlier)], index=mse_df.columns)
robust_mse.name = 'Robust model'
mse_df = mse_df.append(robust_mse)
```

`mse_df`

Full data set | Data set without outlier | |
---|---|---|

Full model | 1.250563 | 0.325152 |

No outlier mdoel | 1.637640 | 0.000008 |

Robust model | 1.432198 | 0.032294 |

On the data set without the outlier, the robust model has a significantly larger MSE than the no outlier model, but, importantly, its MSE is an order of magnitude smaller than that of the full model.

This post is available as an IPython notebook here.

Tags: Statistics, PyMC

Maximum likelihood estimation is a common method for fitting statistical models. In Python, it is quite possible to fit maximum likelihood models using just `scipy.optimize`

. Over time, however, I have come to prefer the convenience provided by `statsmodels`

’ `GenericLikelihoodModel`

. In this post, I will show how easy it is to subclass `GenericLikelihoodModel`

and take advantage of much of `statsmodels`

’ well-developed machinery for maximum likelihood estimation of custom models.

The model we use for this demonstration is a zero-inflated Poisson model. This is a model for count data that generalizes the Poisson model by allowing for an overabundance of zero observations.

The model has two parameters, \(\pi\), the proportion of excess zero observations, and \(\lambda\), the mean of the Poisson distribution. We assume that observations from this model are generated as follows. First, a weighted coin with probability \(\pi\) of landing on heads is flipped. If the result is heads, the observation is zero. If the result is tails, the observation is generated from a Poisson distribution with mean \(\lambda\). Note that there are two ways for an observation to be zero under this model:

- the coin is heads, and
- the coin is tails, and the sample from the Poisson distribution is zero.

If \(X\) has a zero-inflated Poisson distribution with parameters \(\pi\) and \(\lambda\), its probability mass function is given by

\[\begin{align*} P(X = 0) & = \pi + (1 - \pi)\ e^{-\lambda} \\ P(X = x) & = (1 - \pi)\ e^{-\lambda}\ \frac{\lambda^x}{x!} \textrm{ for } x > 0. \end{align*}\]

In this post, we will use the parameter values \(\pi = 0.3\) and \(\lambda = 2\). The probability mass function of the zero-inflated Poisson distribution is shown below, next to a normal Poisson distribution, for comparison.

```
from __future__ import division
from matplotlib import pyplot as plt
import numpy as np
from scipy import stats
import seaborn as sns
from statsmodels.base.model import GenericLikelihoodModel
np.random.seed(123456789)
```

```
pi = 0.3
lambda_ = 2.
```

```
def zip_pmf(x, pi=pi, lambda_=lambda_):
if pi < 0 or pi > 1 or lambda_ <= 0:
return np.zeros_like(x)
else:
return (x == 0) * pi + (1 - pi) * stats.poisson.pmf(x, lambda_)
```

First we generate 1,000 observations from the zero-inflated model.

```
N = 1000
inflated_zero = stats.bernoulli.rvs(pi, size=N)
x = (1 - inflated_zero) * stats.poisson.rvs(lambda_, size=N)
```

We are now ready to estimate \(\pi\) and \(\lambda\) by maximum likelihood. To do so, we define a class that inherits from `statsmodels`

’ `GenericLikelihoodModel`

as follows.

```
class ZeroInflatedPoisson(GenericLikelihoodModel):
def __init__(self, endog, exog=None, **kwds):
if exog is None:
exog = np.zeros_like(endog)
super(ZeroInflatedPoisson, self).__init__(endog, exog, **kwds)
def nloglikeobs(self, params):
pi = params[0]
lambda_ = params[1]
return -np.log(zip_pmf(self.endog, pi=pi, lambda_=lambda_))
def fit(self, start_params=None, maxiter=10000, maxfun=5000, **kwds):
if start_params is None:
lambda_start = self.endog.mean()
excess_zeros = (self.endog == 0).mean() - stats.poisson.pmf(0, lambda_start)
start_params = np.array([excess_zeros, lambda_start])
return super(ZeroInflatedPoisson, self).fit(start_params=start_params,
maxiter=maxiter, maxfun=maxfun, **kwds)
```

The key component of this class is the method `nloglikeobs`

, which returns the negative log likelihood of each observed value in `endog`

. Secondarily, we must also supply reasonable initial guesses of the parameters in `fit`

. Obtaining the maximum likelihood estimate is now simple.

```
model = ZeroInflatedPoisson(x)
results = model.fit()
```

```
Optimization terminated successfully.
Current function value: 1.586641
Iterations: 37
Function evaluations: 70
```

We see that we have estimated the parameters fairly well.

```
pi_mle, lambda_mle = results.params
pi_mle, lambda_mle
```

`(0.31542487710071976, 2.0451304204850853)`

There are many advantages to buying into the `statsmodels`

ecosystem and subclassing `GenericLikelihoodModel`

. The already-written `statsmodels`

code handles storing the observations and the interaction with `scipy.optimize`

for us. (It is possible to control the use of `scipy.optimize`

through keyword arguments to `fit`

.)

We also gain access to many of `statsmodels`

’ built in model analysis tools. For example, we can use bootstrap resampling to estimate the variation in our parameter estimates.

```
boot_mean, boot_std, boot_samples = results.bootstrap(nrep=500, store=True)
boot_pis = boot_samples[:, 0]
boot_lambdas = boot_samples[:, 1]
```

The next time you are fitting a model using maximum likelihood, try integrating with `statsmodels`

to take advantage of the significant amount of work that has gone into its ecosystem.

This post is available as an IPython notebook here.

Tags: Statistics, Python

Logistic regression is perhaps one of the most commonly used tools in all of statistics. While I have mathematically understood and used logistic regression for quite some time, it took me much longer to develop intution for it. It was only a few months ago that I learned about the connection between logistic regression and utility theory.

Suppose a person must decide between two options, \(Y = 0\) or \(Y = 1\). Utility theory posits that each option has an associated utility, \(U_0\) and \(U_1\), respectively. The person chooses the option with the largest utility. My intuition for logistic regression arises from understanding that it arises from a specific utility model.

In logistic regression, we are interested in modeling the effect of a covariate, \(X\), on the person’s choice. In this post, we will work with the logistic regression model

\[\begin{align*} P(Y = 1 | X = x) & = \frac{e^x}{1 + e^x}. \end{align*}\]

This model is shown graphically below.

```
from __future__ import division
from matplotlib import pyplot as plt
import numpy as np
from scipy.stats import gumbel_r
def logistic_probability(x):
exp = np.exp(x)
return exp / (1 + exp)
```

In terms of utility theory, we assume that each person’s utilities are related to \(X\) by

\[\begin{align*} U_0 | X & = \beta_0 X + \varepsilon_0 \\ U_1 | X & = \beta_1 X + \varepsilon_1 \end{align*}.\]

Here, \(\beta_i X\) represents the observed utility of option \(Y = i\) (\(i = 0, 1\)), and \(\varepsilon_i\) represents btthe random fluctuation in the option’s utility. Logistic regression arises from the assumption that \(\varepsilon_0\) and \(\varepsilon_0\) are independent with Gumbel distributions.

The Gumbel distribution has the density function

\[\begin{align*} f(\varepsilon) & = e^{-\varepsilon}\ \exp(-e^{-\varepsilon}). \end{align*}\]

To show that this utility model gives rise to logistic regression, we see that

\[\begin{align*} P(Y = 1 | X = x) & = P(U_1 > U_0 | X = x) \\ & = P(\beta_1 x + \varepsilon_1 > \beta_0 x + \varepsilon_0) \\ & = P((\beta_1 - \beta_0) x > \varepsilon_0 - \varepsilon_1). \end{align*}\]

It is useful to note that the difference of independent identically distributed Gumbel random variables follows a logistic distribution, so

\[f(\Delta) = \frac{e^{-\Delta}}{(1 + e^{-\Delta})^2},\]

where \(\Delta = \varepsilon_1 - \varepsilon_0\).

Therefore

\[\begin{align*} P(Y = 1 | X = x) & = P((\beta_1 - \beta_0) x > -\Delta) \\ & = \int_{-(\beta_1 - \beta_0) x}^\infty \frac{e^{-\Delta}}{(1 + e^{-\Delta})^2}\ d\Delta. \end{align*}\]

Making the substitution \(t = 1 + e^{-\Delta}\), with \(dt = -e^{-\Delta}\ dt\), we get that

\[\begin{align*} P(Y = 1 | X = x) & = \int_1^{1 + \exp((\beta_1 - \beta_0) x)} t^{-2}\ dt \\ & = \left.t^{-1}\right|_{1 + \exp((\beta_1 - \beta_0) x)}^1 \\ & = 1 - \frac{1}{{1 + \exp((\beta_1 - \beta_0) x)}} \\ & = \frac{{\exp((\beta_1 - \beta_0) x)}}{{1 + \exp((\beta_1 - \beta_0) x)}}. \end{align*}\]

Recalling that

\[\begin{align*} P(Y = 1 | X = x) & = \frac{e^{x}}{1 + e^{x}}, \end{align*}\]

we must have that \(\beta_1 - \beta_0 = 1\).

The fact that there are infinitely many solutions of this equation is a subtle but important point of utility theory, that the difference in utility is both location and scale invariant. We will prove this statement in two parts.

- For any \(\mu\), let \(\tilde{U}_i = U_i + \mu\) for \(i = 0, 1\). Then \[\begin{align*} \tilde{U_1} - \tilde{U_0} & = U_1 + \mu - (U_0 - \mu) = U_1 - U_0, \end{align*}\] so \(\tilde{U_1} - \tilde{U_0} > 0\) if and only if \(U_1 - U_0 > 0\), and therefore the difference in utility is location invariant.
- For any \(\sigma > 0\), let \(\tilde{U}_i = \frac{1}{\sigma} U_i\) for \(i = 0, 1\). Then \[\begin{align*} \tilde{U_1} - \tilde{U_0} & = \frac{1}{\sigma}\left(U_1 - U_0\right), \end{align*}\] so \(\tilde{U_1} - \tilde{U_0} > 0\) if and only if \(U_1 - U_0 > 0\), and therefore the difference in utility is scale invariant.

Together, these invariances show that the units of utility are irrelevant, and the only quantity that matters is the difference in utilities. Due to the location invariance in utilities, we may as well set \(\beta_0 = 0\), so \(\beta_1 = 1\) for convenience. Our utility model is therefore

\[\begin{align*} U_0 & = \varepsilon_0 \\ U_1 & = x + \varepsilon_1 \end{align*}.\]

To verify that this utility model is equivalent to logistic regression in a second way, we will simulate \(P(Y = 1 | X = x)\) by generating Gumbel random variables.

```
N = 500
eps0 = gumbel_r.rvs(size=(N, n))
eps1 = gumbel_r.rvs(size=(N, n))
U0 = eps0
U1 = xs + eps1
simulated_ps = (U1 > U0).mean(axis=0)
```

The red simulated curve is reasonably close to the black actual curve. (Increasing `N`

would cause the red curve to converge to the black one.)

Aside from being an important tool in econometrics, utility theory helps shed light on logistic regression. The perspective it provides on logistic regression opens the door to generalization and related theories. If the random portions of utility, \(\varepsilon_1\) and \(\varepsilon_0\) are normally distributed instead of Gumbel distributed, the utility model gives rise to probit regression. For a thorough introduction to utility/choice theory, consult the excellent book *Discrete Choice Models with Simulation* by Kenneth Train, which is freely available online.

Tags: Statistics, Econometrics

I have been interested in streaming data algorithms for some time. These algorithms assume that data arrive sequentially over time and/or that the data set is too large to fit into memory for random access. Perhaps the most widely-known streaming algorithm is HyperLogLog, which calculates the approximate number of distinct values in a stream, with fixed memory use. In this post, I will discuss a simple algorithm for randomly sampling from from a data stream.

Let the values of the stream be \(x_1, x_2, x_3, \ldots\); we do not need to assume that the stream ever terminates, although it may. The reservoir algorithm samples \(k\) random items from the stream (without replacement) in the sense that after seeing \(n\) data points, the probability that any individual data point is in the sample is \(\frac{k}{n}\). This algorithm only requires one pass through the stream, and uses storage proportional to \(k\) (not the total, possibly infinite, size of the stream).

The reservoir algorithm is so named because at each step it updates a “reservoir” of candidate samples. We will denote the reservoir of candidate samples by \(R\). We will use \(R_t\) to denote the state of \(R\) after observing the first \(t\) data points. We think of \(R\) as a vector of length \(k\), so \(R_t[0]\) is the first candidate sample after \(t\) data points have been seen, \(R_t[1]\) is the second, \(R_t[k - 1]\) is the last, etc. It is important that \(k\) is small enough that the reservoir vectors can be stored in memory (or at least accessed reasonably quickly on disk).

We initialize the first reservoir, \(R_k\) with the first \(k\) data points we see. At this point, we have a random sample (without replacement) of the first \(k\) data points from the stream.

Suppose now that we have seen \(t - 1\) elements and have a reservoir of sample candidates \(R_{t - 1}\). When we receive \(x_t\), we generate an integer \(i\) uniformly distributed in the interval \([1, t]\). If \(i \leq k\), we set \(R_t[i - 1] = x_t\); otherwise, we wait for the next data point.

Intuitively, this algorithm seems reasonable, because \(P(x_t \in R_t) = P(i \leq k) = \frac{k}{t}\), as we expect from a uniform random sample. What is less clear at this point is that for any \(s < t\), \(P(x_{s} \in R_t) = \frac{k}{t}\) as well. We will now prove this fact.

First, we calculate the probability that a candidate sample in the reservoir remains after another data point is received. We let \(x_s \in R_t\), and suppose we have observed \(x_{t + 1}\). The candidate sample \(x_s\) will be in \(R_{t + 1}\) if and only if the random integer \(i\) generated for \(x_{t + 1}\) is not the index of \(x_s\) in \(R_t\). Since \(i\) is uniformly distributed in the interval \([1, t + 1]\), we have that

\[P(x_s \in R_{t + 1}\ |\ x_s \in R_t) = \frac{t}{t + 1}.\]

Now, suppose that we have received \(n\) data points. First consider the case where \(k < s < n\). Then

\[P(x_s \in R_n) = P(x_s \in R_n\ |\ x_s \in R_s) \cdot P(x_s \in R_s).\]

The second term, \(P(x_s \in R_s)\), is the probability that \(x_s\) entered the reservoir when it was first observed, so

\[P(x_s \in R_s) = \frac{k}{s}.\]

To calculate the first term, \(P(x_s \in R_n\ |\ x_s \in R_s)\), we multiply the probability that \(x_s\) remains in the reservoir after each subsequent observation, yielding

\[ P(x_s \in R_n\ |\ x_s \in R_s) = \prod_{t = s}^{n - 1} P(x_s \in R_{t + 1}\ |\ x_s \in R_t) = \frac{s}{s + 1} \cdot \frac{s + 1}{s + 2} \cdot \cdots \cdot \frac{n - 1}{n} = \frac{s}{n}. \]

Therefore

\[P(x_s \in R_n) = \frac{s}{n} \cdot \frac{k}{s} = \frac{k}{n},\]

as desired.

Now consider the case where \(s \leq k\), so that \(P(x_s \in R_k) = 1\). In this case,

\[ P(x_s \in R_n) = P(x_s \in R_n\ |\ x_s \in R_k) = \prod_{t = k}^{n - 1} P(x_s \in R_{t + 1}\ |\ x_s \in R_t) = \frac{k}{k + 1} \cdot \frac{k + 1}{k + 2} \cdot \cdots \cdot \frac{n - 1}{n} = \frac{k}{n}, \]

as desired.

Below we give an implementation of this algorithm in Python.

```
import itertools as itl
import numpy as np
def sample_after(stream, k):
"""
Return a random sample ok k elements drawn without replacement from stream.
This function is designed to be used when the elements of stream cannot
fit into memory.
"""
r = np.array(list(itl.islice(stream, k)))
for t, x in enumerate(stream, k + 1):
i = np.random.randint(1, t + 1)
if i <= k:
r[i - 1] = x
return r
sample(xrange(1000000000), 10)
```

```
> array([950266975, 378108182, 637777154, 915372867, 298742970, 629846773,
749074581, 893637541, 328486342, 685539979])
```

Vitter^{1} gives three generalizations of this simple reservoir sampling algorithm, all based on the following idea.

Instead of generating a random integer for each data point, we generate the number of data points to skip before the next candidate data point. Let \(S_t\) be the number of data points to advance from \(x_t\) before adding a candidate to the reservoir. For example, if \(S_t = 3\), we would ignore \(x_{t + 1}\) and \(x_{t + 2}\) and add \(x_{t + 3}\) to the reservoir.

The simple reservoir algorithm leads to the following distribution on \(S_t\):

\[P(S_t = 1) = \frac{k}{t + 1}\]

\[P(S_t = 2) = \left(1 - \frac{k}{t + 1}\right) \frac{k}{t + 2} = \frac{t - k + 1}{t + 1} \cdot \frac{k}{t + 2}\]

\[P(S_t = 3) = \left(1 - \frac{k}{t + 2}\right) \left(1 - \frac{k}{t + 1}\right) \frac{k}{t + 3} = \frac{t - k + 2}{t + 2} \cdot \frac{t - k + 1}{t + 1} \cdot \frac{k}{t + 3}\]

In general,

\[P(S_t = s) = k \cdot \frac{t! (t + s - (k + 1))!}{(t + s)! (t - k)!}.\]

Vitter gives three generalizations of the reservor algorithm, each based on different ways of sampling from the distribution of \(S_t\). These generalizations have the advantage of requiring the generation of fewer random integers than the simple reservoir algorithm given above.

This blog post is available as an IPython notebook here.

Vitter, Jeffrey S. “Random sampling with a reservoir.”

*ACM Transactions on Mathematical Software (TOMS)*11.1 (1985): 37-57.↩