<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet type="text/xsl" href="../assets/xml/rss.xsl" media="all"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Austin Rochford (Posts about Sampling)</title><link>https://austinrochford.com/</link><description></description><atom:link href="https://austinrochford.com/tags/sampling.xml" rel="self" type="application/rss+xml"></atom:link><language>en</language><copyright>Contents © 2022 &lt;a href="mailto:austin.rochford@gmail.com"&gt;Austin Rochford&lt;/a&gt; </copyright><lastBuildDate>Mon, 17 Jan 2022 11:51:48 GMT</lastBuildDate><generator>Nikola (getnikola.com)</generator><docs>http://blogs.law.harvard.edu/tech/rss</docs><item><title>Reservoir Sampling for Streaming Data</title><link>https://austinrochford.com/posts/2014-11-30-reservoir-sampling.html</link><dc:creator>Austin Rochford</dc:creator><description>&lt;p&gt;I have been interested in streaming data algorithms for some time. These algorithms assume that data arrive sequentially over time and/or that the data set is too large to fit into memory for random access. Perhaps the most widely-known streaming algorithm is &lt;a href="http://en.wikipedia.org/wiki/HyperLogLog"&gt;HyperLogLog&lt;/a&gt;, which calculates the approximate number of distinct values in a stream, with fixed memory use. In this post, I will discuss a simple algorithm for randomly sampling from from a data stream.&lt;/p&gt;
&lt;p&gt;Let the values of the stream be &lt;span class="math inline"&gt;\(x_1, x_2, x_3, \ldots\)&lt;/span&gt;; we do not need to assume that the stream ever terminates, although it may. The &lt;a href="http://en.wikipedia.org/wiki/Reservoir_sampling"&gt;reservoir algorithm&lt;/a&gt; samples &lt;span class="math inline"&gt;\(k\)&lt;/span&gt; random items from the stream (without replacement) in the sense that after seeing &lt;span class="math inline"&gt;\(n\)&lt;/span&gt; data points, the probability that any individual data point is in the sample is &lt;span class="math inline"&gt;\(\frac{k}{n}\)&lt;/span&gt;. This algorithm only requires one pass through the stream, and uses storage proportional to &lt;span class="math inline"&gt;\(k\)&lt;/span&gt; (not the total, possibly infinite, size of the stream).&lt;/p&gt;
&lt;p&gt;The reservoir algorithm is so named because at each step it updates a “reservoir” of candidate samples. We will denote the reservoir of candidate samples by &lt;span class="math inline"&gt;\(R\)&lt;/span&gt;. We will use &lt;span class="math inline"&gt;\(R_t\)&lt;/span&gt; to denote the state of &lt;span class="math inline"&gt;\(R\)&lt;/span&gt; after observing the first &lt;span class="math inline"&gt;\(t\)&lt;/span&gt; data points. We think of &lt;span class="math inline"&gt;\(R\)&lt;/span&gt; as a vector of length &lt;span class="math inline"&gt;\(k\)&lt;/span&gt;, so &lt;span class="math inline"&gt;\(R_t[0]\)&lt;/span&gt; is the first candidate sample after &lt;span class="math inline"&gt;\(t\)&lt;/span&gt; data points have been seen, &lt;span class="math inline"&gt;\(R_t[1]\)&lt;/span&gt; is the second, &lt;span class="math inline"&gt;\(R_t[k - 1]\)&lt;/span&gt; is the last, etc. It is important that &lt;span class="math inline"&gt;\(k\)&lt;/span&gt; is small enough that the reservoir vectors can be stored in memory (or at least accessed reasonably quickly on disk).&lt;/p&gt;
&lt;p&gt;We initialize the first reservoir, &lt;span class="math inline"&gt;\(R_k\)&lt;/span&gt; with the first &lt;span class="math inline"&gt;\(k\)&lt;/span&gt; data points we see. At this point, we have a random sample (without replacement) of the first &lt;span class="math inline"&gt;\(k\)&lt;/span&gt; data points from the stream.&lt;/p&gt;
&lt;p&gt;Suppose now that we have seen &lt;span class="math inline"&gt;\(t - 1\)&lt;/span&gt; elements and have a reservoir of sample candidates &lt;span class="math inline"&gt;\(R_{t - 1}\)&lt;/span&gt;. When we receive &lt;span class="math inline"&gt;\(x_t\)&lt;/span&gt;, we generate an integer &lt;span class="math inline"&gt;\(i\)&lt;/span&gt; uniformly distributed in the interval &lt;span class="math inline"&gt;\([1, t]\)&lt;/span&gt;. If &lt;span class="math inline"&gt;\(i \leq k\)&lt;/span&gt;, we set &lt;span class="math inline"&gt;\(R_t[i - 1] = x_t\)&lt;/span&gt;; otherwise, we wait for the next data point.&lt;/p&gt;
&lt;p&gt;Intuitively, this algorithm seems reasonable, because &lt;span class="math inline"&gt;\(P(x_t \in R_t) = P(i \leq k) = \frac{k}{t}\)&lt;/span&gt;, as we expect from a uniform random sample. What is less clear at this point is that for any &lt;span class="math inline"&gt;\(s &amp;lt; t\)&lt;/span&gt;, &lt;span class="math inline"&gt;\(P(x_{s} \in R_t) = \frac{k}{t}\)&lt;/span&gt; as well. We will now prove this fact.&lt;/p&gt;
&lt;p&gt;First, we calculate the probability that a candidate sample in the reservoir remains after another data point is received. We let &lt;span class="math inline"&gt;\(x_s \in R_t\)&lt;/span&gt;, and suppose we have observed &lt;span class="math inline"&gt;\(x_{t + 1}\)&lt;/span&gt;. The candidate sample &lt;span class="math inline"&gt;\(x_s\)&lt;/span&gt; will be in &lt;span class="math inline"&gt;\(R_{t + 1}\)&lt;/span&gt; if and only if the random integer &lt;span class="math inline"&gt;\(i\)&lt;/span&gt; generated for &lt;span class="math inline"&gt;\(x_{t + 1}\)&lt;/span&gt; is not the index of &lt;span class="math inline"&gt;\(x_s\)&lt;/span&gt; in &lt;span class="math inline"&gt;\(R_t\)&lt;/span&gt;. Since &lt;span class="math inline"&gt;\(i\)&lt;/span&gt; is uniformly distributed in the interval &lt;span class="math inline"&gt;\([1, t + 1]\)&lt;/span&gt;, we have that&lt;/p&gt;
&lt;p&gt;&lt;span class="math display"&gt;\[P(x_s \in R_{t + 1}\ |\ x_s \in R_t) = \frac{t}{t + 1}.\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;Now, suppose that we have received &lt;span class="math inline"&gt;\(n\)&lt;/span&gt; data points. First consider the case where &lt;span class="math inline"&gt;\(k &amp;lt; s &amp;lt; n\)&lt;/span&gt;. Then&lt;/p&gt;
&lt;p&gt;&lt;span class="math display"&gt;\[P(x_s \in R_n) = P(x_s \in R_n\ |\ x_s \in R_s) \cdot P(x_s \in R_s).\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;The second term, &lt;span class="math inline"&gt;\(P(x_s \in R_s)\)&lt;/span&gt;, is the probability that &lt;span class="math inline"&gt;\(x_s\)&lt;/span&gt; entered the reservoir when it was first observed, so&lt;/p&gt;
&lt;p&gt;&lt;span class="math display"&gt;\[P(x_s \in R_s) = \frac{k}{s}.\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;To calculate the first term, &lt;span class="math inline"&gt;\(P(x_s \in R_n\ |\ x_s \in R_s)\)&lt;/span&gt;, we multiply the probability that &lt;span class="math inline"&gt;\(x_s\)&lt;/span&gt; remains in the reservoir after each subsequent observation, yielding&lt;/p&gt;
&lt;p&gt;&lt;span class="math display"&gt;\[
P(x_s \in R_n\ |\ x_s \in R_s)
    = \prod_{t = s}^{n - 1} P(x_s \in R_{t + 1}\ |\ x_s \in R_t)
    = \frac{s}{s + 1} \cdot \frac{s + 1}{s + 2} \cdot \cdots \cdot \frac{n -
1}{n}
    = \frac{s}{n}.
\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;Therefore&lt;/p&gt;
&lt;p&gt;&lt;span class="math display"&gt;\[P(x_s \in R_n) = \frac{s}{n} \cdot \frac{k}{s} = \frac{k}{n},\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;as desired.&lt;/p&gt;
&lt;p&gt;Now consider the case where &lt;span class="math inline"&gt;\(s \leq k\)&lt;/span&gt;, so that &lt;span class="math inline"&gt;\(P(x_s \in R_k) = 1\)&lt;/span&gt;. In this case,&lt;/p&gt;
&lt;p&gt;&lt;span class="math display"&gt;\[
P(x_s \in R_n)
    = P(x_s \in R_n\ |\ x_s \in R_k)
    = \prod_{t = k}^{n - 1} P(x_s \in R_{t + 1}\ |\ x_s \in R_t)
    = \frac{k}{k + 1} \cdot \frac{k + 1}{k + 2} \cdot \cdots \cdot \frac{n -
1}{n}
    = \frac{k}{n},
\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;as desired.&lt;/p&gt;
&lt;p&gt;Below we give an implementation of this algorithm in Python.&lt;/p&gt;
&lt;div class="sourceCode" id="cb1"&gt;&lt;pre class="sourceCode python"&gt;&lt;code class="sourceCode python"&gt;&lt;span id="cb1-1"&gt;&lt;a href="https://austinrochford.com/posts/2014-11-30-reservoir-sampling.html#cb1-1" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;&lt;span class="im"&gt;import&lt;/span&gt; itertools &lt;span class="im"&gt;as&lt;/span&gt; itl&lt;/span&gt;
&lt;span id="cb1-2"&gt;&lt;a href="https://austinrochford.com/posts/2014-11-30-reservoir-sampling.html#cb1-2" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;&lt;/span&gt;
&lt;span id="cb1-3"&gt;&lt;a href="https://austinrochford.com/posts/2014-11-30-reservoir-sampling.html#cb1-3" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;&lt;span class="im"&gt;import&lt;/span&gt; numpy &lt;span class="im"&gt;as&lt;/span&gt; np&lt;/span&gt;
&lt;span id="cb1-4"&gt;&lt;a href="https://austinrochford.com/posts/2014-11-30-reservoir-sampling.html#cb1-4" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;&lt;/span&gt;
&lt;span id="cb1-5"&gt;&lt;a href="https://austinrochford.com/posts/2014-11-30-reservoir-sampling.html#cb1-5" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;&lt;span class="kw"&gt;def&lt;/span&gt; sample_after(stream, k):&lt;/span&gt;
&lt;span id="cb1-6"&gt;&lt;a href="https://austinrochford.com/posts/2014-11-30-reservoir-sampling.html#cb1-6" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;    &lt;span class="co"&gt;"""&lt;/span&gt;&lt;/span&gt;
&lt;span id="cb1-7"&gt;&lt;a href="https://austinrochford.com/posts/2014-11-30-reservoir-sampling.html#cb1-7" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;&lt;span class="co"&gt;    Return a random sample ok k elements drawn without replacement from stream.&lt;/span&gt;&lt;/span&gt;
&lt;span id="cb1-8"&gt;&lt;a href="https://austinrochford.com/posts/2014-11-30-reservoir-sampling.html#cb1-8" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;&lt;span class="co"&gt;    &lt;/span&gt;&lt;/span&gt;
&lt;span id="cb1-9"&gt;&lt;a href="https://austinrochford.com/posts/2014-11-30-reservoir-sampling.html#cb1-9" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;&lt;span class="co"&gt;    This function is designed to be used when the elements of stream cannot&lt;/span&gt;&lt;/span&gt;
&lt;span id="cb1-10"&gt;&lt;a href="https://austinrochford.com/posts/2014-11-30-reservoir-sampling.html#cb1-10" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;&lt;span class="co"&gt;    fit into memory.&lt;/span&gt;&lt;/span&gt;
&lt;span id="cb1-11"&gt;&lt;a href="https://austinrochford.com/posts/2014-11-30-reservoir-sampling.html#cb1-11" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;&lt;span class="co"&gt;    """&lt;/span&gt;&lt;/span&gt;
&lt;span id="cb1-12"&gt;&lt;a href="https://austinrochford.com/posts/2014-11-30-reservoir-sampling.html#cb1-12" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;    r &lt;span class="op"&gt;=&lt;/span&gt; np.array(&lt;span class="bu"&gt;list&lt;/span&gt;(itl.islice(stream, k)))&lt;/span&gt;
&lt;span id="cb1-13"&gt;&lt;a href="https://austinrochford.com/posts/2014-11-30-reservoir-sampling.html#cb1-13" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;    &lt;/span&gt;
&lt;span id="cb1-14"&gt;&lt;a href="https://austinrochford.com/posts/2014-11-30-reservoir-sampling.html#cb1-14" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;    &lt;span class="cf"&gt;for&lt;/span&gt; t, x &lt;span class="kw"&gt;in&lt;/span&gt; &lt;span class="bu"&gt;enumerate&lt;/span&gt;(stream, k &lt;span class="op"&gt;+&lt;/span&gt; &lt;span class="dv"&gt;1&lt;/span&gt;):&lt;/span&gt;
&lt;span id="cb1-15"&gt;&lt;a href="https://austinrochford.com/posts/2014-11-30-reservoir-sampling.html#cb1-15" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;        i &lt;span class="op"&gt;=&lt;/span&gt; np.random.randint(&lt;span class="dv"&gt;1&lt;/span&gt;, t &lt;span class="op"&gt;+&lt;/span&gt; &lt;span class="dv"&gt;1&lt;/span&gt;)&lt;/span&gt;
&lt;span id="cb1-16"&gt;&lt;a href="https://austinrochford.com/posts/2014-11-30-reservoir-sampling.html#cb1-16" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;&lt;/span&gt;
&lt;span id="cb1-17"&gt;&lt;a href="https://austinrochford.com/posts/2014-11-30-reservoir-sampling.html#cb1-17" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;        &lt;span class="cf"&gt;if&lt;/span&gt; i &lt;span class="op"&gt;&amp;lt;=&lt;/span&gt; k:&lt;/span&gt;
&lt;span id="cb1-18"&gt;&lt;a href="https://austinrochford.com/posts/2014-11-30-reservoir-sampling.html#cb1-18" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;            r[i &lt;span class="op"&gt;-&lt;/span&gt; &lt;span class="dv"&gt;1&lt;/span&gt;] &lt;span class="op"&gt;=&lt;/span&gt; x&lt;/span&gt;
&lt;span id="cb1-19"&gt;&lt;a href="https://austinrochford.com/posts/2014-11-30-reservoir-sampling.html#cb1-19" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;    &lt;/span&gt;
&lt;span id="cb1-20"&gt;&lt;a href="https://austinrochford.com/posts/2014-11-30-reservoir-sampling.html#cb1-20" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;    &lt;span class="cf"&gt;return&lt;/span&gt; r&lt;/span&gt;
&lt;span id="cb1-21"&gt;&lt;a href="https://austinrochford.com/posts/2014-11-30-reservoir-sampling.html#cb1-21" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;&lt;/span&gt;
&lt;span id="cb1-22"&gt;&lt;a href="https://austinrochford.com/posts/2014-11-30-reservoir-sampling.html#cb1-22" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;sample(&lt;span class="bu"&gt;xrange&lt;/span&gt;(&lt;span class="dv"&gt;1000000000&lt;/span&gt;), &lt;span class="dv"&gt;10&lt;/span&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;div class="sourceCode" id="cb2"&gt;&lt;pre class="sourceCode python"&gt;&lt;code class="sourceCode python"&gt;&lt;span id="cb2-1"&gt;&lt;a href="https://austinrochford.com/posts/2014-11-30-reservoir-sampling.html#cb2-1" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;&lt;span class="op"&gt;&amp;gt;&lt;/span&gt; array([&lt;span class="dv"&gt;950266975&lt;/span&gt;, &lt;span class="dv"&gt;378108182&lt;/span&gt;, &lt;span class="dv"&gt;637777154&lt;/span&gt;, &lt;span class="dv"&gt;915372867&lt;/span&gt;, &lt;span class="dv"&gt;298742970&lt;/span&gt;, &lt;span class="dv"&gt;629846773&lt;/span&gt;,&lt;/span&gt;
&lt;span id="cb2-2"&gt;&lt;a href="https://austinrochford.com/posts/2014-11-30-reservoir-sampling.html#cb2-2" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;         &lt;span class="dv"&gt;749074581&lt;/span&gt;, &lt;span class="dv"&gt;893637541&lt;/span&gt;, &lt;span class="dv"&gt;328486342&lt;/span&gt;, &lt;span class="dv"&gt;685539979&lt;/span&gt;])&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;h4 id="generalizations"&gt;Generalizations&lt;/h4&gt;
&lt;p&gt;Vitter&lt;a href="https://austinrochford.com/posts/2014-11-30-reservoir-sampling.html#fn1" class="footnote-ref" id="fnref1" role="doc-noteref"&gt;&lt;sup&gt;1&lt;/sup&gt;&lt;/a&gt; gives three generalizations of this simple reservoir sampling algorithm, all based on the following idea.&lt;/p&gt;
&lt;p&gt;Instead of generating a random integer for each data point, we generate the number of data points to skip before the next candidate data point. Let &lt;span class="math inline"&gt;\(S_t\)&lt;/span&gt; be the number of data points to advance from &lt;span class="math inline"&gt;\(x_t\)&lt;/span&gt; before adding a candidate to the reservoir. For example, if &lt;span class="math inline"&gt;\(S_t = 3\)&lt;/span&gt;, we would ignore &lt;span class="math inline"&gt;\(x_{t + 1}\)&lt;/span&gt; and &lt;span class="math inline"&gt;\(x_{t + 2}\)&lt;/span&gt; and add &lt;span class="math inline"&gt;\(x_{t + 3}\)&lt;/span&gt; to the reservoir.&lt;/p&gt;
&lt;p&gt;The simple reservoir algorithm leads to the following distribution on &lt;span class="math inline"&gt;\(S_t\)&lt;/span&gt;:&lt;/p&gt;
&lt;p&gt;&lt;span class="math display"&gt;\[P(S_t = 1) = \frac{k}{t + 1}\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span class="math display"&gt;\[P(S_t = 2) = \left(1 - \frac{k}{t + 1}\right) \frac{k}{t + 2} = \frac{t - k +
1}{t + 1} \cdot \frac{k}{t + 2}\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span class="math display"&gt;\[P(S_t = 3) = \left(1 - \frac{k}{t + 2}\right) \left(1 - \frac{k}{t + 1}\right)
\frac{k}{t + 3} = \frac{t - k + 2}{t + 2} \cdot \frac{t - k + 1}{t + 1} \cdot
\frac{k}{t + 3}\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;In general,&lt;/p&gt;
&lt;p&gt;&lt;span class="math display"&gt;\[P(S_t = s) = k \cdot \frac{t! (t + s - (k + 1))!}{(t + s)! (t - k)!}.\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;Vitter gives three generalizations of the reservor algorithm, each based on different ways of sampling from the distribution of &lt;span class="math inline"&gt;\(S_t\)&lt;/span&gt;. These generalizations have the advantage of requiring the generation of fewer random integers than the simple reservoir algorithm given above.&lt;/p&gt;
&lt;p&gt;This blog post is available as an &lt;a href="http://ipython.org/"&gt;IPython&lt;/a&gt; notebook &lt;a href="http://nbviewer.ipython.org/gist/AustinRochford/6be7cb4d9f38b9419f94"&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;section class="footnotes" role="doc-endnotes"&gt;
&lt;hr&gt;
&lt;ol&gt;
&lt;li id="fn1" role="doc-endnote"&gt;&lt;p&gt;Vitter, Jeffrey S. “Random sampling with a reservoir.” &lt;em&gt;ACM Transactions on Mathematical Software (TOMS)&lt;/em&gt; 11.1 (1985): 37-57.&lt;a href="https://austinrochford.com/posts/2014-11-30-reservoir-sampling.html#fnref1" class="footnote-back" role="doc-backlink"&gt;↩︎&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/section&gt;</description><category>Algorithms</category><category>Probability</category><category>Sampling</category><guid>https://austinrochford.com/posts/2014-11-30-reservoir-sampling.html</guid><pubDate>Sun, 30 Nov 2014 05:00:00 GMT</pubDate></item></channel></rss>