index.html


<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html>
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>Data Mechanics</title>
<script async src="https://www.googletagmanager.com/gtag/js?id=UA-90403446-1"></script>
<script>
  window.dataLayer = window.dataLayer || []; function gtag(){dataLayer.push(arguments);} gtag('js', new Date()); gtag('config', 'UA-90403446-1');
</script>
<link rel="stylesheet" type="text/css" href="https://fonts.googleapis.com/css?family=Alex+Brush">
<script type="text/javascript" src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.4/jquery.min.js"></script>
<link rel="stylesheet" href="//cdnjs.cloudflare.com/ajax/libs/highlight.js/9.1.0/styles/vs.min.css">
<script src="//cdnjs.cloudflare.com/ajax/libs/highlight.js/9.1.0/highlight.min.js"></script>
<script src="//cdnjs.cloudflare.com/ajax/libs/highlight.js/9.1.0/languages/haskell.min.js"></script>
<script src="//cdnjs.cloudflare.com/ajax/libs/highlight.js/9.1.0/languages/javascript.min.js"></script>
<script src="//cdnjs.cloudflare.com/ajax/libs/highlight.js/9.1.0/languages/python.min.js"></script>
<script src="//cdnjs.cloudflare.com/ajax/libs/highlight.js/9.1.0/languages/sql.min.js"></script>
<script src="https://d3js.org/d3.v3.min.js"></script>
<script type="text/javascript" src="sheaf/protoql.js"></script>
<link rel="stylesheet" href="sheaf/sheaf.css">
<script type="text/javascript" src="sheaf/sheaf.js"></script>
<script>hljs.initHighlightingOnLoad();</script>
</head>
<body>
<div id="sheaf"><h1>Data Mechanics<span>for Pervasive Systems and Urban Applications</span></h1><div id="toc"><ul> <li>1. <a href="#1">Introduction, Background, and Motivation</a>
  <ul>  <li>1.1. <a href="#1.1">Overview</a></li>  <li>1.2. <a href="#1.2">Data Mechanics Repository and Platform</a></li>  <li>1.3. <a href="#1.3">Mathematical Modeling, Analysis Algorithms, and Optimization Techniques</a></li>
  </ul>
 </li> <li>2. <a href="#2">Modeling Data and Data Transformations</a>
  <ul>  <li>2.1. <a href="#2.1">Relational Data and the MapReduce Paradigm</a></li>  <li>2.2. <a href="#2.2">Composing Transformations into Algorithms</a></li>  <li>2.3. <a href="#2.3">Data Provenance</a></li>
  </ul>
 </li> <li>3. <a href="#3">Systems, Models, and Algorithms</a>
  <ul>  <li>3.1. <a href="#3.1">Systems, Models, and Metrics</a></li>  <li>3.2. <a href="#3.2">Linear systems, satisfiability modulo theories, and linear programming</a></li>  <li>3.3. <a href="#3.3">Graph and Spatial Problems as Constraint Satisfaction and Optimization Problems</a></li>  <li>3.4. <a href="#3.4">Decomposition Techniques</a></li>
  </ul>
 </li> <li>4. <a href="#4">Statistical Analysis</a>
  <ul>  <li>4.1. <a href="#4.1">Review of Facts about Projections from Linear Algebra</a></li>  <li>4.2. <a href="#4.2">Defining Mean and Standard Deviation using Concepts in Linear Algebra</a></li>  <li>4.3. <a href="#4.3">Covariance and Correlation</a></li>  <li>4.4. <a href="#4.4">Observations, Hypothesis Testing, and Significance</a></li>  <li>4.5. <a href="#4.5">Sampling and Inference</a></li>
  </ul>
 </li> <li>5. <a href="#5">Visualizations and Web Services</a>
  <ul>  <li>5.1. <a href="#5.1">Web Services</a></li>
  </ul>
 </li> <li><a href="#bib">References</a></li> <li>Appendix A. <a href="#A">Other Resources</a>
  <ul>  <li>A.1. <a href="#A.1">MongoDB and Related Resources</a></li>  <li>A.2. <a href="#A.2">Installation Resources for Other Software Packages and Libraries</a></li>
  </ul>
 </li></ul></div>
<a id="1"></a>
<div class="section"><hr /><h2 class="linked heading"><span class="link-title">[<a href="#1">link</a>]&nbsp;&nbsp;</span><div><span class="header_numeral">1.</span> Introduction, Background, and Motivation</div></h2>
  <a id="1.1"></a><div class="subsection"><h3 class="linked heading"><span class="link-title">[<a href="#1.1">link</a>]&nbsp;&nbsp;</span><div><span class="header_numeral">1.1.</span> Overview</div></h3>
<div class="text top">With over half of the world's population living in cities [<a href="#625662870761">#</a>], and given the possibility that cities are a more efficient [<a href="#aiid:1855401">#</a>] way to organize and distribute resources and activities, the rigorous study of cities as systems that can be modeled mathematically is a compelling proposition. With the advent of pervasive infrastructures of sensors and computational resources in urban environments (<a href="https://en.wikipedia.org/wiki/Smart_city"><i>smart cities</i></a>, <a href="https://www.sdxcentral.com/articles/news/englands-bristol-is-building-the-first-software-defined-city/2015/03/"><i>software-defined cities</i></a>, and the <a href="https://en.wikipedia.org/wiki/Internet_of_Things"><i>Internet of Things</i></a>), there is a potential to inform and validate such models using actual data, and to exploit them using the interoperability of the pervasive computational infrastructure that is available.</div><div class="paragraph">
In this course, we introduce and define the novel term <i>data mechanics</i> to refer specifically to the study of a particular aspect of large, instrumented systems such as cities: how data can flow through institutions and computational infrastructures, how these flows can interact, integrate, and cleave, and how they can inform decisions and operations (in both online and offline regimes). We choose the term <i>mechanics</i> specifically because it connotes the study of mechanics (e.g., classical mechanics) in physics, where universal laws that govern the behavior and interactions between many forms of matter are studied. We also choose this term to emphasize that, often, the data involved will be tied to the physical environment: geospatial and temporal information is a common and integral part of data sets produced and consumed by instrumented environments.
      </div></div>
  <a id="1.2"></a><div class="subsection"><h3 class="linked heading"><span class="link-title">[<a href="#1.2">link</a>]&nbsp;&nbsp;</span><div><span class="header_numeral">1.2.</span> Data Mechanics Repository and Platform</div></h3>
<div class="text top">This course is somewhat unusual in that there are specific, concrete software development goals towards which the staff and students are working. In particular, one goal of the course is to construct a new general-purpose service platform for collecting, integrating, analyzing, and employing (both in real time and offline) real data sets derived from urban sensors and services (particularly those in the city of Boston). We will consider several real data sets and data feeds, including:
<ul>
  <li><a href="https://data.boston.gov/">Analyze Boston</a>;</li>
  <li><a href="http://www.cambridgema.gov/departments/opendata">City of Cambridge Open Data Portal</a>;</li>
  <li><a href="http://data.brooklinema.gov/">Brookline OpenData</a>;</li>
  <li><a href="http://bostonopendata.boston.opendata.arcgis.com/">BostonMaps: Open Data</a>;</li>
  <li><a href="https://dataverse.harvard.edu/dataverse/BARI">Boston Area Research Initiative Dataverse</a>;</li>
  <li><a href="https://www.massdot.state.ma.us/DevelopersData.aspx">MassDOT Developers Resources</a>;</li>
  <li><a href="http://www.mass.gov/opendata/#/">MassData</a>.</li>
</ul>
We will also secure additional data sets from the above sources as well as a few others. In particular, some nationwide data repositories may be worth considering:
<ul>
  <li><a href="https://chronicdata.cdc.gov/500-Cities/500-Cities-Local-Data-for-Better-Health/6vp6-wxuq">500 Cities: Local Data for Better Health</a>.</li>
</ul>
For the purposes of student projects, there are no limits on other sources of data or computation (e.g., Twitter, Mechanical Turk, and so on) that can be employed in conjunction with some of the above.</div><div class="paragraph">
Because this course involves the construction of a relatively large software application infrastructure, we will have the opportunity to introduce and practice a variety of standard software development and software engineering concepts and techniques. This includes source control, collaboration, documentation, modularity and encapsulation, testing and validation, inversion of control, and others. Students will also need to become familiar with how to use web service APIs used by government organizations (e.g., <a href="https://www.socrata.com/">Socrata</a>) to make queries and retrieve data.
      </div><div class="paragraph">
The overall architecture of the service platform will have at least the following:
<ul>
  <li>a database/storage backend (<b>"repository"</b>) that houses:
    <ul>
      <li>original and derived data sets, with annotations that include:
        <ul>
          <li>from where, when, and by what algorithm it was retrieved</li>
          <li>using what integration algorithms it was derived</li>
        </ul>
      </li>
      <li>algorithms for data retrieval or integration, with references that include:
        <ul>
          <li>when it was written and by whom</li>
          <li>in what data sets it is derived</li>
          <li>from what component algorithms it is composed (if it is such)</li>
        </ul>
      </li>
    </ul>
  </li>
  <li>a web service (<b>"platform"</b>) with an <i>application program interface</i> (API) for running analysis and optimization algorithms:
    <ul>
      <li>a defined language for defining analysis and optimization algorithms over the data stored in the repository</li>
      <li>an interface for submitting and running algorithms</li>
    </ul>
  </li>
  <li>other features for data and result visualization, simulation using partial data, etc.</li>
</ul>
      </div></div>
  <a id="1.3"></a><div class="subsection"><h3 class="linked heading"><span class="link-title">[<a href="#1.3">link</a>]&nbsp;&nbsp;</span><div><span class="header_numeral">1.3.</span> Mathematical Modeling, Analysis Algorithms, and Optimization Techniques</div></h3>
<div class="text top">There are a variety of problems that it may be possible to address using the data in the repository and the capabilities of the platform. This course will cover a variety of useful online and offline optimization topics in a mathematically rigorous but potentially application-specific way, including:
<ul>
  <li>a defined language for defining analysis and optimization algorithms over the data stored in the repository,</li>
  <li>dual decomposition,</li>
  <li>online optimization.</li>
</ul>
The goal is to apply some of these techniques to the data sets(including integrated or derived data sets) and solve practical problems. Some of the problems raised in discussions with the City of Boston DoIT and MassDOT teams are:
<ul>
  <li>characterizing intersections and coming up with a metric that incorporates:
    <ul>
      <li>intersection throughput (people per unit time),</li>
      <li>modes of transportation (public transport, biking, walking),</li>
      <li>intersection safety (vehicle speed, accidents, and so on),</li>
      <li>intersection organization (no left turns, and so on);
    </ul>
  </li>
  <li>characterizing streets and deriving metrics for:
    <ul>
      <li>number of parking spaces,</li>
      <li>probability of an accident (e.g., using different modes of transportation),</li>
      <li>senior and handicapped mobility;</li>
    </ul>
  </li>
  <li>characterizing neighborhoods:
    <ul>
      <li>economic condition and gentrification,</li>
      <li>senior and handicapped accessibility;</li>
    </ul>
  </li>
  <li>how to allocate resources to optimize some of the metrics above:
    <ul>
      <li>where to perform repairs,</li>
      <li>how to improve housing affordability,</li>
      <li>where to place bike racks of ride sharing stations,</li>
      <li>senior and handicapped accessibility;</li>
    </ul>
  </li>
  <li>answering immediate questions relevant to an individual (e.g., in an app):
    <ul>
      <li>is a building accessible,</li>
      <li>is there a place to park (or will there likely be at some other time),</li>
      <li>is a neighborhood safe (or is it becoming less or more safe),</li>
      <li>is a neighborhood affordable (or is it becoming less or more affordable),</li>
      <li>are healthcare services nearby.</li>
    </ul>
  </li>
  <li>integrating public transportation data with other data (e.g., Waze):
    <ul>
      <li>why are buses on a certain route late,</li>
      <li>performance and problems during unexpected events (e.g., snow).</li>
    </ul>
  </li>
</ul>
The above list is far from complete, and we will update it as the course progresses. Students are encouraged to discuss project ideas with faculty and one another.</div></div>
</div>
<a id="2"></a>
<div class="section"><hr /><h2 class="linked heading"><span class="link-title">[<a href="#2">link</a>]&nbsp;&nbsp;</span><div><span class="header_numeral">2.</span> Modeling Data and Data Transformations</div></h2>
<div class="text top">To study rigorously the ways data can behave and interact within an infrastructure that generates, transforms, and consumes data (e.g., to make decisions), it is necessary to define formally what data and data transformations are. One traditional, widely used, and widely accepted model for data is the <i>relational model</i>: any data set is a relation (i.e., a subset of a product of sets), and transformations on data are functions between relations. While the relational model is sufficient to define any transformation on data sets, the MapReduce paradigm is one modern framework for defining transformations between data sets.</div><div class="paragraph">
In modern contexts and paradigms, these models can be useful when studying relatively small collections of individual, curated data sets that do not change dramatically in the short term. However, these alone are not sufficient in a context in which data sets are overwhelmingly multitudinous, varying in structure, and continuously being generated, integrated, and transformed. One complementary discipline that can provide useful tools for dealing with numerous interdependent data sets is that of <i>data provenance</i> or <i>data lineage</i>. The <i>provenance</i> of a data set (or subset) is a formal record of its origin, which can include how the data was generated, from what data or other sources it was created or derived, what process was used or was responsible for creating or deriving it, and other such information. This information can be invaluable for a variety of reasons beyond the obvious ones (i.e., the origin of the data), such as:
<ul>
  <li>the same data be generated again or reproduced if an error occurs,</li>
  <li>a data set can be updated from the original source if the source has been updated or changed,</li>
  <li>the source of an inconsistency or aberration of data can be investigated,</li>
  <li>any of the above could be applied to a subset because recomputing or investigating the entire data set would be prohibitively costly.</li>
</ul>
    </div>
  <a id="2.1"></a><div class="subsection"><h3 class="linked heading"><span class="link-title">[<a href="#2.1">link</a>]&nbsp;&nbsp;</span><div><span class="header_numeral">2.1.</span> Relational Data and the MapReduce Paradigm</div></h3>
<div class="text top">The relational model for data can be expressed in a variety of ways: a data set is a relation on sets, a logical predicate governing terms, a collection of tuples or records with fields and values, a table of rows with labelled columns, and so on. Mathematically, they are all equivalent. In this course, we will adopt a particular model because it is well-suited for the tools and paradigms we will employ, and because it allows for one fairly clean mathematical integration of the study of relational data and data provenance.</div>
<a id="e601deb568ed46a1a1d741907a6dcfa9"></a><div class="linked block"><div class="link-block">[<a href="#e601deb568ed46a1a1d741907a6dcfa9">link</a>]&nbsp;&nbsp;</div><div style="width:100%; display:inline-block;"><div style="width:auto;" class="definition true_required"><span class="block_label">Definition:</span> 
        
<div class="text">A <i>data set</i> (also known as a <i>store</i> or <i>database</i>) is a multiset <i>R</i>: a collection (possibly with duplicates) of tuples of the form (<i>x</i><sub>1</sub>,...,<i>x</i><sub><i>n</i></sub>) taken from the set product <i>X</i><sub>1</sub> &#215; ... &#215; <i>X</i><sub><i>n</i></sub>. Typically, some distinguished set (e.g., the left-most in the set product) will be a set of <i>keys</i>, so that every tuple contains a key. Whether a set is a key or not often depends on the particular paradigm and context, however.</div>
      </div></div></div>
<a id="a123deb568ed46a1a1d436907a6dcfa9"></a><div class="linked block"><div class="link-block">[<a href="#a123deb568ed46a1a1d436907a6dcfa9">link</a>]&nbsp;&nbsp;</div><div style="width:100%; display:inline-block;"><div style="width:auto;" class="definition true_required"><span class="block_label">Definition:</span> 
        
<div class="text">A <i>data transformation</i> <i>T</i>: <i>A</i> <span style="font-size:12px;">&#8594;</span> <i>B</i> is a mapping from one space of data sets <i>A</i> to another space of data sets <i>B</i>. Notice that an individual data set <i>S</i> (i.e., a relation or, equivalently, a set of tuples) is just an <i>element</i> of <i>A</i>.</div>
      </div></div></div>
<div class="text top">Some of the typical building blocks for data transformations in the relational model are:
<ul>
  <li>union and difference (intersection can also be defined in terms of these),</li>
  <li>projection (sometimes generalized into extended projection),</li>
  <li>selection (filtering),</li>
  <li>renaming,</li>
  <li>Cartesian product,</li>
  <li>variants of join operations (many can be constructed using the above, but other variants have been added as extensions),</li>
  <li>aggregation (an extension).</li>
</ul>
One common operation on relations that is not possible to express in traditional formulations but found in some relational database systems is the transitive closure of a data set. Normally, this requires an iterative process consisting of a sequence of join operations.</div>
<a id="9da373c4cc654556bf2fa3fed6d56995"></a><div class="linked block"><div class="link-block">[<a href="#9da373c4cc654556bf2fa3fed6d56995">link</a>]&nbsp;&nbsp;</div><div style="width:100%; display:inline-block;"><div style="width:auto;" class="example task_required"><span class="block_label">Example:</span> 
        
<div class="text">We can model and illustrate the transformations that constitute the MapReduce paradigm using Python. Note that selection and projection can be implemented directly using Python comprehensions, but we define wrappers below for purposes of illustration.</div>
        
<div class="code"><div class="source"><pre><code class="py">
def union(R, S):
    return R + S

def difference(R, S):
    return [t for t in R if t not in S]

def intersect(R, S):
    return [t for t in R if t in S]

def project(R, p):
    return [p(t) for t in R]

def select(R, s):
    return [t for t in R if s(t)]
 
def product(R, S):
    return [(t,u) for t in R for u in S]

def aggregate(R, f):
    keys = {r[0] for r in R}
    return [(key, f([v for (k,v) in R if k == key])) for key in keys]
        </code></pre></div></div>
      </div></div></div>
<a id="ebd9fe9c61014bc9a2d743e069dc9d44"></a><div class="linked block"><div class="link-block">[<a href="#ebd9fe9c61014bc9a2d743e069dc9d44">link</a>]&nbsp;&nbsp;</div><div style="width:100%; display:inline-block;"><div style="width:auto;" class="example task_required"><span class="block_label">Example:</span> 
        
<div class="text">We consider a few simple examples that illustrate how transformations can be constructed within the relational model. We start by showing how <code>select</code> can be used with a predicate to filter a data set.</div>
        
<div class="code"><div class="source"><pre><code class="py">
>>> def red(t): return t == 'tomato'
>>> select(['banana', 'tomato'], red)
['tomato']
        </code></pre></div></div>
        
<div class="text">Suppose we have two data sets and want to join them on a common field. The below sequence illustrates how that can be accomplished by building up the necessary expression out of simple parts.</div>
        
<div class="code"><div class="source"><pre><code class="py">
>>> X = [('Alice', 22), ('Bob', 19)]
>>> Y = [('Alice', 'F'), ('Bob', 'M')]

>>> product(X,Y)
[(('Alice', 'F'), ('Alice', 22)), (('Alice', 'F'), ('Bob', 19)), (('Bob', 'M'), ('Alice', 22)), (('Bob', 'M'), ('Bob', 19))]

>>> select(product(X,Y), lambda t: t[0][0] == t[1][0])
[(('Alice', 'F'), ('Alice', 22)), (('Bob', 'M'), ('Bob', 19))]

>>> project(select(product(X,Y), lambda t: t[0][0] == t[1][0]), lambda t: (t[0][0], t[0][1], t[1][1]))
[('Alice', 'F', 22), ('Bob', 'M', 19)]
        </code></pre></div></div>
        
<div class="text">Finally, the sequence below illustrates how we can compute an aggregate value for each unique key in a data set (such as computing the total age by gender).</div>
        
<div class="code"><div class="source"><pre><code class="py">
>>> X = [('Alice', 'F', 22), ('Bob', 'M', 19), ('Carl', 'M', 25), ('Eve', 'F', 27)]

>>> project(X, lambda t: (t[1], t[2]))
[('F', 22), ('M', 19), ('M', 25), ('F', 27)]

>>> aggregate(project(X, lambda t: (t[1], t[2])), sum)
[('F', 49), ('M', 44)]
        </code></pre></div></div>
        
<div class="text">The following sequence explains what is happening inside the <code>aggregate</code> function for a particular key.</div>
        
<div class="code"><div class="source"><pre><code class="py">
>>> Y = project(X, lambda t: (t[1], t[2]))
>>> keys = {t[0] for t in Y}
>>> keys
{'F', 'M'}
>>> [v for (k,v) in Y if k == 'F']
[22, 27]
>>> sum([v for (k,v) in Y if k == 'F'])
49
>>> ('F', sum([v for (k,v) in Y if k == 'F']))
('F', 49)
        </code></pre></div></div>
      </div></div></div>
<a id="ffbbae67647e4daf838a79fb814e733a"></a><div class="linked block"><div class="link-block">[<a href="#ffbbae67647e4daf838a79fb814e733a">link</a>]&nbsp;&nbsp;</div><div style="width:100%; display:inline-block;"><div style="width:auto;" class="example task_required"><span class="block_label">Example:</span> 
        
<div class="text">Consider the following line of Python code that operates on a data set <code>D</code> containing some voting results broken down by voter, state where the voter participated in voting, and the candidate they chose:</div>
        
<div class="code"><div class="source"><pre><code class="py">
R = sum([1 for (person, state, candidate) in D if state == "Massachusetts" and candidate == "Trump"])
        </code></pre></div></div>
        
<div class="text">We can identify which building blocks for transformations available in the relational model are being used in the above code; we can also draw a corresponding flow diagram that shows how they fit together.</div>
        
<div class="text"><div class="pql" style="border:0px solid #000000; height:80px; display:inline-block; width:100%;">
!table([
  ['r:select``D', 'r:project``...', 'r:agg sum``...', 'R']
])
</div></div>
        
<div class="text">We can also rewrite the above using the <a href="#9da373c4cc654556bf2fa3fed6d56995">building blocks for transformations</a> in the relational model.</div>
        
<div class="code"><div class="source"><pre><code class="py">
R = aggregate(project(select(D, lambda psc: if psc[1] == "Massachusetts" and psc[2] == "Trump"), lambda psc: 1), sum)
        </code></pre></div></div>
        
<div class="text">Notice that the order of the operations is important. If we tried to do the projection before the selection, the information that we are using to perform the selection would be gone after the projection. On the other hand, two selection (one immediately followed by another) can be done in any order. This is one reason why it makes sense to call the paradigm a relational <i>algebra</i>: there are algebraic laws that govern how the operations can be arranged.</div>
      </div></div></div><div class="paragraph">
In the MapReduce paradigm, a smaller set of building blocks inspired by the functional programming paradigm (supported by languages such as ML and Haskell) exist for defining transformations between data sets. Beyond adapting (with some modification) the map and reduce (a.k.a., "fold") functions from functional programming, the contribution of the MapReduce paradigm is the improvement in the performance of these operations on very large distributed data sets. Because of the elegance of the small set of building blocks, and because of the scalability advantages under appropriate circumstances, it is worth studying the paradigm's two building blocks for data transformations: <i>map</i> and <i>reduce</i>:
<ul>
  <li>a map operation will apply some user-specified computation to every tuple in the data set, producing one or more new tuples,</li>
  <li>a reduce operation will apply some user-specified aggregation computation to every set of tuples having the same key, producing a single result.</li>
</ul>
Notice that there is little restriction on the user-specified code other than the requirement that it be stateless in the sense that communication and coordination between parallel executions of the code is impossible. It is possible to express all the building blocks of the relational model using the building blocks of the MapReduce paradigm.
      </div>
<a id="ebd9fe9c61014bc9a2d743e069dc9d5b"></a><div class="linked block"><div class="link-block">[<a href="#ebd9fe9c61014bc9a2d743e069dc9d5b">link</a>]&nbsp;&nbsp;</div><div style="width:100%; display:inline-block;"><div style="width:auto;" class="example task_required"><span class="block_label">Example:</span> 
        
<div class="text">We can model and illustrate the two basic transformations that constitute the MapReduce paradigm in a concise way using Python.</div>
        
<div class="code"><div class="source"><pre><code class="py">
def map(f, R):
    return [t for (k,v) in R for t in f(k,v)]
    
def reduce(f, R):
    keys = {k for (k,v) in R}
    return [f(k1, [v for (k2,v) in R if k1 == k2]) for k1 in keys]
        </code></pre></div></div>
        
<div class="text">A <code>map</code> operation applies some function <code>f</code> to every key-value tuple and produces zero or more new key-value tuples. A <code>reduce</code> operation collects all values under the same key and performs an aggregate operation <code>f</code> on those values. Notice that the operation can be applied to any subset of the tuples in any order, so it is often necessary to use an operation that is associated and commutative.</div>
      </div></div></div><div class="paragraph">
At the language design level, the relational model and the MapReduce paradigm are arguably complementary and simply represent different trade-offs; they can be used in conjunction on data sets represented as relations. Likewise, implementations of the two represent different performance trade-offs. Still, contexts in which they are used can also differ due to historical reasons or due to conventions and community standards.
      </div>
<a id="ebd9fe9c61014bc9a2d743e069dc9d5a"></a><div class="linked block"><div class="link-block">[<a href="#ebd9fe9c61014bc9a2d743e069dc9d5a">link</a>]&nbsp;&nbsp;</div><div style="width:100%; display:inline-block;"><div style="width:auto;" class="example task_required"><span class="block_label">Example:</span> 
        
<div class="text">We can use the Python implementations of the map and reduce operations in an <a href="#ebd9fe9c61014bc9a2d743e069dc9d5b">example above</a> to implement some common transformations in the relational model. For this example, we assume that the first field in each tuple in the data set is a unique key (in general, we assume there is a unique key field if we are working with the MapReduce paradigm). We illustrate how projection and aggregation can be implemented in the code below.</div>
        
<div class="code"><div class="source"><pre><code class="py">
R = [('Alice', ('F', 23)), ('Bob', ('M', 19)), ('Carl', ('M', 22))]

# Projection keeps only gender and age.
# The original key (the name) is discarded.
X = map(lambda k,v: [(v[0], v[1])], R)

# Aggregation by the new key (i.e., gender).
Y = reduce(lambda k,vs: (k, sum(vs)), X) 
        </code></pre></div></div>
        
<div class="text">Selection is not straightforward to implement because this is not a common use case for MapReduce workflows. However, it is possible by treating each tuple in the input data set as its own key and then keeping on the keys at the end.</div>
        
<div class="code"><div class="source"><pre><code class="py">
R = [('Alice', 23), ('Bob', 19), ('Carl', 22)]

X = map(lambda k,v: [((k,v), (k,v))] if v > 20 else [], R) # Selection.
Y = reduce(lambda k,vs: k, X) # Keep same tuples (use tuples as unique keys).
        </code></pre></div></div>
        
<div class="text">We can also perform a simple join operation, although we also need to "combine" the two collections of data. This particular join operation is also simple because each tuple in the input is only joined with one other tuple. Join operations were also not originally envisioned as a common use case for MapReduce workflows.</div>
        
<div class="code"><div class="source"><pre><code class="py">
R = [('Alice', 23), ('Bob', 19), ('Carl', 22)]
S = [('Alice', 'F'), ('Bob', 'M'), ('Carl', 'M')]

X =   map(lambda k,v: [(k, ('Age', v))], R)\
    + map(lambda k,v: [(k, ('Gender', v))], S)
Y = reduce(\
        lambda k,vs:\
          (k,(vs[0][1], vs[1][1]) if vs[0][0] == 'Age' else (vs[1][1],vs[0][1])),\
        X\
      )
        </code></pre></div></div>
      </div></div></div>
<a id="fa12ae67647e4daf838a79fb814e733b"></a><div class="linked block"><div class="link-block">[<a href="#fa12ae67647e4daf838a79fb814e733b">link</a>]&nbsp;&nbsp;</div><div style="width:100%; display:inline-block;"><div style="width:auto;" class="example task_required"><span class="block_label">Example:</span> 
        
<div class="text">Suppose you have a data set containing tuples of the form (<i>name</i>, <i>gender</i>, <i>age</i>). You want to produce a result of the form (<i>gender</i>, <i>total</i>) where <i>total</i> is the sum of the age values of all tuples with the corresponding <i>gender</i>. The code below illustrates using Python how this can be done in the MapReduce paradigm.</div>
        
<div class="code"><div class="source"><pre><code class="py">
INPUT = [('Alice', ('F', 19)),\
         ('Bob', ('M', 23)),\
         ('Carl', ('M', 20)),\
         ('Eve', ('F', 27))]
TEMP = map(lambda k,v: [(v[0], v[1])], INPUT)
OUTPUT = reduce(lambda k,vs: (k, sum(vs)), TEMP)
        </code></pre></div></div>
        
<div class="text">We provide an equivalent MapReduce paradigm implementation of the algorithm using MongoDB.</div>
        
<div class="code"><div class="source"><pre><code class="js">
db.INPUT.insert({_id:"Alice", gender:"F", age:19});
db.INPUT.insert({_id:"Bob", gender:"M", age:23});
db.INPUT.insert({_id:"Carl", gender:"M", age:20});
db.INPUT.insert({_id:"Eve", gender:"F", age:27});

db.INPUT.mapReduce(
  function() {
    emit(this.gender, {age:this.age});
  },
  function(k, vs) {
    var total = 0;
    for (var i = 0; i < vs.length; i++)
        total += vs[i].age;
    return {age:total};
  },
  {out: "OUTPUT"}
);
        </code></pre></div></div>
      </div></div></div>
<a id="ffbbae67647e4daf838a79fb814e733b"></a><div class="linked block"><div class="link-block">[<a href="#ffbbae67647e4daf838a79fb814e733b">link</a>]&nbsp;&nbsp;</div><div style="width:100%; display:inline-block;"><div style="width:auto;" class="example task_required"><span class="block_label">Example:</span> 
        
<div class="text">Suppose we have a data set containing tuples of the form (<i>name</i>, <i>city</i>, <i>age</i>). We want to produce a result of the form (<i>city</i>, <i>range</i>) where <i>range</i> is defined as the difference between the oldest and youngest person in each city.
<ul>
  <li>We can construct a sequence of transformations that will yield this result using the basic building blocks available in the relational model (projections, selections, aggregations, and so on).</li>
  <li>Alternatively, we can use the MapReduce paradigm to construct a single-pass (one map operation and one reduce operation) MapReduce computation that computes the desired result. <b>Hint:</b> emit two copies of each entry (one for the computation of the maximum and one for the computation of the minimum).</li>
</ul></div>
        <div class="paragraph">
One approach in the relational paradigm is to aggregate the minimum and maximum age for each city, negate the minimum ages, and aggregate once more to get the ranges.
        </div>
        
<div class="code"><div class="source"><pre><code class="py">
NCA = [('Alice', 'Boston', 23), ('Bob', 'Boston', 19), ('Carl', 'Seattle', 25)]
CA = [(c,a) for (n,c,a) in NCA]
MIN = aggregate(CA, min)
MIN_NEG = [(c,-1*a) for (c,a) in MIN]
MAX = aggregate(CA, max)
RESULT = aggregate(union(MIN_NEG, MAX), sum)
        </code></pre></div></div>
        
<div class="text">Below is a flow diagram that represents the above transformations.</div>
        
<div class="text"><div class="pql" style="border:0px solid #000000; height:280px; display:inline-block; width:100%;">
!table([
  [null , null, 'r:proj``MIN', 'rd:union``MIN_NEG', null , null],
  ['r:proj``NCA', 'ru:aggmin`rd:agg max``CA', null , null     , 'r:agg sum``...', 'RESULT'],
  [null , null, 'rru:union``MAX', null     , null , null]
])
</div></div>
        
<div class="text">Using the MapReduce paradigm, we may prefer to follow the paradigm's two-stage breakdown of a computation by first converting every single entry in the input data set into an <i>approximation</i> of the range. For example, given only the record <code>('Alice', ('Boston', 23))</code>, in the mapping stage we might estimate the range as <code>('Boston', (23, 23, 0))</code> where the second and third entries are the minimum and maximum "known so far" given the limited information (a single data point). Then, in the reduce stage, we would combine these estimates.</div>
        
<div class="code"><div class="source"><pre><code class="py">
NCA = [('Alice', ('Boston', 23)), ('Bob', ('Boston', 19)), ('Carl', ('Seattle', 25))]
I = map(lambda k, v: [(v[0], (v[1], v[1], 0))], NCA)

def reducer(k, vs):
    age_lo = min([l for (l,h,r) in vs])
    age_hi = max([h for (l,h,r) in vs])
    age_ran = age_hi - age_lo 
    return (k, (age_lo, age_hi, age_ran))
RESULT = reduce(reducer, I)
        </code></pre></div></div>
      </div></div></div>
<a id="cbb3ae67647e4daf838a79fb914e114a"></a><div class="linked block"><div class="link-block">[<a href="#cbb3ae67647e4daf838a79fb914e114a">link</a>]&nbsp;&nbsp;</div><div style="width:100%; display:inline-block;"><div style="width:auto;" class="example task_required"><span class="block_label">Example:</span> 
        
<div class="text">Consider the following algorithm implemented as a collection of transformations drawn from the relational paradigm. The input data set consists of tuples of the form (<i>intersection</i>, <i>date</i>, <i>time</i>, <i>cars</i>). Each tuple represents a single accident took place at the specified intersection, occurred on the specified date and time, and involved the specified number of cars.</div>
        
<div class="code"><div class="source"><pre><code class="py">
D = [('Commonwealth Ave./Mass Ave.', '2016-11-02', '11:34:00', 3), ...]
M = project(D, lambda idtc: [(idtc[0], idtc[3])])
R = aggregate(M, max)
H = [(i,d,t) for ((i,m), (j,d,t,c)) in product(R, D) if i==j and m==c]
        </code></pre></div></div>
        
<div class="text">A data flow diagram representing the transformations is provided below.</div>
        
<div class="text"><div class="pql" style="border:0px solid #000000; height:200px; display:inline-block; width:100%;">
!table([
  ['d:prod`r:proj``D', 'r:agg max``M', 'lld:prod``R'],
  ['r:select``...', 'r:proj``...', 'H'] 
])
</div></div>
        
<div class="text">Complete the following tasks. To simplify the exercise, you may assume that there is at most one accident involving each possible quantity of cars (e.g., there is only one accident that involved five cars).
<ul>
  <li>Write a description of what the algorithm computes. What does the output data set represent?</li>
  <li>Provide an implementation of this algorithm using the MapReduce paradigm. You only need one map and one reduce operation.</li>
</ul></div>
        
<div class="button"><button class="solution_toggle">show solution</button></div><div class="solution_container" style="display:none;"><div class="solution">
          
<div class="text">The output represents the dates and times of the largest accidents for each intersection. Note that the transformation first computes the number of cars involved in the largest accident at each intersection (this is the data set <code>R</code>), and then <i>joins</i> that result with the original data to extract the exact dates and times of the accidents that had that number of cars. As an example, we provide an implementation of this algorithms using the MapReduce feature provided by MongoDB. We assume that each tuple is represented using a JSON document of the form <code>{'intersection':..., 'date':..., 'time':..., 'cars':...}</code>.</div>
          
<div class="code"><div class="source"><pre><code class="js">
db.D.mapReduce(
  function() {
    emit(this.intersection, {'date':this.date, 'time':this.time, 'cars':this.cars});
  },
  function(intersection, dtcs) {
    var index_of_largest = 0;
    dtcs.forEach(function(dtc, i) {
      if (dtc.cars > dtcs[index_of_largest].cars)
        index_of_largest = i;
    });
    // The intersection will already be the key.
    return {'date':dtcs[index_of_largest].date, 'time':dtcs[index_of_largest].time};
  },
  { out: "R" }
);
          </code></pre></div></div>
        </div></div><div class="solution_spacer"></div>
      </div></div></div></div>
  <a id="2.2"></a><div class="subsection"><h3 class="linked heading"><span class="link-title">[<a href="#2.2">link</a>]&nbsp;&nbsp;</span><div><span class="header_numeral">2.2.</span> Composing Transformations into Algorithms</div></h3>
<div class="text top">Whether we are using the relation model or the MapReduce paradigm, the available building blocks can be used to assemble fairly complex transformations on data sets. Each transformation can be written either using the concrete syntax of a particular programming language that implements the paradigm, or as a data flow diagram that describes how starting and intermediate data sets are combined to derive new data sets over the course of the algorithm's operation.</div>
<a id="cba5543907854ed28dbd3eeb874ebd54"></a><div class="linked block"><div class="link-block">[<a href="#cba5543907854ed28dbd3eeb874ebd54">link</a>]&nbsp;&nbsp;</div><div style="width:100%; display:inline-block;"><div style="width:auto;" class="example task_required"><span class="block_label">Example:</span> 
        
<div class="text">We can use building blocks drawn from the relational model (defined for Python in <a href="#9da373c4cc654556bf2fa3fed6d56995">an example above</a>) to construct an implementation of the <i>k</i>-means clustering algorithm.</div>
        
<div class="code"><div class="source"><pre><code class="py">
def dist(p, q):
    (x1,y1) = p
    (x2,y2) = q
    return (x1-x2)**2 + (y1-y2)**2

def plus(args):
    p = [0,0]
    for (x,y) in args:
        p[0] += x
        p[1] += y
    return tuple(p)

def scale(p, c):
    (x,y) = p
    return (x/c, y/c)

M = [(13,1), (2,12)]
P = [(1,2),(4,5),(1,3),(10,12),(13,14),(13,9),(11,11)]

OLD = []
while OLD != M:
    OLD = M

    MPD = [(m, p, dist(m,p)) for (m, p) in product(M, P)]
    PDs = [(p, dist(m,p)) for (m, p, d) in MPD]
    PD = aggregate(PDs, min)
    MP = [(m, p) for ((m,p,d), (p2,d2)) in product(MPD, PD) if p==p2 and d==d2]
    MT = aggregate(MP, plus)

    M1 = [(m, 1) for (m, _) in MP]
    MC = aggregate(M1, sum)

    M = [scale(t,c) for ((m,t),(m2,c)) in product(MT, MC) if m == m2]
    print(sorted(M))
        </code></pre></div></div>
        
<div class="text">Below is a flow diagram describing the overall organization of the computation (nodes are data sets and edges are transformations). Note that the nodes labeled "..." are intermediate results that are implicit in the comprehension notation. For example, <code>[(m, p) for ((m,p,d), (p2,d2)) in product(MPD, PD) if p==p2 and d==d2]</code> first filters <code>product(MPD, PD)</code> using a selection criteria <code>if p==p2 and d==d2</code> and then performs a projection from tuples of the form <code>((m,p,d), (p2,d2))</code> to tuples of the form <code>(m, p)</code> to obtain the result.</div>
        
<div class="text"><div class="pql" style="border:0px solid #000000; height:400px; display:inline-block; width:100%;">
!table([
  ['rd:prod``P', null, null, 'd:agg min``PDs'],
  [null, 'r:proj``...', 'dr:prod`ru:proj``MPD', 'd:prod``PD'],
  ['ru:prod``M', null,  null,     'd:selection``...'],
  ['u:proj``...',    'ld:prod``MC',  'l:agg sum``M1',     'd:proj``...'],
  ['u:selection``...',    'l:prod``MT',  null,     'lu:proj`ll:agg plus``MP']
])
</div></div>
      </div></div></div>
<a id="bca743938aa04d9ea43464f941bd70bc"></a><div class="linked block"><div class="link-block">[<a href="#bca743938aa04d9ea43464f941bd70bc">link</a>]&nbsp;&nbsp;</div><div style="width:100%; display:inline-block;"><div style="width:auto;" class="example task_required"><span class="block_label">Example:</span> 
        
<div class="text">We can use building blocks drawn from the MapReduce paradigm that are available in MongoDB to construct an implementation of the <i>k</i>-means clustering algorithm. This implementation illustrates many of the idiosyncrasies of the MapReduce abstraction made available by MongoDB.</div>
        
<div class="code"><div class="source"><pre><code class="js">
db.system.js.save({ _id:"dist" , value:function(u, v) {
  return Math.pow(u.x - v.x, 2) + Math.pow(u.y - v.y, 2);
}});

function flatten(A) {
  db[A].find().forEach(function(a) { db[A].update({_id: a._id}, a.value); });
}

function prod(A, B, AB) {
  db[AB].remove({});
  db.createCollection(AB);
  db[A].find().forEach(function(a) {
    db[B].find().forEach(function(b) {
      db[AB].insert({left:a, right:b});
    });
  });
}

function union(A, B, AB) {
  db[AB].remove({});
  db.createCollection(AB);
  db[A].find().forEach(function(a) {
    db[AB].insert(a);
  });
  db[B].find().forEach(function(b) {
    db[AB].insert(b);
  });
}

function hash_means(M, HASH) {
  db[M].mapReduce(
    function() { emit("hash", {hash: this.x + this.y}); },
    function(k, vs) {
      var hash = 0;
      vs.forEach(function(v) {
        hash += v.hash;
      });
      return {hash: hash};
    },
    {out: HASH}
  );
}

// We'll only perform a single product operation. Using map-reduce, we can perform
// argmax and argmin more easily. We can also use map-reduce to compare progress.

db.M.remove({});
db.M.insert([{x:13,y:1},{x:2,y:12}]);
db.P.remove({});
db.P.insert([{x:1,y:2},{x:4,y:5},{x:1,y:3},{x:10,y:12},{x:13,y:14},{x:13,y:9},{x:11,y:11}]);

var iterations = 0;
do {
  // Compute an initial hash of the means in order to have a baseline
  // against which to compare when deciding whether to loop again.
  hash_means("M", "HASHOLD");

  prod("M", "P", "MP");
  // At this point, entries in MP have the form
  // {_id:..., left:{x:13,y:1}, right:{x:4,y:5}}.

  // For each point, find the distance to the closest mean. The output after
  // flattening has entries of the form {_id:{x:?, y:?}, m:{x:?, y:?}, d:?}
  // where the identifier is the point.
  db.MPs.remove({});
  db.MP.mapReduce(
    function() {
      var point = {x:this.right.x, y:this.right.y};
      var mean = {x:this.left.x, y:this.left.y};
      emit(point, {m:mean, d:dist(point, mean)});
    },
    function(point, vs) {
      // Each entry in vs is of the form {m:{x:?, y:?}, d:?}.
      // We return the one that is closest to point.
      var j = 0;
      vs.forEach(function(v, i) {
        if (v.d < vs[j].d)
          j = i;
      });
      return vs[j]; // Has form {m:{x:?, y:?}, d:?}.
    },
    {out: "MPs"}
  );
  // At this point, entries in MPs have the form
  // {_id:{x:2, y:3}, value:{m:{x:4, y:5}, d:1}}.

  flatten("MPs");
  // At this point, entries in MPs have the form
  // {_id:{x:2, y:3}, m:{x:4, y:5}, d:1}.

  // For each mean (i.e., key), compute the average of all the points that were
  // "assigned" to that mean (because it was the closest mean to that point).
  db.MPs.mapReduce(
    function() {
      // The key is the mean and the value is the point together with its counter.
      var point = this._id;
      var point_with_count = {x:point.x, y:point.y, c:1};
      var mean = this.m;
      emit(mean, point_with_count);
    },
    function(key, vs) {
      // Remember that the reduce operations will be applied to the values for each key
      // in some arbitrary order, so our aggregation operation must be commutative (in
      // this case, it is vector addition).
      var x = 0, y = 0, c = 0;
      vs.forEach(function(v, i) {
        x += v.x;
        y += v.y;
        c += v.c;
      });
      return {x:x, y:y, c:c};
    },
    { finalize: function(k, v) { return {x: v.x/v.c, y: v.y/v.c}; },
      out: "M"
    }
  );
  // At this point, entries in M have the form
  // {_id:{x:2, y:3}, value:{x:4, y:5}}.

  flatten("M");
  // At this point, entries in MPs have the form
  // {_id:{x:2, y:3}, x:4, y:5}. The identifier
  // value does not matter as long as it is unique.

  // Compute the hash of the new set of means.
  hash_means("M", "HASHNEW");

  // Extract the two hashes in order to compare them in the loop condition.
  var hashold = db.HASHOLD.find({}).limit(1).toArray()[0].value.hash;
  var hashnew = db.HASHNEW.find({}).limit(1).toArray()[0].value.hash;
  print(hashold);
  print(hashnew);
  print(iterations);
  iterations++;
} while (hashold != hashnew);
        </code></pre></div></div>
      </div></div></div>
<a id="bfa741938aa04d9ea43464f951bd72bc"></a><div class="linked block"><div class="link-block">[<a href="#bfa741938aa04d9ea43464f951bd72bc">link</a>]&nbsp;&nbsp;</div><div style="width:100%; display:inline-block;"><div style="width:auto;" class="example task_required"><span class="block_label">Example:</span> 
        
<div class="text">If we are certain that our <i>k</i>-means algorithm will have a relatively small (e.g., constant) number of means, we can take advantage of this by only tracking the means in a local variable and using <code>.updateMany()</code> to distribute the means to all the points at the beginning of each iteration. This leads to a much more concise (and, for a small number of means, efficient) implementation of the algorithm than what is presented in a <a href="#bca743938aa04d9ea43464f941bd70bc">previous example</a>. In particular, it is no longer necessary to encode a production operation within MongoDB.</div>
        
<div class="code"><div class="source"><pre><code class="js">
db.system.js.save({ _id:"dist" , value:function(u, v) {
  return Math.pow(u.x - v.x, 2) + Math.pow(u.y - v.y, 2);
}});

db.P.insert([{x:1,y:2},{x:4,y:5},{x:1,y:3},{x:10,y:12},{x:13,y:14},{x:13,y:9},{x:11,y:11}]);

var means = [{x:13,y:1}, {x:2,y:12}];
do {
  db.P.updateMany({}, {$set: {means: means}}); // Add a field to every object.
  db.P.mapReduce(
    function() {
      var closest = this.means[0];
      for (var i = 0; i < this.means.length; i++)
        if (dist(this.means[i], this) < dist(closest, this))
          closest = this.means[i];
      emit(closest, {x:this.x, y:this.y, c:1});
    },
    function(key, vs) {
      var x = 0, y = 0, c = 0;
      vs.forEach(function(v, i) {
        x += v.x;
        y += v.y;
        c += v.c;
      });
      return {x:x, y:y, c:c};
    },
    { finalize: function(k, v) { return {x: v.x/v.c, y: v.y/v.c}; },
      out: "M"
    }
  );
  means = db.M.find().toArray().map(function(r) { return {x:r.value.x, y:r.value.y}; });
  printjson(means);
} while (true);
        </code></pre></div></div>
        
<div class="text">We do not deal with the issue of convergence in the above example; an equality function on JSON/BSON objects (i.e., the list of means) would need to be defined to implement the loop termination condition. Below, we illustrate how the above implementation can be written within Python using PyMongo.</div>
        
<div class="code"><div class="source"><pre><code class="py">
import pymongo
import bson.code

client = pymongo.MongoClient()
db = client.local

db.system.js.save({'_id':'dist', 'value': bson.code.Code("""
    function(u, v) {
        return Math.pow(u.x - v.x, 2) + Math.pow(u.y - v.y, 2);
    }
    """)})

db.P.insert_many([{'x':1,'y':2},{'x':4,'y':5},{'x':1,'y':3},{'x':10,'y':12},\
                  {'x':13,'y':14},{'x':13,'y':9},{'x':11,'y':11}])

means = [{'x':13,'y':1}, {'x':2,'y':12}]

while True:
    db.P.update_many({}, {'$set': {'means': means}})

    mapper = bson.code.Code("""
        function() {
            var closest = this.means[0];
            for (var i = 0; i < this.means.length; i++)
                if (dist(this.means[i], this) < dist(closest, this))
                    closest = this.means[i];
            emit(closest, {x:this.x, y:this.y, c:1});
        }
        """)
    reducer = bson.code.Code("""
        function(key, vs) {
            var x = 0, y = 0, c = 0;
            vs.forEach(function(v, i) {
                x += v.x;
                y += v.y;
                c += v.c;
            });
            return {x:x, y:y, c:c};
        }
        """)
    finalizer = bson.code.Code("""
        function(k, v) { return {x: v.x/v.c, y: v.y/v.c}; }
        """)
    db.P.map_reduce(mapper, reducer, "M", finalize = finalizer)

    means = [{'x':t['value']['x'], 'y':t['value']['y']} for t in db.M.find()]
    print(means)
        </code></pre></div></div>
      </div></div></div>
<a id="7eee633a65814aacb951b667e38092ec"></a><div class="linked block"><div class="link-block">[<a href="#7eee633a65814aacb951b667e38092ec">link</a>]&nbsp;&nbsp;</div><div style="width:100%; display:inline-block;"><div style="width:auto;" class="example task_required"><span class="block_label">Example:</span> 
        
<div class="text">We can also use building blocks drawn from the relational model (defined for Python in <a href="#9da373c4cc654556bf2fa3fed6d56995">an example above</a>) to construct an implementation of the Floyd-Warshall shortest path algorithm.</div>
        
<div class="code"><div class="source"><pre><code class="py">
N = ['a','b','c','d','e','f']
E = [('a','b'),('b','c'),('a','c'),('c','d'),('d','e'),('e','f'),('b','f')]

oo = float('inf') # This represents infinite distance.

P = product(N,N)
I = [((x,y),oo if x != y else 0) for (x,y) in P] # Zero distance to self, infinite distance to others.
D = [((x,y),1) for (x,y) in E] # Edge-connected nodes are one apart.

OUTPUT = aggregate(union(I,D), min)
STEP = []
while sorted(STEP) != sorted(OUTPUT):
    STEP = OUTPUT
    P = product(STEP, STEP) # All pairs of edges.
    NEW = union(STEP,[((x,v),k+m) for (((x,y),k),((u,v),m)) in P if u == y]) # Add distances of connected edge pairs.
    OUTPUT = aggregate(NEW, min) # Keep only shortest node-node distance entries.

SHORTEST = OUTPUT
        </code></pre></div></div>
      </div></div></div></div>
  <a id="2.3"></a><div class="subsection"><h3 class="linked heading"><span class="link-title">[<a href="#2.3">link</a>]&nbsp;&nbsp;</span><div><span class="header_numeral">2.3.</span> Data Provenance</div></h3>
<div class="text top"><i>Data provenance</i> is an overloaded term that refers, in various contexts and communities, to the source, origin, or lifecycle of a particular unit of data (which could be an individual data point, a subset of a data set, or an entire data set). In this course, we will use the term primarily to refer to dependency relationships between data sets (or relationships between individual entries in those data sets) that may be derived from one another (usually over time). As an area of study within computer science, data provenance (also called <i>data lineage</i>) is arguably still being delineated and developed by the community. However, some community standards for general-purpose representations of data provenance have been established (e.g., PROV [<a href="#PROV-Primer">#</a>]).</div><div class="paragraph">
While the research literature explores various ways of categorizing approaches to data provenance, there are two dimensions that can be used to classify provenance techniques (surveyed in more detail in the literature [<a href="#ilprints918">#</a>]):
<ul>
  <li>from <i>where</i> the data was generated (e.g., what data sets or individual data entries) and <i>how</i> it was generated (e.g., what algorithms were used);</li>
  <li>whether the lineage is tracked at a fine granularity (e.g., per data entry such as a row in a data set) or a coarse granularity (e.g., per data set).</li>
</ul>
      </div>
<a id="5a30782285aa11e4af63feff819cdc9f"></a><div class="linked block"><div class="link-block">[<a href="#5a30782285aa11e4af63feff819cdc9f">link</a>]&nbsp;&nbsp;</div><div style="width:100%; display:inline-block;"><div style="width:auto;" class="table other">
<table class="fig_table">
 <tr>
  <td></td>
  <td><b>coarse</b></td>
  <td><b>fine</b></td>
 </tr>
 <tr> 
  <td><b>how</b></td>
  <td>transformation</td>
  <td>execution path through transformation algorithm</td>
 </tr>
 <tr> 
  <td><b>from where</b></td>
  <td>source data sets</td>
  <td>specific entries in source data sets</td>
 </tr>
</table>
      </div></div></div><div class="paragraph">
When data lineage is being tracked at a fine granularity, there are at least two approaches one can use to determine provenance of a single entry within a data set produced using a specific transformation. One approach is to track the provenance of every entry within the transformation itself (sometimes called <i>tracing</i>); another approach is to determine the provenance of an entry after the transformation has already been performed (e.g., by another party and/or at some point in the past). The latter scenario is more likely if storage of per-entry provenance meta-data is an issue, or if the transformations are black boxes that cannot be inspected, modified, or extended before they are applied to input data sets.
      </div><div class="paragraph">
For a transformation that may combine input data set entries in some way, a large number of entries in the input data set can influence an individual entry in the output data set. For such transformations, finding the per-entry provenance for an entry in the output data set can be non-trivial. Without additional information about the transformation, the conservative assumption to make is that all entries in the input data set may contribute to every entry in the output data set.
      </div><div class="paragraph">
In some cases, we may know more about a transformation. Perhaps its definition is broken down into <a href="#2.1">building blocks found in the relational model</a> (e.g., selections and projections), or it can be defined using map and reduce operations in the MapReduce paradigm. In such cases, studying the data lineage of individual data items it may be possible to derive standard techniques [<a href="#Cui:2003:LTG:775452.775456">#</a>] for determining data lineage given the components that make up the transformation.
      </div>
<a id="f3951db0b6c94dd4b409e9ebb28bd2cd"></a><div class="linked block"><div class="link-block">[<a href="#f3951db0b6c94dd4b409e9ebb28bd2cd">link</a>]&nbsp;&nbsp;</div><div style="width:100%; display:inline-block;"><div style="width:auto;" class="fact true_required"><span class="block_label">Fact:</span> 
        
<div class="text">For relational transformations that simply add or remove data set entries without changing their internal structure (i.e., schema) or data content (i.e., values), computing the per-entry provenance is relatively straightforward. In particular: for union, difference, intersection, selection, and product transformations, the per-entry provenance describing which input data set entries influenced an individual entry in the output data set (i.e., the fine-grained "where" provenance of an individual entry in the output) can be determined using a linear search through the input data set (or sets) for that entry.</div>
      </div></div></div>
<a id="a1151db0b6c94dd4b409e9ebb28bd2cd"></a><div class="linked block"><div class="link-block">[<a href="#a1151db0b6c94dd4b409e9ebb28bd2cd">link</a>]&nbsp;&nbsp;</div><div style="width:100%; display:inline-block;"><div style="width:auto;" class="fact true_required"><span class="block_label">Fact:</span> 
        
<div class="text">For projection transformations, the per-entry provenance describing which input data set entries influenced an individual entry in the output data set can be determined by applying the transformation to each entry in the input data set. Whenever this yields an output equivalent to the target, we know that the provenance for that output entry includes that input entry.</div>
      </div></div></div><div class="paragraph">
The above facts imply that for several relational transformation building blocks, per-entry provenance of output data set entries can be reconstructed efficiently. Note that by induction this also implies that any transformation composed of these building blocks is also amenable to efficient reconstruction of the provenance of any output data set entry.
      </div><div class="paragraph">
If a transformation might <i>combine</i> subsets of the input entries to compute individual output entries (but we have no additional information about the transformation), then with regard to per-entry provenance a worst case scenario may apply.
      </div>
<a id="f3951db0b6c94dd4b409e9ebb28bd2ca"></a><div class="linked block"><div class="link-block">[<a href="#f3951db0b6c94dd4b409e9ebb28bd2ca">link</a>]&nbsp;&nbsp;</div><div style="width:100%; display:inline-block;"><div style="width:auto;" class="fact true_required"><span class="block_label">Fact:</span> 
        
<div class="text">Suppose we know nothing about the internal structure of a transformation, but we want to determine what entries in the input data set influence a particular entry in the output data set. In this case, even running the entire transformation a second time (knowing the target entry) may provide no additional information about which subset of input data set entries produced that output data set entry. In this worst-case scenario, it would be necessary to run the transformation on all 2<sup><i>n</i></sup> subcollections of the input data set (for an input data set of size <i>n</i>), and to check for each one of those 2<sup><i>n</i></sup> outputs whether the particular entry of interest from the output data set was generated.</div>
      </div></div></div><div class="paragraph">
However, it is possible that a transformation that combines input entries into output entries falls into one of a small number of categories that make it possible to compute per-entry provenance more efficiently.
      </div>
<a id="f3951db0b6c94dd4b409e9ebb28bd2cc"></a><div class="linked block"><div class="link-block">[<a href="#f3951db0b6c94dd4b409e9ebb28bd2cc">link</a>]&nbsp;&nbsp;</div><div style="width:100%; display:inline-block;"><div style="width:auto;" class="fact true_required"><span class="block_label">Fact:</span> 
        
<div class="text">A transformation is <i>context-free</i> if any two entries in the input data set that contribute to the same output entry always do so (regardless of what other input entries are present in the input data set). If a transformation is context-free, then the provenance information for a given output data set entry can be computed in quadratic time. First, we can partition the input data set using the entries of the output data set (i.e., create one bucket of input data set entries for each output data set entry). Then, we can determine which partition contributed to the output data set entry of interest using linear search.</div>
      </div></div></div>
<a id="f3951db0b6c94dd4b409e9ebb28bd2cb"></a><div class="linked block"><div class="link-block">[<a href="#f3951db0b6c94dd4b409e9ebb28bd2cb">link</a>]&nbsp;&nbsp;</div><div style="width:100%; display:inline-block;"><div style="width:auto;" class="fact true_required"><span class="block_label">Fact:</span> 
        
<div class="text">A transformation is <i>key-preserving</i> if any two entries under the same keys in the input data set always contribute to an entry with the same output key in the output data set. If a transformation is key-preserving, then the provenance information for a given entry is easy to determine by building up a map from input data set keys to output data set keys, and then performing a linear search over the input data set.</div>
      </div></div></div>
<a id="f3951db0b6c23dd4b409e9ebb28bd2fd"></a><div class="linked block"><div class="link-block">[<a href="#f3951db0b6c23dd4b409e9ebb28bd2fd">link</a>]&nbsp;&nbsp;</div><div style="width:100%; display:inline-block;"><div style="width:auto;" class="example task_required"><span class="block_label">Example:</span> 
        
<div class="text">Any <i>relational aggregation</i> (that is, a key-based aggregation operation as defined in the <a href="#2.1">relational model</a> that produces an output data set in which all keys are unique) is always key-preserving and context-free.</div>
      </div></div></div><div class="paragraph">
The above example completes our understanding of the complexity of reconstructing per-entry provenance for all the <a href="#9da373c4cc654556bf2fa3fed6d56995">building blocks in the relational model</a>.
      </div>
<a id="5a307822b5aa11e4af63feff819cdc9a"></a><div class="linked block"><div class="link-block">[<a href="#5a307822b5aa11e4af63feff819cdc9a">link</a>]&nbsp;&nbsp;</div><div style="width:100%; display:inline-block;"><div style="width:auto;" class="table other">
<table class="fig_table">
 <tr>
  <td><b>transformation <i>f</i></b></td>
  <td><b>per-entry provenance<br/>reconstruction approach</b></td>
  <td><b>complexity for input data<br/>set with <i>n</i> entries</b></td>
 </tr>
 <tr> 
  <td>union<br/>intersection<br/>difference<br/>selection<br/>product</td>
  <td>linear search over input data set<br/>to find entry</td>
  <td>O(<i>n</i>) entry equality checks</td>
 </tr>
 <tr> 
  <td>projection</td>
  <td>application of transformation once<br/>to each input data set entry</td>
  <td>O(<i>n</i>) executions of <i>f</i> and<br/>O(<i>n</i>) entry equality checks</td>
 </tr>
 <tr> 
  <td>aggregation</td>
  <td>application of transformation once<br/>to each input data set entry and<br/>construction of key-to-key map</td>
  <td>O(<i>n</i>) executions of <i>f</i> and<br/>O(<i>n</i> log <i>n</i>) to build and use<br/>key-to-key map</td>
 </tr>
</table>
      </div></div></div>
<a id="f3951db0b5c23dd4b109e9ebb28bd2ac"></a><div class="linked block"><div class="link-block">[<a href="#f3951db0b5c23dd4b109e9ebb28bd2ac">link</a>]&nbsp;&nbsp;</div><div style="width:100%; display:inline-block;"><div style="width:auto;" class="example task_required"><span class="block_label">Example:</span> 
        
<div class="text">A transformation <i>f</i> that takes a data set of points as its input, runs the <i>k</i>-means algorithm on those points, and returns a data set of means is <i>not</i> context-free. To see why, consider its behavior on following inputs for <i>k</i> = 2:
<table style="padding-left:20px; margin:4px 0px 4px 0px;"><tr><td style="text-align:right; white-space:nowrap;"><table style="width:100%;"><tr><td style="text-align:right;"> <i>f</i>([(0,0), (2,2)]) <td></tr></table></td><td style="text-align:center;"> = </td><td><table style="white-space:nowrap;"><tr><td style="white-space:nowrap;"> [(0,0), (2,2)] </td></tr></table></td></tr><tr><td style="text-align:right;"><table style="width:100%;"><tr><td style="text-align:right;">
 <i>f</i>([(0,0), (2,2), (100,100)]) <td></tr></table></td><td style="text-align:center;"> = </td><td><table style="white-space:nowrap;"><tr><td style="white-space:nowrap;"> [(1,1), (100,100)]
</td></tr></table></td></tr></table>
Notice that in the first case, the input entries (0,0) and (2,2) each produce their own output entry. However, the introduction of a new point (100,100) results in (0,0) and (2,2) both contributing to the same output entry (1,1).</div>
      </div></div></div>
<a id="f3951db0b5c23dd4b109e9ebb28bd2ab"></a><div class="linked block"><div class="link-block">[<a href="#f3951db0b5c23dd4b109e9ebb28bd2ab">link</a>]&nbsp;&nbsp;</div><div style="width:100%; display:inline-block;"><div style="width:auto;" class="example task_required"><span class="block_label">Example:</span> 
        
<div class="text">A transformation can be context-free but not key-preserving. For example, suppose a transformation aggregates some vectors by key but then discards the key via a projection:
<table style="padding-left:20px; margin:4px 0px 4px 0px;"><tr><td style="text-align:right; white-space:nowrap;"><table style="width:100%;"><tr><td style="text-align:right;"> [(<i>j</i>, (0,2)), (<i>j</i>, (2,0)), (<i>k</i>, (0,3)), (<i>k</i>, (3,0))] <td></tr></table></td><td style="text-align:center;"> &#x21A6; </td><td><table style="white-space:nowrap;"><tr><td style="white-space:nowrap;"> [(<i>j</i>, (2,2)), (<i>k</i>, (3,3))] <td></tr></table></td><td style="text-align:center;"> &#x21A6; </td><td><table style="white-space:nowrap;"><tr><td style="white-space:nowrap;"> [(2,2), (3,3)]
</td></tr></table></td></tr></table>
The above is context-free because there is no way that other input entries can have any influence over the way existing entries aggregate by key. However, they key <i>j</i> might map to any numeric value key in the output (depending on what the input entries are):
<table style="padding-left:20px; margin:4px 0px 4px 0px;"><tr><td style="text-align:right; white-space:nowrap;"><table style="width:100%;"><tr><td style="text-align:right;"> [(<i>j</i>, (0,0))] <td></tr></table></td><td style="text-align:center;"> &#x21A6; </td><td><table style="white-space:nowrap;"><tr><td style="white-space:nowrap;"> [(0,0)] </td></tr></table></td></tr><tr><td style="text-align:right;"><table style="width:100%;"><tr><td style="text-align:right;">
 [(<i>j</i>, (0,0)), (<i>j</i>, (1,1))] <td></tr></table></td><td style="text-align:center;"> &#x21A6; </td><td><table style="white-space:nowrap;"><tr><td style="white-space:nowrap;"> [(1,1)]
</td></tr></table></td></tr></table></div>
      </div></div></div>
<a id="f3951db0b1c94dd4bbbbe9ebb28bd2cb"></a><div class="linked block"><div class="link-block">[<a href="#f3951db0b1c94dd4bbbbe9ebb28bd2cb">link</a>]&nbsp;&nbsp;</div><div style="width:100%; display:inline-block;"><div style="width:auto;" class="theorem true"><span class="block_label">Theorem:</span> 
        
<div class="text">Any key-preserving transformation is context-free. To see this, suppose that there is some transformation <i>f</i> that is key-preserving but not context-free. This would imply that there is some set of inputs (<i>i</i>, <i>a</i>), (<i>j</i>, <i>b</i>), and (<i>k</i>, <i>c</i>) for which the definition of context-free is not satisfied, such as a case in which two input entries each map to their own distinct output entries when on their own but map to the same output entry when the third input entry (<i>k</i>, <i>c</i>) is introduced:
<table style="padding-left:20px; margin:4px 0px 4px 0px;"><tr><td style="text-align:right; white-space:nowrap;"><table style="width:100%;"><tr><td style="text-align:right;"> <i>f</i>([(<i>i</i>, <i>a</i>), (<i>j</i>, <i>b</i>)]) <td></tr></table></td><td style="text-align:center;"> = </td><td><table style="white-space:nowrap;"><tr><td style="white-space:nowrap;"> [(<i>l</i>, <i>d</i>)] </td></tr></table></td></tr><tr><td style="text-align:right;"><table style="width:100%;"><tr><td style="text-align:right;">
 <i>f</i>([(<i>i</i>, <i>a</i>), (<i>j</i>, <i>b</i>), (<i>k</i>, <i>c</i>)]) <td></tr></table></td><td style="text-align:center;"> = </td><td><table style="white-space:nowrap;"><tr><td style="white-space:nowrap;"> [(<i>l</i>, <i>d</i>), (<i>m</i>, <i>e</i>), (<i>n</i>, <i>f</i>)]
</td></tr></table></td></tr></table>
However, the above would imply that under different conditions, the keys <i>i</i> and <i>j</i> might either both map to a key <i>l</i> or might map to two different keys <i>l</i> and <i>m</i>. But this would imply that entries with the key <i>j</i> map to entries with either key <i>l</i> or key <i>m</i>. This is not key-preserving and contradicts our assumptions, so our premise must have been impossible.</div>
      </div></div></div>
<div class="text top">The diagram below illustrates the relationships between the properties discussed above. Notice that whether a transformation is context-free and key-preserving is used to determine whether a transformation that combines input entries might have properties that allow us to compute per-entry provenance more efficiently. There is no need to check whether other transformations (such as projections and products) have these properties because per-entry provenance is <a href="#f3951db0b6c94dd4b409e9ebb28bd2cd">already efficiently computable in those cases</a>.</div>
<a id="123bf7b584394a8bb3a62e9be3fae8dc"></a><div class="linked block"><div class="link-block">[<a href="#123bf7b584394a8bb3a62e9be3fae8dc">link</a>]&nbsp;&nbsp;</div><div style="width:100%; display:inline-block;"><div style="width:auto;" class="diagram other">
<table class="container">
  <tr>
    <td class="box" style="background-color:lightgrey;">
      data set transformations
      <table class="container">
        <tr>
          <td class="box" style="background-color:powderblue;">
            transformations that combine multiple<br/>input entries into output entries
            <br/><span style="font-weight:normal; font-style:italic;">(example: <i>k</i>-means)</span>
            <table class="container">
              <tr>
                <td class="box" style="background-color:lightgreen;">
                  context-free transformations
                  <br/><span style="font-weight:normal; font-style:italic;">(example: <a href="#f3951db0b5c23dd4b109e9ebb28bd2ab">aggregation by key that drops the key</a>)</span>
                  <table class="container">
                    <tr>
                      <td class="box" style="background-color:lightyellow; padding-bottom:4px;">
                        key-preserving transformations
                        <br/><span style="font-weight:normal; font-style:italic;">(example: relational aggregation by key)</span>
                      </td>
                    </tr>
                  </table>
                </td>
              </tr>
            </table>
          </td>
        </tr>
        <tr>
          <td class="box" style="background-color:powderblue;">
            other transformations
            <br/><span style="font-weight:normal; font-style:italic;">(examples: selections, projections)</span>
          </td>
        </tr>
      </table>
    </td>
  </tr>
</table>
      </div></div></div><div class="paragraph">
We provide explicit algorithms for computing per-entry provenance in three of the cases outlined above.
      </div>
<a id="f3951db0b6c94dd4b409e9ebb28bddaa"></a><div class="linked block"><div class="link-block">[<a href="#f3951db0b6c94dd4b409e9ebb28bddaa">link</a>]&nbsp;&nbsp;</div><div style="width:100%; display:inline-block;"><div style="width:auto;" class="example task_required"><span class="block_label">Example:</span> 
        
<div class="text">If we introduce a function for computing powersets of entries in a data set (e.g., using a <a href="https://docs.python.org/3/library/itertools.html#itertools-recipes">Python recipe</a> for such a function), we can define an inefficient Python algorithm that can correctly determine per-entry provenance in the <a href="#f3951db0b6c94dd4b409e9ebb28bd2ca">general case</a> (i.e., not knowing any additional information about a transformation <code>f</code>).</div>
        
<div class="code"><div class="source"><pre><code class="py">
from itertools import combinations, chain

def powerset(iterable):
    "powerset([1,2,3]) -> () (1,) (2,) (3,) (1,2) (1,3) (2,3) (1,2,3)"
    s = list(iterable)
    return chain.from_iterable(combinations(s, r) for r in range(len(s)+1))

def general_prov(f, X, y):
    for Y in reversed(powerset(X)): # From largest to smallest subset.
        if list(f(list(Y))) == [y]:
            return Y
        </code></pre></div></div>
      </div></div></div>
<a id="f3951db0b6c94dd4b409e9ebb28bd216"></a><div class="linked block"><div class="link-block">[<a href="#f3951db0b6c94dd4b409e9ebb28bd216">link</a>]&nbsp;&nbsp;</div><div style="width:100%; display:inline-block;"><div style="width:auto;" class="example task_required"><span class="block_label">Example:</span> 
        
<div class="text">The function below determines the per-entry provenance for a <a href="#f3951db0b6c94dd4b409e9ebb28bd2cc">context-free transformation</a>. For this example, we assume that all data set entries are of the form (<i>key</i>, <i>value</i>).</div>
        
<div class="code"><div class="source"><pre><code class="py">
def context_free_prov(f, X, y):

    # Build the partitions of the input data set.
    partitions = []
    for (x_key, x_val) in X:
        found = False
        for i in range(0,len(partitions)):
            if len(f(partitions[i] + [(x_key, x_val)])) == 1:
                partitions[i].append((x_key, x_val))
                found = True
                break
        # Create a new partition if adding to any other
        # partition increases the size of the output data set.
        if found == False:
            partitions.append([(x_key, x_val)])

    # Find the corresponding partition.
    for partition in partitions:
        if y in f(partition):
             return partition
        </code></pre></div></div>
      </div></div></div>
<a id="f3951db0b6c94dd4b409e9ebb28bd11b"></a><div class="linked block"><div class="link-block">[<a href="#f3951db0b6c94dd4b409e9ebb28bd11b">link</a>]&nbsp;&nbsp;</div><div style="width:100%; display:inline-block;"><div style="width:auto;" class="example task_required"><span class="block_label">Example:</span> 
        
<div class="text">The function below determines the per-entry provenance for a <a href="#f3951db0b6c94dd4b409e9ebb28bd2cb">key-preserving transformation</a>. For this example, we assume that all data set entries are of the form (<i>key</i>, <i>value</i>).</div>
        
<div class="code"><div class="source"><pre><code class="py">
def key_preserving_prov(f, X, y):
    keymap = {}
    
    # Build up the map from input data set
    # keys to output data set keys.
    for (x_key, x_val) in X:
        (y_key, y_val) = f([(x_key, x_val)])[0]
        keymap[x_key] = y_key
    (y_key, y_val) = y

    # Collect all the tuples that contribute
    # to the target result.
    pre_image = set()
    for (x_key, x_val) in X:
        if keymap[x_key] == y_key:
             pre_image.add((x_key, x_val))
    return pre_image
        </code></pre></div></div>
      </div></div></div>
<a id="ffbbae67647e4daf838a79fb814e733d"></a><div class="linked block"><div class="link-block">[<a href="#ffbbae67647e4daf838a79fb814e733d">link</a>]&nbsp;&nbsp;</div><div style="width:100%; display:inline-block;"><div style="width:auto;" class="exercise task_required"><span class="block_label">Exercise:</span> 
        
<div class="text">For each of the following informal descriptions of data transformations, determine whether computing the per-entry provenance of a row in the output data set is easy (e.g., linear, O(<i>n</i> log <i>n</i>), or quadratic time) or difficult (i.e., cannot be done in polynomial time), assuming that the transformation's implementation is a black box (except for the provided description).
<ul>
  <li>Given a data set containing entries of the form (<i>name</i>, <i>age</i>), the transformation produces a list of all entries corresponding to individuals who are of a legal voting age.</li>
  <li>Given a data set containing entries of the form (<i>product</i>, (<i>department</i>, <i>price</i>, <i>date</i>)) describing sales that take place in a department store, the transformation computes the total revenue for each department to produce entries of the form (<i>department</i>, <i>revenue</i>). Assume that each product appears in exactly one department.</li>
  <li>Given a data set containing entries of the form (<i>product</i>, (<i>department</i>, <i>price</i>, <i>date</i>)) describing sales that take place in a department store, the transformation computes the total revenue from each department for each possible date to produce entries of the form ((<i>date</i>, <i>department</i>), <i>revenue</i>).</li>
  <li>Given a data set specifying the geographical locations of a collection of radio receivers in a bounded region, determine an optimal placement of 100 radio towers within that region that minimizes the aggregate of the distances between each tower-receiver pair.</li>
</ul></div>
        
<div class="button"><button class="solution_toggle">show solution</button></div><div class="solution_container" style="display:none;"><div class="solution">
The complexities are as follows.
<ul>
  <li>Selections simply produce a subset of the original input data set, and per-entry provenance can be computed in linear time for such transformations using a linear search through the input data set. It is not even necessary to compute the original transformation because tuples can be compared directly using equality.</li>
  <li>We can compute the total revenue for each <i>department</i> using a relational aggregation by key, and all such aggregation transformations are key-preserving. There exists an <a href="#f3951db0b6c94dd4b409e9ebb28bd11b">efficient algorithm</a> to compute per-entry provenance for such transformations.</li>
  <li>Since a product may be sold on different days, this transformation is not key-preserving. However, since it can be built up using a projection followed by a relational aggregation, it is context-free. Thus, per-entry provenance can be computed efficiently.</li>
  <li>Computing an optimal placement for the radio towers would require the <i>k</i>-means algorithm, which is <a href="#f3951db0b5c23dd4b109e9ebb28bd2ab">not context-free</a>. Thus, per-entry provenance cannot be computed efficiently (i.e., it would require exponential time).</li>
</ul>
        </div></div><div class="solution_spacer"></div>
      </div></div></div>
<a id="a4bbae67647e4daf838a79fb814e733d"></a><div class="linked block"><div class="link-block">[<a href="#a4bbae67647e4daf838a79fb814e733d">link</a>]&nbsp;&nbsp;</div><div style="width:100%; display:inline-block;"><div style="width:auto;" class="exercise task_required"><span class="block_label">Exercise:</span> 
        
<div class="text">Given a data set containing entries of the form (<i>restaurant</i>, <i>popularity</i>, <i>location</i>) that describe restaurants within the various neighborhoods in a city, the transformation uses an unknown machine learning algorithm to group the restaurants by similarity of cuisine into categories (each restaurant is assigned to exactly one cuisine category) and produces a data set with entries of the form (<i>cuisine</i>, <i>count</i>) specifying the number of restaurants that specialize in each type of cuisine. If you have no additional information, how difficult will it be to determine the provenance of an individual entry in the output?</div>
        
<div class="button"><button class="solution_toggle">show solution</button></div><div class="solution_container" style="display:none;"><div class="solution">
Since this transformation combines multiple input entries to produce its output entries and we have no additional information about its machine learning algorithm (for instance, it might not be context free), we must assume the worst-case difficulty: exponential time.
        </div></div><div class="solution_spacer"></div>
      </div></div></div><div class="paragraph">
Tracking provenance at a coarse granularity (per data set) is relatively inexpensive in most applications (since the number of data sets is usually orders of magnitude less than the number of individual data set entries), and standards such as PROV provide authors of transformations a way to represent provenance information.
      </div>
<a id="832a44dd32a640bb981e5fc8902cd348"></a><div class="linked block"><div class="link-block">[<a href="#832a44dd32a640bb981e5fc8902cd348">link</a>]&nbsp;&nbsp;</div><div style="width:100%; display:inline-block;"><div style="width:auto;" class="example task_required"><span class="block_label">Example:</span> 
        
<div class="text">Suppose Alice and Bob want to assemble a script that retrieves some information from an online resource and stores it in a database. We can use the <a href="https://pypi.python.org/pypi/prov"><code>prov</code></a> Python package to programmatically assemble a course granularity provenance record that conforms to the PROV standard and describes a particular execution of this script.</div>
        <div class="paragraph">
We first create the document object and define the namespaces we will use for symbols in the document.
        </div>
        
<div class="code"><div class="source"><pre><code class="py">
doc = prov.model.ProvDocument()
doc.add_namespace('alg', 'http://datamechanics.io/algorithm/alice_bob/') # The scripts in <folder>/<filename> format.
doc.add_namespace('dat', 'http://datamechanics.io/data/alice_bob/') # The data sets in <user>/<collection> format.
doc.add_namespace('ont', 'http://datamechanics.io/ontology#')
doc.add_namespace('log', 'http://datamechanics.io/log#') # The event log.
doc.add_namespace('bdp', 'https://data.cityofboston.gov/resource/')
        </code></pre></div></div>
        
<div class="text">We then use an <i>agent</i> to represent the script, en <i>entity</i> to represent a resource, and an <i>activity</i> to repsent this invocation of the script. We also <i>associate</i> this current activity with the script, and we and indicate that his activity <i>used</i> the entity representing the resource.</div>
        
<div class="code"><div class="source"><pre><code class="py">
this_script = doc.agent('alg:example', {prov.model.PROV_TYPE:prov.model.PROV['SoftwareAgent'], 'ont:Extension':'py'})
resource = doc.entity('bdp:wc8w-nujj', 
    {'prov:label':'311, Service Requests', 
    prov.model.PROV_TYPE:'ont:DataResource', 'ont:Extension':'json'}
  )
this_run = doc.activity(
    'log:a'+str(uuid.uuid4()), startTime, endTime, 
    {prov.model.PROV_TYPE:'ont:Retrieval', 'ont:Query':'?type=Animal+Found&$select=type,latitude,longitude,OPEN_DT'}
  )
doc.wasAssociatedWith(this_run, this_script)
doc.used(this_run, resource, startTime)
        </code></pre></div></div>
        
<div class="text">Finally, we define an entity for the data obtained and indicate to what agent it was attributed, by what actvitity it was generated, and from what entity it was derived.</div>
        
<div class="code"><div class="source"><pre><code class="py">
found = doc.entity('dat:found', {prov.model.PROV_LABEL:'Animals Found', prov.model.PROV_TYPE:'ont:DataSet'})
doc.wasAttributedTo(found, this_script)
doc.wasGeneratedBy(found, this_run, endTime)
doc.wasDerivedFrom(found, resource, this_run, this_run, this_run)
        </code></pre></div></div>
        
<div class="text">We can view our completed provenance record in, for example, JSON form using <code>doc.serialize()</code>.</div>
      </div></div></div></div>
</div>
<a id="3"></a>
<div class="section"><hr /><h2 class="linked heading"><span class="link-title">[<a href="#3">link</a>]&nbsp;&nbsp;</span><div><span class="header_numeral">3.</span> Systems, Models, and Algorithms</div></h2>
<div class="text top">When using mathematics to solve problems involving real-world entities and phenomenon, we can introduce the notions of a <i>system</i>  (i.e., an organized collection of interdependent components and their possible states) that has distinct <i>system states</i> (which can capture spatial and temporal information about the system components, their behaviors, and their relationships). A mathematical <i>model</i> captures only some parts of the system and the possible system states, abstracting the other details away, and it does so using some particular symbolic language or representation.</div><div class="paragraph">
Four common abstract mathematical objects used to represent models of systems are particularly relevant given our application domain of interest:
<ul>
  <li>graphs consisting of a collection of nodes and edges (possibly weighted and directed) between those nodes,</li>
  <li>vector spaces consisting of sets of vectors,</li>
  <li>systems of equations and inequalities (or, equivalently, sets of <i>constraints</i> or set of logical formulas), and</li>
  <li>functions (whether continuous or discrete).</li>
</ul>
In many cases, these mathematical objects can be used to represent each other (e.g., a graph can be represented as a function or as a system of equations), but each has its own characteristics that make it well-suited for modeling certain kinds of systems. Very often, two or more of these are combined when modeling a system.
    </div><div class="paragraph">
Most data sets describing a real-world system are using one or more of the above languages for modeling that system. Identifying which of these is being used can inform our decisions about what techniques we can use to answer questions about that system or to solve problems involving that system. It can also tell us how we can convert to a different language or representation in order to convert, combine, or derive new data sets.
    </div>
  <a id="3.1"></a><div class="subsection"><h3 class="linked heading"><span class="link-title">[<a href="#3.1">link</a>]&nbsp;&nbsp;</span><div><span class="header_numeral">3.1.</span> Systems, Models, and Metrics</div></h3>
<div class="text top">We can provide abstract definitions for systems, models, and metrics. These then allow us to characterize satisfaction, search, and optimization problems in a general and consistent way.</div>
<a id="f3951db0b6c94dd4b409e9ebb28bd2aa"></a><div class="linked block"><div class="link-block">[<a href="#f3951db0b6c94dd4b409e9ebb28bd2aa">link</a>]&nbsp;&nbsp;</div><div style="width:100%; display:inline-block;"><div style="width:auto;" class="definition true_required"><span class="block_label">Definition:</span> 
        
<div class="text">A <i>system</i> is a set <i>S</i> of distinct <i>system states</i> (possibly infinite). Often, each possible system state <i>s</i> <span style="font-size:12px;">&#8712;</span> <i>S</i> is called a <i>model</i> of the system.</div>
      </div></div></div>
<div class="text top">One mathematical object that is commonly used to represent systems is a vector space where individual vectors are the system states (i.e., models) of that system.</div>
<a id="f3951db0b6c94dd4b409e9ebb28bd2ab"></a><div class="linked block"><div class="link-block">[<a href="#f3951db0b6c94dd4b409e9ebb28bd2ab">link</a>]&nbsp;&nbsp;</div><div style="width:100%; display:inline-block;"><div style="width:auto;" class="definition true_required"><span class="block_label">Definition:</span> 
        
<div class="text">A <i>constraint set</i> is a set of logical formulas (written using a formal language that may contain terms such as integers, vector, matrices, sets, and so on), usually used to identify a subset of a system (i.e., a subset of the possible systems states or models). A system state or model <i>satisfies</i> a constraint or constraint set if the logical formulas are true for that model. The subset of all states that satisfy the constraints is sometimes called the <i>feasible region</i>.</div>
      </div></div></div>
<div class="text top">When using vector spaces, it is often possible to represent constraints using a matrix equation, e.g.:
<table style="padding-left:20px; margin:4px 0px 4px 0px;"><tr><td style="text-align:right; white-space:nowrap;"><table style="width:100%;"><tr><td style="text-align:right;"> <i>M</i> &sdot; <i>x</i> <td></tr></table></td><td style="text-align:center;"> = </td><td><table style="white-space:nowrap;"><tr><td style="white-space:nowrap;"> <i>b</i>
</td></tr></table></td></tr></table>
In the above, <i>M</i> might be a matrix from &#8477;<sup><i>n</i> &#215; <i>m</i></sup>, <i>x</i> might be a variable vector in &#8477;<sup><i>m</i></sup>, and <i>b</i> might be a vector in &#8477;<sup><i>n</i></sup>.</div>
<a id="4cc8b03c0ccd4318adb78496bcc730e3"></a><div class="linked block"><div class="link-block">[<a href="#4cc8b03c0ccd4318adb78496bcc730e3">link</a>]&nbsp;&nbsp;</div><div style="width:100%; display:inline-block;"><div style="width:auto;" class="definition true_required"><span class="block_label">Definition:</span> 
        
<div class="text">Given a system <i>S</i> and a set of constraints, a <i>constraint satisfaction problem</i> is the problem of finding a model (or set of models) in <i>S</i> that satisfies that set of constraints.</div>
      </div></div></div>
<a id="4cc8b03c0ccd4318adb78496bcc730e2"></a><div class="linked block"><div class="link-block">[<a href="#4cc8b03c0ccd4318adb78496bcc730e2">link</a>]&nbsp;&nbsp;</div><div style="width:100%; display:inline-block;"><div style="width:auto;" class="definition true_required"><span class="block_label">Definition:</span> 
        
<div class="text">A <i>metric</i> over a system <i>S</i> is a function <i>f</i>: <i>S</i> <span style="font-size:12px;">&#8594;</span> &#8477; that maps system states to real numbers.</div>
      </div></div></div>
<a id="f5cea5a026964477a864a079fefbf6e7"></a><div class="linked block"><div class="link-block">[<a href="#f5cea5a026964477a864a079fefbf6e7">link</a>]&nbsp;&nbsp;</div><div style="width:100%; display:inline-block;"><div style="width:auto;" class="definition true_required"><span class="block_label">Definition:</span> 
        
<div class="text">Given a system <i>S</i> and metric <i>f</i>, an <i>optimization problem</i> is the problem of finding <i>s</i> <span style="font-size:12px;">&#8712;</span> <i>S</i> that maximizes or minimizes <i>f</i>:
<table style="padding-left:20px; margin:4px 0px 4px 0px;"><tr><td style="text-align:right; white-space:nowrap;"><table style="width:100%;"><tr><td style="text-align:right;"> argmax<sub><i>s</i> <span style="font-size:12px;">&#8712;</span> <i>S</i></sub> &nbsp;<i>f</i>(<i>s</i>) </td></tr></table></td></tr><tr><td style="text-align:right;"><table style="width:100%;"><tr><td style="text-align:right;">
 argmin<sub><i>s</i> <span style="font-size:12px;">&#8712;</span> <i>S</i></sub> &nbsp;<i>f</i>(<i>s</i>)
</td></tr></table></td></tr></table></div>
      </div></div></div>
<a id="abc51db0b6c94dd4b409e9ebb28bddaa"></a><div class="linked block"><div class="link-block">[<a href="#abc51db0b6c94dd4b409e9ebb28bddaa">link</a>]&nbsp;&nbsp;</div><div style="width:100%; display:inline-block;"><div style="width:auto;" class="example task_required"><span class="block_label">Example:</span> 
        
<div class="text">Computing the argmax or argmin operation in the relational or MapReduce paradigms can be non-trivial, but it is still possible. For example, in the relational paradigm it is possible to compute it by combining two projections, an aggregation, and a selection.</div>
        
<div class="code"><div class="source"><pre><code class="py">
def argmax(X, f):
    Y = [f(x) for x in X] # Projection.
    y_max = aggregate(Y, max)
    XF = [(x, f(x)) for x in X] # Projection.
    xs = [x for (x,y) in XF if y == y_max] # Selection.
    return xs[0]
        </code></pre></div></div>
      </div></div></div>
<a id="abc51ab0bec94dd4b409e9ebb28bddaa"></a><div class="linked block"><div class="link-block">[<a href="#abc51ab0bec94dd4b409e9ebb28bddaa">link</a>]&nbsp;&nbsp;</div><div style="width:100%; display:inline-block;"><div style="width:auto;" class="exercise task_required"><span class="block_label">Exercise:</span> 
        
<div class="text">Consider a state space <i>S</i> = {0,1}<sup>4</sup> with states of the form <i>x</i> = (<i>x</i><sub>1</sub>, <i>x</i><sub>2</sub>, <i>x</i><sub>3</sub>, <i>x</i><sub>4</sub>) and the following set of constraints:
<table style="padding-left:20px; margin:4px 0px 4px 0px;"><tr><td style="text-align:right; white-space:nowrap;"><table style="width:100%;"><tr><td style="text-align:right;">  <i>x</i><sub>1</sub> > <i>x</i><sub>2</sub> </td></tr></table></td></tr><tr><td style="text-align:right;"><table style="width:100%;"><tr><td style="text-align:right;">
  <i>x</i><sub>3</sub> &ge; <i>x</i><sub>4</sub>
</td></tr></table></td></tr></table>
<ul>
  <li>What is the size of the state space?</li>
  <li>Find a state in <i>S</i> that solves this constraint satisfaction problem.</li>
  <li>Suppose that this problem involves maximizing the objective function <i>f</i>(<i>x</i>) = <i>x</i><sub>1</sub> + <i>x</i><sub>2</sub> + <i>x</i><sub>3</sub> + <i>x</i><sub>4</sub>; what state in <i>S</i> solves this optimization problem?</li>
</ul></div>
        
<div class="button"><button class="solution_toggle">show solution</button></div><div class="solution_container" style="display:none;"><div class="solution">
          
<div class="text">In order to demonstrate how such problems can be solved programmatically, we will answer the posed questions using Python. We define the various parts of the problem below.</div>
          
<div class="code"><div class="source"><pre><code class="py">
from itertools import product
S = list(product(*[{0,1}]*4)) # State space.
R = [(x1,x2,x3,x4) for (x1,x2,x3,x4) in S if x1 > x2 and x3 >= x4] # Feasible region.
metric = lambda s: sum(s) # Metric.
o = max(R, key=metric) # State that maximizes metric.
          </code></pre></div></div>
          
<div class="text">We can now answer each question.</div>
          
<div class="code"><div class="source"><pre><code class="py">
>>> len(S) # Size of state space.
16
>>> R[0] # A state that solves the constraint satisfaction problem.
(1, 0, 0, 0)
>>> o # # The state that solves the optimization problem.
(1, 0, 1, 1)
          </code></pre></div></div>
        </div></div><div class="solution_spacer"></div>
      </div></div></div>
<a id="a2c51db0b6494dd4b409e9ebb28bddaa"></a><div class="linked block"><div class="link-block">[<a href="#a2c51db0b6494dd4b409e9ebb28bddaa">link</a>]&nbsp;&nbsp;</div><div style="width:100%; display:inline-block;"><div style="width:auto;" class="example task_required"><span class="block_label">Example:</span> 
        
<div class="text">Suppose we want to create an approximate, discrete version of the <i>k</i>-means algorithm. We can do so by treating the outcome of the <i>k</i>-means algorithm as the solution to an optimization problem. In order to make this concrete, we will work with the following set of points and Euclidean distances (as in a <a href="#cba5543907854ed28dbd3eeb874ebd54">previous example</a>).</div>
        
<div class="code"><div class="source"><pre><code class="py">
P = [(1,2),(4,5),(1,3),(10,12),(13,14),(13,9),(11,11)]

def dist(p, q):
    (x1,y1) = p
    (x2,y2) = q
    return (x1-x2)**2 + (y1-y2)**2
        </code></pre></div></div>
        
<div class="text">For <i>k</i> = 1, we can create a state space <code>S</code> that consists of all possible positions for the mean within a particular bounded region. The metric would be the sum of the distances from all the points in <code>P</code> to the mean <code>m</code>. The solution would be the one that minimizes the metric.</div>
        
<div class="code"><div class="source"><pre><code class="py">
S = list(product(range(0,20), range(0,20))) # Possible locations for one mean.

def metric(m):
    return sum([dist(m, p) for p in P])

o = min(S, key=metric)
        </code></pre></div></div>
        
<div class="text">For <i>k</i> = 2, we would need to create a state space that represents all possible pairs of means. We would also need to adjust the metric function to consider in the sum only the distances from each point to the <i>closest</i> (i.e., minimum-distance) mean to that point.</div>
        
<div class="code"><div class="source"><pre><code class="py">
V = list(product(range(0,20), range(0,20))) # Possible locations for one mean.
S = list(product(V, V)) # Possible pairs of means.

def metric(M):
    return sum([min([dist(m, p) for m in M]) for p in P])

o = min(S, key=metric)
        </code></pre></div></div>
      </div></div></div>
<a id="f5cea5a026964477a864a079fefbf6ea"></a><div class="linked block"><div class="link-block">[<a href="#f5cea5a026964477a864a079fefbf6ea">link</a>]&nbsp;&nbsp;</div><div style="width:100%; display:inline-block;"><div style="width:auto;" class="exercise task_required"><span class="block_label">Exercise:</span> 
        
<div class="text">Consider the following problem: given a data set describing a set of 10,000 spectators and their heights, determine if it is possible to seat them in a stadium of 10,000 seats so that no one blocks anyone else's view. Determine whether this is a constraint satisfaction problem or an optimization problem, and identify the state space, constraint set, and (if applicable) metric.</div>
        
<div class="button"><button class="solution_toggle">show solution</button></div><div class="solution_container" style="display:none;"><div class="solution">
          
<div class="text">One approach is to use a variable that ranges over the real numbers to represent the height of the person in each seat. This leads to a state space of &#8477;<sup>10,000</sup>. Any individual system state (x<sub>1</sub>, ..., x<sub>10,000</sub>) is then a vector with 10,000 entries; each entry represents the height of the person in that particular seat.</div>
          <div class="paragraph">
The constraints must encode two requirements: (1) that no two seats that are positioned with one in front of the other have a taller person in the front seat, and (2) for every possible height <i>h</i>, that the number of seats occupied by someone having height <i>h</i> matches the number of spectators of height <i>h</i>.
          </div>
          <div class="paragraph">
For the first requirement, we can introduce a constraint for every pair of such seats. It might look something like <i>x</i><sub><i>i</i></sub> + 50 &le; <i>x</i><sub><i>j</i></sub> for every pair of seats <i>x</i><sub><i>i</i></sub> and <i>x</i><sub><i>j</i></sub> where <i>x</i><sub><i>i</i></sub> is in front of <i>x</i><sub><i>j</i></sub>. Given that the layout of the stadium is roughly planar, we would expect that this leads to a number of constraints that is within a constant factor of 10,000.
          </div>
          <div class="paragraph">
For the second requirement, suppose that the number of spectators of height <i>h</i> is given by <i>N</i><sub><i>h</i></sub>. Then we would need to generate a somewhat complex constraint for every possible height <i>h</i>. We can do so using the building blocks for relational transformations: <code>[(h, N_h)] == select(aggregate([(seat_hgt, 1) for seat_hgt in x], sum), lambda t: t[0] == h)</code>. Note that while this is a fairly complex expression, it is still a logical formula that must be either true or false. In principle, we could expand it into a long expression consisting of basic logical and arithmetic operators.
          </div>
        </div></div><div class="solution_spacer"></div>
      </div></div></div>
<a id="f5cea5a026964477a864a079fefbf6cd"></a><div class="linked block"><div class="link-block">[<a href="#f5cea5a026964477a864a079fefbf6cd">link</a>]&nbsp;&nbsp;</div><div style="width:100%; display:inline-block;"><div style="width:auto;" class="exercise task_required"><span class="block_label">Exercise:</span> 
        
<div class="text">For each of the following scenarios, determine whether it is an instance of a constraint satisfaction problem or an instance of an optimization problem. If it is an optimization problem, identify the objective function.
<ul>
  <li>Find the lowest-cost route between two destinations on a transportation network.</li>
  <li>Given a social network where any pair of individual members can be "connected", determine if the whole network is connected (i.e., there exists a path from every member to every other member).</li>
  <li>Given a social network where any pair of individual members can be "connected", find the smallest group of individuals that must be removed in order to make the network disconnected.</li>
</ul></div>
        
<div class="button"><button class="solution_toggle">show solution</button></div><div class="solution_container" style="display:none;"><div class="solution">
          
<div class="text"><ul>
  <li>
    For the first example, the solution could be as follows.
    <ul>
      <li>
        Each possible state is a path in the graph. A path is just a subset of the edges, so the state space can be the collection of all subsets of the edges. If the set of edges is <i>E</i>, then the state space could be {0,1}<sup>|E|</sup> where each vector of the form (<i>x</i><sub>1</sub>, ..., <i>x</i><sub>|E|</sub>) where <i>x</i><sub><i>i</i></sub> <span style="font-size:12px;">&#8712;</span> {0,1} represents whether a particular edge is included in the path.
      </li>
      <li>
        Constraints would specify connectivity constraints (e.g., two edges <i>i</i> and <i>j</i> are connected then one constraint could be <i>x</i><sub><i>i</i></sub> &le; <i>x</i><sub><i>j</i></sub> to ensure that if edge <i>i</i> is included in the path then so is edge <i>j</i>).
      </li>
      <li>
        The number of edges in the path specifies the path length. Thus, the metric could be <i>f</i>(<i>x</i><sub>1</sub>, ..., <i>x</i><sub>|E|</sub>) = <i>x</i><sub>1</sub> + ... +  <i>x</i><sub>|E|</sub>. The state (i.e., path) that minimizes this metric while satisfying the constraints would then be a shortest path.
      </li>
    </ul>
  </li>
</ul></div>
        </div></div><div class="solution_spacer"></div>
      </div></div></div></div>
  <a id="3.2"></a><div class="subsection"><h3 class="linked heading"><span class="link-title">[<a href="#3.2">link</a>]&nbsp;&nbsp;</span><div><span class="header_numeral">3.2.</span> Linear systems, satisfiability modulo theories, and linear programming</div></h3>
<div class="text top">We often say a system or subset of a system is <i>linear</i> if it can be modeled using a Euclidean vector space and can be defined using a collection of linear constraints of the form:
<table style="padding-left:20px; margin:4px 0px 4px 0px;"><tr><td style="text-align:right; white-space:nowrap;"><table style="width:100%;"><tr><td style="text-align:right;"> <i>a</i><sub>1</sub> &sdot; <i>x</i><sub>1</sub> + ... + <i>a</i><sub><i>n</i></sub> &sdot; <i>x</i><sub><i>n</i></sub> &nbsp;&#9830;&nbsp; <i>c</i>
</td></tr></table></td></tr></table>
Where (<i>x</i><sub>1</sub>, ..., <i>x</i><sub><i>n</i></sub>) <span style="font-size:12px;">&#8712;</span> &#8477;<sup><i>n</i></sup> is a vector, the <i>a</i><sub><i>i</i></sub> are coefficients in &#8477;, <i>c</i> in &#8477; is a constant, and &#9830; is a relation operator such as =, &le;, &lt;, &ge;, or &gt;. Note that the constraints contain no higher-order <i>x</i><sub><i>i</i></sub> terms such as <i>x</i><sub><i>i</i></sub><sup>2</sup>.</div>
<div class="text top">There are three common techniques for solving constraint satisfaction and optimization problems involving linear systems:
<ul>
  <li>finding an exact or approximate solution (e.g., using some combination of matrix inversion, QR factorization, and <a href="https://en.wikipedia.org/wiki/Ordinary_least_squares">ordinary least squares</a>),</li>
  <li>checking if a solution exists (and finding a witness system state or model proving this), and</li>
  <li>finding a solution that minimizes or maximizes some (linear) metric over the vector space.</li>
</ul>
The third item (solving an optimization problem) is arguably a generalization of the first approach if we remove the requirement that the metric must be linear (since the first approach minimizes the Euclidean distance metric, which is quadratic and thus non-linear).</div>
<a id="0c4124b538154b5794b64fababbdd3e9"></a><div class="linked block"><div class="link-block">[<a href="#0c4124b538154b5794b64fababbdd3e9">link</a>]&nbsp;&nbsp;</div><div style="width:100%; display:inline-block;"><div style="width:auto;" class="example task_required"><span class="block_label">Example:</span> 
        
<div class="text">Suppose we want to use an SMT solver to solve a flow maximization problem for a flow graph (the variable on each represents the flow amount, and the parenthesized number is the maximum possible flow on that edge):</div>
        
<div class="text"><div class="pql" style="border:0px solid #000000; height:200px; display:inline-block; width:100%;">
!table([
  [null,                  null,                            ('dd:x4 (6)`dr:x5 (3)``&bull;'), null, null],
  [('r:x1 (?)``&bull;'), ('ur:x2 (7)`dr:x3 (8)``&bull;'), null,       ('r:x7 (5)``&bull;'), ('&bull;')],
  [null,                  null,                            ('ur:x6 (4)``&bull;'), null, null],
])
</div></div>
        
<div class="text">Here, the system is the set of possible states of the graph in terms of the amount of material flowing along each edge (we model the flow that goes along each edge in the graph as a variable ranging over &#8477;, and the flow in the overall graph containing seven edges as a vector in &#8477;<sup>7</sup>).</div>
        
<div class="code"><div class="source"><pre><code class="py">
import z3

(x1,x2,x3,x4,x5,x6,x7) = [z3.Real('x'+str(i)) for i in range(1,8)]

S = z3.Solver()

# Only allow non-negative flows.
for x in (x1,x2,x3,x4,x5,x6,x7):
    S.add(x >= 0)
    
# Edge capacity constraints.
S.add(x2 <= 7, x3 <= 8, x4 <= 6)
S.add(x5 <= 3, x6 <= 4, x7 <= 5)

# Constraints derived from graph topology.
S.add(x1 == x2+x3, x2 == x4+x5, x3+x4 == x6, x5+x6 == x7)

S.add(x1 > 0) # We want a positive flow.

print(S.check())
print(S.model())
        </code></pre></div></div>
        
<div class="text">If we are allowed to vary only the variable for incoming edge, we can use the SMT solver together with binary search to find the maximum flow through the graph.</div>
        
<div class="code"><div class="source"><pre><code class="py">
flow = 0
for i in range(5, -1, -1):
    S.push()
    S.add(x1 >= (2**i + flow))
    if str(S.check()) != 'unsat':
        flow += 2**i
    S.pop()

S.add(x1 >= flow)
print(S.model())
        </code></pre></div></div>
        
<div class="text">We could alternatively build up the constraint set from a starting matrix.</div>
        
<div class="code"><div class="source"><pre><code class="py">
def dot(xs, ys):
    return sum([x*y for (x,y) in zip(xs, ys)])

x = [x1,x2,x3,x4,x5,x6,x7]

M = [
  [ 0,-1, 0, 0, 0, 0, 0],
  [ 0, 0,-1, 0, 0, 0, 0],
  [ 0, 0, 0,-1, 0, 0, 0],
  [ 0, 0, 0, 0,-1, 0, 0],
  [ 0, 0, 0, 0, 0,-1, 0],
  [ 0, 0, 0, 0, 0, 0,-1],

  [ 1, 0, 0, 0, 0, 0, 0],
  [ 0, 1, 0, 0, 0, 0, 0],
  [ 0, 0, 1, 0, 0, 0, 0],
  [ 0, 0, 0, 1, 0, 0, 0],
  [ 0, 0, 0, 0, 1, 0, 0],
  [ 0, 0, 0, 0, 0, 1, 0],
  [ 0, 0, 0, 0, 0, 0, 1],

  [-1, 1, 1, 0, 0, 0, 0],
  [ 0,-1, 0, 1, 1, 0, 0],
  [ 0, 0,-1,-1, 0, 1, 0],
  [ 0, 0, 0, 0,-1,-1, 1]
  ]

b = [-7,-8,-6,-3,-4,-5,
      0, 0, 0, 0, 0, 0, 0,
      0, 0, 0, 0]

S = z3.Solver()

for i in range(len(M)):
    S.add(b[i] <= dot(M[i], x))

print(S.check())
print(S.model())
        </code></pre></div></div>
      </div></div></div>
<a id="0c4124b538154b5794b64fababbdd3e3"></a><div class="linked block"><div class="link-block">[<a href="#0c4124b538154b5794b64fababbdd3e3">link</a>]&nbsp;&nbsp;</div><div style="width:100%; display:inline-block;"><div style="width:auto;" class="example task_required"><span class="block_label">Example:</span> 
        
<div class="text">Suppose we want to use linear programming to solve the problem described in the <a href="#0c4124b538154b5794b64fababbdd3e9">example above</a>.</div>
        
<div class="code"><div class="source"><pre><code class="py">
from scipy.optimize import linprog

M = [
  [ 1,-1,-1, 0, 0, 0, 0],
  [ 0, 1, 0,-1,-1, 0, 0],
  [ 0, 0, 1, 1, 0,-1, 0],
  [ 0, 0, 0, 0, 1, 1,-1],
  ]
b = [0, 0, 0, 0]
c = [-1, 0, 0, 0, 0, 0, 0]
bounds = [(0, None), (0, 7), (0, 8), (0, 6), (0, 3), (0, 4), (0, 5)]

result = linprog(c, A_ub = M, b_ub = b, bounds=bounds, options={"disp": True})
print(result)
        </code></pre></div></div>
        
<div class="text">Note that this technique produces a potentially different solution from the one in the <a href="#0c4124b538154b5794b64fababbdd3e9">example above</a> (though both are optimal).</div>
      </div></div></div>
<a id="27c51ab0bec94dd4b409e9ebb28bddaa"></a><div class="linked block"><div class="link-block">[<a href="#27c51ab0bec94dd4b409e9ebb28bddaa">link</a>]&nbsp;&nbsp;</div><div style="width:100%; display:inline-block;"><div style="width:auto;" class="example task_required"><span class="block_label">Example:</span> 
        
<div class="text">Suppose that you want to distribute two types of infrastructure improvements across the 20 neighborhoods of a city: road repairs and electric grid updates. Road repairs cost $100,000 per mile and electric grid updates cost $50,000 per mile. You have access to a data set <i>I</i> that contains tuples of the form (<i>neighborhood</i>, <i>road-miles</i>, <i>grid-miles</i>) describing the quantity of each type of infrastructure in each neighborhood.</div>
        <div class="paragraph">
You want to distribute the cost of improvements equitably such that every neighborhood receives exactly the same amount of funding, and every neighborhood is able to use all that funding it receives. If your overall budget is capped at $10,000,000, is this possible?
        </div>
        
<div class="text">One approach is to fix some constant <i>c</i> and create two variables <i>x</i><sub><i>i</i></sub> and <i>y</i><sub><i>i</i></sub> and three constraints for every entry (<i>i</i>, <i>r</i><sub><i>i</i></sub>, <i>g</i><sub><i>i</i></sub>) in <i>I</i>:
<table style="padding-left:20px; margin:4px 0px 4px 0px;"><tr><td style="text-align:right; white-space:nowrap;"><table style="width:100%;"><tr><td style="text-align:right;">  <i>x</i><sub><i>i</i></sub> &sdot; 100,000 + <i>y</i><sub><i>i</i></sub> &sdot; 50,000 &ge; <i>c</i> </td></tr></table></td></tr><tr><td style="text-align:right;"><table style="width:100%;"><tr><td style="text-align:right;">
  <i>x</i><sub><i>i</i></sub> &le; <i>r</i><sub><i>i</i></sub> </td></tr></table></td></tr><tr><td style="text-align:right;"><table style="width:100%;"><tr><td style="text-align:right;">
  <i>y</i><sub><i>i</i></sub> &le; <i>g</i><sub><i>i</i></sub>
</td></tr></table></td></tr></table>
The above constraints ensure that at least <i>c</i> funding is used in the neighborhood, but also that the amount of infrastructure repairs does not exceed the potential needs of the neighborhood. Thus, the state space consists of 20 &sdot; 2 dimensions (i.e., the two vectors <i>x</i> and <i>y</i> having one component for every neighborhood).</div>
        <div class="paragraph">
Note that if we wanted to check whether it is possible to disburse <i>all</i> the available funding equitably, we could set <i>c</i> = 10,000,000/20. Alternatively, if we wanted to maximize <i>c</i>, we could try larger and larger values of <i>c</i> until it cannot be increased without causing the problem to become unsolvable. Alternatively, we could treat <i>c</i> as another variable and define the objective function that must be maximized as <i>f</i>(<i>x</i>, <i>y</i>, <i>c</i>) = <i>c</i>.
        </div>
      </div></div></div></div>
  <a id="3.3"></a><div class="subsection"><h3 class="linked heading"><span class="link-title">[<a href="#3.3">link</a>]&nbsp;&nbsp;</span><div><span class="header_numeral">3.3.</span> Graph and Spatial Problems as Constraint Satisfaction and Optimization Problems</div></h3>
<div class="text top">We have already seen how we can define some basic algorithms that have natural spatial interpretations (basic clustering and shortest) using data transformations assembled from building blocks drawn from the relational and the MapReduce paradigms. These algorithms employed two different representations: a vector space (clustering) and a graph (shortest paths). In this section, we review how algorithms such as these can be viewed as solutions to constraint satisfaction and/or optimization problems.</div>
<a id="22551db0b6c94dd4b409e9ebb28bddaa"></a><div class="linked block"><div class="link-block">[<a href="#22551db0b6c94dd4b409e9ebb28bddaa">link</a>]&nbsp;&nbsp;</div><div style="width:100%; display:inline-block;"><div style="width:auto;" class="definition true_required"><span class="block_label">Definition:</span> 
        
<div class="text">Suppose we have a graph consisting of a set of nodes <i>N</i> = {<i>n</i><sub>1</sub>, <i>n</i><sub>2</sub>, ...} and a set of edges <i>E</i> of the form (<i>n</i><sub>1</sub>, <i>n</i><sub>2</sub>). The <i>minimum spanning tree</i> of this graph is the smallest subset <i>T</i> &sub; <i>E</i> such that in the subgraph consisting of the edges in <i>T</i>, every node can be reached from every other node (i.e., by traversing a path along the edges in <i>T</i>).</div>
      </div></div></div>
<a id="22551db0b6c94344b409e9ebb28bddaa"></a><div class="linked block"><div class="link-block">[<a href="#22551db0b6c94344b409e9ebb28bddaa">link</a>]&nbsp;&nbsp;</div><div style="width:100%; display:inline-block;"><div style="width:auto;" class="example task_required"><span class="block_label">Example:</span> 
        
<div class="text">The problem of finding a minimum spanning tree can be broken down into a constraint satisfaction problem:
<ul>
  <li>the set of system states is the collection of edge subsets (i.e., all 2<sup>|<i>E</i>|</sup> possible subsets);</li>
  <li>the set of constraints is the collection of reachability requirements (one for every pair nodes stating that those two nodes must be reachable from one another) and the requirement that the subset <i>T</i> must be a tree (or, equivalent, that it must have at most |<i>N</i>| &#8722; 1 edges).</li>
</ul>
This problem can also be broken down into an optimization problem that minimizes a metric:
<ul>
  <li>the set of system states is the collection of edge subsets (i.e., all 2<sup>|<i>E</i>|</sup> possible subsets);</li>
  <li>the set of constraints is the collection of reachability requirements (one for every pair nodes stating that those two nodes must be reachable from one another);</li>
  <li>the metric is a function that computes the size of <i>T</i> (i.e., the number of edges in the spanning subtree).</li>
</ul></div>
      </div></div></div></div>
  <a id="3.4"></a><div class="subsection"><h3 class="linked heading"><span class="link-title">[<a href="#3.4">link</a>]&nbsp;&nbsp;</span><div><span class="header_numeral">3.4.</span> Decomposition Techniques</div></h3>
<div class="text top">General-purpose techniques for solving constraint satisfaction and optimization problems can sometimes be inefficient (potentially having a running time that is exponential in the number of state space dimensions and/or the number of constraints). This can be especially costly when state space dimensions or constraints are derived from large data sets having hundreds, thousands, or hundreds of thousands of entries.</div><div class="paragraph">
In such scenarios, it may be possible to apply general-purpose decomposition techniques for solving constraint satisfaction and optimization problems in a scalable way. Usually, these rely on certain assumptions about the properties of the constraints and/or the objective function.
      </div>
<a id="0c4124b538154b5794b64fababbdda54"></a><div class="linked block"><div class="link-block">[<a href="#0c4124b538154b5794b64fababbdda54">link</a>]&nbsp;&nbsp;</div><div style="width:100%; display:inline-block;"><div style="width:auto;" class="definition true_required"><span class="block_label">Definition:</span> 
        
<div class="text">Suppose we have a constraint satisfaction problem with an <i>n</i>-dimensional state space <i>S</i> containing system states of the form (<i>x</i><sub>1</sub>, ...,<i>x</i><sub><i>n</i></sub>) and a set of constraints <i>C</i>. If the set of constraints <i>C</i> can be separated into two disjoint subsets such that <i>C</i> = <i>C</i><sub>1</sub> &cup; <i>C</i><sub>2</sub> where <i>C</i><sub>1</sub> and <i>C</i><sub>2</sub> share no variables, then the problem can be <i>decomposed</i> into two <i>independent subproblems</i>.</div>
      </div></div></div>
<a id="0c4124b538154b5794b64fababbdda55"></a><div class="linked block"><div class="link-block">[<a href="#0c4124b538154b5794b64fababbdda55">link</a>]&nbsp;&nbsp;</div><div style="width:100%; display:inline-block;"><div style="width:auto;" class="fact true_required"><span class="block_label">Fact:</span> 
        
<div class="text">Suppose that a constraint satisfaction problem with state space <i>S</i> and constraint set <i>C</i> can be decomposed into two independent subproblems where <i>C</i> = <i>C</i><sub>1</sub> &cup; <i>C</i><sub>2</sub>. If <i>s</i><sub>1</sub> <span style="font-size:12px;">&#8712;</span> <i>S</i> is a solution to the subproblem corresponding to <i>C</i><sub>1</sub> and <i>s</i><sub>2</sub> <span style="font-size:12px;">&#8712;</span> <i>S</i> is a solution to the subproblem corresponding to <i>C</i><sub>2</sub>, then we can combine <i>s</i><sub>1</sub> and <i>s</i><sub>2</sub> to obtain a state that satisfies all the constraints in <i>C</i> (using a linear-time scan through the representation of <i>s</i><sub>1</sub> and <i>s</i><sub>2</sub>).</div>
      </div></div></div>
<a id="0c4124b538154b5794b64fababbdda57"></a><div class="linked block"><div class="link-block">[<a href="#0c4124b538154b5794b64fababbdda57">link</a>]&nbsp;&nbsp;</div><div style="width:100%; display:inline-block;"><div style="width:auto;" class="definition true_required"><span class="block_label">Definition:</span> 
        
<div class="text">Suppose we have a constraint satisfaction problem with an (<i>n</i>+<i>m</i>)-dimensional state space <i>S</i> containing system states of the form (<i>x</i><sub>1</sub>, ...,<i>x</i><sub><i>n</i></sub>, <i>y</i><sub>1</sub>, ..., <i>y</i><sub><i>m</i></sub>) and a set of constraints <i>C</i> = <i>C</i><sub><b>x</b></sub> &cup; <i>C</i><sub><b>xy</b></sub> &cup; <i>C</i><sub><b>y</b></sub> where:
<ul>
  <li>all constraints in <i>C</i><sub><b>x</b></sub> only contain variables from the list <i>x</i><sub>1</sub>, ...,<i>x</i><sub><i>n</i></sub>;</li>
  <li>all constraints in <i>C</i><sub><b>y</b></sub> only contain variables from the list <i>y</i><sub>1</sub>, ...,<i>y</i><sub><i>m</i></sub>;</li>
  <li>every constraint in <i>C</i><sub><b>xy</b></sub> has at least one variable from <i>x</i><sub>1</sub>, ...,<i>x</i><sub><i>n</i></sub> and at least one from <i>y</i><sub>1</sub>, ...,<i>y</i><sub><i>n</i></sub>.</li>
</ul>
Then the constraints contained in <i>C</i><sub><b>xy</b></sub> are called the <i>complicating constraints</i> and the variables found in those constraints are called the <i>complicating variables</i>.</div>
      </div></div></div>
<a id="0c4b2425381a4b5794b64fababbdda57"></a><div class="linked block"><div class="link-block">[<a href="#0c4b2425381a4b5794b64fababbdda57">link</a>]&nbsp;&nbsp;</div><div style="width:100%; display:inline-block;"><div style="width:auto;" class="example task_required"><span class="block_label">Example:</span> 
        
<div class="text">Suppose we have a constraint satisfaction problem with states of the form (<i>x</i><sub>1</sub>, ...,<i>x</i><sub><i>n</i></sub>, <i>y</i><sub>1</sub>, ..., <i>y</i><sub><i>m</i></sub>) <span style="font-size:12px;">&#8712;</span> {0,1}<sup><i>n</i>+<i>m</i></sup> and a constraint set <i>C</i> = <i>C</i><sub><b>x</b></sub> &cup; <i>C</i><sub><b>xy</b></sub> &cup; <i>C</i><sub><b>y</b></sub>. Suppose also that <i>x</i><sub>1</sub> and <i>y</i><sub>1</sub> are the only complicating variables and <i>x</i><sub>1</sub> &ge; <i>y</i><sub>1</sub> is the only complicating constraint. Then we can consider four different subsets of the state space:
<ul>
  <li>all states in which <i>x</i><sub>1</sub> = 0 and <i>y</i><sub>1</sub> = 0;</li>
  <li>all states in which <i>x</i><sub>1</sub> = 1 and <i>y</i><sub>1</sub> = 0;</li>
  <li>all states in which <i>x</i><sub>1</sub> = 0 and <i>y</i><sub>1</sub> = 1;</li>
  <li>all states in which <i>x</i><sub>1</sub> = 1 and <i>y</i><sub>1</sub> = 1.</li>
</ul>
Note that for each of the four scenarios, we can easily extend the two subproblems corresponding to <i>C</i><sub><b>x</b></sub> and <i>C</i><sub><b>y</b></sub> with one additional constraint (such as <i>x</i><sub>1</sub> = 1 for <i>C</i><sub><b>x</b></sub> and <i>y</i><sub>1</sub> = 0 for <i>C</i><sub><b>y</b></sub>). We can then eliminate the complicating constraint (which is redundant at that point), allowing us to decompose the new problem into two independent subproblems. If we find a solution to both subproblems in any of these four cases, then we have found a solution to the overall original problem corresponding to <i>C</i>.</div>
        <div class="paragraph">
This approach can lead to significant performance gains if the algorithmic complexity of solving the constraint satisfaction problem is worse than linear. For example, suppose that there are <i>k</i> constraints and solving the problem takes exponential time in the number of constraints (such as 2<sup><i>k</i></sup>). If we can split the problem into two independent subproblems (each of size <i>k</i>/2), then we have can compute the overall running time needed to solve <i>all four</i> pairs of independent subproblems:
<table style="padding-left:20px; margin:4px 0px 4px 0px;"><tr><td style="text-align:right; white-space:nowrap;"><table style="width:100%;"><tr><td style="text-align:right;">  4 &sdot; 2 &sdot; 2<sup><i>k</i>/2</sup> <td></tr></table></td><td style="text-align:center;"> = </td><td><table style="white-space:nowrap;"><tr><td style="white-space:nowrap;"> 2<sup><i>k</i>/2 + 3</sup>
</td></tr></table></td></tr></table>
The running time 2<sup><i>k</i>/2 + 3</sup> is substantially faster than 2<sup><i>k</i></sup> for large <i>k</i>. In fact, 2<sup><i>k</i>/2 + 3</sup> &approx; &radic;(2<sup><i>k</i></sup>).
        </div>
        <div class="paragraph">
In addition, note that the third scenario where <i>x</i><sub>1</sub> = 0 and <i>y</i><sub>1</sub> = 1 does not satisfy the complicated constraint <i>x</i><sub>1</sub> &ge; <i>y</i><sub>1</sub>, so we need not consider it at all. Thus, the running time would actually be 3 &sdot; 2 &sdot; 2<sup><i>k</i>/2</sup> if we avoid this case.
        </div>
      </div></div></div>
<a id="0c4124b538154b5794b64fababbdda89"></a><div class="linked block"><div class="link-block">[<a href="#0c4124b538154b5794b64fababbdda89">link</a>]&nbsp;&nbsp;</div><div style="width:100%; display:inline-block;"><div style="width:auto;" class="fact true_required"><span class="block_label">Fact:</span> 
        
<div class="text">Suppose we have an optimization problem with an (<i>n</i>+<i>m</i>)-dimensional state space <i>S</i> containing system states of the form (<i>x</i><sub>1</sub>, ...,<i>x</i><sub><i>n</i></sub>, <i>y</i><sub>1</sub>, ..., <i>y</i><sub><i>m</i></sub>), a set of constraints <i>C</i> = <i>C</i><sub><b>x</b></sub> &cup; <i>C</i><sub><b>y</b></sub>, and an objective function <i>f</i>(<i>x</i><sub>1</sub>, ...,<i>x</i><sub><i>n</i></sub>, <i>y</i><sub>1</sub>, ..., <i>y</i><sub><i>m</i></sub>) = <i>g</i>(<i>x</i><sub>1</sub>, ...,<i>x</i><sub><i>n</i></sub>) + <i>h</i>(<i>y</i><sub>1</sub>, ...,<i>y</i><sub><i>m</i></sub>). Note that there are no complicating constraints or variables.</div>
        <div class="paragraph">
In this scenario, we can decompose the problem into two independent optimization problems (with objective functions <i>g</i> and <i>h</i>, respectively). We can then find a solution <i>s</i><sub>1</sub> that optimizes <i>g</i> under constraints <i>C</i><sub><b>x</b></sub> and a solution <i>s</i><sub>2</sub> that optimizes <i>h</i> under constraints <i>C</i><sub><b>y</b></sub>. Finally, we can combine these two solutions <i>s</i><sub>1</sub> and <i>s</i><sub>2</sub> into a solution for the overall problem that optimizes <i>f</i>. As in the corresponding <a href="#0c4124b538154b5794b64fababbdda55">fact</a> about constraint satisfaction problems, we can then use a linear-time scan through the representations of <i>s</i><sub>1</sub> and <i>s</i><sub>2</sub> to build this overall solution.
        </div>
      </div></div></div></div>
</div>
<a id="4"></a>
<div class="section"><hr /><h2 class="linked heading"><span class="link-title">[<a href="#4">link</a>]&nbsp;&nbsp;</span><div><span class="header_numeral">4.</span> Statistical Analysis</div></h2>
<div class="text top">Given a data set some of the optimization techniques we have studied can be used to find a best-fit model for that data. For example, given a data set of points in &#8477;<sup>2</sup> representing some collection of observations measured along two dimensions, it is possible to find an  ordinary least squares "best fit" linear function (using algebra, optimization, gradient descent, and so on). However, just because we have found a "best fit" model for a data set from some given space of functions does not mean that the model is actually a good one for that particular data set. Perhaps the relationship between the two dimensions is not linear, or even completely non-existent. How can we determine if this is the case?</div><div class="paragraph">
In this section, we will present some statistical methods that can help us decide whether a linear model is actually appropriate for a given two-dimensional data set, and to quantify how appropriate it might be. We will review the mathematical foundations underlying the methods, how they can be applied, and how they can be realized as data transformations in the paradigms we have studied in earlier sections.
    </div>
  <a id="4.1"></a><div class="subsection"><h3 class="linked heading"><span class="link-title">[<a href="#4.1">link</a>]&nbsp;&nbsp;</span><div><span class="header_numeral">4.1.</span> Review of Facts about Projections from Linear Algebra</div></h3>
<div class="text top">We review a few useful definitions and facts about vectors in &#8477;<sup><i>n</i></sup> from linear algebra. These provide a mathematical foundation for defining the statistical constructs discussed in this section.</div>
<a id="aabb49ea963b4301995bcd19c5442807"></a><div class="linked block"><div class="link-block">[<a href="#aabb49ea963b4301995bcd19c5442807">link</a>]&nbsp;&nbsp;</div><div style="width:100%; display:inline-block;"><div style="width:auto;" class="definition true_required"><span class="block_label">Definition:</span> 
        
<div class="text">A <i>vector</i> is an ordered, finite tuple of real numbers (<i>x</i><sub>1</sub>, ..., <i>x</i><sub><i>n</i></sub>) <span style="font-size:12px;">&#8712;</span> &#8477;<sup><i>n</i></sup>. Note that we will use individual variables to denote an entire vector (e.g., <i>x</i> <span style="font-size:12px;">&#8712;</span> &#8477;<sup><i>n</i></sup>), and that same variable with a subscript index (e.g., <i>x</i><sub><i>i</i></sub>) to denote individual components of that vector; thus, we have that <i>x</i> = (<i>x</i><sub>1</sub>, ..., <i>x</i><sub><i>n</i></sub>).</div>
      </div></div></div>
<a id="ff3b49ea963b4301995bcd19c5442807"></a><div class="linked block"><div class="link-block">[<a href="#ff3b49ea963b4301995bcd19c5442807">link</a>]&nbsp;&nbsp;</div><div style="width:100%; display:inline-block;"><div style="width:auto;" class="fact true_required"><span class="block_label">Fact:</span> 
        
<div class="text">For a vector <i>x</i> <span style="font-size:12px;">&#8712;</span> &#8477;<sup><i>n</i></sup> and a scalar <i>s</i> <span style="font-size:12px;">&#8712;</span> &#8477;, we can scale <i>x</i> by computing <i>s</i> &sdot; <i>x</i> = (<i>s</i> &sdot; <i>x</i><sub>1</sub>, ..., <i>s</i> &sdot; <i>x</i><sub><i>n</i></sub>). Note that for a scalar <i>s</i> <span style="font-size:12px;">&#8712;</span> &#8477;, we often use the shorthand notation <i>x</i>/<i>s</i> to mean (1/<i>s</i>) &sdot; <i>x</i>.</div>
      </div></div></div>
<a id="ff3249ea963b4301995bcd19c5442807"></a><div class="linked block"><div class="link-block">[<a href="#ff3249ea963b4301995bcd19c5442807">link</a>]&nbsp;&nbsp;</div><div style="width:100%; display:inline-block;"><div style="width:auto;" class="definition true_required"><span class="block_label">Definition:</span> 
        
<div class="text">For a vector <i>x</i> <span style="font-size:12px;">&#8712;</span> &#8477;<sup><i>n</i></sup> the Euclidean length or <i>norm</i> of <i>x</i> is ||<i>x</i>|| = &radic;(<i>x</i><sub>1</sub><sup>2</sup> + ... + <i>x</i><sub><i>n</i></sub><sup>2</sup>). We can also obtain a normalized unit vector of length one that points in the same direction as <i>x</i> by computing <i>x</i>/||<i>x</i>||.</div>
      </div></div></div>
<a id="ff3249ea963b4301995bcd19c5442803"></a><div class="linked block"><div class="link-block">[<a href="#ff3249ea963b4301995bcd19c5442803">link</a>]&nbsp;&nbsp;</div><div style="width:100%; display:inline-block;"><div style="width:auto;" class="definition true_required"><span class="block_label">Definition:</span> 
        
<div class="text">Given two vectors <i>x</i> <span style="font-size:12px;">&#8712;</span> &#8477;<sup><i>n</i></sup> and <i>y</i> <span style="font-size:12px;">&#8712;</span> &#8477;<sup><i>n</i></sup>, the <i>dot product</i> of the two vectors is <i>x</i> &sdot; <i>y</i> = <i>x</i><sub>1</sub> &sdot; <i>y</i><sub>1</sub> + ... + <i>x</i><sub><i>n</i></sub> &sdot; <i>y</i><sub><i>n</i></sub>.</div>
      </div></div></div>
<a id="bca249ea963b4301995bcd19c5442803"></a><div class="linked block"><div class="link-block">[<a href="#bca249ea963b4301995bcd19c5442803">link</a>]&nbsp;&nbsp;</div><div style="width:100%; display:inline-block;"><div style="width:auto;" class="fact true_required"><span class="block_label">Fact:</span> 
        
<div class="text">We can compute the projection of a vector <i>x</i> <span style="font-size:12px;">&#8712;</span> &#8477;<sup><i>n</i></sup> onto another vector <i>y</i> <span style="font-size:12px;">&#8712;</span> &#8477;<sup><i>n</i></sup> by computing (<i>x</i> &sdot; (<i>y</i>/||<i>y</i>||)) &sdot; (<i>y</i>/||<i>y</i>||).</div>
      </div></div></div>
<a id="ff3249ea963b4301995bcd19c5442801"></a><div class="linked block"><div class="link-block">[<a href="#ff3249ea963b4301995bcd19c5442801">link</a>]&nbsp;&nbsp;</div><div style="width:100%; display:inline-block;"><div style="width:auto;" class="example task_required"><span class="block_label">Example:</span> 
        
<div class="text">Given two vectors (5,7) <span style="font-size:12px;">&#8712;</span> &#8477;<sup>2</sup> and (3,0) <span style="font-size:12px;">&#8712;</span> &#8477;<sup>2</sup>, the normalized version of (3,0) is (1/3) &sdot; (3,0) = (1,0). The projection of (5,7) onto (3,0) is then:
<table style="padding-left:20px; margin:4px 0px 4px 0px;"><tr><td style="text-align:right; white-space:nowrap;"><table style="width:100%;"><tr><td style="text-align:right;">  ((5,7) &sdot; ((1/3) &sdot; (3,0))) &sdot; ((1/3) &sdot; (3,0) <td></tr></table></td><td style="text-align:center;"> = </td><td><table style="white-space:nowrap;"><tr><td style="white-space:nowrap;"> ((5,7) &sdot; (1,0)) &sdot; (1,0) </td></tr></table></td></tr><tr><td style="text-align:right;"><table style="width:100%;"><tr><td style="text-align:right;">
                                  <td></tr></table></td><td style="text-align:center;"> = </td><td><table style="white-space:nowrap;"><tr><td style="white-space:nowrap;"> (5 &sdot; 1 + 7 &sdot; 0) &sdot; (1,0) </td></tr></table></td></tr><tr><td style="text-align:right;"><table style="width:100%;"><tr><td style="text-align:right;">
                                  <td></tr></table></td><td style="text-align:center;"> = </td><td><table style="white-space:nowrap;"><tr><td style="white-space:nowrap;"> 5 &sdot; (1,0) </td></tr></table></td></tr><tr><td style="text-align:right;"><table style="width:100%;"><tr><td style="text-align:right;">
                                  <td></tr></table></td><td style="text-align:center;"> = </td><td><table style="white-space:nowrap;"><tr><td style="white-space:nowrap;"> (5,0).
</td></tr></table></td></tr></table></div>
      </div></div></div>
<a id="ff3b49ea963b4301995bcd19c544280a"></a><div class="linked block"><div class="link-block">[<a href="#ff3b49ea963b4301995bcd19c544280a">link</a>]&nbsp;&nbsp;</div><div style="width:100%; display:inline-block;"><div style="width:auto;" class="theorem true_required"><span class="block_label">Theorem (Cauchy-Schwarz inequality):</span> 
        
<div class="text">Given two vectors <i>x</i> <span style="font-size:12px;">&#8712;</span> &#8477;<sup><i>n</i></sup> and <i>y</i> <span style="font-size:12px;">&#8712;</span> &#8477;<sup><i>n</i></sup>, the following inequality holds:
<table style="padding-left:20px; margin:4px 0px 4px 0px;"><tr><td style="text-align:right; white-space:nowrap;"><table style="width:100%;"><tr><td style="text-align:right;">  |<i>x</i> &sdot; <i>y</i>| / (||<i>x</i>|| &sdot; ||<i>y</i>||) <td></tr></table></td><td style="text-align:center;"> &le; </td><td><table style="white-space:nowrap;"><tr><td style="white-space:nowrap;"> 1
</td></tr></table></td></tr></table>
Another way of writing the above is:
<table style="padding-left:20px; margin:4px 0px 4px 0px;"><tr><td style="text-align:right; white-space:nowrap;"><table style="width:100%;"><tr><td style="text-align:right;">  |&nbsp; (<i>x</i>/||<i>x</i>||) &sdot; (<i>y</i>/||<i>y</i>||) &nbsp;|  <td></tr></table></td><td style="text-align:center;"> &le; </td><td><table style="white-space:nowrap;"><tr><td style="white-space:nowrap;"> 1
</td></tr></table></td></tr></table>
The intuitive interpretation is as follows: if we try to find the projection of a unit vector <i>x</i>/||<i>x</i>|| onto another unit vector <i>y</i>/||<i>y</i>||, the resulting vector has length at most 1.</div>
      </div></div></div></div>
  <a id="4.2"></a><div class="subsection"><h3 class="linked heading"><span class="link-title">[<a href="#4.2">link</a>]&nbsp;&nbsp;</span><div><span class="header_numeral">4.2.</span> Defining Mean and Standard Deviation using Concepts in Linear Algebra</div></h3>
<div class="text top">In this section, we will assume that we are working with a data set of the form [(<i>x</i><sub>1</sub>, <i>y</i><sub>1</sub>), ..., (<i>x</i><sub><i>n</i></sub>, <i>y</i><sub><i>n</i></sub>)] that contains <i>n</i> (not necessarily distinct) tuples and where for 1 &le; <i>i</i> &le; <i>n</i> we have (<i>x</i><sub><i>i</i></sub>, <i>y</i><sub><i>i</i></sub>) <span style="font-size:12px;">&#8712;</span> &#8477;<sup>2</sup>. For the purposes of analysis, we will also often use two projections of this data set as vectors in their own right: (<i>x</i><sub>1</sub>, ...,<i>x</i><sub><i>n</i></sub>) <span style="font-size:12px;">&#8712;</span> &#8477;<sup><i>n</i></sup> and (<i>y</i><sub>1</sub>, ...,<i>y</i><sub><i>n</i></sub>) <span style="font-size:12px;">&#8712;</span> &#8477;<sup><i>n</i></sup>. We will also use the shorthand notation <i>x</i> = (<i>x</i><sub>1</sub>, ...,<i>x</i><sub><i>n</i></sub>) and <i>y</i> = (<i>y</i><sub>1</sub>, ...,<i>y</i><sub><i>n</i></sub>).</div>
<a id="2a91b48a2e1040808d9538ff45aede2b"></a><div class="linked block"><div class="link-block">[<a href="#2a91b48a2e1040808d9538ff45aede2b">link</a>]&nbsp;&nbsp;</div><div style="width:100%; display:inline-block;"><div style="width:auto;" class="definition true_required"><span class="block_label">Definition (arithmetic mean):</span> 
        
<div class="text">We define the arithmetic <i>mean</i> &mu;(<i>x</i>) of a one-dimensional or one-column data set <i>x</i> <span style="font-size:12px;">&#8712;</span> &#8477;<sup><i>n</i></sup> where <i>x</i> = (<i>x</i><sub>1</sub>, ..., <i>x</i><sub><i>n</i></sub>) as:
<table style="padding-left:20px; margin:4px 0px 4px 0px;"><tr><td style="text-align:right; white-space:nowrap;"><table style="width:100%;"><tr><td style="text-align:right;">  &mu;(<i>x</i>) <td></tr></table></td><td style="text-align:center;"> = </td><td><table style="white-space:nowrap;"><tr><td style="white-space:nowrap;"> (1/<i>n</i>) &sdot; (<i>x</i><sub>1</sub> + ... + <i>x</i><sub><i>n</i></sub>)
</td></tr></table></td></tr></table></div>
      </div></div></div>
<a id="2a91b48a2e1040808d9538ff45aede2a"></a><div class="linked block"><div class="link-block">[<a href="#2a91b48a2e1040808d9538ff45aede2a">link</a>]&nbsp;&nbsp;</div><div style="width:100%; display:inline-block;"><div style="width:auto;" class="fact true_required"><span class="block_label">Fact:</span> 
        
<div class="text">For <i>x</i> <span style="font-size:12px;">&#8712;</span> &#8477;<sup><i>n</i></sup>, &mu;(<i>x</i>) is the quantity that minimizes the following objective function:
<table style="padding-left:20px; margin:4px 0px 4px 0px;"><tr><td style="text-align:right; white-space:nowrap;"><table style="width:100%;"><tr><td style="text-align:right;">  &mu;(<i>x</i>) <td></tr></table></td><td style="text-align:center;"> = </td><td><table style="white-space:nowrap;"><tr><td style="white-space:nowrap;"> argmin<sub><i>a</i> <span style="font-size:12px;">&#8712;</span> &#8477;</sub> (&Sigma;<sub><i>i</i>=1</sub><sup><i>n</i></sup> (<i>x</i><sub><i>i</i></sub> - <i>a</i>)<sup>2</sup>)
</td></tr></table></td></tr></table>
To prove the above, assume that the true mean is some specific value <i>m</i>. Next, we choose some arbitrary <i>a</i> in the formula and proceed algebraically from there by multiplying the terms under the summation and then using the distributive property to factor out the terms without an index subscript:
<table style="padding-left:20px; margin:4px 0px 4px 0px;"><tr><td style="text-align:right; white-space:nowrap;"><table style="width:100%;"><tr><td style="text-align:right;">  &Sigma;<sub><i>i</i>=1</sub><sup><i>n</i></sup> (<i>x</i><sub><i>i</i></sub> - <i>a</i>)<sup>2</sup> <td></tr></table></td><td style="text-align:center;"> = </td><td><table style="white-space:nowrap;"><tr><td style="white-space:nowrap;"> &Sigma;<sub><i>i</i>=1</sub><sup><i>n</i></sup> ((<i>x</i><sub><i>i</i></sub> &#8722; <i>m</i>) + (<i>m</i> &#8722; <i>a</i>))<sup>2</sup> </td></tr></table></td></tr><tr><td style="text-align:right;"><table style="width:100%;"><tr><td style="text-align:right;">
                                                      <td></tr></table></td><td style="text-align:center;"> = </td><td><table style="white-space:nowrap;"><tr><td style="white-space:nowrap;"> &Sigma;<sub><i>i</i>=1</sub><sup><i>n</i></sup> ( 
                                                            (<i>x</i><sub><i>i</i></sub> &#8722; <i>m</i>)<sup>2</sup> +
                                                            2 &sdot; ((<i>x</i><sub><i>i</i></sub> &#8722; <i>m</i>) &sdot; (<i>m</i> &#8722; <i>a</i>))
                                                            + (<i>m</i> &#8722; <i>a</i>)<sup>2</sup> )</td></tr></table></td></tr><tr><td style="text-align:right;"><table style="width:100%;"><tr><td style="text-align:right;">
                                                      <td></tr></table></td><td style="text-align:center;"> = </td><td><table style="white-space:nowrap;"><tr><td style="white-space:nowrap;"> &Sigma;<sub><i>i</i>=1</sub><sup><i>n</i></sup> (<i>x</i><sub><i>i</i></sub> &#8722; <i>m</i>)<sup>2</sup> +
                                                            2 &sdot; (<i>m</i> &#8722; <i>a</i>) &sdot; (&Sigma;<sub><i>i</i>=1</sub><sup><i>n</i></sup> (<i>x</i><sub><i>i</i></sub> &#8722; <i>m</i>))
                                                            + &Sigma;<sub><i>i</i>=1</sub><sup><i>n</i></sup> (<i>m</i> &#8722; <i>a</i>)<sup>2</sup> </td></tr></table></td></tr><tr><td style="text-align:right;"><table style="width:100%;"><tr><td style="text-align:right;">
                                                      <td></tr></table></td><td style="text-align:center;"> = </td><td><table style="white-space:nowrap;"><tr><td style="white-space:nowrap;"> &Sigma;<sub><i>i</i>=1</sub><sup><i>n</i></sup> (<i>x</i><sub><i>i</i></sub> &#8722; <i>m</i>)<sup>2</sup> +
                                                            2 &sdot; (<i>m</i> &#8722; <i>a</i>) &sdot; (&Sigma;<sub><i>i</i>=1</sub><sup><i>n</i></sup> <i>x</i><sub><i>i</i></sub> &#8722; &Sigma;<sub><i>i</i>=1</sub><sup><i>n</i></sup> <i>m</i>))
                                                            + &Sigma;<sub><i>i</i>=1</sub><sup><i>n</i></sup> (<i>m</i> &#8722; <i>a</i>)<sup>2</sup>
</td></tr></table></td></tr></table>
In the above, notice that &Sigma;<sub><i>i</i>=1</sub><sup><i>n</i></sup> <i>x</i><sub><i>i</i></sub> = &Sigma;<sub><i>i</i>=1</sub><sup><i>n</i></sup> <i>m</i> because <i>m</i> is the true mean, so the second term is just 2 &sdot; (<i>m</i> &#8722; <i>a</i>) &sdot; 0 = 0. Thus, we can eliminate it. Furthermore, we have &Sigma;<sub><i>i</i>=1</sub><sup><i>n</i></sup> (<i>m</i> &#8722; <i>a</i>)<sup>2</sup> = <i>n</i> &sdot; (<i>m</i> &#8722; <i>a</i>)<sup>2</sup>. Applying all these substitutions, we have:
<table style="padding-left:20px; margin:4px 0px 4px 0px;"><tr><td style="text-align:right; white-space:nowrap;"><table style="width:100%;"><tr><td style="text-align:right;"> &Sigma;<sub><i>i</i>=1</sub><sup><i>n</i></sup> (<i>x</i><sub><i>i</i></sub> - <i>a</i>)<sup>2</sup> <td></tr></table></td><td style="text-align:center;"> = </td><td><table style="white-space:nowrap;"><tr><td style="white-space:nowrap;"> (&Sigma;<sub><i>i</i>=1</sub><sup><i>n</i></sup> (<i>x</i><sub><i>i</i></sub> &#8722; <i>m</i>)<sup>2</sup> ) + <i>n</i> &sdot; (<i>m</i> &#8722; <i>a</i>)<sup>2</sup>
</td></tr></table></td></tr></table>
Notice that setting <i>a</i> = <i>m</i> minimizes the above because the first term does not contain <i>a</i> at all, and the second term becomes 0 exactly when <i>a</i> = <i>m</i>. Thus, the <i>a</i> that minimizes the objective &Sigma;<sub><i>i</i>=1</sub><sup><i>n</i></sup> (<i>x</i><sub><i>i</i></sub> - <i>a</i>)<sup>2</sup> is exactly the mean.</div>
      </div></div></div>
<a id="dfa1b48a2e1040808d9538ff45aede2a"></a><div class="linked block"><div class="link-block">[<a href="#dfa1b48a2e1040808d9538ff45aede2a">link</a>]&nbsp;&nbsp;</div><div style="width:100%; display:inline-block;"><div style="width:auto;" class="definition true_required"><span class="block_label">Definition:</span> 
        
<div class="text">For a given vector <i>x</i> <span style="font-size:12px;">&#8712;</span> &#8477;<sup><i>n</i></sup>, suppose we use &mu;(<i>x</i>) to create a vector with <i>n</i> entries (&mu;(<i>x</i>), &mu;(<i>x</i>), ..., &mu;(<i>x</i>)) in which every entry is &mu;(<i>x</i>). We can call this vector <i>x</i><sub>&mu;</sub>.</div>
      </div></div></div>
<a id="bd91b48a2e1040808d9538ff45aede2a"></a><div class="linked block"><div class="link-block">[<a href="#bd91b48a2e1040808d9538ff45aede2a">link</a>]&nbsp;&nbsp;</div><div style="width:100%; display:inline-block;"><div style="width:auto;" class="fact true_required"><span class="block_label">Fact:</span> 
        
<div class="text">The vector <i>x</i><sub>&mu;</sub> minimizes a particular objective function:
<table style="padding-left:20px; margin:4px 0px 4px 0px;"><tr><td style="text-align:right; white-space:nowrap;"><table style="width:100%;"><tr><td style="text-align:right;">  <i>x</i><sub>&mu;</sub> <td></tr></table></td><td style="text-align:center;"> = </td><td><table style="white-space:nowrap;"><tr><td style="white-space:nowrap;"> argmin<sub><i>v</i> is on the diagonal in &#8477;<sup><i>n</i></sup></sub> ||<i>x</i> - <i>v</i>||
</td></tr></table></td></tr></table>
Another way of thinking about this is that <i>x</i><sub>&mu;</sub> as defined above is the projection of the vector <i>x</i> onto the diagonal line in &#8477;<sup><i>n</i></sup>. This means that <i>x</i><sub>&mu;</sub> is also the point on the diagonal in &#8477;<sup><i>n</i></sup> that is <i>closest</i> to <i>x</i>.</div>
      </div></div></div>
<div class="text top">Notice that the geometric, vector-based interpretation of the mean preserves an important property of the mean of a data set <i>x</i>: rearranging the elements of <i>x</i> should not change the mean. In fact, this is guaranteed because the dimensions are interchangeable: for any permuted version of <i>x</i> (let us call it <i>x</i>'), the distance of <i>x</i>' from the diagonal will always be the same as the distance of <i>x</i> from the diagonal because we can simply relabel the dimensions to bring <i>x</i>' back to its original form <i>x</i>.</div>
<a id="2a91b48aaa1040808d9538ff45aede2c"></a><div class="linked block"><div class="link-block">[<a href="#2a91b48aaa1040808d9538ff45aede2c">link</a>]&nbsp;&nbsp;</div><div style="width:100%; display:inline-block;"><div style="width:auto;" class="example task_required"><span class="block_label">Example:</span> 
        
<div class="text">Suppose that our data consists of only two points in one dimension: <i>x</i><sub>1</sub> = 10 and <i>x</i><sub>2</sub> = 2. We can compute:
<table style="padding-left:20px; margin:4px 0px 4px 0px;"><tr><td style="text-align:right; white-space:nowrap;"><table style="width:100%;"><tr><td style="text-align:right;">  &mu;(<i>x</i>) <td></tr></table></td><td style="text-align:center;"> = </td><td><table style="white-space:nowrap;"><tr><td style="white-space:nowrap;"> (1/2) &sdot; (10 + 2) <td></tr></table></td><td style="text-align:center;"> = </td><td><table style="white-space:nowrap;"><tr><td style="white-space:nowrap;"> 6 </td></tr></table></td></tr><tr><td style="text-align:right;"><table style="width:100%;"><tr><td style="text-align:right;">
  <i>x</i><sub>&mu;</sub> <td></tr></table></td><td style="text-align:center;"> = </td><td><table style="white-space:nowrap;"><tr><td style="white-space:nowrap;"> (6, 6)
</td></tr></table></td></tr></table>
Notice that (6, 6) is on the diagonal in &#8477;<sup>2</sup>. Notice also that the vector <i>x</i> &#8722; <i>x</i><sub>&mu;</sub> = (10, 2) - (6, 6) = (4, -4) has a slope of &nbsp;&#8722; 1 (it is orthogonal to the diagonal). This is consistent with <i>x</i><sub>&mu;</sub> being a projection of <i>x</i> onto the diagonal.</div>
      </div></div></div>
<a id="2a91b48a2e1040808d9538ff45aede2c"></a><div class="linked block"><div class="link-block">[<a href="#2a91b48a2e1040808d9538ff45aede2c">link</a>]&nbsp;&nbsp;</div><div style="width:100%; display:inline-block;"><div style="width:auto;" class="definition true_required"><span class="block_label">Definition (standard deviation):</span> 
        
<div class="text">We define the <i>standard deviation</i> &sigma;(<i>x</i>) of a one-dimensional or one-column data set <i>x</i> <span style="font-size:12px;">&#8712;</span> &#8477;<sup><i>n</i></sup> where <i>x</i> = (<i>x</i><sub>1</sub>, ..., <i>x</i><sub><i>n</i></sub>) as:
<table style="padding-left:20px; margin:4px 0px 4px 0px;"><tr><td style="text-align:right; white-space:nowrap;"><table style="width:100%;"><tr><td style="text-align:right;">  &sigma;(<i>x</i>) <td></tr></table></td><td style="text-align:center;"> = </td><td><table style="white-space:nowrap;"><tr><td style="white-space:nowrap;"> &radic;( &nbsp; (1/<i>n</i>) &sdot; ((<i>x</i><sub>1</sub> &#8722; &mu;(<i>x</i>))<sup>2</sup> + ... + (<i>x</i><sub><i>n</i></sub> &#8722; &mu;(<i>x</i>))<sup>2</sup>) &nbsp; )
</td></tr></table></td></tr></table></div>
      </div></div></div>
<a id="2a91b48a2e1040808d9538ff45aede2d"></a><div class="linked block"><div class="link-block">[<a href="#2a91b48a2e1040808d9538ff45aede2d">link</a>]&nbsp;&nbsp;</div><div style="width:100%; display:inline-block;"><div style="width:auto;" class="fact true_required"><span class="block_label">Fact:</span> 
        
<div class="text">For <i>x</i> <span style="font-size:12px;">&#8712;</span> &#8477;<sup><i>n</i></sup>, &sigma;(<i>x</i>) can be rewritten in terms of the norm (i.e., length) ||<i>x</i> &#8722; &mu;(<i>x</i>)|| of the vector <i>x</i> &#8722; &mu;(<i>x</i>):
<table style="padding-left:20px; margin:4px 0px 4px 0px;"><tr><td style="text-align:right; white-space:nowrap;"><table style="width:100%;"><tr><td style="text-align:right;">  &sigma;(<i>x</i>) <td></tr></table></td><td style="text-align:center;"> = </td><td><table style="white-space:nowrap;"><tr><td style="white-space:nowrap;"> (1/&radic;(<i>n</i>)) &sdot; &radic;( &nbsp; ((<i>x</i><sub>1</sub> &#8722; &mu;(<i>x</i>))<sup>2</sup> + ... + (<i>x</i><sub><i>n</i></sub> &#8722; &mu;(<i>x</i>))<sup>2</sup>) &nbsp; ) </td></tr></table></td></tr><tr><td style="text-align:right;"><table style="width:100%;"><tr><td style="text-align:right;">
             <td></tr></table></td><td style="text-align:center;"> = </td><td><table style="white-space:nowrap;"><tr><td style="white-space:nowrap;"> (1/&radic;(<i>n</i>)) &sdot; ||<i>x</i> &#8722; <i>x</i><sub>&mu;</sub>||
</td></tr></table></td></tr></table>
In other words, we can view the standard deviation of a data vector <i>x</i> as a normalized distance between <i>x</i> and the diagonal in &#8477;<sup><i>n</i></sup>.</div>
      </div></div></div>
<div class="text top">We might ask why the extra scalar term 1/&radic;(<i>n</i>) is necessary in the above. After all, the term ||<i>x</i> &#8722; <i>x</i><sub>&mu;</sub>|| already measures the distance between the vector <i>x</i> and the diagonal. Furthermore, the individual <i>x</i><sub><i>i</i></sub> terms must represent some units specific to the domain from which the data is drawn (e.g., temperature, height, velocity, and so on), while <i>n</i> and &radic;(<i>n</i>) have no units relevant to the domain of the data (they are just quantities associated with how many data points were collected).</div><div class="paragraph">
The reason is that when the number of components in the vector <i>x</i> changes, the number of dimensions in the space from which the vector is drawn also changes. But the number of dimensions influences the lengths of vectors (if the individual vector components are about the same): the more dimensions there are, the longer a vector can become because the distances from all dimensions contribute to the length of a vector.
      </div>
<a id="ff91b48a2e10408add9538ff45aede2d"></a><div class="linked block"><div class="link-block">[<a href="#ff91b48a2e10408add9538ff45aede2d">link</a>]&nbsp;&nbsp;</div><div style="width:100%; display:inline-block;"><div style="width:auto;" class="example task_required"><span class="block_label">Example:</span> 
        
<div class="text">Consider the lengths of the following vectors (each drawn from a vector space having its own distinct number of dimensions):
<table style="padding-left:20px; margin:4px 0px 4px 0px;"><tr><td style="text-align:right; white-space:nowrap;"><table style="width:100%;"><tr><td style="text-align:right;">  ||(1, 1)|| <td></tr></table></td><td style="text-align:center;"> = </td><td><table style="white-space:nowrap;"><tr><td style="white-space:nowrap;"> &radic;(2) </td></tr></table></td></tr><tr><td style="text-align:right;"><table style="width:100%;"><tr><td style="text-align:right;">
  ||(1, 1, 1)|| <td></tr></table></td><td style="text-align:center;"> = </td><td><table style="white-space:nowrap;"><tr><td style="white-space:nowrap;"> &radic;(3) </td></tr></table></td></tr><tr><td style="text-align:right;"><table style="width:100%;"><tr><td style="text-align:right;">
  ||(1, 1, 1, 1)|| <td></tr></table></td><td style="text-align:center;"> = </td><td><table style="white-space:nowrap;"><tr><td style="white-space:nowrap;"> &radic;(4) </td></tr></table></td></tr><tr><td style="text-align:right;"><table style="width:100%;"><tr><td style="text-align:right;">
  ||(1, 1, 1, 1, 1)|| <td></tr></table></td><td style="text-align:center;"> = </td><td><table style="white-space:nowrap;"><tr><td style="white-space:nowrap;"> &radic;(5) </td></tr></table></td></tr><tr><td style="text-align:right;"><table style="width:100%;"><tr><td style="text-align:right;">
   <td></tr></table></td><td style="text-align:center;"> &#8942; </td><td><table style="white-space:nowrap;"><tr><td style="white-space:nowrap;"> </td></tr></table></td></tr><tr><td style="text-align:right;"><table style="width:100%;"><tr><td style="text-align:right;">
  ||(1, 1, 1, 1, 1, ..., 1)|| <td></tr></table></td><td style="text-align:center;"> = </td><td><table style="white-space:nowrap;"><tr><td style="white-space:nowrap;"> &radic;(<i>n</i>)
</td></tr></table></td></tr></table>
In general, we can see that the following holds for a vector of length <i>n</i> in which components are all the same value <i>s</i>:
<table style="padding-left:20px; margin:4px 0px 4px 0px;"><tr><td style="text-align:right; white-space:nowrap;"><table style="width:100%;"><tr><td style="text-align:right;">  ||(<i>s</i>, <i>s</i>, <i>s</i>, ..., <i>s</i>)|| <td></tr></table></td><td style="text-align:center;"> = </td><td><table style="white-space:nowrap;"><tr><td style="white-space:nowrap;"> &radic;(<i>n</i> &sdot; <i>s</i><sup>2</sup>) </td></tr></table></td></tr><tr><td style="text-align:right;"><table style="width:100%;"><tr><td style="text-align:right;">
                            <td></tr></table></td><td style="text-align:center;"> = </td><td><table style="white-space:nowrap;"><tr><td style="white-space:nowrap;"> <i>s</i> &sdot; &radic;(<i>n</i>)
</td></tr></table></td></tr></table></div>
      </div></div></div>
<div class="text top">The above example shows that if we want to compare the standard deviations of two distinct data sets that are of different sizes, we need to account for the fact that the vectors are drawn from different vector spaces (in terms of the number of dimensions). The scalar term 1/&radic;(<i>n</i>) in the standard deviation formula accounts for this difference and makes comparisons possible.</div><div class="paragraph">
One practical issue that can be a concern when defining statistical analysis workflows (e.g., using the relational or MapReduce paradigms) over large data sets is how optimizations based on algebraic laws may interact with the representation size of numerical values within the language, library, or framework being used.
      </div>
<a id="ff91b48a2e10408ad29538ff45aedebd"></a><div class="linked block"><div class="link-block">[<a href="#ff91b48a2e10408ad29538ff45aedebd">link</a>]&nbsp;&nbsp;</div><div style="width:100%; display:inline-block;"><div style="width:auto;" class="example task_required"><span class="block_label">Example:</span> 
        
<div class="text">Suppose that we have a large data set <i>x</i> <span style="font-size:12px;">&#8712;</span> &#8477;<sup>1000</sup>consisting of 1000 entries: <i>x</i> = (<i>x</i><sub>1</sub>, ..., <i>x</i><sub>1000</sub>). Assume that 995 of these entries are approximately 0.1 while the other five are all larger (i.e., let us assume they are all 2). Furthermore, suppose that we have a representation size for floating point numbers such that the smallest floating point number we can represent is 0.001 (anything else is cut off and thus rounded to 0).</div>
        <div class="paragraph">
If we want to compute the mean, we might decide to first scale every entry in the vector <i>x</i> by dividing it by <i>n</i> to obtain the vector (<i>x</i><sub>1</sub>/<i>n</i>, ..., <i>x</i><sub>1000</sub>/<i>n</i>). However, notice that this will lead to 0.1/1000 = 0.0001 as the true mathematical answer for most of the data. Due to rounding errors, 995 of the data points will be treated as if they were 0. Thus, we will have:
<table style="padding-left:20px; margin:4px 0px 4px 0px;"><tr><td style="text-align:right; white-space:nowrap;"><table style="width:100%;"><tr><td style="text-align:right;">  &mu;(<i>x</i>) <td></tr></table></td><td style="text-align:center;"> = </td><td><table style="white-space:nowrap;"><tr><td style="white-space:nowrap;"> (2 + 2 + 2 + 2 + 2) / 1000 </td></tr></table></td></tr><tr><td style="text-align:right;"><table style="width:100%;"><tr><td style="text-align:right;">
          <td></tr></table></td><td style="text-align:center;"> = </td><td><table style="white-space:nowrap;"><tr><td style="white-space:nowrap;"> 0.01
</td></tr></table></td></tr></table>
But notice that the true mean is quite different; the error we see above is an entire order of magnitude (and grows as the number of data points grows):
<table style="padding-left:20px; margin:4px 0px 4px 0px;"><tr><td style="text-align:right; white-space:nowrap;"><table style="width:100%;"><tr><td style="text-align:right;">  &mu;(<i>x</i>) <td></tr></table></td><td style="text-align:center;"> = </td><td><table style="white-space:nowrap;"><tr><td style="white-space:nowrap;"> (2 + 2 + 2 + 2 + 2 + 0.1 &sdot; 995) / 1000 </td></tr></table></td></tr><tr><td style="text-align:right;"><table style="width:100%;"><tr><td style="text-align:right;">
          <td></tr></table></td><td style="text-align:center;"> = </td><td><table style="white-space:nowrap;"><tr><td style="white-space:nowrap;"> (10 + 99.5) / 1000 </td></tr></table></td></tr><tr><td style="text-align:right;"><table style="width:100%;"><tr><td style="text-align:right;">
          <td></tr></table></td><td style="text-align:center;"> = </td><td><table style="white-space:nowrap;"><tr><td style="white-space:nowrap;"> 0.1095
</td></tr></table></td></tr></table>
One intuitive response might be to avoid the optimization and perform the summation of all components of <i>x</i> before performing division by <i>n</i>. But this may again lead to an issue when there are too many data points, as the total may exceed the maximum representation size of a number!
        </div>
      </div></div></div>
<div class="text top">Issues such as the above are less common in contemporary tool chains because many libraries use representations for numerical data structures that allow arbitrary precision and size. If such a library is not available, however, we must ensure that our optimizations do not introduce such errors. In the specific scenario above, the solution might be to split the data up into smaller chunks, perform the optimization over those small chunks, and then aggregate the results of those individual computations (though this would only work for associative operations such as summation).</div></div>
  <a id="4.3"></a><div class="subsection"><h3 class="linked heading"><span class="link-title">[<a href="#4.3">link</a>]&nbsp;&nbsp;</span><div><span class="header_numeral">4.3.</span> Covariance and Correlation</div></h3>
<div class="text top">Given a data set that has two-dimensional entries of the form (<i>x</i><sub><i>i</i></sub>, <i>y</i><sub><i>i</i></sub>), it is possible to compare the two dimensions by comparing their standard deviation vectors.</div>
<a id="2a91b48a2e1040808d9538ff45aede2e"></a><div class="linked block"><div class="link-block">[<a href="#2a91b48a2e1040808d9538ff45aede2e">link</a>]&nbsp;&nbsp;</div><div style="width:100%; display:inline-block;"><div style="width:auto;" class="definition true_required"><span class="block_label">Definition (covariance):</span> 
        
<div class="text">For <i>x</i> <span style="font-size:12px;">&#8712;</span> &#8477;<sup><i>n</i></sup> and <i>y</i> <span style="font-size:12px;">&#8712;</span> &#8477;<sup><i>n</i></sup>, we define the <i>covariance</i> between <i>x</i> and <i>y</i> as:
<table style="padding-left:20px; margin:4px 0px 4px 0px;"><tr><td style="text-align:right; white-space:nowrap;"><table style="width:100%;"><tr><td style="text-align:right;">  cov(<i>x</i>, <i>y</i>) <td></tr></table></td><td style="text-align:center;"> = </td><td><table style="white-space:nowrap;"><tr><td style="white-space:nowrap;"> (1/<i>n</i>) &sdot; ((<i>x</i><sub>1</sub> - &mu;(<i>x</i>)) &sdot; (<i>y</i><sub>1</sub> - &mu;(<i>y</i>)) + ... + (<i>x</i><sub><i>n</i></sub> - &mu;(<i>x</i>)) &sdot; (<i>y</i><sub><i>n</i></sub> - &mu;(<i>y</i>))) </td></tr></table></td></tr><tr><td style="text-align:right;"><table style="width:100%;"><tr><td style="text-align:right;">
              <td></tr></table></td><td style="text-align:center;"> = </td><td><table style="white-space:nowrap;"><tr><td style="white-space:nowrap;"> (1/<i>n</i>) &sdot; (&Sigma;<sub><i>i</i>=1</sub><sup><i>n</i></sup> (<i>x</i><sub><i>i</i></sub> - &mu;(<i>x</i>)) &sdot; (<i>y</i><sub><i>i</i></sub> - &mu;(<i>y</i>))) </td></tr></table></td></tr><tr><td style="text-align:right;"><table style="width:100%;"><tr><td style="text-align:right;">
              <td></tr></table></td><td style="text-align:center;"> = </td><td><table style="white-space:nowrap;"><tr><td style="white-space:nowrap;"> (1/<i>n</i>) &sdot; (<i>x</i> &#8722; <i>x</i><sub>&mu;</sub>) &sdot; (<i>y</i> &#8722; <i>y</i><sub>&mu;</sub>) </td></tr></table></td></tr><tr><td style="text-align:right;"><table style="width:100%;"><tr><td style="text-align:right;">
              <td></tr></table></td><td style="text-align:center;"> = </td><td><table style="white-space:nowrap;"><tr><td style="white-space:nowrap;"> ((1/&radic;(<i>n</i>)) &sdot; (<i>x</i> &#8722; <i>x</i><sub>&mu;</sub>)) &sdot; ((1/&radic;(<i>n</i>)) &sdot; (<i>y</i> &#8722; <i>y</i><sub>&mu;</sub>))
</td></tr></table></td></tr></table>
Notice that the last line in the above indicates that we can view the covariance as the dot product of the vectors (<i>x</i> &#8722; <i>x</i><sub>&mu;</sub>) and (<i>y</i> &#8722; <i>y</i><sub>&mu;</sub>) (each of which represent how <i>x</i> and <i>y</i> deviate from their respective means).</div>
      </div></div></div>
<a id="2a91b48a2e1040808d9538ff45aede2f"></a><div class="linked block"><div class="link-block">[<a href="#2a91b48a2e1040808d9538ff45aede2f">link</a>]&nbsp;&nbsp;</div><div style="width:100%; display:inline-block;"><div style="width:auto;" class="definition true_required"><span class="block_label">Definition (correlation coefficient):</span> 
        
<div class="text">For <i>x</i> <span style="font-size:12px;">&#8712;</span> &#8477;<sup><i>n</i></sup> and <i>y</i> <span style="font-size:12px;">&#8712;</span> &#8477;<sup><i>n</i></sup>, we define the <i>correlation coefficient</i> (also known as the <i>Pearson product-moment correlation coefficient</i>) between <i>x</i> and <i>y</i> as:
<table style="padding-left:20px; margin:4px 0px 4px 0px;"><tr><td style="text-align:right; white-space:nowrap;"><table style="width:100%;"><tr><td style="text-align:right;">  &rho;(<i>x</i>, <i>y</i>) <td></tr></table></td><td style="text-align:center;"> = </td><td><table style="white-space:nowrap;"><tr><td style="white-space:nowrap;"> cov(<i>x</i>, <i>y</i>) / (&sigma;(<i>x</i>) &sdot; &sigma;(<i>y</i>))
</td></tr></table></td></tr></table></div>
      </div></div></div>
<a id="2a91b48a2e1040808d9538ff45aeae2f"></a><div class="linked block"><div class="link-block">[<a href="#2a91b48a2e1040808d9538ff45aeae2f">link</a>]&nbsp;&nbsp;</div><div style="width:100%; display:inline-block;"><div style="width:auto;" class="fact true_required"><span class="block_label">Fact:</span> 
        
<div class="text">For <i>x</i> <span style="font-size:12px;">&#8712;</span> &#8477;<sup><i>n</i></sup> and <i>y</i> <span style="font-size:12px;">&#8712;</span> &#8477;<sup><i>n</i></sup>, it must be that -1 &le; &rho;(<i>x</i>, <i>y</i>) &le; 1. We can see this by noticing that the formula for &rho;(<i>x</i>, <i>y</i>) appears in the <a href="#ff3b49ea963b4301995bcd19c544280a">Cauchy-Schwarz inequality</a>:
<table style="padding-left:20px; margin:4px 0px 4px 0px;"><tr><td style="text-align:right; white-space:nowrap;"><table style="width:100%;"><tr><td style="text-align:right;">  &rho;(<i>x</i>, <i>y</i>) <td></tr></table></td><td style="text-align:center;"> = </td><td><table style="white-space:nowrap;"><tr><td style="white-space:nowrap;"> cov(<i>x</i>, <i>y</i>) / (&sigma;(<i>x</i>) &sdot; &sigma;(<i>y</i>)) </td></tr></table></td></tr><tr><td style="text-align:right;"><table style="width:100%;"><tr><td style="text-align:right;">
                <td></tr></table></td><td style="text-align:center;"> = </td><td><table style="white-space:nowrap;"><tr><td style="white-space:nowrap;"> ((1/<i>n</i>) &sdot; (<i>x</i> &#8722; <i>x</i><sub>&mu;</sub>) &sdot; (<i>y</i> &#8722; <i>y</i><sub>&mu;</sub>)) / ((1/&radic;(<i>n</i>)) &sdot; ||<i>x</i> &#8722; <i>x</i><sub>&mu;</sub>|| &sdot; (1/&radic;(<i>n</i>)) &sdot; ||<i>y</i> &#8722; <i>y</i><sub>&mu;</sub>||) </td></tr></table></td></tr><tr><td style="text-align:right;"><table style="width:100%;"><tr><td style="text-align:right;">
                <td></tr></table></td><td style="text-align:center;"> = </td><td><table style="white-space:nowrap;"><tr><td style="white-space:nowrap;"> ((1/<i>n</i>) &sdot; (<i>x</i> &#8722; <i>x</i><sub>&mu;</sub>) &sdot; (<i>y</i> &#8722; <i>y</i><sub>&mu;</sub>)) / ((1/<i>n</i>) &sdot; ||<i>x</i> &#8722; <i>x</i><sub>&mu;</sub>|| &sdot; ||<i>y</i> &#8722; <i>y</i><sub>&mu;</sub>||) </td></tr></table></td></tr><tr><td style="text-align:right;"><table style="width:100%;"><tr><td style="text-align:right;">
                <td></tr></table></td><td style="text-align:center;"> = </td><td><table style="white-space:nowrap;"><tr><td style="white-space:nowrap;"> ((<i>x</i> &#8722; <i>x</i><sub>&mu;</sub>) &sdot; (<i>y</i> &#8722; <i>y</i><sub>&mu;</sub>)) / (||<i>x</i> &#8722; <i>x</i><sub>&mu;</sub>|| &sdot; ||<i>y</i> &#8722; <i>y</i><sub>&mu;</sub>||)
</td></tr></table></td></tr></table>
Notice that the only difference between the above and the Cauchy-Schwarz inequality is that in the above, the numerator is not guaranteed to be positive. Thus, we have:
<table style="padding-left:20px; margin:4px 0px 4px 0px;"><tr><td style="text-align:right; white-space:nowrap;"><table style="width:100%;"><tr><td style="text-align:right;">  &#8722;1 <td></tr></table></td><td style="text-align:center;"> &le; </td><td><table style="white-space:nowrap;"><tr><td style="white-space:nowrap;"> cov(<i>x</i>, <i>y</i>) / (&sigma;(<i>x</i>) &sdot; &sigma;(<i>y</i>)) <td></tr></table></td><td style="text-align:center;"> &le; </td><td><table style="white-space:nowrap;"><tr><td style="white-space:nowrap;"> 1
</td></tr></table></td></tr></table>
In addition to being bounded by &#8722;1 and 1, the correlation coefficient has some other useful properties. If <i>x</i> and <i>y</i> have a positive linear correlation, then &rho;(<i>x</i>, <i>y</i>) is positive. If they have a negative linear correlation, it is negative. Given two pairs of data sets <i>x</i>, <i>y</i> and <i>x</i>', <i>y</i>', if |&rho;(<i>x</i>, y)| > |&rho;(<i>x</i>', y')|, then <i>x</i> and <i>y</i> have a stronger correlation with each other than do <i>x</i>' and <i>y</i>'.</div>
      </div></div></div>
<div class="text top">While the correlation coefficient is useful for comparing correlations, the quantity itself should not be used to infer anything about the extent of the correlation. To do that with some rigor, it is necessary to introduce a few additional concepts.</div>
<a id="2a91b48aaa1040808d9538bb45aede2c"></a><div class="linked block"><div class="link-block">[<a href="#2a91b48aaa1040808d9538bb45aede2c">link</a>]&nbsp;&nbsp;</div><div style="width:100%; display:inline-block;"><div style="width:auto;" class="example task_required"><span class="block_label">Example:</span> 
        
<div class="text">Suppose that our data consists of only a pair of two-dimensional tuples: <i>x</i> = (2, 6) and <i>y</i> = (20, 60). In other words, the data set is [(2,20), (6,60)]. Notice that the slopes of the orthogonal projections of <i>x</i> and of <i>y</i> onto the diagonal are the same, and if we normalize these projections, they become the same vector. That means (<i>x</i> &#8722; <i>x</i><sub>&mu;</sub>) / ||<i>x</i> &#8722; <i>x</i><sub>&mu;</sub>|| = (<i>y</i> &#8722; <i>y</i><sub>&mu;</sub>) / ||<i>y</i> &#8722; <i>y</i><sub>&mu;</sub>||. But the dot product of a vector with itself is the square of its length: <i>v</i> &sdot; <i>v</i> = ||<i>v</i>||<sup>2</sup>. Thus, we have:
<table style="padding-left:20px; margin:4px 0px 4px 0px;"><tr><td style="text-align:right; white-space:nowrap;"><table style="width:100%;"><tr><td style="text-align:right;">  cov(<i>x</i>, <i>y</i>) / (&sigma;(<i>x</i>) &sdot; &sigma;(<i>y</i>)) <td></tr></table></td><td style="text-align:center;"> = </td><td><table style="white-space:nowrap;"><tr><td style="white-space:nowrap;"> (<i>x</i> &#8722; <i>x</i><sub>&mu;</sub>) &sdot; (<i>y</i> &#8722; <i>y</i><sub>&mu;</sub>) / (||<i>x</i> &#8722; <i>x</i><sub>&mu;</sub>|| &sdot; ||<i>y</i> &#8722; <i>y</i><sub>&mu;</sub>||) <td></tr></table></td><td style="text-align:center;"> = </td><td><table style="white-space:nowrap;"><tr><td style="white-space:nowrap;"> 1
</td></tr></table></td></tr></table>
Thus, because the two dimensions have a positive linear correlation in this example, the correlation coefficient is 1.</div>
      </div></div></div></div>
  <a id="4.4"></a><div class="subsection"><h3 class="linked heading"><span class="link-title">[<a href="#4.4">link</a>]&nbsp;&nbsp;</span><div><span class="header_numeral">4.4.</span> Observations, Hypothesis Testing, and Significance</div></h3>
<div class="text top">The correlation coefficient gives us a way to quantify that there may exist a correlation between two dimensions in a data set. However, this correlation may be an artifact of the way we happened to obtain the data. For example, we may have obtained a biased sample. We can account for this by modeling the likelihood of obtaining a biased sample given the assumption that no correlation exists. If it is very unlikely that we would obtain a sample with the observed correlation if no correlation actually exists, then we might conclude that it is likely that a correlation does exist.</div><div class="paragraph">
To take this approach, we need to compare the correlation we observe against something (i.e., to decide how likely or unlikely it is if there is no correlation). This means we need to define mathematically what it means for there to be no correlation.
      </div>
<a id="3331b48a2e1040808d9538ff45aede3a"></a><div class="linked block"><div class="link-block">[<a href="#3331b48a2e1040808d9538ff45aede3a">link</a>]&nbsp;&nbsp;</div><div style="width:100%; display:inline-block;"><div style="width:auto;" class="definition true_required"><span class="block_label">Definition (distribution):</span> 
        
<div class="text">A <i>distribution</i> over a set <i>S</i> is a function <i>f</i>: <i>S</i> <span style="font-size:12px;">&#8594;</span> &#8477;. If <i>S</i> represents a set of possible observations of a system (e.g., a collection of system states), then a distribution <i>f</i> might represent the probability that the system is in that state, or the frequency with which that system state has been or can be observed.</div>
      </div></div></div>
<a id="3331b48a2e1040808d9538ff45aede2f"></a><div class="linked block"><div class="link-block">[<a href="#3331b48a2e1040808d9538ff45aede2f">link</a>]&nbsp;&nbsp;</div><div style="width:100%; display:inline-block;"><div style="width:auto;" class="definition true_required"><span class="block_label">Definition (p-value):</span> 
        
<div class="text">Let a subset of the real number line <i>S</i> &sub; &#8477; represent all possible system states that can be observed, and let <i>f</i> be a distribution over <i>S</i> that represents the probability of observing each of the possible states in <i>S</i>. Given some state <i>s</i> <span style="font-size:12px;">&#8712;</span> <i>S</i>, the probability of observing <i>s</i> is then <i>f</i>(<i>s</i>). We define the <i>p-value</i> at the observed state <i>s</i> to be the probability of observing any state that is equal to or more extreme than the state <i>s</i>.</div>
      </div></div></div>
<div class="text top">One way to define which states are more extreme is to consider all the states from <i>s</i> to the nearest endpoint. Suppose <i>S</i> is an interval [<i>a</i>, <i>b</i>]. If <i>s</i> > (<i>a</i> &#8722; <i>b</i>) / 2, then the <i>p</i>-value would be computed by taking the definite integral of <i>f</i> from <i>s</i> to <i>b</i>. Otherwise, it would be computed by taking the definite integral of <i>f</i> from <i>s</i> to <i>a</i>.</div><div class="paragraph">
In order to use the <i>p</i>-value concept to reason about correlation coefficients, it is necessary to model how the correlation coefficient should behave if no correlation exists. One way to do this using the data alone is to consider every permutation of the data.
      </div>
<a id="2bb1b48aaa1040808d9538bb45aede2c"></a><div class="linked block"><div class="link-block">[<a href="#2bb1b48aaa1040808d9538bb45aede2c">link</a>]&nbsp;&nbsp;</div><div style="width:100%; display:inline-block;"><div style="width:auto;" class="example task_required"><span class="block_label">Example:</span> 
        
<div class="text">Assume we have a two-dimensional data set [(<i>x</i><sub>1</sub>, <i>y</i><sub>1</sub>), ..., (<i>x</i><sub><i>n</i></sub>, <i>y</i><sub><i>n</i></sub>)], also represented by <i>x</i> <span style="font-size:12px;">&#8712;</span> &#8477;<sup><i>n</i></sup> and <i>y</i> <span style="font-size:12px;">&#8712;</span> &#8477;<sup><i>n</i></sup>. One way to obtain a range of possible correlation coefficients is to compute the correlation coefficient for all <i>n</i>! possible permutations of the second dimension. Intuitively, if the data has no correlation, then performing this permutation should not change the correlation coefficient much. On the other hand, if the data does have a correlation, we should get a distribution of correlation coefficients with respect to which the true correlation coefficient is "extreme".</div>
        
<div class="code"><div class="source"><pre><code class="py">
from random import shuffle
from math import sqrt

data = [(18, 28), (24, 18), (27, 31), (14, 15), (46, 23),
        (36, 19), (27, 10), (34, 25), (19, 15), (13, 13),
        (4, 2), (17, 20), (28, 12), (36, 11), (26, 14),
        (19, 19), (24, 13), (25, 6), (20, 8), (17, 22),
        (18, 8), (25, 12), (28, 27), (31, 28), (35, 22),
        (17, 8), (19, 19), (23, 23), (22, 11)]
x = [xi for (xi, yi) in data]
y = [yi for (xi, yi) in data]

def permute(x):
    shuffled = [xi for xi in x]
    shuffle(shuffled)
    return shuffled

def avg(x): # Average
    return sum(x)/len(x)

def stddev(x): # Standard deviation.
    m = avg(x)
    return sqrt(sum([(xi-m)**2 for xi in x])/len(x))

def cov(x, y): # Covariance.
    return sum([(xi-avg(x))*(yi-avg(y)) for (xi,yi) in zip(x,y)])/len(x)

def corr(x, y): # Correlation coefficient.
    if stddev(x)*stddev(y) != 0:
        return cov(x, y)/(stddev(x)*stddev(y))

def p(x, y):
    c0 = corr(x, y)
    corrs = []
    for k in range(0, 2000):
        y_permuted = permute(y)
        corrs.append(corr(x, y_permuted))
    return len([c for c in corrs if abs(c) >= abs(c0)])/len(corrs)
        </code></pre></div></div>
        
<div class="text">In the above, we compute the correlation coefficient for a large number of permuted data sets. We then compute the fraction of these data sets that have a correlation coefficient with a higher absolute value than the correlation coefficient of the original data. We can call this fraction the <i>p</i>-value. Ideally, we would try all possible permutations; since this is infeasible, we can use this technique to get an approximation.</div>
        <div class="paragraph">
Alternatively, we can use a library to compute the same information. The <code>scipy.stats.pearsonr()</code> function will return the correlation coefficient and the <i>p</i>-value.
        </div>
        
<div class="code"><div class="source"><pre><code class="py">
import scipy.stats
print(scipy.stats.pearsonr(x, y))
        </code></pre></div></div>
      </div></div></div>
<div class="text top">Another approach is to mathematically derive a distribution of correlation coefficients based on some assumptions about how the observation vectors <i>x</i> and <i>y</i> are distributed if they are independent of one another (i.e., not correlated) and each follow some distribution (e.g., the normal distribution). Then, computing a definite integral of this distribution from some specific observation to the nearest boundary would yield the <i>p</i>-value for that observation. Taking this approach usually requires parameterizing the distribution using the size of the data set (i.e., the number of samples) because this can influence how the distribution of correlation coefficients should be computed.</div>
<a id="084e12ab2226470ebe2d13d00f0d1103"></a><div class="linked block"><div class="link-block">[<a href="#084e12ab2226470ebe2d13d00f0d1103">link</a>]&nbsp;&nbsp;</div><div style="width:100%; display:inline-block;"><div style="width:auto;" class="example task_required"><span class="block_label">Example:</span> 
        
<div class="text">Suppose that you have a data set of <i>n</i> observations in which each tuple is of the form (<i>w</i><sub><i>i</i></sub>, <i>x</i><sub><i>i</i></sub>, <i>y</i><sub><i>i</i></sub>, <i>z</i><sub><i>i</i></sub>), where <i>w</i> = (<i>w</i><sub>1</sub>, ..., <i>w</i><sub><i>n</i></sub>), <i>x</i> = (<i>x</i><sub>1</sub>, ..., <i>x</i><sub><i>n</i></sub>), <i>y</i> = (<i>y</i><sub>1</sub>, ..., <i>y</i><sub><i>n</i></sub>) and <i>z</i> = (<i>z</i><sub>1</sub>, ..., <i>z</i><sub><i>n</i></sub>). You compute some covariance and correlation coefficient quantities and find the following relationships:
<table style="padding-left:20px; margin:4px 0px 4px 0px;"><tr><td style="text-align:right; white-space:nowrap;"><table style="width:100%;"><tr><td style="text-align:right;">  cov(<i>x</i>, <i>y</i>) <td></tr></table></td><td style="text-align:center;"> &gt; </td><td><table style="white-space:nowrap;"><tr><td style="white-space:nowrap;"> cov(<i>w</i>, <i>z</i>) </td></tr></table></td></tr><tr><td style="text-align:right;"><table style="width:100%;"><tr><td style="text-align:right;">
  &rho;(<i>y</i>, <i>z</i>) <td></tr></table></td><td style="text-align:center;"> &gt; </td><td><table style="white-space:nowrap;"><tr><td style="white-space:nowrap;"> &rho;(<i>w</i>, <i>x</i>) </td></tr></table></td></tr><tr><td style="text-align:right;"><table style="width:100%;"><tr><td style="text-align:right;">
  &rho;(<i>w</i>, <i>x</i>) <td></tr></table></td><td style="text-align:center;"> &gt; </td><td><table style="white-space:nowrap;"><tr><td style="white-space:nowrap;"> 0
</td></tr></table></td></tr></table>
<ul>
  <li>Can you conclude that the correlation between <i>x</i> and <i>y</i> is stronger than the correlation between <i>w</i> and <i>z</i>?</li>
  <li>Can you conclude that the correlation between <i>y</i> and <i>z</i> is stronger than the correlation between <i>w</i> and <i>x</i>?</li>
  <li>Suppose that we change the last inequality to &rho;(<i>y</i>, <i>z</i>) < 0. Does this change your answers to any of the above?</li>
</ul></div>
      </div></div></div>
<a id="084e12ab2226470ebe2d13d00f0d1aaa"></a><div class="linked block"><div class="link-block">[<a href="#084e12ab2226470ebe2d13d00f0d1aaa">link</a>]&nbsp;&nbsp;</div><div style="width:100%; display:inline-block;"><div style="width:auto;" class="exercise task_required"><span class="block_label">Exercise:</span> 
        
<div class="text">Suppose you have a data set <i>D</i> consisting of tuples of the form (<i>x</i><sub><i>i</i></sub>, <i>y</i><sub><i>i</i></sub>) and you want to define the necessary transformations using the MapReduce paradigm to compute the covariance for this data set. Provide the definitions of the map and reduce operations necessary to accomplish this.</div>
        
<div class="button"><button class="solution_toggle">show solution</button></div><div class="solution_container" style="display:none;"><div class="solution">
          
<div class="text">Computing the average would require that we emit tuples of the form (*, <i>x</i><sub><i>i</i></sub>, <i>y</i><sub><i>i</i></sub>, 1) that all have the same key * and then reduce them with a component-wise addition operation. We would then perform another mapping from the data set of totals that contains only one entry of the form (*, <i>t</i><sub><i>x</i></sub>, <i>t</i><sub><i>y</i></sub>, <i>c</i>) to a data set containing a single tuple (<i>t</i><sub><i>x</i></sub>/<i>c</i>, <i>t</i><sub><i>y</i></sub>/<i>c</i>) = (&mu;(<i>x</i>), &mu;(y)). We will call this data set containing a single entry <i>M</i>.</div>
          <div class="paragraph">
Next, we would need to take the product <i>M</i> and <i>D</i> that has tuples of the form (<i>x</i><sub><i>i</i></sub>, <i>y</i><sub><i>i</i></sub>, &mu;(<i>x</i>), &mu;(<i>y</i>)), and would compute the covariance using a map operation that produces (*, (<i>x</i><sub><i>i</i></sub> &#8722; &mu;(<i>x</i>)) &sdot; (<i>y</i><sub><i>i</i></sub> &#8722; &mu;(<i>y</i>)), 1) and a reduce operation that uses component-wise addition. A final map operation would convert the resulting single tuple by dividing both totals by the count, as before.
          </div>
          <div class="paragraph">
If we also wanted to compute the standard deviations, we would use an approach similar to the one used for computing the averages. We would then take the product of the two singleton data sets to compute the final result.
          </div>
          <div class="paragraph">
The Python implementation below illustrates the techniques described above.
          </div>
          
<div class="code"><div class="source"><pre><code class="py">
D = [(1,2), (3,4), (5,6)]

def reducer(k, xycs):
    (tx, ty, n) = (0, 0, 0)
    for (x, y, c) in xycs:
        (tx, ty, n) = (tx + x, ty + y, n + c)
    return (tx/n, ty/n)

[(mx, my)] = reduce(reducer, map(lambda x,y: [("*", (x, y, 1))], D))

def reducer(k, ps):
    (tp, n) = (0, 0)
    for (p, c) in ps:
        (tp, n) = (tp + p, n + c)
    return (tp/n,)

[(cov,)] = reduce(reducer, map(lambda x,y: [("*", ((x-mx) * (y-my), 1))], D))
          </code></pre></div></div>
        </div></div><div class="solution_spacer"></div>
      </div></div></div>
<a id="084e12ab2226470ebe2d13d00f0d1aab"></a><div class="linked block"><div class="link-block">[<a href="#084e12ab2226470ebe2d13d00f0d1aab">link</a>]&nbsp;&nbsp;</div><div style="width:100%; display:inline-block;"><div style="width:auto;" class="example task_required"><span class="block_label">Example:</span> 
        
<div class="text">Suppose you have a data set <i>D</i> consisting of tuples of the form (<i>x</i><sub><i>i</i></sub>, <i>y</i><sub><i>i</i></sub>) and you want to compute the correlation coefficient for this data set with a transformation sequence that uses building blocks from the relational paradigm. The flow diagram for this sequence of transformations is provided below.</div>
        
<div class="text"><div class="pql" style="border:0px solid #000000; height:400px; display:inline-block; width:100%;">
!table([
  [['dr:prod`r:proj``(xi,yi)'],                     ['r:agg plus``(xi,yi,1)'], ['d:proj``(sx,sy,n)']],
  [['d:agg plus``(((xi-&mu;(x))^2)/n, ((y-&mu;(y))^2)/n)'],  ['rd:proj`l:proj``(xi,yi,&mu;(x), &mu;(y), n)'],   ['l:prod``(&mu;(x) = sx/n, &mu;(y) = sy/n, n)']],
  [['d:proj``(&sigma;(x)^2, &sigma;(y)^2)'], ['d:prod``(cov(x, y)'], ['l:agg plus``(1/n)*(x-&mu;(x))*(y-&mu;(y))']],
  [['r:prod``(&sigma;(x), &sigma;(y))'], ['r:proj``(cov(x,y), &sigma;(x), &sigma;(y))'], ['(&rho;(x, y))']]
])
</div></div>
      </div></div></div></div>
  <a id="4.5"></a><div class="subsection"><h3 class="linked heading"><span class="link-title">[<a href="#4.5">link</a>]&nbsp;&nbsp;</span><div><span class="header_numeral">4.5.</span> Sampling and Inference</div></h3>
<div class="text top">Often, data sets are partial observations of the overall system. Despite this, we may still be able to infer things about the system's properties from those partial observations. In this subsection we go over just a few of the many methods available for making inferences based on data obtained via sampling.</div>
<a id="3331b48a2e1040808d9538ff45ae342f"></a><div class="linked block"><div class="link-block">[<a href="#3331b48a2e1040808d9538ff45ae342f">link</a>]&nbsp;&nbsp;</div><div style="width:100%; display:inline-block;"><div style="width:auto;" class="definition true_required"><span class="block_label">Definition (sampling):</span> 
        
<div class="text">Given some data set <i>S</i> (which may not be known in its entirely and may only exist in theory, e.g., the data set of all humans, or the data set of all Earth-size planets in the galaxy), we call any subset <i>T</i> &sub; <i>S</i> a <i>sample</i> of <i>S</i>.</div>
      </div></div></div>
<div class="text top">Typically, the data set being sampled will contain records or tuples, each corresponding to an item or an individual. Each tuple provides some information about the properties or measurements of that item or individual along certain dimensions (e.g., size, location, and so on).</div>
<a id="3331b48a2e1040808d9538ff15ae342f"></a><div class="linked block"><div class="link-block">[<a href="#3331b48a2e1040808d9538ff15ae342f">link</a>]&nbsp;&nbsp;</div><div style="width:100%; display:inline-block;"><div style="width:auto;" class="definition true_required"><span class="block_label">Definition (probability sampling):</span> 
        
<div class="text">A <i>probability sample</i> of a data set <i>S</i> is any sample <i>T</i> &sub; <i>S</i> that is collected under the condition that for every tuple, the probabiility of selecting that tuple is known in advance.</div>
      </div></div></div>
<a id="1121b48a2e1040808d9538ff15aeaaaf"></a><div class="linked block"><div class="link-block">[<a href="#1121b48a2e1040808d9538ff15aeaaaf">link</a>]&nbsp;&nbsp;</div><div style="width:100%; display:inline-block;"><div style="width:auto;" class="fact true_required"><span class="block_label">Fact (proportional reasoning between independent variables):</span> 
        
<div class="text">Assume a data set <i>S</i> has information that can be used to derive two relevant binary characteristics describing the individuals or items in that data set (e.g., whether an individual is male or female, whether an individual resides in a particular neighborhood, and so on). We can define two (possibly overlapping) subsets <i>A</i> &sub; <i>S</i> and <i>B</i> &sub; <i>S</i> corresponding to those subsets that have each of those characteristics.</div>
        <div class="paragraph">
If the two characteristics are independent we can conclude that:
<table style="padding-left:20px; margin:4px 0px 4px 0px;"><tr><td style="text-align:right; white-space:nowrap;"><table style="width:100%;"><tr><td style="text-align:right;">  |<i>A</i> &cap; <i>B</i>|/|<i>B</i>| <td></tr></table></td><td style="text-align:center;"> = </td><td><table style="white-space:nowrap;"><tr><td style="white-space:nowrap;"> |<i>A</i> &cap; <i>S</i>|/|<i>S</i>| <td></tr></table></td><td style="text-align:center;"> = </td><td><table style="white-space:nowrap;"><tr><td style="white-space:nowrap;"> |<i>A</i>|/|<i>S</i>|
</td></tr></table></td></tr></table>
In other words, if we treat the tuples from <i>B</i> as a sample of <i>S</i> and we find the fraction of those tuples that are also in <i>A</i>, we can conclude that the same would have occurred if we had sampled the entirety of <i>S</i>.
        </div>
      </div></div></div>
<a id="2bb1b48aaa1040808d9538bb4baedb2c"></a><div class="linked block"><div class="link-block">[<a href="#2bb1b48aaa1040808d9538bb4baedb2c">link</a>]&nbsp;&nbsp;</div><div style="width:100%; display:inline-block;"><div style="width:auto;" class="example task_required"><span class="block_label">Example:</span> 
        
<div class="text">Assume we know that within a certain geographical region, one out of every 10 people is a member of an online social networking mobile application that collects the geographical locations of its users. Suppose we take a sample of how many unique users were present in a particular neighborhood within that region over the course of a week, and we get a total of 1200 users. Assuming that there is no correlation between that particular neighborhood and the popularity of that mobile application (i.e., residency in that neighborhood and membership in the social network are independent variables), we can then infer the total population in that neighborhood by solving the following equation:
<table style="padding-left:20px; margin:4px 0px 4px 0px;"><tr><td style="text-align:right; white-space:nowrap;"><table style="width:100%;"><tr><td style="text-align:right;">  1/10 <td></tr></table></td><td style="text-align:center;"> = </td><td><table style="white-space:nowrap;"><tr><td style="white-space:nowrap;"> 1200/<i>n</i>
</td></tr></table></td></tr></table>
The above yields <i>n</i> = 12,000, so we can conclude that there are about 12,000 residents in that particular neighborhood based on our observations.</div>
      </div></div></div>
<div class="text top">In the <a href="#2bb1b48aaa1040808d9538bb4baedb2c">above example</a>, we could have also worked in the other direction if we only knew the overall population in the region, the population of the neighborhood, and the popularity of the application within the population in that neighborhood. This would allow us to derive the total number of users of the application in the region.</div>
<a id="1121b48a2e1040808d9538ff15ae342f"></a><div class="linked block"><div class="link-block">[<a href="#1121b48a2e1040808d9538ff15ae342f">link</a>]&nbsp;&nbsp;</div><div style="width:100%; display:inline-block;"><div style="width:auto;" class="fact true_required"><span class="block_label">Fact (capture-recapture):</span> 
        
<div class="text">Suppose that we can use probability sampling of a data set of size <i>N</i>, we know that the probability of selecting any individual tuple is equivalent to 1/<i>N</i>, and we do not know the size <i>N</i> of the data set. In addition, suppose we are allowed to "mark" any tuple that appears in a sample we take so that we can identify it (with complete certainty) as a previously marked tuple if we see it in subsequent samples.</div>
        <div class="paragraph">
Under these conditions, we can estimate the size of the entire data set (an unknown quantity) in the following way. First, we take a sample of size <i>n</i> of the data set, and we mark every tuple in that sample. Next, we take another (independent) sample of size <i>K</i> and count the number of marked tuples in that sample (suppose it is <i>k</i> &le; <i>K</i>). We can then solve the following equation for <i>N</i>:
<table style="padding-left:20px; margin:4px 0px 4px 0px;"><tr><td style="text-align:right; white-space:nowrap;"><table style="width:100%;"><tr><td style="text-align:right;">  <i>k</i>/<i>K</i> <td></tr></table></td><td style="text-align:center;"> = </td><td><table style="white-space:nowrap;"><tr><td style="white-space:nowrap;"> <i>n</i>/<i>N</i> </td></tr></table></td></tr><tr><td style="text-align:right;"><table style="width:100%;"><tr><td style="text-align:right;">
  <i>N</i> <td></tr></table></td><td style="text-align:center;"> = </td><td><table style="white-space:nowrap;"><tr><td style="white-space:nowrap;"> (<i>n</i> &sdot; <i>K</i>) / <i>k</i>
</td></tr></table></td></tr></table>
        </div>
      </div></div></div>
<a id="24b1b48aaa1040808d9538bb4baedb2c"></a><div class="linked block"><div class="link-block">[<a href="#24b1b48aaa1040808d9538bb4baedb2c">link</a>]&nbsp;&nbsp;</div><div style="width:100%; display:inline-block;"><div style="width:auto;" class="example task_required"><span class="block_label">Example:</span> 
        
<div class="text">Suppose that we want to know how many distinct people regularly visit a particular grocery store, and we can identify visitors in some way (e.g., by recording their unique identifiers). We can fix a particular time period and collect the identifiers of all the people who visit that store (suppose we collect <i>n</i> of them). We can then fix another time period, again record the identifiers of all <i>K</i> visitors during that time period, and then determine what fraction <i>k</i>/<i>K</i> of the identifiers from the first collection appears in the second collection. We could then estimate that the total number of individuals who visit the store regularly is around (<i>n</i> &sdot; <i>K</i>) / <i>k</i>.</div>
      </div></div></div>
<a id="084e12ab2226470ebe2d13d00f0d1aac"></a><div class="linked block"><div class="link-block">[<a href="#084e12ab2226470ebe2d13d00f0d1aac">link</a>]&nbsp;&nbsp;</div><div style="width:100%; display:inline-block;"><div style="width:auto;" class="exercise task_required"><span class="block_label">Exercise:</span> 
        
<div class="text">You are obtaining a probability sample of a population such that the probability of selecting every individual is the same. You take a sample of 1000 individuals and find that 200 of them are members of a particular social network. Answer the following questions.
<ul>
  <li>Assuming that membership in the social network is completely independent of age, and that there are 20,000 individuals who are 18-24 years old in the overall population, provide one possible estimate <i>k</i> for the number of individuals in the 18-24 age range who are members of the social network.</li>
  <li>Suppose you determine that the social network has 10,000 unique members. What is the size <i>n</i> of the population from which your sample was taken?</li>
</ul></div>
        
<div class="button"><button class="solution_toggle">show solution</button></div><div class="solution_container" style="display:none;"><div class="solution">
          
<div class="text">Since membership in the 18-24 demographic and membership in the social network are assumed to be independent, we can use proportional reasoning to obtain <i>k</i> = 4000:
<table style="padding-left:20px; margin:4px 0px 4px 0px;"><tr><td style="text-align:right; white-space:nowrap;"><table style="width:100%;"><tr><td style="text-align:right;">  200/1000 <td></tr></table></td><td style="text-align:center;"> = </td><td><table style="white-space:nowrap;"><tr><td style="white-space:nowrap;"> <i>k</i>/20,000 </td></tr></table></td></tr><tr><td style="text-align:right;"><table style="width:100%;"><tr><td style="text-align:right;">
  <i>k</i> <td></tr></table></td><td style="text-align:center;"> = </td><td><table style="white-space:nowrap;"><tr><td style="white-space:nowrap;"> 4000
</td></tr></table></td></tr></table>
If the social network has 10,000 members and 4000 of them are age 18-24, then we can use the fact that there are 20,000 individuals in the 18-24 demographic in the overall population to compute <i>n</i> = 50,000:
<table style="padding-left:20px; margin:4px 0px 4px 0px;"><tr><td style="text-align:right; white-space:nowrap;"><table style="width:100%;"><tr><td style="text-align:right;">  4000/10,000 <td></tr></table></td><td style="text-align:center;"> = </td><td><table style="white-space:nowrap;"><tr><td style="white-space:nowrap;"> 20,000/<i>n</i> </td></tr></table></td></tr><tr><td style="text-align:right;"><table style="width:100%;"><tr><td style="text-align:right;">
  <i>n</i> <td></tr></table></td><td style="text-align:center;"> = </td><td><table style="white-space:nowrap;"><tr><td style="white-space:nowrap;"> 50,000
</td></tr></table></td></tr></table></div>
        </div></div><div class="solution_spacer"></div>
      </div></div></div>
<a id="3aa1b48a2e1040808d9538ff45ae342f"></a><div class="linked block"><div class="link-block">[<a href="#3aa1b48a2e1040808d9538ff45ae342f">link</a>]&nbsp;&nbsp;</div><div style="width:100%; display:inline-block;"><div style="width:auto;" class="fact true_required"><span class="block_label">Fact (Algorithm R and reservoir sampling):</span> 
        
<div class="text">Using <a href="https://en.wikipedia.org/wiki/Reservoir_sampling#Algorithm_R">Algorithm R</a>, it is possible to take a probability sample of size <i>k</i> from an incoming sequence of data set entries (e.g., tuples) without knowing the length of the sequence in advance. Furthermore, it is guaranteed that every item in the sequence had an equal probability of being added to the sample.</div>
      </div></div></div>
<a id="084e12ab2226470ebe2d1342ff0d1aac"></a><div class="linked block"><div class="link-block">[<a href="#084e12ab2226470ebe2d1342ff0d1aac">link</a>]&nbsp;&nbsp;</div><div style="width:100%; display:inline-block;"><div style="width:auto;" class="exercise task_required"><span class="block_label">Exercise:</span> 
        
<div class="text">Assume we have enough storage in our database to store <i>n</i> + <i>k</i> tuples, and it currently holds a data set <i>X</i> that has <i>n</i> tuples. However, we expect that tomorrow, we will obtain a new data set <i>Y</i> that also has <i>n</i> tuples, so we cannot store it unless we delete <i>X</i> completely. We need to take a probability sample of size <i>k</i> from the collection <i>X</i> &cup; <i>Y</i> such that any item has an equal likelihood of being added to the sample. Is this possible? Explain your answer. You may assume that every tuple in <i>X</i> and <i>Y</i> has a unique identifier and there is no overlap between <i>X</i> and <i>Y</i>.</div>
        
<div class="button"><button class="solution_toggle">show solution</button></div><div class="solution_container" style="display:none;"><div class="solution">
          
<div class="text">We can start by using the <i>k</i> storage slots to perform Algorithm R over the data set <i>X</i>. We can then discard <i>X</i> in its entirety. Once we obtain <i>Y</i> on the next day, we can store it in the <i>n</i> available slots and contrinue running Algorithm R on the values of <i>Y</i> from where we last stopped (i.e., when we had finished processing <i>X</i>). In the end, Algorithm R guarantees that any of the |<i>X</i> &cup; <i>Y</i>| elements that were encountered during its operation had an equal probability of 1/|<i>X</i> &cup; <i>Y</i>| of being chosen.</div>
        </div></div><div class="solution_spacer"></div>
      </div></div></div>
<a id="084e12ab2226470ebe2d1342ff0d1dfc"></a><div class="linked block"><div class="link-block">[<a href="#084e12ab2226470ebe2d1342ff0d1dfc">link</a>]&nbsp;&nbsp;</div><div style="width:100%; display:inline-block;"><div style="width:auto;" class="exercise task_required"><span class="block_label">Exercise:</span> 
        
<div class="text">You were using Algorithm R to perform reservoir sampling of packets as they were arriving, with a sample size of 100. You were also recording the number of packets there were corrupted, but you did not record the total number of packets that arrived.</div>
        <div class="paragraph">
Later on, someone asks you to compute the total number of packets that had arrived during the sampling period. You look at your sample and you find that out of 100 packets, 7 were corrupted. You also recorded that a total of 1400 packets had been corrupted during the sample collection period. How many packets arrived during the sample collection period?
        </div>
        
<div class="button"><button class="solution_toggle">show solution</button></div><div class="solution_container" style="display:none;"><div class="solution">
          
<div class="text">Since the sample obtained using Algorithm R is a probability sample such that every packet encountered had an equal probability of being in the sample, whether a packet is in the sample is independent of whether it is corrupted. Thus, since 7/100 packets were corrupted in the sample and 1400 in total were corrupted, the number of packets that arrived is 20,000:
<table style="padding-left:20px; margin:4px 0px 4px 0px;"><tr><td style="text-align:right; white-space:nowrap;"><table style="width:100%;"><tr><td style="text-align:right;">  7/100 <td></tr></table></td><td style="text-align:center;"> = </td><td><table style="white-space:nowrap;"><tr><td style="white-space:nowrap;"> 1400/<i>n</i> </td></tr></table></td></tr><tr><td style="text-align:right;"><table style="width:100%;"><tr><td style="text-align:right;">
  <i>n</i> <td></tr></table></td><td style="text-align:center;"> = </td><td><table style="white-space:nowrap;"><tr><td style="white-space:nowrap;"> 20,000
</td></tr></table></td></tr></table></div>
        </div></div><div class="solution_spacer"></div>
      </div></div></div>
<a id="084e12abcdc6470ebe2d1342ff0d1dfc"></a><div class="linked block"><div class="link-block">[<a href="#084e12abcdc6470ebe2d1342ff0d1dfc">link</a>]&nbsp;&nbsp;</div><div style="width:100%; display:inline-block;"><div style="width:auto;" class="exercise task_required"><span class="block_label">Exercise:</span> 
        
<div class="text">Suppose there are exactly <i>n</i> entries in a data set of all neighborhoods and that these entries can be divided into two dimensions: <i>x</i> = (<i>x</i><sub>1</sub>, ..., <i>x</i><sub><i>n</i></sub>) and <i>y</i> = (<i>y</i><sub>1</sub>, ..., <i>y</i><sub><i>n</i></sub>). Furthermore, for all 1 &le; <i>i</i> &le; <i>n</i>, we have that <i>x</i><sub><i>i</i></sub> <span style="font-size:12px;">&#8712;</span> {0,1} and <i>y</i><sub><i>i</i></sub> <span style="font-size:12px;">&#8712;</span> {0,1}, where each observation of the form (<i>x</i><sub><i>i</i></sub>, <i>y</i><sub><i>i</i></sub>) specifies whether the <i>i</i>th neighborhood has a hospital (the <i>x</i><sub><i>i</i></sub> component) and whether it has a school (the <i>y</i><sub><i>i</i></sub> component).</div>
        <div class="paragraph">
We compute the correlation coefficient and determine that &rho;(<i>x</i>, <i>y</i>) = 0. If there are 12 neighborhoods with a school, there are 6 neighborhoods that have both a school and a hospital, and there are 8 hospitals in total, what can these facts reasonably imply about the total number of neighborhoods <i>n</i>?
        </div>
        
<div class="button"><button class="solution_toggle">show solution</button></div><div class="solution_container" style="display:none;"><div class="solution">
          
<div class="text">Since the correlation coefficient is 0, it is likely that there is no correlation between the two dimensions being measured (i.e., the presence of a school in the neighborhood and the presence of a hospital in the neighborhood). This allows us to use <a href="#1121b48a2e1040808d9538ff15aeaaaf">proportional reasoning</a> between these independent variables. Of the 12 neighborhoods that have schools, 6 have hospitals. Since the sample of neighborhoods with schools is independent of the property of having a hospital, this 6/12 relationship must also hold for the set of all neighborhoods:
<table style="padding-left:20px; margin:4px 0px 4px 0px;"><tr><td style="text-align:right; white-space:nowrap;"><table style="width:100%;"><tr><td style="text-align:right;">  # with school and hospital / # with school <td></tr></table></td><td style="text-align:center;"> = </td><td><table style="white-space:nowrap;"><tr><td style="white-space:nowrap;"> # with hospital / # neighborhoods
</td></tr></table></td></tr></table>
Thus, there are 16 neighborhoods:
<table style="padding-left:20px; margin:4px 0px 4px 0px;"><tr><td style="text-align:right; white-space:nowrap;"><table style="width:100%;"><tr><td style="text-align:right;">  6/12 <td></tr></table></td><td style="text-align:center;"> = </td><td><table style="white-space:nowrap;"><tr><td style="white-space:nowrap;"> 8/<i>n</i> </td></tr></table></td></tr><tr><td style="text-align:right;"><table style="width:100%;"><tr><td style="text-align:right;">
  <i>n</i> <td></tr></table></td><td style="text-align:center;"> = </td><td><table style="white-space:nowrap;"><tr><td style="white-space:nowrap;"> 16
</td></tr></table></td></tr></table></div>
        </div></div><div class="solution_spacer"></div>
      </div></div></div></div>
</div>
<a id="5"></a>
<div class="section"><hr /><h2 class="linked heading"><span class="link-title">[<a href="#5">link</a>]&nbsp;&nbsp;</span><div><span class="header_numeral">5.</span> Visualizations and Web Services</div></h2>
  <a id="5.1"></a><div class="subsection"><h3 class="linked heading"><span class="link-title">[<a href="#5.1">link</a>]&nbsp;&nbsp;</span><div><span class="header_numeral">5.1.</span> Web Services</div></h3>
<div class="text top">In this section, we present examples of how a few existing tools can be used to implement and test a basic web service. More generally, there are innumerable frameworks for building server-side and client-side application components.</div>
<a id="084e12abcdc6470ebe2d1342ff0d1bdd"></a><div class="linked block"><div class="link-block">[<a href="#084e12abcdc6470ebe2d1342ff0d1bdd">link</a>]&nbsp;&nbsp;</div><div style="width:100%; display:inline-block;"><div style="width:auto;" class="example task_required"><span class="block_label">Example:</span> 
        
<div class="text">The following presents one possible web service implementation using the Flask framework.</div>
        
<div class="code"><div class="source"><pre><code class="py">
import jsonschema
from flask import Flask, jsonify, abort, make_response, request
from flask.ext.httpauth import HTTPBasicAuth

app = Flask(__name__)
auth = HTTPBasicAuth()

users = [
  { 'id': 1, 'username': u'alice' },
  { 'id': 2, 'username': u'bob' }
]

schema = {
  "type": "object", 
  "properties": {"username" : {"type": "string"}},
  "required": ["username"]
}

&commat;app.route('/client', methods=['OPTIONS'])
def show_api():
    return jsonify(schema)

&commat;app.route('/client', methods=['GET'])
&commat;auth.login_required
def show_client():
    return open('client.html','r').read()

&commat;app.route('/app/api/v0.1/users', methods=['GET'])
def get_users(): # Server-side reusable name for function.
    print("I'm responding.")
    return jsonify({'users': users})

&commat;app.route('/app/api/v0.1/users/<int:user_id>', methods=['GET'])
def get_user(user_id):
    user = [user for user in users if user['id'] == user_id]
    if len(user) == 0:
        abort(404)
    return jsonify({'user': user[0]})

&commat;app.errorhandler(404)
def not_found(error):
    return make_response(jsonify({'error': 'Not found foo.'}), 404)

&commat;app.route('/app/api/v0.1/users', methods=['POST'])
def create_user():
    print(request.json)
    if not request.json:
        print('Request not valid JSON.')
        abort(400)

    try:
        jsonschema.validate(request.json, schema)
        user = { 'id': users[-1]['id'] + 1, 'username': request.json['username'] }
        users.append(user)
        print(users)
        return jsonify({'user': user}), 201
    except:
        print('Request does not follow schema.')
        abort(400)

&commat;auth.get_password
def foo(username):
    if username == 'alice':
        return 'ecila'
    return None

&commat;auth.error_handler
def unauthorized():
    return make_response(jsonify({'error': 'Unauthorized access.'}), 401)

if __name__ == '__main__':
    app.run(debug=True)
        </code></pre></div></div>
        
<div class="text">The HTML page below can be used to test the POST request handling capabilities of the application.</div>
        
<div class="code"><div class="source"><pre><code class="js">
&lt;script&gt;
  function postSomething() {
    var http = new XMLHttpRequest();
    var url = "http://localhost:5000/app/api/v0.1/users";
    var json = JSON.stringify({'name':'carl'});
    http.open("POST", url, true);
    http.setRequestHeader("Content-type", "application/json");
    http.onreadystatechange = function() {
      if (http.readyState == 4 && http.status == 200) {
        console.log(http.responseText);
      }
    }
    http.send(json);
  }
&lt;/script&gt;
&lt;button onclick="postSomething();"&gt;Post new user&lt;/button&gt;
        </code></pre></div></div>        
      </div></div></div></div>
</div><a id="bib"></a><div class="references"><hr /><h2 class="linked heading"><span class="link-title">[<a href="#bib">link</a>]&nbsp;&nbsp;</span><span class="header_numeral">References</span></h2><table><tr><td class="cite"><a id="625662870761"></a>[1]</td><td> "<b>World's population increasingly urban with more than half living in urban areas</b>". <a href="https://www.un.org/development/desa/en/news/population/world-urbanization-prospects.html">https://www.un.org/development/desa/en/news/population/world-urbanization-prospects.html</a></td></tr><tr><td class="cite"><a id="aiid:1855401"></a>[2]</td><td> Luís M. A. Bettencourt, José Lobo, Dirk Helbing, Christian Kühnert, and Geoffrey B. West. "<b>Growth, innovation, scaling, and the pace of life in cities</b>". <i>Proceedings of the National Academy of Sciences of the United States of America</i> 2007;104(17):7301-7306. <a href="http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1852329/">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1852329/</a></td></tr><tr><td class="cite"><a id="Albright:2008:SSL:1376616.1376739"></a>[3]</td><td> Robert Albright, and Alan Demers, Johannes Gehrke, Nitin Gupta,  Hooyeon Lee, Rick Keilty, Gregory Sadowski, Ben Sowell, and Walker White. "<b>SGL: A Scalable Language for Data-driven Games</b>". <i>Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data</i> 2008. <a href="http://www.cs.cornell.edu/~sowell/2008-sigmod-games-demo.pdf">http://www.cs.cornell.edu/~sowell/2008-sigmod-games-demo.pdf</a></td></tr><tr><td class="cite"><a id="White:2008:DPC:1401843.1401847"></a>[4]</td><td> Walker White, Benjamin Sowell, Johannes Gehrke, and Alan Demers. "<b>Declarative Processing for Computer Games</b>". <i>Proceedings of the 2008 ACM SIGGRAPH Symposium on Video Games</i> 2008. <a href="http://www.cs.cornell.edu/~sowell/2008-sandbox-declarative.pdf">http://www.cs.cornell.edu/~sowell/2008-sandbox-declarative.pdf</a></td></tr><tr><td class="cite"><a id="PROV-Primer"></a>[5]</td><td> W3C Working Group. "<b>PROV Model Primer</b>". <a href="https://www.w3.org/TR/prov-primer/">https://www.w3.org/TR/prov-primer/</a></td></tr><tr><td class="cite"><a id="ilprints918"></a>[6]</td><td> Robert Ikeda and Jennifer Widom. "<b>Data Lineage: A Survey</b>". <a href="http://ilpubs.stanford.edu:8090/918/1/lin_final.pdf">http://ilpubs.stanford.edu:8090/918/1/lin_final.pdf</a></td></tr><tr><td class="cite"><a id="Cui:2003:LTG:775452.775456"></a>[7]</td><td> Y. Cui and J. Widom. "<b>Lineage Tracing for General Data Warehouse Transformations</b>". <i>The VLDB Journal</i> 2003;12(1):41-58. <a href="http://ilpubs.stanford.edu:8090/525/1/2001-5.pdf">http://ilpubs.stanford.edu:8090/525/1/2001-5.pdf</a></td></tr><tr><td class="cite"><a id="RePEc:eee:soceco:v:33:y:2004:i:5:p:587-606"></a>[8]</td><td> Gerd Gigerenzer. "<b>Mindless statistics</b>". <i>The Journal of Socio-Economics</i> 2004;33(5):587-606. <a href="http://www.unh.edu/halelab/BIOL933/papers/2004_Gigerenzer_JSE.pdf">http://www.unh.edu/halelab/BIOL933/papers/2004_Gigerenzer_JSE.pdf</a></td></tr><tr><td class="cite"><a id="DBLP:journals-corr-ShahZ14"></a>[9]</td><td> Nihar B. Shah and Dengyong Zhou. "<b>Double or Nothing: Multiplicative Incentive Mechanisms for Crowdsourcing</b>". <i>CoRR</i> 2014. <a href="http://www.eecs.berkeley.edu/~nihar/publications/double_or_nothing.pdf">http://www.eecs.berkeley.edu/~nihar/publications/double_or_nothing.pdf</a></td></tr></table></div><a id="A"></a><div class="appendix"><hr /><h2 class="linked heading"><span class="link-title">[<a href="#A">link</a>]&nbsp;&nbsp;</span><div><span class="header_numeral">Appendix A.</span> Other Resources</div></h2>
  <a id="A.1"></a><div class="subsection"><h3 class="linked heading"><span class="link-title">[<a href="#A.1">link</a>]&nbsp;&nbsp;</span><div><span class="header_numeral">A.1.</span> MongoDB and Related Resources</div></h3>
<div class="text top">In this course, staff will use <a href="https://www.mongodb.org/">MongoDB</a> to store and work with large data sets. This section lists some resources that you may find useful.</div><div class="paragraph">
Some references, tutorials, and other materials of interest mentioned during lecture include the following.
<ul>
  <li>
    The official documentation (<a href="https://docs.mongodb.org/manual/">https://docs.mongodb.org/manual/</a>), and in particular:
    <ul>
      <li>how to assemble relational queries (<a href="https://docs.mongodb.org/manual/core/aggregation-pipeline/">https://docs.mongodb.org/manual/core/aggregation-pipeline/</a>) and</li>
      <li>how to assemble map-reduce queries (<a href="https://docs.mongodb.org/manual/core/map-reduce/">https://docs.mongodb.org/manual/core/map-reduce/</a>).</li>
    </ul>
  </li>
  <li>
    Issues and feature requests mentioned during lecture include:
    <ul>
      <li>flattening the results generated by <code>mapReduce</code> operations: <a href="https://jira.mongodb.org/browse/SERVER-2517">https://jira.mongodb.org/browse/SERVER-2517</a> and</li>
      <li>deprecation and future removal of server-side <code>eval()</code> and stored procedure capabilities: <a href="https://jira.mongodb.org/browse/SERVER-17453">https://jira.mongodb.org/browse/SERVER-17453</a>.</li>
    </ul>
  </li>
  <li>A tutorial suggested by a student: <a href="http://openmymind.net/mongly/">http://openmymind.net/mongly/</a>.</li>
  <li>PyMongo Python library documentation: <a href="http://api.mongodb.org/python/current/">http://api.mongodb.org/python/current/</a>.</li>
  <li>
    Some resources dealing with using and building geospatial data set indices include:
    <ul>
      <li><a href="http://geojson.org/">the GeoJSON format</a>,</li>
      <li><a href="https://docs.mongodb.org/manual/applications/geospatial-indexes/">using geospatial indices and queries in MongoDB</a>,</li>
      <li><a href="http://api.mongodb.org/python/current/examples/geo.html">example</a> illustrating how to create a <code>GEO2D</code> index using PyMongo, and</li>
      <li><a href="https://www.openstreetmap.org/">OpenStreetMap</a>.</li>
    </ul>
  </li>
</ul>
      </div></div>
  <a id="A.2"></a><div class="subsection"><h3 class="linked heading"><span class="link-title">[<a href="#A.2">link</a>]&nbsp;&nbsp;</span><div><span class="header_numeral">A.2.</span> Installation Resources for Other Software Packages and Libraries</div></h3>
<div class="text top">This subsection contains links to other useful resources that may help with installing packages and libraries we use in this course.</div><div class="paragraph">
Some resources that may be useful when working with the <a href="https://pypi.python.org/pypi/prov"><code>prov</code></a> package:
<ul>
  <li>when installing the <a href="https://pypi.python.org/pypi/prov"><code>prov</code></a> library for Python on Windows, you may have some trouble installing the <a href="https://pypi.python.org/pypi/lxml"><code>lxml</code></a> package. Try obtaining the precompiled package <a href="http://www.lfd.uci.edu/~gohlke/pythonlibs/">here</a>, instead, and installing it using <code>pip install *.whl</code>;</li>
  <li>a short tutorial is available <a href="http://nbviewer.jupyter.org/github/trungdong/notebooks/blob/master/PROV%20Tutorial.ipynb">here</a>.</li>
</ul>
      </div><div class="paragraph">
As we use the JSON format for storing authentication, configuration, and provenance information, the <a href="http://json-schema.org/">JSON Schema</a> is useful and there is a corresponding <a href="https://pypi.python.org/pypi/jsonschema">jsonschema</a> Python package that you will need to install.
      </div><div class="paragraph">
Two useful kinds of tools for solving problems that can be represented as systems of equations and inequalities are SMT solvers and linear optimization packages.
<ul>
  <li>
    The <a href="http://www.scipy.org/">SciPy</a> library collection includes the <a href="http://docs.scipy.org/doc/scipy-0.17.0/reference/optimize.linprog-simplex.html"><code>optimize.linprog</code></a> module for solving linear optimization problems, and the library is relatively straightforward to install:
    <ul>
      <li>using Windows, it may be easiest to obtain and install precompiled binaries for <a href="http://www.lfd.uci.edu/~gohlke/pythonlibs/#numpy">numpy</a> and <a href="http://www.lfd.uci.edu/~gohlke/pythonlibs/#scipy">scipy</a> (in that order);</li>
      <li>using Mac OS X or Linux, running <code>pip install numpy</code> and <code>pip install scipy</code>should be sufficient (you may need to download and install <a href="https://gcc.gnu.org/wiki/GFortranBinaries">a Fortran compiler</a> first under some configurations).</li>
    </ul>
  </li>
  <li>The <a href="https://github.com/Z3Prover/z3/wiki">Z3</a> SMT solver is now open source and very easy to use with Python; precompiled binaries are also <a href="https://github.com/Z3Prover/bin/tree/master/releases">available</a>.
  </li>
</ul>
      </div></div></div>
</div><div id="footer"><div id="author"><a href="http://lapets.io">Andrei Lapets</a></div><div id="sheaflink"><a href="http://sheaf.io">sheaf</a></div></div></body>
</html>
<!--eof-->