chapter5-graphid.tex

\begin{comment}
fixtex --fpaths chapter5-graphid.tex --outline --asmarkdown --numlines=999 --shortcite

fixtex --fpaths chapter5-graphid.tex --outline --asmarkdown --numlines=999 --shortcite -w && ./checklang.py outline_chapter5-graphid.md

fixtex --fpaths chapter1-intro.tex --outline --asmarkdown --numlines=999 --shortcite -w && ./checklang.py outline_chapter1-intro.md
\end{comment}

% TODO: relate back to GZC, and talk about non-rally scenarios


\chapter{IDENTIFICATION USING CONNECTIVITY IN A DECISION GRAPH}\label{chap:graphid}
\newcommand{\nT}{N}

In this chapter we frame the problem of animal identification in terms of constructing a %
\glossterm{decision graph}.
In this graph, each node is an annotation, and each edge represents a decision made between two annotations.
Edges determine if two annotations are the same (positive) or different (negative) individuals or if they cannot
  be compared (incomparable).
Using the connectivity of the decision graph, we naturally address the problem of animal identification.
Assuming no edges are incorrectly labeled, each connected component of positive edges are all annotations from
  the same individual animal.
We will refer to these as \glossterm{positive connected components} (PCCs).
We call a graph \glossterm{id-complete} if there is at least one negative edge between all pairs of PCCs.
In this case all labeled individuals must be distinct, and we have therefore completed identification.
Alternatively, if every pair of PCCs either has a negative edge between them except for pairs of PCCs where all
  possible edges between them are incomparable, then we know that the data cannot be used to determine if those
  incomparable PCCs are the same or different.
Removing the assumption that all edges are correctly labeled, we call a decision graph \glossterm{inconsistent}
  if any PCC contains a negative edge because it implies that an edge has been mislabeled.
Therefore, stated abstractly, goal of graph identification is to determine a correct, consistent, and id-complete
  set of edges in the decision graph.


In the most general case, the decision graph is initialized with an empty set of edges.
This captures the challenge posed by events like the \GZC{} from~\cref{sec:introgzc}, where we are given a set of
  annotations without name labels.
In essence, each annotation starts by itself as an individual animal, but because there are no negative edges, we
  cannot be sure that this is the correct name labeling.
In order to refine the name labeling, we add edges to the decision graph, gradually moving from a state of zero
  confidence that the name labelings are correct to a state of high confidence.
However, it is important to note that the graph algorithm does not require that we start in an empty state.
Given a known set of annotations with name labels and one or more new annotations with unknown name labels, we
  can add these new annotations to the existing database simply by labeling the graph with edges that captures our
  current knowledge.
Simply put, this means each known individual is a PCC, there is one negative edge between each pair of PCCs, and
  the new annotations are added as nodes without any edges.
Regardless of the initial state, graph identification proceeds to complete the decision graph.
For the remainder of this chapter, without loss of generality, we can assume that the graph starts in an empty
  state.

%The name labelings --- \ie{} the set of PCCs --- are refined as new edges are added to the graph.

%Akin to a segmentation algorithm~\cite{fulkerson_class_2009} that starts with an over-segmentation of an image,
%  the identification graph starts with an empty set of edges, $G = (V, \{ \})$, so in essence each annotation
%  starts by itself as an individual animal.
%we must discard all of those PCCs except those in an independent
%set of a meta-graph where the PCCs are nodes and edges are formed between incomparable PCCs.

To construct the decision graph, we develop a semi-automatic review procedure that combines the ranking and
  verification algorithms presented in \cref{chap:ranking,chap:pairclf}.
The ranking algorithm will be used to suggest candidate edges to be placed in the graph, and the verification
  algorithm will be used to automatically review as many edges as possible.
The key reason for combining these algorithms with a decision graph is to take advantage of its connectivity
  information.
Connectivity not only identifies the individuals, but it can also be used to develop graph measures of
  \emph{redundancy}, \emph{completeness}, \emph{consistency}, and \emph{convergence}.
By combining these graph measures with the ranking and verification algorithms we can prioritize edges for review
  based on both their pairwise probabilities and their ability to affect the consistency of the graph, which in
  turn allows us to:
\begin{enumin}
\item increase confidence that the identifications are correct, %
\item reduce the number of manual reviews,  % 
\item detect and recover from review errors, and %
\item determine when identification is complete. %
\end{enumin}

An important property of the graph identification framework is that it is agnostic to the underlying computer
  vision procedures, which are abstracted into three components:
\begin{enumin}
\item a ranking algorithm used to search for candidate positive edges, %
\item a verification algorithm used to automatically review edges, and %
\item a probability algorithm used assign probabilities to edges (note this is typically a by-product of the
  ranking or verification algorithm).
\end{enumin}
In this thesis we use ranking algorithm from \cref{chap:ranking}, and the verification algorithm from
  \cref{chap:pairclf} to define these components because these are suitable for identifying textured species.
However, while graph identification benefits from accurate computer vision subroutines, it can stand alone
  without them.
This means that existing identification algorithms that only define a subset of these procedures (\eg{}
  contour-based rank-only identification of humpback whales and bottlenose dolphins) could be seamlessly
  incorporated into our framework and realize the benefits of graph identification (\eg{} a reduced number of
  manual reviews and error recovery mechanisms).
Furthermore, because pairwise decisions are gathered and maintained by this framework, verification algorithms
  can be retrained and improved, moving closer to a fully-automatic algorithm.


The first section (\Cref{sec:decisiongraph}) of this chapter formalizes the decision graph and summarizes the
  priority-based review procedure used to construct it.
This provides an overview of each component of the processes.
Each of these components is then discussed in further detail
  in~\cref{sec:redun,sec:incon,sec:cand,sec:decision,sec:converge}.
\Cref{sec:graphexpt} experimentally demonstrates the ability of the graph identification algorithm to reduce the
  number of manual reviews and recover from decision errors.
\Cref{sec:graphconclusion} concludes and summarizes the chapter.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\FloatBarrier{}
\section{THE DECISION GRAPH}\label{sec:decisiongraph}

The graph identification algorithm is a review procedure formalized around the notion of a \glossterm{decision
  graph} $G = (V, E)$ whose nodes are annotations and whose edges are suggested by a ranking algorithm (LNBNN in
  our case) and decided upon by a combination of the probabilities output by a verification algorithm and by manual
  review.
The edge set $E = E_p \cup E_n \cup E_i$ is composed of three disjoint sets.
Throughout this chapter we will refer to the set that an edge belongs to as the label of that edge.
Each edge in $E_p$ is \emph{positive}, meaning that it connects two annotations determined to be from the same
  individual.
Each edge in $E_n$ is \emph{negative}, meaning that it connects annotations determined to be from different
  individuals.
Finally, each edge in $E_i$ is \emph{incomparable}, meaning that it connects two annotations where it has been
  determined that there is not enough information to tell if they are from the same individual (\eg{} when one
  annotation shows the left side of an animal and another other shows the right side).
An example of a decision graph with all three edge types is illustrated in \cref{fig:decisiongraph}.
The goal of graph identification is to construct these edges.

\decisiongraph{}

The most important task is to determine the positive edges $E_p$.
This is because each connected component in the subgraph $G_p = (V, E_p)$ corresponds to a unique individual.
Producing an accurate set of these \glossterm{positive connected components} (PCCs) addresses the larger problem
  of animal identification.
However, an algorithm that only determines positive edges is not enough.
This is because the algorithm may have failed to find all positive edges, resulting in two unconnected PCCs that
  should be \emph{merged} into one.
To ensure that this is not the case we must turn towards negative edges.

We can gain confidence that all positive edges have been found by using negative edges $E_n$, which provide
  direct evidence that two annotations are different individuals.
A correctly labeled negative edge between two PCCs means that no other unreviewed edge between those PCCs can be
  positive.
Another important case is when a negative edge is contained within a PCC.
When this happens, the PCC is \emph{inconsistent}, and it implies that the PCC contains at least one mislabeled
  edge.
Whenever an inconsistency is detected, we resolve it using the algorithm that we will define in~\cref{sec:incon}.

Lastly, incomparable edges, $E_i$, simply signify that a positive or negative decision cannot be made.
Whenever an edge is not labeled as positive, it is critically important to the construction of the decision graph
  that this non-positive edge is distinguished as either negative or incomparable.
Negative edges restrict what new edges can be added because they carry information about the completeness and
  consistency of the graph.
In contrast, incomparable edges do not.
Incomparable edges can exist internally in a PCC without causing inconsistencies or between two PCCS without
  precluding them from being matched at a later point.
%and , whereas incomparable edges do not.
%To be able to use this negative information, we must recognize the distinction between non-positive and negative
%  edges.
%Recall, that this was the basis of our $3$-state classification algorithm in \cref{chap:pairclf}.
%Algorithms like the ranking algorithm from~\cref{chap:matching} do not make this distinction, and therefore
%  cannot make use of the negative information carried by negative edges.
In the case where all edges between two PCCs are incomparable and those PCCs are complete, then we know the
  current data is not enough to determine if those PCCs are the same or different.
In our datasets most images were taken to reduce the number of incomparable PCCs in order to simplify the
  sight-resight analysis, where all PCCs must be comparable to each other, but incomparable cases do exist.
Thus, in our context incomparable edges play a minor but necessary role.
%Note that sight resight analysis can be accomplished on sets with incomparable edges using the algorithm
%  in~\cref{app:markrecapincomp}.


%datasets where 
%play a minor but necessary role by 

Using the connectivity of these edges, we can reduce the number of manual reviews needed.
We now make several observations assuming that each edge is reviewed correctly.
To reduce the number of potential reviews, notice, that once a group of nodes is connected by (a tree of)
  positive edges, all those nodes in that PCC can be inferred to belong to the same individual, and it is not
  necessary to consider any other edge internal to the PCC for review.
Likewise, once a negative edge has been placed between two PCCs, all edges between those PCCs can be ignored.
By ignoring these redundant edges we can reduce the number of reviews.
Furthermore, we can construct a \glossterm{deterministic termination criterion}; if a negative edge is placed
  between every pair of PCCs, then all individuals must have been discovered and identification has converged.
We call such a graph id-complete because when all vertices in a PCC are collapsed into one, the resulting
  meta-graph is complete.

Unfortunately, there are two issues with these observations.
First, they depend on the condition that each edge was correctly reviewed.
In fact, we know that both the verification algorithm and human reviewers are sometimes wrong.
We therefore will introduce redundancy into our graph that allows the algorithm to detect and correct errors,
  trading off the level of redundancy with the sophistication of the errors that may be caught.
A small amount of redundancy is desirable because:
\begin{itemln}
    \item PCCs with redundant edges are less likely to contain errors.
    \item Redundancy can potentially introduce inconsistency into a PCC, which signifies that an error has
      occurred.
\end{itemln}
Therefore, we will define a redundancy criterion in~\cref{sec:redun} which ignores edges within and between PCCs,
  but only after they meet a minimum level of redundancy.

Additionally, the deterministic termination criterion would require that each pair of PCCs has a redundant set of
  negative edges between them before the algorithm stops.
However, the number of edges that need review grows quadratically.
Therefore, unless the automatic algorithm is perfect or the dataset is small, the number of negative reviews
  needed to converge will be too large for a human to handle.
We address this concern in~\cref{sec:converge} using a probabilistic termination criterion.
%While this criterion will not guarantee ,
%that the graph is complete

%it will ensure that the algorithm terminates 


\FloatBarrier{}
\subsection{The review algorithm}\label{sub:graphalgo}


% Algorithm overview
The review algorithm that produces the edges of a decision graph is outlined in Algorithm~\ref{alg:AlgoOverview}.
Akin to a segmentation algorithm~\cite{fulkerson_class_2009} that starts with an over-segmentation of an image,
  the identification graph starts with an empty set of edges, $G = (V, \{ \})$, so in essence each annotation
  starts by itself as an individual animal, but because there are no negative edges we cannot be confident that any
  pair of annotations should indeed be given different labels.
Therefore, the algorithm proceeds to prioritize and add positive edges that merge multiple annotations into the
  same PCC, negative edges that indicate that two annotations are different individuals, and incomparable edges
  that prevent two annotations from being labeled as positive or negative.
Throughout the main algorithm, the graph is maintained in a \emph{consistent} state, which means that each PCC
  has no internal negative edges.
Note, that while we describe the algorithm starting from an empty graph, without loss of generality, the edges in
  the graph can be initialized to reflect a previously known labeling.
Thus, the algorithm can address the case where nothing is known, new images are being added to an existing
  dataset, or multiple datasets are being combined.


\begin{algorithm}
        While the graph has not converged:
        \begin{enumln}
            \item Generate and prioritize candidate edges 
            \item Insert candidate edges into a priority queue 
            \item While the priority queue is not empty:
            \begin{enumln}
                \item Pop an edge from the priority queue
                \item Make a decision and add the edge to the graph
                \item If the edge causes an inconsistency drop into inconsistency recovery mode
                \item Update the priority queue based on the new edge
                \item If candidate edges require refresh, break
            \end{enumln}
        \end{enumln}
\caption[Algorithm Overview]{Overview of the graph identification review procedure}
\label{alg:AlgoOverview}
\end{algorithm}

The first step of the algorithm is to generate candidate edges and predict probability measures (positive,
  negative, or incomparable) for each candidate edge.
In the next step each edge is entered into a priority queue with a priority based first on its ability to be
  automatically reviewed and then on its positive probability.
Next, the algorithm enters a loop where the next candidate edge is selected, a decision is made about this edge
  --- either automatically (as much as possible) or by the user --- and it is added to the graph.
The algorithm proceeds toward convergence by removing candidate edges from the priority queue, either directly
  from the top of the queue or indirectly by eliminating candidate edges that are no longer needed.
A candidate edge is no longer needed when there are sufficient redundancies in the edge set within or between its
  PCCs.
A pair of PCCs is \emph{complete} when there are enough negative edges between them.

Each new edge addition could trigger two important events:
\begin{enumin}
    \item a \emph{merge} --- addition of a positive edge between different
      PCCs combines them into one PCC, and

    \item an \emph{inconsistency} --- addition of either a negative edge within a PCC or a positive edge between
      PCCs that already have a negative edge between them creates an inconsistent PCC.
\end{enumin}
Handling a merge is largely a matter of bookkeeping and can be done efficiently using a data structure that can
  dynamically maintain connected components~\cite{jacob_holm_poly_logarithmic_2001}.
Finding an inconsistency, however, drops the user into inconsistency recovery mode which alternates between
  hypothesizing one or more edges to fix and manually verifying these edges with the user until consistency is
  restored.

Finally, the outer loop of the overall algorithm allows the ranking algorithm to generate additional candidate
  edges --- this allows the ranking algorithm to take advantage of more subtle matches as the PCCs begin to form.
The priority queue will gradually be emptied as each PCC obtains a sufficiently redundant set of positive edges
  and enough negative edges to be complete.
%Ensuring completeness requires examining $O(|V|^2)$ edges, so in practice we
%  develop a learned probabilistic completeness measure.
%If sufficient training data is not available simple heuristics can be used to
%  terminate.

Details of each step in the review algorithm are described in the following sections.
First we describe the redundancy criterion in~\cref{sec:redun}.
Then we will define candidate edge generation in~\cref{sec:cand} and decision-making in~\cref{sec:decision}.
Inconsistency recovery is covered in~\cref{sec:incon}.
Finally, we describe the refresh and termination criteria in~\cref{sec:converge}.

% will be emptied when each PCC is sufficiently re
% Obtaining sufficient redundancy and completeness in order to empty the priority queue can,
% Ensuring all PCCs are complete leads to the need to ,
%  so to prevent this we develop a probabilistic measure (\cref{sec:converge})
%  that triggers much earlier convergence when positive edges are no longer
%  likely to be found.

\FloatBarrier{}
\section{POSITIVE AND NEGATIVE REDUNDANCY}\label{sec:redun}

In this section we define criteria that
(1) increases our confidence that PCCs are correct by enforcing a minimum level of redundancy and
(2) prevents edges that exceed this redundancy from being reviewed.
At a minimum each PCC must be a tree of positive edges, but when errors can occur, it's difficult to be confident
  that all nodes in the PCC are really annotations from the same individual.
By adding a redundant edge we either increase the confidence that other edges are correct or detect an
  inconsistency which can be resolved with the algorithm in ~\cref{sec:incon}.
However, the gains in confidence from adding each additional edge are diminishing.
Therefore, it is desirable to achieve a minimum level of redundancy, but once this has been achieved we should
  prevent additional redundant edges from being reviewed.
We formalize this minimum level of redundancy in two forms.
The first is for positive edges within PCCs and the second is for negative edges between PCCs.
\begin{enumln}

    \item positive-redundancy --- % 
        A PCC is $k$-positive-redundant if its positive subgraph is $k$-edge-connected (contains no cut-sets
          involving fewer than $k$ positive edges~\cite{eswaran_augmentation_1976}), or if the PCC has $k$ or fewer
          nodes and the union of positive and incomparable edges is a complete graph.

    \item negative-redundancy --- % 
        A pair of PCCs $C$ and $D$ is $k$-negative-redundant if there are $k$ negative edges between $C$ and $D$,
          or if there are $|C| \cdot |D|$ negative or incomparable edges between them.
          %or if both PCCs have fewer than $k$ nodes and there are at least $\mathop{max}(|C|, |D|)$ negative edges
          %between them.
        
    %which can be determined in $O(n_1 n_2)$ time.
    %(by looping over adjacency sets of nodes
    %in $C$ and performing set intersection with nodes in $D$ to get the edges
    %between $C$ and $D$).

    %$k$-negative-redundant if there are $k$ negative edges between them.
\end{enumln}
\kredun{}

The example in \cref{fig:kredun} illustrates different levels of redundancy.
To understand these criteria better, consider what it means for a PCC that has been determined to be
  $k$-positive-redundant to have an undiscovered error.
The error means that the PCC really should be split into (at least) two separate PCCs.
Suppose these PCCs correspond to animals $C$ and $D$.
If the combined PCC is $k$-positive-redundant then are $k$ separate undiscovered mistakes connecting $C$ and $D$,
  and there must also be no negative edges connecting $C$ and $D$.
This may be plausible if $C$ were identical twins, but these tend not to occur for species where the
  distinguishing markings (\eg{} hip and shoulder of zebras) are mostly random.
In other words, an error only becomes undiscoverable if a reviewer makes the same mistake with different
  annotations from the same individual $k$ times.
Note that $k$ can be different for positive and negative redundancy, but in our current implementation we use
  $k=2$ for both positive and negative redundancy.


\subsection{Checking redundancy}

We now describe how to check if a consistent PCC with $n$ nodes and $m$ edges is $k$-positive-redundant.
An PCC that contains a negative edge is inconsistent and is never considered as positive-redundant.
In the case where $n \leq k$, the PCC is positive redundant if all possible edges are either positive or
  incomparable.
This can be determined in $O(m)$ by checking if the sum of the number of positive and incomparable edges between
  the nodes in the PCC is equal to $\binom{n}{2}$.
Otherwise, in the more interesting case, when $n > k$, a PCC is positive-redundant iff it is $k$-edge-connected
  --- \ie{} it is impossible to disconnect the nodes in the PCC by removing any set of $k - 1$ edges.
In practice, we are primarily concerned with the case when $k=2$.
This special case of $2$-edge-connectivity is also called bridge-connectivity, and can be tested for in %
$O(n + m)$~\cite{eswaran_augmentation_1976,wang_simple_2015}.
The basic idea is to trace cycles encountered during a depth-first-search of the graph and mark all edges that
  are part of a cycle.
Any unmarked edge is not part of any cycle, and is called a bridge.
Removing any bridge edge would disconnect the PCC.
Thus, if no bridge exists then the PCC is $2$-edge-connected.

In the general case when $k>2$, edge-connectivity can be determined in $O(m n)$ amortized
  time~\cite{esfahanian_connectivity_2017}.
This involves first computing a small (not necessarily the smallest) dominating set.
A small dominating set can be computed using a greedy algorithm that starts with an empty set and iteratively
  adds an arbitrary node that is not in or adjacent to the current set until all nodes are in or adjacent to that
  set.
Then an arbitrary node in the dominating set is chosen.
The local-edge-connectivity is computed between the chosen node and every other node in the dominating set.
The local-edge-connectivity between two nodes is simply the maximum flow between those nodes, if all edge
  capacities are equal to $1$.
The edge-connectivity of the entire PCC is the minimum of
(1) the minimum degree of the PCC and
(2) the minimum computed local-edge-connectivity.


We now describe how to check if two PCCs $C$ and $D$ with sizes $n_1$ and $n_2$ are $k$-negative-redundant.
This can be done in $O(n_1 n_2)$ time using adjacency lists and set intersections to check if the number of
  negative edges between the nodes in $C$ and $D$ is greater than $k$.
For each node in $C$ we simply check if any node in $D$ is in the adjacency list (stored as a set) of that node.
The number of times this is true is the number of negative edges between $C$ and $D$.

%Whenever a positive edge is added inside a PCC, we can check positive-redundancy, and if it is, we can remove all
%  other edges inside that PCC from the priority queue.
%Similarly, when a negative edge is placed between two PCCs, we can check negative redundancy and, if it is
%  satisfied, remove all other edges between those PCCs.

Using this redundancy criterion we are able to find and remove edges from the priority queue that are no longer
  needed.
When a positive edge is added within a single PCC, we check for positive-redundancy.
If this passes, all remaining internal edges for that PCC may be removed from the priority queue.
When a negative edge is added between a pair of PCCs, we run the negative-redundancy check on the pair, and if
  this passes, all remaining edges between the PCCs may be removed from the priority queue.
When a positive edge is added between a pair of PCCs, the two PCCs are merged into a single new PCC $C'$, and the
  above negative-redundancy check must be run between $C'$ and all other PCCs having a negative edge connecting to
  $C'$.
It can be shown that if the graph is in a consistent state, that these are the only updates required.

As a final note, consider the case where a PCC is composed of two positive $k$-edge-connected subgraphs joined by
  a single positive edge.
While the entire PCC is not positive redundant, much of it is.
In this case, we do not need to review any edge within any $k$-edge-connected component of the PCC.
We can dynamically remove these from the priority queue by checking when a new edge popped off of the priority
  queue is within an existing PCC.
We can check if the local-edge-connectivity~\cite{esfahanian_connectivity_2017} --- \ie{} the maximum flow ---
  between the edge's endpoints is at least $k$.
If the local-edge-connectivity between those nodes is at least $k$, then that edge is part of a
  $k$-edge-connected component and can be ignored.


\subsection{Redundancy augmentation}\label{subsec:augredun}

In addition to determining if existing edges are redundant, it would be useful determine a small set of edges
  that would make an existing PCC positive-redundant or two PCCs negative-redundant.
Reviewing these edges would help expose any undiscovered errors in the graph.

To find edges that would complete positive-redundancy for a PCC, we compute a $k$-positive-augmentation.
This is equivalent to the problem of finding a minimum $k$-edge-augmentation.
When $k=2$ the problem is called bridge augmentation and can be computed $O(m)$~\cite{eswaran_augmentation_1976}
  as long as all edges in the complement of the PCC can be used in the augmentation.
However, if the PCC contains incomparable edges, then these edges cannot be used.
We can address this by using a weighted variant of the problem, setting the weights of the incomparable edges to
  $\infty$, and searching for a minimum cost augmentation.
However, the weighted variant of this problem is NP-hard, even for $k=2$, but can be approximated within a factor
  of $2$~\cite{khuller_approximation_1993}.
We make use of this algorithm later in~\cref{sec:cand}

To find a set of edges that would complete negative-redundancy for a pair of PCCs, we compute a
  $k$-negative-augmentation.
Computing this set of augmenting edges is trivial.
Initialize an empty augmenting set.
Iterate through all combinations of nodes between the two PCCs.
For each combination of nodes, if there is no existing edge between those nodes in the graph, add it to the
  augmenting set.
Once $k$ edges have been added to the augmenting set or there are no more node combinations, stop and return the
  augmenting set.
Note, that we do not use this algorithm in practice because we use a probabilistic termination criteria instead
  of enforcing that all pairs of PCCs are negative-redundant.

%If the PCC only contains positive edges, edge augmentation with the fewest edges can be computed in linear time
%  using~\cite{eswaran_augmentation_1976}.
%However, if some edges in the complement of the positive PCC edges are incomparable then we must compute a
%  minimum weight edge augmentation (which is NP-hard) using a $2$-approximation algorithm
% ~\cite{khuller_approximation_1993}.


\section{CANDIDATE EDGE GENERATION AND PRIORITIES}\label{sec:cand}

In this section we describe the first step of the algorithm where candidate edges are generated and then
  prioritized for review.
There many ways that candidate edges can be generated.
Different sets and orderings of candidate edges will impact different properties of the graph at different rates.
Therefore, we choose candidate edges to depend on what properties of the graph we want to manipulate.
In the context of identifying individual animals, the most obvious manipulation is to reduce the number of PCCs
  in the graph by adding positive edges between existing PCCs, thus merging them into one.
A less obvious property that we can manipulate is the fraction of PCCs that are positive redundant.
By increasing this fraction to $1$, we discover mistakes that have been made, which can be fixed using the
  algorithm in~\cref{sec:incon}.
This ultimately decreases the likelihood that any PCC contains a mislabeled positive edge.
In each iteration outer loop of our main algorithm, we alternate between first generating candidate edges to
  merge PCCs, and then generating candidate edges to ensure that all PCCs are positive redundant.

To generate candidate edges that may merge existing PCCs we use the LNBNN ranking algorithm
  from~\cref{chap:ranking}.
We issue each annotation as a query to the ranking algorithm, and form edges from the top results of the ranked
  lists.
We then assign a priority to each new candidate edge.
We use the pairwise algorithm from~\cref{chap:pairclf} to estimate the positive, negative, and incomparable
  probabilities of each edge.
Any edge whose maximum positive, negative, or incomparable probability is above the threshold for automatic
  decision-making is ranked according to this probability.
All other edges are ordered by their positive probability.
This ensures automatic decision-making is first, followed by an ordering of the edges needed for manual review
  that are most likely to be positive and therefore add the most information to the graph.
It is desirable to add positive decisions to the graph first because
(1) they are the most important edges with respect to determining the animal identities, and
(2) larger PCCs increase the number of edges that can be skipped using the redundancy criterion.

The first iteration of the outer loop the algorithm generates edges using the ranking algorithm.
Review of these edges continues until the priority queue is empty, or we determine that the candidate edges
  should be refreshed using the algorithm we will describe in~\cref{sec:converge}.
This ends the current inner loop, and as long as the algorithm has not converged, it proceeds to the next
  iteration of the outer loop.
In this next iteration, we generate candidate edges to ensure that all PCCs are positive redundant.

To generate candidate edges that will make PCCs positive redundant, we use the positive-augmentation algorithm
  from~\cref{subsec:augredun} to generate edges.
These edges are assigned priorities in the same way, however in this iteration the refresh criterion is disabled.
This enforces that if each edges is reviewed as positive, then all PCCs will be positive-redundant at the end of
  this inner loop.
However, sometimes an edge will not be reviewed as positive.
If it is reviewed as negative, then it will introduce an inconsistency and be handled by the algorithm
  in~\cref{sec:incon}.
If it is reviewed as incomparable, then the PCC will only be positive-redundant if all edges between the nodes in
  the PCC have been reviewed.
Because we want to ensure that all PCCs are positive-redundant, we delay alternating back to the ranking
  algorithm.
Instead, we simply regenerate candidate edges using the positive-augmentation algorithm in the next iteration of
  the outer loop until all PCCs are positive-redundant.
Therefore, at the end of this process all PCCs are positive-redundant by construction.

After we have ensured positive-redundancy, the outer loop alternates back to generating candidate edges using the
  ranking algorithm.
In this way our algorithm alternates between two modes, first searching for merges, and then ensuring redundancy.
This continues until the termination criterion that we will describe in~\cref{sec:converge} is satisfied.
At the end of this process it is guaranteed that all PCCs are consistent and positive-redundant.


\section{MAKING DECISIONS}\label{sec:decision}

Now that we have generated and prioritized a set of candidate edges we come to the core of the inner loop ---
  decision-making.
Because of the surrounding structure of the graph framework, this step is quite simple.
Given a popped edge from the priority queue, we check if any of the positive, negative, and incomparable state
  probabilities produced by the pairwise algorithm are above their automatic decision threshold (set externally as
  a hyperparameter).
If the edge cannot be automatically reviewed we issue a request for user feedback.
Once we have obtained feedback for an edge --- either automatically or manually --- the edge is added to the
  appropriate edge set.
After the new edge is added, we update candidate edge priorities discussed in~\cref{sec:redun}.
If the new edge causes an inconsistency, then we drop into inconsistency mode, which we will discuss
  in~\cref{sec:incon}.

For each decision we record a user-id to identify the reviewer or algorithm making the decision.
We also follow the approach of~\cite{branson_visual_2010} and store a user-specified categorical confidence value
  of unspecified, guessing, not-sure, pretty-sure, and absolutely-sure (with associated integer values $0$, $1$,
  $2$, $3$, and $4$).
The user-id allows us to differentiate between edges that were automatically reviewed from those that were
  manually reviewed.
While this is not directly used in this algorithm description, it enables a variety of possible post-processing
  techniques, \eg{} manually reviewing automatically reviewed edges between PCCs containing only two annotations.
However, the user confidence contributes to the edges weights in the error detection and recovery algorithm
  from~\cref{sec:incon}.


\section{RECOVERING FROM INCONSISTENCIES}\label{sec:incon}

In~\cref{sec:redun} we described a redundancy criterion that exposes errors by introducing inconsistencies.
In this section we describe an algorithm for fixing these errors and recovering from these inconsistencies.

Whenever a decision is made that either adds a negative edge within a PCC or adds a positive edge between two
  PCCs with at least one negative edge between them, the graph becomes inconsistent.
In both cases the result is a single PCC $C$ with internal negative edges.
The goal of inconsistency recovery mode is to change the labels of edges in order to make the subgraph formed by
  the nodes of $C$ and all of their edges consistent.
An inconsistency implies that a mistake was made, but does not necessarily determine which edge has the wrong
  label.
Therefore, we develop an algorithm to hypothesize the edge(s) most likely to contain the mistake(s) using a
  minimum cut.
If the hypothesis is correct, and all the labels on these edges were changed, then we show that the PCC would
  either become consistent or be split into multiple consistent PCCs.
Because the hypothesis might not be correct, we present these edges to a user for manual review, and if the user
  agrees with the hypothesis, the algorithm completes.
Otherwise, the new information received by the user it taken into account, and the hypothesis is recomputed.
An example of an inconsistent PCC with hypothesized edges is illustrated in \cref{fig:inconpcc}.

\inconpcc{}

\subsection{Hypothesis generation}
The procedure alternates between steps of generating ``mistake hypothesis'' edges, and presenting these to the
  user for review.
The ``hypothesis generation algorithm'' returns a set of negative edges or a set of positive edges, which if
  re-labeled as positive or negative respectively would cause $C$ to become consistent.
For simplicity, we focus first the case where $C$ contains exactly one negative edge.
It will not be hard to extend to the general case.


First consider a cut of $C$ that disconnects the endpoints of the negative edge.
If the labels on these edges were changed from positive to negative or incomparable, then the inconsistent PCC
  would be split into multiple consistent PCCs.
Alternatively, if the label on the negative edge was changed to positive or incomparable, the PCC would no longer
  be inconsistent.
Therefore, the algorithm starts by creating a minimum $s$-$t$-cut using the subgraph of $C$ containing only
  positive edges.
The endpoints of the single negative edge are the terminal nodes $s$ and $t$.
Unlike in our preceding discussion where the edges only have positive/negative/incomparable labels the edges will
  now have weights that reflect a measure of confidence in their current label.
In the case where the negative edge is correct, this will encourage the minimum cut to return the positive edges
  that are most likely to be mislabeled.

The weight of an edge in this cut problem is the sum of three values:
\begin{enumln}
    \item its positive probability previously computed using the pairwise classifier,

    \item the number of consecutive times that edge was manually reviewed with
        its current label, and

    \item an integer ranging from $0$ to $4$ indicating the confidence of the most recent review (see
      \cref{sec:decision}).
\end{enumln}
Because we are using a minimum cut, the algorithm will find the lowest confidence cut set of these edges.
The higher each of these values is, the more likely that the edge is correctly labeled.
The positive probability offers a baseline estimate of this confidence.
Using the number of manual reviews means that if a user disagrees with a hypothesis, the weight on that edge will
  be increased and the algorithm will be encouraged to select a different edge.
The confidence performs a similar role, speeding up the pace at which the algorithm tries different edges.

Using this weighting scheme we find the minimum $s$-$t$-cut, which returns a cut set of positive edges.
We now need to decide if it is more likely that the positive cut-edges should be relabeled, or the negative edge
  should be relabeled.
We compare the total weight of cut positive edges with the weight of the negative edge (weighted using the same
  scheme).
If the positive weight is smaller, the algorithm suggests that the cut positive edges should be relabeled as
  negative.
Otherwise, it suggests that the negative edge should become positive.
%The minimum cut returns a set of edges that disconnects the terminal nodes.
%In other words, if the label on these edges is changed from positive to negative or incomparable, then $s$ and
%  $t$ would no longer be part of the same PCC and the inconsistency would be removed.

\subsection{Generalization}
So far, we have only considered the case when only one negative edge exists in a PCC.
%In a restricted scenario where each edge is reviewed exactly to the specifications of the algorithm, then it is
%  only possible for one negative edge to exist in a PCC when $k=2$.
However, in practice it is possible for multiple negative edges to exist within a PCC.
In this case we can slightly modify the inconsistency recovery algorithm.
%to apply to the case when there are any number of
%  negative edges between the nodes of a PCC.
The modification is simple.
Instead of using a minimum $s$-$t$-cut, we use a multicut to disconnect the endpoints in all pairs of negative
  edges.
Multicut is NP-hard, but a simple approximation algorithm is to perform a minimum $s$-$t$-cut for each set of
  terminal nodes, and then take the union of the resulting cut~\cite[203--208]{vazirani_approximation_2013}.
When considering which edges to return we consider all negative edges together, comparing the sum of their
  weights to the sum of the positive edge weights.

\subsection{Hypothesis review}
The user iterates through each hypothesis edge and chooses
(1) to agree with the hypothesis and change the label of the edge, or
(2) to disagree with the hypothesis and keep the edge label.
In the case where the user agrees that a positive label should be changed, it does not matter if the label is
  changed to negative or incomparable, it will still remove the positive connection.
A similar argument is true when the user agrees to change a negative label to either positive or incomparable;
  either way, the negative edge between the nodes in the PCC is removed.
As long as the user agrees with the hypothesis and the hypothesis set is non-empty, the iteration continues.
If there are no more hypothesis edges, then either all inconsistencies were removed or all PCCs were split into
  multiple consistent PCCs.
In the case where the reviewer disagrees with the algorithm, the new review is recorded and we generate a new set
  of hypothesis errors.
Because reviewing the edge increases its weight, the algorithm will be forced to look elsewhere for a cut.
This alternation between hypothesis generation and user review repeats until the reviewer agrees with all
  hypothesis edges, which means that all inconsistencies have been eliminated

Fixing inconsistencies can result in splitting $C$ into multiple PCCs.
This may invalidate implicit reviews inferred from redundancy either within or incident to this subgraph.
Therefore, we recompute positive-redundancy within each new PCC, ignoring edges where the criterion is satisfied
  and re-prioritizing unreviewed edges where it is no longer valid.
A similar process happens for negative-redundancy between each pair of new PCCs as well as between each new PCC
  and all other PCCs previously negative-redundant with $C$.

\subsection{Implementation details}
While it might be conceptually simpler to think of inconsistency recovery as a separate mode, in practice it is
  actually integrated as part of the algorithm's inner loop.
The algorithm dynamically maintains a list of all PCCs that contains errors.
We disable redundancy checks for any PCCs that contain errors.
Hypothesis edges are computed whenever a new inconsistency is created, and these edges are entered into the queue
  with their normal priority plus $10$ to ensure they are reviewed first.
It can be shown that as long as the user agrees with the current hypothesis edges computed so far, recomputing
  the new hypothesis edges will always be the same as the remaining hypothesis edges.
This implementation is an important first step for generalizing the graph algorithm into a distributed setting
  with multiple reviewers.


%\section{Refreshing candidate edges}\label{sec:refresh}


%The goal is to refresh if there has been a significant number of positive reviews, but new results are consistently
%negative. If we have not found any positive edges then we do not want to refresh. We keep track of the fraction of
%positive review decisions as a moving average of manual decisions. We also maintain the total number of positive reviews
%made since the last candidate edge generation. Thus the candidate edges are refreshed whenever the number of positive
%reviews is above a threshold and the positive review fraction is below a threshold

%As the last outer iteration of the overall algorithm before convergence, triggered when the LNBNN ranking algorithm
%fails to produce positive edges, candidate edges between untested pairs of annotations are added within PCCs that are
%not positive-redundant and between PCCs that are not negative-redundant. This is because the ranking algorithm itself is
%imperfect and the missed matches tend to affect small PCCs disproportionately, which are the last to satisfy redundancy
%tests.


%\subsection{Probabilistic convergence}
%The goal of probabilistic convergence is to determine if a PCC $C$ is negative-redundant with all other components with
%high probability. When all components are positive-redundant and satisfy this, then all edges will be removed from the
%priority queue and the algorithm will converge. We consider the probability $\Pr{E_c \given \nT_C}$ that an undiscovered
%positive edge exists ($E_C$) given $C$'s existing set of outgoing negative edges ($\nT_C$). Under mild conditions (if we
%assume that $\Pr{E_c \given \nT_C} < 0.5$), we can show that the probability $\Pr{\nT_C \given E_C}$ of observing the
%negative edges bounds this p given that an undiscovered match exists can be used as a surrogate. We can learn this
%probability offline by measuring the frequency that correct results are at a given rank in a PCC's ranked list
%(constructed by aggregating the ranked lists of all annotations in the PCC).

%To predict $\Pr{\nT_C \given E_C}$ we issue all queries as a single LNBNN query to obtain a single ranked list for the
%entire PCC. This can be done by treating all query descriptors as if they were the from the same annotation except
%during the spatial verification stage. Let $R_C$ denote the ranks of every PCC marked as negative with $C$. In an
%offline step we learn a probability mass function $\phi$ that predicts the probability that a correct match appears at a
%given rank for the PCC $C$. The predicted probability is %
%$\Pr{\nT_C \given E_C} = 1 - \sum_{r \in R_C} \phi(r)$.

%To learn $\phi$, we measure the probability that a correct match appears at a given rank, given a correct match exists.
%To do this initialize an histogram. For each $C$ in the training set, divide it into a query $C_q$ and target $C_t$. The
%target and the rest of the PCCs in the training set become database PCCs. Use LNBNN to score each annotation in $C_q$
%against the database PCCs. Determine the best rank that $C_t$ appears in each ranked list, and increment the
%corresponding index in the histogram. Repeat this process for all PCCs in the training set and for multiple partitions
%of each PCC. Normalizing the histogram array results in the PMF $\phi$. In order to prevent marginalization across
%important attributes (such as the number of exemplars in a PCC), construct multiple PMFs for different numbers of
%exemplars in a query.


\section{REFRESH AND TERMINATION CRITERIA}\label{sec:converge}


In this section we discuss the criteria we use for both determining when to refresh candidate edges and when to
  stop the algorithm altogether.
We first consider the need for a refresh criterion.
As the review algorithm proceeds, we should only continue to manually review edges as long as the algorithm is
  consistently generating edges that --- once reviewed --- change the PCCs.
This happens whenever a review merges two PCCs into one or splits one PCC into two.
Because splits only happen during inconsistency recovery, we are primarily concerned with searching for new
  positive edges.
However, at some point the candidate edges may no longer contain positive results, but undiscovered positive
  matches may still exist.
This is because LNBNN, working initially with each annotation having a separate label, can miss more subtle but
  correct matches, especially when there are several annotations for an individual with multiple viewpoints.
As the labeling improves, so does the reliability of LNBNN.
Recall that the verification algorithm is only run after candidate edges are generated.
If the edges required to complete the underlying real PCCs are missing from the candidate set, then nothing can
  be done.
We therefore must develop a ``refresh criterion'' to determine when to stop and replace the current priority
  queue with new LNBNN matches.
This will be done by predicting if none of the remaining edges would change the PCCs, and then recomputing
  candidate edges when this happens.
However, before we describe this process, we consider the problem of termination, which will turn out to have a
  similar solution.

Similar to the refresh criterion, we must be able to determine when the algorithm should stop.
To ensure that the identification is perfect, the algorithm would need to use all $\binom{|V|}{2}$ edges as
  candidates and then terminate using the deterministic convergence criterion explained in the introduction of this
  chapter.
Recall that the deterministic convergence criterion only stops the algorithm once each PCC is positive redundant
  and each pair of PCCs is negative-redundant.
This essentially results in a brute-force search, that requires $O(|V|^2)$ reviews, and is only feasible if
(1) the number of annotations is very small, or
(2) all edges can be automatically reviewed.
However, even if all edges can be automatically reviewed the quadratic computation required to run the
  verification algorithm on all pairs of annotations might be too computationally expensive for very large
  databases.
Therefore, in practical circumstances, we turn towards probabilistic methods to determine when to stop.
Like, the refresh criterion, this can be determined --- in part --- by predicting if new reviews will change the
  PCCs.

\subsection{Convergence as a Poisson process}

%\newcommand{\meaningful}{meaningful}
%\newcommand{\meaningful}{label-changing}

Both the review and termination criteria can be addressed by considering the question:
``Will there be a label-changing review anytime soon?''.
A \emph{label-changing review} is one that changes the name labeling of the annotations, \ie{} it is either
  a positive edge that merges two PCCs or a non-positive edge that splits one.
Correct and label-changing reviews improve the accuracy of the identification by increasing the similarity --- in
  terms of the name labeling --- between the predicted PCCs and the real underlying PCCs.
However, producing a label-changing review has a cost:
the number of manual reviews since the last label-changing review.
During the inner loop of the algorithm, when edges popped from the priority queue no longer consistently result
  in label-changing reviews, the marginal gains in identification accuracy that could be made from continuing are
  outweighed by the cost of manual review.
In this circumstance it is best to break out of the inner loop.
Note that if any label-changing reviews were made during that loop, we should refresh candidate edges and start a
  new loop because a refresh could result in new high priority label-changing edges.
On the other hand, if no label-changing reviews were made in the loop, then refreshing will have no benefit, and
  the algorithm should terminate.

Thus, the task is to construct a criterion that determines when edges on the top of the priority queue are no
  longer label-changing.
In this way we directly address the refresh criterion and indirectly address termination criterion.
This is direct in the case of the refresh criterion because when the name labeling is more accurate, the LNBNN
  ranking will improve.
When we directly measure that the next reviews in the current priority queue are unlikely to be label-changing, and
  the name labeling of the decision graph has changed, then it is more likely that we would review a label-changing
  edge on the top of a new set of candidate edges sorted by priority.

This task indirectly addresses the termination criterion because instead of stopping once the probability that
  identification is complete is high, the algorithm simply stops when the cost in terms of manual labor is too
  high.
However, if we were to directly address the termination criterion, we would have to estimate the probability that
  undiscovered merge and split cases exist.
This probability depends on the effectiveness of the ranking algorithm.
Even if our estimate of this probability was perfect, once the ranking algorithm starting producing label-changing
  reviews at a rate no better than random edge generation, it would take an enormous amount of manual effort to
  push this probability passed a desired threshold.
Thus, instead of providing guarantees about identification accuracy our termination criterion achieves a
  trade-off between the number of manual reviews and the cost of identification.

Based on these observations we estimate the probability that ``there will be a label-changing review soon''.
We define ``soon'' using a patience parameter $a$, defined as the maximum number of consecutive reviews that a
  manual reviewer is willing to do between label-changing reviews.
Let $L\teq1$ be the event that a review is label-changing and $L\teq0$ otherwise.
Because reviews are ordered, denote if the $i\th$ review is label-changing as $L_i$, and denote the index of the
  next review as $n$.
Let  %
%$C\teq1 \equiv (\M_i\teq1, \exists i \in [n, n + a])$ 
$C = \Or_{i=n}^{n+a} L_i$
%
be a binary random variable that takes the value $C\teq1$ in the event that any of the next $a$ reviews will be
  label-changing.
Thus, the aforementioned question can be addressed by measuring the probability of the event $C\teq1$.
%The goal is to measure the probability the event $C\teq1$.
We can periodically check if $\Pr{C\teq1}$ is less than a threshold, and if so, we stop the current loop and
  either refresh or terminate.

We model the event $C\teq1$ as a Poisson processes, but for this to be appropriate, $L_i$ must follow a uniform
  distribution.
This would be true if the edges were reviewed in a random order.
However, the priority queue orders edges more likely to be label-changing first, causing $L_i$ to follow a right
  skewed long tail distribution and violate Poisson assumptions.
Even so, the use of a Poisson model can be justified by considering a sliding window along the distribution of
  $L_i$.
Recall that we only need to make predictions about the next $a$ reviews in the future, thus we are only concerned
  with a small window to the right on the distribution.
Assuming the long-tailed distribution is monotonic decreasing, we can use a small window in the past to estimate
  an upper bound on probability of $C\teq1$ in the future.
As the window moves to the right, the interval on the distribution becomes increasingly approximately uniform and
  the tightness of the bound improves and eventually becomes tight.
This is because the order of the remaining reviews becomes random once the prioritization algorithm cannot
  distinguish positive from negative cases.
In this case the Poisson model becomes appropriate.
Thus, the use of a Poisson model with a sliding window allows us to approximate an upper bound on $\Pr{C\teq1}$,
  and the smaller $\Pr{C\teq1}$ is, the more accurate our estimate will be.


\subsection{Details of Poisson convergence}

Having justified its use, we model $C$ as a Poisson process, which is determined by a rate parameter $\mu$ and an
  interval length parameter $a$.
We can measure $\mu$ as the fraction of recently observed manual reviews that were label-changing using an
  exponentially weighted moving window.
We initialize $\mu_0=1$ to denote that it is likely that the first review will be label-changing and because it
  will ensure our estimated probability is an upper bound.
Then, after each new review we update the parameter as %
$\mu_{i + 1} \leftarrow \ell_i \alpha + (1 - \alpha) \mu_{i}$, where $\ell_i\teq1$ if the $i\th$ review was
  label-changing and $0$ otherwise.
The exponential decay $\alpha = 2 / (s + 1)$ is determined by a span parameter $s$, which roughly represents the
  number of previous reviews that are significant.
Using this model, the desired probability that any of the next $a$ reviews will be label-changing is %
$\Pr{C\teq1} = 1 - \exp{-\mu  a}$.
The example in \cref{fig:poisson} illustrates the behavior of the criterion using a synthetic dataset.
In this example we use a window span of $s=20$, a patience of $a=20$, and a threshold of $\tau=0.135$. \footnote{
    % ---
    A better choice would be to set these parameters such that % 
    $\tau = 1 - \exp{-a \left(s - 1\right)^{a} \left(s + 1\right)^{- a}}$ is satisfied.
    This will guarantee convergence after a maximum of $a$ consecutive non-label-changing reviews.
    Unfortunately, this was discovered after the completion of this \thesis{}, so the parameters in our
      experiments do not satisfy this property.
    Suggested values for future work are $s=20$, $a=72$, and $\tau=0.052$.
    % ---
    %$\tau = 1 - e^{- a \left(1 - \frac{2}{s + 1}\right)^{a}}$
}


%To achieve a certain threshold with a specified span the number of iterations required is: 
%$\lceil{\frac{\log{\left (t \right )}}{\log{\left (\frac{s - 1}{s + 1} \right )}}}\rceil$

%If we set the threshold to $\exp{-2}\approx0.135$, then the window span parameter also roughly represents how
%  many successive consecutive non-label-changing decisions are necessary before reaching the threshold if starting from
%  initial state.

\begin{comment}
Another alternative might be measuring once the advantages from the ranking
algorithm become indistinguishable from a brute force search.


Let $D = \binom{|V|}{2}$ be the number of edges in the dataset and let $N=\sum_{C \in \set{C}} (|C| - 1)$ the
  number of edges in the MSTs of all PCCs.
We can measure the density of meaninful reviews in the current labeled dataset as $N/D$.

After our estimated mu gets close enough to $N/D$, we can terminate.

\end{comment}

\poisson{}

%As the last outer iteration of the overall algorithm before convergence, triggered when the LNBNN ranking algorithm
%fails to produce positive edges, candidate edges between untested pairs of annotations are added within PCCs that are
%not positive-redundant and between PCCs that are not negative-redundant. This is because the ranking algorithm itself is
%imperfect and the missed matches tend to affect small PCCs disproportionately, which are the last to satisfy redundancy
%tests.

%Re-estimating the $k\mu$ at each time-step should


\section{EXPERIMENTS}\label{sec:graphexpt}

    In this section we design an end-to-end experiment where we start with a set of annotations without any name
      labels, and then we proceed to construct all the names.
    This simulates identification events like the \GZC{} and provides insight into how the algorithms behave in
      practice.
    Because our algorithms are semi-automatic, we simulate a noisy user response using \groundtruth{} data.
    Given a pair of annotations, the simulated user returns the \groundtruth{} classification $99\percent$ of the
      time, making errors $1\percent$ of the time uniformly at random.
    These assumptions may be simple and too inaccurate to model an expert reviewer, but it will serve to
      demonstrate our graph algorithm's ability to recover from errors.
    We will measure the simulation's accuracy and error as a function of the number of manual reviews.
    Using these measures we will compare our graph algorithm to alternative techniques to demonstrate that it
      produces accurate identifications with fewer errors using significantly fewer manual reviews.

    For this experiment, we will use the plains and Grévy's zebras datasets previously described in
      \cref{sub:datasets}.
    Because some of our algorithms will require training, we split each dataset into a training set and a testing
      set each containing half of the names.
    The training set will be used to learn the pairwise classifiers and determine thresholds for automatic
      classification.
    The details of training and testing sets are summarized in \cref{tbl:TestTrainDBStats}.
    \TestTrainDBStats{}
    \FloatBarrier{}

    The experiment will compare the three different methods of determining the identities of annotations based on
      the algorithms defined in this \thesis{}.
    These algorithms are:
    \begin{enumin}

    \item \pvar{ranking} --- the ranking algorithm from \cref{chap:ranking}, 

    \item \pvar{rank+clf} --- the same
        ranking algorithm but augmented with the automatic classification algorithm from \cref{chap:pairclf}, and
        finally

    \item \pvar{graph} --- the graph identification algorithm introduced in this chapter.

    \end{enumin}
    In the case of \pvar{graph}, the procedure from~\cref{sub:graphalgo} can be directly applied.
    However, in order to compare \pvar{graph} to \pvar{ranking} and \pvar{rank+clf} we must define baseline
      methods to determine the ordering of the reviews and when to stop.
    Details of these identification procedures are given in the next subsection.

    \FloatBarrier{}
    \subsection{Identification procedures}

    We design the procedure for \pvar{ranking} to be similar to the approach described in \cref{sec:introgzc}
      that was used in the \GZC{}.
    This algorithm does not require a pre-training phase and thus only the testing set is used.
    Given the unlabeled annotations, we index the database, issue each annotation as a query, and collect the top
      $5$ results from each ranked list.
    The resulting pairs of query and database annotations are stacked and sorted by LNBNN score.
    The user reviews each pair in the list sequentially.
    As was done in the \GZC{}, all pairs are all manually considered and verified, and all inconsistencies are
      ignored.
    In the \GZC{} the choice to re-run the ranking algorithm was made manually, making it difficult to reproduce.
    Therefore, we simplify the experiment by choosing to run the ranking algorithm once.
    Consistency checks are not applied because developing these is the point of the graph algorithm.
    %When designing this protocol we considered alternatives

    %This is primarily due to simplicity.
    %--- in part for simplicity and in part because it
    %  does not change how we interpret the results.
    %We also do not apply the consistency checks used in during the \GZC{} because these were mainly based on
    %  heuristics that are difficult to reproduce and splitting names is difficult without connectivity
    %information.

    The procedure for \pvar{rank+clf} is similar to the one used for \pvar{ranking}.
    The main difference is that we use the automatic classification algorithm to predict pairwise probabilities
      for each pair of top ranked query and database annotations.
    If the probabilities of a class are above a threshold, then the pair is automatically reviewed.
    This has the effect of reducing the total number of manual reviews.
    The rest of the procedure is unchanged.
    The pairwise classifier is trained on the training set, and the algorithm is evaluated on the test set.
    We choose automatic thresholds by finding the thresholds that result in a specified false positive rate on a
      validation dataset.
    Because \pvar{rank+clf} has no mechanisms for error recovery, we choose conservative automatic thresholds by
      specifying an acceptable false positive rate of $0.001$.
    This results in positive, negative, and incomparable thresholds of $0.976$, $0.991$, and $0.5$ respectively
      for plains zebras, and $0.997$, $0.998$, and $1.0$ for Grévy's.
    %When choosing thresholds we enforce a warm-up period of $200$ examples, where the threshold is set to $1$.
    %This ensures that we do not automatically classify categories (\eg{} incomparable) with little supporting
    %  data.
    % GRAPH THRESH
    %We choose a false positive rate of $0.001$.

    %Because this approach has no mechanism for error recovery, we choose conservative classification thresholds
    %  that achieve a $0\percent$ false positive rate on a validation dataset.

    Finally, we quickly recap the procedure for \pvar{graph} which was defined in~\cref{sub:graphalgo}.
    The algorithm begins by using LNBNN to search for candidate edges, which are then assigned probabilities and
      inserted into a priority queue based on these probabilities.
    As candidate edges are removed from the queue they are either automatically or manually classified based on
      probability thresholds.
    Connectivity information is used to enforce a minimum level of redundancy, to prevent extraneous redundancy,
      and to ensure consistency.
    The convergence criterion determines when candidate edges should be refreshed and when the algorithm should
      terminate.
    We use the same pairwise classifier from \pvar{rank+clf}, but due to the graph algorithm's error recovery
      mechanisms we can choose more aggressive thresholds.
    However, we have found that the algorithm is sensitive to this parameter, therefore we only slightly increase
      the acceptable false positive rate from $0.001$ to $0.0014$.
    % GRAPH THRESH
    This results in positive, negative, and incomparable thresholds of $0.969$, $0.986$, and $0.5$ respectively
      for plains zebras, and $0.989$, $0.992$, and $1.0$ for Grévy's.
    For the convergence criterion we use a patience of $a=20$, a window span of $s=20$, and a termination
      threshold of $\tau=0.135$.
      %of $\exp{-2}\approx0.135$, which causes the criterion to trigger after at most $20$ consecutive reviews that
    %are not label-changing.

    \subsection{Results}

    We run the simulation for all combinations of datasets and algorithms.
    During the simulation, after each review decision is made we record two measurements pertaining to accuracy
      and error.
    The accuracy measurement is the number of merges remaining --- \ie{} the number of PCCs that must be merged
      --- before all individuals have been identified.
    This is the number of edges in a spanning forest of the \groundtruth{} positive subgraph minus the same
      measurement but applied to the subgraph of all correctly predicted positive edges.
    % better error measure?
    The error measurement is the total number of edges with a predicted label that differs from the
      \groundtruth{} match-state.
    Additionally, for \pvar{graph}, after each review decision we record the probability of convergence estimated
      by the convergence criterion.
    These measurements are plotted against the number of manual reviews.
    \Cref{fig:Simulation} shows these accuracy and error plots, and \cref{fig:Refresh} shows the convergence
      criterion.

    \Simulation{}

    \Refresh{}

    The results illustrated in \cref{fig:Simulation} demonstrate that \pvar{graph} achieves the fewest manual
      reviews with the fewest errors while still correctly identifying almost all individuals.
    We first focus on the left part of this figure.
    The number of remaining merges slowly decreases for \pvar{ranking}, which is the only algorithm that requires
      manual review of each pair.
    For \pvar{graph} and \pvar{rank+clf} the initial decrease appears instantaneous due to the automatic
      classification algorithm, which does not cost manual reviews.
    Notice, once manual reviews begin the slope of \pvar{graph} is steeper than \pvar{rank+clf} due to the
      redundancy mechanisms, which removes extraneous reviews.
    Furthermore, while \pvar{rank+clf} and \pvar{ranking} continue until their candidate edges are exhausted,
      \pvar{graph} uses the convergence criterion to terminate shortly after the curve flattens.
    It is noteworthy that \pvar{graph} completes identification using less than $25\percent$ of the manual
      reviews needed by $\pvar{ranking}$, which models the way that we counted individuals in the \GZC{}.

    We now turn our attention to the right of \cref{fig:Simulation}.
    The \pvar{ranking} and \pvar{rank+clf} algorithms do not have mechanisms for error recovery, and thus their
      error steadily increases over time.
    However, \pvar{graph} is able to recover from many of these errors and achieve a low error rate despite
      starting with more errors due to an aggressive auto-classification threshold.
      
    \subsection{Error cases}

    To further analyze the predictions of the graph the simulation, we compare the node groupings of the
      predicted PCCs to the real \groundtruth{} PCCs.
    For convenience, we will still refer to a group of nodes as a PCC.
    %Even though we are only concerned with the grouping of nodes
    In this analysis we categorize groups of predicted PCCs and their corresponding real PCCs as one of three
      types:
    correct, split, or merge.
    Consider an example where we are given a graph with $8$ nodes and the real PCCs are %
    $\curly{\curly{a, b, c}, \curly{d}, \curly{e, f}, \curly{g}}$, and we predict the PCCs %
    $\curly{\curly{a, b}, \curly{c}, \curly{d, e, f}, \curly{g}}$.
    Using this example we define the three group types:
    %correct, split We will categorize predictions as one of three types.
      %in terms of the nodes that belong to each group.
    %In this analysis we are only interested in the grouping of nodes, and we ignore the details of the edge
    %  labeling.
    %To analyze our predictions we define $3$ types of groups.
    \begin{enumln}
        \item A ``correct'' group is a predicted PCCs that contains all nodes in a real PCC.
        In the example the PCC $\curly{g}$ is a real group.
        %For a predicted PCC to be correct there must be a real PCC with the same nodes.
        
        \item A ``split'' error group consists of multiple real PCCs that are labeled as belonging to the same
          predicted PCC.
        Each split group contains at least one edge incorrectly labeled as positive.
        In the example, the real PCCs $\curly{\curly{d}, \curly{e, f}}$ are a split group because in the
          predicted graph $\curly{d, e, f}$ is a PCC.
        %Thus split groups are created when an edge is incorrectly labeled as positive.
        %A split cases is caused by when two real PCCs are connected.
        %This results in two real PCCs that are connected.
        %because that edge connects two real PCCs.
        %two nodes in two real PCCs are connected by
      
        \item A ``merge'' error group consists of multiple predicted PCCs that must be merged into a single real
          PCC.
        Each pair of predicted PCCs in a merge group is missing a positive edge.
        This can happen if an edge is incorrectly labeled, if the ranking algorithm never generates this edge as
          a candidate, or the termination criterion stops the algorithm before the edge can be reviewed.
        In the example, the predicted PCCs $\curly{\curly{a, b}, \curly{c}}$ are a merge group because in the
          real graph $\curly{a, b, c}$ is a PCC.
        
        %that would have connected 
        %A merge cases is caused by the absence of a positive edge that would have connected two nodes in a real
        %  PCC.
    \end{enumln}
    As defined, the split and merge error groups do not cover all cases.
    We call the previously described cases ``pure'' split/merge error groups because there is a mapping between
      sets of real/predicted PCCs to exactly one predicted/real PCC.
    However, consider the real PCCs $\curly{\curly{x, y}, \curly{z}}$ and predicted PCCs $\curly{\curly{x},
      \curly{y, z}}$.
    This case cannot be cleanly categorized as a pure split or a merge group, but it contains elements of both.
    Thus, we call this a ``hybrid'' case.
    A slight modification to the above groups can incorporate hybrid cases.
    %First, we change the definition of a split group to be a maximal set of at least two non-empty maximal
    %  subsets of real PCCs where all annotations in those subsets are labeled as belonging to the same predicted
    %  PCC.
    First, we modify the definition of a split group.
    For a predicted PCC containing annotations in more than one real PCC, the split group consists of the subsets
      of each of these real PCCs that intersect with the predicted PCC.
    In essence this means we treat $\curly{\curly{y}, \curly{z}}$ as a split group, even though $\curly{y}$ is
      only a subset of the real PCC $\curly{x, y}$.
    Second, for merges, we simply change predicted PCCs to be the predicted PCCs after they have been split.
    This lets us treat $\curly{\curly{x}, \curly{y}}$ as a merge group.
    %In or graph simulation almost all error groups are either pure splits or pure merges.

    %In our analysis we will treat hybrids similar to split and merge cases. 
    %However, for the simplicity of analysis we will group hybrid cases with 
    %We can address a hybrid case by first splitting hybrid groups into maximal subsets of real PCCs and then
    %  merging these subsets into real PCCs.

    %We analyze statistics of these groups and present the results in two tables.
    To analyze statistics of these groups, we gather the set of real PCCs and the set of predicted PCCs for each
      simulation.
    Then, we group them into correct, split, and merge groups.
    For each group type we count the number of predicted PCCs and the number of real PCCs they correspond to.
    We also measure the average number of annotations in each PCC.
    These number are reported in~\cref{tbl:ErrorSizeDetails}.

    From this table we observe that \pvar{graph} predicts the most correct PCCs and fewest splits in all cases.
    This is not surprising due to its error recovery mechanism.
    However, we also see that \pvar{graph} also has the most merge cases.
    The reason for this is that the other algorithms do not have termination criteria.
    They continued to review pairs, even long after the rankings were no longer consistently label-changing, and
      the simulated reviewer essentially began to brute-force search the database.
    In the thousands of extra iterations, the other algorithms managed to find ${\sim}20$ extra merge cases that
      were not discovered by \pvar{graph}.

    We continue our analysis focusing only on the results of \pvar{graph}.
    We measure the average number of PCCs in each error group.
    Additionally, we consider the smallest and largest PCC in each error group, and we measure the average number
      of annotations in each.
    These numbers are given in ~\cref{tbl:ErrorGroupDetails}.

    Here, we notice that almost all merge cases for \pvar{graph} are the result of failing to match a single
      annotation to a larger group of annotations.
    Similarly, most split cases involve splitting just one singleton annotation off of a larger PCC.
    To gain further insight into what is causing these errors we will visualize individual cases.
    %Mistakes involving individuals with only one annotation are more difficult to detect and highlight the
    %  importance of obtaining multiple images of each individua.
    %These error measures are reported
    %in~\cref{tbl:ErrorSizeDetails,tbl:ErrorGroupDetails}.

    %\SimDetails{}

    %\SimErrorSize{}
    \ErrorSizeDetails{}

    \ErrorGroupDetails{}

    \FloatBarrier{}

    We illustrate several individual error cases from \pvar{graph}
    in~\cref{fig:SplitErrorsPZ,fig:SplitErrorsGZ,fig:MergeErrorPZA,fig:MergeErrorPZB,fig:MergeErrorGZA,fig:MergeErrorGZB}.
    On the top of each figure we will show the subgraph corresponding to an error group.
    For a split case this will be the single PCC that must be broken apart, and for a merge case this will be
      multiple PCCs that should be connected.
    For merge cases, if no edge connecting the PCCs was ever added to the priority queue as a candidate, then we
      will insert a dashed edge between two arbitrary annotations.
    We will highlight the edges with labels that differ from their \groundtruth{}.
    On the bottom we show the annotation pair corresponding to a selected highlighted edge.

    Upon inspection, we discover that the split cases in~\cref{fig:SplitErrorsPZ,fig:SplitErrorsGZ} are caused by
      \groundtruth{} errors.
    In each of these cases the automatic verification algorithm made the PCCs positive redundant, thus the
      simulated reviewer --- which is driven by the \groundtruth{} --- was unable to split the PCC.
    In fact, we discovered that all split cases measured for \pvar{graph} are due to \groundtruth{} errors.
    This means that \pvar{graph} did not predict any PCCs that were split cases.
    %This is encouraging because it means that

    When inspecting merge cases, we also found several caused by \groundtruth{} errors, but most were due to
      challenging image conditions such as viewpoint, occlusion, and pose.
    These factors either prevented an edge from being generated as a candidate or caused the classifier to
      produce a low positive probability.
    Two merge cases for plains zebras are illustrated in ~\cref{fig:MergeErrorPZA,fig:MergeErrorPZB}, and two for
      Grévy's zebras are illustrated in~\cref{fig:MergeErrorGZA,fig:MergeErrorGZB}.

    %all split
    %  cases for the graph algorithm are due to \groundtruth{} errors.
    % %$5$ split cases reported for plains zebras 
    %Two examples of these cases are shown in \cref{fig:SplitErrorsPZ,fig:SplitErrorsGZ}

    \SplitErrorsPZ{}

    \SplitErrorsGZ{}

    \MergeErrorPZA{}

    \MergeErrorPZB{}

    \MergeErrorGZA{}

    \MergeErrorGZB{}

 
\FloatBarrier{}
\section{SUMMARY OF GRAPH IDENTIFICATION}\label{sec:graphconclusion}

In this chapter we have described the final component in our approach to animal identification.
The graph-based framework takes advantage of the algorithms previously developed in
  \cref{chap:ranking,chap:pairclf} and uses them in a principled way to address the identification problem.
The ranking algorithm quickly identifies candidate edges that are likely to be label-changing, and the pairwise
  classifier automatically verifies pairs of annotations, reducing the amount of required manual interaction.
While these algorithms can be applied to address identification directly, our experiments have demonstrated that
  there is a considerable advantage to placing them in the context of the graph identification framework.
Specifically, we have simulated an identification event similar to the \GZC{} using our ranking and graph
  algorithms.
We have shown that using the graph framework can reduce the number of manual reviews needed to complete
  identification by a factor of $4$.
Furthermore, the graph framework reduces the number of errors and can recover from them when they occur.
The only false positives (split cases) produced using the graph framework were the result of errors in the
  \groundtruth{}.

%and shown that identification events like the $\GCZ$ can
%  be performed using less than $25\percent$ of the number of manual reviews required by our original algorithm.


Our graph framework uses the connectivity of positive, negative, and incomparable edges in order to quickly
  converge on a final identification.
Edge connectivity is used to enforce that the graph contains a small exact amount of redundancy, which reduces
  the number of manual decisions that must be made while providing the means to detect inconsistencies.
When inconsistencies are detected, edge-connectivity is again used to quickly find and fix errors in edge
  labeling.
Our framework also includes a probabilistic convergence criterion based on a Poisson process.
This criterion stops identification once the ranking and verification algorithm are no longer able to find
  label-changing edges.
This means that the algorithm will stop the identification process before it devolves into a brute force search.
The point at which this happens will depend on the power of the ranking and verification algorithms.

This brings us to our last point.
The graph framework is general.
It is not restricted to animal identification.
It could be applied to any instance recognition problem where one annotation corresponds to one individual
  object.
It does not depend on our specific ranking and verification algorithms and could easily incorporate other
  algorithms.
It is even possible to use it without any algorithms.
Even though this does result in a brute force search, the redundancy criterion still reduces the number of
  reviews, and error recovery still ensures that the graph is always consistent.
This case is not just theoretical.
To extend to new species and domains it is important to be able to construct a small labeled database from which
  learned ranking and verification algorithms can be bootstrapped.
As more pairwise training data is gathered and maintained using this framework, more sophisticated pairwise
  classifiers trained using deep learning could be applied, perhaps removing the need for manual interaction and
  resulting in a fully automatic identification algorithm.