-
Notifications
You must be signed in to change notification settings - Fork 0
/
chapter5-graphid.tex
1254 lines (1078 loc) · 80.1 KB
/
chapter5-graphid.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\begin{comment}
fixtex --fpaths chapter5-graphid.tex --outline --asmarkdown --numlines=999 --shortcite
fixtex --fpaths chapter5-graphid.tex --outline --asmarkdown --numlines=999 --shortcite -w && ./checklang.py outline_chapter5-graphid.md
fixtex --fpaths chapter1-intro.tex --outline --asmarkdown --numlines=999 --shortcite -w && ./checklang.py outline_chapter1-intro.md
\end{comment}
% TODO: relate back to GZC, and talk about non-rally scenarios
\chapter{IDENTIFICATION USING CONNECTIVITY IN A DECISION GRAPH}\label{chap:graphid}
\newcommand{\nT}{N}
In this chapter we frame the problem of animal identification in terms of constructing a %
\glossterm{decision graph}.
In this graph, each node is an annotation, and each edge represents a decision made between two annotations.
Edges determine if two annotations are the same (positive) or different (negative) individuals or if they cannot
be compared (incomparable).
Using the connectivity of the decision graph, we naturally address the problem of animal identification.
Assuming no edges are incorrectly labeled, each connected component of positive edges are all annotations from
the same individual animal.
We will refer to these as \glossterm{positive connected components} (PCCs).
We call a graph \glossterm{id-complete} if there is at least one negative edge between all pairs of PCCs.
In this case all labeled individuals must be distinct, and we have therefore completed identification.
Alternatively, if every pair of PCCs either has a negative edge between them except for pairs of PCCs where all
possible edges between them are incomparable, then we know that the data cannot be used to determine if those
incomparable PCCs are the same or different.
Removing the assumption that all edges are correctly labeled, we call a decision graph \glossterm{inconsistent}
if any PCC contains a negative edge because it implies that an edge has been mislabeled.
Therefore, stated abstractly, goal of graph identification is to determine a correct, consistent, and id-complete
set of edges in the decision graph.
In the most general case, the decision graph is initialized with an empty set of edges.
This captures the challenge posed by events like the \GZC{} from~\cref{sec:introgzc}, where we are given a set of
annotations without name labels.
In essence, each annotation starts by itself as an individual animal, but because there are no negative edges, we
cannot be sure that this is the correct name labeling.
In order to refine the name labeling, we add edges to the decision graph, gradually moving from a state of zero
confidence that the name labelings are correct to a state of high confidence.
However, it is important to note that the graph algorithm does not require that we start in an empty state.
Given a known set of annotations with name labels and one or more new annotations with unknown name labels, we
can add these new annotations to the existing database simply by labeling the graph with edges that captures our
current knowledge.
Simply put, this means each known individual is a PCC, there is one negative edge between each pair of PCCs, and
the new annotations are added as nodes without any edges.
Regardless of the initial state, graph identification proceeds to complete the decision graph.
For the remainder of this chapter, without loss of generality, we can assume that the graph starts in an empty
state.
%The name labelings --- \ie{} the set of PCCs --- are refined as new edges are added to the graph.
%Akin to a segmentation algorithm~\cite{fulkerson_class_2009} that starts with an over-segmentation of an image,
% the identification graph starts with an empty set of edges, $G = (V, \{ \})$, so in essence each annotation
% starts by itself as an individual animal.
%we must discard all of those PCCs except those in an independent
%set of a meta-graph where the PCCs are nodes and edges are formed between incomparable PCCs.
To construct the decision graph, we develop a semi-automatic review procedure that combines the ranking and
verification algorithms presented in \cref{chap:ranking,chap:pairclf}.
The ranking algorithm will be used to suggest candidate edges to be placed in the graph, and the verification
algorithm will be used to automatically review as many edges as possible.
The key reason for combining these algorithms with a decision graph is to take advantage of its connectivity
information.
Connectivity not only identifies the individuals, but it can also be used to develop graph measures of
\emph{redundancy}, \emph{completeness}, \emph{consistency}, and \emph{convergence}.
By combining these graph measures with the ranking and verification algorithms we can prioritize edges for review
based on both their pairwise probabilities and their ability to affect the consistency of the graph, which in
turn allows us to:
\begin{enumin}
\item increase confidence that the identifications are correct, %
\item reduce the number of manual reviews, %
\item detect and recover from review errors, and %
\item determine when identification is complete. %
\end{enumin}
An important property of the graph identification framework is that it is agnostic to the underlying computer
vision procedures, which are abstracted into three components:
\begin{enumin}
\item a ranking algorithm used to search for candidate positive edges, %
\item a verification algorithm used to automatically review edges, and %
\item a probability algorithm used assign probabilities to edges (note this is typically a by-product of the
ranking or verification algorithm).
\end{enumin}
In this thesis we use ranking algorithm from \cref{chap:ranking}, and the verification algorithm from
\cref{chap:pairclf} to define these components because these are suitable for identifying textured species.
However, while graph identification benefits from accurate computer vision subroutines, it can stand alone
without them.
This means that existing identification algorithms that only define a subset of these procedures (\eg{}
contour-based rank-only identification of humpback whales and bottlenose dolphins) could be seamlessly
incorporated into our framework and realize the benefits of graph identification (\eg{} a reduced number of
manual reviews and error recovery mechanisms).
Furthermore, because pairwise decisions are gathered and maintained by this framework, verification algorithms
can be retrained and improved, moving closer to a fully-automatic algorithm.
The first section (\Cref{sec:decisiongraph}) of this chapter formalizes the decision graph and summarizes the
priority-based review procedure used to construct it.
This provides an overview of each component of the processes.
Each of these components is then discussed in further detail
in~\cref{sec:redun,sec:incon,sec:cand,sec:decision,sec:converge}.
\Cref{sec:graphexpt} experimentally demonstrates the ability of the graph identification algorithm to reduce the
number of manual reviews and recover from decision errors.
\Cref{sec:graphconclusion} concludes and summarizes the chapter.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\FloatBarrier{}
\section{THE DECISION GRAPH}\label{sec:decisiongraph}
The graph identification algorithm is a review procedure formalized around the notion of a \glossterm{decision
graph} $G = (V, E)$ whose nodes are annotations and whose edges are suggested by a ranking algorithm (LNBNN in
our case) and decided upon by a combination of the probabilities output by a verification algorithm and by manual
review.
The edge set $E = E_p \cup E_n \cup E_i$ is composed of three disjoint sets.
Throughout this chapter we will refer to the set that an edge belongs to as the label of that edge.
Each edge in $E_p$ is \emph{positive}, meaning that it connects two annotations determined to be from the same
individual.
Each edge in $E_n$ is \emph{negative}, meaning that it connects annotations determined to be from different
individuals.
Finally, each edge in $E_i$ is \emph{incomparable}, meaning that it connects two annotations where it has been
determined that there is not enough information to tell if they are from the same individual (\eg{} when one
annotation shows the left side of an animal and another other shows the right side).
An example of a decision graph with all three edge types is illustrated in \cref{fig:decisiongraph}.
The goal of graph identification is to construct these edges.
\decisiongraph{}
The most important task is to determine the positive edges $E_p$.
This is because each connected component in the subgraph $G_p = (V, E_p)$ corresponds to a unique individual.
Producing an accurate set of these \glossterm{positive connected components} (PCCs) addresses the larger problem
of animal identification.
However, an algorithm that only determines positive edges is not enough.
This is because the algorithm may have failed to find all positive edges, resulting in two unconnected PCCs that
should be \emph{merged} into one.
To ensure that this is not the case we must turn towards negative edges.
We can gain confidence that all positive edges have been found by using negative edges $E_n$, which provide
direct evidence that two annotations are different individuals.
A correctly labeled negative edge between two PCCs means that no other unreviewed edge between those PCCs can be
positive.
Another important case is when a negative edge is contained within a PCC.
When this happens, the PCC is \emph{inconsistent}, and it implies that the PCC contains at least one mislabeled
edge.
Whenever an inconsistency is detected, we resolve it using the algorithm that we will define in~\cref{sec:incon}.
Lastly, incomparable edges, $E_i$, simply signify that a positive or negative decision cannot be made.
Whenever an edge is not labeled as positive, it is critically important to the construction of the decision graph
that this non-positive edge is distinguished as either negative or incomparable.
Negative edges restrict what new edges can be added because they carry information about the completeness and
consistency of the graph.
In contrast, incomparable edges do not.
Incomparable edges can exist internally in a PCC without causing inconsistencies or between two PCCS without
precluding them from being matched at a later point.
%and , whereas incomparable edges do not.
%To be able to use this negative information, we must recognize the distinction between non-positive and negative
% edges.
%Recall, that this was the basis of our $3$-state classification algorithm in \cref{chap:pairclf}.
%Algorithms like the ranking algorithm from~\cref{chap:matching} do not make this distinction, and therefore
% cannot make use of the negative information carried by negative edges.
In the case where all edges between two PCCs are incomparable and those PCCs are complete, then we know the
current data is not enough to determine if those PCCs are the same or different.
In our datasets most images were taken to reduce the number of incomparable PCCs in order to simplify the
sight-resight analysis, where all PCCs must be comparable to each other, but incomparable cases do exist.
Thus, in our context incomparable edges play a minor but necessary role.
%Note that sight resight analysis can be accomplished on sets with incomparable edges using the algorithm
% in~\cref{app:markrecapincomp}.
%datasets where
%play a minor but necessary role by
Using the connectivity of these edges, we can reduce the number of manual reviews needed.
We now make several observations assuming that each edge is reviewed correctly.
To reduce the number of potential reviews, notice, that once a group of nodes is connected by (a tree of)
positive edges, all those nodes in that PCC can be inferred to belong to the same individual, and it is not
necessary to consider any other edge internal to the PCC for review.
Likewise, once a negative edge has been placed between two PCCs, all edges between those PCCs can be ignored.
By ignoring these redundant edges we can reduce the number of reviews.
Furthermore, we can construct a \glossterm{deterministic termination criterion}; if a negative edge is placed
between every pair of PCCs, then all individuals must have been discovered and identification has converged.
We call such a graph id-complete because when all vertices in a PCC are collapsed into one, the resulting
meta-graph is complete.
Unfortunately, there are two issues with these observations.
First, they depend on the condition that each edge was correctly reviewed.
In fact, we know that both the verification algorithm and human reviewers are sometimes wrong.
We therefore will introduce redundancy into our graph that allows the algorithm to detect and correct errors,
trading off the level of redundancy with the sophistication of the errors that may be caught.
A small amount of redundancy is desirable because:
\begin{itemln}
\item PCCs with redundant edges are less likely to contain errors.
\item Redundancy can potentially introduce inconsistency into a PCC, which signifies that an error has
occurred.
\end{itemln}
Therefore, we will define a redundancy criterion in~\cref{sec:redun} which ignores edges within and between PCCs,
but only after they meet a minimum level of redundancy.
Additionally, the deterministic termination criterion would require that each pair of PCCs has a redundant set of
negative edges between them before the algorithm stops.
However, the number of edges that need review grows quadratically.
Therefore, unless the automatic algorithm is perfect or the dataset is small, the number of negative reviews
needed to converge will be too large for a human to handle.
We address this concern in~\cref{sec:converge} using a probabilistic termination criterion.
%While this criterion will not guarantee ,
%that the graph is complete
%it will ensure that the algorithm terminates
\FloatBarrier{}
\subsection{The review algorithm}\label{sub:graphalgo}
% Algorithm overview
The review algorithm that produces the edges of a decision graph is outlined in Algorithm~\ref{alg:AlgoOverview}.
Akin to a segmentation algorithm~\cite{fulkerson_class_2009} that starts with an over-segmentation of an image,
the identification graph starts with an empty set of edges, $G = (V, \{ \})$, so in essence each annotation
starts by itself as an individual animal, but because there are no negative edges we cannot be confident that any
pair of annotations should indeed be given different labels.
Therefore, the algorithm proceeds to prioritize and add positive edges that merge multiple annotations into the
same PCC, negative edges that indicate that two annotations are different individuals, and incomparable edges
that prevent two annotations from being labeled as positive or negative.
Throughout the main algorithm, the graph is maintained in a \emph{consistent} state, which means that each PCC
has no internal negative edges.
Note, that while we describe the algorithm starting from an empty graph, without loss of generality, the edges in
the graph can be initialized to reflect a previously known labeling.
Thus, the algorithm can address the case where nothing is known, new images are being added to an existing
dataset, or multiple datasets are being combined.
\begin{algorithm}
While the graph has not converged:
\begin{enumln}
\item Generate and prioritize candidate edges
\item Insert candidate edges into a priority queue
\item While the priority queue is not empty:
\begin{enumln}
\item Pop an edge from the priority queue
\item Make a decision and add the edge to the graph
\item If the edge causes an inconsistency drop into inconsistency recovery mode
\item Update the priority queue based on the new edge
\item If candidate edges require refresh, break
\end{enumln}
\end{enumln}
\caption[Algorithm Overview]{Overview of the graph identification review procedure}
\label{alg:AlgoOverview}
\end{algorithm}
The first step of the algorithm is to generate candidate edges and predict probability measures (positive,
negative, or incomparable) for each candidate edge.
In the next step each edge is entered into a priority queue with a priority based first on its ability to be
automatically reviewed and then on its positive probability.
Next, the algorithm enters a loop where the next candidate edge is selected, a decision is made about this edge
--- either automatically (as much as possible) or by the user --- and it is added to the graph.
The algorithm proceeds toward convergence by removing candidate edges from the priority queue, either directly
from the top of the queue or indirectly by eliminating candidate edges that are no longer needed.
A candidate edge is no longer needed when there are sufficient redundancies in the edge set within or between its
PCCs.
A pair of PCCs is \emph{complete} when there are enough negative edges between them.
Each new edge addition could trigger two important events:
\begin{enumin}
\item a \emph{merge} --- addition of a positive edge between different
PCCs combines them into one PCC, and
\item an \emph{inconsistency} --- addition of either a negative edge within a PCC or a positive edge between
PCCs that already have a negative edge between them creates an inconsistent PCC.
\end{enumin}
Handling a merge is largely a matter of bookkeeping and can be done efficiently using a data structure that can
dynamically maintain connected components~\cite{jacob_holm_poly_logarithmic_2001}.
Finding an inconsistency, however, drops the user into inconsistency recovery mode which alternates between
hypothesizing one or more edges to fix and manually verifying these edges with the user until consistency is
restored.
Finally, the outer loop of the overall algorithm allows the ranking algorithm to generate additional candidate
edges --- this allows the ranking algorithm to take advantage of more subtle matches as the PCCs begin to form.
The priority queue will gradually be emptied as each PCC obtains a sufficiently redundant set of positive edges
and enough negative edges to be complete.
%Ensuring completeness requires examining $O(|V|^2)$ edges, so in practice we
% develop a learned probabilistic completeness measure.
%If sufficient training data is not available simple heuristics can be used to
% terminate.
Details of each step in the review algorithm are described in the following sections.
First we describe the redundancy criterion in~\cref{sec:redun}.
Then we will define candidate edge generation in~\cref{sec:cand} and decision-making in~\cref{sec:decision}.
Inconsistency recovery is covered in~\cref{sec:incon}.
Finally, we describe the refresh and termination criteria in~\cref{sec:converge}.
% will be emptied when each PCC is sufficiently re
% Obtaining sufficient redundancy and completeness in order to empty the priority queue can,
% Ensuring all PCCs are complete leads to the need to ,
% so to prevent this we develop a probabilistic measure (\cref{sec:converge})
% that triggers much earlier convergence when positive edges are no longer
% likely to be found.
\FloatBarrier{}
\section{POSITIVE AND NEGATIVE REDUNDANCY}\label{sec:redun}
In this section we define criteria that
(1) increases our confidence that PCCs are correct by enforcing a minimum level of redundancy and
(2) prevents edges that exceed this redundancy from being reviewed.
At a minimum each PCC must be a tree of positive edges, but when errors can occur, it's difficult to be confident
that all nodes in the PCC are really annotations from the same individual.
By adding a redundant edge we either increase the confidence that other edges are correct or detect an
inconsistency which can be resolved with the algorithm in ~\cref{sec:incon}.
However, the gains in confidence from adding each additional edge are diminishing.
Therefore, it is desirable to achieve a minimum level of redundancy, but once this has been achieved we should
prevent additional redundant edges from being reviewed.
We formalize this minimum level of redundancy in two forms.
The first is for positive edges within PCCs and the second is for negative edges between PCCs.
\begin{enumln}
\item positive-redundancy --- %
A PCC is $k$-positive-redundant if its positive subgraph is $k$-edge-connected (contains no cut-sets
involving fewer than $k$ positive edges~\cite{eswaran_augmentation_1976}), or if the PCC has $k$ or fewer
nodes and the union of positive and incomparable edges is a complete graph.
\item negative-redundancy --- %
A pair of PCCs $C$ and $D$ is $k$-negative-redundant if there are $k$ negative edges between $C$ and $D$,
or if there are $|C| \cdot |D|$ negative or incomparable edges between them.
%or if both PCCs have fewer than $k$ nodes and there are at least $\mathop{max}(|C|, |D|)$ negative edges
%between them.
%which can be determined in $O(n_1 n_2)$ time.
%(by looping over adjacency sets of nodes
%in $C$ and performing set intersection with nodes in $D$ to get the edges
%between $C$ and $D$).
%$k$-negative-redundant if there are $k$ negative edges between them.
\end{enumln}
\kredun{}
The example in \cref{fig:kredun} illustrates different levels of redundancy.
To understand these criteria better, consider what it means for a PCC that has been determined to be
$k$-positive-redundant to have an undiscovered error.
The error means that the PCC really should be split into (at least) two separate PCCs.
Suppose these PCCs correspond to animals $C$ and $D$.
If the combined PCC is $k$-positive-redundant then are $k$ separate undiscovered mistakes connecting $C$ and $D$,
and there must also be no negative edges connecting $C$ and $D$.
This may be plausible if $C$ were identical twins, but these tend not to occur for species where the
distinguishing markings (\eg{} hip and shoulder of zebras) are mostly random.
In other words, an error only becomes undiscoverable if a reviewer makes the same mistake with different
annotations from the same individual $k$ times.
Note that $k$ can be different for positive and negative redundancy, but in our current implementation we use
$k=2$ for both positive and negative redundancy.
\subsection{Checking redundancy}
We now describe how to check if a consistent PCC with $n$ nodes and $m$ edges is $k$-positive-redundant.
An PCC that contains a negative edge is inconsistent and is never considered as positive-redundant.
In the case where $n \leq k$, the PCC is positive redundant if all possible edges are either positive or
incomparable.
This can be determined in $O(m)$ by checking if the sum of the number of positive and incomparable edges between
the nodes in the PCC is equal to $\binom{n}{2}$.
Otherwise, in the more interesting case, when $n > k$, a PCC is positive-redundant iff it is $k$-edge-connected
--- \ie{} it is impossible to disconnect the nodes in the PCC by removing any set of $k - 1$ edges.
In practice, we are primarily concerned with the case when $k=2$.
This special case of $2$-edge-connectivity is also called bridge-connectivity, and can be tested for in %
$O(n + m)$~\cite{eswaran_augmentation_1976,wang_simple_2015}.
The basic idea is to trace cycles encountered during a depth-first-search of the graph and mark all edges that
are part of a cycle.
Any unmarked edge is not part of any cycle, and is called a bridge.
Removing any bridge edge would disconnect the PCC.
Thus, if no bridge exists then the PCC is $2$-edge-connected.
In the general case when $k>2$, edge-connectivity can be determined in $O(m n)$ amortized
time~\cite{esfahanian_connectivity_2017}.
This involves first computing a small (not necessarily the smallest) dominating set.
A small dominating set can be computed using a greedy algorithm that starts with an empty set and iteratively
adds an arbitrary node that is not in or adjacent to the current set until all nodes are in or adjacent to that
set.
Then an arbitrary node in the dominating set is chosen.
The local-edge-connectivity is computed between the chosen node and every other node in the dominating set.
The local-edge-connectivity between two nodes is simply the maximum flow between those nodes, if all edge
capacities are equal to $1$.
The edge-connectivity of the entire PCC is the minimum of
(1) the minimum degree of the PCC and
(2) the minimum computed local-edge-connectivity.
We now describe how to check if two PCCs $C$ and $D$ with sizes $n_1$ and $n_2$ are $k$-negative-redundant.
This can be done in $O(n_1 n_2)$ time using adjacency lists and set intersections to check if the number of
negative edges between the nodes in $C$ and $D$ is greater than $k$.
For each node in $C$ we simply check if any node in $D$ is in the adjacency list (stored as a set) of that node.
The number of times this is true is the number of negative edges between $C$ and $D$.
%Whenever a positive edge is added inside a PCC, we can check positive-redundancy, and if it is, we can remove all
% other edges inside that PCC from the priority queue.
%Similarly, when a negative edge is placed between two PCCs, we can check negative redundancy and, if it is
% satisfied, remove all other edges between those PCCs.
Using this redundancy criterion we are able to find and remove edges from the priority queue that are no longer
needed.
When a positive edge is added within a single PCC, we check for positive-redundancy.
If this passes, all remaining internal edges for that PCC may be removed from the priority queue.
When a negative edge is added between a pair of PCCs, we run the negative-redundancy check on the pair, and if
this passes, all remaining edges between the PCCs may be removed from the priority queue.
When a positive edge is added between a pair of PCCs, the two PCCs are merged into a single new PCC $C'$, and the
above negative-redundancy check must be run between $C'$ and all other PCCs having a negative edge connecting to
$C'$.
It can be shown that if the graph is in a consistent state, that these are the only updates required.
As a final note, consider the case where a PCC is composed of two positive $k$-edge-connected subgraphs joined by
a single positive edge.
While the entire PCC is not positive redundant, much of it is.
In this case, we do not need to review any edge within any $k$-edge-connected component of the PCC.
We can dynamically remove these from the priority queue by checking when a new edge popped off of the priority
queue is within an existing PCC.
We can check if the local-edge-connectivity~\cite{esfahanian_connectivity_2017} --- \ie{} the maximum flow ---
between the edge's endpoints is at least $k$.
If the local-edge-connectivity between those nodes is at least $k$, then that edge is part of a
$k$-edge-connected component and can be ignored.
\subsection{Redundancy augmentation}\label{subsec:augredun}
In addition to determining if existing edges are redundant, it would be useful determine a small set of edges
that would make an existing PCC positive-redundant or two PCCs negative-redundant.
Reviewing these edges would help expose any undiscovered errors in the graph.
To find edges that would complete positive-redundancy for a PCC, we compute a $k$-positive-augmentation.
This is equivalent to the problem of finding a minimum $k$-edge-augmentation.
When $k=2$ the problem is called bridge augmentation and can be computed $O(m)$~\cite{eswaran_augmentation_1976}
as long as all edges in the complement of the PCC can be used in the augmentation.
However, if the PCC contains incomparable edges, then these edges cannot be used.
We can address this by using a weighted variant of the problem, setting the weights of the incomparable edges to
$\infty$, and searching for a minimum cost augmentation.
However, the weighted variant of this problem is NP-hard, even for $k=2$, but can be approximated within a factor
of $2$~\cite{khuller_approximation_1993}.
We make use of this algorithm later in~\cref{sec:cand}
To find a set of edges that would complete negative-redundancy for a pair of PCCs, we compute a
$k$-negative-augmentation.
Computing this set of augmenting edges is trivial.
Initialize an empty augmenting set.
Iterate through all combinations of nodes between the two PCCs.
For each combination of nodes, if there is no existing edge between those nodes in the graph, add it to the
augmenting set.
Once $k$ edges have been added to the augmenting set or there are no more node combinations, stop and return the
augmenting set.
Note, that we do not use this algorithm in practice because we use a probabilistic termination criteria instead
of enforcing that all pairs of PCCs are negative-redundant.
%If the PCC only contains positive edges, edge augmentation with the fewest edges can be computed in linear time
% using~\cite{eswaran_augmentation_1976}.
%However, if some edges in the complement of the positive PCC edges are incomparable then we must compute a
% minimum weight edge augmentation (which is NP-hard) using a $2$-approximation algorithm
% ~\cite{khuller_approximation_1993}.
\section{CANDIDATE EDGE GENERATION AND PRIORITIES}\label{sec:cand}
In this section we describe the first step of the algorithm where candidate edges are generated and then
prioritized for review.
There many ways that candidate edges can be generated.
Different sets and orderings of candidate edges will impact different properties of the graph at different rates.
Therefore, we choose candidate edges to depend on what properties of the graph we want to manipulate.
In the context of identifying individual animals, the most obvious manipulation is to reduce the number of PCCs
in the graph by adding positive edges between existing PCCs, thus merging them into one.
A less obvious property that we can manipulate is the fraction of PCCs that are positive redundant.
By increasing this fraction to $1$, we discover mistakes that have been made, which can be fixed using the
algorithm in~\cref{sec:incon}.
This ultimately decreases the likelihood that any PCC contains a mislabeled positive edge.
In each iteration outer loop of our main algorithm, we alternate between first generating candidate edges to
merge PCCs, and then generating candidate edges to ensure that all PCCs are positive redundant.
To generate candidate edges that may merge existing PCCs we use the LNBNN ranking algorithm
from~\cref{chap:ranking}.
We issue each annotation as a query to the ranking algorithm, and form edges from the top results of the ranked
lists.
We then assign a priority to each new candidate edge.
We use the pairwise algorithm from~\cref{chap:pairclf} to estimate the positive, negative, and incomparable
probabilities of each edge.
Any edge whose maximum positive, negative, or incomparable probability is above the threshold for automatic
decision-making is ranked according to this probability.
All other edges are ordered by their positive probability.
This ensures automatic decision-making is first, followed by an ordering of the edges needed for manual review
that are most likely to be positive and therefore add the most information to the graph.
It is desirable to add positive decisions to the graph first because
(1) they are the most important edges with respect to determining the animal identities, and
(2) larger PCCs increase the number of edges that can be skipped using the redundancy criterion.
The first iteration of the outer loop the algorithm generates edges using the ranking algorithm.
Review of these edges continues until the priority queue is empty, or we determine that the candidate edges
should be refreshed using the algorithm we will describe in~\cref{sec:converge}.
This ends the current inner loop, and as long as the algorithm has not converged, it proceeds to the next
iteration of the outer loop.
In this next iteration, we generate candidate edges to ensure that all PCCs are positive redundant.
To generate candidate edges that will make PCCs positive redundant, we use the positive-augmentation algorithm
from~\cref{subsec:augredun} to generate edges.
These edges are assigned priorities in the same way, however in this iteration the refresh criterion is disabled.
This enforces that if each edges is reviewed as positive, then all PCCs will be positive-redundant at the end of
this inner loop.
However, sometimes an edge will not be reviewed as positive.
If it is reviewed as negative, then it will introduce an inconsistency and be handled by the algorithm
in~\cref{sec:incon}.
If it is reviewed as incomparable, then the PCC will only be positive-redundant if all edges between the nodes in
the PCC have been reviewed.
Because we want to ensure that all PCCs are positive-redundant, we delay alternating back to the ranking
algorithm.
Instead, we simply regenerate candidate edges using the positive-augmentation algorithm in the next iteration of
the outer loop until all PCCs are positive-redundant.
Therefore, at the end of this process all PCCs are positive-redundant by construction.
After we have ensured positive-redundancy, the outer loop alternates back to generating candidate edges using the
ranking algorithm.
In this way our algorithm alternates between two modes, first searching for merges, and then ensuring redundancy.
This continues until the termination criterion that we will describe in~\cref{sec:converge} is satisfied.
At the end of this process it is guaranteed that all PCCs are consistent and positive-redundant.
\section{MAKING DECISIONS}\label{sec:decision}
Now that we have generated and prioritized a set of candidate edges we come to the core of the inner loop ---
decision-making.
Because of the surrounding structure of the graph framework, this step is quite simple.
Given a popped edge from the priority queue, we check if any of the positive, negative, and incomparable state
probabilities produced by the pairwise algorithm are above their automatic decision threshold (set externally as
a hyperparameter).
If the edge cannot be automatically reviewed we issue a request for user feedback.
Once we have obtained feedback for an edge --- either automatically or manually --- the edge is added to the
appropriate edge set.
After the new edge is added, we update candidate edge priorities discussed in~\cref{sec:redun}.
If the new edge causes an inconsistency, then we drop into inconsistency mode, which we will discuss
in~\cref{sec:incon}.
For each decision we record a user-id to identify the reviewer or algorithm making the decision.
We also follow the approach of~\cite{branson_visual_2010} and store a user-specified categorical confidence value
of unspecified, guessing, not-sure, pretty-sure, and absolutely-sure (with associated integer values $0$, $1$,
$2$, $3$, and $4$).
The user-id allows us to differentiate between edges that were automatically reviewed from those that were
manually reviewed.
While this is not directly used in this algorithm description, it enables a variety of possible post-processing
techniques, \eg{} manually reviewing automatically reviewed edges between PCCs containing only two annotations.
However, the user confidence contributes to the edges weights in the error detection and recovery algorithm
from~\cref{sec:incon}.
\section{RECOVERING FROM INCONSISTENCIES}\label{sec:incon}
In~\cref{sec:redun} we described a redundancy criterion that exposes errors by introducing inconsistencies.
In this section we describe an algorithm for fixing these errors and recovering from these inconsistencies.
Whenever a decision is made that either adds a negative edge within a PCC or adds a positive edge between two
PCCs with at least one negative edge between them, the graph becomes inconsistent.
In both cases the result is a single PCC $C$ with internal negative edges.
The goal of inconsistency recovery mode is to change the labels of edges in order to make the subgraph formed by
the nodes of $C$ and all of their edges consistent.
An inconsistency implies that a mistake was made, but does not necessarily determine which edge has the wrong
label.
Therefore, we develop an algorithm to hypothesize the edge(s) most likely to contain the mistake(s) using a
minimum cut.
If the hypothesis is correct, and all the labels on these edges were changed, then we show that the PCC would
either become consistent or be split into multiple consistent PCCs.
Because the hypothesis might not be correct, we present these edges to a user for manual review, and if the user
agrees with the hypothesis, the algorithm completes.
Otherwise, the new information received by the user it taken into account, and the hypothesis is recomputed.
An example of an inconsistent PCC with hypothesized edges is illustrated in \cref{fig:inconpcc}.
\inconpcc{}
\subsection{Hypothesis generation}
The procedure alternates between steps of generating ``mistake hypothesis'' edges, and presenting these to the
user for review.
The ``hypothesis generation algorithm'' returns a set of negative edges or a set of positive edges, which if
re-labeled as positive or negative respectively would cause $C$ to become consistent.
For simplicity, we focus first the case where $C$ contains exactly one negative edge.
It will not be hard to extend to the general case.
First consider a cut of $C$ that disconnects the endpoints of the negative edge.
If the labels on these edges were changed from positive to negative or incomparable, then the inconsistent PCC
would be split into multiple consistent PCCs.
Alternatively, if the label on the negative edge was changed to positive or incomparable, the PCC would no longer
be inconsistent.
Therefore, the algorithm starts by creating a minimum $s$-$t$-cut using the subgraph of $C$ containing only
positive edges.
The endpoints of the single negative edge are the terminal nodes $s$ and $t$.
Unlike in our preceding discussion where the edges only have positive/negative/incomparable labels the edges will
now have weights that reflect a measure of confidence in their current label.
In the case where the negative edge is correct, this will encourage the minimum cut to return the positive edges
that are most likely to be mislabeled.
The weight of an edge in this cut problem is the sum of three values:
\begin{enumln}
\item its positive probability previously computed using the pairwise classifier,
\item the number of consecutive times that edge was manually reviewed with
its current label, and
\item an integer ranging from $0$ to $4$ indicating the confidence of the most recent review (see
\cref{sec:decision}).
\end{enumln}
Because we are using a minimum cut, the algorithm will find the lowest confidence cut set of these edges.
The higher each of these values is, the more likely that the edge is correctly labeled.
The positive probability offers a baseline estimate of this confidence.
Using the number of manual reviews means that if a user disagrees with a hypothesis, the weight on that edge will
be increased and the algorithm will be encouraged to select a different edge.
The confidence performs a similar role, speeding up the pace at which the algorithm tries different edges.
Using this weighting scheme we find the minimum $s$-$t$-cut, which returns a cut set of positive edges.
We now need to decide if it is more likely that the positive cut-edges should be relabeled, or the negative edge
should be relabeled.
We compare the total weight of cut positive edges with the weight of the negative edge (weighted using the same
scheme).
If the positive weight is smaller, the algorithm suggests that the cut positive edges should be relabeled as
negative.
Otherwise, it suggests that the negative edge should become positive.
%The minimum cut returns a set of edges that disconnects the terminal nodes.
%In other words, if the label on these edges is changed from positive to negative or incomparable, then $s$ and
% $t$ would no longer be part of the same PCC and the inconsistency would be removed.
\subsection{Generalization}
So far, we have only considered the case when only one negative edge exists in a PCC.
%In a restricted scenario where each edge is reviewed exactly to the specifications of the algorithm, then it is
% only possible for one negative edge to exist in a PCC when $k=2$.
However, in practice it is possible for multiple negative edges to exist within a PCC.
In this case we can slightly modify the inconsistency recovery algorithm.
%to apply to the case when there are any number of
% negative edges between the nodes of a PCC.
The modification is simple.
Instead of using a minimum $s$-$t$-cut, we use a multicut to disconnect the endpoints in all pairs of negative
edges.
Multicut is NP-hard, but a simple approximation algorithm is to perform a minimum $s$-$t$-cut for each set of
terminal nodes, and then take the union of the resulting cut~\cite[203--208]{vazirani_approximation_2013}.
When considering which edges to return we consider all negative edges together, comparing the sum of their
weights to the sum of the positive edge weights.
\subsection{Hypothesis review}
The user iterates through each hypothesis edge and chooses
(1) to agree with the hypothesis and change the label of the edge, or
(2) to disagree with the hypothesis and keep the edge label.
In the case where the user agrees that a positive label should be changed, it does not matter if the label is
changed to negative or incomparable, it will still remove the positive connection.
A similar argument is true when the user agrees to change a negative label to either positive or incomparable;
either way, the negative edge between the nodes in the PCC is removed.
As long as the user agrees with the hypothesis and the hypothesis set is non-empty, the iteration continues.
If there are no more hypothesis edges, then either all inconsistencies were removed or all PCCs were split into
multiple consistent PCCs.
In the case where the reviewer disagrees with the algorithm, the new review is recorded and we generate a new set
of hypothesis errors.
Because reviewing the edge increases its weight, the algorithm will be forced to look elsewhere for a cut.
This alternation between hypothesis generation and user review repeats until the reviewer agrees with all
hypothesis edges, which means that all inconsistencies have been eliminated
Fixing inconsistencies can result in splitting $C$ into multiple PCCs.
This may invalidate implicit reviews inferred from redundancy either within or incident to this subgraph.
Therefore, we recompute positive-redundancy within each new PCC, ignoring edges where the criterion is satisfied
and re-prioritizing unreviewed edges where it is no longer valid.
A similar process happens for negative-redundancy between each pair of new PCCs as well as between each new PCC
and all other PCCs previously negative-redundant with $C$.
\subsection{Implementation details}
While it might be conceptually simpler to think of inconsistency recovery as a separate mode, in practice it is
actually integrated as part of the algorithm's inner loop.
The algorithm dynamically maintains a list of all PCCs that contains errors.
We disable redundancy checks for any PCCs that contain errors.
Hypothesis edges are computed whenever a new inconsistency is created, and these edges are entered into the queue
with their normal priority plus $10$ to ensure they are reviewed first.
It can be shown that as long as the user agrees with the current hypothesis edges computed so far, recomputing
the new hypothesis edges will always be the same as the remaining hypothesis edges.
This implementation is an important first step for generalizing the graph algorithm into a distributed setting
with multiple reviewers.
%\section{Refreshing candidate edges}\label{sec:refresh}
%The goal is to refresh if there has been a significant number of positive reviews, but new results are consistently
%negative. If we have not found any positive edges then we do not want to refresh. We keep track of the fraction of
%positive review decisions as a moving average of manual decisions. We also maintain the total number of positive reviews
%made since the last candidate edge generation. Thus the candidate edges are refreshed whenever the number of positive
%reviews is above a threshold and the positive review fraction is below a threshold
%As the last outer iteration of the overall algorithm before convergence, triggered when the LNBNN ranking algorithm
%fails to produce positive edges, candidate edges between untested pairs of annotations are added within PCCs that are
%not positive-redundant and between PCCs that are not negative-redundant. This is because the ranking algorithm itself is
%imperfect and the missed matches tend to affect small PCCs disproportionately, which are the last to satisfy redundancy
%tests.
%\subsection{Probabilistic convergence}
%The goal of probabilistic convergence is to determine if a PCC $C$ is negative-redundant with all other components with
%high probability. When all components are positive-redundant and satisfy this, then all edges will be removed from the
%priority queue and the algorithm will converge. We consider the probability $\Pr{E_c \given \nT_C}$ that an undiscovered
%positive edge exists ($E_C$) given $C$'s existing set of outgoing negative edges ($\nT_C$). Under mild conditions (if we
%assume that $\Pr{E_c \given \nT_C} < 0.5$), we can show that the probability $\Pr{\nT_C \given E_C}$ of observing the
%negative edges bounds this p given that an undiscovered match exists can be used as a surrogate. We can learn this
%probability offline by measuring the frequency that correct results are at a given rank in a PCC's ranked list
%(constructed by aggregating the ranked lists of all annotations in the PCC).
%To predict $\Pr{\nT_C \given E_C}$ we issue all queries as a single LNBNN query to obtain a single ranked list for the
%entire PCC. This can be done by treating all query descriptors as if they were the from the same annotation except
%during the spatial verification stage. Let $R_C$ denote the ranks of every PCC marked as negative with $C$. In an
%offline step we learn a probability mass function $\phi$ that predicts the probability that a correct match appears at a
%given rank for the PCC $C$. The predicted probability is %
%$\Pr{\nT_C \given E_C} = 1 - \sum_{r \in R_C} \phi(r)$.
%To learn $\phi$, we measure the probability that a correct match appears at a given rank, given a correct match exists.
%To do this initialize an histogram. For each $C$ in the training set, divide it into a query $C_q$ and target $C_t$. The
%target and the rest of the PCCs in the training set become database PCCs. Use LNBNN to score each annotation in $C_q$
%against the database PCCs. Determine the best rank that $C_t$ appears in each ranked list, and increment the
%corresponding index in the histogram. Repeat this process for all PCCs in the training set and for multiple partitions
%of each PCC. Normalizing the histogram array results in the PMF $\phi$. In order to prevent marginalization across
%important attributes (such as the number of exemplars in a PCC), construct multiple PMFs for different numbers of
%exemplars in a query.
\section{REFRESH AND TERMINATION CRITERIA}\label{sec:converge}
In this section we discuss the criteria we use for both determining when to refresh candidate edges and when to
stop the algorithm altogether.
We first consider the need for a refresh criterion.
As the review algorithm proceeds, we should only continue to manually review edges as long as the algorithm is
consistently generating edges that --- once reviewed --- change the PCCs.
This happens whenever a review merges two PCCs into one or splits one PCC into two.
Because splits only happen during inconsistency recovery, we are primarily concerned with searching for new
positive edges.
However, at some point the candidate edges may no longer contain positive results, but undiscovered positive
matches may still exist.
This is because LNBNN, working initially with each annotation having a separate label, can miss more subtle but
correct matches, especially when there are several annotations for an individual with multiple viewpoints.
As the labeling improves, so does the reliability of LNBNN.
Recall that the verification algorithm is only run after candidate edges are generated.
If the edges required to complete the underlying real PCCs are missing from the candidate set, then nothing can
be done.
We therefore must develop a ``refresh criterion'' to determine when to stop and replace the current priority
queue with new LNBNN matches.
This will be done by predicting if none of the remaining edges would change the PCCs, and then recomputing
candidate edges when this happens.
However, before we describe this process, we consider the problem of termination, which will turn out to have a
similar solution.
Similar to the refresh criterion, we must be able to determine when the algorithm should stop.
To ensure that the identification is perfect, the algorithm would need to use all $\binom{|V|}{2}$ edges as
candidates and then terminate using the deterministic convergence criterion explained in the introduction of this
chapter.
Recall that the deterministic convergence criterion only stops the algorithm once each PCC is positive redundant
and each pair of PCCs is negative-redundant.
This essentially results in a brute-force search, that requires $O(|V|^2)$ reviews, and is only feasible if
(1) the number of annotations is very small, or
(2) all edges can be automatically reviewed.
However, even if all edges can be automatically reviewed the quadratic computation required to run the
verification algorithm on all pairs of annotations might be too computationally expensive for very large
databases.
Therefore, in practical circumstances, we turn towards probabilistic methods to determine when to stop.
Like, the refresh criterion, this can be determined --- in part --- by predicting if new reviews will change the
PCCs.
\subsection{Convergence as a Poisson process}
%\newcommand{\meaningful}{meaningful}
%\newcommand{\meaningful}{label-changing}
Both the review and termination criteria can be addressed by considering the question:
``Will there be a label-changing review anytime soon?''.
A \emph{label-changing review} is one that changes the name labeling of the annotations, \ie{} it is either
a positive edge that merges two PCCs or a non-positive edge that splits one.
Correct and label-changing reviews improve the accuracy of the identification by increasing the similarity --- in
terms of the name labeling --- between the predicted PCCs and the real underlying PCCs.
However, producing a label-changing review has a cost:
the number of manual reviews since the last label-changing review.
During the inner loop of the algorithm, when edges popped from the priority queue no longer consistently result
in label-changing reviews, the marginal gains in identification accuracy that could be made from continuing are
outweighed by the cost of manual review.
In this circumstance it is best to break out of the inner loop.
Note that if any label-changing reviews were made during that loop, we should refresh candidate edges and start a
new loop because a refresh could result in new high priority label-changing edges.
On the other hand, if no label-changing reviews were made in the loop, then refreshing will have no benefit, and
the algorithm should terminate.
Thus, the task is to construct a criterion that determines when edges on the top of the priority queue are no
longer label-changing.
In this way we directly address the refresh criterion and indirectly address termination criterion.
This is direct in the case of the refresh criterion because when the name labeling is more accurate, the LNBNN
ranking will improve.
When we directly measure that the next reviews in the current priority queue are unlikely to be label-changing, and
the name labeling of the decision graph has changed, then it is more likely that we would review a label-changing
edge on the top of a new set of candidate edges sorted by priority.
This task indirectly addresses the termination criterion because instead of stopping once the probability that
identification is complete is high, the algorithm simply stops when the cost in terms of manual labor is too
high.
However, if we were to directly address the termination criterion, we would have to estimate the probability that
undiscovered merge and split cases exist.
This probability depends on the effectiveness of the ranking algorithm.
Even if our estimate of this probability was perfect, once the ranking algorithm starting producing label-changing
reviews at a rate no better than random edge generation, it would take an enormous amount of manual effort to
push this probability passed a desired threshold.
Thus, instead of providing guarantees about identification accuracy our termination criterion achieves a
trade-off between the number of manual reviews and the cost of identification.
Based on these observations we estimate the probability that ``there will be a label-changing review soon''.
We define ``soon'' using a patience parameter $a$, defined as the maximum number of consecutive reviews that a
manual reviewer is willing to do between label-changing reviews.
Let $L\teq1$ be the event that a review is label-changing and $L\teq0$ otherwise.
Because reviews are ordered, denote if the $i\th$ review is label-changing as $L_i$, and denote the index of the
next review as $n$.
Let %
%$C\teq1 \equiv (\M_i\teq1, \exists i \in [n, n + a])$
$C = \Or_{i=n}^{n+a} L_i$
%
be a binary random variable that takes the value $C\teq1$ in the event that any of the next $a$ reviews will be
label-changing.
Thus, the aforementioned question can be addressed by measuring the probability of the event $C\teq1$.
%The goal is to measure the probability the event $C\teq1$.
We can periodically check if $\Pr{C\teq1}$ is less than a threshold, and if so, we stop the current loop and
either refresh or terminate.
We model the event $C\teq1$ as a Poisson processes, but for this to be appropriate, $L_i$ must follow a uniform
distribution.
This would be true if the edges were reviewed in a random order.
However, the priority queue orders edges more likely to be label-changing first, causing $L_i$ to follow a right
skewed long tail distribution and violate Poisson assumptions.
Even so, the use of a Poisson model can be justified by considering a sliding window along the distribution of
$L_i$.
Recall that we only need to make predictions about the next $a$ reviews in the future, thus we are only concerned
with a small window to the right on the distribution.
Assuming the long-tailed distribution is monotonic decreasing, we can use a small window in the past to estimate
an upper bound on probability of $C\teq1$ in the future.
As the window moves to the right, the interval on the distribution becomes increasingly approximately uniform and
the tightness of the bound improves and eventually becomes tight.
This is because the order of the remaining reviews becomes random once the prioritization algorithm cannot
distinguish positive from negative cases.
In this case the Poisson model becomes appropriate.
Thus, the use of a Poisson model with a sliding window allows us to approximate an upper bound on $\Pr{C\teq1}$,
and the smaller $\Pr{C\teq1}$ is, the more accurate our estimate will be.
\subsection{Details of Poisson convergence}
Having justified its use, we model $C$ as a Poisson process, which is determined by a rate parameter $\mu$ and an
interval length parameter $a$.
We can measure $\mu$ as the fraction of recently observed manual reviews that were label-changing using an
exponentially weighted moving window.
We initialize $\mu_0=1$ to denote that it is likely that the first review will be label-changing and because it
will ensure our estimated probability is an upper bound.
Then, after each new review we update the parameter as %
$\mu_{i + 1} \leftarrow \ell_i \alpha + (1 - \alpha) \mu_{i}$, where $\ell_i\teq1$ if the $i\th$ review was
label-changing and $0$ otherwise.
The exponential decay $\alpha = 2 / (s + 1)$ is determined by a span parameter $s$, which roughly represents the
number of previous reviews that are significant.
Using this model, the desired probability that any of the next $a$ reviews will be label-changing is %
$\Pr{C\teq1} = 1 - \exp{-\mu a}$.
The example in \cref{fig:poisson} illustrates the behavior of the criterion using a synthetic dataset.
In this example we use a window span of $s=20$, a patience of $a=20$, and a threshold of $\tau=0.135$. \footnote{
% ---
A better choice would be to set these parameters such that %
$\tau = 1 - \exp{-a \left(s - 1\right)^{a} \left(s + 1\right)^{- a}}$ is satisfied.
This will guarantee convergence after a maximum of $a$ consecutive non-label-changing reviews.
Unfortunately, this was discovered after the completion of this \thesis{}, so the parameters in our
experiments do not satisfy this property.
Suggested values for future work are $s=20$, $a=72$, and $\tau=0.052$.
% ---
%$\tau = 1 - e^{- a \left(1 - \frac{2}{s + 1}\right)^{a}}$
}
%To achieve a certain threshold with a specified span the number of iterations required is:
%$\lceil{\frac{\log{\left (t \right )}}{\log{\left (\frac{s - 1}{s + 1} \right )}}}\rceil$
%If we set the threshold to $\exp{-2}\approx0.135$, then the window span parameter also roughly represents how
% many successive consecutive non-label-changing decisions are necessary before reaching the threshold if starting from
% initial state.
\begin{comment}
Another alternative might be measuring once the advantages from the ranking
algorithm become indistinguishable from a brute force search.
Let $D = \binom{|V|}{2}$ be the number of edges in the dataset and let $N=\sum_{C \in \set{C}} (|C| - 1)$ the
number of edges in the MSTs of all PCCs.
We can measure the density of meaninful reviews in the current labeled dataset as $N/D$.
After our estimated mu gets close enough to $N/D$, we can terminate.
\end{comment}
\poisson{}
%As the last outer iteration of the overall algorithm before convergence, triggered when the LNBNN ranking algorithm
%fails to produce positive edges, candidate edges between untested pairs of annotations are added within PCCs that are
%not positive-redundant and between PCCs that are not negative-redundant. This is because the ranking algorithm itself is
%imperfect and the missed matches tend to affect small PCCs disproportionately, which are the last to satisfy redundancy
%tests.
%Re-estimating the $k\mu$ at each time-step should
\section{EXPERIMENTS}\label{sec:graphexpt}
In this section we design an end-to-end experiment where we start with a set of annotations without any name
labels, and then we proceed to construct all the names.
This simulates identification events like the \GZC{} and provides insight into how the algorithms behave in
practice.
Because our algorithms are semi-automatic, we simulate a noisy user response using \groundtruth{} data.
Given a pair of annotations, the simulated user returns the \groundtruth{} classification $99\percent$ of the
time, making errors $1\percent$ of the time uniformly at random.
These assumptions may be simple and too inaccurate to model an expert reviewer, but it will serve to
demonstrate our graph algorithm's ability to recover from errors.
We will measure the simulation's accuracy and error as a function of the number of manual reviews.
Using these measures we will compare our graph algorithm to alternative techniques to demonstrate that it
produces accurate identifications with fewer errors using significantly fewer manual reviews.
For this experiment, we will use the plains and Grévy's zebras datasets previously described in
\cref{sub:datasets}.
Because some of our algorithms will require training, we split each dataset into a training set and a testing
set each containing half of the names.
The training set will be used to learn the pairwise classifiers and determine thresholds for automatic
classification.
The details of training and testing sets are summarized in \cref{tbl:TestTrainDBStats}.
\TestTrainDBStats{}
\FloatBarrier{}
The experiment will compare the three different methods of determining the identities of annotations based on
the algorithms defined in this \thesis{}.
These algorithms are:
\begin{enumin}
\item \pvar{ranking} --- the ranking algorithm from \cref{chap:ranking},
\item \pvar{rank+clf} --- the same
ranking algorithm but augmented with the automatic classification algorithm from \cref{chap:pairclf}, and
finally
\item \pvar{graph} --- the graph identification algorithm introduced in this chapter.
\end{enumin}
In the case of \pvar{graph}, the procedure from~\cref{sub:graphalgo} can be directly applied.
However, in order to compare \pvar{graph} to \pvar{ranking} and \pvar{rank+clf} we must define baseline
methods to determine the ordering of the reviews and when to stop.
Details of these identification procedures are given in the next subsection.
\FloatBarrier{}
\subsection{Identification procedures}
We design the procedure for \pvar{ranking} to be similar to the approach described in \cref{sec:introgzc}
that was used in the \GZC{}.
This algorithm does not require a pre-training phase and thus only the testing set is used.
Given the unlabeled annotations, we index the database, issue each annotation as a query, and collect the top
$5$ results from each ranked list.
The resulting pairs of query and database annotations are stacked and sorted by LNBNN score.
The user reviews each pair in the list sequentially.
As was done in the \GZC{}, all pairs are all manually considered and verified, and all inconsistencies are
ignored.
In the \GZC{} the choice to re-run the ranking algorithm was made manually, making it difficult to reproduce.
Therefore, we simplify the experiment by choosing to run the ranking algorithm once.
Consistency checks are not applied because developing these is the point of the graph algorithm.
%When designing this protocol we considered alternatives
%This is primarily due to simplicity.
%--- in part for simplicity and in part because it
% does not change how we interpret the results.
%We also do not apply the consistency checks used in during the \GZC{} because these were mainly based on
% heuristics that are difficult to reproduce and splitting names is difficult without connectivity
%information.
The procedure for \pvar{rank+clf} is similar to the one used for \pvar{ranking}.
The main difference is that we use the automatic classification algorithm to predict pairwise probabilities
for each pair of top ranked query and database annotations.
If the probabilities of a class are above a threshold, then the pair is automatically reviewed.
This has the effect of reducing the total number of manual reviews.
The rest of the procedure is unchanged.
The pairwise classifier is trained on the training set, and the algorithm is evaluated on the test set.
We choose automatic thresholds by finding the thresholds that result in a specified false positive rate on a
validation dataset.
Because \pvar{rank+clf} has no mechanisms for error recovery, we choose conservative automatic thresholds by
specifying an acceptable false positive rate of $0.001$.
This results in positive, negative, and incomparable thresholds of $0.976$, $0.991$, and $0.5$ respectively
for plains zebras, and $0.997$, $0.998$, and $1.0$ for Grévy's.
%When choosing thresholds we enforce a warm-up period of $200$ examples, where the threshold is set to $1$.
%This ensures that we do not automatically classify categories (\eg{} incomparable) with little supporting
% data.
% GRAPH THRESH
%We choose a false positive rate of $0.001$.
%Because this approach has no mechanism for error recovery, we choose conservative classification thresholds
% that achieve a $0\percent$ false positive rate on a validation dataset.
Finally, we quickly recap the procedure for \pvar{graph} which was defined in~\cref{sub:graphalgo}.
The algorithm begins by using LNBNN to search for candidate edges, which are then assigned probabilities and
inserted into a priority queue based on these probabilities.
As candidate edges are removed from the queue they are either automatically or manually classified based on
probability thresholds.
Connectivity information is used to enforce a minimum level of redundancy, to prevent extraneous redundancy,
and to ensure consistency.
The convergence criterion determines when candidate edges should be refreshed and when the algorithm should
terminate.
We use the same pairwise classifier from \pvar{rank+clf}, but due to the graph algorithm's error recovery
mechanisms we can choose more aggressive thresholds.
However, we have found that the algorithm is sensitive to this parameter, therefore we only slightly increase
the acceptable false positive rate from $0.001$ to $0.0014$.
% GRAPH THRESH
This results in positive, negative, and incomparable thresholds of $0.969$, $0.986$, and $0.5$ respectively