-
Notifications
You must be signed in to change notification settings - Fork 0
/
chapter3-matching.tex
1286 lines (1124 loc) · 87.1 KB
/
chapter3-matching.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\begin{comment}
./texfix.py --fpaths chapter3-matching.tex --outline --asmarkdown --numlines=999 -w
./texfix.py --fpaths chapter3-matching.tex --outline --asmarkdown --numlines=999 -w
./texfix.py --fpaths chapter3-matching.tex --reformat
# http://jaxedit.com/mark/
fixtex --fpaths chapter3-matching.tex --outline --asmarkdown --numlines=999 --shortcite -w && ./checklang.py outline_chapter3-matching.md
\end{comment}
\chapter{IDENTIFICATION USING A RANKING ALGORITHM}\label{chap:ranking}
This chapter addresses the problem of computer-assisted animal identification, where an algorithm suggests
likely possibilities, but a human reviewer always makes the final decision.
Given, a single annotation depicting an unknown animal and a database of previously identified annotations,
the task is to determine a ranking of database individuals most likely to match the unknown animal.
A human reviewer manually determines which --- if any --- of the top ranked results are correct.
An \glossterm{annotation} is a rectangular region of interest around a specific animal within an image.
Each annotation given a \name{} label denoting its individual identity.
A \glossterm{\name{}} refers to a group of annotations known to be the same individual.
The identification process assigns a \name{} label to an unknown annotation either as
(1) the \name{} label of a matched database annotation or
(2) a new \name{} label if no matches are found.
The ranking algorithm is based on the feature correspondences between a query annotation and a set of
database annotations.
In each annotation a set of patch-based features is detected at keypoint locations.
Then the visual appearance of each patch is described using SIFT~\cite{lowe_distinctive_2004}.
A nearest neighbor algorithm establishes a set of feature correspondences between a query and the annotations
in the database.
A scoring mechanism based on local \naive{} Bayes nearest neighbor (LNBNN)~\cite{mccann_local_2012} produces
a score for each feature correspondence.
These scores are then aggregated into a single score for each \name{} in the database, producing a ranked
list of \names{}.
If the top ranked \name{} has a ``high'' score, then it is likely to be the same individual depicted in the
query annotation.
If the top ranked \name{} has a ``low'' score, then it is likely that the query individual is not depicted in
any database annotation.
An example ranked list returned by the algorithm is illustrated in~\cref{fig:rankedmatches}.
For each result in the ranked list we decide if it is a correct or incorrect match.
In the baseline algorithm this identification decision is left to a user.
However, in the next chapter we will consider using an algorithm to automatically verify results returned in
this ranked list.
The outline of this chapter is as follows:
\Cref{sec:annotrepr} discusses the initial processing of an annotation which involves image normalization,
feature extraction, and feature weighting.
\Cref{sec:baselineranking,sec:sver} describe the baseline ranking and scoring algorithm.
The first of these sections focuses on establishing feature correspondences, and the second focuses on
filtering the correspondences.
\Cref{sec:exempselect} describes the process for selecting \exemplars{}.
\Cref{sec:rankexpt} provides an experimental evaluation of the ranking algorithm.
\Cref{sec:staticsum} summarizes this chapter.
\rankedmatches{}
\section{ANNOTATION REPRESENTATION}\label{sec:annotrepr}
For each annotation in the database we (1) normalize the image geometry and intensity, (2) compute features,
(3) and weight the features.
% Chip Extract
Image normalization rotates, crops, and resizes an annotation from its image. This helps to remove background
clutter and roughly align the annotations in pose and scale. The extracted and normalized region is referred to
as a \glossterm{chip}.
% Feat Detect
Then, a set of features --- a keypoint and descriptor pair --- is computed. Keypoints are detected at multiple
locations and scales within the chip, and a texture-based descriptor vector is extracted at each keypoint.
% Featweight
Finally, each feature is assigned a probabilistic weight using a foregroundness classifier. This helps remove
the influence of background features.
\subsection{Chip extraction}
% Bounding box + orientation
Each annotation has a bounding box and an orientation specified in a previous detection step. For zebras
and giraffes, the orientation of the chip is chosen such that the top of the bounding box is roughly
parallel to the back of the animal.
A chip is extracted by jointly rotating, scaling, and cropping an annotation's parent image using Lanczos
resampling~\cite{lanczos_applied_1988}. The scaling resizes the image such that the cropped chip has
approximately $450^2$ pixels and it maintains the aspect ratio of the bounding box. If specified in the
pipeline configuration, adaptive histogram equalization~\cite{pizer_adaptive_1987} is applied to the chip,
however this is not used in the experimental evaluation presented later in this chapter.
\subsection{Keypoint detection and description}
Keypoints are detected within each annotation's chip using a modified implementation of the Hessian
detector described in~\cite{perdoch_efficient_2009} and reviewed in~\cref{sec:featuredetect}.
This produces a set of elliptical features localized in space, scale, shape, and orientation.
Each keypoint is described using the SIFT~\cite{lowe_distinctive_2004} descriptor that was reviewed
in~\cref{sec:featuredescribe}.
The resulting unordered set of keypoint-descriptor pairs are an annotation's features.
Further details about the keypoint structure are given in~\cref{sec:kpstructure}.
We choose a baseline feature detection algorithm that produces affine invariant keypoints with the gravity
vector. Affine invariant (\ie{} shape adapted) keypoints detect elliptical patches instead of circular
ones. We choose affine invariant keypoints because the animals we identify will be seen from many
viewpoints. Because all chips have been rotated into an upright position, we assign all keypoints a
constant orientation --- this is the gravity vector assumption~\cite{perdoch_efficient_2009}. However,
these baseline settings may not be appropriate for all species.
It is important to select the appropriate level of invariance for each species we identify.
% Shape
In our experiments in~\cref{sub:exptinvar}, we will vary several parameters related to invariance in
keypoint detection.
To determine if affine invariance is appropriate for animal identification we experiment with both
circular and elliptical keypoints.
% Orientation
We also experiment with different levels of orientation invariance.
The gravity vector assumption holds in the case of rigid non-poseable objects (\eg{} buildings), if the
image is upright.
Clearly, for highly poseable animals, this assumption is more questionable.
However, full rotation invariance (using dominant gradient orientations) has intuitive problems.
Patterns (like ``V'' and ``$\Lambda$'') that might contribute distinguishing evidence that two
annotations match, would always appear identical under full rotation invariance.
Ideally orientation selection would be made based on the pose of the animal.
We introduce a simple orientation heuristic to help match keypoints from the same animal in the presence of
small pose variations. Instead of extracting a single keypoint in the direction of gravity or multiple
keypoints in the directions of the dominant gradient orientation we extract 3 descriptors at every
keypoint: one in the direction of gravity, and the other two offset at $\pm15\degrees$ from the gravity
direction. This provides a middle ground between rotation invariance and strict use of the gravity vector.
Using this heuristic, it will be more likely to extract similar descriptors from two annotations of the
same animal seen from slightly different poses.
\subsection{Feature weighting}
In animal identification, there will often be many annotations containing the same background.
Photographers may take many photos in a single place and camera traps will contribute many images with the
same background. Without accurate background masking, regions of an annotation from different images
containing the same background may strongly match and outscore matches to correct individuals. An example
illustrating two different individuals seen in front of the same distinctive background is shown
in~\cref{fig:SceneryMatch}. To account for this, each feature is given a weight based on its probability of
belonging to the foreground --- its ``foregroundness''. This weight is used indicate the importance of a
feature in scoring and spatial verification.
Foregroundness is derived from a species detection algorithm developed by
Parham~\cite{parham_photographic_2015}. The input to the species detection algorithm is the annotation's
chip, and the output is an intensity image. Each pixel in the intensity image represents the likelihood
that it is part of a foreground object.
A single feature's foregroundness weight is computed for each keypoint in an annotation as follows: The
region around the keypoint in the intensity image is warped into a normalized reference frame. Each pixel
in the normalized intensity patch is weighted using a Gaussian falloff based on the pixel's distance from
the center of the patch. The sum of these weighted intensities is the feature's foregroundness weight. The
steps of feature weight computation are illustrated in~\cref{fig:fgweight}.
\SceneryMatch{}
\fgweight{}
\subsection{Keypoint structure overview}\label{sec:kpstructure}
%Before we discuss the computation of the we review the structure of
% a keypoint.
The keypoint of a feature is represented as: $\kp\tighteq(\pt, \vmat, \ori)$, %
The vector $\pt\tighteq\ptcolvec$ is the feature's $xy$-location. The scalar $\theta$ is the keypoint
orientation. The lower triangular matrix $\vmat\tighteq\VMatII$ encodes the keypoint's shape and scale.
This matrix skews and scales a keypoint's elliptical shape into a unit circle. A keypoint is circular when
$a\tighteq{}d$ and $c\tighteq0$.
%If $c\tighteq0$, there is no skew and if $a\tighteq{}d$ the keypoint
% is circular.
The keypoint scale is related to the determinant of this matrix and can be extracted as: %
$\sigma = \frac{1}{\sqrt{\detfn{\vmat}}} = \frac{1}{\sqrt{ad}}$. All of this information can be encoded in
a single affine matrix.
\subsubsection{Encoding keypoint parameters in an affine matrix}
It will be useful to construct two transformations that encode all keypoint information in a single matrix.
The first, $\rvmat$, maps a keypoint in an annotation into a normalized reference frame --- the unit
circle. The second transformation, $\inv{\rvmat}$ is the inverse, which warps the normalized reference
frame back onto the keypoint. To construct $\rvmat$, the keypoint is centered at the origin $(0, 0)$ using
translation matrix, $\mat{T}$. Then $\vmat$ is used to skew and scale the keypoint into a unit circle.
Finally, the keypoint's orientation is normalized by rotating $-\theta$ radians using a rotation matrix
$\mat{R}$.
\begin{equation}\label{eqn:RVTConstruct}
%\rvmat=\paren{\invrotMatIII{\paren{-\ori}} \VMatIII \transMATIII{-x}{-y}}
\rvmat=\mat{R} \vmat \mat{T} = \rotBigMatIII{\paren{-\ori}} \VBigMatIII \transBigMatIII{-x}{-y}
\end{equation}
The construction of $\inv{\rvmat}$ is performed similarly.
% see vtool.keypoint.get_invVR_mats_oris
\begin{equation}\label{eqn:invTVRConstruct}
\inv{\rvmat} = \inv{\mat{T}} \inv{\vmat} \inv{\mat{R}} =
\transBigMatIII{x}{y}
\BIGMAT{
\frac{1}{a} & 0 & 0\\
-\frac{c}{a d} & \frac{1}{d} & 0\\
0 & 0 & 1
}
\rotBigMatIII{\paren{\ori}}
\end{equation}
\subsubsection{Extracting keypoint parameters from an affine matrix}
During the spatial verification step, described in~\cref{sec:sver}, keypoints are warped from one image
into the space of another. It will be useful to extract the keypoint parameters from an arbitrary keypoint
matrix. This will allow us to directly compare properties of corresponding features. Given an arbitrary
affine matrix $\inv{\rvmat}$ representing keypoint $\kp$, we show how the individual parameters $(\pt,
\scale, \ori)$ can be extracted. First consider the components of $\inv{\rvmat}$ by simplifying the right
side of~\cref{eqn:invTVRConstruct}.
\begin{equation}\label{eqn:ArbInvRVTMat}
\inv{\rvmat} =
\BIGMAT{
e & f & x\\
g & h & y\\
0 & 0 & 1
} =
\BIGMAT{
\frac{1}{a} \cos{(\theta )} & -\frac{1}{a} \sin{(\theta )} & x\\
\frac{1}{d} \sin{(\theta )} - \frac{c}{a d} \cos{(\theta )} & \frac{1}{d} \cos{(\theta )} + \frac{c}{a d} \sin{(\theta )} & y\\
0 & 0 & 1
}
\end{equation}
%%---------
The position, scale, and orientation can be extracted from an arbitrary affine keypoint shape matrix
$\invvrmat$ as follows:
\begin{equation}\label{eqn:affinewarp}
\begin{aligned}
\pt &= \VEC{x\\y} \\
\scale &= \sqrt{\detfn{\invvrmat}}\\
\ori &= \modfn{\paren{-\atantwo{f, e}}}{\TAU}
\end{aligned}
\end{equation}
%\newcommand{\annotscoreop}{\opname{annot\_score}}
%\newcommand{\namescoreop}{\opname{name\_score}}
\newcommand{\annotscoreop}{\opname{K_{\tt annot}}}
\newcommand{\amechscoreop}{\opname{K_{\csum}}}
\newcommand{\fmechscoreop}{\opname{K_{\nsum}}}
\section{MATCHING AGAINST A DATABASE OF INDIVIDUAL ANIMALS}\label{sec:baselineranking}
To identify a query annotation, it is matched against a database of known \names{}. \Aan{\name{}} is a set of
annotations known to depict the same animal. The basic matching pipeline can be summarized in \three{} steps:
(1) establish feature correspondences, %
(2) score feature correspondences, and %
(3) aggregate feature correspondence scores across the \names{}. Correspondences between a query annotation's
features and \emph{all} database annotation features are established using an approximate nearest neighbor
algorithm. This step also establishes a normalizing feature which is used to measure the distinctiveness of a
query feature. Each feature correspondence is scored based on the feature weights established in the previous
section and a measure of the distinctiveness of the query feature. The feature correspondence scores are then
aggregated into a \glossterm{\namescore{}} for each \name{} in the database. The \namescores{} induce a ranking
on \names{} in the database where database \names{} with higher ranks are more likely to be correct matches.
\subsection{Establishing initial feature correspondence}\label{sub:featmatch}
\subsubsection{Offline indexing}
Before feature correspondences can be established, an offline algorithm indexes descriptors from all
database annotations for fast approximate nearest neighbor search. All database descriptor vectors are
stacked into a single array of vectors, %
$\AnyDB$, % should this be removed?
%
and these descriptors are indexed by an inverted index. The inverted index maps each descriptor in the
stacked array back to its original annotation and feature. This database array is indexed for nearest
neighbor search using a forest of kd-trees~\cite{silpa_anan_optimised_2008} using the FLANN
library~\cite{muja_fast_2009}, both of which were reviewed in~\cref{sec:ann}. This allows for the efficient
implementation of a \codeobj{neighbor index function} %
$\NN(\AnyDB, \desc, \K)$ %
%$\NN(\desc, \K)$ %
that returns the indices in $\AnyDB$ of the $\K$ approximate nearest neighbors of a query feature's
descriptor $\desc$.
%that returns the database indices of the $\K$ approximate
% nearest neighbors of a query feature's descriptor
%$\desc$.
\subsubsection{Approximate nearest neighbor search}
Matching begins by establishing multiple feature correspondences between each query feature and several
visually similar database features. For each query descriptor vector $\desc_i \in \X$ the $\K + \Knorm$
approximate nearest neighbors are found using the \coderef{neighbor index function}. These neighbors
sorted by ascending distance are:
\begin{equation}
\NN(\AnyDB, \desc_i, \K + \Knorm) \eqv \dotarrIII{j}{\K}{\K + \Knorm}
%\NN(\desc_i, \K + \Knorm) \eqv \dotarrIII{j}{\K}{\K + \Knorm}
\end{equation}
The $\K$ nearest neighbors, $\dotsubarr{\desc}{{j_1}}{{j_\K}}$, are the initial feature correspondences
to the $i\th{}$ query feature. The remaining $\Knorm$ neighbors,%
$\dotsubarr{\desc}{{j_{\K + 1}}}{j_{\Knorm}}$, are candidate normalizers for use in LNBNN scoring.
\subsubsection{Normalizer selection}
A single descriptor $\descnorm_{i}$ is selected from the $\Knorm{}$ candidate normalizers and used in
computing the LNBNN score for all (up to $\K$) of the $i\th{}$ query descriptor's correspondences in
the database. The purpose of a normalizing descriptor is to estimate the local density of descriptor
space, which can be interpreted as a measure of the query descriptor's distinctiveness \wrt{} the
database. The normalizing descriptor is chosen as the most visually similar descriptor to the query
that is not a correct match. In other words, the query descriptor's normalizer should be from an
individual different from the query. The intuition is there will not be any features in the database
that are close to distinctive features in the query except for the features that belong to the correct
match.
The selection process described in the original formulation of LNBNN is to simply choose the %
$\K + 1\th{}$ nearest neighbor, which amounts to setting $\Knorm=1$. The authors of LNBNN find that there is
no benefit to using a higher value of $\Knorm$~\cite{mccann_local_2012}. However, this does not account
for the case when some or all the $\K + 1\th{}$ nearest neighbor belongs to the same class as one of
the nearest $\K$ neighbors. Therefore, we employ a slightly different selection process. To motivate
our selection process, consider the case when there are more than $\K$ images of the same individual
from the same viewpoint in the database and a distinctive feature from a new annotation of that
individual is being scored. In this case $\K$ correspondences will be correctly established a
distinctive the query feature and $\K$ database features. However, if the normalizer is chosen as the
$\K + 1$ neighbor, then these correspondences will be inappropriately downweighted.
% probably need to reword.
To gain an intuition of the LNBNN scoring mechanism, consider the example in~\cref{fig:knorm}.
The figure shows the case where $\K=3$ and $\Knorm=1$.
In this case there are two examples \Cref{sub:knormb,sub:knormc} of the query individual in the
database.
Even though there is an incorrect match, the LNBNN scores of the correct matches are an order of
magnitude higher than the score for the incorrect match.
Now, consider the case where the number of correct matches in the database is greater than $\K$ by
setting $\K=1$.
In this case the normalizing descriptor is the ``same'' feature as the query feature and the nearest
match drops from $0.066$ to $0.007$.
\knorm{}
To avoid this case, a normalizing feature is carefully chosen to reduce the possibility that it belongs
to a potentially correct match. More formally, the normalizing descriptor is chosen to be the
descriptor with the smallest distance to the query descriptor that is not from the same \name{} as any
of the chosen correspondences. Let $\nid_j$ be the \name{} associated with the annotation containing
descriptor $\desc_j$. Let %
$\multiset{N}_i \eqv \curly{\nid_j \where j \in \NN(\AnyDB, \desc_i, \K)}$
be the set of \names{} matched by the $i\th{}$ query feature. The descriptor that normalizes all
matches of query descriptor $\desc_i$ is:
\begin{equation}
\descnorm_{i} \eqv
\argmin{\desc_j \in \dotsubarr{\desc}{{j_{\K + 1}}}{j_{\Knorm}}}
\elltwosqrd{\desc_j - \desc_i} \where \nid_j \notin \multiset{N}_i
\end{equation}
\subsection{Feature correspondence scoring}
Each feature correspondence is given a score representing how likely it is to be a correct match. While the
L2-distance between query and database descriptors is useful for ranking feature correspondences based on
visual similarity, the distinctiveness of the match is more useful for ranking the query annotation's
similarity to a database annotation~\cite{lowe_distinctive_2004, arandjelovic_dislocation_2014,
mccann_local_2012}. However, highly distinctive matches from other objects --- like background matches ---
do not provide relevant information about a query annotation's identity and should not contribute to the
final score. Therefore, each feature correspondence is scored using a mechanism that combines both
distinctiveness and likelihood that the object belongs to the foreground. For each feature correspondence
$m = (i, j)$ with query descriptor $\desc_i$ and matching database descriptor $\desc_j$, several scores are
computed which are then combined into a single feature correspondence score $s_{i,j}$.
\subsubsection{LNBNN score}\label{sec:lnbnnscore}
The LNBNN scoring mechanism measures the difference in similarity between the query descriptor and
(1) its match and
(2) its normalizer $\descnorm_{i}$.
This serves as an estimate of local feature density and measures the distinctiveness of the feature
correspondence.
A match is distinctive when the query-to-match distance is much smaller than the query-to-normalizer
distance, \ie{} the local density of descriptor space around the query is sparse.
The LNBNN score of a feature match is computed as follows:
\begin{equation}\label{eqn:lnbnn}
\fs_{\LNBNN} \eqv \frac{\elltwo{\desc_i - \descnorm_{i}} - \elltwo{\desc_i - \desc_j}}{Z}
\end{equation}
All descriptors used in this calculation are L2-normalized to unit length --- \ie{} to sit on the
surface of a unit hypersphere.
The $Z$ term normalizes the score to ensure that it is in the range $\rangeinin{0, 1}$.
If descriptor vectors have only non-negative components (as in the case of
SIFT~\cite{lowe_distinctive_2004}), then the maximum distance between any two L2-normalized
descriptors is $Z\tighteq\sqrt{2}$.
If descriptors vectors have negative components (like those that might extracted from a deep
convolutional neural network~\cite{zagoruyko_learning_2015}), then the maximum distance between them
is $Z\tighteq2$.
\subsubsection{Foregroundness score}
To reduce the influence of background matches, each feature correspondence is assigned a score based on
the foregroundness of both the query and database features. The geometric mean of the foregroundness of
query feature, $w_i$, and database feature, $w_j$, drives the score to $0$ if either is certain to be
background. % show python -m ibeis --db testdb3 --query 325 -y
\begin{equation}
\fs_{{\tt fg}} \eqv \sqrt{w_i w_j}
\end{equation}
\subsubsection{Final feature correspondence score}
The final score of the correspondence $(i, j)$ captures both the distinctiveness of the match and the
likelihood that the match is part of the foreground.
\begin{equation}\label{eqn:featscore}
\fs_{i,j} \eqv \fs_{{\tt fg}} \fs_{\LNBNN}
\end{equation}
\subsection{Feature score aggregation}\label{subsec:namescore}
So far, each feature in a query annotation has been matched to several features in the database and a score
has been assigned to each of these correspondences based on its distinctiveness and foregroundness. The
next step in the identification process is to aggregate the scores from these patch-based correspondences
into a single \glossterm{\namescore} for each \name{} in the database. Note that this \name-based
definition of scoring is a key difference between animal identification and image retrieval, where a score
is assigned to each image in the database. In animal identification the analogous concept is an
\glossterm{\annotscore} --- a score assigned to each annotation in the database~\cite{philbin_object_2007}.
This distinction between a score from a query annotation to a database annotation is important because the
goal of the application is to classify a new query annotation as either a known \name{} or as a new
\name{}, not to determine which annotations are most similar.
This subsection presents two mechanisms to compute \namescores{}.
The first mechanism is \csumprefix{} and computes a \namescore{} in two steps.
This mechanism aggregates feature correspondence scores into an \annotscore{} for each annotation in the
database.
Then the \annotscores{} are aggregated into a score for each \name{} in the database.
The second mechanism is \nsumprefix{}.
This mechanism aggregates feature correspondence scores matching multiple database annotations directly
into a \namescore{}.
These mechanisms are respectively similar to the image-to-image distance and the image-to-class distance
described in~\cite{boiman_defense_2008}.
\subsubsection{The set of all feature correspondences}
All scoring mechanisms presented in this subsection are based on aggregating scores from features
correspondences. The set of all feature correspondences for a query annotation $\X$ is expressed as:
\begin{equation}
\Matches \eqv \{(i, j) \where \desc_i \in \X \AND j \in \NN(\AnyDB, \desc_i, \K)\}
\end{equation}
% SAME AS IMAGE-TO-IMAGE
\subsubsection{Annotation scoring}
An \annotscore{} is a measure of similarity between two annotations.
An \annotscore{} between a query annotation and a database annotation is defined as the sum of the
feature correspondence scores matching to the features from that database annotation.
Let $\daid_j$ be the database annotation containing feature $j$.
Let
%
$\Matches_{\daid} \eqv \curly{(i, j) \in \Matches \where \daid_j = \daid}$
%
denote all the correspondences to a particular database annotation.
The \annotscore{} between the query annotation $\qaid$ and database annotation $\daid$ is:
\begin{equation}
\annotscoreop(\qaid, \daid) \eqv \sum_{(i, j) \in \Matches_{\daid}} \fs_{i, j}
\end{equation}
% SAME AS IMAGE-TO-CLASS
\subsubsection{Name scoring (1) --- \csumprefix{}} %
The \cscoring{} mechanism aggregates \annotscores{} into \namescores{} by simply taking the maximum
\annotscore{} over all annotations belonging to a \name{}.
In our experiments we refer to this version of \namescoring{} as \csum{}.
Let $\nid$ be the set of database annotations with the same \name{}.
The \cscore{} between a query annotation and a database \name{} is:
%the maximum over all annotations scores in that \name{}:
\begin{equation}
\amechscoreop(\qaid, \nid)
\eqv
\max_{\daid \in \nid}
\paren{
\annotscoreop\paren{\qaid, \daid}
}
\end{equation}
\subsubsection{Name scoring (2) --- \nsumprefix{}} %
The \cscoring{} mechanism accounts for the fact that animals will be seen multiple times, but it does
not take advantage of complementary information available when \aan{\name{}} has multiple
annotations.
The following aggregation mechanism combines scores on a feature level to correct for this.
It allows each query feature at a specific location to vote for a given \name{} at most once.
Thus, when a query feature (or multiple query features at the same location) correspond(s) to
database features from multiple views of the same animal, only the best correspondence for that
feature will contribute to the score.
In our experiments we refer to this version of \namescoring{} as \nsum{}.
The first step of computing \aan{\namescore{}} for a specific \name{} is grouping the feature
correspondences.
Two feature correspondences are in the same group if the query features have the same location and
the database features belong to the same \name{}.
The next step is to choose the highest scoring correspondence within each group.
The sum of the chosen scores is the score for \aan{\name{}}.
This procedure is illustrated in~\cref{fig:namematch}.
\newcommand{\MatchesGroup}{\Matches^{G}}
Formally, consider two feature correspondences $\mI\tighteq(\iI, \jI)$ and $\mII\tighteq(\iII,
\jII)$.
Let $\pt_{\iI}$ and $\pt_{\iII}$ be the $xy$-location of the query feature in each correspondence.
Let $\nid_{\jI}$ and $\nid_{\jII}$ be the \name{} of the database annotations containing the matched
features.
The group that contains feature correspondence $\mI$ is defined as:
\begin{equation}
\MatchesGroup_{\mI} \eqv \curly{\mII \in \Matches \where
\paren{
\paren{\pt_{\iI} \eq \pt_{\iII}} \AND
\paren{\nid_{\jI} \eq \nid_{\jII}}
}
}
\end{equation}
The correspondence with the highest score in each connected component is flagged as chosen. Ties are
broken arbitrarily.
\begin{equation}
\ischosen(\mI) \eqv
\bincase{
\paren{
\fs_{\mI} > \fs_{\mII}
\quad \forall \mII \in \MatchesGroup_{\mI}
}
\OR
\card{\MatchesGroup_{\mI}} \eq 1
}
\end{equation}
Let $\Matches_{\nid} \eqv \{(i, j) \in \Matches \where
\nid_j = \nid\}$ denote all the correspondences to a particular
\name{}.
The \nscore{} of \aan{\name{}} is:
\begin{equation}
\fmechscoreop(\qaid, \nid)
\eqv
\sum_{m \in \Matches_{\nid}} \ischosen(m) \; \fs_m
\end{equation}
\namematch{}
\FloatBarrier{}
\section{SPATIAL VERIFICATION}\label{sec:sver}
The basic ranking algorithm treats each annotation as an orderless set of feature descriptors (with a small
exception in name scoring, which has used a small amount of spatial information). This means that many of the
initial feature correspondences will be spatially inconsistent. Spatial verification removes these spatially
inconsistent feature correspondences. Determining which features are inconsistent is done by first estimating
an affine transform between the two annotations. Then a projective transform is estimated using the inliers to
the affine transform. Finally, any correspondences that do not agree with the projective transform
transformation are removed~\cite{fischler_random_1981, philbin_object_2007}. This process is illustrated in
\cref{fig:sver}. We have reviewed related work in spatial verification in~\cref{subsec:sverreview}.
%In our problem, the animals are seen in a wide variety of poses,
% and projective transforms may not always be sufficient to capture
% all correctly corresponding features.
%Yet, without strong spatial constraints on matching, many
% background features will be spatially verified.
%For now, we proceed with standard techniques for spatial
% verification and evaluate if more sophisticated methods are needed.
\sver{}
\subsection{Shortlist selection}
Standard methods for spatial verification are defined on the feature correspondences between two
annotations (images). Normally, a shortlist of the top ranked annotations is passed onto spatial
verification. However, in our application we rank \names{}, which may have multiple annotations. In our
baseline approach we simply apply spatial verification to the top $N_{\tt{nameSL}}=40$ \names{} and the top
$N_{\tt{annotSL}}=3$ annotations within those \names{}.
\subsection{Affine hypothesis estimation}
Here, we will compute an affine transformation that will remove a majority of the spatially inconsistent
feature correspondences.
Instead of using random sampling of the feature correspondences as in the original RANSAC
algorithm~\cite{hartley_multiple_2003}, we estimate affine hypotheses using a deterministic method
similar to~\cite{philbin_object_2007, chum_homography_2012}.
Given a set of matching features between annotations $\annotI$ and $\annotII$, the shape, scale,
orientation, and position of each pair of matching keypoints are used to estimate a hypothesis affine
transformation.
Each hypothesis transformation warps keypoints from annotation $\annotI$ into the space of $\annotII$.
Inliers are estimated by using the error in position, scale, and orientation between each warped keypoint
and its correspondence.
Each inlier is weighted by its feature correspondence score.
The final affine transform is chosen to be the one that produces the maximum weight set of inliers.
% vmat = V here
% V maps from ellipse to u-circle
\newcommand{\AffMat}{\mat{A}}
\newcommand{\HypothSet}{\set{A}}
\newcommand{\AffMatij}{\mat{A}_{i, j}}
\newcommand{\HypothAffMat}{\hat{\mat{A}}}
\subsubsection{Enumeration of affine hypotheses}
%The deterministic set of hypothesis transformations mapping from
% query annotation $\annotI$ to database annotation $\annotII$ is
% computed for each feature correspondence in a match from to an
% annotation.
Let $\Matches_{\annotII}$ be the set of all correspondences between features from query annotation
$\annotI$ to database annotation $\annotII$.
For each feature correspondence $(i, j) \in \Matches_{\annotII}$, we construct a hypothesis
transformation, $\AffMatij$ using the matrices $\rvmat_{i}$ and $\inv{\rvmat_{j}}$, which where
defined in~\cref{eqn:RVTConstruct,eqn:invTVRConstruct}.
The first transformation $\rvmat_{i}$, warps points from $\annotI$-space into a normalized reference
frame.
Then the second transformation, $\inv{\rvmat_{j}}$, warps points in the normalized reference frame
into $\annotII$-space.
Formally, the hypothesis transformation is defined as $\AffMatij \eqv \inv{\rvmat_{j}}\rvmat_{i}$,
and the set of hypothesis transformations is:
\begin{equation}
\HypothSet \eqv \curly{ \AffMatij \where (i, j) \in \Matches_{\annotII} }
\end{equation}
\subsubsection{Measuring the affine transformation error}
The transformation $\AffMatij$ perfectly aligns the corresponding $i\th{}$ query feature with the
$j\th{}$ database feature in the space of the database annotation ($\annotII$).
If the correspondence is indeed correct, then we can expect that other corresponding features will be
well aligned by the transformation.
The next step is to determine how close the other transformed features from the query annotation
($\annotI$) are to their corresponding features in database annotation ($\annotII$).
This can be measured using the error in distance, scale, and orientation for every correspondence.
The following procedure is repeated for each hypothesis transform %
$\AffMatij \in \HypothSet$.
Note that the following description is in the context of the correspondence between the $i\th{}$
query feature and the $j\th{}$ database feature, and the $i,j$ suffix is omitted for clarity.
In this context, the suffixes $\idxI$ and $\idxII$ will be used to index into features
correspondences.
Let $\set{B}_{\idxI} = \curly{\invvrmatI \where (\idxI, \idxII) \in \Matches_{\annotII}}$ be the set of
keypoint matrices in the query annotation with correspondences to database annotation $\annotII$. Given
a hypothesis transform $\AffMat$, each query keypoint in the set of matches
%(mapping from the normalized reference frame to feature space)
$\invvrmatI \in \set{B}_{\idxI}$, is warped into $\annotII$-space:
\begin{equation}
\warp{\invvrmatI} = \AffMat \invvrmatI
\end{equation}
%---
The warped position $\warp{\ptI}$, scale $\warp{\scaleI}$, and orientation $\warp{\oriI}$, can be
extracted from $\warp{\invvrmatI}$ using~\cref{eqn:affinewarp}. The warped query keypoint properties in
$\annotII$-space and can now be directly compared to the keypoint properties of their database
correspondences.
%Each warped point is checked for consistency with its
% correspondence's $\ptII$, scale $\scaleII$, and orientation
% $\oriII$, in $\annotII$.
The absolute distance in position, scale, and orientation between the $\idxI\th{}$ query feature and
the $\idxII\th{}$ database feature with respect to hypothesis transformation $\AffMat$ is measured as
follows:
\begin{equation}\label{eqn:inlierdelta}
\begin{aligned}
\Delta \pt_{\idxI, \idxII} & \eqv \elltwo{\warp{\ptI} - \ptII}\\
\Delta \scale_{\idxI, \idxII} & \eqv \max(
\frac{\warp{\scaleI}}{\scaleII},
\frac{\scaleII}{\warp{\scaleI}}) \\
\Delta \ori_{\idxI, \idxII} & \eqv \min(
\modfn{\abs{\warp{\oriI} - \oriII}}{\TAU},\quad
\TAU - \modfn{\abs{\warp{\oriI} - \oriII}}{\TAU})
\end{aligned}
\end{equation}
\subsubsection{Selecting affine inliers}
Any keypoint correspondence $(\idxI, \idxII) \in \Matches_{\annotII}$ is considered an inlier \wrt{}
$\AffMat$ if its absolute differences in position, scale, and orientation are all within a spatial
distance threshold $\xythresh$, scale threshold $\scalethresh$, and orientation threshold
$\orithresh$.
This is expressed using the function $\isinlierop$, which determines if match is an inlier:
%\begin{equation}
%\label{eqn:inlierchecks}
% \begin{gathered}
% %\begin{aligned}
% \txt{isinlier}(\kp_1, \kp_2) \rightarrow \elltwo{\pt_1' - \pt_2} < \xythresh \AND \\
% %-----
% {\frac{\scale_1'}{\scale_2} < \scalethresh \txt{ if }
% \paren{\scale_1' > \scale_2} \txt{ else }
% \frac{\scale_2}{\scale_1'} < \scalethresh} \AND\\
% %-----
% \txt{minimum}(
% \modfn{\abs{\ori_1' - \ori_2}}{\TAU},
% \TAU - \modfn{\abs{\ori_1' - \ori_2}}{\TAU}) < \orithresh
% %\end{aligned}
% \end{gathered}
%\end{equation}
\begin{equation}\label{eqn:inlierchecks}
\begin{gathered}
\isinlierop((\idxI, \idxII), \AffMat) \eqv
\Delta \pt_{\idxI, \idxII} < \xythresh \AND
\Delta \scale_{\idxI, \idxII} < \scalethresh \AND
\Delta \ori_{\idxI, \idxII} < \orithresh
\end{gathered}
\end{equation}
The set of inlier matches for a hypothesis transformation $\AffMat$ can then be written as:
\begin{equation}\label{eqn:affinliers}
\Matches_{\AffMat} \eqv \curly{m \in \Matches_{\annotII} \where \isinlierop(m, \AffMat)}
\end{equation}
The best affine hypothesis transformation, $\HypothAffMat$, maximizes the weighted sum of inlier scores.
\begin{equation}
\HypothAffMat \eqv \argmax{\AffMat \in \HypothSet}
\sum_{(\idxI, \idxII) \in \Matches_{\AffMat}} \fs_{\idxI, \idxII}
\end{equation}
\subsection{Homography refinement}
Feature correspondences that are inliers to the best hypothesis affine transformation, $\HypothAffMat$,
are used in the least squares refinement step.
This step is only executed if there are at least $4$ inliers to $\HypothAffMat$, otherwise all
correspondences between features in query annotation $\annotI$ to features in database annotation
$\annotII$ are removed.
The refinement step estimates a projective transform from $\annotI$ to $\annotII$.
To avoid numerical errors the $xy$-locations of the correspondence are normalized to have a mean of $0$
and a standard deviation of $1$ prior to estimation.
A more comprehensive explanation of estimating projective transformations using point correspondences can
be found in~\cite[311--320]{szeliski_computer_2010}.
Unlike in the affine hypothesis estimation where many transformations are generated, only one homography
transformation is computed. Given a set of inliers to the affine hypothesis transform
$\Matches_{\HypothAffMat}$, the least square estimation of a projective homography transform is:
\begin{equation}
\HmgMatBest \eqv \argmin{\HmgMat} \sum_{(i, j) \in
\Matches_{\HypothAffMat}} \elltwosqrd{\HmgMat \pt_{i} - \pt_{j}}
\end{equation}
Similar to affine error estimation, we will identify the subset of inlier features correspondences
$\Matches_{\HmgMatBest} \subseteq \Matches_{\annotII}$.
A correspondence is an inlier if the query feature is transformed to within a certain spatial distance
threshold $\xythresh$, orientation threshold $\orithresh$, and scale threshold $\scalethresh$ of its
corresponding database feature.
For convenience, let $\tohmg{\cdot}$ and $\unhmg{\cdot}$ transform points into and out of homogeneous
form (recall homogeneous form augments an $xy$-point with a $z$ coordinate).
For each feature correspondence $(\idxI, \idxII) \in \Matches_{\annotII}$, the query feature position,
$\ptI$, is warped from $\annotI$-space into $\annotII$-space.
\begin{equation}
\warp{\ptI} = \unhmg{\HmgMatBest \tohmg{\ptI}}
\end{equation}
However, because projective transformations are not guaranteed to preserve the structure of the affine
keypoints, warped scales and orientations cannot be estimated using~\cref{eqn:affinewarp}.
Therefore, we estimate values that will serve a similar purpose relative to a reference point.
\subsubsection{Estimation of warped shape parameters}
Because we cannot warp the shape of an affine keypoint using a projective transformation, we instead
estimate the warped scale and orientation for the $\idxI\th{}$ query feature using a reference point.
Given a single feature correspondence $(\idxI, \idxII) \in \Matches_{\annotII}$, we associate a reference
point $\refptI$ with the query location $\ptI$, scale $\scaleI$ and orientation $\oriI$.
The reference point is defined to be $\scaleI$ distance away from the feature center at an angle of
$\oriI$ radians in $\annotI$-space.
\begin{equation}
\refptI = \ptI + \scale_1 \BVEC{\sin{\oriI} \\ -\cos{\oriII}}
\end{equation}
To estimate the warped scale and orientation, first the reference
point is warped from $\annotI$-space into $\annotII$-space.
\begin{equation}
\warp{\refptI} = \unhmg{\HmgMatBest \tohmg{\refptI}}
\end{equation}
%-----------
Then we compute the residual vector $\ptres$ between the warped point and the warped reference point:
\begin{equation}
%\Delta \warp{\refptI} = \BVEC{\Delta \warp{\inI{x}} \\ \Delta \warp{\inI{y}}} = \warp{\ptI}- \warp{\refptI}.
\ptres = \BVEC{\xres \\ \yres} = \warp{\ptI}- \warp{\refptI}.
\end{equation}
The warped scale and orientation are estimated using the length and angle of the residual vector.
Recall, the warped location can be computed exactly.
In summary, the warped location, scale, and orientation of the $\idxI\th{}$ query feature is:
\begin{equation}\label{eqn:homogwarp}
\begin{aligned}
\warp{\ptI} &\eqv \unhmg{\HmgMatBest \tohmg{\refptI}} \\
\warp{\scaleI} &\eqv \elltwo{\ptres}\\
%\warp{\oriI} &= \atantwo{\yres, \xres} - \frac{\TAU}{4}.
%\warp{\oriI} &= \atantwo{\yres, \xres} - \frac{\pi}{2}. % is this adjustment right?
\warp{\oriI} &\eqv \atantwo{\yres, \xres}
\end{aligned}
\end{equation}
\subsubsection{Homography inliers}
The rest of homography inlier estimation is no different from affine inlier estimation.
\Cref{eqn:inlierdelta} is used to compute the errors $( %
\Delta \pt_{\idxI, \idxII}, %
\Delta \scale_{\idxI, \idxII}, %
\Delta \ori_{\idxI, \idxII})$
%
between the warped query location, scale, and orientation, $(\warp{\ptI}, \warp{\scaleI}, \warp{\oriI})$, %
and the corresponding database location, scale, and orientation, %
$({\ptII}, {\scaleII}, {\oriII})$.
The final set of homograph inliers is given as:
\begin{equation}\label{eqn:homoginliers}
\Matches_{\HmgMatBest} \eqv \curly{m \in \Matches_{\annotII} \where \isinlierop(m, \HmgMatBest)}
\end{equation}
Spatial verification results in a reduced set of inlier feature correspondences from the query annotation
to the database annotations. The \namescoring{} mechanism from~\cref{subsec:namescore} is then applied to
these inlier feature correspondences. This final per-name score is the output of the identification
algorithm and used to form a ranked list that is presented to a user for review.
\FloatBarrier{}
\section{EXEMPLAR SELECTION}\label{sec:exempselect}
To scale one-vs-many matching to larger databases and to allow the LNBNN mechanism to find appropriate
normalizers we restrict the number of examples of each individual in the database to a set of exemplars.
Exemplars that represent a wide range of viewpoints and poses are automatically chosen by using a modified
version of the technique presented in~\cite{oddone_mobile_2016}. The idea is to treat exemplar selection as a
maximum weight set cover problem. For each individual, the input is a set of annotations. A similarity score is
computed between pairs of annotations. To compute covering sets we first choose a threshold, each annotation is
assigned a covering set as itself and the other annotations it matches with a similarity score above that
threshold. The maximum number of exemplars is restricted by setting a maximum weight. Searching for the optimal
set cover is NP-hard, therefore we use the greedy %
$(1 - \frac{1}{e})$-approximation algorithm~\cite{michael_guide_1979}. The algorithm is run for several
iterations in order to find a good threshold that minimizes the difference between the weight of the set cover
and the maximum weight limit. The similarity score between annotations can be computed using the LNBNN scores,
but a better choice is the of the algorithm we will later describe in \Cref{chap:pairclf} to produce the
probability that a pair of annotation correctly matches.
\section{EXPERIMENTS}\label{sec:rankexpt}
This section presents an experimental evaluation of the identification algorithm using annotated images of
plains zebras, Grévy's zebras, Masai giraffes, and humpback whales.
The input to each experiment is
(1) a dataset,
(2) a subset of query and database annotations from the database,
(3) a pipeline configuration.
The datasets are described in~\cref{sub:datasets}.
The subsets of query and database annotations are carefully chosen to measure the accuracy of the algorithm
under different conditions and to control for time, quality, and viewpoint.
The pipeline configuration is a set of parameters --- \eg{} the level of feature invariance, the number of
the nearest neighbors, and the \namescoring{} mechanism --- given to the identification algorithm.
We will vary these pipeline parameters in order to measure their effect on the accuracy of the ranking
algorithm.
For each query annotation, the identification algorithm returns a ranked list of \names{} with a score for each
name. The accuracy of identification is measured using the cumulative match
characteristic~\cite{decann_relating_2013} which can be understood as the probability that a query correctly
finds a match at a specified rank under the assumption that a correct match exists in the database. We are
primarily concerned with only the first point in this curve --- the fraction of queries with a correct result at
rank $1$ --- because often a user of the system will only review the first result of a query.
The outline of this section is as follows. First, \cref{sub:datasets} introduces and describes each dataset.
Our first experiment in \cref{sub:exptbase} establishes the accuracy of our ranking algorithm on several
datasets using a default pipeline configuration. We then compare our approach to an alternative SMK approach in
\cref{sub:exptsmk}. The next subsections perform in depth experiments on the parameter settings of our
algorithm. \Cref{sub:exptfg} tests the effect of the foregroundness weight on identification accuracy.
\Cref{sub:exptinvar} investigates the effect of the level of feature invariance and viewpoint.
\Cref{sub:exptscoremech} compares the \csumprefix{} and the \nsumprefix{} \namescoring{} mechanism.
\Cref{sub:exptk} varies the $\K$ parameter (the number of the nearest neighbors used in establishing feature
correspondences) and investigates the relationship between $\K$ and database size in terms of both the number
of annotations and the number of exemplars per name. \Cref{sub:exptfail} discusses the failure cases of our
ranking algorithm. Finally,~\cref{sub:exptsum} summarizes this section.
\subsection{Datasets}\label{sub:datasets}
All the images in the datasets used in these experiments were taken by photographers in the field. Each
dataset is labeled with \groundtruth{} in the form of annotations with name labels. Annotations (bounding
boxes) have been drawn to localize animals within the image. A unique \name{} label has been assigned to
all annotations with the same identity. Some of this \groundtruth{} labeling was generated independently.
However, large portions of the datasets were labeled with assistance from the ranking algorithm. While
this may introduce some bias in the results, there was no alternative because the amount of time needed to
independently and completely label a large dataset is prohibitive and error prone.
There are two important things to note before we describe each dataset. First, in order to control for
challenging factors in the images such as quality and viewpoint some experiments sample subsets of the
datasets we describe here. Second, labeling errors exist in some datasets.
\DatabaseInfo{}
\timedist{}
The number of names, annotations, and their distribution within each database are summarized in the
following tables. In these tables we distinguish between \glossterm{singleton} and \glossterm{resighted}
names. Singleton names are individuals sighted only once, \ie{} contain only a single encounter. Resighted
names contain more than one encounter. We make this distinction because resighted names have correct
matches across a significant time delta. Note, that singleton names may still have more than one
annotation, but those annotations are all from the same encounter. We have pre-filtered each database to
remove annotations that are unidentified, are missing timestamps, or are labeled as ``junk'' quality.
\Cref{tbl:DatabaseStatistics} summarizes the number of annotations and individuals in each database as
well as the number of times (distinct encounters) each individual was sighted.
\Cref{tbl:AnnotationsPerQuality} summarizes the quality labels of the annotations.
\Cref{tbl:AnnotationsPerViewpoint} summarizes the viewpoint labels of the annotations.
Distributions of when images in each dataset were taken are illustrated in \cref{fig:timedist}.
The name and a short description of each dataset is given in the following list.
\FloatBarrier{}
\begin{itemln}
\item \textbf{Plains zebras}.
Our plains zebras dataset is an aggregate of several smaller datasets.
There is variation in how the data was collected and preprocessed.
Some images are cropped to the flank of the animal, while others are cropped to encompass the entire
body.
The constituent datasets were collected in Kenya at several locations including Nairobi National
Park, Sweetwaters, and Ol Pejeta.
More than $90\percent$ of the \groundtruth{} generated for this dataset was assisted using the
matching algorithm.
This dataset contains many imaging challenges including occlusion, viewpoint, pose, quality, and time
variation.
There are some annotations in this dataset without quality or viewpoint labelings and some images
contain undetected animals.
This data was collected between $2006$ and $2015$, but the majority of the data was collected in
$2012$--$2015$.
\item \textbf{Grévy's zebras}.
This is another aggregated dataset.
The original \groundtruth{} for this dataset was generated independently of the ranking algorithm,
however the ranking algorithm revealed several \groundtruth{} errors that have since been corrected.
The Grévy's dataset was collected in Mpala, Kenya.
Most of the annotations in this database have been cropped to the animal's flank.
This dataset contains a moderate degree of pose and viewpoint variation and occlusion.
This data was collected between $2003$ and $2012$, but the majority was collected in $2011$ and $2012$.
\item \textbf{Masai giraffes}.
These images of Masai giraffes were all taken in Nairobi National Park during the \GZC{} between
February $20$, $2015$ and March $2$, $2015$.
All \groundtruth{} was established using the ranking algorithm followed by manual verification.
This dataset contains a high degree of pose and viewpoint variation, and occlusion.
Because of their long necks, it is difficult to ensure that only a single giraffe appears in each
annotation.
This results in many \glossterm{photobombs} --- pairs of annotations where a background animal in one
annotation matches the foreground animal in the other --- when matching.
\item \textbf{Humpbacks}.
The humpback dataset was collected by FlukeBook~\cite{levenson_flukebook_2015} over nearly 15 years.
Images were contributed by both marine and citizen scientists.
The original \groundtruth{} was established using both manual and automated methods that are disjoint
from these techniques considered here; however our software was used to correct mistakes.
The annotations in this dataset have not been manually reviewed.
Some are cropped to the fluke while others encompass the entire image.
Quality and viewpoint labels do not exist for this dataset.
\end{itemln}
\FloatBarrier{}
\subsection{Baseline experiment}\label{sub:exptbase}
This first experiment determines the accuracy of the identification algorithm using the baseline pipeline
configuration.
The baseline pipeline configuration uses affine invariant features oriented using the gravity vector,
$\K\tighteq4$ as the number of feature correspondences assigned to each query feature, and \nscoring{}
(\nsum{}).
In this test we control for several biases that may be introduced by carefully selecting a subset of our
datasets.
We only use annotations that
(1) are known (\ie{} have been assigned a name),
(2) are comparable to the species primary viewpoint (\eg{} left, front-left, and back-left for plains
zebras),
(3) have not been assigned a quality of ``junk''.
Furthermore, to account for the fact that some \names{} contain more annotations than others, we
constrain our data selection such that there is only one correct exemplar in the database for each query
annotation.
Of these annotations, we group them into encounters. For each encounter we sample one annotation with the
highest quality. Names with only one encounter are added to the database as distractors. For the other
names, we randomly sample two encounters --- regardless of quality --- one for the database and one to use
as a query. This defines a set of query and database annotations that are separated by time, testing the
ability of our system to match animals across gaps of time using only a single image per individual. The
CMC curves for this baseline test are illustrated in~\cref{fig:BaselineExpt}.
The results of this baseline experiment demonstrates that our algorithm is able to reliably find matching
annotations in a database with many other images. The accuracy is over $60\percent$ for all species
considered. Subsequent experiments will restrict our focus to Grévy's and plains zebras in order to
investigate detailed parameter choices of the ranking algorithm and alternative ranking algorithms.
\BaselineExpt{}
\FloatBarrier{}
\subsection{SMK as an alternative}\label{sub:exptsmk}
Before we investigate the parameter choices of the LNBNN ranking algorithm, we briefly evaluate the
performance of an alternative ranking algorithm, namely VLAD-flavored SMK. The SMK algorithm is a
vocabulary-based algorithm that is representative of more traditional approaches to instance recognition
problems. In contrast to the raw descriptors used in LNBNN, SMK assigns each descriptor to a visual word
and builds a weighted histogram of visual words (accumulated residual vectors in the case of VLAD) to
represent each annotation as a sparse fixed length vector. We have experimented with several configurations
of VLAD and report our best results here.
In our SMK implementation, we pre-trained an $8000$ word vocabulary using mini-batch k-means on the stacked
descriptors from all database annotations. Note that typically the vocabulary is trained using a disjoint
external dataset in order to prevent overfitting. However, we \naively{} train using the database
annotations to be indexed, understanding that this will inflate the accuracy measurements. Each word in the
vocabulary is weighted with its inverse document frequency. We use the vocabulary to compute an inverted
index that maps each visual word to annotations containing that word in the database. Initial feature
correspondences for a descriptor are computed using single assignment to a visual word and then creating a
correspondence to every feature in that word's inverted index. We use spatial verification to filter
spatially invalid correspondences, and re-score the remaining matches.
The results of the SMK experiment are illustrated in \cref{fig:SMKExpt}. The query and database annotations
are the same in each experiment. Despite the bias in the SMK vocabulary, our measurements show that LNBNN
provides more accurate rankings. For plains zebra's there is a difference of $8\percent$ in the number of
correct matches at rank $1$, and for Grévy's zebras the difference is $6\percent$.
\SMKExpt{}
\FloatBarrier{}
\subsection{Foregroundness experiment}\label{sub:exptfg}
In this experiment we test the effect of our foregroundness weights --- weighting the score of each
features correspondence with a foregroundness weight --- on identification accuracy. When foregroundness is
enabled (\pvar{fg=T}), each feature correspondence is weighted using a foregroundness measure learned using
a deep convolutional neural network~\cite{parham_photographic_2015}. When disabled (\pvar{fg=F}), the
weight of every correspondence effectively becomes $1$.
Running this experiment with using the query / database sample as outlined in the baseline experiment does
not result in a noticeable difference in scores because the purpose of the foregroundness measure is to
down weight matches between scenery objects (\eg{} trees, grass, bushes) that appear in multiple
annotations. The baseline database sample contains only a single image from each encounter and only two
encounters per individual. This means that it will be unlikely for an annotation in the query set and
another annotation in the database set to have a similar background.
To more clearly illustrate the effect of the foregroundness measure we use a different sampling strategy.
We group all encounters by which occurrence they belong to. Annotations within the same occurrence are more
likely to share background. We sample query and database annotations from within occurrences to simulate
matching annotations within an encounter. We do not limit the number of exemplars in this test to ensure
that annotation pairs that share common scenery exist. We perform this test over multiple occurrences and
aggregate the results. Therefore, the reported database size will be an average, and the query size is the
sum of all unique query annotations.
The accuracy of the foregroundness is illustrated in~\cref{fig:FGIntraExpt}. The results show that using
foregroundness weights improves the number of correct results at rank $1$ by a significant margin for both
species. In the higher ranks using the \pvar{fg=T} line occasionally dips below the \pvar{fg=F} line
because sometimes the foregroundness mask covers distinguishing keypoints, but this is neither significant
nor common. Therefore, we find it beneficial to always include foregroundness when a trained estimator is
available.
\FGIntraExpt{}
\FloatBarrier{}
\subsection{Invariance experiment}\label{sub:exptinvar} %
In this experiment we vary the feature invariance configuration. This influences the location, shape, and
orientation of keypoints detected in each annotation, which in turn influences which regions in each
annotation are matchable using SIFT descriptors extracted at each keypoint. The best invariance settings
will be depend on properties of the data.
In our experiments we test different settings by enabling (denoted as T) or disabling (denoted as F) the
parameters affine invariance (AI), and our query-side rotation heuristic (QRH). Initially we also tested
rotation invariance, but found that it provided the poorest results for all datasets by a significant
margin, likely because the gravity vector assumption is mostly satisfied in all images. Therefore, we
exclude rotation invariance from our experiments.
In configurations where \pvar{AI=F}, keypoints are circular with a radius defined by the scale at which
it was detected.