-
Notifications
You must be signed in to change notification settings - Fork 78
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
25d8faa
commit 72a6de5
Showing
3 changed files
with
78 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
76 changes: 76 additions & 0 deletions
76
site/_posts/2018-12-25-Smooth Loss Functions for Deep Top-k Classification.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,76 @@ | ||
--- | ||
layout: post | ||
title: Smooth Loss Functions for Deep Top-k Classification | ||
comments: True | ||
excerpt: | ||
tags: ['2018', 'ICLR 2018', 'Loss Function', AI, ICLR, Loss] | ||
--- | ||
|
||
## Introduction | ||
|
||
* For top-k classification tasks, cross entropy is widely used as the learning objective even though it is the optimal metric only in the limit of infinite data. | ||
|
||
* The paper introduces a family of smoothed loss functions that are specially designed for top-k optimization. | ||
|
||
* [Paper](https://arxiv.org/abs/1802.07595) | ||
|
||
* [Code](https://github.com/oval-group/smooth-topk) | ||
|
||
## Idea | ||
|
||
* Inspired by the multi-loss SVMs, a surrogate loss (l<sub>k</sub>) is introduced that creates a margin between the ground truth and the kth largest score. | ||
|
||
![Equation 1](https://github.com/shagunsodhani/papers-I-read/raw/master/assets/topk/eq1.png) | ||
|
||
* Here **s** denotes the output of the classifier model to be learnt, *y* is the ground truth label, *s[p]* denotes the kth largest element of **s** and **s\p** denotes the vector **s** without *p*th element. | ||
|
||
* This l<sub>k</sub> loss has two limitations: | ||
|
||
* It is continous but not differentiable in *s*. | ||
|
||
* Its weak derivatives have at most 2-nonzero elements. | ||
|
||
* The loss can be reformulated by adding and subtracting the k-1 largest scores of **s\y** and *s<sub>y</sub>* and by introducing a temperature parameter τ. | ||
|
||
![Equation 2](https://github.com/shagunsodhani/papers-I-read/raw/master/assets/topk/eq2.png) | ||
|
||
## Properties of L<sub>kτ</sub> | ||
|
||
* For any τ > 0, L<sub>kτ</sub> is infinite-differentiable and has non-sparse gradients. | ||
|
||
* Under mild conditions, L<sub>kτ</sub> apporachs l<sub>k</sub> (in a pointwise sense) as τ approaches to 0+<sup>+</sup>. | ||
|
||
* It is an upper bound on the actual loss (up to a constant factor). | ||
|
||
* It is a generalization of the cross-entropy loss for different values of k, and τ and higher margins. | ||
|
||
|
||
## Computational Challenges | ||
|
||
* *nCk* number of terms needs to be evaluated for computing the loss for one sample (n is number of classes). | ||
|
||
* Loss L<sub>kτ</sub> can be expressed in terms of elementary symmetric polynomials σ<sub>i</sub>(**e**) (sum of all products of i distinct elements of vector e). Thus the challenge is to compute σ<sub>k</sub> efficiently. | ||
|
||
### Forward Computation | ||
|
||
* Compute σ<sub>k</sub>(**e**) where **e** is a n-dimensional vector and k<< n and e[i]!=0 for all i. | ||
|
||
* σ<sub>i</sub>(*e*) can be computed using the coefficients of the polynomial (X+e<sub>1</sub>)(X+e<sub>2</sub>)...(X+e<sub>n</sub>) by divide and conquer approach with polynomial multiplication. | ||
|
||
* With some more optimizations (eg log(n) levels of recursion and each level being parallelized on a GPU), the resulting algorithms scale well with n on a GPU. | ||
|
||
* Operations are performed in the log-space using the log-sum-exp trick to achieve numerical stability in single floating point precision. | ||
|
||
### Backward computation | ||
|
||
* The backward pass uses optimizations like computing derivative of σ<sub>j</sub> with respect to e<sub>i</sub> in a recursive manner. | ||
|
||
* Appendix of the paper describes these techniques in detail. | ||
|
||
## Experiments | ||
|
||
* Experiments are performed on CIFAR-100 (with noise) and Imagenet. | ||
|
||
* For CIFAR-100 with noise, the labels are randomized with probability p (within the same top-level class). | ||
|
||
* The proposed loss function is very robust to both noise and reduction in the amount of training dataset as compared to cross-entropy loss function for both top-k and top-1 performance. |
Submodule _site
updated
from f78874 to 3e865e