-
k-Nearest Neighbor
on images never used
- L1 (Manhattan) distance:
$d_1(I_1,I_2)=\sum_p{|I_1^p-I_2^p|}$ - L2 (Euclidean) distance:
$d_1(I_1,I_2)=\sqrt{\sum_p{(I_1^p-I_2^p)^2}}$
- L1 (Manhattan) distance:
-
Setting Hyperparameters
- Split data into train, validation, and test (only used once)
- Cross-Validation: Split data into folds
-
Linear Classification
$s=f(x,W) = Wx + b$
- Multiclass SVM
- $L_i=\sum_{j\not =y_i}\max(0,f(x_i;W)j-f(x_i;W){y_i}+1)$
- Loss Function: $$L=\frac{1}{N}\sum_{i=1}^NL_i=\frac{1}{N}\sum_{j\not =y_i}\max(0,f(x_i;W)j-f(x_i;W){y_i}+1)$$
- Regularization
$$L(W)=\frac{1}{N}\sum_{i=1}^NL_i(f(x_i,W),y_i)+\lambda{R(W)}$$ - L2 regularization:
$R(W)=\sum_k\sum_lW_{k,l}^2$ - L1 regularization:
$R(W)=\sum_k\sum_l|W_{k,l}|$ - Elastic net (L1 + L2):
$R(W)=\sum_k\sum_l\beta{W_{k,l}^2}+|W_{k,l}|$ - Dropout, Batch normalization, Stochastic depth, fractional pooling, etc
- Softmax Classifier (Multinomial Logistic Regression)
- Softmax Function:
$P(Y=k|X=x_i)=\frac{e^{s_k}}{\sum_j{e^{s_j}}}$ $L_i=-log(\frac{e^{s_k}}{\sum_j{e^{s_j}}})$ - Loss Function:
$$L=\frac{1}{N}\sum_{i=1}^NL_i=-\frac{1}{N}log(\frac{e^{s_k}}{\sum_j{e^{s_j}}})$$
- Softmax Function:
- Stochastic Gradient Descent (SGD) is:
- On-line Gradient Descent
- Minibatch Gradient Descent (MGD)
- Batch gradient descent (BGD)
- Image Features
- Gradient
- Numerical gradient: slow :(, approximate :(, easy to write :)
- Analytic gradient: fast :), exact :), error-prone :(
- Computational graphs
- How to compute gradients?
- Computational graphs + Backpropagation
- backprop with scalars
- vector-valued functions
- Computational graphs + Backpropagation
- Fully Connected Layer
- Convolution Layer
- Accepts a volumne of size
$W H D$ - Hyperpararmeters
$K:\ number\ of\ filters$ $F:\ spatial\ extent\ of\ filter$ $S:\ stride$ $P:\ zero\ padding$
- Produces a volume of size
$W=(W-F+2P)/S+1$ $H=(H-F+2P)/S+1$ $D=K$
- The number of parameters:
$(F·F·D)·K+K$ - Pooling:
$W=(W-F)/S+1$ $H=(H-F)/S+1$ $D=D$
- Accepts a volumne of size
Mini-batch SGD Loop:
- Sample a batch of data
- Forward prop it through the graph (network), get loss
- Backprop to calculate the gradients
- Update the parameters using the gradient
- Sigmoid problems
- Saturated neurons “kill” the gradients
- Sigmoid outputs are not zero-centered
- exp() is a bit compute expensive
- tanh problems
- still kills gradients when saturated
- ReLU problems
- Not zero-centered output
- Leaky ReLU
- Does not saturate-
- Computationally efficient-
- Converges much faster than sigmoid/tanh in practice! (e.g. 6x)-
- will not “die”
- Parametric Rectifier (PReLU)
- Exponential Linear Units (ELU)
- All benefits of ReLU-Closer to zero mean outputs
- Negative saturation regime compared with Leaky ReLU adds some robustness to noise
- Maxout “Neuron”
Use ReLU. Be careful with your learning rates Try out Leaky ReLU / Maxout / ELU Try out tanh but don’t expect much Don’t use sigmoid
- zero-centered (image only this)
- normalized data
- PCA
- Whitening
- tanh + “Xavier” Initialization: std = 1/sqrt(Din)
- ReLU + He et al. Initialization: std = sqrt(Din / 2)
- ReLU + Kaiming / MSRA Initialization: std = sqrt(2 / Din)
- Double checkout the loss is reasonable
- learning rate: 1e-5 ~ 1e-3
- Only run a few epochs first
- If the cost is ever > 3 * original cost, breat out
- Random Search Hyperparameter > Grid Search Hyperparameter
- Adam is a good default choice in many cases; it often works ok even with constant learning rate
- SGD + Momentum can outperform Adam but may require more tuning of LR and schedule
- Try cosine schedule, very few hyperparameters!
- If you can afford to do full batch updates then try out L-BFGS (and don’t forget to disable all sources of noise)
-
Problems with SGD (梯度方向?)
-
Loss function has high condition number
- ratio of largest to smallest singular value of the Hessian matrix is large
-
Local minima / Saddle points
-
Our gradients come from minibatches so they can be noisy
SGD:
$x_{t+1}=x_t-\alpha{\triangledown f(x_t)}$ SGD + Momentum:$x_{t+1}=x_t-\alpha{(\rho{v_t}+\triangledown f(x_t))}$ SGD + Nesterov Momentum:$x_{t+1}=x_t+\rho{v_t}-\alpha{\triangledown f(x_t+\rho{v_t})}$ -
-
Learning Rate Decay (common with momentum & less common with Adam)
-
Second-Order Optimization (without learning rate)
- First-Order Optimization
- Use gradient form linear approximation
- Step to minimize the approximation
- First-Order Optimization
-
Model Ensembles: Tips and Tricks
- Add term to loss
- Dropout: In each forward pass, randomly set some neurons to zero
- Probability of dropping is a hyperparameter; 0.5 is common
- Batch Normalization
- Data Augmentation
- Horizontal Flips
- Random crops and scales
- Color Jitter
- Simple: Randomize contrast and brightness
- Complex: Apply PCA to all [R, G, B]
- Random mix/combinations of:
- translation, rotation, stretching, shearing, lens distortions
- DropConnect (set weights to 0)
- Fractional Max Pooling
- Stochastic Depth
- Cutout
- Mixup
very similar dataset | very different dataset | |
---|---|---|
very little data | Use Linear Classifier on top layer | You’re in trouble… Try linear classifier from different stages |
quite a lot of data | Finetune a few layers | Finetune a larger number of layers |
Have some dataset of interest but it has < ~1M images?
- Find a very large dataset that has similar data, train a big ConvNet there
- Transfer learn to your dataset
Deep learning frameworks provide a “Model Zoo” of pretrained models so you don’t need to train your own
- Caffe: https://github.com/BVLC/caffe/wiki/Model-Zoo
- TensorFlow: https://github.com/tensorflow/models
- PyTorch: https://github.com/pytorch/vision
- Check initial loss
- Overfit a small sample
- Find LR that makes loss go down
- Coarse grid, train for ~1-5 epochs
- Refine grid, train longer
- Look at loss curves