Rethinking Atrous Convolutions for Semantic Image Segmentation

Key ideas

Image Pyramid: same model, typically with shared weights, applied to different scales of the image
Encoder-decoder: encoder where spatial dimension of feature maps is gradually reduced, decoder where object details + dimension are recovered
Context module: DenseCRF for encoding long-range context
Spatial pyramid pooling: captures context in several ranges, sometimes based on LSTM
Atrous convolutions: Experiments with different atrous ranges to capture long-range information

Max-pooling + striding at consecutive layers reduces the spatial resolution of feature maps in DCNNs
Atrous convolutions rate allows us to control how densely to compute features in CNNs

Duplicate several copies of the last ResNet block (block4) and arrange them in cascade
3x3 convolutions in these blocks with stride of 2 except the last one

Four parallel atrous convolutions with different atrous rates applied on top of feature map