Cover image source: A guide to convolution arithmetic for deep learning - Vincent Dumoulin, Francesco Visin - ArXiv
In ZF Net, the transposed convolution is used to approximate the inverse of convolution, leading to the reconstruction of the activating features from the input image that activates a particular layer. However, strictly speaking, the transposed convolution is not really the inverse of convolution. Then why can it be used to do the reconstruction?
An ordinary convolution can be written in the form of . Then the corresponding transposed convolution is defined as:
The convolution and be transformed to the vectorized form:
where is the the flattened input feature map , is the -th element of the flattened output feature map , and is the actual weights that are applied to to yield .
What will makes generates the largest response? For example, is a good heuristic when , since after the multiplication all negative values in becomes positive values. Therefore:
If can not be arbitrarily big, then setting yields the largest value of . Or we can say, setting yields the largest response in the -th item of the flattened output feature map .
Now let's return to ZF Net's scenario: we know the response feature map , what kind of input feature map that is most activating to generate ?
We have known that yields the largest value of . However, the value (aka. activating level) of s are different. So a intuitive idea is using as the weight to constitute the desirable :
That's exactly a transposed convolution of !
In previous discussions, the transposed convolution is converted to the flatten vectorized form, which may be a little bit confusing. However, the vectorized form can always be stacked back to the matrix form, which leads to the conclusion that a transposed convolution is actually an ordinary convolution but padding the original inside and outside with the stride, as shown below:
Image source: A guide to convolution arithmetic for deep learning - Vincent Dumoulin, Francesco Visin - ArXiv
This makes the implementation of transposed convolution very simple, as in: