Why transposed convolution can be used to reconstruct activating features from the input images

Cover image source: A guide to convolution arithmetic for deep learning - Vincent Dumoulin, Francesco Visin - ArXiv

In ZF Net, the transposed convolution is used to approximate the inverse of convolution, leading to the reconstruction of the activating features from the input image that activates a particular layer. However, strictly speaking, the transposed convolution is not really the inverse of convolution. Then why can it be used to do the reconstruction?

What is transposed convolution

An ordinary convolution can be written in the form of Y=Cβˆ—X{Y} = {C} * {X}. Then the corresponding transposed convolution is defined as:

Xβ€²=CTβˆ—Y{X'} = {C}^T * {Y}

What yields the largest response

The convolution Y=Cβˆ—X{Y} = {C} * {X} and be transformed to the vectorized form:

yi=ciTxy_i = {c}_i^T {x}
where x{x} is the the flattened input feature map X{X}, yiy_i is the ii-th element of the flattened output feature map Y{Y}, and cic_i is the actual weights that are applied to x{x} to yield yiy_i.

What will makes yiy_i generates the largest response? For example, x=(1,2,βˆ’1){x} = (1, 2, -1) is a good heuristic when ci=(1,2,βˆ’1){c}_i = (1, 2, -1), since after the multiplication all negative values in ci{c}_i becomes positive values. Therefore:

If x{x} can not be arbitrarily big, then setting x=ci{x} = {c}_i yields the largest value of yiy_i. Or we can say, setting x=ci{x} = {c}_i yields the largest response in the ii-th item of the flattened output feature map Y{Y}.

Back-props the feature map

Now let's return to ZF Net's scenario: we know the response feature map Y{Y}, what kind of input feature map X{X} that is most activating to generate Y{Y}?

We have known that x=ci{x} = {c}_i yields the largest value of yiy_i. However, the value (aka. activating level) of yiy_is are different. So a intuitive idea is using yiy_i as the weight to constitute the desirable x{x}:

xβ€²=βˆ‘iyici{x'} = \sum_i y_i {c}_i
Xβ€²=CTβˆ—Y{X'} = {C}^T * {Y}

That's exactly a transposed convolution of C{C}!

Visualization of transposed convolution

In previous discussions, the transposed convolution is converted to the flatten vectorized form, which may be a little bit confusing. However, the vectorized form can always be stacked back to the matrix form, which leads to the conclusion that a transposed convolution is actually an ordinary convolution but padding the original X{X} inside and outside with the stride, as shown below:

img

Image source: A guide to convolution arithmetic for deep learning - Vincent Dumoulin, Francesco Visin - ArXiv

This makes the implementation of transposed convolution very simple, as in:

ConvTranspose2d β€” PyTorch 2.2 documentation

Lastly updated: 2024-03-14

Do you have any ideas or comments? Please join the discussion on XπŸ‘‡