Existing deep learning based image inpainting methods use a standard convolutional network over the corrupted image, using convolutional filter responses conditioned on both valid pixels as well as the substitute values in the masked holes (typically the mean value). This often leads to artifacts such as color discrepancy and blurriness. Post-processing is usually used to reduce such artifacts, but are expensive and may fail. We propose the use of partial convolutions, where the convolution is masked and renormalized to be conditioned on only valid pixels. We further include a mechanism to automatically generate an updated mask for the next layer as part of the forward pass. Our model outperforms other methods for irregular masks. We show qualitative and quantitative comparisons with other methods to validate our approach.
Media Coverage (Selected)
Fortune, Forbes, Fast Company, Engadget, SlashGear, Digital Trends, TNW, eTeknix, Game Debate, Alphr, Gizbot, Fossbytes Techradar, Beeborn, Bit-tech, Hexus, HotHardWare, BleepingComputer,hardocp, boingboing, PetaPixel, 搜狐, 新浪,量子位(知乎)
- How Equation (1) and (2) are implemented?
I implemented by extending the existing Convolution layer provided by pyTorch.
The general idea is that:
- we will have convolution operator C to do the basic convolution we want; it has W, b as the shown in the equations. for computing sum(M), we use another convolution operator D, whose kernel size and stride is the same with the one above, but all its weights are 1 and bias are 0.
- Note: M has same channel, height and width with feature/image. M is multi-channel, not single-channel.
- Thus C(X) = W^T * X + b, C(0) = b, D(M) = 1 * M + 0 = sum(M), W^T* (M .* X) / sum(M) + b = [C(M .* X) – C(0)] / D(M) + C(0)
- The value of W^T* (M .* X) / sum(M) + b may be very small.
- If you feel the value W^T* (M .* X) / sum(M) is too small, an alternative to W^T* (M .* X) / sum(M) + b is W^T* (M .* X) * sum(I) / sum(M) + b , where I is a tensor filled with all 1 and having same channel, height and width with M. Mathematically these two are the same. However, for some network initialization schemes, the latter one may be easier to train.
- How mask dataset is generated?
- The mask dataset is generated using the forward-backward optical flow consistency checking described in this paper. The first step is to get the forward and backward flow using some code like deepflow or flownet2; the second step is to use theconsistency checking code to generate mask. Later, we use random dilation, rotation and cropping to augment the mask dataset (if the generated holes are too small, you may try videos with larger motions).
- What are the scale of VGG feature and its losses?
- Be careful of the scale difference issues. The VGG model pretrained on pyTorch divides the image values by 255 before feeding into the network like this; pyTorch’s pretrained VGG model was also trained in this way. This is what we are currently using. However, other framework (tensorflow, chainer) may not do that. It will have a big impact on the scale of the perceptual loss and style loss.
- How to do padding?
- Note that we didn’t directly use existing padding scheme like zero/reflection/repetition padding; instead, we use partial convolution as padding by assuming the region outside the images (border) are holes. This will help to reduce the border artifacts. An easy way to implement this is to first do zero padding for both features and masks and then apply the partial convolution operation and mask updating. Details can be found here: Partial Convolution based Padding
- How skip link works?
- For skip links, we do concatenations for features and masks separately. Assume we have feature F and mask output K from the decoder stage, and feature I and mask M from encoder stage. We do the concatenation between F and I, and the concatenation between K and M. The concatenation outputs concat(F, I) and concat(K, M) will he feature input and mask input for next layer.
- L1 loss
- The L1 losses in the paper are all size-averaged.