MaskGIT: Masked Generative Image Transformer
CVPR 2022

Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, Bill Freeman
Google Research


Class-conditional Image Editing by MaskGIT

Abstract

Image generative transformers typically treat an image as a sequence of tokens, and decode an image sequentially following the raster scan ordering (i.e. line-by-line).

This paper proposes a novel image synthesis paradigm using a bidirectional transformer decoder, which we term MaskGIT. During training, MaskGIT learns to predict randomly masked tokens by attending to tokens in all directions. At inference time, the model begins with generating all tokens of an image simultaneously, and then refines the image iteratively conditioned on the previous generation.

Our experiments demonstrate that MaskGIT significantly outperforms the state-of-the-art transformer model on the ImageNet dataset, and accelerates autoregressive decoding by up to 64x. Besides, MaskGIT can be easily extended to various image editing tasks, such as inpainting, extrapolation, and image manipulation.


Autoregressive decoding vs MaskGIT's parallel decoding visualized at the same playback speed (0.1s per step).

Paper

Applications

Horizontal image extrapolation:
Inpainting results on 512x512 Places2 images:

Class-conditional image editing results:

BibTeX

@InProceedings{chang2022maskgit, title = {MaskGIT: Masked Generative Image Transformer}, author={Huiwen Chang and Han Zhang and Lu Jiang and Ce Liu and William T. Freeman}, booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2022} }

Acknowledgement

Webpage template from Richard Tucker.