High School · English

VTP: Teaching image tokenizers to carry meaning, not just pixels.

VTP pre-trains a visual tokenizer with multiple learning signals so generators improve when you scale data and compute.

Visual Tokenizer Pre-training Image Generation

Paper facts

  • Title: Towards Scalable Pre-training of Visual Tokenizers for Generation
  • Authors: Jingfeng Yao, Yuda Song, Yucong Zhou, Xinggang Wang
  • Submitted: 15 Dec 2025 (v1)
  • Institutions: Huazhong University of Science and Technology, MiniMax
  • Venue: arXiv (cs.CV)
  • DOI: 10.48550/arXiv.2512.13687
Zero-shot acc.78.2
rFID0.36
FID gain65.8%
Speedup4.1×

Problem setup (plain language)

  • Image generators first compress an image into discrete tokens using a visual tokenizer.
  • Most tokenizers only focus on pixel reconstruction, so the tokens miss high-level meaning.
  • When you scale training compute, generation quality stops improving because the tokens are not semantic enough.

Method in one page

VTP is a ViT-based auto-encoder trained with three signals at once:

  • Contrastive: connect images with their text so tokens capture semantics.
  • Self-supervised: mask parts of an image and learn to predict them (plus a self-distillation step).
  • Reconstruction: rebuild pixels, then refine with a GAN stage for sharper details.

The mix balances meaning and pixel fidelity so downstream generators scale better.

Experiments & results

  • Pre-trained on DataComp-1B (277M image-text pairs) and evaluated on ImageNet.
  • Zero-shot accuracy reaches 78.2 and reconstruction FID is 0.36.
  • Scaling pre-training FLOPs yields a 65.8% FID improvement for DiT training, while baselines plateau.
  • Converges 4.1× faster than a distillation-based tokenizer.

Limitations (as stated or implied)

The paper does not list a dedicated limitations section. Based on the setup, VTP likely depends on large-scale data/compute and is mainly validated on ImageNet and DiT-style generators; generalization to other domains or generators still needs evidence.

Resources

Suggested citation

Yao, J., Song, Y., Zhou, Y., & Wang, X. (2025). Towards Scalable Pre-training of Visual Tokenizers for Generation. arXiv:2512.13687.