High School · English

VTP: Teaching image tokenizers to carry meaning, not just pixels.

VTP pre-trains a visual tokenizer with multiple learning signals so generators improve when you scale data and compute.

Visual Tokenizer Pre-training Image Generation

Paper facts

Title: Towards Scalable Pre-training of Visual Tokenizers for Generation
Authors: Jingfeng Yao, Yuda Song, Yucong Zhou, Xinggang Wang
Submitted: 15 Dec 2025 (v1)
Institutions: Huazhong University of Science and Technology, MiniMax
Venue: arXiv (cs.CV)
DOI: 10.48550/arXiv.2512.13687

Zero-shot acc.78.2

rFID0.36

FID gain65.8%

Speedup4.1×

Problem setup (plain language)

Image generators first compress an image into discrete tokens using a visual tokenizer.
Most tokenizers only focus on pixel reconstruction, so the tokens miss high-level meaning.
When you scale training compute, generation quality stops improving because the tokens are not semantic enough.

Method in one page

VTP is a ViT-based auto-encoder trained with three signals at once:

Contrastive: connect images with their text so tokens capture semantics.
Self-supervised: mask parts of an image and learn to predict them (plus a self-distillation step).
Reconstruction: rebuild pixels, then refine with a GAN stage for sharper details.

The mix balances meaning and pixel fidelity so downstream generators scale better.

Experiments & results

Pre-trained on DataComp-1B (277M image-text pairs) and evaluated on ImageNet.
Zero-shot accuracy reaches 78.2 and reconstruction FID is 0.36.
Scaling pre-training FLOPs yields a 65.8% FID improvement for DiT training, while baselines plateau.
Converges 4.1× faster than a distillation-based tokenizer.

Limitations (as stated or implied)

The paper does not list a dedicated limitations section. Based on the setup, VTP likely depends on large-scale data/compute and is mainly validated on ImageNet and DiT-style generators; generalization to other domains or generators still needs evidence.

Resources

arXiv PDF GitHub Hugging Face models

Suggested citation

Yao, J., Song, Y., Zhou, Y., & Wang, X. (2025). Towards Scalable Pre-training of Visual Tokenizers for Generation. arXiv:2512.13687.