High School · English
VTP: Teaching image tokenizers to carry meaning, not just pixels.
VTP pre-trains a visual tokenizer with multiple learning signals so generators improve when you scale data and compute.
Visual Tokenizer
Pre-training
Image Generation
Paper facts
- Title: Towards Scalable Pre-training of Visual Tokenizers for Generation
- Authors: Jingfeng Yao, Yuda Song, Yucong Zhou, Xinggang Wang
- Submitted: 15 Dec 2025 (v1)
- Institutions: Huazhong University of Science and Technology, MiniMax
- Venue: arXiv (cs.CV)
- DOI: 10.48550/arXiv.2512.13687
Zero-shot acc.78.2
rFID0.36
FID gain65.8%
Speedup4.1×
Problem setup (plain language)
- Image generators first compress an image into discrete tokens using a visual tokenizer.
- Most tokenizers only focus on pixel reconstruction, so the tokens miss high-level meaning.
- When you scale training compute, generation quality stops improving because the tokens are not semantic enough.
Method in one page
VTP is a ViT-based auto-encoder trained with three signals at once:
- Contrastive: connect images with their text so tokens capture semantics.
- Self-supervised: mask parts of an image and learn to predict them (plus a self-distillation step).
- Reconstruction: rebuild pixels, then refine with a GAN stage for sharper details.
The mix balances meaning and pixel fidelity so downstream generators scale better.
Experiments & results
- Pre-trained on DataComp-1B (277M image-text pairs) and evaluated on ImageNet.
- Zero-shot accuracy reaches 78.2 and reconstruction FID is 0.36.
- Scaling pre-training FLOPs yields a 65.8% FID improvement for DiT training, while baselines plateau.
- Converges 4.1× faster than a distillation-based tokenizer.
Limitations (as stated or implied)
The paper does not list a dedicated limitations section. Based on the setup, VTP likely depends on large-scale data/compute and is mainly validated on ImageNet and DiT-style generators; generalization to other domains or generators still needs evidence.
Resources
Suggested citation
Yao, J., Song, Y., Zhou, Y., & Wang, X. (2025). Towards Scalable Pre-training of Visual Tokenizers for Generation. arXiv:2512.13687.