Graduate · English
VTP: Multi-objective pre-training for scalable visual tokenizers.
A ViT-based tokenizer trained jointly with contrastive, self-supervised, and reconstruction objectives to make downstream image generators scale with additional compute.
Paper facts
- Title: Towards Scalable Pre-training of Visual Tokenizers for Generation
- Authors: Jingfeng Yao, Yuda Song, Yucong Zhou, Xinggang Wang
- Submitted: 15 Dec 2025 (v1)
- Institutions: Huazhong University of Science and Technology, MiniMax
- Venue: arXiv (cs.CV)
- DOI: 10.48550/arXiv.2512.13687
Problem setup
Visual tokenizers (e.g., VQ-VAEs) compress images into discrete codes for autoregressive or diffusion-based generation. The paper identifies a scaling gap: pre-training with pixel reconstruction alone yields tokens that do not improve semantic generation when compute scales, leading to a pre-training scaling bottleneck.
Method
Architecture. A ViT-based auto-encoder with vector quantization; a 12-layer text encoder (dim 768), a 4-layer ViT-L pixel decoder, and a latent dimension of 64 (with a 256 ablation). QKNorm is used for stable attention.
Objectives. Joint training on three signals:
- Self-supervised: DINOv2-style MIM plus self-distillation.
- Contrastive: CLIP-style image-text alignment, distilling OpenCLIP text embeddings from noisy images.
- Reconstruction: MSE stage followed by GAN fine-tuning with LPIPS + adversarial loss.
Experiments & results
- Pre-training uses DataComp-1B (277M image-text samples) and follows DINOv2/OpenCLIP settings; evaluation is on ImageNet-1K.
- Zero-shot accuracy reaches 78.2, and reconstruction rFID is 0.36.
- Scaling tokenizer pre-training FLOPs improves DiT FID by 65.8% while existing tokenizers saturate.
- Tokenizer convergence is 4.1× faster than distillation-based training.
The paper also reports a strong correlation between visual understanding metrics and downstream generation quality.
Limitations (as stated or implied)
The paper does not provide a dedicated limitations section. From the reported setup, VTP depends on large-scale image-text data and compute, and results are centered on ImageNet/DiT pipelines. Generalization to other domains and generators remains to be tested.
Resources
Suggested citation
Yao, J., Song, Y., Zhou, Y., & Wang, X. (2025). Towards Scalable Pre-training of Visual Tokenizers for Generation. arXiv:2512.13687.