Graduate · English

VTP: Multi-objective pre-training for scalable visual tokenizers.

A ViT-based tokenizer trained jointly with contrastive, self-supervised, and reconstruction objectives to make downstream image generators scale with additional compute.

VQ Tokenizer Contrastive + SSL + Recon DiT Scaling

Paper facts

Title: Towards Scalable Pre-training of Visual Tokenizers for Generation
Authors: Jingfeng Yao, Yuda Song, Yucong Zhou, Xinggang Wang
Submitted: 15 Dec 2025 (v1)
Institutions: Huazhong University of Science and Technology, MiniMax
Venue: arXiv (cs.CV)
DOI: 10.48550/arXiv.2512.13687

Zero-shot acc.78.2

rFID0.36

FID gain65.8%

Convergence4.1× faster

Problem setup

Visual tokenizers (e.g., VQ-VAEs) compress images into discrete codes for autoregressive or diffusion-based generation. The paper identifies a scaling gap: pre-training with pixel reconstruction alone yields tokens that do not improve semantic generation when compute scales, leading to a pre-training scaling bottleneck.

Method

Architecture. A ViT-based auto-encoder with vector quantization; a 12-layer text encoder (dim 768), a 4-layer ViT-L pixel decoder, and a latent dimension of 64 (with a 256 ablation). QKNorm is used for stable attention.

Objectives. Joint training on three signals:

Self-supervised: DINOv2-style MIM plus self-distillation.
Contrastive: CLIP-style image-text alignment, distilling OpenCLIP text embeddings from noisy images.
Reconstruction: MSE stage followed by GAN fine-tuning with LPIPS + adversarial loss.

Experiments & results

Pre-training uses DataComp-1B (277M image-text samples) and follows DINOv2/OpenCLIP settings; evaluation is on ImageNet-1K.
Zero-shot accuracy reaches 78.2, and reconstruction rFID is 0.36.
Scaling tokenizer pre-training FLOPs improves DiT FID by 65.8% while existing tokenizers saturate.
Tokenizer convergence is 4.1× faster than distillation-based training.

The paper also reports a strong correlation between visual understanding metrics and downstream generation quality.

Limitations (as stated or implied)

The paper does not provide a dedicated limitations section. From the reported setup, VTP depends on large-scale image-text data and compute, and results are centered on ImageNet/DiT pipelines. Generalization to other domains and generators remains to be tested.

Resources

arXiv PDF GitHub Hugging Face models

Suggested citation

Yao, J., Song, Y., Zhou, Y., & Wang, X. (2025). Towards Scalable Pre-training of Visual Tokenizers for Generation. arXiv:2512.13687.