arXiv:2512.13687 · 2025-12-15
VTP: Towards Scalable Pre-training of Visual Tokenizers for Generation
VTP pre-trains image tokenizers with multi-objective signals so downstream generators scale with more data and compute.
Visual Tokenizer
Pre-training
Image Generation
Quick orientation
What problem does it solve?
Image generators rely on a tokenizer (often a VAE) to compress pixels into discrete tokens. Existing tokenizers optimize reconstruction, which does not scale well for semantic generation when compute grows.
What is VTP?
A visual tokenizer pre-trained with contrastive, self-supervised, and reconstruction objectives so its tokens carry both semantic and pixel-level signals for scalable generation.