arXiv:2512.13687 · 2025-12-15

VTP: Towards Scalable Pre-training of Visual Tokenizers for Generation

VTP pre-trains image tokenizers with multi-objective signals so downstream generators scale with more data and compute.

Visual Tokenizer Pre-training Image Generation

Quick orientation

What problem does it solve?

Image generators rely on a tokenizer (often a VAE) to compress pixels into discrete tokens. Existing tokenizers optimize reconstruction, which does not scale well for semantic generation when compute grows.

What is VTP?

A visual tokenizer pre-trained with contrastive, self-supervised, and reconstruction objectives so its tokens carry both semantic and pixel-level signals for scalable generation.