arXiv:2512.13687 · 2025-12-15

VTP: Towards Scalable Pre-training of Visual Tokenizers for Generation

VTP pre-trains image tokenizers with multi-objective signals so downstream generators scale with more data and compute.

Visual Tokenizer Pre-training Image Generation

arXiv PDF Code Models

Choose a version

HS · EN High school overview

Plain language, big-picture intuition.

Grad · EN Graduate deep dive

Architecture, objectives, metrics.

高中 · 中文 高中版中文概览

通俗解释、重点结论。

研究生 · 中文 研究生版中文详解

方法细节与实验指标。

Quick orientation

What problem does it solve?

Image generators rely on a tokenizer (often a VAE) to compress pixels into discrete tokens. Existing tokenizers optimize reconstruction, which does not scale well for semantic generation when compute grows.

What is VTP?

A visual tokenizer pre-trained with contrastive, self-supervised, and reconstruction objectives so its tokens carry both semantic and pixel-level signals for scalable generation.