ViQ: Text-Aligned Visual Quantized Representations at Any Resolution (ECCV 2026)
Tencent Hunyuan
ViQ introduces a discrete visual representation framework built on a SigLIP2 vision tower with position-aware, head-wise Finite Scalar Quantization (FSQ). It converts images at any native resolution into compact discrete codes usable by both multimodal LLMs for understanding and decoders for high-fidelity reconstruction. Training uses two stages: text-aligned semantic pre-training and feature discretization via proximal representation learning. ViQ matches continuous-feature encoders on multimodal benchmarks while delivering 20-70% inference acceleration. Accepted to ECCV 2026.
Why it matters
Discrete visual tokens are a key bottleneck for unified image-language models: prior methods either sacrificed reconstruction quality for semantics or vice versa. ViQ's resolution-agnostic, text-aligned quantization bridges that gap. 80 upvotes on HF Daily Papers.
Importance: 3/5
Top-voted HF Daily paper (80 upvotes, +1 bump); ECCV 2026 acceptance; solves resolution-agnostic discrete visual tokenization