ViQ: Text-Aligned Visual Quantized Representations at Any Resolution (ECCV 2026)

Tencent Hunyuan

Research official + media 2 src. ~1 min

ViQ introduces a discrete visual representation framework built on a SigLIP2 vision tower with position-aware, head-wise Finite Scalar Quantization (FSQ). It converts images at any native resolution into compact discrete codes usable by both multimodal LLMs for understanding and decoders for high-fidelity reconstruction. Training uses two stages: text-aligned semantic pre-training and feature discretization via proximal representation learning. ViQ matches continuous-feature encoders on multimodal benchmarks while delivering 20-70% inference acceleration. Accepted to ECCV 2026.

Why it matters

Discrete visual tokens are a key bottleneck for unified image-language models: prior methods either sacrificed reconstruction quality for semantics or vice versa. ViQ's resolution-agnostic, text-aligned quantization bridges that gap. 80 upvotes on HF Daily Papers.

Importance: 3/5

Top-voted HF Daily paper (80 upvotes, +1 bump); ECCV 2026 acceptance; solves resolution-agnostic discrete visual tokenization

multimodal visual-tokenization quantization representation-learning eccv-2026

Sources

official ViQ: Text-Aligned Visual Quantized Representations at Any Resolution | arXiv

media ViQ | HuggingFace Daily Papers (80 upvotes)