Dev.to Machine Learning4h ago|Research & PapersProducts & Services

VHS: Latent Verifier Cuts Diffusion Model Verification Cost by 63.3%, Boosts GenEval by 2.7%

Researchers propose Verifier on Hidden States (VHS), a lightweight verifier that operates directly on the generator's latent features, eliminating costly pixel-space decoding. VHS reduces joint generation-and-verification time by 63.3% and improves GenEval performance by 2.7% compared to MLLM verifiers.

💡

Why it matters

VHS represents a significant advancement in making inference-time scaling practical for diffusion-based text-to-image generation, with major efficiency and performance improvements.

Key Points

  • 1VHS is a verifier that operates on the generator's hidden states, avoiding expensive pixel-space decoding
  • 2VHS reduces joint generation-and-verification time by 63.3% and compute FLOPs by 51%
  • 3VHS improves GenEval performance by 2.7% compared to MLLM-based verifiers
  • 4VHS is a lightweight MLP head that can be trained once and used efficiently for inference

Details

The paper introduces Verifier on Hidden States (VHS), a method to drastically reduce the computational overhead of using verifiers to improve text-to-image generation. Inference-time scaling, where a model generates multiple candidates and a separate verifier selects the best, is an effective technique but creates a paradox for diffusion-based generators. These models generate images efficiently in a compressed latent space, but to be evaluated by a language model verifier, the latent images must first be decoded to full pixel space and then re-encoded, a redundant and expensive process. VHS addresses this by operating directly on the generator's hidden representations, analyzing the features during the denoising process before they are projected to the final latent space. Architecturally, VHS is a simple MLP head that takes the generator's final hidden state as input and outputs a quality score. Training involves a contrastive loss to assign higher scores to higher-quality candidates. The results show VHS reduces joint generation-and-verification time by 63.3%, compute FLOPs by 51%, and VRAM usage by 14.5%, while also improving GenEval performance by 2.7% compared to MLLM verifiers. These efficiency gains make inference-time scaling viable for real-time or high-throughput applications where it was previously prohibitive.

Like
Save
Read original
Cached
Comments
?

No comments yet

Be the first to comment

AI Curator - Daily AI News Curation

AI Curator

Your AI news assistant

Ask me anything about AI

I can help you understand AI news, trends, and technologies