VOID: Video Object and Interaction Deletion (physically-consistent video inpainting)
VOID is a model for video object removal that aims to handle physical interactions, unlike existing video inpainting methods that fail to account for the dynamic effects of removed objects.
Why it matters
VOID represents a significant advancement in video inpainting by addressing the critical issue of physically-consistent object removal, which has important applications in video editing and content creation.
Key Points
- 1VOID models counterfactual scene evolution to predict what the video would look like if the object had never been there
- 2Uses counterfactual training data, VLM-guided masks, and a two-pass generation process to achieve physically-consistent results
- 3Outperformed baselines like Runway (Aleph), Generative Omnimatte, and ProPainter in a human preference study
Details
VOID addresses the limitations of existing video inpainting methods that can fill in pixels behind an object but fail to handle cases where the removed object affects the dynamics of the scene, such as a domino chain falling or two cars about to crash. VOID models the counterfactual scene evolution to predict what the video would look like if the object had never been there. Key ideas include using counterfactual training data (paired videos with and without objects), VLM-guided masks to identify affected regions, and a two-pass generation process to first predict the new motion and then refine with flow-warped noise for temporal consistency. In a human preference study on real-world videos, VOID was selected 64.8% of the time over baseline methods.
No comments yet
Be the first to comment