MoCA-Video

Motion-Aware Concept Alignment for Consistent Video Editing

KAUST

Abstract

We introduce MoCA-Video (Motion Concept Alignment in Video), a training-free framework bridging the gap between image-domain semantic mixing and video. Given a generated video and a user-provided reference image, MoCA-Video injects the semantic features of the reference image into a specific object within the video, while preserving the original motion and visual context. Our approach leverages a diagonal denoising schedule and class-agnostic segmentation to detect and track objects in the latent space and precisely control the spatial location of the blended objects. To ensure temporal coherence, we incorporate momentum-based semantic corrections and gamma residual noise stabilization for smooth frame transitions. We evaluate MoCA's performance using the standard SSIM, image-level LPIPS, temporal LPIPS, and introduce a novel metric CASS (Conceptual Alignment Shift Score) to evaluate the consistency and effectiveness of the visual shifts between the source prompt and the modified video frames. Using self-constructed dataset, MoCA-Video outperforms current baselines, achieving superior spatial consistency, coherent motion, and a significantly higher CASS score, despite having no training or fine-tuning. MoCA-Video demonstrates that structured manipulation in the diffusion noise trajectory allows for controllable, high-quality video synthesis.

Results Gallery

Input Image: Unicorn

Input Video: Horse

Generated: Horse-Unicorn

Input Image: Kayak

Input Video: Surfer

Generated: Surfer-Kayak

Input Image: Cat

Input Video: Astronaut

Generated: Cat-Astronaut

Baseline Comparison Gallery

Input Image: Unicorn

Input Video: Horse

MoCA-Video

AnimateDiffV2V

FreeBlend+DynamiCrafter

Input Image: Kayak

Input Video: Surfer

MoCA-Video

AnimateDiffV2V

FreeBlend+DynamiCrafter

Input Image: Cat

Input Video: Astronaut

MoCA-Video

AnimateDiffV2V

FreeBlend+DynamiCrafter

Input Image: Cat

Input Video: Mouse

MoCA-Video

AnimateDiffV2V

FreeBlend+DynamiCrafter

Input Image: Cat

Input Video: Bird

MoCA-Video

AnimateDiffV2V

FreeBlend+DynamiCrafter

Baseline Comparison

The CASS (Conceptual Alignment Shift Score) metric measures how much a generated video shifts semantically toward the conditioned image and away from the original text prompt, quantifying the effectiveness of video semantic mixing.

$$\text{CLIP-T}_{\text{orig}} = \text{sim}(E(V_{\text{orig}}), T_{\text{orig}}), \quad \text{CLIP-T}_{\text{fused}} = \text{sim}(E(V_{\text{fused}}), T_{\text{orig}})$$

$$\text{CLIP-I}_{\text{orig}} = \text{sim}(E(V_{\text{orig}}), E(I_{\text{cond}})), \quad \text{CLIP-I}_{\text{fused}} = \text{sim}(E(V_{\text{fused}}), E(I_{\text{cond}}))$$

$$\text{CASS} = (\text{CLIP-I}_{\text{fused}} - \text{CLIP-I}_{\text{orig}}) - (\text{CLIP-T}_{\text{fused}} - \text{CLIP-T}_{\text{orig}})$$

relCASS is a normalized form of CASS to remove the inherent task difficulty and bias for more balanced evaluation.

$$\text{relCLIP-I} = \frac{\text{CLIP-I}_{\text{fused}} - \text{CLIP-I}_{\text{orig}}}{\text{CLIP-I}_{\text{orig}}}, \quad \text{relCLIP-T} = \frac{\text{CLIP-T}_{\text{fused}} - \text{CLIP-T}_{\text{orig}}}{\text{CLIP-T}_{\text{orig}}}$$

$$\text{relCASS} = \text{relCLIP-I} - \text{relCLIP-T}$$

We collected 100 prompts based on super categories from Freeblend and extended it with DAVIS-16 categories to enrich semantic context of it.

User Study Results

BibTeX

@misc{zhang2025motionawareconceptalignmentconsistent, title={Motion-Aware Concept Alignment for Consistent Video Editing}, author={Tong Zhang and Juan C Leon Alcazar and Bernard Ghanem}, year={2025}, eprint={2506.01004}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2506.01004}, }

Motion-Aware Concept Alignment for Consistent Video Editing

Abstract

AI Podcast

Methodology

Results Gallery

Input Image: Unicorn

Input Video: Horse

Generated: Horse-Unicorn

Input Image: Kayak

Input Video: Surfer

Generated: Surfer-Kayak

Input Image: Cat

Input Video: Astronaut

Generated: Cat-Astronaut

Baseline Comparison Gallery

Input Image: Unicorn

Input Video: Horse

MoCA-Video

AnimateDiffV2V

FreeBlend+DynamiCrafter

Input Image: Kayak

Input Video: Surfer

MoCA-Video

AnimateDiffV2V

FreeBlend+DynamiCrafter

Input Image: Cat

Input Video: Astronaut

MoCA-Video

AnimateDiffV2V

FreeBlend+DynamiCrafter

Input Image: Cat

Input Video: Mouse

MoCA-Video

AnimateDiffV2V

FreeBlend+DynamiCrafter

Input Image: Cat

Input Video: Bird

MoCA-Video

AnimateDiffV2V

FreeBlend+DynamiCrafter

Baseline Comparison

User Study Results

BibTeX