Motion-Aware Concept Alignment for Consistent Video Editing


KAUST

Abstract

We introduce MoCA-Video (Motion Concept Alignment in Video), a training-free framework bridging the gap between image-domain semantic mixing and video. Given a generated video and a user-provided reference image, MoCA-Video injects the semantic features of the reference image into a specific object within the video, while preserving the original motion and visual context. Our approach leverages a diagonal denoising schedule and class-agnostic segmentation to detect and track objects in the latent space and precisely control the spatial location of the blended objects. To ensure temporal coherence, we incorporate momentum-based semantic corrections and gamma residual noise stabilization for smooth frame transitions. We evaluate MoCA's performance using the standard SSIM, image-level LPIPS, temporal LPIPS, and introduce a novel metric CASS (Conceptual Alignment Shift Score) to evaluate the consistency and effectiveness of the visual shifts between the source prompt and the modified video frames. Using self-constructed dataset, MoCA-Video outperforms current baselines, achieving superior spatial consistency, coherent motion, and a significantly higher CASS score, despite having no training or fine-tuning. MoCA-Video demonstrates that structured manipulation in the diffusion noise trajectory allows for controllable, high-quality video synthesis.

AI Podcast

NotebookLM created AI Podcast for simple and clear understanding of MoCA-Video

Methodology

Methodology

Results Gallery

Baseline Comparison Gallery

Baseline Comparison

The CASS (Conceptual Alignment Shift Score) metric measures how much a generated video shifts semantically toward the conditioned image and away from the original text prompt, quantifying the effectiveness of video semantic mixing.

$$\text{CLIP-T}_{\text{orig}} = \text{sim}(E(V_{\text{orig}}), T_{\text{orig}}), \quad \text{CLIP-T}_{\text{fused}} = \text{sim}(E(V_{\text{fused}}), T_{\text{orig}})$$

$$\text{CLIP-I}_{\text{orig}} = \text{sim}(E(V_{\text{orig}}), E(I_{\text{cond}})), \quad \text{CLIP-I}_{\text{fused}} = \text{sim}(E(V_{\text{fused}}), E(I_{\text{cond}}))$$

$$\text{CASS} = (\text{CLIP-I}_{\text{fused}} - \text{CLIP-I}_{\text{orig}}) - (\text{CLIP-T}_{\text{fused}} - \text{CLIP-T}_{\text{orig}})$$

relCASS is a normalized form of CASS to remove the inherent task difficulty and bias for more balanced evaluation.

$$\text{relCLIP-I} = \frac{\text{CLIP-I}_{\text{fused}} - \text{CLIP-I}_{\text{orig}}}{\text{CLIP-I}_{\text{orig}}}, \quad \text{relCLIP-T} = \frac{\text{CLIP-T}_{\text{fused}} - \text{CLIP-T}_{\text{orig}}}{\text{CLIP-T}_{\text{orig}}}$$

$$\text{relCASS} = \text{relCLIP-I} - \text{relCLIP-T}$$

Prompt Illustration

We collected 100 prompts based on super categories from Freeblend and extended it with DAVIS-16 categories to enrich semantic context of it.

Baseline Comparison

User Study Results

User Study Results

BibTeX

@misc{zhang2025motionawareconceptalignmentconsistent,
      title={Motion-Aware Concept Alignment for Consistent Video Editing}, 
      author={Tong Zhang and Juan C Leon Alcazar and Bernard Ghanem},
      year={2025},
      eprint={2506.01004},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2506.01004}, 
}