We introduce MoCA-Video (Motion Concept Alignment in Video), a training-free framework bridging the gap between image-domain semantic mixing and video. Given a generated video and a user-provided reference image, MoCA-Video injects the semantic features of the reference image into a specific object within the video, while preserving the original motion and visual context. Our approach leverages a diagonal denoising schedule and class-agnostic segmentation to detect and track objects in the latent space and precisely control the spatial location of the blended objects. To ensure temporal coherence, we incorporate momentum-based semantic corrections and gamma residual noise stabilization for smooth frame transitions. We evaluate MoCA's performance using the standard SSIM, image-level LPIPS, temporal LPIPS, and introduce a novel metric CASS (Conceptual Alignment Shift Score) to evaluate the consistency and effectiveness of the visual shifts between the source prompt and the modified video frames. Using self-constructed dataset, MoCA-Video outperforms current baselines, achieving superior spatial consistency, coherent motion, and a significantly higher CASS score, despite having no training or fine-tuning. MoCA-Video demonstrates that structured manipulation in the diffusion noise trajectory allows for controllable, high-quality video synthesis.
NotebookLM created AI Podcast for simple and clear understanding of MoCA-Video
The CASS (Conceptual Alignment Shift Score) metric measures how much a generated video shifts semantically toward the conditioned image and away from the original text prompt, quantifying the effectiveness of video semantic mixing.
$$\text{CLIP-T}_{\text{orig}} = \text{sim}(E(V_{\text{orig}}), T_{\text{orig}}), \quad \text{CLIP-T}_{\text{fused}} = \text{sim}(E(V_{\text{fused}}), T_{\text{orig}})$$
$$\text{CLIP-I}_{\text{orig}} = \text{sim}(E(V_{\text{orig}}), E(I_{\text{cond}})), \quad \text{CLIP-I}_{\text{fused}} = \text{sim}(E(V_{\text{fused}}), E(I_{\text{cond}}))$$
$$\text{CASS} = (\text{CLIP-I}_{\text{fused}} - \text{CLIP-I}_{\text{orig}}) - (\text{CLIP-T}_{\text{fused}} - \text{CLIP-T}_{\text{orig}})$$
relCASS is a normalized form of CASS to remove the inherent task difficulty and bias for more balanced evaluation.
$$\text{relCLIP-I} = \frac{\text{CLIP-I}_{\text{fused}} - \text{CLIP-I}_{\text{orig}}}{\text{CLIP-I}_{\text{orig}}}, \quad \text{relCLIP-T} = \frac{\text{CLIP-T}_{\text{fused}} - \text{CLIP-T}_{\text{orig}}}{\text{CLIP-T}_{\text{orig}}}$$
$$\text{relCASS} = \text{relCLIP-I} - \text{relCLIP-T}$$
We collected 100 prompts based on super categories from Freeblend and extended it with DAVIS-16 categories to enrich semantic context of it.
@misc{zhang2025motionawareconceptalignmentconsistent,
title={Motion-Aware Concept Alignment for Consistent Video Editing},
author={Tong Zhang and Juan C Leon Alcazar and Bernard Ghanem},
year={2025},
eprint={2506.01004},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2506.01004},
}