Methods for image-to-video generation have achieved impressive, photo-realistic quality. However, adjusting specific elements in generated videos, such as object motion or camera movement, is often a tedious process of trial and error, e.g., involving re-generating videos with different random seeds. Recent techniques address this issue by fine-tuning a pre-trained model to follow conditioning signals, such as bounding boxes or point trajectories. Yet, this fine-tuning procedure can be computationally expensive, and it requires datasets with annotated object motion, which can be difficult to procure. In this work, we introduce SG-I2V, a framework for controllable image-to-video generation that is self-guided—offering zero-shot control by relying solely on the knowledge present in a pre-trained image-to-video diffusion model without the need for fine-tuning or external knowledge. Our zero-shot method outperforms unsupervised baselines while being competitive with supervised models in terms of visual quality and motion fidelity.
Input Trajectory
Ours
DragNUWA
DragAnything
Generated Video
Up-Block output
Self-Attention
Modified Self (Ours)
Input Trajectory
Modified self-attn out
(Proposed)
Self-attn out
Upsample block out
Input Trajectory
w/o post-processing
w/ post-processing (Proposed)
Input Trajectory
Bottom 2
Mid 2
Top 1
Input Trajectory
Timestep 50
Timestep 40
Timestep 30
Timestep 20
Timestep 10
@article{namekata2024sgi2v,
author = {Namekata, Koichi and Bahmani, Sherwin and Wu, Ziyi and Kant, Yash and Gilitschenski, Igor and Lindell, David B.},
title = {SG-I2V: Self-Guided Trajectory Control in Image-to-Video Generation},
journal = {arXiv preprint arXiv:2411.04989},
year = {2024},
}