Video Generation with Mochi#
This tutorial demonstrates how to run the Mochi 1 text-to-video generation model by Genmo on Union.
Overview#
Mochi 1 is an open-source 10-billion parameter diffusion model built on the Asymmetric Diffusion Transformer (AsymmDiT) architecture. The Mochi model can be run on both single- and multi-GPU setups. It is recommended to run the model on an H100 GPU, but quantized versions supported by HuggingFace diffusers allow for running the model with a minimum of 22GB VRAM.
Let’s begin by importing the necessary dependencies:
Run on Union Serverless
Once you have a Union account, install union
:
pip install union
Then run the following commands to run the workflow:
git clone https://github.com/unionai/unionai-examples
cd mochi_video_generation
union run --remote mochi_video_generation.py genmo_video_generation_with_actor
The source code for this tutorial can be found here .
from dataclasses import dataclass
from pathlib import Path
import flytekit as fl
from dataclasses_json import dataclass_json
from flytekit import FlyteContextManager
from flytekit.extras.accelerators import A100
from flytekit.types.directory import FlyteDirectory
from flytekit.types.file import FlyteFile
from union.actor import ActorEnvironment
We also define a dataclass to provide the prompt and the necessary params to be used while generating the videos.
@dataclass_json
@dataclass
class VideoGen:
prompt: str
negative_prompt: str = ""
num_frames: int = 19
Defining image specifications#
Here, we define two image specifications for the workflow:
The first image installs CUDA and is used for video generation. We’re using a pre-release version of Diffusers since Mochi is available in this version.
The second image is used to download the model and run the dynamic workflow that processes the prompts.
image = fl.ImageSpec(
name="genmo",
packages=[
"torch==2.5.1",
"git+https://github.com/huggingface/diffusers.git@805aa93789fe9c95dd8d5a3ceac100d33f584ec7",
"git+https://github.com/flyteorg/flytekit.git@650efe4425c799eaf66384575cc0e67521e9a851", # PR: https://github.com/flyteorg/flytekit/pull/2931
"transformers==4.46.3",
"accelerate==1.1.1",
"sentencepiece==0.2.0",
"opencv-python==4.10.0.84",
],
conda_channels=["nvidia"],
conda_packages=[
"cuda=12.1.0",
"cuda-nvcc",
"cuda-version=12.1.0",
"cuda-command-line-tools=12.1.0",
],
apt_packages=["git", "libglib2.0-0", "libsm6", "libxrender1", "libxext6"],
)
image_with_no_cuda = fl.ImageSpec(
name="genmo-no-cuda",
packages=[
"huggingface-hub==0.26.2",
"git+https://github.com/flyteorg/flytekit.git@650efe4425c799eaf66384575cc0e67521e9a851", # PR: https://github.com/flyteorg/flytekit/pull/2931
"diffusers==0.31.0",
],
apt_packages=["git"],
)
Defining an actor environment#
The actor environment is used to retain the downloaded model across all actor executions.
We set the accelerator to A100
and the replica count to 1 to avoid downloading the model multiple times.
actor = ActorEnvironment(
name="genmo-video-generation",
replica_count=1,
ttl_seconds=900,
requests=fl.Resources(gpu="1", mem="100Gi"),
container_image=image,
accelerator=A100,
)
Downloading the model#
The download step ensures that the model is cached and doesn’t need to be downloaded from the HuggingFace hub every time this execution runs.
@fl.task(
cache=True,
cache_version="0.1",
requests=fl.Resources(cpu="5", mem="45Gi"),
container_image=image_with_no_cuda,
)
def download_model(repo_id: str) -> FlyteDirectory:
from huggingface_hub import snapshot_download
ctx = fl.current_context()
working_dir = Path(ctx.working_directory)
cached_model_dir = working_dir / "cached_model"
snapshot_download(repo_id=repo_id, local_dir=cached_model_dir)
return FlyteDirectory(path=cached_model_dir)
Defining an actor task#
We define an actor task to generate a video using the Mochi 1 model. The model is downloaded once to a hard-coded path and used for every prompt. In the future, we plan to allow avoiding model initialization and loading onto a GPU every time.
enable_model_cpu_offload
offloads the model to CPU using accelerate, reducing memory usage with minimal performance impact.
enable_vae_tiling
saves a large amount of memory and allows processing larger images.
@actor.task
def genmo_video_generation(model_dir: FlyteDirectory, param_set: VideoGen) -> FlyteFile:
import torch
from diffusers import MochiPipeline
from diffusers.utils import export_to_video
local_path = Path("/tmp/genmo_mochi_model")
if not local_path.exists():
print("Model doesn't exist")
ctx = FlyteContextManager.current_context()
ctx.file_access.get_data(
remote_path=model_dir.remote_source,
local_path=local_path,
is_multipart=True,
)
pipe = MochiPipeline.from_pretrained(
local_path, variant="bf16", torch_dtype=torch.bfloat16
)
pipe.enable_model_cpu_offload()
pipe.enable_vae_tiling()
frames = pipe(
param_set.prompt,
negative_prompt=param_set.negative_prompt,
num_frames=param_set.num_frames,
).frames[0]
ctx = fl.current_context()
working_dir = Path(ctx.working_directory)
video_file = working_dir / "video.mp4"
export_to_video(frames, video_file, fps=30)
return FlyteFile(path=video_file)
Defining a dynamic workflow#
We define a dynamic workflow to loop through the prompts and parameters. It calls the actor task to generate the video.
@fl.dynamic(container_image=image_with_no_cuda)
def generate_videos(
model_dir: FlyteDirectory, video_gen_params: list[VideoGen]
) -> list[FlyteFile]:
videos = []
for param_set in video_gen_params:
videos.append(genmo_video_generation(model_dir=model_dir, param_set=param_set))
return videos
Defining a workflow#
With all tasks in place, we define a workflow to generate videos.
Initialize VideoGen
objects to specify the prompt, number of frames, and a negative prompt.
@fl.workflow
def genmo_video_generation_with_actor(
repo_id: str = "genmo/mochi-1-preview",
video_gen_params: list[VideoGen] = [
VideoGen(
prompt="A hand with delicate fingers picks up a bright yellow lemon from a wooden bowl filled with lemons and sprigs of mint against a peach-colored background. The hand gently tosses the lemon up and catches it, showcasing its smooth texture. A beige string bag sits beside the bowl, adding a rustic touch to the scene. Additional lemons, one halved, are scattered around the base of the bowl. The even lighting enhances the vibrant colors and creates a fresh, inviting atmosphere.",
),
VideoGen(
prompt="Close-up of a chameleon's eye, with its scaly skin changing color. Ultra high resolution 4k.",
num_frames=84,
),
],
) -> list[FlyteFile]:
model_dir = download_model(repo_id=repo_id)
return generate_videos(model_dir=model_dir, video_gen_params=video_gen_params)