A model for generating videos from text, where prompts can vary over time and the videos can be several minutes long.