Research

LLM-grounded video diffusion models

Adam Yala, an MIT Abdul Latif Jameel Clinic for Machine Learning in Health (MIT Jameel Clinic) affiliated researcher, co-authors research utilising large language model (LLM)-grounded video diffusion (LVD) to improve spatiotemporal prompts for neural video generation.

From the research team's abstract: 'Text-conditioned diffusion models have emerged as a promising tool for neural video generation. However, current models still struggle with intricate spatiotemporal prompts and often generate restricted or incorrect motion (e.g., even lacking the ability to be prompted for objects moving from left to right). To address these limitations, we introduce LLM-grounded Video Diffusion (LVD). Instead of directly generating videos from the text inputs, LVD first leverages a large language model (LLM) to generate dynamic scene layouts based on the text inputs and subsequently uses the generated layouts to guide a diffusion model for video generation. We show that LLMs are able to understand complex spatiotemporal dynamics from text alone and generate layouts that align closely with both the prompts and the object motion patterns typically observed in the real world. We then propose to guide video diffusion models with these layouts by adjusting the attention maps. Our approach is training-free and can be integrated into any video diffusion model that admits classifier guidance. Our results demonstrate that LVD significantly outperforms its base video diffusion model and several strong baseline methods in faithfully generating videos with the desired attributes and motion patterns.'

Details

author(s)

Adam Yala

publication date

2 October 2023

source

Arxiv

related programme

MIT Jameel Clinic

Link to publication

External link ->

Generative AI in the era of 'alternative facts'

27 March 2024

MIT Open Publishing Services

External data and AI are making each other more valuable

26 February 2024

Harvard Business Review Press

Rethinking patch dependence for masked autoencoders

25 January 2024

Arxiv

Removing biases from molecular representations via information maximisation

1 December 2023

Arxiv

Effective human-AI teams via learned natural language rules and onboarding

7 November 2023

Arxiv

A deep dive into single-cell RNA sequencing foundation models

23 October 2023

bioRxiv

Antibiotic identified by AI

11 October 2023

Nature