TUNA: Taming Unified Visual Representations for
Native Unified Multimodal Models

1Meta BizAI 2HKU 3University of Waterloo 4KAUST
Joint first authors, listed alphabetically by last name Core contributors *Joint project lead

Introducing TUNA, a family of native unified multimodal models

  • TUNA leverages unified visual representations to enable image/video understanding, image/video generation, and image editing within a single framework.
  • Our extensive experiments show that TUNA's unified visual representation is highly effective, achieving state-of-the-art performance across multiple multimodal understanding and generation tasks.
  • Our comprehensive ablation studies demonstrate the superiority of our unified visual representation design over existing methods with unified representations and other models employing decoupled representations.

Text-to-Image Generation

Text-to-Image Generation

Image Editing

Image Editing

Image and Video Understanding

Image and Video Understanding

Text-to-Video Generation

Hover over each video to see the corresponding text prompt.




Citation

If you find our work helpful, please cite our paper: