OmniGen2: Exploration to Advanced Multimodal Generation

Teaser Image

Overview

OmniGen2 is a unified multimodal generation model that combines strong visual understanding, text-to-image synthesis, instruction-based image editing, and subject-driven in-context generation within a single framework. Built on a decoupled architecture, it preserves high-quality language modeling while enabling fine-grained and consistent visual outputs. Beyond generation, OmniGen2 incorporates a multimodal reflection mechanism that allows it to analyze, critique, and iteratively refine its outputs—bringing reasoning and self-correction into the image generation process. With competitive performance across both generation and understanding tasks, it sets a new benchmark among lightweight open-source models.

Model Architecture

OmniGen2 adopts a dual-path architecture with separate autoregressive and diffusion Transformers for text and image generation, respectively. It leverages a decoupled design where a ViT encoder feeds visual information into the multimodal large language model (MLLM) for understanding tasks, while a VAE encoder supplies fine-grained visual features exclusively to the diffusion decoder. This separation preserves the strong language modeling capabilities of the MLLM while enabling high-fidelity and consistent image generation, making the architecture both efficient and versatile across tasks like text-to-image synthesis, image editing, and in-context generation.

modelarch
Figure 1: Architecture of OmniGen2.

Multimodal Rotary Position Embedding: We introduce a novel Omni-RoPE specifically designed to meet the demands of our diverse and complex tasks, particularly image editing and in-context generation as illustrated in Figure 2.

modelarch
Figure 2: An illustration of our proposed Omni-RoPE.

It decomposes position information into three components:

  1. Sequence and Modality Identifier \(id_{seq}\) that is constant for all tokens within a single image (treating it as a semantic unit) but unique across different images.
  2. A 2D Spatial Height Coordinate \(h\) that represents the normalized vertical position for image tokens.
  3. A 2D Spatial Width Coordinate \(w\) that represents the normalized horizontal position for image tokens. For all non-image tokens, both spatial coordinates \(h,w\) are set to zero.

This dual mechanism enables the model to unambiguously distinguish different images via their unique \(id_{seq}\), while the shared local spatial coordinates enhance consistency for tasks like image editing.

Model Capabilities

🧠 Visual Understanding

OmniGen2 leverages a powerful multimodal large language model (MLLM) to perform robust visual understanding across diverse image types. By using a ViT encoder for image representation and keeping the MLLM largely frozen, it achieves strong performance on standard benchmarks while preserving semantic alignment, object recognition, and reasoning capabilities across text and vision inputs.

🎨 Text-to-Image Generation

OmniGen2 supports high-quality text-to-image generation with strong compositional reasoning and long prompt following. By conditioning a diffusion-based image decoder on hidden states from the language model and fine-grained visual features from a VAE, it generates faithful and coherent images that align closely with complex natural language descriptions.

modelarch
Figure 3: Text-to-image generation results.

✏️ Instruction-Based Image Editing

The model enables precise and localized image editing based on natural language instructions. With dedicated editing datasets and a dual-path architecture, OmniGen2 can make fine-grained modifications—such as object manipulation, style changes, or motion edits—while preserving unedited regions and maintaining visual realism and consistency.

modelarch
Figure 4: Instruction-based image editing results.

🧍 In-Context Generation (Subject-Driven)

OmniGen2 excels at subject-driven generation, where it extracts subjects from reference images and re-renders them in new scenes as guided by text prompts. Through a specially designed training pipeline based on video data, the model demonstrates superior subject consistency and contextual integration, outperforming prior open-source models in this emerging domain.

modelarch
Figure 5: In-context generation results.

🔁 Multimodal Reflection

A distinctive feature of OmniGen2 is its built-in reflection mechanism, allowing it to evaluate its own outputs, identify shortcomings, and generate improved results through iterative refinement. This capability, powered by a combination of image-text analysis and self-correction training, brings a form of multimodal reasoning to generation—enhancing controllability, reliability, and output quality over time.

modelarch
Figure 6: Multimodal reflection results.

❤️ Citation

If you find this repository or our work useful, please consider giving a star ⭐ and citation 🦖, which would be greatly appreciated (OmniGen2 report will be available as soon as possible):


@misc{omnigen2024, 
    author={Shitao Xiao, Yueze Wang, Junjie Zhou, Huaying Yuan, Xingrun Xing, Ruiran Yan, Shuting Wang, Tiejun Huang, Zheng Liu},
    title={OmniGen: Unified Image Generation}, 
    publisher={arXiv:2409.11340v1}, 
    year={2024}, 
}