OmniGen2 is a unified multimodal generation model that combines strong visual understanding, text-to-image synthesis, instruction-based image editing, and subject-driven in-context generation within a single framework. Built on a decoupled architecture, it preserves high-quality language modeling while enabling fine-grained and consistent visual outputs. Beyond generation, OmniGen2 incorporates a multimodal reflection mechanism that allows it to analyze, critique, and iteratively refine its outputs—bringing reasoning and self-correction into the image generation process. With competitive performance across both generation and understanding tasks, it sets a new benchmark among lightweight open-source models.

Text-to-Image Generation

A cat with a crown lounging on a velvet throne

A dark wizard conjuring spells in an ancient cave

The sun rises slightly, the dew on the rose petals in the garden is clear, a crystal ladybug is crawling to the dew, the background is the early morning garden, macro lens

A fierce wolf walks in a broadleaf forest, next to a stream, detail true, photography

A cat holds a white board writing text "OmniGen2" and a red heart

A rabbit made out of vegetables

This dreamlike digital art captures a vibrant, kaleidoscopic bird in a lush rainforest

Colorful and extremely beautiful aurora shines on the ice sheet

A snow maiden with pale translucent skin, frosty white lashes, and a soft expression of longing

Change the dress to blue

Raise the hand

Change the background to classroom

Add a fisherman hat to the woman's head

Replace the sword with a hammer

Remove the cat

Generate an anime-style figurine based on the original image

Make he smile

Extract the character from the picture and fill the rest of the background with white

Let the girl and the boy get married in the church

Add the bird from image 1 to the desk in image 2

In a cozy café, the anime figure is sitting in front of a laptop, smiling confidently.

Let the man appear in front of the White House delivering a speech

Please let the person in image 2 hold the toy from the first image in a parking lot

Replace the apple with the cat from the second image.

Replace the woman in the second image with the woman from the first image

Create a wedding figure based on the girl in the first image and the man in the second image. Set the background as a wedding hall, with the man dressed in a suit and the girl in a white wedding dress. The man should adopt a realistic style, whereas the girl should maintain their classic anime style.

Make the girl pray in the second image

Model Architecture

OmniGen2 adopts a dual-path architecture with separate autoregressive and diffusion Transformers for text and image generation, respectively. It leverages a decoupled design where a ViT encoder feeds visual information into the multimodal large language model (MLLM) for understanding tasks, while a VAE encoder supplies fine-grained visual features exclusively to the diffusion decoder. This separation preserves the strong language modeling capabilities of the MLLM while enabling high-fidelity and consistent image generation, making the architecture both efficient and versatile across tasks like text-to-image synthesis, image editing, and in-context generation.

Multimodal Rotary Position Embedding: We introduce a novel Omni-RoPE specifically designed to meet the demands of our diverse and complex tasks, particularly image editing and in-context generation as illustrated in Figure 2.

It decomposes position information into three components:

Sequence and Modality Identifier \(id_{seq}\) that is constant for all tokens within a single image (treating it as a semantic unit) but unique across different images.
A 2D Spatial Height Coordinate \(h\) that represents the normalized vertical position for image tokens.
A 2D Spatial Width Coordinate \(w\) that represents the normalized horizontal position for image tokens. For all non-image tokens, both spatial coordinates \(h,w\) are set to zero.

This dual mechanism enables the model to unambiguously distinguish different images via their unique \(id_{seq}\), while the shared local spatial coordinates enhance consistency for tasks like image editing.

Model Capabilities

🧠 Visual Understanding

OmniGen2 leverages a powerful multimodal large language model (MLLM) to perform robust visual understanding across diverse image types. By using a ViT encoder for image representation and keeping the MLLM largely frozen, it achieves strong performance on standard benchmarks while preserving semantic alignment, object recognition, and reasoning capabilities across text and vision inputs.

🎨 Text-to-Image Generation

OmniGen2 supports high-quality text-to-image generation with strong compositional reasoning and long prompt following. By conditioning a diffusion-based image decoder on hidden states from the language model and fine-grained visual features from a VAE, it generates faithful and coherent images that align closely with complex natural language descriptions.

✏️ Instruction-Based Image Editing

The model enables precise and localized image editing based on natural language instructions. With dedicated editing datasets and a dual-path architecture, OmniGen2 can make fine-grained modifications—such as object manipulation, style changes, or motion edits—while preserving unedited regions and maintaining visual realism and consistency.

🧍 In-Context Generation (Subject-Driven)

OmniGen2 excels at subject-driven generation, where it extracts subjects from reference images and re-renders them in new scenes as guided by text prompts. Through a specially designed training pipeline based on video data, the model demonstrates superior subject consistency and contextual integration, outperforming prior open-source models in this emerging domain.

🔁 Multimodal Reflection

A distinctive feature of OmniGen2 is its built-in reflection mechanism, allowing it to evaluate its own outputs, identify shortcomings, and generate improved results through iterative refinement. This capability, powered by a combination of image-text analysis and self-correction training, brings a form of multimodal reasoning to generation—enhancing controllability, reliability, and output quality over time.

OmniGen2: Exploration to Advanced Multimodal Generation

Overview