OmniGen: Unified Image Generation

Shitao Xiao*, Yueze Wang*, Junjie Zhou*, Huaying Yuan*, Xingrun Xing, Ruiran Yan, Shuting Wang, Tiejun Huang, Zheng Liu+

*Equal Contribution +Corresponding authors

🔥 OmniGen is the first diffusion model for unified image generation. It unifies various tasks into a single framework and simplifies the architecture.

Abstract

The emergence of Large Language Models (LLMs) has unified language generation tasks and revolutionized human-machine interaction. However, in the realm of image generation, a unified model capable of handling various tasks within a single framework remains largely unexplored. In this work, we introduce OmniGen, a new diffusion model for unified image generation. Unlike popular diffusion models (e.g., Stable Diffusion), OmniGen no longer requires additional modules such as ControlNet or IP-Adapter to process diverse control conditions. OmniGen is characterized by the following features:

  • Unification. OmniGen not only demonstrates text-to-image generation capabilities but also inherently supports various downstream tasks, such as image editing, subject-driven generation, and visual-conditional generation. Additionally, OmniGen can handle classic computer vision tasks by transforming them into image generation tasks, such as edge detection and human pose recognition.
  • Simplicity. The architecture of OmniGen is highly simplified, eliminating the need for additional text encoders. Moreover, it is more user-friendly compared to existing diffusion models, enabling complex tasks to be accomplished through instructions without the need and cost for extra preprocessing steps (e.g., human pose estimation), thereby significantly simplifying the workflow of image generation.
  • Knowledge Transfer. Benefit from learning in a unified format, OmniGen effectively transfers knowledge across different tasks, manages unseen tasks and domains, and exhibits novel capabilities.
  • We also explore the model’s reasoning capabilities and potential applications of chain-of-thought mechanism. This work represents the first attempt at a general-purpose image generation model, and there remain several unresolved issues. We will open-source the related resources at https://github.com/VectorSpaceLab/OmniGen to foster advancements in this field.

    What can OmniGen do?

    Flexible and Controllable Generation

    Based on OmniGen's general capabilities, more flexible image generation can be implemented. The following showcases a simple pipeline: generating images from text, editing parts of the generated images, generating redraws based on the human poses in the generated images, and extracting desired objects from another image to integrate into the new image.

    Referring Expression Generation

    You can input multiple images and use simple, general language to refer to the objects within those images. OmniGen can automatically recognize the necessary objects in each image and generate new images based on them. No additional operations, such as image cropping or face detection, are required.

    Common Image generation tasks

    OmniGen can process various image generation tasks, inlcuding image editing, image-conditional gengeration (controlnet), etc.

    Classical Vision tasks

    OmniGen also is able to process some classical computer vision tasks, e.g., low-level tasks: deblur, derain, inpainting; high level tasks: human pose estimation, depth estimation.

    Furthe Analysis

    OmniGen has potential inference capabilities and a certain degree of in-context learning (ICL) ability.

    Step by step

    The Chain-of-Thought (CoT) method can significantly boost the performance of LLMs by decomposing the task into multiple steps and sequentially solving each step to obtain an accurate final answer. We consider whether a similar alternative can be applied to image generation. Inspired by the basic way of human drawing, we hope to mimic the step-by-step drawing process, iteratively refining theimage from a blank canvas. We fine-tune the OmniGen to process this task. Based on the findings of previous work on LLMs, which indicate that process supervision significantly outperforms outcome supervision, we posit that supervising the drawing process of images is a promising direction that may assist the model in handling more complex and diverse scenes.

    Architecture

    Current diffusion models are typically limited to common text-to-image tasks and can not perform a broader range of downstream image-generation tasks. To achieve real-world applications, users often need to design and integrate additional network structures to extend the capabilities of diffusion models, making the models highly cumbersome. Even worse, these additional parameter networks are usually task-specific and can not be reused for other tasks, unless more networks are designed and trained for different functions. To circumvent these issues, the design principles of OmniGen are as follows: 1). Universality: accepting any form of image and text inputs for various tasks; 2). Conciseness, avoiding overly complex structural designs and numerous additional components.

    Dataset

    To achieve robust multi-task processing capabilities, it is essential to train models on large-scale and diverse datasets. However, in the field of image generation, a readily available large-scale and diverse dataset has yet to emerge. In this work, we have constructed a large-scale unified image generation dataset for the first time, which we refer to as the X2I dataset, meaning "anything to image". We have converted these data into a unified format, and following figure presents some examples from the X2I dataset. The entire dataset comprises approximately 0.1 billion images. We will provide a detailed description of the composition of this dataset in the following sections.

    Citation

     
    @misc{omnigen2024, 
        author={Shitao Xiao, Yueze Wang, Junjie Zhou, Huaying Yuan, Xingrun Xing, Ruiran Yan, Shuting Wang, Tiejun Huang, Zheng Liu},
        title={OmniGen: Unified Image Generation}, 
        publisher={arXiv:2409.11340v1}, 
        year={2024}, 
    }