The emergence of Large Language Models (LLMs) has unified language generation tasks and revolutionized human-machine interaction. However, in the realm of image generation, a unified model capable of handling various tasks within a single framework remains largely unexplored. In this work, we introduce OmniGen, a new diffusion model for unified image generation. Unlike popular diffusion models (e.g., Stable Diffusion), OmniGen no longer requires additional modules such as ControlNet or IP-Adapter to process diverse control conditions. OmniGen is characterized by the following features:
Based on OmniGen's general capabilities, more flexible image generation can be implemented. The following showcases a simple pipeline: generating images from text, editing parts of the generated images, generating redraws based on the human poses in the generated images, and extracting desired objects from another image to integrate into the new image.
You can input multiple images and use simple, general language to refer to the objects within those images. OmniGen can automatically recognize the necessary objects in each image and generate new images based on them. No additional operations, such as image cropping or face detection, are required.
OmniGen can process various image generation tasks, inlcuding image editing, image-conditional gengeration (controlnet), etc.
OmniGen also is able to process some classical computer vision tasks, e.g., low-level tasks: deblur, derain, inpainting; high level tasks: human pose estimation, depth estimation.
OmniGen has potential inference capabilities and a certain degree of in-context learning (ICL) ability.
The Chain-of-Thought (CoT) method can significantly boost the performance of LLMs by decomposing the task into multiple steps and sequentially solving each step to obtain an accurate final answer. We consider whether a similar alternative can be applied to image generation. Inspired by the basic way of human drawing, we hope to mimic the step-by-step drawing process, iteratively refining theimage from a blank canvas. We fine-tune the OmniGen to process this task. Based on the findings of previous work on LLMs, which indicate that process supervision significantly outperforms outcome supervision, we posit that supervising the drawing process of images is a promising direction that may assist the model in handling more complex and diverse scenes.
Current diffusion models are typically limited to common text-to-image tasks and can not perform a broader range of downstream image-generation tasks. To achieve real-world applications, users often need to design and integrate additional network structures to extend the capabilities of diffusion models, making the models highly cumbersome. Even worse, these additional parameter networks are usually task-specific and can not be reused for other tasks, unless more networks are designed and trained for different functions. To circumvent these issues, the design principles of OmniGen are as follows: 1). Universality: accepting any form of image and text inputs for various tasks; 2). Conciseness, avoiding overly complex structural designs and numerous additional components.
To achieve robust multi-task processing capabilities, it is essential to train models on large-scale and diverse datasets. However, in the field of image generation, a readily available large-scale and diverse dataset has yet to emerge. In this work, we have constructed a large-scale unified image generation dataset for the first time, which we refer to as the X2I dataset, meaning "anything to image". We have converted these data into a unified format, and following figure presents some examples from the X2I dataset. The entire dataset comprises approximately 0.1 billion images. We will provide a detailed description of the composition of this dataset in the following sections.
@misc{omnigen2024,
author={Shitao Xiao, Yueze Wang, Junjie Zhou, Huaying Yuan, Xingrun Xing, Ruiran Yan, Shuting Wang, Tiejun Huang, Zheng Liu},
title={OmniGen: Unified Image Generation},
publisher={arXiv:2409.11340v1},
year={2024},
}