On the night of June 27th, Alibaba launched the Qwen VLo multimodal unified understanding and generation model. The model not only can "understand" the world but can also perform high-quality re-creation based on understanding, with three key highlights: more accurate understanding and generation, support for open instruction image editing, and multilingual instruction support.
Users can now access the model through Qwen Chat (preview version), such as directly sending prompts like "generate a cute cat image" to create images, or uploading a cat image and requesting "add a hat on the cat's head" to modify the image.
Qwen VLo generates images through a progressive generation method. During the generation process, the model continuously adjusts and optimizes predicted content, ensuring the final result is more harmonious and consistent, enhancing visual effects while providing a more flexible and controllable creative experience.
More Accurate Understanding and Generation, Supporting Open Instruction Image Modification
[The rest of the translation follows the same approach, maintaining the original structure and translating all text while preserving HTML tags and image sources.]User: Generate a desktop filled with fruits
Qwen VLo:
User: Predict edge detection image
Qwen VLo:
User: Segment the banana's edges with a red mask
Qwen VLo:
5. Qwen VLo can support input understanding and generation of multiple images. (The multi-image input function has not been officially launched yet)
User: Put these bathroom items into this red basket
Qwen VLo:
6. In addition to simultaneous text and image input, Qwen VLo also supports direct text-to-image generation, including general images and Chinese and English posters.
User: Epic anime artwork of a wizard atop a mountain at night casting a cosmic spell into the dark sky that says "Qwen VLo!" made out of colorful energy
(An epic anime artwork: At night, a wizard stands on a mountain top, casting a cosmic spell into the dark sky, with the words "Qwen VLo!" formed by colorful energy appearing in the night sky.)
Qwen VLo:
7. Qwen VLo supports image generation with dynamic aspect ratios, easily handling extremely elongated images with aspect ratios up to 4:1 or 1:3. (Extreme aspect ratio image generation function has not been officially launched yet.)
User: Anime illustration; watercolor hand-drawn; foreground is a grass slope with a person running dynamically, then thick white clouds; blue background; multiple color gradient layers; smooth and harmonious transition
Qwen VLo:
8. As a unified understanding and generation model, Qwen VLo can also re-analyze and understand the generated content, such as identifying dog and cat breeds in generated images.
User: Generate a puppy and a kitten.
Qwen VLo:
User: What breed of cat and dog is this?
Qwen VLo:
Conclusion: Unified Understanding and Generation, Upgrading Image Description
Qwen VLo has innovatively introduced a new progressive generation mechanism that not only improves generation efficiency but is also suitable for long-paragraph text generation tasks requiring fine control.
At the same time, Qwen VLo is still in the preview stage, and during the generation process, there may be issues such as inconsistencies with facts, lack of complete consistency with the original image, non-compliance with instructions, and unstable recognition and understanding of generated images.
In the future, models will not only be able to answer questions with text but also convey ideas and meanings through images. For example, functions such as generating diagrams, adding auxiliary lines, and marking key areas will provide users with more diverse communication methods.
Meanwhile, multi-modal models with output capabilities also provide researchers with new supervision methods. Through generation tasks, they can better help models understand the world.
This article is from the WeChat public account "Zhidongxi" (ID: zhidxcom), authored by Li Shuiqing, and published with authorization from 36Kr.