Alibaba's version of GPT-4o is now available, allowing you to accurately Photoshop images in one sentence, for free

This article is machine translated

Show original

On the night of June 27th, Alibaba launched the Qwen VLo multimodal unified understanding and generation model. The model not only can "understand" the world but can also perform high-quality re-creation based on understanding, with three key highlights: more accurate understanding and generation, support for open instruction image editing, and multilingual instruction support.

Users can now access the model through Qwen Chat (preview version), such as directly sending prompts like "generate a cute cat image" to create images, or uploading a cat image and requesting "add a hat on the cat's head" to modify the image.

Qwen VLo generates images through a progressive generation method. During the generation process, the model continuously adjusts and optimizes predicted content, ensuring the final result is more harmonious and consistent, enhancing visual effects while providing a more flexible and controllable creative experience.

More Accurate Understanding and Generation, Supporting Open Instruction Image Modification

[The rest of the translation follows the same approach, maintaining the original structure and translating all text while preserving HTML tags and image sources.]

User: Generate a desktop filled with fruits

Qwen VLo:

User: Predict edge detection image

Qwen VLo:

User: Segment the banana's edges with a red mask

Qwen VLo:

5. Qwen VLo can support input understanding and generation of multiple images. (The multi-image input function has not been officially launched yet)

User: Put these bathroom items into this red basket

Qwen VLo:

6. In addition to simultaneous text and image input, Qwen VLo also supports direct text-to-image generation, including general images and Chinese and English posters.

User: Epic anime artwork of a wizard atop a mountain at night casting a cosmic spell into the dark sky that says "Qwen VLo!" made out of colorful energy

(An epic anime artwork: At night, a wizard stands on a mountain top, casting a cosmic spell into the dark sky, with the words "Qwen VLo!" formed by colorful energy appearing in the night sky.)

Qwen VLo:

7. Qwen VLo supports image generation with dynamic aspect ratios, easily handling extremely elongated images with aspect ratios up to 4:1 or 1:3. (Extreme aspect ratio image generation function has not been officially launched yet.)

User: Anime illustration; watercolor hand-drawn; foreground is a grass slope with a person running dynamically, then thick white clouds; blue background; multiple color gradient layers; smooth and harmonious transition

Qwen VLo:

8. As a unified understanding and generation model, Qwen VLo can also re-analyze and understand the generated content, such as identifying dog and cat breeds in generated images.

User: Generate a puppy and a kitten.

Qwen VLo:

User: What breed of cat and dog is this?

Qwen VLo:

Conclusion: Unified Understanding and Generation, Upgrading Image Description

Qwen VLo has innovatively introduced a new progressive generation mechanism that not only improves generation efficiency but is also suitable for long-paragraph text generation tasks requiring fine control.

At the same time, Qwen VLo is still in the preview stage, and during the generation process, there may be issues such as inconsistencies with facts, lack of complete consistency with the original image, non-compliance with instructions, and unstable recognition and understanding of generated images.

In the future, models will not only be able to answer questions with text but also convey ideas and meanings through images. For example, functions such as generating diagrams, adding auxiliary lines, and marking key areas will provide users with more diverse communication methods.

Meanwhile, multi-modal models with output capabilities also provide researchers with new supervision methods. Through generation tasks, they can better help models understand the world.

This article is from the WeChat public account "Zhidongxi" (ID: zhidxcom), authored by Li Shuiqing, and published with authorization from 36Kr.

Source

Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.

Add to Favorites

Comments

Relevant content

Followin News

The tokenization of US stocks is irreversible! The season of copycats is over, and the charm of digging for gold in the "intersection of stocks and coins" is here

Followin News

Is Bitcoin going to pull back? Why are people calling for a short position? What are the traders on Followin Trading Strategy Square saying?

BTC

0.9%

Followin Original Work

Robinhood enters the US stock market on the blockchain. What are the concept coins? Is the US stock market blood-sucking copycat more dangerous?