Apple’s model combines vision understanding and image creation.

Apple researchers have published research on Manzano, a multimodal model that combines visual understanding and text-to-image generation while significantly reducing the trade-offs between performance and quality of current implementations. Here are the details.

An interesting approach to a modern problem.

In a study titled MANZANO: A Simple and Scalable Unified Multimodal Model with Hybrid Vision Tokenizer, a team of nearly 30 Apple researchers detailed a new unified approach that provides image understanding and text-to-image generation in a single multimodal model.

This is important because current unified multimodal models that support image generation often face trade-offs: they either sacrifice visual understanding to prioritize autoregressive image generation, or prioritize understanding at the expense of generative accuracy. In other words, it is often difficult for them to excel at both things at the same time.

Here’s why this happens, according to researchers:

One of the main reasons for this gap is the controversial nature of visual tokenization. Automatic inversion usually favors single image markers, while comprehension usually benefits from continuous embedding. Many models use a dual tokenizer strategy using a semantic encoder for rich continuous features, while a separate quantized tokenizer such as VQ-VAE handles the generation. However, this forces the language model to process two different types of picture tokens, one from a high-level semantic space and the other from a low-level spatial space, creating significant task conflict. While some solutions, such as Mixing of Transformers (MoT), can mitigate this by assigning separate paths to each task, they are parameter inefficient and often incompatible with current Mixing of Experts (MoE) architectures. An alternative course of work avoids this conflict by freezing a pre-trained multimodal LLM and connecting it to a broadcast decoder. While this preserves clarity, it separates generation, losing potential mutual benefits and limiting the potential benefits of generation when scaling multimodal LLM.

Simply put, current multimodal architectures are ill-suited to both tasks simultaneously because they rely on conflicting visual representations for comprehension and generation that the language model itself struggles to reconcile.

This is where Manzano comes in. It combines the tasks of understanding and generation by using an autoregressive LLM to predict what the image should semantically contain, and then feeds these predictions to a diffusion decoder (a denoising process we explain here) that represents the actual pixels.

As the researchers explain, Manzano combines three components in its architecture:

  1. A hybrid visual tokenizer that creates continuous and discrete visual representations;
  2. An LLM decoder that accepts text tokens and/or continuous image embeddings and automatically predicts the next discrete image or text tokens from a shared dictionary;
  3. An image decoder that generates image pixels from predicted image markers

As a result of this approach, “Manzano handles nonsensical and physics-unfriendly cues (eg, ‘Bird flies under elephant’) comparable to GPT-4o and Nano Banana,” the researchers say.

The researchers also note that in various tests, “the Manzano 3B and 30B models achieve superior or competitive performance compared to other unified multimodal LLM SOTAs.”

Apple researchers tested the Manzano in various sizes, from the 300M model to the 30B version. This allowed them to estimate how much unified multimodal performance improves as the model is scaled:

Here’s another comparison between the Manzano and other next-gen models, including Google’s Nano Banana and OpenAI’s GPT-4o:

Finally, Manzano also handles image editing tasks well, including guided editing, style transfer, drawing/painting, and depth estimation.

To read the full study with detailed technical details on Manzano hybrid tokenizer training, broadcast decoder design, scaling experiments, and human evaluations, follow this link.

And if you’re interested in this topic, don’t forget to check out our explanatory material on UniGen, another promising rendering model that Apple researchers recently described. Although none of these models are available on Apple devices, they offer continued work to achieve higher results in creating an image of yourself in the Image Playground and beyond.

Accessories deals on Amazon

Add 9to5Mac as a preferred source on Google
Add 9to5Mac as a preferred source on Google

FTC: We use automatic affiliate links that generate income. further.

Leave a Reply

Your email address will not be published. Required fields are marked *