From Marcus Mendes on 9to5Mac:
In the study titled MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer, a team of nearly 30 Apple researchers details a novel unified approach that enables both image understanding and text-to-image generation in a single multimodal model.
Mendes gives a detailed explanation of the paper’s results – Apple’s image generation is “comparable” to GPT-4o in some tests, for example:
As a result of this approach, “Manzano handles counterintuitive, physics-defying prompts (e.g., ‘The bird is flying below the elephant’) comparably to GPT-4o and Nano Banana,” the researchers say.
We can basically see here that Apple’s models are a year behind where they want to be – but potentially catching up thanks to new research.