Apple model combines vision understanding and image generation »

Apple researchers have published a study on a new model "Manzano" that represents improved quality and performance compared to current versions.

From Marcus Mendes on 9to5Mac:

In the study titled MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer, a team of nearly 30 Apple researchers details a novel unified approach that enables both image understanding and text-to-image generation in a single multimodal model.

Mendes gives a detailed explanation of the paper’s results – Apple’s image generation is “comparable” to GPT-4o in some tests, for example:

As a result of this approach, “Manzano handles counterintuitive, physics-defying prompts (e.g., ‘The bird is flying below the elephant’) comparably to GPT-4o and Nano Banana,” the researchers say.

We can basically see here that Apple’s models are a year behind where they want to be – but potentially catching up thanks to new research.

View the original.

Posts You Might Like

Zenitizer, a clean and simple meditation timer »
Congrats to Zenitizer, a clean and simple meditation timer app that just launched on the App Store.
Test This iOS 17 Soundboard App With Interactive Widgets »
Interactive widgets are coming to iOS 17 and this developer has teased Klang, a new soundboard app, now available in TestFlight.
Shortcut to redirect YouTube links into the iOS app »
How to make YouTube videos redirect into the app – thanks to this shortcut from Stephen Robles.
Preparing Your App for Apple Intelligence: My Conference Talk from NSSpain 2024
Watch my full presentation about Apple Intelligence from NSSpain – this is the perfect primer ahead of new features arriving soon.