Apple model combines vision understanding and image generation »

Apple researchers have published a study on a new model "Manzano" that represents improved quality and performance compared to current versions.

From Marcus Mendes on 9to5Mac:

In the study titled MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer, a team of nearly 30 Apple researchers details a novel unified approach that enables both image understanding and text-to-image generation in a single multimodal model.

Mendes gives a detailed explanation of the paper’s results – Apple’s image generation is “comparable” to GPT-4o in some tests, for example:

As a result of this approach, “Manzano handles counterintuitive, physics-defying prompts (e.g., ‘The bird is flying below the elephant’) comparably to GPT-4o and Nano Banana,” the researchers say.

We can basically see here that Apple’s models are a year behind where they want to be – but potentially catching up thanks to new research.

View the original.

Posts You Might Like

Chris Lawley’s MUST have Shortcuts Automations: What’s on my iPad »
Chris Lawley shares his top shortcuts and automations for working on the iPad – I always love seeing the latest versions of Chris' workflow.
iOS 17: How to Choose Which App Siri Uses to Send a Message »
In iOS 17, there's a new Siri feature for selecting which app to use when messaging someone – here's how to activate it.
Apple Developer Videos Now on YouTube
Apple has a new @AppleDeveloper account on YouTube where they're hosting videos from last year's and this WWDC – which means they can be automated via Shortcuts...
Creative Neglect: What About the Apps in Apple? »
Joe Rosensteel writes at Six Colors on his concerns for Apple's apps – how Clips, Pixelmator, and Final Cut Pro are growing worries (and I would add Shortcuts in too).