Apple model combines vision understanding and image generation »

Apple researchers have published a study on a new model "Manzano" that represents improved quality and performance compared to current versions.

From Marcus Mendes on 9to5Mac:

In the study titled MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer, a team of nearly 30 Apple researchers details a novel unified approach that enables both image understanding and text-to-image generation in a single multimodal model.

Mendes gives a detailed explanation of the paper’s results – Apple’s image generation is “comparable” to GPT-4o in some tests, for example:

As a result of this approach, “Manzano handles counterintuitive, physics-defying prompts (e.g., ‘The bird is flying below the elephant’) comparably to GPT-4o and Nano Banana,” the researchers say.

We can basically see here that Apple’s models are a year behind where they want to be – but potentially catching up thanks to new research.

View the original.

Posts You Might Like

Entering your home with just a tap (using NFC & iOS)
Matt Haughey started with an August Smart Lock and ended up getting NFC tags so that everyone in his family could tap in and out of the house.
How to put two shortcuts in one small widget »
The folks at Cult of Mac have shared a how-to guide for setting up the new Dual widget available in iOS 17.
What’s new in Shortcuts for iOS and iPadOS 17.0, macOS 14.0, and watchOS 10.0 »
Apple has posted release notes for this year's updates to Shortcuts – in this post, I contextualize what's new, and what's missing from the release notes.
How to run shortcuts using Audio Hijack’s automation features »
Dan Moren shared this great automation for Audio Hijack to move his episodes to Dropbox after each recording – and I'm saving his Run Shortcut method for later.