Over the past two years, we have witnessed an explosion in Unified Multimodal Models (UMMs). From Chameleon[1], Show-o[2] to Transfusion[3] and Emu[4], the field flourished with diverse architecture. Until the emergence of Bagel[5], architectural exploration gradually converged. In this phase, benefited by the integration of understanding, we have observed remarkable gains on the generation side: UMMs exhibit enhanced reasoning capabilities, and reflect world knowledge in generated outputs. The "VLM-as-Encoder" is becoming a new generative paradigm. This flourishing era constitutes the "First Half" of UMM development, where the community has primarily focused on architectural. On the generation side, attention has been largely directed toward capabilities such as static world knowledge, safety, and instruction adherence.
However, as the marginal returns of architectural exploration diminish, we confront a pivotal inflection point. With the construction of UMMs demystified, the research focus inevitably pivots toward their effective utilization and the identification of their intrinsic strengths, signaling our readiness to enter the "Second Half." In contrast to the First Half's emphasis on implementation and benchmark maximization, this new phase necessitates the establishment of insightful evaluation frameworks that prioritize capabilities unique to UMMs and their adaptability to complex, real-world demands. Despite the proliferation of models and soaring metrics witnessed in 2024–2025, a qualitative paradigm shift in practical creation and interaction remains elusive. 「The remainder of the Second Half will be elaborated upon in subsequent sections.」
This rapid advancement invites a natural question: How far are current UMMs from achieving true general intelligence regrading visual generation?
Drawing from the literature[6], general intelligence can be decoupled into Crystallized Intelligence (CI)[7] and Fluid Intelligence (FI)[8]. CI relies on recalling accumulated knowledge and learned schemas, while FI emphasizes the capacity to reason and solve problems in novel situations. The former has been the core focus of the "First Half" of UMM development: through fitting massive datasets, models have acquired astonishing CI. For instance, a model's ability to generate a flawless cat often stems from exposure to billions of instances during training, followed by probabilistic reproduction during inference. However, real-world demands are diverse, often requiring models to adapt to contexts on the fly, which poses a significant challenge to their FI. Coincidentally, our work aligns with the initial research from Shunyu Yao's team, focusing on "true" in-context learning[9], which is also the foundation of FI. Our work can also be considered as the generative extension of their work.
Existing benchmarks, such as the classic ARC-Bench[10], are predominantly grounded in understanding, which is typically discussed within comprehension contexts. However, visual generation is approaching a similar inflection point. The historical fixation on pixel-level fidelity may represent a bias; in the long term, understanding and generation should arguably not be treated as separate tasks.
Guided by the definition of FI, we do not assess whether a model can render a "more realistic dog"—an indicator of CI. Instead, drawing inspiration from tasks humans perform effortlessly, we translate these capabilities into generative challenges:
We evaluated twelve models and surprisingly found that even the state-of-the-art Nano-Banana Pro failed to achieve a passing score. (Please refer to the paper for specific metrics and prompts.) Our evaluation focuses on three primary dimensions: logical correctness, preservation of reference information, and aesthetic quality. To ensure fairness and reproducibility, we developed manually annotated rubrics for each test case and utilized both open-source and proprietary models for evaluation.
| Method | Interleaved | Overall | Implicit Pattern Induction | Ad-hoc Constraint Execution | Contextual Knowledge Adaptation | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Implicit Pattern | Symbolic Constraint | Visual Constraint | Prior-Conflicting | Multi-Semantic | |||||||||||||
| RC | VC | AQ | RC | VC | AQ | RC | VC | AQ | RC | VC | AQ | RC | VC | AQ | |||
| Proprietary Models | |||||||||||||||||
| Nano Banana Pro | ✅ | 57.19 | 66.86 | 44.59 | 96.51 | 71.38 | 50.00 | 92.11 | 76.67 | 66.67 | 96.67 | 52.97 | 41.38 | 90.59 | 35.45 | - | 95.00 |
| Nano Banana | ✅ | 50.66 | 56.47 | 39.04 | 94.12 | 60.46 | 51.91 | 90.20 | 68.33 | 79.17 | 93.33 | 35.50 | 39.47 | 91.00 | 30.28 | - | 93.12 |
| GPT-Image | ❌ | 47.15 | 58.14 | 41.92 | 93.60 | 58.82 | 32.82 | 93.79 | 49.17 | 62.50 | 92.50 | 43.50 | 33.33 | 90.00 | 28.64 | - | 85.45 |
| SeeDream 4.0 | ❌ | 21.26 | 12.05 | 0.70 | 96.39 | 21.57 | 3.44 | 84.64 | 40.00 | 4.17 | 76.67 | 30.69 | 10.34 | 82.67 | 30.73 | - | 80.00 |
| SeeDream 4.5 | ❌ | 52.84 | 70.00 | 59.59 | 97.06 | 62.91 | 41.09 | 94.37 | 58.33 | 62.50 | 86.67 | 40.10 | 41.38 | 92.57 | 35.00 | - | 86.82 |
| Open-Source Models | |||||||||||||||||
| Qwen-Image | ❌ | 30.58 | 36.18 | 27.69 | 71.05 | 36.18 | 27.69 | 71.05 | 26.67 | 45.83 | 55.83 | 27.72 | 20.69 | 71.78 | 25.91 | - | 69.55 |
| GLM-Image | ❌ | 24.71 | 32.94 | 19.86 | 93.53 | 22.37 | 21.15 | 87.50 | 27.50 | 12.50 | 70.83 | 20.30 | 15.52 | 71.29 | 17.73 | - | 70.91 |
| FLUX.2-dev | ❌ | 34.39 | 34.30 | 27.70 | 88.95 | 35.76 | 31.01 | 87.09 | 39.17 | 50.00 | 59.17 | 25.25 | 30.17 | 84.16 | 29.82 | - | 79.82 |
| NextStep-1 | ❌ | 10.44 | 10.74 | 0.40 | 25.12 | 11.33 | 2.54 | 21.67 | 21.50 | 4.20 | 29.17 | 15.49 | 7.55 | 28.71 | 12.80 | - | 20.28 |
| Emu3.5-Image | ❌ | 36.67 | 41.86 | 35.81 | 83.72 | 34.97 | 39.31 | 86.93 | 24.17 | 29.17 | 42.50 | 26.24 | 37.93 | 82.18 | 32.87 | - | 75.46 |
| Omini-Gen2 | ❌ | 27.87 | 29.07 | 26.35 | 76.16 | 25.33 | 30.38 | 77.96 | 11.67 | 41.67 | 52.50 | 23.76 | 34.48 | 69.80 | 19.27 | - | 63.76 |
| Bagel | ✅ | 26.74 | 26.74 | 27.03 | 84.30 | 29.61 | 16.03 | 76.32 | 22.50 | 12.50 | 49.17 | 17.24 | 22.28 | 74.75 | 33.49 | - | 53.67 |
| Ours | ✅ | 32.92 | 39.54 | 44.92 | 66.71 | 36.54 | 26.73 | 67.11 | 30.45 | 35.11 | 47.84 | 23.67 | 36.75 | 57.78 | 34.22 | - | 52.75 |
Beyond the quantitative results, our diagnostic analysis reveals several critical observations regarding the current limitations of UMMs:
Building upon these findings, we conducted a deeper investigation:
We visualized the attention mechanism by employing the generated image tokens as Queries against the interleaved multimodal context (serving as Keys). Observations reveal that Bagel's attention distribution is erratic, characterized by irregular noise and stochastic spikes. This raises a critical question: To what extent does attention dictate context comprehension, and can a simple modulation of these weights enhance model performance?
Drawing on theoretical frameworks from [11] that interpret In-Context Learning (ICL) as implicit gradient descent, we extended this analysis to the Bagel architecture (full derivation provided in the paper). Theoretically, high-quality context acts as a signal for a more precise and definitive "gradient descent" trajectory. However, Bagel's diffuse attention maps suggest a failure to focus on critical task-relevant features. Consequently, the implicit gradient update lacks a coherent descent direction, trapping the model within its pre-training priors.
To address this, we propose a training-free attention adjustment mechanism (detailed in the paper) that redirects focus toward semantically relevant regions. Both qualitative and quantitative experiments demonstrate consistent improvements, establishing a Strong Baseline for future research.
As we go back to the second half of UMM research, I want to offer several critical questions and insights that may define the path forward:
The first half of the UMM era was a carnival of memorization and fitting. The second half will be an arduous journey toward reasoning, adaptation, and real-world application. To discuss fluid intelligence is to discuss the evolution of UMMs from skilled "memorizers" into genuine "thinkers." We must aim to redefine the intellectual boundaries of UMMs, moving beyond the simple superposition of understanding and generation tasks.
[1] Chameleon: Mixed-modal early-fusion foundation models
[2] Show-o: One single transformer to unify multimodal understanding and generation
[3] Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model
[4] Emu: Generative Pretraining in Multimodality
[5] Emerging Properties in Unified Multimodal Pretraining
[6] Theory of fluid and crystallized intelligence: A critical experiment
[7] On the nature of crystallized intelligence: The relationship between verbal ability and factual knowledge
[8] Fluid intelligence: A brief history
[9] CL-bench: A Benchmark for Context Learning
[10] ARC Prize 2024: Technical Report
[11] Learning without training: The implicit dynamics of in-context learning
[12] UniCTokens: Boosting Personalized Understanding and Generation via Unified Concept Tokens