The Second Half of Unified Models

Over the past two years, we have witnessed an explosion in Unified Multimodal Models (UMMs). From Chameleon^[1], Show-o^[2] to Transfusion^[3] and Emu^[4], the field flourished with diverse architecture. Until the emergence of Bagel^[5], architectural exploration gradually converged. In this phase, benefited by the integration of understanding, we have observed remarkable gains on the generation side: UMMs exhibit enhanced reasoning capabilities, and reflect world knowledge in generated outputs. The "VLM-as-Encoder" is becoming a new generative paradigm. This flourishing era constitutes the "First Half" of UMM development, where the community has primarily focused on architectural. On the generation side, attention has been largely directed toward capabilities such as static world knowledge, safety, and instruction adherence.

However, as the marginal returns of architectural exploration diminish, we confront a pivotal inflection point. With the construction of UMMs demystified, the research focus inevitably pivots toward their effective utilization and the identification of their intrinsic strengths, signaling our readiness to enter the "Second Half." In contrast to the First Half's emphasis on implementation and benchmark maximization, this new phase necessitates the establishment of insightful evaluation frameworks that prioritize capabilities unique to UMMs and their adaptability to complex, real-world demands. Despite the proliferation of models and soaring metrics witnessed in 2024–2025, a qualitative paradigm shift in practical creation and interaction remains elusive. 「The remainder of the Second Half will be elaborated upon in subsequent sections.」

This rapid advancement invites a natural question: How far are current UMMs from achieving true general intelligence regrading visual generation?

Drawing from the literature^[6], general intelligence can be decoupled into Crystallized Intelligence (CI)^[7] and Fluid Intelligence (FI)^[8]. CI relies on recalling accumulated knowledge and learned schemas, while FI emphasizes the capacity to reason and solve problems in novel situations. The former has been the core focus of the "First Half" of UMM development: through fitting massive datasets, models have acquired astonishing CI. For instance, a model's ability to generate a flawless cat often stems from exposure to billions of instances during training, followed by probabilistic reproduction during inference. However, real-world demands are diverse, often requiring models to adapt to contexts on the fly, which poses a significant challenge to their FI. Coincidentally, our work aligns with the initial research from Shunyu Yao's team, focusing on "true" in-context learning^[9], which is also the foundation of FI. Our work can also be considered as the generative extension of their work.

Existing benchmarks, such as the classic ARC-Bench^[10], are predominantly grounded in understanding, which is typically discussed within comprehension contexts. However, visual generation is approaching a similar inflection point. The historical fixation on pixel-level fidelity may represent a bias; in the long term, understanding and generation should arguably not be treated as separate tasks.

Guided by the definition of FI, we do not assess whether a model can render a "more realistic dog"—an indicator of CI. Instead, drawing inspiration from tasks humans perform effortlessly, we translate these capabilities into generative challenges:

Implicit Pattern Induction: Humans are adept at distilling implicit information from context, such as distinguishing preferences in text or stylistic textures in images.
Ad-hoc Constraint Execution: Humans adeptly reason under provisional, out-of-distribution constraints. This includes defining semantic visual operators or interpreting abstract symbols with context-specific meanings (e.g., performing arithmetic operations where objects symbolically represent numerical values).
Contextual Knowledge Adaptation: This entails overcoming established priors to internalize counter-intuitive rules. For instance, given a counterfactual premise where "gravity is dictated by color", humans can effortlessly modulate their reasoning to imagine and generate content within this novel framework.

We evaluated twelve models and surprisingly found that even the state-of-the-art Nano-Banana Pro failed to achieve a passing score. (Please refer to the paper for specific metrics and prompts.) Our evaluation focuses on three primary dimensions: logical correctness, preservation of reference information, and aesthetic quality. To ensure fairness and reproducibility, we developed manually annotated rubrics for each test case and utilized both open-source and proprietary models for evaluation.

Method	Interleaved	Overall	Implicit Pattern Induction			Ad-hoc Constraint Execution						Contextual Knowledge Adaptation
			Implicit Pattern			Symbolic Constraint			Visual Constraint			Prior-Conflicting			Multi-Semantic
			RC	VC	AQ	RC	VC	AQ	RC	VC	AQ	RC	VC	AQ	RC	VC	AQ
Proprietary Models
Nano Banana Pro	✅	57.19	66.86	44.59	96.51	71.38	50.00	92.11	76.67	66.67	96.67	52.97	41.38	90.59	35.45	-	95.00
Nano Banana	✅	50.66	56.47	39.04	94.12	60.46	51.91	90.20	68.33	79.17	93.33	35.50	39.47	91.00	30.28	-	93.12
GPT-Image	❌	47.15	58.14	41.92	93.60	58.82	32.82	93.79	49.17	62.50	92.50	43.50	33.33	90.00	28.64	-	85.45
SeeDream 4.0	❌	21.26	12.05	0.70	96.39	21.57	3.44	84.64	40.00	4.17	76.67	30.69	10.34	82.67	30.73	-	80.00
SeeDream 4.5	❌	52.84	70.00	59.59	97.06	62.91	41.09	94.37	58.33	62.50	86.67	40.10	41.38	92.57	35.00	-	86.82
Open-Source Models
Qwen-Image	❌	30.58	36.18	27.69	71.05	36.18	27.69	71.05	26.67	45.83	55.83	27.72	20.69	71.78	25.91	-	69.55
GLM-Image	❌	24.71	32.94	19.86	93.53	22.37	21.15	87.50	27.50	12.50	70.83	20.30	15.52	71.29	17.73	-	70.91
FLUX.2-dev	❌	34.39	34.30	27.70	88.95	35.76	31.01	87.09	39.17	50.00	59.17	25.25	30.17	84.16	29.82	-	79.82
NextStep-1	❌	10.44	10.74	0.40	25.12	11.33	2.54	21.67	21.50	4.20	29.17	15.49	7.55	28.71	12.80	-	20.28
Emu3.5-Image	❌	36.67	41.86	35.81	83.72	34.97	39.31	86.93	24.17	29.17	42.50	26.24	37.93	82.18	32.87	-	75.46
Omini-Gen2	❌	27.87	29.07	26.35	76.16	25.33	30.38	77.96	11.67	41.67	52.50	23.76	34.48	69.80	19.27	-	63.76
Bagel	✅	26.74	26.74	27.03	84.30	29.61	16.03	76.32	22.50	12.50	49.17	17.24	22.28	74.75	33.49	-	53.67
Ours	✅	32.92	39.54	44.92	66.71	36.54	26.73	67.11	30.45	35.11	47.84	23.67	36.75	57.78	34.22	-	52.75

Beyond the quantitative results, our diagnostic analysis reveals several critical observations regarding the current limitations of UMMs:

Generative Fluid Intelligence(GFI) remains a significant bottleneck for current models. Even state-of-the-art models like Nano-Banana Pro fail to achieve a passing grade on the GENIUS benchmark, highlighting a critical gap in reasoning capabilities. Crucially, GENIUS demonstrates strong discriminative power, revealing distinct stratification across model capabilities. The significant performance gap ranging from ~57.19 (Proprietary) down to ~26.74 (Open-Source), which highlights the benchmark’s sensitivity, validating it as a robust instrument for tracking the evolutionary progress of UMMs.
Current models fail to effectively arbitrate the conflict between pre-trained priors and the given context. Performance significantly degrades in Contextual Knowledge Adaptation tasks, where the context explicitly contradicts pre-trained priors. This reveals a rigidity in current AI: models lack the cognitive flexibility to override the pre-trained priors and adapt to counter-intuitive rules, unlike human intelligence which demonstrates far more robust plasticity in updating beliefs.
Aesthetic fidelity masks deep logical deficiencies. Aesthetic scores consistently outstrip reasoning metrics, reflecting a historical bias towards pixel-level fidelity (e.g., FID, IAA) over logical correctness. Consequently, while models generate visually stunning images, they frequently "lose the plot" when prompts require complex reasoning, producing high-quality but semantically misaligned outputs.

Pre-Planning and Post-Reflection yield marginal gains. As illustrated in (a), we explored inference-time strategies such as "Pre-Planning" and "Post-Reflection" (feeding evaluated generations back as context). However, these yielded only negligible improvements, suggesting that current generic reasoning paradigms are insufficient to address the specific complexities of this task.
Context comprehension is the key to solve GFI problems. Injecting human-curated rubrics/hints into the context resulted in substantial performance surges. Since these rubrics distill human understanding, this implies that if models could autonomously achieve similar comprehension, the problem is largely solvable. However, results remain bounded by intrinsic model capabilities; for instance, Bagel's performance degradation even with multimodal rubrics highlights existing weaknesses in processing complex, interleaved multimodal inputs.
Generative failure primarily stems from an execution gap rather than comprehension deficits. To test comprehension, we reformulated the task into a VQA format (b), using the rubric as the Ground Truth. Surprisingly, models achieved high accuracy, indicating a solid grasp of the context. We attribute the persistent generation failure—a "know-but-cannot-draw" phenomenon—to two factors: the difficulty of fully articulating fine-grained visual nuances within dense interleaved contexts, and structural inefficiencies in current UMMs where rich semantic understanding from the encoder fails to effectively propagate to the generative decoder.

Building upon these findings, we conducted a deeper investigation:

We visualized the attention mechanism by employing the generated image tokens as Queries against the interleaved multimodal context (serving as Keys). Observations reveal that Bagel's attention distribution is erratic, characterized by irregular noise and stochastic spikes. This raises a critical question: To what extent does attention dictate context comprehension, and can a simple modulation of these weights enhance model performance?

Drawing on theoretical frameworks from ^[11] that interpret In-Context Learning (ICL) as implicit gradient descent, we extended this analysis to the Bagel architecture (full derivation provided in the paper). Theoretically, high-quality context acts as a signal for a more precise and definitive "gradient descent" trajectory. However, Bagel's diffuse attention maps suggest a failure to focus on critical task-relevant features. Consequently, the implicit gradient update lacks a coherent descent direction, trapping the model within its pre-training priors.

To address this, we propose a training-free attention adjustment mechanism (detailed in the paper) that redirects focus toward semantically relevant regions. Both qualitative and quantitative experiments demonstrate consistent improvements, establishing a Strong Baseline for future research.

As we go back to the second half of UMM research, I want to offer several critical questions and insights that may define the path forward:

What is the ideal task. What capabilities should an ideal UMM possess that distinguish it from standard generative models? GENIUS has made a good start, and hopefully it will become a milestone on this side road.
The challenge of agentic. Researchers exploring "thinking with generated images" using UMMs must critically examine a fundamental question: Given the robust generation capabilities of models like Nano 2, why must the images within a reasoning trajectory originate from the UMM itself? I term this the "Existential Crisis" of UMMs.
Mutual enhancement between understanding and generation. Achieving mutual reinforcement between understanding and generation represents the ultimate aspiration for UMM researchers. While it is widely accepted that understanding enhances generation, whether generation conversely aids understanding remains an open question. From first principles, however, generation should theoretically facilitate understanding. As discussed in our NeurIPS 2025 paper^[12], this aligns with human cognition: establishing a preliminary concept aids drawing, and the act of drawing, in turn, deepens conceptual understanding. This challenge mirrors the previous point: the field currently lacks appropriate scenarios and tasks to demonstrate this effect.
Architectural convergence may be illusory. Despite exhaustive permutations of UMM components that seem to cover every possibility, our evaluation reveals that few models can genuinely process multi-modal interleaved inputs. UMM development may be misguided by a path dependency on VLM architectures. As with LLaVA, does established practice imply correctness? Consequently, 2025 has witnessed a growing shift toward native multimodal research.
Visual presentation of understanding. The prevailing "First Half" adapts generation to understanding, constraining generation as a byproduct—exemplified by using LLMs for denoising/AR or utilizing diffusion as a post-head. Conversely, adapting understanding to generation may be more rational. This aligns with human perceptual logic, as seen in DeepSeek-OCR's exploration of using visual presentation for understanding.
The object of unification. Text, with its sequential and causal nature, shares more with video than with static images. Unifying static image generation with language understanding might be a suboptimal local minimum; a shift toward video generation/understanding could resolve many current architectural dissonances.
Latent vs. Pixel-Level. Latent vs. Pixel-Level For internal reasoning (excluding human interaction), generating at the latent level might suffice to aid understanding, rendering pixel-level reconstruction redundant and noisy. The critical question for the "second half" is not if modalities can share a latent space, but specifically what kind of shared latent space best supports reasoning.
The evolution of generative models. I boldly predict the end of standalone diffusion models. The future likely belongs to integrated systems combining Model Context Protocol (MCP) and agent.
The story is to be continued...

The first half of the UMM era was a carnival of memorization and fitting. The second half will be an arduous journey toward reasoning, adaptation, and real-world application. To discuss fluid intelligence is to discuss the evolution of UMMs from skilled "memorizers" into genuine "thinkers." We must aim to redefine the intellectual boundaries of UMMs, moving beyond the simple superposition of understanding and generation tasks.

References

[1] Chameleon: Mixed-modal early-fusion foundation models

[2] Show-o: One single transformer to unify multimodal understanding and generation

[3] Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

[4] Emu: Generative Pretraining in Multimodality

[5] Emerging Properties in Unified Multimodal Pretraining

[6] Theory of fluid and crystallized intelligence: A critical experiment

[7] On the nature of crystallized intelligence: The relationship between verbal ability and factual knowledge

[8] Fluid intelligence: A brief history

[9] CL-bench: A Benchmark for Context Learning

[10] ARC Prize 2024: Technical Report

[11] Learning without training: The implicit dynamics of in-context learning

[12] UniCTokens: Boosting Personalized Understanding and Generation via Unified Concept Tokens