UniCTokens: Boosting Personalized Understanding and Generation via Unified Concept Tokens - Ruichuan An, Sihan Yang, Renrui Zhang, Zijun Shen, Ming Lu, Gaole Dai, Hao Liang, Ziyu Guo, Shilin Yan, Yulin Luo, Bocheng Zou, Chaoqun Yang, Wentao Zhang | Peking University, Xi'an JiaoTong University, CUHK, Intel Labs, Nanjing University, University of Wisconsin-Madison, Tsinghua University

📍 Abstract

Personalized models have demonstrated remarkable success in understanding and generating concepts provided by users. However, existing methods use separate concept tokens for understanding and generation, treating these tasks in isolation. This may result in limitations for generating images with complex prompts. For example, given the concept \langle bo\rangle, generating "\langle bo\rangle wearing its hat" without additional textual descriptions of its hat. We call this kind of generation personalized knowledge-driven generation. To address the limitation, we present UniCTokens, a novel framework that effectively integrates personalized information into a unified vision language model (VLM) for understanding and generation. UniCTokens trains a set of unified concept tokens to leverage complementary semantics, boosting two personalized tasks. Moreover, we propose a progressive training strategy with three stages: understanding warm-up, bootstrapping generation from understanding, and deepening understanding from generation to enhance mutual benefits between both tasks. To quantitatively evaluate the unified VLM personalization, we present UnifyBench, the first benchmark for assessing concept understanding, concept generation, and knowledge-driven generation. Experimental results on UnifyBench indicate that UniCTokens shows competitive performance compared to leading methods in concept understanding, concept generation, and achieving state-of-the-art results in personalized knowledge-driven generation. Our research demonstrates that enhanced understanding improves generation, and the generation process can yield valuable insights into understanding.

The capability overview of UniCTokens.

📰 Introduction

UniCTokens is an innovative framework that effectively integrates personalized information into a unified vision language model (VLM) for understanding and generation tasks. Existing methods typically treat understanding and generation tasks separately, which limits the model's ability to generate images with complex prompts. For example, given the concept ⟨bo⟩, generating "⟨bo⟩ wearing its hat" without additional descriptions of what the hat looks like. We call this personalized knowledge-driven generation.

UniCTokens addresses this limitation through a three-stage progressive training strategy:

Understanding warm-up.
Bootstrapping generation from understanding.
Deepening understanding from generation (Generation as Perception).

Our research demonstrates that enhanced understanding improves generation, and the generation process can yield valuable insights into understanding.

The overview of UniCTokens.

🔥 Key Features

🔄 Unified Concept Tokens

Unifying personalized understanding and generation tasks in a single model.

🧠 Personalized Knowledge-Driven Generation

Leveraging external personalized knowledge for complex image generation.

📈 Mutual Enhancement

Three-stage strategy promoting mutual enhancement of understanding and generation, achieving cross-task information transfer.

📊 UnifyBench

The first benchmark for assessing personalized understanding, generation, and personalized knowledge-driven generation all in one.

UnifyBench Benchmark Tasks

Multi-Modal Understanding (MMU)

Sub-task	Source files	Evaluation focus
Text-Only QA	test/<concept>/text_only.json	Check whether the model remembers concept knowledge (no image)
VQA	test/<concept>/vqa.json + image	Visual question answering about the concept image
Rec	test/*.png	Pure visual recognition capability

Text-to-Image Generation (T2I)

Mode	Input	Metrics
Vanilla generation	Prompts from the DreamBooth Dataset → target-concept images	CLIP-I / CLIP-T · ArcFace similarity
Personalized knowledge-driven	t2i_conditions.json	Combined T2I-Score: must satisfy both visual & textual attributes

Qualitative Comparisons among UniCTokens, Yo'Chameleon and GPT-4o.

More qualitative results generated from our model.

📦 UniCTokens Dataset

Download the dataset here

Data Overview

Item	Description
Total concepts	20 (Human × 10 · Animal × 5 · Object × 5)
Images per concept	N ≈ 10 – 15 (already split into train / test)
Negative samples	random_images/ (100 random irrelevant images) + negative_example/ (hard negatives)

UnifyBench Dataset.

📝 BibTeX

        @article{an2025unictokens,
          title={UniCTokens: Boosting Personalized Understanding and Generation via Unified Concept Tokens},
          author={An, Ruichuan and Yang, Sihan and Zhang, Renrui and Shen, Zijun and Lu, Ming and Dai, Gaole and Liang, Hao and Guo, Ziyu and Yan, Shilin and Luo, Yulin and others},
          journal={arXiv preprint arXiv:2505.14671},
          year={2025}
        }

📬 Contact

GitHub Issues: https://github.com/arctanxarc/UniCTokens/issues
Email: arctanxarc@gmail.com

✨ UniCTokens: Boosting Personalized Understanding and Generation via Unified Concept Tokens ✨