✨ UniCTokens: Boosting Personalized Understanding and Generation via Unified Concept Tokens ✨

Peking University, Xi'an JiaoTong University, CUHK, Intel Labs, Nanjing University, University of Wisconsin-Madison, Tsinghua University
*Equal Contribution     Project Leader     Corresponding Author
The icon of UniCTokens

📍 Abstract

Personalized models have demonstrated remarkable success in understanding and generating concepts provided by users. However, existing methods use separate concept tokens for understanding and generation, treating these tasks in isolation. This may result in limitations for generating images with complex prompts. For example, given the concept \langle bo\rangle, generating "\langle bo\rangle wearing its hat" without additional textual descriptions of its hat. We call this kind of generation personalized knowledge-driven generation. To address the limitation, we present UniCTokens, a novel framework that effectively integrates personalized information into a unified vision language model (VLM) for understanding and generation. UniCTokens trains a set of unified concept tokens to leverage complementary semantics, boosting two personalized tasks. Moreover, we propose a progressive training strategy with three stages: understanding warm-up, bootstrapping generation from understanding, and deepening understanding from generation to enhance mutual benefits between both tasks. To quantitatively evaluate the unified VLM personalization, we present UnifyBench, the first benchmark for assessing concept understanding, concept generation, and knowledge-driven generation. Experimental results on UnifyBench indicate that UniCTokens shows competitive performance compared to leading methods in concept understanding, concept generation, and achieving state-of-the-art results in personalized knowledge-driven generation. Our research demonstrates that enhanced understanding improves generation, and the generation process can yield valuable insights into understanding.

The capability overview of UniCTokens

The capability overview of UniCTokens.

📰 Introduction

UniCTokens is an innovative framework that effectively integrates personalized information into a unified vision language model (VLM) for understanding and generation tasks. Existing methods typically treat understanding and generation tasks separately, which limits the model's ability to generate images with complex prompts. For example, given the concept ⟨bo⟩, generating "⟨bo⟩ wearing its hat" without additional descriptions of what the hat looks like. We call this personalized knowledge-driven generation.

UniCTokens addresses this limitation through a three-stage progressive training strategy:

  • Understanding warm-up.
  • Bootstrapping generation from understanding.
  • Deepening understanding from generation (Generation as Perception).

Our research demonstrates that enhanced understanding improves generation, and the generation process can yield valuable insights into understanding.

The overview of UniCTokens

The overview of UniCTokens.

🔥 Key Features

🔄 Unified Concept Tokens

Unifying personalized understanding and generation tasks in a single model.

🧠 Personalized Knowledge-Driven Generation

Leveraging external personalized knowledge for complex image generation.

📈 Mutual Enhancement

Three-stage strategy promoting mutual enhancement of understanding and generation, achieving cross-task information transfer.

📊 UnifyBench

The first benchmark for assessing personalized understanding, generation, and personalized knowledge-driven generation all in one.

UnifyBench Benchmark Tasks

Multi-Modal Understanding (MMU)

Sub-task Source files Evaluation focus
Text-Only QA test/<concept>/text_only.json Check whether the model remembers concept knowledge (no image)
VQA test/<concept>/vqa.json + image Visual question answering about the concept image
Rec test/*.png Pure visual recognition capability

Text-to-Image Generation (T2I)

Mode Input Metrics
Vanilla generation Prompts from the DreamBooth Dataset → target-concept images CLIP-I / CLIP-T · ArcFace similarity
Personalized knowledge-driven t2i_conditions.json Combined T2I-Score: must satisfy both visual & textual attributes

📦 UniCTokens Dataset

Data Overview

Item Description
Total concepts 20 (Human × 10 · Animal × 5 · Object × 5)
Images per concept N ≈ 10 – 15 (already split into train / test)
Negative samples random_images/ (100 random irrelevant images) + negative_example/ (hard negatives)
UnifyBench Dataset

UnifyBench Dataset.

📝 BibTeX

        @article{an2025unictokens,
          title={UniCTokens: Boosting Personalized Understanding and Generation via Unified Concept Tokens},
          author={An, Ruichuan and Yang, Sihan and Zhang, Renrui and Shen, Zijun and Lu, Ming and Dai, Gaole and Liang, Hao and Guo, Ziyu and Yan, Shilin and Luo, Yulin and others},
          journal={arXiv preprint arXiv:2505.14671},
          year={2025}
        }