✨ UniCTokens: Boosting Personalized Understanding and Generation via Unified Concept Tokens ✨
📍 Abstract
Personalized models have demonstrated remarkable success in understanding and generating concepts provided by users. However, existing methods use separate concept tokens for understanding and generation, treating these tasks in isolation. This may result in limitations for generating images with complex prompts. For example, given the concept \langle bo\rangle, generating "\langle bo\rangle wearing its hat" without additional textual descriptions of its hat. We call this kind of generation personalized knowledge-driven generation. To address the limitation, we present UniCTokens, a novel framework that effectively integrates personalized information into a unified vision language model (VLM) for understanding and generation. UniCTokens trains a set of unified concept tokens to leverage complementary semantics, boosting two personalized tasks. Moreover, we propose a progressive training strategy with three stages: understanding warm-up, bootstrapping generation from understanding, and deepening understanding from generation to enhance mutual benefits between both tasks. To quantitatively evaluate the unified VLM personalization, we present UnifyBench, the first benchmark for assessing concept understanding, concept generation, and knowledge-driven generation. Experimental results on UnifyBench indicate that UniCTokens shows competitive performance compared to leading methods in concept understanding, concept generation, and achieving state-of-the-art results in personalized knowledge-driven generation. Our research demonstrates that enhanced understanding improves generation, and the generation process can yield valuable insights into understanding.
The capability overview of UniCTokens.
📰 Introduction
UniCTokens is an innovative framework that effectively integrates personalized information into a unified vision language model (VLM) for understanding and generation tasks. Existing methods typically treat understanding and generation tasks separately, which limits the model's ability to generate images with complex prompts. For example, given the concept ⟨bo⟩, generating "〈bo〉 wearing its hat" without additional descriptions of what the hat looks like. We call this personalized knowledge-driven generation.
UniCTokens addresses this limitation through a three-stage progressive training strategy:
- Understanding warm-up.
- Bootstrapping generation from understanding.
- Deepening understanding from generation (Generation as Perception).
Our research demonstrates that enhanced understanding improves generation, and the generation process can yield valuable insights into understanding.
The overview of UniCTokens.
🔥 Key Features
🔄 Unified Concept Tokens
Unifying personalized understanding and generation tasks in a single model.
🧠 Personalized Knowledge-Driven Generation
Leveraging external personalized knowledge for complex image generation.
📈 Mutual Enhancement
Three-stage strategy promoting mutual enhancement of understanding and generation, achieving cross-task information transfer.
📊 UnifyBench
The first benchmark for assessing personalized understanding, generation, and personalized knowledge-driven generation all in one.
UnifyBench Benchmark Tasks
Multi-Modal Understanding (MMU)
| Sub-task | Source files | Evaluation focus |
|---|---|---|
| Text-Only QA | test/<concept>/text_only.json | Check whether the model remembers concept knowledge (no image) |
| VQA | test/<concept>/vqa.json + image | Visual question answering about the concept image |
| Rec | test/*.png | Pure visual recognition capability |
Text-to-Image Generation (T2I)
| Mode | Input | Metrics |
|---|---|---|
| Vanilla generation | Prompts from the DreamBooth Dataset → target-concept images | CLIP-I / CLIP-T · ArcFace similarity |
| Personalized knowledge-driven | t2i_conditions.json | Combined T2I-Score: must satisfy both visual & textual attributes |
Qualitative Comparisons among UniCTokens, Yo'Chameleon and GPT-4o.
More qualitative results generated from our model.
📦 UniCTokens Dataset
Data Overview
| Item | Description |
|---|---|
| Total concepts | 20 (Human × 10 · Animal × 5 · Object × 5) |
| Images per concept | N ≈ 10 – 15 (already split into train / test) |
| Negative samples | random_images/ (100 random irrelevant images) + negative_example/ (hard negatives) |
UnifyBench Dataset.
📝 BibTeX
@article{an2025unictokens,
title={UniCTokens: Boosting Personalized Understanding and Generation via Unified Concept Tokens},
author={An, Ruichuan and Yang, Sihan and Zhang, Renrui and Shen, Zijun and Lu, Ming and Dai, Gaole and Liang, Hao and Guo, Ziyu and Yan, Shilin and Luo, Yulin and others},
journal={arXiv preprint arXiv:2505.14671},
year={2025}
}
📬 Contact
- GitHub Issues: https://github.com/arctanxarc/UniCTokens/issues
- Email: arctanxarc@gmail.com