NExT-GPT’s release offers developers a powerful multi-modal language model that can handle diverse inputs and outputs, paving the way for more sophisticated AI applications across different media types.
The NExT Research Center at the National University of Singapore (NUS) has unveiled NExT-GPT, an open-source multi-modal large language model (LLM) designed to process text, images, videos, and audio interchangeably. The model can accept various types of input and generate responses in different formats, making it a versatile AI agent.
Multi-modal capabilities
NExT-GPT offers a chat-based interface that enables users to input text, images, videos, or audio files. The model can understand and respond to these inputs, answering questions or generating content accordingly. This multi-modal AI system combines pre-trained encoders and decoders, including Vicuna and Stable Diffusion, with trainable neural network layers in between. These intermediary layers are trained using a novel technique developed by the NExT team called Modality-switching Instruction Tuning (MosIT).
See also: How to Attract LLM Developers Amidst the AI Boom
Architecture and training
NExT-GPT’s architecture has three tiers: an encoding stage with linear projections, a Vicuna LLM core responsible for generating tokens (including signals for output modalities), and a decoding stage with modality-specific transformer layers and decoders. Notably, most of the model’s parameters, including encoders, decoders, and the Vicuna model, remain frozen during training, with only about 1% being updated. This approach helps reduce training costs while maintaining performance.
The model was trained using instruction-tuning, using a dataset of example dialogues between human users and chatbots. These dialogues covered scenarios involving multiple modalities in both input and output, totaling around 5,000 dialogues.
Performance and evaluation
NExT-GPT was evaluated on various multi-modal generation benchmarks, demonstrating competitive results compared to baseline models. Human judges also rated the model’s output in different scenarios, with image generation scenarios receiving higher scores than video and audio.
The model’s unique feature is its ability to generate modality-signaling tokens when users request specific types of content, such as images, videos, or sounds. These tokens were pre-defined and included in the vocabulary of the LLM during training.
NExT-GPT’s release offers researchers and developers a powerful multi-modal language model that can handle diverse inputs and outputs, paving the way for more sophisticated AI applications across different media types. NExT-GPT’s open-source availability is a significant contribution to multi-modal AI, enabling developers to create applications that seamlessly integrate text, images, videos, and audio. This model has potential use cases in various domains, from content generation and multimedia analysis to virtual assistants capable of understanding and responding to user requests in their preferred formats.