Skip to main navigation Skip to search Skip to main content

M3GPT: An Advanced Multimodal, Multitask Framework for Motion Comprehension and Generation

  • Mingshuang Luo
  • , Ruibing Hou*
  • , Zhuo Li
  • , Hong Chang
  • , Zimo Liu
  • , Yaowei Wang
  • , Shiguang Shan
  • *Corresponding author for this work
  • CAS - Institute of Computing Technology
  • Peng Cheng Laboratory
  • University of Chinese Academy of Sciences
  • Tencent
  • Harbin Institute of Technology Shenzhen

Research output: Contribution to journalConference articlepeer-review

Abstract

This paper presents M3GPT, an advanced Multimodal, Multitask framework for Motion comprehension and generation. M3GPT operates on three fundamental principles. The first focuses on creating a unified representation space for various motion-relevant modalities. We employ discrete vector quantization for multimodal conditional signals, such as text, music and motion/dance, enabling seamless integration into a large language model (LLM) with a single vocabulary. The second involves modeling motion generation directly in the raw motion space. This strategy circumvents the information loss associated with a discrete tokenizer, resulting in more detailed and comprehensive motion generation. Third, M3GPT learns to model the connections and synergies among various motion-relevant tasks. Text, the most familiar and well-understood modality for LLMs, is utilized as a bridge to establish connections between different motion tasks, facilitating mutual reinforcement. To our knowledge, M3GPT is the first model capable of comprehending and generating motions based on multiple signals. Extensive experiments highlight M3GPT's superior performance across various motion-relevant tasks and its powerful zero-shot generalization capabilities for extremely challenging tasks. Project page: https://luomingshuang.github.io/M3GPT/.

Original languageEnglish
JournalAdvances in Neural Information Processing Systems
Volume37
StatePublished - 2024
Externally publishedYes
Event38th Conference on Neural Information Processing Systems, NeurIPS 2024 - Vancouver, Canada
Duration: 9 Dec 202415 Dec 2024

Fingerprint

Dive into the research topics of 'M3GPT: An Advanced Multimodal, Multitask Framework for Motion Comprehension and Generation'. Together they form a unique fingerprint.

Cite this