A unified multimodal understanding and generation model.
Separate audio into vocals, bass, drums, and other