3 Multi-modal models & papers from NeurIPS 2023 — Try them out now at VESSL Hub

With the release of GPT-4, Multimodal AI has become the biggest trend in Generative AI and it was also one of the ‌‍‍‍‌‍‌‍‌‍‍‌‌‍‌‌‍‍‌‌‍‍‍‍‍‍‍‍‌‌‍‌‌‍‍‌‍‍‌‌‌‌‍‌‍‍‌‍‍‌‌‍‍‍‍‍‍‌‍‍‌‍‌‍‌‌‌‍‌‍‍‍‍‍‍‍‌‍‍‌‌‌‌‌‌‍‍‍‍‌‍‍‌‍‌‍‌‍‌‍‌‌‌‍‌‍‌‌‌‌‌‍‍‍‌‌‍‌‌‍‍‌‍‍‌‌‌‌‍‌‍‍‌‌‌‌‌‌‌‍‌‌‍‍‌‌‍‍‌‍‍‌‌‍‍‌‌‌‍‌‌‌‍‍‌‌‍‌‍‌‌‌‍‌‌‍‍‌‌‌‍‌‍‌‌‍‌‍‌‌‍‌‌‌‌‌‍‌‍‌‌‌‌‍‌‌‌‍‍‌‌‌‍‌‌‌‌‍‍‌‌‍‌‍‍‍‌‍‍‌‌‍‌‌‌‍‌‍‌‍‌‍‌‍‌‍‌‍‍‌‍‍‍‌‍‌‍‌‍‍‍‌‍‌‍‌‍‌‍‌‌‍‌‌‍‌‍‌‌‌‍‌‌‌‍‌‌‌‌‌‍‌‌‌‍‌‌‍‌‌‌‌‍‍‌‌‍‍‌‍‌‍‌‌‍‌‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‌‍‌‌‌‌‌‌‍‍‌‌‍‍‌‍‌‌‍‍‌‌‌‍‌‌‍‍‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‍‌‍‍‌‍‍‌‌‍‌‍‌‌‍‌‍‌‌‌‍‍‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‍‍‌‍‍‍‌‌‌‍‌‌‍‌‌‌‌‍‌‌‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‌‌‍‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‌‌‌‌‌‌‍‌‍‌‌‍‍‌‌‌‌‌‌‍‌‌‌‌‍‌‌‍‌‌‍‍‌‌‍‌‌‍‌‍‍‌‍‌‍‌‍‌‍‌‌‌‍‌‍‌‌‌‌‌‍‍‍‌‌‍‌‌‍‍‌‍‍‌‌‌‌‍‌‍‍‌‌‌‌‌‌‌‍‌‌‍‍‌‌‍‍‌‍‌‍‍‌‌‍‌‌‌‍‌‍‌‍‌‍‌‍‌‍‌‍‍‌‍‍‍‌‍‌‍‌‍‍‍‌‍‌‍‌‍‌‍‌‌‍‌‌‍‌‍‌‍‌‌‌‍‌‌‌‍‌‌‌‌‌‍‌‌‌‍‌‍‌‌‍‌‌‌‌‍‍‌‌‍‍‌‍‌‍‌‌‍‌‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‌‍‌‌‌‌‌‌‍‍‌‌‍‍‌‍‌‌‍‍‌‌‌‍‌‌‍‍‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‍‌‍‍‌‍‍‌‌‍‌‍‌‌‍‌‍‌‌‌‍‍‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‍‍‌‍‍‍‌‌‌‍‌‌‍‌‌‌‌‍‌‌‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‌‌‍‌‌‌‍‌‌‍‍‌‌highlights we chose from NeurIPS 2023↗‌‍‍‍‌‍‌‍‌‍‍‌‌‍‌‌‍‍‌‌‍‍‍‍‍‍‍‍‌‌‍‌‌‍‍‌‍‍‌‌‌‌‍‌‍‍‌‍‍‌‌‍‍‍‍‍‍‌‍‍‌‍‌‍‌‌‌‍‌‍‍‍‍‍‍‍‌‍‍‌‌‌‌‌‌‍‍‍‍‌‍‍‌‍‌‍‌‍‌‍‌‌‌‍‌‍‌‌‌‌‌‍‍‍‌‌‍‌‌‍‍‌‍‍‌‌‌‌‍‌‍‍‌‌‌‌‌‌‌‍‌‌‍‍‌‌‍‍‌‍‍‌‌‍‍‌‌‌‍‌‌‌‍‍‌‌‍‌‍‌‌‌‍‌‌‍‍‌‌‌‍‌‍‌‌‍‌‍‌‌‍‌‌‌‌‌‍‌‍‌‌‌‌‍‌‌‌‍‍‌‌‌‍‌‌‌‌‍‍‌‌‍‌‍‍‍‌‍‍‌‌‍‌‌‌‍‌‍‌‍‌‍‌‍‌‍‌‍‍‌‍‍‍‌‍‌‍‌‍‍‍‌‍‌‍‌‍‌‍‌‌‍‌‌‍‌‍‌‌‌‍‌‌‌‍‌‌‌‌‌‍‌‌‌‍‌‌‍‌‌‌‌‍‍‌‌‍‍‌‍‌‍‌‌‍‌‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‌‍‌‌‌‌‌‌‍‍‌‌‍‍‌‍‌‌‍‍‌‌‌‍‌‌‍‍‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‍‌‍‍‌‍‍‌‌‍‌‍‌‌‍‌‍‌‌‌‍‍‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‍‍‌‍‍‍‌‌‌‍‌‌‍‌‌‌‌‍‌‌‌‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‌‌‍‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‌‌‌‌‌‌‍‌‍‌‌‍‍‌‌‌‌‌‌‍‌‌‌‌‍‌‌‍‌‌‍‍‌‌‍‌‌‍‌‍‍‌‍‌‍‌‍‌‍‌‌‌‍‌‍‌‌‌‌‌‍‍‍‌‌‍‌‌‍‍‌‍‍‌‌‌‌‍‌‍‍‌‌‌‌‌‌‌‍‌‌‍‍‌‌‍‍‌‍‌‍‍‌‌‍‌‌‌‍‌‍‌‍‌‍‌‍‌‍‌‍‍‌‍‍‍‌‍‌‍‌‍‍‍‌‍‌‍‌‍‌‍‌‌‍‌‌‍‌‍‌‍‌‌‌‍‌‌‌‍‌‌‌‌‌‍‌‌‌‍‌‍‌‌‍‌‌‌‌‍‍‌‌‍‍‌‍‌‍‌‌‍‌‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‌‍‌‌‌‌‌‌‍‍‌‌‍‍‌‍‌‌‍‍‌‌‌‍‌‌‍‍‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‍‌‍‍‌‍‍‌‌‍‌‍‌‌‍‌‍‌‌‌‍‍‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‍‍‌‍‍‍‌‌‌‍‌‌‍‌‌‌‌‍‌‌‌‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‌‌‍‌‌‌‍‌‌‍‍‌‌. It’s also an area that leading Gen AI & LLM companies are chasing after — including our customer ‌‍‍‍‌‍‌‍‌‍‍‌‌‍‌‌‍‍‌‌‍‍‍‍‍‍‍‍‌‌‍‌‌‍‍‌‍‍‌‌‌‌‍‌‍‍‌‍‍‌‌‍‍‍‍‍‍‌‍‍‌‍‌‍‌‌‌‍‌‍‍‍‍‍‍‍‌‍‍‌‌‌‌‌‌‍‍‍‍‌‍‍‌‍‌‍‌‍‌‍‌‌‌‍‌‍‌‌‌‌‌‍‍‍‌‌‍‌‌‍‍‌‍‍‌‌‌‌‍‌‍‍‌‌‌‌‌‌‌‍‌‌‍‍‌‌‍‍‌‍‍‌‌‍‍‌‌‌‍‌‌‌‍‍‌‌‍‌‍‌‌‌‍‌‌‍‍‌‌‌‍‌‍‌‌‍‌‍‌‌‍‌‌‌‌‌‍‌‍‌‌‌‌‍‌‌‌‍‍‌‌‌‍‌‌‌‌‍‍‌‌‍‌‍‍‍‌‍‍‌‌‍‌‌‌‍‌‍‌‍‌‍‌‍‌‍‌‍‍‌‍‍‍‌‍‌‍‌‍‍‍‌‍‌‍‌‍‌‍‌‌‍‌‌‍‌‍‌‌‌‍‌‌‌‍‌‌‌‌‌‍‌‌‌‍‌‌‍‌‌‌‌‍‍‌‌‍‍‌‍‌‍‌‌‍‌‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‌‍‌‌‌‌‌‌‍‍‌‌‍‍‌‍‌‌‍‍‌‌‌‍‌‌‍‍‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‍‌‍‍‌‍‍‌‌‍‌‍‌‌‍‌‍‌‌‌‍‍‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‍‍‌‍‍‍‌‌‌‍‌‌‍‌‌‌‌‍‌‌‍‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‌‌‍‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‌‌‌‌‌‌‍‌‍‌‌‍‍‌‌‌‌‌‌‍‌‌‌‌‍‌‌‍‌‌‍‍‌‌‍‌‌‍‌‍‍‌‍‌‍‌‍‌‍‌‌‌‍‌‍‌‌‌‌‌‍‍‍‌‌‍‌‌‍‍‌‍‍‌‌‌‌‍‌‍‍‌‌‌‌‌‌‌‍‌‌‍‍‌‌‍‍‌‍‌‍‍‌‌‍‌‌‌‍‌‍‌‍‌‍‌‍‌‍‌‍‍‌‍‍‍‌‍‌‍‌‍‍‍‌‍‌‍‌‍‌‍‌‌‍‌‌‍‌‍‌‍‌‌‌‍‌‌‌‍‌‌‌‌‌‍‌‌‌‍‌‍‌‌‍‌‌‌‌‍‍‌‌‍‍‌‍‌‍‌‌‍‌‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‌‍‌‌‌‌‌‌‍‍‌‌‍‍‌‍‌‌‍‍‌‌‌‍‌‌‍‍‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‍‌‍‍‌‍‍‌‌‍‌‍‌‌‍‌‍‌‌‌‍‍‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‍‍‌‍‍‍‌‌‌‍‌‌‍‌‌‌‌‍‌‌‍‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‌‌‍‌‌‌‍‌‌‍‍‌‌Scatter Lab↗‌‍‍‍‌‍‌‍‌‍‍‌‌‍‌‌‍‍‌‌‍‍‍‍‍‍‍‍‌‌‍‌‌‍‍‌‍‍‌‌‌‌‍‌‍‍‌‍‍‌‌‍‍‍‍‍‍‌‍‍‌‍‌‍‌‌‌‍‌‍‍‍‍‍‍‍‌‍‍‌‌‌‌‌‌‍‍‍‍‌‍‍‌‍‌‍‌‍‌‍‌‌‌‍‌‍‌‌‌‌‌‍‍‍‌‌‍‌‌‍‍‌‍‍‌‌‌‌‍‌‍‍‌‌‌‌‌‌‌‍‌‌‍‍‌‌‍‍‌‍‍‌‌‍‍‌‌‌‍‌‌‌‍‍‌‌‍‌‍‌‌‌‍‌‌‍‍‌‌‌‍‌‍‌‌‍‌‍‌‌‍‌‌‌‌‌‍‌‍‌‌‌‌‍‌‌‌‍‍‌‌‌‍‌‌‌‌‍‍‌‌‍‌‍‍‍‌‍‍‌‌‍‌‌‌‍‌‍‌‍‌‍‌‍‌‍‌‍‍‌‍‍‍‌‍‌‍‌‍‍‍‌‍‌‍‌‍‌‍‌‌‍‌‌‍‌‍‌‌‌‍‌‌‌‍‌‌‌‌‌‍‌‌‌‍‌‌‍‌‌‌‌‍‍‌‌‍‍‌‍‌‍‌‌‍‌‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‌‍‌‌‌‌‌‌‍‍‌‌‍‍‌‍‌‌‍‍‌‌‌‍‌‌‍‍‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‍‌‍‍‌‍‍‌‌‍‌‍‌‌‍‌‍‌‌‌‍‍‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‍‍‌‍‍‍‌‌‌‍‌‌‍‌‌‌‌‍‌‌‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‌‌‍‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‌‌‌‌‌‌‍‌‍‌‌‍‍‌‌‌‌‌‌‍‌‌‌‌‍‌‌‍‌‌‍‍‌‌‍‌‌‍‌‍‍‌‍‌‍‌‍‌‍‌‌‌‍‌‍‌‌‌‌‌‍‍‍‌‌‍‌‌‍‍‌‍‍‌‌‌‌‍‌‍‍‌‌‌‌‌‌‌‍‌‌‍‍‌‌‍‍‌‍‌‍‍‌‌‍‌‌‌‍‌‍‌‍‌‍‌‍‌‍‌‍‍‌‍‍‍‌‍‌‍‌‍‍‍‌‍‌‍‌‍‌‍‌‌‍‌‌‍‌‍‌‍‌‌‌‍‌‌‌‍‌‌‌‌‌‍‌‌‌‍‌‍‌‌‍‌‌‌‌‍‍‌‌‍‍‌‍‌‍‌‌‍‌‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‌‍‌‌‌‌‌‌‍‍‌‌‍‍‌‍‌‌‍‍‌‌‌‍‌‌‍‍‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‍‌‍‍‌‍‍‌‌‍‌‍‌‌‍‌‍‌‌‌‍‍‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‍‍‌‍‍‍‌‌‌‍‌‌‍‌‌‌‌‍‌‌‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‌‌‍‌‌‌‍‌‌‍‍‌‌ — as they experiment beyond single-mode text processing to encompass multiple context-based input types such as images and sound.‌‍‍‍‌‍‌‍‌‍‍‌‌‍‌‌‍‍‌‌‍‍‍‍‍‍‍‍‌‌‍‌‌‍‍‌‍‍‌‌‌‌‍‌‍‍‌‍‍‌‌‍‍‍‍‍‍‌‍‍‌‍‌‍‌‌‌‍‌‍‍‍‍‍‍‍‌‍‍‌‌‌‌‌‌‍‍‍‍‌‍‍‌‍‌‍‌‍‌‍‌‌‌‍‌‍‌‌‌‌‌‍‍‍‌‌‍‌‌‍‍‌‍‍‌‌‌‌‍‌‍‍‌‌‌‌‌‌‌‍‌‌‍‍‌‌‍‍‌‍‍‌‌‍‍‌‌‌‍‌‌‌‍‍‌‌‍‌‍‌‌‌‍‌‌‍‍‌‌‌‍‌‍‌‌‍‌‍‌‌‍‌‌‌‌‌‍‌‍‌‌‌‌‍‌‌‌‍‍‌‌‌‍‌‌‌‌‍‍‌‌‍‌‍‍‍‌‍‍‌‌‍‌‌‌‍‌‍‌‍‌‍‌‍‌‍‌‍‍‌‍‍‍‌‍‌‍‌‍‍‍‌‍‌‍‌‍‌‍‌‌‍‌‌‍‌‍‌‌‌‍‌‌‌‍‌‌‌‌‌‍‌‌‌‍‌‌‍‌‌‌‌‍‍‌‌‍‍‌‍‌‍‌‌‍‌‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‌‍‌‌‌‌‌‌‍‍‌‌‍‍‌‍‌‌‍‍‌‌‌‍‌‌‍‍‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‍‌‍‍‌‍‍‌‌‍‌‍‌‌‍‌‍‌‌‌‍‍‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‍‍‌‍‍‍‌‌‌‍‌‌‍‌‌‌‌‍‌‌‌‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‌‌‍‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‌‌‌‌‌‌‍‌‍‌‌‍‍‌‌‌‌‌‌‍‌‌‌‌‍‌‌‍‌‌‍‍‌‌‍‌‌‍‌‍‍‌‍‌‍‌‍‌‍‌‌‌‍‌‍‌‌‌‌‌‍‍‍‌‌‍‌‌‍‍‌‍‍‌‌‌‌‍‌‍‍‌‌‌‌‌‌‌‍‌‌‍‍‌‌‍‍‌‍‌‍‍‌‌‍‌‌‌‍‌‍‌‍‌‍‌‍‌‍‌‍‍‌‍‍‍‌‍‌‍‌‍‍‍‌‍‌‍‌‍‌‍‌‌‍‌‌‍‌‍‌‍‌‌‌‍‌‌‌‍‌‌‌‌‌‍‌‌‌‍‌‍‌‌‍‌‌‌‌‍‍‌‌‍‍‌‍‌‍‌‌‍‌‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‌‍‌‌‌‌‌‌‍‍‌‌‍‍‌‍‌‌‍‍‌‌‌‍‌‌‍‍‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‍‌‍‍‌‍‍‌‌‍‌‍‌‌‍‌‍‌‌‌‍‍‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‍‍‌‍‍‍‌‌‌‍‌‌‍‌‌‌‌‍‌‌‌‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‌‌‍‌‌‌‍‌‌‍‍‌‌

“The interfaces of the world are multimodal. We want our models to see what we see and hear what we hear, and we want them to also generate content that appeals to more than one of our senses." — Mark Chen, Head of Frontiers Research, OpenAI‌‍‍‍‌‍‌‍‌‍‍‌‌‍‌‌‍‍‌‌‍‍‍‍‍‍‍‍‌‌‍‌‌‍‍‌‍‍‌‌‌‌‍‌‍‍‌‍‍‌‌‍‍‍‍‍‍‌‍‍‌‍‌‍‌‌‌‍‌‍‍‍‍‍‍‍‌‍‍‌‌‌‌‌‌‍‍‍‍‌‍‍‌‍‌‍‌‍‌‍‌‌‌‍‌‍‌‌‌‌‌‍‍‍‌‌‍‌‌‍‍‌‍‍‌‌‌‌‍‌‍‍‌‌‌‌‌‌‌‍‌‌‍‍‌‌‍‍‌‍‍‌‌‍‍‌‌‌‍‌‌‌‍‍‌‌‍‌‍‌‌‌‍‌‌‍‍‌‌‌‍‌‍‌‌‍‌‍‌‌‍‌‌‌‌‌‍‌‍‌‌‌‌‍‌‌‌‍‍‌‌‌‍‌‌‌‌‍‍‌‌‍‌‍‍‍‌‍‍‌‌‍‌‌‌‍‌‍‌‍‌‍‌‍‌‍‌‍‍‌‍‍‍‌‍‌‍‌‍‍‍‌‍‌‍‌‍‌‍‌‌‍‌‌‍‌‍‌‌‌‍‌‌‌‍‌‌‌‌‌‍‌‌‌‍‌‌‍‌‌‌‌‍‍‌‌‍‍‌‍‌‍‌‌‍‌‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‌‌‍‌‍‌‌‌‍‍‌‍‌‌‍‌‍‌‌‌‍‍‌‍‍‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‍‌‍‍‌‍‍‌‌‍‌‍‌‌‍‌‍‌‌‌‍‍‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‌‍‌‍‍‌‍‌‍‍‌‍‌‌‌‌‍‌‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‌‌‍‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‌‌‌‌‌‌‍‌‍‌‌‍‍‌‌‌‌‌‌‍‌‌‌‌‍‌‌‍‌‌‍‍‌‌‍‌‌‍‌‍‍‌‍‌‍‌‍‌‍‌‌‌‍‌‍‌‌‌‌‌‍‍‍‌‌‍‌‌‍‍‌‍‍‌‌‌‌‍‌‍‍‌‌‌‌‌‌‌‍‌‌‍‍‌‌‍‍‌‍‌‍‍‌‌‍‌‌‌‍‌‍‌‍‌‍‌‍‌‍‌‍‍‌‍‍‍‌‍‌‍‌‍‍‍‌‍‌‍‌‍‌‍‌‌‍‌‌‍‌‍‌‍‌‌‌‍‌‌‌‍‌‌‌‌‌‍‌‌‌‍‌‍‌‌‍‌‌‌‌‍‍‌‌‍‍‌‍‌‍‌‌‍‌‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‌‌‍‌‍‌‌‌‍‍‌‍‌‌‍‌‍‌‌‌‍‍‌‍‍‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‍‌‍‍‌‍‍‌‌‍‌‍‌‌‍‌‍‌‌‌‍‍‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‌‍‌‍‍‌‍‌‍‍‌‍‌‌‌‌‍‌‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‌‌‍‌‌‌‍‌‌‍‍‌‌

We imported the original code from the authors’ GitHub repo and created a simple playground for ‌‍‍‍‌‍‌‍‌‍‍‌‌‍‌‌‍‍‌‌‍‍‍‍‍‍‍‍‌‌‍‌‌‍‍‌‍‍‌‌‌‌‍‌‍‍‌‍‍‌‌‍‍‍‍‍‍‌‍‍‌‍‌‍‌‌‌‍‌‍‍‍‍‍‍‍‌‍‍‌‌‌‌‌‌‍‍‍‍‌‍‍‌‍‌‍‌‍‌‍‌‌‌‍‌‍‌‌‌‌‌‍‍‍‌‌‍‌‌‍‍‌‍‍‌‌‌‌‍‌‍‍‌‌‌‌‌‌‌‍‌‌‍‍‌‌‍‍‌‍‍‌‌‍‍‌‌‌‍‌‌‌‍‍‌‌‍‌‍‌‌‌‍‌‌‍‍‌‌‌‍‌‍‌‌‍‌‍‌‌‍‌‌‌‌‌‍‌‍‌‌‌‌‍‌‌‌‍‍‌‌‌‍‌‌‌‌‍‍‌‌‍‌‍‍‍‌‍‍‌‌‍‌‌‌‍‌‍‌‍‌‍‌‍‌‍‌‍‍‌‍‍‍‌‍‌‍‌‍‍‍‌‍‌‍‌‍‌‍‌‌‍‌‌‍‌‍‌‌‌‍‌‌‌‍‌‌‌‌‌‍‌‌‌‍‌‌‍‌‌‌‌‍‍‌‌‍‍‌‍‌‍‌‌‍‌‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‌‍‌‌‌‍‌‌‌‍‍‌‌‌‌‌‍‍‍‌‍‌‍‌‌‍‌‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‍‌‍‍‌‍‍‌‌‍‌‍‌‌‍‌‍‌‌‌‍‍‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‌‌‍‌‌‌‍‌‌‌‌‍‌‍‌‌‌‍‌‌‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‌‌‍‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‌‌‌‌‌‌‍‌‍‌‌‍‍‌‌‌‌‌‌‍‌‌‌‌‍‌‌‍‌‌‍‍‌‌‍‌‌‍‌‍‍‌‍‌‍‌‍‌‍‌‌‌‍‌‍‌‌‌‌‌‍‍‍‌‌‍‌‌‍‍‌‍‍‌‌‌‌‍‌‍‍‌‌‌‌‌‌‌‍‌‌‍‍‌‌‍‍‌‍‌‍‍‌‌‍‌‌‌‍‌‍‌‍‌‍‌‍‌‍‌‍‍‌‍‍‍‌‍‌‍‌‍‍‍‌‍‌‍‌‍‌‍‌‌‍‌‌‍‌‍‌‍‌‌‌‍‌‌‌‍‌‌‌‌‌‍‌‌‌‍‌‍‌‌‍‌‌‌‌‍‍‌‌‍‍‌‍‌‍‌‌‍‌‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‌‍‌‌‌‍‌‌‌‍‍‌‌‌‌‌‍‍‍‌‍‌‍‌‌‍‌‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‍‌‍‍‌‍‍‌‌‍‌‍‌‌‍‌‍‌‌‌‍‍‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‌‌‍‌‌‌‍‌‌‌‌‍‌‍‌‌‌‍‌‌‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‌‌‍‌‌‌‍‌‌‍‍‌‌InstructBLIP↗‌‍‍‍‌‍‌‍‌‍‍‌‌‍‌‌‍‍‌‌‍‍‍‍‍‍‍‍‌‌‍‌‌‍‍‌‍‍‌‌‌‌‍‌‍‍‌‍‍‌‌‍‍‍‍‍‍‌‍‍‌‍‌‍‌‌‌‍‌‍‍‍‍‍‍‍‌‍‍‌‌‌‌‌‌‍‍‍‍‌‍‍‌‍‌‍‌‍‌‍‌‌‌‍‌‍‌‌‌‌‌‍‍‍‌‌‍‌‌‍‍‌‍‍‌‌‌‌‍‌‍‍‌‌‌‌‌‌‌‍‌‌‍‍‌‌‍‍‌‍‍‌‌‍‍‌‌‌‍‌‌‌‍‍‌‌‍‌‍‌‌‌‍‌‌‍‍‌‌‌‍‌‍‌‌‍‌‍‌‌‍‌‌‌‌‌‍‌‍‌‌‌‌‍‌‌‌‍‍‌‌‌‍‌‌‌‌‍‍‌‌‍‌‍‍‍‌‍‍‌‌‍‌‌‌‍‌‍‌‍‌‍‌‍‌‍‌‍‍‌‍‍‍‌‍‌‍‌‍‍‍‌‍‌‍‌‍‌‍‌‌‍‌‌‍‌‍‌‌‌‍‌‌‌‍‌‌‌‌‌‍‌‌‌‍‌‌‍‌‌‌‌‍‍‌‌‍‍‌‍‌‍‌‌‍‌‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‌‍‌‌‌‍‌‌‌‍‍‌‌‌‌‌‍‍‍‌‍‌‍‌‌‍‌‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‍‌‍‍‌‍‍‌‌‍‌‍‌‌‍‌‍‌‌‌‍‍‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‌‌‍‌‌‌‍‌‌‌‌‍‌‍‌‌‌‍‌‌‌‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‌‌‍‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‌‌‌‌‌‌‍‌‍‌‌‍‍‌‌‌‌‌‌‍‌‌‌‌‍‌‌‍‌‌‍‍‌‌‍‌‌‍‌‍‍‌‍‌‍‌‍‌‍‌‌‌‍‌‍‌‌‌‌‌‍‍‍‌‌‍‌‌‍‍‌‍‍‌‌‌‌‍‌‍‍‌‌‌‌‌‌‌‍‌‌‍‍‌‌‍‍‌‍‌‍‍‌‌‍‌‌‌‍‌‍‌‍‌‍‌‍‌‍‌‍‍‌‍‍‍‌‍‌‍‌‍‍‍‌‍‌‍‌‍‌‍‌‌‍‌‌‍‌‍‌‍‌‌‌‍‌‌‌‍‌‌‌‌‌‍‌‌‌‍‌‍‌‌‍‌‌‌‌‍‍‌‌‍‍‌‍‌‍‌‌‍‌‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‌‍‌‌‌‍‌‌‌‍‍‌‌‌‌‌‍‍‍‌‍‌‍‌‌‍‌‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‍‌‍‍‌‍‍‌‌‍‌‍‌‌‍‌‍‌‌‌‍‍‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‌‌‍‌‌‌‍‌‌‌‌‍‌‍‌‌‌‍‌‌‌‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‌‌‍‌‌‌‍‌‌‍‍‌‌, ‌‍‍‍‌‍‌‍‌‍‍‌‌‍‌‌‍‍‌‌‍‍‍‍‍‍‍‍‌‌‍‌‌‍‍‌‍‍‌‌‌‌‍‌‍‍‌‍‍‌‌‍‍‍‍‍‍‌‍‍‌‍‌‍‌‌‌‍‌‍‍‍‍‍‍‍‌‍‍‌‌‌‌‌‌‍‍‍‍‌‍‍‌‍‌‍‌‍‌‍‌‌‌‍‌‍‌‌‌‌‌‍‍‍‌‌‍‌‌‍‍‌‍‍‌‌‌‌‍‌‍‍‌‌‌‌‌‌‌‍‌‌‍‍‌‌‍‍‌‍‍‌‌‍‍‌‌‌‍‌‌‌‍‍‌‌‍‌‍‌‌‌‍‌‌‍‍‌‌‌‍‌‍‌‌‍‌‍‌‌‍‌‌‌‌‌‍‌‍‌‌‌‌‍‌‌‌‍‍‌‌‌‍‌‌‌‌‍‍‌‌‍‌‍‍‍‌‍‍‌‌‍‌‌‌‍‌‍‌‍‌‍‌‍‌‍‌‍‍‌‍‍‍‌‍‌‍‌‍‍‍‌‍‌‍‌‍‌‍‌‌‍‌‌‍‌‍‌‌‌‍‌‌‌‍‌‌‌‌‌‍‌‌‌‍‌‌‍‌‌‌‌‍‍‌‌‍‍‌‍‌‍‌‌‍‌‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‌‍‌‌‌‍‌‌‌‍‍‌‌‌‌‌‍‍‍‌‍‌‍‌‌‍‌‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‍‌‍‍‌‍‍‌‌‍‌‍‌‌‍‌‍‌‌‌‍‍‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‌‌‍‌‌‌‍‌‌‌‌‍‌‍‌‌‌‍‌‌‍‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‌‌‍‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‌‌‌‌‌‌‍‌‍‌‌‍‍‌‌‌‌‌‌‍‌‌‌‌‍‌‌‍‌‌‍‍‌‌‍‌‌‍‌‍‍‌‍‌‍‌‍‌‍‌‌‌‍‌‍‌‌‌‌‌‍‍‍‌‌‍‌‌‍‍‌‍‍‌‌‌‌‍‌‍‍‌‌‌‌‌‌‌‍‌‌‍‍‌‌‍‍‌‍‌‍‍‌‌‍‌‌‌‍‌‍‌‍‌‍‌‍‌‍‌‍‍‌‍‍‍‌‍‌‍‌‍‍‍‌‍‌‍‌‍‌‍‌‌‍‌‌‍‌‍‌‍‌‌‌‍‌‌‌‍‌‌‌‌‌‍‌‌‌‍‌‍‌‌‍‌‌‌‌‍‍‌‌‍‍‌‍‌‍‌‌‍‌‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‌‍‌‌‌‍‌‌‌‍‍‌‌‌‌‌‍‍‍‌‍‌‍‌‌‍‌‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‍‌‍‍‌‍‍‌‌‍‌‍‌‌‍‌‍‌‌‌‍‍‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‌‌‍‌‌‌‍‌‌‌‌‍‌‍‌‌‌‍‌‌‍‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‌‌‍‌‌‌‍‌‌‍‍‌‌LLaVA↗‌‍‍‍‌‍‌‍‌‍‍‌‌‍‌‌‍‍‌‌‍‍‍‍‍‍‍‍‌‌‍‌‌‍‍‌‍‍‌‌‌‌‍‌‍‍‌‍‍‌‌‍‍‍‍‍‍‌‍‍‌‍‌‍‌‌‌‍‌‍‍‍‍‍‍‍‌‍‍‌‌‌‌‌‌‍‍‍‍‌‍‍‌‍‌‍‌‍‌‍‌‌‌‍‌‍‌‌‌‌‌‍‍‍‌‌‍‌‌‍‍‌‍‍‌‌‌‌‍‌‍‍‌‌‌‌‌‌‌‍‌‌‍‍‌‌‍‍‌‍‍‌‌‍‍‌‌‌‍‌‌‌‍‍‌‌‍‌‍‌‌‌‍‌‌‍‍‌‌‌‍‌‍‌‌‍‌‍‌‌‍‌‌‌‌‌‍‌‍‌‌‌‌‍‌‌‌‍‍‌‌‌‍‌‌‌‌‍‍‌‌‍‌‍‍‍‌‍‍‌‌‍‌‌‌‍‌‍‌‍‌‍‌‍‌‍‌‍‍‌‍‍‍‌‍‌‍‌‍‍‍‌‍‌‍‌‍‌‍‌‌‍‌‌‍‌‍‌‌‌‍‌‌‌‍‌‌‌‌‌‍‌‌‌‍‌‌‍‌‌‌‌‍‍‌‌‍‍‌‍‌‍‌‌‍‌‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‌‍‌‌‌‍‌‌‌‍‍‌‌‌‌‌‍‍‍‌‍‌‍‌‌‍‌‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‍‌‍‍‌‍‍‌‌‍‌‍‌‌‍‌‍‌‌‌‍‍‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‌‌‍‌‌‌‍‌‌‌‌‍‌‍‌‌‌‍‌‌‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‌‌‍‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‌‌‌‌‌‌‍‌‍‌‌‍‍‌‌‌‌‌‌‍‌‌‌‌‍‌‌‍‌‌‍‍‌‌‍‌‌‍‌‍‍‌‍‌‍‌‍‌‍‌‌‌‍‌‍‌‌‌‌‌‍‍‍‌‌‍‌‌‍‍‌‍‍‌‌‌‌‍‌‍‍‌‌‌‌‌‌‌‍‌‌‍‍‌‌‍‍‌‍‌‍‍‌‌‍‌‌‌‍‌‍‌‍‌‍‌‍‌‍‌‍‍‌‍‍‍‌‍‌‍‌‍‍‍‌‍‌‍‌‍‌‍‌‌‍‌‌‍‌‍‌‍‌‌‌‍‌‌‌‍‌‌‌‌‌‍‌‌‌‍‌‍‌‌‍‌‌‌‌‍‍‌‌‍‍‌‍‌‍‌‌‍‌‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‌‍‌‌‌‍‌‌‌‍‍‌‌‌‌‌‍‍‍‌‍‌‍‌‌‍‌‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‍‌‍‍‌‍‍‌‌‍‌‍‌‌‍‌‍‌‌‌‍‍‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‌‌‍‌‌‌‍‌‌‌‌‍‌‍‌‌‌‍‌‌‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‌‌‍‌‌‌‍‌‌‍‍‌‌, and ‌‍‍‍‌‍‌‍‌‍‍‌‌‍‌‌‍‍‌‌‍‍‍‍‍‍‍‍‌‌‍‌‌‍‍‌‍‍‌‌‌‌‍‌‍‍‌‍‍‌‌‍‍‍‍‍‍‌‍‍‌‍‌‍‌‌‌‍‌‍‍‍‍‍‍‍‌‍‍‌‌‌‌‌‌‍‍‍‍‌‍‍‌‍‌‍‌‍‌‍‌‌‌‍‌‍‌‌‌‌‌‍‍‍‌‌‍‌‌‍‍‌‍‍‌‌‌‌‍‌‍‍‌‌‌‌‌‌‌‍‌‌‍‍‌‌‍‍‌‍‍‌‌‍‍‌‌‌‍‌‌‌‍‍‌‌‍‌‍‌‌‌‍‌‌‍‍‌‌‌‍‌‍‌‌‍‌‍‌‌‍‌‌‌‌‌‍‌‍‌‌‌‌‍‌‌‌‍‍‌‌‌‍‌‌‌‌‍‍‌‌‍‌‍‍‍‌‍‍‌‌‍‌‌‌‍‌‍‌‍‌‍‌‍‌‍‌‍‍‌‍‍‍‌‍‌‍‌‍‍‍‌‍‌‍‌‍‌‍‌‌‍‌‌‍‌‍‌‌‌‍‌‌‌‍‌‌‌‌‌‍‌‌‌‍‌‌‍‌‌‌‌‍‍‌‌‍‍‌‍‌‍‌‌‍‌‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‌‍‌‌‌‍‌‌‌‍‍‌‌‌‌‌‍‍‍‌‍‌‍‌‌‍‌‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‍‌‍‍‌‍‍‌‌‍‌‍‌‌‍‌‍‌‌‌‍‍‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‌‌‍‌‌‌‍‌‌‌‌‍‌‍‌‌‌‍‌‌‌‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‌‌‍‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‌‌‌‌‌‌‍‌‍‌‌‍‍‌‌‌‌‌‌‍‌‌‌‌‍‌‌‍‌‌‍‍‌‌‍‌‌‍‌‍‍‌‍‌‍‌‍‌‍‌‌‌‍‌‍‌‌‌‌‌‍‍‍‌‌‍‌‌‍‍‌‍‍‌‌‌‌‍‌‍‍‌‌‌‌‌‌‌‍‌‌‍‍‌‌‍‍‌‍‌‍‍‌‌‍‌‌‌‍‌‍‌‍‌‍‌‍‌‍‌‍‍‌‍‍‍‌‍‌‍‌‍‍‍‌‍‌‍‌‍‌‍‌‌‍‌‌‍‌‍‌‍‌‌‌‍‌‌‌‍‌‌‌‌‌‍‌‌‌‍‌‍‌‌‍‌‌‌‌‍‍‌‌‍‍‌‍‌‍‌‌‍‌‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‌‍‌‌‌‍‌‌‌‍‍‌‌‌‌‌‍‍‍‌‍‌‍‌‌‍‌‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‍‌‍‍‌‍‍‌‌‍‌‍‌‌‍‌‍‌‌‌‍‍‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‌‌‍‌‌‌‍‌‌‌‌‍‌‍‌‌‌‍‌‌‌‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‌‌‍‌‌‌‍‌‌‍‍‌‌AudioCraft↗‌‍‍‍‌‍‌‍‌‍‍‌‌‍‌‌‍‍‌‌‍‍‍‍‍‍‍‍‌‌‍‌‌‍‍‌‍‍‌‌‌‌‍‌‍‍‌‍‍‌‌‍‍‍‍‍‍‌‍‍‌‍‌‍‌‌‌‍‌‍‍‍‍‍‍‍‌‍‍‌‌‌‌‌‌‍‍‍‍‌‍‍‌‍‌‍‌‍‌‍‌‌‌‍‌‍‌‌‌‌‌‍‍‍‌‌‍‌‌‍‍‌‍‍‌‌‌‌‍‌‍‍‌‌‌‌‌‌‌‍‌‌‍‍‌‌‍‍‌‍‍‌‌‍‍‌‌‌‍‌‌‌‍‍‌‌‍‌‍‌‌‌‍‌‌‍‍‌‌‌‍‌‍‌‌‍‌‍‌‌‍‌‌‌‌‌‍‌‍‌‌‌‌‍‌‌‌‍‍‌‌‌‍‌‌‌‌‍‍‌‌‍‌‍‍‍‌‍‍‌‌‍‌‌‌‍‌‍‌‍‌‍‌‍‌‍‌‍‍‌‍‍‍‌‍‌‍‌‍‍‍‌‍‌‍‌‍‌‍‌‌‍‌‌‍‌‍‌‌‌‍‌‌‌‍‌‌‌‌‌‍‌‌‌‍‌‌‍‌‌‌‌‍‍‌‌‍‍‌‍‌‍‌‌‍‌‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‌‍‌‌‌‍‌‌‌‍‍‌‌‌‌‌‍‍‍‌‍‌‍‌‌‍‌‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‍‌‍‍‌‍‍‌‌‍‌‍‌‌‍‌‍‌‌‌‍‍‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‌‌‍‌‌‌‍‌‌‌‌‍‌‍‌‌‌‍‌‌‌‌‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‌‌‍‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‌‌‌‌‌‌‍‌‍‌‌‍‍‌‌‌‌‌‌‍‌‌‌‌‍‌‌‍‌‌‍‍‌‌‍‌‌‍‌‍‍‌‍‌‍‌‍‌‍‌‌‌‍‌‍‌‌‌‌‌‍‍‍‌‌‍‌‌‍‍‌‍‍‌‌‌‌‍‌‍‍‌‌‌‌‌‌‌‍‌‌‍‍‌‌‍‍‌‍‌‍‍‌‌‍‌‌‌‍‌‍‌‍‌‍‌‍‌‍‌‍‍‌‍‍‍‌‍‌‍‌‍‍‍‌‍‌‍‌‍‌‍‌‌‍‌‌‍‌‍‌‍‌‌‌‍‌‌‌‍‌‌‌‌‌‍‌‌‌‍‌‍‌‌‍‌‌‌‌‍‍‌‌‍‍‌‍‌‍‌‌‍‌‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‌‍‌‌‌‍‌‌‌‍‍‌‌‌‌‌‍‍‍‌‍‌‍‌‌‍‌‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‍‌‍‍‌‍‍‌‌‍‌‍‌‌‍‌‍‌‌‌‍‍‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‌‌‍‌‌‌‍‌‌‌‌‍‌‍‌‌‌‍‌‌‌‌‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‌‌‍‌‌‌‍‌‌‍‍‌‌. You can try them out with a single click at VESSL Hub.‌‍‍‍‌‍‌‍‌‍‍‌‌‍‌‌‍‍‌‌‍‍‍‍‍‍‍‍‌‌‍‌‌‍‍‌‍‍‌‌‌‌‍‌‍‍‌‍‍‌‌‍‍‍‍‍‍‌‍‍‌‍‌‍‌‌‌‍‌‍‍‍‍‍‍‍‌‍‍‌‌‌‌‌‌‍‍‍‍‌‍‍‌‍‌‍‌‍‌‍‌‌‌‍‌‍‌‌‌‌‌‍‍‍‌‌‍‌‌‍‍‌‍‍‌‌‌‌‍‌‍‍‌‌‌‌‌‌‌‍‌‌‍‍‌‌‍‍‌‍‍‌‌‍‍‌‌‌‍‌‌‌‍‍‌‌‍‌‍‌‌‌‍‌‌‍‍‌‌‌‍‌‍‌‌‍‌‍‌‌‍‌‌‌‌‌‍‌‍‌‌‌‌‍‌‌‌‍‍‌‌‌‍‌‌‌‌‍‍‌‌‍‌‍‍‍‌‍‍‌‌‍‌‌‌‍‌‍‌‍‌‍‌‍‌‍‌‍‍‌‍‍‍‌‍‌‍‌‍‍‍‌‍‌‍‌‍‌‍‌‌‍‌‌‍‌‍‌‌‌‍‌‌‌‍‌‌‌‌‌‍‌‌‌‍‌‌‍‌‌‌‌‍‍‌‌‍‍‌‍‌‍‌‌‍‌‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‌‍‌‌‌‍‌‌‌‍‍‌‌‌‌‌‍‍‍‌‍‌‍‌‌‍‌‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‍‌‍‍‌‍‍‌‌‍‌‍‌‌‍‌‍‌‌‌‍‍‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‌‌‍‌‌‌‍‌‌‌‌‍‌‍‌‌‌‍‌‌‌‍‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‌‌‍‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‌‌‌‌‌‌‍‌‍‌‌‍‍‌‌‌‌‌‌‍‌‌‌‌‍‌‌‍‌‌‍‍‌‌‍‌‌‍‌‍‍‌‍‌‍‌‍‌‍‌‌‌‍‌‍‌‌‌‌‌‍‍‍‌‌‍‌‌‍‍‌‍‍‌‌‌‌‍‌‍‍‌‌‌‌‌‌‌‍‌‌‍‍‌‌‍‍‌‍‌‍‍‌‌‍‌‌‌‍‌‍‌‍‌‍‌‍‌‍‌‍‍‌‍‍‍‌‍‌‍‌‍‍‍‌‍‌‍‌‍‌‍‌‌‍‌‌‍‌‍‌‍‌‌‌‍‌‌‌‍‌‌‌‌‌‍‌‌‌‍‌‍‌‌‍‌‌‌‌‍‍‌‌‍‍‌‍‌‍‌‌‍‌‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‌‍‌‌‌‍‌‌‌‍‍‌‌‌‌‌‍‍‍‌‍‌‍‌‌‍‌‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‍‌‍‍‌‍‍‌‌‍‌‍‌‌‍‌‍‌‌‌‍‍‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‌‌‍‌‌‌‍‌‌‌‌‍‌‍‌‌‌‍‌‌‌‍‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‌‌‍‌‌‌‍‌‌‍‍‌‌

Run multi-modal models & papers from NeurIPS 2023‌‍‍‍‌‍‌‍‌‍‍‌‌‍‌‌‍‍‌‌‍‍‍‍‍‍‍‍‌‌‍‌‌‍‍‌‍‍‌‌‌‌‍‌‍‍‌‍‍‌‌‍‍‍‍‍‍‌‍‍‌‍‌‍‌‌‌‍‌‍‍‍‍‍‍‍‌‍‍‌‌‌‌‌‌‍‍‍‍‌‍‍‌‍‌‍‌‍‌‍‌‌‌‍‌‍‌‌‌‌‌‍‍‍‌‌‍‌‌‍‍‌‍‍‌‌‌‌‍‌‍‍‌‌‌‌‌‌‌‍‌‌‍‍‌‌‍‍‌‍‍‌‌‍‍‌‌‌‍‌‌‌‍‍‌‌‍‌‍‌‌‌‍‌‌‍‍‌‌‌‍‌‍‌‌‍‌‍‌‌‍‌‌‌‌‌‍‌‍‌‌‌‌‍‌‌‌‍‍‌‌‌‍‌‌‌‌‍‍‌‌‍‌‍‍‍‌‍‍‌‌‍‌‌‌‍‌‍‌‍‌‍‌‍‌‍‌‍‍‌‍‍‍‌‍‌‍‌‍‍‍‌‍‌‍‌‍‌‍‌‌‍‌‌‍‌‍‌‌‌‍‌‌‌‍‌‌‌‌‌‍‌‌‌‍‌‌‍‌‌‌‌‍‍‌‌‍‍‌‍‌‍‌‌‍‌‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‌‍‌‌‍‌‍‌‍‍‌‍‌‌‌‍‌‍‌‍‌‌‍‌‍‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‍‌‍‍‌‍‍‌‌‍‌‍‌‌‍‌‍‌‌‌‍‍‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‌‍‌‍‌‍‍‌‍‌‍‍‌‍‌‍‌‌‍‌‍‌‌‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‌‌‍‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‌‌‌‌‌‌‍‌‍‌‌‍‍‌‌‌‌‌‌‍‌‌‌‌‍‌‌‍‌‌‍‍‌‌‍‌‌‍‌‍‍‌‍‌‍‌‍‌‍‌‌‌‍‌‍‌‌‌‌‌‍‍‍‌‌‍‌‌‍‍‌‍‍‌‌‌‌‍‌‍‍‌‌‌‌‌‌‌‍‌‌‍‍‌‌‍‍‌‍‌‍‍‌‌‍‌‌‌‍‌‍‌‍‌‍‌‍‌‍‌‍‍‌‍‍‍‌‍‌‍‌‍‍‍‌‍‌‍‌‍‌‍‌‌‍‌‌‍‌‍‌‍‌‌‌‍‌‌‌‍‌‌‌‌‌‍‌‌‌‍‌‍‌‌‍‌‌‌‌‍‍‌‌‍‍‌‍‌‍‌‌‍‌‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‌‍‌‌‍‌‍‌‍‍‌‍‌‌‌‍‌‍‌‍‌‌‍‌‍‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‍‌‍‍‌‍‍‌‌‍‌‍‌‌‍‌‍‌‌‌‍‍‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‌‍‌‍‌‍‍‌‍‌‍‍‌‍‌‍‌‌‍‌‍‌‌‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‌‌‍‌‌‌‍‌‌‍‍‌‌

LLaVa↗‌‍‍‍‌‍‌‍‌‍‍‌‌‍‌‌‍‍‌‌‍‍‍‍‍‍‍‍‌‌‍‌‌‍‍‌‍‍‌‌‌‌‍‌‍‍‌‍‍‌‌‍‍‍‍‍‍‌‍‍‌‍‌‍‌‌‌‍‌‍‍‍‍‍‍‍‌‍‍‌‌‌‌‌‌‍‍‍‍‌‍‍‌‍‌‍‌‍‌‍‌‌‌‍‌‍‌‌‌‌‌‍‍‍‌‌‍‌‌‍‍‌‍‍‌‌‌‌‍‌‍‍‌‌‌‌‌‌‌‍‌‌‍‍‌‌‍‍‌‍‍‌‌‍‍‌‌‌‍‌‌‌‍‍‌‌‍‌‍‌‌‌‍‌‌‍‍‌‌‌‍‌‍‌‌‍‌‍‌‌‍‌‌‌‌‌‍‌‍‌‌‌‌‍‌‌‌‍‍‌‌‌‍‌‌‌‌‍‍‌‌‍‌‍‍‍‌‍‍‌‌‍‌‌‌‍‌‍‌‍‌‍‌‍‌‍‌‍‍‌‍‍‍‌‍‌‍‌‍‍‍‌‍‌‍‌‍‌‍‌‌‍‌‌‍‌‍‌‌‌‍‌‌‌‍‌‌‌‌‌‍‌‌‌‍‌‌‍‌‌‌‌‍‍‌‌‍‍‌‍‌‍‌‌‍‌‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‌‍‍‍‌‌‍‍‌‍‌‍‌‍‌‌‌‍‌‌‌‌‍‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‍‌‍‍‌‍‍‌‌‍‌‍‌‌‍‌‍‌‌‌‍‍‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‌‍‍‌‌‍‌‍‍‌‌‍‍‌‍‌‍‌‍‌‍‌‍‌‍‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‌‌‍‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‌‌‌‌‌‌‍‌‍‌‌‍‍‌‌‌‌‌‌‍‌‌‌‌‍‌‌‍‌‌‍‍‌‌‍‌‌‍‌‍‍‌‍‌‍‌‍‌‍‌‌‌‍‌‍‌‌‌‌‌‍‍‍‌‌‍‌‌‍‍‌‍‍‌‌‌‌‍‌‍‍‌‌‌‌‌‌‌‍‌‌‍‍‌‌‍‍‌‍‌‍‍‌‌‍‌‌‌‍‌‍‌‍‌‍‌‍‌‍‌‍‍‌‍‍‍‌‍‌‍‌‍‍‍‌‍‌‍‌‍‌‍‌‌‍‌‌‍‌‍‌‍‌‌‌‍‌‌‌‍‌‌‌‌‌‍‌‌‌‍‌‍‌‌‍‌‌‌‌‍‍‌‌‍‍‌‍‌‍‌‌‍‌‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‌‍‍‍‌‌‍‍‌‍‌‍‌‍‌‌‌‍‌‌‌‌‍‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‍‌‍‍‌‍‍‌‌‍‌‍‌‌‍‌‍‌‌‌‍‍‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‌‍‍‌‌‍‌‍‍‌‌‍‍‌‍‌‍‌‍‌‍‌‍‌‍‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‌‌‍‌‌‌‍‌‌‍‍‌‌ — Large Language and Vision Assistant‌‍‍‍‌‍‌‍‌‍‍‌‌‍‌‌‍‍‌‌‍‍‍‍‍‍‍‍‌‌‍‌‌‍‍‌‍‍‌‌‌‌‍‌‍‍‌‍‍‌‌‍‍‍‍‍‍‌‍‍‌‍‌‍‌‌‌‍‌‍‍‍‍‍‍‍‌‍‍‌‌‌‌‌‌‍‍‍‍‌‍‍‌‍‌‍‌‍‌‍‌‌‌‍‌‍‌‌‌‌‌‍‍‍‌‌‍‌‌‍‍‌‍‍‌‌‌‌‍‌‍‍‌‌‌‌‌‌‌‍‌‌‍‍‌‌‍‍‌‍‍‌‌‍‍‌‌‌‍‌‌‌‍‍‌‌‍‌‍‌‌‌‍‌‌‍‍‌‌‌‍‌‍‌‌‍‌‍‌‌‍‌‌‌‌‌‍‌‍‌‌‌‌‍‌‌‌‍‍‌‌‌‍‌‌‌‌‍‍‌‌‍‌‍‍‍‌‍‍‌‌‍‌‌‌‍‌‍‌‍‌‍‌‍‌‍‌‍‍‌‍‍‍‌‍‌‍‌‍‍‍‌‍‌‍‌‍‌‍‌‌‍‌‌‍‌‍‌‌‌‍‌‌‌‍‌‌‌‌‌‍‌‌‌‍‌‌‍‌‌‌‌‍‍‌‌‍‍‌‍‌‍‌‌‍‌‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‌‍‍‍‌‌‍‍‌‍‌‍‌‍‌‌‌‍‌‌‌‌‍‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‍‌‍‍‌‍‍‌‌‍‌‍‌‌‍‌‍‌‌‌‍‍‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‌‍‍‌‌‍‌‍‍‌‌‍‍‌‍‌‍‌‍‌‍‌‍‌‍‌‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‌‌‍‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‌‌‌‌‌‌‍‌‍‌‌‍‍‌‌‌‌‌‌‍‌‌‌‌‍‌‌‍‌‌‍‍‌‌‍‌‌‍‌‍‍‌‍‌‍‌‍‌‍‌‌‌‍‌‍‌‌‌‌‌‍‍‍‌‌‍‌‌‍‍‌‍‍‌‌‌‌‍‌‍‍‌‌‌‌‌‌‌‍‌‌‍‍‌‌‍‍‌‍‌‍‍‌‌‍‌‌‌‍‌‍‌‍‌‍‌‍‌‍‌‍‍‌‍‍‍‌‍‌‍‌‍‍‍‌‍‌‍‌‍‌‍‌‌‍‌‌‍‌‍‌‍‌‌‌‍‌‌‌‍‌‌‌‌‌‍‌‌‌‍‌‍‌‌‍‌‌‌‌‍‍‌‌‍‍‌‍‌‍‌‌‍‌‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‌‍‍‍‌‌‍‍‌‍‌‍‌‍‌‌‌‍‌‌‌‌‍‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‍‌‍‍‌‍‍‌‌‍‌‍‌‌‍‌‍‌‌‌‍‍‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‌‍‍‌‌‍‌‍‍‌‌‍‍‌‍‌‍‌‍‌‍‌‍‌‍‌‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‌‌‍‌‌‌‍‌‌‍‍‌‌
MusicGen↗‌‍‍‍‌‍‌‍‌‍‍‌‌‍‌‌‍‍‌‌‍‍‍‍‍‍‍‍‌‌‍‌‌‍‍‌‍‍‌‌‌‌‍‌‍‍‌‍‍‌‌‍‍‍‍‍‍‌‍‍‌‍‌‍‌‌‌‍‌‍‍‍‍‍‍‍‌‍‍‌‌‌‌‌‌‍‍‍‍‌‍‍‌‍‌‍‌‍‌‍‌‌‌‍‌‍‌‌‌‌‌‍‍‍‌‌‍‌‌‍‍‌‍‍‌‌‌‌‍‌‍‍‌‌‌‌‌‌‌‍‌‌‍‍‌‌‍‍‌‍‍‌‌‍‍‌‌‌‍‌‌‌‍‍‌‌‍‌‍‌‌‌‍‌‌‍‍‌‌‌‍‌‍‌‌‍‌‍‌‌‍‌‌‌‌‌‍‌‍‌‌‌‌‍‌‌‌‍‍‌‌‌‍‌‌‌‌‍‍‌‌‍‌‍‍‍‌‍‍‌‌‍‌‌‌‍‌‍‌‍‌‍‌‍‌‍‌‍‍‌‍‍‍‌‍‌‍‌‍‍‍‌‍‌‍‌‍‌‍‌‌‍‌‌‍‌‍‌‌‌‍‌‌‌‍‌‌‌‌‌‍‌‌‌‍‌‌‍‌‌‌‌‍‍‌‌‍‍‌‍‌‍‌‌‍‌‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‌‍‌‍‌‌‍‌‌‌‍‍‌‌‌‌‌‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‍‌‍‍‌‍‍‌‌‍‌‍‌‌‍‌‍‌‌‌‍‍‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‌‌‌‌‍‌‌‌‍‌‍‍‍‍‌‍‌‌‍‍‌‍‌‍‌‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‌‌‍‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‌‌‌‌‌‌‍‌‍‌‌‍‍‌‌‌‌‌‌‍‌‌‌‌‍‌‌‍‌‌‍‍‌‌‍‌‌‍‌‍‍‌‍‌‍‌‍‌‍‌‌‌‍‌‍‌‌‌‌‌‍‍‍‌‌‍‌‌‍‍‌‍‍‌‌‌‌‍‌‍‍‌‌‌‌‌‌‌‍‌‌‍‍‌‌‍‍‌‍‌‍‍‌‌‍‌‌‌‍‌‍‌‍‌‍‌‍‌‍‌‍‍‌‍‍‍‌‍‌‍‌‍‍‍‌‍‌‍‌‍‌‍‌‌‍‌‌‍‌‍‌‍‌‌‌‍‌‌‌‍‌‌‌‌‌‍‌‌‌‍‌‍‌‌‍‌‌‌‌‍‍‌‌‍‍‌‍‌‍‌‌‍‌‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‌‍‌‍‌‌‍‌‌‌‍‍‌‌‌‌‌‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‍‌‍‍‌‍‍‌‌‍‌‍‌‌‍‌‍‌‌‌‍‍‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‌‌‌‌‍‌‌‌‍‌‍‍‍‍‌‍‌‌‍‍‌‍‌‍‌‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‌‌‍‌‌‌‍‌‌‍‍‌‌ from Meta AI — Simple & Controllable Music Generation‌‍‍‍‌‍‌‍‌‍‍‌‌‍‌‌‍‍‌‌‍‍‍‍‍‍‍‍‌‌‍‌‌‍‍‌‍‍‌‌‌‌‍‌‍‍‌‍‍‌‌‍‍‍‍‍‍‌‍‍‌‍‌‍‌‌‌‍‌‍‍‍‍‍‍‍‌‍‍‌‌‌‌‌‌‍‍‍‍‌‍‍‌‍‌‍‌‍‌‍‌‌‌‍‌‍‌‌‌‌‌‍‍‍‌‌‍‌‌‍‍‌‍‍‌‌‌‌‍‌‍‍‌‌‌‌‌‌‌‍‌‌‍‍‌‌‍‍‌‍‍‌‌‍‍‌‌‌‍‌‌‌‍‍‌‌‍‌‍‌‌‌‍‌‌‍‍‌‌‌‍‌‍‌‌‍‌‍‌‌‍‌‌‌‌‌‍‌‍‌‌‌‌‍‌‌‌‍‍‌‌‌‍‌‌‌‌‍‍‌‌‍‌‍‍‍‌‍‍‌‌‍‌‌‌‍‌‍‌‍‌‍‌‍‌‍‌‍‍‌‍‍‍‌‍‌‍‌‍‍‍‌‍‌‍‌‍‌‍‌‌‍‌‌‍‌‍‌‌‌‍‌‌‌‍‌‌‌‌‌‍‌‌‌‍‌‌‍‌‌‌‌‍‍‌‌‍‍‌‍‌‍‌‌‍‌‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‌‍‌‍‌‌‍‌‌‌‍‍‌‌‌‌‌‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‍‌‍‍‌‍‍‌‌‍‌‍‌‌‍‌‍‌‌‌‍‍‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‌‌‌‌‍‌‌‌‍‌‍‍‍‍‌‍‌‌‍‍‌‍‌‍‌‌‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‌‌‍‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‌‌‌‌‌‌‍‌‍‌‌‍‍‌‌‌‌‌‌‍‌‌‌‌‍‌‌‍‌‌‍‍‌‌‍‌‌‍‌‍‍‌‍‌‍‌‍‌‍‌‌‌‍‌‍‌‌‌‌‌‍‍‍‌‌‍‌‌‍‍‌‍‍‌‌‌‌‍‌‍‍‌‌‌‌‌‌‌‍‌‌‍‍‌‌‍‍‌‍‌‍‍‌‌‍‌‌‌‍‌‍‌‍‌‍‌‍‌‍‌‍‍‌‍‍‍‌‍‌‍‌‍‍‍‌‍‌‍‌‍‌‍‌‌‍‌‌‍‌‍‌‍‌‌‌‍‌‌‌‍‌‌‌‌‌‍‌‌‌‍‌‍‌‌‍‌‌‌‌‍‍‌‌‍‍‌‍‌‍‌‌‍‌‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‌‍‌‍‌‌‍‌‌‌‍‍‌‌‌‌‌‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‍‌‍‍‌‍‍‌‌‍‌‍‌‌‍‌‍‌‌‌‍‍‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‌‌‌‌‍‌‌‌‍‌‍‍‍‍‌‍‌‌‍‍‌‍‌‍‌‌‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‌‌‍‌‌‌‍‌‌‍‍‌‌
MotionGPT↗‌‍‍‍‌‍‌‍‌‍‍‌‌‍‌‌‍‍‌‌‍‍‍‍‍‍‍‍‌‌‍‌‌‍‍‌‍‍‌‌‌‌‍‌‍‍‌‍‍‌‌‍‍‍‍‍‍‌‍‍‌‍‌‍‌‌‌‍‌‍‍‍‍‍‍‍‌‍‍‌‌‌‌‌‌‍‍‍‍‌‍‍‌‍‌‍‌‍‌‍‌‌‌‍‌‍‌‌‌‌‌‍‍‍‌‌‍‌‌‍‍‌‍‍‌‌‌‌‍‌‍‍‌‌‌‌‌‌‌‍‌‌‍‍‌‌‍‍‌‍‍‌‌‍‍‌‌‌‍‌‌‌‍‍‌‌‍‌‍‌‌‌‍‌‌‍‍‌‌‌‍‌‍‌‌‍‌‍‌‌‍‌‌‌‌‌‍‌‍‌‌‌‌‍‌‌‌‍‍‌‌‌‍‌‌‌‌‍‍‌‌‍‌‍‍‍‌‍‍‌‌‍‌‌‌‍‌‍‌‍‌‍‌‍‌‍‌‍‍‌‍‍‍‌‍‌‍‌‍‍‍‌‍‌‍‌‍‌‍‌‌‍‌‌‍‌‍‌‌‌‍‌‌‌‍‌‌‌‌‌‍‌‌‌‍‌‌‍‌‌‌‌‍‍‌‌‍‍‌‍‌‍‌‌‍‌‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‌‍‌‍‌‍‌‌‍‍‌‌‌‌‌‍‌‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‍‌‍‍‌‍‍‌‌‍‌‍‌‌‍‌‍‌‌‌‍‍‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‌‍‌‌‍‌‍‌‍‌‍‌‌‍‍‌‍‌‍‍‍‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‌‌‍‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‌‌‌‌‌‌‍‌‍‌‌‍‍‌‌‌‌‌‌‍‌‌‌‌‍‌‌‍‌‌‍‍‌‌‍‌‌‍‌‍‍‌‍‌‍‌‍‌‍‌‌‌‍‌‍‌‌‌‌‌‍‍‍‌‌‍‌‌‍‍‌‍‍‌‌‌‌‍‌‍‍‌‌‌‌‌‌‌‍‌‌‍‍‌‌‍‍‌‍‌‍‍‌‌‍‌‌‌‍‌‍‌‍‌‍‌‍‌‍‌‍‍‌‍‍‍‌‍‌‍‌‍‍‍‌‍‌‍‌‍‌‍‌‌‍‌‌‍‌‍‌‍‌‌‌‍‌‌‌‍‌‌‌‌‌‍‌‌‌‍‌‍‌‌‍‌‌‌‌‍‍‌‌‍‍‌‍‌‍‌‌‍‌‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‌‍‌‍‌‍‌‌‍‍‌‌‌‌‌‍‌‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‍‌‍‍‌‍‍‌‌‍‌‍‌‌‍‌‍‌‌‌‍‍‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‌‍‌‌‍‌‍‌‍‌‍‌‌‍‍‌‍‌‍‍‍‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‌‌‍‌‌‌‍‌‌‍‍‌‌ — LLM-powered text-to-motion model‌‍‍‍‌‍‌‍‌‍‍‌‌‍‌‌‍‍‌‌‍‍‍‍‍‍‍‍‌‌‍‌‌‍‍‌‍‍‌‌‌‌‍‌‍‍‌‍‍‌‌‍‍‍‍‍‍‌‍‍‌‍‌‍‌‌‌‍‌‍‍‍‍‍‍‍‌‍‍‌‌‌‌‌‌‍‍‍‍‌‍‍‌‍‌‍‌‍‌‍‌‌‌‍‌‍‌‌‌‌‌‍‍‍‌‌‍‌‌‍‍‌‍‍‌‌‌‌‍‌‍‍‌‌‌‌‌‌‌‍‌‌‍‍‌‌‍‍‌‍‍‌‌‍‍‌‌‌‍‌‌‌‍‍‌‌‍‌‍‌‌‌‍‌‌‍‍‌‌‌‍‌‍‌‌‍‌‍‌‌‍‌‌‌‌‌‍‌‍‌‌‌‌‍‌‌‌‍‍‌‌‌‍‌‌‌‌‍‍‌‌‍‌‍‍‍‌‍‍‌‌‍‌‌‌‍‌‍‌‍‌‍‌‍‌‍‌‍‍‌‍‍‍‌‍‌‍‌‍‍‍‌‍‌‍‌‍‌‍‌‌‍‌‌‍‌‍‌‌‌‍‌‌‌‍‌‌‌‌‌‍‌‌‌‍‌‌‍‌‌‌‌‍‍‌‌‍‍‌‍‌‍‌‌‍‌‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‌‍‌‍‌‍‌‌‍‍‌‌‌‌‌‍‌‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‍‌‍‍‌‍‍‌‌‍‌‍‌‌‍‌‍‌‌‌‍‍‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‌‍‌‌‍‌‍‌‍‌‍‌‌‍‍‌‍‌‍‍‍‌‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‌‌‍‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‌‌‌‌‌‌‍‌‍‌‌‍‍‌‌‌‌‌‌‍‌‌‌‌‍‌‌‍‌‌‍‍‌‌‍‌‌‍‌‍‍‌‍‌‍‌‍‌‍‌‌‌‍‌‍‌‌‌‌‌‍‍‍‌‌‍‌‌‍‍‌‍‍‌‌‌‌‍‌‍‍‌‌‌‌‌‌‌‍‌‌‍‍‌‌‍‍‌‍‌‍‍‌‌‍‌‌‌‍‌‍‌‍‌‍‌‍‌‍‌‍‍‌‍‍‍‌‍‌‍‌‍‍‍‌‍‌‍‌‍‌‍‌‌‍‌‌‍‌‍‌‍‌‌‌‍‌‌‌‍‌‌‌‌‌‍‌‌‌‍‌‍‌‌‍‌‌‌‌‍‍‌‌‍‍‌‍‌‍‌‌‍‌‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‌‍‌‍‌‍‌‌‍‍‌‌‌‌‌‍‌‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‍‌‍‍‌‍‍‌‌‍‌‍‌‌‍‌‍‌‌‌‍‍‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‌‍‌‌‍‌‍‌‍‌‍‌‌‍‍‌‍‌‍‍‍‌‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‌‌‍‌‌‌‍‌‌‍‍‌‌

MusicGen‌‍‍‍‌‍‌‍‌‍‍‌‌‍‌‌‍‍‌‌‍‍‍‍‍‍‍‍‌‌‍‌‌‍‍‌‍‍‌‌‌‌‍‌‍‍‌‍‍‌‌‍‍‍‍‍‍‌‍‍‌‍‌‍‌‌‌‍‌‍‍‍‍‍‍‍‌‍‍‌‌‌‌‌‌‍‍‍‍‌‍‍‌‍‌‍‌‍‌‍‌‌‌‍‌‍‌‌‌‌‌‍‍‍‌‌‍‌‌‍‍‌‍‍‌‌‌‌‍‌‍‍‌‌‌‌‌‌‌‍‌‌‍‍‌‌‍‍‌‍‍‌‌‍‍‌‌‌‍‌‌‌‍‍‌‌‍‌‍‌‌‌‍‌‌‍‍‌‌‌‍‌‍‌‌‍‌‍‌‌‍‌‌‌‌‌‍‌‍‌‌‌‌‍‌‌‌‍‍‌‌‌‍‌‌‌‌‍‍‌‌‍‌‍‍‍‌‍‍‌‌‍‌‌‌‍‌‍‌‍‌‍‌‍‌‍‌‍‍‌‍‍‍‌‍‌‍‌‍‍‍‌‍‌‍‌‍‌‍‌‌‍‌‌‍‌‍‌‌‌‍‌‌‌‍‌‌‌‌‌‍‌‌‌‍‌‌‍‌‌‌‌‍‍‌‌‍‍‌‍‌‍‌‌‍‌‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‌‍‌‍‌‍‍‌‌‌‍‍‌‌‌‌‍‍‌‌‍‌‌‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‍‌‍‍‌‍‍‌‌‍‌‍‌‌‍‌‍‌‌‌‍‍‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‌‌‌‌‌‍‍‌‍‌‍‌‌‍‌‍‍‌‌‍‌‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‌‌‍‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‌‌‌‌‌‌‍‌‍‌‌‍‍‌‌‌‌‌‌‍‌‌‌‌‍‌‌‍‌‌‍‍‌‌‍‌‌‍‌‍‍‌‍‌‍‌‍‌‍‌‌‌‍‌‍‌‌‌‌‌‍‍‍‌‌‍‌‌‍‍‌‍‍‌‌‌‌‍‌‍‍‌‌‌‌‌‌‌‍‌‌‍‍‌‌‍‍‌‍‌‍‍‌‌‍‌‌‌‍‌‍‌‍‌‍‌‍‌‍‌‍‍‌‍‍‍‌‍‌‍‌‍‍‍‌‍‌‍‌‍‌‍‌‌‍‌‌‍‌‍‌‍‌‌‌‍‌‌‌‍‌‌‌‌‌‍‌‌‌‍‌‍‌‌‍‌‌‌‌‍‍‌‌‍‍‌‍‌‍‌‌‍‌‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‌‍‌‍‌‍‍‌‌‌‍‍‌‌‌‌‍‍‌‌‍‌‌‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‍‌‍‍‌‍‍‌‌‍‌‍‌‌‍‌‍‌‌‌‍‍‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‌‌‌‌‌‍‍‌‍‌‍‌‌‍‌‍‍‌‌‍‌‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‌‌‍‌‌‌‍‌‌‍‍‌‌

MusicGen is a text-to-music model capable of generating high-quality music samples based on text descriptions or audio prompts. Unlike previous works that consist of several models for multiple streams, MusicGen has only a single-stage transformer LM with an efficient token interleaving pattern. This means that it can generate multiple parallel streams with just one model. MusicGen can generate the music not only from the text prompt but also from text & melody, generating a sound clip that “follows” the given melody.‌‍‍‍‌‍‌‍‌‍‍‌‌‍‌‌‍‍‌‌‍‍‍‍‍‍‍‍‌‌‍‌‌‍‍‌‍‍‌‌‌‌‍‌‍‍‌‍‍‌‌‍‍‍‍‍‍‌‍‍‌‍‌‍‌‌‌‍‌‍‍‍‍‍‍‍‌‍‍‌‌‌‌‌‌‍‍‍‍‌‍‍‌‍‌‍‌‍‌‍‌‌‌‍‌‍‌‌‌‌‌‍‍‍‌‌‍‌‌‍‍‌‍‍‌‌‌‌‍‌‍‍‌‌‌‌‌‌‌‍‌‌‍‍‌‌‍‍‌‍‍‌‌‍‍‌‌‌‍‌‌‌‍‍‌‌‍‌‍‌‌‌‍‌‌‍‍‌‌‌‍‌‍‌‌‍‌‍‌‌‍‌‌‌‌‌‍‌‍‌‌‌‌‍‌‌‌‍‍‌‌‌‍‌‌‌‌‍‍‌‌‍‌‍‍‍‌‍‍‌‌‍‌‌‌‍‌‍‌‍‌‍‌‍‌‍‌‍‍‌‍‍‍‌‍‌‍‌‍‍‍‌‍‌‍‌‍‌‍‌‌‍‌‌‍‌‍‌‌‌‍‌‌‌‍‌‌‌‌‌‍‌‌‌‍‌‌‍‌‌‌‌‍‍‌‌‍‍‌‍‌‍‌‌‍‌‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‌‍‌‌‍‍‍‌‍‌‍‌‌‌‍‌‌‍‌‌‍‌‍‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‍‌‍‍‌‍‍‌‌‍‌‍‌‌‍‌‍‌‌‌‍‍‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‌‍‌‌‍‌‌‍‌‍‌‌‌‍‌‌‌‍‍‌‌‌‍‌‌‍‌‌‌‍‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‌‌‍‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‌‌‌‌‌‌‍‌‍‌‌‍‍‌‌‌‌‌‌‍‌‌‌‌‍‌‌‍‌‌‍‍‌‌‍‌‌‍‌‍‍‌‍‌‍‌‍‌‍‌‌‌‍‌‍‌‌‌‌‌‍‍‍‌‌‍‌‌‍‍‌‍‍‌‌‌‌‍‌‍‍‌‌‌‌‌‌‌‍‌‌‍‍‌‌‍‍‌‍‌‍‍‌‌‍‌‌‌‍‌‍‌‍‌‍‌‍‌‍‌‍‍‌‍‍‍‌‍‌‍‌‍‍‍‌‍‌‍‌‍‌‍‌‌‍‌‌‍‌‍‌‍‌‌‌‍‌‌‌‍‌‌‌‌‌‍‌‌‌‍‌‍‌‌‍‌‌‌‌‍‍‌‌‍‍‌‍‌‍‌‌‍‌‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‌‍‌‌‍‍‍‌‍‌‍‌‌‌‍‌‌‍‌‌‍‌‍‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‍‌‍‍‌‍‍‌‌‍‌‍‌‌‍‌‍‌‌‌‍‍‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‌‍‌‌‍‌‌‍‌‍‌‌‌‍‌‌‌‍‍‌‌‌‍‌‌‍‌‌‌‍‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‌‌‍‌‌‌‍‌‌‍‍‌‌

LLaVa‌‍‍‍‌‍‌‍‌‍‍‌‌‍‌‌‍‍‌‌‍‍‍‍‍‍‍‍‌‌‍‌‌‍‍‌‍‍‌‌‌‌‍‌‍‍‌‍‍‌‌‍‍‍‍‍‍‌‍‍‌‍‌‍‌‌‌‍‌‍‍‍‍‍‍‍‌‍‍‌‌‌‌‌‌‍‍‍‍‌‍‍‌‍‌‍‌‍‌‍‌‌‌‍‌‍‌‌‌‌‌‍‍‍‌‌‍‌‌‍‍‌‍‍‌‌‌‌‍‌‍‍‌‌‌‌‌‌‌‍‌‌‍‍‌‌‍‍‌‍‍‌‌‍‍‌‌‌‍‌‌‌‍‍‌‌‍‌‍‌‌‌‍‌‌‍‍‌‌‌‍‌‍‌‌‍‌‍‌‌‍‌‌‌‌‌‍‌‍‌‌‌‌‍‌‌‌‍‍‌‌‌‍‌‌‌‌‍‍‌‌‍‌‍‍‍‌‍‍‌‌‍‌‌‌‍‌‍‌‍‌‍‌‍‌‍‌‍‍‌‍‍‍‌‍‌‍‌‍‍‍‌‍‌‍‌‍‌‍‌‌‍‌‌‍‌‍‌‌‌‍‌‌‌‍‌‌‌‌‌‍‌‌‌‍‌‌‍‌‌‌‌‍‍‌‌‍‍‌‍‌‍‌‌‍‌‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‌‌‍‌‍‍‌‍‌‍‌‍‍‌‍‌‍‍‌‍‍‌‍‍‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‍‌‍‍‌‍‍‌‌‍‌‍‌‌‍‌‍‌‌‌‍‍‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‌‍‌‌‌‍‍‌‍‌‌‍‍‍‍‌‍‌‍‌‍‌‍‍‍‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‌‌‍‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‌‌‌‌‌‌‍‌‍‌‌‍‍‌‌‌‌‌‌‍‌‌‌‌‍‌‌‍‌‌‍‍‌‌‍‌‌‍‌‍‍‌‍‌‍‌‍‌‍‌‌‌‍‌‍‌‌‌‌‌‍‍‍‌‌‍‌‌‍‍‌‍‍‌‌‌‌‍‌‍‍‌‌‌‌‌‌‌‍‌‌‍‍‌‌‍‍‌‍‌‍‍‌‌‍‌‌‌‍‌‍‌‍‌‍‌‍‌‍‌‍‍‌‍‍‍‌‍‌‍‌‍‍‍‌‍‌‍‌‍‌‍‌‌‍‌‌‍‌‍‌‍‌‌‌‍‌‌‌‍‌‌‌‌‌‍‌‌‌‍‌‍‌‌‍‌‌‌‌‍‍‌‌‍‍‌‍‌‍‌‌‍‌‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‌‌‍‌‍‍‌‍‌‍‌‍‍‌‍‌‍‍‌‍‍‌‍‍‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‍‌‍‍‌‍‍‌‌‍‌‍‌‌‍‌‍‌‌‌‍‍‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‌‍‌‌‌‍‍‌‍‌‌‍‍‍‍‌‍‌‍‌‍‌‍‍‍‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‌‌‍‌‌‌‍‌‌‍‍‌‌

LLaVA is an end-to-end trained large multimodal model that connects a vision encoder and an LLM for general-purpose visual and language understanding. It combines a language instruction embedding and an image feature extracted with CLIP. Then, it processes them with an LLM model such as Vicuna or Llama, giving the model a visual reasoning and image-based chat capability. With this Run, you can deploy a Streamlit demo space that runs LLaVA inference. You can upload your photo and ask questions about the image.‌‍‍‍‌‍‌‍‌‍‍‌‌‍‌‌‍‍‌‌‍‍‍‍‍‍‍‍‌‌‍‌‌‍‍‌‍‍‌‌‌‌‍‌‍‍‌‍‍‌‌‍‍‍‍‍‍‌‍‍‌‍‌‍‌‌‌‍‌‍‍‍‍‍‍‍‌‍‍‌‌‌‌‌‌‍‍‍‍‌‍‍‌‍‌‍‌‍‌‍‌‌‌‍‌‍‌‌‌‌‌‍‍‍‌‌‍‌‌‍‍‌‍‍‌‌‌‌‍‌‍‍‌‌‌‌‌‌‌‍‌‌‍‍‌‌‍‍‌‍‍‌‌‍‍‌‌‌‍‌‌‌‍‍‌‌‍‌‍‌‌‌‍‌‌‍‍‌‌‌‍‌‍‌‌‍‌‍‌‌‍‌‌‌‌‌‍‌‍‌‌‌‌‍‌‌‌‍‍‌‌‌‍‌‌‌‌‍‍‌‌‍‌‍‍‍‌‍‍‌‌‍‌‌‌‍‌‍‌‍‌‍‌‍‌‍‌‍‍‌‍‍‍‌‍‌‍‌‍‍‍‌‍‌‍‌‍‌‍‌‌‍‌‌‍‌‍‌‌‌‍‌‌‌‍‌‌‌‌‌‍‌‌‌‍‌‌‍‌‌‌‌‍‍‌‌‍‍‌‍‌‍‌‌‍‌‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‍‍‌‌‌‍‌‌‍‌‌‌‌‍‌‍‌‍‍‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‍‌‍‍‌‍‍‌‌‍‌‍‌‌‍‌‍‌‌‌‍‍‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‍‌‍‌‌‌‍‌‍‌‍‌‌‍‌‍‌‌‌‌‌‍‌‌‌‌‌‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‌‌‍‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‌‌‌‌‌‌‍‌‍‌‌‍‍‌‌‌‌‌‌‍‌‌‌‌‍‌‌‍‌‌‍‍‌‌‍‌‌‍‌‍‍‌‍‌‍‌‍‌‍‌‌‌‍‌‍‌‌‌‌‌‍‍‍‌‌‍‌‌‍‍‌‍‍‌‌‌‌‍‌‍‍‌‌‌‌‌‌‌‍‌‌‍‍‌‌‍‍‌‍‌‍‍‌‌‍‌‌‌‍‌‍‌‍‌‍‌‍‌‍‌‍‍‌‍‍‍‌‍‌‍‌‍‍‍‌‍‌‍‌‍‌‍‌‌‍‌‌‍‌‍‌‍‌‌‌‍‌‌‌‍‌‌‌‌‌‍‌‌‌‍‌‍‌‌‍‌‌‌‌‍‍‌‌‍‍‌‍‌‍‌‌‍‌‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‍‍‌‌‌‍‌‌‍‌‌‌‌‍‌‍‌‍‍‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‍‌‍‍‌‍‍‌‌‍‌‍‌‌‍‌‍‌‌‌‍‍‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‍‌‍‌‌‌‍‌‍‌‍‌‌‍‌‍‌‌‌‌‌‍‌‌‌‌‌‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‌‌‍‌‌‌‍‌‌‍‍‌‌

MotionGPT‌‍‍‍‌‍‌‍‌‍‍‌‌‍‌‌‍‍‌‌‍‍‍‍‍‍‍‍‌‌‍‌‌‍‍‌‍‍‌‌‌‌‍‌‍‍‌‍‍‌‌‍‍‍‍‍‍‌‍‍‌‍‌‍‌‌‌‍‌‍‍‍‍‍‍‍‌‍‍‌‌‌‌‌‌‍‍‍‍‌‍‍‌‍‌‍‌‍‌‍‌‌‌‍‌‍‌‌‌‌‌‍‍‍‌‌‍‌‌‍‍‌‍‍‌‌‌‌‍‌‍‍‌‌‌‌‌‌‌‍‌‌‍‍‌‌‍‍‌‍‍‌‌‍‍‌‌‌‍‌‌‌‍‍‌‌‍‌‍‌‌‌‍‌‌‍‍‌‌‌‍‌‍‌‌‍‌‍‌‌‍‌‌‌‌‌‍‌‍‌‌‌‌‍‌‌‌‍‍‌‌‌‍‌‌‌‌‍‍‌‌‍‌‍‍‍‌‍‍‌‌‍‌‌‌‍‌‍‌‍‌‍‌‍‌‍‌‍‍‌‍‍‍‌‍‌‍‌‍‍‍‌‍‌‍‌‍‌‍‌‌‍‌‌‍‌‍‌‌‌‍‌‌‌‍‌‌‌‌‌‍‌‌‌‍‌‌‍‌‌‌‌‍‍‌‌‍‍‌‍‌‍‌‌‍‌‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‌‍‍‍‌‍‌‍‌‌‍‌‌‍‍‌‌‌‌‌‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‍‌‍‍‌‍‍‌‌‍‌‍‌‌‍‌‍‌‌‌‍‍‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‌‍‍‌‍‌‌‌‍‍‌‍‌‌‍‍‍‌‌‍‍‌‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‌‌‍‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‌‌‌‌‌‌‍‌‍‌‌‍‍‌‌‌‌‌‌‍‌‌‌‌‍‌‌‍‌‌‍‍‌‌‍‌‌‍‌‍‍‌‍‌‍‌‍‌‍‌‌‌‍‌‍‌‌‌‌‌‍‍‍‌‌‍‌‌‍‍‌‍‍‌‌‌‌‍‌‍‍‌‌‌‌‌‌‌‍‌‌‍‍‌‌‍‍‌‍‌‍‍‌‌‍‌‌‌‍‌‍‌‍‌‍‌‍‌‍‌‍‍‌‍‍‍‌‍‌‍‌‍‍‍‌‍‌‍‌‍‌‍‌‌‍‌‌‍‌‍‌‍‌‌‌‍‌‌‌‍‌‌‌‌‌‍‌‌‌‍‌‍‌‌‍‌‌‌‌‍‍‌‌‍‍‌‍‌‍‌‌‍‌‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‌‍‍‍‌‍‌‍‌‌‍‌‌‍‍‌‌‌‌‌‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‍‌‍‍‌‍‍‌‌‍‌‍‌‌‍‌‍‌‌‌‍‍‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‌‍‍‌‍‌‌‌‍‍‌‍‌‌‍‍‍‌‌‍‍‌‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‌‌‍‌‌‌‍‌‌‍‍‌‌

MotionGPT is a unified and versatile motion-language model that combines motion data with a large language model. MotionGPT uses motion-specific vector quantized variational autoencoder (VQ-VAE) to construct motion vocabulary. The input motion feature is encoded into discrete motion by VQ-VAE. The encoded motion tokens are mixed into text tokens and fed to the LLM. Thanks to the power of LLM, MotionGPT achieved state-of-the-art performance on multiple motion tasks, including text-based motion generation, motion captioning, and motion prediction.‌‍‍‍‌‍‌‍‌‍‍‌‌‍‌‌‍‍‌‌‍‍‍‍‍‍‍‍‌‌‍‌‌‍‍‌‍‍‌‌‌‌‍‌‍‍‌‍‍‌‌‍‍‍‍‍‍‌‍‍‌‍‌‍‌‌‌‍‌‍‍‍‍‍‍‍‌‍‍‌‌‌‌‌‌‍‍‍‍‌‍‍‌‍‌‍‌‍‌‍‌‌‌‍‌‍‌‌‌‌‌‍‍‍‌‌‍‌‌‍‍‌‍‍‌‌‌‌‍‌‍‍‌‌‌‌‌‌‌‍‌‌‍‍‌‌‍‍‌‍‍‌‌‍‍‌‌‌‍‌‌‌‍‍‌‌‍‌‍‌‌‌‍‌‌‍‍‌‌‌‍‌‍‌‌‍‌‍‌‌‍‌‌‌‌‌‍‌‍‌‌‌‌‍‌‌‌‍‍‌‌‌‍‌‌‌‌‍‍‌‌‍‌‍‍‍‌‍‍‌‌‍‌‌‌‍‌‍‌‍‌‍‌‍‌‍‌‍‍‌‍‍‍‌‍‌‍‌‍‍‍‌‍‌‍‌‍‌‍‌‌‍‌‌‍‌‍‌‌‌‍‌‌‌‍‌‌‌‌‌‍‌‌‌‍‌‌‍‌‌‌‌‍‍‌‌‍‍‌‍‌‍‌‌‍‌‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‍‌‌‍‍‌‍‌‍‌‌‌‌‍‍‌‍‌‌‍‌‌‌‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‍‌‍‍‌‍‍‌‌‍‌‍‌‌‍‌‍‌‌‌‍‍‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‌‍‍‌‍‌‍‌‍‌‍‌‍‌‍‌‍‌‌‌‍‌‌‌‌‌‍‌‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‌‌‍‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‌‌‌‌‌‌‍‌‍‌‌‍‍‌‌‌‌‌‌‍‌‌‌‌‍‌‌‍‌‌‍‍‌‌‍‌‌‍‌‍‍‌‍‌‍‌‍‌‍‌‌‌‍‌‍‌‌‌‌‌‍‍‍‌‌‍‌‌‍‍‌‍‍‌‌‌‌‍‌‍‍‌‌‌‌‌‌‌‍‌‌‍‍‌‌‍‍‌‍‌‍‍‌‌‍‌‌‌‍‌‍‌‍‌‍‌‍‌‍‌‍‍‌‍‍‍‌‍‌‍‌‍‍‍‌‍‌‍‌‍‌‍‌‌‍‌‌‍‌‍‌‍‌‌‌‍‌‌‌‍‌‌‌‌‌‍‌‌‌‍‌‍‌‌‍‌‌‌‌‍‍‌‌‍‍‌‍‌‍‌‌‍‌‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‍‌‌‍‍‌‍‌‍‌‌‌‌‍‍‌‍‌‌‍‌‌‌‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‍‌‍‍‌‍‍‌‌‍‌‍‌‌‍‌‍‌‌‌‍‍‍‌‌‌‌‌‍‌‌‌‍‍‌‍‌‌‌‍‌‍‌‌‌‌‍‌‌‌‌‍‌‌‍‍‌‍‍‌‍‌‍‌‍‌‍‌‍‌‍‌‍‌‌‌‍‌‌‌‌‌‍‌‍‌‌‍‍‍‌‌‌‌‌‌‍‍‌‌‌‍‌‌‌‍‌‌‍‍‌‌

Try VESSL today

MLOps for high-performance ML teams

RESOURCES

COMPANY

FOLLOW US