Topic 28: What is Mixture-of-Mamba?

Community Article Published February 20, 2025

🔳 we discuss how to enable the Mamba Selective State Space Model (SSM) to handle multimodal data using the Mixture-of-Transformers concept and modality-aware sparsity

At the Turing Post, we are particularly excited about exploring LLM architectures that differ from widespread approaches like transformers. One of them is the Mamba Selective State Space model (SSM), which we covered in one of our first AI 101 episodes. It's one of the main competitors to transformers due to its efficient handling of long sequences, high speed, and reduced memory use. What’s most intriguing about AI is observing how different architectures receive upgrades to align with emerging trends. For example, Mamba is not an efficient option for processing multimodal data, and this is where Mixture-of-Mamba (MoM) comes in. It expands the benefits of transformers by using the Mixture-of-Expert (MoE) concept to enhance SSMs for multimodal tasks. MoM’s main feature – modality-aware sparsity – transforms the Mamba core into a new powerful architecture that meets the need for multimodality. Let’s explore how MoM changed Mamba and how this fascinating complex system works.

📨 Click follow! If you want to receive our articles straight to your inbox, please subscribe here

In today’s episode, we will cover:

Mixture-of-Mamba (MoM): what’s the idea?
How Does MoM Work?
How Good is MoM?
MoM’s Advantages
Not Without Limitations
Conclusion: Why Does Mixture-of-Mamba Stand Out?
Bonus: Resources to Dive Deeper

Mixture-of-Mamba: what’s the idea?

Mamba is one of the most powerful Selective State Space model (SSM). At their core, SSMs are a type of AI model that can efficiently process sequences of data, such as sentences or videos. They’ve been explored as a competitive alternative to transformers, which are powerful but computationally expensive. Mamba is especially efficient and has the following advantages over transformers:

Efficient handling of long sequences: Mamba achieves linear scaling with sequence length compared to transformers, which scale quadratically.
Faster inference: Due to its linear-time processing, Mamba can perform inference up to five times faster than transformers.
Reduced memory use: It avoids the extensive memory requirements of the attention mechanisms in transformers.
Parallelizable training: By representing the SSM as a convolution, Mamba enables parallel training similar to Convolutional Neural Networks (CNNs), leading to faster training times.

However, there’s one big problem – Mamba doesn’t make good use of different types of data and treats all input data like text, images, or speech the same way. This limits Mamba’s effectiveness for multimodal tasks.

The question arises: How can we expand Mamba’s benefits to multimodal data and make it an even more powerful architecture?

Researchers from Stanford University, Carnegie Mellon University and FAIR at Meta found the solution. They turned to the idea of Mixture-of-Experts (MoE), that allows models to use only parts of their structure for specific inputs. In particular, they were inspired by Mixture-of-Transformers (MoT), which selectively activates different processing components based on input type. So, they build on it their new SSM architecture – Mixture-of-Mamba (MoM), which makes the model more "aware" of different data types, keeping it computationally efficient. Let's explore how exactly MoM makes Mamba multimodal.

How does MoM work?

“MoM introduces modality-aware sparsity through modality specific parameterization of the Mamba block.” (The original paper) – Let's break down this statement step by step.

MoM integrates modality-aware sparsity into the core of Mamba. This means that instead of applying the same parameters to all data types, MoM dynamically selects the right processing method for each type of input (text, images, or speech). Here's how it works from the inside.

MoM is built using Mixture-of-Mamba blocks which apply separate processing rules for different types of input data while still sharing common components where it makes sense.

Image Credit: The original paper

The modality-aware sparsity mechanism works like a dynamic routing system:

MoM uses a modality mask to distinguish between different types of input tokens (text, image, or speech).
It activates the right set of weights based on the modality. This process is called modality-specific parameterization.
MoM processes different modality tokens simultaneously, ensuring efficient training and inference.

To be modality-aware, MoM modifies key projection layers in the Mamba model:

Input projection layer: Converts raw data into an initial representation specific to its modality.
Intermediate processing layers: Apply modality-specific transformations to refine the data representation inside the model.
Output projection layer: Converts the processed data into its final output format.

By doing this, the model applies the most effective processing method for each type of data, making training faster and more efficient.

While projection layers are separated in the model, some parts of it remain shared. What are they?

State transitions and convolutions remain shared because they don’t rely on the type of data. They are also used to keep computation efficient.

State transitions track and update sequential information over time, acting like memory units that help capture long-term dependencies across all modalities.
1D Convolutional Layers capture local patterns in sequences, helping the model process time-dependent information across different modalities, including text, images, speech, and video.

This design allows MoM to be both specialized in handling different modalities and computationally efficient by sharing key operations. Now let’s move to MoM performance results.

How good is MoM?

To demonstrate the strength of MoM model, researchers tested MoM on three different AI training setups:

Transfusion – It blends text and continuous image data.
Chameleon – A setup that mixes text and discrete image data.
Three-modality framework – A more complex setup that also includes speech as a third modality alongside text and images, treating all as discrete tokens.

The results were quite impressive.

Transfusion setting
- Image performance:
  - At 1.4B scale, MoM reduces loss by 2.2% while using only 34.76% of FLOPs (Floating Point Operations per Second, measure computational performance).
  - Smaller models (760M, 163M) show similar trends, with 2.37% lower loss and 60% fewer FLOPs.
- Text performance: MoM improves validation loss and generalizes better while using fewer FLOPs. It reaches accuracy faster, boosting training efficiency.
- Overall efficiency: MoM lowers training loss across tasks while cutting FLOPs by up to 86.11%.

Image Credit: The original paper

Chameleon setting: Both images and text were treated as discrete tokens.
- Image performance: MoM reduces image loss by up to 3.46%, using just 25.9 - 42.5 % of FLOPs.
- Text performance: At 1.5B scale, MoM lowers text loss by 3.01%, using 65.4% of FLOPs.

Image Credit: The original paper

Three-modality setting: image, text, and speech
- Speech performance:
  - 443M scale: 7.14% lower training loss, using 19.2% of FLOPs.
  - 1.5B scale: 5.75% lower loss, with only 24.8% of FLOPs.
- Overall metrics:
  - Up to 3.67% training loss improvement, cutting up to 56.2% of FLOPs.
  - MoM consistently improves performance across all three modalities.

Image Credit: The original paper

In the Mixture-of-Mamba results, MoM consistently achieves better accuracy while using fewer FLOPs, making it a more efficient model compared to Mamba and transformer. Further, we gather all the benefits of MoM to clarify exactly how it improves upon Mamba Selective SSM.

MoM’s advantages

In MoM, only relevant parameters are activated, optimizing computation and training speed and cost.
Modality-aware sparsity benefits: It allows MoM to specialize in text, image, and speech processing more efficiently, working with modalities in ways that suit their unique structures. Using modality-aware sparsity in every part of the model gives better results than applying it selectively.
Scalable and flexible: It works across different training strategies like diffusion-based image learning and token-based processing.
MoM consistently outperforms traditional dense models in three multimodal settings: text + continuous images; text + discrete images; and text + images + speech.
MoM improves computational efficiency, cutting costs by up to 65% while maintaining or improving performance.
MoM reaches the same accuracy with fewer training steps, making it faster to train compared to Mamba Dense and Flex-Attention Transformer.
Achieves significant loss reduction compared to baseline models, leading to better generalization to unseen data.
It reduces energy consumption, making AI more accessible and environmentally friendly.
MoM can be combined with MoE techniques, opening up possibilities for further efficiency improvements in multimodal AI.

Not without limitations

While MoM introduces significant improvements in multimodal AI, it also has several limitations:

MoM’s modality-aware sparsity requires specialized parameterization, making it more complex to implement and optimize compared to standard SSMs or Transformers.
The need to decouple multiple projection components in Mamba blocks can make debugging and fine-tuning more difficult.
Training MoM requires careful balancing between text, image, and speech representations, which may introduce additional training overhead.
Speech processing still requires specialized tokenization methods.
The architecture’s efficiency is highly dependent on joint optimization rather than independent component improvements.
MoM’s generalization to real-world applications has not been fully explored.
Larger-scale transformer models, like GPT and LLaMA, still outperform MoM in certain NLP benchmarks, suggesting potential upper limits in its performance.

These limitations highlight the fields for future improvement, but despite them, MoM upgrades Mamba to a new level as an even better alternative to transformers.

Conclusion: Why does Mixture-of-Mamba stand out?

We discussed *8Mixture-of-Mamba (MoM)**, an improved State-Space Model (SSM) that applies modality-aware sparsity to efficiently handle different data types. MoM is a faster, more efficient, and scalable model for multimodal AI, offering significant gains in computational efficiency and training effectiveness across text, image, and speech tasks.

It’s important to highlight how MoM differs from other Mamba-based approaches and sparse transformers to emphasize its unique architecture:

Sparse transformers integrate sparsity into attention mechanisms but remain focused on specific tasks, such as text-to-image generation.
Other MoM-like models, such as MoE-Mamba and BlackMamba, introduce sparsity into Mamba by modifying MLP layers. However, they leave Mamba’s core structure unchanged.

MoM takes a different approach by applying modality-aware sparsity inside the Mamba block, allowing specialized processing across different data types and extending sparse architectures beyond Transformers. This marks a significant advancement for SSMs, demonstrating their huge potential in the multimodal AI world.

We are especially curious about the potential of blending MoM and MoE—a combination that could be complex but highly effective, potentially creating a powerful hybrid model.

Author: Alyona Vert Editor: Ksenia Se