|
--- |
|
license: apache-2.0 |
|
library_name: transformers |
|
tags: |
|
- pytorch |
|
- mixture-of-experts |
|
- lora |
|
- adapter |
|
- causal-lm |
|
- text-generation |
|
language: |
|
- en |
|
pipeline_tag: text-generation |
|
--- |
|
|
|
 |
|
|
|
# MoLA-LM: Mixture of LoRA Adapters LLM |
|
|
|
MoLA-LM combines multiple LoRA adapters with an intelligent router to automatically select the best adapter for each input prompt. This approach enables specialized performance across different tasks while maintaining efficiency. |
|
|
|
[**Click for evals**](https://github.com/alkinun/MoLA/blob/main/README.md) |
|
|
|
**Important Note**: *The v0.5 had issues with the lora applying part of the custom lm class and its router was a bit too small with little generalization. |
|
In v0.6 and future models, all of these issues are/will be resolved.* |
|
|
|
**TLDR:** *Dont use v0.5, use v0.6 and above.* |
|
|
|
## Model Details |
|
|
|
- **Model Type**: Mixture of LoRA Adapters Language Model |
|
- **Base Model**: Qwen/Qwen3-4B-Thinking-2507 |
|
- **Total Adapters**: 9 |
|
- **Architecture**: Custom MoLAForCausalLM with automatic adapter routing |
|
|
|
## Usage |
|
|
|
```python |
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
# Load the model (trust_remote_code=True is required for custom architecture) |
|
model = AutoModelForCausalLM.from_pretrained( |
|
"MoLA-LLM/MoLA-v0.6-9x4b", |
|
trust_remote_code=True, |
|
device_map="auto" |
|
) |
|
tokenizer = AutoTokenizer.from_pretrained("MoLA-LLM/MoLA-v0.6-9x4b", trust_remote_code=True) |
|
# Use like any other language model - adapter selection is automatic |
|
prompt = "Write a Python function to calculate fibonacci numbers" |
|
messages = [{"role": "user", "content": prompt}] |
|
inputs = tokenizer.apply_chat_template( |
|
messages, |
|
add_generation_prompt=True, |
|
tokenize=True, |
|
return_dict=True, |
|
return_tensors="pt", |
|
).to(model.device) |
|
outputs = model.generate(**inputs, max_new_tokens=8192, temperature=.6, do_sample=True) |
|
response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True) |
|
print(f"Selected LoRA: {model.get_current_lora()}") |
|
print(response) |
|
``` |
|
*You can also use load_in_4bit and load_in_8bit directly when loading!* |
|
|
|
## Architecture |
|
|
|
The MoLA-LM architecture consists of: |
|
|
|
1. **Base Model**: Qwen/Qwen3-4B-Thinking-2507 |
|
2. **Router Network**: Frozen encoder as Sentence transformer + decoder as MLP for adapter selection |
|
3. **LoRA Adapters**: 9 task-specific fine-tuned adapters |
|
4. **Dynamic Switching**: Automatic adapter application based on input |
|
|
|
--- |
|
|
|
## *Paper coming soon™* |