MoLA-LLM
/

MoLA-v0.6-9x4b

Text Generation

mixture-of-experts

Model card Files Files and versions

MoLA-v0.6-9x4b / README.md

AtAndDev's picture

Update README.md

31c6660 verified about 2 hours ago

|

history blame contribute delete

2.58 kB

	---
	license: apache-2.0
	library_name: transformers
	tags:
	- pytorch
	- mixture-of-experts
	- lora
	- adapter
	- causal-lm
	- text-generation
	language:
	- en
	pipeline_tag: text-generation
	---

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/630f3e4002ce39336c411048/3gVVmArsXVoogpkXvsBs7.png)

	# MoLA-LM: Mixture of LoRA Adapters LLM

	MoLA-LM combines multiple LoRA adapters with an intelligent router to automatically select the best adapter for each input prompt. This approach enables specialized performance across different tasks while maintaining efficiency.

	[Click for evals](https://github.com/alkinun/MoLA/blob/main/README.md)

	Important Note: *The v0.5 had issues with the lora applying part of the custom lm class and its router was a bit too small with little generalization.
	In v0.6 and future models, all of these issues are/will be resolved.*

	TLDR: Dont use v0.5, use v0.6 and above.

	## Model Details

	- Model Type: Mixture of LoRA Adapters Language Model
	- Base Model: Qwen/Qwen3-4B-Thinking-2507
	- Total Adapters: 9
	- Architecture: Custom MoLAForCausalLM with automatic adapter routing

	## Usage

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	# Load the model (trust_remote_code=True is required for custom architecture)
	model = AutoModelForCausalLM.from_pretrained(
	"MoLA-LLM/MoLA-v0.6-9x4b",
	trust_remote_code=True,
	device_map="auto"
	)
	tokenizer = AutoTokenizer.from_pretrained("MoLA-LLM/MoLA-v0.6-9x4b", trust_remote_code=True)
	# Use like any other language model - adapter selection is automatic
	prompt = "Write a Python function to calculate fibonacci numbers"
	messages = [{"role": "user", "content": prompt}]
	inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
	).to(model.device)
	outputs = model.generate(**inputs, max_new_tokens=8192, temperature=.6, do_sample=True)
	response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True)
	print(f"Selected LoRA: {model.get_current_lora()}")
	print(response)
	```
	You can also use load_in_4bit and load_in_8bit directly when loading!

	## Architecture

	The MoLA-LM architecture consists of:

	1. Base Model: Qwen/Qwen3-4B-Thinking-2507
	2. Router Network: Frozen encoder as Sentence transformer + decoder as MLP for adapter selection
	3. LoRA Adapters: 9 task-specific fine-tuned adapters
	4. Dynamic Switching: Automatic adapter application based on input

	---

	## Paper coming soon™