Thai HanumanSLM

Hanuman is a Small Language Model (HanumanSLM) specifically designed for Thai text generation. This model uses a Mixture of Experts (MoE) architecture and ships with a custom fast tokenizer optimized for Thai whitespace/newline preservation.
Tokenizer advisor: Koichi Yasuoka
Important Notes
- Context Window: Supports up to 4,096 tokens (via RoPE scaling)
- Tokenizer: Fast tokenizer with full whitespace/newline/tab preservation; no remote code
- Serialization: Model weights provided in
safetensors
format for security (no pickle) - Device: Model supports both CPU and GPU inference
- Torch Version: Compatible with PyTorch 1.9+ and transformers 4.20+
Model Details
Model Architecture
- Model Type: SLMForCausalLM (Small Language Model with Mixture of Experts)
- Language: Thai (th)
- License: cc-by-nc-4.0
How the Mixture of Experts (MoE) Model Works
The HanumanSLM MoE model uses a Mixture of Experts architecture to improve flexibility and capacity for Thai text generation. Here’s how it works:
- Embedding Layer: Input tokens are converted to dense vectors using an embedding layer.
- MoE Layer: Multiple expert networks (each a small neural network) process the token embeddings. A gating network decides, for each token, which experts to use and how much to weight their outputs. The top-k experts (as set in the config) are selected for each token, and their outputs are combined using the gating probabilities.
- Output Layer: The combined expert output is passed through a final linear layer to produce logits for each vocabulary token.
- Expert Usage Logging: The gating probabilities for each token are stored and can be analyzed to see which experts are used most often.
This design allows the model to dynamically route different tokens or contexts to different experts, improving generation quality and model efficiency.
Configuration
- Vocabulary Size: 249,261 tokens
- Hidden Size: 512
- Number of Layers: 8
- Attention Heads: 8 (GQA: num_kv_heads=4)
- Intermediate Size: 2,048
- Max Position Embeddings: 4,096 (long-context enabled)
- Tokenizer model_max_length: 4,096
- Architecture: Mixture of Experts (MoE)
Training Details
- Dataset: ZombitX64/Wikipedia-Thai
- Training Method: Causal Language Modeling from scratch
- Optimizer: AdamW
- Learning Rate Scheduler: CosineAnnealingLR
- Epochs: 2
- Training Steps: 100
- Hardware: CPU-optimized training
Performance Metrics
Metric | Value |
---|---|
Training Loss | 5.515 |
Evaluation Loss | 4.722 |
Perplexity | 112.37 |
Final Learning Rate | 1.01e-05 |
Recent Training Run (2025-08-16)
- Epochs: 3
- Global Steps: 189
- Batch Size: 4
- Logging Steps: 50
- Eval/Save Steps: 500
- Total FLOPs: 71,913,893,616,384
- Training Stopped: Yes
- Best Model Checkpoint: None
- Best Metric: None
Log History:
Step | Epoch | Loss | Grad Norm | Learning Rate |
---|---|---|---|---|
50 | 0.8 | 17.472 | 16.52 | 2.45e-05 |
100 | 1.592 | 4.819 | 9.08 | 4.95e-05 |
150 | 2.384 | 0.562 | 3.70 | 2.25e-05 |
Intended Use
Primary Use Cases
- Thai Text Generation: Generate coherent Thai text for various applications
- Content Creation: Assist in creating Thai content for blogs, articles, and social media
- Educational Tools: Support Thai language learning and teaching applications
- Research: Academic research in Thai NLP and language modeling
Limitations
- Training Scale: Model was trained for only 2 epochs on a subset of data
- Hardware Constraints: Optimized for CPU training, may benefit from GPU fine-tuning
- Domain: Primarily trained on Wikipedia data, may need domain-specific fine-tuning
- Quality: Initial model - consider further fine-tuning for production use
Technical Implementation
Tokenizer & Context
- Fast tokenizer preserves all whitespace, newlines, and tabs (no remote code required)
- Special tokens for , , handled via app pre/post-processing
- Round-trip encode/decode accuracy confirmed
- Context window extended to 4,096 tokens using RoPE scaling (linear, factor 8)
Text Normalization
The training pipeline includes:
- Unicode NFC normalization
- Thai-Latin script spacing optimization
- Consistent encoding/decoding for round-trip accuracy
tokenizer = AutoTokenizer.from_pretrained("JonusNattapong/Hanuman") model = AutoModelForCausalLM.from_pretrained("JonusNattapong/Hanuman") def generate_thai_text(prompt, max_length=100): inputs = tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
outputs = model.generate(
**inputs,
max_length=max_length,
temperature=0.7,
top_p=0.9,
do_sample=True,
pad_token_id=tokenizer.eos_token_id
)
return tokenizer.decode(outputs[0], skip_special_tokens=True)
Usage Examples
Basic Text Generation
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("JonusNattapong/Hanuman")
model = AutoModelForCausalLM.from_pretrained("JonusNattapong/Hanuman", trust_remote_code=False)
def generate_thai_text(prompt, max_length=100):
inputs = tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
outputs = model.generate(
**inputs,
max_length=max_length,
temperature=0.7,
top_p=0.9,
do_sample=True,
pad_token_id=tokenizer.eos_token_id
)
return tokenizer.decode(outputs[0], skip_special_tokens=True)
# Example usage
result = generate_thai_text("เทคโนโลยีปัญญาประดิษฐ์")
print(result)
Batch Processing
prompts = [
"สวัสดีครับ",
"ประเทศไทยมีพื้นที่",
"การศึกษาในยุคดิจิทัล"
]
for prompt in prompts:
result = generate_thai_text(prompt, max_length=80)
print(f"Input: {prompt}")
print(f"Output: {result}")
print("-" * 50)
Training Process
Dataset Preparation
- Source: ZombitX64/Wikipedia-Thai (streaming mode)
- Preprocessing: Text cleaning and tokenization with the custom tokenizer
- Normalization: Unicode NFC + Thai-Latin spacing
Training Configuration
training_args = {
"per_device_train_batch_size": 2,
"per_device_eval_batch_size": 2,
"gradient_accumulation_steps": 4,
"num_train_epochs": 2,
"learning_rate": 5e-5,
"warmup_steps": 10,
"logging_steps": 10,
"eval_steps": 50,
"save_steps": 50,
"fp16": False, # CPU training
"dataloader_num_workers": 0
}
Model Architecture
config = SLMConfig(
vocab_size=249261,
hidden_size=512,
num_hidden_layers=8,
num_attention_heads=8,
num_kv_heads=4, # GQA
intermediate_size=2048,
max_position_embeddings=4096, # long-context
rope_scaling={"type": "linear", "factor": 8.0},
# MoE specific parameters
num_experts=4,
experts_per_token=2
)
Evaluation
Text Quality Assessment
The model demonstrates:
- Coherent Thai text generation
- Proper tokenization without mojibake
- Reasonable perplexity for initial training
- Limited training may affect long-form generation quality
Comparison with Base Models
- Tokenization: Significant improvement over ByteLevelBPE
- Whitespace Handling: Full preservation of spaces, tabs, and newlines
- Thai Script Handling: Better Unicode normalization
- Round-trip Accuracy: Improved encode/decode consistency
Fine-tuning Recommendations
For production use, consider:
- Extended Training: Increase epochs and training data
- Domain Adaptation: Fine-tune on domain-specific Thai corpora
- Hardware Optimization: Use GPU training for larger batch sizes
- Hyperparameter Tuning: Optimize learning rate and architecture
- Evaluation: Implement comprehensive Thai language benchmarks
- Tokenizer Customization: For special whitespace or formatting needs, use app-level pre/post-processing
References
- Tokenizer Advisor: Koichi Yasuoka
- Training Dataset: ZombitX64/Wikipedia-Thai
- Architecture: Custom SLMForCausalLM with Mixture of Experts
Contributing
This model is part of ongoing research in Thai language processing. Contributions, feedback, and collaborations are welcome!
📄 Citation
@misc{Hanuman,
title={Thai HanumanSLM},
author={JonusNattapong}{KoichiYasuoka},
year={2025},
publisher={Hugging Face},
url={https://huggingface.co/JonusNattapong/Hanuman},
note={Tokenizer advisor: Koichi Yasuoka}
}
Note: This is an initial model trained for research purposes. For production applications, additional fine-tuning and evaluation are recommended.
Changelog (2025-08-16):
- Context window increased to 4,096 tokens (RoPE scaling)
- Tokenizer upgraded for full whitespace/newline/tab preservation
- Model weights now provided in safetensors format (no pickle)
- All configs and generation defaults synchronized
- Downloads last month
- 157
Model tree for JonusNattapong/Hanuman
Unable to build the model tree, the base model loops to the model itself. Learn more.
Dataset used to train JonusNattapong/Hanuman
Evaluation results
- train_loss on ZombitX64/Wikipedia-Thaiself-reported5.515
- eval_loss on ZombitX64/Wikipedia-Thaiself-reported4.722
- perplexity on ZombitX64/Wikipedia-Thaiself-reported112.369
- learning_rate on ZombitX64/Wikipedia-Thaiself-reported0.000
- epoch on ZombitX64/Wikipedia-Thaiself-reported2.000
- steps on ZombitX64/Wikipedia-Thaiself-reported100.000