Thai HanumanSLM

Hanuman is a Small Language Model (HanumanSLM) specifically designed for Thai text generation. This model uses a Mixture of Experts (MoE) architecture and ships with a custom fast tokenizer optimized for Thai whitespace/newline preservation.

Tokenizer advisor: Koichi Yasuoka

Important Notes

Context Window: Supports up to 4,096 tokens (via RoPE scaling)
Tokenizer: Fast tokenizer with full whitespace/newline/tab preservation; no remote code
Serialization: Model weights provided in safetensors format for security (no pickle)
Device: Model supports both CPU and GPU inference
Torch Version: Compatible with PyTorch 1.9+ and transformers 4.20+

Model Details

Model Architecture

Model Type: SLMForCausalLM (Small Language Model with Mixture of Experts)
Language: Thai (th)
License: cc-by-nc-4.0

How the Mixture of Experts (MoE) Model Works

The HanumanSLM MoE model uses a Mixture of Experts architecture to improve flexibility and capacity for Thai text generation. Here’s how it works:

Embedding Layer: Input tokens are converted to dense vectors using an embedding layer.
MoE Layer: Multiple expert networks (each a small neural network) process the token embeddings. A gating network decides, for each token, which experts to use and how much to weight their outputs. The top-k experts (as set in the config) are selected for each token, and their outputs are combined using the gating probabilities.
Output Layer: The combined expert output is passed through a final linear layer to produce logits for each vocabulary token.
Expert Usage Logging: The gating probabilities for each token are stored and can be analyzed to see which experts are used most often.

This design allows the model to dynamically route different tokens or contexts to different experts, improving generation quality and model efficiency.

Configuration

Vocabulary Size: 249,261 tokens
Hidden Size: 512
Number of Layers: 8
Attention Heads: 8 (GQA: num_kv_heads=4)
Intermediate Size: 2,048
Max Position Embeddings: 4,096 (long-context enabled)
Tokenizer model_max_length: 4,096
Architecture: Mixture of Experts (MoE)

Training Details

Dataset: ZombitX64/Wikipedia-Thai
Training Method: Causal Language Modeling from scratch
Optimizer: AdamW
Learning Rate Scheduler: CosineAnnealingLR
Epochs: 2
Training Steps: 100
Hardware: CPU-optimized training

Performance Metrics

Metric	Value
Training Loss	5.515
Evaluation Loss	4.722
Perplexity	112.37
Final Learning Rate	1.01e-05

Recent Training Run (2025-08-16)

Epochs: 3
Global Steps: 189
Batch Size: 4
Logging Steps: 50
Eval/Save Steps: 500
Total FLOPs: 71,913,893,616,384
Training Stopped: Yes
Best Model Checkpoint: None
Best Metric: None

Log History:

Step	Epoch	Loss	Grad Norm	Learning Rate
50	0.8	17.472	16.52	2.45e-05
100	1.592	4.819	9.08	4.95e-05
150	2.384	0.562	3.70	2.25e-05

Intended Use

Primary Use Cases

Thai Text Generation: Generate coherent Thai text for various applications
Content Creation: Assist in creating Thai content for blogs, articles, and social media
Educational Tools: Support Thai language learning and teaching applications
Research: Academic research in Thai NLP and language modeling

Limitations

Training Scale: Model was trained for only 2 epochs on a subset of data
Hardware Constraints: Optimized for CPU training, may benefit from GPU fine-tuning
Domain: Primarily trained on Wikipedia data, may need domain-specific fine-tuning
Quality: Initial model - consider further fine-tuning for production use

Technical Implementation

Tokenizer & Context

Fast tokenizer preserves all whitespace, newlines, and tabs (no remote code required)
Special tokens for , , handled via app pre/post-processing
Round-trip encode/decode accuracy confirmed
Context window extended to 4,096 tokens using RoPE scaling (linear, factor 8)

Text Normalization

The training pipeline includes:

Unicode NFC normalization
Thai-Latin script spacing optimization
Consistent encoding/decoding for round-trip accuracy

tokenizer = AutoTokenizer.from_pretrained("JonusNattapong/Hanuman") model = AutoModelForCausalLM.from_pretrained("JonusNattapong/Hanuman") def generate_thai_text(prompt, max_length=100): inputs = tokenizer(prompt, return_tensors="pt")

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_length=max_length,
        temperature=0.7,
        top_p=0.9,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )

return tokenizer.decode(outputs[0], skip_special_tokens=True)

Usage Examples

Basic Text Generation

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("JonusNattapong/Hanuman")
model = AutoModelForCausalLM.from_pretrained("JonusNattapong/Hanuman", trust_remote_code=False)

def generate_thai_text(prompt, max_length=100):
  inputs = tokenizer(prompt, return_tensors="pt")
  with torch.no_grad():
    outputs = model.generate(
      **inputs,
      max_length=max_length,
      temperature=0.7,
      top_p=0.9,
      do_sample=True,
      pad_token_id=tokenizer.eos_token_id
    )
  return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Example usage
result = generate_thai_text("เทคโนโลยีปัญญาประดิษฐ์")
print(result)

Batch Processing

prompts = [
  "สวัสดีครับ",
  "ประเทศไทยมีพื้นที่",
  "การศึกษาในยุคดิจิทัล"
]

for prompt in prompts:
  result = generate_thai_text(prompt, max_length=80)
  print(f"Input: {prompt}")
  print(f"Output: {result}")
  print("-" * 50)

Training Process

Dataset Preparation

Source: ZombitX64/Wikipedia-Thai (streaming mode)
Preprocessing: Text cleaning and tokenization with the custom tokenizer
Normalization: Unicode NFC + Thai-Latin spacing

Training Configuration

training_args = {
    "per_device_train_batch_size": 2,
    "per_device_eval_batch_size": 2,
    "gradient_accumulation_steps": 4,
    "num_train_epochs": 2,
    "learning_rate": 5e-5,
    "warmup_steps": 10,
    "logging_steps": 10,
    "eval_steps": 50,
    "save_steps": 50,
    "fp16": False,  # CPU training
    "dataloader_num_workers": 0
}

Model Architecture

config = SLMConfig(
  vocab_size=249261,
  hidden_size=512,
  num_hidden_layers=8,
  num_attention_heads=8,
  num_kv_heads=4,  # GQA
  intermediate_size=2048,
  max_position_embeddings=4096,  # long-context
  rope_scaling={"type": "linear", "factor": 8.0},
  # MoE specific parameters
  num_experts=4,
  experts_per_token=2
)

Evaluation

Text Quality Assessment

The model demonstrates:

Coherent Thai text generation
Proper tokenization without mojibake
Reasonable perplexity for initial training
Limited training may affect long-form generation quality

Comparison with Base Models

Tokenization: Significant improvement over ByteLevelBPE
Whitespace Handling: Full preservation of spaces, tabs, and newlines
Thai Script Handling: Better Unicode normalization
Round-trip Accuracy: Improved encode/decode consistency

Fine-tuning Recommendations

For production use, consider:

Extended Training: Increase epochs and training data
Domain Adaptation: Fine-tune on domain-specific Thai corpora
Hardware Optimization: Use GPU training for larger batch sizes
Hyperparameter Tuning: Optimize learning rate and architecture
Evaluation: Implement comprehensive Thai language benchmarks
Tokenizer Customization: For special whitespace or formatting needs, use app-level pre/post-processing

References

Tokenizer Advisor: Koichi Yasuoka
Training Dataset: ZombitX64/Wikipedia-Thai
Architecture: Custom SLMForCausalLM with Mixture of Experts

Contributing

This model is part of ongoing research in Thai language processing. Contributions, feedback, and collaborations are welcome!

📄 Citation

@misc{Hanuman,
  title={Thai HanumanSLM},
  author={JonusNattapong}{KoichiYasuoka},
  year={2025},
  publisher={Hugging Face},
  url={https://huggingface.co/JonusNattapong/Hanuman},
  note={Tokenizer advisor: Koichi Yasuoka}
}

Note: This is an initial model trained for research purposes. For production applications, additional fine-tuning and evaluation are recommended.

Changelog (2025-08-16):

Context window increased to 4,096 tokens (RoPE scaling)
Tokenizer upgraded for full whitespace/newline/tab preservation
Model weights now provided in safetensors format (no pickle)
All configs and generation defaults synchronized

JonusNattapong
/

Hanuman

Thai HanumanSLM

Important Notes

Model Details

Model Architecture

How the Mixture of Experts (MoE) Model Works

Configuration

Training Details

Performance Metrics

Recent Training Run (2025-08-16)

Intended Use

Primary Use Cases

Limitations

Technical Implementation

Tokenizer & Context

Text Normalization

Usage Examples

Basic Text Generation

Batch Processing

Training Process

Dataset Preparation

Training Configuration

Model Architecture

Evaluation

Text Quality Assessment

Comparison with Base Models

Fine-tuning Recommendations

References

Contributing

📄 Citation

Model tree for JonusNattapong/Hanuman

Dataset used to train JonusNattapong/Hanuman

Evaluation results