Shoonya v0.1 - Lightweight CPU-Friendly Language Model

Model Description

Shoonya is a lightweight transformer-based language model designed specifically for CPU inference. Built with efficiency in mind, it features a compact architecture while maintaining coherent text generation capabilities.

Key Features

  • CPU-Optimized: Designed to run efficiently on CPU-only environments
  • Lightweight: Only 4 transformer layers with 128 hidden dimensions
  • Memory Efficient: ~15MB model size (quantized version ~4MB)
  • Fast Inference: Suitable for real-time text generation on consumer hardware

Technical Details

  • Architecture: Transformer-based language model
    • 4 attention layers
    • 4 attention heads per layer
    • 128 hidden dimensions
    • 256 intermediate size
    • 128 max sequence length
  • Vocabulary: GPT-2 tokenizer (50,257 tokens)
  • Training: Fine-tuned on TinyStories dataset (1,000 examples)
  • Quantization: 8-bit dynamic quantization available for further size reduction

Usage

from transformers import AutoTokenizer
from model.transformer import TransformerLM

# Load model
model = TransformerLM.from_pretrained("vaidhyamegha/shoonya-v0.1")
tokenizer = AutoTokenizer.from_pretrained("gpt2")

# Generate text
prompt = "Once upon a time"
generated = model.generate(prompt, max_length=50)
print(generated)

Performance Characteristics

  • Memory Usage: <2GB RAM during inference
  • Model Size:
    • Full model: ~15MB
    • Quantized version: ~4MB
  • Speed: ~100ms per inference on standard CPU

Limitations

  • Limited context window (128 tokens)
  • Trained on a small subset of data
  • Best suited for short-form creative writing
  • May produce repetitive text on longer generations

Training

Trained on a curated subset of the TinyStories dataset, focusing on short, coherent narratives. The model uses a custom implementation of the transformer architecture with specific optimizations for CPU inference.

License

[Add your chosen license]

Citation

@misc{shoonya2025,
  author = {VaidhyaMegha},
  title = {Shoonya: A Lightweight CPU-Friendly Language Model},
  year = {2025},
  publisher = {Hugging Face},
  journal = {Hugging Face Model Hub},
}

Intended Use

This model is designed for:

  • Prototyping and experimentation
  • Educational purposes
  • CPU-only environments
  • Resource-constrained settings
  • Short-form text generation

Quantization

The model comes in two variants:

  1. Full precision (shoonya_model_v0_1.pt)
  2. 8-bit quantized (shoonya_model_v0_1_quantized.pt)

The quantized version offers significant size reduction while maintaining reasonable quality.

Downloads last month
16
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The HF Inference API does not support text-generation models for custom library.

Evaluation results