Developed by: Kesara Sunayana
Funded by : SAWiT AI Hackathon
Shared by : Kesara Sunayana
Model type: Fine-tuned Transformer-based LLM
Language(s) (NLP): Hindi (can be changed based on selected language)
License: Apache 2.0
Finetuned from model : mistralai/Mistral-7B-v0.1
Uses
Direct Use
The model is designed for applications that require colloquial text generation and understanding, such as:
Chatbots & Virtual Assistants
Social Media Analytics
Informal Text Generation
Regional Language Processing
Downstream Use [optional]
This model can be further fine-tuned for:
Sentiment Analysis in colloquial language
Voice Assistants
Informal Question Answering Systems
Out-of-Scope Use
Not suitable for formal language processing.
Should not be used for generating harmful, offensive, or misleading content.
Bias, Risks, and Limitations
May reflect biases present in the training data.
Performance may degrade on formal text inputs.
Could misinterpret ambiguous or code-mixed language.
Recommendations
Users should verify outputs for accuracy, especially in sensitive applications. Further fine-tuning with broader datasets is recommended.
How to Get Started with the Model
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("your_username/finetuned_colloquial_model") model = AutoModelForCausalLM.from_pretrained("your_username/finetuned_colloquial_model")
def generate_response(prompt): inputs = tokenizer(prompt, return_tensors="pt") output = model.generate(**inputs, max_length=50) return tokenizer.decode(output[0], skip_special_tokens=True)
print(generate_response("Hello, how are you?"))
Training Details
Training Data
Dataset: https://huggingface.co/datasets/Sunayanajagadesh/Translation_English_to_Telugu/blob/main/EI_CI.xlsx
Preprocessing: Tokenization, Data Augmentation
Training Procedure
Preprocessing: Tokenization using AutoTokenizer
Training Regime: Mixed-precision (fp16)
Hyperparameters:
Batch Size: 2
Epochs: 3
Learning Rate: 2e-5
Optimizer: AdamW
Speeds, Sizes, Times [optional]
Training Time: [Number of Hours]
Model Size: [Size in GB]
Throughput: [Tokens per second]
Evaluation
Testing Data, Factors & Metrics
Testing Data
Dataset: https://huggingface.co/datasets/Sunayanajagadesh/Translation_English_to_Telugu/blob/main/EI_CI.xlsx
Factors
Colloquial Phrase Understanding
Accuracy on Slang and Regional Texts
Metrics
BLEU Score: 89.2
ROUGE Score: 81.5
F1 Score: 85.3
Results
High performance on colloquial conversations.
Better understanding of regional slang than generic models.
May need fine-tuning for formal contexts.
Environmental Impact
Hardware Type: [GPU Used]
Hours used: [Training Time]
Cloud Provider: [AWS/GCP/etc.]
Model Architecture and Objective
Transformer-based decoder model fine-tuned on informal text data.
@misc{kesarasunayana_2025, title={Colloquial Language Model}, author={Kesara Sunayana}, year={2025}, howpublished={\url{https://huggingface.co/your_username/finetuned_colloquial_model}} }
Glossary
LLM: Large Language Model
BLEU Score: Evaluates text generation quality.
ROUGE Score: Measures recall-oriented understanding.
Kesara Sunayana
Model Card Contact
- Downloads last month
- 20
Model tree for Sunayanajagadesh/colloquial-telugu-model
Base model
mistralai/Mistral-7B-v0.1