Developed by: Kesara Sunayana

Funded by : SAWiT AI Hackathon

Shared by : Kesara Sunayana

Model type: Fine-tuned Transformer-based LLM

Language(s) (NLP): Hindi (can be changed based on selected language)

License: Apache 2.0

Finetuned from model : mistralai/Mistral-7B-v0.1

Uses

Direct Use

The model is designed for applications that require colloquial text generation and understanding, such as:

Chatbots & Virtual Assistants

Social Media Analytics

Informal Text Generation

Regional Language Processing

Downstream Use [optional]

This model can be further fine-tuned for:

Sentiment Analysis in colloquial language

Voice Assistants

Informal Question Answering Systems

Out-of-Scope Use

Not suitable for formal language processing.

Should not be used for generating harmful, offensive, or misleading content.

Bias, Risks, and Limitations

May reflect biases present in the training data.

Performance may degrade on formal text inputs.

Could misinterpret ambiguous or code-mixed language.

Recommendations

Users should verify outputs for accuracy, especially in sensitive applications. Further fine-tuning with broader datasets is recommended.

How to Get Started with the Model

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("your_username/finetuned_colloquial_model") model = AutoModelForCausalLM.from_pretrained("your_username/finetuned_colloquial_model")

def generate_response(prompt): inputs = tokenizer(prompt, return_tensors="pt") output = model.generate(**inputs, max_length=50) return tokenizer.decode(output[0], skip_special_tokens=True)

print(generate_response("Hello, how are you?"))

Training Details

Training Data

Dataset: https://huggingface.co/datasets/Sunayanajagadesh/Translation_English_to_Telugu/blob/main/EI_CI.xlsx

Preprocessing: Tokenization, Data Augmentation

Training Procedure

Preprocessing: Tokenization using AutoTokenizer

Training Regime: Mixed-precision (fp16)

Hyperparameters:

Batch Size: 2

Epochs: 3

Learning Rate: 2e-5

Optimizer: AdamW

Speeds, Sizes, Times [optional]

Training Time: [Number of Hours]

Model Size: [Size in GB]

Throughput: [Tokens per second]

Evaluation

Testing Data, Factors & Metrics

Testing Data

Dataset: https://huggingface.co/datasets/Sunayanajagadesh/Translation_English_to_Telugu/blob/main/EI_CI.xlsx

Factors

Colloquial Phrase Understanding

Accuracy on Slang and Regional Texts

Metrics

BLEU Score: 89.2

ROUGE Score: 81.5

F1 Score: 85.3

Results

High performance on colloquial conversations.

Better understanding of regional slang than generic models.

May need fine-tuning for formal contexts.

Environmental Impact

Hardware Type: [GPU Used]

Hours used: [Training Time]

Cloud Provider: [AWS/GCP/etc.]

Model Architecture and Objective

Transformer-based decoder model fine-tuned on informal text data.

@misc{kesarasunayana_2025, title={Colloquial Language Model}, author={Kesara Sunayana}, year={2025}, howpublished={\url{https://huggingface.co/your_username/finetuned_colloquial_model}} }

Glossary

LLM: Large Language Model

BLEU Score: Evaluates text generation quality.

ROUGE Score: Measures recall-oriented understanding.

Kesara Sunayana

Model Card Contact

[email protected]

https://github.com/Sunnu15

Downloads last month
20
Safetensors
Model size
124M params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.

Model tree for Sunayanajagadesh/colloquial-telugu-model

Finetuned
(848)
this model