|
--- |
|
license: llama3.2 |
|
base_model: meta-llama/Llama-3.2-8B-Instruct |
|
tags: |
|
- text-generation |
|
- instruction |
|
- datafusion |
|
- rust |
|
- code |
|
--- |
|
|
|
 |
|
|
|
|
|
**Author:** yarenty |
|
**Model type:** Llama 3.2 (fine-tuned) |
|
**Task:** Instruction-following, code Q/A, DataFusion expert assistant |
|
**License:** Apache 2.0 |
|
**Visibility:** Public |
|
|
|
--- |
|
|
|
|
|
# Llama 3.2 DataFusion Instruct |
|
|
|
This model is a fine-tuned version of **meta-llama/Llama-3.2-8B-Instruct**, specialized for the [Apache Arrow DataFusion](https://arrow.apache.org/datafusion/) ecosystem. It's designed to be a helpful assistant for developers, answering technical questions, generating code, and explaining concepts related to DataFusion, Arrow.rs, Ballista, and the broader Rust data engineering landscape. |
|
|
|
**GGUF Version:** For quantized, low-resource deployment, you can find the GGUF version [here](<https://huggingface.co/yarenty/llama32-datafusion-instruct-gguf>). |
|
|
|
## Model Description |
|
|
|
This model was fine-tuned on a curated dataset of high-quality question-answer pairs and instruction-following examples sourced from the official DataFusion documentation, source code, mailing lists, and community discussions. |
|
|
|
- **Model Type:** Instruction-following Large Language Model (LLM) |
|
- **Base Model:** `meta-llama/Llama-3.2-8B-Instruct` |
|
- **Primary Use:** Developer assistant for the DataFusion ecosystem. |
|
|
|
## Prompt Template |
|
|
|
To get the best results, format your prompts using the following instruction template. |
|
|
|
``` |
|
### Instruction: |
|
{Your question or instruction here} |
|
|
|
### Response: |
|
``` |
|
|
|
## Example Usage |
|
|
|
```python |
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
|
model_id = "yarenty/llama32-datafusion-instruct" |
|
tokenizer = AutoTokenizer.from_pretrained(model_id) |
|
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto") |
|
|
|
# The model was trained with a specific instruction template. |
|
# For optimal performance, your prompt should follow this structure. |
|
prompt_template = """### Instruction: |
|
How do I register a Parquet file in DataFusion? |
|
|
|
### Response:""" |
|
|
|
inputs = tokenizer(prompt_template, return_tensors="pt").to(model.device) |
|
outputs = model.generate(**inputs, max_new_tokens=256, eos_token_id=tokenizer.eos_token_id) |
|
|
|
# Decode the output, skipping special tokens and the prompt |
|
prompt_length = inputs["input_ids"].shape[1] |
|
print(tokenizer.decode(outputs[0][prompt_length:], skip_special_tokens=True)) |
|
``` |
|
|
|
## Training Procedure |
|
|
|
- **Hardware:** Trained on 1x NVIDIA A100 GPU. |
|
- **Training Script:** Custom script using `transformers.SFTTrainer`. |
|
- **Key Hyperparameters:** |
|
- Epochs: 3 |
|
- Learning Rate: 2e-5 |
|
- Batch Size: 4 |
|
- **Dataset:** A curated dataset of ~5,000 high-quality QA pairs and instructions related to DataFusion. Data was cleaned and deduplicated as per the notes in `pitfalls.md`. |
|
|
|
## Intended Use & Limitations |
|
|
|
- **Intended Use:** This model is intended for developers and data engineers working with DataFusion. It can be used for code generation, debugging assistance, and learning the library. It can also serve as a strong base for further fine-tuning on more specialized data. |
|
- **Limitations:** The model's knowledge is limited to the data it was trained on. It may produce inaccurate or outdated information for rapidly evolving parts of the library. It is not a substitute for official documentation or expert human review. |
|
|
|
## Citation |
|
|
|
If you find this model useful in your work, please cite: |
|
``` |
|
@misc{yarenty_2025_llama32_datafusion_instruct, |
|
author = {yarenty}, |
|
title = {Llama 3.2 DataFusion Instruct}, |
|
year = {2025}, |
|
publisher = {Hugging Face}, |
|
journal = {Hugging Face repository}, |
|
howpublished = {\url{https://huggingface.co/yarenty/llama32-datafusion-instruct}} |
|
} |
|
``` |
|
|
|
## Contact |
|
For questions or feedback, please open an issue on the Hugging Face repository or the [source GitHub repository](https://github.com/yarenty/trainer). |