Update README.md

e86aebc verified about 1 month ago

3.95 kB

	---
	license: llama3.2
	base_model: meta-llama/Llama-3.2-8B-Instruct
	tags:
	- text-generation
	- instruction
	- datafusion
	- rust
	- code
	---

	![transformers](https://img.shields.io/badge/transformers-yes-green)


	Author: yarenty
	Model type: Llama 3.2 (fine-tuned)
	Task: Instruction-following, code Q/A, DataFusion expert assistant
	License: Apache 2.0
	Visibility: Public

	---


	# Llama 3.2 DataFusion Instruct

	This model is a fine-tuned version of meta-llama/Llama-3.2-8B-Instruct, specialized for the [Apache Arrow DataFusion](https://arrow.apache.org/datafusion/) ecosystem. It's designed to be a helpful assistant for developers, answering technical questions, generating code, and explaining concepts related to DataFusion, Arrow.rs, Ballista, and the broader Rust data engineering landscape.

	GGUF Version: For quantized, low-resource deployment, you can find the GGUF version [here](<https://huggingface.co/yarenty/llama32-datafusion-instruct-gguf>).

	## Model Description

	This model was fine-tuned on a curated dataset of high-quality question-answer pairs and instruction-following examples sourced from the official DataFusion documentation, source code, mailing lists, and community discussions.

	- Model Type: Instruction-following Large Language Model (LLM)
	- Base Model: `meta-llama/Llama-3.2-8B-Instruct`
	- Primary Use: Developer assistant for the DataFusion ecosystem.

	## Prompt Template

	To get the best results, format your prompts using the following instruction template.

	```
	### Instruction:
	{Your question or instruction here}

	### Response:
	```

	## Example Usage

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model_id = "yarenty/llama32-datafusion-instruct"
	tokenizer = AutoTokenizer.from_pretrained(model_id)
	model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")

	# The model was trained with a specific instruction template.
	# For optimal performance, your prompt should follow this structure.
	prompt_template = """### Instruction:
	How do I register a Parquet file in DataFusion?

	### Response:"""

	inputs = tokenizer(prompt_template, return_tensors="pt").to(model.device)
	outputs = model.generate(**inputs, max_new_tokens=256, eos_token_id=tokenizer.eos_token_id)

	# Decode the output, skipping special tokens and the prompt
	prompt_length = inputs["input_ids"].shape[1]
	print(tokenizer.decode(outputs[0][prompt_length:], skip_special_tokens=True))
	```

	## Training Procedure

	- Hardware: Trained on 1x NVIDIA A100 GPU.
	- Training Script: Custom script using `transformers.SFTTrainer`.
	- Key Hyperparameters:
	- Epochs: 3
	- Learning Rate: 2e-5
	- Batch Size: 4
	- Dataset: A curated dataset of ~5,000 high-quality QA pairs and instructions related to DataFusion. Data was cleaned and deduplicated as per the notes in `pitfalls.md`.

	## Intended Use & Limitations

	- Intended Use: This model is intended for developers and data engineers working with DataFusion. It can be used for code generation, debugging assistance, and learning the library. It can also serve as a strong base for further fine-tuning on more specialized data.
	- Limitations: The model's knowledge is limited to the data it was trained on. It may produce inaccurate or outdated information for rapidly evolving parts of the library. It is not a substitute for official documentation or expert human review.

	## Citation

	If you find this model useful in your work, please cite:
	```
	@misc{yarenty_2025_llama32_datafusion_instruct,
	author = {yarenty},
	title = {Llama 3.2 DataFusion Instruct},
	year = {2025},
	publisher = {Hugging Face},
	journal = {Hugging Face repository},
	howpublished = {\url{https://huggingface.co/yarenty/llama32-datafusion-instruct}}
	}
	```

	## Contact
	For questions or feedback, please open an issue on the Hugging Face repository or the [source GitHub repository](https://github.com/yarenty/trainer).