# BSG CyLLama Setup and Usage Guide This guide explains how to set up and use the BSG CyLLama scientific summarization model. ## Overview BSG CyLLama is a LoRA-adapted Llama-3.2-1B-Instruct model fine-tuned for scientific text summarization. The model excels at generating high-quality abstracts and summaries from scientific papers and research content. ## Files Structure ``` bsg_cyllama/ ├── scientific_model_production_v2/ # Trained model files │ ├── config.json # Model configuration │ ├── prompt_generator.pt # Prompt generation utilities │ └── model/ # LoRA adapter files │ ├── adapter_config.json │ ├── adapter_model.safetensors │ ├── tokenizer.json │ └── ... ├── bsg_training_data_complete_aligned.tsv # Complete training dataset (19,174 records) ├── bsg_cyllama_trainer_v2.py # Training script ├── scientific_model_inference2.py # Inference utilities ├── bsg_training_data_gen.py # Data generation pipeline ├── compile_complete_training_data.py # Data compilation script ├── upload_to_huggingface.py # HF upload utilities └── run_upload.py # Simple upload runner ``` ## Prerequisites 1. **Python Environment**: ```bash python >= 3.8 torch >= 2.0 transformers >= 4.30.0 peft >= 0.4.0 huggingface_hub pandas numpy ``` 2. **Hardware Requirements**: - GPU with at least 8GB VRAM (recommended) - 16GB+ system RAM - CUDA support for optimal performance ## Installation 1. **Clone/Download the repository**: ```bash git clone cd bsg_cyllama ``` 2. **Install dependencies**: ```bash pip install torch transformers peft huggingface_hub pandas numpy sentence-transformers ``` 3. **Activate environment** (if using virtual environment): ```bash source ~/myenv/bin/activate ``` ## Usage ### 1. Basic Inference ```python from transformers import AutoTokenizer, AutoModelForCausalLM from peft import PeftModel import torch # Load base model base_model_name = "meta-llama/Llama-3.2-1B-Instruct" tokenizer = AutoTokenizer.from_pretrained(base_model_name) base_model = AutoModelForCausalLM.from_pretrained( base_model_name, torch_dtype=torch.float16, device_map="auto" ) # Load LoRA adapter model = PeftModel.from_pretrained(base_model, "./scientific_model_production_v2/model") def generate_summary(text, max_length=200): prompt = f"Summarize the following scientific text:\n\n{text}\n\nSummary:" inputs = tokenizer.encode(prompt, return_tensors="pt") with torch.no_grad(): outputs = model.generate( inputs, max_length=max_length, num_return_sequences=1, temperature=0.7, pad_token_id=tokenizer.eos_token_id, do_sample=True ) summary = tokenizer.decode(outputs[0], skip_special_tokens=True) return summary.split("Summary:")[-1].strip() ``` ### 2. Using the Inference Script ```bash python scientific_model_inference2.py ``` ### 3. Training from Scratch ```bash python bsg_cyllama_trainer_v2.py ``` ## Dataset Information The complete training dataset contains **19,174 records** with the following structure: - **AbstractSummary**: Detailed scientific summary - **ShortSummary**: Concise version - **Title**: Research paper title - **OriginalText**: Source abstract - **OriginalKeywords**: Topic keywords - **Clustering information**: For data organization ### Loading the Dataset ```python import pandas as pd # Load complete training data df = pd.read_csv("bsg_training_data_complete_aligned.tsv", sep="\t") print(f"Dataset size: {len(df)} records") print(f"Columns: {df.columns.tolist()}") # Example training pair sample = df.iloc[0] print(f"Original: {sample['OriginalText'][:200]}...") print(f"Summary: {sample['AbstractSummary'][:200]}...") ``` ## Model Configuration - **Base Model**: meta-llama/Llama-3.2-1B-Instruct - **LoRA Rank**: 128 - **LoRA Alpha**: 256 - **Target Modules**: v_proj, o_proj, k_proj, gate_proj, q_proj, up_proj, down_proj - **Training Samples**: 19,174 ## Uploading to Hugging Face To upload your model and dataset to Hugging Face: 1. **Set up your token**: ```bash # Your token is already configured in the script ``` 2. **Run the upload**: ```bash python run_upload.py ``` 3. **Enter your HF username** when prompted This will create two repositories: - `{username}/bsg-cyllama` (model) - `{username}/bsg-cyllama-training-data` (dataset) ## Performance Tips 1. **For better performance**: - Use GPU inference - Adjust temperature (0.5-0.8 for more focused summaries) - Experiment with max_length based on your needs 2. **Memory optimization**: - Use torch.float16 for inference - Enable gradient checkpointing for training - Use smaller batch sizes if needed ## Troubleshooting 1. **CUDA out of memory**: - Reduce batch size - Use CPU inference - Enable gradient checkpointing 2. **Import errors**: - Check transformers version: `pip install transformers>=4.30.0` - Install missing dependencies: `pip install peft sentence-transformers` 3. **Model loading issues**: - Verify file paths - Check model file integrity - Ensure proper permissions ## Example Applications 1. **Scientific Paper Summarization** 2. **Abstract Generation** 3. **Research Literature Review** 4. **Technical Documentation Condensation** ## Citation ```bibtex @misc{bsg-cyllama-2025, title={BSG CyLLama: Scientific Summarization with LoRA-tuned Llama}, author={BSG Research Team}, year={2025}, url={https://huggingface.co/bsg-cyllama} } ``` ## Support For questions, issues, or collaboration: 1. Check this guide first 2. Review the error messages 3. Open an issue in the repository 4. Contact the development team --- **Last Updated**: January 2025 **Model Version**: v2 **Dataset Version**: Complete Aligned (19,174 records)