Upload SETUP_GUIDE.md with huggingface_hub

Browse files

Files changed (1) hide show

SETUP_GUIDE.md +234 -0

SETUP_GUIDE.md ADDED Viewed

	@@ -0,0 +1,234 @@

+# BSG CyLLama Setup and Usage Guide
+This guide explains how to set up and use the BSG CyLLama scientific summarization model.
+## Overview
+BSG CyLLama is a LoRA-adapted Llama-3.2-1B-Instruct model fine-tuned for scientific text summarization. The model excels at generating high-quality abstracts and summaries from scientific papers and research content.
+## Files Structure
+```
+bsg_cyllama/
+├── scientific_model_production_v2/     # Trained model files
+│   ├── config.json                     # Model configuration
+│   ├── prompt_generator.pt             # Prompt generation utilities
+│   └── model/                          # LoRA adapter files
+│       ├── adapter_config.json
+│       ├── adapter_model.safetensors
+│       ├── tokenizer.json
+│       └── ...
+├── bsg_training_data_complete_aligned.tsv  # Complete training dataset (19,174 records)
+├── bsg_cyllama_trainer_v2.py          # Training script
+├── scientific_model_inference2.py     # Inference utilities
+├── bsg_training_data_gen.py           # Data generation pipeline
+├── compile_complete_training_data.py  # Data compilation script
+├── upload_to_huggingface.py           # HF upload utilities
+└── run_upload.py                      # Simple upload runner
+```
+## Prerequisites
+1. **Python Environment**:
+   ```bash
+   python >= 3.8
+   torch >= 2.0
+   transformers >= 4.30.0
+   peft >= 0.4.0
+   huggingface_hub
+   pandas
+   numpy
+   ```
+2. **Hardware Requirements**:
+   - GPU with at least 8GB VRAM (recommended)
+   - 16GB+ system RAM
+   - CUDA support for optimal performance
+## Installation
+1. **Clone/Download the repository**:
+   ```bash
+   git clone <your-repo-url>
+   cd bsg_cyllama
+   ```
+2. **Install dependencies**:
+   ```bash
+   pip install torch transformers peft huggingface_hub pandas numpy sentence-transformers
+   ```
+3. **Activate environment** (if using virtual environment):
+   ```bash
+   source ~/myenv/bin/activate
+   ```
+## Usage
+### 1. Basic Inference
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+from peft import PeftModel
+import torch
+# Load base model
+base_model_name = "meta-llama/Llama-3.2-1B-Instruct"
+tokenizer = AutoTokenizer.from_pretrained(base_model_name)
+base_model = AutoModelForCausalLM.from_pretrained(
+    base_model_name,
+    torch_dtype=torch.float16,
+    device_map="auto"
+)
+# Load LoRA adapter
+model = PeftModel.from_pretrained(base_model, "./scientific_model_production_v2/model")
+def generate_summary(text, max_length=200):
+    prompt = f"Summarize the following scientific text:\n\n{text}\n\nSummary:"
+    inputs = tokenizer.encode(prompt, return_tensors="pt")
+    with torch.no_grad():
+        outputs = model.generate(
+            inputs,
+            max_length=max_length,
+            num_return_sequences=1,
+            temperature=0.7,
+            pad_token_id=tokenizer.eos_token_id,
+            do_sample=True
+        )
+    summary = tokenizer.decode(outputs[0], skip_special_tokens=True)
+    return summary.split("Summary:")[-1].strip()
+```
+### 2. Using the Inference Script
+```bash
+python scientific_model_inference2.py
+```
+### 3. Training from Scratch
+```bash
+python bsg_cyllama_trainer_v2.py
+```
+## Dataset Information
+The complete training dataset contains **19,174 records** with the following structure:
+- **AbstractSummary**: Detailed scientific summary
+- **ShortSummary**: Concise version
+- **Title**: Research paper title
+- **OriginalText**: Source abstract
+- **OriginalKeywords**: Topic keywords
+- **Clustering information**: For data organization
+### Loading the Dataset
+```python
+import pandas as pd
+# Load complete training data
+df = pd.read_csv("bsg_training_data_complete_aligned.tsv", sep="\t")
+print(f"Dataset size: {len(df)} records")
+print(f"Columns: {df.columns.tolist()}")
+# Example training pair
+sample = df.iloc[0]
+print(f"Original: {sample['OriginalText'][:200]}...")
+print(f"Summary: {sample['AbstractSummary'][:200]}...")
+```
+## Model Configuration
+- **Base Model**: meta-llama/Llama-3.2-1B-Instruct
+- **LoRA Rank**: 128
+- **LoRA Alpha**: 256
+- **Target Modules**: v_proj, o_proj, k_proj, gate_proj, q_proj, up_proj, down_proj
+- **Training Samples**: 19,174
+## Uploading to Hugging Face
+To upload your model and dataset to Hugging Face:
+1. **Set up your token**:
+   ```bash
+   # Your token is already configured in the script
+   ```
+2. **Run the upload**:
+   ```bash
+   python run_upload.py
+   ```
+3. **Enter your HF username** when prompted
+This will create two repositories:
+- `{username}/bsg-cyllama` (model)
+- `{username}/bsg-cyllama-training-data` (dataset)
+## Performance Tips
+1. **For better performance**:
+   - Use GPU inference
+   - Adjust temperature (0.5-0.8 for more focused summaries)
+   - Experiment with max_length based on your needs
+2. **Memory optimization**:
+   - Use torch.float16 for inference
+   - Enable gradient checkpointing for training
+   - Use smaller batch sizes if needed
+## Troubleshooting
+1. **CUDA out of memory**:
+   - Reduce batch size
+   - Use CPU inference
+   - Enable gradient checkpointing
+2. **Import errors**:
+   - Check transformers version: `pip install transformers>=4.30.0`
+   - Install missing dependencies: `pip install peft sentence-transformers`
+3. **Model loading issues**:
+   - Verify file paths
+   - Check model file integrity
+   - Ensure proper permissions
+## Example Applications
+1. **Scientific Paper Summarization**
+2. **Abstract Generation**
+3. **Research Literature Review**
+4. **Technical Documentation Condensation**
+## Citation
+```bibtex
+@misc{bsg-cyllama-2025,
+  title={BSG CyLLama: Scientific Summarization with LoRA-tuned Llama},
+  author={BSG Research Team},
+  year={2025},
+  url={https://huggingface.co/bsg-cyllama}
+}
+```
+## Support
+For questions, issues, or collaboration:
+1. Check this guide first
+2. Review the error messages
+3. Open an issue in the repository
+4. Contact the development team
+---
+**Last Updated**: January 2025
+**Model Version**: v2
+**Dataset Version**: Complete Aligned (19,174 records)