jimnoneill commited on
Commit
8bb63a1
·
verified ·
1 Parent(s): e6a3f19

Upload SETUP_GUIDE.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. SETUP_GUIDE.md +234 -0
SETUP_GUIDE.md ADDED
@@ -0,0 +1,234 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # BSG CyLLama Setup and Usage Guide
2
+
3
+ This guide explains how to set up and use the BSG CyLLama scientific summarization model.
4
+
5
+ ## Overview
6
+
7
+ BSG CyLLama is a LoRA-adapted Llama-3.2-1B-Instruct model fine-tuned for scientific text summarization. The model excels at generating high-quality abstracts and summaries from scientific papers and research content.
8
+
9
+ ## Files Structure
10
+
11
+ ```
12
+ bsg_cyllama/
13
+ ├── scientific_model_production_v2/ # Trained model files
14
+ │ ├── config.json # Model configuration
15
+ │ ├── prompt_generator.pt # Prompt generation utilities
16
+ │ └── model/ # LoRA adapter files
17
+ │ ├── adapter_config.json
18
+ │ ├── adapter_model.safetensors
19
+ │ ├── tokenizer.json
20
+ │ └── ...
21
+ ├── bsg_training_data_complete_aligned.tsv # Complete training dataset (19,174 records)
22
+ ├── bsg_cyllama_trainer_v2.py # Training script
23
+ ├── scientific_model_inference2.py # Inference utilities
24
+ ├── bsg_training_data_gen.py # Data generation pipeline
25
+ ├── compile_complete_training_data.py # Data compilation script
26
+ ├── upload_to_huggingface.py # HF upload utilities
27
+ └── run_upload.py # Simple upload runner
28
+ ```
29
+
30
+ ## Prerequisites
31
+
32
+ 1. **Python Environment**:
33
+ ```bash
34
+ python >= 3.8
35
+ torch >= 2.0
36
+ transformers >= 4.30.0
37
+ peft >= 0.4.0
38
+ huggingface_hub
39
+ pandas
40
+ numpy
41
+ ```
42
+
43
+ 2. **Hardware Requirements**:
44
+ - GPU with at least 8GB VRAM (recommended)
45
+ - 16GB+ system RAM
46
+ - CUDA support for optimal performance
47
+
48
+ ## Installation
49
+
50
+ 1. **Clone/Download the repository**:
51
+ ```bash
52
+ git clone <your-repo-url>
53
+ cd bsg_cyllama
54
+ ```
55
+
56
+ 2. **Install dependencies**:
57
+ ```bash
58
+ pip install torch transformers peft huggingface_hub pandas numpy sentence-transformers
59
+ ```
60
+
61
+ 3. **Activate environment** (if using virtual environment):
62
+ ```bash
63
+ source ~/myenv/bin/activate
64
+ ```
65
+
66
+ ## Usage
67
+
68
+ ### 1. Basic Inference
69
+
70
+ ```python
71
+ from transformers import AutoTokenizer, AutoModelForCausalLM
72
+ from peft import PeftModel
73
+ import torch
74
+
75
+ # Load base model
76
+ base_model_name = "meta-llama/Llama-3.2-1B-Instruct"
77
+ tokenizer = AutoTokenizer.from_pretrained(base_model_name)
78
+ base_model = AutoModelForCausalLM.from_pretrained(
79
+ base_model_name,
80
+ torch_dtype=torch.float16,
81
+ device_map="auto"
82
+ )
83
+
84
+ # Load LoRA adapter
85
+ model = PeftModel.from_pretrained(base_model, "./scientific_model_production_v2/model")
86
+
87
+ def generate_summary(text, max_length=200):
88
+ prompt = f"Summarize the following scientific text:\n\n{text}\n\nSummary:"
89
+
90
+ inputs = tokenizer.encode(prompt, return_tensors="pt")
91
+
92
+ with torch.no_grad():
93
+ outputs = model.generate(
94
+ inputs,
95
+ max_length=max_length,
96
+ num_return_sequences=1,
97
+ temperature=0.7,
98
+ pad_token_id=tokenizer.eos_token_id,
99
+ do_sample=True
100
+ )
101
+
102
+ summary = tokenizer.decode(outputs[0], skip_special_tokens=True)
103
+ return summary.split("Summary:")[-1].strip()
104
+ ```
105
+
106
+ ### 2. Using the Inference Script
107
+
108
+ ```bash
109
+ python scientific_model_inference2.py
110
+ ```
111
+
112
+ ### 3. Training from Scratch
113
+
114
+ ```bash
115
+ python bsg_cyllama_trainer_v2.py
116
+ ```
117
+
118
+ ## Dataset Information
119
+
120
+ The complete training dataset contains **19,174 records** with the following structure:
121
+
122
+ - **AbstractSummary**: Detailed scientific summary
123
+ - **ShortSummary**: Concise version
124
+ - **Title**: Research paper title
125
+ - **OriginalText**: Source abstract
126
+ - **OriginalKeywords**: Topic keywords
127
+ - **Clustering information**: For data organization
128
+
129
+ ### Loading the Dataset
130
+
131
+ ```python
132
+ import pandas as pd
133
+
134
+ # Load complete training data
135
+ df = pd.read_csv("bsg_training_data_complete_aligned.tsv", sep="\t")
136
+
137
+ print(f"Dataset size: {len(df)} records")
138
+ print(f"Columns: {df.columns.tolist()}")
139
+
140
+ # Example training pair
141
+ sample = df.iloc[0]
142
+ print(f"Original: {sample['OriginalText'][:200]}...")
143
+ print(f"Summary: {sample['AbstractSummary'][:200]}...")
144
+ ```
145
+
146
+ ## Model Configuration
147
+
148
+ - **Base Model**: meta-llama/Llama-3.2-1B-Instruct
149
+ - **LoRA Rank**: 128
150
+ - **LoRA Alpha**: 256
151
+ - **Target Modules**: v_proj, o_proj, k_proj, gate_proj, q_proj, up_proj, down_proj
152
+ - **Training Samples**: 19,174
153
+
154
+ ## Uploading to Hugging Face
155
+
156
+ To upload your model and dataset to Hugging Face:
157
+
158
+ 1. **Set up your token**:
159
+ ```bash
160
+ # Your token is already configured in the script
161
+ ```
162
+
163
+ 2. **Run the upload**:
164
+ ```bash
165
+ python run_upload.py
166
+ ```
167
+
168
+ 3. **Enter your HF username** when prompted
169
+
170
+ This will create two repositories:
171
+ - `{username}/bsg-cyllama` (model)
172
+ - `{username}/bsg-cyllama-training-data` (dataset)
173
+
174
+ ## Performance Tips
175
+
176
+ 1. **For better performance**:
177
+ - Use GPU inference
178
+ - Adjust temperature (0.5-0.8 for more focused summaries)
179
+ - Experiment with max_length based on your needs
180
+
181
+ 2. **Memory optimization**:
182
+ - Use torch.float16 for inference
183
+ - Enable gradient checkpointing for training
184
+ - Use smaller batch sizes if needed
185
+
186
+ ## Troubleshooting
187
+
188
+ 1. **CUDA out of memory**:
189
+ - Reduce batch size
190
+ - Use CPU inference
191
+ - Enable gradient checkpointing
192
+
193
+ 2. **Import errors**:
194
+ - Check transformers version: `pip install transformers>=4.30.0`
195
+ - Install missing dependencies: `pip install peft sentence-transformers`
196
+
197
+ 3. **Model loading issues**:
198
+ - Verify file paths
199
+ - Check model file integrity
200
+ - Ensure proper permissions
201
+
202
+ ## Example Applications
203
+
204
+ 1. **Scientific Paper Summarization**
205
+ 2. **Abstract Generation**
206
+ 3. **Research Literature Review**
207
+ 4. **Technical Documentation Condensation**
208
+
209
+ ## Citation
210
+
211
+ ```bibtex
212
+ @misc{bsg-cyllama-2025,
213
+ title={BSG CyLLama: Scientific Summarization with LoRA-tuned Llama},
214
+ author={BSG Research Team},
215
+ year={2025},
216
+ url={https://huggingface.co/bsg-cyllama}
217
+ }
218
+ ```
219
+
220
+ ## Support
221
+
222
+ For questions, issues, or collaboration:
223
+ 1. Check this guide first
224
+ 2. Review the error messages
225
+ 3. Open an issue in the repository
226
+ 4. Contact the development team
227
+
228
+ ---
229
+
230
+ **Last Updated**: January 2025
231
+ **Model Version**: v2
232
+ **Dataset Version**: Complete Aligned (19,174 records)
233
+
234
+