jimnoneill commited on
Commit
e6a3f19
·
verified ·
1 Parent(s): c24fd0f

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +121 -0
README.md ADDED
@@ -0,0 +1,121 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # BSG CyLLama - Scientific Summarization Model
2
+
3
+ BSG CyLLama is a fine-tuned Llama-3.2-1B-Instruct model specialized for scientific text summarization. The model is trained to generate high-quality abstracts and summaries from scientific papers and research content.
4
+
5
+ ## Model Details
6
+
7
+ - **Base Model**: meta-llama/Llama-3.2-1B-Instruct
8
+ - **Fine-tuning Method**: LoRA (Low-Rank Adaptation)
9
+ - **Training Samples**: 19,174 scientific abstracts and summaries
10
+ - **Task**: Scientific Text Summarization
11
+ - **Language**: English
12
+
13
+ ## Training Configuration
14
+
15
+ - **LoRA Rank**: 128
16
+ - **LoRA Alpha**: 256
17
+ - **LoRA Dropout**: 0.05
18
+ - **Target Modules**: v_proj, o_proj, k_proj, gate_proj, q_proj, up_proj, down_proj
19
+ - **Embedding Dimension**: 1024
20
+ - **Hidden Dimension**: 2048
21
+
22
+ ## Usage
23
+
24
+ ```python
25
+ from transformers import AutoTokenizer, AutoModelForCausalLM
26
+ from peft import PeftModel
27
+ import torch
28
+
29
+ # Load base model and tokenizer
30
+ base_model_name = "meta-llama/Llama-3.2-1B-Instruct"
31
+ tokenizer = AutoTokenizer.from_pretrained(base_model_name)
32
+ base_model = AutoModelForCausalLM.from_pretrained(
33
+ base_model_name,
34
+ torch_dtype=torch.float16,
35
+ device_map="auto"
36
+ )
37
+
38
+ # Load LoRA adapter
39
+ model = PeftModel.from_pretrained(base_model, "path/to/bsg-cyllama")
40
+
41
+ # Example usage
42
+ def generate_summary(text, max_length=200):
43
+ prompt = f"Summarize the following scientific text:\n\n{text}\n\nSummary:"
44
+
45
+ inputs = tokenizer.encode(prompt, return_tensors="pt")
46
+
47
+ with torch.no_grad():
48
+ outputs = model.generate(
49
+ inputs,
50
+ max_length=max_length,
51
+ num_return_sequences=1,
52
+ temperature=0.7,
53
+ pad_token_id=tokenizer.eos_token_id
54
+ )
55
+
56
+ summary = tokenizer.decode(outputs[0], skip_special_tokens=True)
57
+ return summary.split("Summary:")[-1].strip()
58
+
59
+ # Example
60
+ scientific_text = "Your scientific paper content here..."
61
+ summary = generate_summary(scientific_text)
62
+ print(summary)
63
+ ```
64
+
65
+ ## Training Data
66
+
67
+ The model was trained on a comprehensive dataset of scientific abstracts and summaries:
68
+ - **Total Records**: 19,174
69
+ - **Sources**: Scientific literature including biomedical, computational, and interdisciplinary research
70
+ - **Format**: Abstract → Summary pairs with metadata
71
+ - **Quality**: Curated and clustered data with quality filtering
72
+
73
+ ## Files Included
74
+
75
+ - `adapter_config.json`: LoRA adapter configuration
76
+ - `adapter_model.safetensors`: LoRA adapter weights
77
+ - `config.json`: Model configuration
78
+ - `prompt_generator.pt`: Prompt generation utilities
79
+ - `tokenizer.*`: Tokenizer files
80
+ - Training scripts and data processing utilities
81
+
82
+ ## Training Scripts
83
+
84
+ - `bsg_cyllama_trainer_v2.py`: Main training script
85
+ - `scientific_model_inference2.py`: Inference utilities
86
+ - `bsg_training_data_gen.py`: Data generation pipeline
87
+ - `compile_complete_training_data.py`: Data compilation script
88
+
89
+ ## Performance
90
+
91
+ The model demonstrates strong performance in:
92
+ - Scientific abstract summarization
93
+ - Research paper summarization
94
+ - Technical content condensation
95
+ - Maintaining scientific accuracy and terminology
96
+
97
+ ## Limitations
98
+
99
+ - Specialized for scientific text; may not perform optimally on general text
100
+ - Based on Llama-3.2-1B, so has inherent size limitations
101
+ - English language only
102
+ - May require domain-specific fine-tuning for highly specialized fields
103
+
104
+ ## Citation
105
+
106
+ ```bibtex
107
+ @misc{bsg-cyllama-2025,
108
+ title={BSG CyLLama: Scientific Summarization with LoRA-tuned Llama},
109
+ author={BSG Research Team},
110
+ year={2025},
111
+ url={https://huggingface.co/bsg-cyllama}
112
+ }
113
+ ```
114
+
115
+ ## License
116
+
117
+ Please refer to the base Llama-3.2 license terms for usage guidelines.
118
+
119
+ ## Contact
120
+
121
+ For questions or collaboration opportunities, please open an issue in this repository.