BSG_CyLlama / README.md
jimnoneill's picture
Upload README.md with huggingface_hub
e6a3f19 verified
|
raw
history blame
3.71 kB

BSG CyLLama - Scientific Summarization Model

BSG CyLLama is a fine-tuned Llama-3.2-1B-Instruct model specialized for scientific text summarization. The model is trained to generate high-quality abstracts and summaries from scientific papers and research content.

Model Details

  • Base Model: meta-llama/Llama-3.2-1B-Instruct
  • Fine-tuning Method: LoRA (Low-Rank Adaptation)
  • Training Samples: 19,174 scientific abstracts and summaries
  • Task: Scientific Text Summarization
  • Language: English

Training Configuration

  • LoRA Rank: 128
  • LoRA Alpha: 256
  • LoRA Dropout: 0.05
  • Target Modules: v_proj, o_proj, k_proj, gate_proj, q_proj, up_proj, down_proj
  • Embedding Dimension: 1024
  • Hidden Dimension: 2048

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch

# Load base model and tokenizer
base_model_name = "meta-llama/Llama-3.2-1B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(base_model_name)
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)

# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, "path/to/bsg-cyllama")

# Example usage
def generate_summary(text, max_length=200):
    prompt = f"Summarize the following scientific text:\n\n{text}\n\nSummary:"
    
    inputs = tokenizer.encode(prompt, return_tensors="pt")
    
    with torch.no_grad():
        outputs = model.generate(
            inputs,
            max_length=max_length,
            num_return_sequences=1,
            temperature=0.7,
            pad_token_id=tokenizer.eos_token_id
        )
    
    summary = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return summary.split("Summary:")[-1].strip()

# Example
scientific_text = "Your scientific paper content here..."
summary = generate_summary(scientific_text)
print(summary)

Training Data

The model was trained on a comprehensive dataset of scientific abstracts and summaries:

  • Total Records: 19,174
  • Sources: Scientific literature including biomedical, computational, and interdisciplinary research
  • Format: Abstract → Summary pairs with metadata
  • Quality: Curated and clustered data with quality filtering

Files Included

  • adapter_config.json: LoRA adapter configuration
  • adapter_model.safetensors: LoRA adapter weights
  • config.json: Model configuration
  • prompt_generator.pt: Prompt generation utilities
  • tokenizer.*: Tokenizer files
  • Training scripts and data processing utilities

Training Scripts

  • bsg_cyllama_trainer_v2.py: Main training script
  • scientific_model_inference2.py: Inference utilities
  • bsg_training_data_gen.py: Data generation pipeline
  • compile_complete_training_data.py: Data compilation script

Performance

The model demonstrates strong performance in:

  • Scientific abstract summarization
  • Research paper summarization
  • Technical content condensation
  • Maintaining scientific accuracy and terminology

Limitations

  • Specialized for scientific text; may not perform optimally on general text
  • Based on Llama-3.2-1B, so has inherent size limitations
  • English language only
  • May require domain-specific fine-tuning for highly specialized fields

Citation

@misc{bsg-cyllama-2025,
  title={BSG CyLLama: Scientific Summarization with LoRA-tuned Llama},
  author={BSG Research Team},
  year={2025},
  url={https://huggingface.co/bsg-cyllama}
}

License

Please refer to the base Llama-3.2 license terms for usage guidelines.

Contact

For questions or collaboration opportunities, please open an issue in this repository.