InstructBioMol-instruct-stage1 / README.md

Improve model card: Add metadata, tags, and sample usage

e2311ce verified about 2 months ago

4.83 kB

	---
	license: mit
	pipeline_tag: any-to-any
	library_name: transformers
	tags:
	- biomolecules
	- proteins
	- molecules
	- multimodal
	- language-model
	- instruction-tuned
	- llama
	---

	<div align="center">

	<h3>InstructBioMol: A Multimodal LLM for Biomolecule Understanding and Design</h3>

	<p align="center">
	<a href="https://arxiv.org/abs/2410.07919">Paper</a> •
	<a href="https://github.com/HICAI-ZJU/InstructBioMol">Project</a> •
	<a href="#quickstart">Quickstart</a> •
	<a href="#citation">Citation</a>
	</p>
	</div>

	### Model Description

	InstructBioMol is a multimodal large language model that bridges natural language with biomolecules (proteins and small molecules). It achieves any-to-any alignment between natural language, molecules, and proteins through comprehensive instruction tuning.

	For detailed information, please refer to our [paper](https://arxiv.org/abs/2410.07919) and [code repository](https://github.com/HICAI-ZJU/InstructBioMol).
	### Released Variants

	\| Model Name \| Stage \| Multimodal\| Description \|
	\|------------\|-----------\| -------\| -------\|
	\| [InstructBioMol-base](https://huggingface.co/hicai-zju/InstructBioMol-base) \| Pretraining \| ❎\| Continual pretrained model on molecular sequences, protein sequences, and scientific literature. \|
	\| [InstructBioMol-instruct-stage1](https://huggingface.co/hicai-zju/InstructBioMol-instruct-stage1) (This Model) \| Instruction tuning (stage 1) \| ✅ \| Stage1 instruction-tuned model with biomolecular multimodal processing capabilities. (e.g., 3D molecules/proteins) \|
	\| [InstructBioMol-instruct](https://huggingface.co/hicai-zju/InstructBioMol-instruct) \| Instruction tuning (stage 1 and 2) \| ✅\| Fully instruction-tuned model (stage1 & stage2) with biomolecular multimodal processing capabilities (e.g., 3D molecules/proteins) \|

	### Training Details

	Base Architecture: InstructBioMol-base

	Training Data:

	1. Molecule - Natural Language Alignment:
	- 60 million data from pubchem and chebi

	2. Protein - Natural Langauge Alignment:
	- 35 million data from UniProt (Swiss-Prot and TrEMBL)

	3. Molecule - Protein Alignment:
	- 1 million data from BindingDB and Rhea


	Training Objective: Instruction tuning

	### Quickstart

	You can use InstructBioMol with the `transformers` library by setting `trust_remote_code=True`. The model handles multimodal inputs, specifically proteins and molecules, as demonstrated below.

	```python
	from transformers import AutoModel, AutoTokenizer
	import torch

	# Load the model and tokenizer
	model_name = "hicai-zju/InstructBioMol-instruct-stage1" # or "hicai-zju/InstructBioMol-instruct"
	tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
	model = AutoModel.from_pretrained(model_name, torch_dtype=torch.float16, device_map="auto", trust_remote_code=True)

	# Example: Generate a description for a protein sequence
	protein_sequence = ">sp\|P0A7G8\|FMT_ECOLI Formyltetrahydrofolate synthetase OS=Escherichia coli (strain K12) PE=1 SV=1
	MSKKLVSGTDVAEYLLSVQKEELGDLTLEIDELKTVTLTRIAQLKDFGSGSIPVEAVKLINQENILFLLGTLGIGKTTTTLLKRIISDKDFGFYSSADKLYDYKGYVVFGESVAGAEADWTSKIDVVVAPFTSIDETAKLLAKLTPDVSVLGQAVAVKGALRILGMDDAAQRVADIVGLAVTGQIVKLAANAGADLLEALKLPEVVVVGNGVAYALDGRLKAEFSLDTAVADGASEVAGKLIARNGADGSLKGVLLEELGAAKLKVIAPLTGLAKELKAFESLLAEKKD"
	prompt = f"Please describe this protein:
	<PROT>{protein_sequence}</PROT>"

	input_ids = tokenizer.encode(prompt, return_tensors="pt").to(model.device)
	output_ids = model.generate(input_ids, max_new_tokens=100, do_sample=False)
	generated_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
	print(generated_text)

	# Example: Generate a SMILES string for a molecule description
	mol_description = "A molecule with anti-cancer activity and a molecular weight around 300."
	prompt = f"Generate a SMILES string for a molecule with the following properties:
	<MOL>{mol_description}</MOL>"

	input_ids = tokenizer.encode(prompt, return_tensors="pt").to(model.device)
	output_ids = model.generate(input_ids, max_new_tokens=100, do_sample=False)
	generated_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
	print(generated_text)
	```

	### Citation

	```bibtex
	@article{DBLP:journals/corr/abs-2410-07919,
	author = {Xiang Zhuang and
	Keyan Ding and
	Tianwen Lyu and
	Yinuo Jiang and
	Xiaotong Li and
	Zhuoyi Xiang and
	Zeyuan Wang and
	Ming Qin and
	Kehua Feng and
	Jike Wang and
	Qiang Zhang and
	Huajun Chen},
	title = {InstructBioMol: Advancing Biomolecule Understanding and Design Following
	Human Instructions},
	journal = {CoRR},
	volume = {abs/2410.07919},
	year = {2024}
	}
	```