Some instructions regarding fine-tuning & classification with ESM++ model
Dear contributors and developers, @lhallee
Thank you for the important and helpful work you are doing by making protein LLMs more accessible to the community!
I try to follow the code in modeling_esm_plusplus.py
in order to perform fine-tuning and protein-level classification downstream on my specific use case.
I am using ESMplusplus_600M()
function for embeddings (also tried with AutoModelForMaskedLM.from_pretrained('Synthyra/ESMplusplus_large')
) and ESMplusplusForSequenceClassification.from_pretrained_esm("600")
for the classification head. Also using LORA in between.
I already tried several ways to correctly embed my dataset (e.g. model.embed_dataset()
), or just pass just the sequences + labels (inputs_ids, ...) via Dataset object directly to Trainer
function together with a tokenizer
object.
But nothing seems to work, and the training steps will not start at all or will crash after 1-epoch due to inconsistencies in shapes/dimensions and batches between inputs to the model, or to the outputs.
I would be very help for any help/advices/guides, can provide of course the specific code I used or the errors.
Many thanks,
Dani
No problem. Probably GitHub is better.
Sure thank you!
Here is the link to the notebook:
https://github.com/VadimDu/Protein_LLM_modeling/blob/main/clean_ver_Modeling_ESM_plusplus.ipynb
I basically copied all the code from modeling_esm_plusplus.py
there, and added over it my data and steps towards fine-tuning the classification model.
From the cell named "My protein input data" starts the part I added.
In the current trial I commented out data_collator
and tokenizer
from the Trainer, and used the default one and the tokenizer implemented in function ESMplusplus_600M()
, class ESMplusplusForMaskedLM()
, self.tokenizer = EsmSequenceTokenizer()
.
Any help will be much appreciated!
Dani
Hi @lhallee ,
Thanks again for your reply. Could you please clarify regarding AutoModelForSequenceClassification
? I could not find such class/method in your code I am using.
Regarding the errors:
- If I am running the preprocessing and fine-tuning steps exactly how is in the notebook (
model_embedding = ESMplusplus_600M(num_labels=3)
,.embed_dataset()
,model_classification = ESMplusplusForSequenceClassification.from_pretrained_esm("600")
, and no explicit custom data_collator and tokenizer give toTrainer()
), this is the error:
ValueError Traceback (most recent call last)
<ipython-input-31-3435b262f1ae> in <cell line: 0>()
----> 1 trainer.train()
13 frames
<ipython-input-2-bf6a22e3339a> in forward(self, x, attention_mask, output_hidden_states, output_attentions)
441 TransformerOutput containing last hidden state and optionally all hidden states and attention weights
442 """
--> 443 batch_size, seq_len, _ = x.shape
444 hidden_states = () if output_hidden_states else None
445 attentions = () if output_attentions else None
ValueError: not enough values to unpack (expected 3, got 2)
- If I am adding
class CustomDataCollator
to define a data_collator, to convert myinput_embeds
to shape:torch.Size([num_of_sequences, 1, 1152])
from a 2-dimensional tensor, then 1 training epoch finish OK, and then crashes at the start of the epoch 2:
Could not estimate the number of tokens of the input, floating-point operations will not be computed
[ 4/20 00:00 < 00:03, 4.01 it/s, Epoch 1/10]
Epoch Training Loss Validation Loss
[2/2 00:00]
Downloading builder script: 100%
4.20k/4.20k [00:00<00:00, 506kB/s]
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-48-3435b262f1ae> in <cell line: 0>()
----> 1 trainer.train()
9 frames
/usr/local/lib/python3.11/dist-packages/numpy/core/fromnumeric.py in _wrapit(obj, method, *args, **kwds)
43 except AttributeError:
44 wrap = None
---> 45 result = getattr(asarray(obj), method)(*args, **kwds)
46 if wrap:
47 if not isinstance(result, mu.ndarray):
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 3 dimensions. The detected shape was (2, 20, 1) + inhomogeneous part.
I used 20 sequences just as an example for training.
- If use the commented out cell (#@title ESM++ for protein embeddings using a pre-trained model from Synthyra) for sequence Dataset creation and tokenizer with
AutoModelForMaskedLM.from_pretrained('Synthyra/ESMplusplus_large)'
, and supplyTrainer()
with a tokenizer, I get this error:
ValueError Traceback (most recent call last)
<ipython-input-28-5075ee0329cb> in <cell line: 0>()
11
12 # Train the model
---> 13 trainer.train()
14 frames
/usr/local/lib/python3.11/dist-packages/torch/nn/functional.py in cross_entropy(input, target, weight, size_average, ignore_index, reduce, reduction, label_smoothing)
3477 if size_average is not None or reduce is not None:
3478 reduction = _Reduction.legacy_get_string(size_average, reduce)
-> 3479 return torch._C._nn.cross_entropy_loss(
3480 input,
3481 target,
ValueError: Expected input batch_size (2200) to match target batch_size (4).
I hope this information may be helpful, many thanks again for your efforts.
Dani
Gotcha. So a few things. If you want to finetune a model for sequence classification you do not need to pre-embed the sequences. Just need to feed the input_ids
and attention_mask
with the data collator. You can load the model without copying the implementation anywhere by doing this
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained('Synthyra/ESMplusplus_large', trust_remote_code=True)
From here you can apply lora if you'd like.
If you want to just train a model on the vector embeddings of the model, you can embed them like you had and train a small neural network.
Does that make sense?
Here's an example of a collator we use for input_ids and labels. Trainer automatically unpacks a dictionary sent to the model, so everything in "batch" here will go to the right place
def string_labels_collator_builder(tokenizer, **kwargs):
def _collate_fn(batch):
seqs = [ex[0] for ex in batch]
labels = torch.stack([torch.tensor(ex[1]) for ex in batch])
batch = tokenizer(seqs,
padding='longest',
padding_to_multiple_of=8,
truncation=False,
return_tensors='pt',
add_special_tokens=True)
batch['labels'] = labels
return batch
return _collate_fn
tokenizer = model.tokenizer
data_collator = string_labels_collator_builder(tokenizer)
This expects a PyTorch dataset class that will output a tuple of sequences and the labels you are interested in. A class that might link up with your current workflow looks something like this
from torch.utils.data import Dataset as TorchDataset
class StringLabelDatasetFromHF(TorchDataset):
def __init__(self, hf_dataset, col_name='seqs', label_col='labels', **kwargs):
self.seqs = hf_dataset[col_name]
self.labels = hf_dataset[label_col]
self.lengths = [len(seq) for seq in self.seqs]
def avg(self):
return sum(self.lengths) / len(self.lengths)
def __len__(self):
return len(self.seqs)
def __getitem__(self, idx):
seq = self.seqs[idx]
label = self.labels[idx]
return seq, label
Does this help? If you try something new and get a new error please send along.
Hi @lhallee ,
Many thanks again for your help!
I have implemented the collator and PyTorch dataset as you suggested and used AutoModelForSequenceClassification
, but unfortunately after the 1st epoch of the training finished it crashed with a similar error as I had before.
Below I will paste all the relevant code from the start until the training, maybe you can spot some inconsistencies there:
from torch.utils.data import Dataset as TorchDataset
from transformers import AutoModelForSequenceClassification, AutoConfig
config = AutoConfig.from_pretrained('Synthyra/ESMplusplus_large', trust_remote_code=True, num_labels=3)
model_classification = AutoModelForSequenceClassification.from_pretrained('Synthyra/ESMplusplus_large', trust_remote_code=True, config=config)
tokenizer = model_classification.tokenizer
# Move models to GPU and keep them in float32
model_classification = model_classification.to(device) # Remove .half()
def string_labels_collator_builder(tokenizer, **kwargs):
def _collate_fn(batch):
seqs = [ex[0] for ex in batch]
labels = torch.stack([torch.tensor(ex[1]) for ex in batch])
batch = tokenizer(seqs,
padding='longest',
truncation=False,
return_tensors='pt',
add_special_tokens=True)
batch['labels'] = labels
return batch
return _collate_fn
class StringLabelDatasetFromHF(TorchDataset):
'''The design pattern of the code uses the PyTorch Dataset class for accessing the sequences and labels during the training loop.'''
def __init__(self, hf_dataset, col_name='sequence', label_col='label', **kwargs):
self.seqs = hf_dataset[col_name].to_numpy() # Convert to NumPy array
self.labels = hf_dataset[label_col].to_numpy() # Convert to NumPy array
self.lengths = [len(seq) for seq in self.seqs]
def avg(self):
return sum(self.lengths) / len(self.lengths)
def __len__(self):
return len(self.seqs)
def __getitem__(self, idx):
seq = self.seqs[idx]
label = self.labels[idx]
return seq, label
torchdataset_my_train = StringLabelDatasetFromHF(my_train)
torchdataset_my_valid = StringLabelDatasetFromHF(my_valid)
torchdataset_my_test = StringLabelDatasetFromHF(my_test)
data_collator = string_labels_collator_builder(tokenizer)
# LORA fine-tuning
# Define the regex pattern to match desired layers (excluding LayerNorm - ffn.0)
pattern = r"transformer\.blocks\.\d+\.(attn\.layernorm_qkv\.1|attn\.out_proj|ffn\.[13])"
target_modules = [
name
for name, module in model_classification.named_modules() # iterate through all modules and their names.
if re.fullmatch(pattern, name)
]
print(f'Target modules for LORA: {target_modules}')
lora_config = LoraConfig(
r=4, # Rank of the LoRA update matrices
lora_alpha=32, # Scaling factor for the LoRA update matrices
lora_dropout=0.05, # Dropout probability for the LoRA update matrices
bias="none", # Whether to apply bias to the LoRA update matrices
task_type=TaskType.SEQ_CLS, # Task type for sequence classification
target_modules=target_modules, # Modules which LORA method should target and modify their weights
)
model = get_peft_model(model_classification, lora_config)
# Prints the number of trainable parameters in the LoRA-adapted model
model.print_trainable_parameters()
# Define Huggingface Trainer arguments
training_args = TrainingArguments(
output_dir="./results",
eval_strategy = "epoch",
logging_strategy = "epoch",
save_strategy = "epoch",
learning_rate=3e-4,
# effective training batch size is batch * accum
# we recommend an effective training batch size of 8
per_device_train_batch_size=4,
per_device_eval_batch_size=16,
gradient_accumulation_steps=2,
num_train_epochs=10,
weight_decay=0.01,
load_best_model_at_end=True,
#deepspeed= ds_config if deepspeed else None,
fp16 = False,
gradient_checkpointing=False,
)
# Metric definition for validation data
def compute_metrics(eval_pred, num_labels=3):
if num_labels>1: # for classification
metric = load("accuracy")
predictions, labels = eval_pred
predictions = np.argmax(predictions, axis=1)
else: # for regression
metric = load("spearmanr")
predictions, labels = eval_pred
return metric.compute(predictions=predictions, references=labels)
# Define the Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=torchdataset_my_train,
eval_dataset=torchdataset_my_valid,
data_collator=data_collator, # the custom data collator
compute_metrics=compute_metrics,
)
# Train the model
trainer.train()
This is the error I got:
ValueError Traceback (most recent call last)
<ipython-input-28-3435b262f1ae> in <cell line: 0>()
----> 1 trainer.train()
9 frames
/usr/local/lib/python3.11/dist-packages/numpy/core/fromnumeric.py in _wrapit(obj, method, *args, **kwds)
43 except AttributeError:
44 wrap = None
---> 45 result = getattr(asarray(obj), method)(*args, **kwds)
46 if wrap:
47 if not isinstance(result, mu.ndarray):
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 2 dimensions. The detected shape was (2, 501) + inhomogeneous part.
I am sorry that we still couldn't resolve the issue! Maybe I am missing something basic or critical, I'm still new to LLMs / Hugging Face api in general.
I can send you a small sample of my data so that you can try yourself if that's OK with you.
Thank you
Dani
Is it happening exactly 1 epoch? This could be an error from the evaluation, likely happening in compute_metrics
. I would write a separate one for regression or classification based on your needs, and pass the correct one when needed. The only argument for compute_metrics
should be an EvalPrediction. You can type hint it like this
from transformers import EvalPrediction
def compute_metrics(p: EvalPrediction):
preds, labels = p.predictions, p.label_ids
# if preds or labels is a tuple you usually need to take the 0th index, I usually add an if statement for this
# etc.
# For example
def compute_metrics_regression(p: EvalPrediction):
"""
Compute various regression metrics for model evaluation.
Args:
(p: EvalPrediction): An object containing predictions and label ids.
Returns:
dict: A dictionary containing the following metrics:
- r_squared: Coefficient of determination
- spearman_rho: Spearman's rank correlation coefficient
- spear_pval: p-value for Spearman's correlation
- pearson_rho: Pearson correlation coefficient
- pear_pval: p-value for Pearson's correlation
- mse: Mean Squared Error
- mae: Mean Absolute Error
- rmse: Root Mean Squared Error
"""
preds = p.predictions[0] if isinstance(p.predictions, tuple) else p.predictions
labels = p.label_ids[1] if isinstance(p.label_ids, tuple) else p.label_ids
logits = np.array(preds).flatten()
labels = np.array(labels).flatten()
r2 = r2_score(labels, logits)
spearman_rho, spear_pval = spearmanr(logits, labels)
pearson_rho, pear_pval = pearsonr(logits, labels)
mse = mean_squared_error(labels, logits)
mae = mean_absolute_error(labels, logits)
rmse = np.sqrt(mse)
return {
'r_squared': round(r2, 5),
'spearman_rho': round(spearman_rho, 5),
'spear_pval': round(spear_pval, 5),
'pearson_rho': round(pearson_rho, 5),
'pear_pval': round(pear_pval, 5),
'mse': round(mse, 5),
'mae': round(mae, 5),
'rmse': round(rmse, 5),
}
I also don't think you need the .to_numpy()
in your dataset class. That shouldn't be able to run for a list of strings.
I would be happy to look at a small sample of your data, one or a couple example lines is fine if it is sensitive (you can change the column names too). I can just copy what you send several times if I need more samples. Also, if you could send the full traceback I may be able to debug a bit better. Sometimes an IDE will not show you the whole thing, I don't think it did here. Not sure how to fix that though.
It's great that you are new to LLMs and Huggingface! Welcome to the ecosystem. There is definitely a learning curve but once it clicks it is a fantastic resource for research. Don't get discouraged!
Best,
Logan