Some instructions regarding fine-tuning & classification with ESM++ model

#3
by DaniDubi - opened

Dear contributors and developers, @lhallee

Thank you for the important and helpful work you are doing by making protein LLMs more accessible to the community!

I try to follow the code in modeling_esm_plusplus.py in order to perform fine-tuning and protein-level classification downstream on my specific use case.
I am using ESMplusplus_600M()function for embeddings (also tried with AutoModelForMaskedLM.from_pretrained('Synthyra/ESMplusplus_large')) and ESMplusplusForSequenceClassification.from_pretrained_esm("600") for the classification head. Also using LORA in between.
I already tried several ways to correctly embed my dataset (e.g. model.embed_dataset()), or just pass just the sequences + labels (inputs_ids, ...) via Dataset object directly to Trainer function together with a tokenizer object.
But nothing seems to work, and the training steps will not start at all or will crash after 1-epoch due to inconsistencies in shapes/dimensions and batches between inputs to the model, or to the outputs.

I would be very help for any help/advices/guides, can provide of course the specific code I used or the errors.
Many thanks,
Dani

Synthyra org

Hi @DaniDubi ,

Thanks for you interest in the models. Please paste your code and GitHub link and I'll be happy to take a look 🙂
Best,
Logan

Hi @lhallee ,

Thank you for the fast response! Can you clarify how do you prefer, do you want me to paste the relevant code here, and send you my GitHub repo?

Synthyra org

No problem. Probably GitHub is better.

Sure thank you!

Here is the link to the notebook:
https://github.com/VadimDu/Protein_LLM_modeling/blob/main/clean_ver_Modeling_ESM_plusplus.ipynb

I basically copied all the code from modeling_esm_plusplus.py there, and added over it my data and steps towards fine-tuning the classification model.

From the cell named "My protein input data" starts the part I added.
In the current trial I commented out data_collator and tokenizer from the Trainer, and used the default one and the tokenizer implemented in function ESMplusplus_600M(), class ESMplusplusForMaskedLM(), self.tokenizer = EsmSequenceTokenizer().

Any help will be much appreciated!
Dani

Synthyra org

Hey @DaniDubi ,

Sorry for the delay. Keep in mind you can get this model and the implementation by using AutoModelForSequenceClassification.

Upon initially looking through your code I don't see anything inherently wrong. Could you share what error you are getting?

Hi @lhallee ,

Thanks again for your reply. Could you please clarify regarding AutoModelForSequenceClassification? I could not find such class/method in your code I am using.

Regarding the errors:

  • If I am running the preprocessing and fine-tuning steps exactly how is in the notebook (model_embedding = ESMplusplus_600M(num_labels=3), .embed_dataset(), model_classification = ESMplusplusForSequenceClassification.from_pretrained_esm("600"), and no explicit custom data_collator and tokenizer give to Trainer()), this is the error:
ValueError                                Traceback (most recent call last)
<ipython-input-31-3435b262f1ae> in <cell line: 0>()
----> 1 trainer.train()

13 frames
<ipython-input-2-bf6a22e3339a> in forward(self, x, attention_mask, output_hidden_states, output_attentions)
    441             TransformerOutput containing last hidden state and optionally all hidden states and attention weights
    442         """
--> 443         batch_size, seq_len, _ = x.shape
    444         hidden_states = () if output_hidden_states else None
    445         attentions = () if output_attentions else None

ValueError: not enough values to unpack (expected 3, got 2)
  • If I am adding class CustomDataCollator to define a data_collator, to convert my input_embeds to shape: torch.Size([num_of_sequences, 1, 1152]) from a 2-dimensional tensor, then 1 training epoch finish OK, and then crashes at the start of the epoch 2:
Could not estimate the number of tokens of the input, floating-point operations will not be computed
 [ 4/20 00:00 < 00:03, 4.01 it/s, Epoch 1/10]
Epoch	Training Loss	Validation Loss
 [2/2 00:00]
Downloading builder script: 100%
 4.20k/4.20k [00:00<00:00, 506kB/s]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-48-3435b262f1ae> in <cell line: 0>()
----> 1 trainer.train()

9 frames
/usr/local/lib/python3.11/dist-packages/numpy/core/fromnumeric.py in _wrapit(obj, method, *args, **kwds)
     43     except AttributeError:
     44         wrap = None
---> 45     result = getattr(asarray(obj), method)(*args, **kwds)
     46     if wrap:
     47         if not isinstance(result, mu.ndarray):

ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 3 dimensions. The detected shape was (2, 20, 1) + inhomogeneous part.

I used 20 sequences just as an example for training.

  • If use the commented out cell (#@title ESM++ for protein embeddings using a pre-trained model from Synthyra) for sequence Dataset creation and tokenizer with AutoModelForMaskedLM.from_pretrained('Synthyra/ESMplusplus_large)', and supply Trainer() with a tokenizer, I get this error:
ValueError                                Traceback (most recent call last)
<ipython-input-28-5075ee0329cb> in <cell line: 0>()
     11 
     12 # Train the model
---> 13 trainer.train()

14 frames
/usr/local/lib/python3.11/dist-packages/torch/nn/functional.py in cross_entropy(input, target, weight, size_average, ignore_index, reduce, reduction, label_smoothing)
   3477     if size_average is not None or reduce is not None:
   3478         reduction = _Reduction.legacy_get_string(size_average, reduce)
-> 3479     return torch._C._nn.cross_entropy_loss(
   3480         input,
   3481         target,

ValueError: Expected input batch_size (2200) to match target batch_size (4).

I hope this information may be helpful, many thanks again for your efforts.
Dani

Gotcha. So a few things. If you want to finetune a model for sequence classification you do not need to pre-embed the sequences. Just need to feed the input_ids and attention_mask with the data collator. You can load the model without copying the implementation anywhere by doing this

from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained('Synthyra/ESMplusplus_large', trust_remote_code=True)

From here you can apply lora if you'd like.

If you want to just train a model on the vector embeddings of the model, you can embed them like you had and train a small neural network.

Does that make sense?

Here's an example of a collator we use for input_ids and labels. Trainer automatically unpacks a dictionary sent to the model, so everything in "batch" here will go to the right place

def string_labels_collator_builder(tokenizer, **kwargs):
    def _collate_fn(batch):
        seqs = [ex[0] for ex in batch]
        labels = torch.stack([torch.tensor(ex[1]) for ex in batch])
        batch = tokenizer(seqs,
                          padding='longest',
                          padding_to_multiple_of=8,
                          truncation=False,
                          return_tensors='pt',
                          add_special_tokens=True)
        batch['labels'] = labels
        return batch
    return _collate_fn

tokenizer = model.tokenizer
data_collator = string_labels_collator_builder(tokenizer)

This expects a PyTorch dataset class that will output a tuple of sequences and the labels you are interested in. A class that might link up with your current workflow looks something like this

from torch.utils.data import Dataset as TorchDataset

class StringLabelDatasetFromHF(TorchDataset):    
    def __init__(self, hf_dataset, col_name='seqs', label_col='labels', **kwargs):
        self.seqs = hf_dataset[col_name]
        self.labels = hf_dataset[label_col]
        self.lengths = [len(seq) for seq in self.seqs]

    def avg(self):
        return sum(self.lengths) / len(self.lengths)

    def __len__(self):
        return len(self.seqs)
    
    def __getitem__(self, idx):
        seq = self.seqs[idx]
        label = self.labels[idx]
        return seq, label

Does this help? If you try something new and get a new error please send along.

Hi @lhallee ,

Many thanks again for your help!

I have implemented the collator and PyTorch dataset as you suggested and used AutoModelForSequenceClassification, but unfortunately after the 1st epoch of the training finished it crashed with a similar error as I had before.

Below I will paste all the relevant code from the start until the training, maybe you can spot some inconsistencies there:

from torch.utils.data import Dataset as TorchDataset
from transformers import AutoModelForSequenceClassification, AutoConfig

config = AutoConfig.from_pretrained('Synthyra/ESMplusplus_large', trust_remote_code=True, num_labels=3)
model_classification = AutoModelForSequenceClassification.from_pretrained('Synthyra/ESMplusplus_large', trust_remote_code=True, config=config)
tokenizer = model_classification.tokenizer

# Move models to GPU and keep them in float32
model_classification = model_classification.to(device)  # Remove .half()


def string_labels_collator_builder(tokenizer, **kwargs):
    def _collate_fn(batch):
        seqs = [ex[0] for ex in batch]
        labels = torch.stack([torch.tensor(ex[1]) for ex in batch])
        batch = tokenizer(seqs,
                          padding='longest',
                          truncation=False,
                          return_tensors='pt',
                          add_special_tokens=True)
        batch['labels'] = labels
        return batch
    return _collate_fn


class StringLabelDatasetFromHF(TorchDataset):
    '''The design pattern of the code uses the PyTorch Dataset class for accessing the sequences and labels during the training loop.'''
    def __init__(self, hf_dataset, col_name='sequence', label_col='label', **kwargs):
        self.seqs = hf_dataset[col_name].to_numpy() # Convert to NumPy array
        self.labels = hf_dataset[label_col].to_numpy() # Convert to NumPy array
        self.lengths = [len(seq) for seq in self.seqs]

    def avg(self):
        return sum(self.lengths) / len(self.lengths)

    def __len__(self):
        return len(self.seqs)
    
    def __getitem__(self, idx):
        seq = self.seqs[idx]
        label = self.labels[idx]
        return seq, label


torchdataset_my_train = StringLabelDatasetFromHF(my_train)
torchdataset_my_valid = StringLabelDatasetFromHF(my_valid)
torchdataset_my_test = StringLabelDatasetFromHF(my_test)
data_collator = string_labels_collator_builder(tokenizer)


# LORA fine-tuning
# Define the regex pattern to match desired layers (excluding LayerNorm - ffn.0)
pattern = r"transformer\.blocks\.\d+\.(attn\.layernorm_qkv\.1|attn\.out_proj|ffn\.[13])"

target_modules = [
    name
    for name, module in model_classification.named_modules() # iterate through all modules and their names.
    if re.fullmatch(pattern, name)
]
print(f'Target modules for LORA: {target_modules}')

lora_config = LoraConfig(
    r=4,  # Rank of the LoRA update matrices
    lora_alpha=32,  # Scaling factor for the LoRA update matrices
    lora_dropout=0.05,  # Dropout probability for the LoRA update matrices
    bias="none",  # Whether to apply bias to the LoRA update matrices
    task_type=TaskType.SEQ_CLS,  # Task type for sequence classification
    target_modules=target_modules,  # Modules which LORA method should target and modify their weights
)

model = get_peft_model(model_classification, lora_config)

# Prints the number of trainable parameters in the LoRA-adapted model
model.print_trainable_parameters()

# Define Huggingface Trainer arguments
training_args = TrainingArguments(
    output_dir="./results",
    eval_strategy = "epoch",
    logging_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate=3e-4,
    # effective training batch size is batch * accum
    # we recommend an effective training batch size of 8
    per_device_train_batch_size=4,
    per_device_eval_batch_size=16,
    gradient_accumulation_steps=2,
    num_train_epochs=10,
    weight_decay=0.01,
    load_best_model_at_end=True,
    #deepspeed= ds_config if deepspeed else None,
    fp16 = False,
    gradient_checkpointing=False,
)

# Metric definition for validation data
def compute_metrics(eval_pred, num_labels=3):
  if num_labels>1:  # for classification
    metric = load("accuracy")
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
  else:  # for regression
    metric = load("spearmanr")
    predictions, labels = eval_pred

  return metric.compute(predictions=predictions, references=labels)


# Define the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=torchdataset_my_train,
    eval_dataset=torchdataset_my_valid,
    data_collator=data_collator,  # the custom data collator
    compute_metrics=compute_metrics,
)

# Train the model
trainer.train()

This is the error I got:

ValueError                                Traceback (most recent call last)
<ipython-input-28-3435b262f1ae> in <cell line: 0>()
----> 1 trainer.train()

9 frames
/usr/local/lib/python3.11/dist-packages/numpy/core/fromnumeric.py in _wrapit(obj, method, *args, **kwds)
     43     except AttributeError:
     44         wrap = None
---> 45     result = getattr(asarray(obj), method)(*args, **kwds)
     46     if wrap:
     47         if not isinstance(result, mu.ndarray):

ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 2 dimensions. The detected shape was (2, 501) + inhomogeneous part.

I am sorry that we still couldn't resolve the issue! Maybe I am missing something basic or critical, I'm still new to LLMs / Hugging Face api in general.
I can send you a small sample of my data so that you can try yourself if that's OK with you.

Thank you
Dani

Synthyra org

Is it happening exactly 1 epoch? This could be an error from the evaluation, likely happening in compute_metrics. I would write a separate one for regression or classification based on your needs, and pass the correct one when needed. The only argument for compute_metrics should be an EvalPrediction. You can type hint it like this

from transformers import EvalPrediction

def compute_metrics(p: EvalPrediction):
    preds, labels = p.predictions, p.label_ids
    # if preds or labels is a tuple you usually need to take the 0th index, I usually add an if statement for this   
    # etc.

# For example

def compute_metrics_regression(p: EvalPrediction):
    """
    Compute various regression metrics for model evaluation.

    Args:
        (p: EvalPrediction): An object containing predictions and label ids.

    Returns:
        dict: A dictionary containing the following metrics:
            - r_squared: Coefficient of determination
            - spearman_rho: Spearman's rank correlation coefficient
            - spear_pval: p-value for Spearman's correlation
            - pearson_rho: Pearson correlation coefficient
            - pear_pval: p-value for Pearson's correlation
            - mse: Mean Squared Error
            - mae: Mean Absolute Error
            - rmse: Root Mean Squared Error
    """
    preds = p.predictions[0] if isinstance(p.predictions, tuple) else p.predictions
    labels = p.label_ids[1] if isinstance(p.label_ids, tuple) else p.label_ids

    logits = np.array(preds).flatten()
    labels = np.array(labels).flatten()

    r2 = r2_score(labels, logits)
    spearman_rho, spear_pval = spearmanr(logits, labels)
    pearson_rho, pear_pval = pearsonr(logits, labels)
    mse = mean_squared_error(labels, logits)
    mae = mean_absolute_error(labels, logits)
    rmse = np.sqrt(mse)

    return {
        'r_squared': round(r2, 5),
        'spearman_rho': round(spearman_rho, 5),
        'spear_pval': round(spear_pval, 5),
        'pearson_rho': round(pearson_rho, 5),
        'pear_pval': round(pear_pval, 5),
        'mse': round(mse, 5),
        'mae': round(mae, 5),
        'rmse': round(rmse, 5),
    }

I also don't think you need the .to_numpy() in your dataset class. That shouldn't be able to run for a list of strings.

I would be happy to look at a small sample of your data, one or a couple example lines is fine if it is sensitive (you can change the column names too). I can just copy what you send several times if I need more samples. Also, if you could send the full traceback I may be able to debug a bit better. Sometimes an IDE will not show you the whole thing, I don't think it did here. Not sure how to fix that though.

It's great that you are new to LLMs and Huggingface! Welcome to the ecosystem. There is definitely a learning curve but once it clicks it is a fantastic resource for research. Don't get discouraged!

Best,
Logan

Sign up or log in to comment