ImportError: cannot import name 'DataCollatorSpeechSeq2SeqWithPadding' from 'transformers'

#78
by SudheenK - opened

I tried to import this as suggested by some LLM. But this doesnot work. I checked out the transformers repository, but it was not present. In some blog I found a custom data collator for whisper.

class DataCollatorSpeechSeq2SeqWithPadding:
processor: Any

def __call__(
    self, features: List[Dict[str, Union[List[int], torch.Tensor]]]
) -> Dict[str, torch.Tensor]:
    # split inputs and labels since they have to be of different lengths and need different padding methods
    # first treat the audio inputs by simply returning torch tensors
    input_features = [
        {"input_features": feature["input_features"][0]} for feature in features
    ]
    batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")

    # get the tokenized label sequences
    label_features = [{"input_ids": feature["labels"]} for feature in features]
    # pad the labels to max length
    labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")

    # replace padding with -100 to ignore loss correctly
    labels = labels_batch["input_ids"].masked_fill(
        labels_batch.attention_mask.ne(1), -100
    )

    # if bos token is appended in previous tokenization step,
    # cut bos token here as it's append later anyways
    if (labels[:, 0] == self.processor.tokenizer.bos_token_id).all().cpu().item():
        labels = labels[:, 1:]

    batch["labels"] = labels

    return batch

Will this work for any language? I am trying to fine-tune on the Korean language.

Sign up or log in to comment