bert-base-cased-DA-ChemTok-ZN1540K-V1

This model is a domain-adapted version of bert-base-cased on the cafierom/ZN1540K dataset of drug or drug-like molecules.

Model description

This domain adaptation of bert-base-cased has been trained on ~41K molecular SMILES strings, with added tokens:

new_tokens = ["[C@H]","[C@@H]","(F)","(Cl)","c1","c2","(O)","N#C","(=O)","([N+]([O-])=O)","[O-]"]

It is meant to be used for finetuning classification models for drug-related tasks.

Intended uses & limitations

More information needed

Training and evaluation data

image/png

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 2e-05
  • train_batch_size: 64
  • eval_batch_size: 64
  • seed: 42
  • optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
  • lr_scheduler_type: linear
  • num_epochs: 20

Training results

Training Loss Epoch Step Validation Loss
1.6227 1.0 546 0.7740
0.6775 2.0 1092 0.5304
0.5299 3.0 1638 0.4411
0.4596 4.0 2184 0.3954
0.416 5.0 2730 0.3580
0.3896 6.0 3276 0.3340
0.3615 7.0 3822 0.3132
0.3461 8.0 4368 0.3083
0.3288 9.0 4914 0.2921
0.3172 10.0 5460 0.2714
0.3069 11.0 6006 0.2713
0.2962 12.0 6552 0.2574
0.2901 13.0 7098 0.2587
0.2862 14.0 7644 0.2556
0.2734 15.0 8190 0.2471
0.2731 16.0 8736 0.2433
0.2687 17.0 9282 0.2288
0.2657 18.0 9828 0.2407
0.2651 19.0 10374 0.2326
0.2606 20.0 10920 0.2348

Framework versions

  • Transformers 4.48.3
  • Pytorch 2.5.1+cu124
  • Datasets 3.3.1
  • Tokenizers 0.21.0
Downloads last month
19
Safetensors
Model size
108M params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.

Model tree for cafierom/bert-base-cased-DA-ChemTok-ZN1540K-V1

Finetuned
(2144)
this model
Finetunes
2 models