thrumbel's picture
Create README.md
d992c15 verified
metadata
library_name: biomed-multi-omic
license: apache-2.0
tags:
  - Biology
  - DNA

ibm-research/biomed.dna.ref.modernbert.113m.v1

Biomedical foundational models for omics data. This package supports the development of foundation models for scRNA or for DNA data.

biomed-multi-omic enables development and testing of foundation models for DNA sequences and for RNA expression, with modular model and training methods for pretraining and fine-tuning, controllable via a declarative no-code interface. biomed-multi-omic leverages anndata, HuggingFace Transformers, PyTorchLighting and Hydra.

  • 🧬 A single package for DNA and RNA Foundation models. scRNA pretraining on h5ad files or TileDB (eg CellXGene), DNA pretraining on reference human genome (GRCh38/hg38) and also variant imputed genome based on common SNPs available from GWAT catalog and ClinVar datasets.
  • 🚀 Leverages latest open source tools: anndata, HuggingFace transformers and PyTorchLighting
  • 📈 Zero-shot and finetuning support for diverse downstream tasks: (cell type annotation, perturbation prediction for scRNA, promoter prediction task and regulatory regions using Massively parallel reporter assays (MPRAs) for DNA sequences)
  • Novel pretraining strategies for scRNA and DNA implemented alongside existing methods to enable experimentation and comparison.

For details on how the models were trained, please refer to the BMFM-DNA preprint.

Checkpoint

BMFM-DNA-REF

The pre-training samples were prepared by extracting DNA sequences of random lengths (between 1kb and 10kb) consecutively from the human reference genome. Sequences were excluded if all nucleotides are “N”. To further enrich the diversity of the training set, we repeated the whole-genome random sampling 10 times. For each DNA sequence sample, we also created the reverse complement sequence as the counterpart, leading to a total of 9,982,678 samples that roughly cover the human genome 20 times or about 60 billion nucleotides.

For full details see section 3.1.1 of the BMFM-DNA manuscript.

Usage

Using biomed.dna.ref.modernbert.113m.v1 requires the codebase https://github.com/BiomedSciAI/biomed-multi-omic.

For installation, please follow the instructions on github.

DNA Inference

To get embeddings for DNA sequences run:

export INPUT_DIRECTORY=... # path to your DNA sequences files
bmfm-targets-run -cn dna_predict input_directory=$INPUT_DIRECTORY working_dir=/tmp checkpoint=ibm-research/biomed.dna.ref.modernbert.113m.v1

For more details see the DNA tutorials on github.

Citation

To cite the tool for both RNA and DNA, please cite both the following articles:

@misc{li2025bmfmdnasnpawarednafoundation,
      title={BMFM-DNA: A SNP-aware DNA foundation model to capture variant effects},
      author={Hongyang Li and Sanjoy Dey and Bum Chul Kwon and Michael Danziger and Michal Rosen-Tzvi and Jianying Hu and James Kozloski and Ching-Huei Tsou and Bharath Dandala and Pablo Meyer},
      year={2025},
      eprint={2507.05265},
      archivePrefix={arXiv},
      primaryClass={q-bio.GN},
      url={https://arxiv.org/abs/2507.05265},
}