|
# TReconLM |
|
|
|
TReconLM is a decoder-only transformer model for trace reconstruction of noisy DNA sequences. It is trained to reconstruct a ground-truth sequence from multiple noisy copies (traces), each independently corrupted by insertions, deletions, and substitutions. |
|
|
|
## Model Variants |
|
|
|
We provide pretrained and fine-tuned model checkpoints for the following ground-truth sequence lengths: |
|
|
|
- L = 60 |
|
- L = 110 |
|
- L = 180 |
|
|
|
Each model supports reconstruction from cluster sizes between 2 and 10. |
|
|
|
## How to Use |
|
|
|
A Colab notebook is available in our [GitHub repository](https://github.com/MLI-lab/TReconLM) under `trace_reconstruction.ipynb`, which demonstrates how to load the model and run inference on our benchmark datasets. The test datasets used in the notebook can be downloaded from [Hugging Face](https://huggingface.co/datasets/mli-lab/TReconLM_datasets). |
|
|
|
|
|
## Training Details |
|
|
|
- Models are pretrained on synthetic data generated by sampling ground-truth sequences of length L uniformly at random over the quaternary alphabet, and independently introducing insertions, deletions, and substitutions at each position. |
|
- Error probabilities for insertions, deletions, and substitutions are drawn uniformly from the interval [0.01, 0.1], and cluster sizes are sampled uniformly from [2, 10]. |
|
- Models are fine-tuned on real-world sequencing data (Noisy-DNA and Microsoft datasets). |
|
|
|
For full experimental details, see [our paper](http://arxiv.org/abs/2507.12927). |
|
|
|
## Limitations |
|
|
|
Models are trained for fixed sequence lengths and may perform worse on other lengths or if the test data distribution differs significantly from the training data. |