monsoon-nlp commited on
Commit
618de05
·
verified ·
1 Parent(s): 476c3bd

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +71 -3
README.md CHANGED
@@ -1,3 +1,71 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ license: apache-2.0
4
+ base_model: monsoon-nlp/dna-blockdiff-2
5
+ ---
6
+
7
+ # DNA and Block Diffusion
8
+
9
+ Using the [Block Diffusion](https://github.com/kuleshov-group/bd3lms) architecture and
10
+ [AgroNT](https://huggingface.co/InstaDeepAI/agro-nucleotide-transformer-1b)'s six-nucleotide-length tokens.
11
+
12
+ Took [dna-blockdiff-2](https://huggingface.co/monsoon-nlp/dna-blockdiff-2) weights,
13
+ trained on [Papaya genome](https://huggingface.co/datasets/monsoon-nlp/wheat-bees) for one epoch.
14
+
15
+ Training loss was up and down, but validation curve (on [human genome](https://huggingface.co/datasets/dnagpt/human_genome_GCF_009914755.1)) was consistently improving
16
+
17
+ ### Loading model
18
+
19
+ ```python
20
+ from transformers import AutoModelForMaskedLM
21
+ m = AutoModelForMaskedLM.from_pretrained(
22
+ "monsoon-nlp/dna-blockdiff-papaya",
23
+ trust_remote_code=True,
24
+ )
25
+ ```
26
+
27
+ ### Perplexity of a sequence
28
+
29
+ ```
30
+ cd bd3lms && python -u main.py \
31
+ loader.eval_batch_size=1 \
32
+ model=small \
33
+ algo=bd3lm \
34
+ algo.T=5000 \
35
+ algo.backbone=hf_dit \
36
+ data=instadeep \
37
+ model.length=256 \
38
+ block_size=4 \
39
+ wandb=null \
40
+ mode=ppl_eval \
41
+ eval.checkpoint_path="monsoon-nlp/dna-blockdiff-papaya" \
42
+ model.attn_backend=sdpa \
43
+ sampling.nucleus_p=0.9 \
44
+ sampling.kv_cache=true \
45
+ sampling.logdir=$PWD/sample_logs/samples_genlen_bd3lm_blocksize4 \
46
+ data.tokenizer_name_or_path="monsoon-nlp/dna-blockdiff-papaya"
47
+ ```
48
+
49
+ ### Generating text
50
+
51
+ ```bash
52
+ cd bd3lms && python -u main.py \
53
+ loader.eval_batch_size=1 \
54
+ model=small \
55
+ algo=bd3lm \
56
+ algo.T=5000 \
57
+ algo.backbone=hf_dit \
58
+ data=instadeep \
59
+ model.length=256 \
60
+ block_size=4 \
61
+ wandb=null \
62
+ mode=sample_eval \
63
+ eval.checkpoint_path="monsoon-nlp/dna-blockdiff-papaya" \
64
+ model.attn_backend=sdpa \
65
+ sampling.nucleus_p=0.9 \
66
+ sampling.kv_cache=true \
67
+ sampling.logdir=$PWD/sample_logs/samples_genlen_bd3lm_blocksize4 \
68
+ data.tokenizer_name_or_path="monsoon-nlp/dna-blockdiff-papaya"
69
+ ```
70
+
71
+ Currently this generates `<cls> N N N N N...` but could be improved by guiding decoding