benjamin commited on
Commit
19d951e
·
verified ·
1 Parent(s): 2b29236

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -1
README.md CHANGED
@@ -88,4 +88,8 @@ Training took ~26 hours on a TPU v4-32.
88
 
89
  The current version of this model is trained for 20k steps with 32*2048 bytes per batch (= 1.3B bytes ≈ 328M subword tokens total). It was unexpected that it performs as well as it does with this very short training procedure. We plan to train a new version for more steps (you can also do so yourself using [`tokenkit`](https://github.com/bminixhofer/tokenkit)).
90
 
91
- To preserve efficiency, we would have to add (a combination of) [BLT-style hierarchical processing](https://arxiv.org/abs/2412.09871), [attention approximations](https://hkunlp.github.io/blog/2025/evabyte/), and [self-speculative decoding](https://arxiv.org/abs/2309.08168).
 
 
 
 
 
88
 
89
  The current version of this model is trained for 20k steps with 32*2048 bytes per batch (= 1.3B bytes ≈ 328M subword tokens total). It was unexpected that it performs as well as it does with this very short training procedure. We plan to train a new version for more steps (you can also do so yourself using [`tokenkit`](https://github.com/bminixhofer/tokenkit)).
90
 
91
+ To preserve efficiency, we would have to add (a combination of) [BLT-style hierarchical processing](https://arxiv.org/abs/2412.09871), [attention approximations](https://hkunlp.github.io/blog/2025/evabyte/), and [self-speculative decoding](https://arxiv.org/abs/2309.08168).
92
+
93
+ ## Acknowledgments
94
+
95
+ Training was enabled by Cloud TPUs from Google’s TPU Research Cloud (TRC).