1.41k
The Ultra-Scale Playbook
๐
The ultimate guide to training LLM on large GPU Clusters
Loved it, thanks for having done the work, explained it and provided the new BERT model.
I was wondering how much you would gain over ModernBERT-Base by distilling ModernBERT-Large back into a ModernBERT-Base sized model. Any ideas if that is worth doing ?