Revisiting the Test-Time Scaling of o1-like Models: Do they Truly Possess Test-Time Scaling Capabilities? Paper • 2502.12215 • Published 6 days ago • 15
You Do Not Fully Utilize Transformer's Representation Capacity Paper • 2502.09245 • Published 10 days ago • 30
Cramming 1568 Tokens into a Single Vector and Back Again: Exploring the Limits of Embedding Space Capacity Paper • 2502.13063 • Published 5 days ago • 60
REALTALK: A 21-Day Real-World Dataset for Long-Term Conversation Paper • 2502.13270 • Published 5 days ago • 5
LongPO: Long Context Self-Evolution of Large Language Models through Short-to-Long Preference Optimization Paper • 2502.13922 • Published 4 days ago • 25
Is That Your Final Answer? Test-Time Scaling Improves Selective Question Answering Paper • 2502.13962 • Published 4 days ago • 27
MoM: Linear Sequence Modeling with Mixture-of-Memories Paper • 2502.13685 • Published 4 days ago • 29
S^2R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning Paper • 2502.12853 • Published 5 days ago • 22
Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning Paper • 2502.14768 • Published 3 days ago • 32
How Much Knowledge Can You Pack into a LoRA Adapter without Harming LLM? Paper • 2502.14502 • Published 3 days ago • 64
FoNE: Precise Single-Token Number Embeddings via Fourier Features Paper • 2502.09741 • Published 10 days ago • 11
InfiniteHiP: Extending Language Model Context Up to 3 Million Tokens on a Single GPU Paper • 2502.08910 • Published 11 days ago • 139
TransMLA: Multi-head Latent Attention Is All You Need Paper • 2502.07864 • Published 12 days ago • 43
MiniMax-01: Scaling Foundation Models with Lightning Attention Paper • 2501.08313 • Published Jan 14 • 273
Retrofitting (Large) Language Models with Dynamic Tokenization Paper • 2411.18553 • Published Nov 27, 2024 • 2
Byte Latent Transformer: Patches Scale Better Than Tokens Paper • 2412.09871 • Published Dec 13, 2024 • 92