Collections
Discover the best community collections!
Collections including paper arxiv:2502.11089
-
MiniMax-01: Scaling Foundation Models with Lightning Attention
Paper • 2501.08313 • Published • 273 -
rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking
Paper • 2501.04519 • Published • 257 -
Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference
Paper • 2412.13663 • Published • 134 -
Apollo: An Exploration of Video Understanding in Large Multimodal Models
Paper • 2412.10360 • Published • 140
-
deepseek-ai/DeepSeek-V3-Base
Updated • 383k • 1.57k -
TransMLA: Multi-head Latent Attention Is All You Need
Paper • 2502.07864 • Published • 43 -
2
Qwen2.5 Bakeneko 32b Instruct Awq
⚡Generate text-based responses for chat interactions
-
2
Deepseek R1 Distill Qwen2.5 Bakeneko 32b Awq
⚡Generate detailed responses based on user queries
-
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
Paper • 2405.04434 • Published • 17 -
Titans: Learning to Memorize at Test Time
Paper • 2501.00663 • Published • 19 -
Transformer^2: Self-adaptive LLMs
Paper • 2501.06252 • Published • 53 -
Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention
Paper • 2502.11089 • Published • 133
-
Qwen2.5 Technical Report
Paper • 2412.15115 • Published • 346 -
Qwen2.5-Coder Technical Report
Paper • 2409.12186 • Published • 141 -
Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement
Paper • 2409.12122 • Published • 3 -
Qwen2.5-VL Technical Report
Paper • 2502.13923 • Published • 136
-
LLM Pruning and Distillation in Practice: The Minitron Approach
Paper • 2408.11796 • Published • 58 -
TableBench: A Comprehensive and Complex Benchmark for Table Question Answering
Paper • 2408.09174 • Published • 52 -
To Code, or Not To Code? Exploring Impact of Code in Pre-training
Paper • 2408.10914 • Published • 42 -
Open-FinLLMs: Open Multimodal Large Language Models for Financial Applications
Paper • 2408.11878 • Published • 56
-
Addition is All You Need for Energy-efficient Language Models
Paper • 2410.00907 • Published • 145 -
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
Paper • 2402.17764 • Published • 609 -
LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding
Paper • 2404.16710 • Published • 77 -
Beyond Scaling Laws: Understanding Transformer Performance with Associative Memory
Paper • 2405.08707 • Published • 31
-
CatLIP: CLIP-level Visual Recognition Accuracy with 2.7x Faster Pre-training on Web-scale Image-Text Data
Paper • 2404.15653 • Published • 27 -
MoDE: CLIP Data Experts via Clustering
Paper • 2404.16030 • Published • 13 -
MoRA: High-Rank Updating for Parameter-Efficient Fine-Tuning
Paper • 2405.12130 • Published • 48 -
Reducing Transformer Key-Value Cache Size with Cross-Layer Attention
Paper • 2405.12981 • Published • 30