Models
Datasets
Spaces
Posts
Docs
Enterprise
Pricing
Log In
Sign Up

Collections

Discover the best community collections!

Collections including paper arxiv:2502.11089

Novel Attention Mechanisms

Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention

Paper • 2502.11089 • Published 7 days ago • 133

Deepseek Papers

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Paper • 2402.03300 • Published Feb 5, 2024 • 97
Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention

Paper • 2502.11089 • Published 7 days ago • 133

MiniMax-01: Scaling Foundation Models with Lightning Attention

Paper • 2501.08313 • Published Jan 14 • 273
rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking

Paper • 2501.04519 • Published Jan 8 • 257
Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference

Paper • 2412.13663 • Published Dec 18, 2024 • 134
Apollo: An Exploration of Video Understanding in Large Multimodal Models

Paper • 2412.10360 • Published Dec 13, 2024 • 140

deepseek-ai/DeepSeek-V3-Base

Updated about 1 month ago • 383k • 1.57k
TransMLA: Multi-head Latent Attention Is All You Need

Paper • 2502.07864 • Published 12 days ago • 43
Running on Zero

2

2

Qwen2.5 Bakeneko 32b Instruct Awq

⚡

Generate text-based responses for chat interactions
Running on Zero

2

2

Deepseek R1 Distill Qwen2.5 Bakeneko 32b Awq

⚡

Generate detailed responses based on user queries

foundational models

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

Paper • 2405.04434 • Published May 7, 2024 • 17
Titans: Learning to Memorize at Test Time

Paper • 2501.00663 • Published Dec 31, 2024 • 19
Transformer^2: Self-adaptive LLMs

Paper • 2501.06252 • Published Jan 9 • 53
Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention

Paper • 2502.11089 • Published 7 days ago • 133

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

Paper • 2304.01373 • Published Apr 3, 2023 • 9
Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention

Paper • 2502.11089 • Published 7 days ago • 133

LLM Tech Report

Qwen2.5 Technical Report

Paper • 2412.15115 • Published Dec 19, 2024 • 346
Qwen2.5-Coder Technical Report

Paper • 2409.12186 • Published Sep 18, 2024 • 141
Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement

Paper • 2409.12122 • Published Sep 18, 2024 • 3
Qwen2.5-VL Technical Report

Paper • 2502.13923 • Published 4 days ago • 136

LLM Pruning and Distillation in Practice: The Minitron Approach

Paper • 2408.11796 • Published Aug 21, 2024 • 58
TableBench: A Comprehensive and Complex Benchmark for Table Question Answering

Paper • 2408.09174 • Published Aug 17, 2024 • 52
To Code, or Not To Code? Exploring Impact of Code in Pre-training

Paper • 2408.10914 • Published Aug 20, 2024 • 42
Open-FinLLMs: Open Multimodal Large Language Models for Financial Applications

Paper • 2408.11878 • Published Aug 20, 2024 • 56

Ternary LLMs & Knowledge distillation & SOTA

Addition is All You Need for Energy-efficient Language Models

Paper • 2410.00907 • Published Oct 1, 2024 • 145
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

Paper • 2402.17764 • Published Feb 27, 2024 • 609
LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding

Paper • 2404.16710 • Published Apr 25, 2024 • 77
Beyond Scaling Laws: Understanding Transformer Performance with Associative Memory

Paper • 2405.08707 • Published May 14, 2024 • 31

CatLIP: CLIP-level Visual Recognition Accuracy with 2.7x Faster Pre-training on Web-scale Image-Text Data

Paper • 2404.15653 • Published Apr 24, 2024 • 27
MoDE: CLIP Data Experts via Clustering

Paper • 2404.16030 • Published Apr 24, 2024 • 13
MoRA: High-Rank Updating for Parameter-Efficient Fine-Tuning

Paper • 2405.12130 • Published May 20, 2024 • 48
Reducing Transformer Key-Value Cache Size with Cross-Layer Attention

Paper • 2405.12981 • Published May 21, 2024 • 30

Previous
1
2
3
Next

Company

TOS Privacy About Jobs

Website

Models Datasets Spaces Pricing Docs