-
Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference
Paper • 2403.09636 • Published • 2 -
TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding
Paper • 2404.11912 • Published • 17 -
Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache
Paper • 2401.02669 • Published • 16 -
LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding
Paper • 2404.16710 • Published • 77
Collections
Discover the best community collections!
Collections including paper arxiv:2406.15319
-
Generative Representational Instruction Tuning
Paper • 2402.09906 • Published • 54 -
LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs
Paper • 2406.15319 • Published • 64 -
BRIGHT: A Realistic and Challenging Benchmark for Reasoning-Intensive Retrieval
Paper • 2407.12883 • Published • 9 -
mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval
Paper • 2407.19669 • Published • 23
-
Prometheus: Inducing Fine-grained Evaluation Capability in Language Models
Paper • 2310.08491 • Published • 54 -
Prompt Cache: Modular Attention Reuse for Low-Latency Inference
Paper • 2311.04934 • Published • 32 -
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
Paper • 2311.05437 • Published • 50 -
Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges
Paper • 2406.12624 • Published • 37
-
MADLAD-400: A Multilingual And Document-Level Large Audited Dataset
Paper • 2309.04662 • Published • 23 -
Neurons in Large Language Models: Dead, N-gram, Positional
Paper • 2309.04827 • Published • 17 -
Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs
Paper • 2309.05516 • Published • 10 -
DrugChat: Towards Enabling ChatGPT-Like Capabilities on Drug Molecule Graphs
Paper • 2309.03907 • Published • 12