I found if we apply the reasoning system prompt (that has been published on the NousResearch/DeepHermes-3-Llama-3-8B-Preview model card) other models are also react to it and start mimicking reasoning. Some better some worse. I've seen internal monologue and self questioning.
ā Evaluating Long Context #2: SCROLLS and ZeroSCROLLS
In this series of posts about tracing the history of long context evaluation, we started with Long Range Arena (LRA). Introduced in 2020, Long Range Arens (LRA) is one of the earliest benchmarks designed to tackle the challenge of long context evaluation. But it wasn't introduced to evaluate LLMs, but rather the transformer architecture in general.
š The SCROLLS benchmark, introduced in 2022, addresses this gap in NLP/LLM research. SCROLLS challenges models with tasks that require reasoning over extended sequences (according to 2022 standards). So, what does it offer?
1ļøā£ Long Text Focus: SCROLLS (unlike LRA) focus mainly on text and contain inputs with thousands of words, testing models' ability to synthesize information across lengthy documents. 2ļøā£ Diverse Tasks: Includes summarization, question answering, and natural language inference across domains like literature, science, and business. 3ļøā£ Unified Format: All datasets are available in a text-to-text format, facilitating easy evaluation and comparison of models.
Building on SCROLLS, ZeroSCROLLS takes long text evaluation to the next level by focusing on zero-shot learning. Other features include:
1ļøā£ New Tasks: Introduces tasks like sentiment aggregation and sorting book chapter summaries. 2ļøā£ Leaderboard: A live leaderboard encourages continuous improvement and competition among researchers.
š” What are some other landmark benchmarks in the history of long context evaluation? Feel free to share your thoughts and suggestions in the comments.
Can it run DeepSeek V3 671B is the new 'can it run Doom'.
How minimalistic can I go with on device AI with behemoth models - here I'm running DeepSeek V3 MoE on a single A6000 GPU.
Not great, not terrible, for this minimalistic setup. I love the Mixture of Experts architectures. Typically I'm running my core LLM distributed over the 4 GPUs.
Make sure you own your AI. AI in the cloud is not aligned with you; it's aligned with the company that owns it.