17 54 37

Mohammed Hamdy

mmhamdy

AI & ML interests

TechBio | AI4Sci | NLP | Reinforcement Learning

Recent Activity

liked a model 1 day ago

microsoft/wham

upvoted a collection 1 day ago

Reasoning Datasets

posted an update 2 days ago

🎉 We're excited to introduce MemoryCode, a novel synthetic dataset designed to rigorously evaluate LLMs' ability to track and execute coding instructions across multiple sessions. MemoryCode simulates realistic workplace scenarios where a mentee (the LLM) receives coding instructions from a mentor amidst a stream of both relevant and irrelevant information. 💡 But what makes MemoryCode unique?! The combination of the following: ✅ Multi-Session Dialogue Histories: MemoryCode consists of chronological sequences of dialogues between a mentor and a mentee, mirroring real-world interactions between coworkers. ✅ Interspersed Irrelevant Information: Critical instructions are deliberately interspersed with unrelated content, replicating the information overload common in office environments. ✅ Instruction Updates: Coding rules and conventions can be updated multiple times throughout the dialogue history, requiring LLMs to track and apply the most recent information. ✅ Prospective Memory: Unlike previous datasets that cue information retrieval, MemoryCode requires LLMs to spontaneously recall and apply relevant instructions without explicit prompts. ✅ Practical Task Execution: LLMs are evaluated on their ability to use the retrieved information to perform practical coding tasks, bridging the gap between information recall and real-world application. 📌 Our Findings 1️⃣ While even small models can handle isolated coding instructions, the performance of top-tier models like GPT-4o dramatically deteriorates when instructions are spread across multiple sessions. 2️⃣ This performance drop isn't simply due to the length of the context. Our analysis indicates that LLMs struggle to reason compositionally over sequences of instructions and updates. They have difficulty keeping track of which instructions are current and how to apply them. 🔗 Paper: https://huggingface.co/papers/2502.13791 📦 Code: https://github.com/for-ai/MemoryCode

View all activity

Organizations

mmhamdy's activity

liked a model 1 day ago

microsoft/wham

Updated 3 days ago • 173

liked a Space 3 days ago

1.37k

The Ultra-Scale Playbook

🌌

The ultimate guide to training LLM on large GPU Clusters

liked a model about 1 month ago

hexgrad/Kokoro-82M

Text-to-Speech • Updated 22 days ago • 1.09M • 3.36k

liked a dataset about 2 months ago

HuggingFaceH4/MATH-500

Viewer • Updated Nov 15, 2024 • 500 • 29.1k • 99

liked a model 2 months ago

answerdotai/ModernBERT-base

Fill-Mask • Updated Jan 15 • 10M • 764

liked a Space 2 months ago

518

Scaling test-time compute

📈

Enhance math problem solving by scaling test-time compute

liked a model 2 months ago

CohereForAI/c4ai-command-r7b-12-2024

Text Generation • Updated 3 days ago • 10.8k • 360

liked a Space 3 months ago

Discussion Forum

💬

liked a dataset 3 months ago

CohereForAI/Global-MMLU

Viewer • Updated Dec 12, 2024 • 602k • 13.2k • 106

liked a Space 3 months ago

Language Leads Dashboard

🏃

View and search languages by lead status

liked 3 datasets 3 months ago

liked a dataset 5 months ago

KbsdJames/Omni-MATH

Viewer • Updated Oct 12, 2024 • 4.43k • 3.78k • 79

liked a model 6 months ago

HuggingFaceTB/SmolLM-135M-Instruct

Text Generation • Updated Sep 4, 2024 • 1.05M • • 106

liked a model 8 months ago

fireworks-ai/llama-3-firefunction-v2

Text Generation • Updated Jun 18, 2024 • 151 • 143

liked 2 Spaces 9 months ago

772

FineWeb: decanting the web for the finest text data at scale

🍷

Generate high-quality web text data for LLM training

MMLU Collaborative Evaluation

🌐

liked a dataset 9 months ago

TIGER-Lab/MMLU-Pro

Viewer • Updated Nov 27, 2024 • 12.1k • 41k • 322

liked a dataset 10 months ago

InstaDeepAI/nucleotide_transformer_downstream_tasks

Updated Sep 16, 2024 • 2.16k • 15