Add/improve model card (#1)

Browse files

- Add/improve model card (cb1ccb7c9c86e9d4836e723d08c93ba50909df3f)

Co-authored-by: Niels Rogge <[email protected]>

Files changed (1) hide show

README.md +53 -3

README.md CHANGED Viewed

@@ -1,3 +1,9 @@
 <div align="center">
 <h1>
   <b>m1</b>: Unleash the Potential of Test-Time Scaling for Medical Reasoning in Large Language Models
@@ -9,13 +15,57 @@ A simple test-time scaling strategy, with minimal fine-tuning, can unlock strong
 ## ⚡ Introduction
-Hi! Welcome to the huggingface repository for m1 (https://github.com/UCSC-VLAA/m1)!
 **m1** is a medical LLM designed to enhance reasoning through efficient test-time scaling. It enables lightweight models to match or exceed the performance of much larger counterparts by extending inference-time “thinking.” Unlike methods that rely on complex RL or expert supervision, m1 achieves strong results through:
-- **Fine-tuning on a small, high-quality set of verified medical reasoning examples**, showing that even with just 1K–23K examples, m1-7B *surpasses* models like HuatuoGPT-o1-7B and UltraMedical-8B, and m1-32B *rivals* 70B-scale models.
-- **Scaling reasoning at inference using token budgets**, which consistently improves performance across medical QA tasks—up to an optimal ~4K token budget, beyond which performance may degrade due to overthinking.
 - **Identifying medical knowledge as the key bottleneck**, revealing that additional reasoning alone cannot overcome knowledge gaps; instead, improvements require better data quality and increased model capacity.

+---
+license: mit
+library_name: transformers
+pipeline_tag: question-answering
+---
 <div align="center">
 <h1>
   <b>m1</b>: Unleash the Potential of Test-Time Scaling for Medical Reasoning in Large Language Models
 ## ⚡ Introduction
+![](assets/teaser.png)
+Hi! Welcome to the repository for **m1** (📃 [Paper](https://arxiv.org/abs/2504.00869))!
 **m1** is a medical LLM designed to enhance reasoning through efficient test-time scaling. It enables lightweight models to match or exceed the performance of much larger counterparts by extending inference-time “thinking.” Unlike methods that rely on complex RL or expert supervision, m1 achieves strong results through:
+- **Fine-tuning on a small, high-quality set of verified medical reasoning examples**, showing that even with just 1K–23K examples, m1-7B *surpasses* previous SOTA models like HuatuoGPT-o1-7B and UltraMedical-8B, and m1-32B *rivals* 70B-scale models.
+- **Scaling reasoning at inference using token budgets**, which consistently improves performance across medical QA tasks: up to an optimal ~4K token budget, beyond which performance may degrade due to overthinking.
 - **Identifying medical knowledge as the key bottleneck**, revealing that additional reasoning alone cannot overcome knowledge gaps; instead, improvements require better data quality and increased model capacity.
+We open-sourced our models, data, and code here.
+****************************************************************
+**Updates:**
+* 2025-03: We release our code, data, models, and paper!
+****************************************************************
+### 🌍 Environment
+Please refer to [docs/ENV.md](docs/ENV.md).
+### 👨‍⚕️ Models and Data
+| Model            | Backbone              | Training Data                                                                 | Link                                                           |
+| ---------------- | --------------------- | ----------------------------------------------------------------------------- | -------------------------------------------------------------- |
+| **m1-32b-1k**     | Qwen2.5-32B-Instruct  | [m1k](https://huggingface.co/datasets/UCSC-VLAA/m1k-tokenized)                | [HF Link](https://huggingface.co/UCSC-VLAA/m1-32B-1K)          |
+| **m1-7b-1k**      | Qwen2.5-7B-Instruct   | [m1k](https://huggingface.co/datasets/UCSC-VLAA/m1k-tokenized)                | [HF Link](https://huggingface.co/UCSC-VLAA/m1-7B-1K)           |
+| **m1-7b-23k**     | Qwen2.5-7B-Instruct   | [m23k](https://huggingface.co/datasets/UCSC-VLAA/m23k-tokenized)              | [HF Link](https://huggingface.co/UCSC-VLAA/m1-7B-23K)          |
+### 🏃 Inference
+(... same content as original README ...)
+### 📖 Citation
+```
+@misc{huang2025m1UnleashPotential,
+      title={m1: Unleash the Potential of Test-Time Scaling for Medical Reasoning in Large Language Models},
+      author={Xiaoke Huang and Juncheng Wu and Hui Liu and Xianfeng Tang and Yuyin Zhou},
+      year={2025},
+      eprint={2504.00869},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL},
+      url={https://arxiv.org/abs/2504.00869},
+}
+```