Add/improve model card (#1)
Browse files- Add/improve model card (cb1ccb7c9c86e9d4836e723d08c93ba50909df3f)
Co-authored-by: Niels Rogge <[email protected]>
README.md
CHANGED
@@ -1,3 +1,9 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
<div align="center">
|
2 |
<h1>
|
3 |
<b>m1</b>: Unleash the Potential of Test-Time Scaling for Medical Reasoning in Large Language Models
|
@@ -9,13 +15,57 @@ A simple test-time scaling strategy, with minimal fine-tuning, can unlock strong
|
|
9 |
|
10 |
## ⚡ Introduction
|
11 |
|
|
|
12 |
|
13 |
-
Hi! Welcome to the
|
14 |
|
15 |
**m1** is a medical LLM designed to enhance reasoning through efficient test-time scaling. It enables lightweight models to match or exceed the performance of much larger counterparts by extending inference-time “thinking.” Unlike methods that rely on complex RL or expert supervision, m1 achieves strong results through:
|
16 |
|
17 |
-
- **Fine-tuning on a small, high-quality set of verified medical reasoning examples**, showing that even with just 1K–23K examples, m1-7B *surpasses* models like HuatuoGPT-o1-7B and UltraMedical-8B, and m1-32B *rivals* 70B-scale models.
|
18 |
|
19 |
-
- **Scaling reasoning at inference using token budgets**, which consistently improves performance across medical QA tasks
|
20 |
|
21 |
- **Identifying medical knowledge as the key bottleneck**, revealing that additional reasoning alone cannot overcome knowledge gaps; instead, improvements require better data quality and increased model capacity.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: mit
|
3 |
+
library_name: transformers
|
4 |
+
pipeline_tag: question-answering
|
5 |
+
---
|
6 |
+
|
7 |
<div align="center">
|
8 |
<h1>
|
9 |
<b>m1</b>: Unleash the Potential of Test-Time Scaling for Medical Reasoning in Large Language Models
|
|
|
15 |
|
16 |
## ⚡ Introduction
|
17 |
|
18 |
+

|
19 |
|
20 |
+
Hi! Welcome to the repository for **m1** (📃 [Paper](https://arxiv.org/abs/2504.00869))!
|
21 |
|
22 |
**m1** is a medical LLM designed to enhance reasoning through efficient test-time scaling. It enables lightweight models to match or exceed the performance of much larger counterparts by extending inference-time “thinking.” Unlike methods that rely on complex RL or expert supervision, m1 achieves strong results through:
|
23 |
|
24 |
+
- **Fine-tuning on a small, high-quality set of verified medical reasoning examples**, showing that even with just 1K–23K examples, m1-7B *surpasses* previous SOTA models like HuatuoGPT-o1-7B and UltraMedical-8B, and m1-32B *rivals* 70B-scale models.
|
25 |
|
26 |
+
- **Scaling reasoning at inference using token budgets**, which consistently improves performance across medical QA tasks: up to an optimal ~4K token budget, beyond which performance may degrade due to overthinking.
|
27 |
|
28 |
- **Identifying medical knowledge as the key bottleneck**, revealing that additional reasoning alone cannot overcome knowledge gaps; instead, improvements require better data quality and increased model capacity.
|
29 |
+
|
30 |
+
We open-sourced our models, data, and code here.
|
31 |
+
|
32 |
+
|
33 |
+
|
34 |
+
****************************************************************
|
35 |
+
|
36 |
+
**Updates:**
|
37 |
+
|
38 |
+
* 2025-03: We release our code, data, models, and paper!
|
39 |
+
|
40 |
+
****************************************************************
|
41 |
+
|
42 |
+
### 🌍 Environment
|
43 |
+
|
44 |
+
Please refer to [docs/ENV.md](docs/ENV.md).
|
45 |
+
|
46 |
+
### 👨⚕️ Models and Data
|
47 |
+
|
48 |
+
| Model | Backbone | Training Data | Link |
|
49 |
+
| ---------------- | --------------------- | ----------------------------------------------------------------------------- | -------------------------------------------------------------- |
|
50 |
+
| **m1-32b-1k** | Qwen2.5-32B-Instruct | [m1k](https://huggingface.co/datasets/UCSC-VLAA/m1k-tokenized) | [HF Link](https://huggingface.co/UCSC-VLAA/m1-32B-1K) |
|
51 |
+
| **m1-7b-1k** | Qwen2.5-7B-Instruct | [m1k](https://huggingface.co/datasets/UCSC-VLAA/m1k-tokenized) | [HF Link](https://huggingface.co/UCSC-VLAA/m1-7B-1K) |
|
52 |
+
| **m1-7b-23k** | Qwen2.5-7B-Instruct | [m23k](https://huggingface.co/datasets/UCSC-VLAA/m23k-tokenized) | [HF Link](https://huggingface.co/UCSC-VLAA/m1-7B-23K) |
|
53 |
+
|
54 |
+
|
55 |
+
### 🏃 Inference
|
56 |
+
|
57 |
+
(... same content as original README ...)
|
58 |
+
|
59 |
+
### 📖 Citation
|
60 |
+
|
61 |
+
```
|
62 |
+
@misc{huang2025m1UnleashPotential,
|
63 |
+
title={m1: Unleash the Potential of Test-Time Scaling for Medical Reasoning in Large Language Models},
|
64 |
+
author={Xiaoke Huang and Juncheng Wu and Hui Liu and Xianfeng Tang and Yuyin Zhou},
|
65 |
+
year={2025},
|
66 |
+
eprint={2504.00869},
|
67 |
+
archivePrefix={arXiv},
|
68 |
+
primaryClass={cs.CL},
|
69 |
+
url={https://arxiv.org/abs/2504.00869},
|
70 |
+
}
|
71 |
+
```
|