UCSC-VLAA
/

m1-7B-23K

Question Answering

text-generation

text-generation-inference

Model card Files Files and versions

m1-7B-23K / README.md

cihangxie's picture

Add model card metadata and link to code (#1)

e3668e6 verified about 1 month ago

|

history blame contribute delete

1.69 kB

	---
	license: mit
	library_name: transformers
	pipeline_tag: question-answering
	---

	```markdown
	<div align="center">
	<h1>
	<b>m1</b>: Unleash the Potential of Test-Time Scaling for Medical Reasoning in Large Language Models
	</h1>
	<p>
	A simple test-time scaling strategy, with minimal fine-tuning, can unlock strong medical reasoning within large language models.
	</p>
	</div>

	This repository contains the model presented in the paper [m1: Unleash the Potential of Test-Time Scaling for Medical Reasoning in Large Language Models](https://huggingface.co/papers/2504.00869).

	Code: https://github.com/UCSC-VLAA/m1

	## ⚡ Introduction

	Hi! Welcome to the huggingface repository for m1!

	m1 is a medical LLM designed to enhance reasoning through efficient test-time scaling. It enables lightweight models to match or exceed the performance of much larger counterparts by extending inference-time “thinking.” Unlike methods that rely on complex RL or expert supervision, m1 achieves strong results through:

	- Fine-tuning on a small, high-quality set of verified medical reasoning examples, showing that even with just 1K–23K examples, m1-7B surpasses models like HuatuoGPT-o1-7B and UltraMedical-8B, and m1-32B rivals 70B-scale models.

	- Scaling reasoning at inference using token budgets, which consistently improves performance across medical QA tasks—up to an optimal ~4K token budget, beyond which performance may degrade due to overthinking.

	- Identifying medical knowledge as the key bottleneck, revealing that additional reasoning alone cannot overcome knowledge gaps; instead, improvements require better data quality and increased model capacity.
	```