IntrinSight: A Large Vision-Language Model for Medical Insight

IntrinSight is a cutting-edge Large Vision-Language Model (LVLM), fine-tuned for advanced reasoning and analysis within the medical domain. It is designed to act as a "wisdom mirror," capable of directly interpreting medical images (such as X-rays, CT scans, and MRIs) and synthesizing this visual information with associated textual data (like clinical notes or questions) to assist healthcare professionals in making more precise judgments.

Unlike traditional language models that only process text, Inspirit-Insight can "see." It grounds its reasoning in visual evidence, making it a powerful tool for tasks like anomaly detection in scans, image-based diagnosis assistance, and generating descriptive reports from visual data.

Model Overview

Base Model: Gemma-3-4B-IT

Training Dataset: GMAI-Reasoning10K. This is a high-quality medical image reasoning dataset containing 10,000 carefully selected samples. The data was collected from 95 medical datasets from reliable sources such as Kaggle, GrandChallenge, and Open-Release, covering 12 imaging modalities including X-ray, CT, and MRI. Data preprocessing followed the standardization methods from SAMed-20M: 3D data (CT/MRI) had individual slices extracted with pixel values normalized to the 0-255 range, while video data had key frames extracted. For each sample, key metadata was used with GPT to construct an informative multiple-choice question with a single correct answer. Strict quality control and reject sampling strategies were employed to ensure the high quality and reliability of the final dataset.

Training Framework: VeRL

Training Process

The model was trained for 3 epochs using the innovative DrGRPO (Group Reward Policy Optimization) algorithm. The core of the training was to teach the model to anchor its textual reasoning in the visual evidence from the images.

We use Format Reward,Accuracy Reward, and Repetition Penalty three reward functions.

The entire training pipeline was constructed and managed using the VeRL framework, which provides a robust and efficient environment for reinforcement learning-based model training.

How to use

We recommend to use the system prompt to enable the reasoning mode. The following prompt is one sample:

SYSTEM_PROMPT = (
    "A conversation between user and assistant. The user asks a question, and the assistant solves it. The assistant "
    "first thinks about the reasoning process in the mind and then provides the user with the answer." 
    "The reasoning process is to solve the problem step by step, so you will think about it sinceraly."
    "The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., "
    "<think> reasoning process here </think><answer> answer here </answer>."
)

A larger budget for tokens may enhance the model's performance. You could try to use a larger max_tokens, for example, 16384.

Disclaimer

For Research and Assisting Purposes Only. This model is an experimental tool developed for academic and research purposes. It is not a medical device and is not intended to replace the professional judgment of a qualified healthcare provider. Any output from Inspirit-Insight, including its interpretation of images, should be carefully reviewed and verified by a medical professional before being used for clinical decision-making. The developers assume no liability for any actions taken based on the model's output.

qiuxi337
/

IntrinSight-4B

IntrinSight: A Large Vision-Language Model for Medical Insight

Model Overview

Training Process

How to use

Disclaimer

Model tree for qiuxi337/IntrinSight-4B

Dataset used to train qiuxi337/IntrinSight-4B

Space using qiuxi337/IntrinSight-4B 1