RenalCLIP: A Disease-Centric Vision-Language Foundation Model for Kidney Cancer

RenalCLIP is a 3D Vision-Language Model (VLM) for computed tomography that leverages a novel, two-stage knowledge-enhancement pre-training strategy to excel at the comprehensive assessment of renal masses. This repository provides the official pre-trained model weights for the image and text encoders used in our study.

For the full implementation, usage examples, and downstream task evaluation scripts, please visit our official GitHub repository.

GitHub: https://github.com/dt-yuhui/RenalCLIP

Paper (arXiv): https://arxiv.org/abs/2508.16569

Model Description

The RenalCLIP model consists of two main components: a specialized image encoder and a text encoder.

Image Encoder

The image encoder is designed to process 3D kidney CT volumes. Its architecture consists of a 3D ResNet-18 backbone followed by a projection layer to create embeddings suitable for cross-modal alignment.

Checkpoint: RenalCLIP-image-encoder-model-best-acc.pt

Text Encoder (LLM2Vec)

The text encoder is built upon the Meta-Llama-3-8B-Instruct model and adapted into a powerful medical language expert using the LLM2Vec methodology. We provide two sets of LoRA weights corresponding to the two final stages of its pre-training:

MNTP (Masked Next Token Prediction) Stage:
- LoRA Weights: LLM2Vec/Meta-Llama-3-8B-Instruct-radiology-ext-long/
- This stage fine-tunes the base model on a large corpus of medical text (MIMIC-CXR) to enhance its general medical domain comprehension.
SimCSE (Contrastive Learning) Stage:
- LoRA Weights: LLM2Vec/Meta-Llama-3-8B-Instruct-radiology-simcse/
- This stage further refines the text encoder's understanding of kidney cancer-related terminology using our in-house pre-training corpus.

Important: The provided text encoder checkpoints are LoRA weights only. To use them, you must download the base Llama-3-8B-Instruct model and place its files within the same directory as these LoRA weights. You may change base_model_name_or_path in adapter_config.json to load Llama correctly.

How to Use

For detailed instructions on how to load the model weights, prepare data, and run the pre-training, fine-tuning, and inference scripts, please refer to our official GitHub repository:

https://github.com/dt-yuhui/RenalCLIP

Citation

If you find our work useful in your research, please consider citing our paper:

@article{Tao2025RenalCLIP,
  title={A Disease-Centric Vision-Language Foundation Model for Precision Oncology in Kidney Cancer},
  author={Yuhui Tao and Zhongwei Zhao and Zilong Wang and Xufang Luo and Feng Chen and et al.},
  journal={arXiv preprint},
  year={2025}
}