This model is fine tuned with FreedomIntelligence/medical-o1-reasoning-SFT dataset and awq gemm quantized. RLAIF part is not completed. You can read their paper : https://arxiv.org/pdf/2412.18925

Trained with Unsloth for faster fine tuning and training more parameters.

To use it with vLLM :

docker network create vllm docker run --runtime=nvidia --gpus all --network vllm --name vllm -v vllm_cache:/root/.cache/huggingface --env "HUGGING_FACE_HUB_TOKEN=..." --env "HF_HUB_ENABLE_HF_TRANSFER=0" -p 8000:8000 --ipc=host vllm/vllm-openai:latest --model Yujivus/Phi-4-Health-CoT-1.1-AWQ --quantization awq_marlin --dtype float16 --gpu_memory-utilization 0.95 --max-model-len 2500

You can test vLLM's speed :

import asyncio from openai import AsyncOpenAI

async def get_chat_response_streaming(prompt, index):

client = AsyncOpenAI(
    base_url="http://vllm:8000/v1",
    api_key="EMPTY",   
)

messages = [
    {"role": "user", "content": prompt},
]

print(f"Request {index+1}: Starting", flush=True)

stream = await client.chat.completions.create(
    model="Yujivus/Phi-4-Health-CoT-1.1-AWQ",
    messages=messages,
    max_tokens=200, 
    temperature=0.7,
    stream=True,  
)

accumulated_response = ""
async for chunk in stream:
    if chunk.choices[0].delta.content is not None:
        delta_content = chunk.choices[0].delta.content
        accumulated_response += delta_content
        print(delta_content, end="", flush=True)

print(f"\nRequest {index+1}: Finished", flush=True) 

await asyncio.sleep(index * 0.5)
print(f"\nResult {index + 1}: {accumulated_response}\n", flush=True) 

return accumulated_response

async def main():

prompts = [
    "What are the symptoms of diabetes?",
    "How is diabetes diagnosed?",
    "What are the complications of hypertension?",
    "How is pneumonia treated?",
    "What are the symptoms of diabetes?",
    "How is diabetes diagnosed?",
    "What are the complications of hypertension?",
    "How is pneumonia treated?",
]

tasks = [get_chat_response_streaming(prompt, i) for i, prompt in enumerate(prompts)]

for future in asyncio.as_completed(tasks):
    await future

if name == "main": asyncio.run(main())

Since the model is quantized awq-gemm, you should see max throughtput for 8 requests.

To use it with TGI :

docker network create tgi docker run --name tgi-server --gpus all -p 80:81 --network tgi -v volume:/data --env HUGGING_FACE_HUB_TOKEN=... ghcr.io/huggingface/text-generation-inference:latest --model-id Yujivus/Phi-4-Health-CoT-1.1-AWQ --quantize awq

To use it with llamacpp or Ollama : mradermacher/Phi-4-Health-CoT-1.1-GGUF

Thanks to my company for their supports: Istechsoft Software Technologies

Downloads last month
244
Safetensors
Model size
2.85B params
Tensor type
I32
·
BF16
·
FP16
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.

Model tree for Yujivus/Phi-4-Health-CoT-1.1-AWQ

Base model

microsoft/phi-4
Finetuned
unsloth/phi-4
Quantized
(19)
this model

Dataset used to train Yujivus/Phi-4-Health-CoT-1.1-AWQ