Yujivus/Phi-4-Health-CoT-1.1-AWQ

This model is fine tuned with FreedomIntelligence/medical-o1-reasoning-SFT dataset and awq gemm quantized. RLAIF part is not completed. You can read their paper : https://arxiv.org/pdf/2412.18925

Trained with Unsloth for faster fine tuning and training more parameters.

To use it with vLLM :

docker network create vllm docker run --runtime=nvidia --gpus all --network vllm --name vllm -v vllm_cache:/root/.cache/huggingface --env "HUGGING_FACE_HUB_TOKEN=..." --env "HF_HUB_ENABLE_HF_TRANSFER=0" -p 8000:8000 --ipc=host vllm/vllm-openai:latest --model Yujivus/Phi-4-Health-CoT-1.1-AWQ --quantization awq_marlin --dtype float16 --gpu_memory-utilization 0.95 --max-model-len 2500

You can test vLLM's speed :

import asyncio from openai import AsyncOpenAI

async def get_chat_response_streaming(prompt, index):

client = AsyncOpenAI(
    base_url="http://vllm:8000/v1",
    api_key="EMPTY",   
)

messages = [
    {"role": "user", "content": prompt},
]

print(f"Request {index+1}: Starting", flush=True)

stream = await client.chat.completions.create(
    model="Yujivus/Phi-4-Health-CoT-1.1-AWQ",
    messages=messages,
    max_tokens=200, 
    temperature=0.7,
    stream=True,  
)

accumulated_response = ""
async for chunk in stream:
    if chunk.choices[0].delta.content is not None:
        delta_content = chunk.choices[0].delta.content
        accumulated_response += delta_content
        print(delta_content, end="", flush=True)

print(f"\nRequest {index+1}: Finished", flush=True) 

await asyncio.sleep(index * 0.5)
print(f"\nResult {index + 1}: {accumulated_response}\n", flush=True) 

return accumulated_response

async def main():

prompts = [
    "What are the symptoms of diabetes?",
    "How is diabetes diagnosed?",
    "What are the complications of hypertension?",
    "How is pneumonia treated?",
    "What are the symptoms of diabetes?",
    "How is diabetes diagnosed?",
    "What are the complications of hypertension?",
    "How is pneumonia treated?",
]

tasks = [get_chat_response_streaming(prompt, i) for i, prompt in enumerate(prompts)]

for future in asyncio.as_completed(tasks):
    await future

if name == "main": asyncio.run(main())

Since the model is quantized awq-gemm, you should see max throughtput for 8 requests.

To use it with TGI :

docker network create tgi docker run --name tgi-server --gpus all -p 80:81 --network tgi -v volume:/data --env HUGGING_FACE_HUB_TOKEN=... ghcr.io/huggingface/text-generation-inference:latest --model-id Yujivus/Phi-4-Health-CoT-1.1-AWQ --quantize awq

To use it with llamacpp or Ollama : mradermacher/Phi-4-Health-CoT-1.1-GGUF

Thanks to my company for their supports: Istechsoft Software Technologies

Yujivus
/

Phi-4-Health-CoT-1.1-AWQ

Model tree for Yujivus/Phi-4-Health-CoT-1.1-AWQ

Dataset used to train Yujivus/Phi-4-Health-CoT-1.1-AWQ