This model is fine tuned with FreedomIntelligence/medical-o1-reasoning-SFT dataset and awq gemm quantized. RLAIF part is not completed. You can read their paper : https://arxiv.org/pdf/2412.18925
Trained with Unsloth for faster fine tuning and training more parameters.
To use it with vLLM :
docker network create vllm docker run --runtime=nvidia --gpus all --network vllm --name vllm -v vllm_cache:/root/.cache/huggingface --env "HUGGING_FACE_HUB_TOKEN=..." --env "HF_HUB_ENABLE_HF_TRANSFER=0" -p 8000:8000 --ipc=host vllm/vllm-openai:latest --model Yujivus/Phi-4-Health-CoT-1.1-AWQ --quantization awq_marlin --dtype float16 --gpu_memory-utilization 0.95 --max-model-len 2500
You can test vLLM's speed :
import asyncio from openai import AsyncOpenAI
async def get_chat_response_streaming(prompt, index):
client = AsyncOpenAI(
base_url="http://vllm:8000/v1",
api_key="EMPTY",
)
messages = [
{"role": "user", "content": prompt},
]
print(f"Request {index+1}: Starting", flush=True)
stream = await client.chat.completions.create(
model="Yujivus/Phi-4-Health-CoT-1.1-AWQ",
messages=messages,
max_tokens=200,
temperature=0.7,
stream=True,
)
accumulated_response = ""
async for chunk in stream:
if chunk.choices[0].delta.content is not None:
delta_content = chunk.choices[0].delta.content
accumulated_response += delta_content
print(delta_content, end="", flush=True)
print(f"\nRequest {index+1}: Finished", flush=True)
await asyncio.sleep(index * 0.5)
print(f"\nResult {index + 1}: {accumulated_response}\n", flush=True)
return accumulated_response
async def main():
prompts = [
"What are the symptoms of diabetes?",
"How is diabetes diagnosed?",
"What are the complications of hypertension?",
"How is pneumonia treated?",
"What are the symptoms of diabetes?",
"How is diabetes diagnosed?",
"What are the complications of hypertension?",
"How is pneumonia treated?",
]
tasks = [get_chat_response_streaming(prompt, i) for i, prompt in enumerate(prompts)]
for future in asyncio.as_completed(tasks):
await future
if name == "main": asyncio.run(main())
Since the model is quantized awq-gemm, you should see max throughtput for 8 requests.
To use it with TGI :
docker network create tgi docker run --name tgi-server --gpus all -p 80:81 --network tgi -v volume:/data --env HUGGING_FACE_HUB_TOKEN=... ghcr.io/huggingface/text-generation-inference:latest --model-id Yujivus/Phi-4-Health-CoT-1.1-AWQ --quantize awq
To use it with llamacpp or Ollama : mradermacher/Phi-4-Health-CoT-1.1-GGUF
Thanks to my company for their supports: Istechsoft Software Technologies
- Downloads last month
- 244