I am a physician with a knowledge of python, terminal and some exposure to hugging face. One of my laptops has an intel chip and integrated intel UHD graphics as well as integrated NVIDIA MX 150 graphics. also has 16 gb ram. My other laptop is an m1 mac w 16 gb ram. I have installed models in gguf format.

How can I get this model installed? Is there a step by step instruction sheet? My simple, limited experience instructs me to do this:

1)download llama from https://github.com/Mozilla-Ocho/llamafile/releases/tag/0.9.3
2)rename to llamafile.exe
3)go to huggingface.co and select gguf for filter
4)search for medreason -> finds nothing

The README on the Medreason github says:

MedReason-8B can be deployed with tools like vllm or Sglang How do I do this?

ChatGPT tells me to do this:

1)Install the following libraries:

torch: PyTorch supports Apple Silicon via the MPS backend (Metal Performance Shaders).

transformers: Hugging Face's transformers library, which allows you to easily work with pre-trained models.

Run the following in your terminal:

pip install torch==1.13.1+cpu # Ensure compatibility with M1 Mac
pip install transformers
pip install accelerate
2) Download the Model
You mentioned the Hugging Face page for MedReason-Llama is here: MedReason-Llama on Hugging Face.

To download the model:

Go to the page above and click on the "Use in transformers" button.

You’ll see a code snippet like this:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "UCSC-VLAA/MedReason-Llama"

model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
You can also use git to clone the model or directly load it from Hugging Face. In this case, the transformers library will handle downloading and caching the model for you.

Set Up the Model in Code
Here's a Python script to load the model and run an inference on your M1 Mac:

Open a text editor and create a Python script, say run_medreason.py.

Copy the following code into your script:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

Load the model and tokenizer from Hugging Face

model_name = "UCSC-VLAA/MedReason-Llama"
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

Set device to MPS (Metal Performance Shaders) if available, else fallback to CPU

device = "mps" if torch.backends.mps.is_available() else "cpu"
print(f"Using device: {device}")

Move the model to the device

model.to(device)

Define the prompt

prompt = "What is the latest development in cancer research?"

Tokenize input prompt

inputs = tokenizer(prompt, return_tensors="pt").to(device)

Generate output

outputs = model.generate(inputs["input_ids"], max_length=150, num_return_sequences=1)

Decode and print the generated response

output_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("Generated Text: ", output_text)
4) Run the Script
Once you’ve saved the script as run_medreason.py, follow these steps to run it:

Open Terminal.

Navigate to the directory where your script is located.

Run the script:

python3 run_medreason.py
This should load the MedReason-Llama model, generate a response based on your input prompt, and print it to the terminal.





This approach requires me to hard code the query. Isnt it possible to have a gguf model that will allow easy install and a simple prompt that allows me to type in any question?

UCSC-VLAA
/

MedReason-Llama

SImple question