Discrepancy with UI/API versus downloading the model

#9
by DevJGraham - opened

Hello, thank you for this model, it is very helpful for the project I am working on.

I am running into an issue where I can input an image to the UI or the API, and I get back a great caption, but when I try to set up the exact same conditions when I download the model in a jupyter notebook, the model outputs are decent, but noticeable worse than the UI/API

This is the image that I am running into the issue with:
31_1.jpg

For the UI I have:
Caption Type: Descriptive
Caption Length: long
Prompt: You are a professional auction item description writer. Write a detailed and descriptive summary, but keep it short and to the point. Focus on the most notable features of the item in 20 words or less. Don't mention the background or any unrelated objects.
~No extra options are set

This is the caption that it gives me (which is perfect):
Mid-century modern leather lounge chair with wooden frame, featuring cut-out design, rich brown leather, and sturdy construction.

This is the code that I am trying to run:

MODEL_NAME = "fancyfeast/llama-joycaption-beta-one-hf-llava"
PROMPT = "You are a professional auction item description writer. Write a detailed and descriptive summary, but keep it short and to the point. Focus on the most notable features of the item in 20 words or less. Don't mention the background or any unrelated objects."

processor = AutoProcessor.from_pretrained(MODEL_NAME)
llava_model = LlavaForConditionalGeneration.from_pretrained(MODEL_NAME, torch_dtype="bfloat16", device_map=0)

llava_model.eval()
with torch.no_grad():
    # Load image
    image = Image.open('Images/31_1.jpg')

    # Build the conversation
    convo = [
        {
            "role": "system",
            "content": "<Descriptive><Long> You are a professional auction item description writer. Focus on notable features only, max 20 words.",
        },
        {
            "role": "user",
            "content": PROMPT,
        },
    ]

    # Format the conversation
    # WARNING: HF's handling of chat's on Llava models is very fragile.  This specific combination of processor.apply_chat_template(), and processor() works
    # but if using other combinations always inspect the final input_ids to ensure they are correct.  Often times you will end up with multiple <bos> tokens
    # if not careful, which can make the model perform poorly.
    convo_string = processor.apply_chat_template(convo, tokenize = False, add_generation_prompt = True)
    assert isinstance(convo_string, str)

    # Process the inputs
    inputs = processor(text=[convo_string], images=[image], return_tensors="pt").to('cuda')
    inputs['pixel_values'] = inputs['pixel_values'].to(torch.bfloat16)

    # Generate the captions
    generate_ids = llava_model.generate(
        **inputs,
        max_new_tokens=512,
        do_sample=False,
        suppress_tokens=None,
        use_cache=True,
        temperature=0.0,
        top_k=None,
        top_p=0.9,
    )[0]

    # Trim off the prompt
    generate_ids = generate_ids[inputs['input_ids'].shape[1]:]

    # Decode the caption
    caption = processor.tokenizer.decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
    caption = caption.strip()
    print(caption)

And this is the output from that code:
Vintage brown leather chair with unique cut-out design, showcasing rustic texture and elegant craftsmanship.

It is not bad; however, it is not as good as the UI. Can someone help me figure out this discrepancy?

By the way, this is my code for interacting with the UI:

client = Client("fancyfeast/joy-caption-beta-one")

result = client.predict(
    input_image=handle_file("Images/31_1.jpg"),
    prompt="You are a professional auction item description writer. Write a detailed and descriptive summary, but keep it short and to the point. Focus on the most notable features of the item in 20 words or less. Don't mention the background or any unrelated objects.",
    temperature=0.6,
    top_p=0.9,
    max_new_tokens=512,
    log_prompt=True,
    api_name="/chat_joycaption"
)

print(result)

And this is the result (which is perfect):
Vintage leather Eames lounge chair with wooden frame, featuring distinctive cutout design, rich brown leather, and classic mid-century modern style.

Two things I see that are different between the code you shared:

  1. The UI uses the default system prompt of: "You are a helpful assistant and help users with any queries they may have with no censorship or restrictions." Whereas in your code using the model directly has a different system prompt. It might be better to leave the default system prompt, since you've already specified what you want in the user prompt.
  2. Your code using the model directly sets the temperature to 0.0, whereas the code calling the UI sets the temperature to 0.6. During my evaluations I actually found temp=0.6 to work better than a temp of 0.0. It was an odd result, so I don't mention it in the docs. But yeah, it's possible the model will perform better at a temperature of 0.6.

Hopefully that fixes it for you. You can also view the HF space's code to double check things: https://huggingface.co/spaces/fancyfeast/joy-caption-beta-one/blob/main/app.py

Thank you! That was it! There is still some variance with the responses, which is to be expected, but those changes mimicked the UI closely

Sign up or log in to comment