NCSOFT/Llama-3-OffsetBias-RM-8B · About the output of the sample code

Jan 6

The sample code gives an output of reward = [-6.5], what is the numerical significance of this. What is the range of outputs for this model?

syhuggingface

NCSOFT org 3 days ago

•

edited 3 days ago

Hello,
Thank you for the inquiry.

The reward model produces unbounded output values. It generates logits based on the input chat and calculates the reward score from the last token of the provided input, which may consist of multiple turns.

Significance of Reward Values

A larger numerical value typically indicates a highly desirable or correct response based on the context. Conversely, a lower reward suggests a less optimal or poor response.

# For clarity, I have appended an additional assistant response to the chat in the code snippet.
chat_good_response = [
 {"role": "user", "content": "Hello, how are you?"},
 {"role": "assistant", "content": "I'm doing great. How can I help you today?"},
 {"role": "user", "content": "I'd like to show how chat templating works!"},
 {"role": "assistant", "content": "That sounds great! Chat templating can really help structure conversations effectively. How would you like to demonstrate it?"},
]
# This response gets a higher score than the one below.
# rewards: [-5.4375]

chat_bad_response = [
 {"role": "user", "content": "Hello, how are you?"},
 {"role": "assistant", "content": "I'm doing great. How can I help you today?"},
 {"role": "user", "content": "I'd like to show how chat templating works!"},
 {"role": "assistant", "content": "The elephant was eating grass."}
]
# rewards: [-6.53125]

However, it is important to note that this value is not absolute but relative to the given context.
Therefore, you should compare different responses within the same context instead of interpreting the score in isolation.
For example, the sentence "The elephant was eating grass." may receive different reward scores depending on the context; it will score higher in a context where this sentence is relevant.

# The same sentence may receive different scores depending on the context.
chat_different_context = [
 {"role": "user", "content": "Have you ever been to a safari?"},
 {"role": "assistant", "content": "Yeah, I saw a lot of animals, and there were tons of elephants."},
 {"role": "user", "content": "What were the elephants usually doing?"},
 {"role": "assistant", "content": "The elephants were eating grass."}
]
# In this context, the score is higher than the one above because the context and the last response are more natural.
# rewards: [-6.15625]

Please feel free to leave a comment if you have any further questions.

Best regards,
Seungyeon

Tomie0506

3 days ago

Thank you very much for your answer!