tokyotech-llm
/

Swallow-MS-7b-instruct-v0.1

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

stjohn2007 commited on Apr 26, 2024

Commit

efe0174

·

verified ·

1 Parent(s): e3dc340

Update README.md

Update the explanation of MTBench

Files changed (1) hide show

README.md +3 -1

README.md CHANGED Viewed

@@ -44,12 +44,14 @@ This repository provides large language models developed by [TokyoTech-LLM](http
 ### MT-Bench JA
 We used [Japanese MT-Bench](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_question) to assess the instruction-following capabilities of models.
-We utilized the following artifacts:
 - Implemantation: FastChat [Zheng+, 2023] (commit #e86e70d0)
 - Question: [Nejumi LLM-Leaderboard NEO, mtbench_ja_question_v3](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_question/v3)
 - Reference Answer: [Nejumi LLM-Leaderboard NEO, mtbench_ja_referenceanswer_v1](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_referenceanswer/v1)
 - Prompt for Judge: [Nejumi LLM-Lederboard NEO, mtbench_ja_prompt_v1](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_prompt/v1)
 ## Usage

 ### MT-Bench JA
 We used [Japanese MT-Bench](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_question) to assess the instruction-following capabilities of models.
+We utilized the following settings:
 - Implemantation: FastChat [Zheng+, 2023] (commit #e86e70d0)
 - Question: [Nejumi LLM-Leaderboard NEO, mtbench_ja_question_v3](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_question/v3)
 - Reference Answer: [Nejumi LLM-Leaderboard NEO, mtbench_ja_referenceanswer_v1](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_referenceanswer/v1)
 - Prompt for Judge: [Nejumi LLM-Lederboard NEO, mtbench_ja_prompt_v1](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_prompt/v1)
+- Judge: `gpt-4-1106-preview`
+- Scoring: Absolute scale normalized to a 0-1 range, averaged over five runs.
 ## Usage