aiplanet
/

LuxLlama

@@ -129,40 +129,40 @@ The fine-tuning dataset was compiled from the following sources:
 | Category                           | Score |
 | :--------------------------------- | :---- |
-| translation_to_english             | 4.19  |
-| reading_comprehension              | 4.14  |
-| verb_conjugation                   | 4.11  |
-| multiple_choice                    | 4.08  |
-| translation_from_english           | 4.07  |
-| translation                        | 4.00  |
-| listening_comprehension_simulation | 3.98  |
-| conversation                       | 3.79  |
-| word_order                         | 3.79  |
-| cultural_knowledge                 | 3.76  |
-| writing_prompt                     | 3.74  |
-| grammar                            | 3.44  |
-| idioms_and_expressions             | 3.37  |
-| sentence_completion                | 3.11  |
-| spelling_and_pronunciation         | 3.04  |
-| vocabulary                         | 3.00  |
 **Scores by Difficulty:**
 | Difficulty   | Score |
 | :----------- | :---- |
-| beginner     | 3.93  |
-| intermediate | 3.69  |
-| advanced     | 3.68  |
-| native       | 3.57  |
 **Comparative Performance:**
 | Model                     | Overall Score (LUXELLA) |
 | :------------------------ | :---------------------- |
-| **LuxLlama (Ours)**       | **3.73 / 5.0**          |
-| gemma2-9b-it              | 3.07 / 5.0              |
-| llama-3.1-8b-instant      | 2.46 / 5.0              |
-| mixtral-8x7b-32768        | 2.44 / 5.0              |
 **Summary:** LuxLlama demonstrates strong performance on the LUXELLA benchmark, outperforming other tested models significantly. It excels in translation, comprehension, and verb conjugation. Areas like vocabulary, spelling, and idioms show relatively lower scores, indicating room for improvement in capturing finer linguistic nuances. The model handles beginner-level tasks very well, with a gradual decrease in performance as difficulty increases, validating the benchmark's sensitivity. Sample high-performing questions show correct handling of cultural knowledge, spelling, and advanced verb conjugation, while low-performing samples highlight challenges with specific grammar rules (Konjunktiv II usage), subtle distinctions in vocabulary (Niess vs Kusinn), and standard word order conventions.

 | Category                           | Score |
 | :--------------------------------- | :---- |
+| translation_to_english             | 83.8  |
+| reading_comprehension              | 82.8  |
+| verb_conjugation                   | 82.2  |
+| multiple_choice                    | 81.6  |
+| translation_from_english           | 81.4  |
+| translation                        | 80.0  |
+| listening_comprehension_simulation | 79.6  |
+| conversation                       | 75.8  |
+| word_order                         | 75.8  |
+| cultural_knowledge                 | 75.2  |
+| writing_prompt                     | 74.8  |
+| grammar                            | 68.8  |
+| idioms_and_expressions             | 67.4  |
+| sentence_completion                | 62.2  |
+| spelling_and_pronunciation         | 60.8  |
+| vocabulary                         | 60.0  |
 **Scores by Difficulty:**
 | Difficulty   | Score |
 | :----------- | :---- |
+| beginner     | 78.6  |
+| intermediate | 73.8  |
+| advanced     | 73.6  |
+| native       | 71.4  |
 **Comparative Performance:**
 | Model                     | Overall Score (LUXELLA) |
 | :------------------------ | :---------------------- |
+| **LuxLlama (Ours)**       | **74.6**          |
+| gemma2-9b-it              | 61.4              |
+| llama-3.1-8b-instant      | 49.2              |
+| mixtral-8x7b-32768        | 48.8             |
 **Summary:** LuxLlama demonstrates strong performance on the LUXELLA benchmark, outperforming other tested models significantly. It excels in translation, comprehension, and verb conjugation. Areas like vocabulary, spelling, and idioms show relatively lower scores, indicating room for improvement in capturing finer linguistic nuances. The model handles beginner-level tasks very well, with a gradual decrease in performance as difficulty increases, validating the benchmark's sensitivity. Sample high-performing questions show correct handling of cultural knowledge, spelling, and advanced verb conjugation, while low-performing samples highlight challenges with specific grammar rules (Konjunktiv II usage), subtle distinctions in vocabulary (Niess vs Kusinn), and standard word order conventions.