metadata

license: cc-by-4.0
datasets:
  - allenai/c4
language:
  - en
metrics:
  - accuracy
base_model:
  - deepseek-ai/DeepSeek-R1-Distill-Llama-70B
pipeline_tag: text-generation

Overview

This document presents the evaluation results of DeepSeek-R1-Distill-Llama-70B, a 4-bit quantized model using GPTQ, evaluated with the Language Model Evaluation Harness on the ARC-Challenge benchmark.

📊 Evaluation Summary

Metric	Value	Description	8bit
Accuracy (acc,none)	`21.2%`	Raw accuracy - percentage of correct answers.	`21.2%`
Standard Error (acc_stderr,none)	`1.19%`	Uncertainty in the accuracy estimate.	`1.2%`
Normalized Accuracy (acc_norm,none)	`25.4%`	Accuracy after dataset-specific normalization.	`25.2%`
Standard Error (acc_norm_stderr,none)	`1.27%`	Uncertainty for normalized accuracy.	`1.3%`

📌 Interpretation:

The model correctly answered 21.2% of the questions.
After normalization, the accuracy slightly improves to 25.4%.
The standard error (~1.27%) indicates a small margin of uncertainty.

⚙️ Model Configuration

Model: DeepSeek-R1-Distill-Llama-70B
Parameters: 70 billion
Quantization: 4-bit GPTQ
Source: Hugging Face (hf)
Precision: torch.float16
Hardware: NVIDIA A100 80GB PCIe
CUDA Version: 12.4
PyTorch Version: 2.6.0+cu124
Batch Size: 1
Evaluation Time: 365.89 seconds (~6 minutes)

📌 Interpretation:

The evaluation was performed on a high-performance GPU (A100 80GB).
The model is significantly larger than the previous 8B version, with GPTQ 4-bit quantization reducing memory footprint.
A single-sample batch size was used, which might slow evaluation speed.

📂 Dataset Information

Dataset: AI2 ARC-Challenge
Task Type: Multiple Choice
Number of Samples Evaluated: 1,172
Few-shot Examples Used: 0 (Zero-shot setting)

📌 Interpretation:

This benchmark assesses grade-school-level scientific reasoning.
Since no few-shot examples were provided, the model was evaluated in a pure zero-shot setting.

📈 Performance Insights

The "higher_is_better" flag confirms that higher accuracy is preferred.
The model's raw accuracy (21.2%) is significantly lower compared to state-of-the-art models (60–80% on ARC-Challenge).
Quantization Impact: The 4-bit GPTQ quantization reduces memory usage but may also impact accuracy slightly.
Zero-shot Limitation: Performance could improve with few-shot prompting (providing examples before testing).

📊 Detailed Evaluation on MMLU Challenges

Metric	Value	Description
MMLU	`37.88%`	Averaged over MMLU-Stem, MMLU-Social-Sciences, MMLU-Humanities, MMLU-ther
MMLU-Humanities	`31.83%`	Averaged over MMLU-Formal-Logic, MMLU-Prehistory, MMLU-World-Religions, MMLU-Philosophy, MMLU-High-School-World-History, MMLU-Professional-Law, MMLU-High-School-US-History, MMLU-Logical-Fallacies, MMLU-International-Law, MMLU-High-School-European-History, MMLU-Moral-Disputes, MMLU-Moral-Scenarios, MMLU-Jurisprudence
MMLU-Social-Sciences	`45.43%`	Averaged over MMLU-Public-Relations, MMLU-Sociology, MMLU-Security-Studies, MMLU-High-School-Government-and-Politics, MMLU-High-School-Psychology, MMLU-Human-Sexuality, MMLU-US-Foreign-Policy, MMLU-High-School-Microeconomics, MMLU-Econometrics, MMLU-High-School-Macroeconomics, MMLU-High-School-Geography, MMLU-Professional-Psychology
MMLU-Stem	`33.01%`	Averaged over MMLU-Conceptual-Physics, MMLU-High-School-Chemistry, MMLU-College-Biology, MMLU-College-Chemistry, MMLU-Machine-Learning, MMLU-Elementary-Mathematics, MMLU-Abstract-Algebra, MMLU-Astronomy, MMLU-High-School-Statistics, MMLU-Anatomy, MMLU-College-Mathematics, MMLU-Computer-Security, MMLU-College-Computer-Science, MMLU-Electrical-Engineering, MMLU-College-Physics, MMLU-High-School-Computer-Science, MMLU-High-School-Physics, MMLU-High-School-Biology, MMLU-High-School-Mathematics
MMLU-Other	`44.48%`	Averaged over MMLU-Medical-Genetics, MMLU-Global-Facts, MMLU-Marketing, MMLU-College-Medicine, MMLU-Human-Aging, MMLU-Virology, MMLU-Business-Ethics, MMLU-Clinical-Knowledge, MMLU-Professional-Medicine, MMLU-Nutrition, MMLU-Miscellaneous, MMLU-Professional-Accounting, MMLU-Management

📌 Let us know if you need further analysis or model tuning! 🚀