Issues faced in reproducing the paper's experiments
Very interesting work! I am currently trying to reproduce the experimental results from your paper. However, I have encountered two issues:
- The generated text tends to have severe repetition.
- The model's accuracy on MATH problems (GSM8K dataset) is significantly lower than the reported results in the paper.
I would like to ask whether this discrepancy might be due to the checkpoint used or specific hyperparameter settings (e.g., temperature). Would it be possible to share the exact hyperparameter configurations used in the paper? Thanks!
Hi, are your issues with MATH or with GSM8k? Some more details on GSM8k can be found here: https://huggingface.co/tomg-group-umd/huginn-0125/discussions/7#67b59e08b24bf87803b701b6
Regarding repetition, this has not been a big problem for me, are you using the model as a text completion model, or with the chat template?
Thank you for your response and reminder! I realized that I was using text completion instead of chat templating, which resulted in a lot of repetition. I will try using the lm-eval harness for evaluation to see if I can reproduce the results successfully. Thanks again!
Sure! let me know how it goes, or if there are followup questions.