Hello, we have some questions about your work.
Thanks for sharing your model, we have some questions we'd like answered.
We are curious about the difference between your work and HuatuoGPT-o1's. We see you just repeated the same as their or just added TTT.
In ANESBENCH, FineMedLM achieved almost the worst performance. We question eval results in your paper. Since you used HuatuoGPT-o1's method and report outperforming, such performance is puzzling.
We extremely doubt the authenticity of your training process and eval results. It's as if you actually generated some data similar to the test set and used it primarily for SFT.
Don't know if other peers have the same questions.
Any records of the training process, may you share them?
Looking forward to your reply.
Best wishes.
Thank you very much for your interest in our work!
Firstly, FineMedLM-o1 is not just repeated HuatuoGPT-o1's method.
The seed used in HuatuoGPT-o1 is MedQA and MedMCQA and they use MCTS to obtain synthetic data. Then HuatuoGPT-o1 is trained through SFT and PPO with reasoning data.
And the seed used in FineMedLM-o1 is CC. We use our proposed synthetic data method to convert text data into QA pairs while maintaining both quality and complexity. Then FineMedLM-o1 is trained through SFT with non-reasoning data to acquire medical chat ability and DPO with reasoning data to incentivize reasoning ability. TTT is used to further enhanced reasoning ability.
FineMedLM-o1 focuses on how to gain medical synthetic data from knowledge-rich CC to incentivize the model's reasoning ability, and investigate how Curriculum Learning and TTT influence the LLM's medical reasoning ability.
As for the performance issues on ANESBENCH, we are also studying the reasons for the unsatisfactory performance. If we have any useful findings, we will share them with you. Please look forward to our follow-up work!
Thank you again for your question. I believe that only through mutual discussion can we make progress together.
Best wishes.