DeepTheorem: Advancing LLM Reasoning for Theorem Proving Through Natural Language and Reinforcement Learning 🚀

Welcome to the GitHub repository for DeepTheorem 🎉, a comprehensive framework for enhancing large language model (LLM) mathematical reasoning through informal, natural language-based theorem proving. This project introduces a novel approach to automated theorem proving (ATP) by leveraging the informal reasoning strengths of LLMs, moving beyond traditional formal proof systems 🌟.

Overview 📖

Theorem proving is a critical benchmark for evaluating complex reasoning in LLMs 🧠. However, formal proof systems often misalign with the informal, natural language knowledge LLMs acquire during pre-training. DeepTheorem addresses this gap by introducing:

  1. A Large-Scale Theorem Proving Dataset 📊:

    • Contains high-quality, IMO-level informal theorems and proofs across diverse mathematical domains 📚.
    • Rigorously annotated for correctness, difficulty, and topic categories ✅.
    • Includes systematically constructed verifiable theorem variants for robust evaluation 🔍.
  2. RL-Zero 🤖:

    • A novel reinforcement learning strategy tailored for informal theorem proving ⚙️.
    • Utilizes verified theorem variants to incentivize robust mathematical inference 💡.

Deeptheorem demonstrates its superior performance over both opensource and commercial models (i.e Gemini and o1)

Performance 🚀

Deeptheorem achieves the #Rank 5 position along all the commerical models and open source models.

Model FIMO HMMT Putnam Avg.(#Rank)
out. proc. out. proc. out. proc. out. proc.
Gemini2.5-Pro 57.14 54.06 57.63 49.82 64.58 58.75 59.78(#2) 54.21(#3)
o1-mini 60.32 55.23 35.59 30.90 61.46 52.88 52.46(#4) 46.34(#4)
o1 66.67 61.00 47.46 47.30 62.50 57.55 58.88(#3) 55.28(#2)
o3-mini 80.95 77.61 45.76 43.47 78.12 75.12 68.28(#1) 65.40(#1)
*DeepTheorem-RL-7B 55.56 39.07 28.81 20.85 57.29 42.20 47.22(#5) 34.04(#5)
*DeepTheorem-RL-3B 38.10 23.39 25.42 13.56 52.08 33.84 38.53 23.60
*DeepTheorem-RL-1.5B 31.75 15.23 23.73 10.15 52.08 22.79 35.85 16.06

DeepTheorem Dataset 📊

The DeepTheorem dataset comprises 121K IMO-level informal theorems and proofs spanning diverse mathematical domains 🌐. Each theorem-proof pair is rigorously annotated for:

  • o3-mini Proofs 🖋️: Ensuring mathematical accuracy through proofs generated or verified by the o3-mini model ✅.
  • Truth Value 🔍: The truth value of the theorem extracted from the o3-mini proofs, indicating whether the theorem is true or false ✔️.
  • Difficulty 🎚️: Categorized by complexity to suit various LLM capabilities 🧩.
  • Topic Categories 🗂️: Covering algebra, geometry, number theory, and more 📘.
  • Variants 🔄: Positive or negative variants of the theorem that share the same or inverse truth value of the original theorem 🔀.
Downloads last month
10
Safetensors
Model size
1.78B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Jiahao004/DeepTheorem-qwen-1.5b-rl

Quantizations
2 models

Collection including Jiahao004/DeepTheorem-qwen-1.5b-rl