File size: 5,194 Bytes

---
license: apache-2.0
language:
- en
base_model:
- seeklhy/OmniSQL-32B
---
# GradeSQL-32B — Outcome Reward Model for Text-to-SQL

## Model Description

**GradeSQL-32B** is an Outcome Reward Model (ORM) designed to evaluate the semantic correctness of SQL queries generated from natural language questions in Text-to-SQL tasks. Rather than relying on syntactic heuristics or majority votes, GradeSQL-32B assigns a confidence score indicating whether a candidate SQL query faithfully answers the user's question based on the database schema.

Built on top of the **OmniSQL-32B** base model and finetuned on the **SPIDER** dataset, GradeSQL-32B provides a robust semantic scoring mechanism to improve query selection and alignment with user intent.

## Intended Use

- **Reranking Candidate SQL Queries:** Use GradeSQL-32B to assign semantic correctness scores and select the best SQL query among multiple candidates generated by LLMs.
- **Enhancing Text-to-SQL Pipelines:** Integrate as a reward or reranking model to improve execution accuracy and semantic fidelity in Text-to-SQL systems.
- **Evaluation and Research:** Analyze the semantic alignment of SQL queries with natural language questions to identify and mitigate errors.

## Finetuning Configuration

The following **LoRA configuration** was used to train this model:

- **R**: `16` (rank of the low-rank matrices)  
- **Lora Alpha**: `64` (scaling factor for the low-rank update)  
- **Target Modules**: `{q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj}`  
- **Lora Dropout**: `0.05`  
- **Bias**: `"none"` (bias terms are frozen)  
- **FP16**: `True` (half-precision training)  
- **Learning Rate**: `7e-5`  
- **Train Batch Size**: `5`  
- **Num. Train Epochs**: `50`
  
## Usage Example

```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel



prompt = """Question: What is the total horses record for each farm, sorted ascending?
CREATE TABLE competition_record (
    Competition_ID number, -- example: [1, 2]
    Farm_ID number, -- example: [2, 3]
    Rank number, -- example: [1, 2]
    PRIMARY KEY (Competition_ID),
    CONSTRAINT fk_competition_record_competition_id FOREIGN KEY (Competition_ID) REFERENCES farm_competition (Competition_ID),
    CONSTRAINT fk_competition_record_farm_id FOREIGN KEY (Farm_ID) REFERENCES farm (Farm_ID)
);

CREATE TABLE city (
    City_ID number, -- example: [1, 2]
    Status text, -- example: ['Town', 'Village']
    PRIMARY KEY (City_ID)
);

CREATE TABLE farm_competition (
    Competition_ID number, -- example: [1, 2]
    Host_city_ID number, -- example: [1, 2]
    PRIMARY KEY (Competition_ID),
    CONSTRAINT fk_farm_competition_host_city_id FOREIGN KEY (Host_city_ID) REFERENCES city (City_ID)
);

CREATE TABLE farm (
    Farm_ID number, -- example: [1, 2]
    Total_Horses number, -- example: [5056.5, 5486.9]
    Total_Cattle number, -- example: [8374.5, 8604.8]
    PRIMARY KEY (Farm_ID)
);
What is the total horses record for each farm, sorted ascending?
SQL: SELECT SUM(Total_Horses) AS Total_Horses, Farm_ID
FROM farm
GROUP BY Farm_ID
ORDER BY SUM(Total_Horses) ASC;
Is the SQL correct?"""

base_model = AutoModelForCausalLM.from_pretrained("seeklhy/OmniSQL-32B", torch_dtype="auto", device_map="auto")
peft_model = PeftModel.from_pretrained(base_model, "sisinflab-ai/GradeSQL-32B-ORM-Spider")
orm_model = peft_model.merge_and_unload()
orm_tokenizer = AutoTokenizer.from_pretrained("seeklhy/OmniSQL-32B", use_fast=True)

del base_model 
del peft_model  

inputs = orm_tokenizer(prompt, return_tensors="pt").to(orm_model.device)
        
with torch.no_grad():
    outputs = orm_model.generate(**inputs, max_new_tokens=1, return_dict_in_generate=True, output_scores=True, use_cache=False)

    generated_ids = outputs.sequences[0, len(inputs.input_ids[0]):]
    yes_token_id = orm_tokenizer.convert_tokens_to_ids("ĠYes")
    no_token_id = orm_tokenizer.convert_tokens_to_ids("ĠNo")
    
    yes_no_pos = None
    for i, token_id in enumerate(generated_ids):
        if token_id in [yes_token_id, no_token_id]:
            yes_no_pos = i
            break
    
    if yes_no_pos is None:
        print("[Warning]: No 'Yes' or 'No' token found in the generated output.")
        print("[Score]: 0.5")
    
    logits = outputs.scores[yes_no_pos]
    probs = torch.softmax(logits, dim=-1)
    yes_prob = probs[0, yes_token_id].item()
    generated_answer = "Yes" if generated_ids[yes_no_pos] == yes_token_id else "No"

    if generated_answer == "Yes":
        print("[Score]: ", yes_prob)
    elif generated_answer == "No":
        print("[Score]: ", 0)
```

If you use **GradeSQL** in your research, please cite the following paper:  

```bibtex
@misc{gradesqloutcomerewardmodels2025,
      title={GradeSQL: Outcome Reward Models for Ranking SQL Queries from Large Language Models}, 
      author={Mattia Tritto and Giuseppe Farano and Dario Di Palma and Gaetano Rossiello and Fedelucio Narducci and Dharmashankar Subramanian and Tommaso Di Noia},
      year={2025},
      eprint={2509.01308},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2509.01308}, 
}
```