File size: 5,194 Bytes
1256daf
1c1ba94
 
 
 
 
1256daf
1c1ba94
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e2335de
1c1ba94
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27b225e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
---
license: apache-2.0
language:
- en
base_model:
- seeklhy/OmniSQL-32B
---
# GradeSQL-32B — Outcome Reward Model for Text-to-SQL

## Model Description

**GradeSQL-32B** is an Outcome Reward Model (ORM) designed to evaluate the semantic correctness of SQL queries generated from natural language questions in Text-to-SQL tasks. Rather than relying on syntactic heuristics or majority votes, GradeSQL-32B assigns a confidence score indicating whether a candidate SQL query faithfully answers the user's question based on the database schema.

Built on top of the **OmniSQL-32B** base model and finetuned on the **SPIDER** dataset, GradeSQL-32B provides a robust semantic scoring mechanism to improve query selection and alignment with user intent.

## Intended Use

- **Reranking Candidate SQL Queries:** Use GradeSQL-32B to assign semantic correctness scores and select the best SQL query among multiple candidates generated by LLMs.
- **Enhancing Text-to-SQL Pipelines:** Integrate as a reward or reranking model to improve execution accuracy and semantic fidelity in Text-to-SQL systems.
- **Evaluation and Research:** Analyze the semantic alignment of SQL queries with natural language questions to identify and mitigate errors.

## Finetuning Configuration

The following **LoRA configuration** was used to train this model:

- **R**: `16` (rank of the low-rank matrices)  
- **Lora Alpha**: `64` (scaling factor for the low-rank update)  
- **Target Modules**: `{q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj}`  
- **Lora Dropout**: `0.05`  
- **Bias**: `"none"` (bias terms are frozen)  
- **FP16**: `True` (half-precision training)  
- **Learning Rate**: `7e-5`  
- **Train Batch Size**: `5`  
- **Num. Train Epochs**: `50`
  
## Usage Example

```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel



prompt = """Question: What is the total horses record for each farm, sorted ascending?
CREATE TABLE competition_record (
    Competition_ID number, -- example: [1, 2]
    Farm_ID number, -- example: [2, 3]
    Rank number, -- example: [1, 2]
    PRIMARY KEY (Competition_ID),
    CONSTRAINT fk_competition_record_competition_id FOREIGN KEY (Competition_ID) REFERENCES farm_competition (Competition_ID),
    CONSTRAINT fk_competition_record_farm_id FOREIGN KEY (Farm_ID) REFERENCES farm (Farm_ID)
);

CREATE TABLE city (
    City_ID number, -- example: [1, 2]
    Status text, -- example: ['Town', 'Village']
    PRIMARY KEY (City_ID)
);

CREATE TABLE farm_competition (
    Competition_ID number, -- example: [1, 2]
    Host_city_ID number, -- example: [1, 2]
    PRIMARY KEY (Competition_ID),
    CONSTRAINT fk_farm_competition_host_city_id FOREIGN KEY (Host_city_ID) REFERENCES city (City_ID)
);

CREATE TABLE farm (
    Farm_ID number, -- example: [1, 2]
    Total_Horses number, -- example: [5056.5, 5486.9]
    Total_Cattle number, -- example: [8374.5, 8604.8]
    PRIMARY KEY (Farm_ID)
);
What is the total horses record for each farm, sorted ascending?
SQL: SELECT SUM(Total_Horses) AS Total_Horses, Farm_ID
FROM farm
GROUP BY Farm_ID
ORDER BY SUM(Total_Horses) ASC;
Is the SQL correct?"""

base_model = AutoModelForCausalLM.from_pretrained("seeklhy/OmniSQL-32B", torch_dtype="auto", device_map="auto")
peft_model = PeftModel.from_pretrained(base_model, "sisinflab-ai/GradeSQL-32B-ORM-Spider")
orm_model = peft_model.merge_and_unload()
orm_tokenizer = AutoTokenizer.from_pretrained("seeklhy/OmniSQL-32B", use_fast=True)

del base_model 
del peft_model  

inputs = orm_tokenizer(prompt, return_tensors="pt").to(orm_model.device)
        
with torch.no_grad():
    outputs = orm_model.generate(**inputs, max_new_tokens=1, return_dict_in_generate=True, output_scores=True, use_cache=False)

    generated_ids = outputs.sequences[0, len(inputs.input_ids[0]):]
    yes_token_id = orm_tokenizer.convert_tokens_to_ids("ĠYes")
    no_token_id = orm_tokenizer.convert_tokens_to_ids("ĠNo")
    
    yes_no_pos = None
    for i, token_id in enumerate(generated_ids):
        if token_id in [yes_token_id, no_token_id]:
            yes_no_pos = i
            break
    
    if yes_no_pos is None:
        print("[Warning]: No 'Yes' or 'No' token found in the generated output.")
        print("[Score]: 0.5")
    
    logits = outputs.scores[yes_no_pos]
    probs = torch.softmax(logits, dim=-1)
    yes_prob = probs[0, yes_token_id].item()
    generated_answer = "Yes" if generated_ids[yes_no_pos] == yes_token_id else "No"

    if generated_answer == "Yes":
        print("[Score]: ", yes_prob)
    elif generated_answer == "No":
        print("[Score]: ", 0)
```

If you use **GradeSQL** in your research, please cite the following paper:  

```bibtex
@misc{gradesqloutcomerewardmodels2025,
      title={GradeSQL: Outcome Reward Models for Ranking SQL Queries from Large Language Models}, 
      author={Mattia Tritto and Giuseppe Farano and Dario Di Palma and Gaetano Rossiello and Fedelucio Narducci and Dharmashankar Subramanian and Tommaso Di Noia},
      year={2025},
      eprint={2509.01308},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2509.01308}, 
}
```