File size: 5,194 Bytes
1256daf 1c1ba94 1256daf 1c1ba94 e2335de 1c1ba94 27b225e |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 |
---
license: apache-2.0
language:
- en
base_model:
- seeklhy/OmniSQL-32B
---
# GradeSQL-32B — Outcome Reward Model for Text-to-SQL
## Model Description
**GradeSQL-32B** is an Outcome Reward Model (ORM) designed to evaluate the semantic correctness of SQL queries generated from natural language questions in Text-to-SQL tasks. Rather than relying on syntactic heuristics or majority votes, GradeSQL-32B assigns a confidence score indicating whether a candidate SQL query faithfully answers the user's question based on the database schema.
Built on top of the **OmniSQL-32B** base model and finetuned on the **SPIDER** dataset, GradeSQL-32B provides a robust semantic scoring mechanism to improve query selection and alignment with user intent.
## Intended Use
- **Reranking Candidate SQL Queries:** Use GradeSQL-32B to assign semantic correctness scores and select the best SQL query among multiple candidates generated by LLMs.
- **Enhancing Text-to-SQL Pipelines:** Integrate as a reward or reranking model to improve execution accuracy and semantic fidelity in Text-to-SQL systems.
- **Evaluation and Research:** Analyze the semantic alignment of SQL queries with natural language questions to identify and mitigate errors.
## Finetuning Configuration
The following **LoRA configuration** was used to train this model:
- **R**: `16` (rank of the low-rank matrices)
- **Lora Alpha**: `64` (scaling factor for the low-rank update)
- **Target Modules**: `{q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj}`
- **Lora Dropout**: `0.05`
- **Bias**: `"none"` (bias terms are frozen)
- **FP16**: `True` (half-precision training)
- **Learning Rate**: `7e-5`
- **Train Batch Size**: `5`
- **Num. Train Epochs**: `50`
## Usage Example
```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
prompt = """Question: What is the total horses record for each farm, sorted ascending?
CREATE TABLE competition_record (
Competition_ID number, -- example: [1, 2]
Farm_ID number, -- example: [2, 3]
Rank number, -- example: [1, 2]
PRIMARY KEY (Competition_ID),
CONSTRAINT fk_competition_record_competition_id FOREIGN KEY (Competition_ID) REFERENCES farm_competition (Competition_ID),
CONSTRAINT fk_competition_record_farm_id FOREIGN KEY (Farm_ID) REFERENCES farm (Farm_ID)
);
CREATE TABLE city (
City_ID number, -- example: [1, 2]
Status text, -- example: ['Town', 'Village']
PRIMARY KEY (City_ID)
);
CREATE TABLE farm_competition (
Competition_ID number, -- example: [1, 2]
Host_city_ID number, -- example: [1, 2]
PRIMARY KEY (Competition_ID),
CONSTRAINT fk_farm_competition_host_city_id FOREIGN KEY (Host_city_ID) REFERENCES city (City_ID)
);
CREATE TABLE farm (
Farm_ID number, -- example: [1, 2]
Total_Horses number, -- example: [5056.5, 5486.9]
Total_Cattle number, -- example: [8374.5, 8604.8]
PRIMARY KEY (Farm_ID)
);
What is the total horses record for each farm, sorted ascending?
SQL: SELECT SUM(Total_Horses) AS Total_Horses, Farm_ID
FROM farm
GROUP BY Farm_ID
ORDER BY SUM(Total_Horses) ASC;
Is the SQL correct?"""
base_model = AutoModelForCausalLM.from_pretrained("seeklhy/OmniSQL-32B", torch_dtype="auto", device_map="auto")
peft_model = PeftModel.from_pretrained(base_model, "sisinflab-ai/GradeSQL-32B-ORM-Spider")
orm_model = peft_model.merge_and_unload()
orm_tokenizer = AutoTokenizer.from_pretrained("seeklhy/OmniSQL-32B", use_fast=True)
del base_model
del peft_model
inputs = orm_tokenizer(prompt, return_tensors="pt").to(orm_model.device)
with torch.no_grad():
outputs = orm_model.generate(**inputs, max_new_tokens=1, return_dict_in_generate=True, output_scores=True, use_cache=False)
generated_ids = outputs.sequences[0, len(inputs.input_ids[0]):]
yes_token_id = orm_tokenizer.convert_tokens_to_ids("ĠYes")
no_token_id = orm_tokenizer.convert_tokens_to_ids("ĠNo")
yes_no_pos = None
for i, token_id in enumerate(generated_ids):
if token_id in [yes_token_id, no_token_id]:
yes_no_pos = i
break
if yes_no_pos is None:
print("[Warning]: No 'Yes' or 'No' token found in the generated output.")
print("[Score]: 0.5")
logits = outputs.scores[yes_no_pos]
probs = torch.softmax(logits, dim=-1)
yes_prob = probs[0, yes_token_id].item()
generated_answer = "Yes" if generated_ids[yes_no_pos] == yes_token_id else "No"
if generated_answer == "Yes":
print("[Score]: ", yes_prob)
elif generated_answer == "No":
print("[Score]: ", 0)
```
If you use **GradeSQL** in your research, please cite the following paper:
```bibtex
@misc{gradesqloutcomerewardmodels2025,
title={GradeSQL: Outcome Reward Models for Ranking SQL Queries from Large Language Models},
author={Mattia Tritto and Giuseppe Farano and Dario Di Palma and Gaetano Rossiello and Fedelucio Narducci and Dharmashankar Subramanian and Tommaso Di Noia},
year={2025},
eprint={2509.01308},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2509.01308},
}
``` |