OloriBern commited on
Commit
061e977
·
verified ·
1 Parent(s): 78b2a1a

Upload TrailRAG cross-encoder model for msmarco

Browse files
Files changed (1) hide show
  1. README.md +183 -0
README.md ADDED
@@ -0,0 +1,183 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ library_name: sentence-transformers
4
+ license: mit
5
+ pipeline_tag: sentence-similarity
6
+ tags:
7
+ - cross-encoder
8
+ - regression
9
+ - trail-rag
10
+ - pathfinder-rag
11
+ - msmarco
12
+ - passage-ranking
13
+ - sentence-transformers
14
+ model-index:
15
+ - name: trailrag-cross-encoder-msmarco-enhanced
16
+ results:
17
+ - task:
18
+ type: text-ranking
19
+ dataset:
20
+ name: MS MARCO
21
+ type: msmarco
22
+ metrics:
23
+ - type: mse
24
+ value: 0.0618574036824182
25
+ - type: mae
26
+ value: 0.1473706976051132
27
+ - type: rmse
28
+ value: 0.2487114868324707
29
+ - type: r2_score
30
+ value: 0.588492937027161
31
+ - type: pearson_correlation
32
+ value: 0.857523177971012
33
+ - type: spearman_correlation
34
+ value: 0.8264641527356917
35
+ ---
36
+
37
+ # TrailRAG Cross-Encoder: MS MARCO Enhanced
38
+
39
+ This is a fine-tuned cross-encoder model specifically optimized for **Passage Ranking** tasks, trained as part of the PathfinderRAG research project.
40
+
41
+ ## Model Details
42
+
43
+ - **Model Type**: Cross-Encoder for Regression (continuous similarity scores)
44
+ - **Base Model**: `cross-encoder/ms-marco-MiniLM-L-6-v2`
45
+ - **Training Dataset**: MS MARCO (Large-scale passage ranking dataset from Microsoft)
46
+ - **Task**: Passage Ranking
47
+ - **Library**: sentence-transformers
48
+ - **License**: MIT
49
+
50
+ ## Performance Metrics
51
+
52
+ ### Final Regression Metrics
53
+
54
+ | Metric | Value | Description |
55
+ |--------|-------|-------------|
56
+ | **MSE** | **0.061857** | Mean Squared Error (lower is better) |
57
+ | **MAE** | **0.147371** | Mean Absolute Error (lower is better) |
58
+ | **RMSE** | **0.248711** | Root Mean Squared Error (lower is better) |
59
+ | **R² Score** | **0.588493** | Coefficient of determination (higher is better) |
60
+ | **Pearson Correlation** | **0.857523** | Linear correlation (higher is better) |
61
+ | **Spearman Correlation** | **0.826464** | Rank correlation (higher is better) |
62
+
63
+ ### Training Details
64
+
65
+ - **Training Duration**: 21 minutes
66
+ - **Epochs**: 6
67
+ - **Early Stopping**: No
68
+ - **Best Correlation Score**: 0.915442
69
+ - **Final MSE**: 0.061857
70
+
71
+ ### Training Configuration
72
+
73
+ - **Batch Size**: 20
74
+ - **Learning Rate**: 3e-05
75
+ - **Max Epochs**: 6
76
+ - **Weight Decay**: 0.01
77
+ - **Warmup Steps**: 100
78
+
79
+ ## Usage
80
+
81
+ This model can be used with the sentence-transformers library for computing semantic similarity scores between query-document pairs.
82
+
83
+ ### Installation
84
+
85
+ ```bash
86
+ pip install sentence-transformers
87
+ ```
88
+
89
+ ### Basic Usage
90
+
91
+ ```python
92
+ from sentence_transformers import CrossEncoder
93
+
94
+ # Load the model
95
+ model = CrossEncoder('OloriBern/trailrag-cross-encoder-msmarco-enhanced')
96
+
97
+ # Example usage
98
+ pairs = [
99
+ ['What is artificial intelligence?', 'AI is a field of computer science focused on creating intelligent machines.'],
100
+ ['What is artificial intelligence?', 'Paris is the capital of France.']
101
+ ]
102
+
103
+ # Get similarity scores (continuous values, not binary)
104
+ scores = model.predict(pairs)
105
+ print(scores) # Higher scores indicate better semantic match
106
+ ```
107
+
108
+ ### Advanced Usage in PathfinderRAG
109
+
110
+ ```python
111
+ from sentence_transformers import CrossEncoder
112
+
113
+ # Initialize for PathfinderRAG exploration
114
+ cross_encoder = CrossEncoder('OloriBern/trailrag-cross-encoder-msmarco-enhanced')
115
+
116
+ def score_query_document_pair(query: str, document: str) -> float:
117
+ """Score a query-document pair for relevance."""
118
+ score = cross_encoder.predict([[query, document]])[0]
119
+ return float(score)
120
+
121
+ # Use in document exploration
122
+ query = "Your research query"
123
+ documents = ["Document 1 text", "Document 2 text", ...]
124
+
125
+ # Score all pairs
126
+ scores = cross_encoder.predict([[query, doc] for doc in documents])
127
+ ranked_docs = sorted(zip(documents, scores), key=lambda x: x[1], reverse=True)
128
+ ```
129
+
130
+ ## Training Process
131
+
132
+ This model was trained using **regression metrics** (not classification) to predict continuous similarity scores in the range [0, 1]. The training process focused on:
133
+
134
+ 1. **Data Quality**: Used authentic MS MARCO examples with careful contamination filtering
135
+ 2. **Regression Approach**: Avoided binary classification, maintaining continuous label distribution
136
+ 3. **Correlation Optimization**: Maximized Spearman correlation for effective ranking
137
+ 4. **Scientific Rigor**: All metrics derived from real training runs without simulation
138
+
139
+ ### Why Regression Over Classification?
140
+
141
+ Cross-encoders for information retrieval should predict **continuous similarity scores**, not binary classifications. This approach:
142
+
143
+ - Preserves fine-grained similarity distinctions
144
+ - Enables better ranking and document selection
145
+ - Provides more informative scores for downstream applications
146
+ - Aligns with the mathematical foundation of information retrieval
147
+
148
+ ## Dataset
149
+
150
+ **MS MARCO**: Large-scale passage ranking dataset from Microsoft
151
+
152
+ - **Task Type**: Passage Ranking
153
+ - **Training Examples**: 1,000 high-quality pairs
154
+ - **Validation Split**: 20% (200 examples)
155
+ - **Quality Threshold**: ≥0.70 (authentic TrailRAG metrics)
156
+ - **Contamination**: Zero overlap between splits
157
+
158
+ ## Limitations
159
+
160
+ - Optimized specifically for passage ranking tasks
161
+ - Performance may vary on out-of-domain data
162
+ - Requires sentence-transformers library for inference
163
+ - CPU-based training (GPU optimization available for future versions)
164
+
165
+ ## Citation
166
+
167
+ ```bibtex
168
+ @misc{trailrag-cross-encoder-msmarco,
169
+ title = {TrailRAG Cross-Encoder: MS MARCO Enhanced},
170
+ author = {PathfinderRAG Team},
171
+ year = {2025},
172
+ publisher = {Hugging Face},
173
+ url = {https://huggingface.co/OloriBern/trailrag-cross-encoder-msmarco-enhanced}
174
+ }
175
+ ```
176
+
177
+ ## Model Card Contact
178
+
179
+ For questions about this model, please open an issue in the [PathfinderRAG repository](https://github.com/your-org/trail-rag-1) or contact the development team.
180
+
181
+ ---
182
+
183
+ *This model card was automatically generated using the TrailRAG model card generator with authentic training metrics.*