File size: 39,969 Bytes
36894ac
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
---
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- dense
- generated_from_trainer
- dataset_size:80
- loss:MatryoshkaLoss
- loss:MultipleNegativesRankingLoss
base_model: microsoft/mpnet-base
widget:
- source_sentence: How many different active substances were detected in surface water
    across all catchment areas?
  sentences:
  - 'metabolites were not detected in the water bodies.

    2.1.1. Antibiotics/Enzyme-Inhibitors and

    Abacavir in Surface-Water

    Fifty detections were found in all catchment areas in surface water, which corresponds
    to 15 different active substances:

    12 antibiotics, two enzyme inhibitors, and one antiviral. The number of detections
    per sampling station ranged from 0 to 7

    different active substances. The Ave river-Prazins (Santo Tirso) and Serzedelo
    I and II (Guimar ã es) as well as Ria

    Formosa-coastal water (Faro and Olh ã o), each one with two sampling sites, showed
    the most detected compounds in'
  - '2. Results

    2.1. Frequency of Detections:

    Antibiotics/Enzyme-Inhibitors and Abacavir

    in Surface-Groundwater

    During the screening framework beyond the antibiotics/enzyme-inhibitors, the antiviral
    abacavir was detected. Therefore,

    given the relevance of this compound, it was included in the present study. Although
    enzyme inhibitors belong to the

    antibiotic group, their specific pharmacological properties and detection were
    sorted apart. In the present study, antibiotic

    metabolites were not detected in the water bodies.

    2.1.1. Antibiotics/Enzyme-Inhibitors and

    Abacavir in Surface-Water'
  - 'surface water. The relatively higher detection of substances downstream of the
    effluent discharge points compared with a

    low detection in upstream samples could be attributed to the low efficiency in
    urban wastewater treatment plants or

    agricultural pressure. The environmental impact is more critical due to active
    substances in drinking water or premix

    medicated feeds in the veterinary site.

    Furthermore, the detection of substances of exclusive human use (abacavir, tazobactam
    and cilastatin) prove the weak'
- source_sentence: What group of pharmaceuticals was sulfamethazine matched to when
    its quantity was missing?
  sentences:
  - 'ciprofloxacin

    43%

    (3/7), enrofloxacin, norfloxacin, trimethoprim, lincomycin (29% (2/7), abacavir
    and tetracycline

    14% (1/7). The enzyme inhibitors, namely clavulanic acid and cilastatin, were
    detected once in an urban region located

    well. This catchment point showed the most significant

    number of pharmaceuticals. West/Tejo and Centre were the regions with the most
    considerable number of substances in

    groundwater, accounting for 43%. All groundwater

    samples were contaminated by at least one antibiotic. Supplemental Tables S2 and
    S4 contain a detailed description of

    the'
  - 'clarithromycin) were the only ones that demonstrated the potential to concentrate
    in living organisms (log Kow ≥ 3) [14].

    All the remaining antibiotics showed a relatively low log Kow and were expected
    to be present mainly in surface water.

    However, the soil mobility/adsorption detected The detected pharmaceuticals showed
    high to moderate water solubility

    and are small ionisable molecules (MW ≤ 900 g/mol). Regarding the octanol/water
    partitioning coefficient (log Kow) data,'
  - 'missing quantity for sulfamethazine, the sulfonamides group has been matched.

    Consumption (Kg) of the detected pharmaceuticals in Portugal (2017).

    1 Amount from ESVAC Report-2017; 2 Match the sulfonamides amount; NA-not available.

    Amount of detected pharmaceuticals consumption per Portuguese region. Amount of
    detected pharmaceuticals

    consumption per Portuguese region.'
- source_sentence: What directive sets environmental quality standards for substances
    in surface waters?
  sentences:
  - 'As much as the specificities of each member state should be considered this issue
    has become one of the European

    community''s main concerns [8].

    The strategies against water pollution are provided in the Water Framework Directive
    [9] and the Directive on

    Environmental Quality Standards that set environmental quality standards (EQS)
    for the substances in surface waters

    and confirm their designation as priority or priority hazardous substances [10].
    Evidence of potential impacts and'
  - 'seems to undertake a similar fate in the environment.

    Nevertheless, due to stronger adsorption, with higher emergence in sediment, its
    occurrence in the surface water is lower

    [71]. The use of tetracyclines, mainly as medicated premix and oral solution for
    food-producing animals [72], and the very

    low bioavailability (e.g. in pig feed) [43] contribute to increasing its release
    into the environment. Regarding macrolides,

    erythromycin and clarithromycin exhibit a remarkable frequency of detection in
    surface water samples. The most'
  - 'low flows; otherwise, POCIS might be damage. In ground-waters was used one POCIS
    unit/well. Due to the high sorption

    capacity, POCIS was deployed approximately for 30 days, allowing the polar organic
    compounds adsorbed to be in the

    equilibrium stage with the active substances in an aqueous medium. In the laboratory,
    POCIS disks were frozen until

    extraction.

    4.2.2. Qualitative Analysis Method Used

    for the Characterisation of Antibiotics in

    Surface-Groundwater'
- source_sentence: What is the molecular weight range of the detected pharmaceuticals?
  sentences:
  - '2.3. Physicochemical Properties and Key Pharmacokinetic Features of Detected
    Pharmaceuticals 2.3. Physicochemical

    Properties and Key Pharmacokinetic Features of Detected Pharmaceuticals

    The detected pharmaceuticals showed high to moderate water solubility and are
    small ionisable molecules (MW ≤ 900

    g/mol). Regarding the octanol/water partitioning coefficient (log Kow) data, macrolide
    antibiotics (azithromycin and

    clarithromycin) were the only ones that demonstrated the potential to concentrate
    in living organisms (log Kow ≥ 3) [14].'
  - 'As much as the specificities of each member state should be considered this issue
    has become one of the European

    community''s main concerns [8].

    The strategies against water pollution are provided in the Water Framework Directive
    [9] and the Directive on

    Environmental Quality Standards that set environmental quality standards (EQS)
    for the substances in surface waters

    and confirm their designation as priority or priority hazardous substances [10].
    Evidence of potential impacts and'
  - 'passive samplers in groundwater considered the well technical features; the depth
    and groundwater level were previously

    determined since they should be detected at the superficial levels. The passive
    sampler was placed using a water level

    meter, 2 m below the groundwater level. The sampler always remained immersed in
    water, avoiding extractions and the

    regional lowering of the water table [104]. For the sampling stations, sites of
    different environmental pressures were

    considered, specifically urban, agricultural area/animal production, and aquaculture.
    The information regarding the'
- source_sentence: What was the most frequently identified pharmaceutical in the groundwater
    samples?
  sentences:
  - 'Pharmacokinetic characteristics may represent key features in understanding antibiotics
    occurrence [62]. Most antibiotics

    are not completely metabolised in humans and animals; thus, a high percentage
    of the active substance (40-90%) is

    excreted in urine/faeces in the unchanged form. These molecules are discharged
    into water and soil through wastewater,

    animal manure, and sewage sludge, frequently used as fertilisers to agricultural
    lands. Also, it is expected that the

    hospital effluent will contribute partly to the pharmaceutical load in the wastewater
    treatment plant influence [63].'
  - 'many domestic and livestock animals. Several formulations of powder for administration
    in drinking water and medicated

    premix are available for poultry and pigs. The excretion of amoxicillin is predominantly
    renal; more than 80% of the parent

    drug is recovered unchanged in the urine. While bioavailability of 75 to 80% is
    reported in humans, a low value (~30%)

    was observed in pigs, calves, foals, and pigeons [26,52]. Maybe this last group
    of animals contribute more sharply to the'
  - 'from one to five compounds. The most frequently identified pharmaceuticals, in
    decreasing order, were ciprofloxacin 43%

    (3/7), enrofloxacin, norfloxacin, trimethoprim, lincomycin (29% (2/7), abacavir
    and tetracycline 14% (1/7). The enzyme

    inhibitors, namely clavulanic acid and cilastatin, were detected once in an urban
    region located well. This catchment point

    showed the most significant number of pharmaceuticals. West/Tejo and Centre were
    the regions with the most

    considerable number of substances in groundwater, accounting for 43%. All groundwater
    samples were contaminated by'
pipeline_tag: sentence-similarity
library_name: sentence-transformers
metrics:
- cosine_accuracy
model-index:
- name: SentenceTransformer based on microsoft/mpnet-base
  results:
  - task:
      type: triplet
      name: Triplet
    dataset:
      name: initial test
      type: initial_test
    metrics:
    - type: cosine_accuracy
      value: 0.7799999713897705
      name: Cosine Accuracy
  - task:
      type: triplet
      name: Triplet
    dataset:
      name: final test
      type: final_test
    metrics:
    - type: cosine_accuracy
      value: 0.8199999928474426
      name: Cosine Accuracy
    - type: cosine_accuracy
      value: 0.8999999761581421
      name: Cosine Accuracy
    - type: cosine_accuracy
      value: 0.8999999761581421
      name: Cosine Accuracy
    - type: cosine_accuracy
      value: 0.9200000166893005
      name: Cosine Accuracy
---

# SentenceTransformer based on microsoft/mpnet-base

This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [microsoft/mpnet-base](https://huggingface.co/microsoft/mpnet-base) on the json dataset. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

## Model Details

### Model Description
- **Model Type:** Sentence Transformer
- **Base model:** [microsoft/mpnet-base](https://huggingface.co/microsoft/mpnet-base) <!-- at revision 6996ce1e91bd2a9c7d7f61daec37463394f73f09 -->
- **Maximum Sequence Length:** 512 tokens
- **Output Dimensionality:** 768 dimensions
- **Similarity Function:** Cosine Similarity
- **Training Dataset:**
    - json
<!-- - **Language:** Unknown -->
<!-- - **License:** Unknown -->

### Model Sources

- **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
- **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
- **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)

### Full Model Architecture

```
SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False, 'architecture': 'MPNetModel'})
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)
```

## Usage

### Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

```bash
pip install -U sentence-transformers
```

Then you can load this model and run inference.
```python
from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("sahithkumar7/mpnet-base-matryoshka-iter02")
# Run inference
sentences = [
    'What was the most frequently identified pharmaceutical in the groundwater samples?',
    'from one to five compounds. The most frequently identified pharmaceuticals, in decreasing order, were ciprofloxacin 43%\n(3/7), enrofloxacin, norfloxacin, trimethoprim, lincomycin (29% (2/7), abacavir and tetracycline 14% (1/7). The enzyme\ninhibitors, namely clavulanic acid and cilastatin, were detected once in an urban region located well. This catchment point\nshowed the most significant number of pharmaceuticals. West/Tejo and Centre were the regions with the most\nconsiderable number of substances in groundwater, accounting for 43%. All groundwater samples were contaminated by',
    'Pharmacokinetic characteristics may represent key features in understanding antibiotics occurrence [62]. Most antibiotics\nare not completely metabolised in humans and animals; thus, a high percentage of the active substance (40-90%) is\nexcreted in urine/faeces in the unchanged form. These molecules are discharged into water and soil through wastewater,\nanimal manure, and sewage sludge, frequently used as fertilisers to agricultural lands. Also, it is expected that the\nhospital effluent will contribute partly to the pharmaceutical load in the wastewater treatment plant influence [63].',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[1.0000, 0.8234, 0.5626],
#         [0.8234, 1.0000, 0.6069],
#         [0.5626, 0.6069, 1.0000]])
```

<!--
### Direct Usage (Transformers)

<details><summary>Click to see the direct usage in Transformers</summary>

</details>
-->

<!--
### Downstream Usage (Sentence Transformers)

You can finetune this model on your own dataset.

<details><summary>Click to expand</summary>

</details>
-->

<!--
### Out-of-Scope Use

*List how the model may foreseeably be misused and address what users ought not to do with the model.*
-->

## Evaluation

### Metrics

#### Triplet

* Datasets: `initial_test`, `final_test`, `final_test`, `final_test` and `final_test`
* Evaluated with [<code>TripletEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.TripletEvaluator)

| Metric              | initial_test | final_test |
|:--------------------|:-------------|:-----------|
| **cosine_accuracy** | **0.78**     | **0.92**   |

<!--
## Bias, Risks and Limitations

*What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
-->

<!--
### Recommendations

*What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
-->

## Training Details

### Training Dataset

#### json

* Dataset: json
* Size: 80 training samples
* Columns: <code>anchor</code>, <code>positive</code>, and <code>negative</code>
* Approximate statistics based on the first 80 samples:
  |         | anchor                                                                            | positive                                                                             | negative                                                                             |
  |:--------|:----------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------|
  | type    | string                                                                            | string                                                                               | string                                                                               |
  | details | <ul><li>min: 9 tokens</li><li>mean: 16.14 tokens</li><li>max: 33 tokens</li></ul> | <ul><li>min: 48 tokens</li><li>mean: 125.65 tokens</li><li>max: 218 tokens</li></ul> | <ul><li>min: 48 tokens</li><li>mean: 122.97 tokens</li><li>max: 211 tokens</li></ul> |
* Samples:
  | anchor                                                                                         | positive                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       | negative                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
  |:-----------------------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
  | <code>Which two macrolide antibiotics are frequently detected in surface water samples?</code> | <code>seems to undertake a similar fate in the environment.<br>Nevertheless, due to stronger adsorption, with higher emergence in sediment, its occurrence in the surface water is lower<br>[71]. The use of tetracyclines, mainly as medicated premix and oral solution for food-producing animals [72], and the very<br>low bioavailability (e.g. in pig feed) [43] contribute to increasing its release into the environment. Regarding macrolides,<br>erythromycin and clarithromycin exhibit a remarkable frequency of detection in surface water samples. The most</code>                | <code>Nonetheless, besides the sorption capacity, these antibiotics have high solubility in water. Crucial routes for these<br>substances into the environment are manure from animal production and sewage sludge from wastewater treatment<br>plant (WWTP) used as fertilisers. Therefore, these substances have been evidenced in topsoil samples [68]. These<br>quinolones and other antibiotics, for instance, norfloxacin and tetracycline, have been identified in groundwater samples<br>despite being influenced by sorption processes. They were not readily degraded; instead, the input into groundwater</code> |
  | <code>What antimicrobial drugs were identified in the survey besides macrolides?</code>        | <code>is one of the most frequently pharmaceutical in representative rivers [74,75]. The three macrolides identified in our<br>detection survey are included since 2018 in the first 'watch list' [76].<br>Another group of antimicrobial drugs identified in our survey were sulfamethoxazole/trimethoprim and sulfamethazine.<br>Sulfamethoxazole/trimethoprim are often used combined since the effectiveness of sulfonamides is enhanced. In the<br>present study, the detection of both substances was comparable; however, trimethoprim was detected in groundwater.</code>              | <code>upstream samples obtained in rural locations was demonstrated and could be attributed to a low efficiency in the urban<br>wastewater treatment plants or due to agricultural pressure.<br>The higher frequency of detection for most substances was observed in the Ave river and Ria Formosa, confirming that<br>several effluents impact these water bodies from urban wastewater treatment plants and livestock production.<br>Pharmacokinetic characteristics may represent key features in understanding antibiotics occurrence [62]. Most antibiotics</code>                                                    |
  | <code>How long was the observational period of the antibiotic survey in Portugal?</code>       | <code>of antibiotics and their metabolites in surface- groundwater. It seeks to reflect the current demographic, spatial, drug<br>consumption, and drug profile on an observational period of 3 years in Portugal. The greatest challenge of this survey<br>data will be to promote the ecopharmacovigilance framework development shortly to implement measures for avoiding<br>misuse/overuse of antibiotics and slow down emission and antibiotic resistance.<br>2. Results<br>2.1. Frequency of Detections:<br>Antibiotics/Enzyme-Inhibitors and Abacavir<br>in Surface-Groundwater</code> | <code>despite being influenced by sorption processes. They were not readily degraded; instead, the input into groundwater<br>could be due to livestock farming pressure, namely by spreading manure in the soil or the possible sewage sludge<br>application in the area. High clay and low sand content in soils can decrease the mobility of pharmaceuticals, which is<br>attributed to clay intense exchange capacity. Thus, soil properties (e.g. particle composition) are a significant, influential</code>                                                                                                           |
* Loss: [<code>MatryoshkaLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#matryoshkaloss) with these parameters:
  ```json
  {
      "loss": "MultipleNegativesRankingLoss",
      "matryoshka_dims": [
          768,
          512,
          256,
          128,
          64
      ],
      "matryoshka_weights": [
          1,
          1,
          1,
          1,
          1
      ],
      "n_dims_per_step": -1
  }
  ```

### Evaluation Dataset

#### json

* Dataset: json
* Size: 20 evaluation samples
* Columns: <code>anchor</code>, <code>positive</code>, and <code>negative</code>
* Approximate statistics based on the first 20 samples:
  |         | anchor                                                                            | positive                                                                             | negative                                                                            |
  |:--------|:----------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------|
  | type    | string                                                                            | string                                                                               | string                                                                              |
  | details | <ul><li>min: 11 tokens</li><li>mean: 16.4 tokens</li><li>max: 25 tokens</li></ul> | <ul><li>min: 76 tokens</li><li>mean: 113.65 tokens</li><li>max: 148 tokens</li></ul> | <ul><li>min: 89 tokens</li><li>mean: 118.8 tokens</li><li>max: 162 tokens</li></ul> |
* Samples:
  | anchor                                                                                                           | positive                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               | negative                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
  |:-----------------------------------------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
  | <code>What percentage of unchanged excretion did the most significant number of detected substances show?</code> | <code>coefficients were not available for lincomycin, clavulanic acid and cilastatin.<br>Physicochemical properties of detected pharmaceuticals.<br>1 Data retrieved from [16]; 2 Data retrieved from [17]; 3 Data retrieved from [18]; 4 Data retrieved from [19]; 5<br>Data retrieved from [20];<br>6 Data retrieved from [21]; 7 Data retrieved from [22]; 8 Data retrieved from [23]; 9 Data retrieved from [24]; 10<br>Data retrieved from [25];<br>NA-not available.<br>The most significant number of detected substances showed a percentage of unchanged excretion higher than 40%.</code>                    | <code>1. Introduction<br>Antibiotics are a critical component of human and veterinary modern medicine, developed to produce desirable or<br>beneficial effects on infections induced by pathogens. Like most pharmaceuticals, antibiotics tend to be small organic<br>polar compounds, generally ionisable, ordinarily subject to a metabolism or biotransformation process by the organism to<br>be eliminated more efficiently [1,2]. The excretion of these compounds and their metabolites occurs mainly through urine,</code>                                                                                                         |
  | <code>How many kilograms of abacavir were detected in Portugal in 2017?</code>                                   | <code>Regarding the different regions, it has been concluded that North and West/Tejo were the regions with the higher<br>consuming values. Both regions presented a significant value (33%) for the abacavir. For the detected antiviral abacavir,<br>an amount of 1458 kg has been observed.<br>Regarding antibiotics used in veterinary medicine, the regional amount was not available. Likewise, due to the reported<br>missing quantity for sulfamethazine, the sulfonamides group has been matched.<br>Consumption (Kg) of the detected pharmaceuticals in Portugal (2017).</code>                              | <code>43%<br>(3/7), enrofloxacin, norfloxacin, trimethoprim, lincomycin (29% (2/7), abacavir and tetracycline<br>14% (1/7). The enzyme inhibitors, namely clavulanic acid and cilastatin, were detected once in an urban region located<br>well. This catchment point showed the most significant<br>number of pharmaceuticals. West/Tejo and Centre were the regions with the most considerable number of substances in<br>groundwater, accounting for 43%. All groundwater<br>samples were contaminated by at least one antibiotic. Supplemental Tables S2 and S4 contain a detailed description of<br>the</code>                        |
  | <code>What must marketing authorisation procedures for medicines include since 2006?</code>                      | <code>substances in passive samplers [7]. Since 2006, marketing authorisation procedures for both human and veterinary<br>medicines must include an environmental risk assessment that comprises a prospective exposure assessment,<br>underestimating the possible impact and the occurrence of antibiotics after years of consumption. Ultimately, the potential<br>risk may not be correctly anticipated. It becomes urgent to generate new data, mainly to refine exposure assessments.<br>As much as the specificities of each member state should be considered this issue has become one of the European</code> | <code>clarithromycin/erythromycin, tetracycline, sulfamethoxazole, and abacavir. In groundwater, enrofloxacin/ciprofloxacin,<br>norfloxacin, trimethoprim, lincomycin, abacavir and tetracycline were recovered. Metabolites were not detected in water<br>bodies. Noticeable was the detection of enzyme inhibitors, tazobactam and cilastatin, which are both for exclusive<br>hospital use. The North region and Algarve (South) were the areas with the most significant frequency of substances in<br>surface water. The relatively higher detection of substances downstream of the effluent discharge points compared with a</code> |
* Loss: [<code>MatryoshkaLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#matryoshkaloss) with these parameters:
  ```json
  {
      "loss": "MultipleNegativesRankingLoss",
      "matryoshka_dims": [
          768,
          512,
          256,
          128,
          64
      ],
      "matryoshka_weights": [
          1,
          1,
          1,
          1,
          1
      ],
      "n_dims_per_step": -1
  }
  ```

### Training Hyperparameters
#### Non-Default Hyperparameters

- `eval_strategy`: steps
- `per_device_train_batch_size`: 16
- `per_device_eval_batch_size`: 16
- `num_train_epochs`: 1
- `warmup_ratio`: 0.1
- `fp16`: True
- `batch_sampler`: no_duplicates

#### All Hyperparameters
<details><summary>Click to expand</summary>

- `overwrite_output_dir`: False
- `do_predict`: False
- `eval_strategy`: steps
- `prediction_loss_only`: True
- `per_device_train_batch_size`: 16
- `per_device_eval_batch_size`: 16
- `per_gpu_train_batch_size`: None
- `per_gpu_eval_batch_size`: None
- `gradient_accumulation_steps`: 1
- `eval_accumulation_steps`: None
- `torch_empty_cache_steps`: None
- `learning_rate`: 5e-05
- `weight_decay`: 0.0
- `adam_beta1`: 0.9
- `adam_beta2`: 0.999
- `adam_epsilon`: 1e-08
- `max_grad_norm`: 1.0
- `num_train_epochs`: 1
- `max_steps`: -1
- `lr_scheduler_type`: linear
- `lr_scheduler_kwargs`: {}
- `warmup_ratio`: 0.1
- `warmup_steps`: 0
- `log_level`: passive
- `log_level_replica`: warning
- `log_on_each_node`: True
- `logging_nan_inf_filter`: True
- `save_safetensors`: True
- `save_on_each_node`: False
- `save_only_model`: False
- `restore_callback_states_from_checkpoint`: False
- `no_cuda`: False
- `use_cpu`: False
- `use_mps_device`: False
- `seed`: 42
- `data_seed`: None
- `jit_mode_eval`: False
- `use_ipex`: False
- `bf16`: False
- `fp16`: True
- `fp16_opt_level`: O1
- `half_precision_backend`: auto
- `bf16_full_eval`: False
- `fp16_full_eval`: False
- `tf32`: None
- `local_rank`: 0
- `ddp_backend`: None
- `tpu_num_cores`: None
- `tpu_metrics_debug`: False
- `debug`: []
- `dataloader_drop_last`: False
- `dataloader_num_workers`: 0
- `dataloader_prefetch_factor`: None
- `past_index`: -1
- `disable_tqdm`: False
- `remove_unused_columns`: True
- `label_names`: None
- `load_best_model_at_end`: False
- `ignore_data_skip`: False
- `fsdp`: []
- `fsdp_min_num_params`: 0
- `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
- `fsdp_transformer_layer_cls_to_wrap`: None
- `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
- `deepspeed`: None
- `label_smoothing_factor`: 0.0
- `optim`: adamw_torch
- `optim_args`: None
- `adafactor`: False
- `group_by_length`: False
- `length_column_name`: length
- `ddp_find_unused_parameters`: None
- `ddp_bucket_cap_mb`: None
- `ddp_broadcast_buffers`: False
- `dataloader_pin_memory`: True
- `dataloader_persistent_workers`: False
- `skip_memory_metrics`: True
- `use_legacy_prediction_loop`: False
- `push_to_hub`: False
- `resume_from_checkpoint`: None
- `hub_model_id`: None
- `hub_strategy`: every_save
- `hub_private_repo`: None
- `hub_always_push`: False
- `gradient_checkpointing`: False
- `gradient_checkpointing_kwargs`: None
- `include_inputs_for_metrics`: False
- `include_for_metrics`: []
- `eval_do_concat_batches`: True
- `fp16_backend`: auto
- `push_to_hub_model_id`: None
- `push_to_hub_organization`: None
- `mp_parameters`: 
- `auto_find_batch_size`: False
- `full_determinism`: False
- `torchdynamo`: None
- `ray_scope`: last
- `ddp_timeout`: 1800
- `torch_compile`: False
- `torch_compile_backend`: None
- `torch_compile_mode`: None
- `include_tokens_per_second`: False
- `include_num_input_tokens_seen`: False
- `neftune_noise_alpha`: None
- `optim_target_modules`: None
- `batch_eval_metrics`: False
- `eval_on_start`: False
- `use_liger_kernel`: False
- `eval_use_gather_object`: False
- `average_tokens_across_devices`: False
- `prompts`: None
- `batch_sampler`: no_duplicates
- `multi_dataset_batch_sampler`: proportional
- `router_mapping`: {}
- `learning_rate_mapping`: {}

</details>

### Training Logs
| Epoch | Step | Training Loss | initial_test_cosine_accuracy | final_test_cosine_accuracy |
|:-----:|:----:|:-------------:|:----------------------------:|:--------------------------:|
| -1    | -1   | -             | 0.7800                       | -                          |
| 0.2   | 1    | 15.6011       | -                            | -                          |
| 0.4   | 2    | 12.9289       | -                            | -                          |
| 0.6   | 3    | 15.1921       | -                            | -                          |
| 0.8   | 4    | 14.4243       | -                            | -                          |
| 1.0   | 5    | 16.8067       | -                            | -                          |
| -1    | -1   | -             | -                            | 0.8200                     |
| 0.2   | 1    | 14.317        | -                            | -                          |
| 0.4   | 2    | 12.326        | -                            | -                          |
| 0.6   | 3    | 14.0337       | -                            | -                          |
| 0.8   | 4    | 11.1261       | -                            | -                          |
| 1.0   | 5    | 8.9671        | -                            | -                          |
| 1.2   | 6    | 10.716        | -                            | -                          |
| 1.4   | 7    | 9.496         | -                            | -                          |
| 1.6   | 8    | 9.0035        | -                            | -                          |
| 1.8   | 9    | 7.3839        | -                            | -                          |
| 2.0   | 10   | 11.0917       | -                            | -                          |
| -1    | -1   | -             | -                            | 0.9000                     |
| 0.2   | 1    | 11.3791       | -                            | -                          |
| 0.4   | 2    | 5.6417        | -                            | -                          |
| 0.6   | 3    | 5.7289        | -                            | -                          |
| 0.8   | 4    | 3.5917        | -                            | -                          |
| 1.0   | 5    | 2.3028        | -                            | -                          |
| -1    | -1   | -             | -                            | 0.9200                     |


### Framework Versions
- Python: 3.11.13
- Sentence Transformers: 5.0.0
- Transformers: 4.52.4
- PyTorch: 2.6.0+cu124
- Accelerate: 1.8.1
- Datasets: 3.6.0
- Tokenizers: 0.21.2

## Citation

### BibTeX

#### Sentence Transformers
```bibtex
@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}
```

#### MatryoshkaLoss
```bibtex
@misc{kusupati2024matryoshka,
    title={Matryoshka Representation Learning},
    author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
    year={2024},
    eprint={2205.13147},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}
```

#### MultipleNegativesRankingLoss
```bibtex
@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
```

<!--
## Glossary

*Clearly define terms in order to be accessible across audiences.*
-->

<!--
## Model Card Authors

*Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
-->

<!--
## Model Card Contact

*Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
-->