Ngit commited on
Commit
7e74878
·
verified ·
1 Parent(s): 3178e51

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +83 -8
README.md CHANGED
@@ -6,20 +6,95 @@ language:
6
  # Text Classification Toxicity
7
 
8
  This model is a fined-tuned version of [nreimers/MiniLMv2-L6-H384-distilled-from-BERT-Large](https://huggingface.co/nreimers/MiniLMv2-L6-H384-distilled-from-BERT-Large) on the on the [Jigsaw 1st Kaggle competition](https://www.kaggle.com/competitions/jigsaw-toxic-comment-classification-challenge) dataset using [unitary/toxic-bert](https://huggingface.co/unitary/toxic-bert) as teacher model.
9
- The quantized version in ONNX format can be found [here](https://huggingface.co/minuva/MiniLMv2-toxic-jigsaw-lite-onnx).
10
 
11
  The model contains two labels only (toxicity and severe toxicity). For the model with all labels refer to this [page](https://huggingface.co/minuva/MiniLMv2-toxic-jigsaw)
12
 
13
- # Load the Model
14
 
15
- ```py
16
- from transformers import pipeline
17
 
18
- pipe = pipeline(model='minuva/MiniLMv2-toxic-jigsaw-lite', task='text-classification')
19
- pipe("This is pure trash")
20
- # [{'label': 'toxic', 'score': 0.9383478164672852}]
 
21
  ```
22
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
23
  # Training hyperparameters
24
 
25
  The following hyperparameters were used during training:
@@ -36,7 +111,7 @@ The following hyperparameters were used during training:
36
 
37
  | Teacher (params) | Student (params) | Set (metric) | Score (teacher) | Score (student) |
38
  |--------------------|-------------|----------|--------| --------|
39
- | unitary/toxic-bert (110M) | MiniLMv2-toxic-jigsaw-lite (23M) | Test (ROC_AUC) | 0.982677 | 0.9815 |
40
 
41
  # Deployment
42
 
 
6
  # Text Classification Toxicity
7
 
8
  This model is a fined-tuned version of [nreimers/MiniLMv2-L6-H384-distilled-from-BERT-Large](https://huggingface.co/nreimers/MiniLMv2-L6-H384-distilled-from-BERT-Large) on the on the [Jigsaw 1st Kaggle competition](https://www.kaggle.com/competitions/jigsaw-toxic-comment-classification-challenge) dataset using [unitary/toxic-bert](https://huggingface.co/unitary/toxic-bert) as teacher model.
9
+ The quantized version in ONNX format can be found [here](https://huggingface.co/minuva/MiniLMv2-toxic-jigsaw-lite).
10
 
11
  The model contains two labels only (toxicity and severe toxicity). For the model with all labels refer to this [page](https://huggingface.co/minuva/MiniLMv2-toxic-jigsaw)
12
 
13
+ # Usage
14
 
15
+ ## Installation
 
16
 
17
+ ```bash
18
+ pip install tokenizers
19
+ pip install onnxruntime
20
+ git clone https://huggingface.co/minuva/MiniLMv2-toxic-jigsaw-lite-onnx
21
  ```
22
 
23
+ ## Load the Model
24
+
25
+ import os
26
+ import numpy as np
27
+ import json
28
+
29
+ from tokenizers import Tokenizer
30
+ from onnxruntime import InferenceSession
31
+
32
+
33
+ model_name = "minuva/MiniLMv2-toxic-jigsaw-lite-onnx"
34
+ tokenizer = Tokenizer.from_pretrained(model_name)
35
+ tokenizer.enable_padding()
36
+ tokenizer.enable_truncation(max_length=256)
37
+ batch_size = 16
38
+
39
+ texts = ["This is pure trash",]
40
+ outputs = []
41
+ model = InferenceSession("MiniLMv2-toxic-jigsaw-lite-onnx/model_optimized_quantized.onnx", providers=['CPUExecutionProvider'])
42
+
43
+ with open(os.path.join("MiniLMv2-toxic-jigsaw-lite-onnx", "config.json"), "r") as f:
44
+ config = json.load(f)
45
+
46
+ output_names = [output.name for output in model.get_outputs()]
47
+ input_names = [input.name for input in model.get_inputs()]
48
+
49
+ for subtexts in np.array_split(np.array(texts), len(texts) // batch_size + 1):
50
+ encodings = tokenizer.encode_batch(list(subtexts))
51
+ inputs = {
52
+ "input_ids": np.vstack(
53
+ [encoding.ids for encoding in encodings],
54
+ ),
55
+ "attention_mask": np.vstack(
56
+ [encoding.attention_mask for encoding in encodings],
57
+ ),
58
+ "token_type_ids": np.vstack(
59
+ [encoding.type_ids for encoding in encodings],
60
+ ),
61
+ }
62
+
63
+ for input_name in input_names:
64
+ if input_name not in inputs:
65
+ raise ValueError(f"Input name {input_name} not found in inputs")
66
+
67
+ inputs = {input_name: inputs[input_name] for input_name in input_names}
68
+ output = np.squeeze(
69
+ np.stack(
70
+ model.run(output_names=output_names, input_feed=inputs)
71
+ ),
72
+ axis=0,
73
+ )
74
+ outputs.append(output)
75
+
76
+ outputs = np.concatenate(outputs, axis=0)
77
+ scores = 1 / (1 + np.exp(-outputs))
78
+ results = []
79
+ for item in scores:
80
+ labels = []
81
+ scores = []
82
+ for idx, s in enumerate(item):
83
+ labels.append(config["id2label"][str(idx)])
84
+ scores.append(float(s))
85
+ results.append({"labels": labels, "scores": scores})
86
+
87
+ res = []
88
+
89
+ for result in results:
90
+ joined = list(zip(result['labels'], result['scores']))
91
+ max_score = max(joined, key=lambda x: x[1])
92
+ res.append(max_score)
93
+
94
+ res
95
+ # [('toxic', 0.736885666847229)]
96
+
97
+
98
  # Training hyperparameters
99
 
100
  The following hyperparameters were used during training:
 
111
 
112
  | Teacher (params) | Student (params) | Set (metric) | Score (teacher) | Score (student) |
113
  |--------------------|-------------|----------|--------| --------|
114
+ | unitary/toxic-bert (110M) | MiniLMv2-toxic-jigsaw-lite (23M) | Test (ROC_AUC) | 0.982677 | 0.9806 |
115
 
116
  # Deployment
117