Upload 9 files

Browse files

Files changed (9) hide show

README.md +113 -3
config.json +433 -0
language_detection.onnx +3 -0
model.safetensors +3 -0
special_tokens_map.json +37 -0
to_onnx.py +256 -0
tokenizer.json +0 -0
tokenizer_config.json +56 -0
vocab.txt +0 -0

README.md CHANGED Viewed

@@ -1,3 +1,113 @@
----
-license: mit
----

+---
+library_name: transformers
+tags:
+- language
+- detection
+- classification
+license: mit
+datasets:
+- hac541309/open-lid-dataset
+pipeline_tag: text-classification
+---
+# Language Detection Model
+A **BERT-based** language detection model trained on [hac541309/open-lid-dataset](https://huggingface.co/datasets/hac541309/open-lid-dataset), which includes **121 million sentences across 200 languages**. This model is optimized for **fast and accurate** language identification in text classification tasks.
+## Model Details
+- **Architecture**: [BertForSequenceClassification](https://huggingface.co/transformers/model_doc/bert.html)
+- **Hidden Size**: 384
+- **Number of Layers**: 4
+- **Attention Heads**: 6
+- **Max Sequence Length**: 512
+- **Dropout**: 0.1
+- **Vocabulary Size**: 50,257
+## Training Process
+- **Dataset**:
+  - Used the [open-lid-dataset](https://huggingface.co/datasets/hac541309/open-lid-dataset)
+  - Split into train (90%) and test (10%)
+- **Tokenizer**: A custom `BertTokenizerFast` with special tokens for `[UNK]`, `[CLS]`, `[SEP]`, `[PAD]`, `[MASK]`
+- **Hyperparameters**:
+  - Learning Rate: 2e-5
+  - Batch Size: 256 (training) / 512 (testing)
+  - Epochs: 1
+  - Scheduler: Cosine
+- **Trainer**: Leveraged the Hugging Face [Trainer API](https://huggingface.co/docs/transformers/main_classes/trainer) with Weights & Biases for logging
+## Evaluation
+The model was evaluated on the test split. Below are the overall metrics:
+- **Accuracy**: 0.969466
+- **Precision**: 0.969586
+- **Recall**: 0.969466
+- **F1 Score**: 0.969417
+Detailled evaluation (Size is the number of languages supported)
+| Script | Support | Precision | Recall | F1 Score | Size |
+|--------|---------|-----------|--------|----------|------|
+| Arab   | 819219  | 0.9038    | 0.9014 | 0.9023   | 21   |
+| Latn   | 7924704 | 0.9678    | 0.9663 | 0.9670   | 125  |
+| Ethi   | 144403  | 0.9967    | 0.9964 | 0.9966   | 2    |
+| Beng   | 163983  | 0.9949    | 0.9935 | 0.9942   | 3    |
+| Deva   | 423895  | 0.9495    | 0.9326 | 0.9405   | 10   |
+| Cyrl   | 831949  | 0.9899    | 0.9883 | 0.9891   | 12   |
+| Tibt   | 35683   | 0.9925    | 0.9930 | 0.9927   | 2    |
+| Grek   | 131155  | 0.9984    | 0.9990 | 0.9987   | 1    |
+| Gujr   | 86912   | 0.99999   | 0.9999 | 0.99995  | 1    |
+| Hebr   | 100530  | 0.9966    | 0.9995 | 0.9981   | 2    |
+| Armn   | 67203   | 0.9999    | 0.9998 | 0.9998   | 1    |
+| Jpan   | 88004   | 0.9983    | 0.9987 | 0.9985   | 1    |
+| Knda   | 67170   | 0.9999    | 0.9998 | 0.9999   | 1    |
+| Geor   | 70769   | 0.99997   | 0.9998 | 0.9999   | 1    |
+| Khmr   | 39708   | 1.0000    | 0.9997 | 0.9999   | 1    |
+| Hang   | 108509  | 0.9997    | 0.9999 | 0.9998   | 1    |
+| Laoo   | 29389   | 0.9999    | 0.9999 | 0.9999   | 1    |
+| Mlym   | 68418   | 0.99996   | 0.9999 | 0.9999   | 1    |
+| Mymr   | 100857  | 0.9999    | 0.9992 | 0.9995   | 2    |
+| Orya   | 44976   | 0.9995    | 0.9998 | 0.9996   | 1    |
+| Guru   | 67106   | 0.99999   | 0.9999 | 0.9999   | 1    |
+| Olck   | 22279   | 1.0000    | 0.9991 | 0.9995   | 1    |
+| Sinh   | 67492   | 1.0000    | 0.9998 | 0.9999   | 1    |
+| Taml   | 76373   | 0.99997   | 0.9999 | 0.9999   | 1    |
+| Tfng   | 41325   | 0.8512    | 0.8246 | 0.8247   | 2    |
+| Telu   | 62387   | 0.99997   | 0.9999 | 0.9999   | 1    |
+| Thai   | 83820   | 0.99995   | 0.9998 | 0.9999   | 1    |
+| Hant   | 152723  | 0.9945    | 0.9954 | 0.9949   | 2    |
+| Hans   | 92689   | 0.9893    | 0.9870 | 0.9882   | 1    |
+A detailed per-script classification report is also provided in the repository for further analysis.
+---
+### How to Use
+You can quickly load and run inference with this model using the [Transformers pipeline](https://huggingface.co/docs/transformers/main_classes/pipelines):
+```python
+from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
+tokenizer = AutoTokenizer.from_pretrained("alexneakameni/language_detection")
+model = AutoModelForSequenceClassification.from_pretrained("alexneakameni/language_detection")
+language_detection = pipeline("text-classification", model=model, tokenizer=tokenizer)
+text = "Hello world!"
+predictions = language_detection(text)
+print(predictions)
+```
+This will output the predicted language code or label with the corresponding confidence score.
+---
+**Note**: The model’s performance may vary depending on text length, language variety, and domain-specific vocabulary. Always validate results against your own datasets for critical applications.
+For more information, see the [repository documentation](https://github.com/KameniAlexNea/learning_language).
+Thank you for using this model—feedback and contributions are welcome!

config.json ADDED Viewed

	@@ -0,0 +1,433 @@

+{
+  "_name_or_path": "data/results/checkpoint-76000",
+  "architectures": [
+    "BertForSequenceClassification"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "classifier_dropout": null,
+  "gradient_checkpointing": false,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 384,
+  "id2label": {
+    "0": "lit_Latn",
+    "1": "fon_Latn",
+    "2": "kin_Latn",
+    "3": "khm_Khmr",
+    "4": "bjn_Latn",
+    "5": "prs_Arab",
+    "6": "wol_Latn",
+    "7": "run_Latn",
+    "8": "eng_Latn",
+    "9": "gla_Latn",
+    "10": "lvs_Latn",
+    "11": "nya_Latn",
+    "12": "kac_Latn",
+    "13": "lua_Latn",
+    "14": "tuk_Latn",
+    "15": "tpi_Latn",
+    "16": "grn_Latn",
+    "17": "xho_Latn",
+    "18": "bam_Latn",
+    "19": "mri_Latn",
+    "20": "san_Deva",
+    "21": "isl_Latn",
+    "22": "kas_Deva",
+    "23": "bel_Cyrl",
+    "24": "heb_Hebr",
+    "25": "zho_Hant",
+    "26": "bak_Cyrl",
+    "27": "fra_Latn",
+    "28": "por_Latn",
+    "29": "ukr_Cyrl",
+    "30": "umb_Latn",
+    "31": "kan_Knda",
+    "32": "smo_Latn",
+    "33": "als_Latn",
+    "34": "kbp_Latn",
+    "35": "lin_Latn",
+    "36": "urd_Arab",
+    "37": "yor_Latn",
+    "38": "azb_Arab",
+    "39": "ltz_Latn",
+    "40": "twi_Latn",
+    "41": "hin_Deva",
+    "42": "tgl_Latn",
+    "43": "asm_Beng",
+    "44": "gaz_Latn",
+    "45": "ell_Grek",
+    "46": "taq_Tfng",
+    "47": "nso_Latn",
+    "48": "dan_Latn",
+    "49": "pes_Arab",
+    "50": "pan_Guru",
+    "51": "war_Latn",
+    "52": "mar_Deva",
+    "53": "mni_Beng",
+    "54": "acm_Arab",
+    "55": "srd_Latn",
+    "56": "vec_Latn",
+    "57": "ory_Orya",
+    "58": "lug_Latn",
+    "59": "ltg_Latn",
+    "60": "guj_Gujr",
+    "61": "ita_Latn",
+    "62": "swe_Latn",
+    "63": "cjk_Latn",
+    "64": "ace_Latn",
+    "65": "taq_Latn",
+    "66": "cat_Latn",
+    "67": "zsm_Latn",
+    "68": "hun_Latn",
+    "69": "kaz_Cyrl",
+    "70": "pol_Latn",
+    "71": "ban_Latn",
+    "72": "nus_Latn",
+    "73": "acq_Arab",
+    "74": "aeb_Arab",
+    "75": "spa_Latn",
+    "76": "slk_Latn",
+    "77": "hrv_Latn",
+    "78": "crh_Latn",
+    "79": "tur_Latn",
+    "80": "bos_Latn",
+    "81": "ssw_Latn",
+    "82": "kik_Latn",
+    "83": "ydd_Hebr",
+    "84": "snd_Arab",
+    "85": "hau_Latn",
+    "86": "tam_Taml",
+    "87": "plt_Latn",
+    "88": "kmr_Latn",
+    "89": "ace_Arab",
+    "90": "mkd_Cyrl",
+    "91": "lij_Latn",
+    "92": "dyu_Latn",
+    "93": "mos_Latn",
+    "94": "ayr_Latn",
+    "95": "ast_Latn",
+    "96": "fij_Latn",
+    "97": "lmo_Latn",
+    "98": "zho_Hans",
+    "99": "nob_Latn",
+    "100": "hye_Armn",
+    "101": "amh_Ethi",
+    "102": "jav_Latn",
+    "103": "sag_Latn",
+    "104": "mai_Deva",
+    "105": "lao_Laoo",
+    "106": "uzn_Latn",
+    "107": "mya_Mymr",
+    "108": "fin_Latn",
+    "109": "knc_Latn",
+    "110": "tat_Cyrl",
+    "111": "ajp_Arab",
+    "112": "dzo_Tibt",
+    "113": "pag_Latn",
+    "114": "kir_Cyrl",
+    "115": "sna_Latn",
+    "116": "zul_Latn",
+    "117": "kab_Latn",
+    "118": "fur_Latn",
+    "119": "ckb_Arab",
+    "120": "vie_Latn",
+    "121": "mal_Mlym",
+    "122": "bem_Latn",
+    "123": "som_Latn",
+    "124": "ars_Arab",
+    "125": "szl_Latn",
+    "126": "tgk_Cyrl",
+    "127": "tel_Telu",
+    "128": "quy_Latn",
+    "129": "deu_Latn",
+    "130": "bjn_Arab",
+    "131": "azj_Latn",
+    "132": "eus_Latn",
+    "133": "ces_Latn",
+    "134": "nld_Latn",
+    "135": "shn_Mymr",
+    "136": "bul_Cyrl",
+    "137": "kam_Latn",
+    "138": "kmb_Latn",
+    "139": "ron_Latn",
+    "140": "bho_Deva",
+    "141": "glg_Latn",
+    "142": "awa_Deva",
+    "143": "tha_Thai",
+    "144": "lim_Latn",
+    "145": "hat_Latn",
+    "146": "mag_Deva",
+    "147": "kon_Latn",
+    "148": "pbt_Arab",
+    "149": "kat_Geor",
+    "150": "khk_Cyrl",
+    "151": "arb_Arab",
+    "152": "knc_Arab",
+    "153": "kor_Hang",
+    "154": "oci_Latn",
+    "155": "lus_Latn",
+    "156": "ary_Arab",
+    "157": "epo_Latn",
+    "158": "pap_Latn",
+    "159": "ibo_Latn",
+    "160": "fao_Latn",
+    "161": "ben_Beng",
+    "162": "yue_Hant",
+    "163": "ceb_Latn",
+    "164": "luo_Latn",
+    "165": "srp_Cyrl",
+    "166": "ind_Latn",
+    "167": "slv_Latn",
+    "168": "min_Latn",
+    "169": "scn_Latn",
+    "170": "apc_Arab",
+    "171": "sin_Sinh",
+    "172": "mlt_Latn",
+    "173": "kea_Latn",
+    "174": "uig_Arab",
+    "175": "npi_Deva",
+    "176": "kas_Arab",
+    "177": "bug_Latn",
+    "178": "hne_Deva",
+    "179": "sat_Olck",
+    "180": "swh_Latn",
+    "181": "tso_Latn",
+    "182": "nno_Latn",
+    "183": "rus_Cyrl",
+    "184": "dik_Latn",
+    "185": "sun_Latn",
+    "186": "afr_Latn",
+    "187": "arz_Arab",
+    "188": "gle_Latn",
+    "189": "sot_Latn",
+    "190": "ewe_Latn",
+    "191": "fuv_Latn",
+    "192": "tum_Latn",
+    "193": "ilo_Latn",
+    "194": "cym_Latn",
+    "195": "tir_Ethi",
+    "196": "tzm_Tfng",
+    "197": "bod_Tibt",
+    "198": "tsn_Latn",
+    "199": "est_Latn",
+    "200": "jpn_Jpan"
+  },
+  "initializer_range": 0.02,
+  "intermediate_size": 768,
+  "label2id": {
+    "ace_Arab": 89,
+    "ace_Latn": 64,
+    "acm_Arab": 54,
+    "acq_Arab": 73,
+    "aeb_Arab": 74,
+    "afr_Latn": 186,
+    "ajp_Arab": 111,
+    "als_Latn": 33,
+    "amh_Ethi": 101,
+    "apc_Arab": 170,
+    "arb_Arab": 151,
+    "ars_Arab": 124,
+    "ary_Arab": 156,
+    "arz_Arab": 187,
+    "asm_Beng": 43,
+    "ast_Latn": 95,
+    "awa_Deva": 142,
+    "ayr_Latn": 94,
+    "azb_Arab": 38,
+    "azj_Latn": 131,
+    "bak_Cyrl": 26,
+    "bam_Latn": 18,
+    "ban_Latn": 71,
+    "bel_Cyrl": 23,
+    "bem_Latn": 122,
+    "ben_Beng": 161,
+    "bho_Deva": 140,
+    "bjn_Arab": 130,
+    "bjn_Latn": 4,
+    "bod_Tibt": 197,
+    "bos_Latn": 80,
+    "bug_Latn": 177,
+    "bul_Cyrl": 136,
+    "cat_Latn": 66,
+    "ceb_Latn": 163,
+    "ces_Latn": 133,
+    "cjk_Latn": 63,
+    "ckb_Arab": 119,
+    "crh_Latn": 78,
+    "cym_Latn": 194,
+    "dan_Latn": 48,
+    "deu_Latn": 129,
+    "dik_Latn": 184,
+    "dyu_Latn": 92,
+    "dzo_Tibt": 112,
+    "ell_Grek": 45,
+    "eng_Latn": 8,
+    "epo_Latn": 157,
+    "est_Latn": 199,
+    "eus_Latn": 132,
+    "ewe_Latn": 190,
+    "fao_Latn": 160,
+    "fij_Latn": 96,
+    "fin_Latn": 108,
+    "fon_Latn": 1,
+    "fra_Latn": 27,
+    "fur_Latn": 118,
+    "fuv_Latn": 191,
+    "gaz_Latn": 44,
+    "gla_Latn": 9,
+    "gle_Latn": 188,
+    "glg_Latn": 141,
+    "grn_Latn": 16,
+    "guj_Gujr": 60,
+    "hat_Latn": 145,
+    "hau_Latn": 85,
+    "heb_Hebr": 24,
+    "hin_Deva": 41,
+    "hne_Deva": 178,
+    "hrv_Latn": 77,
+    "hun_Latn": 68,
+    "hye_Armn": 100,
+    "ibo_Latn": 159,
+    "ilo_Latn": 193,
+    "ind_Latn": 166,
+    "isl_Latn": 21,
+    "ita_Latn": 61,
+    "jav_Latn": 102,
+    "jpn_Jpan": 200,
+    "kab_Latn": 117,
+    "kac_Latn": 12,
+    "kam_Latn": 137,
+    "kan_Knda": 31,
+    "kas_Arab": 176,
+    "kas_Deva": 22,
+    "kat_Geor": 149,
+    "kaz_Cyrl": 69,
+    "kbp_Latn": 34,
+    "kea_Latn": 173,
+    "khk_Cyrl": 150,
+    "khm_Khmr": 3,
+    "kik_Latn": 82,
+    "kin_Latn": 2,
+    "kir_Cyrl": 114,
+    "kmb_Latn": 138,
+    "kmr_Latn": 88,
+    "knc_Arab": 152,
+    "knc_Latn": 109,
+    "kon_Latn": 147,
+    "kor_Hang": 153,
+    "lao_Laoo": 105,
+    "lij_Latn": 91,
+    "lim_Latn": 144,
+    "lin_Latn": 35,
+    "lit_Latn": 0,
+    "lmo_Latn": 97,
+    "ltg_Latn": 59,
+    "ltz_Latn": 39,
+    "lua_Latn": 13,
+    "lug_Latn": 58,
+    "luo_Latn": 164,
+    "lus_Latn": 155,
+    "lvs_Latn": 10,
+    "mag_Deva": 146,
+    "mai_Deva": 104,
+    "mal_Mlym": 121,
+    "mar_Deva": 52,
+    "min_Latn": 168,
+    "mkd_Cyrl": 90,
+    "mlt_Latn": 172,
+    "mni_Beng": 53,
+    "mos_Latn": 93,
+    "mri_Latn": 19,
+    "mya_Mymr": 107,
+    "nld_Latn": 134,
+    "nno_Latn": 182,
+    "nob_Latn": 99,
+    "npi_Deva": 175,
+    "nso_Latn": 47,
+    "nus_Latn": 72,
+    "nya_Latn": 11,
+    "oci_Latn": 154,
+    "ory_Orya": 57,
+    "pag_Latn": 113,
+    "pan_Guru": 50,
+    "pap_Latn": 158,
+    "pbt_Arab": 148,
+    "pes_Arab": 49,
+    "plt_Latn": 87,
+    "pol_Latn": 70,
+    "por_Latn": 28,
+    "prs_Arab": 5,
+    "quy_Latn": 128,
+    "ron_Latn": 139,
+    "run_Latn": 7,
+    "rus_Cyrl": 183,
+    "sag_Latn": 103,
+    "san_Deva": 20,
+    "sat_Olck": 179,
+    "scn_Latn": 169,
+    "shn_Mymr": 135,
+    "sin_Sinh": 171,
+    "slk_Latn": 76,
+    "slv_Latn": 167,
+    "smo_Latn": 32,
+    "sna_Latn": 115,
+    "snd_Arab": 84,
+    "som_Latn": 123,
+    "sot_Latn": 189,
+    "spa_Latn": 75,
+    "srd_Latn": 55,
+    "srp_Cyrl": 165,
+    "ssw_Latn": 81,
+    "sun_Latn": 185,
+    "swe_Latn": 62,
+    "swh_Latn": 180,
+    "szl_Latn": 125,
+    "tam_Taml": 86,
+    "taq_Latn": 65,
+    "taq_Tfng": 46,
+    "tat_Cyrl": 110,
+    "tel_Telu": 127,
+    "tgk_Cyrl": 126,
+    "tgl_Latn": 42,
+    "tha_Thai": 143,
+    "tir_Ethi": 195,
+    "tpi_Latn": 15,
+    "tsn_Latn": 198,
+    "tso_Latn": 181,
+    "tuk_Latn": 14,
+    "tum_Latn": 192,
+    "tur_Latn": 79,
+    "twi_Latn": 40,
+    "tzm_Tfng": 196,
+    "uig_Arab": 174,
+    "ukr_Cyrl": 29,
+    "umb_Latn": 30,
+    "urd_Arab": 36,
+    "uzn_Latn": 106,
+    "vec_Latn": 56,
+    "vie_Latn": 120,
+    "war_Latn": 51,
+    "wol_Latn": 6,
+    "xho_Latn": 17,
+    "ydd_Hebr": 83,
+    "yor_Latn": 37,
+    "yue_Hant": 162,
+    "zho_Hans": 98,
+    "zho_Hant": 25,
+    "zsm_Latn": 67,
+    "zul_Latn": 116
+  },
+  "layer_norm_eps": 1e-12,
+  "max_position_embeddings": 512,
+  "model_type": "bert",
+  "num_attention_heads": 6,
+  "num_hidden_layers": 4,
+  "pad_token_id": 3,
+  "position_embedding_type": "absolute",
+  "problem_type": "single_label_classification",
+  "torch_dtype": "float32",
+  "transformers_version": "4.48.3",
+  "type_vocab_size": 2,
+  "use_cache": true,
+  "vocab_size": 50257
+}

language_detection.onnx ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e4ccd30df1196c19d4b227bd82ca4d79aca9cd0c74c9622e3ca80288ff9bb304
+size 97945176

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ec3137634f58a55ae6127d61d12d4aa05c92380852909c1160e03f82f51a8a68
+size 97838484

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,37 @@

+{
+  "cls_token": {
+    "content": "[CLS]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "mask_token": {
+    "content": "[MASK]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "[PAD]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "sep_token": {
+    "content": "[SEP]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "[UNK]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

to_onnx.py ADDED Viewed

	@@ -0,0 +1,256 @@

+import os
+import torch
+from transformers import AutoModelForSequenceClassification, AutoTokenizer
+from onnxruntime.quantization import quantize_dynamic, quantize_static, QuantType
+from onnxruntime.quantization.calibrate import CalibrationDataReader
+import onnx
+import time
+import numpy as np
+def ensure_directory(path):
+    """Create directory if it doesn't exist"""
+    abs_path = os.path.abspath(path)
+    if not os.path.exists(abs_path):
+        os.makedirs(abs_path)
+        print(f"Created directory: {abs_path}")
+    return abs_path
+def verify_file_exists(file_path, timeout=5):
+    """Verify that a file exists and is not empty"""
+    start_time = time.time()
+    while time.time() - start_time < timeout:
+        if os.path.exists(file_path) and os.path.getsize(file_path) > 0:
+            return True
+        time.sleep(0.1)
+    return False
+def export_to_onnx(model, tokenizer, save_path):
+    """Export model to ONNX format"""
+    try:
+        # Create a dummy input for the model
+        dummy_input = tokenizer("This is a sample input", return_tensors="pt")
+        # Export the model to ONNX
+        torch.onnx.export(
+            model,
+            (dummy_input["input_ids"], dummy_input["attention_mask"]),
+            save_path,
+            opset_version=14,
+            input_names=["input_ids", "attention_mask"],
+            output_names=["output"],
+            dynamic_axes={
+                "input_ids": {0: "batch_size"},
+                "attention_mask": {0: "batch_size"},
+                "output": {0: "batch_size"}
+            }
+        )
+        # Verify the file was created
+        if verify_file_exists(save_path):
+            print(f"Successfully exported ONNX model to {save_path}")
+            return True
+        else:
+            print(f"Failed to verify ONNX model at {save_path}")
+            return False
+    except Exception as e:
+        print(f"Error exporting to ONNX: {str(e)}")
+        return False
+def create_calibration_dataset(tokenizer, max_length=512):
+    """Generate calibration dataset for static quantization with padding"""
+    samples = [
+        "This is an English sentence.",
+        "Dies ist ein deutscher Satz.",
+        "C'est une phrase française.",
+        "Esta es una frase en español.",
+        "这是一个中文句子。",
+        "これは日本語の文章です。"
+    ]
+    # Tokenize with padding and truncation
+    encoded_samples = []
+    for text in samples:
+        encoded = tokenizer(
+            text,
+            padding='max_length',
+            max_length=max_length,
+            truncation=True,
+            return_tensors="pt"
+        )
+        encoded_samples.append({
+            'input_ids': encoded['input_ids'],
+            'attention_mask': encoded['attention_mask']
+        })
+    return encoded_samples
+class CalibrationLoader(CalibrationDataReader):
+    def __init__(self, calibration_data):
+        self.calibration_data = calibration_data
+        self.current_index = 0
+    def get_next(self):
+        if self.current_index >= len(self.calibration_data):
+            return None
+        current_data = self.calibration_data[self.current_index]
+        self.current_index += 1
+        # Ensure we're returning numpy arrays with the correct shape
+        return {
+            'input_ids': current_data['input_ids'].numpy(),
+            'attention_mask': current_data['attention_mask'].numpy()
+        }
+    def rewind(self):
+        self.current_index = 0
+def export_to_onnx(model, tokenizer, save_path, max_length=512):
+    """Export model to ONNX format with fixed dimensions"""
+    try:
+        # Create a dummy input with fixed dimensions
+        dummy_input = tokenizer(
+            "This is a sample input",
+            padding='max_length',
+            max_length=max_length,
+            truncation=True,
+            return_tensors="pt"
+        )
+        # Export the model to ONNX
+        torch.onnx.export(
+            model,
+            (dummy_input["input_ids"], dummy_input["attention_mask"]),
+            save_path,
+            opset_version=14,
+            input_names=["input_ids", "attention_mask"],
+            output_names=["output"],
+            dynamic_axes={
+                "input_ids": {0: "batch_size"},
+                "attention_mask": {0: "batch_size"}
+            }
+        )
+        if verify_file_exists(save_path):
+            print(f"Successfully exported ONNX model to {save_path}")
+            return True
+        else:
+            print(f"Failed to verify ONNX model at {save_path}")
+            return False
+    except Exception as e:
+        print(f"Error exporting to ONNX: {str(e)}")
+        return False
+def quantize_model(base_onnx_path, onnx_dir, config_name, calibration_dataset=None):
+    """
+    Quantize ONNX model using either dynamic or static quantization.
+    Args:
+        base_onnx_path (str): Path to the base ONNX model
+        onnx_dir (str): Directory to save quantized models
+        config_name (str): Type of quantization ('dynamic' or 'static')
+        calibration_dataset (list, optional): Dataset for static quantization calibration
+    """
+    try:
+        quantized_model_path = os.path.join(onnx_dir, f"model_{config_name}_quantized.onnx")
+        if config_name == "dynamic":
+            print(f"\nPerforming dynamic quantization...")
+            quantize_dynamic(
+                model_input=base_onnx_path,
+                model_output=quantized_model_path,
+                weight_type=QuantType.QUInt8
+            )
+        elif config_name == "static" and calibration_dataset is not None:
+            print(f"\nPerforming static quantization...")
+            calibration_loader = CalibrationLoader(calibration_dataset)
+            quantize_static(
+                model_input=base_onnx_path,
+                model_output=quantized_model_path,
+                calibration_data_reader=calibration_loader,
+                quant_format=QuantType.QUInt8
+            )
+        else:
+            print(f"Invalid quantization configuration: {config_name}")
+            return False
+        # Verify the quantized model exists
+        if verify_file_exists(quantized_model_path):
+            print(f"Successfully created {config_name} quantized model at {quantized_model_path}")
+            # Print file sizes for comparison
+            base_size = os.path.getsize(base_onnx_path) / (1024 * 1024)  # Convert to MB
+            quantized_size = os.path.getsize(quantized_model_path) / (1024 * 1024)  # Convert to MB
+            print(f"Original model size: {base_size:.2f} MB")
+            print(f"Quantized model size: {quantized_size:.2f} MB")
+            print(f"Size reduction: {((base_size - quantized_size) / base_size * 100):.2f}%")
+            return True
+        else:
+            print(f"Failed to verify quantized model at {quantized_model_path}")
+            return False
+    except Exception as e:
+        print(f"Error during {config_name} quantization: {str(e)}")
+        return False
+def main():
+    # Get absolute paths
+    current_dir = os.path.abspath(os.getcwd())
+    onnx_dir = ensure_directory(os.path.join(current_dir, "onnx"))
+    base_onnx_path = os.path.join(onnx_dir, "model.onnx")
+    print(f"Working directory: {current_dir}")
+    print(f"ONNX directory: {onnx_dir}")
+    print(f"Base ONNX model path: {base_onnx_path}")
+    # Step 1: Load model and tokenizer
+    print("\nLoading model and tokenizer...")
+    model_name = "alexneakameni/language_detection"
+    model = AutoModelForSequenceClassification.from_pretrained(model_name)
+    tokenizer = AutoTokenizer.from_pretrained(model_name)
+    # Get the model's default max_length
+    max_length = tokenizer.model_max_length
+    # Step 2: Export base ONNX model
+    if not export_to_onnx(model, tokenizer, base_onnx_path, max_length):
+        print("Failed to export base ONNX model. Exiting.")
+        return
+    # Verify the ONNX model
+    try:
+        print(f"Verifying ONNX model at: {base_onnx_path}")
+        onnx_model = onnx.load(base_onnx_path)
+        print("Successfully verified ONNX model")
+    except Exception as e:
+        print(f"Error verifying ONNX model: {str(e)}")
+        return
+    # Step 3: Create calibration dataset
+    calibration_dataset = create_calibration_dataset(tokenizer, max_length)
+    # Step 4: Create quantized versions
+    print("\nCreating quantized versions...")
+    # Dynamic quantization
+    quantize_model(
+        base_onnx_path=base_onnx_path,
+        onnx_dir=onnx_dir,
+        config_name="dynamic"
+    )
+    # Static quantization
+    quantize_model(
+        base_onnx_path=base_onnx_path,
+        onnx_dir=onnx_dir,
+        config_name="static",
+        calibration_dataset=calibration_dataset
+    )
+if __name__ == "__main__":
+    main()

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,56 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "[UNK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "[CLS]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "[SEP]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "[PAD]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "4": {
+      "content": "[MASK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "clean_up_tokenization_spaces": false,
+  "cls_token": "[CLS]",
+  "do_lower_case": true,
+  "extra_special_tokens": {},
+  "mask_token": "[MASK]",
+  "model_max_length": 512,
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "strip_accents": null,
+  "tokenize_chinese_chars": true,
+  "tokenizer_class": "BertTokenizer",
+  "unk_token": "[UNK]"
+}

vocab.txt ADDED Viewed

The diff for this file is too large to render. See raw diff