Macropodus commited on
Commit
75a2d1d
·
verified ·
1 Parent(s): a54e3e6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +159 -480
README.md CHANGED
@@ -2,81 +2,118 @@
2
  license: apache-2.0
3
  language:
4
  - zh
 
 
 
5
  tags:
6
  - csc
7
- - macro-correct
8
- - pycorrector
 
 
 
9
  - mdcspell
10
- - macbert4mdcspell
11
- - chinese-spelling-correct
12
  ---
13
- <p align="center">
14
- <img src="tet/images/csc_logo.png" width="480">
15
- </p>
16
-
17
- # [macro-correct](https://github.com/yongzhuo/macro-correct)
18
- [![PyPI](https://img.shields.io/pypi/v/macro-correct)](https://pypi.org/project/macro-correct/)
19
- [![Build Status](https://travis-ci.com/yongzhuo/macro-correct.svg?branch=master)](https://travis-ci.com/yongzhuo/macro-correct)
20
- [![PyPI_downloads](https://img.shields.io/pypi/dm/macro-correct)](https://pypi.org/project/macro-correct/)
21
- [![Stars](https://img.shields.io/github/stars/yongzhuo/macro-correct?style=social)](https://github.com/yongzhuo/macro-correct/stargazers)
22
- [![Forks](https://img.shields.io/github/forks/yongzhuo/macro-correct.svg?style=social)](https://github.com/yongzhuo/macro-correct/network/members)
23
- [![Join the chat at https://gitter.im/yongzhuo/macro-correct](https://badges.gitter.im/yongzhuo/macro-correct.svg)](https://gitter.im/yongzhuo/macro-correct?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge)
24
- >>> macro-correct, 文本纠错工具包(Text Correct), 支持中文拼写纠错/标点符号纠错(CSC, Chinese Spelling Correct / Check), CSC支持各领域数据(包括古文), 模型在大规模、各领域的、现代/当代语料上训练而得, 泛化性强.
25
-
26
- >>> macro-correct是一个只依赖pytorch、transformers、numpy、opencc的文本纠错(CSC, 中文拼写纠错; Punct, 中文标点纠错)工具包,专注于中文文本纠错的极简自然语言处理工具包。
27
- 使用大部分市面上的开源数据集构建生成的混淆集,使用人民日报语料&学习强国语料等生成1000万+训练数据集来训练模型;
28
- 支持MDCSpell、Macbert、ReLM、SoftBERT、BertCRF等多种经典模型;
29
- 支持中文拼写纠错、中文标点符号纠错、中文语法纠错(待续)、独立的检测模型/识别模型(待续);
30
- 具有依赖轻量、代码简洁、注释详细、调试清晰、配置灵活、拓展方便、适配NLP等特性。
31
-
32
 
33
  ## 目录
34
- * [安装](#安装)
35
- * [调用](#调用)
36
- * [体验](#体验)
37
- * [词典](#词典)
38
- * [详情](#详情)
39
- * [训练](#训练)
40
- * [测评](#测评)
41
- * [日志](#日志)
42
- * [参考](#参考)
43
- * [论文](#论文)
44
- * [Cite](#Cite)
45
-
46
 
47
- # 安装
48
- ```bash
49
- pip install macro-correct
50
 
51
- # 清华镜像源
52
- pip install -i https://pypi.tuna.tsinghua.edu.cn/simple macro-correct
53
-
54
- # 如果不行, 则不带依赖安装, 之后缺什么包再补充什么
55
- pip install -i https://pypi.tuna.tsinghua.edu.cn/simple macro-correct --no-dependencies
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
56
  ```
57
 
 
 
 
 
 
 
 
 
58
 
59
- # 调用
60
- 更多样例sample详情见/tet目录
61
- - 使用example详见/tet/tet目录, 中文拼写纠错代码为tet_csc_token_zh.py, 中文标点符号纠错代码为tet_csc_punct_zh.py, CSC也可以直接用tet_csc_flag_transformers.py
62
- - 训练代码详见/tet/train目录, 可配置本地预训练模型地址和各种参数等;
63
-
64
- # 体验
65
- [HF---Space---Macropodus/macbert4csc_v2](https://huggingface.co/spaces/Macropodus/macbert4csc_v2)
66
 
67
- <img src="tet/images/csc_demo.png" width="1024">
 
 
 
 
 
 
 
 
 
 
68
 
 
 
 
 
 
 
 
 
 
 
69
 
70
- ## 2.调用-文本纠错
71
- ### 2.1 CSC 使用 macro-bert
72
- ```python
73
- # !/usr/bin/python
74
- # -*- coding: utf-8 -*-
75
- # @time : 2021/2/29 21:41
76
- # @author : Mo
77
- # @function: 文本纠错, 使用macro-correct
 
 
78
 
 
 
 
 
 
 
 
 
 
79
 
 
 
 
80
  import os
81
  os.environ["MACRO_CORRECT_FLAG_CSC_TOKEN"] = "1"
82
  from macro_correct import correct
@@ -101,8 +138,8 @@ print("#" * 128)
101
  """
102
  ```
103
 
104
- ### 2.2 CSC 使用 transformers
105
- ```bash
106
  # !/usr/bin/python
107
  # -*- coding: utf-8 -*-
108
  # @time : 2021/2/29 21:41
@@ -177,424 +214,67 @@ print(result)
177
  """
178
  ```
179
 
180
- ## 3.调用-标点纠错
181
- ```python
182
- import os
183
- os.environ["MACRO_CORRECT_FLAG_CSC_PUNCT"] = "1"
184
- from macro_correct import correct_punct
185
-
186
-
187
- ### 1.默认标点纠错(list输入)
188
- text_list = ["山不在高有仙则名。",
189
- "水不在深,有龙则灵",
190
- "斯是陋室惟吾德馨",
191
- "苔痕上阶绿草,色入帘青。"
192
- ]
193
- text_csc = correct_punct(text_list)
194
- print("默认标点纠错(list输入):")
195
- for res_i in text_csc:
196
- print(res_i)
197
- print("#" * 128)
198
-
199
- """
200
- 默认标点纠错(list输入):
201
- {'index': 0, 'source': '山不在高有仙则名。', 'target': '山不在高,有仙则名。', 'score': 0.9917, 'errors': [['', ',', 4, 0.9917]]}
202
- {'index': 1, 'source': '水不在深,有龙则灵', 'target': '水不在深,有龙则灵。', 'score': 0.9995, 'errors': [['', '。', 9, 0.9995]]}
203
- {'index': 2, 'source': '斯是陋室惟吾德馨', 'target': '斯是陋室,惟吾德馨。', 'score': 0.9999, 'errors': [['', ',', 4, 0.9999], ['', '。', 8, 0.9998]]}
204
- {'index': 3, 'source': '苔痕上阶绿草,色入帘青。', 'target': '苔痕上阶绿,草色入帘青。', 'score': 0.9998, 'errors': [['', ',', 5, 0.9998]]}
205
- """
206
- ```
207
-
208
- # 词典
209
- ## 默认混淆词典地址
210
- * macro_correct/output/confusion_dict.json
211
- ## 操作混淆词典
212
- ```python
213
- ## 自定义混淆词典
214
- # !/usr/bin/python
215
- # -*- coding: utf-8 -*-
216
- # @time : 2021/2/29 21:41
217
- # @author : Mo
218
- # @function: tet csc of token confusion dict, 混淆词典
219
-
220
-
221
- import os
222
- os.environ["MACRO_CORRECT_FLAG_CSC_TOKEN"] = "1"
223
-
224
- from macro_correct.pytorch_textcorrection.tcTrie import ConfusionCorrect
225
- from macro_correct import MODEL_CSC_TOKEN
226
- from macro_correct import correct
227
-
228
-
229
- ### 默认使用混淆词典
230
- user_dict = {
231
- "乐而往返": "乐而忘返",
232
- "金钢钻": "金刚钻",
233
- "藤罗蔓": "藤萝蔓",
234
- }
235
- text_list = [
236
- "为什么乐而往返?",
237
- "没有金钢钻就不揽瓷活!",
238
- "你喜欢藤罗蔓吗?",
239
- "三周年祭日在哪举行?"
240
- ]
241
- text_csc = correct(text_list, flag_confusion=False)
242
- print("默认纠错(不带混淆词典):")
243
- for res_i in text_csc:
244
- print(res_i)
245
- print("#" * 128)
246
-
247
-
248
-
249
- text_csc = correct(text_list, flag_confusion=True)
250
- print("默认纠错(-带混淆词典-默认):")
251
- for res_i in text_csc:
252
- print(res_i)
253
- print("#" * 128)
254
-
255
-
256
- # ---混淆词典---
257
- ### 只新增, 新增用户词典(默认混淆词典也使用)
258
- MODEL_CSC_TOKEN.model_csc.model_confusion = ConfusionCorrect(user_dict=user_dict)
259
- text_csc = correct(text_list, flag_confusion=True)
260
- print("默认纠错(-带混淆词典-新增):")
261
- for res_i in text_csc:
262
- print(res_i)
263
- print("#" * 128)
264
- ### 全覆盖, 只使用用户词典(默认混淆词典废弃)
265
- MODEL_CSC_TOKEN.model_csc.model_confusion = ConfusionCorrect(confusion_dict=user_dict)
266
- text_csc = correct(text_list, flag_confusion=True)
267
- print("默认纠错(-带混淆词典-全覆盖):")
268
- for res_i in text_csc:
269
- print(res_i)
270
- print("#" * 128)
271
-
272
-
273
- # ---混淆词典文件---
274
- ### 只新增, 新增用户词典(默认混淆词典也使用), path不为空即可; json文件, {混淆词语:正确词语} key-value; 详见macro-correct/tet/tet/tet_csc_token_confusion.py
275
- path_user = "./user_confusion_dict.json"
276
- MODEL_CSC_TOKEN.model_csc.model_confusion = ConfusionCorrect(path="1", path_user=path_user)
277
- text_csc = correct(text_list, flag_confusion=True)
278
- print("默认纠错(-带混淆词典文件-新增):")
279
- for res_i in text_csc:
280
- print(res_i)
281
- print("#" * 128)
282
- ### 全覆盖, 只使用用户词典(默认混淆词典废弃); path必须传空字符串
283
- MODEL_CSC_TOKEN.model_csc.model_confusion = ConfusionCorrect(path="", path_user=path_user)
284
- text_csc = correct(text_list, flag_confusion=True)
285
- print("默认纠错(-带混淆词典文件-全覆盖):")
286
- for res_i in text_csc:
287
- print(res_i)
288
- print("#" * 128)
289
-
290
- """
291
- 默认纠错(不带混淆词典):
292
- {'index': 0, 'source': '为什么乐而往返?', 'target': '为什么乐而往返?', 'errors': []}
293
- {'index': 1, 'source': '没有金钢钻就不揽瓷活!', 'target': '没有金刚钻就不揽瓷活!', 'errors': [['钢', '刚', 3, 0.6587]]}
294
- {'index': 2, 'source': '你喜欢藤罗蔓吗?', 'target': '你喜欢藤萝蔓吗?', 'errors': [['罗', '萝', 4, 0.8582]]}
295
- {'index': 3, 'source': '三周年祭日在哪举行?', 'target': '三周年祭日在哪举行?', 'errors': []}
296
- ################################################################################################################################
297
- 默认纠错(-带混淆词典-默认):
298
- {'index': 0, 'source': '为什么乐而往返?', 'target': '为什么乐而往返?', 'errors': []}
299
- {'index': 1, 'source': '没有金钢钻就不揽瓷活!', 'target': '没有金刚钻就不揽瓷活!', 'errors': [['钢', '刚', 3, 1.0]]}
300
- {'index': 2, 'source': '你喜欢藤罗蔓吗?', 'target': '你喜欢藤萝蔓吗?', 'errors': [['罗', '萝', 4, 0.8582]]}
301
- {'index': 3, 'source': '三周年祭日在哪举行?', 'target': '三周年忌日在哪举行?', 'errors': [['祭', '忌', 3, 1.0]]}
302
- ################################################################################################################################
303
- 默认纠错(-带混淆词典-新增):
304
- {'index': 0, 'source': '为什么乐而往返?', 'target': '为什么乐而忘返?', 'errors': [['往', '忘', 5, 1.0]]}
305
- {'index': 1, 'source': '没有金钢钻就不揽瓷活!', 'target': '没有金刚钻就不揽瓷活!', 'errors': [['钢', '刚', 3, 1.0]]}
306
- {'index': 2, 'source': '你喜欢藤罗蔓吗?', 'target': '你喜欢藤萝蔓吗?', 'errors': [['罗', '萝', 4, 1.0]]}
307
- {'index': 3, 'source': '三周年祭日在哪举行?', 'target': '三周年忌日在哪举行?', 'errors': [['祭', '忌', 3, 1.0]]}
308
- ################################################################################################################################
309
- 默认纠错(-带混淆词典-全覆盖):
310
- {'index': 0, 'source': '为什么乐而往返?', 'target': '为什么乐而忘返?', 'errors': [['往', '忘', 5, 1.0]]}
311
- {'index': 1, 'source': '没有金钢钻就不揽瓷活!', 'target': '没有金刚钻就不揽瓷活!', 'errors': [['钢', '刚', 3, 1.0]]}
312
- {'index': 2, 'source': '你喜欢藤罗蔓吗?', 'target': '你喜欢藤萝蔓吗?', 'errors': [['罗', '萝', 4, 1.0]]}
313
- {'index': 3, 'source': '三周年祭日在哪举行?', 'target': '三周年祭日在哪举行?', 'errors': []}
314
- ################################################################################################################################
315
- 默认纠错(-带混淆词典文件-新增):
316
- {'index': 0, 'source': '为什么乐而往返?', 'target': '为什么乐而忘返?', 'errors': [['往', '忘', 5, 1.0]]}
317
- {'index': 1, 'source': '没有金钢钻就不揽瓷活!', 'target': '没有金刚钻就不揽瓷活!', 'errors': [['钢', '刚', 3, 1.0]]}
318
- {'index': 2, 'source': '你喜欢藤罗蔓吗?', 'target': '你喜欢藤萝蔓吗?', 'errors': [['罗', '萝', 4, 1.0]]}
319
- {'index': 3, 'source': '三周年祭日在哪举行?', 'target': '三周年忌日在哪举行?', 'errors': [['祭', '忌', 3, 1.0]]}
320
- ################################################################################################################################
321
- 默认纠错(-带混淆词典文件-全覆盖):
322
- {'index': 0, 'source': '为什么乐而往返?', 'target': '为什么乐而忘返?', 'errors': [['往', '忘', 5, 1.0]]}
323
- {'index': 1, 'source': '没有金钢钻就不揽瓷活!', 'target': '没有金刚钻就不揽瓷活!', 'errors': [['钢', '刚', 3, 1.0]]}
324
- {'index': 2, 'source': '你喜欢藤罗蔓吗?', 'target': '你喜欢藤萝蔓吗?', 'errors': [['罗', '萝', 4, 1.0]]}
325
- {'index': 3, 'source': '三周年祭日在哪举行?', 'target': '三周年祭日在哪举行?', 'errors': []}
326
- ################################################################################################################################
327
- """
328
- ```
329
-
330
-
331
- # 详情
332
- ## CSC调用(超参数说明)
333
- ```python
334
- import os
335
- os.environ["MACRO_CORRECT_FLAG_CSC_TOKEN"] = "1"
336
- from macro_correct import correct
337
- ### 默认纠错(list输入)
338
- text_list = ["真麻烦你了。希望你们好好的跳无",
339
- "少先队员因该为老人让坐",
340
- "机七学习是人工智能领遇最能体现智能的一个分知",
341
- "一只小鱼船浮在平净的河面上"
342
- ]
343
- ### 默认纠错(list输入, 参数配置)
344
- params = {
345
- "threshold": 0.55, # token阈值过滤
346
- "batch_size": 32, # 批大小
347
- "max_len": 128, # 自定义的长度, 如果截断了, 则截断部分不参与纠错, 后续直接一模一样的补回来
348
- "rounded": 4, # 保存4位小数
349
- "flag_confusion": True, # 是否使用默认的混淆词典
350
- "flag_prob": True, # 是否返回纠错token处的概率
351
- }
352
- text_csc = correct(text_list, **params)
353
- print("默认纠错(list输入, 参数配置):")
354
- for res_i in text_csc:
355
- print(res_i)
356
- print("#" * 128)
357
-
358
-
359
- """
360
- 默认纠错(list输入):
361
- {'index': 0, 'source': '真麻烦你了。希望你们好好的跳无', 'target': '真麻烦你了。希望你们好好地跳舞', 'errors': [['的', '地', 12, 0.6584], ['无', '舞', 14, 1.0]]}
362
- {'index': 1, 'source': '少先队员因该为老人让坐', 'target': '少先队员应该为老人让坐', 'errors': [['因', '应', 4, 0.995]]}
363
- {'index': 2, 'source': '机七学习是人工智能领遇最能体现智能的一个分知', 'target': '机器学习是人工智能领域最能体现智能的一个分支', 'errors': [['七', '器', 1, 0.9998], ['遇', '域', 10, 0.9999], ['知', '支', 21, 1.0]]}
364
- {'index': 3, 'source': '一只小鱼船浮在平净的河面上', 'target': '一只小鱼船浮在平静的河面上', 'errors': [['净', '静', 8, 0.9961]]}
365
- """
366
- ```
367
- ## PUNCT调用(超参数说明)
368
- ```python
369
- import os
370
- os.environ["MACRO_CORRECT_FLAG_CSC_PUNCT"] = "1"
371
- from macro_correct import correct_punct
372
-
373
-
374
- ### 1.默认标点纠错(list输入)
375
- text_list = ["山不在高有仙则名。",
376
- "水不在深,有龙则灵",
377
- "斯是陋室惟吾德馨",
378
- "苔痕上阶绿草,色入帘青。"
379
- ]
380
- ### 2.默认标点纠错(list输入, 参数配置详情)
381
- params = {
382
- "limit_num_errors": 4, # 一句话最多的错别字, 多的就剔除
383
- "limit_len_char": 4, # 一句话的最小字符数
384
- "threshold_zh": 0.5, # 句子阈值, 中文字符占比的最低值
385
- "threshold": 0.55, # token阈值过滤
386
- "batch_size": 32, # 批大小
387
- "max_len": 128, # 自定义的长度, 如果截断了, 则截断部分不参与纠错, 后续直接一模一样的补回来
388
- "rounded": 4, # 保存4位小数
389
- "flag_prob": True, # 是否返回纠错token处的概率
390
- }
391
- text_csc = correct_punct(text_list, **params)
392
- print("默认标点纠错(list输入):")
393
- for res_i in text_csc:
394
- print(res_i)
395
- print("#" * 128)
396
-
397
- """
398
- 默认标点纠错(list输入):
399
- {'index': 0, 'source': '山不在高有仙则名。', 'target': '山不在高,有仙则名。', 'score': 0.9917, 'errors': [['', ',', 4, 0.9917]]}
400
- {'index': 1, 'source': '水不在深,有龙则灵', 'target': '水不在深,有龙则灵。', 'score': 0.9995, 'errors': [['', '。', 9, 0.9995]]}
401
- {'index': 2, 'source': '斯是陋室惟吾德馨', 'target': '斯是陋室,惟吾德馨。', 'score': 0.9999, 'errors': [['', ',', 4, 0.9999], ['', '。', 8, 0.9998]]}
402
- {'index': 3, 'source': '苔痕上阶绿草,色入帘青。', 'target': '苔痕上阶绿,草色入帘青。', 'score': 0.9998, 'errors': [['', ',', 5, 0.9998]]}
403
- """
404
- ```
405
-
406
- # 训练
407
- ## CSC任务
408
- ### 目录地址
409
- * macbert4mdcspell: macro_correct/pytorch_user_models/csc/macbert4mdcspell/train_yield.py
410
- * macbert4csc: macro_correct/pytorch_user_models/csc/macbert4csc/train_yield.py
411
- * relm: macro_correct/pytorch_user_models/csc/relm/train_yield.py
412
- ### 数据准备
413
- * espell: list<dict>的json文件结构, 带"original_text"和"correct_text"就好, 参考macro_correct/corpus/text_correction/espell
414
- ```
415
- [
416
- {
417
- "original_text": "遇到逆竟时,我们必须勇于面对,而且要愈挫愈勇,这样我们才能朝著成功之路前进。",
418
- "correct_text": "遇到逆境时,我们必须勇于面对,而且要愈挫愈勇,这样我们才能朝著成功之路前进。",
419
- }
420
- ]
421
- ```
422
- * sighan: list<dict>的json文件结构, 带"source"和"target"就好, 参考macro_correct/corpus/text_correction/sighan
423
- ```
424
- [
425
- {
426
- "source": "若被告人正在劳动教养,则可以通过劳动教养单位转交",
427
- "target": "若被告人正在劳动教养,则可以通过劳动教养单位转交",
428
- }
429
- ]
430
- ```
431
- ### 配置-训练-验证-预测
432
- #### 配置
433
- 配置好数据地址和超参, 参考macro_correct/pytorch_user_models/csc/macbert4mdcspell/config.py
434
- #### 训练-验证-预测
435
- ```
436
- 训练
437
- nohup python train_yield.py > tc.train_yield.py.log 2>&1 &
438
- tail -n 1000 -f tc.train_yield.py.log
439
- 验证
440
- python eval_std.py
441
- 预测
442
- python predict.py
443
- ```
444
-
445
- ## PUNCT任务
446
- ### 目录地址
447
- * PUNCT: macro_correct/pytorch_sequencelabeling/slRun.py
448
- ### 数据准备
449
- * SPAN格式: NER任务, 默认用span格式(jsonl), 参考macro_correct/corpus/sequence_labeling/chinese_symbol的chinese_symbol.dev.span文件
450
- ```
451
- {'label': [{'type': '0', 'ent': '下', 'pos': [7, 7]}, {'type': '1', 'ent': '林', 'pos': [14, 14]}], 'text': '#桂林山水甲天下阳朔山水甲桂林'}
452
- {'label': [{'type': '11', 'ent': 'o', 'pos': [5, 5]}, {'type': '0', 'ent': 't', 'pos': [12, 12]}, {'type': '1', 'ent': '包', 'pos': [19, 19]}], 'text': '#macrocorrect文本纠错工具包'}
453
- ```
454
- * CONLL格式: 生成SPAN格式后, 用macro_correct/tet/corpus/pos_to_conll.py转换一下就好
455
- ```
456
- 神 O
457
- 秘 O
458
- 宝 O
459
- 藏 B-1
460
- 在 O
461
- 旅 O
462
- 途 O
463
- 中 B-0
464
- 他 O
465
- ```
466
- ### 配置-训练-验证-预测
467
- #### 配置
468
- 配置好数据地址和超参, 参考macro_correct/pytorch_user_models/csc/macbert4mdcspell/config.py
469
- #### 训练-验证-预测
470
- ```
471
- 训练
472
- nohup python train_yield.py > tc.train_yield.py.log 2>&1 &
473
- tail -n 1000 -f tc.train_yield.py.log
474
- 验证
475
- python eval_std.py
476
- 预测
477
- python predict.py
478
- ```
479
-
480
-
481
- # 测评
482
- ## 说明
483
- * 所有训练数据均来自公网或开源数据, 训练数据为1千万左右, 混淆词典较大;
484
- * 所有测试数据均来自公网或开源数据, 测评数据地址为[Macropodus/csc_eval_public](https://huggingface.co/datasets/Macropodus/csc_eval_public);
485
- * 测评代码主要为[tcEval.py](https://github.com/yongzhuo/macro-correct/macro_correct/pytorch_textcorrection/tcEval.py); 其中[qwen25_1-5b_pycorrector]()的测评代码在目录[eval](https://github.com/yongzhuo/macro-correct/tet/eval)
486
- * 评估标准:过纠率(过度纠错, 即高质量正确句子的错误纠正); 句子级宽松标准的准确率/精确率/召回率/F1(同[shibing624/pycorrector](https://github.com/shibing624/pycorrector)); 句子级严格标准的准确率/精确率/召回率/F1(同[wangwang110/CSC](https://github.com/wangwang110/CSC)); 字符级别的准确率/精确率/召回率/F1(错别字);
487
- * qwen25_1-5b_pycorrector权重地址在[shibing624/chinese-text-correction-1.5b](https://huggingface.co/shibing624/chinese-text-correction-1.5b)
488
- * macbert4csc_pycorrector权重地址在[shibing624/macbert4csc-base-chinese](https://huggingface.co/shibing624/macbert4csc-base-chinese);
489
- * macbert4mdcspell_v1权重地址在[Macropodus/macbert4mdcspell_v1](https://huggingface.co/Macropodus/macbert4mdcspell_v1);
490
- * macbert4mdcspell_v2权重地址在[Macropodus/macbert4mdcspell_v2](https://huggingface.co/Macropodus/macbert4mdcspell_v2);
491
- * macbert4csc_v2权重地址在[Macropodus/macbert4csc_v2](https://huggingface.co/Macropodus/macbert4csc_v2);
492
- * macbert4csc_v1权重地址在[Macropodus/macbert4csc_v1](https://huggingface.co/Macropodus/macbert4csc_v1);
493
- * bert4csc_v1权重地址在[Macropodus/bert4csc_v1](https://huggingface.co/Macropodus/bert4csc_v1);
494
-
495
- ## 3.1 测评数据
496
- ```
497
- 1.gen_de3.json(5545): '的地得'纠错, 由人民日报/学习强国/chinese-poetry等高质量数据人工生成;
498
- 2.lemon_v2.tet.json(1053): relm论文提出的数据, 多领域拼写纠错数据集(7个领域), ; 包括game(GAM), encyclopedia (ENC), contract (COT), medical care(MEC), car (CAR), novel (NOV), and news (NEW)等领域;
499
- 3.acc_rmrb.tet.json(4636): 来自NER-199801(人民日报高质量语料);
500
- 4.acc_xxqg.tet.json(5000): 来自学习强国网站的高质量语料;
501
- 5.gen_passage.tet.json(10000): 源数据为qwen生成的好词好句, 由几乎所有的开源数据汇总的混淆词典生成;
502
- 6.textproof.tet.json(1447): NLP竞赛数据, TextProofreadingCompetition;
503
- 7.gen_xxqg.tet.json(5000): 源数据为学习强国网站的高质量语料, 由几乎所有的开源数据汇总的混淆词典生成;
504
- 8.faspell.dev.json(1000): 视频字幕通过OCR后获取的数据集; 来自爱奇艺的论文faspell;
505
- 9.lomo_tet.json(5000): 主要为音似中文拼写纠错数据集; 来自腾讯; 人工标注的数据集CSCD-NS;
506
- 10.mcsc_tet.5000.json(5000): 医学拼写纠错; 来自腾讯医典APP的真实历史日志; 注意论文说该数据集只关注医学实体的纠错, 常用字等的纠错并不关注;
507
- 11.ecspell.dev.json(1500): 来自ECSpell论文, 包括(law/med/gov)等三个领域;
508
- 12.sighan2013.dev.json(1000): 来自sighan13会议;
509
- 13.sighan2014.dev.json(1062): 来自sighan14会议;
510
- 14.sighan2015.dev.json(1100): 来自sighan15会议;
511
- ```
512
-
513
- ## 3.2 测评再说明
514
- ```
515
- 1.数据预处理, 测评数据都经过 全角转半角,繁简转化,标点符号标准化等操作;
516
- 2.指标带common的极为宽松指标, 同开源项目pycorrector的评估指标;
517
- 3.指标带strict的极为严格指标, 同开源项目[wangwang110/CSC](https://github.com/wangwang110/CSC);
518
- 4.macbert4mdcspell_v1/v2模型为训练使用mdcspell架构+bert的mlm-loss, 但是推理的时候只用bert-mlm;
519
- 5.acc_rmrb/acc_xxqg数据集没有错误, 用于评估模型的误纠率(过度纠错);
520
- 6.qwen25_1-5b_pycorrector的模型为shibing624/chinese-text-correction-1.5b, 其训练数据包括了lemon_v2/mcsc_tet/ecspell的验证集和测试集, 其他的bert类模型的训练不包括验证集和测试集;
521
- ```
522
-
523
- ## 3.3 测评结果
524
- ### 3.3.1 F1(common_cor_f1)
525
- | model/common_cor_f1 | avg| gen_de3| lemon_v2| gen_passage| text_proof| gen_xxqg| faspell| lomo_tet| mcsc_tet| ecspell| sighan2013| sighan2014| sighan2015 |
526
- |:------------------------|:-----------------|:-----------------|:-----------------|:-----------------|:-----------------|:-----------------|:-----------------|:-----------------|:-----------------|:-----------------|:-----------------|:-----------------|:-----------------|
527
- | macbert4csc_pycorrector | 45.8| 42.44| 42.89| 31.49| 46.31| 26.06| 32.7| 44.83| 27.93| 55.51| 70.89| 61.72| 66.81 |
528
- | qwen25_1-5b_pycorrector | 45.11| 27.29| 89.48| 14.61| 83.9| 13.84| 18.2| 36.71| 96.29| 88.2| 36.41| 15.64| 20.73 |
529
- | bert4csc_v1 | 62.28| 93.73| 61.99| 44.79| 68.0| 35.03| 48.28| 61.8| 64.41| 79.11| 77.66| 51.01| 61.54 |
530
- | macbert4csc_v1 | 68.55| 96.67| 65.63| 48.4| 75.65| 38.43| 51.76| 70.11| 80.63| 85.55| 81.38| 57.63| 70.7 |
531
- | macbert4csc_v2 | 68.6| 96.74| 66.02| 48.26| 75.78| 38.84| 51.91| 70.17| 80.71| 85.61| 80.97| 58.22| 69.95 |
532
- | macbert4mdcspell_v1 | 71.1| 96.42| 70.06| 52.55| 79.61| 43.37| 53.85| 70.9| 82.38| 87.46| 84.2| 61.08| 71.32 |
533
- | macbert4mdcspell_v2 | 71.23| 96.42| 65.8| 52.35| 75.94| 43.5| 53.82| 72.66| 82.28| 88.69| 82.51| 65.59| 75.26 |
534
-
535
- ### 3.3.2 acc(common_cor_acc)
536
- | model/common_cor_acc| avg| gen_de3| lemon_v2| gen_passage| text_proof| gen_xxqg| faspell| lomo_tet| mcsc_tet| ecspell| sighan2013| sighan2014| sighan2015 |
537
- |:-----------------|:-----------------|:-----------------|:-----------------|:-----------------|:-----------------|:-----------------|:-----------------|:-----------------|:-----------------|:-----------------|:-----------------|:-----------------|:-----------------|
538
- | macbert4csc_pycorrector| 48.26| 26.96| 28.68| 34.16| 55.29| 28.38| 22.2| 60.96| 57.16| 67.73| 55.9| 68.93| 72.73 |
539
- | qwen25_1-5b_pycorrector| 46.09| 15.82| 81.29| 22.96| 82.17| 19.04| 12.8| 50.2| 96.4| 89.13| 22.8| 27.87| 32.55 |
540
- | bert4csc_v1| 60.76| 88.21| 45.96| 43.13| 68.97| 35.0| 34.0| 65.86| 73.26| 81.8| 64.5| 61.11| 67.27 |
541
- | macbert4csc_v1| 65.34| 93.56| 49.76| 44.98| 74.64| 36.1| 37.0| 73.0| 83.6| 86.87| 69.2| 62.62| 72.73 |
542
- | macbert4csc_v2| 65.22| 93.69| 50.14| 44.92| 74.64| 36.26| 37.0| 72.72| 83.66| 86.93| 68.5| 62.43| 71.73 |
543
- | macbert4mdcspell_v1| 67.15| 93.09| 54.8| 47.71| 78.09| 39.52| 38.8| 71.92| 84.78| 88.27| 73.2| 63.28| 72.36 |
544
- | macbert4mdcspell_v2 | 68.31| 93.09| 50.05| 48.72| 75.74| 40.52| 38.9| 76.9| 84.8| 89.73| 71.0| 71.94| 78.36 |
545
-
546
- ### 3.3.3 acc(acc_true, thr=0.75)
547
- | model/acc | avg| acc_rmrb| acc_xxqg |
548
- |:------------------------|:-----------------|:-----------------|:-----------------|
549
- | macbert4csc_pycorrector | 99.24| 99.22| 99.26 |
550
- | qwen25_1-5b_pycorrector | 82.0| 77.14| 86.86 |
551
- | bert4csc_v1 | 98.71| 98.36| 99.06 |
552
- | macbert4csc_v1 | 97.72| 96.72| 98.72 |
553
- | macbert4csc_v2 | 97.89| 96.98| 98.8 |
554
- | macbert4mdcspell_v1 | 97.75| 96.51| 98.98 |
555
- | macbert4mdcspell_v2 | 99.54| 99.22| 99.86 |
556
-
557
-
558
- ### 3.3.4 结论(Conclusion)
559
- ```
560
- 1.macbert4csc_v1/macbert4csc_v2/macbert4mdcspell_v1等模型使用多种领域数据训练, 比较均衡, 也适合作为第一步的预训练模型, 可用于专有领域数据的继续微调;
561
- 2.比较macbert4csc_pycorrector/bertbase4csc_v1/macbert4csc_v2/macbert4mdcspell_v1, 观察表2.3, 可以发现训练数据越多, 准确率提升的同时, 误纠率也会稍微高一些;
562
- 3.MFT(Mask-Correct)依旧有效, 不过对于数据量足够的情形提升不明显, 可能也是误纠率升高的一个重要原因;
563
- 4.训练数据中也存在文言文数据, 训练好的模型也支持文言文纠错;
564
- 5.训练好的模型对"地得的"等高频错误具有较高的识别率和纠错率;
565
- 6.macbert4mdcspell_v2的MFT只70%的时间no-error-mask(0.15), 15%的时间target-to-target, 15%的时间不mask;
566
- ```
567
-
568
-
569
- # 日志
570
- ```
571
- 1. v20240129, 完成csc_punct模块;
572
- 2. v20241001, 完成csc_token模块;
573
- 3. v20250117, 完成csc_eval模块;
574
- 4. v20250501, 完成macbert4mdcspell_v2
575
- ```
576
-
577
-
578
- # 参考
579
- This library is inspired by and references following frameworks and papers.
580
-
581
- * Chinese-text-correction-papers: [nghuyong/Chinese-text-correction-papers](https://github.com/nghuyong/Chinese-text-correction-papers)
582
- * pycorrector: [shibing624/pycorrector](https://github.com/shibing624/pycorrector)
583
- * CTCResources: [destwang/CTCResources](https://github.com/destwang/CTCResources)
584
- * CSC: [wangwang110/CSC](https://github.com/wangwang110/CSC)
585
- * char-similar: [yongzhuo/char-similar](https://github.com/yongzhuo/char-similar)
586
- * MDCSpell: [iioSnail/MDCSpell_pytorch](https://github.com/iioSnail/MDCSpell_pytorch)
587
- * CSCD-NS: [nghuyong/cscd-ns](https://github.com/nghuyong/cscd-ns)
588
- * lemon: [gingasan/lemon](https://github.com/gingasan/lemon)
589
- * ReLM: [Claude-Liu/ReLM](https://github.com/Claude-Liu/ReLM)
590
-
591
-
592
- # 论文
593
- ## 中文拼写纠错(CSC, Chinese Spelling Correction)
594
- * 共收录34篇论文, 写了一个简短的综述. 详见[README.csc_survey.md](https://github.com/yongzhuo/macro-correct/blob/master/README.csc_survey.md)
595
-
596
-
597
- # Cite
598
  For citing this work, you can refer to the present GitHub project. For example, with BibTeX:
599
  ```
600
  @software{macro-correct,
@@ -602,5 +282,4 @@ For citing this work, you can refer to the present GitHub project. For example,
602
  author = {Yongzhuo Mo},
603
  title = {macro-correct},
604
  year = {2025}
605
-
606
  ```
 
2
  license: apache-2.0
3
  language:
4
  - zh
5
+ base_model:
6
+ - hfl/chinese-macbert-base
7
+ pipeline_tag: text-generation
8
  tags:
9
  - csc
10
+ - text-correct
11
+ - chinses-spelling-correct
12
+ - chinese-spelling-check
13
+ - 中文拼写纠错
14
+ - 文本纠错
15
  - mdcspell
16
+ - macro-correct
 
17
  ---
18
+ # macbert4mdcspell
19
+ ## 概述(macbert4mdcspell)
20
+ - macro-correct, 中文拼写纠错CSC测评(文本纠错), 权重使用
21
+ - 项目地址在 [https://github.com/yongzhuo/macro-correct](https://github.com/yongzhuo/macro-correct)
22
+ - 本模型权重为macbert4mdcspell_v2, 使用mdcspell架构, 其特点是det_label和cor_label交互;
23
+ - 训练时加入了macbert的mlm-loss, 推理时舍弃了macbert后面的部分;
24
+ - 如何使用: 1.使用transformers调用; 2.使用[macro-correct](https://github.com/yongzhuo/macro-correct)项目调用; 详情见***三、调用(Usage)***;
25
+ - 为了修复过纠问题, macbert4mdcspell_v2的MFT只70%的时间no-error-mask(0.15), 15%的时间target-to-target, 15%的时间不mask;
 
 
 
 
 
 
 
 
 
 
 
26
 
27
  ## 目录
28
+ * [一、测评(Test)](#一、测评(Test))
29
+ * [二、结论(Conclusion)](#二、结论(Conclusion))
30
+ * [三、调用(Usage)](#三、调用(Usage))
31
+ * [四、论文(Paper)](#四、论文(Paper))
32
+ * [五、参考(Refer)](#五、参考(Refer))
33
+ * [六、引用(Cite)](#六、引用(Cite))
 
 
 
 
 
 
34
 
 
 
 
35
 
36
+ ## 一、测评(Test)
37
+ ### 1.1 测评数据来源
38
+ 地址为[Macropodus/csc_eval_public](https://huggingface.co/datasets/Macropodus/csc_eval_public), 所有训练数据均来自公网或开源数据, 训练数据为1千万左右, 混淆词典较大;
39
+ ```
40
+ 1.gen_de3.json(5545): '的地得'纠错, 由人民日报/学习强国/chinese-poetry等高质量数据人工生成;
41
+ 2.lemon_v2.tet.json(1053): relm论文提出的数据, 多领域拼写纠错数据集(7个领域), ; 包括game(GAM), encyclopedia (ENC), contract (COT), medical care(MEC), car (CAR), novel (NOV), and news (NEW)等领域;
42
+ 3.acc_rmrb.tet.json(4636): 来自NER-199801(人民日报高质量语料);
43
+ 4.acc_xxqg.tet.json(5000): 来自学习强国网站的高质量语料;
44
+ 5.gen_passage.tet.json(10000): 源数据为qwen生成的好词好句, 由几乎所有的开源数据汇总的混淆词典生成;
45
+ 6.textproof.tet.json(1447): NLP竞赛数据, TextProofreadingCompetition;
46
+ 7.gen_xxqg.tet.json(5000): 源数据为学习强国网站的高质量语料, 由几乎所有的开源数据汇总的混淆词典生成;
47
+ 8.faspell.dev.json(1000): 视频字幕通过OCR后获取的数据集; 来自爱奇艺的论文faspell;
48
+ 9.lomo_tet.json(5000): 主要��音似中文拼写纠错数据集; 来自腾讯; 人工标注的数据集CSCD-NS;
49
+ 10.mcsc_tet.5000.json(5000): 医学拼写纠错; 来自腾讯医典APP的真实历史日志; 注意论文说该数据集只关注医学实体的纠错, 常用字等的纠错并不关注;
50
+ 11.ecspell.dev.json(1500): 来自ECSpell论文, 包括(law/med/gov)等三个领域;
51
+ 12.sighan2013.dev.json(1000): 来自sighan13会议;
52
+ 13.sighan2014.dev.json(1062): 来自sighan14会议;
53
+ 14.sighan2015.dev.json(1100): 来自sighan15会议;
54
+ ```
55
+ ### 1.2 测评数据预处理
56
+ ```
57
+ 测评数据都经过 全角转半角,繁简转化,标点符号标准化等操作;
58
  ```
59
 
60
+ ### 1.3 其他说明
61
+ ```
62
+ 1.指标带common的极为宽松指标, 同开源项目pycorrector的评估指标;
63
+ 2.指标带strict的极为严格指标, 同开源项目[wangwang110/CSC](https://github.com/wangwang110/CSC);
64
+ 3.macbert4mdcspell_v1模型为训练使用mdcspell架构+bert的mlm-loss, 但是推理的时候只用bert-mlm;
65
+ 4.acc_rmrb/acc_xxqg数据集没有错误, 用于评估模型的误纠率(过度纠错);
66
+ 5.qwen25_1-5b_pycorrector的模型为shibing624/chinese-text-correction-1.5b, 其训练数据包括了lemon_v2/mcsc_tet/ecspell的验证集和测试集, 其他的bert类模型的训练不包括验证集和测试集;
67
+ ```
68
 
 
 
 
 
 
 
 
69
 
70
+ ## 二、重要指标
71
+ ### 2.1 F1(common_cor_f1)
72
+ | model/common_cor_f1 | avg| gen_de3| lemon_v2| gen_passage| text_proof| gen_xxqg| faspell| lomo_tet| mcsc_tet| ecspell| sighan2013| sighan2014| sighan2015 |
73
+ |:------------------------|:-----------------|:-----------------|:-----------------|:-----------------|:-----------------|:-----------------|:-----------------|:-----------------|:-----------------|:-----------------|:-----------------|:-----------------|:-----------------|
74
+ | macbert4csc_pycorrector | 45.8| 42.44| 42.89| 31.49| 46.31| 26.06| 32.7| 44.83| 27.93| 55.51| 70.89| 61.72| 66.81 |
75
+ | qwen25_1-5b_pycorrector | 45.11| 27.29| 89.48| 14.61| 83.9| 13.84| 18.2| 36.71| 96.29| 88.2| 36.41| 15.64| 20.73 |
76
+ | bert4csc_v1 | 62.28| 93.73| 61.99| 44.79| 68.0| 35.03| 48.28| 61.8| 64.41| 79.11| 77.66| 51.01| 61.54 |
77
+ | macbert4csc_v1 | 68.55| 96.67| 65.63| 48.4| 75.65| 38.43| 51.76| 70.11| 80.63| 85.55| 81.38| 57.63| 70.7 |
78
+ | macbert4csc_v2 | 68.6| 96.74| 66.02| 48.26| 75.78| 38.84| 51.91| 70.17| 80.71| 85.61| 80.97| 58.22| 69.95 |
79
+ | macbert4mdcspell_v1 | 71.1| 96.42| 70.06| 52.55| 79.61| 43.37| 53.85| 70.9| 82.38| 87.46| 84.2| 61.08| 71.32 |
80
+ | macbert4mdcspell_v2 | 71.23| 96.42| 65.8| 52.35| 75.94| 43.5| 53.82| 72.66| 82.28| 88.69| 82.51| 65.59| 75.26 |
81
 
82
+ ### 2.2 acc(common_cor_acc)
83
+ | model/common_cor_acc| avg| gen_de3| lemon_v2| gen_passage| text_proof| gen_xxqg| faspell| lomo_tet| mcsc_tet| ecspell| sighan2013| sighan2014| sighan2015 |
84
+ |:-----------------|:-----------------|:-----------------|:-----------------|:-----------------|:-----------------|:-----------------|:-----------------|:-----------------|:-----------------|:-----------------|:-----------------|:-----------------|:-----------------|
85
+ | macbert4csc_pycorrector| 48.26| 26.96| 28.68| 34.16| 55.29| 28.38| 22.2| 60.96| 57.16| 67.73| 55.9| 68.93| 72.73 |
86
+ | qwen25_1-5b_pycorrector| 46.09| 15.82| 81.29| 22.96| 82.17| 19.04| 12.8| 50.2| 96.4| 89.13| 22.8| 27.87| 32.55 |
87
+ | bert4csc_v1| 60.76| 88.21| 45.96| 43.13| 68.97| 35.0| 34.0| 65.86| 73.26| 81.8| 64.5| 61.11| 67.27 |
88
+ | macbert4csc_v1| 65.34| 93.56| 49.76| 44.98| 74.64| 36.1| 37.0| 73.0| 83.6| 86.87| 69.2| 62.62| 72.73 |
89
+ | macbert4csc_v2| 65.22| 93.69| 50.14| 44.92| 74.64| 36.26| 37.0| 72.72| 83.66| 86.93| 68.5| 62.43| 71.73 |
90
+ | macbert4mdcspell_v1| 67.15| 93.09| 54.8| 47.71| 78.09| 39.52| 38.8| 71.92| 84.78| 88.27| 73.2| 63.28| 72.36 |
91
+ | macbert4mdcspell_v2 | 68.31| 93.09| 50.05| 48.72| 75.74| 40.52| 38.9| 76.9| 84.8| 89.73| 71.0| 71.94| 78.36 |
92
 
93
+ ### 2.3 acc(acc_true, thr=0.75)
94
+ | model/acc | avg| acc_rmrb| acc_xxqg |
95
+ |:------------------------|:-----------------|:-----------------|:-----------------|
96
+ | macbert4csc_pycorrector | 99.24| 99.22| 99.26 |
97
+ | qwen25_1-5b_pycorrector | 82.0| 77.14| 86.86 |
98
+ | bert4csc_v1 | 98.71| 98.36| 99.06 |
99
+ | macbert4csc_v1 | 97.72| 96.72| 98.72 |
100
+ | macbert4csc_v2 | 97.89| 96.98| 98.8 |
101
+ | macbert4mdcspell_v1 | 97.75| 96.51| 98.98 |
102
+ | macbert4mdcspell_v2 | 99.54| 99.22| 99.86 |
103
 
104
+ ## 二、结论(Conclusion)
105
+ ```
106
+ 1.macbert4csc_v1/macbert4csc_v2/macbert4mdcspell_v1等模型使用多种领域数据训练, 比较均衡, 也适合作为第一步的预训练模型, 可用于专有领域数据的继续微调;
107
+ 2.比较macbert4csc_pycorrector/bertbase4csc_v1/macbert4csc_v2/macbert4mdcspell_v1, 观察表2.3, 可以发现训练数据越多, 准确率提升的同时, 误纠率也会稍微高一些;
108
+ 3.MFT(Mask-Correct)依旧有效, 不过对于数据量足够的情形提升不明显, 可能也是误纠率升高的一个重要原因;
109
+ 4.训练数据中也存在文言文数据, 训练好的模型也支持文言文纠错;
110
+ 5.训练好的模型对"地得的"等高频错误具有较高的识别率和纠错率;
111
+ 6.macbert4mdcspell_v2的MFT只70%的时间no-error-mask(0.15), 15%的时间target-to-target, 15%的时间不mask;
112
+ ```
113
 
114
+ ## 三、调用(Usage)
115
+ ### 3.1 使用macro-correct
116
+ ```
117
  import os
118
  os.environ["MACRO_CORRECT_FLAG_CSC_TOKEN"] = "1"
119
  from macro_correct import correct
 
138
  """
139
  ```
140
 
141
+ ### 3.2 使用 transformers
142
+ ```
143
  # !/usr/bin/python
144
  # -*- coding: utf-8 -*-
145
  # @time : 2021/2/29 21:41
 
214
  """
215
  ```
216
 
217
+ ## 四、论文(Paper)
218
+ - 2024-Refining: [Refining Corpora from a Model Calibration Perspective for Chinese](https://arxiv.org/abs/2407.15498)
219
+ - 2024-ReLM: [Chinese Spelling Correction as Rephrasing Language Model](https://arxiv.org/abs/2308.08796)
220
+ - 2024-DICS: [DISC: Plug-and-Play Decoding Intervention with Similarity of Characters for Chinese Spelling Check](https://arxiv.org/abs/2412.12863)
221
+
222
+ - 2023-Bi-DCSpell: [A Bi-directional Detector-Corrector Interactive Framework for Chinese Spelling Check]()
223
+ - 2023-BERT-MFT: [Rethinking Masked Language Modeling for Chinese Spelling Correction](https://arxiv.org/abs/2305.17721)
224
+ - 2023-PTCSpell: [PTCSpell: Pre-trained Corrector Based on Character Shape and Pinyin for Chinese Spelling Correction](https://arxiv.org/abs/2212.04068)
225
+ - 2023-DR-CSC: [A Frustratingly Easy Plug-and-Play Detection-and-Reasoning Module for Chinese](https://aclanthology.org/2023.findings-emnlp.771)
226
+ - 2023-DROM: [Disentangled Phonetic Representation for Chinese Spelling Correction](https://arxiv.org/abs/2305.14783)
227
+ - 2023-EGCM: [An Error-Guided Correction Model for Chinese Spelling Error Correction](https://arxiv.org/abs/2301.06323)
228
+ - 2023-IGPI: [Investigating Glyph-Phonetic Information for Chinese Spell Checking: What Works and What’s Next?](https://arxiv.org/abs/2212.04068)
229
+ - 2023-CL: [Contextual Similarity is More Valuable than Character Similarity-An Empirical Study for Chinese Spell Checking]()
230
+
231
+ - 2022-CRASpell: [CRASpell: A Contextual Typo Robust Approach to Improve Chinese Spelling Correction](https://aclanthology.org/2022.findings-acl.237)
232
+ - 2022-MDCSpell: [MDCSpell: A Multi-task Detector-Corrector Framework for Chinese Spelling Correction](https://aclanthology.org/2022.findings-acl.98)
233
+ - 2022-SCOPE: [Improving Chinese Spelling Check by Character Pronunciation Prediction: The Effects of Adaptivity and Granularity](https://arxiv.org/abs/2210.10996)
234
+ - 2022-ECOPO: [The Past Mistake is the Future Wisdom: Error-driven Contrastive Probability Optimization for Chinese Spell Checking](https://arxiv.org/abs/2203.00991)
235
+
236
+ - 2021-MLMPhonetics: [Correcting Chinese Spelling Errors with Phonetic Pre-training](https://aclanthology.org/2021.findings-acl.198)
237
+ - 2021-ChineseBERT: [ChineseBERT: Chinese Pretraining Enhanced by Glyph and Pinyin Information](https://aclanthology.org/2021.acl-long.161/)
238
+ - 2021-BERTCrsGad: [Global Attention Decoder for Chinese Spelling Error Correction](https://aclanthology.org/2021.findings-acl.122)
239
+ - 2021-ThinkTwice: [Think Twice: A Post-Processing Approach for the Chinese Spelling Error Correction](https://www.mdpi.com/2076-3417/11/13/5832)
240
+ - 2021-PHMOSpell: [PHMOSpell: Phonological and Morphological Knowledge Guided Chinese Spelling Chec](https://aclanthology.org/2021.acl-long.464)
241
+ - 2021-SpellBERT: [SpellBERT: A Lightweight Pretrained Model for Chinese Spelling Check](https://aclanthology.org/2021.emnlp-main.287)
242
+ - 2021-TwoWays: [Exploration and Exploitation: Two Ways to Improve Chinese Spelling Correction Models](https://aclanthology.org/2021.acl-short.56)
243
+ - 2021-ReaLiSe: [Read, Listen, and See: Leveraging Multimodal Information Helps Chinese Spell Checking](https://arxiv.org/abs/2105.12306)
244
+ - 2021-DCSpell: [DCSpell: A Detector-Corrector Framework for Chinese Spelling Error Correction](https://dl.acm.org/doi/10.1145/3404835.3463050)
245
+ - 2021-PLOME: [PLOME: Pre-training with Misspelled Knowledge for Chinese Spelling Correction](https://aclanthology.org/2021.acl-long.233)
246
+ - 2021-DCN: [Dynamic Connected Networks for Chinese Spelling Check](https://aclanthology.org/2021.findings-acl.216/)
247
+
248
+ - 2020-SoftMaskBERT: [Spelling Error Correction with Soft-Masked BERT](https://arxiv.org/abs/2005.07421)
249
+ - 2020-SpellGCN: [SpellGCN:Incorporating Phonological and Visual Similarities into Language Models for Chinese Spelling Check](https://arxiv.org/abs/2004.14166)
250
+ - 2020-ChunkCSC: [Chunk-based Chinese Spelling Check with Global Optimization](https://aclanthology.org/2020.findings-emnlp.184)
251
+ - 2020-MacBERT: [Revisiting Pre-Trained Models for Chinese Natural Language Processing](https://arxiv.org/abs/2004.13922)
252
+
253
+ - 2019-FASPell: [FASPell: A Fast, Adaptable, Simple, Powerful Chinese Spell Checker Based On DAE-Decoder Paradigm](https://aclanthology.org/D19-5522)
254
+ - 2018-Hybrid: [A Hybrid Approach to Automatic Corpus Generation for Chinese Spelling Checking](https://aclanthology.org/D18-1273)
255
+
256
+ - 2015-Sighan15: [Introduction to SIGHAN 2015 Bake-off for Chinese Spelling Check](https://aclanthology.org/W15-3106/)
257
+ - 2014-Sighan14: [Overview of SIGHAN 2014 Bake-off for Chinese Spelling Check](https://aclanthology.org/W14-6820/)
258
+ - 2013-Sighan13: [Chinese Spelling Check Evaluation at SIGHAN Bake-off 2013](https://aclanthology.org/W13-4406/)
259
+
260
+ ## 五、参考(Refer)
261
+ - [nghuyong/Chinese-text-correction-papers](https://github.com/nghuyong/Chinese-text-correction-papers)
262
+ - [destwang/CTCResources](https://github.com/destwang/CTCResources)
263
+ - [wangwang110/CSC](https://github.com/wangwang110/CSC)
264
+ - [chinese-poetry/chinese-poetry](https://github.com/chinese-poetry/chinese-poetry)
265
+ - [chinese-poetry/huajianji](https://github.com/chinese-poetry/huajianji)
266
+ - [garychowcmu/daizhigev20](https://github.com/garychowcmu/daizhigev20)
267
+ - [yangjianxin1/Firefly](https://github.com/yangjianxin1/Firefly)
268
+ - [Macropodus/xuexiqiangguo_428w](https://huggingface.co/datasets/Macropodus/xuexiqiangguo_428w)
269
+ - [Macropodus/csc_clean_wang271k](https://huggingface.co/datasets/Macropodus/csc_clean_wang271k)
270
+ - [Macropodus/csc_eval_public](https://huggingface.co/datasets//Macropodus/csc_eval_public)
271
+ - [shibing624/pycorrector](https://github.com/shibing624/pycorrector)
272
+ - [iioSnail/MDCSpell_pytorch](https://github.com/iioSnail/MDCSpell_pytorch)
273
+ - [gingasan/lemon](https://github.com/gingasan/lemon)
274
+ - [Claude-Liu/ReLM](https://github.com/Claude-Liu/ReLM)
275
+
276
+
277
+ ## 六、引用(Cite)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
278
  For citing this work, you can refer to the present GitHub project. For example, with BibTeX:
279
  ```
280
  @software{macro-correct,
 
282
  author = {Yongzhuo Mo},
283
  title = {macro-correct},
284
  year = {2025}
 
285
  ```