Update README.md
Browse files
README.md
CHANGED
@@ -2,81 +2,118 @@
|
|
2 |
license: apache-2.0
|
3 |
language:
|
4 |
- zh
|
|
|
|
|
|
|
5 |
tags:
|
6 |
- csc
|
7 |
-
-
|
8 |
-
-
|
|
|
|
|
|
|
9 |
- mdcspell
|
10 |
-
-
|
11 |
-
- chinese-spelling-correct
|
12 |
---
|
13 |
-
|
14 |
-
|
15 |
-
|
16 |
-
|
17 |
-
|
18 |
-
|
19 |
-
|
20 |
-
|
21 |
-
[](https://github.com/yongzhuo/macro-correct/stargazers)
|
22 |
-
[](https://github.com/yongzhuo/macro-correct/network/members)
|
23 |
-
[](https://gitter.im/yongzhuo/macro-correct?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge)
|
24 |
-
>>> macro-correct, 文本纠错工具包(Text Correct), 支持中文拼写纠错/标点符号纠错(CSC, Chinese Spelling Correct / Check), CSC支持各领域数据(包括古文), 模型在大规模、各领域的、现代/当代语料上训练而得, 泛化性强.
|
25 |
-
|
26 |
-
>>> macro-correct是一个只依赖pytorch、transformers、numpy、opencc的文本纠错(CSC, 中文拼写纠错; Punct, 中文标点纠错)工具包,专注于中文文本纠错的极简自然语言处理工具包。
|
27 |
-
使用大部分市面上的开源数据集构建生成的混淆集,使用人民日报语料&学习强国语料等生成1000万+训练数据集来训练模型;
|
28 |
-
支持MDCSpell、Macbert、ReLM、SoftBERT、BertCRF等多种经典模型;
|
29 |
-
支持中文拼写纠错、中文标点符号纠错、中文语法纠错(待续)、独立的检测模型/识别模型(待续);
|
30 |
-
具有依赖轻量、代码简洁、注释详细、调试清晰、配置灵活、拓展方便、适配NLP等特性。
|
31 |
-
|
32 |
|
33 |
## 目录
|
34 |
-
* [
|
35 |
-
* [
|
36 |
-
* [
|
37 |
-
* [
|
38 |
-
* [
|
39 |
-
* [
|
40 |
-
* [测评](#测评)
|
41 |
-
* [日志](#日志)
|
42 |
-
* [参考](#参考)
|
43 |
-
* [论文](#论文)
|
44 |
-
* [Cite](#Cite)
|
45 |
-
|
46 |
|
47 |
-
# 安装
|
48 |
-
```bash
|
49 |
-
pip install macro-correct
|
50 |
|
51 |
-
|
52 |
-
|
53 |
-
|
54 |
-
|
55 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
56 |
```
|
57 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
58 |
|
59 |
-
# 调用
|
60 |
-
更多样例sample详情见/tet目录
|
61 |
-
- 使用example详见/tet/tet目录, 中文拼写纠错代码为tet_csc_token_zh.py, 中文标点符号纠错代码为tet_csc_punct_zh.py, CSC也可以直接用tet_csc_flag_transformers.py
|
62 |
-
- 训练代码详见/tet/train目录, 可配置本地预训练模型地址和各种参数等;
|
63 |
-
|
64 |
-
# 体验
|
65 |
-
[HF---Space---Macropodus/macbert4csc_v2](https://huggingface.co/spaces/Macropodus/macbert4csc_v2)
|
66 |
|
67 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
68 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
69 |
|
70 |
-
|
71 |
-
|
72 |
-
|
73 |
-
|
74 |
-
|
75 |
-
|
76 |
-
|
77 |
-
|
|
|
|
|
78 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
79 |
|
|
|
|
|
|
|
80 |
import os
|
81 |
os.environ["MACRO_CORRECT_FLAG_CSC_TOKEN"] = "1"
|
82 |
from macro_correct import correct
|
@@ -101,8 +138,8 @@ print("#" * 128)
|
|
101 |
"""
|
102 |
```
|
103 |
|
104 |
-
###
|
105 |
-
```
|
106 |
# !/usr/bin/python
|
107 |
# -*- coding: utf-8 -*-
|
108 |
# @time : 2021/2/29 21:41
|
@@ -177,424 +214,67 @@ print(result)
|
|
177 |
"""
|
178 |
```
|
179 |
|
180 |
-
##
|
181 |
-
|
182 |
-
|
183 |
-
|
184 |
-
|
185 |
-
|
186 |
-
|
187 |
-
|
188 |
-
|
189 |
-
|
190 |
-
|
191 |
-
|
192 |
-
|
193 |
-
|
194 |
-
|
195 |
-
for
|
196 |
-
|
197 |
-
|
198 |
-
|
199 |
-
|
200 |
-
|
201 |
-
|
202 |
-
|
203 |
-
|
204 |
-
|
205 |
-
|
206 |
-
|
207 |
-
|
208 |
-
|
209 |
-
|
210 |
-
|
211 |
-
|
212 |
-
|
213 |
-
|
214 |
-
|
215 |
-
|
216 |
-
|
217 |
-
|
218 |
-
|
219 |
-
|
220 |
-
|
221 |
-
|
222 |
-
|
223 |
-
|
224 |
-
|
225 |
-
|
226 |
-
|
227 |
-
|
228 |
-
|
229 |
-
|
230 |
-
|
231 |
-
|
232 |
-
|
233 |
-
|
234 |
-
|
235 |
-
|
236 |
-
|
237 |
-
|
238 |
-
|
239 |
-
|
240 |
-
|
241 |
-
text_csc = correct(text_list, flag_confusion=False)
|
242 |
-
print("默认纠错(不带混淆词典):")
|
243 |
-
for res_i in text_csc:
|
244 |
-
print(res_i)
|
245 |
-
print("#" * 128)
|
246 |
-
|
247 |
-
|
248 |
-
|
249 |
-
text_csc = correct(text_list, flag_confusion=True)
|
250 |
-
print("默认纠错(-带混淆词典-默认):")
|
251 |
-
for res_i in text_csc:
|
252 |
-
print(res_i)
|
253 |
-
print("#" * 128)
|
254 |
-
|
255 |
-
|
256 |
-
# ---混淆词典---
|
257 |
-
### 只新增, 新增用户词典(默认混淆词典也使用)
|
258 |
-
MODEL_CSC_TOKEN.model_csc.model_confusion = ConfusionCorrect(user_dict=user_dict)
|
259 |
-
text_csc = correct(text_list, flag_confusion=True)
|
260 |
-
print("默认纠错(-带混淆词典-新增):")
|
261 |
-
for res_i in text_csc:
|
262 |
-
print(res_i)
|
263 |
-
print("#" * 128)
|
264 |
-
### 全覆盖, 只使用用户词典(默认混淆词典废弃)
|
265 |
-
MODEL_CSC_TOKEN.model_csc.model_confusion = ConfusionCorrect(confusion_dict=user_dict)
|
266 |
-
text_csc = correct(text_list, flag_confusion=True)
|
267 |
-
print("默认纠错(-带混淆词典-全覆盖):")
|
268 |
-
for res_i in text_csc:
|
269 |
-
print(res_i)
|
270 |
-
print("#" * 128)
|
271 |
-
|
272 |
-
|
273 |
-
# ---混淆词典文件---
|
274 |
-
### 只新增, 新增用户词典(默认混淆词典也使用), path不为空即可; json文件, {混淆词语:正确词语} key-value; 详见macro-correct/tet/tet/tet_csc_token_confusion.py
|
275 |
-
path_user = "./user_confusion_dict.json"
|
276 |
-
MODEL_CSC_TOKEN.model_csc.model_confusion = ConfusionCorrect(path="1", path_user=path_user)
|
277 |
-
text_csc = correct(text_list, flag_confusion=True)
|
278 |
-
print("默认纠错(-带混淆词典文件-新增):")
|
279 |
-
for res_i in text_csc:
|
280 |
-
print(res_i)
|
281 |
-
print("#" * 128)
|
282 |
-
### 全覆盖, 只使用用户词典(默认混淆词典废弃); path必须传空字符串
|
283 |
-
MODEL_CSC_TOKEN.model_csc.model_confusion = ConfusionCorrect(path="", path_user=path_user)
|
284 |
-
text_csc = correct(text_list, flag_confusion=True)
|
285 |
-
print("默认纠错(-带混淆词典文件-全覆盖):")
|
286 |
-
for res_i in text_csc:
|
287 |
-
print(res_i)
|
288 |
-
print("#" * 128)
|
289 |
-
|
290 |
-
"""
|
291 |
-
默认纠错(不带混淆词典):
|
292 |
-
{'index': 0, 'source': '为什么乐而往返?', 'target': '为什么乐而往返?', 'errors': []}
|
293 |
-
{'index': 1, 'source': '没有金钢钻就不揽瓷活!', 'target': '没有金刚钻就不揽瓷活!', 'errors': [['钢', '刚', 3, 0.6587]]}
|
294 |
-
{'index': 2, 'source': '你喜欢藤罗蔓吗?', 'target': '你喜欢藤萝蔓吗?', 'errors': [['罗', '萝', 4, 0.8582]]}
|
295 |
-
{'index': 3, 'source': '三周年祭日在哪举行?', 'target': '三周年祭日在哪举行?', 'errors': []}
|
296 |
-
################################################################################################################################
|
297 |
-
默认纠错(-带混淆词典-默认):
|
298 |
-
{'index': 0, 'source': '为什么乐而往返?', 'target': '为什么乐而往返?', 'errors': []}
|
299 |
-
{'index': 1, 'source': '没有金钢钻就不揽瓷活!', 'target': '没有金刚钻就不揽瓷活!', 'errors': [['钢', '刚', 3, 1.0]]}
|
300 |
-
{'index': 2, 'source': '你喜欢藤罗蔓吗?', 'target': '你喜欢藤萝蔓吗?', 'errors': [['罗', '萝', 4, 0.8582]]}
|
301 |
-
{'index': 3, 'source': '三周年祭日在哪举行?', 'target': '三周年忌日在哪举行?', 'errors': [['祭', '忌', 3, 1.0]]}
|
302 |
-
################################################################################################################################
|
303 |
-
默认纠错(-带混淆词典-新增):
|
304 |
-
{'index': 0, 'source': '为什么乐而往返?', 'target': '为什么乐而忘返?', 'errors': [['往', '忘', 5, 1.0]]}
|
305 |
-
{'index': 1, 'source': '没有金钢钻就不揽瓷活!', 'target': '没有金刚钻就不揽瓷活!', 'errors': [['钢', '刚', 3, 1.0]]}
|
306 |
-
{'index': 2, 'source': '你喜欢藤罗蔓吗?', 'target': '你喜欢藤萝蔓吗?', 'errors': [['罗', '萝', 4, 1.0]]}
|
307 |
-
{'index': 3, 'source': '三周年祭日在哪举行?', 'target': '三周年忌日在哪举行?', 'errors': [['祭', '忌', 3, 1.0]]}
|
308 |
-
################################################################################################################################
|
309 |
-
默认纠错(-带混淆词典-全覆盖):
|
310 |
-
{'index': 0, 'source': '为什么乐而往返?', 'target': '为什么乐而忘返?', 'errors': [['往', '忘', 5, 1.0]]}
|
311 |
-
{'index': 1, 'source': '没有金钢钻就不揽瓷活!', 'target': '没有金刚钻就不揽瓷活!', 'errors': [['钢', '刚', 3, 1.0]]}
|
312 |
-
{'index': 2, 'source': '你喜欢藤罗蔓吗?', 'target': '你喜欢藤萝蔓吗?', 'errors': [['罗', '萝', 4, 1.0]]}
|
313 |
-
{'index': 3, 'source': '三周年祭日在哪举行?', 'target': '三周年祭日在哪举行?', 'errors': []}
|
314 |
-
################################################################################################################################
|
315 |
-
默认纠错(-带混淆词典文件-新增):
|
316 |
-
{'index': 0, 'source': '为什么乐而往返?', 'target': '为什么乐而忘返?', 'errors': [['往', '忘', 5, 1.0]]}
|
317 |
-
{'index': 1, 'source': '没有金钢钻就不揽瓷活!', 'target': '没有金刚钻就不揽瓷活!', 'errors': [['钢', '刚', 3, 1.0]]}
|
318 |
-
{'index': 2, 'source': '你喜欢藤罗蔓吗?', 'target': '你喜欢藤萝蔓吗?', 'errors': [['罗', '萝', 4, 1.0]]}
|
319 |
-
{'index': 3, 'source': '三周年祭日在哪举行?', 'target': '三周年忌日在哪举行?', 'errors': [['祭', '忌', 3, 1.0]]}
|
320 |
-
################################################################################################################################
|
321 |
-
默认纠错(-带混淆词典文件-全覆盖):
|
322 |
-
{'index': 0, 'source': '为什么乐而往返?', 'target': '为什么乐而忘返?', 'errors': [['往', '忘', 5, 1.0]]}
|
323 |
-
{'index': 1, 'source': '没有金钢钻就不揽瓷活!', 'target': '没有金刚钻就不揽瓷活!', 'errors': [['钢', '刚', 3, 1.0]]}
|
324 |
-
{'index': 2, 'source': '你喜欢藤罗蔓吗?', 'target': '你喜欢藤萝蔓吗?', 'errors': [['罗', '萝', 4, 1.0]]}
|
325 |
-
{'index': 3, 'source': '三周年祭日在哪举行?', 'target': '三周年祭日在哪举行?', 'errors': []}
|
326 |
-
################################################################################################################################
|
327 |
-
"""
|
328 |
-
```
|
329 |
-
|
330 |
-
|
331 |
-
# 详情
|
332 |
-
## CSC调用(超参数说明)
|
333 |
-
```python
|
334 |
-
import os
|
335 |
-
os.environ["MACRO_CORRECT_FLAG_CSC_TOKEN"] = "1"
|
336 |
-
from macro_correct import correct
|
337 |
-
### 默认纠错(list输入)
|
338 |
-
text_list = ["真麻烦你了。希望你们好好的跳无",
|
339 |
-
"少先队员因该为老人让坐",
|
340 |
-
"机七学习是人工智能领遇最能体现智能的一个分知",
|
341 |
-
"一只小鱼船浮在平净的河面上"
|
342 |
-
]
|
343 |
-
### 默认纠错(list输入, 参数配置)
|
344 |
-
params = {
|
345 |
-
"threshold": 0.55, # token阈值过滤
|
346 |
-
"batch_size": 32, # 批大小
|
347 |
-
"max_len": 128, # 自定义的长度, 如果截断了, 则截断部分不参与纠错, 后续直接一模一样的补回来
|
348 |
-
"rounded": 4, # 保存4位小数
|
349 |
-
"flag_confusion": True, # 是否使用默认的混淆词典
|
350 |
-
"flag_prob": True, # 是否返回纠错token处的概率
|
351 |
-
}
|
352 |
-
text_csc = correct(text_list, **params)
|
353 |
-
print("默认纠错(list输入, 参数配置):")
|
354 |
-
for res_i in text_csc:
|
355 |
-
print(res_i)
|
356 |
-
print("#" * 128)
|
357 |
-
|
358 |
-
|
359 |
-
"""
|
360 |
-
默认纠错(list输入):
|
361 |
-
{'index': 0, 'source': '真麻烦你了。希望你们好好的跳无', 'target': '真麻烦你了。希望你们好好地跳舞', 'errors': [['的', '地', 12, 0.6584], ['无', '舞', 14, 1.0]]}
|
362 |
-
{'index': 1, 'source': '少先队员因该为老人让坐', 'target': '少先队员应该为老人让坐', 'errors': [['因', '应', 4, 0.995]]}
|
363 |
-
{'index': 2, 'source': '机七学习是人工智能领遇最能体现智能的一个分知', 'target': '机器学习是人工智能领域最能体现智能的一个分支', 'errors': [['七', '器', 1, 0.9998], ['遇', '域', 10, 0.9999], ['知', '支', 21, 1.0]]}
|
364 |
-
{'index': 3, 'source': '一只小鱼船浮在平净的河面上', 'target': '一只小鱼船浮在平静的河面上', 'errors': [['净', '静', 8, 0.9961]]}
|
365 |
-
"""
|
366 |
-
```
|
367 |
-
## PUNCT调用(超参数说明)
|
368 |
-
```python
|
369 |
-
import os
|
370 |
-
os.environ["MACRO_CORRECT_FLAG_CSC_PUNCT"] = "1"
|
371 |
-
from macro_correct import correct_punct
|
372 |
-
|
373 |
-
|
374 |
-
### 1.默认标点纠错(list输入)
|
375 |
-
text_list = ["山不在高有仙则名。",
|
376 |
-
"水不在深,有龙则灵",
|
377 |
-
"斯是陋室惟吾德馨",
|
378 |
-
"苔痕上阶绿草,色入帘青。"
|
379 |
-
]
|
380 |
-
### 2.默认标点纠错(list输入, 参数配置详情)
|
381 |
-
params = {
|
382 |
-
"limit_num_errors": 4, # 一句话最多的错别字, 多的就剔除
|
383 |
-
"limit_len_char": 4, # 一句话的最小字符数
|
384 |
-
"threshold_zh": 0.5, # 句子阈值, 中文字符占比的最低值
|
385 |
-
"threshold": 0.55, # token阈值过滤
|
386 |
-
"batch_size": 32, # 批大小
|
387 |
-
"max_len": 128, # 自定义的长度, 如果截断了, 则截断部分不参与纠错, 后续直接一模一样的补回来
|
388 |
-
"rounded": 4, # 保存4位小数
|
389 |
-
"flag_prob": True, # 是否返回纠错token处的概率
|
390 |
-
}
|
391 |
-
text_csc = correct_punct(text_list, **params)
|
392 |
-
print("默认标点纠错(list输入):")
|
393 |
-
for res_i in text_csc:
|
394 |
-
print(res_i)
|
395 |
-
print("#" * 128)
|
396 |
-
|
397 |
-
"""
|
398 |
-
默认标点纠错(list输入):
|
399 |
-
{'index': 0, 'source': '山不在高有仙则名。', 'target': '山不在高,有仙则名。', 'score': 0.9917, 'errors': [['', ',', 4, 0.9917]]}
|
400 |
-
{'index': 1, 'source': '水不在深,有龙则灵', 'target': '水不在深,有龙则灵。', 'score': 0.9995, 'errors': [['', '。', 9, 0.9995]]}
|
401 |
-
{'index': 2, 'source': '斯是陋室惟吾德馨', 'target': '斯是陋室,惟吾德馨。', 'score': 0.9999, 'errors': [['', ',', 4, 0.9999], ['', '。', 8, 0.9998]]}
|
402 |
-
{'index': 3, 'source': '苔痕上阶绿草,色入帘青。', 'target': '苔痕上阶绿,草色入帘青。', 'score': 0.9998, 'errors': [['', ',', 5, 0.9998]]}
|
403 |
-
"""
|
404 |
-
```
|
405 |
-
|
406 |
-
# 训练
|
407 |
-
## CSC任务
|
408 |
-
### 目录地址
|
409 |
-
* macbert4mdcspell: macro_correct/pytorch_user_models/csc/macbert4mdcspell/train_yield.py
|
410 |
-
* macbert4csc: macro_correct/pytorch_user_models/csc/macbert4csc/train_yield.py
|
411 |
-
* relm: macro_correct/pytorch_user_models/csc/relm/train_yield.py
|
412 |
-
### 数据准备
|
413 |
-
* espell: list<dict>的json文件结构, 带"original_text"和"correct_text"就好, 参考macro_correct/corpus/text_correction/espell
|
414 |
-
```
|
415 |
-
[
|
416 |
-
{
|
417 |
-
"original_text": "遇到逆竟时,我们必须勇于面对,而且要愈挫愈勇,这样我们才能朝著成功之路前进。",
|
418 |
-
"correct_text": "遇到逆境时,我们必须勇于面对,而且要愈挫愈勇,这样我们才能朝著成功之路前进。",
|
419 |
-
}
|
420 |
-
]
|
421 |
-
```
|
422 |
-
* sighan: list<dict>的json文件结构, 带"source"和"target"就好, 参考macro_correct/corpus/text_correction/sighan
|
423 |
-
```
|
424 |
-
[
|
425 |
-
{
|
426 |
-
"source": "若被告人正在劳动教养,则可以通过劳动教养单位转交",
|
427 |
-
"target": "若被告人正在劳动教养,则可以通过劳动教养单位转交",
|
428 |
-
}
|
429 |
-
]
|
430 |
-
```
|
431 |
-
### 配置-训练-验证-预测
|
432 |
-
#### 配置
|
433 |
-
配置好数据地址和超参, 参考macro_correct/pytorch_user_models/csc/macbert4mdcspell/config.py
|
434 |
-
#### 训练-验证-预测
|
435 |
-
```
|
436 |
-
训练
|
437 |
-
nohup python train_yield.py > tc.train_yield.py.log 2>&1 &
|
438 |
-
tail -n 1000 -f tc.train_yield.py.log
|
439 |
-
验证
|
440 |
-
python eval_std.py
|
441 |
-
预测
|
442 |
-
python predict.py
|
443 |
-
```
|
444 |
-
|
445 |
-
## PUNCT任务
|
446 |
-
### 目录地址
|
447 |
-
* PUNCT: macro_correct/pytorch_sequencelabeling/slRun.py
|
448 |
-
### 数据准备
|
449 |
-
* SPAN格式: NER任务, 默认用span格式(jsonl), 参考macro_correct/corpus/sequence_labeling/chinese_symbol的chinese_symbol.dev.span文件
|
450 |
-
```
|
451 |
-
{'label': [{'type': '0', 'ent': '下', 'pos': [7, 7]}, {'type': '1', 'ent': '林', 'pos': [14, 14]}], 'text': '#桂林山水甲天下阳朔山水甲桂林'}
|
452 |
-
{'label': [{'type': '11', 'ent': 'o', 'pos': [5, 5]}, {'type': '0', 'ent': 't', 'pos': [12, 12]}, {'type': '1', 'ent': '包', 'pos': [19, 19]}], 'text': '#macrocorrect文本纠错工具包'}
|
453 |
-
```
|
454 |
-
* CONLL格式: 生成SPAN格式后, 用macro_correct/tet/corpus/pos_to_conll.py转换一下就好
|
455 |
-
```
|
456 |
-
神 O
|
457 |
-
秘 O
|
458 |
-
宝 O
|
459 |
-
藏 B-1
|
460 |
-
在 O
|
461 |
-
旅 O
|
462 |
-
途 O
|
463 |
-
中 B-0
|
464 |
-
他 O
|
465 |
-
```
|
466 |
-
### 配置-训练-验证-预测
|
467 |
-
#### 配置
|
468 |
-
配置好数据地址和超参, 参考macro_correct/pytorch_user_models/csc/macbert4mdcspell/config.py
|
469 |
-
#### 训练-验证-预测
|
470 |
-
```
|
471 |
-
训练
|
472 |
-
nohup python train_yield.py > tc.train_yield.py.log 2>&1 &
|
473 |
-
tail -n 1000 -f tc.train_yield.py.log
|
474 |
-
验证
|
475 |
-
python eval_std.py
|
476 |
-
预测
|
477 |
-
python predict.py
|
478 |
-
```
|
479 |
-
|
480 |
-
|
481 |
-
# 测评
|
482 |
-
## 说明
|
483 |
-
* 所有训练数据均来自公网或开源数据, 训练数据为1千万左右, 混淆词典较大;
|
484 |
-
* 所有测试数据均来自公网或开源数据, 测评数据地址为[Macropodus/csc_eval_public](https://huggingface.co/datasets/Macropodus/csc_eval_public);
|
485 |
-
* 测评代码主要为[tcEval.py](https://github.com/yongzhuo/macro-correct/macro_correct/pytorch_textcorrection/tcEval.py); 其中[qwen25_1-5b_pycorrector]()的测评代码在目录[eval](https://github.com/yongzhuo/macro-correct/tet/eval)
|
486 |
-
* 评估标准:过纠率(过度纠错, 即高质量正确句子的错误纠正); 句子级宽松标准的准确率/精确率/召回率/F1(同[shibing624/pycorrector](https://github.com/shibing624/pycorrector)); 句子级严格标准的准确率/精确率/召回率/F1(同[wangwang110/CSC](https://github.com/wangwang110/CSC)); 字符级别的准确率/精确率/召回率/F1(错别字);
|
487 |
-
* qwen25_1-5b_pycorrector权重地址在[shibing624/chinese-text-correction-1.5b](https://huggingface.co/shibing624/chinese-text-correction-1.5b)
|
488 |
-
* macbert4csc_pycorrector权重地址在[shibing624/macbert4csc-base-chinese](https://huggingface.co/shibing624/macbert4csc-base-chinese);
|
489 |
-
* macbert4mdcspell_v1权重地址在[Macropodus/macbert4mdcspell_v1](https://huggingface.co/Macropodus/macbert4mdcspell_v1);
|
490 |
-
* macbert4mdcspell_v2权重地址在[Macropodus/macbert4mdcspell_v2](https://huggingface.co/Macropodus/macbert4mdcspell_v2);
|
491 |
-
* macbert4csc_v2权重地址在[Macropodus/macbert4csc_v2](https://huggingface.co/Macropodus/macbert4csc_v2);
|
492 |
-
* macbert4csc_v1权重地址在[Macropodus/macbert4csc_v1](https://huggingface.co/Macropodus/macbert4csc_v1);
|
493 |
-
* bert4csc_v1权重地址在[Macropodus/bert4csc_v1](https://huggingface.co/Macropodus/bert4csc_v1);
|
494 |
-
|
495 |
-
## 3.1 测评数据
|
496 |
-
```
|
497 |
-
1.gen_de3.json(5545): '的地得'纠错, 由人民日报/学习强国/chinese-poetry等高质量数据人工生成;
|
498 |
-
2.lemon_v2.tet.json(1053): relm论文提出的数据, 多领域拼写纠错数据集(7个领域), ; 包括game(GAM), encyclopedia (ENC), contract (COT), medical care(MEC), car (CAR), novel (NOV), and news (NEW)等领域;
|
499 |
-
3.acc_rmrb.tet.json(4636): 来自NER-199801(人民日报高质量语料);
|
500 |
-
4.acc_xxqg.tet.json(5000): 来自学习强国网站的高质量语料;
|
501 |
-
5.gen_passage.tet.json(10000): 源数据为qwen生成的好词好句, 由几乎所有的开源数据汇总的混淆词典生成;
|
502 |
-
6.textproof.tet.json(1447): NLP竞赛数据, TextProofreadingCompetition;
|
503 |
-
7.gen_xxqg.tet.json(5000): 源数据为学习强国网站的高质量语料, 由几乎所有的开源数据汇总的混淆词典生成;
|
504 |
-
8.faspell.dev.json(1000): 视频字幕通过OCR后获取的数据集; 来自爱奇艺的论文faspell;
|
505 |
-
9.lomo_tet.json(5000): 主要为音似中文拼写纠错数据集; 来自腾讯; 人工标注的数据集CSCD-NS;
|
506 |
-
10.mcsc_tet.5000.json(5000): 医学拼写纠错; 来自腾讯医典APP的真实历史日志; 注意论文说该数据集只关注医学实体的纠错, 常用字等的纠错并不关注;
|
507 |
-
11.ecspell.dev.json(1500): 来自ECSpell论文, 包括(law/med/gov)等三个领域;
|
508 |
-
12.sighan2013.dev.json(1000): 来自sighan13会议;
|
509 |
-
13.sighan2014.dev.json(1062): 来自sighan14会议;
|
510 |
-
14.sighan2015.dev.json(1100): 来自sighan15会议;
|
511 |
-
```
|
512 |
-
|
513 |
-
## 3.2 测评再说明
|
514 |
-
```
|
515 |
-
1.数据预处理, 测评数据都经过 全角转半角,繁简转化,标点符号标准化等操作;
|
516 |
-
2.指标带common的极为宽松指标, 同开源项目pycorrector的评估指标;
|
517 |
-
3.指标带strict的极为严格指标, 同开源项目[wangwang110/CSC](https://github.com/wangwang110/CSC);
|
518 |
-
4.macbert4mdcspell_v1/v2模型为训练使用mdcspell架构+bert的mlm-loss, 但是推理的时候只用bert-mlm;
|
519 |
-
5.acc_rmrb/acc_xxqg数据集没有错误, 用于评估模型的误纠率(过度纠错);
|
520 |
-
6.qwen25_1-5b_pycorrector的模型为shibing624/chinese-text-correction-1.5b, 其训练数据包括了lemon_v2/mcsc_tet/ecspell的验证集和测试集, 其他的bert类模型的训练不包括验证集和测试集;
|
521 |
-
```
|
522 |
-
|
523 |
-
## 3.3 测评结果
|
524 |
-
### 3.3.1 F1(common_cor_f1)
|
525 |
-
| model/common_cor_f1 | avg| gen_de3| lemon_v2| gen_passage| text_proof| gen_xxqg| faspell| lomo_tet| mcsc_tet| ecspell| sighan2013| sighan2014| sighan2015 |
|
526 |
-
|:------------------------|:-----------------|:-----------------|:-----------------|:-----------------|:-----------------|:-----------------|:-----------------|:-----------------|:-----------------|:-----------------|:-----------------|:-----------------|:-----------------|
|
527 |
-
| macbert4csc_pycorrector | 45.8| 42.44| 42.89| 31.49| 46.31| 26.06| 32.7| 44.83| 27.93| 55.51| 70.89| 61.72| 66.81 |
|
528 |
-
| qwen25_1-5b_pycorrector | 45.11| 27.29| 89.48| 14.61| 83.9| 13.84| 18.2| 36.71| 96.29| 88.2| 36.41| 15.64| 20.73 |
|
529 |
-
| bert4csc_v1 | 62.28| 93.73| 61.99| 44.79| 68.0| 35.03| 48.28| 61.8| 64.41| 79.11| 77.66| 51.01| 61.54 |
|
530 |
-
| macbert4csc_v1 | 68.55| 96.67| 65.63| 48.4| 75.65| 38.43| 51.76| 70.11| 80.63| 85.55| 81.38| 57.63| 70.7 |
|
531 |
-
| macbert4csc_v2 | 68.6| 96.74| 66.02| 48.26| 75.78| 38.84| 51.91| 70.17| 80.71| 85.61| 80.97| 58.22| 69.95 |
|
532 |
-
| macbert4mdcspell_v1 | 71.1| 96.42| 70.06| 52.55| 79.61| 43.37| 53.85| 70.9| 82.38| 87.46| 84.2| 61.08| 71.32 |
|
533 |
-
| macbert4mdcspell_v2 | 71.23| 96.42| 65.8| 52.35| 75.94| 43.5| 53.82| 72.66| 82.28| 88.69| 82.51| 65.59| 75.26 |
|
534 |
-
|
535 |
-
### 3.3.2 acc(common_cor_acc)
|
536 |
-
| model/common_cor_acc| avg| gen_de3| lemon_v2| gen_passage| text_proof| gen_xxqg| faspell| lomo_tet| mcsc_tet| ecspell| sighan2013| sighan2014| sighan2015 |
|
537 |
-
|:-----------------|:-----------------|:-----------------|:-----------------|:-----------------|:-----------------|:-----------------|:-----------------|:-----------------|:-----------------|:-----------------|:-----------------|:-----------------|:-----------------|
|
538 |
-
| macbert4csc_pycorrector| 48.26| 26.96| 28.68| 34.16| 55.29| 28.38| 22.2| 60.96| 57.16| 67.73| 55.9| 68.93| 72.73 |
|
539 |
-
| qwen25_1-5b_pycorrector| 46.09| 15.82| 81.29| 22.96| 82.17| 19.04| 12.8| 50.2| 96.4| 89.13| 22.8| 27.87| 32.55 |
|
540 |
-
| bert4csc_v1| 60.76| 88.21| 45.96| 43.13| 68.97| 35.0| 34.0| 65.86| 73.26| 81.8| 64.5| 61.11| 67.27 |
|
541 |
-
| macbert4csc_v1| 65.34| 93.56| 49.76| 44.98| 74.64| 36.1| 37.0| 73.0| 83.6| 86.87| 69.2| 62.62| 72.73 |
|
542 |
-
| macbert4csc_v2| 65.22| 93.69| 50.14| 44.92| 74.64| 36.26| 37.0| 72.72| 83.66| 86.93| 68.5| 62.43| 71.73 |
|
543 |
-
| macbert4mdcspell_v1| 67.15| 93.09| 54.8| 47.71| 78.09| 39.52| 38.8| 71.92| 84.78| 88.27| 73.2| 63.28| 72.36 |
|
544 |
-
| macbert4mdcspell_v2 | 68.31| 93.09| 50.05| 48.72| 75.74| 40.52| 38.9| 76.9| 84.8| 89.73| 71.0| 71.94| 78.36 |
|
545 |
-
|
546 |
-
### 3.3.3 acc(acc_true, thr=0.75)
|
547 |
-
| model/acc | avg| acc_rmrb| acc_xxqg |
|
548 |
-
|:------------------------|:-----------------|:-----------------|:-----------------|
|
549 |
-
| macbert4csc_pycorrector | 99.24| 99.22| 99.26 |
|
550 |
-
| qwen25_1-5b_pycorrector | 82.0| 77.14| 86.86 |
|
551 |
-
| bert4csc_v1 | 98.71| 98.36| 99.06 |
|
552 |
-
| macbert4csc_v1 | 97.72| 96.72| 98.72 |
|
553 |
-
| macbert4csc_v2 | 97.89| 96.98| 98.8 |
|
554 |
-
| macbert4mdcspell_v1 | 97.75| 96.51| 98.98 |
|
555 |
-
| macbert4mdcspell_v2 | 99.54| 99.22| 99.86 |
|
556 |
-
|
557 |
-
|
558 |
-
### 3.3.4 结论(Conclusion)
|
559 |
-
```
|
560 |
-
1.macbert4csc_v1/macbert4csc_v2/macbert4mdcspell_v1等模型使用多种领域数据训练, 比较均衡, 也适合作为第一步的预训练模型, 可用于专有领域数据的继续微调;
|
561 |
-
2.比较macbert4csc_pycorrector/bertbase4csc_v1/macbert4csc_v2/macbert4mdcspell_v1, 观察表2.3, 可以发现训练数据越多, 准确率提升的同时, 误纠率也会稍微高一些;
|
562 |
-
3.MFT(Mask-Correct)依旧有效, 不过对于数据量足够的情形提升不明显, 可能也是误纠率升高的一个重要原因;
|
563 |
-
4.训练数据中也存在文言文数据, 训练好的模型也支持文言文纠错;
|
564 |
-
5.训练好的模型对"地得的"等高频错误具有较高的识别率和纠错率;
|
565 |
-
6.macbert4mdcspell_v2的MFT只70%的时间no-error-mask(0.15), 15%的时间target-to-target, 15%的时间不mask;
|
566 |
-
```
|
567 |
-
|
568 |
-
|
569 |
-
# 日志
|
570 |
-
```
|
571 |
-
1. v20240129, 完成csc_punct模块;
|
572 |
-
2. v20241001, 完成csc_token模块;
|
573 |
-
3. v20250117, 完成csc_eval模块;
|
574 |
-
4. v20250501, 完成macbert4mdcspell_v2
|
575 |
-
```
|
576 |
-
|
577 |
-
|
578 |
-
# 参考
|
579 |
-
This library is inspired by and references following frameworks and papers.
|
580 |
-
|
581 |
-
* Chinese-text-correction-papers: [nghuyong/Chinese-text-correction-papers](https://github.com/nghuyong/Chinese-text-correction-papers)
|
582 |
-
* pycorrector: [shibing624/pycorrector](https://github.com/shibing624/pycorrector)
|
583 |
-
* CTCResources: [destwang/CTCResources](https://github.com/destwang/CTCResources)
|
584 |
-
* CSC: [wangwang110/CSC](https://github.com/wangwang110/CSC)
|
585 |
-
* char-similar: [yongzhuo/char-similar](https://github.com/yongzhuo/char-similar)
|
586 |
-
* MDCSpell: [iioSnail/MDCSpell_pytorch](https://github.com/iioSnail/MDCSpell_pytorch)
|
587 |
-
* CSCD-NS: [nghuyong/cscd-ns](https://github.com/nghuyong/cscd-ns)
|
588 |
-
* lemon: [gingasan/lemon](https://github.com/gingasan/lemon)
|
589 |
-
* ReLM: [Claude-Liu/ReLM](https://github.com/Claude-Liu/ReLM)
|
590 |
-
|
591 |
-
|
592 |
-
# 论文
|
593 |
-
## 中文拼写纠错(CSC, Chinese Spelling Correction)
|
594 |
-
* 共收录34篇论文, 写了一个简短的综述. 详见[README.csc_survey.md](https://github.com/yongzhuo/macro-correct/blob/master/README.csc_survey.md)
|
595 |
-
|
596 |
-
|
597 |
-
# Cite
|
598 |
For citing this work, you can refer to the present GitHub project. For example, with BibTeX:
|
599 |
```
|
600 |
@software{macro-correct,
|
@@ -602,5 +282,4 @@ For citing this work, you can refer to the present GitHub project. For example,
|
|
602 |
author = {Yongzhuo Mo},
|
603 |
title = {macro-correct},
|
604 |
year = {2025}
|
605 |
-
|
606 |
```
|
|
|
2 |
license: apache-2.0
|
3 |
language:
|
4 |
- zh
|
5 |
+
base_model:
|
6 |
+
- hfl/chinese-macbert-base
|
7 |
+
pipeline_tag: text-generation
|
8 |
tags:
|
9 |
- csc
|
10 |
+
- text-correct
|
11 |
+
- chinses-spelling-correct
|
12 |
+
- chinese-spelling-check
|
13 |
+
- 中文拼写纠错
|
14 |
+
- 文本纠错
|
15 |
- mdcspell
|
16 |
+
- macro-correct
|
|
|
17 |
---
|
18 |
+
# macbert4mdcspell
|
19 |
+
## 概述(macbert4mdcspell)
|
20 |
+
- macro-correct, 中文拼写纠错CSC测评(文本纠错), 权重使用
|
21 |
+
- 项目地址在 [https://github.com/yongzhuo/macro-correct](https://github.com/yongzhuo/macro-correct)
|
22 |
+
- 本模型权重为macbert4mdcspell_v2, 使用mdcspell架构, 其特点是det_label和cor_label交互;
|
23 |
+
- 训练时加入了macbert的mlm-loss, 推理时舍弃了macbert后面的部分;
|
24 |
+
- 如何使用: 1.使用transformers调用; 2.使用[macro-correct](https://github.com/yongzhuo/macro-correct)项目调用; 详情见***三、调用(Usage)***;
|
25 |
+
- 为了修复过纠问题, macbert4mdcspell_v2的MFT只70%的时间no-error-mask(0.15), 15%的时间target-to-target, 15%的时间不mask;
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
26 |
|
27 |
## 目录
|
28 |
+
* [一、测评(Test)](#一、测评(Test))
|
29 |
+
* [二、结论(Conclusion)](#二、结论(Conclusion))
|
30 |
+
* [三、调用(Usage)](#三、调用(Usage))
|
31 |
+
* [四、论文(Paper)](#四、论文(Paper))
|
32 |
+
* [五、参考(Refer)](#五、参考(Refer))
|
33 |
+
* [六、引用(Cite)](#六、引用(Cite))
|
|
|
|
|
|
|
|
|
|
|
|
|
34 |
|
|
|
|
|
|
|
35 |
|
36 |
+
## 一、测评(Test)
|
37 |
+
### 1.1 测评数据来源
|
38 |
+
地址为[Macropodus/csc_eval_public](https://huggingface.co/datasets/Macropodus/csc_eval_public), 所有训练数据均来自公网或开源数据, 训练数据为1千万左右, 混淆词典较大;
|
39 |
+
```
|
40 |
+
1.gen_de3.json(5545): '的地得'纠错, 由人民日报/学习强国/chinese-poetry等高质量数据人工生成;
|
41 |
+
2.lemon_v2.tet.json(1053): relm论文提出的数据, 多领域拼写纠错数据集(7个领域), ; 包括game(GAM), encyclopedia (ENC), contract (COT), medical care(MEC), car (CAR), novel (NOV), and news (NEW)等领域;
|
42 |
+
3.acc_rmrb.tet.json(4636): 来自NER-199801(人民日报高质量语料);
|
43 |
+
4.acc_xxqg.tet.json(5000): 来自学习强国网站的高质量语料;
|
44 |
+
5.gen_passage.tet.json(10000): 源数据为qwen生成的好词好句, 由几乎所有的开源数据汇总的混淆词典生成;
|
45 |
+
6.textproof.tet.json(1447): NLP竞赛数据, TextProofreadingCompetition;
|
46 |
+
7.gen_xxqg.tet.json(5000): 源数据为学习强国网站的高质量语料, 由几乎所有的开源数据汇总的混淆词典生成;
|
47 |
+
8.faspell.dev.json(1000): 视频字幕通过OCR后获取的数据集; 来自爱奇艺的论文faspell;
|
48 |
+
9.lomo_tet.json(5000): 主要��音似中文拼写纠错数据集; 来自腾讯; 人工标注的数据集CSCD-NS;
|
49 |
+
10.mcsc_tet.5000.json(5000): 医学拼写纠错; 来自腾讯医典APP的真实历史日志; 注意论文说该数据集只关注医学实体的纠错, 常用字等的纠错并不关注;
|
50 |
+
11.ecspell.dev.json(1500): 来自ECSpell论文, 包括(law/med/gov)等三个领域;
|
51 |
+
12.sighan2013.dev.json(1000): 来自sighan13会议;
|
52 |
+
13.sighan2014.dev.json(1062): 来自sighan14会议;
|
53 |
+
14.sighan2015.dev.json(1100): 来自sighan15会议;
|
54 |
+
```
|
55 |
+
### 1.2 测评数据预处理
|
56 |
+
```
|
57 |
+
测评数据都经过 全角转半角,繁简转化,标点符号标准化等操作;
|
58 |
```
|
59 |
|
60 |
+
### 1.3 其他说明
|
61 |
+
```
|
62 |
+
1.指标带common的极为宽松指标, 同开源项目pycorrector的评估指标;
|
63 |
+
2.指标带strict的极为严格指标, 同开源项目[wangwang110/CSC](https://github.com/wangwang110/CSC);
|
64 |
+
3.macbert4mdcspell_v1模型为训练使用mdcspell架构+bert的mlm-loss, 但是推理的时候只用bert-mlm;
|
65 |
+
4.acc_rmrb/acc_xxqg数据集没有错误, 用于评估模型的误纠率(过度纠错);
|
66 |
+
5.qwen25_1-5b_pycorrector的模型为shibing624/chinese-text-correction-1.5b, 其训练数据包括了lemon_v2/mcsc_tet/ecspell的验证集和测试集, 其他的bert类模型的训练不包括验证集和测试集;
|
67 |
+
```
|
68 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
69 |
|
70 |
+
## 二、重要指标
|
71 |
+
### 2.1 F1(common_cor_f1)
|
72 |
+
| model/common_cor_f1 | avg| gen_de3| lemon_v2| gen_passage| text_proof| gen_xxqg| faspell| lomo_tet| mcsc_tet| ecspell| sighan2013| sighan2014| sighan2015 |
|
73 |
+
|:------------------------|:-----------------|:-----------------|:-----------------|:-----------------|:-----------------|:-----------------|:-----------------|:-----------------|:-----------------|:-----------------|:-----------------|:-----------------|:-----------------|
|
74 |
+
| macbert4csc_pycorrector | 45.8| 42.44| 42.89| 31.49| 46.31| 26.06| 32.7| 44.83| 27.93| 55.51| 70.89| 61.72| 66.81 |
|
75 |
+
| qwen25_1-5b_pycorrector | 45.11| 27.29| 89.48| 14.61| 83.9| 13.84| 18.2| 36.71| 96.29| 88.2| 36.41| 15.64| 20.73 |
|
76 |
+
| bert4csc_v1 | 62.28| 93.73| 61.99| 44.79| 68.0| 35.03| 48.28| 61.8| 64.41| 79.11| 77.66| 51.01| 61.54 |
|
77 |
+
| macbert4csc_v1 | 68.55| 96.67| 65.63| 48.4| 75.65| 38.43| 51.76| 70.11| 80.63| 85.55| 81.38| 57.63| 70.7 |
|
78 |
+
| macbert4csc_v2 | 68.6| 96.74| 66.02| 48.26| 75.78| 38.84| 51.91| 70.17| 80.71| 85.61| 80.97| 58.22| 69.95 |
|
79 |
+
| macbert4mdcspell_v1 | 71.1| 96.42| 70.06| 52.55| 79.61| 43.37| 53.85| 70.9| 82.38| 87.46| 84.2| 61.08| 71.32 |
|
80 |
+
| macbert4mdcspell_v2 | 71.23| 96.42| 65.8| 52.35| 75.94| 43.5| 53.82| 72.66| 82.28| 88.69| 82.51| 65.59| 75.26 |
|
81 |
|
82 |
+
### 2.2 acc(common_cor_acc)
|
83 |
+
| model/common_cor_acc| avg| gen_de3| lemon_v2| gen_passage| text_proof| gen_xxqg| faspell| lomo_tet| mcsc_tet| ecspell| sighan2013| sighan2014| sighan2015 |
|
84 |
+
|:-----------------|:-----------------|:-----------------|:-----------------|:-----------------|:-----------------|:-----------------|:-----------------|:-----------------|:-----------------|:-----------------|:-----------------|:-----------------|:-----------------|
|
85 |
+
| macbert4csc_pycorrector| 48.26| 26.96| 28.68| 34.16| 55.29| 28.38| 22.2| 60.96| 57.16| 67.73| 55.9| 68.93| 72.73 |
|
86 |
+
| qwen25_1-5b_pycorrector| 46.09| 15.82| 81.29| 22.96| 82.17| 19.04| 12.8| 50.2| 96.4| 89.13| 22.8| 27.87| 32.55 |
|
87 |
+
| bert4csc_v1| 60.76| 88.21| 45.96| 43.13| 68.97| 35.0| 34.0| 65.86| 73.26| 81.8| 64.5| 61.11| 67.27 |
|
88 |
+
| macbert4csc_v1| 65.34| 93.56| 49.76| 44.98| 74.64| 36.1| 37.0| 73.0| 83.6| 86.87| 69.2| 62.62| 72.73 |
|
89 |
+
| macbert4csc_v2| 65.22| 93.69| 50.14| 44.92| 74.64| 36.26| 37.0| 72.72| 83.66| 86.93| 68.5| 62.43| 71.73 |
|
90 |
+
| macbert4mdcspell_v1| 67.15| 93.09| 54.8| 47.71| 78.09| 39.52| 38.8| 71.92| 84.78| 88.27| 73.2| 63.28| 72.36 |
|
91 |
+
| macbert4mdcspell_v2 | 68.31| 93.09| 50.05| 48.72| 75.74| 40.52| 38.9| 76.9| 84.8| 89.73| 71.0| 71.94| 78.36 |
|
92 |
|
93 |
+
### 2.3 acc(acc_true, thr=0.75)
|
94 |
+
| model/acc | avg| acc_rmrb| acc_xxqg |
|
95 |
+
|:------------------------|:-----------------|:-----------------|:-----------------|
|
96 |
+
| macbert4csc_pycorrector | 99.24| 99.22| 99.26 |
|
97 |
+
| qwen25_1-5b_pycorrector | 82.0| 77.14| 86.86 |
|
98 |
+
| bert4csc_v1 | 98.71| 98.36| 99.06 |
|
99 |
+
| macbert4csc_v1 | 97.72| 96.72| 98.72 |
|
100 |
+
| macbert4csc_v2 | 97.89| 96.98| 98.8 |
|
101 |
+
| macbert4mdcspell_v1 | 97.75| 96.51| 98.98 |
|
102 |
+
| macbert4mdcspell_v2 | 99.54| 99.22| 99.86 |
|
103 |
|
104 |
+
## 二、结论(Conclusion)
|
105 |
+
```
|
106 |
+
1.macbert4csc_v1/macbert4csc_v2/macbert4mdcspell_v1等模型使用多种领域数据训练, 比较均衡, 也适合作为第一步的预训练模型, 可用于专有领域数据的继续微调;
|
107 |
+
2.比较macbert4csc_pycorrector/bertbase4csc_v1/macbert4csc_v2/macbert4mdcspell_v1, 观察表2.3, 可以发现训练数据越多, 准确率提升的同时, 误纠率也会稍微高一些;
|
108 |
+
3.MFT(Mask-Correct)依旧有效, 不过对于数据量足够的情形提升不明显, 可能也是误纠率升高的一个重要原因;
|
109 |
+
4.训练数据中也存在文言文数据, 训练好的模型也支持文言文纠错;
|
110 |
+
5.训练好的模型对"地得的"等高频错误具有较高的识别率和纠错率;
|
111 |
+
6.macbert4mdcspell_v2的MFT只70%的时间no-error-mask(0.15), 15%的时间target-to-target, 15%的时间不mask;
|
112 |
+
```
|
113 |
|
114 |
+
## 三、调用(Usage)
|
115 |
+
### 3.1 使用macro-correct
|
116 |
+
```
|
117 |
import os
|
118 |
os.environ["MACRO_CORRECT_FLAG_CSC_TOKEN"] = "1"
|
119 |
from macro_correct import correct
|
|
|
138 |
"""
|
139 |
```
|
140 |
|
141 |
+
### 3.2 使用 transformers
|
142 |
+
```
|
143 |
# !/usr/bin/python
|
144 |
# -*- coding: utf-8 -*-
|
145 |
# @time : 2021/2/29 21:41
|
|
|
214 |
"""
|
215 |
```
|
216 |
|
217 |
+
## 四、论文(Paper)
|
218 |
+
- 2024-Refining: [Refining Corpora from a Model Calibration Perspective for Chinese](https://arxiv.org/abs/2407.15498)
|
219 |
+
- 2024-ReLM: [Chinese Spelling Correction as Rephrasing Language Model](https://arxiv.org/abs/2308.08796)
|
220 |
+
- 2024-DICS: [DISC: Plug-and-Play Decoding Intervention with Similarity of Characters for Chinese Spelling Check](https://arxiv.org/abs/2412.12863)
|
221 |
+
|
222 |
+
- 2023-Bi-DCSpell: [A Bi-directional Detector-Corrector Interactive Framework for Chinese Spelling Check]()
|
223 |
+
- 2023-BERT-MFT: [Rethinking Masked Language Modeling for Chinese Spelling Correction](https://arxiv.org/abs/2305.17721)
|
224 |
+
- 2023-PTCSpell: [PTCSpell: Pre-trained Corrector Based on Character Shape and Pinyin for Chinese Spelling Correction](https://arxiv.org/abs/2212.04068)
|
225 |
+
- 2023-DR-CSC: [A Frustratingly Easy Plug-and-Play Detection-and-Reasoning Module for Chinese](https://aclanthology.org/2023.findings-emnlp.771)
|
226 |
+
- 2023-DROM: [Disentangled Phonetic Representation for Chinese Spelling Correction](https://arxiv.org/abs/2305.14783)
|
227 |
+
- 2023-EGCM: [An Error-Guided Correction Model for Chinese Spelling Error Correction](https://arxiv.org/abs/2301.06323)
|
228 |
+
- 2023-IGPI: [Investigating Glyph-Phonetic Information for Chinese Spell Checking: What Works and What’s Next?](https://arxiv.org/abs/2212.04068)
|
229 |
+
- 2023-CL: [Contextual Similarity is More Valuable than Character Similarity-An Empirical Study for Chinese Spell Checking]()
|
230 |
+
|
231 |
+
- 2022-CRASpell: [CRASpell: A Contextual Typo Robust Approach to Improve Chinese Spelling Correction](https://aclanthology.org/2022.findings-acl.237)
|
232 |
+
- 2022-MDCSpell: [MDCSpell: A Multi-task Detector-Corrector Framework for Chinese Spelling Correction](https://aclanthology.org/2022.findings-acl.98)
|
233 |
+
- 2022-SCOPE: [Improving Chinese Spelling Check by Character Pronunciation Prediction: The Effects of Adaptivity and Granularity](https://arxiv.org/abs/2210.10996)
|
234 |
+
- 2022-ECOPO: [The Past Mistake is the Future Wisdom: Error-driven Contrastive Probability Optimization for Chinese Spell Checking](https://arxiv.org/abs/2203.00991)
|
235 |
+
|
236 |
+
- 2021-MLMPhonetics: [Correcting Chinese Spelling Errors with Phonetic Pre-training](https://aclanthology.org/2021.findings-acl.198)
|
237 |
+
- 2021-ChineseBERT: [ChineseBERT: Chinese Pretraining Enhanced by Glyph and Pinyin Information](https://aclanthology.org/2021.acl-long.161/)
|
238 |
+
- 2021-BERTCrsGad: [Global Attention Decoder for Chinese Spelling Error Correction](https://aclanthology.org/2021.findings-acl.122)
|
239 |
+
- 2021-ThinkTwice: [Think Twice: A Post-Processing Approach for the Chinese Spelling Error Correction](https://www.mdpi.com/2076-3417/11/13/5832)
|
240 |
+
- 2021-PHMOSpell: [PHMOSpell: Phonological and Morphological Knowledge Guided Chinese Spelling Chec](https://aclanthology.org/2021.acl-long.464)
|
241 |
+
- 2021-SpellBERT: [SpellBERT: A Lightweight Pretrained Model for Chinese Spelling Check](https://aclanthology.org/2021.emnlp-main.287)
|
242 |
+
- 2021-TwoWays: [Exploration and Exploitation: Two Ways to Improve Chinese Spelling Correction Models](https://aclanthology.org/2021.acl-short.56)
|
243 |
+
- 2021-ReaLiSe: [Read, Listen, and See: Leveraging Multimodal Information Helps Chinese Spell Checking](https://arxiv.org/abs/2105.12306)
|
244 |
+
- 2021-DCSpell: [DCSpell: A Detector-Corrector Framework for Chinese Spelling Error Correction](https://dl.acm.org/doi/10.1145/3404835.3463050)
|
245 |
+
- 2021-PLOME: [PLOME: Pre-training with Misspelled Knowledge for Chinese Spelling Correction](https://aclanthology.org/2021.acl-long.233)
|
246 |
+
- 2021-DCN: [Dynamic Connected Networks for Chinese Spelling Check](https://aclanthology.org/2021.findings-acl.216/)
|
247 |
+
|
248 |
+
- 2020-SoftMaskBERT: [Spelling Error Correction with Soft-Masked BERT](https://arxiv.org/abs/2005.07421)
|
249 |
+
- 2020-SpellGCN: [SpellGCN:Incorporating Phonological and Visual Similarities into Language Models for Chinese Spelling Check](https://arxiv.org/abs/2004.14166)
|
250 |
+
- 2020-ChunkCSC: [Chunk-based Chinese Spelling Check with Global Optimization](https://aclanthology.org/2020.findings-emnlp.184)
|
251 |
+
- 2020-MacBERT: [Revisiting Pre-Trained Models for Chinese Natural Language Processing](https://arxiv.org/abs/2004.13922)
|
252 |
+
|
253 |
+
- 2019-FASPell: [FASPell: A Fast, Adaptable, Simple, Powerful Chinese Spell Checker Based On DAE-Decoder Paradigm](https://aclanthology.org/D19-5522)
|
254 |
+
- 2018-Hybrid: [A Hybrid Approach to Automatic Corpus Generation for Chinese Spelling Checking](https://aclanthology.org/D18-1273)
|
255 |
+
|
256 |
+
- 2015-Sighan15: [Introduction to SIGHAN 2015 Bake-off for Chinese Spelling Check](https://aclanthology.org/W15-3106/)
|
257 |
+
- 2014-Sighan14: [Overview of SIGHAN 2014 Bake-off for Chinese Spelling Check](https://aclanthology.org/W14-6820/)
|
258 |
+
- 2013-Sighan13: [Chinese Spelling Check Evaluation at SIGHAN Bake-off 2013](https://aclanthology.org/W13-4406/)
|
259 |
+
|
260 |
+
## 五、参考(Refer)
|
261 |
+
- [nghuyong/Chinese-text-correction-papers](https://github.com/nghuyong/Chinese-text-correction-papers)
|
262 |
+
- [destwang/CTCResources](https://github.com/destwang/CTCResources)
|
263 |
+
- [wangwang110/CSC](https://github.com/wangwang110/CSC)
|
264 |
+
- [chinese-poetry/chinese-poetry](https://github.com/chinese-poetry/chinese-poetry)
|
265 |
+
- [chinese-poetry/huajianji](https://github.com/chinese-poetry/huajianji)
|
266 |
+
- [garychowcmu/daizhigev20](https://github.com/garychowcmu/daizhigev20)
|
267 |
+
- [yangjianxin1/Firefly](https://github.com/yangjianxin1/Firefly)
|
268 |
+
- [Macropodus/xuexiqiangguo_428w](https://huggingface.co/datasets/Macropodus/xuexiqiangguo_428w)
|
269 |
+
- [Macropodus/csc_clean_wang271k](https://huggingface.co/datasets/Macropodus/csc_clean_wang271k)
|
270 |
+
- [Macropodus/csc_eval_public](https://huggingface.co/datasets//Macropodus/csc_eval_public)
|
271 |
+
- [shibing624/pycorrector](https://github.com/shibing624/pycorrector)
|
272 |
+
- [iioSnail/MDCSpell_pytorch](https://github.com/iioSnail/MDCSpell_pytorch)
|
273 |
+
- [gingasan/lemon](https://github.com/gingasan/lemon)
|
274 |
+
- [Claude-Liu/ReLM](https://github.com/Claude-Liu/ReLM)
|
275 |
+
|
276 |
+
|
277 |
+
## 六、引用(Cite)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
278 |
For citing this work, you can refer to the present GitHub project. For example, with BibTeX:
|
279 |
```
|
280 |
@software{macro-correct,
|
|
|
282 |
author = {Yongzhuo Mo},
|
283 |
title = {macro-correct},
|
284 |
year = {2025}
|
|
|
285 |
```
|