Simon Clematide commited on
Commit
6e17501
·
1 Parent(s): 7ca08a0

Add Bloom Filter files and logs for French and German datasets

Browse files
ocrqa-wp_v1.0.6-de.bloom ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6c310a7a899687b4b719ebf0524d562f2d8dcc06406a7b148d6a6f5a41a7639f
3
+ size 10060958
ocrqa-wp_v1.0.6-de.bloom.log ADDED
@@ -0,0 +1,246 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 2025-02-14 16:07:47,840 ocrqa_create_bloom_filter.py:425 INFO: Namespace(input_files=['lex/de/realword_ocr_errors.nw.txt', 'lex/de/ocr_errors.nw.txt', 'lex/de/old_spelling.rw.txt', 'lex/de/modern_spelling.rw.txt', 'lex/de/dewiki.unigram.freq.tsv.bz2'], bloom_path='build.d/fp_prob_0.00001/ocrqa-wp_v1.0.6-de.bloom', fp_probability=1e-05, log_level='INFO', log_file='build.d/fp_prob_0.00001/ocrqa-wp_v1.0.6-de.bloom.log', config=None, min_frequency=2, single_char_min_frequency=20, diagnose_bloom=True)
2
+ 2025-02-14 16:07:47,840 ocrqa_create_bloom_filter.py:226 INFO: Starting Bloom Filter creation...
3
+ 2025-02-14 16:07:47,840 ocrqa_create_bloom_filter.py:178 INFO: Processing nonword file: lex/de/realword_ocr_errors.nw.txt
4
+ 2025-02-14 16:07:47,840 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['negierung']
5
+ 2025-02-14 16:07:47,840 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['negierungen']
6
+ 2025-02-14 16:07:47,840 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['nidwaiden']
7
+ 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['ölten']
8
+ 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['unterwaiden']
9
+ 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['verlausen']
10
+ 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['vertretet']
11
+ 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:190 INFO: Excluded 7 words that should never be added
12
+ 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:178 INFO: Processing nonword file: lex/de/ocr_errors.nw.txt
13
+ 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['0oo']
14
+ 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['@d']
15
+ 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['@e']
16
+ 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['@i']
17
+ 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['@r']
18
+ 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['@t']
19
+ 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['aargan']
20
+ 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['abbin']
21
+ 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['abgefetzt']
22
+ 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['abgereift']
23
+ 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['ahresbesoldung']
24
+ 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['aisbann']
25
+ 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['aneh']
26
+ 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['anmeldungstermln']
27
+ 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['ariesheim']
28
+ 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['aueh']
29
+ 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['ausgefetzt']
30
+ 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['auslände']
31
+ 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['bahmen']
32
+ 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['bandesblatt']
33
+ 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['bebacht']
34
+ 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['bechnung']
35
+ 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['befetzt']
36
+ 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['befetzten']
37
+ 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['begelung']
38
+ 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['begierung']
39
+ 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['begierungen']
40
+ 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['begierungsrat']
41
+ 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['behorde']
42
+ 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['behorden']
43
+ 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['berieht']
44
+ 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['bersonen']
45
+ 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['besicht']
46
+ 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['bestimm']
47
+ 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['betragt']
48
+ 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['betreifend']
49
+ 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['beutscher']
50
+ 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['bevision']
51
+ 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['bielleicht']
52
+ 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['bingier']
53
+ 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['bnndesblatt']
54
+ 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['bücksicht']
55
+ 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['bundesbehorden']
56
+ 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['diefe']
57
+ 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['diefer']
58
+ 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['dingungen']
59
+ 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['dnrch']
60
+ 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['dnreh']
61
+ 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['dureh']
62
+ 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['eahmen']
63
+ 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['eappen']
64
+ 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['eäte']
65
+ 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['eatifikation']
66
+ 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['ebenfall']
67
+ 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['ebruar']
68
+ 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['eechnung']
69
+ 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['eecht']
70
+ 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['eechte']
71
+ 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['eechts']
72
+ 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['eegel']
73
+ 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['eegelung']
74
+ 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['eegierung']
75
+ 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['eegierungen']
76
+ 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['eegierungsrat']
77
+ 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['eeglement']
78
+ 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['eeihe']
79
+ 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['eente']
80
+ 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['eenten']
81
+ 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['eepublik']
82
+ 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['eesolution']
83
+ 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['eevision']
84
+ 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['ehur']
85
+ 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['eichter']
86
+ 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['eidgenossenschast']
87
+ 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['eidgenossische']
88
+ 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['eingefetzt']
89
+ 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['einlabung']
90
+ 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['eirea']
91
+ 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['eiue']
92
+ 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['eldg']
93
+ 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['elfaß']
94
+ 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['endlieh']
95
+ 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['engtischen']
96
+ 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['eobert']
97
+ 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['eolle']
98
+ 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['erbalten']
99
+ 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['erbbeben']
100
+ 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['erfetzt']
101
+ 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['erhallen']
102
+ 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['erlauft']
103
+ 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['erleiben']
104
+ 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['erleibet']
105
+ 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['erseht']
106
+ 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['eücksicht']
107
+ 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['euenburg']
108
+ 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['feiet']
109
+ 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['feinet']
110
+ 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['feite']
111
+ 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['festgefetzt']
112
+ 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['fetze']
113
+ 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['fetzte']
114
+ 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['fetzten']
115
+ 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['fetzung']
116
+ 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['feuersbrunft']
117
+ 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['fiir']
118
+ 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['fipoi']
119
+ 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['fllr']
120
+ 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['fönst']
121
+ 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['fortfetzen']
122
+ 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['fortfetzung']
123
+ 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['fortgefetzt']
124
+ 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['franzofen']
125
+ 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['franzosischen']
126
+ 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['frauken']
127
+ 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['galleu']
128
+ 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['gefetzt']
129
+ 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['gemass']
130
+ 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['gesellschast']
131
+ 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['gesellsehast']
132
+ 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['gewöhn']
133
+ 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['gierung']
134
+ 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['gischen']
135
+ 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['grossere']
136
+ 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['grossern']
137
+ 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['hallung']
138
+ 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['handelsund']
139
+ 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['hauptfache']
140
+ 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['heuligen']
141
+ 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['ì000']
142
+ 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['iaht']
143
+ 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['iahte']
144
+ 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['iahten']
145
+ 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['iiber']
146
+ 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['infofern']
147
+ 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['jnni']
148
+ 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['kauton']
149
+ 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['kautone']
150
+ 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['kautons']
151
+ 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['korden']
152
+ 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['kreispostdirektiou']
153
+ 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['leife']
154
+ 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['liier']
155
+ 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['lnzern']
156
+ 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['locamo']
157
+ 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['lostet']
158
+ 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['luzernburg']
159
+ 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['macbonalb']
160
+ 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['mahnahmen']
161
+ 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['mahregeln']
162
+ 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['mährend']
163
+ 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['matznahmen']
164
+ 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['melben']
165
+ 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['melche']
166
+ 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['ministet']
167
+ 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['mitleib']
168
+ 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['moglich']
169
+ 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['möglid']
170
+ 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['moglieh']
171
+ 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['naeh']
172
+ 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['nieht']
173
+ 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['nikiaus']
174
+ 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['noeh']
175
+ 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['nothig']
176
+ 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['nothigen']
177
+ 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['ollem']
178
+ 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['poft']
179
+ 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['prankreich']
180
+ 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['rebakteur']
181
+ 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['reiburg']
182
+ 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['reuenburg']
183
+ 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['roieber']
184
+ 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['rovember']
185
+ 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['ruffischen']
186
+ 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['schisse']
187
+ 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['schwierigleiten']
188
+ 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['sehengen']
189
+ 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['seihst']
190
+ 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['siud']
191
+ 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['srühern']
192
+ 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['stanben']
193
+ 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['stobt']
194
+ 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['tatfache']
195
+ 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['tatfachen']
196
+ 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['teten']
197
+ 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['thronrebe']
198
+ 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['uater']
199
+ 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['uicht']
200
+ 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['unbein']
201
+ 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['unfete']
202
+ 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['unier']
203
+ 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['unterschieb']
204
+ 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['urfache']
205
+ 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['ürich']
206
+ 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['verfetzt']
207
+ 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['verkau']
208
+ 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['verkauten']
209
+ 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['verlauft']
210
+ 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['verlaus']
211
+ 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['verlehr']
212
+ 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['vorfitz']
213
+ 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['vstrr']
214
+ 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['webet']
215
+ 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['welehe']
216
+ 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['wnrde']
217
+ 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['znm']
218
+ 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['zurlch']
219
+ 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:190 INFO: Excluded 213 words that should never be added
220
+ 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:196 INFO: Processing real-word file: lex/de/old_spelling.rw.txt
221
+ 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:196 INFO: Processing real-word file: lex/de/modern_spelling.rw.txt
222
+ 2025-02-14 16:07:47,844 ocrqa_create_bloom_filter.py:135 INFO: Processing frequency file: lex/de/dewiki.unigram.freq.tsv.bz2
223
+ 2025-02-14 16:08:00,551 ocrqa_create_bloom_filter.py:240 INFO: low_freq_excluded before removing parts from high-frequency words: 3780824
224
+ 2025-02-14 16:08:00,724 ocrqa_create_bloom_filter.py:248 INFO: low_freq_excluded after removing parts from high-frequency words: 3288723
225
+ 2025-02-14 16:08:00,724 ocrqa_create_bloom_filter.py:252 INFO: Lexical processing complete.
226
+ 2025-02-14 16:08:00,724 ocrqa_create_bloom_filter.py:256 INFO: - nonwords_read: 213
227
+ 2025-02-14 16:08:00,724 ocrqa_create_bloom_filter.py:256 INFO: - nonwords_count: 213
228
+ 2025-02-14 16:08:00,724 ocrqa_create_bloom_filter.py:256 INFO: - realwords_read: 488
229
+ 2025-02-14 16:08:00,724 ocrqa_create_bloom_filter.py:256 INFO: - realwords_accepted: 488
230
+ 2025-02-14 16:08:00,724 ocrqa_create_bloom_filter.py:256 INFO: - realwords_nonwords_filtered: 0
231
+ 2025-02-14 16:08:00,724 ocrqa_create_bloom_filter.py:256 INFO: - freq_words_read: 9199714
232
+ 2025-02-14 16:08:00,724 ocrqa_create_bloom_filter.py:256 INFO: - words_accepted: 4143119
233
+ 2025-02-14 16:08:00,724 ocrqa_create_bloom_filter.py:256 INFO: - subwords_accepted: 5029202
234
+ 2025-02-14 16:08:00,724 ocrqa_create_bloom_filter.py:256 INFO: - subwords_filtered: 1
235
+ 2025-02-14 16:08:00,724 ocrqa_create_bloom_filter.py:256 INFO: - low_freq_excluded: 3288723
236
+ 2025-02-14 16:08:00,724 ocrqa_create_bloom_filter.py:256 INFO: - single_char_words_filtered: 6719
237
+ 2025-02-14 16:08:00,724 ocrqa_create_bloom_filter.py:256 INFO: - freq_words_filtered: 5049876
238
+ 2025-02-14 16:08:00,724 ocrqa_create_bloom_filter.py:259 INFO: Estimated word count: 3358453
239
+ 2025-02-14 16:08:01,663 ocrqa_create_bloom_filter.py:263 INFO: Bloom Filter created and saved to build.d/fp_prob_0.00001/ocrqa-wp_v1.0.6-de.bloom
240
+ 2025-02-14 16:08:02,322 ocrqa_create_bloom_filter.py:285 INFO: Diagnosis Results:
241
+ 2025-02-14 16:08:02,323 ocrqa_create_bloom_filter.py:286 INFO: - Excluded words in bloom filter: 0
242
+ 2025-02-14 16:08:02,323 ocrqa_create_bloom_filter.py:287 INFO: - Known words not in bloom filter: 0
243
+ 2025-02-14 16:08:02,821 ocrqa_create_bloom_filter.py:294 INFO: - Low-frequency words in bloom filter: 25
244
+ 2025-02-14 16:08:02,821 ocrqa_create_bloom_filter.py:300 INFO: - Proportion of excluded words in bloom filter: 0.00000000
245
+ 2025-02-14 16:08:02,821 ocrqa_create_bloom_filter.py:306 INFO: - Proportion of known words not in bloom filter: 0.00000000
246
+ 2025-02-14 16:08:02,821 ocrqa_create_bloom_filter.py:314 INFO: - Proportion of low-frequency words in bloom filter: 0.00000760
ocrqa-wp_v1.0.6-fr.bloom ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b905d07b317efd2349a92ccac06f6befea2d25b555be18a64d34bfefa262c781
3
+ size 5805282
ocrqa-wp_v1.0.6-fr.bloom.log ADDED
@@ -0,0 +1,52 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 2025-02-14 16:08:03,171 ocrqa_create_bloom_filter.py:425 INFO: Namespace(input_files=['lex/fr/realword_ocr_errors.nw.txt', 'lex/fr/ocr_errors.nw.txt', 'lex/fr/old_spelling.rw.txt', 'lex/fr/modern_spelling.rw.txt', 'lex/fr/morph.rw.txt.bz2', 'lex/fr/single_char.rw.txt.bz2', 'lex/fr/frwiki.unigram.freq.tsv.bz2'], bloom_path='build.d/fp_prob_0.00001/ocrqa-wp_v1.0.6-fr.bloom', fp_probability=1e-05, log_level='INFO', log_file='build.d/fp_prob_0.00001/ocrqa-wp_v1.0.6-fr.bloom.log', config=None, min_frequency=2, single_char_min_frequency=20, diagnose_bloom=True)
2
+ 2025-02-14 16:08:03,172 ocrqa_create_bloom_filter.py:226 INFO: Starting Bloom Filter creation...
3
+ 2025-02-14 16:08:03,172 ocrqa_create_bloom_filter.py:178 INFO: Processing nonword file: lex/fr/realword_ocr_errors.nw.txt
4
+ 2025-02-14 16:08:03,172 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['gazelle']
5
+ 2025-02-14 16:08:03,172 ocrqa_create_bloom_filter.py:190 INFO: Excluded 1 words that should never be added
6
+ 2025-02-14 16:08:03,172 ocrqa_create_bloom_filter.py:178 INFO: Processing nonword file: lex/fr/ocr_errors.nw.txt
7
+ 2025-02-14 16:08:03,172 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['bàie']
8
+ 2025-02-14 16:08:03,172 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['chepin']
9
+ 2025-02-14 16:08:03,172 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['dftce']
10
+ 2025-02-14 16:08:03,172 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['ì']
11
+ 2025-02-14 16:08:03,172 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['locamo']
12
+ 2025-02-14 16:08:03,172 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['mmmi']
13
+ 2025-02-14 16:08:03,172 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['mmml']
14
+ 2025-02-14 16:08:03,172 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['mmmv']
15
+ 2025-02-14 16:08:03,172 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['neuchat']
16
+ 2025-02-14 16:08:03,172 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['neuchàtel']
17
+ 2025-02-14 16:08:03,172 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['ölten']
18
+ 2025-02-14 16:08:03,172 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['ponr']
19
+ 2025-02-14 16:08:03,172 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['pribourg']
20
+ 2025-02-14 16:08:03,172 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['remercîments']
21
+ 2025-02-14 16:08:03,172 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['sadresser']
22
+ 2025-02-14 16:08:03,172 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['zuberbuhler']
23
+ 2025-02-14 16:08:03,172 ocrqa_create_bloom_filter.py:190 INFO: Excluded 17 words that should never be added
24
+ 2025-02-14 16:08:03,172 ocrqa_create_bloom_filter.py:196 INFO: Processing real-word file: lex/fr/old_spelling.rw.txt
25
+ 2025-02-14 16:08:03,172 ocrqa_create_bloom_filter.py:196 INFO: Processing real-word file: lex/fr/modern_spelling.rw.txt
26
+ 2025-02-14 16:08:03,172 ocrqa_create_bloom_filter.py:196 INFO: Processing real-word file: lex/fr/morph.rw.txt.bz2
27
+ 2025-02-14 16:08:03,856 ocrqa_create_bloom_filter.py:196 INFO: Processing real-word file: lex/fr/single_char.rw.txt.bz2
28
+ 2025-02-14 16:08:03,857 ocrqa_create_bloom_filter.py:135 INFO: Processing frequency file: lex/fr/frwiki.unigram.freq.tsv.bz2
29
+ 2025-02-14 16:08:08,932 ocrqa_create_bloom_filter.py:240 INFO: low_freq_excluded before removing parts from high-frequency words: 1548627
30
+ 2025-02-14 16:08:09,031 ocrqa_create_bloom_filter.py:248 INFO: low_freq_excluded after removing parts from high-frequency words: 1224019
31
+ 2025-02-14 16:08:09,031 ocrqa_create_bloom_filter.py:252 INFO: Lexical processing complete.
32
+ 2025-02-14 16:08:09,031 ocrqa_create_bloom_filter.py:256 INFO: - nonwords_read: 17
33
+ 2025-02-14 16:08:09,031 ocrqa_create_bloom_filter.py:256 INFO: - nonwords_count: 17
34
+ 2025-02-14 16:08:09,031 ocrqa_create_bloom_filter.py:256 INFO: - realwords_read: 723486
35
+ 2025-02-14 16:08:09,031 ocrqa_create_bloom_filter.py:256 INFO: - realwords_accepted: 737153
36
+ 2025-02-14 16:08:09,031 ocrqa_create_bloom_filter.py:256 INFO: - realwords_nonwords_filtered: 1
37
+ 2025-02-14 16:08:09,031 ocrqa_create_bloom_filter.py:256 INFO: - freq_words_read: 4047539
38
+ 2025-02-14 16:08:09,031 ocrqa_create_bloom_filter.py:256 INFO: - words_accepted: 1992185
39
+ 2025-02-14 16:08:09,032 ocrqa_create_bloom_filter.py:256 INFO: - subwords_accepted: 2385659
40
+ 2025-02-14 16:08:09,032 ocrqa_create_bloom_filter.py:256 INFO: - subwords_filtered: 9
41
+ 2025-02-14 16:08:09,032 ocrqa_create_bloom_filter.py:256 INFO: - low_freq_excluded: 1224019
42
+ 2025-02-14 16:08:09,032 ocrqa_create_bloom_filter.py:256 INFO: - single_char_words_filtered: 6597
43
+ 2025-02-14 16:08:09,032 ocrqa_create_bloom_filter.py:256 INFO: - freq_words_filtered: 2048757
44
+ 2025-02-14 16:08:09,032 ocrqa_create_bloom_filter.py:259 INFO: Estimated word count: 1937683
45
+ 2025-02-14 16:08:09,562 ocrqa_create_bloom_filter.py:263 INFO: Bloom Filter created and saved to build.d/fp_prob_0.00001/ocrqa-wp_v1.0.6-fr.bloom
46
+ 2025-02-14 16:08:09,924 ocrqa_create_bloom_filter.py:285 INFO: Diagnosis Results:
47
+ 2025-02-14 16:08:09,924 ocrqa_create_bloom_filter.py:286 INFO: - Excluded words in bloom filter: 0
48
+ 2025-02-14 16:08:09,924 ocrqa_create_bloom_filter.py:287 INFO: - Known words not in bloom filter: 0
49
+ 2025-02-14 16:08:10,101 ocrqa_create_bloom_filter.py:294 INFO: - Low-frequency words in bloom filter: 15
50
+ 2025-02-14 16:08:10,101 ocrqa_create_bloom_filter.py:300 INFO: - Proportion of excluded words in bloom filter: 0.00000000
51
+ 2025-02-14 16:08:10,101 ocrqa_create_bloom_filter.py:306 INFO: - Proportion of known words not in bloom filter: 0.00000000
52
+ 2025-02-14 16:08:10,101 ocrqa_create_bloom_filter.py:314 INFO: - Proportion of low-frequency words in bloom filter: 0.00001225