OCR Quality Assessment using Unigram Language Data
This HuggingFace model repository contains known word lists (a.k.a. word unigram data) in bloom filter format built for efficient and robust OCR quality assessment.
Known Word Lists as Bloom Filters
All model names start with ocrqa-
and the remainder specifies the following metadata:
- Model Name: A short identifier (e.g. wp for Wikipedia)
- Version: A specific model version identifier (e.g. v1.0.0)
- Language: The target language (e.g. fr, de)
If available, log files from the bloom file compilation process contain more details
about the word lists that were used.
All words in the Bloom filters are in lowercase and have been normalized to unicode NFKC
normalization.
All digits are mapped to 0
. Many punctuation characters and non-alphanumeric symbols are replaced by space.
Installation
Tested with Python 3.11.
pip install cython pybloomfiltermmap3 huggingface_hub
Usage
To use this models in your project for OCR QA, you can use the following code snippet:
import unicodedata
from typing import Optional
from huggingface_hub import hf_hub_download
from pybloomfilter import BloomFilter
# Define normalization table
QUOTES_PUNCT = "„•<>!\"#%&'’"
ASCII_PUNCT = "()*,./:;?"
BRACKETS_SPECIAL = "[]\\~_{}"
UNICODE_PUNCT = "\xa1\xab\xb7\xbb\xbf"
DASH_CARET = "—^`"
SPECIAL_SYMBOLS = "¦§£="
HYPHEN = "-"
DIGITS = "0123456789"
NORMALIZATION_TABLE = str.maketrans(
{
char: " "
for char in (
QUOTES_PUNCT
+ ASCII_PUNCT
+ BRACKETS_SPECIAL
+ UNICODE_PUNCT
+ DASH_CARET
+ SPECIAL_SYMBOLS
+ HYPHEN
)
}
| {char: "0" for char in DIGITS}
)
def normalize_text(s: str, unicode_normalize: Optional[str] = "NFKC") -> str:
"""Normalize text by replacing punctuation with spaces and digits with '0'."""
if unicode_normalize:
s = unicodedata.normalize(unicode_normalize, s).lower()
return s.translate(NORMALIZATION_TABLE)
def get_bloomfilter(model_id: str, filename: str):
return BloomFilter.open(hf_hub_download(repo_id=model_id, filename=filename))
def filter(text: str, bloom_filter: BloomFilter):
# Normalize and tokenize text
normalized_text = normalize_text(text)
tokens = normalized_text.split()
# Check tokens against the bloom filter
for token in tokens:
if token in bloom_filter:
print(f"'{token}' is in the bloom filter.")
else:
print(f"'{token}' is NOT in the bloom filter.")
def filter_text(DE_TEXT: str, bloom_filter: BloomFilter):
knowns = set()
unknowns = set()
# Normalize and tokenize text
normalized_text = normalize_text(DE_TEXT)
tokens = normalized_text.split()
# Check tokens against the bloom filter
for token in tokens:
if token in bloom_filter:
print(f"'{token}' is in the bloom filter.")
knowns.add(token)
else:
print(f"'{token}' is NOT in the bloom filter.")
unknowns.add(token)
result = result = {"knowns": knowns, "unknowns": unknowns}
return result
DE_TEXT = """Dieser histrische Text änthält OCR-/Tippsfehler, aber auch einige korrekte Wörter."""
# Load the bloom filter
bf = get_bloomfilter(
"impresso-project/OCR-quality-assessment-unigram", "ocrqa-wp_v1.0.6-de.bloom"
)
print(filter_text(DE_TEXT, bf))
Limitations
- only French and German is supported so far.
- New Wikipedia dumps should be used to update the word lists.
Release info
- v1.0.6: Added more high-frequency words for German (historical spelling) and a few French ones. This models are planned to be used in the impresso webapp and API
- v1.0.5: Initial release with impresso 1 word lists (only internally used, never available in the public webapp or API) built mostly from Wikipedia dumps from 2019
Inference Providers
NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API:
The model has no library tag.