OCR Quality Assessment using Unigram Language Data

This HuggingFace model repository contains known word lists (a.k.a. word unigram data) in bloom filter format built for efficient and robust OCR quality assessment.

Known Word Lists as Bloom Filters

All model names start with ocrqa- and the remainder specifies the following metadata:

Model Name: A short identifier (e.g. wp for Wikipedia)
Version: A specific model version identifier (e.g. v1.0.0)
Language: The target language (e.g. fr, de)

If available, log files from the bloom file compilation process contain more details about the word lists that were used. All words in the Bloom filters are in lowercase and have been normalized to unicode NFKC normalization. All digits are mapped to 0. Many punctuation characters and non-alphanumeric symbols are replaced by space.

Installation

Tested with Python 3.11.

pip install cython pybloomfiltermmap3 huggingface_hub

Usage

To use this models in your project for OCR QA, you can use the following code snippet:

import unicodedata
from typing import Optional
from huggingface_hub import hf_hub_download
from pybloomfilter import BloomFilter


# Define normalization table
QUOTES_PUNCT = "„•<>!\"#%&'’"
ASCII_PUNCT = "()*,./:;?"
BRACKETS_SPECIAL = "[]\\~_{}"
UNICODE_PUNCT = "\xa1\xab\xb7\xbb\xbf"
DASH_CARET = "—^`"
SPECIAL_SYMBOLS = "¦§£="
HYPHEN = "-"
DIGITS = "0123456789"

NORMALIZATION_TABLE = str.maketrans(
    {
        char: " "
        for char in (
            QUOTES_PUNCT
            + ASCII_PUNCT
            + BRACKETS_SPECIAL
            + UNICODE_PUNCT
            + DASH_CARET
            + SPECIAL_SYMBOLS
            + HYPHEN
        )
    }
    | {char: "0" for char in DIGITS}
)


def normalize_text(s: str, unicode_normalize: Optional[str] = "NFKC") -> str:
    """Normalize text by replacing punctuation with spaces and digits with '0'."""
    if unicode_normalize:
        s = unicodedata.normalize(unicode_normalize, s).lower()
    return s.translate(NORMALIZATION_TABLE)


def get_bloomfilter(model_id: str, filename: str):
    return BloomFilter.open(hf_hub_download(repo_id=model_id, filename=filename))


def filter(text: str, bloom_filter: BloomFilter):
    # Normalize and tokenize text
    normalized_text = normalize_text(text)
    tokens = normalized_text.split()

    # Check tokens against the bloom filter
    for token in tokens:
        if token in bloom_filter:
            print(f"'{token}' is in the bloom filter.")
        else:
            print(f"'{token}' is NOT in the bloom filter.")


def filter_text(DE_TEXT: str, bloom_filter: BloomFilter):

    knowns = set()
    unknowns = set()

    # Normalize and tokenize text
    normalized_text = normalize_text(DE_TEXT)
    tokens = normalized_text.split()

    # Check tokens against the bloom filter
    for token in tokens:
        if token in bloom_filter:
            print(f"'{token}' is in the bloom filter.")
            knowns.add(token)
        else:
            print(f"'{token}' is NOT in the bloom filter.")
            unknowns.add(token)
    result = result = {"knowns": knowns, "unknowns": unknowns}
    return result


DE_TEXT = """Dieser histrische Text änthält OCR-/Tippsfehler, aber auch einige korrekte Wörter."""

# Load the bloom filter

bf = get_bloomfilter(
    "impresso-project/OCR-quality-assessment-unigram", "ocrqa-wp_v1.0.6-de.bloom"
)

print(filter_text(DE_TEXT, bf))

Limitations

only French and German is supported so far.
New Wikipedia dumps should be used to update the word lists.

Release info

v1.0.6: Added more high-frequency words for German (historical spelling) and a few French ones. This models are planned to be used in the impresso webapp and API
v1.0.5: Initial release with impresso 1 word lists (only internally used, never available in the public webapp or API) built mostly from Wikipedia dumps from 2019