Description

German word embedding model trained by Müller with the following parameter configuration:

  • a corpus as big as possible (and as diverse as possible without being informal) filtering of punctuation and stopwords
  • forming bigramm tokens
  • using skip-gram as training algorithm with hierarchical softmax
  • window size between 5 and 10
  • dimensionality of feature vectors of 300 or more
  • using negative sampling with 10 samples
  • ignoring all words with total frequency lower than 50

For more information, see https://devmount.github.io/GermanWordEmbeddings/

How to use?

from gensim.models import KeyedVectors
from huggingface_hub import hf_hub_download
model = KeyedVectors.load_word2vec_format(hf_hub_download(repo_id="Word2vec/german_model", filename="german.model"), binary=True, unicode_errors="ignore")

Citation

@thesis{mueller2015,
  author = {{Müller}, Andreas},
  title  = "{Analyse von Wort-Vektoren deutscher Textkorpora}",
  school = {Technische Universität Berlin},
  year   = 2015,
  month  = jun,
  type   = {Bachelor's Thesis},
  url    = {https://devmount.github.io/GermanWordEmbeddings}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.

Dataset used to train Word2vec/german_model