Transformers

Join the Hugging Face community

and get access to the augmented documentation experience

Collaborate on models, datasets and Spaces

Faster examples with accelerated inference

Switch between documentation themes

to get started

This model was released on 2019-03-24 and added to Hugging Face Transformers on 2020-11-16.

BertJapanese

Overview

The BERT models trained on Japanese text.

There are models with two different tokenization methods:

Tokenize with MeCab and WordPiece. This requires some extra dependencies, fugashi which is a wrapper around MeCab.
Tokenize into characters.

To use MecabTokenizer, you should pip install transformers["ja"] (or pip install -e .["ja"] if you install from source) to install dependencies.

See details on cl-tohoku repository.

Example of using a model with MeCab and WordPiece tokenization:

>>> import torch
>>> from transformers import AutoModel, AutoTokenizer

>>> bertjapanese = AutoModel.from_pretrained("cl-tohoku/bert-base-japanese")
>>> tokenizer = AutoTokenizer.from_pretrained("cl-tohoku/bert-base-japanese")

>>> ## Input Japanese Text
>>> line = "吾輩は猫である。"

>>> inputs = tokenizer(line, return_tensors="pt")

>>> print(tokenizer.decode(inputs["input_ids"][0]))
[CLS] 吾輩 は 猫 で ある 。 [SEP]

>>> outputs = bertjapanese(**inputs)

Example of using a model with Character tokenization:

>>> bertjapanese = AutoModel.from_pretrained("cl-tohoku/bert-base-japanese-char")
>>> tokenizer = AutoTokenizer.from_pretrained("cl-tohoku/bert-base-japanese-char")

>>> ## Input Japanese Text
>>> line = "吾輩は猫である。"

>>> inputs = tokenizer(line, return_tensors="pt")

>>> print(tokenizer.decode(inputs["input_ids"][0]))
[CLS] 吾 輩 は 猫 で あ る 。 [SEP]

>>> outputs = bertjapanese(**inputs)

This model was contributed by cl-tohoku.

This implementation is the same as BERT, except for tokenization method. Refer to BERT documentation for API reference information.

BertJapaneseTokenizer

class transformers.BertJapaneseTokenizer

< source >

( vocab_file spm_file = None do_lower_case = False do_word_tokenize = True do_subword_tokenize = True word_tokenizer_type = 'basic' subword_tokenizer_type = 'wordpiece' never_split = None unk_token = '[UNK]' sep_token = '[SEP]' pad_token = '[PAD]' cls_token = '[CLS]' mask_token = '[MASK]' mecab_kwargs = None sudachi_kwargs = None jumanpp_kwargs = None **kwargs )

Parameters

vocab_file (str) — Path to a one-wordpiece-per-line vocabulary file.
spm_file (str, optional) — Path to SentencePiece file (generally has a .spm or .model extension) that contains the vocabulary.
do_lower_case (bool, optional, defaults to True) — Whether to lower case the input. Only has an effect when do_basic_tokenize=True.
do_word_tokenize (bool, optional, defaults to True) — Whether to do word tokenization.
do_subword_tokenize (bool, optional, defaults to True) — Whether to do subword tokenization.
word_tokenizer_type (str, optional, defaults to "basic") — Type of word tokenizer. Choose from [“basic”, “mecab”, “sudachi”, “jumanpp”].
subword_tokenizer_type (str, optional, defaults to "wordpiece") — Type of subword tokenizer. Choose from [“wordpiece”, “character”, “sentencepiece”,].
mecab_kwargs (dict, optional) — Dictionary passed to the MecabTokenizer constructor.
sudachi_kwargs (dict, optional) — Dictionary passed to the SudachiTokenizer constructor.
jumanpp_kwargs (dict, optional) — Dictionary passed to the JumanppTokenizer constructor.

Construct a BERT tokenizer for Japanese text.

This tokenizer inherits from PreTrainedTokenizer which contains most of the main methods. Users should refer to: this superclass for more information regarding those methods.

convert_tokens_to_string

< source >

( tokens )

Converts a sequence of tokens (string) in a single string.

Update on GitHub

←BertGeneration BERTweet→