EQUES
/

jpharma-bert-large

Model card Files Files and versions

jpharma-bert-large / README.md

tkr12's picture

Update README.md

89d4d17 verified 3 months ago

|

history blame contribute delete

2.8 kB

	---
	library_name: transformers
	tags: []
	---

	# Model Card

	<!-- Provide a quick summary of what the model is/does. -->
	Our JpharmaBERT (large) is a continually pre-trained version of the BERT model ([tohoku-nlp/bert-large-japanese-v2](https://huggingface.co/tohoku-nlp/bert-large-japanese-v2)), further trained on pharmaceutical data — the same dataset used for [eques/jpharmatron](https://huggingface.co/EQUES/JPharmatron-7B).

	# Examoke Usage

	<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
	```python
	import torch
	from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline

	model = AutoModelForMaskedLM.from_pretrained("EQUES/jpharma-bert-large", torch_dtype=torch.bfloat16)
	tokenizer = AutoTokenizer.from_pretrained("EQUES/jpharma-bert-large")
	fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)

	results = fill_mask("水は化学式で[MASK]2Oです。")

	for result in results:
	print(result)
	# {'score': 0.49609375, 'token': 55, 'token_str': 'H', 'sequence': '水は化学式でH2Oです。'}
	# {'score': 0.11767578125, 'token': 29257, 'token_str': 'Na', 'sequence': '水は化学式でNa2Oです。'}
	# {'score': 0.047607421875, 'token': 61, 'token_str': 'N', 'sequence': '水は化学式でN2Oです。'}
	# {'score': 0.038330078125, 'token': 16966, 'token_str': 'CH', 'sequence': '水は化学式でCH2Oです。'}
	# {'score': 0.0255126953125, 'token': 66, 'token_str': 'S', 'sequence': '水は化学式でS2Oです。'}
	```

	## Training Details

	### Training Data

	<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
	We used the same dataset as [eques/jpharmatron](https://huggingface.co/EQUES/JPharmatron-7B) for training our JpharmaBERT, which consists of:
	- Japanese text data (2B tokens) collected from pharmaceutical documents such as academic papers and package inserts
	- English data (8B tokens) obtained from PubMed abstracts
	- Pharmaceutical-related data (1.2B tokens) extracted from the multilingual CC100 dataset

	After removing duplicate entries across these sources, the final dataset contains approximately 9 billion tokens.
	(For details, please refer to our paper about Jpharmatron: [link](https://arxiv.org/abs/2505.16661))

	#### Training Hyperparameters

	The model was continually pre-trained with the following settings:

	- Mask probability: 15%
	- Maximum sequence length: 512 tokens
	- Number of training epochs: 6
	- Learning rate: 5e-5
	- Warm-up steps: 10,000
	- Per-device training batch size: 64

	## Model Card Authors

	Created by Takuro Fujii ([email protected])