Update README.md

8769bfc verified over 1 year ago

9.71 kB

	---
	license: other
	language:
	- ko
	- en
	- ja
	- zh
	pipeline_tag: fill-mask
	---
	# Model Card for KEByT5-large (1.23B #params)

	<!-- Provide a quick summary of what the model is/does. -->
	KEByT5: Korean-Enhanced/Enriched Byte-level Text-to-Text Transfer Transformer(T5)

	크로스모달 및 다국어 친화적인 한국어 중심의 토큰-프리 언어 이해 생성 모델
	(EN=Cross-modal, Multilingual Friendly, Token-free Encoder-Decoder Pretrained Language Model for Korean)

	* 본 사전학습 언어모델은 시각, 청각과 같은 텍스트 이외의 모달리티와 교차언어 지식 교환에 용이한 토큰-프리 사전학습 언어모델을 목표로 합니다.
	* 별도의 tokenizer가 필요없지만, 편의를 위해 AutoTokenizer.from_pretrained()를 사용하여 다른 토크나이저 기반 인코더-디코더 모델과 동일하게 처리할 수 있습니다. 토크나이저를 생략하고 싶은 경우, UTF-8 입력을 바이트 단위로 쪼개어, 각 바이트에 +3을 하여 Token ID를 생성합니다. (즉, ASCII value 0 == Token ID 3, ASCII value 255 == Token ID 258)
	* 현재 Preview 스테이지에 있는 모델이며, 활용에는 fine-tuning이 필요합니다.

	## Acknowledgements
	* 본 사전학습 언어모델은 2022년도 정부(과학기술정보통신부)의 재원으로 정보통신기획평가원의 지원을 받아 수행된 연구임 (No. RS-2022-00187238, 효율적 사전학습이 가능한 한국어 대형 언어모델 사전학습 기술 개발)
	(EN=This pretrained language model was supported by the Institute of Information & communication Technology Planning & Evaluation(IITP) grant funded by the Korea government(MSIT) (No. RS-2022-00187238, Development of Large Korean Language Model Technology for Efficient Pre-training))

	# Model Details

	본 사전학습 언어모델은 다음과 같은 규모를 가집니다:
	* kebyt5-small : 330M [link](https://huggingface.co/etri-lirs/kebyt5-small-preview)
	* kebyt5-base : 580M [link](https://huggingface.co/etri-lirs/kebyt5-base-preview)
	* kebyt5-large : 1.23B (this model)

	이들 모델은 [google/byt5-small](https://huggingface.co/google/byt5-small), [google/byt5-base](https://huggingface.co/google/byt5-base), [google/byt5-large](https://huggingface.co/google/byt5-large) 모델과 동일한 신경망 구조와 크기를 가지며, 토크나이저(ByT5Tokenizer)와 구현 상 두 모델은 별도의 수정없이 바로 교환하여 사용할 수 있습니다.
	huggingface transformers에서의 사용법 역시, T5ForConditionalGeneration을 동일하게 사용할 수 있습니다.

	## Model Description

	<!-- Provide a longer summary of what this model is. -->

	- Developed by: Language Intelligence Research Section, Electronics and Telecommunications Research Institute(ETRI)
	- Model type: Encoder-Decoder Transformer, specifically, ByT5.
	- Language(s) (NLP): Korean, English(partially for translation task), Chinese(partially for translation task), Japanese(partially for translation task).
	- License: Apache 2.0 License
	- Finetuned from model: kebyt5-small/-base/-xl model weights were initialized by google/byt5-* for Warm-start pretraining.

	## Model Sources

	- Repository: 다운스트림 태스크 학습을 위해, https://github.com/etri-crossmodal/llm-downstream-s2s
	- Paper: 신종훈 외, "한국어 중심의 토큰-프리 언어 이해-생성 모델 사전학습 연구", 제35회 한글 및 한국어 정보처리 학술대회 논문집, pp.711-715. 2023.
	(EN=Shin et al., "Towards Korean-Centric Token-free Pretrained Language Model", in Procs. of the 35th Annual Conference on Human and Cognitive Language Technology. pp. 711-715. 2023.)

	# Uses

	해당 사전학습 언어모델은 연구 및 교육 목적의 활용으로 그 사용 목적이 제한됩니다.

	## Direct Use

	현재 공개되는 모델은 T5 모델 학습에 사용된 Corrupted span denoising 만으로 학습되어 있어, 실제 응용 태스크에 적용하기 위해서는 fine-tuning 과정이 필요합니다.

	Sentinel Token(token id 258, 257, 256, ...)을 사용하여 Masked Token Prediction을 수행할 수 있으나, 예측된 내용에는 부적절한 내용이 있을 수 있습니다.

	## Downstream Use

	Token-free 모델의 특성 상, 복잡하거나 Noisy한 입력에 강건하며, 짧은 시퀀스 길이의 생성에 적합합니다. (예: 언어 이해, 대화 응답 생성)

	사전학습은 1024 bytes 길이의 데이터를 학습했기 때문에, 이를 초과하는 긴 시퀀스를 다루는 문제에 적합하지 않을 수 있습니다.

	더 긴 시퀀스를 다뤄야 하는 문제에서는, [GBST 기반의 토큰-프리 언어모델](https://huggingface.co/etri-lirs/gbst-kebyt5-base-preview)을 사용하는 것을 권장합니다.

	# Bias, Risks, Limitations, and Recommendations

	Masked Token Prediction을 통해 획득될 수 있는 정보에는 다른 생성형 언어모델과 같은 위험을 가지고 있을 수 있습니다. 학습에 사용된 데이터는 욕설, 음란, 정치적 내용 및 기타 거친 언어들에 대한 별도의 처리가 이루어지지 않았습니다. 따라서, 사회적으로 용인되지 않은 토큰이나 텍스트를 생성할 수 있으며, 주변 문맥에 따라서 공격적인 입력에 어떠한 결과를 생성할 수 있을지 쉽게 예상할 수 없습니다.

	한편, 본 언어모델은 주로 한국어 텍스트로 학습되었으며, 이들의 특성을 전이할 수 있는 다운스트림 태스크, 그 중에서도 분류, 요약, 짧은 문장 생성에 적합할 수 있습니다. 입출력 수준에서 미등록어(Out-of-Vocabulary)가 존재할 수 없으나, 사전학습되지 않은 텍스트 시퀀스에 대해서는 추가의 도메인 적응 학습 및 다운스트림 태스크의 미세조정이 필요합니다.

	[More Information Needed]

	## How to Get Started with the Model
	Transformers 4.27.0 이상의 버전에서, 다음의 파이썬 코드를 사용하여 모델과 tokenizer를 사용할 수 있습니다:

	```
	from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

	tokenizer = AutoTokenizer.from_pretrained("etri-lirs/kebyt5-small-preview")
	model = AutoModelForSeq2SeqLM.from_pretrained("etri-lirs/kebyt5-small-preview")
	```

	# Training Details

	## Training Data
	본 사전학습에는 아래의 공개 데이터가 사용되었습니다:

	* 국립국어원, 모두의 말뭉치. 신문 v2.0
	* 국립국어원, 모두의 말뭉치. 구어 말뭉치 v1.2
	* 국립국어원, 모두의 말뭉치. 문어 말뭉치 v1.0
	* 국립국어원, 모두의 말뭉치. 신문 2020 v1.0
	* 국립국어원, 모두의 말뭉치. 신문 2021 v1.0
	* 한국어 위키피디어 덤프, [v2020.09.20](https://github.com/lovit/kowikitext)
	* [나무위키 덤프](https://github.com/lovit/namuwikitext)
	* 한국정보화진흥원, AIHub. 전문분야 말뭉치, 법률/특허 지식베이스, 논문/도서/대화/대본 요약, 한영/한일/한중 번역 말뭉치, 콜센터/주문/뉴스기사/시각정보 질의응답, 방송/회의/상담 음성인식 데이터.
	* 한국정보화진흥원, AIHub. 대규모 웹데이터 기반 한국어 말뭉치 데이터
	* 한국정보화진흥원, AIHub. 온라인 구어체 말뭉치 데이터.
	* [KcBERT 말뭉치, v2022.3Q](https://github.com/Beomi/KcBERT)

	또한, 소량의 자체 구축된 데이터 및 합성 데이터 일부를 사용, 전체 약 ~220GB 가량의 데이터로 학습되었습니다.

	# Evaluation

	## Testing Data, Factors & Metrics & Results

	한국어 언어 이해 태스크에 사용되는 [KLUE dataset, v1.1](https://klue-benchmark.com/)의 dev set을 사용하여 평가되었습니다.
	생성은 모두 seq2seq을 이용한 출력 레이블 직접 생성 방법을 사용했습니다.

	\| models \| KLUE-TC(YNAT) (F1) \| KLUE-NER (Entity, Char F1) \| KLUE-DP (UAS, LAS) \| KLUE-MRC (EM, ROUGE-W) \|
	\|-------------\|---------------\|--------------\|-------------------\|------------------\|
	\| google/byt5-large (1.23B) \| 78.52 \| 48.81, 63.95 \| 44.26, 7.805 \| _NOT TESTED_ \|
	\| KEByT5-Base (580M) \| 84.99 \| 86.75, 91.05 \| 88.70, 85.90 \| 62.28, 68.38 \|
	\| KEByT5-Large (1.23B) \| 85.68 \| 88.09, 92.40 \| 87.18, 85.52 \| 70.07, 75.81 \|
	\| GBST-KEByT5-Base (584M) \| 85.29 \| 87.35, 92.09 \| 88.33, 85.00 \| 59.69, 66.44 \|

	대화 상태 추적(DST; Dialogue State Tracking) 태스크인 KLUE-WOS-v1.1 결과는 다음과 같습니다. 평가는 모두 seq2seq을 이용한 다이얼로그 상태 직접 생성을 사용했습니다:
	\| models \| WOS (JGA, %) \| WOS (F1, %) \|
	\| ------- \| ---------- \| ----------- \|
	\| klue/klue-roberta-large \| 50.22 \| 92.23 \|
	\| KEByT5-Base (580M) \| 77.15 \| 96.92 \|
	\| KEByT5-Large (1.23B) \| 78.54 \| 97.28 \|

	관계 추출(RE; Relation Extraction) 태스크인 KLUE-RE-v1.1 결과는 다음과 같습니다. no_relation을 제외한 29개의 관계 클래스에 대한 Micro F1 결과입니다:
	\| models \| KLUE-RE (F1, %) \|
	\| ------- \| ---------- \|
	\| klue/klue-roberta-base \| 65.90 \|
	\| KEByT5-Base (580M) \| 65.48 \|
	\| KEByT5-Large (1.23B) \| 68.95 \|


	## Compute Infrastructure

	* Trained on nVidia A100 80GB * 4EA

	# Citation

	* 허정 외, "생성형 언어모델을 이용한 관계 추출", 제35회 한글 및 한국어 정보처리 학술대회 논문집. pp.708-710. 2023.
	* 이기영 외, "한국어 토큰-프리 사전학습 언어모델 KeByT5를 이용한 한국어 생성 기반 대화 상태 추적", 제35회 한글 및 한국어 정보처리 학술대회 논문집. pp.644-647. 2023.

	# Model Card Authors/Contacts

	Jong-hun Shin(ETRI), e-mail=jhshin82 _AT_ etri _DOT_ re _DOT_ kr.