🍊 DPR-KO

1. Intro

한국어 DPR 모델 (Question Encoder) 입니다.
Facebook의 DPR 코드와는 전혀 다른 새로운 코드로 학습되었습니다.
Dense Vector 기반의 Semantic Search에 사용할 수 있습니다.
질문은 Question Encoder로, 텍스트는 Context Encoder를 이용해 인코딩합니다.

2. Experiment settings

  • 베이스 모델: klue/bert-base
  • 데이터 셋: KorQuad v1
  • 위키 덤프: kowiki-latest-pages-articles.xml.bz2 (2024/07/23)
  • 청크 당 문장: 5
  • 전체 청크: 약 160 만
  • BM25 가중치: 0.3
  • 1 A100 GPU

3. Performance

(%) BM25 (w/o DPR-KO) DPR-KO (w/o BM25) DPR-KO (with BM25)
Top1 Acc 36.25 48.98 71.16
Top5 Acc 51.61 71.16 86.75
Top10 Acc 57.34 77.05 90.28
Top20 Acc 62.40 82.09 92.66
Top50 Acc 68.46 87.03 94.86
Top100 Acc 72.48 90.23 96.02

※ BM25모델은 한국어 위키피디아 전체 텍스트로 학습한 모델입니다.
※ 자세한 학습 및 평가 방식은 Github를 참고해주세요.

Citing

@article{lim2019korquad1,
  title={Korquad1. 0: Korean qa dataset for machine reading comprehension},
  author={Lim, Seungyoung and Kim, Myungji and Lee, Jooyoul},
  journal={arXiv preprint arXiv:1909.07005},
  year={2019}
}
@article{karpukhin2020dense,
  title={Dense Passage Retrieval for Open-Domain Question Answering},
  author={Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, Wen-tau Yih},
  journal={Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
  year={2020}
}
@misc{park2021klue,
      title={KLUE: Korean Language Understanding Evaluation},
      author={Sungjoon Park and Jihyung Moon and Sungdong Kim and Won Ik Cho and Jiyoon Han and Jangwon Park and Chisung Song and Junseong Kim and Yongsook Song and Taehwan Oh and Joohong Lee and Juhyun Oh and Sungwon Lyu and Younghoon Jeong and Inkwon Lee and Sangwoo Seo and Dongjun Lee and Hyunwoo Kim and Myeonghwa Lee and Seongbo Jang and Seungwon Do and Sunkyoung Kim and Kyungtae Lim and Jongwon Lee and Kyumin Park and Jamin Shin and Seonghyun Kim and Lucy Park and Alice Oh and Jungwoo Ha and Kyunghyun Cho},
      year={2021},
      eprint={2105.09680},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
Downloads last month
194
Safetensors
Model size
111M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train snumin44/biencoder-ko-bert-question