opensearch-project
/

opensearch-neural-sparse-encoding-doc-v2-distill

Model card Files Files and versions

zhichao-geng commited on Nov 15, 2024

Commit

a708c2b

·

verified ·

1 Parent(s): 6b4a90d

Update README.md (#2)

- Update README.md (223b4d4af2a94069deeefdd00c12a480aaff2132)

Files changed (1) hide show

README.md +3 -0

README.md CHANGED Viewed

@@ -28,6 +28,9 @@ Overall, the v2 series of models have better search relevance, efficiency and in
 ## Overview
 This is a learned sparse retrieval model. It encodes the documents to 30522 dimensional **sparse vectors**. For queries, it just use a tokenizer and a weight look-up table to generate sparse vectors. The non-zero dimension index means the corresponding token in the vocabulary, and the weight means the importance of the token. And the similarity score is the inner product of query/document sparse vectors. In the real-world use case, the search performance of opensearch-neural-sparse-encoding-v1 is comparable to BM25.
 The training datasets includes MS MARCO, eli5_question_answer, squad_pairs, WikiAnswers, yahoo_answers_title_question, gooaq_pairs, stackexchange_duplicate_questions_body_body, wikihow, S2ORC_title_abstract, stackexchange_duplicate_questions_title-body_title-body, yahoo_answers_question_answer, searchQA_top5_snippets, stackexchange_duplicate_questions_title_title, yahoo_answers_title_answer.

 ## Overview
+- **Paper**: [Towards Competitive Search Relevance For Inference-Free Learned Sparse Retrievers](https://arxiv.org/abs/2411.04403)
+- **Fine-tuning sample**: [opensearch-sparse-model-tuning-sample](https://github.com/zhichao-aws/opensearch-sparse-model-tuning-sample)
 This is a learned sparse retrieval model. It encodes the documents to 30522 dimensional **sparse vectors**. For queries, it just use a tokenizer and a weight look-up table to generate sparse vectors. The non-zero dimension index means the corresponding token in the vocabulary, and the weight means the importance of the token. And the similarity score is the inner product of query/document sparse vectors. In the real-world use case, the search performance of opensearch-neural-sparse-encoding-v1 is comparable to BM25.
 The training datasets includes MS MARCO, eli5_question_answer, squad_pairs, WikiAnswers, yahoo_answers_title_question, gooaq_pairs, stackexchange_duplicate_questions_body_body, wikihow, S2ORC_title_abstract, stackexchange_duplicate_questions_title-body_title-body, yahoo_answers_question_answer, searchQA_top5_snippets, stackexchange_duplicate_questions_title_title, yahoo_answers_title_answer.