Uni-SMART commited on
Commit
e8e2906
·
verified ·
1 Parent(s): 0a44c05

Update README

Browse files
Files changed (1) hide show
  1. README.md +74 -3
README.md CHANGED
@@ -1,3 +1,74 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+
4
+ ---
5
+
6
+ # Model Card for SciLitLLM1.5
7
+
8
+ SciLitLLM1.5 adapts a general large language model for effective scientific literature understanding. Starting from the Qwen2.5-7B/14B model, SciLitLLM1.5-7B/14B goes through a hybrid strategy that integrates continual pre-training (CPT) and supervised fine-tuning (SFT), to simultaneously infuse scientific domain knowledge and enhance instruction-following capabilities for domain-specific tasks.
9
+
10
+ In this process, we identify two key challenges: (1) constructing high-quality CPT corpora, and (2) generating diverse SFT instructions. We address these challenges through a meticulous pipeline, including PDF text extraction, parsing content error correction, quality filtering, and synthetic instruction creation.
11
+
12
+ Applying this strategy, we present SciLitLLM-7B and 14B, specialized in scientific literature understanding, which demonstrates promising performance on scientific literature understanding benchmarks.
13
+
14
+ We observe promising performance enhancements, **with an average improvement of 4.0\% on SciAssess and 10.1% on SciRIFF, compared to the leading LLMs under 10B parameters**. Notably, **SciLitLLM-7B even outperforms Llama3.1 and Qwen2.5 with 70B parameters on SciRIFF**. Additionally, SciLitLLM-14B achieves leading results on both benchmarks, surpassing other open-source LLMs. Further ablation studies demonstrate the effectiveness of each module in our pipeline.
15
+
16
+ See the [paper](https://arxiv.org/abs/2408.15545) for more details and [github](https://github.com/dptech-corp/Uni-SMART) for data processing codes.
17
+
18
+ ## Requirements
19
+
20
+ Since SciLitLLM is based on Qwen2.5, we advise you to install `transformers>=4.37.0`, or you might encounter the following error:
21
+
22
+ ```
23
+ KeyError: 'qwen2'
24
+ ```
25
+
26
+ ## Quickstart
27
+
28
+ Here provides a code snippet with `apply_chat_template` to show you how to load the tokenizer and model and how to generate contents.
29
+
30
+ ```python
31
+ from transformers import AutoModelForCausalLM, AutoTokenizer
32
+ device = "cuda" # the device to load the model onto
33
+ model = AutoModelForCausalLM.from_pretrained(
34
+ "Uni-SMART/SciLitLLM-1.5/7B/",
35
+ torch_dtype="auto",
36
+ device_map="auto"
37
+ )
38
+ tokenizer = AutoTokenizer.from_pretrained("Uni-SMART/SciLitLLM-1.5/7B/")
39
+ prompt = "Can you summarize this article for me?\n <ARTICLE>"
40
+ messages = [
41
+ {"role": "system", "content": "You are a helpful assistant."},
42
+ {"role": "user", "content": prompt}
43
+ ]
44
+ text = tokenizer.apply_chat_template(
45
+ messages,
46
+ tokenize=False,
47
+ add_generation_prompt=True
48
+ )
49
+ model_inputs = tokenizer([text], return_tensors="pt").to(device)
50
+ generated_ids = model.generate(
51
+ model_inputs.input_ids,
52
+ max_new_tokens=512
53
+ )
54
+ generated_ids = [
55
+ output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
56
+ ]
57
+ response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
58
+ ```
59
+
60
+ ## Citation
61
+
62
+ If you find our work helpful, feel free to give us a cite.
63
+
64
+ ```
65
+ @misc{li2024scilitllmadaptllmsscientific,
66
+ title={SciLitLLM: How to Adapt LLMs for Scientific Literature Understanding},
67
+ author={Sihang Li and Jin Huang and Jiaxi Zhuang and Yaorui Shi and Xiaochen Cai and Mingjun Xu and Xiang Wang and Linfeng Zhang and Guolin Ke and Hengxing Cai},
68
+ year={2024},
69
+ eprint={2408.15545},
70
+ archivePrefix={arXiv},
71
+ primaryClass={cs.LG},
72
+ url={https://arxiv.org/abs/2408.15545},
73
+ }
74
+ ```