Text Generation
Transformers
Safetensors
qwen2
conversational
text-generation-inference
hongzhizhang commited on
Commit
4ea9046
·
verified ·
1 Parent(s): e36c4c5

Update README.md

Browse files

give an vllm inference example.

Files changed (1) hide show
  1. README.md +29 -42
README.md CHANGED
@@ -31,60 +31,47 @@ Reinforcement learning (RL) for large language models is an energy-intensive end
31
 
32
  ## 🚀 Quick Start (Inference)
33
 
34
- You can use the RLEP model for accelerated text generation by leveraging its custom `EaModel` class. Ensure you have the `rlep` package and its `vllm` dependencies installed as per the official repository.
35
-
36
- First, install the necessary packages by cloning the repository and installing its dependencies:
37
  ```bash
38
- git clone https://github.com/Kwai-Klear/RLEP.git
39
- cd RLEP
40
- pip3 install -e .[vllm]
41
  ```
42
-
43
- Then, you can use the model in your Python code:
44
 
45
  ```python
46
- import torch
47
- from transformers import AutoTokenizer
48
- from eagle.model.ea_model import EaModel
49
- from fastchat.model import get_conversation_template
50
-
51
- # Define paths for your base model and RLEP model checkpoint
52
- # This model is based on Qwen2.5-Math-7B.
53
- base_model_path = "Qwen/Qwen2.5-Math-7B" # Original Qwen2.5 base model
54
- rlep_model_path = "Kwai-Klear/qwen2.5-math-rlep" # This RLEP checkpoint
55
-
56
- # Load the RLEP-enhanced model
57
- # trust_remote_code=True might be necessary depending on your environment
58
- model = EaModel.from_pretrained(
59
- base_model_path=base_model_path,
60
- ea_model_path=rlep_model_path,
61
- torch_dtype=torch.float16, # or torch.bfloat16 for Qwen2 models
62
- low_cpu_mem_usage=True,
63
- device_map="auto",
64
- total_token=-1 # -1 allows EAGLE-2 to auto-configure this parameter
65
  )
66
- model.eval()
67
 
68
- # Example usage for text generation:
69
- user_message = "What is the capital of France?"
70
 
71
- # Get conversation template for your base model.
72
- # Adjust "qwen2" if your base model uses a different chat format.
73
- conv = get_conversation_template("qwen2")
74
- conv.append_message(conv.roles[0], user_message)
75
- conv.append_message(conv.roles[1], None) # Append None for the assistant's turn
76
 
77
- prompt = conv.get_prompt()
78
- input_ids = model.tokenizer([prompt]).input_ids
79
- input_ids = torch.as_tensor(input_ids).cuda()
80
 
81
- # Generate text using the RLEP-accelerated generation method
82
- output_ids = model.eagenerate(input_ids, temperature=0.5, max_new_tokens=512)
83
- output = model.tokenizer.decode(output_ids[0])
84
 
85
- print(output)
86
  ```
87
 
 
 
88
  ## Evaluation Results
89
 
90
  We evaluated the converged RLEP model at 320 training steps and the DAPO-nodyn-bs64 baseline at 400 steps.
 
31
 
32
  ## 🚀 Quick Start (Inference)
33
 
34
+ Here’s a simple example of running inference with vLLM.
35
+ First, install vLLM (version ≥ 0.7.3):
 
36
  ```bash
37
+ pip3 install vllm
 
 
38
  ```
39
+ After installation, you can load and run the model in your Python code like this:
 
40
 
41
  ```python
42
+ import os
43
+
44
+ from transformers import AutoModelForCausalLM, AutoTokenizer
45
+ from vllm import LLM, SamplingParams
46
+
47
+ model_path = 'Kwai-Klear/qwen2.5-math-rlep'
48
+ sampling_params = SamplingParams(temperature=1.0, top_p=1.0, max_tokens=1024 * 3, n=1)
49
+ llm = LLM(
50
+ model=model_path,
51
+ enforce_eager=False,
52
+ tensor_parallel_size=1,
53
+ seed=0,
 
 
 
 
 
 
 
54
  )
 
55
 
56
+ tokenizer = AutoTokenizer.from_pretrained(model_path)
57
+ question = '''Find the sum of all integer bases $b>9$ for which $17_b$ is a divisor of $97_b.$'''
58
 
59
+ prefix="Solve the following math problem step by step. The last line of your response should be of the form Answer: $Answer (without quotes) where $Answer is the answer to the problem.\n\n"
60
+ post_fix = '\n\nRemember to put your answer on its own line after "Answer:".'
61
+ question_with_instruct = prefix + question + post_fix # the model is trained with this instruct.
62
+ messages = [{'content': question_with_instruct, 'role':'user'}]
63
+ text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
64
 
65
+ output =llm.generate([text], sampling_params)[0]
66
+ answer = output.outputs[0].text
 
67
 
68
+ print(question)
69
+ print(answer)
 
70
 
 
71
  ```
72
 
73
+ To evaluete the model on benchmarks like AIME-2024, AIME-2025 and AMC-2023 etc. please refer to [our repo](http://github.com/Kwai-Klear/RLEP?tab=readme-ov-file#evaluation).
74
+
75
  ## Evaluation Results
76
 
77
  We evaluated the converged RLEP model at 320 training steps and the DAPO-nodyn-bs64 baseline at 400 steps.