Streaming Issue
I have deployed the gpt-oss-120b on our local server and found some generation issues.
When the streaming flag was off. Everything worked well.
>>> client = OpenAI(
... base_url="http://localhost:8001/v1",
... api_key="EMPTY"
... )
>>>
>>> result = client.chat.completions.create(
... messages=[
... {"role": "system", "content": "You are a helpful assistant."},
... {"role": "user", "content": "hi"}
... ]
... )
>>> result
ChatCompletion(id='chatcmpl-cd37484714ab4ece9a4c332a5f71fa86', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='Hello! How can I help you today?', refusal=None, role='assistant', annotations=None, audio=None, function_call=None, tool_calls=[], reasoning_content='The user says "hi". Simple greeting. Respond friendly.'), stop_reason=None)], created=1754426169, model='dragon', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=31, prompt_tokens=82, total_tokens=113, completion_tokens_details=None, prompt_tokens_details=None), prompt_logprobs=None, kv_transfer_params=None)
While turning on the stream
flag, the model tends to generate None
and other tokens
>>> result = client.chat.completions.create(
... messages=[
... {"role": "system", "content": "You are a helpful assistant."},
... {"role": "user", "content": "hi"}
... ],
... stream=True
... )
>>> for chunk in result:
... str(chunk.choices[0].delta.content)
...
''
'None'
'None'
'None'
'None'
'None'
'None'
'None'
'None'
'None'
'None'
'None'
'None'
'None'
'None'
''
'Hello'
'!'
' How'
' can'
' I'
' assist'
' you'
' today'
'?'
'None'
This issue can be easily reproduced
Issue Description : The current API does not support the standard OpenAI chat.completions streaming format, causing None values in the response when stream=True. This prevents developers from using the gpt-oss server as a drop-in replacement for OpenAI's API.
Proposed Solution :
a. Create a New Endpoint: Add a new FastAPI endpoint at /v1/chat/completions in the gpt_oss/responses_api/api_server.py file.
b. Add an Adapter for Streaming: Implement a new async generator function that acts as an adapter between the internal StreamResponsesEvents and the OpenAI ChatCompletionChunk format. This function will iterate through the events from StreamResponsesEvents and yield properly formatted ChatCompletionChunk objects.
c. Handle Stream Termination: Ensure that the stream is properly terminated with a [DONE] message as per the SSE protocol.
• Affected Files/Directories:
◦ gpt_oss/responses_api/api_server.py
◦ gpt_oss/responses_api/types.py (to add new Pydantic models for the OpenAI-compatible endpoint)