Streaming
Streaming lets you display response tokens as they’re generated, creating a more responsive user experience. This is especially valuable for long responses.
Basic Streaming
Section titled “Basic Streaming”The stream() method prints tokens in real-time and returns the final response:
import asynciofrom lm_deluge import LLMClient
async def basic_streaming(): client = LLMClient("gpt-4o")
# stream() prints tokens as they arrive response = await client.stream("Write a short story about a robot learning to paint.")
# The response contains the complete text print(f"\n\nFinal response length: {len(response.completion)} chars")
asyncio.run(basic_streaming())Streaming with Conversations
Section titled “Streaming with Conversations”Stream responses from multi-turn conversations:
import asynciofrom lm_deluge import LLMClient, Conversation, Message
async def conversation_streaming(): client = LLMClient("gpt-4o")
conversation = Conversation() conversation.add(Message.system("You are a creative writing assistant.")) conversation.add(Message.user("Write a haiku about artificial intelligence."))
response = await client.stream(conversation)
print(f"\n\nComplete: {response.completion}")
asyncio.run(conversation_streaming())Custom Stream Processing
Section titled “Custom Stream Processing”For more control over how tokens are handled, use the lower-level stream_chat function:
import asynciofrom lm_deluge.api_requests.openai import stream_chatfrom lm_deluge import Conversationfrom lm_deluge.config import SamplingParams
async def custom_streaming(): conversation = Conversation().user("Explain quantum computing in simple terms.") sampling_params = SamplingParams(temperature=0.7, max_new_tokens=500)
content = "" async for item in stream_chat( model_name="gpt-4o", prompt=conversation, sampling_params=sampling_params, ): if isinstance(item, str): # This is a token content += item print(item, end="", flush=True) else: # This is the final APIResponse print(f"\n\nStreaming complete!") print(f"Total tokens: {item.usage.total_tokens if item.usage else 'Unknown'}") return item
asyncio.run(custom_streaming())Streaming in a Chat Loop
Section titled “Streaming in a Chat Loop”Combine streaming with an interactive chat:
import asynciofrom lm_deluge import LLMClient, Conversation, Message
async def streaming_chat(): conversation = Conversation().system("You are a helpful assistant.") client = LLMClient("gpt-4o")
print("Streaming chat! Type 'quit' to exit.\n")
while True: user_input = input("\nYou: ").strip()
if user_input.lower() == "quit": break
if not user_input: continue
conversation.add(Message.user(user_input))
print("Assistant: ", end="", flush=True)
try: response = await client.stream(conversation) print() # Newline after streamed response
if response and response.completion: conversation.add(Message.ai(response.completion))
except Exception as e: print(f"\nError: {e}") conversation.messages.pop()
asyncio.run(streaming_chat())Collecting Streamed Text
Section titled “Collecting Streamed Text”If you need to process the text while streaming:
import asynciofrom lm_deluge.api_requests.openai import stream_chatfrom lm_deluge import Conversationfrom lm_deluge.config import SamplingParams
async def collect_while_streaming(): conversation = Conversation().user("List 5 interesting facts about space.") sampling_params = SamplingParams(max_new_tokens=500)
chunks = []
async for item in stream_chat( model_name="gpt-4o-mini", prompt=conversation, sampling_params=sampling_params, ): if isinstance(item, str): chunks.append(item) print(item, end="", flush=True) else: # Final response full_text = "".join(chunks) print(f"\n\nCollected {len(chunks)} chunks, {len(full_text)} chars") return full_text
asyncio.run(collect_while_streaming())Supported Models
Section titled “Supported Models”Streaming works with all OpenAI models:
gpt-4o,gpt-4o-minigpt-4.1,gpt-4.1-mini,gpt-4.1-nanogpt-4-turbo,gpt-4gpt-3.5-turboo1,o1-mini,o1-preview(reasoning models)o3-mini(reasoning model)
Limitations
Section titled “Limitations”- OpenAI only: Anthropic, Google, and other providers don’t support streaming yet in lm-deluge
- No tool streaming: When using tools, the agent loop doesn’t stream intermediate responses
- No structured output streaming: JSON mode/structured outputs return complete responses
Error Handling
Section titled “Error Handling”import asynciofrom lm_deluge import LLMClient
async def safe_streaming(): client = LLMClient("gpt-4o")
try: response = await client.stream("Write something interesting.") print(f"\n\nSuccess: {len(response.completion)} chars")
except ValueError as e: # Raised if trying to stream with non-OpenAI model print(f"Streaming not supported: {e}")
except Exception as e: print(f"Streaming error: {e}")
asyncio.run(safe_streaming())