Rate Limiting
Overview
Section titled “Overview”LM Deluge is designed to help you max out your rate limits safely. The scheduler enforces request, token, and concurrency caps while the StatusTracker keeps everyone honest about progress and spend.
Key Parameters
Section titled “Key Parameters”When creating an LLMClient, you can control throughput with these parameters:
from lm_deluge import LLMClient
client = LLMClient( "gpt-4.1-mini", max_requests_per_minute=5_000, # Max API requests per minute max_tokens_per_minute=600_000, # Max tokens per minute max_concurrent_requests=500, # Max simultaneous requests max_attempts=5, request_timeout=45,)How It Works
Section titled “How It Works”The scheduler keeps three pools full (requests, tokens, concurrent tasks) and only allows new work to start when all three have room. Each task carries its own token count by calling Conversation.count_tokens(max_new_tokens) prior to dispatch, so every provider request is aware of how much capacity it consumes. When a request fails, the client decrements attempts_left, waits for any cooling-off period generated by rate-limit errors, and puts the context back on the retry queue. The next attempt can use the same model or a different one depending on the error.
Default Values
Section titled “Default Values”If you don’t specify limits, the defaults shown above (1_000 requests/min, 100_000 tokens/min, and 225 concurrent) are used for every model. Pick values that match the quotas issued by each provider.
Multi-Model Rate Limiting
Section titled “Multi-Model Rate Limiting”When spraying across multiple models, the rate limits apply to the client as a whole, not per-model:
client = LLMClient( ["gpt-4o-mini", "claude-3-haiku", "gemini-1.5-flash"], max_requests_per_minute=10_000, # Shared across all models max_tokens_per_minute=500_000,)This allows you to maximize throughput by distributing load across providers while still respecting the combined quota.
Progress Display
Section titled “Progress Display”LM Deluge shows progress as it processes prompts. You can customize the display:
client = LLMClient("gpt-4.1-mini", progress="tqdm")
# Disable progress per callresults = client.process_prompts_sync(prompts, show_progress=False)- rich: Beautiful progress bars with detailed stats (default)
- tqdm: Classic tqdm progress bar
- manual: Prints an update every 30 seconds
When you reuse a client across several batches, call client.open(total=len(prompts)) before dispatching work and client.close() afterwards so the same tracker instance keeps counting.
Timeouts
Section titled “Timeouts”Set request_timeout to avoid hanging on slow responses:
client = LLMClient( "claude-3.5-sonnet", request_timeout=60,)StatusTracker Tips
Section titled “StatusTracker Tips”- Inspect
tracker.total_costor eachAPIResponse.costto understand how much a batch cost. - Use
await client.as_completed()to consume responses as soon as capacity frees up. - When building streaming UI, pair
client.start_nowait()withclient.wait_for_all()so you can start the next job before the current one is finished.
Best Practices
Section titled “Best Practices”- Start conservative: begin with lower rate limits and increase gradually as you learn what each provider allows.
- Account for tokens:
Conversation.count_tokens()lets you estimate usage before scheduling work. - Monitor costs: the tracker prints totals when you call
client.close(). Pipe that to your logging system. - Use multiple models: distributing load across providers lets you keep the pipeline full even when one API throttles.
- Tune retries: expensive prompts might deserve a larger
max_attempts, while real-time workloads can lower the timeout for faster failure.
Next Steps
Section titled “Next Steps”- Learn about Client Basics to see every configuration option
- Use Local & Provider Caching to avoid repeated calls
- Explore Advanced Workflows for streaming, batch jobs, and embeddings