Skip to content

Rate Limiting

LM Deluge is designed to help you max out your rate limits safely. The scheduler enforces request, token, and concurrency caps while the StatusTracker keeps everyone honest about progress and spend.

When creating an LLMClient, you can control throughput with these parameters:

from lm_deluge import LLMClient
client = LLMClient(
"gpt-4.1-mini",
max_requests_per_minute=5_000, # Max API requests per minute
max_tokens_per_minute=600_000, # Max tokens per minute
max_concurrent_requests=500, # Max simultaneous requests
max_attempts=5,
request_timeout=45,
)

The scheduler keeps three pools full (requests, tokens, concurrent tasks) and only allows new work to start when all three have room. Each task carries its own token count by calling Conversation.count_tokens(max_new_tokens) prior to dispatch, so every provider request is aware of how much capacity it consumes. When a request fails, the client decrements attempts_left, waits for any cooling-off period generated by rate-limit errors, and puts the context back on the retry queue. The next attempt can use the same model or a different one depending on the error.

If you don’t specify limits, the defaults shown above (1_000 requests/min, 100_000 tokens/min, and 225 concurrent) are used for every model. Pick values that match the quotas issued by each provider.

When spraying across multiple models, the rate limits apply to the client as a whole, not per-model:

client = LLMClient(
["gpt-4o-mini", "claude-3-haiku", "gemini-1.5-flash"],
max_requests_per_minute=10_000, # Shared across all models
max_tokens_per_minute=500_000,
)

This allows you to maximize throughput by distributing load across providers while still respecting the combined quota.

LM Deluge shows progress as it processes prompts. You can customize the display:

client = LLMClient("gpt-4.1-mini", progress="tqdm")
# Disable progress per call
results = client.process_prompts_sync(prompts, show_progress=False)
  • rich: Beautiful progress bars with detailed stats (default)
  • tqdm: Classic tqdm progress bar
  • manual: Prints an update every 30 seconds

When you reuse a client across several batches, call client.open(total=len(prompts)) before dispatching work and client.close() afterwards so the same tracker instance keeps counting.

Set request_timeout to avoid hanging on slow responses:

client = LLMClient(
"claude-3.5-sonnet",
request_timeout=60,
)
  • Inspect tracker.total_cost or each APIResponse.cost to understand how much a batch cost.
  • Use await client.as_completed() to consume responses as soon as capacity frees up.
  • When building streaming UI, pair client.start_nowait() with client.wait_for_all() so you can start the next job before the current one is finished.
  1. Start conservative: begin with lower rate limits and increase gradually as you learn what each provider allows.
  2. Account for tokens: Conversation.count_tokens() lets you estimate usage before scheduling work.
  3. Monitor costs: the tracker prints totals when you call client.close(). Pipe that to your logging system.
  4. Use multiple models: distributing load across providers lets you keep the pipeline full even when one API throttles.
  5. Tune retries: expensive prompts might deserve a larger max_attempts, while real-time workloads can lower the timeout for faster failure.