Rate Limiting

Overview

LM Deluge is designed to help you max out your rate limits safely. The scheduler enforces request, token, and concurrency caps while the StatusTracker keeps everyone honest about progress and spend.

Key Parameters

When creating an LLMClient, you can control throughput with these parameters:

from lm_deluge import LLMClient

client = LLMClient(
    "gpt-4.1-mini",
    max_requests_per_minute=5_000,      # Max API requests per minute
    max_tokens_per_minute=600_000,     # Max tokens per minute
    max_concurrent_requests=500,       # Max simultaneous requests
    max_attempts=5,
    request_timeout=45,
)

How It Works

The scheduler keeps three pools full (requests, tokens, concurrent tasks) and only allows new work to start when all three have room. Each task carries its own token count by calling Conversation.count_tokens(max_new_tokens) prior to dispatch, so every provider request is aware of how much capacity it consumes. When a request fails, the client decrements attempts_left, waits for any cooling-off period generated by rate-limit errors, and puts the context back on the retry queue. The next attempt can use the same model or a different one depending on the error.

Default Values

If you don’t specify limits, the defaults shown above (1_000 requests/min, 100_000 tokens/min, and 225 concurrent) are used for every model. Pick values that match the quotas issued by each provider.

Multi-Model Rate Limiting

When spraying across multiple models, the rate limits apply to the client as a whole, not per-model:

client = LLMClient(
    ["gpt-4o-mini", "claude-3-haiku", "gemini-1.5-flash"],
    max_requests_per_minute=10_000,  # Shared across all models
    max_tokens_per_minute=500_000,
)

This allows you to maximize throughput by distributing load across providers while still respecting the combined quota.

Progress Display

LM Deluge shows progress as it processes prompts. You can customize the display:

client = LLMClient("gpt-4.1-mini", progress="tqdm")

# Disable progress per call
results = client.process_prompts_sync(prompts, show_progress=False)

rich: Beautiful progress bars with detailed stats (default)
tqdm: Classic tqdm progress bar
manual: Prints an update every 30 seconds

When you reuse a client across several batches, call client.open(total=len(prompts)) before dispatching work and client.close() afterwards so the same tracker instance keeps counting.

Timeouts

Set request_timeout to avoid hanging on slow responses:

client = LLMClient(
    "claude-3.5-sonnet",
    request_timeout=60,
)

StatusTracker Tips

Inspect tracker.total_cost or each APIResponse.cost to understand how much a batch cost.
Use await client.as_completed() to consume responses as soon as capacity frees up.
When building streaming UI, pair client.start_nowait() with client.wait_for_all() so you can start the next job before the current one is finished.

Best Practices

Start conservative: begin with lower rate limits and increase gradually as you learn what each provider allows.
Account for tokens: Conversation.count_tokens() lets you estimate usage before scheduling work.
Monitor costs: the tracker prints totals when you call client.close(). Pipe that to your logging system.
Use multiple models: distributing load across providers lets you keep the pipeline full even when one API throttles.
Tune retries: expensive prompts might deserve a larger max_attempts, while real-time workloads can lower the timeout for faster failure.

Next Steps

Learn about Client Basics to see every configuration option
Use Local & Provider Caching to avoid repeated calls
Explore Advanced Workflows for streaming, batch jobs, and embeddings