Client Basics
The LLMClient orchestrates every request. It normalizes conversations, schedules work under your rate limits, handles retries, and collects structured APIResponse objects. This page walks through the knobs that control that behavior.
Constructor Overview
Section titled “Constructor Overview”from lm_deluge import LLMClient, SamplingParams
client = LLMClient( model_names=["gpt-4.1-mini"], max_requests_per_minute=1_000, max_tokens_per_minute=100_000, max_concurrent_requests=225, sampling_params=[SamplingParams(temperature=0.75, max_new_tokens=512)], max_attempts=5, request_timeout=30, use_responses_api=False, progress="rich",)- Defaults match the arguments listed above. They are intentionally aggressive so you can saturate provider quotas.
- The factory returns a Pydantic-powered client, so validation happens immediately and you can serialize/deserialise configurations safely.
Loading from Config Files
Section titled “Loading from Config Files”Load pre-defined settings without writing code by using LLMClient.from_dict() or LLMClient.from_yaml():
config = { "model_names": ["claude-3.5-sonnet"], "sampling_params": {"temperature": 0.4, "max_new_tokens": 300}, "max_concurrent_requests": 400,}client = LLMClient.from_dict(config)Using Multiple Models
Section titled “Using Multiple Models”Pass a list of model IDs to spray traffic. Provide model_weights when you need deterministic sampling ratios. The weights are normalized automatically; set "uniform" (default) for equal traffic.
multi_client = LLMClient( ["gpt-4.1-mini", "claude-3-haiku", "gemini-1.5-flash"], model_weights=[0.6, 0.2, 0.2], sampling_params=[ SamplingParams(temperature=0.2, max_new_tokens=200), SamplingParams(temperature=0.8, max_new_tokens=150), SamplingParams(temperature=1.0, max_new_tokens=300), ],)LLMClient.with_model() and .with_models() provide a fluent API when you need to swap the list at runtime, and _select_model() ensures retries can hop to a different model whenever APIResponse.retry_with_different_model is set.
Sampling Parameters
Section titled “Sampling Parameters”SamplingParams mirrors the arguments used by every provider:
temperature,top_p, andmax_new_tokensfeed directly into the request bodies.json_mode=Trueplaces OpenAI and Gemini into JSON-object responses if the model supports it.reasoning_effortlets you request"low","medium","high","minimal", or"none"on reasoning models (o4,gpt-5,claude-3.5, etc.).logprobs+top_logprobsenable token-level probabilities across all models that support it; the client validates that every model in the pool allows logprobs and adjusts eachSamplingParamsinstance for you.strict_toolskeeps OpenAI/Anthropic tool definitions in strict mode (removing defaults) unless you explicitly disable it.
You can provide one SamplingParams for every model or a single entry that LM Deluge clones.
Rate Limits, Retries, and Timeouts
Section titled “Rate Limits, Retries, and Timeouts”The scheduler enforces three independent limits:
max_requests_per_minutemax_tokens_per_minutemax_concurrent_requests
Use client.with_limits(max_requests_per_minute=...) to adjust them on an existing client when you reuse it across jobs. Every request is retried up to max_attempts times with a per-attempt timeout of request_timeout seconds. Failed tasks are re-queued until attempts run out.
Status Tracker & Progress Output
Section titled “Status Tracker & Progress Output”StatusTracker records usage, retries, costs, and queue depth for the current batch. Control the UX with:
progress="rich"(default),"tqdm", or"manual"show_progress=Falseperprocess_prompts_*callclient.open(total=len(prompts))/client.close()if you want to reuse a single tracker across several batchesclient.reset_tracker()to zero out the counters without destroying the progress display
The tracker also exposes cumulative totals through each APIResponse.usage so you can build your own dashboards.
Provider-Specific Features
Section titled “Provider-Specific Features”- OpenAI Responses API: set
use_responses_api=Trueto send requests to/responses. This is required for Codex models, computer-use previews, typed MCP servers, and background mode. - Background mode:
background=Trueturns each request into a start/poll cycle on the Responses API, freeing slots while OpenAI runs the job. - Service tiers: pass
service_tier("auto","default","flex", or"priority") intoprocess_prompts_*,start(), orstart_nowait()to opt into OpenAI’s scheduling tiers."flex"automatically downgrades to"auto"on models that do not support it. - Headers & MCP routing: use
extra_headersto inject provider-specific HTTP headers, andforce_local_mcp=Trueto force LM Deluge (instead of OpenAI/Anthropic) to call MCP servers locally when you provide anMCPServertool. - Tooling:
toolsaccepts a list ofToolinstances, raw built-in tool dictionaries (for computer use), orMCPServerdescriptors. - Post-processing: supply
postprocessif you want to mutate everyAPIResponsebefore it is returned—perfect for trimming whitespace, redacting secrets, or logging. - Caching knobs: pass any object with
get(prompt: Conversation)andput(prompt, response)as thecache=argument when constructing the client for local caching. Provide thecachestring (CachePattern) on eachprocess_prompts_*call to enable provider-side caching (currently Anthropic).
Synchronous vs. Asynchronous APIs
Section titled “Synchronous vs. Asynchronous APIs”process_prompts_syncwraps the async version withasyncio.run()for convenience. Useprocess_prompts_asyncin notebooks or async services.start()/start_nowait()enqueue individual prompts and return task IDs that you canawaitlater or multiplex usingwait_for_all()andas_completed().stream()streams OpenAI-compatible chat tokens to stdout and returns the final response; callstream_chatdirectly when you need an async generator of chunks.run_agent_loop()executes tool calls until the model returns a final answer, mutating yourConversationalong the way.
See Advanced Workflows for code samples that combine these primitives.