API Reference
This page summarizes the primary classes exposed by lm_deluge.
LLMClient
Section titled “LLMClient”LLMClient is the entry point for all prompt processing, rate limiting, retries, and tool orchestration.
Constructor
Section titled “Constructor”LLMClient( model_names: str | list[str] = "gpt-4.1-mini", *, name: str | None = None, max_requests_per_minute: int = 1_000, max_tokens_per_minute: int = 100_000, max_concurrent_requests: int = 225, sampling_params: list[SamplingParams] | None = None, model_weights: list[float] | Literal["uniform", "dynamic"] = "uniform", max_attempts: int = 5, request_timeout: int = 30, cache: Any = None, extra_headers: dict[str, str] | None = None, use_responses_api: bool = False, background: bool = False, temperature: float = 0.75, top_p: float = 1.0, json_mode: bool = False, max_new_tokens: int = 512, reasoning_effort: Literal["low", "medium", "high", "minimal", "none", None] = None, logprobs: bool = False, top_logprobs: int | None = None, force_local_mcp: bool = False, progress: Literal["rich", "tqdm", "manual"] = "rich", postprocess: Callable[[APIResponse], APIResponse] | None = None,)Key parameters:
sampling_params: list ofSamplingParamsto apply per model. If omitted, defaults derived fromtemperature,top_p, andmax_new_tokensare used.model_weights: provide explicit floats or'uniform'for equal sampling. The'dynamic'literal is reserved for a future auto-balancing mode and currently raisesNotImplementedErrorif selected.cache: any object exposingget(prompt: Conversation) -> APIResponse | Noneandput(prompt, response) -> None.use_responses_api: switch OpenAI models to/responses(required for computer-use and Codex models).background: only valid withuse_responses_api=True; polls background jobs until completion.extra_headers: merged into every HTTP request (useful for beta headers or OpenAI organization routing).
Core Methods
Section titled “Core Methods”| Method | Description |
|---|---|
process_prompts_sync(...) | Convenience wrapper that runs process_prompts_async inside asyncio.run(). |
process_prompts_async(...) | Schedule a batch of prompts, respecting rate limits and retries. |
start(prompt, **kwargs) | Equivalent to start_nowait() + wait_for(). |
start_nowait(prompt, *, tools=None, cache=None, service_tier=None) | Queue a single prompt and return a task ID immediately. |
wait_for(task_id) / wait_for_all(task_ids=None) | Await one or many tasks. |
as_completed(task_ids=None) | Async generator yielding (task_id, APIResponse) pairs as soon as tasks finish. |
stream(prompt, *, tools=None) | Streams chunks to stdout and resolves to the final APIResponse (see stream_chat for a generator). |
run_agent_loop(conversation, *, tools=None, max_rounds=5) | Executes tool calls automatically until the model stops asking for tools. |
run_agent_loop_sync(...) | Synchronous wrapper for the agent loop. |
submit_batch_job(prompts, *, tools=None, cache=None, batch_size=50_000) | Submit prompts through OpenAI or Anthropic batch APIs. |
wait_for_batch_job(batch_ids, provider) | Poll batch jobs until they complete. |
open(total=None, show_progress=True) / close() / reset_tracker() | Manage the underlying StatusTracker. |
service_tier can be supplied to process_prompts_*, start(), and start_nowait() for OpenAI models ("auto", "default", "flex", "priority").
SamplingParams
Section titled “SamplingParams”SamplingParams encapsulates decoding options. It is defined in lm_deluge.config and mirrors the arguments expected by every provider.
SamplingParams( temperature: float = 0.0, top_p: float = 1.0, json_mode: bool = False, max_new_tokens: int = 2_048, reasoning_effort: Literal["low", "medium", "high", "minimal", "none", None] = None, logprobs: bool = False, top_logprobs: int | None = None, strict_tools: bool = True,)strict_tools=True ensures OpenAI/Anthropic tool definitions stay in strict mode unless you disable it per request. SamplingParams.to_vllm() converts the structure to a vllm.SamplingParams instance when you want to reuse configurations locally.
Conversation & Message
Section titled “Conversation & Message”Conversation is a dataclass that holds a list of Message objects and exposes helpers for building prompts:
Conversation.system(text)andConversation.user(text, image=None, file=None)create new conversations with a single message..add(message)/.with_message(message)append new messages..with_tool_result(tool_call_id, result)appends tool outputs, handling parallel calls automatically..to_openai(),.to_openai_responses(),.to_anthropic(cache_pattern=None)emit provider-specific payloads..from_openai_chat(messages)/.from_anthropic(...)convert provider transcripts back into LM Deluge objects..count_tokens(max_new_tokens=0, img_tokens=85)estimates the number of tokens for scheduling.
Message instances contain rich content blocks:
.with_text(str),.with_image(data, detail="auto", max_size=None),.with_file(data, media_type=None, filename=None).with_remote_file(data, provider="openai")(async) uploads files before referencing them.with_tool_call(id, name, arguments)/.with_tool_result(call_id, result).with_thinking(content)for explicit reasoning traces
Helper constructors: Message.user, Message.system, and Message.ai (assistant).
Tool, ToolParams, MCPServer
Section titled “Tool, ToolParams, MCPServer”Tool describes a function-call schema plus an optional Python callable:
Tool.from_function(func)– introspects type hints and docstrings.Tool.from_pydantic(name, BaseModel, *, description=None, run=None, **kwargs)Tool.from_typed_dict(name, TypedDict, *, description=None, run=None, **kwargs)Tool.from_params(name, ToolParams, *, description=None, run=None)Tool.from_mcp(...)/Tool.from_mcp_config(config)– async helpers that connect to MCP servers and return lists of tools.
Instances expose .call(**kwargs) and .acall(**kwargs) which automatically pick the right execution strategy for sync vs. async callables.
ToolParams(schema_dict) lets you build JSON Schemas programmatically (including required keys and nested structures).
MCPServer(name, url, token=None, configuration=None, headers=None) wraps an MCP server description. Pass force_local_mcp=True to the LLMClient to expand the server locally, or rely on provider-native MCP support when available.
File & Image
Section titled “File & Image”File and Image encapsulate binary content.
- Accepts local paths, URLs, byte buffers, base64 strings, or existing provider
file_ids. .as_remote(provider)uploads the file to OpenAI, Anthropic, or Gemini and returns a newFilewithfile_idpopulated..delete()removes remote files when you no longer need them.fingerprintandsizeproperties are cached for consistent cache keys.
- Uses the same constructors as
Fileand supports.resize(max_size)to shrink large images. .from_pdf(path, dpi=200, target_size=1024)converts PDF pages into JPEG images (requirespdf2image).- Provider-specific methods (
oa_chat,oa_resp,anthropic,gemini,mistral) are invoked internally when building payloads.
APIResponse & Usage
Section titled “APIResponse & Usage”APIResponse captures the result of every request:
APIResponse( id: int, model_internal: str, prompt: Conversation | dict, sampling_params: SamplingParams, status_code: int | None, is_error: bool | None, error_message: str | None, usage: Usage | None = None, content: Message | None = None, thinking: str | None = None, model_external: str | None = None, region: str | None = None, logprobs: list | None = None, finish_reason: str | None = None, cost: float | None = None, cache_hit: bool = False, local_cache_hit: bool = False, retry_with_different_model: bool | None = False, give_up_if_no_other_models: bool | None = False, response_id: str | None = None, raw_response: dict | None = None,)Conveniences:
.completionreturns the first text part for backward compatibility..input_tokens,.output_tokens,.cache_read_tokens,.cache_write_tokensproxy the underlyingUsageobject..to_dict()/.from_dict()help with persistence (images are replaced with textual placeholders).
Usage(input_tokens, output_tokens, cache_read_tokens, cache_write_tokens) tracks provider-reported metrics and exposes .total_tokens and .has_cache_hit helpers.
Cache Interface
Section titled “Cache Interface”Pass a cache implementation into the client constructor to enable local caching:
class CacheProto: def get(self, prompt: Conversation) -> APIResponse | None: ... def put(self, prompt: Conversation, response: APIResponse) -> None: ...Built-in caches live in lm_deluge.cache:
SqliteCache(path, cache_key="default")LevelDBCache(path=None, cache_key="default")DistributedDictCache(cache, cache_key="default")
Each cache fingerprints the entire Conversation (including SamplingParams) to avoid false positives.