Skip to content

API Reference

This page summarizes the primary classes exposed by lm_deluge.

LLMClient is the entry point for all prompt processing, rate limiting, retries, and tool orchestration.

LLMClient(
model_names: str | list[str] = "gpt-4.1-mini",
*,
name: str | None = None,
max_requests_per_minute: int = 1_000,
max_tokens_per_minute: int = 100_000,
max_concurrent_requests: int = 225,
sampling_params: list[SamplingParams] | None = None,
model_weights: list[float] | Literal["uniform", "dynamic"] = "uniform",
max_attempts: int = 5,
request_timeout: int = 30,
cache: Any = None,
extra_headers: dict[str, str] | None = None,
extra_body: dict[str, str] | None = None,
use_responses_api: bool = False,
background: bool = False,
temperature: float = 1.0,
top_p: float = 1.0,
json_mode: bool = False,
max_new_tokens: int = 512,
reasoning_effort: Literal["low", "medium", "high", "minimal", "none", None] = None,
logprobs: bool = False,
top_logprobs: int | None = None,
force_local_mcp: bool = False,
progress: Literal["rich", "tqdm", "manual"] = "rich",
postprocess: Callable[[APIResponse], APIResponse] | None = None,
)

Key parameters:

  • sampling_params: list of SamplingParams to apply per model. If omitted, defaults derived from temperature, top_p, and max_new_tokens are used.
  • model_weights: provide explicit floats or 'uniform' for equal sampling. The 'dynamic' literal is reserved for a future auto-balancing mode and currently raises NotImplementedError if selected.
  • cache: any object exposing get(prompt: Conversation) -> APIResponse | None and put(prompt, response) -> None.
  • use_responses_api: switch OpenAI models to /responses (required for computer-use and Codex models).
  • background: only valid with use_responses_api=True; polls background jobs until completion.
  • extra_headers: merged into every HTTP request (useful for beta headers or OpenAI organization routing).
  • extra_body: merged into every HTTP request body (useful for provider-specific parameters).
MethodDescription
process_prompts_sync(...)Convenience wrapper that runs process_prompts_async inside asyncio.run().
process_prompts_async(...)Schedule a batch of prompts, respecting rate limits and retries.
start(prompt, **kwargs)Equivalent to start_nowait() + wait_for().
start_nowait(prompt, *, tools=None, cache=None, service_tier=None)Queue a single prompt and return a task ID immediately.
wait_for(task_id) / wait_for_all(task_ids=None)Await one or many tasks.
as_completed(task_ids=None)Async generator yielding (task_id, APIResponse) pairs as soon as tasks finish.
stream(prompt, *, tools=None)Streams chunks to stdout and resolves to the final APIResponse (see stream_chat for a generator).
run_agent_loop(conversation, *, tools=None, max_rounds=5)Executes tool calls automatically until the model stops asking for tools. Equivalent to start_agent_loop_nowait() + wait_for_agent_loop().
start_agent_loop_nowait(conversation, *, tools=None, max_rounds=5)Start an agent loop without waiting. Returns a task ID that can be used with wait_for_agent_loop().
wait_for_agent_loop(task_id)Wait for an agent loop task to complete. Returns (Conversation, APIResponse).
run_agent_loop_sync(...)Synchronous wrapper for the agent loop.
submit_batch_job(prompts, *, tools=None, cache=None, batch_size=50_000)Submit prompts through OpenAI or Anthropic batch APIs.
wait_for_batch_job(batch_ids, provider)Poll batch jobs until they complete.
open(total=None, show_progress=True) / close() / reset_tracker()Manage the underlying StatusTracker.
with_limits(max_requests_per_minute=None, max_tokens_per_minute=None, max_concurrent_requests=None)Update rate limits on an existing client instance. Returns self for chaining.
from_dict(config_dict)Class method to create a client from a dictionary configuration.
from_yaml(file_path)Class method to create a client from a YAML configuration file.

service_tier can be supplied to process_prompts_*, start(), and start_nowait() for OpenAI models ("auto", "default", "flex", "priority").

SamplingParams encapsulates decoding options. It is defined in lm_deluge.config and mirrors the arguments expected by every provider.

SamplingParams(
temperature: float = 1.0,
top_p: float = 1.0,
json_mode: bool = False,
max_new_tokens: int = 2_048,
global_effort: Literal["low", "medium", "high"] = "high",
reasoning_effort: Literal["low", "medium", "high", "minimal", "none", None] = None,
thinking_budget: int | None = None,
logprobs: bool = False,
top_logprobs: int | None = None,
strict_tools: bool = True,
)

strict_tools=True ensures OpenAI/Anthropic tool definitions stay in strict mode unless you disable it per request. SamplingParams.to_vllm() converts the structure to a vllm.SamplingParams instance when you want to reuse configurations locally.

global_effort applies to Anthropic’s claude-4.5-opus and maps to the provider’s new effort parameter. thinking_budget lets you pin a token budget for reasoning models (Anthropic or Gemini); when both thinking_budget and reasoning_effort are supplied, the explicit budget wins and a warning is emitted to help spot unexpected overrides.

Conversation is a dataclass that holds a list of Message objects and exposes helpers for building prompts:

  • Conversation().system(text) and Conversation().user(text, image=None, file=None) create new conversations with a single message.
  • .add(message) / .with_message(message) append new messages.
  • .with_tool_result(tool_call_id, result) appends tool outputs, handling parallel calls automatically.
  • .to_openai(), .to_openai_responses(), .to_anthropic(cache_pattern=None) emit provider-specific payloads.
  • .from_openai_chat(messages) / .from_anthropic(...) convert provider transcripts back into LM Deluge objects.
  • .count_tokens(max_new_tokens=0, img_tokens=85) estimates the number of tokens for scheduling.

Message instances contain rich content blocks:

  • .with_text(str), .with_image(data, detail="auto", max_size=None), .with_file(data, media_type=None, filename=None)
  • .with_remote_file(data, provider="openai") (async) uploads files before referencing them
  • .with_tool_call(id, name, arguments) / .with_tool_result(call_id, result)
  • .with_thinking(content) for explicit reasoning traces

Helper constructors: Message.user, Message.system, and Message.ai (assistant).

Tool describes a function-call schema plus an optional Python callable. Core helpers:

  • Tool.from_function(func, *, include_output_schema_in_description=False) – introspects type hints, docstrings, and Annotated[...] descriptions, and extracts an output_schema from the return type (with optional runtime validation via validate_output=True on .call()/.acall()).
  • Tool(...) – construct manually; parameters accepts JSON Schema dicts, BaseModel subclasses, TypedDict classes (including NotRequired/Required), or simple Python-type mappings like {"city": str, "limit": int} or (type, extras) tuples.
  • Tool.from_mcp(...) / Tool.from_mcp_config(config)async helpers that connect to MCP servers and return lists of tools.

Instances expose .call(**kwargs) and .acall(**kwargs) which automatically pick the right execution strategy for sync vs. async callables; pass validate_output=True to enforce return-type validation when output_schema is present.

MCPServer(name, url, token=None, configuration=None, headers=None) wraps an MCP server description. Pass force_local_mcp=True to the LLMClient to expand the server locally, or rely on provider-native MCP support when available.

Utility managers in lm_deluge.tool.prefab provide ready-made tool suites:

  • FilesystemManager exposes a sandboxed filesystem tool (read_file, write_file, delete_path, list_dir, grep, apply_patch) backed by an in-memory workspace or any custom WorkspaceBackend.
  • TodoManager exposes todowrite/todoread handlers for maintaining a structured todo list during long sessions (see TodoItem, TodoPriority, and TodoStatus for strongly typed entries).
  • SubAgentManager registers start_subagent, check_subagent, and wait_for_subagent tools so the main model can delegate parallel agent loops to cheaper models without manual orchestration.
  • MemoryManager exposes memsearch/memread/memwrite/memupdate/memdelete for long-lived note taking outside the chat transcript.
  • BatchTool bundles multiple tool calls into one request (calls: [{tool, arguments}]) to save roundtrips.
  • ToolSearchTool gives the model a regex-powered discovery + call helper when you have a large toolbelt.
  • ToolComposer (OTC) lets the model write short Python snippets that orchestrate multiple tools in one shot, returning only the final output to the conversation.

extract, extract_async, translate, translate_async, and score_llm now live in lm_deluge.pipelines.

File and Image encapsulate binary content.

  • Accepts local paths, URLs, byte buffers, base64 strings, or existing provider file_ids.
  • .as_remote(provider) uploads the file to OpenAI, Anthropic, or Gemini and returns a new File with file_id populated.
  • .delete() removes remote files when you no longer need them.
  • fingerprint and size properties are cached for consistent cache keys.
  • Uses the same constructors as File and supports .resize(max_size) to shrink large images.
  • .from_pdf(path, dpi=200, target_size=1024) converts PDF pages into JPEG images (requires pdf2image).
  • Provider-specific methods (oa_chat, oa_resp, anthropic, gemini, mistral) are invoked internally when building payloads.

APIResponse captures the result of every request:

APIResponse(
id: int,
model_internal: str,
prompt: Conversation | dict,
sampling_params: SamplingParams,
status_code: int | None,
is_error: bool | None,
error_message: str | None,
usage: Usage | None = None,
content: Message | None = None,
thinking: str | None = None,
model_external: str | None = None,
region: str | None = None,
logprobs: list | None = None,
finish_reason: str | None = None,
cost: float | None = None,
cache_hit: bool = False,
local_cache_hit: bool = False,
retry_with_different_model: bool | None = False,
give_up_if_no_other_models: bool | None = False,
response_id: str | None = None,
raw_response: dict | None = None,
)

Conveniences:

  • .completion returns the first text part for backward compatibility.
  • .input_tokens, .output_tokens, .cache_read_tokens, .cache_write_tokens proxy the underlying Usage object.
  • .to_dict() / .from_dict() help with persistence (images are replaced with textual placeholders).

Usage(input_tokens, output_tokens, cache_read_tokens, cache_write_tokens) tracks provider-reported metrics and exposes .total_tokens and .has_cache_hit helpers.

Pass a cache implementation into the client constructor to enable local caching. Built-in caches live in lm_deluge.cache:

  • SqliteCache(path, cache_key="default")
  • LevelDBCache(path=None, cache_key="default")
  • DistributedDictCache(cache, cache_key="default")

Each cache fingerprints only the Conversation content (messages, images, files, etc.) to generate cache keys. Note: Sampling parameters (temperature, max_new_tokens, etc.) are NOT included in the cache key, so changing these parameters will still hit the same cache entry. This is intentional—the cache stores responses based on the prompt content alone.