Batch Processing
Batch processing lets you submit large numbers of prompts at once with significant cost savings (typically 50% off). Both OpenAI and Anthropic support batch APIs with 24-hour completion windows.
Overview
Section titled “Overview”Batch processing is ideal for:
- Large-scale data processing tasks
- Non-time-sensitive workloads
- Cost optimization for high-volume API usage
Provider limits:
- OpenAI: Up to 50,000 requests per batch
- Anthropic: Up to 100,000 requests per batch
Basic Batch Processing
Section titled “Basic Batch Processing”Synchronous with Wait
Section titled “Synchronous with Wait”Submit a batch and wait for completion:
from lm_deluge import LLMClient
client = LLMClient( model_names=["gpt-4o-mini"], max_requests_per_minute=10000, max_tokens_per_minute=1000000, max_concurrent_requests=100,)
prompts = [ "What is the capital of France?", "Explain quantum computing in simple terms", "Write a haiku about programming",]
# Submit and wait for completionresults = client.submit_batch_job( prompts, wait_for_completion=True, poll_interval=30, # Check status every 30 seconds)
# Process resultsfor i, result in enumerate(results): print(f"Prompt {i}: {prompts[i]}") content = result["response"]["body"]["choices"][0]["message"]["content"] print(f"Response: {content}\n")Asynchronous Batch Processing
Section titled “Asynchronous Batch Processing”For better performance with multiple batches:
import asynciofrom lm_deluge import LLMClient
async def process_batches(): client = LLMClient( model_names=["claude-4.5-haiku"], max_requests_per_minute=10000, max_tokens_per_minute=1000000, max_concurrent_requests=100, )
batch1_prompts = ["Question 1", "Question 2"] batch2_prompts = ["Question 3", "Question 4"]
# Submit both batches concurrently batch_ids = await asyncio.gather( client.submit_batch_job_async(batch1_prompts), client.submit_batch_job_async(batch2_prompts), )
print(f"Submitted batches: {batch_ids}")
# Wait for all batches to complete all_results = await client.wait_for_batch_completion_async( [id for batch in batch_ids for id in batch], provider="anthropic", poll_interval=60, )
return all_results
results = asyncio.run(process_batches())Submit Now, Retrieve Later
Section titled “Submit Now, Retrieve Later”For long-running jobs, submit without waiting and retrieve results later:
import jsonfrom lm_deluge import LLMClient
# === Submit job ===client = LLMClient( model_names=["gpt-4o"], max_requests_per_minute=10000, max_tokens_per_minute=1000000, max_concurrent_requests=100,)
prompts = ["prompt 1", "prompt 2", "prompt 3"]
# Submit without waitingbatch_ids = client.submit_batch_job( prompts, wait_for_completion=False,)
# Save batch IDs for laterwith open("batch_ids.json", "w") as f: json.dump({"batch_ids": batch_ids, "provider": "openai"}, f)
print(f"Submitted {len(batch_ids)} batch(es): {batch_ids}")# === Retrieve later (separate script/session) ===import jsonfrom lm_deluge import LLMClient
with open("batch_ids.json", "r") as f: batch_info = json.load(f)
client = LLMClient( model_names=["gpt-4o"], max_requests_per_minute=10000, max_tokens_per_minute=1000000, max_concurrent_requests=100,)
# Retrieve completed batchesresults = client.retrieve_batch_jobs( batch_info["batch_ids"], provider=batch_info["provider"],)
for batch_results in results: for result in batch_results: print(f"Custom ID: {result['custom_id']}") content = result["response"]["body"]["choices"][0]["message"]["content"] print(f"Response: {content}\n")Batch with Tools (Anthropic)
Section titled “Batch with Tools (Anthropic)”Anthropic batch jobs support tool use:
from lm_deluge import LLMClient, Tool, Conversation
def get_weather(location: str) -> str: """Get the weather for a location.""" return f"The weather in {location} is sunny and 72F"
weather_tool = Tool.from_function(get_weather)
client = LLMClient( model_names=["claude-4-sonnet"], max_requests_per_minute=10000, max_tokens_per_minute=1000000, max_concurrent_requests=100,)
# Create conversations that might use toolsconversations = [ Conversation().user("What's the weather in Paris?"), Conversation().user("Tell me about the weather in Tokyo"), Conversation().user("Is it nice in London today?"),]
# Submit batch with toolsresults = client.submit_batch_job( conversations, wait_for_completion=True, tools=[weather_tool],)Batch with Prompt Caching (Anthropic)
Section titled “Batch with Prompt Caching (Anthropic)”Combine batching with prompt caching for maximum savings:
from lm_deluge import LLMClient, Conversation, Message
client = LLMClient( model_names=["claude-4-sonnet"], max_requests_per_minute=10000, max_tokens_per_minute=1000000, max_concurrent_requests=100,)
# Common system prompt across all requestsbase_system = "You are a helpful assistant specializing in technical documentation."
conversations = [ Conversation().system(base_system).add(Message.user("Explain REST APIs")), Conversation().system(base_system).add(Message.user("Explain GraphQL")), Conversation().system(base_system).add(Message.user("Explain gRPC")),]
# Submit with caching enabledresults = client.submit_batch_job( conversations, wait_for_completion=True, cache="system_and_tools",)Large-Scale Data Processing
Section titled “Large-Scale Data Processing”Process a dataset in chunks:
import asyncioimport pandas as pdfrom lm_deluge import LLMClient
async def process_dataset(df): client = LLMClient( model_names=["gpt-4o-mini"], max_requests_per_minute=10000, max_tokens_per_minute=1000000, max_concurrent_requests=100, )
# Create prompts from dataframe prompts = [ f"Analyze this feedback and rate sentiment (positive/negative/neutral): {row['feedback']}" for _, row in df.iterrows() ]
# Process in chunks (under 50k limit) chunk_size = 10000 all_results = []
for i in range(0, len(prompts), chunk_size): chunk = prompts[i : i + chunk_size] print(f"Processing chunk {i // chunk_size + 1}")
results = await client.submit_batch_job_async( chunk, wait_for_completion=True, ) all_results.extend(results)
# Extract sentiments sentiments = [] for result in all_results: if "error" not in result: content = result["response"]["body"]["choices"][0]["message"]["content"] sentiments.append(content.strip()) else: sentiments.append("ERROR")
df["sentiment"] = sentiments return df
# Usagedf = pd.read_csv("customer_feedback.csv")df_with_sentiment = asyncio.run(process_dataset(df))df_with_sentiment.to_csv("feedback_with_sentiment.csv", index=False)Error Handling
Section titled “Error Handling”from lm_deluge import LLMClient
client = LLMClient(model_names=["gpt-4o-mini"])
try: results = client.submit_batch_job( prompts, wait_for_completion=True, )
# Check individual request results for result in results: if "error" in result: print(f"Request {result['custom_id']} failed: {result['error']}") else: content = result["response"]["body"]["choices"][0]["message"]["content"] print(f"Success: {content[:100]}...")
except ValueError as e: print(f"Batch job failed: {e}")Progress Monitoring
Section titled “Progress Monitoring”When wait_for_completion=True, you’ll see live progress:
Batch batch_abc123 - processing - 2m 15s - 1250/5000 doneBatch batch_def456 - processing - 2m 15s - 2000/5000 doneBest Practices
Section titled “Best Practices”- Batch similar requests: Group prompts that generate similar-length responses
- Use appropriate models: Smaller models (gpt-4o-mini, claude-4.5-haiku) are ideal for batch processing
- Leverage caching: For Anthropic, use cache patterns on repeated content
- Monitor token usage: Check results to optimize future batches
- Set appropriate poll intervals: 30-60 seconds is usually good
- Test with small batches first: Validate prompts before scaling up
- Always save batch IDs: Persist IDs when not waiting for completion
Limitations
Section titled “Limitations”- 24-hour time limit: Results must be retrieved within 24 hours
- No streaming: Batch responses don’t support streaming
- Request size limits: Individual requests still have token limits
- Rate limits still apply: Though higher, there are overall limits