AI Documentation
Integrating AI APIs into Your Application
A developer guide for integrating large language model (LLM) APIs into production applications. Covers authentication, request handling, streaming responses, error management, and cost optimization.
Overview
AI APIs provide access to language models via HTTP endpoints. This guide covers the patterns and practices needed to build reliable integrations that handle real-world conditions: latency, rate limits, errors, and cost.
Prerequisites
- API key from your AI provider
- HTTPS client library (requests, fetch, axios, etc.)
- Environment variable management for secrets
Authentication
Most AI APIs authenticate using Bearer tokens in the Authorization header.
Authorization: Bearer sk-your-api-key-here
Security Best Practices
- Never expose keys in client-side code. Make API calls from your backend.
- Use environment variables. Load keys from
process.envor equivalent. - Rotate keys regularly. Implement key rotation without downtime.
- Set usage limits. Configure spending caps in your provider dashboard.
Basic Request
A typical completion request sends a prompt and receives generated text.
Request
POST /v1/chat/completions
Content-Type: application/json
Authorization: Bearer $API_KEY
{
"model": "gpt-4",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain DNS in one paragraph."}
],
"max_tokens": 150,
"temperature": 0.7
}
Response
{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"created": 1706140800,
"model": "gpt-4",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "DNS (Domain Name System) translates human-readable domain names..."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 25,
"completion_tokens": 87,
"total_tokens": 112
}
}
Key Parameters
- model: The model identifier (e.g.,
gpt-4,claude-3-opus) - messages: Conversation history as an array of role/content objects
- max_tokens: Maximum tokens to generate (controls response length and cost)
- temperature: Randomness (0 = deterministic, 1 = creative). Use lower values for factual tasks.
- top_p: Nucleus sampling threshold. Alternative to temperature.
- stop: Sequences where the model should stop generating
Streaming Responses
For better user experience, stream responses token-by-token instead of waiting for the complete response.
Request
{
"model": "gpt-4",
"messages": [...],
"stream": true
}
Handling Server-Sent Events
const response = await fetch(endpoint, {
method: 'POST',
headers: { 'Authorization': `Bearer ${apiKey}` },
body: JSON.stringify({ ...payload, stream: true })
});
const reader = response.body.getReader();
const decoder = new TextDecoder();
while (true) {
const { done, value } = await reader.read();
if (done) break;
const chunk = decoder.decode(value);
const lines = chunk.split('\n').filter(line => line.startsWith('data: '));
for (const line of lines) {
const data = line.replace('data: ', '');
if (data === '[DONE]') return;
const parsed = JSON.parse(data);
const token = parsed.choices[0]?.delta?.content || '';
process.stdout.write(token);
}
}
Error Handling
AI APIs return standard HTTP status codes. Handle these gracefully.
Common Errors
- 400 Bad Request: Invalid parameters. Check your request body.
- 401 Unauthorized: Invalid or missing API key.
- 429 Rate Limited: Too many requests. Implement exponential backoff.
- 500 Server Error: Provider issue. Retry with backoff.
- 503 Service Unavailable: Temporary overload. Retry later.
Retry Logic
async function callWithRetry(request, maxRetries = 3) {
for (let attempt = 0; attempt < maxRetries; attempt++) {
try {
const response = await fetch(request);
if (response.status === 429 || response.status >= 500) {
const delay = Math.pow(2, attempt) * 1000;
await new Promise(r => setTimeout(r, delay));
continue;
}
if (!response.ok) {
throw new Error(`API error: ${response.status}`);
}
return await response.json();
} catch (error) {
if (attempt === maxRetries - 1) throw error;
}
}
}
Rate Limiting
Respect rate limits to avoid service disruption.
Response Headers
x-ratelimit-limit-requests: 60
x-ratelimit-remaining-requests: 45
x-ratelimit-reset-requests: 1s
Strategies
- Queue requests. Use a rate-limited queue to space out calls.
- Monitor headers. Track remaining quota and pause before hitting limits.
- Implement circuit breakers. Stop requests temporarily after repeated failures.
Cost Optimization
AI API costs scale with token usage. Optimize to control spend.
- Set max_tokens. Limit response length to what you actually need.
- Use appropriate models. Smaller models (GPT-3.5, Claude Haiku) cost less for simple tasks.
- Cache responses. Store results for repeated identical queries.
- Trim context. Only include necessary conversation history in messages array.
- Monitor usage. Track tokens per request and set budget alerts.
Token Counting
Estimate tokens before sending requests to predict costs.
// Rough estimate: ~4 characters per token for English text
function estimateTokens(text) {
return Math.ceil(text.length / 4);
}
// For precise counting, use tiktoken or provider libraries
import { encoding_for_model } from 'tiktoken';
const enc = encoding_for_model('gpt-4');
const tokens = enc.encode(text).length;
Production Checklist
- Secrets management. API keys loaded from environment, not hardcoded.
- Error handling. Retry logic with exponential backoff implemented.
- Rate limiting. Request queue or throttling in place.
- Timeouts. Set reasonable timeouts (30-60s for completions).
- Logging. Log request IDs, latency, token usage, and errors.
- Monitoring. Track success rate, latency percentiles, and cost.
- Fallbacks. Graceful degradation when the API is unavailable.
Related Samples
This is a sample article to demonstrate how I write.