AI Documentation

Integrating AI APIs into Your Application

A developer guide for integrating large language model (LLM) APIs into production applications. Covers authentication, request handling, streaming responses, error management, and cost optimization.

Overview

AI APIs provide access to language models via HTTP endpoints. This guide covers the patterns and practices needed to build reliable integrations that handle real-world conditions: latency, rate limits, errors, and cost.

Prerequisites

API key from your AI provider
HTTPS client library (requests, fetch, axios, etc.)
Environment variable management for secrets

Authentication

Most AI APIs authenticate using Bearer tokens in the Authorization header.

Authorization: Bearer sk-your-api-key-here

Security Best Practices

Never expose keys in client-side code. Make API calls from your backend.
Use environment variables. Load keys from process.env or equivalent.
Rotate keys regularly. Implement key rotation without downtime.
Set usage limits. Configure spending caps in your provider dashboard.

Basic Request

A typical completion request sends a prompt and receives generated text.

Request

POST /v1/chat/completions
Content-Type: application/json
Authorization: Bearer $API_KEY

{
  "model": "gpt-4",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Explain DNS in one paragraph."}
  ],
  "max_tokens": 150,
  "temperature": 0.7
}

Response

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1706140800,
  "model": "gpt-4",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "DNS (Domain Name System) translates human-readable domain names..."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 25,
    "completion_tokens": 87,
    "total_tokens": 112
  }
}

Key Parameters

model: The model identifier (e.g., gpt-4, claude-3-opus)
messages: Conversation history as an array of role/content objects
max_tokens: Maximum tokens to generate (controls response length and cost)
temperature: Randomness (0 = deterministic, 1 = creative). Use lower values for factual tasks.
top_p: Nucleus sampling threshold. Alternative to temperature.
stop: Sequences where the model should stop generating

Streaming Responses

For better user experience, stream responses token-by-token instead of waiting for the complete response.

Request

{
  "model": "gpt-4",
  "messages": [...],
  "stream": true
}

Handling Server-Sent Events

const response = await fetch(endpoint, {
  method: 'POST',
  headers: { 'Authorization': `Bearer ${apiKey}` },
  body: JSON.stringify({ ...payload, stream: true })
});

const reader = response.body.getReader();
const decoder = new TextDecoder();

while (true) {
  const { done, value } = await reader.read();
  if (done) break;

  const chunk = decoder.decode(value);
  const lines = chunk.split('\n').filter(line => line.startsWith('data: '));

  for (const line of lines) {
    const data = line.replace('data: ', '');
    if (data === '[DONE]') return;

    const parsed = JSON.parse(data);
    const token = parsed.choices[0]?.delta?.content || '';
    process.stdout.write(token);
  }
}

Error Handling

AI APIs return standard HTTP status codes. Handle these gracefully.

Common Errors

400 Bad Request: Invalid parameters. Check your request body.
401 Unauthorized: Invalid or missing API key.
429 Rate Limited: Too many requests. Implement exponential backoff.
500 Server Error: Provider issue. Retry with backoff.
503 Service Unavailable: Temporary overload. Retry later.

Retry Logic

async function callWithRetry(request, maxRetries = 3) {
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    try {
      const response = await fetch(request);

      if (response.status === 429 || response.status >= 500) {
        const delay = Math.pow(2, attempt) * 1000;
        await new Promise(r => setTimeout(r, delay));
        continue;
      }

      if (!response.ok) {
        throw new Error(`API error: ${response.status}`);
      }

      return await response.json();
    } catch (error) {
      if (attempt === maxRetries - 1) throw error;
    }
  }
}

Rate Limiting

Respect rate limits to avoid service disruption.

Response Headers

x-ratelimit-limit-requests: 60
x-ratelimit-remaining-requests: 45
x-ratelimit-reset-requests: 1s

Strategies

Queue requests. Use a rate-limited queue to space out calls.
Monitor headers. Track remaining quota and pause before hitting limits.
Implement circuit breakers. Stop requests temporarily after repeated failures.

Cost Optimization

AI API costs scale with token usage. Optimize to control spend.

Set max_tokens. Limit response length to what you actually need.
Use appropriate models. Smaller models (GPT-3.5, Claude Haiku) cost less for simple tasks.
Cache responses. Store results for repeated identical queries.
Trim context. Only include necessary conversation history in messages array.
Monitor usage. Track tokens per request and set budget alerts.

Token Counting

Estimate tokens before sending requests to predict costs.

// Rough estimate: ~4 characters per token for English text
function estimateTokens(text) {
  return Math.ceil(text.length / 4);
}

// For precise counting, use tiktoken or provider libraries
import { encoding_for_model } from 'tiktoken';
const enc = encoding_for_model('gpt-4');
const tokens = enc.encode(text).length;

Production Checklist

Secrets management. API keys loaded from environment, not hardcoded.
Error handling. Retry logic with exponential backoff implemented.
Rate limiting. Request queue or throttling in place.
Timeouts. Set reasonable timeouts (30-60s for completions).
Logging. Log request IDs, latency, token usage, and errors.
Monitoring. Track success rate, latency percentiles, and cost.
Fallbacks. Graceful degradation when the API is unavailable.

Related Samples

This is a sample article to demonstrate how I write.