Multimodal Vision Guide: AI Image Analysis & Text Extraction

Transform your applications with advanced visual AI capabilities. Our Multimodal Vision API enables you to analyze images, extract text, understand visual content, and generate detailed descriptions. Whether you're building OCR systems, content moderation tools, or accessibility features, our vision API provides powerful image understanding across multiple AI providers.

Multimodal Vision Overview

The Multimodal Vision API allows developers to send images alongside text prompts to AI models that support visual understanding. These models can analyze images, extract text (OCR), describe visual content, answer questions about images, and perform complex visual reasoning tasks.

OpenAI Compatible Framework

Our Vision API maintains full compatibility with OpenAI's vision format while extending support to multiple providers. Simply include images in your chat completion requests using either image URLs or base64-encoded data, and the model will process both text and visual information together.

View Multimodal Models

Check our Dashboard to see the complete list of multimodal-capable models and their vision features.

How Multimodal Vision Works

Vision-enabled models process both text prompts and images simultaneously, allowing for sophisticated visual understanding tasks. You can include images in your conversation by adding them to message content using either:

Image URLs: Direct links to publicly accessible images
Base64 encoding: Embedded image data within the request

note

Images can be in various formats: JPEG, PNG, GIF, WebP
Maximum image size varies by provider (typically 4-20MB)
Some models support multiple images in a single request
All vision capabilities use the same chat completions endpoint

Available Vision-Enabled Models

Our API supports vision capabilities across multiple providers:

OpenAI - GPT-4o, GPT-4 Vision
Anthropic - Claude 3 series with vision
Google - Gemini Pro Vision, Gemini Ultra
Meta - Llama Vision models

Visit our Dashboard to explore all multimodal-capable models and their specific vision features.

Choosing the Right Vision Model

Selecting the optimal vision model depends on your specific use case, performance requirements, and budget considerations:

Performance vs Cost Tiers

Premium Models: GPT-4o, Claude 3.5 Sonnet, Gemini Pro Vision
- Best for complex visual reasoning and detailed analysis
- Higher accuracy but increased cost
- Ideal for professional applications requiring high precision
Balanced Models: GPT-4o-mini, Gemini Flash, Claude Haiku
- Good performance at moderate cost
- Suitable for most production applications
- Excellent for general-purpose vision tasks
Budget-Friendly Models: Smaller vision models, open-source alternatives
- Cost-effective for high-volume processing
- Basic vision capabilities
- Good for simple OCR and image description tasks

Model Strengths by Use Case

Text Extraction (OCR): GPT-4o, Claude Sonnet models excel at extracting and formatting text from complex documents
Detailed Image Analysis: Gemini Pro Vision and GPT-4o provide comprehensive scene understanding
Technical Diagrams: Claude models perform well with charts, graphs, and technical drawings
Multiple Images: Some models support comparing multiple images in a single request
Speed-Critical Applications: Gemini Flash and GPT-4o-mini offer faster response times

Evaluation Tips

Test with your data: Use the Dashboard to test different models with your specific image types
Consider context length: Some models handle longer conversations with images better
Check language support: Ensure the model supports your required languages for OCR tasks
Monitor costs: Use smaller models for development and scale up for production
Leverage routing: Use our intelligent routing to automatically select optimal models

Example Vision API Call with Image URL

curl -L -X POST 'https://apipie.ai/v1/chat/completions' \
-H 'Content-Type: application/json' \
-H 'Accept: application/json' \
-H 'Authorization: Bearer <YOUR_API_KEY>' \
--data-raw '{
  "model": "gpt-4o",
  "max_tokens": 300,
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "What do you see in this image? Describe it in detail."
        },
        {
          "type": "image_url",
          "image_url": {
            "url": "https://example.com/sample-image.jpg"
          }
        }
      ]
    }
  ]
}'

Example Vision API Call with Base64 Image

curl -L -X POST 'https://apipie.ai/v1/chat/completions' \
-H 'Content-Type: application/json' \
-H 'Accept: application/json' \
-H 'Authorization: Bearer <YOUR_API_KEY>' \
--data-raw '{
  "model": "gpt-4o",
  "max_tokens": 300,
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "Extract all text from this document image."
        },
        {
          "type": "image_url",
          "image_url": {
            "url": "data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAYEBQYFBAYGBQYHBwYIChAKCgkJChQODwwQFxQYGBcUFhYaHSUfGhsjHBYWICwgIyYnKSopGR8tMC0oMCUoKSj/2wBDAQcHBwoIChMKChMoGhYaKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCj/wAARCAABAAEDASIAAhEBAxEB/8QAFQABAQAAAAAAAAAAAAAAAAAAAAv/xAAUEAEAAAAAAAAAAAAAAAAAAAAA/8QAFQEBAQAAAAAAAAAAAAAAAAAAAAX/xAAUEQEAAAAAAAAAAAAAAAAAAAAA/9oADAMBAAIRAxEAPwCdABmX/9k="
          }
        }
      ]
    }
  ]
}'

Response Example

The response structure is identical to regular chat completions but includes visual analysis:

{
  "id": "chatcmpl-vision-5fde5f7fffe8d6dc1f18aab4a138d4b7",
  "object": "chat.completion",
  "created": 1729535643,
  "provider": "openai",
  "model": "gpt-4o",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "I can see a document containing several paragraphs of text. The document appears to be a business report with the following visible text:\n\n'QUARTERLY SALES REPORT\nQ3 2024 Performance Summary\n\nSales increased by 15% compared to Q2 2024...\n\nThe document includes charts showing monthly trends and appears to be professionally formatted with headers and structured content."
      },
      "logprobs": null,
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 1150,
    "completion_tokens": 85,
    "total_tokens": 1235,
    "prompt_characters": 45,
    "response_characters": 312,
    "cost": 0.01435,
    "latency_ms": 3420
  }
}

Vision-Specific Parameters

Image Content Structure

When including images in your messages, use this content structure:

{
  "role": "user",
  "content": [
    {
      "type": "text",
      "text": "Your text prompt here"
    },
    {
      "type": "image_url",
      "image_url": {
        "url": "https://example.com/image.jpg",
      }
    }
  ]
}

Multiple Images

Some models support analyzing multiple images in a single request:

{
  "role": "user",
  "content": [
    {
      "type": "text",
      "text": "Compare these two images and describe the differences."
    },
    {
      "type": "image_url",
      "image_url": {
        "url": "https://example.com/image1.jpg"
      }
    },
    {
      "type": "image_url",
      "image_url": {
        "url": "https://example.com/image2.jpg"
      }
    }
  ]
}

Common Vision Use Cases

Text Extraction (OCR)

Extract text from documents, signs, screenshots, or any image containing text:

curl -L -X POST 'https://apipie.ai/v1/chat/completions' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer <YOUR_API_KEY>' \
--data-raw '{
  "model": "gpt-4o",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "Extract all text from this image and format it as clean, readable text."
        },
        {
          "type": "image_url",
          "image_url": {
            "url": "https://example.com/document.jpg",
            "detail": "high"
          }
        }
      ]
    }
  ]
}'

Image Description and Analysis

Generate detailed descriptions of images for accessibility or content understanding:

curl -L -X POST 'https://apipie.ai/v1/chat/completions' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer <YOUR_API_KEY>' \
--data-raw '{
  "model": "claude-3-5-sonnet",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "Provide a detailed description of this image for visually impaired users. Include colors, objects, people, activities, and spatial relationships."
        },
        {
          "type": "image_url",
          "image_url": {
            "url": "https://example.com/photo.jpg"
          }
        }
      ]
    }
  ]
}'

Visual Question Answering

Ask specific questions about image content:

curl -L -X POST 'https://apipie.ai/v1/chat/completions' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer <YOUR_API_KEY>' \
--data-raw '{
  "model": "gemini-pro-vision",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "How many people are in this image? What are they wearing? What is the setting?"
        },
        {
          "type": "image_url",
          "image_url": {
            "url": "https://example.com/group-photo.jpg"
          }
        }
      ]
    }
  ]
}'

Document Analysis

Analyze charts, graphs, tables, and structured documents:

curl -L -X POST 'https://apipie.ai/v1/chat/completions' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer <YOUR_API_KEY>' \
--data-raw '{
  "model": "gpt-4o",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "Analyze this chart and provide a summary of the key trends and data points."
        },
        {
          "type": "image_url",
          "image_url": {
            "url": "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAYAAAAfFcSJAAAADUlEQVR42mP8/5+hHgAHggJ/PchI7wAAAABJRU5ErkJggg=="
          }
        }
      ]
    }
  ]
}'

Image Format Support

Supported Formats (across most models)

JPEG: Most common format, good compression
PNG: Supports transparency, lossless compression
GIF: Animated images (first frame analyzed)
WebP: Modern format with excellent compression

Size Limitations

Maximum file size varies by provider (4MB - 20MB)
Recommended resolution: 2048x2048 pixels or smaller
Higher resolution images may be automatically resized

Base64 Encoding

For base64 images, use the data URL format:

data:image/jpeg;base64,<base64-encoded-data>

Example Python code to encode an image:

import base64

def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode('utf-8')

# Usage
base64_image = encode_image("path/to/your/image.jpg")
data_url = f"data:image/jpeg;base64,{base64_image}"

Usage Metrics and Costs

Vision requests typically use more tokens than text-only requests due to image processing:

Token Usage

Images are converted to tokens for processing
Token count depends on image size

Cost Optimization

Resize images to optimal dimensions before uploading
Consider image compression to reduce file size
Use caching for repeated image analysis

Example response with vision usage metrics:

{
  "usage": {
    "prompt_tokens": 1150,
    "completion_tokens": 85,
    "total_tokens": 1235,
    "prompt_characters": 45,
    "response_characters": 312,
    "cost": 0.01435,
    "latency_ms": 3420
  }
}

Best Practices

Image Quality

Use high-resolution images for better text recognition
Ensure good lighting and contrast in photos
Avoid blurry or distorted images
For documents, scan rather than photograph when possible

Prompt Engineering

Be specific about what you want to extract or analyze
Use clear, descriptive prompts
Ask for structured output when needed (JSON, tables, lists)
Provide context about the image type (document, photo, chart, etc.)

Error Handling

Common vision-specific errors:

Unsupported image format: Check image format and convert if needed
Image too large: Reduce image size or compression
Invalid image url: Verify URL accessibility and format
Model doesnt support vision: Verify URL accessibility and format

Security Considerations

Validate image URLs before processing
Sanitize extracted text for security
Be aware of privacy implications when processing images
Use HTTPS URLs for image references
Consider data retention policies for processed images

Getting Started

Choose a vision-enabled model from our Dashboard
Prepare your images in supported formats (JPEG, PNG, GIF, WebP)
Structure your API request with both text and image content
Test with simple use cases like basic image description
Optimize for your specific needs using appropriate prompts

Our Multimodal Vision API opens up powerful possibilities for AI image to text conversion, visual understanding, and document analysis. Start building your vision-powered applications today!

Multimodal Vision Overview​

OpenAI Compatible Framework​

How Multimodal Vision Works​

Available Vision-Enabled Models​

Choosing the Right Vision Model​

Performance vs Cost Tiers​

Model Strengths by Use Case​

Evaluation Tips​

Example Vision API Call with Image URL​

Example Vision API Call with Base64 Image​

Response Example​

Vision-Specific Parameters​

Image Content Structure​

Multiple Images​

Common Vision Use Cases​

Text Extraction (OCR)​

Image Description and Analysis​

Visual Question Answering​

Document Analysis​

Image Format Support​

Supported Formats (across most models)​

Size Limitations​

Base64 Encoding​

Usage Metrics and Costs​

Token Usage​

Cost Optimization​

Best Practices​

Image Quality​

Prompt Engineering​

Error Handling​

Security Considerations​

Getting Started​