Multimodal Vision Guide: AI Image Analysis & Text Extraction

Transform your applications with advanced visual AI capabilities. Our Multimodal Vision API enables you to analyze images, extract text, understand visual content, and generate detailed descriptions. Whether you're building OCR systems, content moderation tools, or accessibility features, our vision API provides powerful image understanding across multiple AI providers.
Multimodal Vision Overview
The Multimodal Vision API allows developers to send images alongside text prompts to AI models that support visual understanding. These models can analyze images, extract text (OCR), describe visual content, answer questions about images, and perform complex visual reasoning tasks.
OpenAI Compatible Framework
Our Vision API maintains full compatibility with OpenAI's vision format while extending support to multiple providers. Simply include images in your chat completion requests using either image URLs or base64-encoded data, and the model will process both text and visual information together.
Check our Dashboard to see the complete list of multimodal-capable models and their vision features.
How Multimodal Vision Works
Vision-enabled models process both text prompts and images simultaneously, allowing for sophisticated visual understanding tasks. You can include images in your conversation by adding them to message content using either:
- Image URLs: Direct links to publicly accessible images
- Base64 encoding: Embedded image data within the request
- Images can be in various formats: JPEG, PNG, GIF, WebP
- Maximum image size varies by provider (typically 4-20MB)
- Some models support multiple images in a single request
- All vision capabilities use the same chat completions endpoint
Available Vision-Enabled Models
Our API supports vision capabilities across multiple providers:
- OpenAI - GPT-4o, GPT-4 Vision
- Anthropic - Claude 3 series with vision
- Google - Gemini Pro Vision, Gemini Ultra
- Meta - Llama Vision models
Visit our Dashboard to explore all multimodal-capable models and their specific vision features.
Choosing the Right Vision Model
Selecting the optimal vision model depends on your specific use case, performance requirements, and budget considerations:
Performance vs Cost Tiers
-
Premium Models: GPT-4o, Claude 3.5 Sonnet, Gemini Pro Vision
- Best for complex visual reasoning and detailed analysis
- Higher accuracy but increased cost
- Ideal for professional applications requiring high precision
-
Balanced Models: GPT-4o-mini, Gemini Flash, Claude Haiku
- Good performance at moderate cost
- Suitable for most production applications
- Excellent for general-purpose vision tasks
-
Budget-Friendly Models: Smaller vision models, open-source alternatives
- Cost-effective for high-volume processing
- Basic vision capabilities
- Good for simple OCR and image description tasks
Model Strengths by Use Case
- Text Extraction (OCR): GPT-4o, Claude Sonnet models excel at extracting and formatting text from complex documents
- Detailed Image Analysis: Gemini Pro Vision and GPT-4o provide comprehensive scene understanding
- Technical Diagrams: Claude models perform well with charts, graphs, and technical drawings
- Multiple Images: Some models support comparing multiple images in a single request
- Speed-Critical Applications: Gemini Flash and GPT-4o-mini offer faster response times
Evaluation Tips
- Test with your data: Use the Dashboard to test different models with your specific image types
- Consider context length: Some models handle longer conversations with images better
- Check language support: Ensure the model supports your required languages for OCR tasks
- Monitor costs: Use smaller models for development and scale up for production
- Leverage routing: Use our intelligent routing to automatically select optimal models
Example Vision API Call with Image URL
curl -L -X POST 'https://apipie.ai/v1/chat/completions' \
-H 'Content-Type: application/json' \
-H 'Accept: application/json' \
-H 'Authorization: Bearer <YOUR_API_KEY>' \
--data-raw '{
"model": "gpt-4o",
"max_tokens": 300,
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What do you see in this image? Describe it in detail."
},
{
"type": "image_url",
"image_url": {
"url": "https://example.com/sample-image.jpg"
}
}
]
}
]
}'
Example Vision API Call with Base64 Image
curl -L -X POST 'https://apipie.ai/v1/chat/completions' \
-H 'Content-Type: application/json' \
-H 'Accept: application/json' \
-H 'Authorization: Bearer <YOUR_API_KEY>' \
--data-raw '{
"model": "gpt-4o",
"max_tokens": 300,
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Extract all text from this document image."
},
{
"type": "image_url",
"image_url": {
"url": "data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAYEBQYFBAYGBQYHBwYIChAKCgkJChQODwwQFxQYGBcUFhYaHSUfGhsjHBYWICwgIyYnKSopGR8tMC0oMCUoKSj/2wBDAQcHBwoIChMKChMoGhYaKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCj/wAARCAABAAEDASIAAhEBAxEB/8QAFQABAQAAAAAAAAAAAAAAAAAAAAv/xAAUEAEAAAAAAAAAAAAAAAAAAAAA/8QAFQEBAQAAAAAAAAAAAAAAAAAAAAX/xAAUEQEAAAAAAAAAAAAAAAAAAAAA/9oADAMBAAIRAxEAPwCdABmX/9k="
}
}
]
}
]
}'
Response Example
The response structure is identical to regular chat completions but includes visual analysis:
{
"id": "chatcmpl-vision-5fde5f7fffe8d6dc1f18aab4a138d4b7",
"object": "chat.completion",
"created": 1729535643,
"provider": "openai",
"model": "gpt-4o",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "I can see a document containing several paragraphs of text. The document appears to be a business report with the following visible text:\n\n'QUARTERLY SALES REPORT\nQ3 2024 Performance Summary\n\nSales increased by 15% compared to Q2 2024...\n\nThe document includes charts showing monthly trends and appears to be professionally formatted with headers and structured content."
},
"logprobs": null,
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 1150,
"completion_tokens": 85,
"total_tokens": 1235,
"prompt_characters": 45,
"response_characters": 312,
"cost": 0.01435,
"latency_ms": 3420
}
}
Vision-Specific Parameters
Image Content Structure
When including images in your messages, use this content structure:
{
"role": "user",
"content": [
{
"type": "text",
"text": "Your text prompt here"
},
{
"type": "image_url",
"image_url": {
"url": "https://example.com/image.jpg",
}
}
]
}
Multiple Images
Some models support analyzing multiple images in a single request:
{
"role": "user",
"content": [
{
"type": "text",
"text": "Compare these two images and describe the differences."
},
{
"type": "image_url",
"image_url": {
"url": "https://example.com/image1.jpg"
}
},
{
"type": "image_url",
"image_url": {
"url": "https://example.com/image2.jpg"
}
}
]
}
Common Vision Use Cases
Text Extraction (OCR)
Extract text from documents, signs, screenshots, or any image containing text:
curl -L -X POST 'https://apipie.ai/v1/chat/completions' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer <YOUR_API_KEY>' \
--data-raw '{
"model": "gpt-4o",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Extract all text from this image and format it as clean, readable text."
},
{
"type": "image_url",
"image_url": {
"url": "https://example.com/document.jpg",
"detail": "high"
}
}
]
}
]
}'
Image Description and Analysis
Generate detailed descriptions of images for accessibility or content understanding:
curl -L -X POST 'https://apipie.ai/v1/chat/completions' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer <YOUR_API_KEY>' \
--data-raw '{
"model": "claude-3-5-sonnet",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Provide a detailed description of this image for visually impaired users. Include colors, objects, people, activities, and spatial relationships."
},
{
"type": "image_url",
"image_url": {
"url": "https://example.com/photo.jpg"
}
}
]
}
]
}'
Visual Question Answering
Ask specific questions about image content:
curl -L -X POST 'https://apipie.ai/v1/chat/completions' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer <YOUR_API_KEY>' \
--data-raw '{
"model": "gemini-pro-vision",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "How many people are in this image? What are they wearing? What is the setting?"
},
{
"type": "image_url",
"image_url": {
"url": "https://example.com/group-photo.jpg"
}
}
]
}
]
}'
Document Analysis
Analyze charts, graphs, tables, and structured documents:
curl -L -X POST 'https://apipie.ai/v1/chat/completions' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer <YOUR_API_KEY>' \
--data-raw '{
"model": "gpt-4o",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Analyze this chart and provide a summary of the key trends and data points."
},
{
"type": "image_url",
"image_url": {
"url": "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAYAAAAfFcSJAAAADUlEQVR42mP8/5+hHgAHggJ/PchI7wAAAABJRU5ErkJggg=="
}
}
]
}
]
}'
Advanced Vision Features
Memory with Images
Use our IMM (Integrated Model Memory) to maintain context across vision conversations:
curl -L -X POST 'https://apipie.ai/v1/chat/completions' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer <YOUR_API_KEY>' \
--data-raw '{
"model": "gpt-4o",
"memory": 1,
"mem_session": "vision_session_123",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Remember this product image for later reference."
},
{
"type": "image_url",
"image_url": {
"url": "https://example.com/product.jpg"
}
}
]
}
]
}'
Vision with RAG Integration
Combine visual analysis with knowledge retrieval using RAG:
curl -L -X POST 'https://apipie.ai/v1/chat/completions' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer <YOUR_API_KEY>' \
--data-raw '{
"model": "gpt-4o",
"rag_tune": "product_catalog",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Identify this product and find similar items in our catalog."
},
{
"type": "image_url",
"image_url": {
"url": "https://example.com/unknown-product.jpg"
}
}
]
}
]
}'
Streaming Vision Responses
Enable streaming for real-time vision analysis:
curl -L -X POST 'https://apipie.ai/v1/chat/completions' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer <YOUR_API_KEY>' \
--data-raw '{
"model": "gpt-4o",
"stream": true,
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Provide a detailed analysis of this image."
},
{
"type": "image_url",
"image_url": {
"url": "https://example.com/complex-image.jpg",
}
}
]
}
]
}'
Image Format Support
Supported Formats (across most models)
- JPEG: Most common format, good compression
- PNG: Supports transparency, lossless compression
- GIF: Animated images (first frame analyzed)
- WebP: Modern format with excellent compression
Size Limitations
- Maximum file size varies by provider (4MB - 20MB)
- Recommended resolution: 2048x2048 pixels or smaller
- Higher resolution images may be automatically resized
Base64 Encoding
For base64 images, use the data URL format:
data:image/jpeg;base64,<base64-encoded-data>
Example Python code to encode an image:
import base64
def encode_image(image_path):
with open(image_path, "rb") as image_file:
return base64.b64encode(image_file.read()).decode('utf-8')
# Usage
base64_image = encode_image("path/to/your/image.jpg")
data_url = f"data:image/jpeg;base64,{base64_image}"
Usage Metrics and Costs
Vision requests typically use more tokens than text-only requests due to image processing:
Token Usage
- Images are converted to tokens for processing
- Token count depends on image size
Cost Optimization
- Resize images to optimal dimensions before uploading
- Consider image compression to reduce file size
- Use caching for repeated image analysis
Example response with vision usage metrics:
{
"usage": {
"prompt_tokens": 1150,
"completion_tokens": 85,
"total_tokens": 1235,
"prompt_characters": 45,
"response_characters": 312,
"cost": 0.01435,
"latency_ms": 3420
}
}
Best Practices
Image Quality
- Use high-resolution images for better text recognition
- Ensure good lighting and contrast in photos
- Avoid blurry or distorted images
- For documents, scan rather than photograph when possible
Prompt Engineering
- Be specific about what you want to extract or analyze
- Use clear, descriptive prompts
- Ask for structured output when needed (JSON, tables, lists)
- Provide context about the image type (document, photo, chart, etc.)
Error Handling
Common vision-specific errors:
- Unsupported image format: Check image format and convert if needed
- Image too large: Reduce image size or compression
- Invalid image url: Verify URL accessibility and format
- Model doesnt support vision: Verify URL accessibility and format
Security Considerations
- Validate image URLs before processing
- Sanitize extracted text for security
- Be aware of privacy implications when processing images
- Use HTTPS URLs for image references
- Consider data retention policies for processed images
Getting Started
- Choose a vision-enabled model from our Dashboard
- Prepare your images in supported formats (JPEG, PNG, GIF, WebP)
- Structure your API request with both text and image content
- Test with simple use cases like basic image description
- Optimize for your specific needs using appropriate prompts
Our Multimodal Vision API opens up powerful possibilities for AI image to text conversion, visual understanding, and document analysis. Start building your vision-powered applications today!