Vision in Text Models

This article describes a specific capability of text models: vision, which enables image-to-text and video-to-text conversion. With vision support, models can interpret visual content and return structured or natural-language responses based on what they see.

Common use cases include describing images, analyzing screenshots, extracting text, understanding charts and documents, identifying objects, summarizing scenes, and processing video frames or clips.

The sections below explain how to work with image and video inputs, along with request examples and supported models.

🏝️ Image analysis

Supported Model List
import requests
import json   # for getting a structured output with indentation

response = requests.post(
    url = "https://api.aimlapi.com/v1/chat/completions",
    headers = {
        # Insert your AIML API Key instead of <YOUR_AIMLAPI_KEY>:  
        "Authorization": "Bearer <YOUR_AIMLAPI_KEY>",
        "Content-Type": "application/json"
    },

    json = {
        "model": "alibaba/qwen3.5-omni-flash",
        "messages": [
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": "Describe the content of this image."
                    },
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": "https://raw.githubusercontent.com/aimlapi/api-docs/main/reference-files/handwriting.jpg"
                        }
                    }
                ]
            }
        ]
    }
)

data = response.json()
print(json.dumps(data, indent=2, ensure_ascii=False))
Response

🎦 Video analysis

Supported Model List
Response

Last updated

Was this helpful?