AI/ML API Documentation
API KeyModelsPlaygroundGitHubGet Support
  • πŸ“žContact Sales
  • πŸ—―οΈSend Feedback
  • Quickstart
    • 🧭Documentation Map
    • Setting Up
    • Supported SDKs
  • API REFERENCES
    • πŸ“’All Model IDs
    • Text Models (LLM)
      • Alibaba Cloud
        • qwen-max
        • qwen-plus
        • qwen-turbo
        • Qwen2-72B-Instruct
        • Qwen2.5-7B-Instruct-Turbo
        • Qwen2.5-72B-Instruct-Turbo
        • Qwen2.5-Coder-32B-Instruct
        • Qwen-QwQ-32B
        • Qwen3-235B-A22B
      • Anthracite
        • magnum-v4
      • Anthropic
        • Claude 3 Haiku
        • Claude 3.5 Haiku
        • Claude 3 Opus
        • Claude 3 Sonnet
        • Claude 3.5 Sonnet
        • Claude 3.7 Sonnet
      • Cohere
        • command-r-plus
      • DeepSeek
        • DeepSeek V3
        • DeepSeek R1
        • DeepSeek Prover V2
      • Google
        • gemini-1.5-flash
        • gemini-1.5-pro
        • gemini-2.0-flash-exp
        • gemini-2.0-flash
        • gemini-2.5-flash-preview
        • gemini-2.5-pro-exp
        • gemini-2.5-pro-preview
        • gemma-2
        • gemma-3
      • Gryphe
        • MythoMax-L2-13b-Lite
      • Meta
        • Llama-3-chat-hf
        • Llama-3-8B-Instruct-Lite
        • Llama-3.1-8B-Instruct-Turbo
        • Llama-3.1-70B-Instruct-Turbo
        • Llama-3.1-405B-Instruct-Turbo
        • Llama-3.2-11B-Vision-Instruct-Turbo
        • Llama-3.2-90B-Vision-Instruct-Turbo
        • Llama-Vision-Free
        • Llama-3.2-3B-Instruct-Turbo
        • Llama-3.3-70B-Instruct-Turbo
        • Llama-4-scout
        • Llama-4-maverick
      • MiniMax
        • text-01
        • abab6.5s-chat
      • Mistral AI
        • codestral-2501
        • mistral-nemo
        • mistral-tiny
        • Mistral-7B-Instruct
        • Mixtral-8x22B-Instruct
        • Mixtral-8x7B-Instruct
      • NVIDIA
        • Llama-3.1-Nemotron-70B-Instruct-HF
        • llama-3.1-nemotron-70b
      • NeverSleep
        • llama-3.1-lumimaid
      • NousResearch
        • Nous-Hermes-2-Mixtral-8x7B-DPO
      • OpenAI
        • gpt-3.5-turbo
        • gpt-4
        • gpt-4-preview
        • gpt-4-turbo
        • gpt-4o
        • gpt-4o-mini
        • gpt-4o-audio-preview
        • gpt-4o-mini-audio-preview
        • gpt-4o-search-preview
        • gpt-4o-mini-search-preview
        • o1
        • o1-mini
        • o1-preview
        • o3-mini
        • gpt-4.5-preview
        • gpt-4.1
        • gpt-4.1-mini
        • gpt-4.1-nano
        • o4-mini
      • xAI
        • grok-beta
        • grok-3-beta
        • grok-3-mini-beta
    • Image Models
      • Flux
        • flux-pro
        • flux-pro/v1.1
        • flux-pro/v1.1-ultra
        • flux-realism
        • flux/dev
        • flux/dev/image-to-image
        • flux/schnell
      • Google
        • Imagen 3.0
      • OpenAI
        • DALLΒ·E 2
        • DALLΒ·E 3
      • RecraftAI
        • Recraft v3
      • Stability AI
        • Stable Diffusion v3 Medium
        • Stable Diffusion v3.5 Large
    • Video Models
      • Alibaba Cloud
        • Wan 2.1 (Text-to-Video)
      • Google
        • Veo2 (Image-to-Video)
        • Veo2 (Text-to-Video)
      • Kling AI
        • v1-standard/image-to-video
        • v1-standard/text-to-video
        • v1-pro/image-to-video
        • v1-pro/text-to-video
        • v1.6-standard/text-to-video
        • v1.6-standard/image-to-video
        • v1.6-pro/image-to-video
        • v1.6-pro/text-to-video
        • v1.6-standard/effects
        • v1.6-pro/effects
        • v2-master/image-to-video
        • v2-master/text-to-video
      • Luma AI
        • Text-to-Video v2
        • Text-to-Video v1 (legacy)
      • MiniMax
        • video-01
        • video-01-live2d
      • Runway
        • gen3a_turbo
        • gen4_turbo
    • Music Models
      • MiniMax
        • minimax-music [legacy]
        • music-01
      • Stability AI
        • stable-audio
    • Voice/Speech Models
      • Speech-to-Text
        • stt [legacy]
        • Deepgram
          • nova-2
        • OpenAI
          • whisper-base
          • whisper-large
          • whisper-medium
          • whisper-small
          • whisper-tiny
      • Text-to-Speech
        • Deepgram
          • aura
    • Content Moderation Models
      • Meta
        • Llama-Guard-3-11B-Vision-Turbo
        • LlamaGuard-2-8b
        • Meta-Llama-Guard-3-8B
    • 3D-Generating Models
      • Stability AI
        • triposr
    • Vision Models
      • Image Analysis
      • OCR: Optical Character Recognition
        • Google
          • Google OCR
        • Mistral AI
          • mistral-ocr-latest
      • OFR: Optical Feature Recognition
    • Embedding Models
      • Anthropic
        • voyage-2
        • voyage-code-2
        • voyage-finance-2
        • voyage-large-2
        • voyage-large-2-instruct
        • voyage-law-2
        • voyage-multilingual-2
      • BAAI
        • bge-base-en
        • bge-large-en
      • Google
        • textembedding-gecko
        • text-multilingual-embedding-002
      • OpenAI
        • text-embedding-3-large
        • text-embedding-3-small
        • text-embedding-ada-002
      • Together AI
        • m2-bert-80M-retrieval
  • Solutions
    • Bagoodex
      • AI Search Engine
        • Find Links
        • Find Images
        • Find Videos
        • Find the Weather
        • Find a Local Map
        • Get a Knowledge Structure
    • OpenAI
      • Assistants
        • Assistant API
        • Thread API
        • Message API
        • Run and Run Step API
        • Events
  • Use Cases
    • Create Images: Illustrate an Article
    • Animate Images: A Children’s Encyclopedia
    • Create an Assistant to Discuss a Specific Document
    • Create a 3D Model from an Image
    • Create a Looped GIF for a Web Banner
    • Read Text Aloud and Describe Images: Support People with Visual Impairments
    • Find Relevant Answers: Semantic Search with Text Embeddings
    • Summarize Websites with AI-Powered Chrome Extension
  • Capabilities
    • Completion and Chat Completion
    • Streaming Mode
    • Code Generation
    • Thinking / Reasoning
    • Function Calling
    • Vision in Text Models (Image-To-Text)
    • Web Search
    • Features of Anthropic Models
    • Model comparison
  • FAQ
    • Can I use API in Python?
    • Can I use API in NodeJS?
    • What are the Pro Models?
    • How to use the Free Tier?
    • Are my requests cropped?
    • Can I call API in the asynchronous mode?
    • OpenAI SDK doesn't work?
  • Errors and Messages
    • General Info
    • Errors with status code 4xx
    • Errors with status code 5xx
  • Glossary
    • Concepts
  • Integrations
    • 🧩Our Integration List
    • Cline
    • Langflow
    • LiteLLM
    • Roo Code
Powered by GitBook
On this page
  • Idea and Step-by-Step Plan
  • Full Walkthrough
  • Full Code Example
  • TTS model: Aura
  • TTS model: gpt-4o-audio-preview

Was this helpful?

  1. Use Cases

Read Text Aloud and Describe Images: Support People with Visual Impairments

PreviousCreate a Looped GIF for a Web BannerNextFind Relevant Answers: Semantic Search with Text Embeddings

Last updated 21 days ago

Was this helpful?

Idea and Step-by-Step Plan

  1. Upload the PDF to extract all the text Provide a PDF file with text and illustrations to be processed by a text model and converted into an audiobook. The model reads the PDF, extracts all textual content page by page, and describes each illustration it encounters.

  2. Send the text to a TTS model to create an audio version The extracted text is sent to a TTS (Text-to-Speech) model via a second API call. The model streams the generated audio, and the script saves the audio file locally.

As a result, you will receive an audio version of the original PDF text, saved as a .wav file.


Full Walkthrough

  1. Upload the PDF to extract all the text

As a text example, we'll use the following one, which you might already recognize from our . The original PDF file you can download from .

PDF Content Preview

What Are Raccoons?

Raccoons are small, furry animals with fluffy striped tails and black β€œmasks” around their eyes. They live in forests, near rivers and lakesβ€”and sometimes even close to people in towns and cities. Raccoons are very clever, curious, and quick with their paws.

One of the raccoon's most famous habits is "washing" its food. But raccoons aren’t really cleaning their meals. They just love to roll and rub things between their paws, especially near water. Scientists believe this helps them understand what they’re holding.

Raccoons eat almost anything: berries, fruits, nuts, insects, fish, and even bird eggs. They're nocturnal, which means they go out at night to look for food and sleep during the day in cozy tree hollows.

Raccoons are very social. Young raccoons love to playβ€”tumbling in the grass, hiding behind trees, and exploring everything around them. And sometimes, if they feel safe, raccoons might even come closer to where people areβ€”especially if there's a snack nearby!

Even though they can be a little mischievous, raccoons play an important role in nature. They help spread seeds and keep insect populations in check.

So next time you see a raccoon, remember: it’s not just a fluffy animalβ€”it’s a real forest explorer!


We use model to extract text from the document, sending the PDF as base64. Here's the code:

Code Example
import base64
from openai import OpenAI


aimlapi_key = "<YOUR_AIMLAPI_KEY>"

client = OpenAI(
    base_url = "https://api.aimlapi.com",
    api_key = aimlapi_key, 
)

# Put your filename here. The file must be in the same folder as your Python script.
your_file_name = "What Are Raccoons.pdf"

with open(your_file_name, "rb") as f:
    data = f.read()

# We encode the entire file into a single string to send it to the model
base64_string = base64.b64encode(data).decode("utf-8")


def get_text():
    response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    # Sending our file to the model
                    "type": "file",
                    "file": {
                        "filename": your_file_name,
                        "file_data": f"data:application/pdf;base64,{base64_string}",
                    }
                },
                {
                    # Providing the model with detailed instructions for extracting text and adding descriptions for illustrations
                    "type": "text",
                    "text": "Extract all the text from this file. Don't add to text something like /Page 1:/ or /Image Description/. If there's an image, insert a description of it instead, exactly in the place of text where the illustration was. The description is intended for those who cannot see, so describe accurately and vividly, but do not add anything that is not present in the image. 3 sentences per image at least. Before every image description, you can add something like: Here is an illustration. It shows... (but try to vary these announcements)",
                },
            ],
        },
    ]
)
    print(response.choices[0].message.content)
    return response.choices[0].message.content

def main():
     # Running text preparing
     our_text = get_text()
     

if __name__ == "__main__":
    main()
Prepared Text
What Are Raccoons?

Raccoons are small, furry animals with fluffy striped tails and black β€œmasks” around their eyes. They live in forests, near rivers and lakesβ€”and sometimes even close to people in towns and cities. Raccoons are very clever, curious, and quick with their paws.

Here is an illustration. It shows a raccoon by a small stream surrounded by rocks and grass. The raccoon has its paws in the water, seemingly engaged in its typical β€œwashing” behavior. The setting is peaceful with green foliage in the background, creating a sense of the raccoon's natural habitat.

One of the raccoon's most famous habits is "washing" its food. But raccoons aren’t really cleaning their meals. They just love to roll and rub things between their paws, especially near water. Scientists believe this helps them understand what they’re holding. Raccoons eat almost anything: berries, fruits, nuts, insects, fish, and even bird eggs. They're nocturnal, which means they go out at night to look for food and sleep during the day in cozy tree hollows.

Here is another illustration. It depicts a family of raccoons in a grassy area, with three young raccoons playfully interacting. The adult raccoon is sitting nearby, seemingly watching over the young ones. The background is filled with green trees and grass, giving the scene a lively and natural atmosphere.

Raccoons are very social. Young raccoons love to playβ€”tumbling in the grass, hiding behind trees, and exploring everything around them. And sometimes, if they feel safe, raccoons might even come closer to where people areβ€”especially if there's a snack nearby! Even though they can be a little mischievous, raccoons play an important role in nature. They help spread seeds and keep insect populations in check. So next time you see a raccoon, remember: it’s not just a fluffy animalβ€”it’s a real forest explorer!
  1. Send the text to a TTS model to create an audio version

We decided to implement two Text-to-Speech processing options to let our models compete!

For the chat model, we had to tweak the settings β€” like increasing β€” and come up with a smart prompt that left no room for the model to creatively rephrase the original text: "You are just a speaker. You read text aloud without any distortions or additions. Read from the very beginning, including all the headers".

The TTS model was much easier to use: just pick a voice and send the text.

Below, you'll find the complete Python code for each option (including the text generation part). Under each example, you can listen to the audio output (saved under the name original_pdf_filename.wav).

TTS Response
Audio saved to: c:\Users\user\Documents\Python Scripts\What Are Raccoons.pdf.wav

Full Code Example

Code
from openai import OpenAI
import base64
import os

aimlapi_key = "<YOUR_AIMLAPI_KEY>"

client = OpenAI(
    base_url = "https://api.aimlapi.com",
    api_key = aimlapi_key, 
)

# Put your filename here. The file must be in the same folder as your Python script.
your_file_name = "What Are Raccoons.pdf"

with open(your_file_name, "rb") as f:
    data = f.read()

# We encode the entire file into a single string to send it to the model
base64_string = base64.b64encode(data).decode("utf-8")


def get_text():
    response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    # Sending our file to the model
                    "type": "file",
                    "file": {
                        "filename": your_file_name,
                        "file_data": f"data:application/pdf;base64,{base64_string}",
                    }
                },
                {
                    # Providing the chat model with detailed instructions for extracting text and adding descriptions for illustrations
                    "type": "text",
                    "text": "Extract all the text from this file. Don't add to text something like /Page 1:/ or /Image Description/. If there's an image, insert a description of it instead, exactly in the place of text where the illustration was. The description is intended for those who cannot see, so describe accurately and vividly, but do not add anything that is not present in the image. 3 sentences per image at least. Before every image description, you can add something like: Here is an illustration. It shows... (but try to vary these announcements)",
                },
            ],
        },
    ]
)
    print(response.choices[0].message.content)
    return response.choices[0].message.content


def read_aloud(text_to_read_aloud):
    url = "https://api.aimlapi.com/v1/tts"
    headers = {
        "Authorization": f"Bearer {aimlapi_key}",
    }
    payload = {
        "model": "#g1_aura-zeus-en",
        "text": text_to_read_aloud,
    }

    response = requests.post(url, headers=headers, json=payload, stream=True)
    
    result = os.path.abspath(f"{your_file_name}.wav")

    with open(result, "wb") as write_stream:
        for chunk in response.iter_content(chunk_size=8192):
            if chunk:
                write_stream.write(chunk)

    print("Audio saved to:", result)


def main():
     # Running text extraction and TTS process
     our_text = get_text()
     read_aloud(our_text)
     

if __name__ == "__main__":
    main()

Code
from openai import OpenAI
import base64
import os

aimlapi_key = "YOUR_AIMLAPI_KEY"

client = OpenAI(
    base_url = "https://api.aimlapi.com",
    api_key = aimlapi_key, 
)


# Put your filename here. The file must be in the same folder as your Python script
your_file_name = "What Are Raccoons.pdf"

with open(your_file_name, "rb") as f:
    data = f.read()

# We encode the entire file into a single string to send it to the model
base64_string = base64.b64encode(data).decode("utf-8")


def get_text():
    response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    # Sending our file to the model
                    "type": "file",
                    "file": {
                        "filename": your_file_name,
                        "file_data": f"data:application/pdf;base64,{base64_string}",
                    }
                },
                {
                    # Providing the chat model with detailed instructions for extracting text and adding descriptions for illustrations
                    "type": "text",
                    "text": "Extract all the text from this file. Don't add to text something like /Page 1:/ or /Image Description/. If there's an image, insert a description of it instead, exactly in the place of text where the illustration was. The description is intended for those who cannot see, so describe accurately and vividly, but do not add anything that is not present in the image. 3 sentences per image at least. Before every image description, you can add something like: Here is an illustration. It shows... (but try to vary these announcements)",
                },
            ],
        },
    ]
)
    print(response.choices[0].message.content)
    return response.choices[0].message.content


def read_aloud(text_to_read_aloud):
    response = client.chat.completions.create(
        model="gpt-4o-audio-preview",
        modalities=["text", "audio"],
        audio={"voice": "alloy", "format": "wav"},
        messages=[
            {
                # Providing the TTS model with detailed instructions for reading the text aloud
                "role": "system",
                "content": "You are just a speaker. You read text aloud without any distortions or additions. Read from the very beginning, including all the headers"
            },
            {
                "role": "user",
                "content": text_to_read_aloud
            }
        ],
        max_tokens=6000,  
    )

    wav_bytes = base64.b64decode(response.choices[0].message.audio.data)
    with open(f"{your_file_name}.wav", "wb") as f:
        f.write(wav_bytes)
    dist = os.path.abspath(f"{your_file_name}.wav")
    print("Audio saved to:", dist)


def main():
     # Running text extraction and TTS process
     our_text = get_text()
     read_aloud(our_text)
     

if __name__ == "__main__":
    main()

Copy the code, insert your AIMLAPI key, specify the path to your document in the code, and give it a try yourself!

We compared a specialized TTS model ( by Deepgram) with a chat model that has audio capabilities ( by OpenAI).

TTS model:

Advantages of the model: it's more affordable and provides a total of 12 voices, covering both male and female types.

Here’s the original audio, generated by the Aura model β€” you can listen to it at .

TTS model:

Advantages of the model: although only a single voice is available, it features much more natural intonation and a slower, more pleasant reading style that suits audiobooks well.

You can listen to the original audio, generated by the GPT-4o Audio Preview model, at .

βœ…
βœ…
Aura
GPT-4o Audio Preview
Aura
this link
gpt-4o-audio-preview
this link
use case about illustration animation
here
gpt-4o