Read Text Aloud and Describe Images: Support People with Visual Impairments
Last updated
Was this helpful?
Last updated
Was this helpful?
Upload the PDF to extract all the text Provide a PDF file with text and illustrations to be processed by a text model and converted into an audiobook. The model reads the PDF, extracts all textual content page by page, and describes each illustration it encounters.
Send the text to a TTS model to create an audio version The extracted text is sent to a TTS (Text-to-Speech) model via a second API call. The model streams the generated audio, and the script saves the audio file locally.
As a result, you will receive an audio version of the original PDF text, saved as a .wav
file.
Upload the PDF to extract all the text
As a text example, we'll use the following one, which you might already recognize from our . The original PDF file you can download from .
We use model to extract text from the document, sending the PDF as base64. Here's the code:
Send the text to a TTS model to create an audio version
We decided to implement two Text-to-Speech processing options to let our models compete!
For the chat model, we had to tweak the settings β like increasing β and come up with a smart prompt that left no room for the model to creatively rephrase the original text: "You are just a speaker. You read text aloud without any distortions or additions. Read from the very beginning, including all the headers".
The TTS model was much easier to use: just pick a voice and send the text.
Below, you'll find the complete Python code for each option (including the text generation part). Under each example, you can listen to the audio output (saved under the name original_pdf_filename.wav
).
Copy the code, insert your AIMLAPI key, specify the path to your document in the code, and give it a try yourself!
We compared a specialized TTS model ( by Deepgram) with a chat model that has audio capabilities ( by OpenAI).
Advantages of the model: it's more affordable and provides a total of 12 voices, covering both male and female types.
Hereβs the original audio, generated by the Aura model β you can listen to it at .
Advantages of the model: although only a single voice is available, it features much more natural intonation and a slower, more pleasant reading style that suits audiobooks well.
You can listen to the original audio, generated by the GPT-4o Audio Preview model, at .