universal

This documentation is valid for the following list of our models:

aai/universal

A new Speech-to-Text model offering exceptional accuracy by leveraging its deep understanding of context and semantics, with the broadest language support.

This model use per-second billing. The cost of audio transcription is based on the number of seconds in the input audio file, not the processing time.

Setup your API Key

If you don’t have an API key for the AI/ML API yet, feel free to use our Quickstart guide.

API Schema

Creating and sending a speech-to-text conversion task to the server

post

Authorizations

AuthorizationstringRequired

Bearer key

Body

modelundefined · enumRequiredPossible values:

audio_start_fromintegerOptional

The point in time, in milliseconds, in the file at which the transcription was started.

audio_end_atintegerOptional

The point in time, in milliseconds, in the file at which the transcription was terminated.

language_codestringOptional

The language of your audio file. Possible values are found in Supported Languages. The default value is 'en_us'.

language_confidence_thresholdnumber · max: 1Optional

The confidence threshold for the automatically detected language. An error will be returned if the language confidence is below this threshold. Defaults to 0.

language_detectionbooleanOptional

Enable Automatic language detection, either true or false. Available for universal model only.

punctuatebooleanOptional

Adds punctuation and capitalization to the transcript

Default: true

format_textbooleanOptional

Enable Text Formatting, can be true or false.

Default: true

disfluenciesbooleanOptional

Transcribe Filler Words, like "umm", in your media file; can be true or false.

Default: false

multichannelbooleanOptional

Enable Multichannel transcription, can be true or false.

Default: false

speaker_labelsbooleanOptional

Enable Speaker diarization, can be true or false.

Default: false

speakers_expectedintegerOptional

Tell the speaker label model how many speakers it should attempt to identify. See Speaker diarization for more details.

content_safetybooleanOptional

Enable Content Moderation, can be true or false.

Default: false

iab_categoriesbooleanOptional

Enable Topic Detection, can be true or false.

Default: false

auto_highlightsbooleanOptional

Enable Key Phrases, either true or false.

Default: false

word_booststring[]Optional

The list of custom vocabulary to boost transcription probability for.

boost_paramstring · enumOptional

How much to boost specified words. Allowed values: low, default, high.

Possible values:

filter_profanitybooleanOptional

Filter profanity from the transcribed text, can be true or false.

Default: false

redact_piibooleanOptional

Redact PII from the transcribed text using the Redact PII model, can be true or false.

Default: false

redact_pii_audiobooleanOptional

Generate a copy of the original media file with spoken PII "beeped" out, can be true or false. See PII redaction for more details.

Default: false

redact_pii_audio_qualitystring · enumOptional

Controls the filetype of the audio created by redact_pii_audio. Currently supports mp3 (default) and wav. See PII redaction for more details.

Possible values:

redact_pii_substring · enumOptional

The replacement logic for detected PII, can be entity_type or hash. See PII redaction for more details.

Possible values:

sentiment_analysisbooleanOptional

Enable Sentiment Analysis, can be true or false.

Default: false

entity_detectionbooleanOptional

Enable Entity Detection, can be true or false.

Default: false

summarizationbooleanOptional

Enable Summarization, can be true or false.

Default: false

summary_modelstring · enumOptional

The model to summarize the transcript. Allowed values: informative, conversational, catchy.

Possible values:

summary_typestring · enumOptional

The type of summary. Allowed values: bullets, bullets_verbose, gist, headline, paragraph.

Possible values:

auto_chaptersbooleanOptional

Enable Auto Chapters, either true or false.

Default: false

speech_thresholdnumber · max: 1Optional

Reject audio files that contain less than this fraction of speech. Valid values are in the range [0, 1] inclusive.

Responses

201Success

application/json

post

/v1/stt/create

async function main() {
  const response = await fetch('https://api.aimlapi.com/v2/video/generations', {
    method: 'POST',
    headers: {
      'Authorization': 'Bearer <YOUR_API_KEY>',
      'Content-Type': 'application/json',
    },
    body: JSON.stringify({
      "model": "aai/universal",
      "url": "https://audio-samples.github.io/samples/mp3/blizzard_primed/sample-0.mp3"
    }),
  });

  const data = await response.json();
  console.log(JSON.stringify(data, null, 2));
}

main();

201Success

{
  "generation_id": "123e4567-e89b-12d3-a456-426614174000"
}

Requesting the result of the task from the server using the generation_id

get

Authorizations

AuthorizationstringRequired

Bearer key

Path parameters

generation_idstringRequired

Responses

201Success

application/json

get

/v1/stt/{generation_id}

GET /v1/stt/{generation_id} HTTP/1.1
Host: api.aimlapi.com
Authorization: Bearer YOUR_SECRET_TOKEN
Accept: */*

201Success

{
  "status": "text",
  "result": {
    "metadata": {
      "transaction_key": "text",
      "request_id": "text",
      "sha256": "text",
      "created": "2025-11-25T14:09:01.222Z",
      "duration": 1,
      "channels": 1,
      "models": [
        "text"
      ],
      "model_info": {
        "ANY_ADDITIONAL_PROPERTY": {
          "name": "text",
          "version": "text",
          "arch": "text"
        }
      }
    },
    "results": {
      "channels": {
        "alternatives": [
          {
            "transcript": "text",
            "confidence": 1,
            "words": [
              {
                "word": "text",
                "start": 1,
                "end": 1,
                "confidence": 1,
                "punctuated_word": "text"
              }
            ],
            "paragraphs": [
              {
                "transcript": "text",
                "paragraphs": {
                  "sentences": [
                    {
                      "text": "text",
                      "start": 1,
                      "end": 1
                    }
                  ],
                  "num_words": 1,
                  "start": 1,
                  "end": 1
                }
              }
            ]
          }
        ]
      }
    }
  }
}

Quick Example: Processing a Speech Audio File via URL

Let's transcribe the following audio fragment:

import time
import requests
import json   # for getting a structured output with indentation

base_url = "https://api.aimlapi.com/v1"
# Insert your AIML API Key instead of <YOUR_AIMLAPI_KEY>:
api_key = "<YOUR_AIMLAPI_KEY>"

# Creating and sending a speech-to-text conversion task to the server
def create_stt():
    url = f"{base_url}/stt/create"
    headers = {
        "Authorization": f"Bearer {api_key}", 
    }

    data = {
        "model": "aai/universal",
        "url": "https://audio-samples.github.io/samples/mp3/blizzard_primed/sample-0.mp3"
    }
 
    response = requests.post(url, json=data, headers=headers)
    
    if response.status_code >= 400:
        print(f"Error: {response.status_code} - {response.text}")
    else:
        response_data = response.json()
        print(response_data)
        return response_data

# Requesting the result of the task from the server using the generation_id
def get_stt(gen_id):
    url = f"{base_url}/stt/{gen_id}"
    headers = {
        "Authorization": f"Bearer {api_key}", 
    }
    response = requests.get(url, headers=headers)
    return response.json()
    
# First, start the generation, then repeatedly request the result from the server every 10 seconds.
def main():
    stt_response = create_stt()
    gen_id = stt_response.get("generation_id")



    if gen_id:
        start_time = time.time()

        timeout = 600
        while time.time() - start_time < timeout:
            response_data = get_stt(gen_id)

            if response_data is None:
                print("Error: No response from API")
                break
        
            status = response_data.get("status")

            if status == "waiting" or status == "active":
                print("Still waiting... Checking again in 10 seconds.")
                time.sleep(10)
            else:
                
                print("Processing complete:/n", response_data["result"]["text"])
                
                # Uncomment the line below to print the entire "result" object with all service data
                # print("Processing complete:/n", json.dumps(response_data["result"], indent=2, ensure_ascii=False))
                return response_data
   
        print("Timeout reached. Stopping.")
        return None     


if __name__ == "__main__":
    main()

Response

{'generation_id': '0cff4e24-c1ba-419d-8b62-46f342985881'}
Still waiting... Checking again in 10 seconds.
Processing complete:/n {
  "id": "04d07a4c-9238-4860-ac6f-534d58fdaf9a",
  "language_model": "assemblyai_default",
  "acoustic_model": "assemblyai_default",
  "language_code": "en_us",
  "status": "completed",
  "audio_url": "https://audio-samples.github.io/samples/mp3/blizzard_primed/sample-0.mp3",
  "text": "He doesn't belong to you. And I don't see how you have anything to do with what is be his power yet his he presumably that from this stage to you be fired.",
  "words": [
    {
      "text": "He",
      "start": 400,
      "end": 520,
      "confidence": 0.98876953,
      "speaker": null
    },
    {
      "text": "doesn't",
      "start": 520,
      "end": 880,
      "confidence": 0.9296875,
      "speaker": null
    },
    {
      "text": "belong",
      "start": 880,
      "end": 1320,
      "confidence": 1,
      "speaker": null
    },
    {
      "text": "to",
      "start": 1320,
      "end": 1560,
      "confidence": 0.99853516,
      "speaker": null
    },
    {
      "text": "you.",
      "start": 1560,
      "end": 1840,
      "confidence": 0.99853516,
      "speaker": null
    },
    {
      "text": "And",
      "start": 1840,
      "end": 2120,
      "confidence": 0.99365234,
      "speaker": null
    },
    {
      "text": "I",
      "start": 2120,
      "end": 2280,
      "confidence": 0.99902344,
      "speaker": null
    },
    {
      "text": "don't",
      "start": 2280,
      "end": 2520,
      "confidence": 0.9949544,
      "speaker": null
    },
    {
      "text": "see",
      "start": 2520,
      "end": 2720,
      "confidence": 0.99902344,
      "speaker": null
    },
    {
      "text": "how",
      "start": 2720,
      "end": 3000,
      "confidence": 0.99902344,
      "speaker": null
    },
    {
      "text": "you",
      "start": 3000,
      "end": 3320,
      "confidence": 0.99853516,
      "speaker": null
    },
    {
      "text": "have",
      "start": 3320,
      "end": 3600,
      "confidence": 0.99658203,
      "speaker": null
    },
    {
      "text": "anything",
      "start": 3600,
      "end": 4080,
      "confidence": 0.9968262,
      "speaker": null
    },
    {
      "text": "to",
      "start": 4080,
      "end": 4240,
      "confidence": 0.99902344,
      "speaker": null
    },
    {
      "text": "do",
      "start": 4240,
      "end": 4360,
      "confidence": 0.99902344,
      "speaker": null
    },
    {
      "text": "with",
      "start": 4360,
      "end": 4520,
      "confidence": 0.9902344,
      "speaker": null
    },
    {
      "text": "what",
      "start": 4520,
      "end": 4720,
      "confidence": 0.9941406,
      "speaker": null
    },
    {
      "text": "is",
      "start": 4720,
      "end": 4920,
      "confidence": 0.9819336,
      "speaker": null
    },
    {
      "text": "be",
      "start": 4920,
      "end": 5080,
      "confidence": 0.8720703,
      "speaker": null
    },
    {
      "text": "his",
      "start": 5080,
      "end": 5280,
      "confidence": 0.9951172,
      "speaker": null
    },
    {
      "text": "power",
      "start": 5280,
      "end": 5520,
      "confidence": 0.8588867,
      "speaker": null
    },
    {
      "text": "yet",
      "start": 5520,
      "end": 5840,
      "confidence": 0.5756836,
      "speaker": null
    },
    {
      "text": "his",
      "start": 5840,
      "end": 6160,
      "confidence": 0.5419922,
      "speaker": null
    },
    {
      "text": "he",
      "start": 6160,
      "end": 6360,
      "confidence": 0.96972656,
      "speaker": null
    },
    {
      "text": "presumably",
      "start": 6360,
      "end": 6840,
      "confidence": 0.5012207,
      "speaker": null
    },
    {
      "text": "that",
      "start": 6840,
      "end": 7000,
      "confidence": 0.8901367,
      "speaker": null
    },
    {
      "text": "from",
      "start": 7000,
      "end": 7160,
      "confidence": 0.9951172,
      "speaker": null
    },
    {
      "text": "this",
      "start": 7160,
      "end": 7320,
      "confidence": 0.9926758,
      "speaker": null
    },
    {
      "text": "stage",
      "start": 7320,
      "end": 7680,
      "confidence": 0.9953613,
      "speaker": null
    },
    {
      "text": "to",
      "start": 7680,
      "end": 7960,
      "confidence": 0.9941406,
      "speaker": null
    },
    {
      "text": "you",
      "start": 7960,
      "end": 8320,
      "confidence": 0.9975586,
      "speaker": null
    },
    {
      "text": "be",
      "start": 9440,
      "end": 9720,
      "confidence": 0.4555664,
      "speaker": null
    },
    {
      "text": "fired.",
      "start": 9720,
      "end": 10050,
      "confidence": 0.4534912,
      "speaker": null
    }
  ],
  "utterances": null,
  "confidence": 0.90746206,
  "audio_duration": 11,
  "punctuate": true,
  "format_text": true,
  "dual_channel": null,
  "webhook_url": null,
  "webhook_status_code": null,
  "webhook_auth": false,
  "webhook_auth_header_name": null,
  "speed_boost": false,
  "auto_highlights_result": null,
  "auto_highlights": false,
  "audio_start_from": null,
  "audio_end_at": null,
  "word_boost": [],
  "boost_param": null,
  "prompt": null,
  "keyterms_prompt": [],
  "filter_profanity": false,
  "redact_pii": false,
  "redact_pii_audio": false,
  "redact_pii_audio_quality": null,
  "redact_pii_audio_options": null,
  "redact_pii_policies": null,
  "redact_pii_sub": null,
  "speaker_labels": false,
  "speaker_options": null,
  "content_safety": false,
  "iab_categories": false,
  "content_safety_labels": {
    "status": "unavailable",
    "results": [],
    "summary": {}
  },
  "iab_categories_result": {
    "status": "unavailable",
    "results": [],
    "summary": {}
  },
  "language_detection": false,
  "language_detection_options": null,
  "language_confidence_threshold": null,
  "language_confidence": null,
  "custom_spelling": null,
  "throttled": false,
  "auto_chapters": false,
  "summarization": false,
  "summary_type": null,
  "summary_model": null,
  "custom_topics": false,
  "topics": [],
  "speech_threshold": null,
  "speech_model": "universal",
  "chapters": null,
  "disfluencies": false,
  "entity_detection": false,
  "sentiment_analysis": false,
  "sentiment_analysis_results": null,
  "entities": null,
  "speakers_expected": null,
  "summary": null,
  "custom_topics_results": null,
  "is_deleted": null,
  "multichannel": null,
  "project_id": 675898,
  "token_id": 1245789
}

Previousslam-1 NextDeepgram

Last updated 2 months ago

Was this helpful?