# OpenAI (Audio Transcription)

## 1. Overview

The industry's leading large language model. Generates text transcriptions from input audio.

**Model List:**

* `gpt-4o-transcribe`
* `gpt-4o-mini-transcribe`
* `gpt-4o-transcribe-diarize`
* `whisper-1`

## 2. Request Description

* **Request Method**: `POST`
* **Request URL**:

  > `https://gateway.theturbo.ai/v1/audio/transcriptions`

***

## 3. Request Parameters

### 3.1 Header Parameters

| Parameter Name  | Type   | Required | Description                                                         | Example                |
| --------------- | ------ | -------- | ------------------------------------------------------------------- | ---------------------- |
| `Authorization` | string | Yes      | API\_KEY required for authentication, format `Bearer $YOUR_API_KEY` | `Bearer $YOUR_API_KEY` |

***

### 3.2 Body Parameters (application/json)

| Parameter Name             | Type   | Required | Description                                                                                                                                                                                                                                                                                                                                                                                                                 | Example (Default)   |
| -------------------------- | ------ | -------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------- |
| **model**                  | string | Yes      | The model ID to use. See the available versions listed in [Overview](#id-1.-overview), e.g., `gpt-4o-transcribe`.                                                                                                                                                                                                                                                                                                           | `gpt-4o-transcribe` |
| **file**                   | string | Yes      | The audio file. Supported formats: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, or webm.                                                                                                                                                                                                                                                                                                                                      |                     |
| chunking\_strategy         | string | No       | Controls how the audio is split into chunks.                                                                                                                                                                                                                                                                                                                                                                                | `auto`              |
| include                    | array  | No       | Additional information to include in the transcription response.                                                                                                                                                                                                                                                                                                                                                            |                     |
| known\_speaker\_names      | array  | No       | Optional list of speaker names.                                                                                                                                                                                                                                                                                                                                                                                             |                     |
| known\_speaker\_references | array  | No       | Optional list of audio samples.                                                                                                                                                                                                                                                                                                                                                                                             |                     |
| language                   | string | No       | The language of the input audio. Providing the input language in ISO-639-1 format (e.g., en) will improve accuracy and reduce latency.                                                                                                                                                                                                                                                                                      | `en`                |
| prompt                     | string | No       | Optional text to guide the model's style or continue a previous audio segment. The prompt should match the audio language. This field does not support gpt-4o-transcribe-diarize.                                                                                                                                                                                                                                           |                     |
| response\_format           | string | No       | Supported formats: json, text, srt, verbose\_json, vtt, or diarized\_json. gpt-4o-transcribe and gpt-4o-mini-transcribe only support json. gpt-4o-transcribe-diarize supports json, text, and diarized\_json.                                                                                                                                                                                                               | `text`              |
| stream                     | bool   | No       | If set to true, the model response data will be streamed to the client using server-sent events as it is generated. Note: Streaming is not supported for the whisper-1 model, and streaming requests will be ignored.                                                                                                                                                                                                       |                     |
| temperature                | bool   | No       | The sampling temperature, ranging from 0 to 1.                                                                                                                                                                                                                                                                                                                                                                              | `0.8`               |
| timestamp\_granularities   | array  | No       | The timestamp granularity for this transcription, containing "word" or "segment" elements. response\_format must be set to verbose\_json to use timestamp granularities. Either or both of the following options are supported: word, or segment. Note: Segment timestamps do not add extra latency, but generating word timestamps will incur additional latency. This option does not apply to gpt-4o-transcribe-diarize. | `word`              |

***

## 4. Request Examples

```http
curl --request POST 'https://gateway.theturbo.ai/v1/audio/transcriptions' \
--header 'Authorization: Bearer sk-***' \
--form 'file=@"/Users/xiaobo.yang/Documents/study/speech_test/2.mp3";filename="2.mp3"; headers="Content-Type: audio/mpeg"' \
--form 'timestamp_granularities[]=word' \
--form 'model=gpt-4o-mini-transcribe' \
--form 'response_format=verbose_json' \
--form 'language=zh'
```

## 5. Response Example

```json
{
  "task": "transcribe",
  "language": "zh",
  "duration": 3.3066875,
  "text": "Hello, where are you?",
  "words": [
    {
      "start": 0,
      "end": 0.56,
      "word": "Hello,",
      "probability": 0.7631836
    },
    {
      "start": 0.72,
      "end": 1.26,
      "word": " where",
      "probability": 0.66064453
    },
    {
      "start": 1.26,
      "end": 1.68,
      "word": " are",
      "probability": 0.9970703
    },
    {
      "start": 1.68,
      "end": 2.04,
      "word": " you?",
      "probability": 0.99316406
    }
  ],
  "segments": [
    {
      "id": 1,
      "seek": 330,
      "start": 0,
      "end": 2.04,
      "text": "Hello, where are you?",
      "tokens": [
        50365,
        15947,
        11,
        689,
        366,
        291,
        30,
        50530
      ],
      "temperature": 0,
      "avg_logprob": -0.2973090277777778,
      "compression_ratio": 0.7777777777777778,
      "no_speech_prob": 0.1317138671875,
      "words": [
        {
          "start": 0,
          "end": 0.56,
          "word": "Hello,",
          "probability": 0.7631836
        },
        {
          "start": 0.72,
          "end": 1.26,
          "word": " where",
          "probability": 0.66064453
        },
        {
          "start": 1.26,
          "end": 1.68,
          "word": " are",
          "probability": 0.9970703
        },
        {
          "start": 1.68,
          "end": 2.04,
          "word": " you?",
          "probability": 0.99316406
        }
      ]
    }
  ]
}
```


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.console.zenlayer.com/api-reference/compute/aig/audio-edit/openai-audio-transcription.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
