ASR Engines
There are four different ASR Engines and the right one is chosen based on the asr_type parameter.
Engines
-
ASRAsyncService: the main engine that handles jobs and define the execution mode for transcription and diarization (post-processing is always locally done by this engine). -
ASRLiveService: the engine that handles live streaming requests. -
ASRTranscriptionOnly: the engine when you want to deploy a single transcription remote server. -
ASRDiarizationOnly: the engine when you want to deploy a single diarization remote server.
Warning
The ASRTranscriptionOnly and ASRDiarizationOnly engines aren't meant to be used alone. They are used only when you want to deploy each service in a separate server and they will need to be used along with the ASRAsyncService engine.
Endpoints
Each Engine has its own endpoints as described below.
ASRAsyncService
Transcription endpoints
These endpoints are the main endpoints for transcribing audio files.
/audio[POST] - The audio endpoint for transcribing local files.
@router.post(
"", response_model=Union[AudioResponse, str], status_code=http_status.HTTP_200_OK
)
async def inference_with_audio(
background_tasks: BackgroundTasks,
offset_start: Union[float, None] = Form(None),
offset_end: Union[float, None] = Form(None),
num_speakers: int = Form(-1),
diarization: bool = Form(False),
multi_channel: bool = Form(False),
source_lang: str = Form("en"),
timestamps: str = Form("s"),
vocab: Union[List[str], None] = Form(None),
word_timestamps: bool = Form(False),
internal_vad: bool = Form(False),
repetition_penalty: float = Form(1.2),
compression_ratio_threshold: float = Form(2.4),
log_prob_threshold: float = Form(-1.0),
no_speech_threshold: float = Form(0.6),
condition_on_previous_text: bool = Form(True),
file: UploadFile = File(...),
) -> AudioResponse:
"""Inference endpoint with audio file."""
Note
The local /audio endpoint is expecting a file parameter and all the other parameters are optional and have default values.
/audio-url[POST] - The audio endpoint for transcribing remote files using a URL.
@router.post("", response_model=AudioResponse, status_code=http_status.HTTP_200_OK)
async def inference_with_audio_url(
background_tasks: BackgroundTasks,
url: str,
data: Optional[AudioRequest] = None,
) -> AudioResponse:
"""Inference endpoint with audio url."""
Here is the AudioRequest model which inherits from the BaseRequest model:
class BaseRequest(BaseModel):
"""Base request model for the API."""
offset_start: Union[float, None] = None
offset_end: Union[float, None] = None
num_speakers: int = -1
diarization: bool = False
source_lang: str = "en"
timestamps: Timestamps = Timestamps.seconds
vocab: Union[List[str], None] = None
word_timestamps: bool = False
internal_vad: bool = False
repetition_penalty: float = 1.2
compression_ratio_threshold: float = 2.4
log_prob_threshold: float = -1.0
no_speech_threshold: float = 0.6
condition_on_previous_text: bool = True
class AudioRequest(BaseRequest):
"""Request model for the ASR audio file and url endpoint."""
multi_channel: bool = False
Here is the AudioResponse model which inherits from the BaseResponse model:
class BaseResponse(BaseModel):
"""Base response model, not meant to be used directly."""
utterances: List[Utterance]
audio_duration: float
offset_start: Union[float, None]
offset_end: Union[float, None]
num_speakers: int
diarization: bool
source_lang: str
timestamps: str
vocab: Union[List[str], None]
word_timestamps: bool
internal_vad: bool
repetition_penalty: float
compression_ratio_threshold: float
log_prob_threshold: float
no_speech_threshold: float
condition_on_previous_text: bool
process_times: ProcessTimes
class AudioResponse(BaseResponse):
"""Response model for the ASR audio file and url endpoint."""
multi_channel: bool
youtube[POST] - The audio endpoint for transcribing YouTube videos using a YouTube video link.
@router.post("", response_model=YouTubeResponse, status_code=http_status.HTTP_200_OK)
async def inference_with_youtube(
background_tasks: BackgroundTasks,
url: str,
data: Optional[BaseRequest] = None,
) -> YouTubeResponse:
"""Inference endpoint with YouTube url."""
Note
As you can see the only difference is that the YouTube endpoint has the same BaseRequest model as the /audio-url endpoint but without the multi_channel parameter.
Here is the YouTubeResponse model which inherits from the BaseResponse model:
class YouTubeResponse(BaseResponse):
"""Response model for the ASR YouTube endpoint."""
video_url: str
Management endpoints
These endpoints are used to manage the remote servers URLs, when you want to deploy the ASRTranscriptionOnly or ASRDiarizationOnly engines in separate servers.
/url[GET] - This endpoint allow listing the remote servers URLs.
@router.get(
"",
response_model=Union[List[HttpUrl], str],
status_code=http_status.HTTP_200_OK,
)
async def get_url(task: Literal["transcription", "diarization"]) -> List[HttpUrl]:
"""Get Remote URL endpoint for remote transcription or diarization."""
/url/add[POST] - This endpoint allow adding a remote server URL.
@router.post(
"/add",
response_model=Union[UrlSchema, str],
status_code=http_status.HTTP_200_OK,
)
async def add_url(data: UrlSchema) -> UrlSchema:
"""Add Remote URL endpoint for remote transcription or diarization."""
/url/remove[POST] - This endpoint allow removing a remote server URL.
@router.post(
"/remove",
response_model=Union[UrlSchema, str],
status_code=http_status.HTTP_200_OK,
)
async def remove_url(data: UrlSchema) -> UrlSchema:
"""Remove Remote URL endpoint for remote transcription or diarization."""
Here is the UrlSchema model:
class UrlSchema(BaseModel):
"""Request model for the add_url endpoint."""
task: Literal["transcription", "diarization"]
url: HttpUrl
The url parameter needs to be a valid URL (check pydantic HttpUrl) and the task parameter needs to be either transcription or diarization.
ASRLiveService
/live[WEBSOCKET] - The live streaming endpoint.
@router.websocket("")
async def websocket_endpoint(source_lang: str, websocket: WebSocket) -> None:
"""Handle WebSocket connections."""
This endpoint expects a WebSocket connection and a source_lang parameter as a string, and will return the transcription results in real-time.
ASRTranscriptionOnly
Warning
This endpoint is not meant to be used alone. It is used only when you want to deploy the ASRTranscriptionOnly engine in a separate server and it will need to be used along with the ASRAsyncService engine.
/transcribe[POST] - The transcription endpoint.
@router.post(
"",
response_model=Union[TranscriptionOutput, List[TranscriptionOutput], str],
status_code=http_status.HTTP_200_OK,
)
async def only_transcription(
data: TranscribeRequest,
) -> Union[TranscriptionOutput, List[TranscriptionOutput]]:
"""Transcribe endpoint for the `only_transcription` asr type."""
This endpoint expects a TranscribeRequest and will return the transcription results as a TranscriptionOutput or a list of TranscriptionOutput if the task is a multi_channel task.
Here is the TranscribeRequest model:
class TranscribeRequest(BaseModel):
"""Request model for the transcribe endpoint."""
audio: Union[TensorShare, List[TensorShare]]
compression_ratio_threshold: float
condition_on_previous_text: bool
internal_vad: bool
log_prob_threshold: float
no_speech_threshold: float
repetition_penalty: float
source_lang: str
vocab: Union[List[str], None]
Here is the TranscriptionOutput model:
class TranscriptionOutput(BaseModel):
"""Transcription output model for the API."""
segments: List[Segment]
ASRDiarizationOnly
Warning
This endpoint is not meant to be used alone. It is used only when you want to deploy the ASRDiarizationOnly engine in a separate server and it will need to be used along with the ASRAsyncService engine.
/diarize[POST] - The diarization endpoint.
@router.post(
"",
response_model=Union[DiarizationOutput, str],
status_code=http_status.HTTP_200_OK,
)
async def remote_diarization(
data: DiarizationRequest,
) -> DiarizationOutput:
"""Diarize endpoint for the `only_diarization` asr type."""
This endpoint expects a DiarizationRequest and will return the diarization results as a DiarizationOutput.
Here is the DiarizationRequest model:
class DiarizationRequest(BaseModel):
"""Request model for the diarize endpoint."""
audio: TensorShare
duration: float
num_speakers: int
Here is the DiarizationSegment and DiarizationOutput models:
class DiarizationSegment(NamedTuple):
"""Diarization segment model for the API."""
start: float
end: float
speaker: int
class DiarizationOutput(BaseModel):
"""Diarization output model for the API."""
segments: List[DiarizationSegment]
Execution modes
The execution modes represent the way the tasks are executed, either locally or remotely. For each task, one execution mode is defined.
There are two different execution modes: LocalExecution and RemoteExecution.
LocalExecutionis the default execution mode. It executes the pipeline on the local machine. It is useful for testing and debugging.
The local execution is looking for any local GPU device. If there is no GPU device, it will use the CPU.
If there are multiple GPU devices, it will use all alternatively.
The index parameter keep the track of the GPU index assigned to the task.
RemoteExecutionexecutes the pipeline on a remote machine. It is useful for production and scaling.
Note
The remote execution mode is only available if you have added transcribe_server_urls or diarization_server_urls in the configuration file or on the fly via the API. Check the Environment variables section for more information.
The url parameter is the URL of the remote machine that will be used for the execution of a task.
Created: 2023-10-12