API Docs

    Overview
    Authentication
    Check Permission
    Text APIs
      Text Translation
      Text to Speech
      Text to Speech (Voice Cloning)
      Multi-Speaker Text to Text
      Multi-Speaker Text to Speech (Voice Cloning)
    Speech Recognition APIs
      Speech to Text
      Multi-Speaker Speech to Text
    Video APIs
      Video Subtitling
      Video Translation
      Video Translation (Voice Cloning)

API Docs

Multi-Speaker Text to Speech – Voice Cloning

Generate speaker-separated speech output using cloned voices from multi-speaker text input.

Create Multi-Speaker Text to Speech Voice Cloning Request

POST

/api/ms-tts-w-vc

This endpoint converts multi-speaker text segments into speech using cloned voices. Each speaker is mapped to a specific voice avatar and synthesized independently.

The request is processed asynchronously. Once accepted, the API returns a unique log_id that can be used to track synthesis progress and retrieve the generated audio output.


Request Body

project_title (string, required)

A human-readable title to identify the multi-speaker text to speech project.

Example: "My Project"

segments (array, required)

List of speaker segments containing text, timing, and optional translation.

Example: [ { "original_speaker": "SPEAKER_00", "original_text": "regardless of my", "translated_text": "मेरी परवाह किए बिना", "start_time": "00:00:00,070", "end_time": "00:00:00,910" } ]

segments[].original_speaker (string, required)

Identifier of the speaker for the segment (for example: SPEAKER_00).

segments[].original_text (string, required)

Original text spoken by the speaker in this segment.

segments[].translated_text (string, required)

Text that will be synthesized into speech for the speaker.

segments[].start_time (string, required)

Start timestamp of the segment in HH:MM:SS,ms format.

segments[].end_time (string, required)

End timestamp of the segment in HH:MM:SS,ms format.

speakers_mapping (object, required)

Mapping between speaker identifiers and their corresponding voice cloning configuration.

Example: { "SPEAKER_00": { "avatar": "daaji", "speed": 1, "pitch": 0 } }

speakers_mapping.{speaker}.avatar (string, required)

Voice cloning avatar to be used for the speaker.

Example: "daaji"

speakers_mapping.{speaker}.speed (number, optional)

Speech speed for the speaker (default: 1).

speakers_mapping.{speaker}.pitch (number, optional)

Speech pitch for the speaker (default: 0).

language (string, required)

Language of the synthesized speech.

View example →

stt_log_id (string, optional)

Optional Speech-to-Text job ID. Required only for generating video output; omit this field if audio-only output is needed.


Response

On successful submission, the API returns a unique log_id.
Use this log_id with the Fetch Multi-Speaker Text to Speech By ID endpoint to retrieve the generated audio output per speaker.

{
  "log_id": "695036727d5247d58c029ea1"
}

curl

curl --location 'https://api.narris.io/api/ms-tts-w-vc' \
--header 'Content-Type: application/json' \
--header 'x-api-key: YOUR_API_KEY' \
--data '{
  "project_title": "My Project",
  "segments": [
    {
      "original_speaker": "SPEAKER_00",
      "speed": null,
      "pitch": null,
      "original_text": "regardless of my",
      "start_time": "00:00:00,070",
      "end_time": "00:00:00,910",
      "translated_text": "मेरी परवाह किए बिना"
    }
  ],
  "speakers_mapping": {
    "SPEAKER_00": {
      "avatar": "daaji",
      "speed": 1,
      "pitch": 0
    }
  },
  "language": "hindi",
  "stt_log_id": "694b0857bca52cafa9557563"
}'

Fetch Multi-Speaker Text to Speech Voice Cloning List

GET

/api/ms-tts-w-vc/logs

This endpoint allows you to fetch a paginated list of previously created multi-speaker text to speech requests generated using cloned voices.

Each entry represents a voice cloning job and includes its current status, speaker-to-avatar mappings, timestamps, and output files (if available).


Request Body

page (number, optional)

Page number for pagination.

Example: 1

limit (number, optional)

Number of records to return per page.

Example: 20


Response

On success, the API returns a paginated list of multi-speaker text to speech voice cloning logs.
Each log contains a unique _id which can be used with the Fetch Multi-Speaker Text to Speech By ID endpoint to retrieve detailed synthesis results and generated media.

{
  "total": 11,
  "page": 1,
  "limit": 20,
  "logs": [
    {
      "_id": "695036727d5247d58c029ea1",
      "project_title": "My Project",
      "speakers_mapping": {
        "SPEAKER_00": {
          "avatar": "daaji",
          "speed": 1,
          "pitch": 0
        }
      },
      "status": "finished",
      "createdAt": "2025-12-27T19:41:38.320Z",
      "finishedAt": "2025-12-27T19:41:44.816Z",
      "output_audio_file": "https://lingui-dev.s3.amazonaws.com/audio_upload_video_feature/output/20251227194141_20251227194138.wav",
      "output_file": "https://lingui-dev.s3.amazonaws.com/video_features/output/20251227194143_20251227194141.mp4"
    },
    {
      "_id": "694b3c1399f08a2ce975b2c7",
      "project_title": "My Project",
      "speakers_mapping": {
        "SPEAKER_00": {
          "avatar": "daaji",
          "speed": 1,
          "pitch": 0
        }
      },
      "status": "failed",
      "createdAt": "2025-12-24T01:04:19.043Z"
    },
    {
      "_id": "69445abbf3081a910739e18c",
      "project_title": "My Project",
      "status": "pending",
      "createdAt": "2025-12-18T19:49:15.126Z"
    }
  ]
}

curl

curl --location 'https://api.narris.io/api/ms-tts-w-vc/logs?page=1&limit=20' \
--header 'Content-Type: application/json' \
--header 'x-api-key: YOUR_API_KEY'

Fetch Multi-Speaker Text to Speech Voice Cloning By ID

GET

/api/ms-tts-w-vc/{log_id}

This endpoint allows you to fetch the complete details of a multi-speaker text to speech request generated using cloned voice avatars.

The response includes speaker segments, speaker-to-avatar mappings, synthesis parameters, and the current processing status.


Request Body

log_id (string, required)

Unique identifier of the multi-speaker text to speech voice cloning request returned during creation or from logs.

Example: "694b0917bca52cafa955757c"


Response

On success, the API returns detailed information about the multi-speaker text to speech voice cloning request.
If synthesis fails, the status field will be set to failed and no audio output will be generated.

{
  "_id": "694b0917bca52cafa955757c",
  "project_title": "My Project",
  "segments": [
    {
      "original_speaker": "SPEAKER_00",
      "speed": null,
      "pitch": null,
      "original_text": "regardless of my",
      "start_time": "00:00:00,070",
      "end_time": "00:00:00,910",
      "translated_text": "मेरी परवाह किए बिना"
    }
  ],
  "speakers_mapping": {
    "SPEAKER_00": {
      "avatar": "daaji",
      "speed": 1,
      "pitch": 0
    }
  },
  "language": "hindi",
  "status": "failed",
  "createdAt": "2025-12-23T21:26:47.031Z",
  "updatedAt": "2025-12-23T21:26:49.661Z"
}

curl

curl --location 'https://api.narris.io/api/ms-tts-w-vc/694b0917bca52cafa955757c' \
--header 'Content-Type: application/json' \
--header 'x-api-key: YOUR_API_KEY'

Notes for Developers

• Multi-speaker text to speech voice cloning requests are processed asynchronously. Always store the returned log_id to track generation status.
• Each speaker segment is synthesized independently while preserving speaker identity and timing alignment.
• Voice cloning jobs may remain in pending or processing state longer than standard TTS depending on the number of speakers and audio length.
• Use the Fetch Multi-Speaker Text to Speech List endpoint to view all jobs and Fetch Multi-Speaker Text to Speech By ID to retrieve the generated audio output per speaker.