API Docs
API Docs
Generate speaker-separated speech output using cloned voices from multi-speaker text input.
/api/ms-tts-w-vc
This endpoint converts multi-speaker text segments into speech using cloned voices. Each speaker is mapped to a specific voice avatar and synthesized independently.
The request is processed asynchronously. Once accepted, the API returns a unique log_id that can be used to track synthesis progress and retrieve the generated audio output.
Request Body
project_title (string, required)
A human-readable title to identify the multi-speaker text to speech project.
Example: "My Project"
segments (array, required)
List of speaker segments containing text, timing, and optional translation.
Example: [
{
"original_speaker": "SPEAKER_00",
"original_text": "regardless of my",
"translated_text": "मेरी परवाह किए बिना",
"start_time": "00:00:00,070",
"end_time": "00:00:00,910"
}
]
segments[].original_speaker (string, required)
Identifier of the speaker for the segment (for example: SPEAKER_00).
segments[].original_text (string, required)
Original text spoken by the speaker in this segment.
segments[].translated_text (string, required)
Text that will be synthesized into speech for the speaker.
segments[].start_time (string, required)
Start timestamp of the segment in HH:MM:SS,ms format.
segments[].end_time (string, required)
End timestamp of the segment in HH:MM:SS,ms format.
speakers_mapping (object, required)
Mapping between speaker identifiers and their corresponding voice cloning configuration.
Example: {
"SPEAKER_00": {
"avatar": "daaji",
"speed": 1,
"pitch": 0
}
}
speakers_mapping.{speaker}.avatar (string, required)
Voice cloning avatar to be used for the speaker.
Example: "daaji"
speakers_mapping.{speaker}.speed (number, optional)
Speech speed for the speaker (default: 1).
speakers_mapping.{speaker}.pitch (number, optional)
Speech pitch for the speaker (default: 0).
stt_log_id (string, optional)
Optional Speech-to-Text job ID. Required only for generating video output; omit this field if audio-only output is needed.
Response
On successful submission, the API returns a unique log_id.
Use this log_id with the Fetch Multi-Speaker Text to Speech By ID endpoint to retrieve the generated audio output per speaker.
{
"log_id": "695036727d5247d58c029ea1"
}curl
curl --location 'https://api.narris.io/api/ms-tts-w-vc' \
--header 'Content-Type: application/json' \
--header 'x-api-key: YOUR_API_KEY' \
--data '{
"project_title": "My Project",
"segments": [
{
"original_speaker": "SPEAKER_00",
"speed": null,
"pitch": null,
"original_text": "regardless of my",
"start_time": "00:00:00,070",
"end_time": "00:00:00,910",
"translated_text": "मेरी परवाह किए बिना"
}
],
"speakers_mapping": {
"SPEAKER_00": {
"avatar": "daaji",
"speed": 1,
"pitch": 0
}
},
"language": "hindi",
"stt_log_id": "694b0857bca52cafa9557563"
}'/api/ms-tts-w-vc/logs
This endpoint allows you to fetch a paginated list of previously created multi-speaker text to speech requests generated using cloned voices.
Each entry represents a voice cloning job and includes its current status, speaker-to-avatar mappings, timestamps, and output files (if available).
Request Body
page (number, optional)
Page number for pagination.
Example: 1
limit (number, optional)
Number of records to return per page.
Example: 20
Response
On success, the API returns a paginated list of multi-speaker text to speech voice cloning logs.
Each log contains a unique _id which can be used with the Fetch Multi-Speaker Text to Speech By ID endpoint to retrieve detailed synthesis results and generated media.
{
"total": 11,
"page": 1,
"limit": 20,
"logs": [
{
"_id": "695036727d5247d58c029ea1",
"project_title": "My Project",
"speakers_mapping": {
"SPEAKER_00": {
"avatar": "daaji",
"speed": 1,
"pitch": 0
}
},
"status": "finished",
"createdAt": "2025-12-27T19:41:38.320Z",
"finishedAt": "2025-12-27T19:41:44.816Z",
"output_audio_file": "https://lingui-dev.s3.amazonaws.com/audio_upload_video_feature/output/20251227194141_20251227194138.wav",
"output_file": "https://lingui-dev.s3.amazonaws.com/video_features/output/20251227194143_20251227194141.mp4"
},
{
"_id": "694b3c1399f08a2ce975b2c7",
"project_title": "My Project",
"speakers_mapping": {
"SPEAKER_00": {
"avatar": "daaji",
"speed": 1,
"pitch": 0
}
},
"status": "failed",
"createdAt": "2025-12-24T01:04:19.043Z"
},
{
"_id": "69445abbf3081a910739e18c",
"project_title": "My Project",
"status": "pending",
"createdAt": "2025-12-18T19:49:15.126Z"
}
]
}curl
curl --location 'https://api.narris.io/api/ms-tts-w-vc/logs?page=1&limit=20' \ --header 'Content-Type: application/json' \ --header 'x-api-key: YOUR_API_KEY'
/api/ms-tts-w-vc/{log_id}
This endpoint allows you to fetch the complete details of a multi-speaker text to speech request generated using cloned voice avatars.
The response includes speaker segments, speaker-to-avatar mappings, synthesis parameters, and the current processing status.
Request Body
log_id (string, required)
Unique identifier of the multi-speaker text to speech voice cloning request returned during creation or from logs.
Example: "694b0917bca52cafa955757c"
Response
On success, the API returns detailed information about the multi-speaker text to speech voice cloning request.
If synthesis fails, the status field will be set to failed and no audio output will be generated.
{
"_id": "694b0917bca52cafa955757c",
"project_title": "My Project",
"segments": [
{
"original_speaker": "SPEAKER_00",
"speed": null,
"pitch": null,
"original_text": "regardless of my",
"start_time": "00:00:00,070",
"end_time": "00:00:00,910",
"translated_text": "मेरी परवाह किए बिना"
}
],
"speakers_mapping": {
"SPEAKER_00": {
"avatar": "daaji",
"speed": 1,
"pitch": 0
}
},
"language": "hindi",
"status": "failed",
"createdAt": "2025-12-23T21:26:47.031Z",
"updatedAt": "2025-12-23T21:26:49.661Z"
}curl
curl --location 'https://api.narris.io/api/ms-tts-w-vc/694b0917bca52cafa955757c' \ --header 'Content-Type: application/json' \ --header 'x-api-key: YOUR_API_KEY'
• Multi-speaker text to speech voice cloning requests are processed asynchronously. Always store the returned log_id to track generation status.
• Each speaker segment is synthesized independently while preserving speaker identity and timing alignment.
• Voice cloning jobs may remain in pending or processing state longer than standard TTS depending on the number of speakers and audio length.
• Use the Fetch Multi-Speaker Text to Speech List endpoint to view all jobs and Fetch Multi-Speaker Text to Speech By ID to retrieve the generated audio output per speaker.