It is advised to add words that are not being recognized correctly and place them under the respective selected language in the table, so the system can learn those words better.
Provide clean, studio-quality audio with a duration of less than 12 seconds.
TTT Model:
Narris
Normal text-to-text translation model. Numbers might remain as-is. Can be chosen for any language, and the output should have almost no hallucination.
Narris-hybrid
Text-to-text translation model with numbers converted to verbose format (15 as fifteen). Can be chosen for any language, and the output should have almost no hallucination.
Narris-turbo
Text-to-text translation model with numbers converted to verbose format (15 as fifteen). Can be chosen for any language. The output may have a little bit of hallucination, but accuracy will be very high.
Narris-super
More accurate output is expected if you set the target language to Assamese, Bengali, Bodo, Dogri, Gujarati, English, Hindi, Kannada, Kashmiri, Konkani, Maithili, Malayalam, Manipuri, Marathi, Nepali, Odia, Punjabi, Sanskrit, Santali, Sindhi, Tamil, Telugu, or Urdu with this model. Numbers might remain as-is and it is prone to hallucinations.
Narris-super-hybrid
More accurate output is expected if you set the target language to Assamese, Bengali, Bodo, Dogri, Gujarati, English, Hindi, Kannada, Kashmiri, Konkani, Maithili, Malayalam, Manipuri, Marathi, Nepali, Odia, Punjabi, Sanskrit, Santali, Sindhi, Tamil, Telugu, or Urdu with this model. Numbers might remain as-is and it is prone to hallucinations.
Ensure the uploaded document language matches the source language selected in the dropdown.
When you decrease the pitch below “0”, the voice sounds deeper, heavier, and more “masculine”. When the pitch is higher than 0, the voice sounds thinner, brighter, and more “feminine”. If pushed too low or too high, the audio might get distorted or feel unnatural.
When selecting an avatar from the avatar dropdown, make sure to follow the language suffix included with the name to select the target language that produces good-quality output.
For instant speakers, when selecting an avatar, if the target language matches the language in which the voice was cloned, the output will be more natural and realistic.
Match duration should be selected when passing the file format as SRT. For DOCX and TXT, it might not work at all.
Ensure you follow the file format when using multi-speaker text-to-text. If you are confused, a better approach is to use the subtitle input feature while inputting.
It is preferred to always select the output file format as “SRT” to get more accurate and less hallucinated output.
For more accurate speech-to-speech voice cloning generation, follow the flow below:
STT Model:
Narris Edge
More accurate when the source language is Assamese, Bengali, Bodo, Dogri, Gujarati, Hindi, Kannada, Konkani, Kashmiri, Maithili, Malayalam, Manipuri, Marathi, Nepali, Odia, Punjabi, Sanskrit, Santali, Sindhi, Tamil, Telugu, or Urdu.
Narris Fast
Works for all languages, but accuracy might be slightly compromised.
Narris Noto Sans is the most language-compatible font. Other fonts are experimental, so use them cautiously.
Image-to-video expects a clean image with a single speaker clearly visible. If multiple speakers or persons are present, the video might get hallucinated or fail after a long wait.
Image to Video Model:
narris_i2v
Normal output with lower cost. Lip-sync quality might not be great but is acceptable if the image quality is good.
narris_i2v_pro
Good lip-sync quality with no movement other than mild head movement.
narris_i2v_pro_beta
Good lip-sync quality with visible head, body, and hand movement.
Video Translation Model:
narris_a2v
No lip-sync; the translated audio is simply patched onto the video.
narris_v2v
For single-speaker video setups per frame. This model is cheaper and provides decent lip-sync quality.
narris_v2v_pro
For single/multi (beta) speaker video setups per frame. Provides good lip-sync quality.
narris_v2v_pro_max
For single/multi (beta) speaker video setups per frame. Provides high-quality lip-sync output (highest cost).