On this page

Word DictionaryVoice DesignTextSpeechVideo

Tips, Guidelines & Best Practices

Word Dictionary

It is advised to add words that are not being recognized correctly and place them under the respective selected language in the table, so the system can learn those words better.

Voice Design

Provide clean, studio-quality audio with a duration of less than 12 seconds.

Text

TTT Model:

Narris

Normal text-to-text translation model. Numbers might remain as-is. Can be chosen for any language, and the output should have almost no hallucination.

Narris-hybrid

Text-to-text translation model with numbers converted to verbose format (15 as fifteen). Can be chosen for any language, and the output should have almost no hallucination.

Narris-turbo

Text-to-text translation model with numbers converted to verbose format (15 as fifteen). Can be chosen for any language. The output may have a little bit of hallucination, but accuracy will be very high.

Narris-super

More accurate output is expected if you set the target language to Assamese, Bengali, Bodo, Dogri, Gujarati, English, Hindi, Kannada, Kashmiri, Konkani, Maithili, Malayalam, Manipuri, Marathi, Nepali, Odia, Punjabi, Sanskrit, Santali, Sindhi, Tamil, Telugu, or Urdu with this model. Numbers might remain as-is and it is prone to hallucinations.

Narris-super-hybrid

More accurate output is expected if you set the target language to Assamese, Bengali, Bodo, Dogri, Gujarati, English, Hindi, Kannada, Kashmiri, Konkani, Maithili, Malayalam, Manipuri, Marathi, Nepali, Odia, Punjabi, Sanskrit, Santali, Sindhi, Tamil, Telugu, or Urdu with this model. Numbers might remain as-is and it is prone to hallucinations.


Ensure the uploaded document language matches the source language selected in the dropdown.

When you decrease the pitch below “0”, the voice sounds deeper, heavier, and more “masculine”. When the pitch is higher than 0, the voice sounds thinner, brighter, and more “feminine”. If pushed too low or too high, the audio might get distorted or feel unnatural.

When selecting an avatar from the avatar dropdown, make sure to follow the language suffix included with the name to select the target language that produces good-quality output.

For instant speakers, when selecting an avatar, if the target language matches the language in which the voice was cloned, the output will be more natural and realistic.

Match duration should be selected when passing the file format as SRT. For DOCX and TXT, it might not work at all.

Ensure you follow the file format when using multi-speaker text-to-text. If you are confused, a better approach is to use the subtitle input feature while inputting.

Speech

It is preferred to always select the output file format as “SRT” to get more accurate and less hallucinated output.

For more accurate speech-to-speech voice cloning generation, follow the flow below:

  • Create an instant voice cloning avatar (under Voice Design) if you do not see the avatar in the avatar dropdown.
  • Perform speech-to-text with SRT selected as the output format.
  • Download the SRT file and pass it to text-to-speech with voice cloning (enable match duration if you want the audio duration to match, which might sound unnatural sometimes; otherwise, keep it disabled).

STT Model:

Narris Edge

More accurate when the source language is Assamese, Bengali, Bodo, Dogri, Gujarati, Hindi, Kannada, Konkani, Kashmiri, Maithili, Malayalam, Manipuri, Marathi, Nepali, Odia, Punjabi, Sanskrit, Santali, Sindhi, Tamil, Telugu, or Urdu.

Narris Fast

Works for all languages, but accuracy might be slightly compromised.

Video

Narris Noto Sans is the most language-compatible font. Other fonts are experimental, so use them cautiously.

Image-to-video expects a clean image with a single speaker clearly visible. If multiple speakers or persons are present, the video might get hallucinated or fail after a long wait.

Image to Video Model:

narris_i2v

Normal output with lower cost. Lip-sync quality might not be great but is acceptable if the image quality is good.

narris_i2v_pro

Good lip-sync quality with no movement other than mild head movement.

narris_i2v_pro_beta

Good lip-sync quality with visible head, body, and hand movement.


Video Translation Model:

narris_a2v

No lip-sync; the translated audio is simply patched onto the video.

narris_v2v

For single-speaker video setups per frame. This model is cheaper and provides decent lip-sync quality.

narris_v2v_pro

For single/multi (beta) speaker video setups per frame. Provides good lip-sync quality.

narris_v2v_pro_max

For single/multi (beta) speaker video setups per frame. Provides high-quality lip-sync output (highest cost).