Speech Synthesis Markup Language (SSML)

As of now, the SSML is right now supported in Standard Voices and not 11Labs voices.

Integrating SSML (Speech Synthesis Markup Language) into the Storyboard API request allows for more customized audio responses, offering control over various aspects such as pauses, pronunciation of acronyms, dates, times, abbreviations, and even censoring specific text.

Here's a basic example of how you might structure an API request with SSML:

//SAMPLE Storyboard Request object.
{
    "videoName": "Santa Claus", //Desired video file name
    "videoDescription": "Santa Claus is coming to town", //*optional*
    "language": "en", //Text language: English only
    "scenes": [
        {
            "text":"This is an example \<break time=\"500ms\"/> of SSML integration. \<emphasis level=\"strong\">Strong emphasis</emphasis> can be used for important phrases. You can also specify abbreviations and \<say-as interpret-as=\"date\">2024-04-09</say-as> dates."
        }
      ]
}

In the above example:

  1. is used to introduce a pause of 500 milliseconds.
  2. is employed for strong emphasis on the phrase "Strong emphasis."
  1. is used to interpret the text "2024-04-09" as a date.

Including SSML in the Storyboard API request provides developers with greater flexibility and control over the synthesized speech output, resulting in more natural and tailored audio responses for their applications.

Below mentioned SSML Tags are supported:

  1. <say-as>
  2. <prosody>
  3. <sub>
  4. <emphasis>
  5. <break>

You can pass all these tags in 'text' attribute of Storyboard API.

<say-as>

The <say-as> tag is used to specify how a text-to-speech (TTS) engine should pronounce or interpret certain text. It's particularly useful for indicating how numbers, dates, times, or other types of data should be spoken aloud by the TTS engine. The <say‑as> element has the required attribute, interpret-as, which determines how the value is spoken.

Examples:

The interpret-as tag supports the following values:

Spoken as
Currency<say-as interpret-as='currency'>$42.01</say-as>Spoken as forty-two dollars and one cent
Telephone<say-as interpret-as='telephone' google:style='zero-as-zero'>1800-202-1212</say-as>Spoken as "one eight zero zero two zero two one two one two".
Verbatim or Spell-out<say-as interpret-as="verbatim">abcdefg</say-as>Spelled out letter by letter as "a b c d e f g".
Date<say-as interpret-as="date" format="yyyymmdd" detail="1">1960-09-10</say-as>Spoken as "The tenth of September, nineteen sixty".
Characters<say-as interpret-as="characters">can</say-as>Spoken as "C A N".
Cardinal<say-as interpret-as="cardinal">12345</say-as>Spoken as "Twelve thousand three hundred forty five".
Ordinal<say-as interpret-as="ordinal">1</say-as>Spoken as "First".
Fraction<say-as interpret-as="fraction">5+1/2</say-as>Spoken as "five and a half".

<prosody>

The tag in SSML (Speech Synthesis Markup Language) allows developers to control the prosody, or rhythm and intonation, of synthesized speech. It enables fine-tuning of parameters such as pitch, rate, and volume, to make the speech sound more natural and expressive.

Here's how the tag works with its attributes:

Pitch (pitch attribute):

Changes the pitch of the speech. It accepts values as a percentage change from the default pitch.

Example:

<prosody pitch="x-high">This is spoken with high pitch.</prosody>

Rate (rate attribute):

Adjusts the speaking rate. Values can be a percentage change from the default rate.

Example:

<prosody rate="fast">This is spoken quickly.</prosody>

Volume (volume attribute):

Controls the loudness of speech. Values are expressed as a percentage of the default volume.

Example:

<prosody volume="loud">This is spoken loudly.</prosody>


<sub>

In SSML (Speech Synthesis Markup Language), the tag is used to specify text that should be spoken as a substitute or replacement for other text. It's typically employed for acronyms, abbreviations, or alternative pronunciations.

Example

The <sub alias="World Wide Web Consortium">W3C</sub> is an international community.

In this example:

is used to indicate that the abbreviation "W3C" should be spoken as its alias "World Wide Web Consortium."
The alias attribute specifies the replacement text.
When processed by a text-to-speech engine, it would speak the sentence as: "The World Wide Web Consortium is an international community."

The tag is particularly useful for ensuring correct pronunciation or providing expanded explanations for abbreviations or acronyms that might not be familiar to all listeners.


<emphasis>

The tag is used to indicate that a certain word or phrase should be emphasized or spoken with increased prominence by a text-to-speech (TTS) engine.

Example

This is <emphasis level="strong">important</emphasis>.

Supported emphasis levels are strong, moderate, reduced.


<break>

The tag is used to introduce a pause, emphasizing the break while sentence is been spoken.

Example

Her name is pronounced as <break time=\"2000ms\"/> Cassy

In this example when processed by a text-to-speech engine there will be a pause of 2 seconds before the "Cassy" is spoken