Using Direct Line Speech

This guide is for using Web Chat with chat and speech functionality provided by the Direct Line Speech protocol.

We assume you have already set up a Direct Line Speech bot and have Web Chat running on your webpage.

What is Direct Line Speech?

Direct Line Speech is designed for Voice Assistant scenario. For example, smart display, automotive dashboard, navigation system with low-latency requirement on single-page application and progressive web apps (PWA). These apps usually are made with highly-customized UI and do not show conversation transcripts.

You can look at our sample 06.recomposing-ui/b.speech-ui and 06.recomposing-ui/c.smart-display for target scenarios.

Direct Line Speech is not recommended to use on traditional websites where its primary UI is transcript-based.

Support matrix

		Chrome/Microsoft Edge and Firefox on desktop		Chrome on Android		Safari on macOS/iOS		Web View on Android		Web View on iOS
STT	Basic recognition	✔	4.7	✔	4.7	✔	4.7	✔	4.7	❌	^*1
STT	Custom Speech (Details)	✔		✔		✔		✔		❌	^*1
STT	Interims/Partial Recognition	✔	4.7	✔	4.7	✔	4.7	✔	4.7	❌	^*1
STT	Select language at initialization	✔	4.7	✔	4.7	✔	4.7	✔	4.7	❌	^*1
STT	Input hint	✔	4.7	✔	4.7	✔	4.7	✔	4.7	❌	^*1
STT	Select input device	❌	^*3	❌	^*3	❌	^*3	❌	^*3	❌	^*1
STT	Dynamic priming (Details)	❌		❌		❌		❌		❌	^*1
STT	Reference grammar ID (Details)	❌		❌		❌		❌		❌	^*1
STT	Select language on-the-fly (Details)	❌		❌		❌		❌		❌	^*1
STT	Text normalization options (Details)	❌		❌		❌		❌		❌	^*1
STT	Abort recognition (Details)	❌		❌		❌		❌		❌	^*1
TTS	Basic synthesis using text	✔	4.7	✔	4.7	✔	4.7	✔	4.7	❓	^*2
TTS	Speech Synthesis Markup Language	✔	4.7	✔	4.7	✔	4.7	✔	4.7	❓	^*2
TTS	Custom Voice (Details)	✔		✔		✔		✔		❓	^*2
TTS	Selecting voice/pitch/rate/volume	✔	4.7	✔	4.7	✔	4.7	✔	4.7	❓	^*2
TTS	Override using "speak" property	✔	4.7	✔	4.7	✔	4.7	✔	4.7	❓	^*2
TTS	Interrupt synthesis when clicking on microphone button	✔	4.7	✔	4.7	✔	4.7	✔	4.7	❓	^*2
TTS	Text-to-speech audio format (Details)	❌		❌		❌		❌		❓	^*2
TTS	Stripping text from Markdown (Details)	❌		❌		❌		❌		❓	^*2
TTS	Adaptive Cards using "speak" property (Details)	❌		❌		❌		❌		❓	^*2
TTS	Synthesize activity with multiple attachments (Details)	❌		❌		❌		❌		❓	^*2

Notes

Web View on iOS is not a full browser. It does not have audio recording capabilities, which is required for Cognitive Services
As speech recognition is not working (see above), speech synthesis is not tested
Cognitive Services currently has a bug on selecting a different device for audio recording
- Fixed in cognitive-services-speech-sdk-js#96
- Tracking bug at #2481
- This is fixed in microsoft-cognitiveservices-speech-sdk@>=1.10.0

Requirements

Direct Line Speech does not support Internet Explorer 11. It requires modern browser media capabilities that are not available in IE11.

Direct Line Speech shares the same requirements as Cognitive Services Speech Services. Please refer to SPEECH.md.

How to get started

Before start, please create corresponding Azure resources. You can follow this tutorial for enabling voice in your bot. You do not need to follow the steps for creating C# client, you will replace the client with Web Chat.

Please look at our sample 03.speech/a.direct-line-speech to embedding Web Chat on your web app via Direct Line Speech channel.

After setting up Direct Line Speech on Azure Bot Services, there are two steps for using Direct Line Speech:

Retrieve your Direct Line Speech credentials
Render Web Chat using Direct Line Speech adapters

Retrieve your Direct Line Speech credentials

To secure the conversation, you will need to set up a REST API to generate the credentials. When called, it will return authorization token and region for your Direct Line Speech channel.

In the following code snippets, we assume sending a HTTP POST request to https://webchat-mockbot-streaming.azurewebsites.net/speechservices/token will return with a JSON with authorizationToken and region.

const fetchCredentials = async () => {
  const res = await fetch('https://webchat-mockbot-streaming.azurewebsites.net/speechservices/token', {
    method: 'POST'
  });

  if (!res.ok) {
    throw new Error('Failed to fetch authorization token and region.');
  }

  const { authorizationToken, region } = await res.json();

  return { authorizationToken, region };
};

Render Web Chat using Direct Line Speech adapters

After you have the fetchCredentials function set up, you can pass it to createDirectLineSpeechAdapters function. This function will return a set of adapters that is used by Web Chat. It includes DirectLineJS adapter and Web Speech adapter.

const adapters = await window.WebChat.createDirectLineSpeechAdapters({
  fetchCredentials
});

window.WebChat.renderWebChat(
  {
    ...adapters
  },
  document.getElementById('webchat')
);

Supported options

These are the options to pass when calling createDirectLineSpeechAdapters.

Name	Type	Default	Description
`audioConfig`	`AudioConfig`	`fromDefaultMicrophoneInput()`	Audio input object to use in Speech SDK.
`audioContext`	`AudioContext`	`window.AudioContext \|\| window.webkitAudioContext`	`AudioContext` used for constructing audio graph used for speech synthesis. Can be used to prime the Web Audio engine or as a ponyfill.
`audioInputDeviceId`	`string`	`undefined`	Device ID of the audio input device. Ignored if `audioConfig` is specified.
`fetchCredentials`	`DirectLineSpeechCredentials`	(Required)	An asynchronous function to fetch credentials, including either hostname or region, and either authorization token or subscription key.
`speechRecognitionLanguage`	`string`	window?.navigator?.language \|\| 'en-US'	Language used for speech recognition
`userID`	`string`	(A random ID)	User ID for all outgoing activities.
`username`	`string`	`undefined`	Username for all outgoing activities.

`DirectLineSpeechCredentials`

type DirectLineSpeechCredentials = {
  authorizationToken: string,
  region: string
} || {
  authorizationToken: string,
  directLineSpeechHostname: string
} || {
  region: string,
  subscriptionKey: string
} || {
  directLineSpeechHostname: string,
  subscriptionKey: string
}

For public clouds, we recommend using the region option, such as "westus2".

For sovereign clouds, you should specify the hostname in FQDN through directLineSpeechHostname option, such as "virginia.convai.speech.azure.us".

Known issues

Differences in `conversationUpdate` behaviors

You can specify user ID when you instantiate Web Chat.

If you specify user ID
- conversationUpdate activity will be send on connect and every reconnect. And with your user ID specified in the membersAdded field.
- All message activities will be sent with your user ID in from.id field.
If you do not specify user ID
- conversationUpdate activity will be send on connect and every reconnect. The membersAdded field will have an user ID of empty string.
- All message activities will be sent with a randomized user ID
  - The user ID is kept the same across reconnections

Connection idle and reconnection

After idling for 5 minutes, the Web Socket connection will be disconnected. If the client is still active, we will try to reconnect. On every reconnect, a conversationUpdate activity will be sent.

Text normalization option is not supported

Currently, there is no options to specify different text normalization options, including inverse text normalization (ITN), masked ITN, lexical, and display.

Page refresh will start a new conversation

Web Chat do not persist conversation information (conversation ID and connection ID). Thus, on every page refresh, the conversation will be created as a new conversation.

Conversation history are not stored and resent

Direct Line Speech is not targeting a transcript-based experience. Thus, our servers will no longer store conversation history. We do not plan to support this feature.

No additional data can be piggybacked on speech recognition

When using text-based experience, we allow developers to piggyback additional information to outgoing messages. This is demonstrated in sample 15.a "piggyback data on every outgoing activity".

With Direct Line Speech, you can no longer piggyback additional data to all speech-based outgoing activities.

Speech recognition language cannot be switched on-the-fly

You can only specify speech recognition language at initialization time. You cannot switch speech recognition language while the conversation is active.

Proactive message is not supported

Proactive message is not supported when using Direct Line Speech.

Abort recognition is not supported

After the user click on microphone button to start speech recognition, they cannot click microphone button again to abort the recognition. What they have said will continue to be recognized and send to the bot.

Custom Speech is not supported

Custom Speech is a feature for developers to train a custom speech model to improve speech recognition for uncommon words. You can set this up using the Speech SDK or in the Azure portal when configuring the Direct Line Speech channel.

Dynamic priming is not supported

Dynmic priming (a.k.a. pharse list) is a feature to improve speech recognition for words with similar pronunciations. This is not supported when using Direct Line Speech.

Reference grammer ID is not supported

Reference grammar ID is a feature to improve speech recognition accuracy when pairing with LUIS. This is not supported when using Direct Line Speech.

Custom Voice is not supported

Custom Voice is a feature for developers to perform synthesis using a custom voice font. This is not supported when using Direct Line Speech.

Synthesis audio quality is not configurable

When using Direct Line Speech, you cannot specify the audio quality and format for synthesizing speech.

Alternative for Markdown

When the bot send activities to the user, it can send both plain text and Markdown. If Markdown is sent, the bot should also provide speak field. The speak field will be used for speech synthesis and not displayed to end-user.

Attachments are not synthesized

Attachments are not synthesized. The bot should provide a speak field for speech synthesis.

As attachments are not synthesized, speak property in Adaptive Cards are ignored. The bot should provide a speak field for speech synthesis.

Selecting voice

Voice can only be selected using Speech Synthesis Markup Language (SSML). For example, the following bot code will use a Japanese voice "NanamiNeural" for synthesis.

await context.sendActivity(
  MessageFactory.text(
    `Echo: ${context.activity.text}`,
    `
    <speak
      version="1.0"
      xmlns="https://www.w3.org/2001/10/synthesis"
      xmlns:mstts="https://www.w3.org/2001/mstts"
      xml:lang="en-US"
    >
      <voice name="ja-JP-NanamiNeural">素晴らしい!</voice>
    </speak>
    `
  )
);

Please refer to this article on SSML support.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DIRECT_LINE_SPEECH.md

DIRECT_LINE_SPEECH.md

Using Direct Line Speech

What is Direct Line Speech?

Support matrix

Notes

Requirements

How to get started

Retrieve your Direct Line Speech credentials

Render Web Chat using Direct Line Speech adapters

Supported options

`DirectLineSpeechCredentials`

Known issues

Differences in `conversationUpdate` behaviors

Connection idle and reconnection

Text normalization option is not supported

Page refresh will start a new conversation

Conversation history are not stored and resent

No additional data can be piggybacked on speech recognition

Speech recognition language cannot be switched on-the-fly

Proactive message is not supported

Abort recognition is not supported

Custom Speech is not supported

Dynamic priming is not supported

Reference grammer ID is not supported

Custom Voice is not supported

Synthesis audio quality is not configurable

Alternative for Markdown

Attachments are not synthesized

Selecting voice

Files

DIRECT_LINE_SPEECH.md

Latest commit

History

DIRECT_LINE_SPEECH.md

File metadata and controls

Using Direct Line Speech

What is Direct Line Speech?

Support matrix

Notes

Requirements

How to get started

Retrieve your Direct Line Speech credentials

Render Web Chat using Direct Line Speech adapters

Supported options

DirectLineSpeechCredentials

Known issues

Differences in conversationUpdate behaviors

Connection idle and reconnection

Text normalization option is not supported

Page refresh will start a new conversation

Conversation history are not stored and resent

No additional data can be piggybacked on speech recognition

Speech recognition language cannot be switched on-the-fly

Proactive message is not supported

Abort recognition is not supported

Custom Speech is not supported

Dynamic priming is not supported

Reference grammer ID is not supported

Custom Voice is not supported

Synthesis audio quality is not configurable

Alternative for Markdown

Attachments are not synthesized

Selecting voice

`DirectLineSpeechCredentials`

Differences in `conversationUpdate` behaviors