For Cognitive Services Speech Services, please refer to
SPEECH.md
.
This guide is for using Web Chat with chat and speech functionality provided by the Direct Line Speech protocol.
We assume you have already set up a Direct Line Speech bot and have Web Chat running on your webpage.
Sample code in this article is optimized for modern browsers. You may need to use a transpiler (e.g. Babel) to target a broader range of browsers.
Direct Line Speech is designed for Voice Assistant scenario. For example, smart display, automotive dashboard, navigation system with low-latency requirement on single-page application and progressive web apps (PWA). These apps usually are made with highly-customized UI and do not show conversation transcripts.
You can look at our sample 06.recomposing-ui/b.speech-ui and 06.recomposing-ui/c.smart-display for target scenarios.
Direct Line Speech is not recommended to use on traditional websites where its primary UI is transcript-based.
Chrome/Microsoft Edge and Firefox on desktop |
Chrome on Android |
Safari on macOS/iOS |
Web View on Android |
Web View on iOS |
|||||||
---|---|---|---|---|---|---|---|---|---|---|---|
STT | Basic recognition | ✔ | 4.7 | ✔ | 4.7 | ✔ | 4.7 | ✔ | 4.7 | ❌ | *1 |
STT | Custom Speech (Details) | ✔ | ✔ | ✔ | ✔ | ❌ | *1 | ||||
STT | Interims/Partial Recognition | ✔ | 4.7 | ✔ | 4.7 | ✔ | 4.7 | ✔ | 4.7 | ❌ | *1 |
STT | Select language at initialization | ✔ | 4.7 | ✔ | 4.7 | ✔ | 4.7 | ✔ | 4.7 | ❌ | *1 |
STT | Input hint | ✔ | 4.7 | ✔ | 4.7 | ✔ | 4.7 | ✔ | 4.7 | ❌ | *1 |
STT | Select input device | ❌ | *3 | ❌ | *3 | ❌ | *3 | ❌ | *3 | ❌ | *1 |
STT | Dynamic priming (Details) | ❌ | ❌ | ❌ | ❌ | ❌ | *1 | ||||
STT | Reference grammar ID (Details) | ❌ | ❌ | ❌ | ❌ | ❌ | *1 | ||||
STT | Select language on-the-fly (Details) | ❌ | ❌ | ❌ | ❌ | ❌ | *1 | ||||
STT | Text normalization options (Details) | ❌ | ❌ | ❌ | ❌ | ❌ | *1 | ||||
STT | Abort recognition (Details) | ❌ | ❌ | ❌ | ❌ | ❌ | *1 | ||||
TTS | Basic synthesis using text | ✔ | 4.7 | ✔ | 4.7 | ✔ | 4.7 | ✔ | 4.7 | ❓ | *2 |
TTS | Speech Synthesis Markup Language | ✔ | 4.7 | ✔ | 4.7 | ✔ | 4.7 | ✔ | 4.7 | ❓ | *2 |
TTS | Custom Voice (Details) | ✔ | ✔ | ✔ | ✔ | ❓ | *2 | ||||
TTS | Selecting voice/pitch/rate/volume | ✔ | 4.7 | ✔ | 4.7 | ✔ | 4.7 | ✔ | 4.7 | ❓ | *2 |
TTS | Override using "speak" property | ✔ | 4.7 | ✔ | 4.7 | ✔ | 4.7 | ✔ | 4.7 | ❓ | *2 |
TTS | Interrupt synthesis when clicking on microphone button | ✔ | 4.7 | ✔ | 4.7 | ✔ | 4.7 | ✔ | 4.7 | ❓ | *2 |
TTS | Text-to-speech audio format (Details) | ❌ | ❌ | ❌ | ❌ | ❓ | *2 | ||||
TTS | Stripping text from Markdown (Details) | ❌ | ❌ | ❌ | ❌ | ❓ | *2 | ||||
TTS | Adaptive Cards using "speak" property (Details) | ❌ | ❌ | ❌ | ❌ | ❓ | *2 | ||||
TTS | Synthesize activity with multiple attachments (Details) | ❌ | ❌ | ❌ | ❌ | ❓ | *2 |
- Web View on iOS is not a full browser. It does not have audio recording capabilities, which is required for Cognitive Services
- As speech recognition is not working (see above), speech synthesis is not tested
- Cognitive Services currently has a bug on selecting a different device for audio recording
- Fixed in cognitive-services-speech-sdk-js#96
- Tracking bug at #2481
- This is fixed in
microsoft-cognitiveservices-speech-sdk@>=1.10.0
Direct Line Speech does not support Internet Explorer 11. It requires modern browser media capabilities that are not available in IE11.
Direct Line Speech shares the same requirements as Cognitive Services Speech Services. Please refer to SPEECH.md
.
Before start, please create corresponding Azure resources. You can follow this tutorial for enabling voice in your bot. You do not need to follow the steps for creating C# client, you will replace the client with Web Chat.
Please look at our sample 03.speech/a.direct-line-speech
to embedding Web Chat on your web app via Direct Line Speech channel.
You will need to use Web Chat 4.7 or higher for Direct Line Speech.
After setting up Direct Line Speech on Azure Bot Services, there are two steps for using Direct Line Speech:
You should always use authorization token when authorizing with Direct Line Speech.
To secure the conversation, you will need to set up a REST API to generate the credentials. When called, it will return authorization token and region for your Direct Line Speech channel.
In the following code snippets, we assume sending a HTTP POST request to https://webchat-mockbot-streaming.azurewebsites.net/speechservices/token will return with a JSON with authorizationToken
and region
.
const fetchCredentials = async () => {
const res = await fetch('https://webchat-mockbot-streaming.azurewebsites.net/speechservices/token', {
method: 'POST'
});
if (!res.ok) {
throw new Error('Failed to fetch authorization token and region.');
}
const { authorizationToken, region } = await res.json();
return { authorizationToken, region };
};
Since the token expire after 10 minutes, it is advised to cache this token for 5 minutes. You can use either HTTP header
Cache-Control
on the REST API, or implement a memoization function in the browser.
After you have the fetchCredentials
function set up, you can pass it to createDirectLineSpeechAdapters
function. This function will return a set of adapters that is used by Web Chat. It includes DirectLineJS adapter and Web Speech adapter.
const adapters = await window.WebChat.createDirectLineSpeechAdapters({
fetchCredentials
});
window.WebChat.renderWebChat(
{
...adapters
},
document.getElementById('webchat')
);
The code above will requires transpilation for browser which do not support the spread operator.
These are the options to pass when calling createDirectLineSpeechAdapters
.
Name | Type | Default | Description |
audioConfig
|
AudioConfig
|
fromDefaultMicrophoneInput()
|
Audio input object to use in Speech SDK. |
audioContext
|
AudioContext
|
window.AudioContext || window.webkitAudioContext
|
AudioContext used for constructing audio graph used for speech synthesis. Can be used to prime the Web Audio engine or as a ponyfill.
|
audioInputDeviceId
|
string
|
undefined
|
Device ID of the audio input device. Ignored if audioConfig is specified. |
fetchCredentials
|
DirectLineSpeechCredentials
|
(Required) | An asynchronous function to fetch credentials, including either hostname or region, and either authorization token or subscription key. |
speechRecognitionLanguage
|
string
|
window?.navigator?.language || |
Language used for speech recognition |
userID
|
string
|
(A random ID) | User ID for all outgoing activities. |
username
|
string
|
undefined
|
Username for all outgoing activities. |
type DirectLineSpeechCredentials = {
authorizationToken: string,
region: string
} || {
authorizationToken: string,
directLineSpeechHostname: string
} || {
region: string,
subscriptionKey: string
} || {
directLineSpeechHostname: string,
subscriptionKey: string
}
For public clouds, we recommend using the region
option, such as "westus2"
.
For sovereign clouds, you should specify the hostname in FQDN through directLineSpeechHostname
option, such as "virginia.convai.speech.azure.us"
.
Please vote on this bug if this behavior is not desirable.
You can specify user ID when you instantiate Web Chat.
- If you specify user ID
conversationUpdate
activity will be send on connect and every reconnect. And with your user ID specified in themembersAdded
field.- All
message
activities will be sent with your user ID infrom.id
field.
- If you do not specify user ID
conversationUpdate
activity will be send on connect and every reconnect. ThemembersAdded
field will have an user ID of empty string.- All
message
activities will be sent with a randomized user ID- The user ID is kept the same across reconnections
Please vote on this bug if this behavior is not desirable.
After idling for 5 minutes, the Web Socket connection will be disconnected. If the client is still active, we will try to reconnect. On every reconnect, a conversationUpdate
activity will be sent.
Please vote on this bug if this behavior is not desirable.
Currently, there is no options to specify different text normalization options, including inverse text normalization (ITN), masked ITN, lexical, and display.
Please vote on this bug if this behavior is not desirable.
Web Chat do not persist conversation information (conversation ID and connection ID). Thus, on every page refresh, the conversation will be created as a new conversation.
Direct Line Speech is not targeting a transcript-based experience. Thus, our servers will no longer store conversation history. We do not plan to support this feature.
Please vote on this bug if this behavior is not desirable.
When using text-based experience, we allow developers to piggyback additional information to outgoing messages. This is demonstrated in sample 15.a "piggyback data on every outgoing activity".
With Direct Line Speech, you can no longer piggyback additional data to all speech-based outgoing activities.
Please vote on this bug if this behavior is not desirable.
You can only specify speech recognition language at initialization time. You cannot switch speech recognition language while the conversation is active.
Please vote on this bug if this behavior is not desirable.
Proactive message is not supported when using Direct Line Speech.
Please vote on this bug if this behavior is not desirable.
After the user click on microphone button to start speech recognition, they cannot click microphone button again to abort the recognition. What they have said will continue to be recognized and send to the bot.
Custom Speech is a feature for developers to train a custom speech model to improve speech recognition for uncommon words. You can set this up using the Speech SDK or in the Azure portal when configuring the Direct Line Speech channel.
Please vote on this bug if this behavior is not desirable.
Dynmic priming (a.k.a. pharse list) is a feature to improve speech recognition for words with similar pronunciations. This is not supported when using Direct Line Speech.
Please vote on this bug if this behavior is not desirable.
Reference grammar ID is a feature to improve speech recognition accuracy when pairing with LUIS. This is not supported when using Direct Line Speech.
Please vote on this bug if this behavior is not desirable.
Custom Voice is a feature for developers to perform synthesis using a custom voice font. This is not supported when using Direct Line Speech.
Please vote on this bug if this behavior is not desirable.
When using Direct Line Speech, you cannot specify the audio quality and format for synthesizing speech.
When the bot send activities to the user, it can send both plain text and Markdown. If Markdown is sent, the bot should also provide speak
field. The speak
field will be used for speech synthesis and not displayed to end-user.
Please vote on this bug if this behavior is not desirable.
Attachments are not synthesized. The bot should provide a speak
field for speech synthesis.
As attachments are not synthesized, speak
property in Adaptive Cards are ignored. The bot should provide a speak
field for speech synthesis.
Please submit a feature request if this behavior is not desirable.
Voice can only be selected using Speech Synthesis Markup Language (SSML). For example, the following bot code will use a Japanese voice "NanamiNeural" for synthesis.
await context.sendActivity(
MessageFactory.text(
`Echo: ${context.activity.text}`,
`
<speak
version="1.0"
xmlns="https://www.w3.org/2001/10/synthesis"
xmlns:mstts="https://www.w3.org/2001/mstts"
xml:lang="en-US"
>
<voice name="ja-JP-NanamiNeural">素晴らしい!</voice>
</speak>
`
)
);
Please refer to this article on SSML support.