-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathindex.html
372 lines (336 loc) · 24.3 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
<!DOCTYPE html>
<html lang="en">
<head>
<title>SSDM: Scalable Speech Dysfluency Modeling</title>
<link href="https://cdn.jsdelivr.net/npm/[email protected]/dist/css/bootstrap.min.css" rel="stylesheet">
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<script src="https://ajax.googleapis.com/ajax/libs/jquery/3.6.0/jquery.min.js"></script>
<script src="helper-v2.js" defer=""></script>
<style>
body {
background: linear-gradient(to right, #f8f9fa, #e0f7fa);
font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
}
h1,
h2 {
font-family: 'Georgia', serif;
color: #007bff;
}
td {
text-align: center;
vertical-align: middle;
padding: 16px;
font-size: 16px;
}
audio {
display: inline-block;
vertical-align: middle;
}
.timestamp-label {
color: gray;
}
table.wide-audio audio {
width: 40vw;
max-width: 40vw;
}
tr:not(:first-child) {
border-top: 1px solid lightgrey;
}
.container {
background: #ffffff;
box-shadow: 0px 4px 8px rgba(0, 0, 0, 0.1);
border-radius: 10px;
padding: 20px;
}
.container h1,
.container h2 {
margin-top: 20px;
}
.pagination .page-link {
color: #007bff;
}
.pagination .page-item.active .page-link {
background-color: #007bff;
border-color: #007bff;
}
.page-link:hover {
background-color: #e9ecef;
}
.fade-in {
animation: fadeIn 2s;
}
@keyframes fadeIn {
0% {
opacity: 0;
}
100% {
opacity: 1;
}
}
.shadow-card {
box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);
transition: all 0.3s ease-in-out;
}
.shadow-card:hover {
box-shadow: 0 8px 16px rgba(0, 0, 0, 0.2);
}
th {
background-color: #007bff;
color: white;
padding: 16px;
font-size: 18px;
}
tbody tr:nth-child(even) {
background-color: #f2f2f2;
}
tbody tr:hover {
background-color: #d1ecf1;
}
.header-text {
font-size: 20px;
font-weight: bold;
}
.instruction {
font-size: 14px;
color: #6c757d;
}
.reference-text {
font-size: 12px;
color: #495057;
}
.figure-container {
display: flex;
justify-content: center;
align-items: center;
margin-top: 20px;
}
.figure-container img {
max-width: 100%;
height: auto;
}
@keyframes fadeIn {
from { opacity: 0; }
to { opacity: 1; }
}
</style>
</head>
<body>
<div class="container pt-5 mt-5 text-center fade-in">
<h1>SSDM: Scalable Speech Dysfluency Modeling</h1>
<p>
<b>Abstract. </b> Speech dysfluency modeling is the core module for spoken language learning and speech therapy. However, there are three challenges. First, current state-of-the-art solutions suffer from poor scalability. Second, there is a lack of a large-scale dysfluency corpus. Third, there is not an effective learning framework. In other words, we are at a LeNet moment. In this paper, we propose SSDM: Scalable Speech Dysfluency Modeling, which (1) adopts articulatory gestures as scalable forced alignment; (2) introduces connectionist subsequence aligner (CSA) to achieve dysfluency alignment; (3) introduces a large-scale simulated dysfluency corpus called Libri-Dys; and (4) develops an end-to-end system by leveraging the power of large language models (LLMs). We expect SSDM to serve as a standard in the area of dysfluency modeling.
</p>
<p>
Note that term "dysfluency" and "disfluency" are interchangable.
</p>
</div>
<br><br>
<div class="container pt-5 mt-5 text-center fade-in">
<h2>PPA Speech</h2>
<p>
Primary Progressive Aphasia (PPA) is characterized by progressive impairments in speech and language. PPA can be categorized into three main variants, each with distinct clinical features. (1) Semantic Variant PPA (svPPA). Main Feature: Loss of word meaning. (2) Logopenic Variant PPA (lvPPA). Main Feature: Impairment in word retrieval and sentence repetition. (3) Nonfluent/Agrammatic Variant PPA (nfvPPA). Main Feature: Difficulty in forming grammatically correct sentences and impaired speech production. We provide some examples for nfvPPA.
</p>
</div>
<div class="container pt-5 mt-5 shadow p-5 mb-5 bg-white rounded container-1">
<div class="flex-container">
<div class="table-responsive pt-1">
<table class="table-pagination pt-1 shadow-card" id="rich-captions-1">
<thead>
<tr>
<th>Audio</th>
<th>LTU-AS 13B</th>
<th>SALMONN 13B</th>
<th>ChatGPT</th>
<th>SSDM</th>
</tr>
</thead>
<tbody>
<tr>
<td><audio controls="" controlslist="nodownload" class="px-1">
<source src="audio/ppa/segment2.wav" type="audio/wav">Your browser does not support the audio element.
</audio><br><span class="header-text">Instruction:</span> What do you think of the pronunciation?
<br><span class="reference-text">Reference Text (Optional): Well he is nearly 93 years old</span></td>
<td>The pronunciation is clear and understandable, with no apparent errors or difficulties in articulating words.</td>
<td>The articulation in this sentence is clear and easy to understand. The speaker is pronouncing each word clearly and distinctly, which makes it easier for the listener to follow along and understand the message being conveyed. Overall, the articulation in this sentence is excellent.</td>
<td>The pronunciation of “Well, he is nearly 93 years old” sounds clear and easy to understand. There don’t seem to be any issues with it.</td>
<td> The person is saying, "Well, he is nearly 93 years old." There is a stutter on the word "nearly" at 1.20 seconds.</td>
</tr>
<tr>
<td><audio controls="" controlslist="nodownload" class="px-1">
<source src="audio/ppa/segment4.wav" type="audio/wav">Your browser does not support the audio element.
</audio><br><span class="header-text">Instruction:</span> What do you think of the pronunciation?
<br><span class="reference-text">Reference Text (Optional): He dresses himself in an old black frock coat</span></td>
<td>The pronunciation is clear and understandable, with no apparent errors or difficulties in articulating the words correctly. The speaker's voice quality is also good, which helps to convey the intended meaning effectively.</td>
<td>The pronunciation of the words in the sentence is clear and easy to understand. The articulation of the consonants and vowels is accurate, and there are no mispronunciations. Overall, the pronunciation and articulation of the words in the sentence are good.</td>
<td>The pronunciation of “He dresses himself in an old black frock coat” is clear and easy to understand. There don't seem to be any issues with it.</td>
<td>The pronunciation is almost ok. There is repetition of phoneme "b" on word "black" at 3.30 seconds.</td>
</tr>
<tr>
<td><audio controls="" controlslist="nodownload" class="px-1">
<source src="audio/ppa/segment5.wav" type="audio/wav">Your browser does not support the audio element.
</audio><br><span class="header-text">Instruction:</span> What do you think of the pronunciation?
<br><span class="reference-text">Reference Text (Optional): Usually several buttons missing</td>
<td>The speech is clear and easy to understand, but it may be difficult for some people with hearing impairments or accents.</td>
<td>I think the pronunciation/articulation is a bit difficult to understand, especially for non-native speakers. The speaker seems to be emphasizing certain syllables and words, which can be distracting and take away from the overall clarity of the message. Additionally, the speaker seems to be speaking at a fast pace, which can make it difficult for listeners to follow along. Overall, I would say that the pronunciation/articulation could be improved to make the message easier to understand.</td>
<td>It seems the speaker is saying “Usually, several modems may be the same”. </td>
<td>There are multiple problems. The speaker should say "usually several buttons missing". There is an insertion of phoneme on the word "usually" at 1.00 seconds and there is a stutter on the word "buttons" at 3.48 seconds.</td>
</tr>
<tr>
<td><audio controls="" controlslist="nodownload" class="px-1">
<source src="audio/ppa/segment6.wav" type="audio/wav">Your browser does not support the audio element.
</audio><br><span class="header-text">Instruction:</span> What do you think of the pronunciation?
<br><span class="reference-text">Reference Text (Optional): A long beard clings to his chin</td>
<td>The pronunciation is clear and understandable, with no apparent errors or mispronunciations</td>
<td>As an AI language model, I don't have personal opinions or preferences. However, I can tell you that the pronunciation/articulation of the sentence appears to be correct based on the rules of the International Phonetic Alphabet (IPA). </td>
<td>
The pronunciation of “A long beard clings to his chin” is clear and easy to understand. There don’t seem to be any issues with it.</td>
<td>The man is saying "A long beard clings to his chin". There is a stutter of phoneme "ch" on the word "chin" at 3.60 seconds. </td>
</tr>
<tr>
<td><audio controls="" controlslist="nodownload" class="px-1">
<source src="audio/ppa/segment7.wav" type="audio/wav">Your browser does not support the audio element.
</audio><br><span class="header-text">Instruction:</span> What do you think of the pronunciation?
<br><span class="reference-text">Reference Text (Optional): Giving those who observe him a pronounced feeling of the utmost respect</td>
<td>The pronunciation is clear and articulate.</td>
<td>I think the pronunciation/articulation is clear and easy to understand. </td>
<td>The pronunciation of “giving those who observe him a pronounced feeling of the utmost respect” is clear and easy to understand. There don’t seem to be any issues with it.</td>
<td>There are several pronunciation problems. First, there is a repetition of the word "those" at 1.36 seconds. Second, there is a repetition of the word "pronounced" at 6.10 seconds. Try to speak smoothly. </td>
</tr>
</tbody>
</table>
</div>
</div>
</div>
<div class="container pt-5 mt-5 text-center fade-in">
<h2>Dyslexia Speech</h2>
<p>
Dyslexia is a specific learning disability that primarily affects reading and related language-based processing skills. It is characterized by difficulties with accurate and/or fluent word recognition, poor spelling, and decoding abilities. These difficulties typically result from a deficit in the phonological component of language, which is often unexpected in relation to other cognitive abilities and the provision of effective classroom instruction. Dyslexia primarily happens among kids.
</p>
</div>
<div class="container pt-5 mt-5 shadow p-5 mb-5 bg-white rounded container-1">
<div class="flex-container">
<div class="table-responsive pt-1">
<table class="table-pagination pt-1 shadow-card" id="rich-captions-1">
<thead>
<tr>
<th>Audio</th>
<th>LTU-AS 13B</th>
<th>SALMONN 13B</th>
<th>ChatGPT</th>
<th>SSDM</th>
</tr>
</thead>
<tbody>
<tr>
<td><audio controls="" controlslist="nodownload" class="px-1">
<source src="audio/dyslexia/spk1_we_have_often_urged_him_1.wav" type="audio/wav">Your browser does not support the audio element.
</audio><br><span class="header-text">Instruction:</span> What do you think of the pronunciation?
<br><span class="reference-text">Reference Text (Optional): We have often urged him</span></td>
<td>The pronunciation is clear and understandable, but it could be improved by practicing more.</td>
<td>The pronunciation is correct.</td>
<td>It sounds like there was a bit of hesitation and the pronunciation of "often" came out as "afton". (Note the understanding is forced to stop due to ChatGPT's silence pause property after "often", which is not reasonable)</td>
<td> There is a stutter at time 1.10s to 4.20s for the word "urged". It seems the speaker is saying "agreed" and "argued".</td>
</tr>
<tr>
<td><audio controls="" controlslist="nodownload" class="px-1">
<source src="audio/dyslexia/spk1_giving_those_who_observe_him.wav" type="audio/wav">Your browser does not support the audio element.
</audio><br><span class="header-text">Instruction:</span> What do you think of the pronunciation?
<br><span class="reference-text">Reference Text (Optional): Giving those who observe him</span></td>
<td>The pronunciation is clear and understandable, which suggests that the speaker has good articulation and delivery.</td>
<td>The pronunciation of "giving those who observe him" is correct. The word "giving" is pronounced with a hard "g" sound, and the word "those" is pronounced with a soft "th" sound. The word "observe" is pronounced with a short "ob" sound and a short "serve" sound. Overall, the pronunciation is clear and easy to understand.</td>
<td>It sounds like the beginning of the phrase is correct, but it's cut off. (Note the understanding is forced to stop due to ChatGPT's silence pause property after "often", which is not reasonable)</td>
<td>There is a problem with the pronunciation of the word "observe" from 2.3 seconds to 8.0 seconds. There are stutters and prolongations.</td>
</tr>
<tr>
<td><audio controls="" controlslist="nodownload" class="px-1">
<source src="audio/dyslexia/spk2_usually_several_buttons_missing.wav" type="audio/wav">Your browser does not support the audio element.
</audio><br><span class="header-text">Instruction:</span> What do you think of the pronunciation?
<br><span class="reference-text">Reference Text (Optional): Usually several buttons missing</td>
<td>The pronunciation is clear and understandable, but it could be improved by adding more emphasis on certain syllables or using a different tone.</td>
<td>I think the pronunciation of "usually several buttons missing" is clear and easy to understand.</td>
<td>It sounds like "several" came out as "swavel". Try saying it slowly first. </td>
<td>There are some phonetic errors for the word "several" at time 1.06s-1.46s. It seems the speaker is saying "swavel". </td>
</tr>
</tbody>
</table>
</div>
</div>
</div>
<div class="container pt-5 mt-5 text-center fade-in">
<h2>Artificial Simulated Speech</h2>
<p>
Samples from Libri-Dys test set.
</p>
</div>
<div class="container pt-5 mt-5 shadow p-5 mb-5 bg-white rounded container-1">
<div class="flex-container">
<div class="table-responsive pt-1">
<table class="table-pagination pt-1 shadow-card" id="rich-captions-1">
<thead>
<tr>
<th>Audio</th>
<th>LTU-AS 13B</th>
<th>SALMONN 13B</th>
<th>ChatGPT</th>
<th>SSDM</th>
</tr>
</thead>
<tbody>
<tr>
<td><audio controls="" controlslist="nodownload" class="px-1">
<source src="audio/art-simulated/121_121726_000004_000003_missing.wav" type="audio/wav">Your browser does not support the audio element.
</audio><br><span class="header-text">Instruction:</span> What do you think of the pronunciation?
<br><span class="reference-text">Reference Text (Optional): Also, a popular contrivance whereby love making may be suspended but not stopped during the picnic season.</span></td>
<td>The speech is clear and easy to understand</td>
<td>I think that the problems for pronunciation and articulation for each phoneme and word are significant. It can be challenging for learners to master the sounds and rhythms of the language, especially if they are not native speakers. However, with practice and exposure to the language, learners can improve their pronunciation and articulation over time.</td>
<td>Your pronunciation is quite clear overall! The word "contrivance" might benefit from a bit more emphasis on the second syllable: "con-TRI-vance."</td>
<td>You are attempting to say, 'Also, a popular contrivance whereby lovemaking may be suspended but not stopped during the picnic season.' There is a missing phoneme in the word 'picnic' at 7.8s to 8.0s. </td>
</tr>
<tr>
<td><audio controls="" controlslist="nodownload" class="px-1">
<source src="audio/art-simulated/121_121726_000004_000003_prolong.wav" type="audio/wav">Your browser does not support the audio element.
</audio><br><span class="header-text">Instruction:</span> What do you think of the pronunciation?
<br><span class="reference-text">Reference Text (Optional):Also, a popular contrivance whereby love making may be suspended but not stopped during the picnic season.</span></td>
<td>The speech is clear and easy to understand, with no noticeable accent or pronunciation issues. The speaker's articulation is precise.</td>
<td>The pronunciation is clear and easy to understand.</td>
<td>Your pronunciation is clear and overall flow is natural.</td>
<td>Your articulation is almost perfect. Note that there is a prolongation at the word 'season' at 8.0s. </td>
</tr>
<tr>
<td><audio controls="" controlslist="nodownload" class="px-1">
<source src="audio/art-simulated/121_121726_000034_000001_missing.wav" type="audio/wav">Your browser does not support the audio element.
</audio><br><span class="header-text">Instruction:</span> What do you think of the pronunciation?
<br><span class="reference-text">Reference Text (Optional): Also, a draft on futurity, sometimes honored, but generally extended.</td>
<td>The pronunciation and articulation are clear and understandable in this audio clip.</td>
<td>The pronunciation is perfect.</td>
<td>Your pronunciation is clear. For "TURDI," make sure to emphasize each syllable distinctly if it's an acronym: "TUR-DI. </td>
<td>Your pronunciation is clear. There is a missing syllable in the word 'futurity' at 1.9s to 2.1s. </td>
</tr>
<tr>
<td><audio controls="" controlslist="nodownload" class="px-1">
<source src="audio/art-simulated/121_127105_000000_000000_prolong.wav" type="audio/wav">Your browser does not support the audio element.
</audio><br><span class="header-text">Instruction:</span> What do you think of the pronunciation?
<br><span class="reference-text">Reference Text (Optional): This is a LibriVox recording.</td>
<td>The pronunciation and articulation are clear, with no noticeable errors or hesitations in any of the words</td>
<td>As an AI language model, I don't have personal opinions, but I can tell you that pronunciation and articulation can be challenging for learners of any language.</td>
<td>Your pronunciation is almost perfect. The name "Leigh" should be pronounced with a long "e" sound, like "Lee." </td>
<td>There is a significant prolongation of the word 'LibriVox' at 0.8s to 1.4s. </td>
</tr>
<tr>
<td><audio controls="" controlslist="nodownload" class="px-1">
<source src="audio/art-simulated/121_127105_000017_000001_replace.wav" type="audio/wav">Your browser does not support the audio element.
</audio><br><span class="header-text">Instruction:</span> What do you think of the pronunciation?
<br><span class="reference-text">Reference Text (Optional): You took the thing down?</td>
<td>The pronunciation and articulation are clear and understandable.</td>
<td>The pronunciation is clear and easy to understand.</td>
<td>Perfect! The phrase "took the" should flow smoothly, almost like one word: "tookthe." </td>
<td>It seems for the word 'you,' you are saying 'who' at 0.5s to 0.8s. </td>
</tr>
</tbody>
</table>
</div>
</div>
</div>
</body>
</html>