-
-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Removing words starting with 0 #20
base: main
Are you sure you want to change the base?
Conversation
Hello @milamarcheva, thank you for making this pull request! I haven't thought about how or whether to handle words annotated with a preceding 0, so this is a good opportunity to reflect on this. Thank you for indicating the source data (CHILDES -> Biling -> Perez -> Shelia) for an instance of a 0-word. I've taken a look at the CHAT data file to see if there are clues to help decide what to do with these 0-words. I found the occurrence of "*CHI: I 0am done ." (utterance
In this example, the "0v" in the utterance corresponds to "0v|v" in the %mor tier and to "2|0|ROOT" in %gra. This example suggests that although the 0-words aren't part of the produced speech or are inaudible somehow (as you've also pointed out), they still have a role in other annotation tiers. For pylangacq, an important goal is to correctly align the pieces across an utterance and its associated %mor and %gra tiers (if available) to create the parsed tokens. I see you're proposing to remove these 0-words in your commit, but given this example of "0v" from the American English Brown dataset, it would seem like pylangacq should not drop these 0-words, or else it wouldn't be able to align the utterance with the %mor and %gra tiers. To think out loud a bit more -- If a code change is needed within pylangacq, what are the options? I see the following:
Option 1 is a deal breaker for pylangacq. Options 2 and 3 don't make sense. So I'm leaning towards option 4 for no code change needed. Am I missing something? Let me know what you think, and thank you again for raising the issue! |
Dear Jackson,
Thank you very much for your response. I am a 1st year PhD student and I
only recently discovered pylangacq, even though I have been working with
CHILDES data for 2 years, so I firstly want to thank you for writing the
library, it's very useful to my work.
It's fine if you think that the change I proposed does not fit with the
goals of pylangacq. I found several other issue with the data cleaning from
the annotations:
- +/ -- interruption, the current library leaves the slash in the
processed string. I could attempt to fix that
- [/] -- repetition; [//] -- retracing: the library removes any repeated
words or phrases, but I think an option to leave them behind might be
useful in some cases, when focusing on production
This is my first time attempting to contribute to an open source library
and I was compelled to do it because it's a very useful library for me, so
I wanted to fix some cases of what I thought were issues, but if my ideas
do not comply with the original concept of the library that's fine.
Best,
Mila
…On Thu, 11 Jan 2024 at 02:43, Jackson L. Lee ***@***.***> wrote:
Hello @milamarcheva <https://github.com/milamarcheva>, thank you for
making this pull request! I haven't thought about how or whether to handle
words annotated with a preceding 0, so this is a good opportunity to
reflect on this.
Thank you for indicating the source data (CHILDES -> Biling -> Perez ->
Shelia) for an instance of a 0-word. I've taken a look at the CHAT data
file <https://sla.talkbank.org/TBB/childes/Biling/Perez/Shelia/021101.cha>
to see if there are clues to help decide what to do with these 0-words. I
found the occurrence of "*CHI: I 0am done ." (utterance #136 in the data
file), but this file has the transcribed utterances only and doesn't have
dependent tiers such as %mor and %gra. I spot-checked other CHILDES
datasets, and found a 0-word instance with %mor and %gra:
https://sla.talkbank.org/TBB/childes/Eng-NA/Brown/Eve/010600a.cha, in
utterance #2:
MOT: you 0v more cookies ?
%mor: pro:per|you 0v|v qn|more n|cookie-PL ?
%gra: 1|2|SUBJ 2|0|ROOT 3|4|QUANT 4|2|OBJ 5|2|PUNCT
In this example, the "0v" in the utterance corresponds to "0v|v" in the
%mor tier and to "2|0|ROOT" in %gra. This example suggests that although
the 0-words aren't part of the produced speech or are inaudible somehow (as
you've also pointed out), they still have a role in other annotation tiers.
For pylangacq, an important goal is to correctly align the pieces across an
utterance and its associated %mor and %gra tiers (if available) to create
the parsed tokens <https://pylangacq.org/transcriptions.html#tokens>. I
see you're proposing to remove these 0-words in your commit
<d7c6387>,
but given this example of "0v" from the American English Brown dataset, it
would seem like pylangacq should *not* drop these 0-words, or else it
wouldn't be able to align the utterance with the %mor and %gra tiers.
To think out loud a bit more -- If a code change is needed within
pylangacq, what are the options? I see the following:
1. Drop the 0-words as you've proposed, but the problem is that
pylangacq wouldn't correctly align the utterances with %mor and %gra tiers,
as explained above.
2. Keep the 0-words, but just remove the "0"? Not good, since there
would be no indication that these words either aren't in the actual speech
or are inaudible.
3. Keep the 0-words, but remove the "0" and find another way to
indicate the non-existence of these words. But what way? Is this the
purpose of the "0" in the first place?
4. Do nothing. If these untreated 0-words affect a pylangacq user,
then the user has to handle these 0-words on their own. For instance, if a
user is interested in word count in general, then the 0-words slightly
inflate the word count numbers, in which case the user could detect and
subtract these 0-words.
Option 1 is a deal breaker for pylangacq. Options 2 and 3 don't make
sense. So I'm leaning towards option 4 for no code change needed.
Am I missing something? Let me know what you think, and thank you again
for raising the issue!
—
Reply to this email directly, view it on GitHub
<#20 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AMA6FGFOBWSYXLIZAZXJQH3YN5GT7AVCNFSM6AAAAABBVGPMMOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOBWGEYTONJTHE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Thank you, Mila, for using pylangacq and for your interest in contributing to it -- really appreciate it!
It looks like pylangacq doesn't handle
If you're interested in the original, unparsed utterance (with the repeated words retained, among other things), the utterance objects preserve the original tiers from CHAT data. Please let me know if it's not clear how to access the unparsed utterance line. In case email (rather than the public GitHub platform here) is a preferred way to discuss these or any other questions/ideas you may have, I'm reachable at [email protected] |
Removing Omitted Words, annotated as 0word, because they were added by the annotator and are not part of the authentic child produced speech.
[done] Add a concise title to this pull request on the GitHub web interface.
[done ] Add a description in this box to describe what this pull request is about.
If code behavior is being updated (e.g., a bug fix), relevant tests should be added.
The CircleCI builds should pass, including both the code styling checks by
black
andflake8
as well as the test suite.Add an entry to
CHANGELOG.md
at the repository's root level.