Count problematic (unexpected) behaviors as assertion failures #738

mcking65 · 2023-08-07T09:05:13Z

Problem

Currently unexpected or undesirable behaviors are tabulated separately from assertion failures. This hides their impact by preventing them from impacting top line support level numbers. This reduces the utility of AT support tables in the APG. A pattern could have 100% support of required assertions and thus look really good in the support table but then have problematic behaviors so severe that the pattern is unusable.

Solution

Note: The following solution assumes the system has been changed to use 3 assertion priorities as specified in #737.

To address these issues, this proposal suggests the following changes:

Change the description of these behaviors to "Other behaviors that create negative impacts"
For each other negative behavior collect two additional data elements:
- Severity: high or moderate impact
- A description of why the AT response exhibits the behavior.
In the reporting system, use the severity data collected to report on the following two assertions for every command in every test:
- Other behaviors that create high negative-impacts are not exhibited
- Other behaviors that create moderate negative-impacts are not exhibited
When reporting, do not report other negative behaviors in a column separate from assertions as is currently done. Instead, let any other negative behavior trigger failure for one of the above 3 assertions.

Example: Consider the alert report for Test: Trigger an alert in reading mode.

The current report includes two commands that each have two assertions. So, there are a total of 4 verdicts in the report for this test.

Test Name	Required Assertions	Optional Assertions	Unexpected Behaviors
Trigger an alert in reading mode	4 of 4 passed	2 of 2 passed	None

If we were to map the current optional assertion to a MAY-DO priority as described in #737, there would be 1 MUST-DO behavior and one MAY-DO behavior for each command.

Priority	Assertion
MUST	Text 'Hello' is conveyed
MAY	Role 'alert' is conveyed

If we were to add the two problematic behavior assertions described above to this list, each command would have 4 assertions, increasing the total number of verdicts for this test report from 4 to 8.

Priority	Assertion
MUST	Text 'Hello' is conveyed
MUST	Other behaviors that create high negative-impacts are not exhibited
SHOULD	Other behaviors that create moderate negative-impacts are not exhibited
MAY	Role 'alert' is conveyed

Thus instead of the above report, we would have:

Test Name	MUST HAVE Behaviors	SHOULD HAVE Behaviors	MAY HAVE Behaviors
Trigger an alert in reading mode	4 of 4 passed	2 of 2 passed	4 of 4 passed

Very importantly, if a high impact negative behavior were present for one command, then the MUST-have behavior support would drop from 100% (4/4) to 75% (3/4).

Test Name	MUST HAVE Behaviors	SHOULD HAVE Behaviors	MAY HAVE Behaviors
Trigger an alert in reading mode	3 of 4 passed	2 of 2 passed	4 of 4 passed

If the high-impact problematic behavior occurred when Enter was pressed, using the results table design from #733, the failure would be reported as:

Enter Results (4 passed, 1 failed)

Priority	Assertion	Verdict
MUST	Other behaviors that create high negative-impacts are not exhibited	Failed
MUST	Text 'Hello' is conveyed	Passed
SHOULD	Other behaviors that create medium negative-impacts are not exhibited	Passed
MAY	Role 'alert' is conveyed	Passed
MAY	Other behaviors that create low negative-impacts are not exhibited	Passed

App changes

Modify the results collection form so that each behavior in the list of problematic behaviors has the following inputs:
- A checkbox to indicate if the behavior occurred.
- A select labeled "Impact" with values Low, Medium, and High. It is disabled if the checkbox is not checked and required if it is checked.
- A text field labeled "Details" that is disabled if the checkbox is not checked and required if it is checked.
Add problematic behavior assertions to the test results tables defined in issue 733.
Remove unexpected behavior columns from reports and include the problematic behavior assertion results in the data for the MUST/SHOULD/MAY behavior columns.

css-meeting-bot · 2023-09-07T19:59:09Z

The ARIA-AT Community Group just discussed Changes to unexpected behavior data collection and reporting.

The full IRC log of that discussion

<jugglinmike> Topic: Changes to unexpected behavior data collection and reporting
<jugglinmike> github: https://github.com//issues/738
<jugglinmike> Matt_King: Before we talk about the solution I've proposed, I want to make sure the problem is really well-understood
<jugglinmike> Matt_King: The status quo is: after you answer all the questions in the form about the screen reader conveying the role and the name, etc
<jugglinmike> Matt_King: ...there's a checkbox asking if other unexpected behaviors occured. Activating that reveals a bunch of inputs for describing the behavior
<jugglinmike> Matt_King: In our reports, we have a column for required assertions, a column for optional assertiosn, and a column for unexpected behaviors
<jugglinmike> Matt_King: When we surface this data in the APG (but really, any place we want to summarize the data), it lumps "unexpected behavior" in a separate category all on its own
<jugglinmike> Matt_King: I see this as a problem for two reasons
<jugglinmike> Matt_King: Number one: we're going to have too many numbers (four in total) once we move to "MUST"/"SHOULD"/"MAY" assertions
<jugglinmike> Matt_King: Number two: people don't know how to interpret this information
<jugglinmike> Matt_King: to that second point, there's a wide spectrum of "unexpected behaviors" in terms of how negatively they impact the user expecience
<jugglinmike> mfairchild: I agree that's a problem
<jugglinmike> Hadi: so we don't have a way to communicate "level of annoyance"?
<jugglinmike> Matt_King: That's right
<jugglinmike> Matt_King: we might consider encoding the "level of annoyance" (even to the extent of killing the utility of the feature)
<jugglinmike> mfairchild: This extends even to the AT crashing
<jugglinmike> Matt_King: As for my proposed solution
<jugglinmike> Matt_King: There are three levels of "annoyance": high, medium, and low.
<jugglinmike> Matt_King: And there are three assertions associated with every test: "there are no high-level annoyances", "there are no medium-level annoyances", and "there are no low-level annoyances"
<jugglinmike> Matt_King: That way, what we're today tracking as "unexpected behaviors" separate from assertions, would in the future be encoded and reported just like assertions
<jugglinmike> Matt_King: Ignoring the considerations for data collection, do folks present today think this is a good approach from a reporting standpoint?
<jugglinmike> Matt_King: Well, let's talk about data collection
<jugglinmike> Matt_King: Let's say you have excess verbosity. Maybe the name on a radio group is repeated
<jugglinmike> Matt_King: The way you'd say that on the form is you'd select "yes, an unexpected behavior occurred", then you choose "excess verbosity,"...
<jugglinmike> Matt_King: ...next you choose how negative the impact is ("high", "medium", or "low"), and finally you write out a text description of the behavior
<jugglinmike> Matt_King: It wouldn't be that each of the negative behaviors always fell into one specific bucket of "high", "medium" or "low". Instead, it's that each occurrence of an unexpected behavior would require that classification
<jugglinmike> James_Scholes: I think it's a positive direction, but I wonder about how Testers will agree on the severity of these excess behaviors
<jugglinmike> James_Scholes: I also wonder about sanitizing the plain-text descriptions
<jugglinmike> Matt_King: I don't think we need to collect a lot of information on descriptions; I expect it'd be pretty short for most cases
<jugglinmike> Matt_King: If you look at our data, these things are relatively rare. In a big-picture sense, anyway
<jugglinmike> Matt_King: If you look at it in the reports--in terms of what we've agreed upon--they're extremely rare.
<jugglinmike> Matt_King: I think that mitigates the impact of the additional work here
<jugglinmike> Matt_King: But I think we need to work toward really solid guidelines, and be sensitive to it during the training of new Testers
<jugglinmike> Hadi: My general concern is that once we start to categorize the so-called "annoyance" level, we might get into rabbit holes that we might not be able to manage
<jugglinmike> Hadi: I think anything more than the required assertion should be considered as "annoyance"
<jugglinmike> Hadi: e.g. if the AT repeats the name twice. Or it repeats even more times. Or it reads through to the end of the page.
<jugglinmike> Hadi: Where do we say it crosses the line?
<jugglinmike> Matt_King: I'm not prepared today to propose specific definitions for "high" "medium" and "low". We could do that on the meeting for September 21st
<jugglinmike> Matt_King: But I also don't think we need that to move forward with development work
<howard-e> https://datatracker.ietf.org/doc/html/rfc2119
<jugglinmike> jugglinmike: I'm having trouble thinking of what "MAY NOT" means. And, checking RFC22119, it doesn't appear to be defined for normative purposes. I'll think offline about the impact this has on our design, if any
<jugglinmike> Matt_King: We could reduce it to just two: must not and should not. That could be a good simplification. I'm starting to like that, just having two and not having the third
<jugglinmike> mfairchild: I like that, too
<jugglinmike> mfairchild: but does this go beyond the scope of the project?
<jugglinmike> Matt_King: It's really clear from a perspective of interoperability, that one screen reader you can use with a certain pattern, and with another screen reader, you can't
<jugglinmike> mfairchild: I agree. I think where we might get hung up is on the distinction between "minor" and "moderate" or "may" and "should"
<jugglinmike> mfairchild: I think that, from a perspective of interoperability, we're really focused on severe impediments
<jugglinmike> mfairchild: It's still subjective to a degree, but if we set the bar high enough, it won't be too distracting to actually make these determinations
<jugglinmike> Matt_King: There might be some excess verbosity that we don't include in the report
<jugglinmike> Matt_King: The best way for us to define these things is real-world practice
<jugglinmike> Hadi: can you provide an example of "must not"
<jugglinmike> Matt_King: For example, there's "must not crash." Or "must not change reading cursor position"
<jugglinmike> Hadi: For example, imagine you are in a form that has 30 fields. When you tab to each field, it reads the instructions at the top of the page (which is two paragraphs long), that's just as bad as a crash!
<jugglinmike> Matt_King: James_Scholes would you be comfortable moving to just two levels: "MUST NOT" and "SHOULD NOT"
<jugglinmike> James_Scholes: The fewer categories, the less nuance we'll have in the data. That makes the data less valuable, but it also makes it easier for testers to make these determinations. I support it
<jugglinmike> Matt_King: I'm going to make some changes to this issue based on our discussion today and also add more detail
<jugglinmike> Matt_King: By the time of our next meeting (on September 21), I hope we can be having a discussion on how we categorize unexpected behaviors
<jugglinmike> Zakim, end the meeting

mcking65 · 2023-09-21T18:25:03Z

I didn't finish the to-do discussed in the Sep 7 meeting. I will target having this ready for the Sep 27 meeting.

css-meeting-bot · 2023-09-27T17:22:52Z

The ARIA-AT Community Group just discussed Issue 738: Changes to unexpected behavior data collection and reporting.

The full IRC log of that discussion

<jugglinmike> Topic: Issue 738: Changes to unexpected behavior data collection and reporting
<jugglinmike> github: https://github.com//issues/738
<jugglinmike> Matt_King: I'd like to prepare a better mock up so there's no ambiguity in our future discussion
<jugglinmike> Matt_King: But one of the decisions we made last time is that when we record an unexpected behavior, we would assign one of two severities (rather than one of three)
<jugglinmike> Matt_King: I was working on this because I didn't want to change from "high", "medium", and "low" to just "high" and "medium". "medium" and "low" also felt wrong
<jugglinmike> Matt_King: For now, I've settled on "high impact" and "moderate impact"
<jugglinmike> James_Scholes: As you say, "moderate" is almost meaningless because it could apply to any level impact that is not "high"
<jugglinmike> Matt_King: We want to set a fairly high bar for something that is "high impact."
<jugglinmike> Matt_King: The assertion will read somthing like "The AT must not exhibit unexpected behaviors with high impact"
<jugglinmike> Matt_King: And "The AT should not exhibit unexpected behaviors with moderate impact"
<jugglinmike> Matt_King: I'm going to move forward with those terms for now and bring it back to this meeting next week

howard-e · 2023-11-27T21:32:17Z

Based on previous discussions, the bottom of the top comment should also be edited as:

Enter Results (4 3 passed, 1 failed)

Priority	Assertion	Verdict
MUST	Other behaviors that create high negative-impacts are not exhibited	Failed
MUST	Text 'Hello' is conveyed	Passed
SHOULD	Other behaviors that create medium negative-impacts are not exhibited	Passed
MAY	Role 'alert' is conveyed	Passed
~~MAY~~	~~Other behaviors that create low negative-impacts are not exhibited~~	~~Passed~~

App changes

Modify the results collection form so that each behavior in the list of problematic behaviors has the following inputs:
- A checkbox to indicate if the behavior occurred.
- A select labeled "Impact" with values ~~Low, Medium, and~~ Moderate and High. It is disabled if the checkbox is not checked and required if it is checked.
- A text field labeled "Details" that is disabled if the checkbox is not checked and required if it is checked.
Add problematic behavior assertions to the test results tables defined in issue 733.
Remove unexpected behavior columns from reports and include the problematic behavior assertion results in the data for the MUST/SHOULD/MAY behavior columns.

mcking65 · 2024-02-27T03:13:15Z

Testing in staging, this is most of the way there. I found the following problems. I believe these should be fixed before deploy because one affects the understandability of reports and one affects accessibility of the collection form.

On the collection form, keyboard access is broken in the group of additional undesirable behaviors. If you mark the "Yes radio" and then tab, you can only tab to the first checkbox. The subsequent behaviors are not keyboard accessible.
In the meeting where reviewed the wording, we decided to use the word "Severe" instead of "High".
In the runner after results are submitted and on the report page, the details of the additional undesirable behaviors are still labeled "Unexpected Behaviors:". We should change that wording so readers can directly tie the content back to the assertions. I think the label should be "Other behaviors that create negative impact:". Instead of a list, the details would be easier to read in a 3-column table with columns for "Behavior", Details", and "Impact".

mcking65 · 2024-02-27T03:15:53Z

Because the problems are significant enough to block deployment, changing status to in progress.

mcking65 · 2024-03-06T07:10:51Z

Retested in staging and all the problems listed in my previous comment are resolved. Thank you!!

mcking65 added this to Refactor ARIA-AT test structure and test result presentation project Aug 3, 2023

mcking65 converted this from a draft issue Aug 7, 2023

mcking65 moved this from Todo to Ready for sizing in Refactor ARIA-AT test structure and test result presentation project Aug 7, 2023

mcking65 added the enhancement New feature or request label Aug 7, 2023

This was referenced Aug 13, 2023

Design V2 of test format to enable variable AT setting, command, and assertion mappings for a test w3c/aria-at#974

Closed

Improve readability of table that presents test results for each command #733

Closed

mcking65 moved this from Ready for sizing to Todo in Refactor ARIA-AT test structure and test result presentation project Aug 17, 2023

howard-e self-assigned this Nov 27, 2023

ccanash moved this from Todo to In Progress in Refactor ARIA-AT test structure and test result presentation project Nov 29, 2023

This was referenced Nov 29, 2023

Update collection form to capture severity and additional details on an unexpected behavior happening w3c/aria-at#1006

Closed

Update Test Run page and reports to count unexpected behaviors as assertion failures #849

Merged

howard-e mentioned this issue Dec 7, 2023

Update collection form to capture unexpected behaviors severity and additional details w3c/aria-at#1016

Merged

ccanash moved this from In Progress to In sandbox in Refactor ARIA-AT test structure and test result presentation project Jan 23, 2024

mcking65 moved this from In sandbox to In Progress in Refactor ARIA-AT test structure and test result presentation project Feb 27, 2024

howard-e mentioned this issue Feb 27, 2024

Update undesirable behavior impact collection details #939

Merged

mcking65 moved this from In Progress to In sandbox in Refactor ARIA-AT test structure and test result presentation project Mar 6, 2024

ccanash moved this from In sandbox to In production / Completed in Refactor ARIA-AT test structure and test result presentation project Mar 11, 2024

ccanash closed this as completed Mar 11, 2024

github-project-automation bot moved this from In production / Completed to In Progress in Refactor ARIA-AT test structure and test result presentation project Mar 11, 2024

ccanash moved this from In Progress to In production / Completed in Refactor ARIA-AT test structure and test result presentation project Mar 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Count problematic (unexpected) behaviors as assertion failures #738

Count problematic (unexpected) behaviors as assertion failures #738

mcking65 commented Aug 7, 2023 •

edited

Loading

css-meeting-bot commented Sep 7, 2023

mcking65 commented Sep 21, 2023

css-meeting-bot commented Sep 27, 2023

howard-e commented Nov 27, 2023 •

edited

Loading

mcking65 commented Feb 27, 2024

mcking65 commented Feb 27, 2024

mcking65 commented Mar 6, 2024

Count problematic (unexpected) behaviors as assertion failures #738

Count problematic (unexpected) behaviors as assertion failures #738

Comments

mcking65 commented Aug 7, 2023 • edited Loading

Problem

Solution

App changes

css-meeting-bot commented Sep 7, 2023

mcking65 commented Sep 21, 2023

css-meeting-bot commented Sep 27, 2023

howard-e commented Nov 27, 2023 • edited Loading

App changes

mcking65 commented Feb 27, 2024

mcking65 commented Feb 27, 2024

mcking65 commented Mar 6, 2024

mcking65 commented Aug 7, 2023 •

edited

Loading

howard-e commented Nov 27, 2023 •

edited

Loading