WCAG 2.0 Evaluation Methodology Task Force Teleconference -- 29 Oct 2012

Improving coverage for web applications

Eric: would like to address (1) scope of "web application" in the context of WCAG-EM and (2) how we address web applications in WCAG-EM
... what information is missing and where do we need to add information

Eric: web application is an application rendered via a browser

Ramon: what about Air that is installed locally

Katie: downloaded using HTTP

Shadi: WCAG 2.0 defines web content as that delivered via HTTP and rendered via a browser

Katie: develop a new version of the methodology after WCAG2ICT work is completed?

Shadi: need to make sure we address what WCAG defines as web content, and make sure we do not break usage for other contexts

Ramon: should add example of web applications
... currently no differentiation to web application

Vivienne: need to spell out what web application is, because some people consider a web application is not a website

<David_MacD_Lenovo> http://www.tbs-sct.gc.ca/pol/doc-eng.aspx?id=23601&section=text Government of Canada Web pages refer to static and dynamic Web pages and Web applications.

Ramon: impact of some success criteria on web applications are different that for more static websites
... for example when screen is magnified may impact an applications with many widgets more than a document with more text

Katie: would not want to single out and separate specific requirements
... screen magnification also impacts other situations

Shadi: maybe makes sense to point out to evaluator particular success criteria that occur more frequently in the context of web applications

Vivienne: use a set of such criteria when evaluating web applications
... more important than sampling because could be only few web pages

David: rather than defining web application just use the WCAG approach of calling it content

## Examples of Web Applications ##

calendar widget

online forms

word processor

scribe:

Ramon: something that performs an action

John: typically over several iterations of interaction

Jason: interaction where user provides some input and a response based on that output

John: the nature of that response is what differentiates them

Jason: sometimes thing applications don't exist
... it is about content and interaction
... more dynamic than traditional static pages
... model not fundamentally differnet but usability may be different

John: in web applications some of the logic happens on the client side

Jason: discrete set of functionality that serves a specific purpose?
... what about client side scripts that generate a series of pages

David: use "horse power" for cars even though no horses pull the cars anymore
... maybe want to keep "web page" despite the new paradigms

Ramon: want to avoid that people think WCAG-EM is not applicable to what people perceive as a web application

Vivienne: current descriptions include web applications as part of website
... maybe only need some more examples

Katie: test all the functionality on a page

Ramon: cannot test all the functionality on an application like Google docs
... because possibly many thousand individual ones
... usually group the types of functionality

## in a web application lots of functionality and content may be compressed into a single web page

## there may also be lots of repitition of components (blocks)

## requirement for "complete transaction" is also frequently an issue

Vivienne: deciding the parameters of a web application, where it starts and where it ends

### Example of Web Applications

iTunes homepage is a browser

internet banking part of the bank website

webmail client

hotel or airline booking

(bank may not consider internet banking application as a website in itself)

(booking websites may have distinct search versus booking functionality)

live-time tickers like for scores or stock market

social networking applications

## discussion about dependency: on a traditional website there may be more easily separable areas (like "the math department") whereas in an application there may be dependcies, like the path to get to the particular part of an application (like an "app" on facebook)

tax calculator

<Ryladog> scribe:Ryladog

Topic, Revision of the current sampling approach

<David_MacD_Lenovo> test

<scribe> Agenda: Topic, Revision of the current sampling approach

EV: We do not have a random sample at the moment in our methododlogy
... We define in in Scope section 3
... ◦3.3 Step 3: Select a Representative Sample■3.3.1 Step 3.a: Include Common Web Pages of the Website ■3.3.2 Step 3.b: Include Exemplar Instances of Web Pages ■3.3.3 Step 3.c: Include Other Relevant Web Pages ■3.3.4 Step 3.d: Include Complete Processes in the Sample

3.3 is Missing random sampling. We want to add it to 3.3

VC: Perfromed a test 25% of total sample size could be random. 90% of the pages were represented in the structured pages
... Had a problem finding random pages that were not already in her structured pages
... That could be very expensive, random sample would need to change for next test

DM: Why would you exclude pages because it uses a template?

VC: Because they were all so similar

DM: I would not worry about overlap
... 4 things that a random sample will help with

number 1, like the tax filing example,1. they could miss something (to ensure who website)

number 2, inadverant parts that will be missed

number 3, coming....

EV: Why do random pages?

<shadi> http://lists.w3.org/Archives/Public/public-wai-evaltf/2012Sep/0066.html

RC: Newpaper, one news item, it is not really random

EV: The idea is that it should be random

RC: Every three month automatic testing, then every other three months is maual.

DM: Self evaluation, vs. IV&V

SAZ: With an honest evaluator performing a random sampling, have a detective's nose........
... We pick out of a completely structured example, and random sample should produce the same results........if the chose is truely random

VC: It would be good to compare structured and random results

RC: That could be difficult
... 1 home page, 3 landing pages and 1000 content pages

EV: The random sample should in no way be worse than the structured sample

SAZ: Any random selected pages should never perform worse than structured

VC: I have one example where random wont work.
... Australian Gov says that 10% of the pages need to be tested.

KHS: Should we recommend that identification of random sampling was performed with any methodology testing

VC: Does for research a % of failures

SAZ: I want to push the sampling responsibility to the evaluator
... The result you provide should prevail

RC: For conformance we need to include sampling

DM: Gov of Canada, has an example/sample size requirements, is 10 or minutes 90%

JK: Department of statistics in Canada put this together. They have gone up to 68 pages for a site of 3,000 pages or +. If the TF is going to go this way, by + or - 5% or 10% - that might be reasonable

EV: I am not sure

JK: Standard deviation

VC: Of your 25% should be random
... Gregg V we thong suggested that number 25%
... Why random sample, it keeps you honest

EV: Purpose, Conformation of the outcome of your structures sampling appraoch
... Comparison of 2 or more sites or instances of the same site

VC: Random sampling is better for re-assessments of the same site

SAZ: There are different gradients of sampling

semi-structured selection, under specific criteria - with out identifying the pages

Ramon: instructions of how to select the pages including randomness

Shadi: perhaps the term of 'variation'

David: tax department doesn't give you the opportunity to choose the tax receipts they review - the whole point of not being able to choose - takes that out of the equation. A lot of people are not as passionate about accessibility - they just want the checkmark. There aren't a lot of people who know
... there should be component of random sample - some automated or third party selected

VC: What validity of a third party if you do it yourself?

<shadi> http://www.macorr.com/sample-size-calculator.htm

KL: Self assessment can work well

KHS: The result need to be the same - as long as the outcome is the same.

KL: Always x pages should be be always evaluated and x number of pages will be randomly chosen

VC: Austalia could use this methodology

RC: We do multi-sites evaluations just to compare

<shadi> Number of Pages in the Website / Sample Size

<shadi> 5 / 5

<shadi> 10 / 10

<shadi> 25 / 23

<shadi> 50 / 42

<shadi> 100 / 73

<shadi> 125 / 86

<shadi> 150 / 97

<shadi> 200 / 116

<shadi> 250 / 131

<shadi> 350 / 153

<shadi> 500 / 176

<shadi> 750 / 200

<shadi> 1000 / 214

Not to check conformance - but to see what the people do with the website

<shadi> http://www.macorr.com/sample-size-calculator.htm

<shadi> (85% - 95% confidence)

RC: We do random sampling with automated testing

EV: The difference when random with automated or structured
... Purposes: Confirmation,

DM: Random sampling would be used as a validation of structured testing

SAZ: Current guidance is: Use and automated tool and then pick your pages
... How are we going to change that to include 'random'
... Two things we need to tackle with random sampling: Over-site and Intentional unearthing of errors

EV: Should we require random sampling

VC: We should require it

JK: We should require it

KHS: We should require it

VC: Over time this matters

RC: Unsure to require it

Sample should be representative

JK: For a small site the entire site is your representative sample
... The who idea is to avoid the view - in Canada, every site needs to test the home pages, pages with media,

DM: We should require it

<scribe> ACTION:Jason Kiss will check with Canadian Treasury Board Secretariat's' web-site statistical audit guidance folks - to help us make a determination on sample size [recorded in http://www.w3.org/2012/10/29-eval-minutes.html#action02]

RC: Small sample size just for PASS FAIL - not for conformance - then a full evaluation

Group reviews Sampling survey - 29 repondents

JK: Is Random Sampling used for economical ans staffing? Yes all agree

revising the current performance score

Eric: Step 5C is the performance score
... only compulsory is providing documentation
... how do you score? There are a few possitilities - total website, web page or instance. What would it look like? Do we want to make it mandatory, or if there is a score it must be between 1-10 etc

Katie: somewhere between green & yellow

Ramon: don't like global scores because they convey whether you have done it well or bad. If you give 90% and you have a disability, even that 90% is bad for those people. It tends to make people complacent - they feel they are pretty good.

Eric: the easiest - fail or not fail

Ramon: we use severity and frequency according to the SC
... the global score tends to be for the visually impaired

Detlev: if the score is based on WCAG, then you measure the score based on the criteria. There could be an argument for a universal score. It may not serve user groups well, because of something like captioning in which case it would fail for them completely.

Ramon: the problem with the approaches in the EM - eg keyboard accessibility, it seems like 99% accessibility, but it seems for most people as completely accessible.
... say for an epileptic person some criteria would be a major fail
... very difficult to find out which criteria affect which group

David: that's why we don't use the word priority - it is very political

Ramon: we are part of a foundation for people with disabilities, we cannot discriminate between the different groups. We don't like scoring because of that. Any scoring has to be very clear that covers all ofd the possible types of disability.

Detlev: one example - the BITV test - you score each success criteria for each page with a scale. If you have 95% it doesn't mean one SC would fail completely, it means that for 1 criteria you have less than ideal results. EG colours could be technically a failure (4.2:1), but it is a nearly pass. These near passes then add up
... if there are 'accessibility killers' that even when you have a good score and you find a keyboard trap it would be downgraded to unaccessible.

Eric: how did you get to the scale? You are scoring according to severity?

Detlev: we had a number of criteria that are critical. For every SC and every test that you can downgrade. We always have 2 testers for a final test - is this vital? Sites which have vital failures will seldom reach 95% anyway.

Ramon: we try to avoid any numbering of the score. We try to give a subjective opinion - you are doing good, bad, almost accessible, terrible. We don't want to put a number because when the company comes to us and says they have 10% this means that our website is so terrible that we can't do anythingt without spending a lot of money

Katie: you don't have to use numbers

Ramon: we use 2 columns severity - and frequency - each of them 1-3.

Shadi: what is the severity of a SC?

Ramon: it is a subjective analysis

Katie: these are the 4 critical failure points

Ramon: even if the alternative text is bad, it may not be a problem. But it is subjective.

Katie: we correlate with FQT & QA we customize - critical/serious/moderate/not as moderate. This has to be fixed now, next bill etc. Because you are tracking, you can use that level. We don't use numbers. Preferably I use critical/serious and moderate. If it is 1 alt text on 1 page, that would be minor.

Eric: do you look at Frequency?

Katie: yes, that comes up in tracking - which level, which SC,

Eric: you don't say green/orange/red/black

Katie: in my world we have laws attached - section 508 - deals witrh what you much fix - fix critical first

Ramon: a project asked this specifically for priority table - we combined severity/frequency with impact that the fix would have.
... we include as to whether it is easy or hard to fix. The priority may be that the impact of fixing it would great and how easy it would be to fix.

David: priority, frequency impact and effort

<Detlev> Vivienne pass / fail / N.A. not tested

<Detlev> Vivienne: score of percentages across pages of pass/fail

<Detlev> Viviennw: that was for professional work

<Detlev> Vivienne: for research, create an avergage score across pages to be able to compare sites

<Detlev> Vivienne: needs quantitative score of liberies, retest to check if repairs have been done over time

<Detlev> Vivienne: so two different worlds

<Detlev> Vivienne: client love charts

<Detlev> Vivienne: for research, adding up violation (any 4 critical points in conformance criteria) get extra 5 points to add significance

<Detlev> Vivienne: in reporting a hint of wherther it is a global or individual probem (shared page problem)

<Detlev> Vivienne: for comm work per pagethere will be pass/fail for every SC which then get aggregated across pages

David: accessibility differs for each client based on their goals. For web applications it does change differently. We had a template with all of the SC and you had an example the first time you ran into the error, we report that we've found an error - tells the client to go through and fix those issues
... report has an executive summary which summarizes - whether they have the skills to fix things, top/priority issues - if you fix these 5 issues a whole bunch of stuff gets better. Then has a table with 1,2,3 level priorities - right away, next - based on effort and impact. Fix these and the site gets better quicker. Don't provide a score - except Government of Canada. They have to report to the courts.
... eg 1.3.1 has a huge impact - if it has the same weight it messes up the severity of the impactg

Ramon: if they pass most of A, they get a score of 100, but when they try to go to the higher level, it lowers the score

Katie: priority for this methodology should be on compliance level.

Ramon: now the levels are more reflective on the difficulty, the unusualness, ability to comply

David: if you're going to give a score and a percentage, some people are adamant they get a score so they can compare to other organisations. Some points are more important to some people than others. Some people on the WCAG group will question the scoring method.

Shadi: 1.3.1 occurs so frequently. it can come up 100times on a page - tables/headings etc. Out of those 100 occurrences, how many were failed?

David: usually if they just get 1 wrong, they may be doing it all over the site

Shadi: then it will occur systematically so the numbers will go up

Katie: you would still get a fail overall

Shadi: is it real world that pages are really good quality and just the 1.3.1 is badly marked up

David: 1.3.1 happens disproportionately in terms of the websites.

Shadi: it occurs so often on a page. The more complexity you put in, the more subjectivity that comes in, actually lowers the value of the outcome. There is the notion that the more complex the scoring system, the more parameters it has, the more inclined it is towards subjective - the more easy to vias it.

Katie: maintain simplicity

David: he gave a client a report based on the TBS and they got 95%, and they think it's great. But all of the errors are in 1.3.1.

described adding the errors per page and dividing by the number of pages for an average score per page

David: similar to Government of Canada

Shadi: described the 3 different approaches
... 3rd one - instance: is more like Vivienne's example. For 131 you could have 100 where 70 passed and 30 failed and you can work out an average of total number of errors/ over total possible and gives you a ratio

Detlev: would the last one be a way of differentiating between critical and minor failures?

Shadi: you could have a website that is unusable which could score the same as a website with a lot of decorative images tagged with alt text etc.
... what is the purpose of scoring? How to take to court - depending on score.
... we need to state clearly that the scores are indicative and used to motivate the developer. EG you're on orange, you invested $5 and now you're on level x and you can see the value of the money you've invested

Katie: the world I'm in couldn't care about what you do right - they want to know how much trouble they are in
... only clear violation should be counted

Ramon: these approaches have the same problem - if you pass from A to AA, the percentages change and the results look worse

you can state both A and AA scores

Katie: identify which level you are trying for compliance with. You can say 100% for A and 70% for AA

Ramon: what is considered critical - contrast may be critical for me
... you can't say anything is not critical

Detlev: we need to work with WCAG - has to be set according to that

David: there are new tools to enhance the contrast

Shadi: what are you trying to address

Eric: at the moment it is just pass/fail - WCAG already describes it. Do we want to add things like severity and impact.

shadi: those questions are part of the reporting, First of all you have the conformance - pass/fail. Then you come out with the report - which depends upon you as an evaluation commission as to how much detail you want - what needs fixing,k the frequency of the issues, and as eye-candy, and optionally you can get a score and it doesn't mean that the website is 80% accewssible.
... this score is just an indicator - this circumstance, on this date, on these pages. So you can compare your own progress. Helps you track your own progress.

Katie: let's add those 4 critical requirements.

Ramon: they interfere with the other content

Katie: doesn't that make them critical then?

Detlev: if we have this kind of separation and keep it simple on the pass fail basis and the score is additional and should not be taken as a value of the accessibility of the page. Is there a way of reflecting those imbalance say with 1.3.1? We have 7 different checkpoints for 1.3.1 - split into several bits so the score has more weight, same with 1.1.1
... they wanted to give a higher weight for some of the checkpoints say 1.3.1 - tries to give a relative weight. It then aggregates the results.

Katie: we have to make the same assumptions WCAG has made - things change. such as less priority on tables, more on interactive controls. We need to be careful of specific ways things are done.

Shadi: the seriousness can show itself with the number of occurrences - there are killers and no matter how good the page is. e.g. getting stuck
... the first 2 types - per website and per page may be too coarse. If you want a score you will have to have a form of evaluation that counts every occurrence.

Detlev: where do you put your time - counting and looking at every image

Eric: we just indicate if there errors, and give them a few examples.

Katie: you need to decide based on business cost as well

Ramon: we say - global issue - lists without the proper markup. You can go and fix them. If we find 3 wrong in 30 pages, we don't test all of the pages to look for more, we assume that the developers don't know how to do them and they need to go and check them themselves.

David: regarding Shadi's point - you count up the instances and see how many pass or fail. Can you come up with a % number without doing that. Is that where you want to spend you accessibility budget counting up what is right.

Detlev: it is much more important to be able to hone in those things that are vital - eg search button image with no alt text.

Shadi: is the conclusion - the first 2 - per website or per page is too coarse - more misleading than beneficial. Other is per occurrence, but the administrative overhead is too high - you are using budget to count those things that work. Perhaps provide a hybrid - certain checkpoints have more points and have a point system.

David: the 4 points were picked because if you fail them, you can't access other content

Shadi: to drop the scoring completely is the 4th option

Katie: what about per page?

Shadi: on page 1 - AA have 38 possible success criteria and you sample 10 pages. On page 1 you fulfilled 14, on page 2 you fulfilled X, You get an average per page.

Detlev: it can severely dilute problems - picking more pages. The more pages you check, the less major the impact of the 1 huge problem on 1 page.
... if it is an issue just on 1 page, then this is an indication of the overall impact

Shadi: we would need to see how sensitive the score is towards changes in the sample.

Detlev: it is an issue if you use a score because people want to get a seal (90%) or (95%) for really good. It gives them an impetus for increasing the number of pages tested to water down the results

Shadi: we have to make sure the score is not a measure of conformance andnot a measure of accessibility.

Katie: then we have to say very clearly what it is

shadi: only used for looking at your own performance - we need to be really clear.

Session No. 4 : Requirements for uniform accessibility support

Lookin at Step 3.1.4 Step 1.d

Eric: example of range of AT 5 browsers, 3 types of Assistive technology

People may look at different scenarios (UA/AT) to make things work

Ramon: example of using one Screen reader instead of another

Eric: for some websites, zou don't have a choice (tax web site)

Ramon: W3C#s definition of accessibility support is loose

Katie: WCAG says you choose the technologies and AT to ensure it works across the site

<David_MacD_Lenovo> http://lists.w3.org/Archives/Public/public-wai-evaltf/2012Sep/0020.html my comments are here... do not think consistent support should be required...

Shadi: you could have a site accessible with one set of tools and another with another set of tools .. but it doesn't happen often

Ramon: Expains problem wit h PDF not accessible on the Mac - does that constitute sufficient accessibility (as you may install a virtual machine)

WCAG was deliberate in not nailing down the required level of support due to fast changing technologies

David: As long as something works it is sufficient (scribe: not sure if that is Davides position rendered correctly)

Katie: not offering alternatives for PDF that work on Mac would be a failure

Ramon / David seem to disagree

Vivienne: Australian Government would require an alternative version for PDFs (but this is no tthe WCAG position)

David: Leaving technologies out was conscious decision because it could otherwise have created disincentives for technology developers

Shadi: Partial lack of support cannot be the benchmark for establishing accessibiliy support

Vivienne: Australian government have applications withing websites only work in Windows - was a Government decision

Shadi: any other examples with conflicting sets of technologies for different parts of a site?

Vivienne: There were instances where Firefox did things that did not work in other browsers / AT

Detlev: Case of WAI-ARIA not being avaliable to many peope at the worplace

David: Makes case for not requiring any technologies
... example of may different web teams at a large depertment or company, it is difficult to get all teams do the same things and apply the same set of UA/AT in their tests

Katie: testing methodology should not only test a site if the level of technology is even, - it is not the role of technology to mandate it, ressults are nevertheless helpful - uniformity not required

Shadi: take back this point to WCAG Working group

Katie: still important to list what has been used for testing

Shadi: Differences in accessibility support that have been discovered should enter reporting (where are the weaknesses)

Create new issue: Develop a concept for reporting accessibility support across the website

Katie: we should mandate a minimum set (number of tools) used

Shadi: Step 1d Define the context of website use - define tools used in testing - that may need to be changed
... you may cometo a piece of content that only works in another context - what is the impact of that?

Katie: Multiple operating systems should be considered - may be based on data on most commonos, browsers, AT used

Shadi: Step 1d came up to define a baseline for the developer
... Definig techniques fine, but they may be extended by technologies used in the site as they are discovered
... similar approach for tools -- start with the mot common, then extend to other tools AT to see if it works there

Detlev: Does that mena as long as any tool AT out there supports it, it meets WCAG conformance?

Shadi: yes, technically, thoughit is not best practive, should be noted in the report

Vivienne: make suggestion for better practice

Detlev is that WCAG WG position<ß

David: at least for one page the same tools should work throughout, seems to be WCAG WG position
... Discernable sections odf a website (sub sites, etc) should work together on the same AT
... not swap within tasks
... Uniform AT support across subsites / chunks / tasks of a website might be WG position

Shadi: Maybe WCAG WG can define it more clearly, not do it in EVAL TF - ask for WCAG WG opinion here
... agreement that uniform level of AT support is at least per page, and per function / widget, transaction

David: Shadi, write down as bullet points and put it into a WCAG WG survey to clarify this

Ramon: web site owner can bypass uniformity rewquirement by commissioning two different evaluations

Vivienne: Library and website may have different levels of AT support, library cozuld be singled out for testing

Ramon: problem is AT support / tools clash

David: We could make statement in Understanding document to explain AT support
... Problem of large organisatons where that uniformity is not possible

Shadi: Uniformity may hold back technology development if new parts of a site have to keep in line with the old status

Vivienne: Government agencies purchase parts form others which look and work completely different from the rest

Shadi: We should make sure that a site does not need completely different sets of tools to access the site

<David_MacD_Lenovo> set of Web pages:.. collection of Web pages that share a common purpose and that are created by the same author, group or organization

<David_MacD_Lenovo> Note: Different language versions would be considered different sets of Web pages.

Katie: the market in the end determines what will be used

David: as a proposal to present to WCAG for uniform AT support

Shadi: Question to WCAG WFG: What is the WG postion for the intent of accessibility support for 1) within individual web pages 2) complete processes 3) sets of pages, 4) across entire collections of pages (web sites)?

Katie: Examples: a) form on a web site that only works on the Mac, b) a calendar widget that only works in Firefox c) WAI-ARIA roles that are only supported in specific AT

Detlev: for may SC it does not matter which tool platform has been used

Katie: mandating using more than on tool is still useful

<shadi> issue: Develop a concept for reporting accessibility support across the website

<trackbot> Created ISSUE-10 - Develop a concept for reporting accessibility support across the website ; please complete additional details at http://www.w3.org/WAI/ER/2011/eval/track/issues/10/edit .

<shadi> ACTION: eric to draft question to WCAG WG about the intent of accessibility support [recorded in http://www.w3.org/2012/10/29-eval-minutes.html#action03]

<trackbot> Created ACTION-7 - Draft question to WCAG WG about the intent of accessibility support [on Eric Velleman - due 2012-11-05].

David: operators are mutually talking about API so things mighr improve

Different positions whether WCAG-EM should be independent of technology change or not

Katie: Developers find new ways of AT support

Eric: Wrap up of today

David: Good process, will feed back into larger group for decisions

Vivienne: Goold discussions, moredetails how everyone does things and the differnet views

Katie: Was great, good
... no different expectations for tomorrow good that we came to formulate clear questions for WCAG WG (about AT support) - a survey of what people use would be interesting

Ramon: Good learning experience, reflects on many things in Technosite, similar to discussions there - only negative point is concern about excluding minorities that have specific requirements - drawing a line may include them

Katie: agrees that all disabilities (not just blindness) should be included

Detlev: Lively discussion, can stay that way today

Shadi: was a good discussion in a rather small group, good to have high level discussion - teleconferencing woukdn#t have cutit for that tyype of discussion - tomorrow we need to focus more on presentational side of methodology
... Facilites for CSUN in March

Discussiob about CSUN

Eric: Is happy with input and discussions, good food for thought for editor draft - could have more hannds-on work tomorrow

WCAG 2.0 Evaluation Methodology Task Force

Face to Face Meeting on 29 Oct 2012 in Lyon, France

Attendees

Contents

Improving coverage for web applications

revising the current performance score

Summary of Action Items