The goal of the subjective evaluation was to answer the questions of how correct the testing tools really are, how they are operated, and how usable they are. This part should also bring up aspects which have not been found in the objective evaluation, and offer additional, hopefully different perspectives on things. Here, we provide a summary of voices uttered during the project's first workshop, as well as the reports of three testers which have used accessibility checkers with popular real-life sites.
The subjective findings are discussed together with the results from the objective evaluation in the project overview.
The evaluation of accessibility checkers was divided into an objective and a subjective part. The objective evaluation comprises the performance of each tool as confronted with a vast number of technical tests.
For the subjective assessment (this part), we let three individuals use three tools each with a selection of five different websites. Each individual was asked to play the role of a tester or developer, and to use the respective tool accordingly, but also in line with own preferences and expertise. The sites were quite popular as reported in a recent Alexa ranking. The ranking number for each site on a country basis is indicated in parentheses below.
The objective of this part of the evaluation was to collect impressions and experience of the hands-on use of accessibility tools in their natural habitat, with real-life sites and pages.
Read the testers' individual reports below.
During a project-related workshop with testers and developers, accessibility testing was discussed, including inherent challenges and the informants' needs and wishes. This section summarizes the informants' requirements towards The Perfect Accessibility Checker.
All informants use multiple tools, in addition to human inspection. Maybe the ultimate checker could be a meta tool which integrates and collects results from several separate tools? It is a good idea to make them integrateable with other testing frameworks, in particular those which allow for semi-automatic testing, such as Selenium.
The tools should be suitable for developers, testers, designers, and interaction designers alike. They should hence offer a simple graphical user interface besides the command line and avoid being too technical.
The tool of choice should enable instantaneous feedback also on minor changes and small iterations. With other words, they should be fast.
The tools should be updated with support for more recent functionality like WAI-ARIA, CSS3, and various HTML5-related APIs and extensions.
The tools should be better documented with regard to what they support, and what not, and they should put more emphasis on versioning.
The tools should become more accurate. In particular the sheer number of false positives renders them unusable in a tester's or developer's daily life. The tools should further be automateable to the highest possible degree.
The tool of choice should have support for a testsuite (or defining a custom testsuite) addressing one or several of Difi's responsibility areas like the Regulation for universal design of ICT, Kvalitet på nett, or Kvalitet på digitale tjenester.
Ideally, the tool should be open and free to use, and it should be actively developed. However, a small fee for buying or using it is also acceptable.
Not surprisingly, testers, developers, and content producers are different, and so is their method of work and choice of tools when it comes to testing accessibility. Also keep in mind that they usually do not agree, as discussed elsewhere. Nevertheless, we have found a number of challenges which the majority of testers and developers typically encounter, as presented below.
The following statement probably wraps it all up:
The perfect tool doesn't exist.
Human inspection shows that the tools cannot be trusted, but also the tools themselves report quite different results in terms of number of issues flagged, number of warnings ("to verify"), and actual misses (undetected issues). Partly, the tools seem simply to be incomplete for testing the latest recommendations and APIs. The tools can thus only be used as a supplement to human analysis. As a minimum, it is strongly advisable to deploy multiple tools for cross-verification of results.
To make accessibility testing feasible at all, the tools should be quick for frequent, iterative testing, and they should reduce the burden off testers and developers by automatically ruling out (if possible) as many potential problems as possible. Today's several hundreds of necessary manual inspections is simply not manageable.
The testing tools need also to be more concise regarding what has been tested and how, for instance by documenting vital parts of or the entire document, and by detailing exactly which tests have been applied. While the former is necessary to meet the problem that content sent to the checker differs from the one sent to a human, the latter approach is crucial in order to be able to compare the tools with each other in a more open fashion.
Concluding, we need fast, integrateable, more open, more complete, better documented, more modern, smarter and richer accessibility checkers.
In the context of this evaluation, it should also be kept in mind that human judgement is not without challenges either. In one study including 25 experienced accessibility evaluators and 27 novices, it was found that two experienced evaluators would only agree on slightly more than half of the WCAG 2.0 success criteria. Moreover, experienced evaluators and novices alike typically produce about 26-35% false positives (i.e. errors that are no accessibility errors), and an equally high number of false negatives (i.e. misses of true accessibility errors).