Unwebbable?

Lee Phillips

July 26, 2009

An expert in web accessibility, Joe Clark, recently published an article entitled “Unwebbable” in the excellent A List Apart. His insight is that certain standard document forms convey their structure and semantics through visual layout and formatting, and that they can not be satisfactorily translated to HTML. His main, well chosen example is the screenplay, but he offers several others.

The problem, as he points out, is not merely that it is difficult to guarantee that user agents will display the formatting faithfully, unless one resorts to cumbersome presentational HTML. It is that HTML has no tags that convey the semantics of the document forms in question. There are no tags for dialog or slug lines, and even pagination, which does not exist in HTML pages, can carry significance.

Clark mentions the elephant in the room, PDF, but does not consider it further, except to allude briefly to its accessibility issues. This is odd, because, in an earlier article he describes in detail how these issues can be overcome, and how they are typically exaggerated. He seems to suggest that one of the ways to make PDFs more accessible, tagging, might not be satisfactory because there is no standard set of PDF tags that convey the semantics of these documents: the same problem faced by HTML. But, as he says, “You could, in theory, write your own PDF tags, since they’re just XML”.

This brings us to the solution that the author does recommend: XML. Because you can write your own document type definition (DTD), which means your own custom set of tags, to express any document semantics that you want. As he says, “Ideally, we’ll start using custom XML document types—which, finally and at long last, might actually work.” But if this is a good solution for XML, why is it not a good solution for PDF? And what is a browser or other program to do with your XML document using your custom DTD? You also need to supply a stylesheet to specify the visual formatting, and there still will be no pagination (if that is important). There is no way (I think) to gain fine control over fonts and character positions as there is in PDF, so you can’t have fine looking equations or beautiful typography in general. Clark leaves his article unfinished, presenting XML as a solution without explaining in any detail how it would work or why it is better than the obvious alternative.

PDFs on the web are unpopular, for some good reasons. Although the text contained within can be reliably extracted, PDF is a binary format and can not be parsed and manipulated as easily as can XML or [X]HTML by anyone with a python interpreter and beautifulsoup. Google, however, seems to have little problem in including PDFs in search results, and supplying a usable HTML version.

PDFs usually do not reflow their text to the window boundaries, and so typically are inconvenient to read on a portable device, requiring constant horizontal scrolling. They can be made to reflow if they are properly tagged, but very few PDFs in the wild are, and the best authoring tools, such as pdf[La]Tex, still do not have support for tagging (although they are working on it). They are typically much larger than HTML pages, although if the use of (http://www.alistapart.com/articles/cssatten) becomes widespread, the two formats might be in closer competition in regard to download time.

Despite these shortcomings I can not help but view the widespread advice of web designers to avoid PDF with some bemusement. The past decade or so has seen them achieve a series of victories, each one allowing them more precise cross-browser control over layout and typography, and each one getting them a little closer to their ultimate goal, which is to be able to do what PDF could do before they began their quest.

And yet, if they actually cared about aesthetics, especially the typographic kind, as much as they claim to, they would not be nearly as satisfied with recent progress as they seem to be. The rendering of type in modern browsers, such as Firefox and Safari, is vastly improved over the situation just a few years ago. Kerning and ligatures are working, even in text typed in to forms. Browsers can use any opentype or truetype font on your system or can download a font specified by the designer. Antialiasing works well in Linux and on the Mac (I don’t think we need to consider Windows in an article about aesthetics). Type on the web looks far better than it ever has. But, it still looks pretty bad compared with any TeX document from 1994 rendered in PDF using Postscript fonts (or, indeed, from the ‘80s when printed), or more recently using opentype. The reason is that web browsers do not make the aesthetic judgements about how to set type that are built in to the [La]TeX system and its derivatives. They do not know how to hyphenate words, they break lines in isolation, just like Microsoft Word, they set tables crudely, and there is no intelligent float placement. Mathematics, with mathML, is usable, but not easy on the eyes. Equations resemble something hastily mocked up in a word processor.

This brings me back to Clark’s articles, one of which contains a list of occasions that might justify resorting to PDF, and the other of which contains a list of formats that do not translate well to HTML. Missing from both lists is “something that you want to look good; anything where typographic quality is important to you.” For example, he admits that you might want to turn to PDF for mathematical documents, since “even MathML cannot render certain notations.” This defines his standard, which is bare functionality; aesthetics are not part of the criteria. But aesthetics help convey meaning (“semantics”), and this is especially important in mathematical typesetting.

There is no particular reason to pick on Mr. Clark here; his concern is mainly accessibility, and I use his articles as springboards for my points because they (the articles, I mean) are so insightful and provocative. The larger phenomenon, of the web design community accepting a delusion that HTML + CSS, in some flavor, will give them cross-browser control, at least if they are diligent enough to keep up with the latest set of browser-specific workaround hacks, and that the biggest hopes for the future lie in areas such as font licensing. This delusion is so widespread that there is no point in trying to pick out one or two examples of it. It ignores the reality that browsers do not implement typographic algorithms, and that therefore, compared with the lovely online documents that PDF + <sophisticated authoring tool> could offer us decades ago, the result of their heroic efforts will always look like the clumsy output of a word processor. Except with nifty rollover effects.

Unwebbable?

Tenuously related: