Structured accessible HTML from PDF

Written By: Peter Abrahams
Content Copyright © 2007 Bloor. All Rights Reserved.

There are hundreds of millions of PDF files on the web. The two main reasons for this popularity are:

  1. The document always looks the same irrespective of the type of printer, browser or device. This is important aesthetically but may also have legal implications.
  2. The document is secure; it cannot be altered.

Unfortunately PDF files on the web can be problematic for people with disabilities, especially users of screen-readers because:

  • Free screen readers, such as Thunder, do not support PDF documents because the complexity of the file format has made it too expensive to develop the support. Commercially available screen readers, such as JAWS, that do support PDF, are too expensive for a large number of people who use computers infrequently or access the web via a PC in a library or Internet café.
  • Adobe have defined an extension to the PDF format to provide more information to screen-readers such as alternative text for images and heading levels to aid navigation around the document. Most existing PDF files have not been created as accessible PDF, and the task of converting existing documents is complex and not always achievable.
  • Creating new documents as accessible PDF is perfectly possible and straightforward but requires the use of specific tools and the understanding and cooperation of the document originators. So it is inevitable that many new documents will be produced that are not accessible.

PDF documents are an ideal format for downloading off the web and printing out, but because of all the above reasons there is a need to provide these documents in an alternative format. The obvious alternative is for the document to be available in HTML that is designed for use by users who are blind or have a vision-impairment. The user is not interested in the document looking identical to the original but needs a document that can be read efficiently using a screen reader; to do this the document must:

  • Be linearised, that is any text in multiple columns or around pictures, in the original, must be presented in the correct order.
  • Have alternative text for any images.
  • Mark up tables so that information can be accurately and quickly found in them.
  • Include document structure information such as headings, so that the user can navigate quickly around and find the relevant information.

There are a number of pdf-to-html converters available but I believe that the recently announced RiverDocs Converter is the first aimed specifically at the creation of structured, accessible html documents that are optimized for screen-reader usage.

The converter will take any PDF document and analyse it to recognise multi-column pages, headings, tables, images and other formatting and convert it all into XHTML. Correctly recognising text that wraps around a picture, or the cells in a table requires sophisticated artificial intelligence algorithms.

Having completed the conversion it checks the output for accessibility issues that could not be fixed automatically. The most obvious issue is the lack of descriptions of images using the alt tag.

The user interface to the product allows the user to see the list of issues and at the same time see the relevant sections of the original PDF file, the generated XHTML and a preview of the document on a browser. Clicking on an issue will position the preview to the context of the issue and then the user can fix the problem.

The final output will be a well-structured and annotated document that will give a blind user an excellent experience whilst reading the document.

The UK Disability Equality Duty, that I discussed in a recent blog, has put significant pressure on public authorities and their suppliers to ensure all the content of their web sites is accessible. Providing structured, accessible XHTML versions of all the PDF files is considered to be the only way to comply with the Duty.

The volume and size of the files that need to be converted has meant that the authorities have outsourced this task to specialist web agencies. RiverDocs Converter automates most of the conversion process and means that an agency using it will provide a very competitive bid.

RiverDocs Converter should appeal to any organisation that has a large number of existing documents that need to be made accessible, or that publishes new documents that are not created to be accessible and will need conversion.