Thursday, August 14, 2008

Linking Page Images To Transcriptions With SVG

Hugh A. Cayless - Carolina Digital Library and Archives - Paper

Preface: "There is very little practice or theory in this talk."

There is a lot of discussion in certain areas of the blogsphere regarding the problems of text-image linking. Hugh has an upcoming huge manuscript digitization projects and might be able to use this theory/practice.

Goals:
  • Create an SVG overlay of the manuscript page image
  • Analyse the structure of the SVG document to detect lines, etc.
  • Link the groups so produced to structures in a TEI transcription
  • Display the results in a usable GUI
Inkscape (the SVG drawing tool) has a tracing tool...trace the bmp (or jpg, etc) and then plug the resulting SVG output into your XML application (freaking cool...didin't know Inkscape could do that).

He gathered tools from the open source commnity rather than writing his own monolithic thing.

potrace - takes bitmap and converts to a vector graphic image - the image has to be bitmap (using imageMagick) The output from potrace was not what he wanted exactly so he used Inkscape from the command line to convert to absolute coordinates. Then he used XSLT to do a little cleanup (adding specifically named IDs)

He then used lxml - ElementTree and numpy - the script reads in the SVG produced by potrace and filtered through Inkscape, does some filtering,detects the lines, then serializes the results back to SVG and Javascript.

Problems:
  • How much can be automated? Now not much is.
  • How deeply can this be analysed?
  • What is the best testing mechanism?
How to tell potrace the black/white cutoff is the major sticking point right now. Also image pre-processing isn't clearly defined and automation of linking/path disposal is pretty important.

His process is very preliminary but really sounds fascinating to me. I have tons of documents from m grandmothers' homes that I am in fear of losing....to either the elements or the trashcan. Not everyone in the family finds them useful. But if I could digitize them, at least I could maintain the content, if not the source material. I will have to look into aspects of this process.

No comments: