Document Type

Presentation

Publication Date

11-2017

Publication Title

Chicago Colloquium on Digital Humanities and Computer Science (DHCS)

Abstract

As more and more online versions of Humanities texts are created, it is becoming commonplace to embed elaborate formatting, for example, through the use of HTML. But this can interfere with computerized analyses of the original text. While it may seem, at first, straightforward to simply strip markup from text, this is not the reality. Many digital texts add things that appear to be legitimate content according to the markup syntax, for example, line numbers, and even apart from this issue, existing tools for stripping markup produce inconsistent results. Apart from adopting and enforcing strict conventions for adding markup to texts, the best way to keep a precise record of the original text is to store markup separately from the text. This talk describes a prototype system for separately storing text and markup (of any type), with a facility for quickly creating a merged file when desired. Additionally, a prototype system is provided for separating markup from text in an HTML (or, more generally, XML) file, and we also discuss future directions for semi-automatically creating accurate separations across the wide variety of formatted digital texts that are currently being distributed.

Creative Commons License

Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 License
This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 License.

Share

COinS