Computer Science: Faculty Publications and Other Works

Separating Markup from Text

Document Type

Presentation

Publication Date

11-2017

Publication Title

Chicago Colloquium on Digital Humanities and Computer Science (DHCS)

Abstract

As more and more online versions of Humanities texts are created, it is becoming commonplace to embed elaborate formatting, for example, through the use of HTML. But this can interfere with computerized analyses of the original text. While it may seem, at first, straightforward to simply strip markup from text, this is not the reality. Many digital texts add things that appear to be legitimate content according to the markup syntax, for example, line numbers, and even apart from this issue, existing tools for stripping markup produce inconsistent results. Apart from adopting and enforcing strict conventions for adding markup to texts, the best way to keep a precise record of the original text is to store markup separately from the text. This talk describes a prototype system for separately storing text and markup (of any type), with a facility for quickly creating a merged file when desired. Additionally, a prototype system is provided for separating markup from text in an HTML (or, more generally, XML) file, and we also discuss future directions for semi-automatically creating accurate separations across the wide variety of formatted digital texts that are currently being distributed.

Recommended Citation

Greenberg, Ronald I. and Thiruvathukal, George K.. Separating Markup from Text. Chicago Colloquium on Digital Humanities and Computer Science (DHCS), , : , 2017. Retrieved from Loyola eCommons, Computer Science: Faculty Publications and Other Works,

Creative Commons License

This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 License.

DHCS'17talk_Separating_Markup_from_Text.pptx (224 kB)

Download

Included in

Digital Humanities Commons, Other Computer Sciences Commons

COinS

Computer Science: Faculty Publications and Other Works

Separating Markup from Text

Document Type

Publication Date

Publication Title

Abstract

Recommended Citation

Creative Commons License

Included in

Submission Tools

Explore

For Contributors

About eCommons

Computer Science: Faculty Publications and Other Works

Separating Markup from Text

Authors

Document Type

Publication Date

Publication Title

Abstract

Recommended Citation

Creative Commons License

Included in

Share

Submission Tools

Explore

For Contributors

About eCommons