Last modified: Wednesday, 21-Feb-2024 20:13:22 UTC. Maintained by: Elisa E. Beshero-Bondar (eeb4 at psu.edu). Powered by firebellies.

Regex Exercise: Convert the text of Bram Stoker’s Dracula into XML

Consult the following resources as you work with Regular Expressions:

Our newtFire tutorial on Autotagging with Regular Expressions (Regex)
Regular-Expressions.info Tutorial: a mine of helpful detail on regular expression matching,

The task

We begin with a plain-text file, the Project Gutenberg EBook of Dracula, by Bram Stoker. Download the file and save it to your computer, so you can open it locally in <oXygen/> (where it will open as a plain text file). We want to convert this file to XML but we don’t want to type all of the angle brackets manually. So what can we tag automatically, with global find-and-replace operations? For certain kinds of projects we might actually want to wrap tags around every word of the text, but, at a minimum, we can autotag chapters, chapter titles, paragraphs, and quotations using regex tools, and that’s the goal of the present assignment.

The resulting XML: Here is an example of the finished XML we want to create with this exercise.

Preliminaries

Prepare a Step File: open a new, separate text file, in which you will record each step you take in up-converting this document to XML. This needs to be a plain text (*.txt) or markdown (*.md) file and not something you write in a word processor (not a Microsoft Word document) so you do not have to struggle with autocorrections of the regex patterns you are recording. In this file you will record each step you take and paste in the patterns you apply in the Find and Replace windows in <oXygen/>. (Save your Step file following our standard homework filenaming conventions for homework submitted on Canvas.)

Overview of the conversion process

We will try to do this conversion in a set order just to guide you as you are learning. In reality, you can do some of these tasks in a different sequence, but if you want to follow our tutorial guide below, try to stick to this order. Each step in this overview is linked to a detailed explanation of how to approach the task.

Start your record each of your steps in a markdown file, that you save as .md. Record your steps like this, using tic marks to wrap the regular expressions you use:

                  First I looked for blah blah...
                     Find: `&`
                     Replace `&amp;`
                  
                  Then I tried to blah blah...

Search and replace reserved characters: &, <, >
Set up regex find & replace in oXygen, and remove extra blank lines (Replace with just two blanks in a row so you can see the chapter and paragraph divisions.)
Find and mark all long lines of text as paragraphs inside `.... tags.
Find all chapter titles and wrap them in <heading> start and end tags.
- You will need a regular expression pattern that isolates just the chapter headings.
- Your replace will need to remove the  tags around the chapter lines. (Use capturing groups in your find, and refer to them in your replace.)

Wrap whole chapters with <chapter> start and end tags. (Use our close-open strategy.) After this, the chapter titles and body paragraphs will be nested inside chapter elements in your XML hierarchy. Here is a simple view of the hierarchy you are building:

               <xml> 
                   <chapter>
                       <heading>.....</heading>
                       <p>....</p>
                       <p>.....</p>
                       <p>.....</p>
                       ...
                   </chapter> 
                   <chapter>
                       <heading>....</heading>
                       <p>........</p>
                       <p>....</p>
                       <p>.....</p>
                       <p>.....</p>
                       ...
                   </chapter>
                  ...
  
               </xml>

Find and auto-tag the spoken passages inside single lines. Try setting the ? as a don't-be-greedy check on the .+ pattern. Replace the quotation marks with <q> start and end tags.
Manually clean up the XML you have created: look for extra close tags from the times you used the close-open strategy on paragraphs and chapters for example. (Make sure you have a root element, and do anything that only needs to be marked once by you.)
Check if your work is well-formed XML: Save the file as XML with a .xml file extension. Close it and re-open it in oXygen and make sure it is well-formed.
See if you can autotag dates and times in the novel. Try starting searches for digits in the docume with \d and get a sense of the distinct patterns for dates and times.
Submit your work: We need to see your Markdown file where you recorded your steps, and you may also submit the XML file you created on Courseweb.

Step by step

There’s more than one way to accomplish this task, but one way to approach the problem is as follows:

Reserved characters

The plain text file could, at least in principle, contain characters that have special meaning in XML: the ampersand and the angle brackets. You need to search for those and replace them with their corresponding XML entities. (These are those character strings that start with an & character.) You can read some detailed information about entity strings and what they are for on Obdurodon’s Entities and numerical character references section of http://dh.obdurodon.org/what-is-xml.xhtml, but for just a quick list, see Special Reserved Characters at the bottom of our Introducing XML tutorial. Note that you need to process them in the correct order, because of the ampersand (&) in each one! Think about this carefully: You always want to replace the & characters first. (Why? Explain in your homework write-up.)

Extra blank lines

The blank lines are pseudo-markup that tell us where titles and paragraphs begin and end, but in some cases there are multiple blank lines in a row (for example, there are two blank lines between the title and the word by). Those extra blank lines don’t tell us anything useful, so we’ll start by getting rid of them. We want to retain one blank line (two newline characters) between titles and paragraphs, etc., but not more than that.

To perform regex searching, you need to check the box labeled Regular expression at the bottom of the <oXygen/> find-and-replace dialog box, which you open with Control-f (Windows) or Command-f (Mac). If you don’t check the Regular expression box, <oXygen/> will just search for what you type literally, and it won’t recognize that some characters in regex have special meaning. You don’t have to check anything else yet.

The regex escape code that matches a new line is \n, so you want to search for more than two of those in succession, and you want to replace them with exactly two. You can search for three blank lines and replace them with two and then keep repeating the process until there are no instances of three blank lines left, or, more elegantly and efficiently, you can search for \n{3,}, which matches three or more new line characters in succession (see the Limiting repetition section of http://www.regular-expressions.info/repeat.html) and replace them with \n\n (the quantifiers work only in the Find window, but not in replacements, so you have to write it this way).

Note that a transformation that searches for a sequence of two end-of-line characters depends on their being immediately adjacent to each other. If what looks like a blank line to you actually has (invisible) spaces or tabs, the pattern won’t match and the replacement won’t happen because there will be spaces or tabs between the end-of-line characters, which is to say that they won’t be adjacent. If you think that might be the case, you can make those characters visible by going into the <oXygen/> preferences (Preferences → Editor) and checking the boxes labeled Show TAB/NBSP/EOL/EOF marks and Show SPACE marks under Whitespaces. If you do have whitespace characters interfering with your ability to find a blank line (that is, two consecutive new line characters), you can use regex processing to replace them: the pattern \t matches a tab character, a space matches a space, and \s+ matches one or more white-space characters of any sort (including new lines). You can use the Find or Find all options in the find-and-replace dialog to explore the document and make sure that you’re matching what you want to match before you use Replace all to make the changes.

Paragraphs

We are working from the inside out, starting by wrapping tags around every line of text content. Make sure Dot matches all is turned off, and then search for one or more of any character between the start of a line and the end of a line. Remember the ^ signals the start of a line, and $ signals the end of the line. Hint: You can replace by referring to the whole match and wrapping  start and  end tags around it.

Chapter titles

The title of the first chapter within the body looks like:

<p>CHAPTER I</p>

the second looks like:

<p>CHAPTER II</p>

and we can see easily, from the list of chapter titles at the top, that there are 27 chapter titles, each of which begins with the word CHAPTER. If we can write a regex that matches chapter headings and only chapter headings, then, we can replace the paragraph markup with heading markup, retaining the part in the middle.

We’re not going to write that regex for you, but we will tell you the pieces we used. Try building a regex and running Find all to verify that it is matching all of the chapter titles and nothing else. When you can match what you need, then you can think about how to craft the replacement string. Here are the pieces:

First make sure that, under Options, Case sensitive is checked and Dot matches all is unchecked. You want to do case sensitive matching because the Roman numeral characters here are all upper case, so you want to be able to distinguish those from lower case i, v, x, etc. We’ll discuss when to use Dot matches all below, but for now, make sure that it’s unchecked.
A chapter heading is (now) wrapped (misleadingly) in  tags and fills a single line. You can take advantage of that fact by searching for lines that begin with  and end with . How can you quickly isolate all 27 chapters? What pattern do they all share in common?
You now need to replace the paragraph tags with <heading> tags. To do that we need to capture the part of the title line that’s between the paragraph tags and write that captured text into the replacement. To capture part of a regex, you wrap it in parentheses; this doesn’t match parenthesis characters, but it does make the part of the regex that’s between the parentheses available for reuse in the replacement string. For example, a(b)c would match the sequence abc and capture the b in the middle, so that it could be written into the replacement. Capturing a single literal character value isn’t very useful because you could have just written the b into the replacement literally, but you can also capture wildcard matches. For example, a(.)c matches a sequence of a literal a character followed by any single character except a new line followed by a literal c character. To get more than a single character, you need a repetition indicator. You can use that information to capture everything between the paragraph tags in the matched string. To write a captured pattern into the replacement, use a backslash followed by a digit, where \1 means the first capture group, \2 means the second, etc. In this case you’re capturing only one group, so you can build a replacement string that starts with <heading>, ends with </heading>, and puts \1 between them. You don’t need to do anything about the line start and line end anchors; since you’ve matched an entire line, the replacement will automatically be an entire line.
Putting this all together, you should be able to retag your headings automatically, removing their  tags and replacing them with <heading> tags. Try it.

Chapters

A book isn’t just a series of paragraphs with titles strewn among them; the book has logical chapters, which must begin with a title, and you want to represent this part of the logical document hierarchy in your markup by inserting <chapter> tags. Much as you used blank lines as milestone delimiters between paragraphs earlier, you can now use your <heading> elements as delimiters between chapters. Use a find-and-replace operation to do this; you’ll have to clean up the markup before the first chapter and after the last one manually, since in those cases the <title> element doesn’t have the same milestone function as elsewhere.

Quotes

How are quotations represented in the plain text? How would you find the text of a quotation, that is, how would you find where it starts, where it ends, and what goes between the start and the end? Files on the Internet sometimes have errors and inconsistencies; if you’re relying on cues in the text to identify the beginnings and ends of quotations, what can happen if you miss one?

Quotation marks in the Dracula document are all straight quotation marks instead of the curly quotes. Matching and tagging the spoken passages inside quotation marks raises a few concerns:

A line may have more than one quotation. If we write a regex like ".+" (including the quotation marks), will we match each quotation individually, or will we match the first quotation mark on the line and the last, erroneously gobbling up everything between into one spurious quotation? Try it and see.
Some quotations span multiple lines. Since the dot matches any character except a new line, if we write ".+" and the start and end quotation marks are on different lines, we’ll fail to match those quotations, and we may erroneously match material between ending and starting quotation marks, instead of only between starting and ending ones. Try it and see.

Let’s address the second problem first. There’s a line in the text that reads:

<p>"But, Count," I said, "you know and speak English thoroughly!" He bowed gravely.</p>

This passage shows two split up quotes. If we write ".+" (with Dot matches all turned OFF), we will match too far from the start of the first quote to the end of the last quote. Uh oh! This means we have made a greedy match and missed the inner set of quotation marks. We can resolve the problem by specifying that the match should be non-greedy, that is, that we should make the shortest possible match (instead of the longest, which is the default), and we do this by following the repetition indicator (the plus sign) with a question mark. (Note that the question mark you met earlier is a repetition indicator that means zero or one instance of whatever it follows. Here is isn’t a repetition indicator, though; here it means don’t be greedy. So if the same symbol can have two such different meanings, how does a regex processor know which meaning to apply?) In other words ".+?" will correctly treat two full quotations on the same line as separate quotations. Try it. You should now correctly be matching each quotation fully, regardless of whether it spans a new line character and regardless of the number of quotations on a line.

Once you can do that, you can capture the text between the quotation marks and write it into the output between <quote> tags. Don’t include the quotation mark characters themselves in the capture group; those are plain-text pseudo-markup, and now that you’re going to be tagging quotations with real markup, you don’t want the quotation mark characters included.

Cleanup

At this point you can fix the title at the top manually, and you need to wrap the entire document in a root element (such as <book>). Check to see if you need to move stray close tags at the top of the document and missing close tags at the bottom from your chapter tagging.

Checking your results

Although you’ve added XML markup to the document, <oXygen/> remembers that you opened it as plain text, which means that you can’t check it for well-formedness. To fix that, save it as XML with File → Save as and give it the extension .xml. Even that doesn’t tell <oXygen/> that you’ve changed the file type, though; you have to close the file and reopen it. When you do that, <oXygen/> now knows that it’s XML, so you can verify that it’s well formed in the usual way: Control+Shift+W on Windows, Command+Shift+W on Mac, or click on the arrow next to the red check mark in the icon bar at the top and choose Check well-formedness. If <oXygen/> signals green for well-formed, go ahead and pretty-print the file to see the hierarchy you created.

Autotagging dates and times

You can continue applying regular expression Find and Replace after you save the document as well-formed XML, since XML is made out of patterned text after all. Since Dracula contains journal entries, we can see dates as well as times of day mentioned throughout the file. Continue practicing your regex skills to see if you figure out a pattern for matching the dates and/or times. Try searching first for any digit \d to get a quick look at the different ways numbers are formatted, and see if you can identify distinctive patterns for dates vs. times. Even if you do not match all of them, see how many you can find. (You may want to do a few different passes to capture times with and without a colon for example.)

What to submit

the original source text file you started with
a step file as a markdown (.md) or plain text (.txt) document (a step-by-step description of what you did), and
your results file (the XML document as .xml)

If you don’t get all the way to a solution, just upload the description of what you did, what the output looked like, and why you were not able to proceed any further. As you are working on this, post any questions on Slack or our class GitHub Issues board!