Regex Exercise: Sonnets

Last modified: Wednesday, 30-Jan-2019 05:08:16 UTC. Authored by: David J. Birnbaum (djbpitt at gmail.com). Edited and maintained by: Elisa E. Beshero-Bondar (ebb8 at pitt.edu). Powered by firebellies.

Get the source text ready in <oXygen/>

The text we’ll be using as input for the first regex homework assignment is a plain-text version of Shakespeare’s sonnets produced by Project Gutenberg, which you can download from our site here: shakeSonnets.txt.

Prepare a Step File

The task

Your goal is to produce an XML version of the Shakespeare Sonnets file by using the search-and-replace techniques we discussed in class, and record each step you take in a plain text or markdown file so others can reproduce exactly what you did. (You may, in a real-life project situation, need to share the steps you take in up-converting plain text documents to XML, and share that on your GitHub repo in GitHub’s markdown (the same that we write on the GitHub Issues board), and in that case you would save the file with a .md extension, like the instructions files we pushed into Class Examples for Regex_upConversion on the DHClass-Hub).

Your up-converted XML output should look something like http://dh.obdurodon.org/shakespeare-sonnets.xml. That is, each sonnet should be its own element, each line should be tagged separately, and the roman numerals should be encoded in a useful way (we’ve used attributes, but you could also put them in a child element).

Your Steps file needs to be detailed enough to indicate each step of your process: what regular expression patterns you attempted to find, and what expressions you used to replace them. You might record the number finds you get and even how you fine-tuned your steps when you were not finding everything you wanted to at first. Note: we strongly recommend copying and pasting your find and replace expressions into your Steps file instead of retyping them (since it is easy to introduce errors that way).

How to proceed

There are several ways to get to the target output, but the starting points are standard:

Starting work:

First of all, for any up-conversion of plain text, you must check for the special reserve characters: the ampersand & and the angle brackets < and >. You need to search for those and, if they turn up, replace them with their corresponding XML entities, so that these will not interfere with well-formed XML markup.

Search for:	Replace with:
`&`	`&`
`<`	`<`
`>`	`>`

Note that you need to process the special XML reserve characters in the correct order. Why is it important that you search and replace the & first?

Don’t worry about the title and author at the top of the file just yet. You will eventually tag them by hand, and we recommend just doing that at the end of the up-conversion process. You’ll be using <oXygen/>’s global Find-and-Replace tool to tag the sonnets, and if you leave the title and author in place while you do that, you’ll wind up tagging them incorrectly. That isn’t a problem as long as you remember to fix them manually at the end. Or you could remove them now to another file to paste them back in at the end of the regex autotagging process.

To perform regex searching, you need to check the box labeled Regular expression at the bottom of the <oXygen/> find-and-replace dialog box, which you open with Control-f (Windows) or Command-f (Mac). If you don’t check this box, <oXygen/> will just search for what you type literally, and it won’t recognize that some characters in regex have special meaning. You don’t have to check anything else yet. Be sure that Dot matches all is unchecked, though; we’ll explain why below.

Leading space characters

The non-blank lines all begin with space characters: there are two spaces before most lines (the Roman numerals and the first twelve lines of each sonnet) and four spaces before the last two lines of every sonnet. Those spaces are presentational formatting, and not part of the content of the text, and since we don’t need them in order to tag the text, we’ll start by deleting them. The regex to match a space character is just a space character, and you can match one or more space characters by using the plus sign repetition indicator. To match one or more instances of the letter X, you would use a regex like X+. To match one or more instances of a space character, just replace the X with a space.

You don’t want to remove all space characters, though; you just want to remove the ones at the beginning of a line. You can do that by using the caret metacharacter, which anchors a match so that it succeeds only at the beginning of a line. For example, if the regex X+ matches one or more instances of X, the regex ^X+ matches one or more instances of X only at the beginning of line. You can use this information to match one or more space characters at the beginning of a line and replace them with nothing, that is, delete them.

We can always choose whether to work with blank lines or not. For our purposes in this exercise, we do not need them, so you can delete them if you’d like, or you can leave them in place to enhance the legibility. To delete them, you need to match a blank line, and the easiest way to do that is to match two new line characters in a row and replace them with a single new line character. The regex for a new line character is \n. Try it.

Inside out or outside in?

We can create our markup either from the outside in (document, then sonnet, then divide the sonnet into Roman numeral and lines) or from the inside out (lines and Roman numeral, then wrap those in a sonnet, then wrap all of the sonnets in a document). Either strategy can be made to work, but we generally find it easier to work from the inside out. (When we work from outside in, it’s easy to wind up incorrectly wrapping <line> tags around the <sonnet> start and end tags, etc.)

Lines

We’ll start by tagging every line as a <line>. This will erroneously tag the Roman numerals as if they were lines of poetry, which they aren’t, but since we're using the inside-out method, we are just planning to correct those Roman numeral lines later.

We don’t want to tag any blank lines (if we left them in), though, so we need a regex that matches only lines that have characters in them. Check your <oXygen/> Find / Replace setup: make sure that Dot matches all is unchecked! In this mode only, the dot (.) matches any character except a new line, which means that we can use the plus sign repetition indicator to match one or more instances of any character except a new line (that is, .+). By default regex selects the longest possible match, so even though just two characters on a line will match the pattern, when we run it it will always match the entire line. Since the dot matches any character except a new line, the regex will match each line individually, that is, it won’t run over a new line and continue the same match. Try it and examine the results. Now check Dot matches all, run Find all, and look at those results. Notice that the match no longer stops at the end of the line, and since you want to tag each line individually, you need to uncheck that box to revert to the normal, default behavior.

A human might think of our task as wrap every line in <line> tags, but regex has a find-and-replace view of the world, so a regex way to think about it would be match every line, delete it, and replace it with itself wrapped in <line> tags. That is, regex doesn’t think about leaving the line in place and inserting something before and after it; it thinks about matching the line, deleting it, and then putting it back, but with the addition of the desired tags. The regex selects and matches each full line, but how do we write what we selected into the replacement string? The answer is that the sequence \0 in the replacement pattern means the entire regex match, and you can use that to write the matched line back into the replacement, but wrapped in <line> tags. Try it.

Roman numerals

The Roman numerals are now erroneously tagged as if they were lines of poetry, and in our sample output at http://dh.obdurodon.org/shakespeare-sonnets.xml we want them to be attribute values. To start that process we need to think about how to distinguish a Roman numeral line from a real line of poetry. Since there are 154 sonnets, a Roman numeral line is a line that contains one or more instances of I, V, X, L, and C in any order and nothing else, and no real line of poetry matches that pattern. That means that we can match that pattern by using a regex character class, which you can read about at http://www.regular-expressions.info/charclass.html. This approach will match sequences that aren’t Roman numerals, like XVX, but those don’t occur, so we don’t have to worry about them. This illustrates a useful strategy: a simple regex that overgeneralizes vacuously may be more useful than a complex one that avoids matching things that won’t occur anyway. You can use the character class (wrapped in square brackets) followed by a plus sign (meaning one or more) to complete your regex so that it matches only <line> elements that contain a Roman numeral and nothing but a Roman numeral. Try it.

In this case you want to write the Roman numeral into the replacement string, but you want to get rid of the spurious <line> tags and replace them with other markup. \0 will write the entire match into the replacement, but that would include the original <line> tags that you want to remove. To capture part of a regex match, you wrap it in parentheses; this doesn’t match parenthesis characters, but it does make the part of the regex that’s between the parentheses available for reuse in the replacement string. For example, a(b)c would match the sequence abc and capture the b in the middle, so that it could be written into the replacement. Capturing a single literal character value isn’t very useful because you could have just written the b into the replacement literally, but you can also capture wildcard matches. For example, a(.)c matches a sequence of a literal a character followed by any single character except a new line followed by a literal c character, and you can use that information to capture everything between the <line> tags in the matched string. To write a captured pattern into the replacement, use a backslash followed by a digit, where \1 means the first capture group, \2 means the second, etc. (and in this case you’re capturing only one group). We’d build a replacement string that starts with a </sonnet> end tag, then a new line, and then a <sonnet> start tag, including the @number attribute and using the captured string as its value, etc. Try it.

Clean up

You may have to clean up the beginning and end of the document manually, including the title and author, and you’ll also need to add a root element.

Checking your results

Although you’ve added XML markup to the document, <oXygen/> remembers that you opened it as plain text, which means that you can’t check it for well-formedness. To fix that, save it as XML with File → Save as and give it the extension .xml. Even that doesn’t tell <oXygen/> that you’ve changed the file type, though; you have to close the file and reopen it. When you do that, <oXygen/> now knows that it’s XML, so you can verify that it’s well formed in the usual way: Control+Shift+W on Windows, Command+Shift+W on Mac, or click on the arrow next to the red check mark in the icon bar at the top and choose Check well-formedness.

General

As we mention above, there are several ways to get to the target output, and whatever works is legitimate, as long as you make meaningful use of computational tools, including regular expressions (where appropriate), and don’t just tag everything manually. As you saw in class, there are ways to build your own regular expressions to match whatever patterns you need to identify, and the regex languages is complex and often difficult to read. The way we would approach this task is by figuring out what we need to match and then looking up how to match it. In addition to the mini-tutorial above, there is a more comprehensive description in the regex section of Michael Kay’s book and more detailed tutorial information at http://www.regular-expressions.info/tutorialcnt.html. If you decide to look around for alternative reference sites and find something that seems especially useful, please post the URL on the discussion boards, so that your classmates can also consult it.

What to submit

We don’t need to see the XML that you produce as the output of your transformation because we’re going to recreate it ourselves anyway, but you do need to upload a step-by-step description of what you did. Your write-up can be brief and concise, but it should provide enough information to enable us to duplicate the procedure you followed.

If you don’t get all the way to a solution, just upload the description of what you did, what the output looked like, and why you were not able to proceed any further. As always, you’re encouraged to post any questions on our DHClass-Hub Issues board!

Regex Exercise: Convert the text of Shakespeare’s sonnets into XML