XQuery is one of the XML family of languages that builds on what you have learned of XPath, and we use it to work with XML databases. XML
databases basically work by storing XML files and building persistent indexes
for
them—and this indexing capacity makes it speedy and efficient to search for elements,
attribute values and calculate functions (anything you can locate and process with XPath expressions) across
collections of files. XML databases can run speedily because they build an index of each file, so that the computer doesn’t have to review the entire file every time you run XQuery code. Basically the database’s index stores the tree structure of XML in the database memory, and makes it available for quick retrieval through XQuery.
We are working with a particular XML database called eXist-db, which we have installed on the NewtFire server. Usually when we work on homework exercises and on project development, we will be working on our NewtFire eXist installation, but we can also write and run XQuery locally inside <oXygen/> by clicking on the little XQuery Debugger
button right next to the XSLT debugger
button in the top right of the <oXygen/> window to work with a batch of files we have saved locally. For our projects we tend to prefer working with an eXist installation either on a server or offline on a local computer because a) eXist has indexing tools that we use to make it more efficient to run over multiple collections stored on a remote server, and b) we can connect the XQuery scripts and XML files we have stored in eXist to our project websites. Here’s how to access our NewtFire eXist database:
filename.xql
or filename.xquery
. Note: You can also write XQuery in <oXygen/> to query a document or collection on your local computer: Open a new XQuery file from File -> New, and you will notice that <oXygen/> uses the .xquery
extension.eXist holds file directories, or collections in a hierarchical structure, so that you can access and query a collection of XML files all together. You might think of a collection as one giant XML file with subfiles inside, so you can step up and down the file directory structure with XPath just as you step up and down the XML element hierarchy in the parent/ancestor and child/descendant axes within a single XML document. In the eXist database there is a single root directory called db
, with subfolders containing folders (or collections), which may in turn contain their
own subfolders (more collections), and finally files. I’ve installed a copy of our
Georg Forster XML file here, in a collection called voyages
, inside a directory
called pacific
, and that means that its address in our database is /db/pacific/voyages/ForsterGeorgComplete.xml
, starting from the root db
directory.
As we work on project development, you may find that you want to upload your own collection of XML files into eXist, and we’ll walk you through how to do that. This is different from uploading files to publish on your web space, which makes them publicly viewable but doesn’t build index files or let you collect, extract, and remix your coding using XQuery.
XQuery uses XPath expressions to find its way through its index of files. It can work on one file, or on a whole collection, thus:
doc()
function in XQuery finds a single document, and inside the parentheses goes the path
to the document within the database, including the filename. To retrieve our Georg
Forster XML file, use: doc('/db/pacific/voyages/ForsterGeorgComplete.xml')
.
collection()
function finds a directory (or collection) that holds XML files. We’ve uploaded two
other voyage files to sit in the same collection with our Georg Forster file, and
we can run a query of the ENTIRE group of files, instead of looking through each,
one by one. To query this collection of files, we use collection('/db/pacific/voyages')
.
doc('/db/pacific/voyages/ForsterGeorgComplete.xml')/*
Enter this code, click on the Eval
button at the top of the eXide interface, and see what happens in the return window. In the Return Window, notice that you have options for setting and formatting the output: to view the XML code, you need to have XML Output
selected. You should see the complete XML text of the Georg Forster Pacific voyage narrative in the output window. (Note: You can also set Live Query
to automatically run and update your results as you are typing XQuery. We do not typically use this since it seems to introduce lag in typing in the eXide window, but your experience may vary from ours. Experiment! You can also click to move the output window to the side instead of bottom of the screen.)
voyages
collection space:
HwksV2Ch4-8PNum.xml
and cookVoy2Pnum.xml
. Note: you can also browse through the XML code of all the files in the collection at once with:
collection('/db/pacific/voyages')/*
comment outsome lines in XQuery), use
(: comment :)
Actually, both doc()
and collection()
are XPath expressions (doc()
reaches for an XML document node and the collection()
function retrieves a collection of document nodes). We’ll be adding more XPath once you’ve designated the document
or collection: You can write Xpath expressions, use predicates, functions, and walk up and down axes. Your XPath expressions will locate results from all the files in a collection as long as those files are coded (at least structurally)
in the same or similar ways.
Speaking of coding in the same or similar ways, we need to introduce you to the Text
Encoding Initiative, or TEI. This a language of XML with designated rules and tag
sets for coding digital versions of literary, linguistic, historical, and cultural
texts, and it represents an international standard for coding work consistently for
long-term, sustainable archives. TEI is also a community
and people (like me) serve on its Technical Council to make judgment calls on best practices and
coding guidelines. We use TEI to build digital archives that can "talk to" each other
around the world, and follow recognizable, standard patterns. We could make up our
own XML tag sets, but when coding cultural resources, it’s a good idea to make your
work accessible, so it is easy for others to access and, say, load into databases to
run XQuery for analyzing it, or studying it, or connecting it with other comparable
texts in other archives! We’ll talk more about TEI structure and coding, and give
you some experience with it. (To read more, here’s the TEI’s home site.) For now, you can quickly tell if one of our files is coded in TEI from its root
element: <TEI>
.
XQuery requires a namespace declaration when we use the TEI in order to properly follow its index and in order to follow the schema rules for TEI (to determine if your file is valid as a TEI document). Similarly, we also use a namespace declaration for HTML, to say there are certain rules governing the relationship of tags, their organization, etc. When we query our TEI files, we’ll need to include the following namespace declaration as the first statement of our XQuery:
declare default element namespace "http://www.tei-c.org/ns/1.0";
Following are examples of some XQuery expressions on collections of TEI files in our eXist database. Try copying them into the eXide window and running them by clicking on the Eval
button. Notice the results you return with each.
declare default element namespace "http://www.tei-c.org/ns/1.0";
collection('/db/pacific/literary')//titleStmt/title
The above expression accesses a collection of files, the literary texts associated with our Pacific voyages project. It starts at the root of the eXist directory, always named db
and steps down into a collection named pacific
, and into a collection-inside-the-collection called literary
. (There are a couple of other collections inside the pacific
collection, named voyages
and mapping
, and you can access these collections by inputting their names in the appropriate directory path step inside the collection()
function. Notice that after the collection()
function, we are stepping down the XPath descendant axis with //
and peering into two standard TEI elements that sit in a nested relation to one another. In the TEI header there must always be a <titleStmt>
element, and it must contain a <title>
element that is understood to be the title of the XML document. (You can use the <title>
element elsewhere in a TEI document to mark titles of anything, say references in the document to other books, works of art, etc), but the <title>
inside the <titleStmt>
has a special function of identifying the title of the XML file itself. So, looking at your output you should see a list of those special <title>
elements, which helps us to view at a glance the contents of a TEI collection like this. Stepping down the tree helps us to isolate just the piece of it that we want to return when we run (or eval
) the XQuery code.
declare default element namespace "http://www.tei-c.org/ns/1.0";
collection('/db/pacific/literary')/distinct-values(descendant::body//persName)
This XQuery illustrates the use of an XPath function, distinct-values()
, so that we will return a list gathered from across the entire of the distinctly different names of people (indicated by the TEI <persName>
element) referenced within the <body>
portion of the document. (In the TEI, much like in HTML, the <body>
is a major top-level structure in the text's hierarchy and typically contains a text, like in this case, the full text of a poem, novel, or play. Other parts of the TEI text include a header or metadata, or information about the document, such as its title (up in the <titleStmt>
and publication data.) Here, we want to you to notice how we positioned the distinct-values()
function: Notice that we have to keep the collection()
function outside of distinct-values()
, once we are inside the collection, we take distinct values, using the dot (.
) which means (as it always does in XPath), the self::* axis
. If you try running this query without the dot, the function will lake the precise context it needs to understand its starting point. The dot (or self axis) refers to the collection as a whole.
FLWORExpressions in XQuery
Flower
or FLWOR expressions are a powerful tool in XQuery, letting us work in more
complex ways with querying and remixing information in files and collections—sometimes
both in the same expression! Here's a primer on FLWOR (or really, LFWOR!):
Let:
establishes variables which may be single values or arrays of multiple values (single
or multiple)For:
establishes a range variable that moves step by step from one value to the next and the next in a long list of
values defined by a Let statement. The range variable is designed to process and return a single value at a time, and we say that it loops througha list of multiple values.
Where
(optional): filtering; analogous to predicatesOrder by
(optional): alphabatize, etc. If you use it with a Where
statement, Order by
always has to appear after Where. Always appears after a Where
.Return
: generates output let $hamlet := doc('/db/shakespeare/plays/hamlet.xml')
return $hamlet
Here is an example to demonstrate how we can start with a variable defining a collection of files, and reach into it to retrieve information from a particular special file inside. Note: this particular collection, our Pacific voyage collection, is in the TEI namespace, so we require a special namespace declaration line.
declare default element namespace "http://www.tei-c.org/ns/1.0";
let $pacific := collection('/db/pacific/voyages')/*
let $GeorgFile := $pacific[descendant::author[contains(., 'Georg')]]
//titleStmt/title
return $GeorgFile
This returns just one result in the eXide output window:
1 <title xmlns="http://www.tei-c.org/ns/1.0">A Voyage Round the World in His Majesty's Sloop, Resolution, commanded by Capt. James Cook, during the Years 1772, 3, 4, 5.</title>
Notice how we referenced the descendant::
axis in our XQuery FLWOR. We could also have used .//
to indicate the self::
axis, but we must NOT use //
. We require the dot or the indication of the descendant::
axis in the variable $GeorgFile
to set a starting point, to indicate that we are stepping down from the position defined by the $pacific
variable. (If we do not use the dot, we return zero results because the starting position of the XPath in the predicate is unclear to the computer parser! Try it yourself and see what happens.)
Where
and For
statementsFor
statement here):
declare default element namespace "http://www.tei-c.org/ns/1.0"; let $cook := doc('/db/pacific/voyages/cookVoy2Pnum.xml') let $p := $cook//p[geo] let $geo := $cook//p/geo let $countlat := count ($geo[@select="lat"]) let $countlon := count ($geo[@select="lon"]) where $countlat gt $countlon return $p
For
statement, with an XQuery comment.smiley faceslike this:
(: your comment here :)
declare default element namespace "http://www.tei-c.org/ns/1.0"; let $cook := doc('/db/pacific/voyages/cookVoy2Pnum.xml') let $Paras := $cook//p[geo] let $geo := $cook//p/geo let $countlat := count ($geo[@select="lat"]) let $countlon := count ($geo[@select="lon"]) for $p in $Paras where $countlat gt $countlon return string-join(('paragraph',$p/@n),': ') (: Note use of the string-join function, with its separator. Also notice which parts of it take the single-quotes' '
, and which parts do not! The single quotes,' '
, allow you to indicate that you want some literal text to be returned here. Without it, the computer thinks you are referring to an XPath expression. :)
Oin the FLWOR:
Order
The Order
statement in the FLWOR is optional, but when you use it, it must follow
a Where
statement and immediately precede the Return
. One of the standard, default uses of Order
is to sort a list
of results in alphabetical order, so, for example:
order by $a
organizes results in alphabetical order sorted by the whatever is indicated in the
variable $a
.
There are more complex ways to set up an Order
statement to organize results. For example, you can order
by descending
to get reverse alphabetical order:
order by $a descending
Or you can order a set results according to their numerical position or count, in ascending or descending order.
{ }
To add HTML or XML markup to the XQuery output, add the elements where needed to produce
conformant code. However, these elements are passive, or non-functional when executing
XQuery commands. So we use curly-braces { }
to enclose any XPath or XQuery statements that we want to execute in XQuery, to separate
them from the HTML or XML markup elements. Inside html elements, when we need to do
some calculation or refer to a variable we defined in XQuery, we use the curly-braces
again. We’ll work on some examples in class. Here is one example that may be helpful
as a reference point, showing how to make an HTML page with a table of two columns,
making a list of two related variable results side by side. The resulting html file
is coded to display a table of the distinct characters (<speaker> elements) in Hamlet from our Shakespeare collection, next to a count of their speeches (<sp>) in
the play. Speeches in the play are coded in TEI like this, with speaker names entered as a child element. (Speaker identifiers are also coded as an attribute on the sp
element. In the code below, we will simply work with the contents of the speaker
element, but you could practice and see if you can adapt our example by changing it to work with the @who
attribute instead.)
<sp xmlns="http://www.tei-c.org/ns/1.0" who="Francisco"> <speaker>Francisco</speaker> <l xml:id="sha-ham101002" n="2">Nay, answer me: stand, and unfold yourself.</l> </sp>
We have highlighted the position of the curly-braces in the example:
xquery version "3.1"; declare default element namespace "http://www.tei-c.org/ns/1.0"; <html> <head><title>Speakers and counts of their speeches in Hamlet</title></head> <body> <table>{
let $hamlet := doc('/db/apps/shakespeare/data/ham.xml') let $speeches := $hamlet//sp let $speakers := $hamlet//speaker let $distinctsp := distinct-values($speakers) for $sp in $distinctsp let $count := count($speeches[speaker = $sp]) order by $count descending return <tr> <td>{$sp}
</td> <td>{$count}
</td> </tr>}
</table> </body> </html>
Here’s what’s happening when we apply the curly braces { }
. These wrap the portion of our code in which XQuery must be processed. We write the basic structural HTML tags: the HTML
, head
, and body
elements to encircle our FLWOR statement, since these do not require any special XQuery processing and just need to be output to create a well-formed and valid HTML document. We then encircle the whole FLWOR statement inside curly braces, and when you write this in the eXide window, you will notice that if you remove those curly braces and hit the Eval
button, the XQuery code is simply output as text (and appears all the same color as the HTML documents). When you apply the curly braces, eXide applies color to show you the XQuery code is active. So, why do we need a second set of curly braces inside our return statement, where we output a <p>
element? Try removing them and look at your output! The answer has to do with the use of HTML (or other nonXQuery markup code, such as XML or KML, etc) in our output: The computer parser requires the curly braces any time you are representing the contents of an angle-bracketed element, so that it can tell when a string of text inside the angle-bracketed tags is a literal text string (no curly braces) or XQuery code to be processed (nested within curly braces).
Our model for the next two examples is adapted from Obdurodon’s Generating a list of characters from a collection of Shakespeare plays in alphabetical order
. Try testing and exploring the XQuery scripts below with our Shakespeare collection on the newtFire eXist-db.
concatenated string
of results in plain text:This example returns the characters in Hamlet whose names end with the letter “o”, and outputs the number of characters in their names. To follow this example, you should review the string functions in XPath, so see part III on Strings in Obdurodon’s The XPath functions we use the most.
xquery version "3.1"; declare default element namespace "http://www.tei-c.org/ns/1.0"; let $hamlet := doc('/db/apps/shakespeare/data/ham.xml') let $speakers := distinct-values($hamlet//speaker) for $speaker in $speakers let $NameLength := string-length($speaker) where ends-with($speaker,'o') (:order by string-length($speaker):) (:commenting out! :) order by $NameLength (:return $speaker:) (:commenting out! :) return concat ($speaker, ' has ', $NameLength , ' characters.')
Notice the positioning of two pairs of curly braces { }
in this XQuery code:
<html> <head><title>Title</title></head> <body>{
let $hamlet := doc('/db/apps/shakespeare/data/ham.xml') let $speakers := distinct-values($hamlet//speaker) for $speaker at $pos in $speakers (: The above line creates a special variable named $pos that identifies the position number of each speaker in the sequence of all the distinct speakers. We can use that position number in our output. :) let $speakerLength := string-length($speaker) where ends-with($speaker,'o') order by $speakerLength return <p>{
concat ($speaker, '#', $pos, ' has ', $speakerLength , ' characters')}
</p>}
</body> </html>
While we frequently write XQuery to output plain text or HTML, we can also write it to produce output code in a namespace, such as specialized forms of XML like TEI or KML. Above, when we were processing XQuery on a TEI file for the Pacific project, we used a convenient line of code at the top of the file:
declare default element namespace "http://www.tei-c.org/ns/1.0";
Using this means that the default format of all elements being processed and output iwll be in TEI, and that was fine for our processing above. It may not be okay, though, when you need to process the special Wordhoard TEI Shakespeare collection to convert its TEI elements into HTML elements. Here we need to declare two namespaces, and we have to make a decision which one should be the default. The other one that isn't marked as the default will have to be distinguished, using a namespace prefix, like this: tei:text
(for the TEI element <text>
). When transforming from TEI to HTML, we recommend setting the output HTML as the default namespace and treating TEI elements with prefixes (and generally speaking we suggest setting the namespace format of the output file as the default namespace in your XQuery code. Here is how to set a default namespace line and a namespace line that requires prefixes:
declare default element namespace "http://www.w3.org/1999/xhtml"; declare namespace tei="http://www.tei-c.org/ns/1.0"; (:Continue writing XQuery here... :)
The top line of our example above is a default element namespace line, which we're setting for our output format, the HTML namespace. (We found it by opening an HTML file in oXygen, and just pasted it in here.) The default element namespace won't require us to set prefixes, but if we want to be processing code from a different namespace, we need to declare it too. The TEI elements being processed will all require the tei:
prefix in front for the code to properly distinguish these elements. Note: Attributes are in no namespace at all, but their parent (hosting) elements are what is namespaced. That means you only need to use the namespace prefix on the element names, not the attributes.