Ngram Frequency and Eighteenth-Century Commonplaces

When Samuel Richardson's Mrs. Jewkes remarks that “Nought can restrain consent of twain,” we confidently conclude she's quoting Harington's translation of Orlando Furioso. When Edmund Burke writes in his Philosophical Enquiry, "Dark with excessive light thy skirts appear," we know he's misquoting Milton. While passages like these make their debts fairly clear, though, in most cases literary influence is notoriously difficult to trace. When Mary Wollstonecraft identifies marriage as a form of “legal prostitution” in her Vindications, for instance, are we meant to reflect on the thrust of that phrase in Defoe's Matrimonial Whoredom? When Ann Radcliffe's Adeline and La Motte stroll “under the shade of 'melancholy boughs'” in The Romance of the Forest, what gives us the warrant to imagine Orlando “under the shade of melancholy boughs” in As You Like It?

In each of the aforementioned cases, both the quoting and the quoted texts include identical (or nearly-identical) sequences of words. If this property is a necessary condition for intertextuality, however, it is clearly not a sufficient one, for while Wollstonecraft's second Vindication and Defoe's Matrimonial Whoredom both use the phrase “legal prostitution,” they also both use the phrase “if it be,” as well as the phrase “a kind of.” Nonetheless, literary scholars don't identify the latter two strings as instances of intertextuality, perhaps because we intuitively sense that “if it be” and “a kind of” are far more common phrases during the period than “legal prostitution,” a thesis to which Google lends some confidence:

Ngram frequencies of three strings that appear in both Defoe's Matrimonial Whoredom and Wollstonecraft's Vindication of the Rights of Woman.

Such queries demonstrate something literary scholars have known for a long time, namely the fact that the passages we classify as instances of intertextuality have (1) common words in a common order, and (2) significantly lower relative frequency rates than other (equally long) strings from the same period. With this insight in mind, I built an API for the Google Ngrams data with which one can pull down the relative frequencies of a list of strings shared by two (or more) works. Given a set of substrings shared by two texts, and given the relative frequencies of each of those strings in the age during which those texts were published, one can eliminate high frequency strings and thereby reduce the number of passages scholars must hand review to identify relevant instances of intertextuality.

Although I developed the Ngram API to eliminate high frequency strings from the output of my sequence alignment routines, it eventually helped me to discover an interesting correlation between the relative frequency of n-grams and instances of intertextuality. This discovery unfolded in the following way. On a whim, I decided to examine the relative frequencies of bigrams across passages from a few canonical works published during the long eighteenth century: Henry Fielding's Joseph Andrews (1742),  Edmund Burke's Enquiry (1757), and Maria Edgeworth's Ennui (1809). Each of the selections that I drew from these texts centers on a quotation of another writer—the Fielding passage quotes Virgil's Aeneid, the Burke passage quotes Shakespeare's Henry V, and the Edgeworth passage quotes Voltaire's “La Bégueule.” I broke each of these passages down into a set of sequential bigrams, and submitted each of the bigrams to the Google Ngrams data via the API described above. In the case of Burke, for example, I fired up the API and entered the following data into the input fields:

After identifying these parameters and clicking "Go!", I watched the tool navigate to the Google Ngram site and search for the relative frequency of the first two words in the Burke passage. The API limits the historical scope of this search to the period between 1752 and 1762 (the user-provided publication date of Burke's text plus and minus five years), because the Google Ngram data is a bit noisy, and we don't want anomalies in the data for 1757 to skew our sense of the bigram's relative frequency in the period. The API then calculates the mean value for the bigram's relative frequency across those years, and it writes the bigram, the publication year, and the calculated relative frequency to an output file. It then looks at the next bigram (containing words two and three), and reiterates the process, continuing in this fashion until it has queried all valid ngrams in the input file.

Preliminary analysis suggests that one can then use this output file to identify instances of intertextuality even in cases in which one does not have access to the referenced text. Using the aforementioned selections from eighteenth-century texts, I used the method described above to calculate the relative frequencies of the bigrams in each of those selections. I then plotted the bigram frequencies with R's scatter.smooth() function—identifying the first bigram in the selection as bigram number one, the second bigram as bigram number two, and so forth across the x-axis—so that I could better identify the trends in bigram frequency across each passage. I was surprised by the results (click to enlarge):

In each case, the local minimum of the regression line centers on the instance of intertextuality in the queried passage! While this trend is promising, though, it could be due to a number of causes. Chief among these are the differences in language and historical period that divide each of the “quoting” texts cited above from the passage that that work quotes. As we noted above, Henry Fielding quotes Virgil, Burke quotes Shakespeare, and Edgeworth quotes Voltaire, all in the original languages. When we compare the relative frequency of bigrams in Latin, French, and Elizabethan English with bigrams written in colloquial English of the mid- to late-eighteenth century, then, we should perhaps not be surprised that the latter tend to be more common in the Ngram data from that period, ceteris paribus. Nevertheless, these initial results yield new questions: Can the method described above identify instances of poetry in works of prose from a particular period? Can such a method be integrated into an ensemble approach to intertextuality, or do these graphs merely contain a half-told truth, mysterious to descry, which in the womb of distant causes lie? Such are the questions I hope to pursue in subsequent work.

Digital Approaches to Intertextuality: The Case of Eliza Haywood

Over the last year or so, I have been thinking more and more about intertextuality, or the ways in which writers borrow language and ideas from other writers. As a researcher who is particularly interested in the relationship between literary and scientific texts of the Enlightenment, I have been writing plagiarism detection scripts in order to pursue moments in works like Laurence Sterne's Tristram Shandy that borrow language from medical texts such as Burton's Anatomy of Melancholy. Having spent some time thinking about these kinds of questions, I was perhaps unusually provoked by the passage in Eliza Haywood's Betsy Thoughtless (1751) wherein the male hero Mr. Trueworth quotes a quatrain from Shakespeare that no scholars have subsequently been able to locate within the canonical Shakespearean corpus. “How dear,” Trueworth says, “ought a woman to prize her innocence! — as Shakespeare says,

They all are white,—a sheet

Of spotless paper, when they first are born;

But they are to be scrawl'd upon, and blotted

By every goose-quill” (463).

Christine Blouch, the editor of the Broadview edition of Betsy Thoughtless I was reading, attached a footnote to this quatrain that simply stated “Not Shakespeare. Source unidentified” (463). This footnote led my imagination to run wild—I wondered, could this poetic stanza be a fragment from one of Shakespeare's Lost Plays? Might it be the key that will unlock some of the grand mysteries behind the Shakespeare apocrypha?

While I soon learned the answer to these questions is a resounding “No”the quatrain is not by Shakespeare, but William Congrevemy interest in Haywood's use of intertextuality only continued to mature. In Betsy Thoughtless alone I found the better part of two dozen “quotations” such as the “Shakespeare” quotation above for which previous scholars were unable to identify sources. Curious to see how well plagiarism detection routines could identify the missing sources, I set out to uncover the materials that informed Haywood's work.

I began by selecting a handful of “quotations” on which to focus my attention. Of the twenty or so instances of intertextuality whose sources were unidentified in my edition of Betsy Thoughtless, I selected the twelve that I thought were most interesting. These were passages like:

"When puzzling doubts the anxious bosom seize,

to know the worst is some degree of ease" (51)

Haywood often introduces such passages with phrases like, “As the poet says,” or “I remember to have read somewhere,” which I took to be indications that the passages were based on extant texts of Haywood's day. I therefore typed up the dozen unsourced quotations I had selected and fed them to the Literature Online API. The API then broke the quotations into sequences of three words and sent each of those three word chunks to the Literature Online database. This procedure generated a spreadsheet of text data pictured in the following image, which I subsequently scoured for the sources of Haywood's passages:

Using this data, I first tried to get a feel for the individuals whose prose most closely resembles the unsourced passages. Using some Python scripts, I counted up the number of times each writer in the LION database had a trigram (or series of three words) that matched one of the trigrams in the quotes I had selected from Haywood's novel. Here are the authors whose texts shared the greatest number of trigrams with the selected passages:

The birth and death dates of authors identified herein are taken from the Literature Online database.

These numbers were interesting—who would have thought Edward Ward would occupy the pole position? In the end, though, I found that very few of the sources for Haywood's identified literary borrowings appear in this chart. To discover this fact, I began by writing a few scripts that could loop over the text data pictured above. Using those routines, I soon discovered that in many cases, Haywood's references to other literary works were fairly straightforward, which made the computing easy. We could take for instance the pseudo-Shakespearean passage cited above:

“They all are white,—a sheet

Of spotless paper, when they first are born;

But they are to be scrawl'd upon, and blotted

By every goose-quill” (463)

Much to my chagrin, these lines led me not to a now-forgotten Shakespearean play but William Congreve's Love for Love (1695), where one reads: "You are all white, a sheet of lovely, spotless paper, when you first are born; but you are to be scrawled and blotted by every goose's quill. " (Of course Congreve could also be quoting a forgotten Shakespearean work, but this is another story.)

Other passages in Betsy Thoughtless had similarly straightforward sources. Using my procedure, I was able to find sources for the following passages in Haywood's text:

The Patriarch, to gain a wife

Chaste, beautiful, and young,

Serv'd fourteen years, a painful life,

And never thought it long.

Oh! were you to reward such cares,

And life so long would stay,

Not fourteen, but four hundred years,

Would seem but as one day (153)

These lines are from "The Perfection," a song published at least as early as 1726 in The Hive: A Collection of the Most Celebrated Songs, and one which Robert Burns quoted with delight some years thereafter. Another passage:

“All saw her spots but few her brightness took” (224)

was adapted from the 1677 play that made Nathaniel Lee's career, namely Alexander the Great, where the titular character boasts “All find my spots, but few observe my brightness.” Next up:

“That faultless form could act no crime,

But heav'n, on looking on it, must forgive” (280)

This passage draws from John Dryden's play The Spanish Friar (1681): “So wondrous fair, you justifie Rebellion: As if that faultless Face could make no Sin, But Heaven, with looking on it, must forgive.” The next unsourced passage,

“There is no wonder, or else all is wonder” (285)

is adopted from a remark of William Congreve's in The Mourning Bride (1697): “There are no wonders, or else all is wonder.” Let's turn to another:

“Young Philander woo'd me long,

I was peevish, and forbad him;

I would not hear his charming song,

But now I wish, I wish I had him” (289)

Here Haywood recites a popular song of the day, one that made its way into Charles Johnson's The Village Opera (1729):

My script also uncovered the fact that George Lillo references the song in his play 1731 Silvia (and subsequent searches helped me find that Purcell set the song to music!). Next:

“Ingratitude's the sin, which, first or last,

Taints the whole sex; the catching court-disease” (322)

Mad man Nathaniel Lee wrote similar lines in his play Mithridates (1678): “Inconstancy, the Plague that first or last Taints the whole Sex, the catching Court-disease.” The last passage for which I found a straight forward source runs as follows:

“I, like the child, whose folly prov'd its loss,

Refus'd the gold, and did accept the dross” (602)

Here George Etherege's Comical Revenge, or Love in a Tub (1664) appears to be the source: “I, like the child, whose folly proves his loss, Refus'd the gold, and did accept the dross.” Using natural language processing techniques and the wonderful data provided by LION, identifying these sources took little time at all.

While the previous set of intertextual references were closely patterned on a variety of source texts, some passages that Haywood attributes to other writers are much less straight forward. Indeed, it seems she often combined lines from disparate literary works in order to forge her own ideas. Take, for example, the following passage:

Pleas'd with destruction, proud to be undone,

With open arms I to my ruin run,

And sought the mischiefs I was bid to shun;

Tempted that shame a virgin ought to dread,

And had not the excuse of being betray'd (111)

Like other instances of intertextuality in Haywood's writing, this passage seems to derive from multiple sources. The second line appears in the poet and doctor Richard Blackmore's “Advice to the Poets” (1718), where Blackmore writes “Let them this gen'rous Resolution own, / That they are pleas'd and proud to be undone.” The second and third lines of Haywood's aforementioned passage appear to borrow from Mary Wortley Montagu's “The Basset Table” (1716)—where one finds the lines “I know the bite, yet to my ruin run, / And see the folly which I cannot shun”—and posthumously published lines from “The Excursion of Fancy: A Pindaric Ode” (1753) by Aaron Hill (1685-1750): “Let us throw down this load of doubt, with which no race is won: / And, swift, to easier conquests, lighter, run, / The way, which reason is not bid to shun!” Another synthetic creation of Haywood's that I spent some time analyzing runs as follows:

When puzzling doubts the anxious bosom seize,

To know the worst is some degree of ease (51)

The first line of this couplet pulls from a line in Joseph Mitchell's “Poems on Several Grave and Important Subjects”: “When puzling Doubts invade my Breast, And I am cloath'd in Shades of Night . . . ", while the second inverts a line from Davild Mallet's Eurydice (1731): “When others too are miserable, not to know the worst is some degree of bliss.” In this passage, as in others, Haywood brings a variety of extant literary works to bear on her own project in fascinating and unpredictable ways.

* * *

Tracing the sources of these passages was helpful, not least because it allowed me to get a better sense of the ways writers like Haywood engaged with the texts of their age. For instance, using the data I gathered while tracing the sources of the passages above, I began considering new ways to optimize my plagiarism detection routines. Consider the following chart:

This graph indicates that, of the passages in Haywood's novel for which I was able to find sources, all of those passages shared at least three identical words in identical order with the texts they paraphrase. Roughly 80% of the instances of plagiarism I analyzed had at least four identical words in identical order with their source texts, ~55% had at least five equivalent words in equivalent sequential order, and so on. The lesson embedded in this chart is perhaps predictable: The greater the number of identical words one demands in order to identify one language act as a paraphrase of another, the greater the number of false negatives one can expect in one's analysis. As I noted above, my study of intertextuality in Haywood's writing was carried out using trigrams as the unit of analysis. That is to say, I expected a passage from Haywood's text to share at least three identical words in identical order with the text it paraphrases. While this seemed a relatively low condition for an instance of plagiarism to satisfy, it might have actually been too demanding a condition, because I was only able to find sources for ten of the twelve passages I set out to study. Perhaps a study using bigrams as the unit of analysis—or perhaps one of you—can identify the source of the remaining two quotations:

"Away with this idle, this scrupulous fear,

For a kiss in the dark,

Cry'd the amorous spark,

There is nothing, no nothing, too dear" (311)

* * *

"Unequal lengths, alas! our passions run,

My love was quite worn out, e'er yours begun" (462)

Training the Stanford NER Classifier to Study Literary History

Working with Professor Matthew Wilkens, my fellow doctoral student Suen Wong, and undergraduates at Notre Dame, I have spent the last few months using the Stanford Named Entity Recognition (NER) classifier to identify locations in a few thousand works of nineteenth-century American literature. Using the NER classifier—an enormously powerful tool that can identify such “named entities” in texts as people, places, and company names—our mission was to find all of the locations within Professor Wilkens' corpus of nineteenth-century novels. While Stanford's out-of-the-box classifier could be used for such a purpose, we elected to retrain the tool with nineteenth-century text files in order to improve the classifier's performance. In case others are curious about the process involved in retraining and testing a trained classifier, I thought it might be worthwhile to provide a quick summation of our method and findings to date.

In order to train the classifier to correctly identify locations in a text, users essentially provide the classifier with a substantial quantity of texts. These texts are annotated in such a way as to teach the classifier to correctly identify locations. More specifically, these “training texts” break a text document into a series of words, each of which users must identify as a location or a non-location. The training files looks a bit like this:

the 0
Greenland LOC
whale 0
is 0
deposed 0
, 0
- 0
the 0
great 0
sperm 0
whale 0
now 0
reigneth 0
! 0

In this sample, as in all of the training texts, each word (or “token”) is listed on a unique line, followed by a tab and then a “LOC” or a “0” to indicate whether the given token is or is not a location. Users can feed the Stanford parser this data, and the tool can use this information in order to improve its ability to classify locations correctly.

In our training process, we collected hundreds of passages much longer than the sample section above, and we processed those passages in the way described above—with each token on a unique line, followed by a tab and then a “LOC” or a “0”. (Technically, we also identified persons and organizations, but the discussion is made more simple if we presently ignore these other categories.) We then used a quick Python script to sort these hundreds of annotated text chunks into ten directories of equal size, and another script to combine all of the chunks within each directory into a single file.

Once we had ten unique directories, each containing a single amalgamated file, we trained and tested ten classifiers. To train this first classifier, we combined the annotated texts contained in directories 2-10 into a single text file called “directories2-10combined.tsv”. We then created a .prop file we could use to train the first classifier. This .prop file looked very similar to the default .prop template on the FAQ page for the NER classifier:

# location of the training file
trainFile = directoriest2-10combined.tsv
# location where you would like to save (serialize) your
# classifier; adding .gz at the end automatically gzips the file,
# making it smaller, and faster to load
serializeTo = ner-model.ser.gz

# structure of your training file; this tells the classifier that
# the word is in column 0 and the correct answer is in column 1
map = word=0,answer=1

# This specifies the order of the CRF: order 1 means that features
# apply at most to a class pair of previous class and current class
# or current class and next class.

# these are the features we'd like to train with
# some are discussed below, the rest can be
# understood by looking at NERFeatureFactory
# word character ngrams will be included up to length 6 as prefixes
# and suffixes only
# the last 4 properties deal with word shape features

We saved this file as propforclassifierone.prop, and then built the classifier by executing the following command within a shell:

java -cp stanford-ner.jar -prop propforclassifierone.prop 

This command generated an NER model that one can evoke within a shell using a command such as the following:

java -cp stanford-ner.jar -loadClassifier ner-model.ser.gz -testFile directoryone.tsv

This command will analyze the file specified by the last flag—namely, “directoryone.tsv”, the only training text that we withheld when training our first classifier. The reason we withheld "directoryone.tsv" was so that we could test our newly-trained classifier on the file. Because we have already hand identified all of the locations in the file, to test the performance of the trained classifier we need only check to see whether and to what extent the trained classifier is able to find those locations. Similarly, after training our second classifier on training texts 1 and 3-10, we can test that classifier's accuracy by seeing how well it identifies locations in directorytwo.tsv. In general, we can train our classifier on all ten of our training texts save for one, and then test the classifier on that one tsv file. This method is called "ten-fold cross validation," because it gives us ten opportunities to measure the performance of our training routine and thereby estimate the future success rate of our classifier.

Running the last command listed above generates a  text with three columns: the first column contains the tokens in “directoryone.tsv”, the second column contains the "0"s and "LOC"s we used to classify those tokens by hand in our training texts above, and the third column contains the classifier's guess regarding the status of each token. In other words, if the first word in "directoryone.tsv" is "Montauk", and we designated this token as a location, but the trained classifier did not, the first row of the output file will look like this:

Montauk LOC 0

By measuring the degree to which the tool's classifications match our human classifications, we can measure the accuracy of the trained classifier. After training all ten classifiers, we did precisely this, measuring the success rates of each classifier and plotting the resulting figures in R:  

This first plot measures the number of true positive locations that both the out-of-the-box Stanford parser identified in each of our .tsv files as well as the number of true positives our trained classifiers identified in each .tsv file. A true positive location is a token that the classifier has identified as a location (this is what makes it "positive") that we have also designated as a location (this is what makes it "true"). If the classifier designates a token as a location but we have identified as a non-location, that counts as a "false positive." The following graph makes it fairly clear that the out-of-the-box classifier tends to produce many more false positives than our trained classifiers:

While measuring true positives and false positives is important, it's also important to measure false negatives, which are tokens that we have identified as locations that the classifier fails to identify as locations. The following graph illustrates the fact that the trained classifier tended to miss many more locations than did the out-of-the-box classifier:

Aside from true positives, false positives, and false negatives, the only other possibility is "true negatives", which are so numerous as to almost prevent comparison when plotted together:

While the plots of true positives, false positives, and false negatives above speak to some of the strengths and weaknesses of the trained classifiers, those who work in statistics and information retrieval like to combine some of these values in order to offer additional insight into their data. One such combination that is commonly employed is called "Precision", which in our case is a measure of the degree to which those tokens identified as locations by a given classifier are indeed locations. More specifically, precision is calculated by taking the total number of true positives and dividing that number by the combined sum of true positives and false positives (P = TP/TP+FP). Here are the P values of the trained and untrained classifiers:

Another common measure used by statisticians is "Recall", which is calculated by dividing the number of true positives by the sum total of true positives and false negatives (R = TP/TP+FN). In our tests, recall is essentially an indication of the degree to which a given classifier is able to find all of the tokens that we have identified as locations. Clearly the trained classifier did not excel at this task: 

Finally, once we have calculated our precision and recall values, we can combine those values into an "F measure," which serves as an abstract index of both. There are many ways to calculate F values, depending on whether precision or recall are more important for one's experiment, but to grant equal weight to both precision and recall, we can use a standard harmonic means equation: F = 2PR/P+R. The F values below may serve as an aggregate index of the success of our classifiers:

So what do these charts tell us? In the first place, they tell us that the trained classifiers tend to operate with much greater precision than the out-of-the-box classifier. To state the point slightly differently, we could say that the trained classifier had far fewer false positives than did the untrained classifier. On the other hand, the trained classifier had far more false negatives than did the untrained classifier. This means that the trained classifier incorrectly identified many locations as non-locations. In sum, if our classifier were a baseball player, it would swing at only some of the many beautiful pitches it saw, but if it decided to swing, it would hit the ball pretty darn well.

It stands to reason that further training will only continue to improve the classifier's performance. After all, the NER classifier learns from the grammatical structures of the training files it is fed, which allows the classifier to correctly identify locations it has never encountered before. (One can independently prove that this is the case by running a few Python scripts on the data generated by the classifier.) As we continue to feed the classifier additional grammatical constructs that are used in discussions of locations, the classifier should expand its "location vocabulary" and should therefore be more willing to swing at pretty pitches. Once we've finished compiling the last two thirds of our training texts, we will be able to retrain the classifiers and see whether this hypothesis holds any water. Here's looking forward to seeing those results!

A New Tool for Literary Research: Literature Online API

When it comes to literary research, Literature Online is no doubt one of the best digital resources around. The site hosts a third of a million full-length texts, the definitive collection of digitized criticism, and a robust interface that boasts such advanced features as lemmatized and fuzzy spelling search options. When I started looking for a public API that would allow users to mobilize the site's resources in an algorithmic fashion, though, there was no API to be found. So I decided to build my own.

Using Python's Selenium package, I built an API that sends queries to Literature Online in a procedural fashion and generates clean, user-friendly output data. The program runs as follows: After double-clicking the literatureonlineapi.exe file (or the file) linked in the Tools tab of this site, the following GUI appears:


Using this interface, users may select the appropriate checkboxes pictured above to identify whether they would like to employ Literature Online's fuzzy spelling and/or lemmatized search features. Additionally, users can limit potential matches by publication date and author date ranges. Then, users may click the "Select Input File" button to select a file they would like to use to query Literature Online. This file should be a plain text file that contains one or more words or series of words one would like to use to search Literature Online. The program will send the first n words of this file to Literature Online, where n = the value of "window size" (in the image above, n = 3). The program will then record the name, publication date, and author of texts that contain the first n words of your file. This match will be an exact match--i.e. if n = 3 and the first three words of your file are "the king will", the program will find all texts in the Literature Online database that contain the exact string "the king will". Then the script will look at words p through n + p in your plain text file, where p = window slide interval. In the image above, p = 1 and n = 3, so in its second pass through our hypothetical text file the program would look at words 2 through 4 (inclusive). The program will once again pull down all relevant metadata for the found hits. It will then slide p words forward once again, examining words 3 through 5, and so forth, until it reaches the end of the document. Then, once it has reached the end, it will go back to the beginning of the document and repeat the process, this time submitting not exact searches, but proximity searches. E.g. instead of searching "the king will", the program will find all instances of "the near.3 king near.3 will" and then slide its search window forward in the customary fashion. Finally, the program will write its .tsv output to the directory selected with the "Select Output Location" button. In the case of the sample string discussed in this paragraph, the output file looks like this:

Screen shot of API's output in Windows. Click for enlarged view. 

Users can then use this output to create plots, inform stylometric analysis, or simply to help allocate their readerly attention in a more efficient manner.

Let's suppose you wanted to find all texts in the Literature Online database that contain the word "king," as well as all of the texts that contain the word "queen." In this case, you could proceed as follows: First, make sure you have Firefox installed on your computer. Then, download the LiteratureOnlineAPI folder, open it up, and double click the file entitled "literatureonlineapi.exe". If you double click that file, the GUI pictured above should appear. Next, create and save a text file that contains only "king queen". After selecting this file with the "Select File" button, set the "window size" to 1. Doing so will tell the program that you want each of the searches you send to Literature Online to contain exactly one word (where a word is defined as any character or series of characters bounded by whitespace). Next, set the slide interval to 1, so that the program will know to send the first query ("king"), then slide forward 1 word to "queen" and submit that search term. Finally, click Start. If all goes to plan, a Firefox window will open up and the program will be off and running. If you need to terminate the program, just close that Firefox window. Doing so, however, will prevent the out.tsv output from documenting any found matches. If this happens, you can merely restart the program.

Building this tool was a blast, not least of all because doing so allowed me to learn much more about Webdriver, GUIs, and code compilation. Here's hoping the finished tool will help others create stimulating literary and historical research! 

Submitting Python Scripts to Sun Grid Engine Queues

I have recently begun submitting scripts to my home institution's computer cluster. Although submitting jobs to the cluster is a fairly straightforward task, it took me a bit of time to figure out how to format jobs such that the Sun Grid Engine queuing system employed here at Notre Dame could process my scripts and distribute them over the cluster. In case others would like to be able to distribute jobs over a Unix-based computer cluster that uses an SGE front end, I wanted to briefly type up the protocols I have followed to accomplish this task.

To submit a script to Notre Dame's SGE front end, one needs two files: a hashbanged script that one would like to run, and a .job file. Say the script you want to run is a Python script called "". In order to prepare this script for the cluster, you will want to add a line at the very start of the script in order to point to the version of Python that you plan to employ in your script. This line has a variety of amusing names—it is sometimes called a shebang line, or a hashbang, crunchbang, hashpling, pound bang, etc.—but it usually takes a form such as the following:

#!/usr/bin/env python

Then, once you have your script ready to go, you only need to make a .job file that can tell the cluster where to find all of the elements required to run the script. My .job files look something like this:

#$ -M
#$ -m abe
#$ -r y
#$ -o tmp.out
#$ -e tmp.err
module load python/2.7.3
echo "Start - `date`"
echo "Finish - `date`"

The first line in this .job file indicates that the script intends to send a command to the C shell. The second line specifies the email address to which the cluster will report its output. The third line specifies that you would like the cluster to send emails to the email address in the -M line above when the cluster begins and ends the script. The fourth line is meant to indicate if your script is re-runnable (it appears that Gaussian scripts and certain machine learning algorithms are not re-runnable and so should be flagged -r n). The final two lines prefaced with #$ indicate the expected output files. The line that begins "module load" indicates, as one might expect, the module one would like to load (make sure to specify the version of the software you would like to run). The penultimate line indicates the name of the script you would like to run, and the echo lines merely ask the C shell to identify the start and end times of the job. Using a text editor like Notepad++, users can modify these fields to suit the demands of their job, and then save the file as something like "test.job".

Once you have your hashbanged script and your .job file, upload them to a directory on your cluster. I use Filezilla for this purpose, but one could accomplish the same task at the command line. Then, ssh to that directory (if you are working on a Windows machine, you can use Putty to accomplish this task). Then, once you're in the directory in which your .job and .py files are located, you can submit your job by simply typing "qsub test.job" and hitting enter. If you left the "-m abe" line in your .job script, you should soon receive an email indicating that your job has been submitted.

When I first started submitting scripts, I could tell by the email reports I received that my jobs began and ended almost simultaneously. Eventually I realized that had not properly formatted my file paths. Once I fixed the file paths, all was well and the scripts ran properly.

I found it helpful to try running my scripts from the command line before submitting them with a job file. To do this with a Python script, you need only ssh into the cluster, navigate to the directory in which your script is located, and then type "Python" followed by a return. If your script is properly formatted, any print statements in your script will print to the terminal. If there is a problem with your script, the terminal will list the error message.

Using the cluster has allowed me to process computationally-demanding jobs very rapidly, which has in turn allowed me to continue refining my scripts quickly. I hope employing methods similar to those above can help some readers to submit their own scripts to clusters.