Classifying Shakespearean Drama with Sparse Feature Sets

In her fantastic series of lectures on early modern England, Emma Smith identifies an interesting feature that differentiates the tragedies and comedies of Elizabethan drama: "Tragedies tend to have more streamlined plots, or less plot—you know, fewer things happening. Comedies tend to enjoy a multiplication of characters, disguises, and trickeries. I mean, you could partly think about the way [tragedies tend to move] towards the isolation of a single figure on the stage, getting rid of other people, moving towards a kind of solitude, whereas comedies tend to end with a big scene at the end where everybody's on stage" (6:02-6:37). 

The distinction Smith draws between tragedies and comedies is fairly intuitive: tragedies isolate the poor player that struts and frets his hour upon the stage and then is heard no more. Comedies, on the other hand, aggregate characters in order to facilitate comedic trickery and tidy marriage plots. While this discrepancy seemed promising, I couldn't help but wonder whether computational analysis would bear out the hypothesis. Inspired by the recent proliferation of computer-assisted genre classifications of Shakespeare's plays—many of which are founded upon high dimensional data sets like those generated by DocuScope—I was curious to know if paying attention to the number of characters on stage in Shakespearean drama could help provide additional feature sets with which to carry out this task.

To pursue the question, I ran some analysis on the Folger Digital Texts edition of the Bard's plays. This delightful collection uses a custom XML schema to indicate when characters enter and exit the stage, which makes it possible to track the number of characters on stage over the course of a play:


This visualization of The Tempest, for instance, traces the number of characters on stage from the play's opening scene—in which the Shipmaster and his Boatswain are quickly joined by Alonso, Sebastian, Antonio, Ferdinand, and Ganzalo—through Prospero's staff-dashing monologue around the 15,000 word mark to the play's crowded conclusion. Here are the stagings for the other Shakespearean plays in the FDT canon, ordered by their date of first performance according to Alfred Harbage's Annals of English Drama:


These plots afford ample evidence to suggest that Shakespearean comedies tend to end with large scenes in which everybody's on stage. Unfortunately, many of the comedies and tragedies also end with large gatherings of characters. It therefore seems that the number of characters on stage during a play's conclusion might not be an ideal feature with which to classify the genres of Shakespeare's plays.

With these results in hand, I decided to measure how often Shakespeare isolates a single character on stage within plays from each of the three canonical genres. Aggregating the total number of words spoken when only a single character is on stage, as well as the total number of words spoken when only two characters are on stage, and so forth, allows one to measure the degree to which each play distributes its attention between large and small gatherings of characters (click to enlarge):

While these plots reveal some interesting features of the works, such as the fact that Two Gentlemen of Verona truly does revolve around dyadic pairs, they make it difficult to compare the amount of time tragedies and comedies feature only a single character on stage. To make this latter comparison, one can find the average amount of time a single character occupies the stage for each genre:


Surprisingly, the chief difference between the comedies and tragedies has less to do with the way each handles isolated actors on stage than with the way each handles triads and quadrads. It seems tragedies have a greater tendency to revolve around sets of three characters, while comedies are more often organized around sets of four characters. That said, the similarities between the two genres are far more striking than their differences, and far less encouraging for one in search of distinguishing features.

Reflecting on these results, I wondered if tragedies might be better classified by the amount of time their conflicted characters spend addressing the audience. One way to begin measuring the latter, I thought, would be to count the number of words spoken by each character in each play (click to enlarge):

Analyzing these figures, I was struck by what should have been a fairly obvious fact: Shakespeare's most memorable characters (Falstaff, Hal, Prospero, Rosalind, Hamlet...) are each given commanding positions within the plays they lead. Given the strong correlation between these memorable characters and the number of lines each speaks, it's tempting to ask whether we remember these characters most readily simply because Shakespeare allowed them to say the most, or whether Shakespeare allowed them to say the most because he sensed they were his most memorable characters.

Either way, the last trio of plots shows a fairly even distribution of commanding figures among the comedies, histories, and tragedies. But those plots also reveal that the histories include rather few words spoken by women, as well as the fact that the comedies tend to be shorter than the tragedies and histories:


By analyzing only the length of a play and the number of words women speak in that play, one can start to get reasonably good separation between the genres: comedies tend to be shorter and include more female dialogue, histories tend to be longer and include less female dialogue, and tragedies split provocatively between the upper right and lower left. Reviewing these figures, I can't shake the suspicion that a third dimension of data could unite these divided tragedies. But what would that dimension consist of? 

* * *

I would like to thank Mike Poston, co-curator of the Folger Digital Text editions used for this analysis, for discussing many of the finer points of the FDT collection with me. In case you want to replicate any of the analysis or assess the assumptions on which it's founded, the scripts are here.

Pseudo-Cryptography in Jonathan Swift's Tale of a Tub

“I do here humbly propose for an Experiment,” Jonathan Swift writes near the end of his Tale of a Tub, “that every Prince in Christendom will take seven of the deepest Scholars in his Dominions, and shut them up close for seven Years, in seven Chambers, with a Command to write seven ample Commentaries upon [A Tale of a Tub].” In order to promote so useful a Work, Swift informs readers that he has encrypted some hidden messages in the Tale: “I have couched a very profound Mystery in the Number of O's multiply'd by Seven, and divided by Nine.” Not wishing to leave the alchemically inclined out of his game, Swift adds: “Also, if a devout Brother of the Rosy Cross will pray fervently for sixty three Mornings, with a lively Faith, and then transpose certain Letters and Syllables according to Prescription, in the second and fifth Section; they will certainly reveal into a full Receipt of the Opus Magnum.” Swift then completes his triumvirate of puzzles with the following remark: “Lastly, Whoever will be at the Pains to calculate the whole Number of each Letter in this Treatise, and sum up the Difference exactly between the several Numbers, assigning the true natural Cause for every such Difference; the Discoveries in the Product, will plentifully reward his Labour” (Section X).

The probability that Swift actually altered his language so as to conceal occult knowledge in his already-overstuffed Tale may seem rather minimal to many readers. For those familiar with Swift's delight in word games and ciphers, though, the odds might look a bit brighter. In fact, readers who have discovered Paul Childs' essay “Cipher Against Ciphers: Jonathan Swift's Latino-Anglicus Satire of Medicine” in Cryptologia—which analyzes the ways Swift transformed his suffering under Ménière's disease into cleverly encrypted messages—might wonder whether the Dean has in fact buried some treasure in his Tale. With this question in mind, I decided to run some experiments to see whether I might be able to resolve some of Swift's long-overlooked riddles.

Riddle One: The Mystery of the Number of O's in Swift’s Tale. Some simple analysis reveals that there are 16,092 O’s in the 1710 edition of Swift’s Tale of a Tub. If we multiply this value by seven and divide the result by nine—the operations Swift suggests one must perform to uncover his profound Mystery—we get 12,516, a number whose significance I leave to others to determine. While this number might carry significance, the more fundamental question is whether Swift’s use of the letter O appears to be unusual or premeditated in any way. To answer this question, we can compare the relative frequency of the letter O in Swift’s Tale to the relative frequency of the letter in other documents from the eighteenth century:

This plot compares the relative frequency of the letter 'O' in Jonathan Swift's Tale of a Tub to the relative frequency of the same letter in a random sample of documents from the ECCO-TCP corpus.

This plot compares the relative frequency of the letter 'O' in Jonathan Swift's Tale of a Tub to the relative frequency of the same letter in a random sample of documents from the ECCO-TCP corpus.

If Swift's usage of the letter O were premeditated or unusual in any way, we should expect to see the relative frequency of the letter depart from the norm established by his contemporaries. As the plot above indicates, however, his use of the letter is perfectly in keeping with the trend of his times, which suggests Swift's first riddle is a jest.

Riddle Two: “[If readers] transpose certain Letters and Syllables according to Prescription, in the second and fifth Section; they will certainly reveal into a full Receipt of the Opus Magnum”. To analyze this puzzle, we can again look at the distribution of letters in Swift’s Tale, this time investigating the degree to which the distribution of any letter in sections two and five look out of keeping with the distributions of those letters in other sections. More generally, we can look to see if there are any unusual distributions of letters across the sections of the text, and if there are, we can begin considering appropriate methods of transposing those letters to get the syllables with which the Magnum Opus is communicated.

This plot indicates the relative frequency of each letter in each section of Swift's Tale. Each section is given a consistent color, so if any section contains an unusual proportion of any particular letter(s), we should expect a wider distribution of frequencies for that letter or those letters.

This plot indicates the relative frequency of each letter in each section of Swift's Tale. Each section is given a consistent color, so if any section contains an unusual proportion of any particular letter(s), we should expect a wider distribution of frequencies for that letter or those letters.

Each section of the Tale is given a consistent color in this plot, so if any section contains an unusual proportion of any particular letter(s), we should expect to see a wider distribution of frequencies for that letter or those letters. Much to the would-be alchemist’s chagrin, however, the plot above indicates that there are no letters in the Tale that have wildly aberrant distributions, which effectively closes the book on the second riddle.

Riddle Three: “Whoever will . . . calculate the whole Number of each Letter in this Treatise, and sum up the Difference exactly between the several Numbers, assigning the true natural Cause for every such Difference; the Discoveries in the Product, will plentifully reward his Labour.” We can easily calculate the number of times each letter occurs in Swift’s Tale:


Using these frequencies, Swift suggests, one can find “the Discoveries in the Product,” which value “will plentifully reward his Labour.” He leaves it comically unclear what is meant by “the Product,” though: does he mean the product of the difference between each letter and each other letter, or the product between the difference of each letter and the “whole Number of each Letter in this Treatise”, or some other mad metric?

Regardless of the answer to this question, it seems clear that Swift means these riddles to be ludic and satirical, rather than genuine encryptions. My question for readers is: who or what is Swift parodying here? Are there other texts from the period that purport to contain hidden messages in their letter counts? I ask not only because I'm fascinated by early modern ciphers, but because I want to have a fuller understanding of the specific works with which Swift was working in these delightfully flippant passages.

* * *

The analysis above was conducted using the 1710 edition of the Tale digitized by Lehigh University.  The visualizations were produced with Pyplot using these scripts.

Co-citation Networks in the EEBO-TCP Corpus

I recently had the good fortune of attending a conference on computational approaches to early modern literature hosted at the University of Newcastle. During the conference, I not only got to meet some outstanding scholars—including Doug Bruster, John Burrows, Hugh Craig, Mac Jackson, and Glenn Roe, to name only a few—but I also had a chance to present some of my recent work on algorithmic approaches to the study of literary influence. In case it might be of interest to others working in related fields, I thought I would share one of the approaches I discussed in what follows below.

Buried within the EEBO-TCP corpus, I learned some months ago, is a veritable trove of metadata. These metadata features indicate when authors include things like stage directions, tables of data, and alchemical symbols in their writing, all of which is great news for the computationally inclined. For those interested in influence, there are also metadata fields that indicate when an author is quoting or citing another text. By placing <q> tags around quotations and <bibl> tags around citations of authors or works, the Text Creation Partnership made it easy for researchers to begin looking for citational patterns in early modern literature:

Lines 48-52 of this sample from the EEBO-TCP corpus contain <q> and <bibl> tags to designate quotations and bibliographic citations respectively.

Using a simple script, one can easily extract all of the quotations and bibliographic citations from all files in the EEBO-TCP corpus. Because the EEBO-TCP corpus contains roughly one third of the titles from the period recorded in the definitive English Short Title Catalogue, this citational data can serve as a fairly representative archive of intertextual trends in early modern England:


That is, at least, the idea in theory. In practice, the data is quite messy, and in its native form, is all but algorithmically unapproachable. Take, for example, the following references contained within <bibl> brackets:

Iudges, 5 23.
IVDGES 4. 21.
Iudg. 42.
[Judg. 21.]

Human investigators who look into the matter can easily discover that each of these references refers to the Book of Judges. To allow computers to recognize this fact, however, I had to spend several weeks sifting my way through the collection of <bibl> tags, carefully identifying the texts and authors to which those metadata fields referred. In the end, of the roughly 45,000 items that were tagged as references to books or authors in the EEBO-TCP corpus (parts I and II), I found roughly a third to be too cryptic to decipher:

W. ??.
ラ Testimony of a great Divine.

Setting these obscure references aside, I cleaned up the sources about which I was reasonably confident, such as the references to Judges above. After I had aggregated all of these, I was naturally curious to see which texts and authors were most cited in the corpus. Here are the top forty:

The forty most cited authors and works within the EEBO-TCP corpus.

Biblical citations predominate, with the greatest number of references going to the Book of Psalms. (It should be noted that in most translations of the Bible into English and Latin, the Book of Psalms has by far the greatest number of verses and words.) Matthew leads the Evangelists, followed by John, then Luke, then Mark. The highest frequency Greco-Roman writers include Virgil, Ovid, Horace, Martial, Juvenal, and Seneca, in that order. While this data might not revolutionize early modern scholarship, it might be helpful in the classroom. When teaching undergraduates about the works early modern audiences read, heard, and cited, for example, such visualizations can help to bring home the profound religiosity of the age.

Drawing on the same data that underlies this visualization, one can also analyze networks of influence in early modern literature. Rather than conceive of influence as a series of vectors, each measuring the number of times a given work or author is cited in the TCP corpus, one can analyze the kinds of works and authors that are cited together. Such forms of analysis are often referred to as “co-citation networks,” or networks of references that tend to be cited together. For example, say our corpus contained references to only nine different sources:


In that case, we could visualize each of these sources as a “node” in a network graph and could create an “edge,” or connecting line, between any two nodes that were cited within a text in the corpus. The sample graph above would then illustrate the fact that references “A” and "B" are cited within one text, "G" and "Z" in another, and so on. Using this method on the EEBO-TCP corpus, I thought, might help to reveal latent structures embedded within early modern citation networks. With the curated EEBO-TCP citation data in hand, I therefore proceeded in the fashion described above, transforming each cited author or work into a node, and creating edges between any two nodes cited within a single work. Here is one visualization of the results:

This wild web is a visualization of co-citation networks in the EEBO-TCP corpus. Each cited author and text is a point (or "node") on the graph, and any two authors or texts cited within a single text are connected by a line (or "edge"). The plot also uses a modularity clustering algorithm to separate and color code nodes that commonly appear together: the blue cluster contains biblical works, the red cluster contains classical authorities, the yellow cluster contains Reformation-era martyrs and theologians, and the purple cluster contains literary figures.

In this graph, each node has been assigned a color. These colors are determined by an algorithm that identifies clusters of nodes that are commonly cited together, and then colors the different clusters accordingly. Analyzing the nodes that cluster together, one finds four particularly well-defined groups of references: the light blue cluster of nodes, which are biblical books (Leviticus, Corinthians); the red cluster of nodes, which are classical references (Macrobius, Lucretius); the yellow cluster of nodes, which are Reformation-era martyrs and theologians (Theodore Beza, Richard Turner); and the purple cluster of nodes, which are literary writers (Philip Sidney, William Shakespeare). The results of the procedure are intuitively legible: each group represents a fairly homogenous collection of authors and works, and the divisions that split the groups apart represent fairly significant generic differences between the various references contained in the data. If this is right, then such a map could perhaps serve as a useful resource in the classroom. When discussing the so-called "battle of ancients and moderns," for instance, graphs like the present one might help students visualize the different ways early modern citation practices established divisions between these groups.

Using a slightly different visualization, one can identify the points of connection between and among these ostensibly divided groups. This link will take you to an interactive site that contains all of the data included in the previous plot, including the modularity rankings that separated and color-coded the nodes. Unlike the previous plot, however, the visualization at the other end of that link allows users to click on a particular node and see all of the others to which that node is connected. Comparing the networks of each of these nodes, one can produce at a glance one measure of the degree to which the given node was cited with works and authors from the classical inheritance, for instance. The results sometimes lead to new questions. For example, the interactive plot demonstrates that when early modern authors cited Paul the Apostle in the context of classical writers like Apuleius, they cited Paul's given name; when they cited Paul in biblical contexts, by contrast, they cited his works, such as "Romans." Why might that be the case? Other points of interest I've come across in this visualization include the relatively isolation of the alchemists (George Ripley, Geber, etc.), towards the bottom of the plot, and the predominance of biblical citations in works that reference Descartes. Both of these observations strike me as less intuitive than the separation of authors into disparate modularity rankings, for instance, and both seem worthy of further inquiry.

Batch Processing Python Scripts on Sun Grid Engine Queues

Suppose you have a collection of text files and would like to compare each of those files to each of the others. Perhaps you would like to know which characters, locations, or stage directions in each file occur in any of the others. Whatever the task, if your collection is small enough—on the order of a few paragraphs, say—you can of course compare the files manually, reading each of your paragraphs in turn, and comparing the given paragraph to each of the others. If your collection is a bit bigger—on the order of a few hundred novels, say—you might automate these comparisons on your computer. If your collection is really big, however, a single computer might not be powerful enough to finish the job during your lifetime.

Comparing each file in a four text corpus to each of the others, after all, only involves six comparisons. Running the same analysis on a collection of 50,000 files (roughly the size of the Project Gutenberg collection in English, or the EEBO-TCP corpus), however, means running 1,249,975,000 comparisons. If each of those comparisons takes one minute to execute on your computer, it will take 2376 years to run this job on your machine. Thankfully, we can expedite this process tremendously by leveraging the power of distributed computing systems like Sun Grid Engine (SGE) queues. Pursuing the routine described above, for instance, we can use an SGE system to run each of our 1+ billion comparisons in a few minutes.

To get started, we'll want to create an "iteration schedule" in which we identify all of the comparisons we wish to run. Here is a visual representation of an iteration schedule for a four text corpus:


In the table above, each of our iterations-to-be-run is denoted by an "o." Each "o" sits in the cell that joins the row and the column that denote the two texts to be analyzed in the given iteration. Reading across our first row, for instance, we see that text one does not need to be compared to text one, but does need to be compared to texts two, three, and four. The second row denotes that we want to compare text two to texts three and four, and row three denotes that we want to compare text three to text four.

After determining all of the comparisons to be run, we will want to render that information in machine-readable form. More specifically, we want to generate a table that has three columns: iteration_number, first_text, and second_text: where first_text and second_text are the file names of the two texts we wish to compare in the given iteration, and iteration_number is an integer whose value is zero in the first row of the iteration schedule, 1 in the next row, 2 in the next, . . . and n in the last, where n equals the total number of comparisons we wish to make. [In general, the number of comparisons required is (p-1)(p)/2, where p equals the number of files in your corpus.]  Here is a sample iteration table (produced by this script):

0 A00002.txt A00005.txt
1 A00002.txt A00007.txt
2 A00002.txt A00008.txt
3 A00002.txt A00011.txt
4 A00002.txt A00012.txt
5 A00002.txt A00013.txt
6 A00002.txt A00014.txt
7 A00002.txt A00015.txt
8 A00002.txt A00018.txt
9 A00002.txt A00019.txt

If you want to batch process a different kind of routine on an SGE system, you can modify your iteration schedule appropriately. If you only want to calculate the type-token ratios of each of your files, for instance, you'll only need two columns: iteration_number and text_name. Once this iteration schedule is all set, we can turn to the script to be run during each of these iterations. Here's mine: If you want to run a different kind of analysis, just keep lines 1-21 and line 102 of that script, and use the variable “iteration_number” to guide which texts you will analyze in each iteration. Once your routine is all set, save it as “” and upload it—along with your iteration schedule, the files you wish to compare, and these two higher order scripts—to a single directory on your SGE server:

Once all of these files are in the same directory, you are ready to submit your script for batch processing. To do so using the University of Notre Dame's Center for Research Computing system, you can simply type "python your_netid" with no quotation marks:

After you submit this command, the higher-order Python scripts and will create new copies of your script, changing the input files for each iteration according to your iteration schedule. If all has gone well, and you refresh the directory after a few moments, you'll see a few (or more than a few, depending on the number of iterations you are running!) new files in your directory. More specifically, you'll have a collection of new job_ files that give you feedback on the result of each of your iterations. If errors cropped up during your analysis, those errors should be recorded in these job_ files. Provided that there were no exceptions, though, those files will be empty, and you will find in your directory whatever output you requested in your script. Et voila, now you can finish your analysis in a few minutes, rather than a few millennia!

* * *

I want to thank Scott Hampton and Dodi Heryadi of Notre Dame's High Performance Computing Group, who helped me think through the logistics of batch processing, Reid Johnson of Notre Dame's Computer Science Department, who sent me the higher-order SGE scripts on which this analysis is based, and Tim Peters, who wrote many of the Python modules on which my work depends!

Identifying Poetry in Unstructured Corpora

Over the last few months, I've been working with colleagues at Notre Dame to develop computational approaches we can use to identify the genres to which a literary work belongs. Initially, we focused our research on the georgic, a class of agricultural-cum-labour poems that flourished in the seventeenth and eighteenth centuries. Eventually, though, our limited research corpus led us to investigate methods we could use to identify more period poetry, and these investigations helped reveal a fascinating if simple method one can use to identify poetic works in unstructured corpora. 

We began building our corpus of early modern English poetry by identifying the poetry curated by the Text Creation Partnership (TCP). Running a simple Python script over the TCP's selections from the Early English Books (EEBO) corpus—which stretches from “the first book printed in English in 1475 through 1700”—and the Eighteenth Century Collections (ECCO) corpus, we extracted all the lines of text wrapped in < l > tags (the TEI designation for a line of verse). This left us with 16,571 text files, each of which contained only poetry from roughly the sixteenth through the eighteenth centuries. After examining some of these files, we realized that many consisted entirely of poetic epigraphs, so we used another script to remove all of these small files (those smaller than 16 kb) from our research corpus, leaving us with a fairly substantive collection of poetic works from the period of interest:


Because the EEBO-TCP contains 44,255 volumes—roughly one third of all titles recorded in Alain Veylit's ESTC data for the appropriate years—we felt reasonably confident that our holdings for the sixteenth and seventeenth centuries were fairly representative of literary trends during the period. The ECCO-TCP, on the other hand, contains only 2,387 texts, less than one percent of ESTC titles from the eighteenth century. Even if we accept John Feather's argument that only 25,131 literary works were written in English during the eighteenth century—11,789 of which, he claims, were poetic works—we are left to conclude that the 1,698 files in the ECCO-TCP corpus that contain poetry might not be indicative of poetic trends from the period. Given these conclusions, we were eager to supplement our collection of eighteenth-century poetry.

But where on earth can one find enormous quantities of eighteenth-century poetry in digital form? (This isn't meant to be a rhetorical question; if you've got ideas, please let us know!) After considering the issue for some time, we elected to work with Project Gutenberg. Unfortunately, only after we had downloaded and unzipped all of the English files on Project Gutenberg did we realize that the enormous text collection (roughly 45,000 volumes) is all but entirely unstructured. We couldn't find any master list of file names, author names, publication dates, or any other essential metadata fields, so we had to build our own.

In the first place, we wanted to be able to differentiate poetic texts from non-poetic texts. While I imagine it would be possible to complete this task by analyzing the relative frequency of strings from each of these texts in the manner described in the previous post, we didn't have reliable publication dates for the Gutenberg texts, so we needed an alternative method. Operating on the hypothesis that poetic texts have more line breaks and fewer words per line than prose works, we decided to measure the number of words in each line of each file. We then collected a random sample of poetic works to see what their words-per-line profiles looked like:

In these plots—each of which represents a single poetic text—the numbers along the x-axis indicate the number of words in a line of the text file, and the y-axis indicates the relative frequency of lines that contain such-and-such a number of words within the text. In The Poetical Works of James Beattie, for instance, only ~5% of lines had 12 or 13 words in them, whereas almost 20% of the text's lines had 7 or 8 words in them. In other words, The Poetical Works of James Beattie is dominated by lines with seven or eight words in them, a fact that applies to all of the poetic works plotted above. With these figures in hand, we plotted the words-per-line profiles for a random assortment of prose works from roughly the same period:

We were pleased to see that these plots differed from the poetic plots quite dramatically! Comparing the two sets of curves, we see that poetic works contain a preponderance of lines with 7-8 words, while prose works contain a preponderance of lines with 11-12 words. This is naturally due to the fact that lines of text in prose works run across an entire page, while poets break lines strategically (and regularly in eighteenth-century verse). To identify poetry in unstructured corpora, then, we can calculate a text's words-per-line profile, and use the results of those calculations in order to classify each text in our corpus as a work of poetry or a work of prose. Using a rather simple approach to the latter task, we found 3150 poetic works tucked in the Gutenberg corpus, a few hundred of which are from our period and can thus contribute to our study of genre classification.