Mapping the Early English Book Trade

Historians often call attention to the tremendous influence the 1710 Act of Anne had on the early English book trade. Commonly identified as the origin of modern copyright law, the Act laid the statutory foundations for fixed-term copyright in England, extended the ability to hold such copyrights to all individuals, and eventually toppled the monopoly that London booksellers had held on English printing since the incorporation of the Stationers' Company in 1557. Reading scholarship on this legal development over the last few months, I became curious to see how well the English Short Title Catalogue (ESTC) could substantiate some of the claims made in discussions of the Act. The ESTC seemed an ideal resource for this kind of analysis because, as Stephen Tabor has written, it represents “the fullest and most up-to-date bibliographical account of 'English' printing (in the broadest sense) for its first 328 years” (367). The database lists the authors, titles, imprint lines, publication dates, and many other metadata fields for each of the ~470,000 editions known to have been printed in England or its colonies between 1473 and 1800, and can therefore serve as a helpful resource with which to investigate the relationship between copyright law and literary history in the early modern period.

One of the debates surrounding the Act of Anne concerns the degree to which the statute altered the geography of the English book trade. Prior to the passage of the Act, legal historian Diane Zimmerman notes, the Stationers' Company dominated the book industry, and because the company's printers were primarily stationed in London, the book trade was also centered in the metropole. With the passage of the Statute of Anne, however, authors could sell or trade their copyrights to printers outside of London: “Now any printer [or] bookseller, wherever located within the country, could register a copyright with the Company” and “since purchasers of the copies could be located anywhere in the United Kingdom, the Stationers' Company did not regain its monopoly [on the book trade] (7). Contra Zimmerman, William Patry argues that the Act of Anne failed to undermine London's control of the book trade: “After the Statute of Anne, as before, he writes, the only purchasers of authors' works were a small group of London booksellers” (84). To investigate what the ESTC had to say on this question, I compared the geographical distribution of English printers in the half centuries before and after the passage of the Act (click for full size):

The usual cautions concerning false imprints and varying survival rates notwithstanding, the ESTC clearly demonstrates the decentralization of English printing in the wake of the Act of Anne. London of course remained the primary site of publication throughout the years covered by the ESTC—publishing two-thirds of all records from the period—though its annual share in the trade fell quite dramatically across the eighteenth century:

One can explain some of that decline by examining the growth of printing in major metropolitan areas outside of London, such as Edinburgh (responsible for 6.5% of total editions in the ESTC), Dublin (5.4%), and Boston (3.7%), which claimed the second, third, and fourth overall largest shares of the book trade according to the ESTC:

publishing_beyond_london.png

Among these figures, the explosion of printing in Edinburgh after 1750 is particularly interesting, and appears to be the result of further changes in the legal code. As John Feather notes, “The Copyright Act of 1710 (8 Anne c. 21) implied, but did not state, that it was illegal to import any English-language books into England and Wales if they had been previously printed there” (58). However, he continues, “the legislation in relation to Scotland seems to have lapsed in 1754-1755,” after which one observes tremendous growth in Scottish printing. Between 1750 and 1755, the five year average of Edinburgh printing as a percent of all printing recorded in the ESTC is 7.5%. This figure only continues to grow after the lapse of Scottish printing regulations noted by Feather: From 1755-1760, Edinburgh printing climbs to 9.0% of all printing for the five year period, from 1760-1765, the figure rises to 12.3%, and from 1765-1770, it reaches 14.4% of the ESTC totals for the five year range. These values are significant, because they suggest the real surge in the Scottish reprinting industry did not take place in the aftermath of the Donaldson v. Becket decision, as is commonly supposed, but rather with the lapse of Scottish reprinting regulations in 1755.

Having plotted the changing geography of early English printing, I was curious to see whether the ESTC could shed new light on the debate concerning anonymous printing in the early modern period. Researchers like Jody Greene have argued that the Statute of Anne was in fact designed to help combat anonymous publishing insofar as it required authors to attach their names to works if they wished to obtain copyright protection for those works (4). Years ago, Michel Foucault pioneered a version of this thesis in his essay “What is an Author?”, where he argued that the Act of Anne and its elaboration in eighteenth-century case law spurred the transition from a literary culture founded on anonymity to one founded on named authorship. More recently, however, Robert Griffin disputed such claims, arguing that “the historical record shows . . . there is no necessary relation between copyright and the appearance of the name of the author on the title page” (879). To map the changing rate of anonymity over time, I aggregated the number of anonymous and pseudonymous publications as percents of annual totals within the ESTC:

anonymous_publications.png

The resulting plot shows great fluctuation in anonymous publications within the fifteenth and early sixteenth centuries, largely because of the tremendously small number of publications for those years. In 1492, for instance, the ESTC lists only 14 publications, all but two of which (S111337 and S120825) had identified authors, which results in an aggregate estimate of anonymity for the year of .142, or 14.2 percent. Despite the year to year fluctuations within early records, however, examining anonymity rates in the aggregate leads to legible patterns: one finds a marked decline in anonymous publication rates over the fifteenth and sixteenth centuries, a fairly steady rise across the seventeenth century, and a slow aggregate decline in the wake of the Act of Anne. This data supports some of the the findings of Joad Raymondwho examined a small sample of records from the period and found that anonymity . . . became increasingly frequent over the course of the seventeenth century (168)—while challenging the popular thesis that anonymity thrived with the lapse of the Licensing Act in 1695.

To plot the history of anonymity, though, is to beg a fundamental question: What exactly counts as an anonymous work? While the plot above treats works as anonymous only if their title pages are attributed to pseudonymous figures like “Isaac Bickerstaff” or to no author at all, there are other cases that one might well wish to classify as anonymous works. Consider the range of works attributed to “corporate” authors like the Royal Society of London or the English Parliament. Are works published by these entities anonymous publications? The way one answers this question will of course greatly affect the way one reads the history of anonymity. As a case in point, we could consult the following plot, which shows monarchical and parliamentary publishing during the seventeenth and eighteenth centuries:

The points here represent yearly values, while the regression lines map the smoothed trends over time. For example, the release of the ESTC to which I had access indicates that James I and Charles I published a combined total of 82 works in 1625 (both served as monarch during the year), the English and Scottish Parliaments published a combined total of 4 works during the year, and the year's total number of publications was 695, which means that monarchical publications account for 11.79 percent of the annual total while parliamentary publications account for only .5 percent  of the same. As one can see, treating the high volume of parliamentary publications from the period as “anonymous works” would create a serious spike in anonymity rates during the English Civil Wars, and would steadily inflate anonymity rates across the eighteenth century. On the other hand, refusing to include works of corporate authorship among anonymous publications (as I have done in the plot of anonymity above) makes it more difficult to answer the question: What exactly counts as anonymity in the early modern world? Whether one includes or excludes corporate authorship from the domain of anonymity, this plot of parliamentary and monarchical publications intrigues me because it maps so neatly onto the political history of the English Civil Wars: monarchical publications trump parliamentary output until the critical years of the early 1640's, after which the Parliament assumes a predominance it holds throughout the Interregnum and only loses in the Restoration. Thereafter the monarchical voice triumphs until the Statute of Anne, after which point it rapidly loses ground. Examining this plot, I can't help but wonder: To what extent is monarchical publishing a function of the crown's political power, and to what extent is that political power a function of the monarch's proximity to print?

* * *

I want to thank Benjamin Pauley, Brian Geiger, and Virginia Schillingeach of whom kindly helped me to acquire the ESTC data on which the analysis above was performedas well as Elliott Visconsi, whose intriguing questions on copyright history continue to motivate my ongoing research.

Classifying Shakespearean Drama with Sparse Feature Sets

In her fantastic series of lectures on early modern England, Emma Smith identifies an interesting feature that differentiates the tragedies and comedies of Elizabethan drama: "Tragedies tend to have more streamlined plots, or less plot—you know, fewer things happening. Comedies tend to enjoy a multiplication of characters, disguises, and trickeries. I mean, you could partly think about the way [tragedies tend to move] towards the isolation of a single figure on the stage, getting rid of other people, moving towards a kind of solitude, whereas comedies tend to end with a big scene at the end where everybody's on stage" (6:02-6:37). 

The distinction Smith draws between tragedies and comedies is fairly intuitive: tragedies isolate the poor player that struts and frets his hour upon the stage and then is heard no more. Comedies, on the other hand, aggregate characters in order to facilitate comedic trickery and tidy marriage plots. While this discrepancy seemed promising, I couldn't help but wonder whether computational analysis would bear out the hypothesis. Inspired by the recent proliferation of computer-assisted genre classifications of Shakespeare's plays—many of which are founded upon high dimensional data sets like those generated by DocuScope—I was curious to know if paying attention to the number of characters on stage in Shakespearean drama could help provide additional feature sets with which to carry out this task.

To pursue the question, I ran some analysis on the Folger Digital Texts edition of the Bard's plays. This delightful collection uses a custom XML schema to indicate when characters enter and exit the stage, which makes it possible to track the number of characters on stage over the course of a play:

tempest.png

This visualization of The Tempest, for instance, traces the number of characters on stage from the play's opening scene—in which the Shipmaster and his Boatswain are quickly joined by Alonso, Sebastian, Antonio, Ferdinand, and Ganzalo—through Prospero's staff-dashing monologue around the 15,000 word mark to the play's crowded conclusion. Here are the stagings for the other Shakespearean plays in the FDT canon, ordered by their date of first performance according to Alfred Harbage's Annals of English Drama:

chars_on_stage_all.png

These plots afford ample evidence to suggest that Shakespearean comedies tend to end with large scenes in which everybody's on stage. Unfortunately, many of the comedies and tragedies also end with large gatherings of characters. It therefore seems that the number of characters on stage during a play's conclusion might not be an ideal feature with which to classify the genres of Shakespeare's plays.

With these results in hand, I decided to measure how often Shakespeare isolates a single character on stage within plays from each of the three canonical genres. Aggregating the total number of words spoken when only a single character is on stage, as well as the total number of words spoken when only two characters are on stage, and so forth, allows one to measure the degree to which each play distributes its attention between large and small gatherings of characters (click to enlarge):

While these plots reveal some interesting features of the works, such as the fact that Two Gentlemen of Verona truly does revolve around dyadic pairs, they make it difficult to compare the amount of time tragedies and comedies feature only a single character on stage. To make this latter comparison, one can find the average amount of time a single character occupies the stage for each genre:

means_by_genre.png

Surprisingly, the chief difference between the comedies and tragedies has less to do with the way each handles isolated actors on stage than with the way each handles triads and quadrads. It seems tragedies have a greater tendency to revolve around sets of three characters, while comedies are more often organized around sets of four characters. That said, the similarities between the two genres are far more striking than their differences, and far less encouraging for one in search of distinguishing features.

Reflecting on these results, I wondered if tragedies might be better classified by the amount of time their conflicted characters spend addressing the audience. One way to begin measuring the latter, I thought, would be to count the number of words spoken by each character in each play (click to enlarge):

Analyzing these figures, I was struck by what should have been a fairly obvious fact: Shakespeare's most memorable characters (Falstaff, Hal, Prospero, Rosalind, Hamlet...) are each given commanding positions within the plays they lead. Given the strong correlation between these memorable characters and the number of lines each speaks, it's tempting to ask whether we remember these characters most readily simply because Shakespeare allowed them to say the most, or whether Shakespeare allowed them to say the most because he sensed they were his most memorable characters.

Either way, the last trio of plots shows a fairly even distribution of commanding figures among the comedies, histories, and tragedies. But those plots also reveal that the histories include rather few words spoken by women, as well as the fact that the comedies tend to be shorter than the tragedies and histories:

female_presence_vs_length.png

By analyzing only the length of a play and the number of words women speak in that play, one can start to get reasonably good separation between the genres: comedies tend to be shorter and include more female dialogue, histories tend to be longer and include less female dialogue, and tragedies split provocatively between the upper right and lower left. Reviewing these figures, I can't shake the suspicion that a third dimension of data could unite these divided tragedies. But what would that dimension consist of? 

* * *

I would like to thank Mike Poston, co-curator of the Folger Digital Text editions used for this analysis, for discussing many of the finer points of the FDT collection with me. In case you want to replicate any of the analysis or assess the assumptions on which it's founded, the scripts are here.

Pseudo-Cryptography in Jonathan Swift's Tale of a Tub

“I do here humbly propose for an Experiment,” Jonathan Swift writes near the end of his Tale of a Tub, “that every Prince in Christendom will take seven of the deepest Scholars in his Dominions, and shut them up close for seven Years, in seven Chambers, with a Command to write seven ample Commentaries upon [A Tale of a Tub].” In order to promote so useful a Work, Swift informs readers that he has encrypted some hidden messages in the Tale: “I have couched a very profound Mystery in the Number of O's multiply'd by Seven, and divided by Nine.” Not wishing to leave the alchemically inclined out of his game, Swift adds: “Also, if a devout Brother of the Rosy Cross will pray fervently for sixty three Mornings, with a lively Faith, and then transpose certain Letters and Syllables according to Prescription, in the second and fifth Section; they will certainly reveal into a full Receipt of the Opus Magnum.” Swift then completes his triumvirate of puzzles with the following remark: “Lastly, Whoever will be at the Pains to calculate the whole Number of each Letter in this Treatise, and sum up the Difference exactly between the several Numbers, assigning the true natural Cause for every such Difference; the Discoveries in the Product, will plentifully reward his Labour” (Section X).

The probability that Swift actually altered his language so as to conceal occult knowledge in his already-overstuffed Tale may seem rather minimal to many readers. For those familiar with Swift's delight in word games and ciphers, though, the odds might look a bit brighter. In fact, readers who have discovered Paul Childs' essay “Cipher Against Ciphers: Jonathan Swift's Latino-Anglicus Satire of Medicine” in Cryptologia—which analyzes the ways Swift transformed his suffering under Ménière's disease into cleverly encrypted messages—might wonder whether the Dean has in fact buried some treasure in his Tale. With this question in mind, I decided to run some experiments to see whether I might be able to resolve some of Swift's long-overlooked riddles.

Riddle One: The Mystery of the Number of O's in Swift’s Tale. Some simple analysis reveals that there are 16,092 O’s in the 1710 edition of Swift’s Tale of a Tub. If we multiply this value by seven and divide the result by nine—the operations Swift suggests one must perform to uncover his profound Mystery—we get 12,516, a number whose significance I leave to others to determine. While this number might carry significance, the more fundamental question is whether Swift’s use of the letter O appears to be unusual or premeditated in any way. To answer this question, we can compare the relative frequency of the letter O in Swift’s Tale to the relative frequency of the letter in other documents from the eighteenth century:

This plot compares the relative frequency of the letter 'O' in Jonathan Swift's Tale of a Tub to the relative frequency of the same letter in a random sample of documents from the ECCO-TCP corpus.

This plot compares the relative frequency of the letter 'O' in Jonathan Swift's Tale of a Tub to the relative frequency of the same letter in a random sample of documents from the ECCO-TCP corpus.

If Swift's usage of the letter O were premeditated or unusual in any way, we should expect to see the relative frequency of the letter depart from the norm established by his contemporaries. As the plot above indicates, however, his use of the letter is perfectly in keeping with the trend of his times, which suggests Swift's first riddle is a jest.

Riddle Two: “[If readers] transpose certain Letters and Syllables according to Prescription, in the second and fifth Section; they will certainly reveal into a full Receipt of the Opus Magnum”. To analyze this puzzle, we can again look at the distribution of letters in Swift’s Tale, this time investigating the degree to which the distribution of any letter in sections two and five look out of keeping with the distributions of those letters in other sections. More generally, we can look to see if there are any unusual distributions of letters across the sections of the text, and if there are, we can begin considering appropriate methods of transposing those letters to get the syllables with which the Magnum Opus is communicated.

This plot indicates the relative frequency of each letter in each section of Swift's Tale. Each section is given a consistent color, so if any section contains an unusual proportion of any particular letter(s), we should expect a wider distribution of frequencies for that letter or those letters.

This plot indicates the relative frequency of each letter in each section of Swift's Tale. Each section is given a consistent color, so if any section contains an unusual proportion of any particular letter(s), we should expect a wider distribution of frequencies for that letter or those letters.

Each section of the Tale is given a consistent color in this plot, so if any section contains an unusual proportion of any particular letter(s), we should expect to see a wider distribution of frequencies for that letter or those letters. Much to the would-be alchemist’s chagrin, however, the plot above indicates that there are no letters in the Tale that have wildly aberrant distributions, which effectively closes the book on the second riddle.

Riddle Three: “Whoever will . . . calculate the whole Number of each Letter in this Treatise, and sum up the Difference exactly between the several Numbers, assigning the true natural Cause for every such Difference; the Discoveries in the Product, will plentifully reward his Labour.” We can easily calculate the number of times each letter occurs in Swift’s Tale:

tale_raw_letter_freqs.jpg

Using these frequencies, Swift suggests, one can find “the Discoveries in the Product,” which value “will plentifully reward his Labour.” He leaves it comically unclear what is meant by “the Product,” though: does he mean the product of the difference between each letter and each other letter, or the product between the difference of each letter and the “whole Number of each Letter in this Treatise”, or some other mad metric?

Regardless of the answer to this question, it seems clear that Swift means these riddles to be ludic and satirical, rather than genuine encryptions. My question for readers is: who or what is Swift parodying here? Are there other texts from the period that purport to contain hidden messages in their letter counts? I ask not only because I'm fascinated by early modern ciphers, but because I want to have a fuller understanding of the specific works with which Swift was working in these delightfully flippant passages.

* * *

The analysis above was conducted using the 1710 edition of the Tale digitized by Lehigh University.  The visualizations were produced with Pyplot using these scripts.

Co-citation Networks in the EEBO-TCP Corpus

I recently had the good fortune of attending a conference on computational approaches to early modern literature hosted at the University of Newcastle. During the conference, I not only got to meet some outstanding scholars—including Doug Bruster, John Burrows, Hugh Craig, Mac Jackson, and Glenn Roe, to name only a few—but I also had a chance to present some of my recent work on algorithmic approaches to the study of literary influence. In case it might be of interest to others working in related fields, I thought I would share one of the approaches I discussed in what follows below.

Buried within the EEBO-TCP corpus, I learned some months ago, is a veritable trove of metadata. These metadata features indicate when authors include things like stage directions, tables of data, and alchemical symbols in their writing, all of which is great news for the computationally inclined. For those interested in influence, there are also metadata fields that indicate when an author is quoting or citing another text. By placing <q> tags around quotations and <bibl> tags around citations of authors or works, the Text Creation Partnership made it easy for researchers to begin looking for citational patterns in early modern literature:

Lines 48-52 of this sample from the EEBO-TCP corpus contain <q> and <bibl> tags to designate quotations and bibliographic citations respectively.

Using a simple script, one can easily extract all of the quotations and bibliographic citations from all files in the EEBO-TCP corpus. Because the EEBO-TCP corpus contains roughly one third of the titles from the period recorded in the definitive English Short Title Catalogue, this citational data can serve as a fairly representative archive of intertextual trends in early modern England:

estc_eebo-tcp_titles.jpg

That is, at least, the idea in theory. In practice, the data is quite messy, and in its native form, is all but algorithmically unapproachable. Take, for example, the following references contained within <bibl> brackets:

IUDGES V. XXIII.
Iudges, 5 23.
IVDGES 4. 21.
Iudg. 42.
[Judg. 21.]

Human investigators who look into the matter can easily discover that each of these references refers to the Book of Judges. To allow computers to recognize this fact, however, I had to spend several weeks sifting my way through the collection of <bibl> tags, carefully identifying the texts and authors to which those metadata fields referred. In the end, of the roughly 45,000 items that were tagged as references to books or authors in the EEBO-TCP corpus (parts I and II), I found roughly a third to be too cryptic to decipher:

W. ??.
Qu.
ラ Testimony of a great Divine.

Setting these obscure references aside, I cleaned up the sources about which I was reasonably confident, such as the references to Judges above. After I had aggregated all of these, I was naturally curious to see which texts and authors were most cited in the corpus. Here are the top forty:

The forty most cited authors and works within the EEBO-TCP corpus.

Biblical citations predominate, with the greatest number of references going to the Book of Psalms. (It should be noted that in most translations of the Bible into English and Latin, the Book of Psalms has by far the greatest number of verses and words.) Matthew leads the Evangelists, followed by John, then Luke, then Mark. The highest frequency Greco-Roman writers include Virgil, Ovid, Horace, Martial, Juvenal, and Seneca, in that order. While this data might not revolutionize early modern scholarship, it might be helpful in the classroom. When teaching undergraduates about the works early modern audiences read, heard, and cited, for example, such visualizations can help to bring home the profound religiosity of the age.

Drawing on the same data that underlies this visualization, one can also analyze networks of influence in early modern literature. Rather than conceive of influence as a series of vectors, each measuring the number of times a given work or author is cited in the TCP corpus, one can analyze the kinds of works and authors that are cited together. Such forms of analysis are often referred to as “co-citation networks,” or networks of references that tend to be cited together. For example, say our corpus contained references to only nine different sources:

simple_sample_network.png

In that case, we could visualize each of these sources as a “node” in a network graph and could create an “edge,” or connecting line, between any two nodes that were cited within a text in the corpus. The sample graph above would then illustrate the fact that references “A” and "B" are cited within one text, "G" and "Z" in another, and so on. Using this method on the EEBO-TCP corpus, I thought, might help to reveal latent structures embedded within early modern citation networks. With the curated EEBO-TCP citation data in hand, I therefore proceeded in the fashion described above, transforming each cited author or work into a node, and creating edges between any two nodes cited within a single work. Here is one visualization of the results:

This wild web is a visualization of co-citation networks in the EEBO-TCP corpus. Each cited author and text is a point (or "node") on the graph, and any two authors or texts cited within a single text are connected by a line (or "edge"). The plot also uses a modularity clustering algorithm to separate and color code nodes that commonly appear together: the blue cluster contains biblical works, the red cluster contains classical authorities, the yellow cluster contains Reformation-era martyrs and theologians, and the purple cluster contains literary figures.

In this graph, each node has been assigned a color. These colors are determined by an algorithm that identifies clusters of nodes that are commonly cited together, and then colors the different clusters accordingly. Analyzing the nodes that cluster together, one finds four particularly well-defined groups of references: the light blue cluster of nodes, which are biblical books (Leviticus, Corinthians); the red cluster of nodes, which are classical references (Macrobius, Lucretius); the yellow cluster of nodes, which are Reformation-era martyrs and theologians (Theodore Beza, Richard Turner); and the purple cluster of nodes, which are literary writers (Philip Sidney, William Shakespeare). The results of the procedure are intuitively legible: each group represents a fairly homogenous collection of authors and works, and the divisions that split the groups apart represent fairly significant generic differences between the various references contained in the data. If this is right, then such a map could perhaps serve as a useful resource in the classroom. When discussing the so-called "battle of ancients and moderns," for instance, graphs like the present one might help students visualize the different ways early modern citation practices established divisions between these groups.

Using a slightly different visualization, one can identify the points of connection between and among these ostensibly divided groups. This link will take you to an interactive site that contains all of the data included in the previous plot, including the modularity rankings that separated and color-coded the nodes. Unlike the previous plot, however, the visualization at the other end of that link allows users to click on a particular node and see all of the others to which that node is connected. Comparing the networks of each of these nodes, one can produce at a glance one measure of the degree to which the given node was cited with works and authors from the classical inheritance, for instance. The results sometimes lead to new questions. For example, the interactive plot demonstrates that when early modern authors cited Paul the Apostle in the context of classical writers like Apuleius, they cited Paul's given name; when they cited Paul in biblical contexts, by contrast, they cited his works, such as "Romans." Why might that be the case? Other points of interest I've come across in this visualization include the relatively isolation of the alchemists (George Ripley, Geber, etc.), towards the bottom of the plot, and the predominance of biblical citations in works that reference Descartes. Both of these observations strike me as less intuitive than the separation of authors into disparate modularity rankings, for instance, and both seem worthy of further inquiry.

Batch Processing Python Scripts on Sun Grid Engine Queues

Suppose you have a collection of text files and would like to compare each of those files to each of the others. Perhaps you would like to know which characters, locations, or stage directions in each file occur in any of the others. Whatever the task, if your collection is small enough—on the order of a few paragraphs, say—you can of course compare the files manually, reading each of your paragraphs in turn, and comparing the given paragraph to each of the others. If your collection is a bit bigger—on the order of a few hundred novels, say—you might automate these comparisons on your computer. If your collection is really big, however, a single computer might not be powerful enough to finish the job during your lifetime.

Comparing each file in a four text corpus to each of the others, after all, only involves six comparisons. Running the same analysis on a collection of 50,000 files (roughly the size of the Project Gutenberg collection in English, or the EEBO-TCP corpus), however, means running 1,249,975,000 comparisons. If each of those comparisons takes one minute to execute on your computer, it will take 2376 years to run this job on your machine. Thankfully, we can expedite this process tremendously by leveraging the power of distributed computing systems like Sun Grid Engine (SGE) queues. Pursuing the routine described above, for instance, we can use an SGE system to run each of our 1+ billion comparisons in a few minutes.

To get started, we'll want to create an "iteration schedule" in which we identify all of the comparisons we wish to run. Here is a visual representation of an iteration schedule for a four text corpus:

iteration_schedule_map

In the table above, each of our iterations-to-be-run is denoted by an "o." Each "o" sits in the cell that joins the row and the column that denote the two texts to be analyzed in the given iteration. Reading across our first row, for instance, we see that text one does not need to be compared to text one, but does need to be compared to texts two, three, and four. The second row denotes that we want to compare text two to texts three and four, and row three denotes that we want to compare text three to text four.

After determining all of the comparisons to be run, we will want to render that information in machine-readable form. More specifically, we want to generate a table that has three columns: iteration_number, first_text, and second_text: where first_text and second_text are the file names of the two texts we wish to compare in the given iteration, and iteration_number is an integer whose value is zero in the first row of the iteration schedule, 1 in the next row, 2 in the next, . . . and n in the last, where n equals the total number of comparisons we wish to make. [In general, the number of comparisons required is (p-1)(p)/2, where p equals the number of files in your corpus.]  Here is a sample iteration table (produced by this script):

0 A00002.txt A00005.txt
1 A00002.txt A00007.txt
2 A00002.txt A00008.txt
3 A00002.txt A00011.txt
4 A00002.txt A00012.txt
5 A00002.txt A00013.txt
6 A00002.txt A00014.txt
7 A00002.txt A00015.txt
8 A00002.txt A00018.txt
9 A00002.txt A00019.txt

If you want to batch process a different kind of routine on an SGE system, you can modify your iteration schedule appropriately. If you only want to calculate the type-token ratios of each of your files, for instance, you'll only need two columns: iteration_number and text_name. Once this iteration schedule is all set, we can turn to the script to be run during each of these iterations. Here's mine: If you want to run a different kind of analysis, just keep lines 1-21 and line 102 of that script, and use the variable “iteration_number” to guide which texts you will analyze in each iteration. Once your routine is all set, save it as “test.py” and upload it—along with your iteration schedule, the files you wish to compare, and these two higher order scripts—to a single directory on your SGE server:

Once all of these files are in the same directory, you are ready to submit your script for batch processing. To do so using the University of Notre Dame's Center for Research Computing system, you can simply type "python _run_me.py cmd_run.py your_netid" with no quotation marks:

After you submit this command, the higher-order Python scripts _run_me.py and cmd_run.py will create new copies of your test.py script, changing the input files for each iteration according to your iteration schedule. If all has gone well, and you refresh the directory after a few moments, you'll see a few (or more than a few, depending on the number of iterations you are running!) new files in your directory. More specifically, you'll have a collection of new job_ files that give you feedback on the result of each of your iterations. If errors cropped up during your analysis, those errors should be recorded in these job_ files. Provided that there were no exceptions, though, those files will be empty, and you will find in your directory whatever output you requested in your test.py script. Et voila, now you can finish your analysis in a few minutes, rather than a few millennia!

* * *

I want to thank Scott Hampton and Dodi Heryadi of Notre Dame's High Performance Computing Group, who helped me think through the logistics of batch processing, Reid Johnson of Notre Dame's Computer Science Department, who sent me the higher-order SGE scripts on which this analysis is based, and Tim Peters, who wrote many of the Python modules on which my work depends!