Cross-Lingual Plagiarism Detection with Scikit-Learn

Oliver Goldsmith, one of the great poets, playwrights, and historians of science from the Enlightenment, was many things. He was "an idle, orchard-robbing schoolboy; a tuneful but intractable sizar of Trinity; a lounging, loitering, fair-haunting, flute-playing Irish ‘buckeen.’" He was also a brilliant plagiarist. Goldsmith frequently borrowed whole sentences and paragraphs from French philosophes such as Voltaire and Diderot, closely translating their works into his own voluminous books without offering so much as a word that the passages were taken from elsewhere. Over the last several months, I have worked with several others to study the ways Goldsmith adapted and freely translated these source texts into his own writing in order to develop methods that can be used to discover crosslingual text reuse. By outlining below some of the methods that I have found useful within this field of research, the following post attempts to show how automated methods can be used to further advance our understanding of the history of authorship.

Sample Training Data

In order to identify the passages within Goldsmith's corpus that were taken from other writers, I decided to train a machine learning algorithm to differentiate between plagiarisms and non-plagiarisms. To distinguish between these classes of writing, John Dillon and I collected a large number of plagiarized and non-plagiarized passages within Goldsmith's writing, and provided annotations to identify whether the target passage had been plagiarized or not. Here are a few sample rows from the training data:

French SourceGoldsmith TextPlagiarism
Bothwell eut toute l'insolence qui suit les grands crimes. Il assembla les principaux seigneurs, et leur fit signer un écrit, par lequel il était dit expressément que la reine ne se pouvait dispenser de l'éspouser, puisqu'il l'avait enlevée, et qu'il avait couché avec elle.Bothwell was possessed of all the insolence which attends great crimes: he assembled the principal Lords of the state, and compelled them to sign an instrument, purporting, that they judget it the Queen's interest to marry Bothwell, as he had lain with her against her will.1
Histoire c'est le récit des faits donnés pour vrais; au contraire de la fable, qui est le récit des faits donnés pour faux.In the early part of history a want of real facts hath induced many to spin out the little that was known with conjecture.0
La meilleure maniere de connoître l'usage qu'on doit faire de l' esprit, est de lire le petit nombre de bons ouvravrages de génie qu'on a dans les langues savantes & dans la nôtre.The best method of knowing the true use to be made of wit is, by reading the small number of good works, both in the learned languages, and in our own.1
Comme il y a en Peinture différentes écoles, il y en a aussi en Sculpture, en Architecture, en Musique, & en général dans tous les beaux Arts.A school in the polite arts, properly signifies, that succession of artists which has learned the principles of the art from some eminent master, either by hearing his lessons, or studying his works.0
Des étoiles qui tombent, des montagnes qui se fendent, des fleuves qui reculent, le Soleil & la Lune qui se dissolvent, des comparaisons fausses & gigantesques, la nature toûjours outrée, sont le caractere de ces écrivains, parce que dans ces pays où l'on n'a jamais parlé en public.Falling stars, splitting mountains, rivers flowing to their sources, the sun and moon dissolving, false and unnatural comparisons, and nature everywhere exaggerated, form the character of these writers; and this arises from their never, in these countries, being permitted to speak in public.1

Given this training data, the goal was to identify some features that commonly appear in Goldsmith’s plagiarized passages but don’t commonly appear in his non-plagiarized passages. If we could derive a set of features that differentiate between these two classes, we would be ready to search through Goldsmith’s corpus and tease out only those passages that had been borrowed from elsewhere.

Feature Selection: Alzahrani Similarity

Because a plagiarized passage can be expected to have language that is similar but not necessarily identical to the language used within the plagiarized source text, I decided to test some fuzzy string similarity measures. One of the more promising leads on this front was adapted from the work of Salha M. Alzahrani et al. [2012], who has produced a number of great papers on plagiarism detection. The specific similarity measure adapted from Alzahrani calculates the similarity between two passages (call them Passage A and Passage B) in the following way:

def alzahrani_similarity( a_passage, b_passage ):

    # Create a similarity counter and set its value to zero
    similarity = 0

    # For each word in Passage A
    for a_word in a_passage:

        # If that word is in Passage B
        if a_word in b_passage:

            # Add one to the similarity counter
            similarity += 1

        # Otherwise,

            # For each word in Passage B
            for b_word in b_passage:    

                # If the current word from Passage A is a synonym of the current word from Passage B,
                if a_word in find_synonyms( b_word ):

                    # Add one half to the similarity counter
                    similarity += .5

    # Finally, divide the similarity score by the number of words in the longer passage
    return similarity / max( len(a_passage), len(b_passage) )

To prepare the data for this algorithm, I used the Google Translate API to translate French texts into English, the Big Huge Labs Thesaurus API to collect synonyms for each word in Passage B, and the NLTK to clean the resulting texts (dropping stop words, removing punctuation, etc.). Once these resources were prepared, I used an implementation of the algorithm described above to calculate the "similarity" between the paired passages in the training data. As one can see, the similarity value returned by this algorithm discriminates reasonably well between plagiarized and non-plagiarized passages:


The y-axis here is discrete--each data point represents either a plagiarized pair of passages (such as those in the training data discussed above), or a non-plagiarized pair of passages. The x-axis is really the important axis. The further to the right a point falls on this axis, the greater the length-normalized similarity score for the passage pair. As one would expect, plagiarized passages have much higher similarity scores than non-plagiarized passages.

In order to investigate how sensitive this similarity method is to passage length, I iterated over all sub-windows of n words within the training data, and used the same similarity method to calculate the similarity of the sub-window within the text. When n is five, for instance, one would compare the first five words of Passage A to the first five from Passage B. After storing that value, one would compare words two through six from Passage A to words one through five of Passage B, then words three through seven from Passage A to words one through five of Passage B, proceeding in this way until all five-word windows had been compared. Once all of these five-word scores are calculated, only the maximum score is retained, and the rest are discarded. The following plot shows that as the number of words in the sub-window increases, the separation between plagiarized and non-plagiarized passages also increases:

Each facet within this plot represents a subwindow length, and each point within that facet represents the maximum observed similarity value for a single pair of passages. As the length of the subwindows increases, the distance between plagiarized and non-plagiarized passages also increases.

Each facet within this plot represents a subwindow length, and each point within that facet represents the maximum observed similarity value for a single pair of passages. As the length of the subwindows increases, the distance between plagiarized and non-plagiarized passages also increases.

Feature Selection: Word2Vec Similarity

Although the method discussed above provides helpful separation between plagiarized and non-plagiarized passages, it reduces word pairs to one of three states: equivalent, synonymous, and irrelevant. Intuitively, this model feels limited, because one senses that words can have degrees of similarity. Consider the words small, tiny, and humble. The thesaurus discussed above identifies these terms as synonyms, and the algorithm described above essentially treats the words as interchangeable synonyms. This is slightly unsatisfying because the word small seems more similar to the word tiny than the word humble.

To capture some of these finer gradations in meaning, I called on Word2Vec, a method that uses backpropagation to represent words in high-dimensional vector spaces. Once a word has been transposed into this vector space, one can compare a word's vector to another word's vector and obtain a measure of the similarity of those words. The following snippet, for instance, uses a cosine distance metric to measure the degree to which tiny and humble are similar to the word small:

from gensim.models.word2vec import Word2Vec
from sklearn.metrics.pairwise import cosine_similarity        

# Load the Google pretrained word vectors
model = Word2Vec.load_word2vec_format('../google_pretrained_word_vectors/GoogleNews-vectors-negative300.bin.gz', binary=True)

# Obtain the vector representations of three words
v1 = model[ "small" ]
v2 = model[ "tiny" ]
v3 = model[ "humble" ]

# Measure the similarity of "tiny" and "humble" to the word "small"
for v in [v2,v3]:
    print cosine_similarity(v1, v)

Running this script returns [[ 0.71879274]] and [[ 0.29307675]] respectively, which is to say Word2Vec can recognize that the word small is more similar to tiny than it is to humble. Because Word2Vec allows one to calculate these fine gradations of word similarity, it does a great job calculating the similarity of passages from the Goldsmith training data. The following plot shows the separation achieved by running a modified version of the "Alzahrani algorithm" described above, using this time Word2Vec to measure word similarity:


As one can see, the Word2Vec similarity measure achieves very promising separation between plagiarized and non-plagiarized passage pairs. By repeating the subwindow method described above, one can identify the critical value wherein separation between plagiarized and non-plagiarized passages is best achieved with a Word2Vec similarity metric:


Feature Selection: Syntactic Similarity

Much like the semantic features discussed above, syntactic similarity can also serve as a clue of plagiarism. While a thoroughgoing pursuit of syntactic features might lead one deep into sophisticated analysis of dependency trees, it turns out one can get reasonable results by simply examining the distribution of part of speech tags within Goldsmith's plagiarisms and their source texts. Using the Stanford Part of Speech (POS) Tagger's French and English models, and a custom mapping I put together to link the French POS tags to the universal tagset, I transformed each of the paired passages in the training data into a POS sequence such as the following: 

[(u'Newton', u'NNP'), (u'appeared', u'VBD'),...,(u'amazing', u'JJ'), (u'.', u'.')]
[(u'Newton', u'NPP'), (u'parut', u'V'),...,(u'nouvelle:', u'CL'),(u'.', u'.')]

Using these sequences, two similarity metrics were used to measure the similarity between each of the paired passages in the training data. The first measure (on the x-axis below) simply measured the cosine distance between the two POS sequences; the second measure (on the y-axis below) calculated the longest common POS substring between the two passages. As one would expect, plagiarized passages tend to have higher values in both categories:

The x-axis in this plot indicates the value obtained when one measures the cosine similarity between the part of speech vectors of the relevant paired passages; the y-axis indicates the value obtained when one measures the longest common part of speech substring for the relevant paired passages. Points are jittered to allow one to see the full data set, and stat_ellipse is plotted at the 90% level.

Classifier Results

From the similarity metrics discussed above, I selected a bare-bones set of six features that could be fed to a plagiarism classifier: (1) the aggregate "Alzahrani similarity" score, (2) the maximum six-gram Alzahrani similarity score, (3) the aggregate Word2Vec similarity score, (4) the cosine distance between the part of speech tag sets, (5) the longest common part of speech string, and (6) the longest contiguous common part of speech string. Those values were all represented in a matrix format with one pair of passages per row and one feature per column. Once this matrix was prepared, a small selection of classifiers hosted within Python's Scikit Learn library were chosen for comparison. Cross-classifier comparison is valuable, because different classifiers use very different logic to classify observations. The following plot from the Scikit Learn documentation shows that using a common set of input data (the first column below), the various classifiers in the given row classify that data rather differently:


In order to avoid prejudging the best classifier for the current task, half a dozen classifiers were selected and evaluated with hold one out tests. That is to say, for each observation in the training data, all other rows were used to train the given classifier, and the trained classifier was asked to predict whether the left-out observation was a plagiarism or not. Because this is a two class prediction task (each observation either is or is not an instance of plagiarism), the baseline success rate is 50%. Any performance below this baseline would be worse than random guessing. Happily, all of the classifiers achieved success rates that greatly exceeded this baseline value:

Generally speaking, precision values were higher than recall, perhaps because some of the plagiarisms in the training data were fuzzier than others. Nevertheless, these accuracy values were high enough to warrant further exploration of Goldsmith's writing. Using the array of features discussed above and others to be discussed in a subsequent post, I tracked down a significant number of plagiarisms that were not part of the training data, including the following outright translations from the Encyclopédie:

French SourceGoldsmith Text
Il n'est point douteux que l' Empire , composé d'un grand nombre de membres très-puissans, ne dût être regardé comme un état très-respectable à toute l'Europe, si tous ceux qui le composent concouroient au bien général de leur pays. Mais cet état est sujet à de très-grands inconvéniens: l'autorité du chef n'est point assez grande pour se faire écouter: la crainte, la défiance, la jalousie, regnent continuellement entre les membres: personne ne veut céder en rien à son voisin: les affaires les plus sérieuses les plus importantes pour tout le corps sont quelquefois négligées pour des disputes particulieres, de préséance, d'étiquette, de droits imaginaires d'autres minuties.It is not to be doubted but that the empire, composed as it is of several very powerful states, must be considered as a combination that deserves great respect from the other powers of Europe, provided that all the members which compose it would concur in the common good of their country. But the state is subject to very great inconveniences; the authority of the head is not great enough to command obedience; fear, distrust, and jealousy reign continually among the members; none are willing to yield in the least to their neighbours; the most serious and the most important affairs with respect to the community, are often neglected for private disputes, for precedencies, and all the imaginary privileges of misplaced ambition.
L' Eloquence , dit M. de Voltaire, est née avant les regles de la Rhétorique, comme les langues se sont formées avant la Grammaire.Thus we see, eloquence is born with us before the rules of rhetoric, as languages have been formed before the rules of grammar.
L' empire Germanique, dans l'état où il est aujourd'hui, n'est qu'une portion des états qui étoient soûmis à Charlemagne. Ce prince possédoit la France par droit de succession; il avoit conquis par la force des armes tous les pays situés depuis le Danube jusqu'à la mer Baltique; il y réunit le royaume de Lombardie, la ville de Rome son territoire, ainsi que l'exarchat de Ravennes, qui étoient presque les seuls domaines qui restassent en Occident aux empereurs de Constantinople.The empire of Germany, in its present state is only a part of those states that were once under the dominion of Charlemagne. This prince was possessed of France by right of succession: he had conquered by force of arms all the countries situated between the Baltic Sea and the Danube. He added to his empire the kingdom of Lombardy, the city of Rome and its territory, together with the exarchate of Ravenna, which were almost the only possessions that remained in the West to the emperors of Constantinople.
Il n'est point de genre de poésie qui n'ait son caractere particulier; cette diversité, que les anciens observerent si religieusement, est fondée sur la nature même des sujets imités par les poëtes. Plus leurs imitations sont vraies, mieux ils ont rendu les caracteres qu'ils avoient à exprimer....Ainsi l'églogue ne quitte pas ses chalumeaux pour entonner la trompette, l' élégie n'emprunte point les sublimes accords de la lyre.There is no species of poetry that has not its particular character; and this diversity, which the ancients have so religiously observed, is founded in nature itself. The more just their imitations are found, the more perfectly are those characters distinguished. Thus the pastoral never quits his pipe, in order to sound the trumpet; nor does elegy venture to strike the lyre.


Samuel Johnson once observed that Oliver Goldsmith was “at no pains to fill his mind with knowledge. He transplanted it from one place to another; and it did not settle in his mind; so he could not tell what was his in his own books” (Life of Johnson). Reading the borrowed passages above, one can perhaps understand why Goldsmith struggled to recall what he had written in his booksmuch of his writing was not really his. As scholars continue to advance the art of detecting textual reuse, we will be better equipped to map these borrowed words at larger and more ambitious scales. For the present, writers like Goldsmith offer plenty of data on which to hone those methods.

* * *

This work has benefitted enormously from conversations with a number of others. Antonis Anastasopoulos, David Chiang, Michael Clark, John Dillon, and Kenton Murray of Notre Dame's Text Analysis Group, and Thom Bartold, Dan Hepp, and Jens Wessling of ProQuest offered key analytic insights, and Mark Olsen and Glenn Roe of the University of Chicago's ARTFL group shared essential data. I am grateful for the generous help each of you has provided. Code is available here.

Mapping the Early English Book Trade

Historians often call attention to the tremendous influence the 1710 Act of Anne had on the early English book trade. Commonly identified as the origin of modern copyright law, the Act laid the statutory foundations for fixed-term copyright in England, extended the ability to hold such copyrights to all individuals, and eventually toppled the monopoly that London booksellers had held on English printing since the incorporation of the Stationers' Company in 1557. Reading scholarship on this legal development over the last few months, I became curious to see how well the English Short Title Catalogue (ESTC) could substantiate some of the claims made in discussions of the Act. The ESTC seemed an ideal resource for this kind of analysis because, as Stephen Tabor has written, it represents “the fullest and most up-to-date bibliographical account of 'English' printing (in the broadest sense) for its first 328 years” (367). The database lists the authors, titles, imprint lines, publication dates, and many other metadata fields for each of the ~470,000 editions known to have been printed in England or its colonies between 1473 and 1800, and can therefore serve as a helpful resource with which to investigate the relationship between copyright law and literary history in the early modern period.

One of the debates surrounding the Act of Anne concerns the degree to which the statute altered the geography of the English book trade. Prior to the passage of the Act, legal historian Diane Zimmerman notes, the Stationers' Company dominated the book industry, and because the company's printers were primarily stationed in London, the book trade was also centered in the metropole. With the passage of the Statute of Anne, however, authors could sell or trade their copyrights to printers outside of London: “Now any printer [or] bookseller, wherever located within the country, could register a copyright with the Company” and “since purchasers of the copies could be located anywhere in the United Kingdom, the Stationers' Company did not regain its monopoly [on the book trade] (7). Contra Zimmerman, William Patry argues that the Act of Anne failed to undermine London's control of the book trade: “After the Statute of Anne, as before, he writes, the only purchasers of authors' works were a small group of London booksellers” (84). To investigate what the ESTC had to say on this question, I compared the geographical distribution of English printers in the half centuries before and after the passage of the Act (click for full size):

The usual cautions concerning false imprints and varying survival rates notwithstanding, the ESTC clearly demonstrates the decentralization of English printing in the wake of the Act of Anne. London of course remained the primary site of publication throughout the years covered by the ESTC—publishing two-thirds of all records from the period—though its annual share in the trade fell quite dramatically across the eighteenth century:

One can explain some of that decline by examining the growth of printing in major metropolitan areas outside of London, such as Edinburgh (responsible for 6.5% of total editions in the ESTC), Dublin (5.4%), and Boston (3.7%), which claimed the second, third, and fourth overall largest shares of the book trade according to the ESTC:


Among these figures, the explosion of printing in Edinburgh after 1750 is particularly interesting, and appears to be the result of further changes in the legal code. As John Feather notes, “The Copyright Act of 1710 (8 Anne c. 21) implied, but did not state, that it was illegal to import any English-language books into England and Wales if they had been previously printed there” (58). However, he continues, “the legislation in relation to Scotland seems to have lapsed in 1754-1755,” after which one observes tremendous growth in Scottish printing. Between 1750 and 1755, the five year average of Edinburgh printing as a percent of all printing recorded in the ESTC is 7.5%. This figure only continues to grow after the lapse of Scottish printing regulations noted by Feather: From 1755-1760, Edinburgh printing climbs to 9.0% of all printing for the five year period, from 1760-1765, the figure rises to 12.3%, and from 1765-1770, it reaches 14.4% of the ESTC totals for the five year range. These values are significant, because they suggest the real surge in the Scottish reprinting industry did not take place in the aftermath of the Donaldson v. Becket decision, as is commonly supposed, but rather with the lapse of Scottish reprinting regulations in 1755.

Having plotted the changing geography of early English printing, I was curious to see whether the ESTC could shed new light on the debate concerning anonymous printing in the early modern period. Researchers like Jody Greene have argued that the Statute of Anne was in fact designed to help combat anonymous publishing insofar as it required authors to attach their names to works if they wished to obtain copyright protection for those works (4). Years ago, Michel Foucault pioneered a version of this thesis in his essay “What is an Author?”, where he argued that the Act of Anne and its elaboration in eighteenth-century case law spurred the transition from a literary culture founded on anonymity to one founded on named authorship. More recently, however, Robert Griffin disputed such claims, arguing that “the historical record shows . . . there is no necessary relation between copyright and the appearance of the name of the author on the title page” (879). To map the changing rate of anonymity over time, I aggregated the number of anonymous and pseudonymous publications as percents of annual totals within the ESTC:


The resulting plot shows great fluctuation in anonymous publications within the fifteenth and early sixteenth centuries, largely because of the tremendously small number of publications for those years. In 1492, for instance, the ESTC lists only 14 publications, all but two of which (S111337 and S120825) had identified authors, which results in an aggregate estimate of anonymity for the year of .142, or 14.2 percent. Despite the year to year fluctuations within early records, however, examining anonymity rates in the aggregate leads to legible patterns: one finds a marked decline in anonymous publication rates over the fifteenth and sixteenth centuries, a fairly steady rise across the seventeenth century, and a slow aggregate decline in the wake of the Act of Anne. This data supports some of the the findings of Joad Raymondwho examined a small sample of records from the period and found that anonymity . . . became increasingly frequent over the course of the seventeenth century (168)—while challenging the popular thesis that anonymity thrived with the lapse of the Licensing Act in 1695.

To plot the history of anonymity, though, is to beg a fundamental question: What exactly counts as an anonymous work? While the plot above treats works as anonymous only if their title pages are attributed to pseudonymous figures like “Isaac Bickerstaff” or to no author at all, there are other cases that one might well wish to classify as anonymous works. Consider the range of works attributed to “corporate” authors like the Royal Society of London or the English Parliament. Are works published by these entities anonymous publications? The way one answers this question will of course greatly affect the way one reads the history of anonymity. As a case in point, we could consult the following plot, which shows monarchical and parliamentary publishing during the seventeenth and eighteenth centuries:

The points here represent yearly values, while the regression lines map the smoothed trends over time. For example, the release of the ESTC to which I had access indicates that James I and Charles I published a combined total of 82 works in 1625 (both served as monarch during the year), the English and Scottish Parliaments published a combined total of 4 works during the year, and the year's total number of publications was 695, which means that monarchical publications account for 11.79 percent of the annual total while parliamentary publications account for only .5 percent  of the same. As one can see, treating the high volume of parliamentary publications from the period as “anonymous works” would create a serious spike in anonymity rates during the English Civil Wars, and would steadily inflate anonymity rates across the eighteenth century. On the other hand, refusing to include works of corporate authorship among anonymous publications (as I have done in the plot of anonymity above) makes it more difficult to answer the question: What exactly counts as anonymity in the early modern world? Whether one includes or excludes corporate authorship from the domain of anonymity, this plot of parliamentary and monarchical publications intrigues me because it maps so neatly onto the political history of the English Civil Wars: monarchical publications trump parliamentary output until the critical years of the early 1640's, after which the Parliament assumes a predominance it holds throughout the Interregnum and only loses in the Restoration. Thereafter the monarchical voice triumphs until the Statute of Anne, after which point it rapidly loses ground. Examining this plot, I can't help but wonder: To what extent is monarchical publishing a function of the crown's political power, and to what extent is that political power a function of the monarch's proximity to print?

* * *

I want to thank Benjamin Pauley, Brian Geiger, and Virginia Schillingeach of whom kindly helped me to acquire the ESTC data on which the analysis above was performedas well as Elliott Visconsi, whose intriguing questions on copyright history continue to motivate my ongoing research.

Classifying Shakespearean Drama with Sparse Feature Sets

In her fantastic series of lectures on early modern England, Emma Smith identifies an interesting feature that differentiates the tragedies and comedies of Elizabethan drama: "Tragedies tend to have more streamlined plots, or less plot—you know, fewer things happening. Comedies tend to enjoy a multiplication of characters, disguises, and trickeries. I mean, you could partly think about the way [tragedies tend to move] towards the isolation of a single figure on the stage, getting rid of other people, moving towards a kind of solitude, whereas comedies tend to end with a big scene at the end where everybody's on stage" (6:02-6:37). 

The distinction Smith draws between tragedies and comedies is fairly intuitive: tragedies isolate the poor player that struts and frets his hour upon the stage and then is heard no more. Comedies, on the other hand, aggregate characters in order to facilitate comedic trickery and tidy marriage plots. While this discrepancy seemed promising, I couldn't help but wonder whether computational analysis would bear out the hypothesis. Inspired by the recent proliferation of computer-assisted genre classifications of Shakespeare's plays—many of which are founded upon high dimensional data sets like those generated by DocuScope—I was curious to know if paying attention to the number of characters on stage in Shakespearean drama could help provide additional feature sets with which to carry out this task.

To pursue the question, I ran some analysis on the Folger Digital Texts edition of the Bard's plays. This delightful collection uses a custom XML schema to indicate when characters enter and exit the stage, which makes it possible to track the number of characters on stage over the course of a play:


This visualization of The Tempest, for instance, traces the number of characters on stage from the play's opening scene—in which the Shipmaster and his Boatswain are quickly joined by Alonso, Sebastian, Antonio, Ferdinand, and Ganzalo—through Prospero's staff-dashing monologue around the 15,000 word mark to the play's crowded conclusion. Here are the stagings for the other Shakespearean plays in the FDT canon, ordered by their date of first performance according to Alfred Harbage's Annals of English Drama:


These plots afford ample evidence to suggest that Shakespearean comedies tend to end with large scenes in which everybody's on stage. Unfortunately, many of the comedies and tragedies also end with large gatherings of characters. It therefore seems that the number of characters on stage during a play's conclusion might not be an ideal feature with which to classify the genres of Shakespeare's plays.

With these results in hand, I decided to measure how often Shakespeare isolates a single character on stage within plays from each of the three canonical genres. Aggregating the total number of words spoken when only a single character is on stage, as well as the total number of words spoken when only two characters are on stage, and so forth, allows one to measure the degree to which each play distributes its attention between large and small gatherings of characters (click to enlarge):

While these plots reveal some interesting features of the works, such as the fact that Two Gentlemen of Verona truly does revolve around dyadic pairs, they make it difficult to compare the amount of time tragedies and comedies feature only a single character on stage. To make this latter comparison, one can find the average amount of time a single character occupies the stage for each genre:


Surprisingly, the chief difference between the comedies and tragedies has less to do with the way each handles isolated actors on stage than with the way each handles triads and quadrads. It seems tragedies have a greater tendency to revolve around sets of three characters, while comedies are more often organized around sets of four characters. That said, the similarities between the two genres are far more striking than their differences, and far less encouraging for one in search of distinguishing features.

Reflecting on these results, I wondered if tragedies might be better classified by the amount of time their conflicted characters spend addressing the audience. One way to begin measuring the latter, I thought, would be to count the number of words spoken by each character in each play (click to enlarge):

Analyzing these figures, I was struck by what should have been a fairly obvious fact: Shakespeare's most memorable characters (Falstaff, Hal, Prospero, Rosalind, Hamlet...) are each given commanding positions within the plays they lead. Given the strong correlation between these memorable characters and the number of lines each speaks, it's tempting to ask whether we remember these characters most readily simply because Shakespeare allowed them to say the most, or whether Shakespeare allowed them to say the most because he sensed they were his most memorable characters.

Either way, the last trio of plots shows a fairly even distribution of commanding figures among the comedies, histories, and tragedies. But those plots also reveal that the histories include rather few words spoken by women, as well as the fact that the comedies tend to be shorter than the tragedies and histories:


By analyzing only the length of a play and the number of words women speak in that play, one can start to get reasonably good separation between the genres: comedies tend to be shorter and include more female dialogue, histories tend to be longer and include less female dialogue, and tragedies split provocatively between the upper right and lower left. Reviewing these figures, I can't shake the suspicion that a third dimension of data could unite these divided tragedies. But what would that dimension consist of? 

* * *

I would like to thank Mike Poston, co-curator of the Folger Digital Text editions used for this analysis, for discussing many of the finer points of the FDT collection with me. In case you want to replicate any of the analysis or assess the assumptions on which it's founded, the scripts are here.

Pseudo-Cryptography in Jonathan Swift's Tale of a Tub

“I do here humbly propose for an Experiment,” Jonathan Swift writes near the end of his Tale of a Tub, “that every Prince in Christendom will take seven of the deepest Scholars in his Dominions, and shut them up close for seven Years, in seven Chambers, with a Command to write seven ample Commentaries upon [A Tale of a Tub].” In order to promote so useful a Work, Swift informs readers that he has encrypted some hidden messages in the Tale: “I have couched a very profound Mystery in the Number of O's multiply'd by Seven, and divided by Nine.” Not wishing to leave the alchemically inclined out of his game, Swift adds: “Also, if a devout Brother of the Rosy Cross will pray fervently for sixty three Mornings, with a lively Faith, and then transpose certain Letters and Syllables according to Prescription, in the second and fifth Section; they will certainly reveal into a full Receipt of the Opus Magnum.” Swift then completes his triumvirate of puzzles with the following remark: “Lastly, Whoever will be at the Pains to calculate the whole Number of each Letter in this Treatise, and sum up the Difference exactly between the several Numbers, assigning the true natural Cause for every such Difference; the Discoveries in the Product, will plentifully reward his Labour” (Section X).

The probability that Swift actually altered his language so as to conceal occult knowledge in his already-overstuffed Tale may seem rather minimal to many readers. For those familiar with Swift's delight in word games and ciphers, though, the odds might look a bit brighter. In fact, readers who have discovered Paul Childs' essay “Cipher Against Ciphers: Jonathan Swift's Latino-Anglicus Satire of Medicine” in Cryptologia—which analyzes the ways Swift transformed his suffering under Ménière's disease into cleverly encrypted messages—might wonder whether the Dean has in fact buried some treasure in his Tale. With this question in mind, I decided to run some experiments to see whether I might be able to resolve some of Swift's long-overlooked riddles.

Riddle One: The Mystery of the Number of O's in Swift’s Tale. Some simple analysis reveals that there are 16,092 O’s in the 1710 edition of Swift’s Tale of a Tub. If we multiply this value by seven and divide the result by nine—the operations Swift suggests one must perform to uncover his profound Mystery—we get 12,516, a number whose significance I leave to others to determine. While this number might carry significance, the more fundamental question is whether Swift’s use of the letter O appears to be unusual or premeditated in any way. To answer this question, we can compare the relative frequency of the letter O in Swift’s Tale to the relative frequency of the letter in other documents from the eighteenth century:

This plot compares the relative frequency of the letter 'O' in Jonathan Swift's Tale of a Tub to the relative frequency of the same letter in a random sample of documents from the ECCO-TCP corpus.

This plot compares the relative frequency of the letter 'O' in Jonathan Swift's Tale of a Tub to the relative frequency of the same letter in a random sample of documents from the ECCO-TCP corpus.

If Swift's usage of the letter O were premeditated or unusual in any way, we should expect to see the relative frequency of the letter depart from the norm established by his contemporaries. As the plot above indicates, however, his use of the letter is perfectly in keeping with the trend of his times, which suggests Swift's first riddle is a jest.

Riddle Two: “[If readers] transpose certain Letters and Syllables according to Prescription, in the second and fifth Section; they will certainly reveal into a full Receipt of the Opus Magnum”. To analyze this puzzle, we can again look at the distribution of letters in Swift’s Tale, this time investigating the degree to which the distribution of any letter in sections two and five look out of keeping with the distributions of those letters in other sections. More generally, we can look to see if there are any unusual distributions of letters across the sections of the text, and if there are, we can begin considering appropriate methods of transposing those letters to get the syllables with which the Magnum Opus is communicated.

This plot indicates the relative frequency of each letter in each section of Swift's Tale. Each section is given a consistent color, so if any section contains an unusual proportion of any particular letter(s), we should expect a wider distribution of frequencies for that letter or those letters.

This plot indicates the relative frequency of each letter in each section of Swift's Tale. Each section is given a consistent color, so if any section contains an unusual proportion of any particular letter(s), we should expect a wider distribution of frequencies for that letter or those letters.

Each section of the Tale is given a consistent color in this plot, so if any section contains an unusual proportion of any particular letter(s), we should expect to see a wider distribution of frequencies for that letter or those letters. Much to the would-be alchemist’s chagrin, however, the plot above indicates that there are no letters in the Tale that have wildly aberrant distributions, which effectively closes the book on the second riddle.

Riddle Three: “Whoever will . . . calculate the whole Number of each Letter in this Treatise, and sum up the Difference exactly between the several Numbers, assigning the true natural Cause for every such Difference; the Discoveries in the Product, will plentifully reward his Labour.” We can easily calculate the number of times each letter occurs in Swift’s Tale:


Using these frequencies, Swift suggests, one can find “the Discoveries in the Product,” which value “will plentifully reward his Labour.” He leaves it comically unclear what is meant by “the Product,” though: does he mean the product of the difference between each letter and each other letter, or the product between the difference of each letter and the “whole Number of each Letter in this Treatise”, or some other mad metric?

Regardless of the answer to this question, it seems clear that Swift means these riddles to be ludic and satirical, rather than genuine encryptions. My question for readers is: who or what is Swift parodying here? Are there other texts from the period that purport to contain hidden messages in their letter counts? I ask not only because I'm fascinated by early modern ciphers, but because I want to have a fuller understanding of the specific works with which Swift was working in these delightfully flippant passages.

* * *

The analysis above was conducted using the 1710 edition of the Tale digitized by Lehigh University.  The visualizations were produced with Pyplot using these scripts.

Co-citation Networks in the EEBO-TCP Corpus

I recently had the good fortune of attending a conference on computational approaches to early modern literature hosted at the University of Newcastle. During the conference, I not only got to meet some outstanding scholars—including Doug Bruster, John Burrows, Hugh Craig, Mac Jackson, and Glenn Roe, to name only a few—but I also had a chance to present some of my recent work on algorithmic approaches to the study of literary influence. In case it might be of interest to others working in related fields, I thought I would share one of the approaches I discussed in what follows below.

Buried within the EEBO-TCP corpus, I learned some months ago, is a veritable trove of metadata. These metadata features indicate when authors include things like stage directions, tables of data, and alchemical symbols in their writing, all of which is great news for the computationally inclined. For those interested in influence, there are also metadata fields that indicate when an author is quoting or citing another text. By placing <q> tags around quotations and <bibl> tags around citations of authors or works, the Text Creation Partnership made it easy for researchers to begin looking for citational patterns in early modern literature:

Lines 48-52 of this sample from the EEBO-TCP corpus contain <q> and <bibl> tags to designate quotations and bibliographic citations respectively.

Using a simple script, one can easily extract all of the quotations and bibliographic citations from all files in the EEBO-TCP corpus. Because the EEBO-TCP corpus contains roughly one third of the titles from the period recorded in the definitive English Short Title Catalogue, this citational data can serve as a fairly representative archive of intertextual trends in early modern England:


That is, at least, the idea in theory. In practice, the data is quite messy, and in its native form, is all but algorithmically unapproachable. Take, for example, the following references contained within <bibl> brackets:

Iudges, 5 23.
IVDGES 4. 21.
Iudg. 42.
[Judg. 21.]

Human investigators who look into the matter can easily discover that each of these references refers to the Book of Judges. To allow computers to recognize this fact, however, I had to spend several weeks sifting my way through the collection of <bibl> tags, carefully identifying the texts and authors to which those metadata fields referred. In the end, of the roughly 45,000 items that were tagged as references to books or authors in the EEBO-TCP corpus (parts I and II), I found roughly a third to be too cryptic to decipher:

W. ??.
ラ Testimony of a great Divine.

Setting these obscure references aside, I cleaned up the sources about which I was reasonably confident, such as the references to Judges above. After I had aggregated all of these, I was naturally curious to see which texts and authors were most cited in the corpus. Here are the top forty:

The forty most cited authors and works within the EEBO-TCP corpus.

Biblical citations predominate, with the greatest number of references going to the Book of Psalms. (It should be noted that in most translations of the Bible into English and Latin, the Book of Psalms has by far the greatest number of verses and words.) Matthew leads the Evangelists, followed by John, then Luke, then Mark. The highest frequency Greco-Roman writers include Virgil, Ovid, Horace, Martial, Juvenal, and Seneca, in that order. While this data might not revolutionize early modern scholarship, it might be helpful in the classroom. When teaching undergraduates about the works early modern audiences read, heard, and cited, for example, such visualizations can help to bring home the profound religiosity of the age.

Drawing on the same data that underlies this visualization, one can also analyze networks of influence in early modern literature. Rather than conceive of influence as a series of vectors, each measuring the number of times a given work or author is cited in the TCP corpus, one can analyze the kinds of works and authors that are cited together. Such forms of analysis are often referred to as “co-citation networks,” or networks of references that tend to be cited together. For example, say our corpus contained references to only nine different sources:


In that case, we could visualize each of these sources as a “node” in a network graph and could create an “edge,” or connecting line, between any two nodes that were cited within a text in the corpus. The sample graph above would then illustrate the fact that references “A” and "B" are cited within one text, "G" and "Z" in another, and so on. Using this method on the EEBO-TCP corpus, I thought, might help to reveal latent structures embedded within early modern citation networks. With the curated EEBO-TCP citation data in hand, I therefore proceeded in the fashion described above, transforming each cited author or work into a node, and creating edges between any two nodes cited within a single work. Here is one visualization of the results:

This wild web is a visualization of co-citation networks in the EEBO-TCP corpus. Each cited author and text is a point (or "node") on the graph, and any two authors or texts cited within a single text are connected by a line (or "edge"). The plot also uses a modularity clustering algorithm to separate and color code nodes that commonly appear together: the blue cluster contains biblical works, the red cluster contains classical authorities, the yellow cluster contains Reformation-era martyrs and theologians, and the purple cluster contains literary figures.

In this graph, each node has been assigned a color. These colors are determined by an algorithm that identifies clusters of nodes that are commonly cited together, and then colors the different clusters accordingly. Analyzing the nodes that cluster together, one finds four particularly well-defined groups of references: the light blue cluster of nodes, which are biblical books (Leviticus, Corinthians); the red cluster of nodes, which are classical references (Macrobius, Lucretius); the yellow cluster of nodes, which are Reformation-era martyrs and theologians (Theodore Beza, Richard Turner); and the purple cluster of nodes, which are literary writers (Philip Sidney, William Shakespeare). The results of the procedure are intuitively legible: each group represents a fairly homogenous collection of authors and works, and the divisions that split the groups apart represent fairly significant generic differences between the various references contained in the data. If this is right, then such a map could perhaps serve as a useful resource in the classroom. When discussing the so-called "battle of ancients and moderns," for instance, graphs like the present one might help students visualize the different ways early modern citation practices established divisions between these groups.

Using a slightly different visualization, one can identify the points of connection between and among these ostensibly divided groups. This link will take you to an interactive site that contains all of the data included in the previous plot, including the modularity rankings that separated and color-coded the nodes. Unlike the previous plot, however, the visualization at the other end of that link allows users to click on a particular node and see all of the others to which that node is connected. Comparing the networks of each of these nodes, one can produce at a glance one measure of the degree to which the given node was cited with works and authors from the classical inheritance, for instance. The results sometimes lead to new questions. For example, the interactive plot demonstrates that when early modern authors cited Paul the Apostle in the context of classical writers like Apuleius, they cited Paul's given name; when they cited Paul in biblical contexts, by contrast, they cited his works, such as "Romans." Why might that be the case? Other points of interest I've come across in this visualization include the relatively isolation of the alchemists (George Ripley, Geber, etc.), towards the bottom of the plot, and the predominance of biblical citations in works that reference Descartes. Both of these observations strike me as less intuitive than the separation of authors into disparate modularity rankings, for instance, and both seem worthy of further inquiry.