Plagiary Poets

Last week I launched, an interactive app that visualizes text reuse within early poetry:


I was inspired to build the app after spending a month studying D3.js with Bob Holt, Mike Pennisi, and Yannick Assogba, three brilliant developers who work for Bocoup. The site was a lot of fun to build, and I look forward to launching related projects in the months to come. If you have any thoughts about this one, feel free to drop me a line below!

Visualizing Shakespearean Characters

Some time ago, I was intrigued to discover that Shakespeare’s Histories have a noticeable lack of female characters [link]. Since then, I’ve been curious to further explore the nuances of Shakespearean characters, paying particular respect to the gender dynamics of the Bard’s plays. This post is a quick sketch of some of the insights to which that curiosity has led.

To get a closer look at Shakespeare’s characters, I ran some analysis on the Folger Shakespeare Library’s gold-standard set of Shakespearean texts [link], all of which are encoded in fantastic XML markup that captures a number of character-level attributes, including gender. Using that markup, I extracted data for each character in Shakespeare’s plays, and then scoured through those features in search of patterns. All of the characters with an identified gender in this dataset are plotted below (mouseover for character name and source play):

Looking at this plot, we can see that the most prominent characters in Shakespearean drama are almost all well-known, titular males. There is also a noticeable inverse-relationship between a character’s prominence and the point in the play wherein that character is introduced. Looking more closely at the plot, I’ve further noticed that Shakespeare was curiously consistent in his treatment of characters who appear in multiple plays. In both 1 Henry IV and 2 Henry IV, for instance, Falstaff is given ~6,000 words and is introduced only a few hundred words into the work. Looking at the long tail, by contrast, one finds that among the outspoken characters introduced after the ~15,000 word mark—including Westmoreland and Bedford in 2 Henry IV, and Cade, Clifford, and Iden from 2 Henry VI—nearly all hail from Histories.

While the plot above gives one a birds’ eye view of Shakespeare’s characters, the plot doesn’t make it particularly easy to differentiate male and female character dynamics. As a step in this direction, the plot below visualizes character entrances by gender for each of Shakespeare's plays:

Examining at the distribution along the x-axis, we can see that male characters consistently enter the stage before female characters. An exception to this general rule may be found in the Comedies, as plays like Taming of the Shrew, All’s Well that Ends Well, and Midsummer Nights’ Dream begin with female characters on stage. Looking at the distribution along the y-axis, we can also see that for most plays, male characters continue to be introduced on stage long after the last female characters have been introduced.

Given the plots above, some might conclude that Shakespeare privileged male characters over female characters, as he introduced the former earlier and tended to give them more lines. There is evidence in the plays, however, that points in the opposite direction. Looking at Shakespeare’s minor characters, we see that the smallest and least significant roles in each play were almost universally assigned to males:

Here we see that even important males characters such as Fleance in Macbeth and Cornelius in Hamlet are given very few lines indeed, and the smallest female roles are consistently given more lines than the smallest male roles.

In sum, the plots above show that a number of heretofore undisclosed patterns emerge when we analyze Shakespeare’s characters in the aggregate. However, the plots above don’t show the connections between characters. One way to investigate these interconnections is through a co-occurrence matrix, in which each cell represents the degree to which two characters appear on stage concurrently:

In this visualization, “Frequency” represents the number of times a character appears on stage, “Gender” is indicated by the markup within the Folger Shakespeare Digital Collection XML (red = female, blue = male, green = unspecified), and “Cluster” reflects the subgroup of characters with whom a given character regularly appears, as determined by a fast greedy modularity ranking algorithm. Interacting with this plot allows one to uncover a number of insights. In the first place, we can see that the Histories consistently feature more “clusters” of characters than do Comedies or Tragedies. That is to say, while Comedies tend to be wildly interconnected affairs, Histories tend to include many small, isolated groups of characters that interact rather little with each other.  Looking at the gender dynamics of these groups, we can also see that in Comedies such as Merry Wives of Windsor and Histories such as Richard III and Henry V, female characters tend to appear on stage together, almost creating a coherent collective over the course of the play.

Finally, a number of female characters—such as Queen Margaret in 2 Henry VI and Adrianna in Comedy of Errors—appear on stage more frequently than any other character in their respective plays, despite the fact that they say fewer words than their respective plays' most outspoken characters. That is to say, their visual presence on stage is disproportionate to their verbal presence on stage. This raises a number of questions: To what extent were female characters meant to fulfill the role of a spectacle in Shakespearean drama? It’s difficult to imagine that the male players who acted as females projected authentic feminine voices. Did the limitations of imitative speech help mitigate the number of lines given to these prominent female characters? These and other questions remain to be explored in future work. 

* * *

The Python, Javascript, HTML and CSS used to generate the data and visualizations above is available here [link]. If you find it useful, feel free to drop me a line!

Clustering Semantic Vectors with Python

Google's Word2Vec and Stanford's GloVe have recently offered two fantastic open source software packages capable of transposing words into a high dimension vector space. In both cases, a vector's position within the high dimensional space gives a good indication of the word's semantic class (among other things), and in both cases these vector positions can be used in a variety of applications. In the post below, I'll discuss one approach you can take to clustering the vectors into coherent semantic groupings. 

Both Word2Vec and GloVe can create vector spaces given a large training corpus, but both maintain pretrained vectors as well. To get started with ~1GB of pretrained vectors from GloVe, one need only run the following lines:

gunzip glove.6B.300d.txt.gz

If you unzip and then glance at glove.6B.300d.txt, you'll see that it's organized as follows:

the 0.04656 0.21318 -0.0074364 [...] 0.053913
, -0.25539 -0.25723 0.13169 [...] 0.35499
. -0.12559 0.01363 0.10306 [...] 0.13684
of -0.076947 -0.021211 0.21271 [...] -0.046533
to -0.25756 -0.057132 -0.6719 [...] -0.070621
sandberger 0.429191 -0.296897 0.15011 [...] -0.0590532

Each new line contains a token followed by 300 signed floats, and those values appear to be organized from most to least common. Given this ready format, it's fairly straightforward to get straight to clustering!

There are a variety of methods for clustering vectors, including density-based clustering, hierarchical clustering, and centroid clustering. One of the most intuitive and most commonly used centroid-based methods is K-Means. Given a collection of points in a space, K-Means uses a Hunger Games style random lottery to pick a few lucky points (colored green below), then assigns each of the non-lucky points to the lucky point to which it's closest. Using these preliminary groupings, the next step is to find the "centroid" (or geometric center) of each group, using the same technique one would use to find the center of a square. These centroids become the new lucky points, and again each non-lucky point is again assigned to the lucky point to which it's closest. This process continues until the centroids settle down and stop moving, after which the clustering is complete. Here's a nice visual description of K-Means [source]:


To cluster the GloVe vectors in a similar fashion, one can use the sklearn package in Python, along with a few other packages:

from __future__ import division
from sklearn.cluster import KMeans 
from numbers import Number
from pandas import DataFrame
import sys, codecs, numpy

It will also be helpful to build a class to mimic the behavior of autovivification in Perl, which is essentially the process of creating new default hash values given a new key. In Python, this behavior is available through collections.defaultdict(), but the latter isn't serializable, so the following class is handy. Given an input key it hasn't seen, the class will create an empty list as the corresponding hash value:

class autovivify_list(dict):
        '''Pickleable class to replicate the functionality of collections.defaultdict'''
        def __missing__(self, key):
                value = self[key] = []
                return value

        def __add__(self, x):
                '''Override addition for numeric types when self is empty'''
                if not self and isinstance(x, Number):
                        return x
                raise ValueError

        def __sub__(self, x):
                '''Also provide subtraction method'''
                if not self and isinstance(x, Number):
                        return -1 * x
                raise ValueError

We also want a method to read in a vector file (e.g. glove.6B.300d.txt) and store each word and the position of that word within the vector space. Because reading in and analyzing some of the larger GloVe files can take a long time, to get going quickly one can limit the number of lines to read from the input file by specifying a global value (n_words), which is defined later on:

def build_word_vector_matrix(vector_file, n_words):
        '''Read a GloVe array from sys.argv[1] and return its vectors and labels as arrays'''
        numpy_arrays = []
        labels_array = []
        with, 'r', 'utf-8') as f:
                for c, r in enumerate(f):
                        sr = r.split()
                        numpy_arrays.append( numpy.array([float(i) for i in sr[1:]]) )

                        if c == n_words:
                                return numpy.array( numpy_arrays ), labels_array

        return numpy.array( numpy_arrays ), labels_array

Scikit-Learn's implementation of K-Means returns an object (cluster_labels in these snippets) that indicates the cluster to which each input vector belongs. That object doesn't tell one which word belongs in each cluster, however, so the following method takes care of this. Because all of the words being analyzed are stored in labels_array and the cluster to which each word belongs is stored in cluster_labels, the following method can easily map those two sequences together:

def find_word_clusters(labels_array, cluster_labels):
        '''Read the labels array and clusters label and return the set of words in each cluster'''
        cluster_to_words = autovivify_list()
        for c, i in enumerate(cluster_labels):
                cluster_to_words[ i ].append( labels_array[c] )
        return cluster_to_words

Finally, we can call the methods above, perform K-Means clustering, and print the contents of each cluster with the following block:

if __name__ == "__main__":
        input_vector_file = sys.argv[1] # The Glove file to analyze (e.g. glove.6B.300d.txt)
        n_words           = int(sys.argv[2]) # The number of lines to read from the input file
        reduction_factor  = float(sys.argv[3]) # The desired amount of dimension reduction 
        clusters_to_make  = int( n_words * reduction_factor ) # The number of clusters to make
        df, labels_array  = build_word_vector_matrix(input_vector_file, n_words)
        kmeans_model      = KMeans(init='k-means++', n_clusters=clusters_to_make, n_init=10)

        cluster_labels    = kmeans_model.labels_
        cluster_inertia   = kmeans_model.inertia_
        cluster_to_words  = find_word_clusters(labels_array, cluster_labels)

        for c in cluster_to_words:
                print cluster_to_words[c]
                print "\n"

The full script is available here. To run it, one needs to specify the vector file to be read in, the number of words one wishes to sample from that file (one can of course read them all, but doing so can take some time), and the "reduction factor", which determines the number of clusters to be made. If one specifies a reduction factor of .1, for instance, the routine will produce n*.1 clusters, where n is the number of words sampled from the file. The following command reads in the first 10,000 words, and produces 1,000 clusters:

python glove.6B.300d.txt 10000 .1

The output of this command is the series of clusters produced by the K-Means clustering:

[u'Chicago', u'Boston', u'Houston', u'Atlanta', u'Dallas', u'Denver', u'Philadelphia', u'Baltimore', u'Cleveland', u'Pittsburgh', u'Buffalo', u'Cincinnati', u'Louisville', u'Milwaukee', u'Memphis', u'Indianapolis', u'Auburn', u'Dame']

[u'Product', u'Products', u'Shipping', u'Brand', u'Customer', u'Items', u'Retail', u'Manufacturer', u'Supply', u'Cart', u'SKU', u'Hardware', u'OEM', u'Warranty', u'Brands']

[u'home', u'house', u'homes', u'houses', u'housing', u'offices', u'household', u'acres', u'residence']


[u'Night', u'Disney', u'Magic', u'Dream', u'Ultimate', u'Fantasy', u'Theme', u'Adventure', u'Cruise', u'Potter', u'Angels', u'Adventures', u'Dreams', u'Wonder', u'Romance', u'Mystery', u'Quest', u'Sonic', u'Nights']

I'm currently using these word clusters for fuzzy plagiarism detection, but they can serve a wide variety of purposes. If you find them helpful for a project you're working on, feel free to drop me a note below!

Cross-Lingual Plagiarism Detection with Scikit-Learn

Oliver Goldsmith, one of the great poets, playwrights, and historians of science from the Enlightenment, was many things. He was "an idle, orchard-robbing schoolboy; a tuneful but intractable sizar of Trinity; a lounging, loitering, fair-haunting, flute-playing Irish ‘buckeen.’" He was also a brilliant plagiarist. Goldsmith frequently borrowed whole sentences and paragraphs from French philosophes such as Voltaire and Diderot, closely translating their works into his own voluminous books without offering so much as a word that the passages were taken from elsewhere. Over the last several months, I have worked with several others to study the ways Goldsmith adapted and freely translated these source texts into his own writing in order to develop methods that can be used to discover crosslingual text reuse. By outlining below some of the methods that I have found useful within this field of research, the following post attempts to show how automated methods can be used to further advance our understanding of the history of authorship.

Sample Training Data

In order to identify the passages within Goldsmith's corpus that were taken from other writers, I decided to train a machine learning algorithm to differentiate between plagiarisms and non-plagiarisms. To distinguish between these classes of writing, John Dillon and I collected a large number of plagiarized and non-plagiarized passages within Goldsmith's writing, and provided annotations to identify whether the target passage had been plagiarized or not. Here are a few sample rows from the training data:

French SourceGoldsmith TextPlagiarism
Bothwell eut toute l'insolence qui suit les grands crimes. Il assembla les principaux seigneurs, et leur fit signer un écrit, par lequel il était dit expressément que la reine ne se pouvait dispenser de l'éspouser, puisqu'il l'avait enlevée, et qu'il avait couché avec elle.Bothwell was possessed of all the insolence which attends great crimes: he assembled the principal Lords of the state, and compelled them to sign an instrument, purporting, that they judged it the Queen's interest to marry Bothwell, as he had lain with her against her will.1
Histoire c'est le récit des faits donnés pour vrais; au contraire de la fable, qui est le récit des faits donnés pour faux.In the early part of history a want of real facts hath induced many to spin out the little that was known with conjecture.0
La meilleure maniere de connoître l'usage qu'on doit faire de l' esprit, est de lire le petit nombre de bons ouvravrages de génie qu'on a dans les langues savantes & dans la nôtre.The best method of knowing the true use to be made of wit is, by reading the small number of good works, both in the learned languages, and in our own.1
Comme il y a en Peinture différentes écoles, il y en a aussi en Sculpture, en Architecture, en Musique, & en général dans tous les beaux Arts.A school in the polite arts, properly signifies, that succession of artists which has learned the principles of the art from some eminent master, either by hearing his lessons, or studying his works.0
Des étoiles qui tombent, des montagnes qui se fendent, des fleuves qui reculent, le Soleil & la Lune qui se dissolvent, des comparaisons fausses & gigantesques, la nature toûjours outrée, sont le caractere de ces écrivains, parce que dans ces pays où l'on n'a jamais parlé en public.Falling stars, splitting mountains, rivers flowing to their sources, the sun and moon dissolving, false and unnatural comparisons, and nature everywhere exaggerated, form the character of these writers; and this arises from their never, in these countries, being permitted to speak in public.1

Given this training data, the goal was to identify some features that commonly appear in Goldsmith’s plagiarized passages but don’t commonly appear in his non-plagiarized passages. If we could derive a set of features that differentiate between these two classes, we would be ready to search through Goldsmith’s corpus and tease out only those passages that had been borrowed from elsewhere.

Feature Selection: Alzahrani Similarity

Because a plagiarized passage can be expected to have language that is similar but not necessarily identical to the language used within the plagiarized source text, I decided to test some fuzzy string similarity measures. One of the more promising leads on this front was adapted from the work of Salha M. Alzahrani et al. [2012], who has produced a number of great papers on plagiarism detection. The specific similarity measure adapted from Alzahrani calculates the similarity between two passages (call them Passage A and Passage B) in the following way:

def alzahrani_similarity( a_passage, b_passage ):

    # Create a similarity counter and set its value to zero
    similarity = 0

    # For each word in Passage A
    for a_word in a_passage:

        # If that word is in Passage B
        if a_word in b_passage:

            # Add one to the similarity counter
            similarity += 1

        # Otherwise,

            # For each word in Passage B
            for b_word in b_passage:    

                # If the current word from Passage A is a synonym of the current word from Passage B,
                if a_word in find_synonyms( b_word ):

                    # Add one half to the similarity counter
                    similarity += .5

    # Finally, divide the similarity score by the number of words in the longer passage
    return similarity / max( len(a_passage), len(b_passage) )

To prepare the data for this algorithm, I used the Google Translate API to translate French texts into English, the Big Huge Labs Thesaurus API to collect synonyms for each word in Passage B, and the NLTK to clean the resulting texts (dropping stop words, removing punctuation, etc.). Once these resources were prepared, I used an implementation of the algorithm described above to calculate the "similarity" between the paired passages in the training data. As one can see, the similarity value returned by this algorithm discriminates reasonably well between plagiarized and non-plagiarized passages:


The y-axis here is discrete--each data point represents either a plagiarized pair of passages (such as those in the training data discussed above), or a non-plagiarized pair of passages. The x-axis is really the important axis. The further to the right a point falls on this axis, the greater the length-normalized similarity score for the passage pair. As one would expect, plagiarized passages have much higher similarity scores than non-plagiarized passages.

In order to investigate how sensitive this similarity method is to passage length, I iterated over all sub-windows of n words within the training data, and used the same similarity method to calculate the similarity of the sub-window within the text. When n is five, for instance, one would compare the first five words of Passage A to the first five from Passage B. After storing that value, one would compare words two through six from Passage A to words one through five of Passage B, then words three through seven from Passage A to words one through five of Passage B, proceeding in this way until all five-word windows had been compared. Once all of these five-word scores are calculated, only the maximum score is retained, and the rest are discarded. The following plot shows that as the number of words in the sub-window increases, the separation between plagiarized and non-plagiarized passages also increases:

Each facet within this plot represents a subwindow length, and each point within that facet represents the maximum observed similarity value for a single pair of passages. As the length of the subwindows increases, the distance between plagiarized and non-plagiarized passages also increases.

Each facet within this plot represents a subwindow length, and each point within that facet represents the maximum observed similarity value for a single pair of passages. As the length of the subwindows increases, the distance between plagiarized and non-plagiarized passages also increases.

Feature Selection: Word2Vec Similarity

Although the method discussed above provides helpful separation between plagiarized and non-plagiarized passages, it reduces word pairs to one of three states: equivalent, synonymous, and irrelevant. Intuitively, this model feels limited, because one senses that words can have degrees of similarity. Consider the words small, tiny, and humble. The thesaurus discussed above identifies these terms as synonyms, and the algorithm described above essentially treats the words as interchangeable synonyms. This is slightly unsatisfying because the word small seems more similar to the word tiny than the word humble.

To capture some of these finer gradations in meaning, I called on Word2Vec, a method that uses backpropagation to represent words in high-dimensional vector spaces. Once a word has been transposed into this vector space, one can compare a word's vector to another word's vector and obtain a measure of the similarity of those words. The following snippet, for instance, uses a cosine distance metric to measure the degree to which tiny and humble are similar to the word small:

from gensim.models.word2vec import Word2Vec
from sklearn.metrics.pairwise import cosine_similarity        

# Load the Google pretrained word vectors
model = Word2Vec.load_word2vec_format('../google_pretrained_word_vectors/GoogleNews-vectors-negative300.bin.gz', binary=True)

# Obtain the vector representations of three words
v1 = model[ "small" ]
v2 = model[ "tiny" ]
v3 = model[ "humble" ]

# Measure the similarity of "tiny" and "humble" to the word "small"
for v in [v2,v3]:
    print cosine_similarity(v1, v)

Running this script returns [[ 0.71879274]] and [[ 0.29307675]] respectively, which is to say Word2Vec can recognize that the word small is more similar to tiny than it is to humble. Because Word2Vec allows one to calculate these fine gradations of word similarity, it does a great job calculating the similarity of passages from the Goldsmith training data. The following plot shows the separation achieved by running a modified version of the "Alzahrani algorithm" described above, using this time Word2Vec to measure word similarity:


As one can see, the Word2Vec similarity measure achieves very promising separation between plagiarized and non-plagiarized passage pairs. By repeating the subwindow method described above, one can identify the critical value wherein separation between plagiarized and non-plagiarized passages is best achieved with a Word2Vec similarity metric:


Feature Selection: Syntactic Similarity

Much like the semantic features discussed above, syntactic similarity can also serve as a clue of plagiarism. While a thoroughgoing pursuit of syntactic features might lead one deep into sophisticated analysis of dependency trees, it turns out one can get reasonable results by simply examining the distribution of part of speech tags within Goldsmith's plagiarisms and their source texts. Using the Stanford Part of Speech (POS) Tagger's French and English models, and a custom mapping I put together to link the French POS tags to the universal tagset, I transformed each of the paired passages in the training data into a POS sequence such as the following: 

[(u'Newton', u'NNP'), (u'appeared', u'VBD'),...,(u'amazing', u'JJ'), (u'.', u'.')]
[(u'Newton', u'NPP'), (u'parut', u'V'),...,(u'nouvelle:', u'CL'),(u'.', u'.')]

Using these sequences, two similarity metrics were used to measure the similarity between each of the paired passages in the training data. The first measure (on the x-axis below) simply measured the cosine distance between the two POS sequences; the second measure (on the y-axis below) calculated the longest common POS substring between the two passages. As one would expect, plagiarized passages tend to have higher values in both categories:

The x-axis in this plot indicates the value obtained when one measures the cosine similarity between the part of speech vectors of the relevant paired passages; the y-axis indicates the value obtained when one measures the longest common part of speech substring for the relevant paired passages. Points are jittered to allow one to see the full data set, and stat_ellipse is plotted at the 90% level.

Classifier Results

From the similarity metrics discussed above, I selected a bare-bones set of six features that could be fed to a plagiarism classifier: (1) the aggregate "Alzahrani similarity" score, (2) the maximum six-gram Alzahrani similarity score, (3) the aggregate Word2Vec similarity score, (4) the cosine distance between the part of speech tag sets, (5) the longest common part of speech string, and (6) the longest contiguous common part of speech string. Those values were all represented in a matrix format with one pair of passages per row and one feature per column. Once this matrix was prepared, a small selection of classifiers hosted within Python's Scikit Learn library were chosen for comparison. Cross-classifier comparison is valuable, because different classifiers use very different logic to classify observations. The following plot from the Scikit Learn documentation shows that using a common set of input data (the first column below), the various classifiers in the given row classify that data rather differently:


In order to avoid prejudging the best classifier for the current task, half a dozen classifiers were selected and evaluated with hold one out tests. That is to say, for each observation in the training data, all other rows were used to train the given classifier, and the trained classifier was asked to predict whether the left-out observation was a plagiarism or not. Because this is a two class prediction task (each observation either is or is not an instance of plagiarism), the baseline success rate is 50%. Any performance below this baseline would be worse than random guessing. Happily, all of the classifiers achieved success rates that greatly exceeded this baseline value:

Generally speaking, precision values were higher than recall, perhaps because some of the plagiarisms in the training data were fuzzier than others. Nevertheless, these accuracy values were high enough to warrant further exploration of Goldsmith's writing. Using the array of features discussed above and others to be discussed in a subsequent post, I tracked down a significant number of plagiarisms that were not part of the training data, including the following outright translations from the Encyclopédie:

French SourceGoldsmith Text
Il n'est point douteux que l' Empire , composé d'un grand nombre de membres très-puissans, ne dût être regardé comme un état très-respectable à toute l'Europe, si tous ceux qui le composent concouroient au bien général de leur pays. Mais cet état est sujet à de très-grands inconvéniens: l'autorité du chef n'est point assez grande pour se faire écouter: la crainte, la défiance, la jalousie, regnent continuellement entre les membres: personne ne veut céder en rien à son voisin: les affaires les plus sérieuses les plus importantes pour tout le corps sont quelquefois négligées pour des disputes particulieres, de préséance, d'étiquette, de droits imaginaires d'autres minuties.It is not to be doubted but that the empire, composed as it is of several very powerful states, must be considered as a combination that deserves great respect from the other powers of Europe, provided that all the members which compose it would concur in the common good of their country. But the state is subject to very great inconveniences; the authority of the head is not great enough to command obedience; fear, distrust, and jealousy reign continually among the members; none are willing to yield in the least to their neighbours; the most serious and the most important affairs with respect to the community, are often neglected for private disputes, for precedencies, and all the imaginary privileges of misplaced ambition.
L' Eloquence , dit M. de Voltaire, est née avant les regles de la Rhétorique, comme les langues se sont formées avant la Grammaire.Thus we see, eloquence is born with us before the rules of rhetoric, as languages have been formed before the rules of grammar.
L' empire Germanique, dans l'état où il est aujourd'hui, n'est qu'une portion des états qui étoient soûmis à Charlemagne. Ce prince possédoit la France par droit de succession; il avoit conquis par la force des armes tous les pays situés depuis le Danube jusqu'à la mer Baltique; il y réunit le royaume de Lombardie, la ville de Rome son territoire, ainsi que l'exarchat de Ravennes, qui étoient presque les seuls domaines qui restassent en Occident aux empereurs de Constantinople.The empire of Germany, in its present state is only a part of those states that were once under the dominion of Charlemagne. This prince was possessed of France by right of succession: he had conquered by force of arms all the countries situated between the Baltic Sea and the Danube. He added to his empire the kingdom of Lombardy, the city of Rome and its territory, together with the exarchate of Ravenna, which were almost the only possessions that remained in the West to the emperors of Constantinople.
Il n'est point de genre de poésie qui n'ait son caractere particulier; cette diversité, que les anciens observerent si religieusement, est fondée sur la nature même des sujets imités par les poëtes. Plus leurs imitations sont vraies, mieux ils ont rendu les caracteres qu'ils avoient à exprimer....Ainsi l'églogue ne quitte pas ses chalumeaux pour entonner la trompette, l' élégie n'emprunte point les sublimes accords de la lyre.There is no species of poetry that has not its particular character; and this diversity, which the ancients have so religiously observed, is founded in nature itself. The more just their imitations are found, the more perfectly are those characters distinguished. Thus the pastoral never quits his pipe, in order to sound the trumpet; nor does elegy venture to strike the lyre.


Samuel Johnson once observed that Oliver Goldsmith was “at no pains to fill his mind with knowledge. He transplanted it from one place to another; and it did not settle in his mind; so he could not tell what was his in his own books” (Life of Johnson). Reading the borrowed passages above, one can perhaps understand why Goldsmith struggled to recall what he had written in his booksmuch of his writing was not really his. As scholars continue to advance the art of detecting textual reuse, we will be better equipped to map these borrowed words at larger and more ambitious scales. For the present, writers like Goldsmith offer plenty of data on which to hone those methods.

* * *

This work has benefitted enormously from conversations with a number of others. Antonis Anastasopoulos, David Chiang, Michael Clark, John Dillon, and Kenton Murray of Notre Dame's Text Analysis Group, and Thom Bartold, Dan Hepp, and Jens Wessling of ProQuest offered key analytic insights, and Mark Olsen and Glenn Roe of the University of Chicago's ARTFL group shared essential data. I am grateful for the generous help each of you has provided. Code is available here.

Mapping the Early English Book Trade

Historians often call attention to the tremendous influence the 1710 Act of Anne had on the early English book trade. Commonly identified as the origin of modern copyright law, the Act laid the statutory foundations for fixed-term copyright in England, extended the ability to hold such copyrights to all individuals, and eventually toppled the monopoly that London booksellers had held on English printing since the incorporation of the Stationers' Company in 1557. Reading scholarship on this legal development over the last few months, I became curious to see how well the English Short Title Catalogue (ESTC) could substantiate some of the claims made in discussions of the Act. The ESTC seemed an ideal resource for this kind of analysis because, as Stephen Tabor has written, it represents “the fullest and most up-to-date bibliographical account of 'English' printing (in the broadest sense) for its first 328 years” (367). The database lists the authors, titles, imprint lines, publication dates, and many other metadata fields for each of the ~470,000 editions known to have been printed in England or its colonies between 1473 and 1800, and can therefore serve as a helpful resource with which to investigate the relationship between copyright law and literary history in the early modern period.

One of the debates surrounding the Act of Anne concerns the degree to which the statute altered the geography of the English book trade. Prior to the passage of the Act, legal historian Diane Zimmerman notes, the Stationers' Company dominated the book industry, and because the company's printers were primarily stationed in London, the book trade was also centered in the metropole. With the passage of the Statute of Anne, however, authors could sell or trade their copyrights to printers outside of London: “Now any printer [or] bookseller, wherever located within the country, could register a copyright with the Company” and “since purchasers of the copies could be located anywhere in the United Kingdom, the Stationers' Company did not regain its monopoly [on the book trade] (7). Contra Zimmerman, William Patry argues that the Act of Anne failed to undermine London's control of the book trade: “After the Statute of Anne, as before, he writes, the only purchasers of authors' works were a small group of London booksellers” (84). To investigate what the ESTC had to say on this question, I compared the geographical distribution of English printers in the half centuries before and after the passage of the Act (click for full size):

The usual cautions concerning false imprints and varying survival rates notwithstanding, the ESTC clearly demonstrates the decentralization of English printing in the wake of the Act of Anne. London of course remained the primary site of publication throughout the years covered by the ESTC—publishing two-thirds of all records from the period—though its annual share in the trade fell quite dramatically across the eighteenth century:

One can explain some of that decline by examining the growth of printing in major metropolitan areas outside of London, such as Edinburgh (responsible for 6.5% of total editions in the ESTC), Dublin (5.4%), and Boston (3.7%), which claimed the second, third, and fourth overall largest shares of the book trade according to the ESTC:


Among these figures, the explosion of printing in Edinburgh after 1750 is particularly interesting, and appears to be the result of further changes in the legal code. As John Feather notes, “The Copyright Act of 1710 (8 Anne c. 21) implied, but did not state, that it was illegal to import any English-language books into England and Wales if they had been previously printed there” (58). However, he continues, “the legislation in relation to Scotland seems to have lapsed in 1754-1755,” after which one observes tremendous growth in Scottish printing. Between 1750 and 1755, the five year average of Edinburgh printing as a percent of all printing recorded in the ESTC is 7.5%. This figure only continues to grow after the lapse of Scottish printing regulations noted by Feather: From 1755-1760, Edinburgh printing climbs to 9.0% of all printing for the five year period, from 1760-1765, the figure rises to 12.3%, and from 1765-1770, it reaches 14.4% of the ESTC totals for the five year range. These values are significant, because they suggest the real surge in the Scottish reprinting industry did not take place in the aftermath of the Donaldson v. Becket decision, as is commonly supposed, but rather with the lapse of Scottish reprinting regulations in 1755.

Having plotted the changing geography of early English printing, I was curious to see whether the ESTC could shed new light on the debate concerning anonymous printing in the early modern period. Researchers like Jody Greene have argued that the Statute of Anne was in fact designed to help combat anonymous publishing insofar as it required authors to attach their names to works if they wished to obtain copyright protection for those works (4). Years ago, Michel Foucault pioneered a version of this thesis in his essay “What is an Author?”, where he argued that the Act of Anne and its elaboration in eighteenth-century case law spurred the transition from a literary culture founded on anonymity to one founded on named authorship. More recently, however, Robert Griffin disputed such claims, arguing that “the historical record shows . . . there is no necessary relation between copyright and the appearance of the name of the author on the title page” (879). To map the changing rate of anonymity over time, I aggregated the number of anonymous and pseudonymous publications as percents of annual totals within the ESTC:


The resulting plot shows great fluctuation in anonymous publications within the fifteenth and early sixteenth centuries, largely because of the tremendously small number of publications for those years. In 1492, for instance, the ESTC lists only 14 publications, all but two of which (S111337 and S120825) had identified authors, which results in an aggregate estimate of anonymity for the year of .142, or 14.2 percent. Despite the year to year fluctuations within early records, however, examining anonymity rates in the aggregate leads to legible patterns: one finds a marked decline in anonymous publication rates over the fifteenth and sixteenth centuries, a fairly steady rise across the seventeenth century, and a slow aggregate decline in the wake of the Act of Anne. This data supports some of the the findings of Joad Raymondwho examined a small sample of records from the period and found that anonymity . . . became increasingly frequent over the course of the seventeenth century (168)—while challenging the popular thesis that anonymity thrived with the lapse of the Licensing Act in 1695.

To plot the history of anonymity, though, is to beg a fundamental question: What exactly counts as an anonymous work? While the plot above treats works as anonymous only if their title pages are attributed to pseudonymous figures like “Isaac Bickerstaff” or to no author at all, there are other cases that one might well wish to classify as anonymous works. Consider the range of works attributed to “corporate” authors like the Royal Society of London or the English Parliament. Are works published by these entities anonymous publications? The way one answers this question will of course greatly affect the way one reads the history of anonymity. As a case in point, we could consult the following plot, which shows monarchical and parliamentary publishing during the seventeenth and eighteenth centuries:

The points here represent yearly values, while the regression lines map the smoothed trends over time. For example, the release of the ESTC to which I had access indicates that James I and Charles I published a combined total of 82 works in 1625 (both served as monarch during the year), the English and Scottish Parliaments published a combined total of 4 works during the year, and the year's total number of publications was 695, which means that monarchical publications account for 11.79 percent of the annual total while parliamentary publications account for only .5 percent  of the same. As one can see, treating the high volume of parliamentary publications from the period as “anonymous works” would create a serious spike in anonymity rates during the English Civil Wars, and would steadily inflate anonymity rates across the eighteenth century. On the other hand, refusing to include works of corporate authorship among anonymous publications (as I have done in the plot of anonymity above) makes it more difficult to answer the question: What exactly counts as anonymity in the early modern world? Whether one includes or excludes corporate authorship from the domain of anonymity, this plot of parliamentary and monarchical publications intrigues me because it maps so neatly onto the political history of the English Civil Wars: monarchical publications trump parliamentary output until the critical years of the early 1640's, after which the Parliament assumes a predominance it holds throughout the Interregnum and only loses in the Restoration. Thereafter the monarchical voice triumphs until the Statute of Anne, after which point it rapidly loses ground. Examining this plot, I can't help but wonder: To what extent is monarchical publishing a function of the crown's political power, and to what extent is that political power a function of the monarch's proximity to print?

* * *

I want to thank Benjamin Pauley, Brian Geiger, and Virginia Schillingeach of whom kindly helped me to acquire the ESTC data on which the analysis above was performedas well as Elliott Visconsi, whose intriguing questions on copyright history continue to motivate my ongoing research.