Co-citation Networks in the EEBO-TCP Corpus

I recently had the good fortune of attending a conference on computational approaches to early modern literature hosted at the University of Newcastle. During the conference, I not only got to meet some outstanding scholars—including Doug Bruster, John Burrows, Hugh Craig, Mac Jackson, and Glenn Roe, to name only a few—but I also had a chance to present some of my recent work on algorithmic approaches to the study of literary influence. In case it might be of interest to others working in related fields, I thought I would share one of the approaches I discussed in what follows below.

Buried within the EEBO-TCP corpus, I learned some months ago, is a veritable trove of metadata. These metadata features indicate when authors include things like stage directions, tables of data, and alchemical symbols in their writing, all of which is great news for the computationally inclined. For those interested in influence, there are also metadata fields that indicate when an author is quoting or citing another text. By placing <q> tags around quotations and <bibl> tags around citations of authors or works, the Text Creation Partnership made it easy for researchers to begin looking for citational patterns in early modern literature:

Lines 48-52 of this sample from the EEBO-TCP corpus contain <q> and <bibl> tags to designate quotations and bibliographic citations respectively.

Using a simple script, one can easily extract all of the quotations and bibliographic citations from all files in the EEBO-TCP corpus. Because the EEBO-TCP corpus contains roughly one third of the titles from the period recorded in the definitive English Short Title Catalogue, this citational data can serve as a fairly representative archive of intertextual trends in early modern England:


That is, at least, the idea in theory. In practice, the data is quite messy, and in its native form, is all but algorithmically unapproachable. Take, for example, the following references contained within <bibl> brackets:

Iudges, 5 23.
IVDGES 4. 21.
Iudg. 42.
[Judg. 21.]

Human investigators who look into the matter can easily discover that each of these references refers to the Book of Judges. To allow computers to recognize this fact, however, I had to spend several weeks sifting my way through the collection of <bibl> tags, carefully identifying the texts and authors to which those metadata fields referred. In the end, of the roughly 45,000 items that were tagged as references to books or authors in the EEBO-TCP corpus (parts I and II), I found roughly a third to be too cryptic to decipher:

W. ??.
ラ Testimony of a great Divine.

Setting these obscure references aside, I cleaned up the sources about which I was reasonably confident, such as the references to Judges above. After I had aggregated all of these, I was naturally curious to see which texts and authors were most cited in the corpus. Here are the top forty:

The forty most cited authors and works within the EEBO-TCP corpus.

Biblical citations predominate, with the greatest number of references going to the Book of Psalms. (It should be noted that in most translations of the Bible into English and Latin, the Book of Psalms has by far the greatest number of verses and words.) Matthew leads the Evangelists, followed by John, then Luke, then Mark. The highest frequency Greco-Roman writers include Virgil, Ovid, Horace, Martial, Juvenal, and Seneca, in that order. While this data might not revolutionize early modern scholarship, it might be helpful in the classroom. When teaching undergraduates about the works early modern audiences read, heard, and cited, for example, such visualizations can help to bring home the profound religiosity of the age.

Drawing on the same data that underlies this visualization, one can also analyze networks of influence in early modern literature. Rather than conceive of influence as a series of vectors, each measuring the number of times a given work or author is cited in the TCP corpus, one can analyze the kinds of works and authors that are cited together. Such forms of analysis are often referred to as “co-citation networks,” or networks of references that tend to be cited together. For example, say our corpus contained references to only nine different sources:


In that case, we could visualize each of these sources as a “node” in a network graph and could create an “edge,” or connecting line, between any two nodes that were cited within a text in the corpus. The sample graph above would then illustrate the fact that references “A” and "B" are cited within one text, "G" and "Z" in another, and so on. Using this method on the EEBO-TCP corpus, I thought, might help to reveal latent structures embedded within early modern citation networks. With the curated EEBO-TCP citation data in hand, I therefore proceeded in the fashion described above, transforming each cited author or work into a node, and creating edges between any two nodes cited within a single work. Here is one visualization of the results:

This wild web is a visualization of co-citation networks in the EEBO-TCP corpus. Each cited author and text is a point (or "node") on the graph, and any two authors or texts cited within a single text are connected by a line (or "edge"). The plot also uses a modularity clustering algorithm to separate and color code nodes that commonly appear together: the blue cluster contains biblical works, the red cluster contains classical authorities, the yellow cluster contains Reformation-era martyrs and theologians, and the purple cluster contains literary figures.

In this graph, each node has been assigned a color. These colors are determined by an algorithm that identifies clusters of nodes that are commonly cited together, and then colors the different clusters accordingly. Analyzing the nodes that cluster together, one finds four particularly well-defined groups of references: the light blue cluster of nodes, which are biblical books (Leviticus, Corinthians); the red cluster of nodes, which are classical references (Macrobius, Lucretius); the yellow cluster of nodes, which are Reformation-era martyrs and theologians (Theodore Beza, Richard Turner); and the purple cluster of nodes, which are literary writers (Philip Sidney, William Shakespeare). The results of the procedure are intuitively legible: each group represents a fairly homogenous collection of authors and works, and the divisions that split the groups apart represent fairly significant generic differences between the various references contained in the data. If this is right, then such a map could perhaps serve as a useful resource in the classroom. When discussing the so-called "battle of ancients and moderns," for instance, graphs like the present one might help students visualize the different ways early modern citation practices established divisions between these groups.

Using a slightly different visualization, one can identify the points of connection between and among these ostensibly divided groups. This link will take you to an interactive site that contains all of the data included in the previous plot, including the modularity rankings that separated and color-coded the nodes. Unlike the previous plot, however, the visualization at the other end of that link allows users to click on a particular node and see all of the others to which that node is connected. Comparing the networks of each of these nodes, one can produce at a glance one measure of the degree to which the given node was cited with works and authors from the classical inheritance, for instance. The results sometimes lead to new questions. For example, the interactive plot demonstrates that when early modern authors cited Paul the Apostle in the context of classical writers like Apuleius, they cited Paul's given name; when they cited Paul in biblical contexts, by contrast, they cited his works, such as "Romans." Why might that be the case? Other points of interest I've come across in this visualization include the relatively isolation of the alchemists (George Ripley, Geber, etc.), towards the bottom of the plot, and the predominance of biblical citations in works that reference Descartes. Both of these observations strike me as less intuitive than the separation of authors into disparate modularity rankings, for instance, and both seem worthy of further inquiry.

Batch Processing Python Scripts on Sun Grid Engine Queues

Suppose you have a collection of text files and would like to compare each of those files to each of the others. Perhaps you would like to know which characters, locations, or stage directions in each file occur in any of the others. Whatever the task, if your collection is small enough—on the order of a few paragraphs, say—you can of course compare the files manually, reading each of your paragraphs in turn, and comparing the given paragraph to each of the others. If your collection is a bit bigger—on the order of a few hundred novels, say—you might automate these comparisons on your computer. If your collection is really big, however, a single computer might not be powerful enough to finish the job during your lifetime.

Comparing each file in a four text corpus to each of the others, after all, only involves six comparisons. Running the same analysis on a collection of 50,000 files (roughly the size of the Project Gutenberg collection in English, or the EEBO-TCP corpus), however, means running 1,249,975,000 comparisons. If each of those comparisons takes one minute to execute on your computer, it will take 2376 years to run this job on your machine. Thankfully, we can expedite this process tremendously by leveraging the power of distributed computing systems like Sun Grid Engine (SGE) queues. Pursuing the routine described above, for instance, we can use an SGE system to run each of our 1+ billion comparisons in a few minutes.

To get started, we'll want to create an "iteration schedule" in which we identify all of the comparisons we wish to run. Here is a visual representation of an iteration schedule for a four text corpus:


In the table above, each of our iterations-to-be-run is denoted by an "o." Each "o" sits in the cell that joins the row and the column that denote the two texts to be analyzed in the given iteration. Reading across our first row, for instance, we see that text one does not need to be compared to text one, but does need to be compared to texts two, three, and four. The second row denotes that we want to compare text two to texts three and four, and row three denotes that we want to compare text three to text four.

After determining all of the comparisons to be run, we will want to render that information in machine-readable form. More specifically, we want to generate a table that has three columns: iteration_number, first_text, and second_text: where first_text and second_text are the file names of the two texts we wish to compare in the given iteration, and iteration_number is an integer whose value is zero in the first row of the iteration schedule, 1 in the next row, 2 in the next, . . . and n in the last, where n equals the total number of comparisons we wish to make. [In general, the number of comparisons required is (p-1)(p)/2, where p equals the number of files in your corpus.]  Here is a sample iteration table (produced by this script):

0 A00002.txt A00005.txt
1 A00002.txt A00007.txt
2 A00002.txt A00008.txt
3 A00002.txt A00011.txt
4 A00002.txt A00012.txt
5 A00002.txt A00013.txt
6 A00002.txt A00014.txt
7 A00002.txt A00015.txt
8 A00002.txt A00018.txt
9 A00002.txt A00019.txt

If you want to batch process a different kind of routine on an SGE system, you can modify your iteration schedule appropriately. If you only want to calculate the type-token ratios of each of your files, for instance, you'll only need two columns: iteration_number and text_name. Once this iteration schedule is all set, we can turn to the script to be run during each of these iterations. Here's mine: If you want to run a different kind of analysis, just keep lines 1-21 and line 102 of that script, and use the variable “iteration_number” to guide which texts you will analyze in each iteration. Once your routine is all set, save it as “” and upload it—along with your iteration schedule, the files you wish to compare, and these two higher order scripts—to a single directory on your SGE server:

Once all of these files are in the same directory, you are ready to submit your script for batch processing. To do so using the University of Notre Dame's Center for Research Computing system, you can simply type "python your_netid" with no quotation marks:

After you submit this command, the higher-order Python scripts and will create new copies of your script, changing the input files for each iteration according to your iteration schedule. If all has gone well, and you refresh the directory after a few moments, you'll see a few (or more than a few, depending on the number of iterations you are running!) new files in your directory. More specifically, you'll have a collection of new job_ files that give you feedback on the result of each of your iterations. If errors cropped up during your analysis, those errors should be recorded in these job_ files. Provided that there were no exceptions, though, those files will be empty, and you will find in your directory whatever output you requested in your script. Et voila, now you can finish your analysis in a few minutes, rather than a few millennia!

* * *

I want to thank Scott Hampton and Dodi Heryadi of Notre Dame's High Performance Computing Group, who helped me think through the logistics of batch processing, Reid Johnson of Notre Dame's Computer Science Department, who sent me the higher-order SGE scripts on which this analysis is based, and Tim Peters, who wrote many of the Python modules on which my work depends!

Identifying Poetry in Unstructured Corpora

Over the last few months, I've been working with colleagues at Notre Dame to develop computational approaches we can use to identify the genres to which a literary work belongs. Initially, we focused our research on the georgic, a class of agricultural-cum-labour poems that flourished in the seventeenth and eighteenth centuries. Eventually, though, our limited research corpus led us to investigate methods we could use to identify more period poetry, and these investigations helped reveal a fascinating if simple method one can use to identify poetic works in unstructured corpora. 

We began building our corpus of early modern English poetry by identifying the poetry curated by the Text Creation Partnership (TCP). Running a simple Python script over the TCP's selections from the Early English Books (EEBO) corpus—which stretches from “the first book printed in English in 1475 through 1700”—and the Eighteenth Century Collections (ECCO) corpus, we extracted all the lines of text wrapped in < l > tags (the TEI designation for a line of verse). This left us with 16,571 text files, each of which contained only poetry from roughly the sixteenth through the eighteenth centuries. After examining some of these files, we realized that many consisted entirely of poetic epigraphs, so we used another script to remove all of these small files (those smaller than 16 kb) from our research corpus, leaving us with a fairly substantive collection of poetic works from the period of interest:


Because the EEBO-TCP contains 44,255 volumes—roughly one third of all titles recorded in Alain Veylit's ESTC data for the appropriate years—we felt reasonably confident that our holdings for the sixteenth and seventeenth centuries were fairly representative of literary trends during the period. The ECCO-TCP, on the other hand, contains only 2,387 texts, less than one percent of ESTC titles from the eighteenth century. Even if we accept John Feather's argument that only 25,131 literary works were written in English during the eighteenth century—11,789 of which, he claims, were poetic works—we are left to conclude that the 1,698 files in the ECCO-TCP corpus that contain poetry might not be indicative of poetic trends from the period. Given these conclusions, we were eager to supplement our collection of eighteenth-century poetry.

But where on earth can one find enormous quantities of eighteenth-century poetry in digital form? (This isn't meant to be a rhetorical question; if you've got ideas, please let us know!) After considering the issue for some time, we elected to work with Project Gutenberg. Unfortunately, only after we had downloaded and unzipped all of the English files on Project Gutenberg did we realize that the enormous text collection (roughly 45,000 volumes) is all but entirely unstructured. We couldn't find any master list of file names, author names, publication dates, or any other essential metadata fields, so we had to build our own.

In the first place, we wanted to be able to differentiate poetic texts from non-poetic texts. While I imagine it would be possible to complete this task by analyzing the relative frequency of strings from each of these texts in the manner described in the previous post, we didn't have reliable publication dates for the Gutenberg texts, so we needed an alternative method. Operating on the hypothesis that poetic texts have more line breaks and fewer words per line than prose works, we decided to measure the number of words in each line of each file. We then collected a random sample of poetic works to see what their words-per-line profiles looked like:

In these plots—each of which represents a single poetic text—the numbers along the x-axis indicate the number of words in a line of the text file, and the y-axis indicates the relative frequency of lines that contain such-and-such a number of words within the text. In The Poetical Works of James Beattie, for instance, only ~5% of lines had 12 or 13 words in them, whereas almost 20% of the text's lines had 7 or 8 words in them. In other words, The Poetical Works of James Beattie is dominated by lines with seven or eight words in them, a fact that applies to all of the poetic works plotted above. With these figures in hand, we plotted the words-per-line profiles for a random assortment of prose works from roughly the same period:

We were pleased to see that these plots differed from the poetic plots quite dramatically! Comparing the two sets of curves, we see that poetic works contain a preponderance of lines with 7-8 words, while prose works contain a preponderance of lines with 11-12 words. This is naturally due to the fact that lines of text in prose works run across an entire page, while poets break lines strategically (and regularly in eighteenth-century verse). To identify poetry in unstructured corpora, then, we can calculate a text's words-per-line profile, and use the results of those calculations in order to classify each text in our corpus as a work of poetry or a work of prose. Using a rather simple approach to the latter task, we found 3150 poetic works tucked in the Gutenberg corpus, a few hundred of which are from our period and can thus contribute to our study of genre classification.

Ngram Frequency and Eighteenth-Century Commonplaces

When Samuel Richardson's Mrs. Jewkes remarks that “Nought can restrain consent of twain,” we confidently conclude she's quoting Harington's translation of Orlando Furioso. When Edmund Burke writes in his Philosophical Enquiry, "Dark with excessive light thy skirts appear," we know he's misquoting Milton. While passages like these make their debts fairly clear, though, in most cases literary influence is notoriously difficult to trace. When Mary Wollstonecraft identifies marriage as a form of “legal prostitution” in her Vindications, for instance, are we meant to reflect on the thrust of that phrase in Defoe's Matrimonial Whoredom? When Ann Radcliffe's Adeline and La Motte stroll “under the shade of 'melancholy boughs'” in The Romance of the Forest, what gives us the warrant to imagine Orlando “under the shade of melancholy boughs” in As You Like It?

In each of the aforementioned cases, both the quoting and the quoted texts include identical (or nearly-identical) sequences of words. If this property is a necessary condition for intertextuality, however, it is clearly not a sufficient one, for while Wollstonecraft's second Vindication and Defoe's Matrimonial Whoredom both use the phrase “legal prostitution,” they also both use the phrase “if it be,” as well as the phrase “a kind of.” Nonetheless, literary scholars don't identify the latter two strings as instances of intertextuality, perhaps because we intuitively sense that “if it be” and “a kind of” are far more common phrases during the period than “legal prostitution,” a thesis to which Google lends some confidence:

Ngram frequencies of three strings that appear in both Defoe's Matrimonial Whoredom and Wollstonecraft's Vindication of the Rights of Woman.

Such queries demonstrate something literary scholars have known for a long time, namely the fact that the passages we classify as instances of intertextuality have (1) common words in a common order, and (2) significantly lower relative frequency rates than other (equally long) strings from the same period. With this insight in mind, I built an API for the Google Ngrams data with which one can pull down the relative frequencies of a list of strings shared by two (or more) works. Given a set of substrings shared by two texts, and given the relative frequencies of each of those strings in the age during which those texts were published, one can eliminate high frequency strings and thereby reduce the number of passages scholars must hand review to identify relevant instances of intertextuality.

Although I developed the Ngram API to eliminate high frequency strings from the output of my sequence alignment routines, it eventually helped me to discover an interesting correlation between the relative frequency of n-grams and instances of intertextuality. This discovery unfolded in the following way. On a whim, I decided to examine the relative frequencies of bigrams across passages from a few canonical works published during the long eighteenth century: Henry Fielding's Joseph Andrews (1742),  Edmund Burke's Enquiry (1757), and Maria Edgeworth's Ennui (1809). Each of the selections that I drew from these texts centers on a quotation of another writer—the Fielding passage quotes Virgil's Aeneid, the Burke passage quotes Shakespeare's Henry V, and the Edgeworth passage quotes Voltaire's “La Bégueule.” I broke each of these passages down into a set of sequential bigrams, and submitted each of the bigrams to the Google Ngrams data via the API described above. In the case of Burke, for example, I fired up the API and entered the following data into the input fields:

After identifying these parameters and clicking "Go!", I watched the tool navigate to the Google Ngram site and search for the relative frequency of the first two words in the Burke passage. The API limits the historical scope of this search to the period between 1752 and 1762 (the user-provided publication date of Burke's text plus and minus five years), because the Google Ngram data is a bit noisy, and we don't want anomalies in the data for 1757 to skew our sense of the bigram's relative frequency in the period. The API then calculates the mean value for the bigram's relative frequency across those years, and it writes the bigram, the publication year, and the calculated relative frequency to an output file. It then looks at the next bigram (containing words two and three), and reiterates the process, continuing in this fashion until it has queried all valid ngrams in the input file.

Preliminary analysis suggests that one can then use this output file to identify instances of intertextuality even in cases in which one does not have access to the referenced text. Using the aforementioned selections from eighteenth-century texts, I used the method described above to calculate the relative frequencies of the bigrams in each of those selections. I then plotted the bigram frequencies with R's scatter.smooth() function—identifying the first bigram in the selection as bigram number one, the second bigram as bigram number two, and so forth across the x-axis—so that I could better identify the trends in bigram frequency across each passage. I was surprised by the results (click to enlarge):

In each case, the local minimum of the regression line centers on the instance of intertextuality in the queried passage! While this trend is promising, though, it could be due to a number of causes. Chief among these are the differences in language and historical period that divide each of the “quoting” texts cited above from the passage that that work quotes. As we noted above, Henry Fielding quotes Virgil, Burke quotes Shakespeare, and Edgeworth quotes Voltaire, all in the original languages. When we compare the relative frequency of bigrams in Latin, French, and Elizabethan English with bigrams written in colloquial English of the mid- to late-eighteenth century, then, we should perhaps not be surprised that the latter tend to be more common in the Ngram data from that period, ceteris paribus. Nevertheless, these initial results yield new questions: Can the method described above identify instances of poetry in works of prose from a particular period? Can such a method be integrated into an ensemble approach to intertextuality, or do these graphs merely contain a half-told truth, mysterious to descry, which in the womb of distant causes lie? Such are the questions I hope to pursue in subsequent work.

Digital Approaches to Intertextuality: The Case of Eliza Haywood

Over the last year or so, I have been thinking more and more about intertextuality, or the ways in which writers borrow language and ideas from other writers. As a researcher who is particularly interested in the relationship between literary and scientific texts of the Enlightenment, I have been writing plagiarism detection scripts in order to pursue moments in works like Laurence Sterne's Tristram Shandy that borrow language from medical texts such as Burton's Anatomy of Melancholy. Having spent some time thinking about these kinds of questions, I was perhaps unusually provoked by the passage in Eliza Haywood's Betsy Thoughtless (1751) wherein the male hero Mr. Trueworth quotes a quatrain from Shakespeare that no scholars have subsequently been able to locate within the canonical Shakespearean corpus. “How dear,” Trueworth says, “ought a woman to prize her innocence! — as Shakespeare says,

They all are white,—a sheet
Of spotless paper, when they first are born;
But they are to be scrawl'd upon, and blotted
By every goose-quill” (463).

Christine Blouch, the editor of the Broadview edition of Betsy Thoughtless I was reading, attached a footnote to this quatrain that simply stated “Not Shakespeare. Source unidentified” (463). This footnote led my imagination to run wild—I wondered, could this poetic stanza be a fragment from one of Shakespeare's Lost Plays? Might it be the key that will unlock some of the grand mysteries behind the Shakespeare apocrypha?

While I soon learned the answer to these questions is a resounding “No”the quatrain is not by Shakespeare, but William Congrevemy interest in Haywood's use of intertextuality only continued to mature. In Betsy Thoughtless alone I found the better part of two dozen “quotations” such as the “Shakespeare” quotation above for which previous scholars were unable to identify sources. Curious to see how well plagiarism detection routines could identify the missing sources, I set out to uncover the materials that informed Haywood's work.

I began by selecting a handful of “quotations” on which to focus my attention. Of the twenty or so instances of intertextuality whose sources were unidentified in my edition of Betsy Thoughtless, I selected the twelve that I thought were most interesting. These were passages like:

"When puzzling doubts the anxious bosom seize,
to know the worst is some degree of ease" (51)

Haywood often introduces such passages with phrases like, “As the poet says,” or “I remember to have read somewhere,” which I took to be indications that the passages were based on extant texts of Haywood's day. I therefore typed up the dozen unsourced quotations I had selected and fed them to the Literature Online API. The API then broke the quotations into sequences of three words and sent each of those three word chunks to the Literature Online database. This procedure generated a spreadsheet of text data pictured in the following image, which I subsequently scoured for the sources of Haywood's passages:

A sample comma-separated-value file produced by the Literature Online API described below.

Using this data, I first tried to get a feel for the individuals whose prose most closely resembles the unsourced passages. Using some Python scripts, I counted up the number of times each writer in the LION database had a trigram (or series of three words) that matched one of the trigrams in the quotes I had selected from Haywood's novel. Here are the authors whose texts shared the greatest number of trigrams with the selected passages:

Authors whose work most resembled the selected trigrams from Haywood's Betsy Thoughtless. The birth and death dates of authors identified herein are taken from the Literature Online database.

These numbers were interesting—who would have thought Edward Ward would occupy the pole position? In the end, though, I found that very few of the sources for Haywood's identified literary borrowings appear in this chart. To discover this fact, I began by writing a few scripts that could loop over the text data pictured above. Using those routines, I soon discovered that in many cases, Haywood's references to other literary works were fairly straightforward, which made the computing easy. We could take for instance the pseudo-Shakespearean passage cited above:

“They all are white,—a sheet
Of spotless paper, when they first are born;
But they are to be scrawl'd upon, and blotted
By every goose-quill” (463)

Much to my chagrin, these lines led me not to a now-forgotten Shakespearean play but William Congreve's Love for Love (1695), where one reads: "You are all white, a sheet of lovely, spotless paper, when you first are born; but you are to be scrawled and blotted by every goose's quill. " (Of course Congreve could also be quoting a forgotten Shakespearean work, but this is another story.)

Other passages in Betsy Thoughtless had similarly straightforward sources. Using my procedure, I was able to find sources for the following passages in Haywood's text:

The Patriarch, to gain a wife
Chaste, beautiful, and young,
Serv'd fourteen years, a painful life,
And never thought it long.
Oh! were you to reward such cares,
And life so long would stay,
Not fourteen, but four hundred years,
Would seem but as one day (153)

These lines are from "The Perfection," a song published at least as early as 1726 in The Hive: A Collection of the Most Celebrated Songs, and one which Robert Burns quoted with delight some years thereafter. Another passage:

“All saw her spots but few her brightness took” (224)

was adapted from the 1677 play that made Nathaniel Lee's career, namely Alexander the Great, where the titular character boasts “All find my spots, but few observe my brightness.” Next up:

“That faultless form could act no crime,
But heav'n, on looking on it, must forgive” (280)

This passage draws from John Dryden's play The Spanish Friar (1681): “So wondrous fair, you justifie Rebellion: As if that faultless Face could make no Sin, But Heaven, with looking on it, must forgive.” The next unsourced passage,

“There is no wonder, or else all is wonder” (285)

is adopted from a remark of William Congreve's in The Mourning Bride (1697): “There are no wonders, or else all is wonder.” Let's turn to another:

“Young Philander woo'd me long,
I was peevish, and forbad him;
I would not hear his charming song,
But now I wish, I wish I had him” (289)

Here Haywood recites a popular song of the day, one that made its way into Charles Johnson's The Village Opera (1729):

An air from Charles Johnson's "The Village Opera" that Eliza Haywood cites in Betsy Thoughtless.

My script also uncovered the fact that George Lillo references the song in his play 1731 Silvia (and subsequent searches helped me find that Purcell set the song to music!). Next:

“Ingratitude's the sin, which, first or last,
Taints the whole sex; the catching court-disease” (322)

Mad man Nathaniel Lee wrote similar lines in his play Mithridates (1678): “Inconstancy, the Plague that first or last Taints the whole Sex, the catching Court-disease.” The last passage for which I found a straight forward source runs as follows:

“I, like the child, whose folly prov'd its loss,
Refus'd the gold, and did accept the dross” (602)

Here George Etherege's Comical Revenge, or Love in a Tub (1664) appears to be the source: “I, like the child, whose folly proves his loss, Refus'd the gold, and did accept the dross.” Using natural language processing techniques and the wonderful data provided by LION, identifying these sources took little time at all.

While the previous set of intertextual references were closely patterned on a variety of source texts, some passages that Haywood attributes to other writers are much less straight forward. Indeed, it seems she often combined lines from disparate literary works in order to forge her own ideas. Take, for example, the following passage:

Pleas'd with destruction, proud to be undone,
With open arms I to my ruin run,
And sought the mischiefs I was bid to shun;
Tempted that shame a virgin ought to dread,
And had not the excuse of being betray'd (111)

Like other instances of intertextuality in Haywood's writing, this passage seems to derive from multiple sources. The second line appears in the poet and doctor Richard Blackmore's “Advice to the Poets” (1718), where Blackmore writes “Let them this gen'rous Resolution own, / That they are pleas'd and proud to be undone.” The second and third lines of Haywood's aforementioned passage appear to borrow from Mary Wortley Montagu's “The Basset Table” (1716)—where one finds the lines “I know the bite, yet to my ruin run, / And see the folly which I cannot shun”—and posthumously published lines from “The Excursion of Fancy: A Pindaric Ode” (1753) by Aaron Hill (1685-1750): “Let us throw down this load of doubt, with which no race is won: / And, swift, to easier conquests, lighter, run, / The way, which reason is not bid to shun!” Another synthetic creation of Haywood's that I spent some time analyzing runs as follows:

When puzzling doubts the anxious bosom seize,
To know the worst is some degree of ease (51)

The first line of this couplet pulls from a line in Joseph Mitchell's “Poems on Several Grave and Important Subjects”: “When puzling Doubts invade my Breast, And I am cloath'd in Shades of Night . . . ", while the second inverts a line from Davild Mallet's Eurydice (1731): “When others too are miserable, not to know the worst is some degree of bliss.” In this passage, as in others, Haywood brings a variety of extant literary works to bear on her own project in fascinating and unpredictable ways.

* * *

Tracing the sources of these passages was helpful, not least because it allowed me to get a better sense of the ways writers like Haywood engaged with the texts of their age. For instance, using the data I gathered while tracing the sources of the passages above, I began considering new ways to optimize my plagiarism detection routines. Consider the following chart:

This graph indicates that, of the passages in Haywood's novel for which I was able to find sources, all of those passages shared at least three identical words in identical order with the texts they paraphrase. Roughly 80% of the instances of plagiarism I analyzed had at least four identical words in identical order with their source texts, ~55% had at least five equivalent words in equivalent sequential order, and so on. The lesson embedded in this chart is perhaps predictable: The greater the number of identical words one demands in order to identify one language act as a paraphrase of another, the greater the number of false negatives one can expect in one's analysis. As I noted above, my study of intertextuality in Haywood's writing was carried out using trigrams as the unit of analysis. That is to say, I expected a passage from Haywood's text to share at least three identical words in identical order with the text it paraphrases. While this seemed a relatively low condition for an instance of plagiarism to satisfy, it might have actually been too demanding a condition, because I was only able to find sources for ten of the twelve passages I set out to study. Perhaps a study using bigrams as the unit of analysis—or perhaps one of you—can identify the source of the remaining two quotations:

"Away with this idle, this scrupulous fear,
For a kiss in the dark,
Cry'd the amorous spark,
There is nothing, no nothing, too dear" (311)

* * *

"Unequal lengths, alas! our passions run,
My love was quite worn out, e'er yours begun" (462)