Fuzzy String Matching and Intellectual History

Recently, I have became intrigued by a research question posed by a friend and fellow scholar: How can we algorithmically identify the phrases and concepts that a given writer has borrowed from previous writers?

I began considering this question seriously during EMDA, and only grew more fascinated as others expressed interest in the problem. During conversations with Jake Halford, for instance, I started to imagine the things one might do with a tool that could algorithmically identify the moments in the eighteenth-century corpus in which writers borrowed concepts and phrases from the Philosophical Transactions. Later, Michael Whitmore helped me to realize that one could use the same tool to trace the degree to which post-Shakespearean writers borrowed from the Bard. It began to seem like a tool that could machine classify instances of similarity within a large corpus could have a wide range of research applications.

Provoked by these discussions, I wrote a fuzzy string matching script in Python that attempts to carry out this task. In its current form, the script compares every sequence of n words from a single "target" text with every sequence of p words from directory of "source texts," looking for instances of "similarity" between these two sequences of words. Say, for instance, you want to know whether the Declaration of Independence contains passages that resemble passages from extant texts of the period. You could load the Declaration into the target directory, load a collection of other texts into the source directory, and allow the computer to search for instances of similarity. Right now, two sequences of words are identified as similar if they share at least n words in common, where n is an integer set by users. Say the script were comparing the following two strings:

  1.  "We hold these truths to be self-evident, that all men are created equal, that they are endowed by their Creator with certain unalienable Rights, that among these are Life, Liberty and the pursuit of Happiness."
  2. "That all men are by nature equally free and independent and have certain inherent rights, of which, when they enter into a state of society, they cannot, by any compact, deprive or divest their posterity; namely, the enjoyment of life and liberty"

Of the 42 words in the second passagewhich comes from George Mason's Virginia Declaration of Rights19 appear in the passage from the Declaration of Independence. Thus if n ≤ 19, the script will record these two strings as an instance of similarity, and will print them to the output file.

As I continue working on the algorithm, I plan to work in a number of other features that will allow it to employ even more flexible matching routines. In the first place, I plan to use one of Python's built in word stemmers, so that the script can identify the word "equally" in Mason's document, for instance, as a form of the word "equal" in Jefferson's. My hope is that stemming will help catch instances of similarity that the current script will miss.

After incorporating the stemmer, I plan to build in a synonym lookup function into the routine: Given a word in a string from the current source file, query an api to pull down a list of synonyms of that word. Then stem those synonyms and compare the list of stemmed synonyms to the stemmed words in the current string of the target file. If those two strings have n words in common, flag them as an instance of similarity. With this even fuzzier method of identifying similarity, I'm hoping to be able to begin tracing not just patterns of expression, but patterns of ideas. Whether such a method will be too fuzzy to be useful, though, remains to be seen.