Working notes & digital experiments

Working notes & digital experiments Working notes & digital experiments http://douglasduhaime.com Making Chiptunes with Markov Models Over the last year or so, several curious circumstances sent me down the rabbit hole of algorithmic music composition. First an intriguing question on classifying genuine and fake piano rolls, then a brilliant professor writing an opera on the life of Alan Turing, and finally a gifted graduate student asking probing questions about OpenAI’s VQ-VAE model all made me increasingly interested in generating music with machine learning. After I shared some early results from my explorations, a few friends were interested in learning more. This post is my attempt to share some of the paths I’ve been pursuing, and to lay out some relatively easy ways to get started with automatic music generation. To keep things as simple as possbile, the post below describes how one can use basic Markov models to generate MIDI audio. We’ll first examine how Markov models work by building a simple text generation model in a dozen or so lines of Python. Then we’ll discuss how one can convert MIDI data to text sequences, which will let us use the same Markov model approach to generate MIDI audio. Finally, to spice things up a bit, we’ll convert our generated MIDI files into chiptune waveform audio with a disco dance beat. Let’s dive in! Building Markov Models While the term “Markov model” is used to describe a wide range of statistical models, essentially all Markov models follow a simple basic rule: the model generates a sequence of outputs, and each element in the sequence is conditioned only on the prior element in the sequence. Given a single word, a Markov model can predict the next word in the sequence. Given a pixel, a Markov model can predict the next pixel in the sequence. Given an item in a sequence, a Markov model can predict the next item in the sequence. As an example, let’s build a Markov model that can accomplish a simple text generation task. Our goal will be to train a model using the plays of William Shakespeare, then to use that model to generate new pseudo-Shakespearean play text. We’ll train our model using tiny-shakespeare.txt, a single file that contains raw text from Shakespeare’s plays. Here are the first few lines from the file: First Citizen: Before we proceed any further, hear me speak. All: Speak, speak. First Citizen: You are all resolved rather to die than to famish? All: Resolved. resolved. As you can see, the text follows a regular format in which a character’s name immediately precedes their speech. To help our model recognize these character speech boundaries, let’s add START and END tokens before and after each speech, like so: from collections import defaultdict # read the file text = open('tiny-shakespeare.txt').read() # add START before each speech and END after formatted = text.replace('\n\n', ' END \n\nSTART ') # create the full training data string training_data = 'START ' + formatted + ' END' If you print training_data, you’ll see that it includes the word START before each speech and the word END after each speech: START First Citizen: Before we proceed any further, hear me speak. END START All: Speak, speak. END START First Citizen: You are all resolved rather to die than to famish? END START All: Resolved. resolved. END Those START and END tokens will help our model know what a proper speech looks like so it can create new speeches that have the same format as our training speeches. Having prepared the training data, we can now train our model. To do so, we just need to build up a dictionary in which we map each word to the list of words by which it is followed. For example, given the sequence 1 2 1 3, our dictionary would look like this: {1: [2,3], 2: [1]}. This dictionary tells us that the value 1 is followed by 2 and 3, while the value 2 is followed by only 1. The value 3 is not followed by anything, because it’s the last token in our sequence. Let’s build this dictionary using our Shakespearean text data: from collections import defaultdict # split our training data into a list of words words = training_data.split(' ') # next_words will store the list of words that follow a word next_words = defaultdict(list) # examine each word up to but not including the last for word_index, word in enumerate(words[:-1]): # indicate that the first word is followed by the next next_words[word].append( words[word_index+1] ) That’s all it takes to train a Markov model! If we examine next_words, we’ll find that it maps each key to the list of words by which it is followed. The values of this dictionary contain duplicates by design. If the word “to” is followed by the word “be” often but is only followed by the word “suffer” once, then given the word “to”, our model should be more likely to predict “be” than “suffer”. In slightly fancier parlance, our next_words dictionary represents the weighted probabilities of a particular word following another particular word. To generate new sequences, we’ll simply sample from those weighted probabilities and piece together a sequence of text word by word. Now for the fun part. Let’s use the model to generate new speeches. To do so, we’ll run the following loop 100 times. First we’ll randomly select a word that follows the START token. Character names always follow the START token, so the first word in each speech will contain a character’s name. Then we’ll use next_words to randomly select one of the words that appears after the selected character’s name. For example, if our selected character is “Claudius”, in this step we’ll randomly select one of the words that immediately follows the word “Claudius” (i.e. one of the words with which Cladius begins one of his speeches). Then we’ll randomly sample a word that follows that last word. We’ll carry on in this way until we hit an END token, at which point we conclude the speech. We can implement this operation in code as follows: import random # generate 100 samples from the model for i in range(100): # initialize a string that will store our output output = '' # select a random word that follows START word = random.choice(next_words['START']) # continue selecting the next word until we hit the END marker while word != 'END': # add the current word to the output output += word + ' ' # get the next word in the sequence word = random.choice(next_words[word]) # display out the output print(output.strip(), '\n') That’s all it takes to sample from a Markov model! The output of that block should read like a collection of lunatics muttering Shakespearan nonsense: BUCKINGHAM: Upon the way to see. I have vouchsafed, With your promise pass'd: I do you confine yourself desired of your favour, I do foretell of mine. MARCIUS: May these men should say you must change this second Grissel, And Roman camp conduct him with your soul Sample The generated text looks pseudo-Shakespearean! Next let’s see if we can train some Markov models that generate musical expressions. Making Music with Markov Models As it turns out, we can use essentially the same strategy we used above to generate music with Markov models. To do so, we just need to convert an audio file into a text file. To accomplish this goal, we can parse a midi file and convert each note in the file into a word. The fantastic music21 library in Python written by Michael Scott Cuthbert’s lab at MIT makes this task fairly strightforward. We can install music21 and all the dependencies we’ll use below as follows: pip install music21==7.1.0 pip install nltk==3.6.2 pip install pretty-midi==0.2.9 pip install scipy==1.4.0 pip install https://github.com/duhaime/nesmdb/archive/python-3-support.zip After installing music21, we can use the function below to convert ambrosia.midi (a charming melody from the 8-bit Nintendo game Ultima III) into a string. Here’s the midi file, and here’s how we’ll convert it into a string: from music21.note import Note import music21 def midi_to_string(midi_path): # parse the musical information stored in the midi file score = music21.converter.parse( midi_path, # set midi file path quantizePost=True, # quantize note length quarterLengthDivisors=(4,3)) # set allowed note lengths # s will store the sequence of notes in string form s = '' # keep a record of the last time offset seen in the score last_offset = 0 # iterate over each note in the score for n in score.flat.notes: # measure the time between this note and the previous delta = n.offset - last_offset # get the duration of this note duration = n.duration.components[0].type # store the time at which this note started last_offset = n.offset # if some time elapsed, add a "wait" token if delta: s += 'w_{} '.format(delta) # add tokens for each note (or each note in a chord) notes = [n] if isinstance(n, Note) else n.notes for i in notes: # add this keypress to the sequence s += 'n_{}_{} '.format(i.pitch.midi, duration) return s s = midi_to_string('ambrosia.midi') The block above turns ambrosia.midi into the string s. Within that string, each note in ambrosia.midi is represented by a token that begins with “n_” and each pause between notes is represented by a token that begins with “w_”. If we print s we can see the string representation of our MIDI data more clearly: >>> print(s) w_1.0 n_65_quarter n_38_half w_0.5 n_62_eighth ... This string indicates that the file begins with a full bar of rest. Next we play notes 65 and 38 for a quarter bar and half bar respectively, then wait half a bar, then play note 62 for an eighth bar, and so on. In this way, using just two token types (“n_” tokens and “w_” tokens) we can record each keystroke that should be played as well as the durations of time between those keystrokes. We leave note durations in fractional form to prevent floating point truncation. To test if this conversion worked, let’s reverse the process and convert the string s into a new midi file. If both conversions were successful, we should expect that new midi file to sound like the original ambrosia.midi file. Happily music21 makes the conversion from string to midi straightforward as well: from fractions import Fraction def string_to_midi(s): # initialize the sequence into which we'll add notes stream = music21.stream.Stream() # keep track of the last observed time time = 1 # iterate over each token in our string for i in s.split(): # if the token starts with 'n' it's a note if i.startswith('n'): # identify the note and its duration note, duration = i.lstrip('n_').split('_') # create a new note object n = music21.note.Note(int(note)) # specify the note's duration n.duration.type = duration # add the note to the stream stream.insert(time, n) # if the token starts with 'w' it's a wait elif i.startswith('w'): # add the wait duration to the current time time += float(Fraction(i.lstrip('w_'))) # return the stream we created return stream midi = string_to_midi(s) As you can see, the block above simply reverses the operations performed in midi_to_string, converting each token into a midi note. The resulting midi file should indeed sound like the midi with which we started: Now we’re rolling! From here, all we have to do train a Markov model on the string representation of our MIDI file. To do so, let’s transform the Markov model we used above into a reusable function: from collections import defaultdict from nltk import ngrams import random def markov(s, sequence_length=6, output_length=250): # train markov model d = defaultdict(list) # make a list of lists where sublists contain word sequences tokens = list(ngrams(s.split(), sequence_length)) # store the map from a token to its following tokens for idx, i in enumerate(tokens[:-1]): d[i].append(tokens[idx+1]) # sample from the markov model l = [random.choice(tokens)] while len(l) < output_length: l.append(random.choice(d.get(l[-1], tokens))) # format the result into a string return ' '.join([' '.join(i) for i in l]) # sample a new string from s then convert that string to midi generated_midi = string_to_midi(markov(s)) # save the midi data in "generated.midi" generated_midi.write('midi', 'generated.midi') If we run the markov function we’ll get a new string that contains a sequence of notes expressed in text form. We can then then convert that string to a proper midi file using the string_to_midi function we defined above. The result sounds like a pair of drunken sailors wailing away on a piano: The good news is if you don’t like that audio you can just rerun the markov function until you get a keeper. Before you banish our sample, though, let’s try pushing it through the chiptune meat grinder we’ll write below. Markov Models Meet Chiptunes Chris Donahue, a brilliant postdoc in Computer Science at Stanford University, accomplished the Herculean task of converting the original 8-bit Nintendo synthesizer, or “audio processing unit”, into a simple API exposed in the Python package nesmdb. Nesmdb exports a function midi_to_wav that converts a midi file into nostalgic 8-bit audio that captures the raw energy of the original NES soundtracks. In what follows below, we’ll use that function to convert a midi file to chiptune waveform audio. from pretty_midi import Instrument as Tone from nesmdb.convert import midi_to_wav from music21.note import Note import pretty_midi, math, scipy def midi_to_nintendo_wav(midi_path, length=None, scalar=0.3): # create a list of tones and the time each is free tones = [Tone(0, name=n) for n in ['p1', 'p2', 'tr', 'no']] for t in tones: t.free = 0 # get the start and end times of each note in `midi_path` score = music21.converter.parse(midi_path) for n in score.flat.notes[:length]: for i in [n] if isinstance(n, Note) else n.notes: start = n.offset * scalar end = start + (n.seconds * scalar) # identify the index position of the first free tone tone_index = None for index, t in enumerate(tones[:3]): if t.free <= start: if tone_index is None: tone_index = index t.free = 0 if tone_index is None: continue tones[tone_index].free = end # play the midi note using the selected tone tones[tone_index].notes.append(pretty_midi.Note( velocity=10, pitch=i.pitch.midi, start=start, end=end)) # add drums: 1 = kick, 8 = snare, 16 = high hats for i in range(math.ceil(end * 8)): note = tones[3].notes.append(pretty_midi.Note( velocity=10, pitch=1 if (i%4) == 0 else 8 if (i%4) == 2 else 16, start=(i/2 * scalar), end=(i/2 * scalar) + 0.1)) midi = pretty_midi.PrettyMIDI(resolution=22050) midi.instruments.extend(tones) # store midi length, convert to binary, and then to wav time_signature = pretty_midi.TimeSignature(1, 1, end) midi.time_signature_changes.append(time_signature) midi.write('chiptune.midi') return midi_to_wav(open('chiptune.midi', 'rb').read()) # convert our generated midi sequence to a numpy array wav = midi_to_nintendo_wav('generated.midi') # save the numpy array as a wav file scipy.io.wavfile.write('generated.wav', 44100, wav) The original NES synthesizer supported five concurrent audio tracks: two pulse-wave tracks (“p1”, “p2”), a triangle-wave track (“tr”), a noise track (“no”), and a sampling track that’s not implemented in nesmdb. In the function above, we simply assign each note from the input midi file to the first unused track in our synthesizer (excluding the “no” track, which is assigned a dance beat later in the function). There are certainly more clever ways to assign notes to the synthesizer tracks, but we’ll use this approach for the sake of simplicity. Here are some sample results: If you’re curious to try making your own audio, feel free to try this Colab notebook, which will download the ambrosia.midi file and process it using the steps discussed above. There’s a cell in that notebook to make it easier to upload custom MIDI files for processing as well. Going Further with Markov-Generated MIDI The foregoing discussion is meant only to serve as a relatively straightforward way to get started generating audio with Markov models. We’ve barely scratched the surface of what’s possible though. If you get interested in automatic music generation, you might want to experiment with more sophisticated text sampling techniques, such as an LSTM Network or a transformer model like GPT2. It could also be interesting to train your model with a larger collection of data, such as Colin Raffel’s Lakh MIDI Dataset (possibly stripping the drum tracks and transposing each training file to a common relative major to prevent overfitting). If you generate some fun audio using some of these techniques, please feel free to get in touch! I’d love to hear from you. * * * I would like to thank Christine Mcleavey, whose Clara project first introduced me to the idea of transforming MIDI files into text data, and Professor Matthew Suttor in Yale’s School of Drama, whose opera I Am Alan Turing has inspired me to continue pursuing algorithmic music composition. Tue, 02 Nov 2021 00:00:00 -1000 http://douglasduhaime.com/posts/making-chiptunes-with-markov-models.html http://douglasduhaime.com/posts/making-chiptunes-with-markov-models.html Visualizing Autoencoders with Tensorflow.js NB: This page loads several neural network models and large image atlases which may take time to load on a mobile device. An autoencoder is a type of neural network that is comprised of two functions: an encoder that projects data from high to low dimensionality, and a decoder that projects data from low to high dimensionality. To understand how these two functions work, let’s consider the following images: Since each of these images is 28 pixels by 28 pixels, we can consider each as a 748-dimensional vector (or list of numbers), and can construct an autoencoder in which the encoder projects these 748-dimensional vectors to a two-dimensional vector, just as one might perform dimension reduction using UMAP or TSNE: Loading The visualization above shows the ways UMAP, TSNE, and the encoder from a vanilla autoencoder reduce the dimensionality of the popular MNIST dataset from 748 to 2 dimensions. Click a button to change the layout, or scroll in to see how images with similar shapes (e.g. 8 and 3) appear proximate to one another in the two-dimensional embedding. While the encoder reduces the dimensionality of input data, the decoder projects samples from low dimensionality back to higher dimensionality. For example, if one constructs a decoder that projects data from 2 dimensions to 748 dimensions, it becomes possible to project arbitrary positions in a two dimensional plane into a 748 pixel image. Click around in the figure below to see how a decoder projects from 2 to 748 dimensions. Note that you can click in areas where there are no samples and the decoder will still generate an image: Loading The visualization above shows the ways the decoder from a vanilla autoencoder projects data from a two-dimensional embedding to a 748-dimensional image shown in color in the lower-right. Click different positions to see how the decoder translates a 2D vector (or pair of x,y coordinates) into an image. An autoencoder is a neural network that combines the encoder and decoder discussed above into a single model that projects input data to a lower-dimensional embedding (the encode step), and then projects that lower-dimensional data back to a high dimensional embedding (the decode step). The goal of the autoencoder is to update its internal weights so that it can project an input vector to a lower dimensionality, then project that low-dimensional vector back to the input vector shape in such a way as to produce an output vector that closely resembles the input vector. One can see a visual diagram of the autoencoder model architecture—and see how the autoencoder’s projections improve with training—by interacting with the figure below: Train ! Sample ! Epochs: 0 Loss: 1.0 Model Input Model Output The figure above shows the model architecture of an autoencoder. Click the "Train" button to improve the autoencoder's reconstruction of input images, and click the "Sample" button to show how the model reconstructs a new sample image. Having discussed the purpose and basic components of an autoencoder, let’s now discuss how to create autoencoders using the Keras framework in Python. Building Autoencoders with Keras To build a custom autoencoder with the Keras framework, we’ll want to start by collecting the data on which the model will be trained. To keep things simple and moderately interesting, we’ll use a collection of images of celebrity faces known as the CelebA dataset. One can download and prepare to analyze 20,000 images from this dataset with the following: from keras.preprocessing.image import load_img, img_to_array import requests, zipfile, glob import numpy as np # download the images to a directory named "celeba-sample" url = 'http://bit.ly/celeba-sample' data = requests.get(url, allow_redirects=True).content open('celeba-sample.zip', 'wb').write(data) zipfile.ZipFile('celeba-sample.zip').extractall() # combine all images in "celeba-sample" into a numpy array read_img = lambda i: img_to_array(load_img(i, color_mode='grayscale')) files = glob.glob('celeba-sample/*.jpg') X = np.array([read_img(i) for i in files]).squeeze() / 255.0 # scale 0:1 Running these lines will create a directory named celeba-sample that contains a collection of 20,000 images with uniform size (218 pixels tall by 178 pixels wide), and will read all of those images into a numpy array X with shape (20000, 218, 178). With this dataset prepared, we’re now ready to define the autoencoder model. Happily, the Keras framework makes it possible to define an autoencoder including the encode and decode steps discussed above in roughly 25 lines of code: from keras.models import Model from keras.layers import Input, Reshape, Dense, Flatten class Autoencoder: def __init__(self, img_shape=(218, 178), latent_dim=2, n_layers=2, n_units=128): if not img_shape: raise Exception('Please provide img_shape (height, width) in px') # create the encoder i = h = Input(img_shape) # the encoder takes as input images h = Flatten()(h) # flatten the image into a 1D vector for _ in range(n_layers): # add the "hidden" layers h = Dense(n_units, activation='relu')(h) # add the units in the ith hidden layer o = Dense(latent_dim)(h) # this layer indicates the lower dimensional size self.encoder = Model(inputs=[i], outputs=[o]) # create the decoder i = h = Input((latent_dim,)) # the decoder takes as input lower dimensional vectors for _ in range(n_layers): # add the "hidden" layers h = Dense(n_units, activation='relu')(h) # add the units in the ith hidden layer h = Dense(img_shape[0] * img_shape[1])(h) # one unit per pixel in inputs o = Reshape(img_shape)(h) # create outputs with the shape of input images self.decoder = Model(inputs=[i], outputs=[o]) # combine the encoder and decoder into a full autoencoder i = Input(img_shape) # take as input image vectors z = self.encoder(i) # push observations into latent space o = self.decoder(z) # project from latent space to feature space self.model = Model(inputs=[i], outputs=[o]) self.model.compile(loss='mse', optimizer='adam') autoencoder = Autoencoder() Let’s step through the code above a little. First, we import the building blocks with which we’ll construct the autoencoder from the keras library. Then we define the encoder, decoder, and “stacked” autoencoder, which combines the encoder and decoder into a single model. Each of these models is defined inside a single class that takes as input several named parameters which collectively define the hyperparameters that will be used to define the model. The inline comments above detail how each line contributes to the construction of the encoder, decoder, and stacked autoencoder. Now that the autoencoder is defined, we can “train” it by passing observations from the numpy array X through the model. Note that we hold out some images from X to use as validation data: train = X[:-1000] test = X[-1000:] autoencoder.model.fit(train, train, validation_data=(test, test), batch_size=64, epochs=1000) If you run that line, you should see that the model’s aggregate “loss” (or measure of the difference between model inputs and reconstructed outputs) decreases for a period of time and then eventually levels out. Once the loss starts to level out, you can sometimes try to decrease the model’s learning rate and continue training: import keras.backend as K K.eval(autoencoder.model.optimizer.lr) K.set_value(autoencoder.model.optimizer.lr, 0.0001) Once the model’s loss seems to stop diminishing, we can treat the model as trained and ready for action. Analyzing the Trained Autoencoder After training the model, one can analyze the ways the encoder and decoder transform input images. In the first place, we can analyze the way the encoder positions each image in the latent space by plotting the 2D positions of each image in the input dataset: import matplotlib.pyplot as plt # transform each input image into the latent space z = autoencoder.encoder.predict(X) # plot the latent space plt.scatter(z[:,0], z[:,1], marker='o', s=0.1, c='#d53a26') plt.show() Examining the latent space, you should find that input images are roughly normally distributed around some central point, as in the following example (note however that your visualization may look slightly different due to the random initialization of the autoencoder’s weights): The plot above represents each image from the CelebA data sample projected into the two-dimensional latent space discovered by the autoencoder. Each point in this plot represents the position of a single image in the latent space. The plot above shows how each image in the input dataset X can be projected into the two-dimensional space created by the middlemost layer of the autoencoder we defined above. What’s arguably more interesting, however, is the decoder’s ability to take positions from that two-dimensional space and project them back up into full-fledged images. Using this ability, one can create new images that are conditioned by, but non-identical to, the input images on which the autoencoder was trained. Let’s visualize some of the outputs from the decoder next. Sampling from the Latent Space Having trained the autoencoder, we can now pick a random location in the two-dimensional latent space and ask the decoder to transform that two-dimensional value into an image: # sample from the region 10, 50 in the latent space import matplotlib.pyplot as plt y = np.array([[10, 50]]) prediction = autoencoder.decoder.predict(y) plt.imshow(prediction.squeeze(), cmap='gray') This will display a face like the following: Sampling from different regions of the latent space will create rather different faces. To make this sampling process a little more snappy, let’s use Tensorflow.js to create a realtime, interactive decoder. Exploring Latent Spaces Dynamically To explore the autoencoder’s latent space in realtime, we can use Tensorflow.js, a stunning open source project built by the Google Brain team. To get started, install the package with pip install tensorflowjs==3.8.0. That command will install a package that includes the resources needed to save a Keras model to disk a format with which the Tensorflow.js clientside library can interact. After that package finishes installing, you should have tensorflowjs_converter on your system path. Using that binary, one can save the decoder defined above to disk by running: import subprocess, os model_name = 'celeba' # string used to define filename of saved model autoencoder.decoder.save(model_name + '-decoder.hdf5', include_optimizer=True) out_dir = model_name + '-decoder-js' if not os.path.exists(out_dir): os.makedirs(out_dir) cmd = 'tensorflowjs_converter ' cmd += '--input_format keras_saved_model ' cmd += model_name + '-decoder.hdf5 ' cmd += out_dir subprocess.check_output(cmd, shell=True) This command will create celeba-decoder.hdf5 and celeba-decoder-js, the latter of which is a directory full of files that collectively specify the decoder’s internal parameters. Once those files are saved to disk, one can load the decoder and sample from the position 10, 50 in the latent space (just as we did above) with the following HTML: <!DOCTYPE html> <html> <head> <meta charset='utf-8'> <title>Visualizing Autoencoders with Tensorflow.js</title> </head> <body> <script src='https://cdnjs.cloudflare.com/ajax/libs/tensorflow/3.8.0/tf.min.js'></script> <script> var modelPath = 'celeba-decoder-js/model.json'; tf.loadLayersModel(modelPath).then(function(model) { // convert 10, 50 into a vector var y = tf.tensor2d([[10, 50]]); // sample from region 10, 50 in latent space var prediction = model.predict(y).dataSync(); // log the prediction to the browser console console.log(prediction); }) </script> </body> </html> To try this snippet out, save the HTML above to a file named index.html and start a local webserver with either python -m http.server 8000 (Python 3) or python -m SimpleHTTPServer 8000 (Python 2). Then open a web browser to the port on which the server you just started is running, namely http://localhost:8000 and inspect the browser console. If you do so, you should see that the lines above log an array of 38,804 values, one value for each pixel in the 178 * 218 pixel image sampled from position 10, 50 in the latent space. If all this came together, you’re ready to create some interactive models with Tensorflow.js! To do so, we just need to put together a few lines of code that can visualize the array of data returned by the model.predict(y).dataSync() call above. The following will accomplish this goal fairly succintly: <!DOCTYPE html> <html> <head> <meta charset='utf-8'> <title>Visualizing Autoencoders with Tensorflow.js</title> <style> html, body {margin: 0; height: 100%; width: 100%; overflow: hidden;} </style> </head> <body> <script src='https://cdnjs.cloudflare.com/ajax/libs/tensorflow/3.8.0/tf.min.js'></script> <script src='https://cdnjs.cloudflare.com/ajax/libs/three.js/97/three.min.js'></script> <script src='https://duhaime.s3.amazonaws.com/blog/latent-spaces/Controls2D.js'></script> <script src='https://duhaime.s3.amazonaws.com/blog/latent-spaces/ThreeWorld.js'></script> <script src='https://threejs.org/examples/js/controls/TrackballControls.js'></script> <script> // get the point geometry function getGeometry(colors) { var geometry = new THREE.Geometry(); for (var i=0, y=218; y>0; y--) { for (var x=0; x<178; x++) { var color = colors && colors.length ? colors[i++] : Math.random(); geometry.vertices.push(new THREE.Vector3(x-(182/2), y-(218/2), 0)); geometry.colors.push(new THREE.Color(color, color, color)); } } return geometry; } // sample from the latent space at obj.x, obj.y function sample(obj) { obj.x = (obj.x - 0.5) * 500; obj.y = (obj.y - 0.5) * 500; // convert 10, 50 into a vector var y = tf.tensor2d([[obj.x, obj.y]]); // sample from region 10, 50 in latent space var prediction = window.decoder.predict(y).dataSync(); // log the prediction to the browser console mesh.geometry = getGeometry(prediction); } // add the mesh to the scene var world = new ThreeWorld(); var materialConfig = { size: 1.25, vertexColors: THREE.VertexColors, }; var material = new THREE.PointsMaterial(materialConfig); var geometry = getGeometry([]); var mesh = new THREE.Points(geometry, material); world.scene.add(mesh); // load the decoder with tensorflow.js and render the scene var modelPath = 'celeba-decoder-js/model.json'; tf.loadLayersModel(modelPath).then(function(model) { window.decoder = model; sample({x: 0, y: 0}) world.render(); new Controls2D({ onDrag: sample }); }) </script> </body> </html> The code above creates an interactive widget like the following: Loading By interacting with the two-dimensional range slider, users can explore the autoencoder’s latent space, sampling from a continuous range of latent space positions and examining the image the decoder generates for each. That’s all it takes to visualize a latent space with Tensorflow.js! * * * I want to thank Chase Shimmin, a brilliant physicist and bona-fide machine learning expert, for helping me take a deeper dive into autoencoders. The notes in this post are my humble attempt to circulate some of the insights Chase shared with me among a larger audience. Sun, 26 May 2019 00:00:00 -1000 http://douglasduhaime.com/posts/visualizing-latent-spaces.html http://douglasduhaime.com/posts/visualizing-latent-spaces.html Constrained Lloyd Iteration Lloyd iteration is an iterative algorithm that distributes points within a space. During each iteration of the algorithm, Lloyd iteration builds a Voronoi map in which each point is contained within a distinct Voronoi cell, then centers each point within its cell. This operation causes overlapping points to spread out within a distribution, which can be helpful for data visualization purposes: The illustration above shows the first 20 iterations of unconstrained Lloyd iteration on a sample distribution of points. As the number of iterations increases, points near the convex hull (the outside border) tend toward infinity [gist]. Lloyd iteration does not constrain the expansion of points, which means that as the number of iterations increases, points near the convex hull (the outside border) tend toward infinity. This can be a problem for many use cases, as Lloyd iterations can expand the domain of a set of points quite significantly. It turns out one can solve this problem by adding vertices at the bounding box of the initial point distribution. These vertices will prevent the Voronoi map from extending beyond the initial point domains, which ensures the resulting point posititions remain inside the initial point domains: The illustration above shows the first 20 iterations of constrained Lloyd iteration on a sample distribution of points. As the number of iterations increases, points near the convex hull remain inside the initial points' x and y domains [gist]. As one can see, in just a few iterations, the constrained Lloyd model distributes points so as to prevent overlapping points. This lets one strategically “jitter” points so as to make it easy to interact with each (click to toggle): Showing points before Lloyd iteration. To make it easier to transform points in this way, I put together a small package named lloyd. One can install the package with pip: pip install lloyd After installing the package, one can transform the positions of points within a two-dimensional numpy array in the following way: from lloyd import Field import numpy as np # generate 2000 observations with 2 dimensions points = np.random.rand(2000, 2) # create a lloyd model on which one can perform iterations field = Field(points) # run one iteration of Lloyd relaxation on the field of points field.relax() # get the resulting point positions new_positions = field.get_points() new_positions will then be a numpy array with the same shape as points, only the positions of each point will be updated by the Lloyd algorithm described above. To further distribute points, one can call the .relax() method on the lloyd model until the distribution is optimal for plotting. Sat, 29 Sep 2018 00:00:00 -1000 http://douglasduhaime.com/posts/lloyd-iteration.html http://douglasduhaime.com/posts/lloyd-iteration.html Adding Authentication to Static Sites with AWS Lambda Many websites require authentication to protect private data. When working on a website that uses a server, it’s usually not too much trouble to create some server-side middleware that protects certain routes or web pages. When working on a serverless website served by Apache or Nginx, one can use htpasswd files to challenge users to authenticate. When working on a serverless website hosted from an S3 bucket, however, creating an authentication layer is a little more tricky. This post will attempt to make the process a little easier for others to create password-protected static sites with S3, CloudFront, and Lambda. Creating a Static File Site on S3 To get started, you’ll want to create a sample web page. Here’s the one I’ll be using: <!DOCTYPE html> <html> <head> <meta charset='UTF-8'> <title>HELLO!</title> <style> body { background-color: #ffcb50; background-image: url(iam.png); background-size: 50px; } </style> </head> <body /> </html> Once you have an HTML page to display, you’ll need to register for an AWS account if you don’t already have one. Then, after signing in, go to your list of S3 buckets, click “Create bucket” and give your bucket a name. I’ll name my bucket lambda-authentication: When prompted to set the permissions for the bucket, under “Manage public permissions”select “Grant public read acess to this bucket”. That will display a little orange message confirming that your bucket contents will be public (we’ll change this later): All other defaults are fine to accept. Once the bucket is created, you can upload your HTML file to the bucket by clicking the bucket, then clicking the “Upload” button. Drag your HTML file (which should be named index.html) into the filepicker, click “Next” until you are prompted to “Manage public permissions” for your uploaded file, and select “Grant public read access to this object(s)”: Then keep clicking “Next” until you get to the end, and click “Upload”. Next click the “Properties” tab in your bucket, select “Server website hosting”, select “Use this bucket to host a website”, and specify “index.html” as the default and error documents: Finally save your settings. If you want to get fancy later, you can upload a special 404 page and specify that file as the error document, but let’s keep things simple for now. If you click on the “Static website hosting” card again, you should see an “Endpoint” specified. If you visit that web address, you should see your website: Great! You’re now ready to create a user-authentication layer by configuring a CloudFront distribution for your site. Distributing Your S3 Site with CloudFront CloudFront is AWS’s content distribution network, which distributes your S3 site content to servers around the world, getting your content to viewers faster. CloudFront also allows us to add authentication to an S3 site. To get started with CloudFront, return to the AWS console and click the CloudFront link, then click the big blue button that says “Create Distribution”: On the following screen, click the blue button that says “Get Started” under the “Web” section, then select your S3 bucket address under “Origin Domain Name”. Under “Restrict Bucket Access” select “Yes”, set “access-identity-lambda-authentication” as the identity to use, and finally choose “Yes, Update Bucket Policy”: In the text field labelled “Default Root Object” below, type “index.html”, then click “Create Distribution”. From the next page, you should be able to click the “Distributions” link in the left sidebar to see your new distribution’s status. Take a note of the value under “Domain Name” – in just a few moments that value will become the new address of your new website. The “Status” field will say “in progress” for a few minutes, so while it’s generating we can configure the Lambda function that will provide the actual authentication mechanism. Creating IAM Credentials In order to configure Lambda to work with an S3 bucket, we’ll need to create an IAM profile that has access to the bucket. To do so, navigate back go the AWS console and click the link for the IAM service. Once there, click “Roles” in the left-hand sidebar, then “Create role”. On the next screen, under “Choose the service that will use this role” click “Lambda”, then click “Next: Permissions” at the bottom of the screen. Search for and select the “AWSLambdaExecute” role: Then click “Next: Review” at the bottom of the page. On the next screen, name your role “lambda-execute-role”, then click “Create role”: On the next page you should see that your Lambda role has been created. Once it’s created, click on the link to the role, then click “Trust relationships”, then click “Edit trust relationship” and replace the contents with the following: { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Service": [ "lambda.amazonaws.com", "edgelambda.amazonaws.com" ] }, "Action": "sts:AssumeRole" } ] } This little update allows the policy to interact with Lambda@Edge, which is the service that provides the authentication logic. Once that’s all set, you are ready to proceed to using this role in Lambda itself. Creating the Authentication Layer with AWS Lambda With all of the stage-setting in place, we can now create the actual logic that will handle user-authentication. To do so, return once again to the AWS console. Once you’re there, take a look at the black navigational bar at the top of your screen. Off to the right you should be able to select the “region” in which you wish to operate. For this next step you must be in the N. Virginia region (a.k.a. us-east-1). Once you’re in the N. Virginia region, click the link for “Lambda”. Lambda is a piece of AWS’s “serverless” stack that allows one to run serverside code without having to build, run, and maintain a whole server. We’ll use it to run our authentication logic. On the Lambda landing page, click the orange button that says “Create a function”: On the next page, keep “Author from scratch” selected. Name the function “authentication”, select Node.js as the runtime, select “Choose an existing role” , and select “lambda-execute-role” as the existing role to use (this is the role we just created in the IAM console): Finally, click “Create function” at the bottom of the page. Scroll down to the code editor and paste the following snippet into the input field: exports.handler = (event, context, callback) => { // Get the request and its headers const request = event.Records[0].cf.request; const headers = request.headers; // Specify the username and password to be used const user = 'user'; const pw = 'password'; // Build a Basic Authentication string const authString = 'Basic ' + new Buffer(user + ':' + pw).toString('base64'); // Challenge for auth if auth credentials are absent or incorrect if (typeof headers.authorization == 'undefined' || headers.authorization[0].value != authString) { const response = { status: '401', statusDescription: 'Unauthorized', body: 'Unauthorized', headers: { 'www-authenticate': [{key: 'WWW-Authenticate', value:'Basic'}] }, }; callback(null, response); } // User has authenticated callback(null, request); }; This snippet exports a single function that takes as input the three default arguments Lambda provides to Node.js functions [docs]. The function then pulls out the user’s HTTP request and its headers, specifies the correct username and password, and checks to see if the user’s request contained the username and password in its authentication headers. If not, it prompts the user to authenticate; if so it allows the user into the site. After defining the function, click the big orange “Save” button in the upper-right of the screen. Then, in the list of “Actions” at the top of the screen, click “Publish”, enter a version statement, and click “Publish”: Next, under the “Designer” section toward the top of the page, click “CloudFront”, which will move CloudFront into the triggers portion of the displayed diagram: If you then scroll down a bit, you’ll see a section titled “Configure triggers”. Select your CloudFront distribution’s ID under the “Distribution” selector (this is displayed under the ID column in your CloudFront distribution list), make sure you select “Viewer request” as the CloudFront event that will trigger the function defined above, and click the box that says “Enable trigger and replicate”: Then click “Add”, and click the orange “Save” button in the upper-right hand corner. If you then try to access the address specified under the “Domain Name” column in your CloudFront distribution list [example], you’ll be prompted for a username and password: If you type “user” and “password” as the credentials (or whichever values you set as the username and password in your lambda password), you’ll see the site itself! Cleaning Up There’s just one problem with the setup we established above. If you request your original S3 bucket address, you’ll be able to access your content without being challenged to authenticate. To fix this, return to the AWS console, delete your content, and reupload your web files. This time, don’t add public read permissions to the uploaded files: Thereafter, if you request your bucket address directly, you’ll get a 403 response as expected: If you request the address of your distribution instead, you’ll be able to authenticate and see your website: Fri, 08 Jun 2018 00:00:00 -1000 http://douglasduhaime.com/posts/s3-lambda-auth.html http://douglasduhaime.com/posts/s3-lambda-auth.html The Statute of Anne and the Geography of English Printing Fri, 15 Dec 2017 00:00:00 -1000 http://douglasduhaime.com/posts/printing-geography.html http://douglasduhaime.com/posts/printing-geography.html Visualizing TSNE Maps with Three.js For the last year or so, Yale’s DHLab has undertaken a series of experiments organized around analysis of visual culture. Some of those experiments have involved identifying similar images and visualizing patterns uncovered in this process. In this post, I wanted to discuss how we used the amazing Three.js library to build a WebGL-powered visualization that can display tens of thousands of images in an interactive 3D environment [click to enter]: If you’re interested in creating something similar, feel free to check out the full code. Getting Started with Three.js Three.js is a JavaScript library that generates lower-level code for WebGL, the standard API for 3D rendering in a web browser. Using Three.js, one can build complex 3D environments that would take much more code to build in raw WebGL. For a quick sample of the projects others have built with the library, check out the Three.js homepage. To get started with Three.js, one needs to provide a bit of boilerplate code with the three essential elements of a Three.js page: scene: The scene contains all objects to be rendered: // Create the scene and a camera to view it var scene = new THREE.Scene(); camera: The camera determines the position from which viewers see the scene: // Specify the portion of the scene visiable at any time (in degrees) var fieldOfView = 75; // Specify the camera's aspect ratio var aspectRatio = window.innerWidth / window.innerHeight; // Specify the near and far clipping planes. Only objects // between those planes will be rendered in the scene // (these values help control the number of items rendered // at any given time) var nearPlane = 0.1; var farPlane = 1000; // Use the values specified above to create a camera var camera = new THREE.PerspectiveCamera( fieldOfView, aspectRatio, nearPlane, farPlane ); // Finally, set the camera's position in the z-dimension camera.position.z = 5; renderer: The renderer renders the scene to a canvas element on an HTML page: // Create the canvas with a renderer and tell the // renderer to clean up jagged aliased lines var renderer = new THREE.WebGLRenderer({antialias: true}); // Specify the size of the canvas renderer.setSize( window.innerWidth, window.innerHeight ); // Add the canvas to the DOM document.body.appendChild( renderer.domElement ); The code above creates the scene, adds a camera, and renders the canvas to the DOM. Now all we need to do is add some objects to the scene. Each item rendered in a Three.js scene has a geometry and a material. Geometries use vertices (points) and faces (polygons described by vertices) to define the shape of an object, and materials use textures and colors to define the appearance of that shape. A geometry and a material can be combined into a mesh, which is a fully composed object ready to be added to a scene. The example below uses the high-level BoxGeometry, which comes with pre-built vertices and faces: // Create a cube with width, height, and depth set to 1 var geometry = new THREE.BoxGeometry( 1, 1, 1 ); // Use a simple material with a specified hex color var material = new THREE.MeshBasicMaterial({ color: 0xffff00 }); // Combine the geometry and material into a mesh var cube = new THREE.Mesh( geometry, material ); // Add the mesh to the scene scene.add( cube ); Finally, to render the scene on the page, one must call the render() method, passing in the scene and the camera as arguments: renderer.render( scene, camera ); Combining the snippets above gives the following result: See the Pen The Simplest Three.js Scene by Douglas Duhaime (@duhaime) on CodePen. This is great, but the scene is static. To add some animation to the scene, one can periodically update the cube’s rotation property, then rerender the scene. To do so, one can replace the renderer.render() line above with a render loop that calls itself recursively. Here is a standard render loop in Three.js: function animate() { requestAnimationFrame( animate ); renderer.render( scene, camera ); // Rotate the object a bit each animation frame cube.rotation.y += 0.01; cube.rotation.z += 0.01; } animate(); Adding this block at the bottom of the script makes the cube slowly rotate: See the Pen Animating the Cube by Douglas Duhaime (@duhaime) on CodePen. Adding lights to the scene can make it easier to differentiate the faces of the cube. To add lights to the scene above, we’ll first want to change the cube’s material, because as the documentation says the MeshBasicMaterial is not affected by lights. Let’s replace the material defined above with a MeshPhongMaterial: var material = new THREE.MeshPhongMaterial({color: 0xffff00}) Next let’s point a light at the cube so that different faces of the cube catch different amounts of light: // Add a point light with #fff color, .7 intensity, and 0 distance var light = new THREE.PointLight(0xffffff, .7, 0); // Specify the light's position in the x, y, and z dimensions light.position.set(1, 1, 100); // Add the light to the scene scene.add(light) Voila! See the Pen Lighting the Cube by Douglas Duhaime (@duhaime) on CodePen. Adding Images to a Scene The snippets above give a quick overview of the core elements of a Three.js scene. The following section will build upon those ideas to create a TSNE map of images. To build an image viewer, we’ll need to load some image files into some Three.js materials. We can do so by using the TextureLoader: // Create a texture loader so we can load the image file var loader = new THREE.TextureLoader(); // Specify the path to an image var url = 'https://s3.amazonaws.com/duhaime/blog/tsne-webgl/assets/cat.jpg'; // Load an image file into a MeshLambert material var material = new THREE.MeshLambertMaterial({ map: loader.load(url) }); Now that the material is ready, the remaining steps are to generate a geometry from the image, combine the material and geoemtry into a mesh, and add the mesh to the scene, just like the cube example above. Because images are two-dimensional planes, we can use a simple PlaneGeometry for this object’s geometry: // create a plane geometry for the image with a width of 10 // and a height that preserves the image's aspect ratio var geometry = new THREE.PlaneGeometry(10, 10*.75); // combine the image geometry and material into a mesh var mesh = new THREE.Mesh(geometry, material); // set the position of the image mesh in the x,y,z dimensions mesh.position.set(0,0,0) // add the image to the scene scene.add(mesh); The image will now appear in the scene: See the Pen Adding an Image to a Three.js Scene by Douglas Duhaime (@duhaime) on CodePen. It’s worth noting that one can swap out the PlanarGeometry for other geometries and Three.js will automatically wrap the material over the new geometry. The example below, for instance, swaps the PlanarGeometry for a more interesting Icosahedron geometry, and rotates the icosahedron inside the render loop: // use an icosahedron geometry instead of the planar geometry var geometry = new THREE.IcosahedronGeometry(); // spin the icosahedron each animation frame function animate() { requestAnimationFrame( animate ); renderer.render( scene, camera ); mesh.rotation.x += 0.01; mesh.rotation.y += 0.01; mesh.rotation.z += 0.01; } animate(); This produces a strange looking cat indeed: See the Pen Icosahedron Cat by Douglas Duhaime (@duhaime) on CodePen. Building Custom Geometries The examples above use a few different geometries built into Three.js. Those geometries are based on the fundamental THREE.Geometry class, which is a primitive geometry one can use to create custom geometries. THREE.Geometry is lower-level than the prebuilt geometries used above, but it gives performance gains that make it worth the effort. Let’s create a custom geometry by calling the THREE.Geometry constructor, which takes no arguments: var geometry = new THREE.Geometry(); This geometry object doesn’t do much yet, because it doesn’t have any vertices with which to ariculate a shape. Let’s add four vertices to the geometry, one for each corner of the image. Each vertex takes three arguments, which define the vertex’s x, y, and z positions respectively: // identify the image width and height var imageSize = {width: 10, height: 7.5}; // identify the x, y, z coords where the image should be placed // inside the scene var coords = {x: -5, y: -3.75, z: 0}; // add one vertex for each image corner in this order: // lower left, lower right, upper right, upper left geometry.vertices.push( new THREE.Vector3( coords.x, coords.y, coords.z ), new THREE.Vector3( coords.x+imageSize.width, coords.y, coords.z ), new THREE.Vector3( coords.x+imageSize.width, coords.y+imageSize.height, coords.z ), new THREE.Vector3( coords.x, coords.y+imageSize.height, coords.z ) ); Now that the vertices are in place, we need to add some faces to the geometry. The code below will model an image as two triangle faces, as triangles are performant primitives in the WebGL world. The first triangle will combine the lower-left, lower-right, and upper-right vertices of the image, and the second will triangulate the lower-left, upper-right, and upper-left vertices of the image: // add the first face (the lower-right triangle) var faceOne = new THREE.Face3( geometry.vertices.length-4, geometry.vertices.length-3, geometry.vertices.length-2 ) // add the second face (the upper-left triangle) var faceTwo = new THREE.Face3( geometry.vertices.length-4, geometry.vertices.length-2, geometry.vertices.length-1 ) // add those faces to the geometry geometry.faces.push(faceOne, faceTwo); Awesome, we now have a geometry with four vertices that describe the corners of the image, and two faces that describe the lower-right and upper-left-hand triangles of the image. The next step is to describe which portions of the cat image should appear in each of the faces of the geometry. To do so, one must add some faceVertexUvs to the geometry, as faceVertexUvs indicate which portions of a texture should appear in which portions of a geometry. FaceVertexUvs represent a texture as a two-dimensional plane that stretches from 0 to 1 in the x dimension and 0 to 1 in the y dimension. Within this coordinate system, 0,0 represents the bottom-left-most region of the texture, and 1,1 represents the top-right-most region of the texture. Given this coordinate system, we can map the lower-right triangle of the image to the first face created above, and we can map the upper-left triangle of the image to the second face created above: // map the region of the image described by the lower-left, // lower-right, and upper-right vertices to the first face // of the geometry geometry.faceVertexUvs[0].push([ new THREE.Vector2(0,0), new THREE.Vector2(1,0), new THREE.Vector2(1,1) ]); // map the region of the image described by the lower-left, // upper-right, and upper-left vertices to the second face // of the geometry geometry.faceVertexUvs[0].push([ new THREE.Vector2(0,0), new THREE.Vector2(1,1), new THREE.Vector2(0,1) ]); With the uv coordinates in place, one can render the custom geometry within the scene just as above: See the Pen Building Custon Geometries by Douglas Duhaime (@duhaime) on CodePen. This may seem like a lot of work for the same result we achieve with a one-line PlanarGeometry declaration above. If a scene only required one image and nothing else, one could certainly use the PlanarGeometry and call it a day. However, each mesh added to a Three.js scene necessitates an additional “draw call”, and each draw call requires the browser agent’s CPU to send all mesh related data to the browser agent’s GPU. These draw calls happen for each mesh during each animation frame, so if a scene is running at 60 frames per second, each mesh in that scene will require the transportation of data from the CPU to the GPU sixty times per second. In short, more draw calls means more work for the host device, so reducing the number of draw calls is essential if you want to keep animations smooth and close to sixty frames per second. The upshot of all this is that a scene with tens of thousands of PlanarGeometry meshes will grind a browser to a halt. To render lots of images in a scene, it’s much more performant to use a custom geometry like the one above, and to push lots of vertices, faces, and vertex uvs into that geometry. We’ll explore this idea more below. Displaying multiple images Given the remarks above let’s next build a single geometry that contains multiple images. To do so, we’ll need to load a number of images into the page in which the scene is running. One way to accomplish this task is to pass a series of urls to the texture loader and load each image individually. The trouble with this approach is it requires one new HTTP request for each image to be loaded, and there are upper bounds to the number of HTTP requests a given browser can make to a given domain at a time. A common solution to this problem is to load an “image atlas”, or montage of small images combined into a single larger image: One can then use the montage the way that performance-minded sites like Google use spritesheets. If you have ImageMagick installed, you can create one of these montages with the montage command: # download directory of images wget https://s3.amazonaws.com/duhaime/blog/tsne-webgl/data/100-imgs.tar.gz tar -zxf 100-imgs.tar.gz # create a file that lists all files to be include in the montage ls 100-imgs/* > images_to_montage.txt # create single montage image from images in a directory montage `cat images_to_montage.txt` -geometry +0+0 -background none -tile 10x 100-img-atlas.jpg The last command will create an image atlas with 10 images per column and no padding between the images in the atlas. The sample directory 100-imgs.tar.gz contains 100 images, and the -tile argument in the montage command indicates ouput atlas should have 10 columns, so the command above will generate a 10x10 grid of size 1280px by 1280px. Let’s load the image atlas into a Three.js scene: // Create a texture loader so we can load the image file var loader = new THREE.TextureLoader(); // Load an image file into a custom material var material = new THREE.MeshBasicMaterial({ map: loader.load('https://s3.amazonaws.com/duhaime/blog/tsne-webgl/data/100-img-atlas.jpg') }); Once the image atlas is loaded in, we’ll want to create some helper objects that identify the size of the atlas and its sub images. Those helper objects can then be used to calculate the vertex uvs of each face in a geometry: // Identify the subimage size in px var image = {width: 128, height: 128}; // Identify the total number of cols & rows in the image atlas var atlas = {width: 1280, height: 1280, cols: 10, rows: 10}; The custom geometry example above used four vertices and two faces to render a single image. To represent all 100 images from the image atlas, we can create four vertices and two faces for each of the 100 images to be displayed. Then we can associate the proper region of the image atlas material with each of the geometry’s 200 faces: // Create a helper function that returns an int {-700,700}. // We'll use this function to set each subimage's x and // y coordinate positions function getRandomInt() { var val = Math.random() * 700; return Math.random() > 0.5 ? -val : val; } // Create the empty geometry var geometry = new THREE.Geometry(); // For each of the 100 subimages in the montage, add four // vertices (one for each corner), in the following order: // lower left, lower right, upper right, upper left for (var i=0; i<100; i++) { // Create x, y, z coords for this subimage var coords = { x: getRandomInt(), y: getRandomInt(), z: -400 }; geometry.vertices.push( new THREE.Vector3( coords.x, coords.y, coords.z ), new THREE.Vector3( coords.x + image.width, coords.y, coords.z ), new THREE.Vector3( coords.x + image.width, coords.y + image.height, coords.z ), new THREE.Vector3( coords.x, coords.y + image.height, coords.z ) ); // Add the first face (the lower-right triangle) var faceOne = new THREE.Face3( geometry.vertices.length-4, geometry.vertices.length-3, geometry.vertices.length-2 ) // Add the second face (the upper-left triangle) var faceTwo = new THREE.Face3( geometry.vertices.length-4, geometry.vertices.length-2, geometry.vertices.length-1 ) // Add those faces to the geometry geometry.faces.push(faceOne, faceTwo); // Identify this subimage's offset in the x dimension // An xOffset of 0 means the subimage starts flush with // the left-hand edge of the atlas var xOffset = (i % 10) * (image.width / atlas.width); // Identify the subimage's offset in the y dimension // A yOffset of 0 means the subimage starts flush with // the top edge of the atlas var yOffset = Math.floor(i/10) * (image.height / atlas.height); // Use the xOffset and yOffset (and the knowledge that // each row and column contains only 10 images) to specify // the regions of the current image geometry.faceVertexUvs[0].push([ new THREE.Vector2(xOffset, yOffset), new THREE.Vector2(xOffset+.1, yOffset), new THREE.Vector2(xOffset+.1, yOffset+.1) ]); // Map the region of the image described by the lower-left, // upper-right, and upper-left vertices to `faceTwo` geometry.faceVertexUvs[0].push([ new THREE.Vector2(xOffset, yOffset), new THREE.Vector2(xOffset+.1, yOffset+.1), new THREE.Vector2(xOffset, yOffset+.1) ]); } // Combine the image geometry and material into a mesh var mesh = new THREE.Mesh(geometry, material); // Set the position of the image mesh in the x,y,z dimensions mesh.position.set(0,0,0) // Add the image to the scene scene.add(mesh); Rendering that scene produces a crazy little scatterplot of images: See the Pen Loading Multiple Images by Douglas Duhaime (@duhaime) on CodePen. Here we represent one hundred images with just a single mesh! This is much better than giving each image its own mesh, as it reduces the number of required draw calls by two orders of magnitude. It’s worth noting, however, that eventually one does need to create additional meshes. A number of graphics devices can only handle 2^16 vertices in a single mesh, so if you need your scene to run on a wide range of devices it’s best to ensure each mesh contains 65,536 or fewer vertices. Using Multiple Atlas Files Having discovered how to visualize multiple images with a single mesh, we can now scale up the image collection size dramatically. One way to crank up the number of visualized images is to squeeze more images into the image atlas. As it turns out, however, the largest texture size supported by many devices is 2048 x 2048px, so the code below will stick to atlas files of that size. For the examples below, I took roughly 20,480 images, resized each to 32px thumbs, then used the montage technique discuss above to build the following atlas files: 1, 2, 3, 4, 5. Once those atlas files are loaded onto a static file server, one can load each atlas into a scene with a simple loop: // Create a store that maps each atlas file's index position // to its material var materials = {}; // Create a texture loader so we can load the image file var loader = new THREE.TextureLoader(); for (var i=0; i<5; i++) { var url = 'https://s3.amazonaws.com/duhaime/blog/tsne-webgl/data/atlas_files/32px/atlas-' + i + '.jpg'; loader.load(url, handleTexture.bind(null, i)); } // Callback function that adds the texture to the list of textures // and calls the geometry builder if all textures have loaded function handleTexture(idx, texture) { materials[idx] = new THREE.MeshBasicMaterial({ map: texture }) if (Object.keys(materials).length === 5) { document.querySelector('#loading').style.display = 'none'; buildGeometry(); } } The buildGeometry function will then create the vertices and faces for the 20,000 images within the atlas files. Once those are set, one can pump those geometries into some meshes and add the meshes to the scene (click the Codepen link for the full code update): See the Pen Loading Multiple Atlas Files by Douglas Duhaime (@duhaime) on CodePen. Positioning Images with TSNE Coordinates So far we’ve used random coordinates to place images within a scene. Let’s now position images near other similar-looking images. To do so, we’ll create vectorized representations of each image, project those vectors down into a 2D embedding, then use each image’s position in the 2D coordinate space to position the image in the Three.js scene. Generating TSNE Coordinates First things first, let’s create a vector representation of each image. If you have tensorflow installed, you can create vectorized representations of each image in 100-imgs by running: # download a script that generates vectorized representations of images wget https://gist.githubusercontent.com/duhaime/2a71921c9f4655c96857dbb6b6ed9bd6/raw/0e72c48e698395265d029fabad0e6ab1f3961b26/classify_images.py # install a dependency for process management pip install psutil # run the script on a glob of images python classify_images.py '100-imgs/*' This script will generate one image vector for each image in 100-imgs/. We can then run the following script to create a 2D TSNE projection of those image vectors: # create_tsne_projection.py from sklearn.manifold import TSNE import numpy as np import glob, json, os # create datastores image_vectors = [] chart_data = [] maximum_imgs = 20480 # build a list of image vectors vector_files = glob.glob('image_vectors/*.npz')[:maximum_imgs] for c, i in enumerate(vector_files): image_vectors.append(np.loadtxt(i)) print(' * loaded', c+1, 'of', len(vector_files), 'image vectors') # build the tsne model on the image vectors model = TSNE(n_components=2, random_state=0) np.set_printoptions(suppress=True) fit_model = model.fit_transform( np.array(image_vectors) ) # store the coordinates of each image in the chart data for c, i in enumerate(fit_model): chart_data.append({ 'x': float(i[0]), 'y': float(i[1]), 'idx': c }) with open('image_tsne_projections.json', 'w') as out: json.dump(chart_data, out) Running that TSNE script on your image vectors will generate a JSON file in which each input image is mapped to x and y coordinate values: [ { 'x': 95.027, 'y': 11.80 }, { 'x': 98.54, 'y': -30.42 }, ... ] Positioning Images in a Three.js Scene Given the JSON file with those TSNE coordinates, all we need to do is iterate over each item in the JSON file and position the image in that index position accordingly. To load a JSON file using the Three.js library, we can use a FileLoader: // Create a store for image position information var imagePositions = null; var loader = new THREE.FileLoader(); loader.load('https://s3.amazonaws.com/duhaime/blog/tsne-webgl/data/image_tsne_projections.json', function(data) { imagePositions = JSON.parse(data); // Build the geometries if all atlas files are loaded conditionallyBuildGeometries() }) We can then use the index position of each item in that JSON file to identify the appropriate atlas file and x, y offsets for a given image. To do so, we’ll need to store each material by its index position: // Create a texture loader so we can load the image file var loader = new THREE.TextureLoader(); for (var i=0; i<5; i++) { var url = 'https://s3.amazonaws.com/duhaime/blog/tsne-webgl/data/'; url += 'atlas_files/32px/atlas-' + i + '.jpg'; // Pass the texture index position to the callback function loader.load(url, handleTexture.bind(null, i)); } // Callback function that adds the texture to the list of textures // and calls the geometry builder if all textures have loaded function handleTexture(idx, texture) { materials[idx] = new THREE.MeshBasicMaterial({ map: texture }) conditionallyBuildGeometries() } // If the textures and the mapping from image idx to positional information // are all loaded, create the geometries function conditionallyBuildGeometries() { if (Object.keys(materials).length === 5 && imagePositions) { document.querySelector('#loading').style.display = 'none'; buildGeometry(); } } Then buildGeometry() can then pass the index position of each atlas i and the index position of each image within a given atlas j to getCoords(), a function that returns the given image’s x and y coordinates: // Identify the total number of cols & rows in the image atlas var atlas = {width: 2048, height: 2048, cols: 64, rows: 64}; // Create a new mesh for each texture function buildGeometry() { for (var i=0; i<5; i++) { // Create one new geometry per atlas var geometry = new THREE.Geometry(); for (var j=0; j< atlas.cols*atlas.rows; j++) { var coords = getCoords(i, j); geometry = updateVertices(geometry, coords) geometry = updateFaces(geometry) geometry = updateFaceVertexUvs(geometry, j) } buildMesh(geometry, materials[i]) } } // Get the x, y, z coords for the subimage in index position j // of atlas in index position i function getCoords(i, j) { var idx = (i * atlas.rows * atlas.cols) + j; var coords = imagePositions[idx]; coords.x *= 1200; coords.y *= 600; coords.z = (-200 + j/100); return coords; } Adding Controls In addition to setting the image positions, we can add some controls to the scene that allow users to zoom in or out. An easy way to do so is to add the pre-packaged trackball controls as an additional JavaScript dependency. Then we can call the control’s constructor and update the controls both on window resize events and inside the main render loop to keep the controls up to date with the application state: /** * Add Controls **/ var controls = new THREE.TrackballControls(camera, renderer.domElement); /** * Handle window resizes **/ window.addEventListener('resize', function() { camera.aspect = window.innerWidth / window.innerHeight; camera.updateProjectionMatrix(); renderer.setSize( window.innerWidth, window.innerHeight ); controls.handleResize(); }); /** * Render! **/ // The main animation function function animate() { requestAnimationFrame( animate ); renderer.render( scene, camera ); controls.update(); } animate(); The result is an interactive visualization of the images in a 2D TSNE projection: See the Pen Three.js - Positioning Images with TSNE Coordinates by Douglas Duhaime (@duhaime) on CodePen. Getting Fancy We’ve now achieved a basic TSNE map with Three.js, but there’s much more that could be done to improve a user’s experience of the visualization. In particular, within the extant plot: * Users get no indication of load progress * Users can't see details within the small images * Users have no guide through the visualization To see how our team resolved those challenges, feel free to visit the live site or the GitHub repository with the full source code. Otherwise, if you’re working on something similar, feel free to send me a note or a comment below–I’d love to see what you’re building. * * * I want to thank Cyril Diagne, a lead developer on the spectacular Google Arts Experiments TSNE viewer, for generously sharings ideas and optimization techniques that we used to build our own TSNE viewer. Sun, 19 Nov 2017 00:00:00 -1000 http://douglasduhaime.com/posts/visualizing-tsne-maps-with-three-js.html http://douglasduhaime.com/posts/visualizing-tsne-maps-with-three-js.html Press Piracy from Tonson v. Baker (1710) to Cary v. Kearsley (1803) Tue, 12 Sep 2017 00:00:00 -1000 http://douglasduhaime.com/posts/press-piracy.html http://douglasduhaime.com/posts/press-piracy.html Identifying Similar Images with TensorFlow This year’s theme in Yale’s Digital Humanities Lab is visual culture. We’ve spent a good deal of time talking about image mining, color analysis, and related themes, and have become interested in one particular task: identifying similar images in large photo collections. Our work on this subject began when Peter Leonard stumbled across a thread that revolves around using TensorFlow to obtain vector representations of images. The author of that thread pointed out that by making a small change to a script in one of Google’s Tensorflow Tutorials, one could produce phenomenal vector representations of images that can be used for a variety of purposes. In what follows, I’ll discuss that change and suggest a few ideas of ways one can use the resulting image vectors. Installing dependencies To get started, we’ll first need to install TensorFlow. The easiest way I’ve found to do so is to use the Anaconda distribution of TensorFlow. For those who don’t know, Anaconda is a tremendously helpful distribution of Python that makes it easy to manage multiple versions of Python and various application dependencies in Python. It’s well worth an install, so if you don’t have Anaconda installed, I’d go ahead and install that now. Once you have Anaconda in place, you should be able to create a new virtual environment using Python 3.5 and then install TensorFlow in that environment with the following commands: # create virtual environment using python 3.5 with name '3.5' conda create -n 3.5 python=3.5 # activate the virtual environment source activate 3.5 # install tensorflow conda install -c conda-forge tensorflow You should see (3.5) as a preface in your terminal. If you don’t, then you’ve somehow left the virtual environment named 3.5, so you’ll need to re-enter that environment by typing “source activate 3.5” again. If you are in the virtual environment and you type “python”, you’ll enter the Python interpreter. Inside the interpreter, you should be able to load Tensorflow by typing “import tensorflow”. If no error springs, you’ve installed TensorFlow and can leave the interpreter by typing “quit()”. If you do get an error, you’ll need to install TensorFlow before proceeding. Classifying Images with TensorFlow The code below revolves around only a slight modification to this original script from TensorFlow’s ImageNet tutorial. The original script takes a single image as input and returns multiple string labels for the image as output. It is meant to be used from the command line ala: # download the original script wget https://raw.githubusercontent.com/tensorflow/models/master/tutorials/image/imagenet/classify_image.py # download a sample image wget http://thecatapi.com/api/images/get?type=jpg -O cat.jpg # run the script to generate text labels for an input image python classify_image.py cat.jpg The first time you run the script, your machine will download Inception, a convolutional neural network pretrained on ImageNet and discussed in the original paper Going Deeper with Convolutions. After downloading the model, the script will print to the terminal several labels for the provided input image, each with a weight to show the model’s confidence for the given label. Those labels are great for tasks like enhancing image search or algorithmic captioning, but they aren’t necessarily optimal for measuring image similarity. For similarity tasks, it’s generally better to work with float point vectors than categorical labels, as vectors capture more of the original object’s signal. Happily, one can obtain vector representations of images by only slightly modifying the classify_image.py script. In essence, instead of asking the last (softmax) layer of the neural network for the text classifications of input images, we’ll ask the penultimate layer of the neural network for the internal model weights for a given image, and we’ll store those weights as a vector representation of the input image. This will allow us to perform traditional vector analysis using images. Vectorizing Images with TensorFlow The original classify_image.py evokes a method “run_inference_on_image()” that handles the image classification for an input image. Here’s that method: def run_inference_on_image(image): """Runs inference on an image. Args: image: Image file name. Returns: Nothing """ if not tf.gfile.Exists(image): tf.logging.fatal('File does not exist %s', image) image_data = tf.gfile.FastGFile(image, 'rb').read() # Creates graph from saved GraphDef. create_graph() with tf.Session() as sess: # Some useful tensors: # 'softmax:0': A tensor containing the normalized prediction across # 1000 labels. # 'pool_3:0': A tensor containing the next-to-last layer containing 2048 # float description of the image. # 'DecodeJpeg/contents:0': A tensor containing a string providing JPEG # encoding of the image. # Runs the softmax tensor by feeding the image_data as input to the graph. softmax_tensor = sess.graph.get_tensor_by_name('softmax:0') predictions = sess.run(softmax_tensor, {'DecodeJpeg/contents:0': image_data}) predictions = np.squeeze(predictions) # Creates node ID --> English string lookup. node_lookup = NodeLookup() top_k = predictions.argsort()[-FLAGS.num_top_predictions:][::-1] for node_id in top_k: human_string = node_lookup.id_to_string(node_id) score = predictions[node_id] print('%s (score = %.5f)' % (human_string, score)) This method notes that the tensor pool_3:0 contains the weights for the penultimate layer of the network. These weights form a 2048 dimensional vector (or list of 2048 numeric units) that’s perfect for image similarity computations. Let’s grab that layer in addition to the final layer of the network: def run_inference_on_images(image_list, output_dir): """Runs inference on an image list. Args: image_list: {list} a list of paths to image files output_dir: {string} name of the directory where image vectors will be saved Returns: image_to_labels: {dict} a dictionary with image file keys and predicted text label values """ image_to_labels = defaultdict(list) create_graph() with tf.Session() as sess: # Some useful tensors: # 'softmax:0': A tensor containing the normalized prediction across # 1000 labels. # 'pool_3:0': A tensor containing the next-to-last layer containing 2048 # float description of the image. # 'DecodeJpeg/contents:0': A tensor containing a string providing JPEG # encoding of the image. # Runs the softmax tensor by feeding the image_data as input to the graph. softmax_tensor = sess.graph.get_tensor_by_name('softmax:0') for image_index, image in enumerate(image_list): try: print("parsing", image_index, image, "\n") if not tf.gfile.Exists(image): tf.logging.fatal('File does not exist %s', image) with tf.gfile.FastGFile(image, 'rb') as f: image_data = f.read() predictions = sess.run(softmax_tensor, {'DecodeJpeg/contents:0': image_data}) predictions = np.squeeze(predictions) ## # Get penultimate layer weights ## feature_tensor = sess.graph.get_tensor_by_name('pool_3:0') feature_set = sess.run(feature_tensor, {'DecodeJpeg/contents:0': image_data}) feature_vector = np.squeeze(feature_set) outfile_name = os.path.basename(image) + ".npz" out_path = os.path.join(output_dir, outfile_name) np.savetxt(out_path, feature_vector, delimiter=',') ## # Store softmax classification results ## # Creates node ID --> English string lookup. node_lookup = NodeLookup() top_k = predictions.argsort()[-FLAGS.num_top_predictions:][::-1] for node_id in top_k: human_string = node_lookup.id_to_string(node_id) score = predictions[node_id] image_to_labels[image].append({ "labels": human_string, "score": str(score) }) # close the open file handlers proc = psutil.Process() open_files = proc.open_files() for open_file in open_files: file_handler = getattr(open_file, "fd") os.close(file_handler) except: print('could not process image index',image_index,'image', image) return image_to_labels Here we see the modifications we’ll make to run_inference_on_image(). They focus on handling a series of input images, using error handling on each image in case a png file or jp2 sneaks into our collection of jpegs, and most importantly on capturing and saving the penultimate layer of the neural network. Finding Similar Images To identify similar images in large image collections, one can run the lines below to download the full updated classify image script, install psutil (which is used for managing open file handlers), and run the updated script on a directory full of images: # get the full updated script wget https://gist.githubusercontent.com/duhaime/2a71921c9f4655c96857dbb6b6ed9bd6/raw/0e72c48e698395265d029fabad0e6ab1f3961b26/classify_images.py # install the new dependency inside your virtual environment pip install psutil # download a collection of jpg images (or use one you have) wget https://goo.gl/Lf9vmN -O images.tar.gz tar -zxf images.tar.gz # run the script on a glob of images python classify_images.py "images/*" This will generate a new directory “image_vectors” and will create one vector for each input image in that directory, using the input image name as the root of the output vector name. Finding Nearest Neighbors The modified version of classify_images.py above generates one image vector for each input image. With those vectors in hand, one can run subsequent analysis to achieve different effects. For example, suppose you wanted to find the most similar images for each of your input images. The browser below offers one example of this kind of functionality: Refresh Hover on an image A nice way to achieve this functionality is to leverage Erik Bern’s Approximate Nearest Neighbors Oh Yeah library to identify the approximate nearest neighbors for each image. The similar image viewer above uses ANN to identify similar images [I used this nearest neighbors script]. To identify the nearest neighbors for the image vectors we created above, one can run: wget http://douglasduhaime.com/assets/posts/similar-images/utils/cluster_vectors.py pip install annoy && pip install scipy && pip install nltk python cluster_vectors.py These commands will generate a directory named nearest_neighbors in your current working directory, and will create one outfile for each image in the collection. Each of those outfiles will identify the 30 images most similar to the given image. To search for more or fewer nearest neighbors, one just needs to update the n_nearest_neighbors variable in the nearest neighbors script. Image TSNE Projections Another fun application for image vectors are TSNE projections. If you haven’t used TSNE before, it’s essentially a dimension reduction technique similar in some ways to Principal Component Analysis, except it’s optimized for learning and preserving non-linear patterns in high dimensional datasets. TSNE projections are often used in data visualizations as they are great at making similar high-dimensional vectors appear next to one another even in two dimensional projections. If we load all of the image vectors into a TSNE model then project the data down two two dimensions, we can create a two-dimensional representation of the image collection that preserves similarity between images. Within this representation of the data, each image is positioned near the images to which it’s most similar (click for interactive view): This plot itself is generated with native HTML5 canvas methods, but D3.js helps provide data fetching, DOM manipulation, and a Voronoi mouseover map. The data for the plot was produced by this tsne clustering script. Mon, 28 Aug 2017 00:00:00 -1000 http://douglasduhaime.com/posts/identifying-similar-images-with-tensorflow.html http://douglasduhaime.com/posts/identifying-similar-images-with-tensorflow.html Spenserian Networks In 1906, William Sumner defined ethnocentrism as the “view of things in which one’s own group is the center of everything, and all others are scaled and rated with reference to it”. Among the ethnocentric, Sumner continued, “each group nourishes its own pride and vanity, boasts itself superior, exists in its own divinities, and looks with contempt on outsiders” [13]. This notion of “in-group favoritism” continues to inspire research and new questions on a wide range of social groups. In an attempt to bring some of these questions back in time, the analysis below uses social network data to evaluate in-group and out-of-group dynamics in a network of historical writers. The Spenserians Database To build a social network of early modern writers, one needs some data. The data used below comes from the Spenserians database, David H. Radcliffe’s phenomenally rich database with over 25,000 hand-keyed original poems and structured metadata on virtually all writers in the Spenserian tradition of poetry. The database builds on 200 years of bibliographical research on the Spenserian tradition, and can therefore justly claim “its selection criteria are formal (anyone who wrote in Spenserian stanzas … is included) and its scope comprehensive for printed materials in English”. From its humble origins in the 1990’s as a digital HyperCard project to its current implementation on a robust LAMP architecture, the Spenserians Database has evolved into the definitive digital collection of Spenserian poetry. The Spenserians database includes rich, curated metadata for roughly 1,200 writers in the Spenserian tradition. The database records the career, writing genres, education level, religion, nationality, and gender of each of those poets using a reserved vocabulary of terms. This structural metadata is extremely rare, and serves as the foundation for the network analysis below. Here is a breakdown of the writers in the database by each of those metadata fields: Poet Metadata Overview Occupation Writing Education Religion Nation Gender A good majority of the archive is clearly comprised of English clergymen of an Anglican faith. That said, there are enough observation counts in other metadata values to allow for some interesting insights below. It’s helpful to remember the metadata value counts above are not static but change over time. The following chart shows how the values within each metadata category change over time. By hovering over the chart, you can see the percent of writers in each decade that have a given metadata value: Poet Metadata Over Time Occupation Writing Education Religion Nation Gender in Occupation-wise, one finds here that while early poets were often secretaries and courtiers, later poets rarely pursued these professions. Writing-wise, the chart reflects a strong rise of editor-poets, a growth that mirrors the increasingly pervasive presence of the printing press over the seventeenth and eighteenth centuries. Religion-wise, the clear trend is away from Anglicanism and toward a larger polyphony of religious devotions. Nationality-wise, English poets are strongly represented in early decades, while American, Irish, and Scottish poets grow better represented over the eighteenth century. Gender-wise, females only become represented in the seventeenth century, and unfortunately never achieve a significant portion of the poet population within the database. The charts above capture metadata counts within the Spenserians database, but they don’t capture relationships between metadata values in the database. To show the relationships between poet occupations and education levels, for example, the plot below uses a matrix-like view. In the initial chart, each column represents an occupation and each row represents an education level. Changing the first two dropdowns below updates the metadata fields used for column and row values. The color of each cell is controlled by the third dropdown—normalizizing by row makes each row’s values sum to 1, while normalizing by column makes each column’s values sum to 1. By toggling between these values, one can normalize the cell values by either column or row-level metadata values: Poet Metadata Correlations Columns: Occupation Gender Education Religion Nationality Writing Rows: Occupation Gender Education Religion Nationality Writing Normalize by: Row Column These charts offer unique insights into early modern societies. Examining them, one finds for example that writers with no formal education often worked as laborers or within the book trade, while those with private school training were often women schoolmasters. Additionally, Jewish authors tended to achieve higher education levels, while Presbyterians tended to achieve lower education levels. By toggling through the select options above, one can begin to understand the relationship between the various metadata categories within the Spenserians dataset. Spenserian Networks If the charts above reveal some of the relationships between Spenserian author metadata fields, they don’t display relationships between Spenserian authors. Fortunately, the Spenserian editors have done so by painstakingly curating the relationships between Spenserian poets. For each author in the database, the database’s editorial team has hand-identified the relationships that writer had with others in the database. The following chart visualizes these relationships. By hovering on individual writers within the network, you can visualize that poet’s relationships with other poets (click the image to enter): Those familiar with literary history might recognize that the y-axis of the network above indicates the passage of time—from Spenser at the top through the Metaphysical poets, on through the Neoclassical poets, and finally to the Romantics at the bottom. Each poet’s position along the y axis is set by the publication year of their first work within the Spenserian database. Given this fact, it becomes interesting to consider the historical range of a poet’s associates. There are perhaps three groups of writers here: poets like Horace Walpole (top below) who were associated almost exclusively with earlier writers, those like William Godwin (middle below) who were associated almost exclusively with later writers, and those like Anna Seward (bottom below) who were squarely of their times and had roughly equal shares of earlier and later associates: One wonders whether authors with predominantly earlier associates have stylstic traints that separate them from those with predominantly later associates. Hopefully subsequent work will be able to pursue this question at some length. Group Dynamics The plots above examine the metadata or network connections of Spenserian authors to uncover early modern social trends in the aggregate. By combining author-level metadata with author network connections, the analysis below works to uncover “unexpected relationships” among identified metadata groups. To do so, we first find the total number of associates that exist for each combination of metadata values. For example, we count the number of times Anglicans are associates of Catholics, the number of times physicians are associates of Scottish writers, and so on. This gives us the raw cooccurrence counts for each combination of two metadata values; e.g. tells us how many Catholics are associations of Anglicans. Then we find the relative proportions of associations each metadata value (e.g. Anglicans) has within each metadata type (e.g. occupations, gender). This tells us that within the gender metadata type, for example, 95% of Anglican relationships are with men, and 5% with women. Finally, to normalize for the varying population sizes among metadata values—e.g. to normalize for the size differential between male and female populations—we subtract from these 95% and 5% values the relative frequency of the given metadata value within the population. 95% of Anglican relationships were with men and 5% with women, for example, but males only account for 93% of the total data population and women account for 7% of the total population. We would therefore say that Anglican relationships with males are more present than one would expect by a margin of 2%, while Anglican relationships with females are less present than one would expect by a margin of -2%. Using this logic, the following chart visualizes over and under-represented relationships: Unexpected Relationships Rows: Occupation Gender Education Religion Nationality Writing Points: Occupation Gender Education Religion Nationality Writing Jitter: Discretize: + It’s important to note that the total data population contains roughly 850 poets, so it’s more than possible that sampling error accounts for some of these deltas. In other words, the small deltas that cluster near the x=0 vertical line above may be the result of a smallish data set, rather than the result of fundamental differences in the underlying population distributions. The chart below offers a visual explanation of this notion. The dashed bars each represent the relative frequency of one education value within the total corpus—i.e. 40% of all writers in the data set have a B.A. degree, 30% have an M.A. degree, and so on. Each random sample drawn during the simulation has a corresponding 40% probability of being sent to the B.A. group, a 30% probability of being sent to the M.A. group, and so on. Click the Start button to see how this “weighted random sample” simulation plays out: Weighted Random Sampling Start Pause Restart Observed Samples: 0 Total Samples: If you start this simulation and watch the number of samples grow, you should see the observations within each metadata value grow closer to the expected number of observations for the given metadata value. After sampling all observations, however, you’ll find each education level has a little more or less than the expected value. This delta is formally known as “sampling error”, or a divergence between the true population statistics and the population statistics observed in a sample from the population. In short, the smaller the sample size, the greater the probability that sample is unrepresentative of the underlying population. The analysis of network trends is based on the best available curated dataset of early modern networks. That said, the sample is relatively small, and is bound to suffer from sampling error. The only real way around this problem is to increase the sample size by leveraging a larger dataset. I’m currently pursuing this line of analysis, and hope to post some updates soon. Sat, 13 May 2017 00:00:00 -1000 http://douglasduhaime.com/posts/spenserian-networks.html http://douglasduhaime.com/posts/spenserian-networks.html CRUD Operations on Static File Sites Frameworks like Jekyll, Middleman, and Gatsby make it fun and easy to build static file websites. Static file sites are fast, flexible, and can be hosted essentially for free using something like Github Pages or Amazon Web Services’ S3 service. The catch to static sites, however, is they lack any real server-side code, which makes it hard to allow users to Create, Read, Update, or Destroy (CRUD) records in a database. This post shows how one can use Google Apps Script and Google Sheets to create a free dynamic backend and database for static file sites. Sample Application For a quick example of the way Google App Script stores data, try the following minigame and save your score and name at the end. Press Start Start Score: 0 Time: 11 Click the ghost! Game Over If you then refresh the page and click this div or play again, you’ll see the score data persists. Saving Forms on Traditional Websites Traditional websites use a server and database combination to store data like the points users achieved in a game session. Wordpress and Squarespace, for example, allow users to create, edit, or delete database items such as blog posts. These web forms send POST requests to the site’s backend to create, edit, or delete appropriate entries in the site’s database, and then submit GET requests to fetch data from the database so it can be shown to users. Static file frameworks like those listed above, by contrast, do not have a backend or database, so they lack built-in support for saving user submissions. One way around this problem is to pay for a Software As A Service (“SAAS”) platform such as Formspree or Formkeep, which will allow admins to add web forms to their sites. These services have the advantage of being supported full time by dedicated teams, but they have the disadvantage of costing money. As it turns out, however, with Google Apps Script and Google Sheets one can implement a free solution to the same problem. Saving Forms with Google Apps Script Using Google Sheets and Google Apps Script, one can easily save form submissions from static file sites. To get started with this approach, let’s suppose we want users to be able to submit their name and email address to sign up for a mailing list. To support this functionality, we’ll make a web form, a Google Sheet to store user responses, and some Google App Script to save form submissions to the Sheet. Let’s get started! Creating the web form To allow users to send forms from a static site, the first thing we’ll need is a form to submit. In our case we want users to be able to send their name and email addresses to sign up for a mailing list, so let’s create a form with Name and Email fields: <form> <div>Name:</div> <input type='text' name='Name'> <div>Email:</div> <input type='text' name='Email'> <input type='submit' value='Submit'> </form> Creating the Google Sheet With our form in place, let’s create a new Google Sheet to store user responses. In the spreadsheet, enter “Timestamp”, “Name”, and “Email” in cells A1, B1, and C1 [example]: Creating the Google App Script Given this spreadsheet, one can prepare to accept post requests by adding a little Google App Script to the sheet. To do so, go to Tools → Script Editor. You should see a text editor appear with a placeholder function defined. Replace the placeholder function with the following script: /** * Save HTTP POST data to the current spreadsheet * * @params: {Object} e: an event object that contains post data in e.parameters * @returns: a success/failure object with data in event.parameters * @documentation: https://developers.google.com/apps-script/guides/web **/ function doPost(e) { try { writeToSheet(e); var result = 'success' } catch(error) { Logger.log(e); Logger.log(error); var result = 'error' } // send a success/failure message return ContentService.createTextOutput(JSON.stringify({ 'result': result, 'event': e, })).setMimeType(ContentService.MimeType.JSON); } /** * Write the submitted form data to a given sheet * @params: {Object} e: an event object that contains post data in e.parameters **/ function writeToSheet(e) { try { var doc = SpreadsheetApp.getActiveSpreadsheet(); var sheet = doc.getActiveSheet(); // get active sheet var lastCol = sheet.getLastColumn(); var headers = sheet.getRange(1, 1, 1, lastCol).getValues()[0]; var nextRow = sheet.getLastRow()+1; // get the next row in the sheet var row = [ new Date() ]; // initialize row data with a timestamp // add each field to the row data // start at index = 1 because the timestamp is already added for (var i = 1; i < headers.length; i++) { if (headers[i].length > 0) { row.push(e.parameter[headers[i]]); } } // write the row data to the sheet sheet.getRange(nextRow, 1, 1, row.length).setValues([row]); } catch(error) { Logger.log(e); // log any errors } finally { return; } } This script has two main methods. doPost() is a special function defined within Google Apps Script that is called when an app receives a HTTP POST request. writeToSheet() is a custom function that adds the posted data to the sheet. Together, they receive data sent through POST requests and save them to your Google Sheet. After adding these functions to your script, click Save and type a name for your project when prompted. Then we need to publish the script as an app so that we can allow other web services to send POST requests to the script. To do so, one can click Publish → Deploy as Web App. Select “Execute the app as me”, and grant “Everyone, even anonymous” access to the app, in order to allow outside web traffic to communicate with the app. Once those values are set, click Deploy, then click Review Submissions and accept the permissions. You should then see a modal that indicates your “Current web app URL”. Copy this url to your clipboard and save it for later use. Adding the App Url to the Form Finally, we can make our form post responses to our Sheet by modifying the form we defined above. Let’s make the form submit a POST request, and let’s use the “Current web app URL” from the Google Sheet as the form action: <form id='google-form' method='post' action='https://script.google.com/macros/s/AKfycbyVS-FMaTegLw0tYrr00ZhOdwfHD4EYP6vwJSpdwGMywBkir9Y/exec'> <div>Name:</div> <input type='text' name='Name'> <div>Email:</div> <input type='text' name='Email'> <input type='submit' value='Submit'> </form> Submitting the form If we add a touch of CSS and render this form on a web page, we should see something like the following: Name: Email: If you fill out and submit the form, you should see your responses in your new spreadsheet. Voila! Submitting the form without changing the page In the code above, we submit a sample web form and are redirected to a new page with a JSON response from the server. This is suboptimal for lots of reasons, not least because it’s confusing to users accustomed to single page applications. One traditional way around this problem is to add CORS headers to the server that’s sending responses, then to use AJAX calls to fetch data from that server. In this case, however, we don’t control the Google servers so can’t add CORS headers to the responses. A suitable workaround is to add a hidden iframe to the page, then specify that iframe as the ‘target’ for the data returned from the server: <iframe name='hidden-iframe' style='display: none'></iframe> <form id='google-form' method='post' target='hidden-iframe' action='https://script.google.com/macros/s/AKfycbyVS-FMaTegLw0tYrr00ZhOdwfHD4EYP6vwJSpdwGMywBkir9Y/exec'> <div>Name:</div> <input type='text' name='Name'> <div>Email:</div> <input type='text' name='Email'> <input type='submit' value='Submit'> </form> If you resubmit the form, you’ll now stay on the same page! Going Further Google Apps Script is pretty interesting, especially for those working in static-site contexts. To read more about their services, check out their sample applications. Fri, 28 Apr 2017 00:00:00 -1000 http://douglasduhaime.com/posts/crud-operations-on-static-file-sites.html http://douglasduhaime.com/posts/crud-operations-on-static-file-sites.html Image Transitions with D3 and Primitive D3.js does magic with svgs, and Primitive transforms images into svgs. Put them together and you can turn Kevin Bacon into Francis Bacon (click the page): To help others produce image transitions like this, I put together a quick proof of concept repository. The general approach is to transform each image from a raster object to an SVG object using Primitive. Here’s a simple Python function that accomplishes this goal: from bs4 import BeautifulSoup import json, sys, os, glob, subprocess, shlex def img_to_svg(img): '''Read in the path to an image and use tfogelman/primitive to transform that image into an svg''' basename = os.path.basename(img) new_name = os.path.splitext(basename)[0] + '.svg' out_file = os.path.join(output_directory, new_name) call = 'primitive -i ' + img call += ' -o ' + out_file call += ' -r ' + str(size) call += ' -s ' + str(size) call += ' -n 300' call += ' -m 4' print(' * running', call) subprocess.call(shlex.split(call)) return out_file From there, one can transform the svg file to JSON for consumption within D3. Using BeautifulSoup installed via pip install beautifulsoup4 makes it relatively to parse out each attribute from the elements in the SVG and transform them into a JSON object: def svg_to_json(svg): '''Read in the path to an svg and write json for that svg to disk''' filename = os.path.basename(svg) data = { 'svg': {}, 'group': {}, 'rect': {}, 'points': [], 'name': filename } with open(svg) as f: f = f.read() soup = BeautifulSoup(f, 'lxml') svg_elem = soup.find('svg') data['svg'] = { 'width': svg_elem['width'], 'height': svg_elem['height'] } group = soup.find('g') data['group'] = { 'transform': group['transform'] } rect = soup.find('rect') data['rect'] = { 'x': rect['x'], 'y': rect['y'], 'width': rect['width'], 'height': rect['height'], 'fill': rect['fill'] } point_attributes = ['fill', 'fill-opacity', 'cx', 'cy', 'rx', 'ry'] points = soup.find_all('ellipse') for i in points: observation = {} for a in point_attributes: observation[a] = i[a] data['points'].append(observation) output_filename = filename.replace('.svg', '.json') outfile = os.path.join(output_directory, output_filename) with open(outfile, 'w') as out: json.dump(data, out) Once that JSON is ready, one can load that data into D3 and use the general update pattern to transition between images. Here is the JS code used to generate the transition seen above. If you load that into a browser, fire up a web server, open your page and trigger a few click events, you should see a fun image transition. For the complete code used in this post, feel free to visit the full repository and raise any questions or issues you might have. Happy coding! Sat, 15 Apr 2017 00:00:00 -1000 http://douglasduhaime.com/posts/image-transitions-with-d3-and-primitive.html http://douglasduhaime.com/posts/image-transitions-with-d3-and-primitive.html Donaldson v. Beckett (1774) and the Cheap Literature Hypothesis Fri, 10 Feb 2017 00:00:00 -1000 http://douglasduhaime.com/posts/testing-the-cheap-literature-hypothesis.html http://douglasduhaime.com/posts/testing-the-cheap-literature-hypothesis.html Simple Image Segmentation with Scikit-Image Several months ago, I worked with Professor David Corso and his team at the University of Michigan on an MDP Project to subdivide newspaper articles from full newspaper sheets. Our project revolved around analyzing a newspaper sheet, identifying each of the articles within that sheet, and saving each article to a unique file. The purpose of the exercise was to allow downstream applications to run OCR on the subdivided images to improve OCR quality and ultimately improve search relevancy. In the months that followed, I crossed paths with a number of additional image segmentation tasks in Yale’s Digital Humanities Lab, all of which seemed to suggest that image segmentation is an area of increasing importance in digital research. Given the paucity of material on image segmentation, I thought it would be worthwhile to write up a quick case study that shows how one can perform some simple image segmentation. Case Study: Segmenting playbills The case study discussed below grows out work I pursued when the British Library asked if Yale’s Digital Humanities Lab could help process a large image collection in their possession. Their data consisted of scrapbooks wherein each page/image contained several advertisements for eighteenth-century plays. Here’s a sample image: Given an image such as the above, they wanted to save each of the clippings from that page to its own file: This is a fairly tidy example of an image segmentation task, and one that our lab achieved quickly with Python’s scikit-image package. The write-up below documents the approaches we leveraged for this task. Converting an image file to a pixel matrix To get started, one must first install skimage. To do so, just open a terminal and type pip install scikit-image. From there, one can read a jpg or jp2 into RAM with a script such as the following: from skimage import io from scipy import ndimage import sys image_file = sys.argv[1] file_extension = image_file.split(".")[-1] if file_extension in ["jpg", "jpeg"]: im = ndimage.imread(image_file) elif file_extension in ["jp2"]: im = io.imread(image_file, plugin='freeimage') else: print("your input file isn't jpg or jp2") sys.exit() To invoke this script, save the above to a file (e.g. image_segmentation.py) and run: python image_segmentation.py PATH_TO/AN_IMAGE.jpg, where the sole argument provided to the script is the path to an image file on your machine. If you do so, you’ll instantiate an im object. If you print that object, you’ll see it’s a matrix. The shape of this matrix depends on the input image type, as discussed in the relevant scipy and skimage docs. In the case of the grayscale images above, the im object is a 2d matrix, or array of arrays. Each subarray represents one row of pixels in the image, and each integer in a given subarray represent the luminescence of a pixel in the given row in 8bit scale (0 = black, 255 = white): >>> print(im) [[255 255 255 ..., 255 255 255] [255 255 255 ..., 255 255 255] [255 255 255 ..., 255 255 255] ..., [254 254 254 ..., 255 255 255] [254 254 254 ..., 255 255 255] [254 254 254 ..., 255 255 255]] One can run a wide range of numerical operations on this image pixel matrix in order to achieve different tasks. Below we’ll look at two approaches one can use to save each subimage in a composite image to its own file. Image segmentation with pixel dilations One approach that’s often useful in image processing is “pixel dilation.” This term refers to the process of measuring the total amount of luminescence for each row and each column of an image. Measuring these values can provide helpful inputs with which one can automatically crop or even segment image elements. Given a matrix representation of the composite image discussed above, for example, one can easily find the aggregate luminesence of each column of pixels. The plot below on the right visualizes the aggregate luminesence of each column of pixels for the image on the left below: Examining this chart, we can tell there are two dark bands of pixels within the image on the left: one that stretches from roughly pixels 100-800 in the image, and another that stretches from roughly 1200-1900. Given just this representation of the image’s contents, one would have enough information to partition the image into two regions. From there, one could repeat the procedure, this time dilating pixels along the y axis and again splitting the image based on the resulting blocks within the pixel histogram. To generate pixel histograms such as the one above for the x and y axes of an image matrix named im, one can run: # plot the amount of white ink across the columns & rows row_vals = list([sum(r) for r in im ]) col_vals = list([sum(c) for c in im.T]) # plot the column (x-axis) pixel dilations plt.plot(col_vals) plt.show() # plot the row (y-axis) pixel dilations plt.plot(row_vals) plt.show() Image segmentation with pixel clustering While pixel dilations can offer significant clues for image processing, many image segmentation tasks involve identifying non-rectilinear patterns, and therefore require more flexible solutions. Below we’ll examine one approach to automatically segmenting an image into discrete regions of interest. Binarizing grayscale pixels The sample image discussed above is an 8bit grayscale image. Each pixel is represented as an integer value between 0 and 255, where 0 = perfect black and 255 = perfect white. One way to simplify image processing for this kind of color scale is to “binarize” the image, or transform the image such that each pixel is either black or white. To binarize a grayscale image, we simply need to identify a threshold value between 0 and 255. Once the threshold is established, we can identify each pixel with a boolean (True/False) value that indicates whether the given pixel is black/white. As it happens, the threshold_otsu() function in skimage provides a helper for determining an ideal threshold value for binarizing a grayscale image, and the clear_border() method provides a helper for applying a binarized image mask to an image: from skimage import filters, segmentation # find a dividing line between 0 and 255 # pixels below this value will be black # pixels above this value will be white val = filters.threshold_otsu(im) # the mask object converts each pixel in the image to True or False # to indicate whether the given pixel is black/white mask = im < val # apply the mask to the image object clean_border = segmentation.clear_border(mask) # plot the resulting binarized image plt.imshow(clean_border, cmap='gray') plt.show() Segmenting binarized images After binarizing a grayscale image, one can use the label() function in skimage to partition the image into contiguous areas of self-similar pixel regions: from skimage.measure import label # labeled contains one integer for each pixel in the image, # where that image indicates the segment to which the pixel belongs labeled = label(clean_border) This method returns a matrix with the same shape as im in which each value indicates the segment to which a given pixel has been assigned. The first member of this matrix, for instance, will be an integer indicating the segment to which the pixel at position 0,0 has been assigned, while the second member of the matrix will be an integer indicating the segment to which the pixel at 1,0 has been assigned. Identifying large segments to crop With these segment assignments in hand, one should filter out the “noise” segments, or the segments to which a small number of pixels have been assigned. These segments are the result of noise in the input image, and can be disregarded in the following way: from skimage.measure import regionprops # create array in which to store cropped articles cropped_images = [] # define amount of padding to add to cropped image pad = 20 # for each segment number, find the area of the given segment. # If that area is sufficiently large, crop out the identified segment. for region_index, region in enumerate(regionprops(labeled)): if region.area < 2000: continue # draw a rectangle around the segmented articles # bbox describes: min_row, min_col, max_row, max_col minr, minc, maxr, maxc = region.bbox # use those bounding box coordinates to crop the image cropped_images.append(im[minr-pad:maxr+pad, minc-pad:maxc+pad]) Having identified all of the images we wish to partition from the composite image, all that’s left to do is to save those images to disk. Saving cropped images to disk To save our cropped images, we only need to create an output directory in which to store the images, then save each of the images we just cropped out of the composite image to that output directory: import io # create a directory in which to store cropped images out_dir = "segmented_articles/" if not os.path.exists(out_dir): os.makedirs(out_dir) # save each cropped image by its index number for c, cropped_image in enumerate(cropped_images): io.imsave( out_dir + str(c) + ".png", cropped_image) Conclusion This post has attempted to show some of the ways one can approach some simple image segmentation problems with scikit-image. If you have any troubles with the snippets above, feel free to consult the full source code used to process the samples above. If you get interested this line of analysis, feel free to drop me a note–I’d be curious to hear what brings you to image segmentation! Mon, 20 Jun 2016 00:00:00 -1000 http://douglasduhaime.com/posts/simple-image-segmentation-with-scikit-image.html http://douglasduhaime.com/posts/simple-image-segmentation-with-scikit-image.html Keeping Screens Alive on AFS Filesystems Shared filesystem servers like AFS sometimes try to manage the duration of user sessions through the concept of authentication tokens. If you SSH to a server running AFS, you create a token that grants you access to your portion of the filesystem. Once you logout, that token is destroyed, which cuts off access to the filesystem. If a user starts some background processes (e.g. behind a screen) during the course of a session and then logs out, or if that user stays logged in but outstays the duration of their token, their jobs will eventually flail if they need to interact with the filesystem. One way around this problem is to obtain a token that is not associated with one’s current session. Scott Hampton, a tremendously helpful guru at Notre Dame’s Center for Research Computing, recently shared with me a series of steps one can take to access such a token: # login to a machine # Run these commands: pagsh -c /bin/tcsh kinit aklog source .cshrc # start screen/tmux # start jobs that you need to run # detach screen/tmux as needed # logout # logout This will create tokens that are good for 30 days. If you need more time than that to complete some jobs (God bless you), you can repeat the process. Thu, 14 Apr 2016 00:00:00 -1000 http://douglasduhaime.com/posts/keeping-screens-alive-on-afs-filesystems.html http://douglasduhaime.com/posts/keeping-screens-alive-on-afs-filesystems.html Plagiary Poets Last week I wrapped up PlagiaryPoets.io, an interactive app that visualizes text reuse within early poetry: I was inspired to build the app after spending a month studying D3.js with Bob Holt, Mike Pennisi, and Yannick Assogba, three brilliant developers who work for Bocoup. The site was a lot of fun to build, and I look forward to launching related projects in the months to come. If you have any thoughts about this one, feel free to drop me a line below! Sat, 06 Feb 2016 00:00:00 -1000 http://douglasduhaime.com/posts/plagiary-poets.html http://douglasduhaime.com/posts/plagiary-poets.html Visualizing Shakespearean Characters Some time ago, I was intrigued to discover that Shakespeare’s Histories have a noticeable lack of female characters [link]. Since then, I’ve been curious to further explore the nuances of Shakespearean characters, paying particular respect to the gender dynamics of the Bard’s plays. This post is a quick sketch of some of the insights to which that curiosity has led. To get a closer look at Shakespeare’s characters, I ran some analysis on the Folger Shakespeare Library’s gold-standard set of Shakespearean texts [link], all of which are encoded in fantastic XML markup that captures a number of character-level attributes, including gender. Using that markup, I extracted data for each character in Shakespeare’s plays, and then scoured through those features in search of patterns. All of the characters with an identified gender in this dataset are plotted below (mouseover for character name and source play): Looking at this plot, we can see that the most prominent characters in Shakespearean drama are almost all well-known, titular males. There is also a noticeable inverse-relationship between a character’s prominence and the point in the play wherein that character is introduced. Looking more closely at the plot, I’ve further noticed that Shakespeare was curiously consistent in his treatment of characters who appear in multiple plays. In both 1 Henry IV and 2 Henry IV, for instance, Falstaff is given ~6,000 words and is introduced only a few hundred words into the work. Looking at the long tail, by contrast, one finds that among the outspoken characters introduced after the ~15,000 word mark—including Westmoreland and Bedford in 2 Henry IV, and Cade, Clifford, and Iden from 2 Henry VI—nearly all hail from Histories. While the plot above gives one a birds’ eye view of Shakespeare’s characters, the plot doesn’t make it particularly easy to differentiate male and female character dynamics. As a step in this direction, the plot below visualizes character entrances by gender for each of Shakespeare’s plays: Examining at the distribution along the x-axis, we can see that male characters consistently enter the stage before female characters. An exception to this general rule may be found in the Comedies, as plays like Taming of the Shrew, All’s Well that Ends Well, and Midsummer Nights’ Dream begin with female characters on stage. Looking at the distribution along the y-axis, we can also see that for most plays, male characters continue to be introduced on stage long after the last female characters have been introduced. Given the plots above, some might conclude that Shakespeare privileged male characters over female characters, as he introduced the former earlier and tended to give them more lines. There is evidence in the plays, however, that points in the opposite direction. Looking at Shakespeare’s minor characters, we see that the smallest and least significant roles in each play were almost universally assigned to males: Here we see that even important males characters such as Fleance in Macbeth and Cornelius in Hamlet are given very few lines indeed, and the smallest female roles are consistently given more lines than the smallest male roles. In sum, the plots above show that a number of heretofore undisclosed patterns emerge when we analyze Shakespeare’s characters in the aggregate. However, the plots above don’t show the connections between characters. One way to investigate these interconnections is through a co-occurrence matrix, in which each cell represents the degree to which two characters appear on stage concurrently: Play: Henry_IV_i Antony_And_Cleopatra Midsummer-Nights_Dream Alls_Well Coriolanus Cymbeline Hamlet Julius_Caesar King_Lear Loves_Labours_Lost Macbeth Measure_For_Measure Much_Ado Othello Pericles Romeo_And_Juliet Comedy_Of_Errors King_John Merchant_Of_Venice Merry_Wives_Of_Windsor Taming_Of_The_Shrew Tempest Two_Gentlemen_Of_Verona Two_Noble_Kinsmen Winters_Tale Timon_Of_Athens Titus_Andronicus Troilus_And_Cressida Twelfth_Night King_Richard_II King_Richard_III Henry_IV_ii King_Henry_V Henry_VI_i Henry_VI_ii Henry_VI_iii Order: by Name by Frequency by Cluster by Gender Color: by Gender by Cluster In this visualization, "Frequency" represents the number of times a character appears on stage, "Gender" is indicated by the markup within the Folger Shakespeare Digital Collection XML (red = female, blue = male, green = unspecified), and "Cluster" reflects the subgroup of characters with whom a given character regularly appears, as determined by a fast greedy modularity ranking algorithm. Interacting with this plot allows one to uncover a number of insights. In the first place, we can see that the Histories consistently feature more "clusters" of characters than do Comedies or Tragedies. That is to say, while Comedies tend to be wildly interconnected affairs, Histories tend to include many small, isolated groups of characters that interact rather little with each other. Looking at the gender dynamics of these groups, we can also see that in Comedies such as Merry Wives of Windsor and Histories such as Richard III and Henry V, female characters tend to appear on stage together, almost creating a coherent collective over the course of the play. Finally, a number of female characters—such as Queen Margaret in 2 Henry VI and Adrianna in Comedy of Errors—appear on stage more frequently than any other character in their respective plays, despite the fact that they say fewer words than their respective plays' most outspoken characters. That is to say, their visual presence on stage is disproportionate to their verbal presence on stage. This raises a number of questions: To what extent were female characters meant to fulfill the role of a spectacle in Shakespearean drama? It’s difficult to imagine that the male players who acted as females projected authentic feminine voices. Did the limitations of imitative speech help mitigate the number of lines given to these prominent female characters? These and other questions remain to be explored in future work. Sun, 13 Dec 2015 00:00:00 -1000 http://douglasduhaime.com/posts/visualizing-shakespearean-characters.html http://douglasduhaime.com/posts/visualizing-shakespearean-characters.html Clustering Semantic Vectors with Python Google’s Word2Vec and Stanford’s GloVe have recently offered two fantastic open source software packages capable of transposing words into a high dimension vector space. In both cases, a vector’s position within the high dimensional space gives a good indication of the word’s semantic class (among other things), and in both cases these vector positions can be used in a variety of applications. In the post below, I’ll discuss one approach you can take to clustering the vectors into coherent semantic groupings. Both Word2Vec and GloVe can create vector spaces given a large training corpus, but both maintain pretrained vectors as well. To get started with ~1GB of pretrained vectors from GloVe, one need only run the following lines: wget http://www-nlp.stanford.edu/data/glove.6B.300d.txt.gz gunzip glove.6B.300d.txt.gz If you unzip and then glance at glove.6B.300d.txt, you’ll see that it’s organized as follows: the 0.04656 0.21318 -0.0074364 [...] 0.053913 , -0.25539 -0.25723 0.13169 [...] 0.35499 . -0.12559 0.01363 0.10306 [...] 0.13684 of -0.076947 -0.021211 0.21271 [...] -0.046533 to -0.25756 -0.057132 -0.6719 [...] -0.070621 [...] sandberger 0.429191 -0.296897 0.15011 [...] -0.0590532 Each new line contains a token followed by 300 signed floats, and those values appear to be organized from most to least common. Given this ready format, it’s fairly straightforward to get straight to clustering! There are a variety of methods for clustering vectors, including density-based clustering, hierarchical clustering, and centroid clustering. One of the most intuitive and most commonly used centroid-based methods is K-Means. Given a collection of points in a space, K-Means uses a Hunger Games style random lottery to pick a few lucky points (colored green below), then assigns each of the non-lucky points to the lucky point to which it’s closest. Using these preliminary groupings, the next step is to find the “centroid” (or geometric center) of each group, using the same technique one would use to find the center of a square. These centroids become the new lucky points, and again each non-lucky point is again assigned to the lucky point to which it’s closest. This process continues until the centroids settle down and stop moving, after which the clustering is complete. Here’s a nice visual description of K-Means [source]: To cluster the GloVe vectors in a similar fashion, one can use the sklearn package in Python, along with a few other packages: from __future__ import division from sklearn.cluster import KMeans from numbers import Number from pandas import DataFrame import sys, codecs, numpy It will also be helpful to build a class to mimic the behavior of autovivification in Perl, which is essentially the process of creating new default hash values given a new key. In Python, this behavior is available through collections.defaultdict(), but the latter isn’t serializable, so the following class is handy. Given an input key it hasn’t seen, the class will create an empty list as the corresponding hash value: class autovivify_list(dict): '''A pickleable version of collections.defaultdict''' def __missing__(self, key): '''Given a missing key, set initial value to an empty list''' value = self[key] = [] return value def __add__(self, x): '''Override addition for numeric types when self is empty''' if not self and isinstance(x, Number): return x raise ValueError def __sub__(self, x): '''Also provide subtraction method''' if not self and isinstance(x, Number): return -1 * x raise ValueError We also want a method to read in a vector file (e.g. glove.6B.300d.txt) and store each word and the position of that word within the vector space. Because reading in and analyzing some of the larger GloVe files can take a long time, to get going quickly one can limit the number of lines to read from the input file by specifying a global value (n_words), which is defined later on: def build_word_vector_matrix(vector_file, n_words): '''Return the vectors and labels for the first n_words in vector file''' numpy_arrays = [] labels_array = [] with codecs.open(vector_file, 'r', 'utf-8') as f: for c, r in enumerate(f): sr = r.split() labels_array.append(sr[0]) numpy_arrays.append( numpy.array([float(i) for i in sr[1:]]) ) if c == n_words: return numpy.array( numpy_arrays ), labels_array return numpy.array( numpy_arrays ), labels_array Scikit-Learn’s implementation of K-Means returns an object (cluster_labels in these snippets) that indicates the cluster to which each input vector belongs. That object doesn’t tell one which word belongs in each cluster, however, so the following method takes care of this. Because all of the words being analyzed are stored in labels_array and the cluster to which each word belongs is stored in cluster_labels, the following method can easily map those two sequences together: def find_word_clusters(labels_array, cluster_labels): '''Return the set of words in each cluster''' cluster_to_words = autovivify_list() for c, i in enumerate(cluster_labels): cluster_to_words[ i ].append( labels_array[c] ) return cluster_to_words Finally, we can call the methods above, perform K-Means clustering, and print the contents of each cluster with the following block: if __name__ == "__main__": input_vector_file = sys.argv[1] # Vector file input (e.g. glove.6B.300d.txt) n_words = int(sys.argv[2]) # Number of words to analyze reduction_factor = float(sys.argv[3]) # Amount of dimension reduction {0,1} n_clusters = int( n_words * reduction_factor ) # Number of clusters to make df, labels_array = build_word_vector_matrix(input_vector_file, n_words) kmeans_model = KMeans(init='k-means++', n_clusters=n_clusters, n_init=10) kmeans_model.fit(df) cluster_labels = kmeans_model.labels_ cluster_inertia = kmeans_model.inertia_ cluster_to_words = find_word_clusters(labels_array, cluster_labels) for c in cluster_to_words: print cluster_to_words[c] print "\n" The full script is available here. To run it, one needs to specify the vector file to be read in, the number of words one wishes to sample from that file (one can of course read them all, but doing so can take some time), and the “reduction factor”, which determines the number of clusters to be made. If one specifies a reduction factor of .1, for instance, the routine will produce n*.1 clusters, where n is the number of words sampled from the file. The following command reads in the first 10,000 words, and produces 1,000 clusters: python cluster_vectors.py glove.6B.300d.txt 10000 .1 The output of this command is the series of clusters produced by the K-Means clustering: [u'Chicago', u'Boston', u'Houston', u'Atlanta', u'Dallas', u'Denver', u'Philadelphia', u'Baltimore', u'Cleveland', u'Pittsburgh', u'Buffalo', u'Cincinnati', u'Louisville', u'Milwaukee', u'Memphis', u'Indianapolis', u'Auburn', u'Dame'] [u'Product', u'Products', u'Shipping', u'Brand', u'Customer', u'Items', u'Retail', u'Manufacturer', u'Supply', u'Cart', u'SKU', u'Hardware', u'OEM', u'Warranty', u'Brands'] [u'home', u'house', u'homes', u'houses', u'housing', u'offices', u'household', u'acres', u'residence'] [...] [u'Night', u'Disney', u'Magic', u'Dream', u'Ultimate', u'Fantasy', u'Theme', u'Adventure', u'Cruise', u'Potter', u'Angels', u'Adventures', u'Dreams', u'Wonder', u'Romance', u'Mystery', u'Quest', u'Sonic', u'Nights'] I’m currently using these word clusters for fuzzy plagiarism detection, but they can serve a wide variety of purposes. If you find them helpful for a project you’re working on, feel free to drop me a note below! Sat, 12 Sep 2015 00:00:00 -1000 http://douglasduhaime.com/posts/clustering-semantic-vectors.html http://douglasduhaime.com/posts/clustering-semantic-vectors.html Crosslingual Plagiarism Detection with Scikit-Learn Oliver Goldsmith, one of the great poets, playwrights, and historians of science from the Enlightenment, was many things. He was ‘an idle, orchard-robbing schoolboy; a tuneful but intractable sizar of Trinity; a lounging, loitering, fair-haunting, flute-playing Irish ‘buckeen.’’ He was also a brilliant plagiarist. Goldsmith frequently borrowed whole sentences and paragraphs from French philosophes such as Voltaire and Diderot, closely translating their works into his own voluminous books without offering so much as a word that the passages were taken from elsewhere. Over the last several months, I have worked with several others to study the ways Goldsmith adapted and freely translated these source texts into his own writing in order to develop methods that can be used to discover crosslingual text reuse. By outlining below some of the methods that I have found useful within this field of research, the following post attempts to show how automated methods can be used to further advance our understanding of the history of authorship. Sample Training Data In order to identify the passages within Goldsmith’s corpus that were taken from other writers, I decided to train a machine learning algorithm to differentiate between plagiarisms and non-plagiarisms. To distinguish between these classes of writing, John Dillon and I collected a large number of plagiarized and non-plagiarized passages within Goldsmith’s writing, and provided annotations to identify whether the target passage had been plagiarized or not. Here are a few sample rows from the training data: French Source Goldsmith Text Plagiarism Bothwell eut toute l'insolence qui suit les grands crimes. Il assembla les principaux seigneurs, et leur fit signer un écrit, par lequel il était dit expressément que la reine ne se pouvait dispenser de l'éspouser, puisqu'il l'avait enlevée, et qu'il avait couché avec elle. Bothwell was possessed of all the insolence which attends great crimes: he assembled the principal Lords of the state, and compelled them to sign an instrument, purporting, that they judged it the Queen's interest to marry Bothwell, as he had lain with her against her will. 1 Histoire c'est le récit des faits donnés pour vrais; au contraire de la fable, qui est le récit des faits donnés pour faux. In the early part of history a want of real facts hath induced many to spin out the little that was known with conjecture. 0 La meilleure maniere de connoître l'usage qu'on doit faire de l' esprit, est de lire le petit nombre de bons ouvravrages de génie qu'on a dans les langues savantes & dans la nôtre. The best method of knowing the true use to be made of wit is, by reading the small number of good works, both in the learned languages, and in our own. 1 Comme il y a en Peinture différentes écoles, il y en a aussi en Sculpture, en Architecture, en Musique, & en général dans tous les beaux Arts. A school in the polite arts, properly signifies, that succession of artists which has learned the principles of the art from some eminent master, either by hearing his lessons, or studying his works. 0 Des étoiles qui tombent, des montagnes qui se fendent, des fleuves qui reculent, le Soleil & la Lune qui se dissolvent, des comparaisons fausses & gigantesques, la nature toûjours outrée, sont le caractere de ces écrivains, parce que dans ces pays où l'on n'a jamais parlé en public. Falling stars, splitting mountains, rivers flowing to their sources, the sun and moon dissolving, false and unnatural comparisons, and nature everywhere exaggerated, form the character of these writers; and this arises from their never, in these countries, being permitted to speak in public. 1 Given this training data, the goal was to identify some features that commonly appear in Goldsmith’s plagiarized passages but don’t commonly appear in his non-plagiarized passages. If we could derive a set of features that differentiate between these two classes, we would be ready to search through Goldsmith’s corpus and tease out only those passages that had been borrowed from elsewhere. Feature Selection: Alzahrani Similarity Because a plagiarized passage can be expected to have language that is similar but not necessarily identical to the language used within the plagiarized source text, I decided to test some fuzzy string similarity measures. One of the more promising leads on this front was adapted from the work of Salha M. Alzahrani et al. [2012], who has produced a number of great papers on plagiarism detection. The specific similarity measure adapted from Alzahrani calculates the similarity between two passages (call them Passage A and Passage B) in the following way: def alzahrani_similarity( a_passage, b_passage ): # Create a similarity counter and set its value to zero similarity = 0 # For each word in Passage A for a_word in a_passage: # If that word is in Passage B if a_word in b_passage: # Add one to the similarity counter similarity += 1 # Otherwise, else: # For each word in Passage B for b_word in b_passage: # If the current words from Passages A and B are synonymous if a_word in find_synonyms( b_word ): # Add one half to the similarity counter similarity += .5 break # Lastly, divide the similarity score by the n words in the longer passage return similarity / max( len(a_passage), len(b_passage) ) To prepare the data for this algorithm, I used the Google Translate API to translate French texts into English, the Big Huge Labs Thesaurus API to collect synonyms for each word in Passage B, and the NLTK to clean the resulting texts (dropping stop words, removing punctuation, etc.). Once these resources were prepared, I used an implementation of the algorithm described above to calculate the ‘similarity’ between the paired passages in the training data. As one can see, the similarity value returned by this algorithm discriminates reasonably well between plagiarized and non-plagiarized passages: The y-axis here is discrete–each data point represents either a plagiarized pair of passages (such as those in the training data discussed above), or a non-plagiarized pair of passages. The x-axis is really the important axis. The further to the right a point falls on this axis, the greater the length-normalized similarity score for the passage pair. As one would expect, plagiarized passages have much higher similarity scores than non-plagiarized passages. In order to investigate how sensitive this similarity method is to passage length, I iterated over all sub-windows of n words within the training data, and used the same similarity method to calculate the similarity of the sub-window within the text. When n is five, for instance, one would compare the first five words of Passage A to the first five from Passage B. After storing that value, one would compare words two through six from Passage A to words one through five of Passage B, then words three through seven from Passage A to words one through five of Passage B, proceeding in this way until all five-word windows had been compared. Once all of these five-word scores are calculated, only the maximum score is retained, and the rest are discarded. The following plot shows that as the number of words in the sub-window increases, the separation between plagiarized and non-plagiarized passages also increases: Feature Selection: Word2vec Similarity Although the method discussed above provides helpful separation between plagiarized and non-plagiarized passages, it reduces word pairs to one of three states: equivalent, synonymous, and irrelevant. Intuitively, this model feels limited, because one senses that words can have degrees of similarity. Consider the words small, tiny, and humble. The thesaurus discussed above identifies these terms as synonyms, and the algorithm described above essentially treats the words as interchangeable synonyms. This is slightly unsatisfying because the word small seems more similar to the word tiny than the word humble. To capture some of these finer gradations in meaning, I called on Word2Vec, a method that uses backpropagation to represent words in high-dimensional vector spaces. Once a word has been transposed into this vector space, one can compare a word’s vector to another word’s vector and obtain a measure of the similarity of those words. The following snippet, for instance, uses a cosine distance metric to measure the degree to which tiny and humble are similar to the word small: from gensim.models.word2vec import Word2Vec from sklearn.metrics.pairwise import cosine_similarity # Load the Google pretrained word vectors google_dir = '../google_pretrained_word_vectors/' google_model = google_dir + 'GoogleNews-vectors-negative300.bin.gz' model = Word2Vec.load_word2vec_format(google_model, binary=True) # Obtain the vector representations of three words v1 = model[ 'small' ] v2 = model[ 'tiny' ] v3 = model[ 'humble' ] # Measure the similarity of 'tiny' and 'humble' to the word 'small' for v in [v2,v3]: print cosine_similarity(v1, v) Running this script returns [[ 0.71879274]] and [[ 0.29307675]] respectively, which is to say Word2Vec can recognize that the word small is more similar to tiny than it is to humble. Because Word2Vec allows one to calculate these fine gradations of word similarity, it does a great job calculating the similarity of passages from the Goldsmith training data. The following plot shows the separation achieved by running a modified version of the ‘Alzahrani algorithm’ described above, using this time Word2Vec to measure word similarity: As one can see, the Word2Vec similarity measure achieves very promising separation between plagiarized and non-plagiarized passage pairs. By repeating the subwindow method described above, one can identify the critical value wherein separation between plagiarized and non-plagiarized passages is best achieved with a Word2Vec similarity metric: Feature Selection: Syntactic Similarity Much like the semantic features discussed above, syntactic similarity can also serve as a clue of plagiarism. While a thoroughgoing pursuit of syntactic features might lead one deep into sophisticated analysis of dependency trees, it turns out one can get reasonable results by simply examining the distribution of part of speech tags within Goldsmith’s plagiarisms and their source texts. Using the Stanford Part of Speech (POS) Tagger’s French and English models, and a custom mapping I put together to link the French POS tags to the universal tagset, I transformed each of the paired passages in the training data into a POS sequence such as the following: [(u'Newton', u'NNP'), (u'appeared', u'VBD'),...,(u'amazing', u'JJ'), (u'.', u'.')] [(u'Newton', u'NPP'), (u'parut', u'V'),...,(u'nouvelle:', u'CL'),(u'.', u'.')] Using these sequences, two similarity metrics were used to measure the similarity between each of the paired passages in the training data. The first measure (on the x-axis below) simply measured the cosine distance between the two POS sequences; the second measure (on the y-axis below) calculated the longest common POS substring between the two passages. As one would expect, plagiarized passages tend to have higher values in both categories: Classifier Results From the similarity metrics discussed above, I selected a bare-bones set of six features that could be fed to a plagiarism classifier: (1) the aggregate ‘Alzahrani similarity’ score, (2) the maximum six-gram Alzahrani similarity score, (3) the aggregate Word2Vec similarity score, (4) the cosine distance between the part of speech tag sets, (5) the longest common part of speech string, and (6) the longest contiguous common part of speech string. Those values were all represented in a matrix format with one pair of passages per row and one feature per column. Once this matrix was prepared, a small selection of classifiers hosted within Python’s Scikit Learn library were chosen for comparison. Cross-classifier comparison is valuable, because different classifiers use very different logic to classify observations. The following plot from the Scikit Learn documentation shows that using a common set of input data (the first column below), the various classifiers in the given row classify that data rather differently: In order to avoid prejudging the best classifier for the current task, half a dozen classifiers were selected and evaluated with hold one out tests. That is to say, for each observation in the training data, all other rows were used to train the given classifier, and the trained classifier was asked to predict whether the left-out observation was a plagiarism or not. Because this is a two class prediction task (each observation either is or is not an instance of plagiarism), the baseline success rate is 50%. Any performance below this baseline would be worse than random guessing. Happily, all of the classifiers achieved success rates that greatly exceeded this baseline value: Generally speaking, precision values were higher than recall, perhaps because some of the plagiarisms in the training data were fuzzier than others. Nevertheless, these accuracy values were high enough to warrant further exploration of Goldsmith’s writing. Using the array of features discussed above and others to be discussed in a subsequent post, I tracked down a significant number of plagiarisms that were not part of the training data, including the following outright translations from the Encyclopédie: French Source Goldsmith Text Il n'est point douteux que l' Empire , composé d'un grand nombre de membres très-puissans, ne dût être regardé comme un état très-respectable à toute l'Europe, si tous ceux qui le composent concouroient au bien général de leur pays. Mais cet état est sujet à de très-grands inconvéniens: l'autorité du chef n'est point assez grande pour se faire écouter: la crainte, la défiance, la jalousie, regnent continuellement entre les membres: personne ne veut céder en rien à son voisin: les affaires les plus sérieuses les plus importantes pour tout le corps sont quelquefois négligées pour des disputes particulieres, de préséance, d'étiquette, de droits imaginaires d'autres minuties. It is not to be doubted but that the empire, composed as it is of several very powerful states, must be considered as a combination that deserves great respect from the other powers of Europe, provided that all the members which compose it would concur in the common good of their country. But the state is subject to very great inconveniences; the authority of the head is not great enough to command obedience; fear, distrust, and jealousy reign continually among the members; none are willing to yield in the least to their neighbours; the most serious and the most important affairs with respect to the community, are often neglected for private disputes, for precedencies, and all the imaginary privileges of misplaced ambition. L' Eloquence , dit M. de Voltaire, est née avant les regles de la Rhétorique, comme les langues se sont formées avant la Grammaire. Thus we see, eloquence is born with us before the rules of rhetoric, as languages have been formed before the rules of grammar. L' empire Germanique, dans l'état où il est aujourd'hui, n'est qu'une portion des états qui étoient soûmis à Charlemagne. Ce prince possédoit la France par droit de succession; il avoit conquis par la force des armes tous les pays situés depuis le Danube jusqu'à la mer Baltique; il y réunit le royaume de Lombardie, la ville de Rome son territoire, ainsi que l'exarchat de Ravennes, qui étoient presque les seuls domaines qui restassent en Occident aux empereurs de Constantinople. The empire of Germany, in its present state is only a part of those states that were once under the dominion of Charlemagne. This prince was possessed of France by right of succession: he had conquered by force of arms all the countries situated between the Baltic Sea and the Danube. He added to his empire the kingdom of Lombardy, the city of Rome and its territory, together with the exarchate of Ravenna, which were almost the only possessions that remained in the West to the emperors of Constantinople. Il n'est point de genre de poésie qui n'ait son caractere particulier; cette diversité, que les anciens observerent si religieusement, est fondée sur la nature même des sujets imités par les poëtes. Plus leurs imitations sont vraies, mieux ils ont rendu les caracteres qu'ils avoient à exprimer....Ainsi l'églogue ne quitte pas ses chalumeaux pour entonner la trompette, l' élégie n'emprunte point les sublimes accords de la lyre. There is no species of poetry that has not its particular character; and this diversity, which the ancients have so religiously observed, is founded in nature itself. The more just their imitations are found, the more perfectly are those characters distinguished. Thus the pastoral never quits his pipe, in order to sound the trumpet; nor does elegy venture to strike the lyre. Conclusion Samuel Johnson once observed that Oliver Goldsmith was “at no pains to fill his mind with knowledge. He transplanted it from one place to another; and it did not settle in his mind; so he could not tell what was his in his own books” (Life of Johnson). Reading the borrowed passages above, one can perhaps understand why Goldsmith struggled to recall what he had written in his books–much of his writing was not really his. As scholars continue to advance the art of detecting textual reuse, we will be better equipped to map these borrowed words at larger and more ambitious scales. For the present, writers like Goldsmith offer plenty of data on which to hone those methods. * * * This work has benefitted enormously from conversations with a number of others. Antonis Anastasopoulos, David Chiang, Michael Clark, John Dillon, and Kenton Murray of Notre Dame’s Text Analysis Group, and Thom Bartold, Dan Hepp, and Jens Wessling of ProQuest offered key analytic insights, and Mark Olsen and Glenn Roe of the University of Chicago’s ARTFL group shared essential data. I am grateful for the generous help each of you has provided. Code is available on GitHub. Sun, 19 Jul 2015 00:00:00 -1000 http://douglasduhaime.com/posts/crosslingual-plagiarism-detection.html http://douglasduhaime.com/posts/crosslingual-plagiarism-detection.html Classifing Shakespearean Drama with Sparse Feature Sets In her fantastic series of lectures on early modern England, Emma Smith identifies an interesting feature that differentiates the tragedies and comedies of Elizabethan drama: ‘Tragedies tend to have more streamlined plots, or less plot—you know, fewer things happening. Comedies tend to enjoy a multiplication of characters, disguises, and trickeries. I mean, you could partly think about the way [tragedies tend to move] towards the isolation of a single figure on the stage, getting rid of other people, moving towards a kind of solitude, whereas comedies tend to end with a big scene at the end where everybody’s on stage’ (6:02-6:37). The distinction Smith draws between tragedies and comedies is fairly intuitive: tragedies isolate the poor player that struts and frets his hour upon the stage and then is heard no more. Comedies, on the other hand, aggregate characters in order to facilitate comedic trickery and tidy marriage plots. While this discrepancy seemed promising, I couldn’t help but wonder whether computational analysis would bear out the hypothesis. Inspired by the recent proliferation of computer-assisted genre classifications of Shakespeare’s plays—many of which are founded upon high dimensional data sets like those generated by DocuScope—I was curious to know if paying attention to the number of characters on stage in Shakespearean drama could help provide additional feature sets with which to carry out this task. To pursue the question, I ran some analysis on the Folger Digital Texts edition of the Bard’s plays. This delightful collection uses a custom XML schema to indicate when characters enter and exit the stage, which makes it possible to track the number of characters on stage over the course of a play: This visualization of The Tempest, for instance, traces the number of characters on stage from the play’s opening scene—in which the Shipmaster and his Boatswain are quickly joined by Alonso, Sebastian, Antonio, Ferdinand, and Ganzalo—through Prospero’s staff-dashing monologue around the 15,000 word mark to the play’s crowded conclusion. Here are the stagings for the other Shakespearean plays in the FDT canon, ordered by their date of first performance according to Alfred Harbage’s Annals of English Drama: These plots afford ample evidence to suggest that Shakespearean comedies tend to end with large scenes in which everybody’s on stage. Unfortunately, many of the comedies and tragedies also end with large gatherings of characters. It therefore seems that the number of characters on stage during a play’s conclusion might not be an ideal feature with which to classify the genres of Shakespeare’s plays. With these results in hand, I decided to measure how often Shakespeare isolates a single character on stage within plays from each of the three canonical genres. Aggregating the total number of words spoken when only a single character is on stage, as well as the total number of words spoken when only two characters are on stage, and so forth, allows one to measure the degree to which each play distributes its attention between large and small gatherings of characters (click to enlarge): While these plots reveal some interesting features of the works, such as the fact that Two Gentlemen of Verona truly does revolve around dyadic pairs, they make it difficult to compare the amount of time tragedies and comedies feature only a single character on stage. To make this latter comparison, one can find the average amount of time a single character occupies the stage for each genre: Surprisingly, the chief difference between the comedies and tragedies has less to do with the way each handles isolated actors on stage than with the way each handles triads and quadrads. It seems tragedies have a greater tendency to revolve around sets of three characters, while comedies are more often organized around sets of four characters. That said, the similarities between the two genres are far more striking than their differences, and far less encouraging for one in search of distinguishing features. Reflecting on these results, I wondered if tragedies might be better classified by the amount of time their conflicted characters spend addressing the audience. One way to begin measuring the latter, I thought, would be to count the number of words spoken by each character in each play (click to enlarge): Analyzing these figures, I was struck by what should have been a fairly obvious fact: Shakespeare’s most memorable characters (Falstaff, Hal, Prospero, Rosalind, Hamlet…) are each given commanding positions within the plays they lead. Given the strong correlation between these memorable characters and the number of lines each speaks, it’s tempting to ask whether we remember these characters most readily simply because Shakespeare allowed them to say the most, or whether Shakespeare allowed them to say the most because he sensed they were his most memorable characters. Either way, the last trio of plots shows a fairly even distribution of commanding figures among the comedies, histories, and tragedies. But those plots also reveal that the histories include rather few words spoken by women, as well as the fact that the comedies tend to be shorter than the tragedies and histories: By analyzing only the length of a play and the number of words women speak in that play, one can start to get reasonably good separation between the genres: comedies tend to be shorter and include more female dialogue, histories tend to be longer and include less female dialogue, and tragedies split provocatively between the upper right and lower left. Reviewing these figures, I can’t shake the suspicion that a third dimension of data could unite these divided tragedies. But what would that dimension consist of? * * * I would like to thank Mike Poston, co-curator of the Folger Digital Text editions used for this analysis, for discussing many of the finer points of the FDT collection with me. In case you want to replicate any of the analysis or assess the assumptions on which it’s founded, the scripts are here. Sun, 12 Oct 2014 00:00:00 -1000 http://douglasduhaime.com/posts/classifying-shakespearean-drama-with-sparse-feature-sets.html http://douglasduhaime.com/posts/classifying-shakespearean-drama-with-sparse-feature-sets.html Batch Processing Python Scripts on Sun Grid Engine Queues Suppose you have a collection of text files and would like to compare each of those files to each of the others. Perhaps you would like to know which characters, locations, or stage directions in each file occur in any of the others. Whatever the task, if your collection is small enough—on the order of a few paragraphs, say—you can of course compare the files manually, reading each of your paragraphs in turn, and comparing the given paragraph to each of the others. If your collection is a bit bigger—on the order of a few hundred novels, say—you might automate these comparisons on your computer. If your collection is really big, however, a single computer might not be powerful enough to finish the job during your lifetime. Comparing each file in a four text corpus to each of the others, after all, only involves six comparisons. Running the same analysis on a collection of 50,000 files (roughly the size of the Project Gutenberg collection in English, or the EEBO-TCP corpus), however, means running 1,249,975,000 comparisons. If each of those comparisons takes one minute to execute on your computer, it will take 2376 years to run this job on your machine. Thankfully, we can expedite this process tremendously by leveraging the power of distributed computing systems like Sun Grid Engine (SGE) queues. Pursuing the routine described above, for instance, we can use an SGE system to run each of our 1+ billion comparisons in a few minutes. To get started, we’ll want to create an ‘iteration schedule’ in which we identify all of the comparisons we wish to run. Here is a visual representation of an iteration schedule for a four text corpus: In the table above, each of our iterations-to-be-run is denoted by an ‘o.’ Each ‘o’ sits in the cell that joins the row and the column that denote the two texts to be analyzed in the given iteration. Reading across our first row, for instance, we see that text one does not need to be compared to text one, but does need to be compared to texts two, three, and four. The second row denotes that we want to compare text two to texts three and four, and row three denotes that we want to compare text three to text four. After determining all of the comparisons to be run, we will want to render that information in machine-readable form. More specifically, we want to generate a table that has three columns: iteration_number, first_text, and second_text: where first_text and second_text are the file names of the two texts we wish to compare in the given iteration, and iteration_number is an integer whose value is zero in the first row of the iteration schedule, 1 in the next row, 2 in the next, . . . and n in the last, where n equals the total number of comparisons we wish to make. In general, the number of comparisons required is (p-1)(p)/2, where p equals the number of files in your corpus. Here is a sample iteration table: 0 A00002.txt A00005.txt 1 A00002.txt A00007.txt 2 A00002.txt A00008.txt 3 A00002.txt A00011.txt 4 A00002.txt A00012.txt 5 A00002.txt A00013.txt 6 A00002.txt A00014.txt 7 A00002.txt A00015.txt 8 A00002.txt A00018.txt 9 A00002.txt A00019.txt If you want to batch process a different kind of routine on an SGE system, you can modify your iteration schedule appropriately. If you only want to calculate the type-token ratios of each of your files, for instance, you’ll only need two columns: iteration_number and text_name. Once this iteration schedule is all set, we can turn to the script to be run during each of these iterations. Here’s mine: If you want to run a different kind of analysis, just keep lines 1-21 and line 102 of that script, and use the variable “iteration_number” to guide which texts you will analyze in each iteration. Once your routine is all set, save it as “test.py” and upload it—along with your iteration schedule, the files you wish to compare, and these two—to a single directory on your SGE server: Once all of these files are in the same directory, you are ready to submit your script for batch processing. To do so using the University of Notre Dame’s Center for Research Computing system, you can simply type python _run_me.py cmd_run.py your_netid: After you submit this command, the higher-order Python scripts _run_me.py and cmd_run.py will create new copies of your test.py script, changing the input files for each iteration according to your iteration schedule. If all has gone well, and you refresh the directory after a few moments, you’ll see a few (or more than a few, depending on the number of iterations you are running!) new files in your directory. More specifically, you’ll have a collection of new job files that give you feedback on the result of each of your iterations. If errors cropped up during your analysis, those errors should be recorded in these job files. Provided that there were no exceptions, though, those files will be empty, and you will find in your directory whatever output you requested in your test.py script. Et voila, now you can finish your analysis in a few minutes, rather than a few millennia! * * * I want to thank Scott Hampton and Dodi Heryadi of Notre Dame’s High Performance Computing Group, who helped me think through the logistics of batch processing, Reid Johnson of Notre Dame’s Computer Science Department, who sent me the higher-order SGE scripts on which this analysis is based, and Tim Peters, who wrote many of the Python modules on which my work depends! Thu, 26 Jun 2014 00:00:00 -1000 http://douglasduhaime.com/posts/batch-processing-python-scripts-on-sge-queues.html http://douglasduhaime.com/posts/batch-processing-python-scripts-on-sge-queues.html Identifying Poetry in Unstructured Corpora Over the last few months, I’ve been working with colleagues at Notre Dame to develop computational approaches we can use to identify the genres to which a literary work belongs. Initially, we focused our research on the georgic, a class of agricultural-cum-labour poems that flourished in the seventeenth and eighteenth centuries. Eventually, though, our limited research corpus led us to investigate methods we could use to identify more period poetry, and these investigations helped reveal a fascinating if simple method one can use to identify poetic works in unstructured corpora. We began building our corpus of early modern English poetry by identifying the poetry curated by the Text Creation Partnership (TCP). Running a simple Python script over the TCP’s selections from the Early English Books (EEBO) corpus—which stretches from “the first book printed in English in 1475 through 1700”—and the Eighteenth Century Collections (ECCO) corpus, we extracted all the lines of text wrapped in <l> tags (the TEI designation for a line of verse). This left us with 16,571 text files, each of which contained only poetry from roughly the sixteenth through the eighteenth centuries. After examining some of these files, we realized that many consisted entirely of poetic epigraphs, so we used another script to remove all of these small files (those smaller than 16 kb) from our research corpus, leaving us with a fairly substantive collection of poetic works from the period of interest: Because the EEBO-TCP contains 44,255 volumes—roughly one third of all titles recorded in Alain Veylit’s ESTC data for the appropriate years—we felt reasonably confident that our holdings for the sixteenth and seventeenth centuries were fairly representative of literary trends during the period. The ECCO-TCP, on the other hand, contains only 2,387 texts, less than one percent of ESTC titles from the eighteenth century. Even if we accept John Feather’s argument that only 25,131 literary works were written in English during the eighteenth century—11,789 of which, he claims, were poetic works—we are left to conclude that the 1,698 files in the ECCO-TCP corpus that contain poetry might not be indicative of poetic trends from the period. Given these conclusions, we were eager to supplement our collection of eighteenth-century poetry. But where on earth can one find enormous quantities of eighteenth-century poetry in digital form? (This isn’t meant to be a rhetorical question; if you’ve got ideas, please let us know!) After considering the issue for some time, we elected to work with Project Gutenberg. Unfortunately, only after we had downloaded and unzipped all of the English files on Project Gutenberg did we realize that the enormous text collection (roughly 45,000 volumes) is all but entirely unstructured. We couldn’t find any master list of file names, author names, publication dates, or any other essential metadata fields, so we had to build our own. In the first place, we wanted to be able to differentiate poetic texts from non-poetic texts. While I imagine it would be possible to complete this task by analyzing the relative frequency of strings from each of these texts in the manner described in the previous post, we didn’t have reliable publication dates for the Gutenberg texts, so we needed an alternative method. Operating on the hypothesis that poetic texts have more line breaks and fewer words per line than prose works, we decided to measure the number of words in each line of each file. We then collected a random sample of poetic works to see what their words-per-line profiles looked like: In these plots—each of which represents a single poetic text—the numbers along the x-axis indicate the number of words in a line of the text file, and the y-axis indicates the relative frequency of lines that contain such-and-such a number of words within the text. In The Poetical Works of James Beattie, for instance, only ~5% of lines had 12 or 13 words in them, whereas almost 20% of the text’s lines had 7 or 8 words in them. In other words, The Poetical Works of James Beattie is dominated by lines with seven or eight words in them, a fact that applies to all of the poetic works plotted above. With these figures in hand, we plotted the words-per-line profiles for a random assortment of prose works from roughly the same period: We were pleased to see that these plots differed from the poetic plots quite dramatically! Comparing the two sets of curves, we see that poetic works contain a preponderance of lines with 7-8 words, while prose works contain a preponderance of lines with 11-12 words. This is naturally due to the fact that lines of text in prose works run across an entire page, while poets break lines strategically (and regularly in eighteenth-century verse). To identify poetry in unstructured corpora, then, we can calculate a text’s words-per-line profile, and use the results of those calculations in order to classify each text in our corpus as a work of poetry or a work of prose. Using a rather simple approach to the latter task, we found 3150 poetic works tucked in the Gutenberg corpus, a few hundred of which are from our period and can thus contribute to our study of genre classification. Mon, 05 May 2014 00:00:00 -1000 http://douglasduhaime.com/posts/identifying-poetry-in-unstructured-corpora.html http://douglasduhaime.com/posts/identifying-poetry-in-unstructured-corpora.html Ngram Frequency and Eighteenth-Century Commonplaces When Samuel Richardson’s Mrs. Jewkes remarks that ‘Nought can restrain consent of twain,’ we confidently conclude she’s quoting Harington’s translation of Orlando Furioso. When Edmund Burke writes in his Philosophical Enquiry, ‘Dark with excessive light thy skirts appear,’ we know he’s misquoting Milton. While passages like these make their debts fairly clear, though, in most cases literary influence is notoriously difficult to trace. When Mary Wollstonecraft identifies marriage as a form of ‘legal prostitution’ in her Vindications, for instance, are we meant to reflect on the thrust of that phrase in Defoe’s Matrimonial Whoredom? When Ann Radcliffe’s Adeline and La Motte stroll ‘under the shade of ‘melancholy boughs’’ in The Romance of the Forest, what gives us the warrant to imagine Orlando ‘under the shade of melancholy boughs’ in As You Like It? In each of the aforementioned cases, both the quoting and the quoted texts include identical (or nearly-identical) sequences of words. If this property is a necessary condition for intertextuality, however, it is clearly not a sufficient one, for while Wollstonecraft’s second Vindication and Defoe’s Matrimonial Whoredom both use the phrase ‘legal prostitution,’ they also both use the phrase ‘if it be,’ as well as the phrase ‘a kind of.’ Nonetheless, literary scholars don’t identify the latter two strings as instances of intertextuality, perhaps because we intuitively sense that ‘if it be’ and ‘a kind of’ are far more common phrases during the period than ‘legal prostitution,’ a thesis to which Google lends some confidence: Such queries demonstrate something literary scholars have known for a long time, namely the fact that the passages we classify as instances of intertextuality have (1) common words in a common order, and (2) significantly lower relative frequency rates than other (equally long) strings from the same period. With this insight in mind, I built an API for the Google Ngrams data with which one can pull down the relative frequencies of a list of strings shared by two (or more) works. Given a set of substrings shared by two texts, and given the relative frequencies of each of those strings in the age during which those texts were published, one can eliminate high frequency strings and thereby reduce the number of passages scholars must hand review to identify relevant instances of intertextuality. Although I developed the Ngram API to eliminate high frequency strings from the output of my sequence alignment routines, it eventually helped me to discover an interesting correlation between the relative frequency of n-grams and instances of intertextuality. This discovery unfolded in the following way. On a whim, I decided to examine the relative frequencies of bigrams across passages from a few canonical works published during the long eighteenth century: Henry Fielding’s Joseph Andrews (1742), Edmund Burke’s Enquiry (1757), and Maria Edgeworth’s Ennui (1809). Each of the selections that I drew from these texts centers on a quotation of another writer—the Fielding passage quotes Virgil’s Aeneid, the Burke passage quotes Shakespeare’s Henry V, and the Edgeworth passage quotes Voltaire’s ‘La Bégueule.’ I broke each of these passages down into a set of sequential bigrams, and submitted each of the bigrams to the Google Ngrams data via the API described above. In the case of Burke, for example, I fired up the API and entered the following data into the input fields: After identifying these parameters and clicking ‘Go!’, I watched the tool navigate to the Google Ngram site and search for the relative frequency of the first two words in the Burke passage. The API limits the historical scope of this search to the period between 1752 and 1762 (the user-provided publication date of Burke’s text plus and minus five years), because the Google Ngram data is a bit noisy, and we don’t want anomalies in the data for 1757 to skew our sense of the bigram’s relative frequency in the period. The API then calculates the mean value for the bigram’s relative frequency across those years, and it writes the bigram, the publication year, and the calculated relative frequency to an output file. It then looks at the next bigram (containing words two and three), and reiterates the process, continuing in this fashion until it has queried all valid ngrams in the input file. Preliminary analysis suggests that one can then use this output file to identify instances of intertextuality, even in cases in which one does not have access to the referenced text. (This task is referred to as ‘intrinsic plagiarism detection’ within related scholarship.) Using the aforementioned selections from eighteenth-century texts, I used the method described above to calculate the relative frequencies of the bigrams in each of those selections. I then plotted the bigram frequencies with R’s scatter.smooth() function—identifying the first bigram in the selection as bigram number one, the second bigram as bigram number two, and so forth across the x-axis—so that I could better identify the trends in bigram frequency across each passage. I was surprised by the results (click to enlarge): In each case, the local minimum of the regression line centers on the instance of intertextuality in the queried passage! While this trend is promising, though, it could be due to a number of causes. Chief among these are the differences in language and historical period that divide each of the ‘quoting’ texts cited above from the passage that that work quotes. As we noted above, Henry Fielding quotes Virgil, Burke quotes Shakespeare, and Edgeworth quotes Voltaire, all in the original languages. When we compare the relative frequency of bigrams in Latin, French, and Elizabethan English with bigrams written in colloquial English of the mid- to late-eighteenth century, then, we should perhaps not be surprised that the latter tend to be more common in the Ngram data from that period, ceteris paribus. These initial results yield new questions: Can the method described above identify instances of poetry in works of prose from a particular period? Can such a method be integrated into an ensemble approach to intertextuality, or do these graphs merely contain a half-told truth, mysterious to descry, which in the womb of distant causes lie? Such are the questions I hope to pursue in subsequent work. Thu, 13 Mar 2014 00:00:00 -1000 http://douglasduhaime.com/posts/ngram-frequency-and-eighteenth-century-commonplaces.html http://douglasduhaime.com/posts/ngram-frequency-and-eighteenth-century-commonplaces.html Training the Stanford NER Classifier Working with Professor Matthew Wilkens, my fellow doctoral student Suen Wong, and undergraduates at Notre Dame, I have spent the last few months using the Stanford Named Entity Recognition (NER) classifier to identify locations in a few thousand works of nineteenth-century American literature. Using the NER classifier—an enormously powerful tool that can identify such “named entities” in texts as people, places, and company names—our mission was to find all of the locations within Professor Wilkens’ corpus of nineteenth-century novels. While Stanford’s out-of-the-box classifier could be used for such a purpose, we elected to retrain the tool with nineteenth-century text files in order to improve the classifier’s performance. In case others are curious about the process involved in retraining and testing a trained classifier, I thought it might be worthwhile to provide a quick summation of our method and findings to date. In order to train the classifier to correctly identify locations in a text, users essentially provide the classifier with a substantial quantity of texts. These texts are annotated in such a way as to teach the classifier to correctly identify locations. More specifically, these “training texts” break a text document into a series of words, each of which users must identify as a location or a non-location. The training files looks a bit like this: the 0 Greenland LOC whale 0 is 0 deposed 0 , 0 - 0 the 0 great 0 sperm 0 whale 0 now 0 reigneth 0 ! 0 In this sample, as in all of the training texts, each word (or “token”) is listed on a unique line, followed by a tab and then a “LOC” or a “0” to indicate whether the given token is or is not a location. Users can feed the Stanford parser this data, and the tool can use this information in order to improve its ability to classify locations correctly. In our training process, we collected hundreds of passages much longer than the sample section above, and we processed those passages in the way described above—with each token on a unique line, followed by a tab and then a “LOC” or a “0”. (Technically, we also identified persons and organizations, but the discussion is made more simple if we presently ignore these other categories.) We then used a quick Python script to sort these hundreds of annotated text chunks into ten directories of equal size, and another script to combine all of the chunks within each directory into a single file. Once we had ten unique directories, each containing a single amalgamated file, we trained and tested ten classifiers. To train this first classifier, we combined the annotated texts contained in directories 2-10 into a single text file called “directories2-10combined.tsv”. We then created a .prop file we could use to train the first classifier. This .prop file looked very similar to the default .prop template on the FAQ page for the NER classifier: # location of the training file trainFile = directoriest2-10combined.tsv # location where you would like to save (serialize) your # classifier; adding .gz at the end automatically gzips the file, # making it smaller, and faster to load serializeTo = ner-model.ser.gz # structure of your training file; this tells the classifier that # the word is in column 0 and the correct answer is in column 1 map = word=0,answer=1 # This specifies the order of the CRF: order 1 means that features # apply at most to a class pair of previous class and current class # or current class and next class. maxLeft=1 # these are the features we'd like to train with # some are discussed below, the rest can be # understood by looking at NERFeatureFactory useClassFeature=true useWord=true # word character ngrams will be included up to length 6 as prefixes # and suffixes only useNGrams=true noMidNGrams=true maxNGramLeng=6 usePrev=true useNext=true useDisjunctive=true useSequences=true usePrevSequences=true # the last 4 properties deal with word shape features useTypeSeqs=true useTypeSeqs2=true useTypeySequences=true wordShape=chris2useLC We saved this file as propforclassifierone.prop, and then built the classifier by executing the following command within a shell: java -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier \ -prop propforclassifierone.prop This command generated an NER model that one can evoke within a shell using a command such as the following: java -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier \ -loadClassifier ner-model.ser.gz -testFile directoryone.tsv This command will analyze the file specified by the last flag—namely, “directoryone.tsv”, the only training text that we withheld when training our first classifier. The reason we withheld “directoryone.tsv” was so that we could test our newly-trained classifier on the file. Because we have already hand identified all of the locations in the file, to test the performance of the trained classifier we need only check to see whether and to what extent the trained classifier is able to find those locations. Similarly, after training our second classifier on training texts 1 and 3-10, we can test that classifier’s accuracy by seeing how well it identifies locations in directorytwo.tsv. In general, we can train our classifier on all ten of our training texts save for one, and then test the classifier on that one tsv file. This method is called “ten-fold cross validation,” because it gives us ten opportunities to measure the performance of our training routine and thereby estimate the future success rate of our classifier. Running the last command listed above generates a text with three columns: the first column contains the tokens in “directoryone.tsv”, the second column contains the “0”s and “LOC”s we used to classify those tokens by hand in our training texts above, and the third column contains the classifier’s guess regarding the status of each token. In other words, if the first word in “directoryone.tsv” is “Montauk”, and we designated this token as a location, but the trained classifier did not, the first row of the output file will look like this: Montauk LOC 0 By measuring the degree to which the tool’s classifications match our human classifications, we can measure the accuracy of the trained classifier. After training all ten classifiers, we did precisely this, measuring the success rates of each classifier and plotting the resulting figures in R: This first plot measures the number of true positive locations that both the out-of-the-box Stanford parser identified in each of our .tsv files as well as the number of true positives our trained classifiers identified in each .tsv file. A true positive location is a token that the classifier has identified as a location (this is what makes it “positive”) that we have also designated as a location (this is what makes it “true”). If the classifier designates a token as a location but we have identified as a non-location, that counts as a “false positive.” The following graph makes it fairly clear that the out-of-the-box classifier tends to produce many more false positives than our trained classifiers: While measuring true positives and false positives is important, it’s also important to measure false negatives, which are tokens that we have identified as locations that the classifier fails to identify as locations. The following graph illustrates the fact that the trained classifier tended to miss many more locations than did the out-of-the-box classifier: Aside from true positives, false positives, and false negatives, the only other possibility is “true negatives”, which are so numerous as to almost prevent comparison when plotted together: While the plots of true positives, false positives, and false negatives above speak to some of the strengths and weaknesses of the trained classifiers, those who work in statistics and information retrieval like to combine some of these values in order to offer additional insight into their data. One such combination that is commonly employed is called “Precision”, which in our case is a measure of the degree to which those tokens identified as locations by a given classifier are indeed locations. More specifically, precision is calculated by taking the total number of true positives and dividing that number by the combined sum of true positives and false positives (P = TP/TP+FP). Here are the P values of the trained and untrained classifiers: Another common measure used by statisticians is “Recall”, which is calculated by dividing the number of true positives by the sum total of true positives and false negatives (R = TP/TP+FN). In our tests, recall is essentially an indication of the degree to which a given classifier is able to find all of the tokens that we have identified as locations. Clearly the trained classifier did not excel at this task: Finally, once we have calculated our precision and recall values, we can combine those values into an “F measure,” which serves as an abstract index of both. There are many ways to calculate F values, depending on whether precision or recall are more important for one’s experiment, but to grant equal weight to both precision and recall, we can use a standard harmonic means equation: F = 2PR/P+R. The F values below may serve as an aggregate index of the success of our classifiers: So what do these charts tell us? In the first place, they tell us that the trained classifiers tend to operate with much greater precision than the out-of-the-box classifier. To state the point slightly differently, we could say that the trained classifier had far fewer false positives than did the untrained classifier. On the other hand, the trained classifier had far more false negatives than did the untrained classifier. This means that the trained classifier incorrectly identified many locations as non-locations. In sum, if our classifier were a baseball player, it would swing at only some of the many beautiful pitches it saw, but if it decided to swing, it would hit the ball pretty darn well. It stands to reason that further training will only continue to improve the classifier’s performance. After all, the NER classifier learns from the grammatical structures of the training files it is fed, which allows the classifier to correctly identify locations it has never encountered before. (One can independently prove that this is the case by running a few Python scripts on the data generated by the classifier.) As we continue to feed the classifier additional grammatical constructs that are used in discussions of locations, the classifier should expand its “location vocabulary” and should therefore be more willing to swing at pretty pitches. Once we’ve finished compiling the last two thirds of our training texts, we will be able to retrain the classifiers and see whether this hypothesis holds any water. Here’s looking forward to seeing those results! Sun, 10 Nov 2013 00:00:00 -1000 http://douglasduhaime.com/posts/training-the-stanford-ner-classifier.html http://douglasduhaime.com/posts/training-the-stanford-ner-classifier.html Introducing the Literature Online API When it comes to literary research, Literature Online is no doubt one of the best digital resources around. The site hosts a third of a million full-length texts, the definitive collection of digitized criticism, and a robust interface that boasts such advanced features as lemmatized and fuzzy spelling search options. When I started looking for a public API that would allow users to mobilize the site’s resources in an algorithmic fashion, though, there was no API to be found. So I decided to build my own. Using Python’s Selenium package, I built an API that sends queries to Literature Online in a procedural fashion and generates clean, user-friendly output data. The program runs as follows: After double-clicking the literatureonlineapi.exe file (or the literatureonlineapi.app file) linked in the Tools tab of this site, the following GUI appears: Using this interface, users may select the appropriate checkboxes pictured above to identify whether they would like to employ Literature Online’s fuzzy spelling and/or lemmatized search features. Additionally, users can limit potential matches by publication date and author date ranges. Then, users may click the “Select Input File” button to select a file they would like to use to query Literature Online. This file should be a plain text file that contains one or more words or series of words one would like to use to search Literature Online. The program will send the first n words of this file to Literature Online, where n = the value of “window size” (in the image above, n = 3). The program will then record the name, publication date, and author of texts that contain the first n words of your file. This match will be an exact match–i.e. if n = 3 and the first three words of your file are “the king will”, the program will find all texts in the Literature Online database that contain the exact string “the king will”. Then the script will look at words p through n + p in your plain text file, where p = window slide interval. In the image above, p = 1 and n = 3, so in its second pass through our hypothetical text file the program would look at words 2 through 4 (inclusive). The program will once again pull down all relevant metadata for the found hits. It will then slide p words forward once again, examining words 3 through 5, and so forth, until it reaches the end of the document. Then, once it has reached the end, it will go back to the beginning of the document and repeat the process, this time submitting not exact searches, but proximity searches. E.g. instead of searching “the king will”, the program will find all instances of “the near.3 king near.3 will” and then slide its search window forward in the customary fashion. Finally, the program will write its .tsv output to the directory selected with the “Select Output Location” button. In the case of the sample string discussed in this paragraph, the output file looks like this: Users can then use this output to create plots, inform stylometric analysis, or simply to help allocate their readerly attention in a more efficient manner. Let’s suppose you wanted to find all texts in the Literature Online database that contain the word “king,” as well as all of the texts that contain the word “queen.” In this case, you could proceed as follows: First, make sure you have Firefox installed on your computer. Then, download the LiteratureOnlineAPI folder, open it up, and double click the file entitled “literatureonlineapi.exe”. If you double click that file, the GUI pictured above should appear. Next, create and save a text file that contains only “king queen”. After selecting this file with the “Select File” button, set the “window size” to 1. Doing so will tell the program that you want each of the searches you send to Literature Online to contain exactly one word (where a word is defined as any character or series of characters bounded by whitespace). Next, set the slide interval to 1, so that the program will know to send the first query (“king”), then slide forward 1 word to “queen” and submit that search term. Finally, click Start. If all goes to plan, a Firefox window will open up and the program will be off and running. If you need to terminate the program, just close that Firefox window. Doing so, however, will prevent the out.tsv output from documenting any found matches. If this happens, you can merely restart the program. Building this tool was a blast, not least of all because doing so allowed me to learn much more about Webdriver, GUIs, and code compilation. Here’s hoping the finished tool will help others create stimulating literary and historical research! 2017 Update: The New Oxford Shakespeare Published In 2017, Oxford University Press published the volume for which the tool described above was built–The New Oxford Shakespeare: This collection of volumes contains an enormous wealth of scholarship that revises our understanding of the works of William Shakespeare. The Authorship Companion volume in the collection focuses in particular on reassessing the body of work that William Shakespeare wrote, leveraging stylometric analysis to isolate the portions of Shakespeare’s plays most probably written by playwrights with whom the bard collaborated (a common practice in his day). The premier Shakespearean Gary Taylor wrote a tremendous amount of scholarship for this collection, including a chapter on which we collaborated using the tool described above. If you have a chance to review the stylometric work we undertake in the volume, it would be great to hear your thoughts: I’ll close by noting that the LION API itself is no longer publicly available, but ProQuest does make raw data exports available for partner institutions in case other data-driver scholars wish to conduct related research. Fri, 13 Sep 2013 00:00:00 -1000 http://douglasduhaime.com/posts/introducing-the-literature-online-api.html http://douglasduhaime.com/posts/introducing-the-literature-online-api.html Submitting Python Scripts to SGE Queues I have recently begun submitting scripts to my home institution’s computer cluster. Although submitting jobs to the cluster is a fairly straightforward task, it took me a bit of time to figure out how to format jobs such that the Sun Grid Engine queuing system employed here at Notre Dame could process my scripts and distribute them over the cluster. In case others would like to be able to distribute jobs over a Unix-based computer cluster that uses an SGE front end, I wanted to briefly type up the protocols I have followed to accomplish this task. To submit a script to Notre Dame’s SGE front end, one needs two files: a hashbanged script that one would like to run, and a .job file. Say the script you want to run is a Python script called “tmp.py”. In order to prepare this script for the cluster, you will want to add a line at the very start of the script in order to point to the version of Python that you plan to employ in your script. This line has a variety of amusing names—it is sometimes called a shebang line, or a hashbang, crunchbang, hashpling, pound bang, etc.—but it usually takes a form such as the following: #!/usr/bin/env python Then, once you have your script ready to go, you only need to make a .job file that can tell the cluster where to find all of the elements required to run the script. My .job files look something like this: #!/bin/csh #$ -M dduhaime@nd.edu #$ -m abe #$ -r y #$ -o tmp.out #$ -e tmp.err module load python/2.7.3 echo "Start - `date`" python tmp.py echo "Finish - `date`" The first line in this .job file indicates that the script intends to send a command to the C shell. The second line specifies the email address to which the cluster will report its output. The third line specifies that you would like the cluster to send emails to the email address in the -M line above when the cluster begins and ends the script. The fourth line is meant to indicate if your script is re-runnable (it appears that Gaussian scripts and certain machine learning algorithms are not re-runnable and so should be flagged -r n). The final two lines prefaced with #$ indicate the expected output files. The line that begins “module load” indicates, as one might expect, the module one would like to load (make sure to specify the version of the software you would like to run). The penultimate line indicates the name of the script you would like to run, and the echo lines merely ask the C shell to identify the start and end times of the job. Using a text editor like Notepad++, users can modify these fields to suit the demands of their job, and then save the file as something like “test.job”. Once you have your hashbanged script and your .job file, upload them to a directory on your cluster. I use Filezilla for this purpose, but one could accomplish the same task at the command line. Then, ssh to that directory (if you are working on a Windows machine, you can use Putty to accomplish this task). Then, once you’re in the directory in which your .job and .py files are located, you can submit your job by simply typing “qsub test.job” and hitting enter. If you left the “-m abe” line in your .job script, you should soon receive an email indicating that your job has been submitted. When I first started submitting scripts, I could tell by the email reports I received that my jobs began and ended almost simultaneously. Eventually I realized that had not properly formatted my file paths. Once I fixed the file paths, all was well and the scripts ran properly. I found it helpful to try running my scripts from the command line before submitting them with a job file. To do this with a Python script, you need only ssh into the cluster, navigate to the directory in which your script is located, and then type “Python tmp.py” followed by a return. If your script is properly formatted, any print statements in your script will print to the terminal. If there is a problem with your script, the terminal will list the error message. Using the cluster has allowed me to process computationally-demanding jobs very rapidly, which has in turn allowed me to continue refining my scripts quickly. I hope employing methods similar to those above can help some readers to submit their own scripts to clusters. Mon, 11 Jun 2012 00:00:00 -1000 http://douglasduhaime.com/posts/submitting-python-scripts-to-sge-queues.html http://douglasduhaime.com/posts/submitting-python-scripts-to-sge-queues.html