Over the last year or so, several curious circumstances sent me down the rabbit hole of algorithmic music composition. First an intriguing question on classifying genuine and fake piano rolls, then a brilliant professor writing an opera on the life of Alan Turing, and finally a gifted graduate student asking probing questions about OpenAI’s VQ-VAE model all made me increasingly interested in generating music with machine learning. After I shared some early results from my explorations, a few friends were interested in learning more. This post is my attempt to share some of the paths I’ve been pursuing, and to lay out some relatively easy ways to get started with automatic music generation.
To keep things as simple as possbile, the post below describes how one can use basic Markov models to generate MIDI audio. We’ll first examine how Markov models work by building a simple text generation model in a dozen or so lines of Python. Then we’ll discuss how one can convert MIDI data to text sequences, which will let us use the same Markov model approach to generate MIDI audio. Finally, to spice things up a bit, we’ll convert our generated MIDI files into chiptune waveform audio with a disco dance beat. Let’s dive in!
Building Markov Models
While the term “Markov model” is used to describe a wide range of statistical models, essentially all Markov models follow a simple basic rule: the model generates a sequence of outputs, and each element in the sequence is conditioned only on the prior element in the sequence. Given a single word, a Markov model can predict the next word in the sequence. Given a pixel, a Markov model can predict the next pixel in the sequence. Given an item in a sequence, a Markov model can predict the next item in the sequence.
As an example, let’s build a Markov model that can accomplish a simple text generation task. Our goal will be to train a model using the plays of William Shakespeare, then to use that model to generate new pseudo-Shakespearean play text. We’ll train our model using tiny-shakespeare.txt, a single file that contains raw text from Shakespeare’s plays. Here are the first few lines from the file:
First Citizen:
Before we proceed any further, hear me speak.
All:
Speak, speak.
First Citizen:
You are all resolved rather to die than to famish?
All:
Resolved. resolved.
As you can see, the text follows a regular format in which a character’s name immediately precedes their speech. To help our model recognize these character speech boundaries, let’s add START and END tokens before and after each speech, like so:
If you print training_data
, you’ll see that it includes the word START before each speech and the word END after each speech:
START First Citizen:
Before we proceed any further, hear me speak. END
START All:
Speak, speak. END
START First Citizen:
You are all resolved rather to die than to famish? END
START All:
Resolved. resolved. END
Those START and END tokens will help our model know what a proper speech looks like so it can create new speeches that have the same format as our training speeches.
Having prepared the training data, we can now train our model. To do so, we just need to build up a dictionary in which we map each word to the list of words by which it is followed. For example, given the sequence 1 2 1 3
, our dictionary would look like this: {1: [2,3], 2: [1]}
. This dictionary tells us that the value 1 is followed by 2 and 3, while the value 2 is followed by only 1. The value 3 is not followed by anything, because it’s the last token in our sequence. Let’s build this dictionary using our Shakespearean text data:
That’s all it takes to train a Markov model!
If we examine next_words
, we’ll find that it maps each key to the list of words by which it is followed. The values of this dictionary contain duplicates by design. If the word “to” is followed by the word “be” often but is only followed by the word “suffer” once, then given the word “to”, our model should be more likely to predict “be” than “suffer”. In slightly fancier parlance, our next_words
dictionary represents the weighted probabilities of a particular word following another particular word. To generate new sequences, we’ll simply sample from those weighted probabilities and piece together a sequence of text word by word.
Now for the fun part. Let’s use the model to generate new speeches. To do so, we’ll run the following loop 100 times. First we’ll randomly select a word that follows the START token. Character names always follow the START token, so the first word in each speech will contain a character’s name. Then we’ll use next_words
to randomly select one of the words that appears after the selected character’s name. For example, if our selected character is “Claudius”, in this step we’ll randomly select one of the words that immediately follows the word “Claudius” (i.e. one of the words with which Cladius begins one of his speeches). Then we’ll randomly sample a word that follows that last word. We’ll carry on in this way until we hit an END token, at which point we conclude the speech. We can implement this operation in code as follows:
That’s all it takes to sample from a Markov model! The output of that block should read like a collection of lunatics muttering Shakespearan nonsense:
Upon the way to see.
I have vouchsafed,
With your promise pass'd:
I do you confine yourself desired of your favour,
I do foretell of mine.
MARCIUS:
May these men should say you must
change this second Grissel,
And Roman camp conduct him with your soul
The generated text looks pseudo-Shakespearean! Next let’s see if we can train some Markov models that generate musical expressions.
Making Music with Markov Models
As it turns out, we can use essentially the same strategy we used above to generate music with Markov models. To do so, we just need to convert an audio file into a text file. To accomplish this goal, we can parse a midi file and convert each note in the file into a word. The fantastic music21 library in Python written by Michael Scott Cuthbert’s lab at MIT makes this task fairly strightforward. We can install music21 and all the dependencies we’ll use below as follows:
pip install music21==7.1.0
pip install nltk==3.6.2
pip install pretty-midi==0.2.9
pip install scipy==1.4.0
pip install https://github.com/duhaime/nesmdb/archive/python-3-support.zip
After installing music21, we can use the function below to convert ambrosia.midi
(a charming melody from the 8-bit Nintendo game Ultima III) into a string. Here’s the midi file, and here’s how we’ll convert it into a string:
The block above turns ambrosia.midi into the string s
. Within that string, each note in ambrosia.midi is represented by a token that begins with “n_” and each pause between notes is represented by a token that begins with “w_”. If we print s
we can see the string representation of our MIDI data more clearly:
This string indicates that the file begins with a full bar of rest. Next we play notes 65 and 38 for a quarter bar and half bar respectively, then wait half a bar, then play note 62 for an eighth bar, and so on. In this way, using just two token types (“n_” tokens and “w_” tokens) we can record each keystroke that should be played as well as the durations of time between those keystrokes. We leave note durations in fractional form to prevent floating point truncation.
To test if this conversion worked, let’s reverse the process and convert the string s
into a new midi file. If both conversions were successful, we should expect that new midi file to sound like the original ambrosia.midi file. Happily music21
makes the conversion from string to midi straightforward as well:
As you can see, the block above simply reverses the operations performed in midi_to_string
, converting each token into a midi note. The resulting midi file should indeed sound like the midi with which we started:
Now we’re rolling! From here, all we have to do train a Markov model on the string representation of our MIDI file. To do so, let’s transform the Markov model we used above into a reusable function:
from collections import defaultdict
from nltk import ngrams
import random
def markov(s, sequence_length=6, output_length=250):
# train markov model
d = defaultdict(list)
# make a list of lists where sublists contain word sequences
tokens = list(ngrams(s.split(), sequence_length))
# store the map from a token to its following tokens
for idx, i in enumerate(tokens[:-1]):
d[i].append(tokens[idx+1])
# sample from the markov model
l = [random.choice(tokens)]
while len(l) < output_length:
l.append(random.choice(d.get(l[-1], tokens)))
# format the result into a string
return ' '.join([' '.join(i) for i in l])
# sample a new string from s then convert that string to midi
generated_midi = string_to_midi(markov(s))
# save the midi data in "generated.midi"
generated_midi.write('midi', 'generated.midi')
If we run the markov
function we’ll get a new string that contains a sequence of notes expressed in text form. We can then then convert that string to a proper midi file using the string_to_midi
function we defined above. The result sounds like a pair of drunken sailors wailing away on a piano:
The good news is if you don’t like that audio you can just rerun the markov
function until you get a keeper. Before you banish our sample, though, let’s try pushing it through the chiptune meat grinder we’ll write below.
Markov Models Meet Chiptunes
Chris Donahue, a brilliant postdoc in Computer Science at Stanford University, accomplished the Herculean task of converting the original 8-bit Nintendo synthesizer, or “audio processing unit”, into a simple API exposed in the Python package nesmdb. Nesmdb exports a function midi_to_wav
that converts a midi file into nostalgic 8-bit audio that captures the raw energy of the original NES soundtracks. In what follows below, we’ll use that function to convert a midi file to chiptune waveform audio.
The original NES synthesizer supported five concurrent audio tracks: two pulse-wave tracks (“p1”, “p2”), a triangle-wave track (“tr”), a noise track (“no”), and a sampling track that’s not implemented in nesmdb. In the function above, we simply assign each note from the input midi file to the first unused track in our synthesizer (excluding the “no” track, which is assigned a dance beat later in the function). There are certainly more clever ways to assign notes to the synthesizer tracks, but we’ll use this approach for the sake of simplicity. Here are some sample results:
If you’re curious to try making your own audio, feel free to try this Colab notebook, which will download the ambrosia.midi
file and process it using the steps discussed above. There’s a cell in that notebook to make it easier to upload custom MIDI files for processing as well.
Going Further with Markov-Generated MIDI
The foregoing discussion is meant only to serve as a relatively straightforward way to get started generating audio with Markov models. We’ve barely scratched the surface of what’s possible though. If you get interested in automatic music generation, you might want to experiment with more sophisticated text sampling techniques, such as an LSTM Network or a transformer model like GPT2. It could also be interesting to train your model with a larger collection of data, such as Colin Raffel’s Lakh MIDI Dataset (possibly stripping the drum tracks and transposing each training file to a common relative major to prevent overfitting).
If you generate some fun audio using some of these techniques, please feel free to get in touch! I’d love to hear from you.
I would like to thank Christine Mcleavey, whose Clara project first introduced me to the idea of transforming MIDI files into text data, and Professor Matthew Suttor in Yale’s School of Drama, whose opera I Am Alan Turing has inspired me to continue pursuing algorithmic music composition.