I needed something to take my mind off the high drama going on at work, so I turned to a side project that hadn’t received much love in the 6+ years since I launched it: op1.fun. It’s a site for users of the Teenage Engineering OP-1 to share patches and store “tapes”. Over the years, people have generously shared thousands of patches (10,611 public ones as of writing). For fun, I wanted to come up with a way to apply some of the techniques I’ve been learning about at work to these sound files.

I settled on the idea to create an interactive tool that lets people explore all of these sounds based on the similarity between their embeddings. Check it out here! Read on below if you’re curious about how it was made.

Embeddings

If you’re not already familiar with embeddings, think of them as a translation mechanism. They convert complex data, like audio from sound patches, into a shorter numerical format. Each sound patch gets transformed into a multidimensional numeric representation. This representation captures its unique characteristics – a bit like a digital DNA for sounds. By comparing these embeddings, we can gauge the similarity between different patches. The dimensions here represent various features or characteristics of the sound patches. When we compute these embeddings, we’re mapping each sound into this abstract space based on its features.

The similarity between two sound patches is determined by calculating the distance between their corresponding vectors in this space. One common method for this is using the cosine similarity measure, which evaluates the cosine of the angle between the two vectors. A smaller angle implies greater similarity. Alternatively, Euclidean distance can be used, where closer vectors represent more similar sounds. This approach allows us to quantify the likeness between any two sounds in a precise, mathematical way. It’s a powerful way to navigate through the vast collection of patches, driven by the underlying mathematical relationships between their sonic qualities.

PANNS

Pre-trained Audio Neural Networks with Spectrogram operate on the principle of transforming audio signals into a visual representation, and then analyzing those representations. By feeding the raw audio of the samples into one of these pre-trained neural networks, we can extract a vector of embeddings from each one and store them for visualization.

I found an open source pre-trained model here, which is the result of the work described in this paper. I used this to create the embeddings.

t-SNE

The t-SNE (t-distributed Stochastic Neighbor Embedding) algorithm is the other main ingredient in this project. t-SNE is a technique specifically designed for dimensionality reduction. It’s particularly adept at taking high-dimensional data, like our sound embeddings, and representing it in a way that’s easier to visualize and interpret, typically in two or three dimensions.

What makes t-SNE stand out is its ability to preserve the local structure of the data while reducing dimensions. It does this by converting the distances between points in the high-dimensional space into probabilities. These probabilities then dictate the similarity of points in the lower-dimensional space. The ’t-distributed’ part of its name comes from its use of a t-distribution to calculate these probabilities, which helps to separate clusters of data points more effectively than a normal distribution.

Applying t-SNE to the sound patch embeddings allowed me to create a 3D map of the entire sound library. In this map, each point represents a sound patch, and patches with similar features are located closer together. This visualization isn’t just a static picture; it’s an interactive exploration tool. Users can zoom in on clusters of interest, listen to how these similar sounds compare, and perhaps discover patterns or relationships that aren’t immediately obvious from the raw data or a simple list. Such a visual approach can reveal the underlying structure of the sound patch library, highlighting similarities and differences in a way that’s both intuitive and insightful.

Implementation

To start, I needed to download all the patches from op1.fun. I was able to export a csv file of all the public patches including their metadata, and then write a script to download all the AIFF audio files from S3. The OP-1 has multiple types of sample patches including drum and sampler. Drum patches consist of a single file with start/stop markers stored in a metadata header in the AIFF file. I wrote an additional script to take these timestamps and use them to chop up each drum patch into individual sound files. After this step, I had a total of ~105,000 separate samples to work with!

Then, I found the librosa python library which contains several tools for analyzing sound files, including an implementation of PANNS (Pre-trained Audio Neural Networks with Spectrogram) inference for generating audio file embeddings using a freely available pre-trained model. I scripted a loop that generated a vector of embeddings for each file and saved a JSON record of the values along with some metadata about the sample file.

With all the audio files analyzed, I needed a way to present all of this visually. I used the sklearn.manifold module’s TSNE class to load up all the embeddings and create a 3 dimensional dataframe of all the samples, which I wrote out to another json file.

The final part was creating a web frontend. ChatGPT helped me write a three.js visualization that loads the json output from the last step and renders a point cloud. You can rotate the cloud of points and hold the shift key to audition any of the samples by mousing over them.

I hope you enjoy it!