The spelled-out intro to language modeling: building makemore
What You Will Learn
- How to build a character-level language model from scratch
- How to train a model to generate new names based on a given dataset
- How to use PyTorch to create and manipulate tensors for language modeling
Key Concepts
Character-level language modeling is a technique where a model is trained to predict the next character in a sequence, given the context of the previous characters. In this lesson, we’re using a dataset of names to train a model to generate new, unique names. The model is built using a bi-gram language model, which looks at pairs of characters to predict the next character. We’re also using PyTorch to create and manipulate tensors, which are multi-dimensional arrays used to represent the model’s parameters and data.
Code Examples
for character_one, character_two in zip(w, w[1:]):
print(character_one, character_two)
This code snippet is used to iterate over each word in the dataset and print out the consecutive pairs of characters.
n = torch.zeros(28, 28, dtype=torch.int32)
This code creates a 28x28 tensor filled with zeros, which will be used to store the counts of each bi-gram in the dataset.
s2i = {c: i for i, c in enumerate(chars)}
This code creates a dictionary that maps each character to its corresponding integer index.
Lesson Summary
In this lesson, we started building a character-level language model from scratch using a dataset of names. We began by loading the dataset and splitting it into individual words, and then we created a bi-gram language model to predict the next character in a sequence. We used PyTorch to create and manipulate tensors, which will be used to store the model’s parameters and data. We also created a dictionary to map each character to its corresponding integer index, which will be used to index into the tensor. By the end of this lesson, we had a tensor that stores the counts of each bi-gram in the dataset, which can be used to generate new names.
Practice Exercise
Using the code snippets from this lesson, try to generate a new name by sampling from the probability distribution of the first character of a word. You can do this by normalizing the counts of the first row of the tensor and then using the torch.multinomial function to sample from the distribution.
What Is Next
In the next lesson, we’ll be exploring more advanced techniques for language modeling, including the use of recurrent neural networks and transformers. We’ll also be learning how to fine-tune our model to generate more realistic and diverse names.