Markus Schläfli

A little ML structure

Code for the post can be found here.

In the last post, we implemented an MLP for MNIST from scratch using C++. We kept things very basic so that we understand exactly what’s happening, forgoing any consideration for architecture, structure, reusability and performance.

Before we move to any more complicated concepts, a little cleanup is in order.

Current state

If we look at our current code, we roughly have the following. Keep the ‘forward’ and ‘backwards’ sections in mind when reading the code.

for (int epoch = 0; epoch < epochs; epoch++) {
  float totalLoss = 0;
  for (int b = 0; b < numBatches; b++) {
    // populate xBatch & yBatch
    int currentBatchSize = ...
    Matrix xBatch(currentBatchSize, 784);
    std::vector<int> yBatch(currentBatchSize);

    // forward
    z1Cache = matmul(xBatch, weights1); addBiasInPlace(z1Cache, biases1);
    a1Cache = relu(z1Cache);
    z2Cache = matmul(a1Cache, weights2); addBiasInPlace(z2Cache, biases2);
    a2Cache = relu(z2Cache);
    z3Cache = matmul(a2Cache, weights3); addBiasInPlace(z3Cache, biases3);
    a3Cache = softmax(z3Cache);

    // loss
    totalLoss += crossentropy(a3Cache, yBatch);

    // backwards
    Matrix dz3 = a3Cache;
    for (int i = 0; i < currentBatchSize; i++)
      dz3.at(i, yBatch[i]) -= 1.f; // crossentropy+softmax simplification
    scaleInPlace(dz3, 1 / (float)currentBatchSize);
    Matrix dW3 = matmul(transpose(a2Cache), dz3);
    Matrix db3 = colsum(dz3);
    Matrix da2 = matmul(dz3, transpose(weights3));
    Matrix dz2 = elemmul(da2, relugrad(z2Cache));
    Matrix dW2 = matmul(transpose(a1Cache), dz2);
    Matrix db2 = colsum(dz2);
    Matrix da1 = matmul(dz2, transpose(weights2));
    Matrix dz1 = elemmul(da1, relugrad(z1Cache));
    Matrix dW1 = matmul(transpose(xBatch), dz1);
    Matrix db1 = colsum(dz1);

    // apply gradients
    axpy(-lr, dW3, weights3); axpy(-lr, db3, biases3);
    axpy(-lr, dW2, weights2); axpy(-lr, db2, biases2);
    axpy(-lr, dW1, weights1); axpy(-lr, db1, biases1);
  }

  auto epochLoss = totalLoss / (float)numBatches;
}

And let’s take PyTorch’s Python example as inspiration:

model = ImageClassifier()
optimizer = Adam(model.parameters(), lr=lr)
loss_fn = CrossEntropyLoss()
data_loader = get_data_loader()

for epoch in range(epochs):
    total_loss = 0
    for images, labels in data_loader:
        outputs = model(images)         # forward
        loss = loss_fn(outputs, labels) # loss

        optimizer.zero_grad()           # clear existing gradients
        loss.backward()                 # backwards
        optimizer.step()                # apply gradients

        total_loss += loss.item()

    epoch_loss = total_loss/len(data_loader)

At first glance, it looks like every section in the C++ code has been collapsed into a class: iterating the training data, the model, the loss function, the gradient computation, and applying the gradients via an optimizer.

I’m against extracting code into functions or classes just for the sake of brevity, so let’s unpack why this would be a good thing to do.

A little structure

A large reason behind organizing things into these classes is to help with automatic differentiation. If we look at the C++ code, all of the steps in the backwards pass are derived using the chain rule manually. If anything changes in the model, then we have to again make sure everything aligns. The “automatic” in an automatic differentiation engine - such as autograd in PyTorch - hides all of this behind a backward() step, and because of the nice interface between different blocks of the model (such as a Linear Layer, ReLU, softmax, etc.), these steps flow from one into the next. It’ll become clearer when we look at the code, but each of these steps has a forward() and backward() function, and you chain together a sequence into a sequential model in our case.

Each step in our model will be considered a C++ class and will contain its weights and biases, along with the forward() and backward() methods. Instead of adding an update() method that updates the weights after the backwards pass, we extract this into a separate optimizer class. The reasons are pretty straightforward: the way we update the weights is dependent on the algorithm, and then it’s uniformly applied across the components of the various layers. Putting it directly into each component would result in a lot of duplicated code. And another significant point is that we approach the problem in a Data-Oriented Design manner: we process all the weights and biases contiguously in a tight loop, making better use of memory hierarchies and arguably having a cleaner code structure.

Other parts of the system are more about aestethics: a data loader in a separate class means we can combine loading, iterating, and shuffling all in 1 place; and a loss function allows things like cross entropy loss to combine softmax and negative log-likehood using the log-sum-exp trick.

Base classes

Let’s start with the core base classes that we’ll need to define various parts of the model:

struct Parameter {
  Matrix *data;
  Matrix *grad;
};

struct Layer {
  virtual void init(std::mt19937 &) {}
  virtual Matrix *forward(Matrix &in) = 0;
  virtual Matrix *backward(Matrix &dout) = 0;
  virtual std::vector<Parameter> parameters() { return {}; }
  virtual ~Layer() = default;
};

The Parameter class combines the data with its gradient, which is what we wrap for both the weights and biases. The Layer is the abstract base class with pure virtual functions that all the various steps of the model will provide an implementation for.

Note on pointers: it’s important to keep the memory copies in mind here. These matrices can be large (our first layer’s weights are 784 x 128 = 100’352 parameters which with 32 bit floats means ~400kB), so having these copied as they flow from layer to layer is not great. I’m always reminded of the string copies that at one time accounted for half of all allocations in Chrome.

Implementing some Layers

So now, what implements this pure virtual class? From our existing C++ code, we’ll need to provide an implementation for the Linear layer and the ReLU operation.

struct Linear : Layer {
  int Fin, Fout; 
  Matrix weight, dweight;
  Matrix bias, dbias;

  Matrix outForward, outBackward; // we own these

  Matrix *lastIn; // cache for backwards pass

  Linear(int fin, int fout)
      : Fin(fin), Fout(fout), weight(fin, fout), dweight(fin, fout),
        bias(1, fout), dbias(1, fout) {}

  void init(std::mt19937 &rng) override { heInit(weight, rng); }

  Matrix *forward(Matrix &in) override {
    lastIn = &in;
    outForward = matmul(in, weight);
    addBiasInPlace(outForward, bias);
    return &outForward;
  }

  Matrix *backward(Matrix &dout) override {
    dweight = matmul(transpose(*lastIn), dout);
    dbias = colsum(dout);
    outBackward = matmul(dout, transpose(weight));
    return &outBackward;
  }

  std::vector<Parameter> parameters() override {
    return { {&weight, &dweight}, {&bias, &dbias} };
  }
};

struct ReLU : Layer {
  Matrix outForward, outBackward; // we own these
  Matrix *lastIn;

  Matrix *forward(Matrix &in) override {
    lastIn = &in;
    outForward = relu(in);
    return &outForward;
  }

  Matrix *backward(Matrix &dout) override {
    outBackward = elemmul(dout, relugrad(*lastIn));
    return &outBackward;
  }
};

And now let’s create something that is going to wrap this all together:

struct Sequential {
  std::vector<Layer *> layers;

  void add(Layer *layer) { layers.push_back(layer); }

  void init(std::mt19937 &rng) {
    for (auto layer : layers)
      layer->init(rng);
  }

  Matrix *forward(Matrix &in) {
    Matrix *m = &in;
    for (auto layer : layers)
      m = layer->forward(*m);
    return m;
  }

  void backward(Matrix &dout) {
    Matrix *m = &dout;
    for (int i = layers.size() - 1; i >= 0; i--)
      m = layers[i]->backward(*m);
  }

  std::vector<Parameter> parameters() {
    std::vector<Parameter> params;
    for (auto layer : layers) {
      auto lp = layer->parameters(); // append
      params.insert(params.end(), lp.begin(), lp.end());
    }
    return params;
  }
};

/* 
example use -
  Sequential model;
  model.add(new Linear(784, 128));
  model.add(new ReLU());
  model.add(new Linear(128, 64));
  model.add(new ReLU());
  model.add(new Linear(64, 10));
  
  Matrix* logits = model.forward(xBatch);
    // run loss fn and compute dLogits
  model.backward(dLogits);
*/

This has collapsed our forward and backward computation into a single class, though we still do need to provide the loss function to turn our logits into a probability distribution on the forward pass, and the inverse on the backward pass.

struct CrossEntropyLoss {
  Matrix cachedProbs;
  const std::vector<int> *cachedLabels;
  int cachedBatchSize;

  float forward(const Matrix &logits, const std::vector<int> &labels) {
    cachedProbs = softmax(logits);
    cachedLabels = &labels;
    cachedBatchSize = logits.rows;
    return crossentropy(cachedProbs, labels);
  }

  Matrix backward() {
    // d(CrossEntropy+Softmax)/dLogits = softmax(logits) - oneHot(labels)
    Matrix dLogits = cachedProbs;
    for (int i = 0; i < cachedBatchSize; i++)
      dLogits.at(i, (*cachedLabels)[i]) -= 1.f;
    scaleInPlace(dLogits, 1.f / cachedBatchSize);
    return dLogits;
  }
};

We will take in the logits into our forward pass, and we will return the gradient version of these during the backward pass. The fact that we just have a single correct label makes the gradient computation straightforward, which is why the entire code is concise.

Optimizer

For the optimizer, we’ll create a class for the same Stochastic Gradient Descent algorithm as before. There are much better options but we already gain a lot of needed structure just by moving out the weights + biases update into a dedicated class.

struct SGD {
  std::vector<Parameter> params;
  float lr;

  SGD(Sequential &model, float lr) : params(model.parameters()), lr(lr) {}

  void step() {
    for (auto &p : params)
      axpy(-lr, *p.grad, *p.data);
  }
};

As a reminder, the Parameter class wraps weights or biases, as well as their gradients, from any layer, which is all we need to perform the update.

Data Loader

The last thing we’re going to be doing is wrapping the data loader into an easy-to-iterate helper that will also take care of shuffling the order. Seeing as all of our data is in a single matrix where each row is one image, we simply shuffle an index list, then copy the entire row at that index when processing the next() in line. The same thing applies for the correct label, though there it’s just a single integer we need to return. And for all of this, we process batches of data, so we return multiple rows of the image matrix and a list of integers for the labels.

struct DataLoader {
  const Matrix &X;
  const std::vector<int> &y;
  int batchSize;
  std::mt19937 &rng;
  std::vector<int> idx;
  int pos;

  DataLoader(const Matrix &X, const std::vector<int> &y, int batchSize,
             std::mt19937 &rng)
      : X(X), y(y), batchSize(batchSize), rng(rng), idx(X.rows), pos(0) {
    std::iota(idx.begin(), idx.end(), 0);
  }

  void shuffle() {
    std::shuffle(idx.begin(), idx.end(), rng);
    pos = 0;
  }

  bool hasNext() const { return pos < (int)idx.size(); }

  std::pair<Matrix, std::vector<int>> next() {
    int feats = X.cols;
    int end = std::min(pos + batchSize, (int)idx.size());
    int currentBatchSize = end - pos;

    Matrix xBatch(currentBatchSize, feats);
    std::vector<int> yBatch(currentBatchSize);
    for (int i = 0; i < currentBatchSize; i++) {
      int src = idx[pos + i];
      std::copy(X.data.begin() + src * feats,
                X.data.begin() + (src + 1) * feats,
                xBatch.data.begin() + i * feats);
      yBatch[i] = y[src];
    }
    pos = end;
    return {xBatch, yBatch};
  }
};

Putting it all together

Now all we need to do is to bring everything together. Our main function is now much simpler.

int epochs = 10;
int batchSize = 64;
float lr = 0.01f;
std::mt19937 rng(42);

Sequential model;
model.add(new Linear(784, 128));
model.add(new ReLU());
model.add(new Linear(128, 64));
model.add(new ReLU());
model.add(new Linear(64, 10));
model.init(rng);

SGD optimizer(model, lr);
CrossEntropyLoss loss_fn;
DataLoader loader(xTrain, yTrain, batchSize, rng);

for (int epoch = 0; epoch < epochs; epoch++) {
  auto startTime = std::chrono::steady_clock::now();
  float epochLoss = 0;
  int numBatches = 0;

  loader.shuffle();
  while (loader.hasNext()) {
    auto xyPair = loader.next();
    auto xBatch = xyPair.first;
    auto yBatch = xyPair.second;

    Matrix *logits = model.forward(xBatch);
    float loss = loss_fn.forward(*logits, yBatch);
    std::cout << "minibatch loss: " << loss << std::endl;
    epochLoss += loss;
    numBatches++;

    Matrix dLogits = loss_fn.backward();
    model.backward(dLogits);
    optimizer.step();
  }
}

And there we have it! From here we should be able to add new types of layers, optimizers, and starting looking at more complicated topics.