Training Deep Neural Networks on a GPU with PyTorch

Arun Purakkatt

Published in

Analytics Vidhya

8 min readAug 19, 2020

MNIST using feed forward neural networks

In my previous posts we have gone through

Let us try to by using feed forward neural network on MNIST data set.

Step 1 : Import libraries & Explore the data and data preparation

With necessary libraries imported and data is loaded as pytorch tensor,MNIST data set contains 60000 labelled images. Data is split into training and validation set with 50000 and 10000 respectively with random split.

val_size = 10000
train_size = len(dataset) - val_size

train_ds, val_ds = random_split(dataset, [train_size, val_size])
len(train_ds), len(val_ds)

We are creating data loaders which allows us to load data in batches , especially while you have large data set it will not fit into memory for training. Hence we use data loaders. Here we are using the batch size of 128. How do we decide the batch size ? Typically you can try different batch sizes by doubling like 128,256,512.. until your GPU/Memory fits it and process it faster.While it gets slow down you can come down one step of batch size.

In the train_loader we use shuffle = True as it gives randomization for the data,pin_memory — If True, the data loader will copy Tensors into CUDA pinned memory before returning them. num_workers — how many sub processes to use for data loading. It works towards parallelization.

Try looking into the documentation: https://pytorch.org/docs/stable/data.html

batch_size=128
train_loader = DataLoader(train_ds, batch_size, shuffle=True, num_workers=4, pin_memory=True)
val_loader = DataLoader(val_ds, batch_size*2, num_workers=4, pin_memory=True)

Let’s visualize a batch of data in a grid using the make_grid function from torchvision. We'll also use the .permute method on the tensor to move the channels to the last dimension, as expected by matplotlib. Each batch has 128 images with 1 channel as its a grey scale image , 28 X 28 pixel size in other words 28 rows and 28 columns. If its a color image channel ie RGB will be 3.

for images, _ in train_loader:
    print('images.shape:', images.shape)
    plt.figure(figsize=(16,8))
    plt.axis('off')
    plt.imshow(make_grid(images, nrow=16).permute((1, 2, 0)))
    break

images.shape: torch.Size([128, 1, 28, 28])

Step 2: Model Preparation

This is how our model looks.We are creating a neural network with one hidden layer.Structure will be like input layer , Hidden layer,Output layer.Let us understand each layer in detail.

Input layer — Here we have 28x28 size , while we use batch size of 128 we will get 128x28x28 as input matrix.

Hidden layer — The first layer (also known as the hidden layer) will transform the input matrix of shape batch_size x 784 into an intermediate output matrix of shape batch_size x hidden_size, where hidden_size is a preconfigured parameter (usually as e.g. 32 or 64).

The intermediate outputs are then passed into a non-linear activation function, which operates on individual elements of the output matrix.

Output layer — The result of the activation function, which is also of size batch_size x hidden_size, is passed into the second layer (also knowns as the output layer), which transforms it into a matrix of size batch_size x 10

What is Activation function & ReLu?

Activation function decides, whether a neuron should be activated or not by calculating weighted sum and further adding bias with it. The purpose of the activation function is to introduce non-linearity into the output of a neuron.

ReLu: Rectified Linear Unit , relu(x) = max(0,x) i.e. if an element is negative, we replace it by 0, otherwise we leave it unchanged.

Why we use hidden layer and activation function?

This allows our model to learn more complex , multi layered and non linear relations between inputs and targets.

We are defining Mnistmodel class as below. We have the __init__ which takes in_size,hidden_size,out_size. We have one hidden layer and output layer.That is our constructor.

In the forward function we take the batch of images xb , ie 128x1x28x28 then flatten it as our linear layer expects a vector of 2 dimension.Then we get intermediate outputs using hidden layer on top of it we are applying activation function. ie we simply replace the negative numbers with zero. Then the out of ReLu is passed into output layer.So in the output for every image we have 10 outputs.

In the training_step function , we are generating predictions and calculating the loss using cross_entropy.The cross_entropy takes out and actual labels of the data and will return loss.

In the validation_step we take the batch of images, labels and pass it into the model generate predictions,calculate loss and calculate accuracy.

In the validation_epoch_end we calculate the loss, epoch_loss ,batch_losses. batch_accs epoch_acc is been stacked and taken mean.

In the epoch_end val_loss and val_acc is been returned.

class MnistModel(nn.Module):
    """Feedfoward neural network with 1 hidden layer"""
    def __init__(self, in_size, hidden_size, out_size):
        super().__init__()
        # hidden layer
        self.linear1 = nn.Linear(in_size, hidden_size)
        # output layer
        self.linear2 = nn.Linear(hidden_size, out_size)
        
    def forward(self, xb):
        # Flatten the image tensors
        xb = xb.view(xb.size(0), -1)
        # Get intermediate outputs using hidden layer
        out = self.linear1(xb)
        # Apply activation function
        out = F.relu(out)
        # Get predictions using output layer
        out = self.linear2(out)
        return out
    
    def training_step(self, batch):
        images, labels = batch 
        out = self(images)                  # Generate predictions
        loss = F.cross_entropy(out, labels) # Calculate loss
        return loss
    
    def validation_step(self, batch):
        images, labels = batch 
        out = self(images)                    # Generate predictions
        loss = F.cross_entropy(out, labels)   # Calculate loss
        acc = accuracy(out, labels)           # Calculate accuracy
        return {'val_loss': loss, 'val_acc': acc}
        
    def validation_epoch_end(self, outputs):
        batch_losses = [x['val_loss'] for x in outputs]
        epoch_loss = torch.stack(batch_losses).mean()   # Combine losses
        batch_accs = [x['val_acc'] for x in outputs]
        epoch_acc = torch.stack(batch_accs).mean()      # Combine accuracies
        return {'val_loss': epoch_loss.item(), 'val_acc': epoch_acc.item()}
    
    def epoch_end(self, epoch, result):
        print("Epoch [{}], val_loss: {:.4f}, val_acc: {:.4f}".format(epoch, result['val_loss'], result['val_acc']))

We are creating model with hidden_size layer 32, we can change this.More hidden layers can be added.Each hidden layers learns something from our data and try to build relation between input and target.Here we have input size of 784,hidden size we are using as 32,output size or num_classes 10.The model parameters will give you weights and biases.

input_size = 784
hidden_size = 32 # you can change this
num_classes = 10model = MnistModel(input_size, hidden_size=32, out_size=num_classes)for t in model.parameters():
    print(t.shape)

Step 3 : Training Model on GPU and Evaluation of accuracy.

As the sizes of our models and datasets increase, we need to use GPUs to train our models within a reasonable amount of time.Define a helper function to ensure that our code uses the GPU if available, and defaults to using the CPU if it isn’t.

torch.cuda.is_available()def get_default_device():
    """Pick GPU if available, else CPU"""
    if torch.cuda.is_available():
        return torch.device('cuda')
    else:
        return torch.device('cpu')device = get_default_device()
device

To move the data to device we create a helper function.It taken list,tuple and calls to to device method on each tensor.Here data is copied from current device which is a CPU to the GPU. We are trying to use the to_device function by accepting images from the train_loader and printing the images which will show as tensors.

def to_device(data, device):
    """Move tensor(s) to chosen device"""
    if isinstance(data, (list,tuple)):
        return [to_device(x, device) for x in data]
    return data.to(device, non_blocking=True)for images, labels in train_loader:
    print(images.shape)
    images = to_device(images, device)
    print(images.device)
    break

we define a DeviceDataLoader class to wrap our existing data loaders and move data to the selected device, as a batches are accessed. Interestingly, we don't need to extend an existing class to create a PyTorch dataloader. All we need is an __iter__ method to retrieve batches of data, and an __len__ method to get the number of batches.

class DeviceDataLoader():
    """Wrap a dataloader to move data to a device"""
    def __init__(self, dl, device):
        self.dl = dl
        self.device = device
        
    def __iter__(self):
        """Yield a batch of data after moving it to device"""
        for b in self.dl: 
            yield to_device(b, self.device)def __len__(self):
        """Number of batches"""
        return len(self.dl)

We use the evaluate function to evaluate the model on the validation dataset.Which loops over the val_loader which will be moved onto GPU and validation_step calculates loss and accuracy. This output is passed onto validation_epoch_end.

The fit function takes number of epochs, learning rate , model, train_loader , val_loader,opt_fun ie optimization function by default its SGD. let us understand each line of code. optimizer = opt_func(model.parameters(), lr) — This basically performed the SGD , by taking the model parameters and learning rate.We are looping over the number of epochs.In the training phase we loop over the training data loader. loss = model.training_step(batch) — calculates loss for each batch.loss.backward() — calculates gradients or derivatives with respect to weights.
optimizer.step() — performs a parameter update based on the current gradient.
optimizer.zero_grad() — set the gradients to zero .since the backward() function accumulates gradients, and you don’t want to mix up gradients between batches, you have to zero them out at the start of a new mini batch. Validation phase we are evaluating the model on val_loader and model.epoch_end we are printing the loss and accuracy.

def evaluate(model, val_loader):
    outputs = [model.validation_step(batch) for batch in val_loader]
    return model.validation_epoch_end(outputs)

def fit(epochs, lr, model, train_loader, val_loader, opt_func=torch.optim.SGD):
    history = []
    optimizer = opt_func(model.parameters(), lr)
    for epoch in range(epochs):
        # Training Phase 
        for batch in train_loader:
            loss = model.training_step(batch)
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()
        # Validation phase
        result = evaluate(model, val_loader)
        model.epoch_end(epoch, result)
        history.append(result)
    return history

Before we train the model, we need to ensure that the data and the model’s parameters (weights and biases) are on the same device (CPU or GPU). We can reuse the to_device function to move the model’s parameters to the right device.Before we train we can check how the model performs on the validation set with the initial set of weights and biases.

# Model (on GPU)
model = MnistModel(input_size, hidden_size=hidden_size, out_size=num_classes)
to_device(model, device)history = [evaluate(model, val_loader)]
history

Let’s train for 5 epochs and look at the results. We can use a relatively higher learning of 0.5. As learning rate is something to be experimented based on your model. We are getting around 96% accuracy.

One way to go about it is to fix a standard value like 0.01 and if your loss shoots up your loss too much or to nan try decreasing by a factor of 10 ie from 10e-2 to 10e-3.If your loss is reducing slowly try increasing by a factor of 10.

96% is pretty good! Let’s train the model for 5 more epochs at a lower learning rate of 0.1, to further improve the accuracy.

history += fit(5, 0.5, model, train_loader, val_loader)history += fit(5, 0.1, model, train_loader, val_loader)

We are plotting loss vs number of epochs , we could see it went down pretty quickly and flattened out.

We are plotting accuracy vs number of epochs , we could see it up pretty quickly and flattened out.

Our model quickly reaches an accuracy of 97%, but doesn’t improve much beyond this. To improve the accuracy further, we need to make the model more powerful which can be achieved by increasing the size of the hidden layer, or adding more hidden layers.

Please look into the entire code on notebook Github,Stay connected with me on Linked in.

Credits & references :

Training Deep Neural Networks on a GPU with PyTorch

Written by Arun Purakkatt