diff --git a/docs/data/understand/deep_learning/TextClassification_3.png b/docs/data/understand/deep_learning/TextClassification_3.png new file mode 100644 index 000000000..6c2e2eafd Binary files /dev/null and b/docs/data/understand/deep_learning/TextClassification_3.png differ diff --git a/docs/data/understand/deep_learning/TextClassification_4.png b/docs/data/understand/deep_learning/TextClassification_4.png new file mode 100644 index 000000000..14d8d4260 Binary files /dev/null and b/docs/data/understand/deep_learning/TextClassification_4.png differ diff --git a/docs/data/understand/deep_learning/TextClassification_5.png b/docs/data/understand/deep_learning/TextClassification_5.png new file mode 100644 index 000000000..31626e9ef Binary files /dev/null and b/docs/data/understand/deep_learning/TextClassification_5.png differ diff --git a/docs/data/understand/deep_learning/Text Classification 6.png b/docs/data/understand/deep_learning/TextClassification_6.png similarity index 100% rename from docs/data/understand/deep_learning/Text Classification 6.png rename to docs/data/understand/deep_learning/TextClassification_6.png diff --git a/docs/data/understand/deep_learning/Text Classification 7.png b/docs/data/understand/deep_learning/TextClassification_7.png similarity index 100% rename from docs/data/understand/deep_learning/Text Classification 7.png rename to docs/data/understand/deep_learning/TextClassification_7.png diff --git a/docs/data/understand/deep_learning/inception v3.png b/docs/data/understand/deep_learning/inception_v3.png similarity index 100% rename from docs/data/understand/deep_learning/inception v3.png rename to docs/data/understand/deep_learning/inception_v3.png diff --git a/docs/data/understand/deep_learning/mnist 1.png b/docs/data/understand/deep_learning/mnist_1.png similarity index 100% rename from docs/data/understand/deep_learning/mnist 1.png rename to docs/data/understand/deep_learning/mnist_1.png diff --git a/docs/data/understand/deep_learning/mnist 2.png b/docs/data/understand/deep_learning/mnist_2.png similarity index 100% rename from docs/data/understand/deep_learning/mnist 2.png rename to docs/data/understand/deep_learning/mnist_2.png diff --git a/docs/data/understand/deep_learning/mnist 3.png b/docs/data/understand/deep_learning/mnist_3.png similarity index 100% rename from docs/data/understand/deep_learning/mnist 3.png rename to docs/data/understand/deep_learning/mnist_3.png diff --git a/docs/data/understand/deep_learning/mnist_4.png b/docs/data/understand/deep_learning/mnist_4.png new file mode 100644 index 000000000..3d35f354b Binary files /dev/null and b/docs/data/understand/deep_learning/mnist_4.png differ diff --git a/docs/data/understand/deep_learning/mnist_5.png b/docs/data/understand/deep_learning/mnist_5.png new file mode 100644 index 000000000..ac92f29bb Binary files /dev/null and b/docs/data/understand/deep_learning/mnist_5.png differ diff --git a/docs/examples/inception_casestudy/inception_casestudy.md b/docs/examples/inception_casestudy/inception_casestudy.md index 7bdab7e2d..756284516 100644 --- a/docs/examples/inception_casestudy/inception_casestudy.md +++ b/docs/examples/inception_casestudy/inception_casestudy.md @@ -1,5 +1,1273 @@ -# Inception V3 with PyTorch - -Pull content from -. -Ignore training description. +# Inception V3 with PyTorch + +## Deep Learning Training + +Deep Learning models are designed to capture the complexity of the problem and the underlying data. These models are "deep," comprising multiple component layers. Training is finding the best parameters for each model layer to achieve a well-defined objective. + +The training data consists of input features in supervised learning, similar to what the learned model is expected to see during the evaluation or inference phase. The target output is also included, which serves to teach the model. A loss metric is defined as part of training that evaluates the model's performance during the training process. + +Training also includes the choice of an optimization algorithm that reduces the loss by adjusting the model's parameters. Training is an iterative process where training data is fed in, usually split into different batches, with the entirety of the training data passed during one training epoch. Training usually is run for multiple epochs. + +## Training Phases + +Training occurs in multiple phases for every batch of training data. {numref}`TypesOfTrainingPhases` provides an explanation of the types of training phases. + +:::{table} Types of Training Phases +:name: TypesOfTrainingPhases +:widths: auto +| Types of Phases | | +| ----------- | ----------- | +| Forward Pass | The input features are fed into the model, whose parameters may be randomly initialized initially. Activations (outputs) of each layer are retained during this pass to help in the loss gradient computation during the backward pass. | +| Loss Computation | The output is compared against the target outputs, and the loss is computed. | +| Backward Pass | The loss is propagated backward, and the model's error gradients are computed and stored for each trainable parameter. | +| Optimization Pass | The optimization algorithm updates the model parameters using the stored error gradients. | +::: + +Training is different from inference, particularly from the hardware perspective. {numref}`TrainingVsInference` shows the contrast between training and inference. + +:::{table} Training vs. Inference +:name: TrainingVsInference +:widths: auto +| Training | Inference | +| ----------- | ----------- | +| Training is measured in hours/days. | The inference is measured in minutes. | +| Training is generally run offline in a data center or cloud setting. | The inference is made on edge devices. | +| The memory requirements for training are higher than inference due to storing intermediate data, such as activations and error gradients. | The memory requirements are lower for inference than training. | +| Data for training is available on the disk before the training process and is generally significant. The training performance is measured by how fast the data batches can be processed. | Inference data usually arrive stochastically, which may be batched to improve performance. Inference performance is generally measured in throughput speed to process the batch of data and the delay in responding to the input (latency). | +::: + +Different quantization data types are typically chosen between training (FP32, BF16) and inference (FP16, INT8). The computation hardware has different specializations from other datatypes, leading to improvement in performance if a faster datatype can be selected for the corresponding task. + +## Case Studies + +The following sections contain case studies for the Inception v3 model. + +### Inception v3 with PyTorch + +Convolution Neural Networks are forms of artificial neural networks commonly used for image processing. One of the core layers of such a network is the convolutional layer, which convolves the input with a weight tensor and passes the result to the next layer. Inception v3 [1] is an architectural development over the ImageNet competition-winning entry, AlexNet, using more profound and broader networks while attempting to meet computational and memory budgets. + +The implementation uses PyTorch as a framework. This case study utilizes torchvision [2], a repository of popular datasets and model architectures, for obtaining the model. Torchvision also provides pretrained weights as a starting point to develop new models or fine-tune the model for a new task. + +#### Evaluating a Pretrained Model + +The Inception v3 model introduces a simple image classification task with the pretrained model. This does not involve training but utilizes an already pretrained model from torchvision. + +This example is adapted from the PyTorch research hub page on Inception v3 [3]. + +Follow these steps: + +1. Run the PyTorch ROCm-based Docker image or refer to the section [Installing PyTorch](https://docs.amd.com/bundle/ROCm-Deep-Learning-Guide-v5.4-/page/Frameworks_Installation.html#d1667e113) for setting up a PyTorch environment on ROCm. + + ```dockerfile + docker run -it -v $HOME:/data --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --device=/dev/kfd --device=/dev/dri --group-add video --ipc=host --shm-size 8G rocm/pytorch:latest + ``` + +2. Run the Python shell and import packages and libraries for model creation. + + ```py + import torch + import torchvision + ``` + +3. Set the model in evaluation mode. Evaluation mode directs PyTorch not to store intermediate data, which would have been used in training. + + ```py + model = torch.hub.load('pytorch/vision:v0.10.0', 'inception_v3', pretrained=True) + model.eval() + ``` + +4. Download a sample image for inference. + + ```py + import urllib + url, filename = ("https://github.com/pytorch/hub/raw/master/images/dog.jpg", "dog.jpg") + try: urllib.URLopener().retrieve(url, filename) + except: urllib.request.urlretrieve(url, filename) + ``` + +5. Import torchvision and PIL Image support libraries. + + ```py + from PIL import Image + from torchvision import transforms + input_image = Image.open(filename) + ``` + +6. Apply preprocessing and normalization. + + ```py + preprocess = transforms.Compose([ + transforms.Resize(299), + transforms.CenterCrop(299), + transforms.ToTensor(), + transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]), + ]) + ``` + +7. Use input tensors and unsqueeze them later. + + ```py + input_tensor = preprocess(input_image) + input_batch = input_tensor.unsqueeze(0) + if torch.cuda.is_available(): + input_batch = input_batch.to('cuda') + model.to('cuda') + ``` + +8. Find out probabilities. + + ```py + with torch.no_grad(): + output = model(input_batch) + print(output[0]) + probabilities = torch.nn.functional.softmax(output[0], dim=0) + print(probabilities) + ``` + +9. To understand the probabilities, download and examine the Imagenet labels. + + ```py + wget https://raw.githubusercontent.com/pytorch/hub/master/imagenet_classes.txt + ``` + +10. Read the categories and show the top categories for the image. + + ```py + with open("imagenet_classes.txt", "r") as f: + categories = [s.strip() for s in f.readlines()] + top5_prob, top5_catid = torch.topk(probabilities, 5) + for i in range(top5_prob.size(0)): + print(categories[top5_catid[i]], top5_prob[i].item()) + ``` + +#### Training Inception v3 + +The previous section focused on downloading and using the Inception v3 model for a simple image classification task. This section walks through training the model on a new dataset. + +Follow these steps: + +1. Run the PyTorch ROCm Docker image or refer to the section [Installing PyTorch](https://docs.amd.com/bundle/ROCm-Deep-Learning-Guide-v5.4-/page/Frameworks_Installation.html#d1667e113) for setting up a PyTorch environment on ROCm. + + ```dockerfile + docker pull rocm/pytorch:latest + docker run -it --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --device=/dev/kfd --device=/dev/dri --group-add video --ipc=host --shm-size 8G rocm/pytorch:latest + ``` + +2. Download an imagenet database. For this example, the tiny-imagenet-200 [4], a smaller ImageNet variant with 200 image classes and a training dataset with 100,000 images, was downsized to 64x64 color images. + + ```py + wget http://cs231n.stanford.edu/tiny-imagenet-200.zip + ``` + +3. Process the database to set the validation directory to the format expected by PyTorch DataLoader. + +4. Run the following script: + + ```py + import io + import glob + import os + from shutil import move + from os.path import join + from os import listdir, rmdir + target_folder = './tiny-imagenet-200/val/' + val_dict = {} + with open('./tiny-imagenet-200/val/val_annotations.txt', 'r') as f: + for line in f.readlines(): + split_line = line.split('\t') + val_dict[split_line[0]] = split_line[1] + + paths = glob.glob('./tiny-imagenet-200/val/images/*') + for path in paths: + file = path.split('/')[-1] + folder = val_dict[file] + if not os.path.exists(target_folder + str(folder)): + os.mkdir(target_folder + str(folder)) + os.mkdir(target_folder + str(folder) + '/images') + + for path in paths: + file = path.split('/')[-1] + folder = val_dict[file] + dest = target_folder + str(folder) + '/images/' + str(file) + move(path, dest) + + rmdir('./tiny-imagenet-200/val/images') + ``` + +5. Open a Python shell. + +6. Import dependencies, including torch, OS, and torchvision. + + ```py + import torch + import os + import torchvision + from torchvision import transforms + from torchvision.transforms.functional import InterpolationMode + ``` + +7. Set parameters to guide the training process. + + :::{note} + The device is set to "cuda". In PyTorch, "cuda" is a generic keyword to denote a GPU. + ::: + + ```py + device = "cuda" + ``` + +8. Set the data_path to the location of the training and validation data. In this case, the tiny-imagenet-200 is present as a subdirectory to the current directory. + + ```py + data_path = "tiny-imagenet-200" + ``` + + The training image size is cropped for input into Inception v3. + + ```py + train_crop_size = 299 + ``` + +9. To smooth the image, use bilinear interpolation, a resampling method that uses the distance weighted average of the four nearest pixel values to estimate a new pixel value. + + ```py + interpolation = "bilinear" + ``` + + The next parameters control the size to which the validation image is cropped and resized. + + ```py + val_crop_size = 299 + val_resize_size = 342 + ``` + + The pretrained Inception v3 model is chosen to be downloaded from torchvision. + + ```py + model_name = "inception_v3" + pretrained = True + ``` + + During each training step, a batch of images is processed to compute the loss gradient and perform the optimization. In the following setting, the size of the batch is determined. + + ```py + batch_size = 32 + ``` + + This refers to the number of CPU threads the data loader uses to perform efficient multiprocess data loading. + + ```py + num_workers = 16 + ``` + + The PyTorch optim package provides methods to adjust the learning rate as the training progresses. This example uses the StepLR scheduler, which decays the learning rate by lr_gamma at every lr_step_size number of epochs. + + ```py + learning_rate = 0.1 + momentum = 0.9 + weight_decay = 1e-4 + lr_step_size = 30 + lr_gamma = 0.1 + ``` + + :::{note} + One training epoch is when the neural network passes an entire dataset forward and backward. + ::: + + ```py + epochs = 90 + ``` + + The train and validation directories are determined. + + ```py + train_dir = os.path.join(data_path, "train") + val_dir = os.path.join(data_path, "val") + ``` + +10. Set up the training and testing data loaders. + + ```py + interpolation = InterpolationMode(interpolation) + + TRAIN_TRANSFORM_IMG = transforms.Compose([ + Normalizaing and standardardizing the image + transforms.RandomResizedCrop(train_crop_size, interpolation=interpolation), + transforms.PILToTensor(), + transforms.ConvertImageDtype(torch.float), + transforms.Normalize(mean=[0.485, 0.456, 0.406], + std=[0.229, 0.224, 0.225] ) + ]) + dataset = torchvision.datasets.ImageFolder( + train_dir, + transform=TRAIN_TRANSFORM_IMG + ) + TEST_TRANSFORM_IMG = transforms.Compose([ + transforms.Resize(val_resize_size, interpolation=interpolation), + transforms.CenterCrop(val_crop_size), + transforms.PILToTensor(), + transforms.ConvertImageDtype(torch.float), + transforms.Normalize(mean=[0.485, 0.456, 0.406], + std=[0.229, 0.224, 0.225] ) + ]) + + dataset_test = torchvision.datasets.ImageFolder( + val_dir, + transform=TEST_TRANSFORM_IMG + ) + + print("Creating data loaders") + train_sampler = torch.utils.data.RandomSampler(dataset) + test_sampler = torch.utils.data.SequentialSampler(dataset_test) + + data_loader = torch.utils.data.DataLoader( + dataset, + batch_size=batch_size, + sampler=train_sampler, + num_workers=num_workers, + pin_memory=True + ) + + data_loader_test = torch.utils.data.DataLoader( + dataset_test, batch_size=batch_size, sampler=test_sampler, num_workers=num_workers, pin_memory=True + ) + ``` + + :::{note} + Use torchvision to obtain the Inception v3 model. Use the pretrained model weights to speed up training. + ::: + + ```py + print("Creating model") + print("Num classes = ", len(dataset.classes)) + model = torchvision.models.__dict__[model_name](pretrained=pretrained) + ``` + +11. Adapt Inception v3 for the current dataset. Tiny-imagenet-200 contains only 200 classes, whereas Inception v3 is designed for 1,000-class output. The last layer of Inception v3 is replaced to match the output features required. + + ```py + model.fc = torch.nn.Linear(model.fc.in_features, len(dataset.classes)) + model.aux_logits = False + model.AuxLogits = None + ``` + +12. Move the model to the GPU device. + + ```py + model.to(device) + ``` + +13. Set the loss criteria. For this example, Cross Entropy Loss [5] is used. + + ```py + criterion = torch.nn.CrossEntropyLoss() + ``` + +14. Set the optimizer to Stochastic Gradient Descent. + + ```py + optimizer = torch.optim.SGD( + model.parameters(), + lr=learning_rate, + momentum=momentum, + weight_decay=weight_decay + ) + ``` + +15. Set the learning rate scheduler. + + ```py + lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=lr_step_size, gamma=lr_gamma) + ``` + +16. Iterate over epochs. Each epoch is a complete pass through the training data. + + ```py + print("Start training") + for epoch in range(epochs): + model.train() + epoch_loss = 0 + len_dataset = 0 + ``` + +17. Iterate over steps. The data is processed in batches, and each step passes through a full batch. + + ```py + for step, (image, target) in enumerate(data_loader): + ``` + +18. Pass the image and target to the GPU device. + + ```py + image, target = image.to(device), target.to(device) + ``` + + The following is the core training logic: + + a. The image is fed into the model. + + b. The output is compared with the target in the training data to obtain the loss. + + c. This loss is back propagated to all parameters that require optimization. + + d. The optimizer updates the parameters based on the selected optimization algorithm. + + ```py + output = model(image) + loss = criterion(output, target) + optimizer.zero_grad() + loss.backward() + optimizer.step() + ``` + + The epoch loss is updated, and the step loss prints. + + ```py + epoch_loss += output.shape[0] * loss.item() + len_dataset += output.shape[0]; + if step % 10 == 0: + print('Epoch: ', epoch, '| step : %d' % step, '| train loss : %0.4f' % loss.item() ) + epoch_loss = epoch_loss / len_dataset + print('Epoch: ', epoch, '| train loss : %0.4f' % epoch_loss ) + ``` + + The learning rate is updated at the end of each epoch. + + ```py + lr_scheduler.step() + ``` + + After training for the epoch, the model evaluates against the validation dataset. + + ```py + model.eval() + with torch.inference_mode(): + running_loss = 0 + for step, (image, target) in enumerate(data_loader_test): + image, target = image.to(device), target.to(device) + + output = model(image) + loss = criterion(output, target) + + running_loss += loss.item() + running_loss = running_loss / len(data_loader_test) + print('Epoch: ', epoch, '| test loss : %0.4f' % running_loss ) + ``` + +19. Save the model for use in inferencing tasks. + +```py +# save model +torch.save(model.state_dict(), "trained_inception_v3.pt") +``` + +Plotting the train and test loss shows both metrics reducing over training epochs. This is demonstrated in {numref}`inceptionV3`. + +```{figure} ../../data/understand/deep_learning/inception_v3.png +:name: inceptionV3 +--- +align: center +--- +Inception v3 Train and Loss Graph +``` + +### Custom Model with CIFAR-10 on PyTorch + +The CIFAR-10 (Canadian Institute for Advanced Research) dataset is a subset of the Tiny Images dataset (which contains 80 million images of 32x32 collected from the Internet) and consists of 60,000 32x32 color images. The images are labeled with one of 10 mutually exclusive classes: airplane, motor car, bird, cat, deer, dog, frog, cruise ship, stallion, and truck (but not pickup truck). There are 6,000 images per class, with 5,000 training and 1,000 testing images per class. Let us prepare a custom model for classifying these images using the PyTorch framework and go step-by-step as illustrated below. + +Follow these steps: + +1. Import dependencies, including torch, OS, and torchvision. + + ```py + import torch + import torchvision + import torchvision.transforms as transforms + import matplotlib.pyplot as plot + import numpy as np + ``` + +2. The output of torchvision datasets is PILImage images of range [0, 1]. Transform them to Tensors of normalized range [-1, 1]. + + ```py + transform = transforms.Compose( + [transforms.ToTensor(), + transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))]) + ``` + + During each training step, a batch of images is processed to compute the loss gradient and perform the optimization. In the following setting, the size of the batch is determined. + + ```py + batch_size = 4 + ``` + +3. Download the dataset train and test datasets as follows. Specify the batch size, shuffle the dataset once, and specify the number of workers to the number of CPU threads used by the data loader to perform efficient multiprocess data loading. + + ```py + train_set = torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=transform) + train_loader = torch.utils.data.DataLoader(train_set, batch_size=batch_size, shuffle=True, num_workers=2) + ``` + +4. Follow the same procedure for the testing set. + + ```py + test_set = TorchVision.datasets.CIFAR10(root='./data', train=False, download=True, transform=transform) + test_loader = torch.utils.data.DataLoader(test_set, batch_size=batch_size, shuffle=False, num_workers=2) + print ("teast set and test loader") + ``` + +5. Specify the defined classes of images belonging to this dataset. + + ```py + classes = ('Aeroplane', 'motorcar', 'bird', 'cat', 'deer', 'puppy', 'frog', 'stallion', 'cruise', 'truck') + print("defined classes") + ``` + +6. Unnormalize the images and then iterate over them. + + ```py + global image_number + image_number = 0 + def show_image(img): + global image_number + image_number = image_number + 1 + img = img / 2 + 0.5 # de-normalizing input image + npimg = img.numpy() + plot.imshow(np.transpose(npimg, (1, 2, 0))) + plot.savefig("fig{}.jpg".format(image_number)) + print("fig{}.jpg".format(image_number)) + plot.show() + data_iter = iter(train_loader) + images, labels = data_iter.next() + show_image(torchvision.utils.make_grid(images)) + print(' '.join('%5s' % classes[labels[j]] for j in range(batch_size))) + print("image created and saved ") + ``` + +7. Import the torch.nn for constructing neural networks and torch.nn.functional to use the convolution functions. + + ```py + import torch.nn as nn + import torch.nn.functional as F + ``` + +8. Define the CNN (Convolution Neural Networks) and relevant activation functions. + + ```py + class Net(nn.Module): + def __init__(self): + super().__init__() + self.conv1 = nn.Conv2d(3, 6, 5) + self.pool = nn.MaxPool2d(2, 2) + self.conv2 = nn.Conv2d(6, 16, 5) + self.pool = nn.MaxPool2d(2, 2) + self.conv3 = nn.Conv2d(3, 6, 5) + self.fc2 = nn.Linear(120, 84) + self.fc3 = nn.Linear(84, 10) + + def forward(self, x): + x = self.pool(F.relu(self.conv1(x))) + x = self.pool(F.relu(self.conv2(x))) + x = torch.flatten(x, 1) # flatten all dimensions except batch + x = F.relu(self.fc1(x)) + x = F.relu(self.fc2(x)) + x = self.fc3(x) + return x + net = Net() + print("created Net() ") + ``` + +9. Set the optimizer to Stochastic Gradient Descent. + + ```py + import torch.optim as optim + ``` + +10. Set the loss criteria. For this example, Cross Entropy Loss [5] is used. + + ```py + criterion = nn.CrossEntropyLoss() + optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9) + ``` + +11. Iterate over epochs. Each epoch is a complete pass through the training data. + + ```py + for epoch in range(2): # loop over the dataset multiple times + + running_loss = 0.0 + for i, data in enumerate(train_loader, 0): + # get the inputs; data is a list of [inputs, labels] + inputs, labels = data + + # zero the parameter gradients + optimizer.zero_grad() + + # forward + backward + optimize + outputs = net(inputs) + loss = criterion(outputs, labels) + loss.backward() + optimizer.step() + + # print statistics + running_loss += loss.item() + if i % 2000 == 1999: # print every 2000 mini-batches + print('[%d, %5d] loss: %.3f' % (epoch + 1, i + 1, running_loss / 2000)) + running_loss = 0.0 + print('Finished Training') + ``` + + ```py + PATH = './cifar_net.pth' + torch.save(net.state_dict(), PATH) + print("saved model to path :",PATH) + net = Net() + net.load_state_dict(torch.load(PATH)) + print("loding back saved model") + outputs = net(images) + _, predicted = torch.max(outputs, 1) + print('Predicted: ', ' '.join('%5s' % classes[predicted[j]] for j in range(4))) + correct = 0 + total = 0 + ``` + + As this is not training, calculating the gradients for outputs is not required. + + ```py + # calculate outputs by running images through the network + with torch.no_grad(): + for data in test_loader: + images, labels = data + # calculate outputs by running images through the network + outputs = net(images) + # the class with the highest energy is what you can choose as prediction + _, predicted = torch.max(outputs.data, 1) + total += labels.size(0) + correct += (predicted == labels).sum().item() + print('Accuracy of the network on the 10000 test images: %d %%' % ( 100 * correct / total)) + # prepare to count predictions for each class + correct_pred = {classname: 0 for classname in classes} + total_pred = {classname: 0 for classname in classes} + ``` + + ```py + # again no gradients needed + with torch.no_grad(): + for data in test_loader: + images, labels = data + outputs = net(images) + _, predictions = torch.max(outputs, 1) + # collect the correct predictions for each class + for label, prediction in zip(labels, predictions): + if label == prediction: + correct_pred[classes[label]] += 1 + total_pred[classes[label]] += 1 + # print accuracy for each class + for classname, correct_count in correct_pred.items(): + accuracy = 100 * float(correct_count) / total_pred[classname] + print("Accuracy for class {:5s} is: {:.1f} %".format(classname,accuracy)) + ``` + +### Case Study: TensorFlow with Fashion MNIST + +Fashion MNIST is a dataset that contains 70,000 grayscale images in 10 categories. + +Implement and train a neural network model using the TensorFlow framework to classify images of clothing, like sneakers and shirts. + +The dataset has 60,000 images you will use to train the network and 10,000 to evaluate how accurately the network learned to classify images. The Fashion MNIST dataset can be accessed via TensorFlow internal libraries. + +Access the source code from the following repository: + +[https://github.com/ROCmSoftwarePlatform/tensorflow_fashionmnist/blob/main/fashion_mnist.py](https://github.com/ROCmSoftwarePlatform/tensorflow_fashionmnist/blob/main/fashion_mnist.py) + +To understand the code step by step, follow these steps: + +1. Import libraries like TensorFlow, Numpy, and Matplotlib to train the neural network and calculate and plot graphs. + + ```py + import tensorflow as tf + import numpy as np + import matplotlib.pyplot as plt + ``` + +2. To verify that TensorFlow is installed, print the version of TensorFlow by using the below print statement: + + ```py + print(tf._version__) r + ``` + +3. Load the dataset from the available internal libraries to analyze and train a neural network upon the MNIST Fashion Dataset. Loading the dataset returns four NumPy arrays. The model uses the training set arrays, train_images and train_labels, to learn. + +4. The model is tested against the test set, test_images, and test_labels arrays. + + ```py + fashion_mnist = tf.keras.datasets.fashion_mnist + (train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data() + ``` + + Since you have 10 types of images in the dataset, assign labels from zero to nine. Each image is assigned one label. The images are 28x28 NumPy arrays, with pixel values ranging from zero to 255. + +5. Each image is mapped to a single label. Since the class names are not included with the dataset, store them, and later use them when plotting the images: + + ```py + class_names = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat','Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot'] + ``` + +6. Use this code to explore the dataset by knowing its dimensions: + + ```py + train_images.shape + ``` + +7. Use this code to print the size of this training set: + + ```py + print(len(train_labels)) + ``` + +8. Use this code to print the labels of this training set: + + ```py + print(train_labels) + ``` + +9. Preprocess the data before training the network, and you can start inspecting the first image, as its pixels will fall in the range of zero to 255. + + ```py + plt.figure() + plt.imshow(train_images[0]) + plt.colorbar() + plt.grid(False) + plt.show() + ``` + + ```{figure} ../../data/understand/deep_learning/mnist_1.png + --- + align: center + --- + ``` + +10. From the above picture, you can see that values are from zero to 255. Before training this on the neural network, you must bring them in the range of zero to one. Hence, divide the values by 255. + + ```py + train_images = train_images / 255.0 + + test_images = test_images / 255.0 + ``` + +11. To ensure the data is in the correct format and ready to build and train the network, display the first 25 images from the training set and the class name below each image. + + ```py + plt.figure(figsize=(10,10)) + for i in range(25): + plt.subplot(5,5,i+1) + plt.xticks([]) + plt.yticks([]) + plt.grid(False) + plt.imshow(train_images[i], cmap=plt.cm.binary) + plt.xlabel(class_names[train_labels[i]]) + plt.show() + ``` + + ```{figure} ../../data/understand/deep_learning/mnist_2.png + --- + align: center + --- + ``` + + The basic building block of a neural network is the layer. Layers extract representations from the data fed into them. Deep Learning consists of chaining together simple layers. Most layers, such as tf.keras.layers.Dense, have parameters that are learned during training. + + ```py + model = tf.keras.Sequential([ + tf.keras.layers.Flatten(input_shape=(28, 28)), + tf.keras.layers.Dense(128, activation='relu'), + tf.keras.layers.Dense(10) + ]) + ``` + + - The first layer in this network tf.keras.layers.Flatten transforms the format of the images from a two-dimensional array (of 28 x 28 pixels) to a one-dimensional array (of 28 * 28 = 784 pixels). Think of this layer as unstacking rows of pixels in the image and lining them up. This layer has no parameters to learn; it only reformats the data. + + - After the pixels are flattened, the network consists of a sequence of two tf.keras.layers.Dense layers. These are densely connected or fully connected neural layers. The first Dense layer has 128 nodes (or neurons). The second (and last) layer returns a logits array with a length of 10. Each node contains a score that indicates the current image belongs to one of the 10 classes. + +12. You must add the Loss function, Metrics, and Optimizer at the time of model compilation. + + ```py + model.compile(optimizer='adam', + loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), + metrics=['accuracy']) + ``` + + - Loss function —This measures how accurate the model is during training when you are looking to minimize this function to "steer" the model in the right direction. + + - Optimizer —This is how the model is updated based on the data it sees and its loss function. + + - Metrics —This is used to monitor the training and testing steps. + + The following example uses accuracy, the fraction of the correctly classified images. + + To train the neural network model, follow these steps: + + 1. Feed the training data to the model. The training data is in the train_images and train_labels arrays in this example. The model learns to associate images and labels. + + 2. Ask the model to make predictions about a test set—in this example, the test_images array. + + 3. Verify that the predictions match the labels from the test_labels array. + + 4. To start training, call the model.fit method because it "fits" the model to the training data. + + ```py + model.fit(train_images, train_labels, epochs=10) + ``` + + 5. Compare how the model will perform on the test dataset. + + ```py + test_loss, test_acc = model.evaluate(test_images, test_labels, verbose=2) + + print('\nTest accuracy:', test_acc) + ``` + + 6. With the model trained, you can use it to make predictions about some images: the model's linear outputs and logits. Attach a softmax layer to convert the logits to probabilities, making it easier to interpret. + + ```py + probability_model = tf.keras.Sequential([model, + tf.keras.layers.Softmax()]) + + predictions = probability_model.predict(test_images) + ``` + + 7. The model has predicted the label for each image in the testing set. Look at the first prediction: + + ```py + predictions[0] + ``` + + A prediction is an array of 10 numbers. They represent the model's "confidence" that the image corresponds to each of the 10 different articles of clothing. You can see which label has the highest confidence value: + + ```py + np.argmax(predictions[0]) + ``` + + 8. Plot a graph to look at the complete set of 10 class predictions. + + ```py + def plot_image(i, predictions_array, true_label, img): + true_label, img = true_label[i], img[i] + plt.grid(False) + plt.xticks([]) + plt.yticks([]) + + plt.imshow(img, cmap=plt.cm.binary) + + predicted_label = np.argmax(predictions_array) + if predicted_label == true_label: + color = 'blue' + else: + color = 'red' + + plt.xlabel("{} {:2.0f}% ({})".format(class_names[predicted_label], + 100*np.max(predictions_array), + class_names[true_label]), + color=color) + + def plot_value_array(i, predictions_array, true_label): + true_label = true_label[i] + plt.grid(False) + plt.xticks(range(10)) + plt.yticks([]) + thisplot = plt.bar(range(10), predictions_array, color="#777777") + plt.ylim([0, 1]) + predicted_label = np.argmax(predictions_array) + + thisplot[predicted_label].set_color('red') + thisplot[true_label].set_color('blue') + ``` + + 9. With the model trained, you can use it to make predictions about some images. Review the 0th image predictions and the prediction array. Correct prediction labels are blue, and incorrect prediction labels are red. The number gives the percentage (out of 100) for the predicted label. + + ```py + i = 0 + plt.figure(figsize=(6,3)) + plt.subplot(1,2,1) + plot_image(i, predictions[i], test_labels, test_images) + plt.subplot(1,2,2) + plot_value_array(i, predictions[i], test_labels) + plt.show() + ``` + + ```{figure} ../../data/understand/deep_learning/mnist_3.png + --- + align: center + --- + ``` + + ```py + i = 12 + plt.figure(figsize=(6,3)) + plt.subplot(1,2,1) + plot_image(i, predictions[i], test_labels, test_images) + plt.subplot(1,2,2) + plot_value_array(i, predictions[i], test_labels) + plt.show() + ``` + + ```{figure} ../../data/understand/deep_learning/mnist_4.png + --- + align: center + --- + ``` + + 10. Use the trained model to predict a single image. + + ```py + # Grab an image from the test dataset. + img = test_images[1] + print(img.shape) + ``` + + 11. tf.keras models are optimized to make predictions on a batch, or collection, of examples at once. Accordingly, even though you are using a single image, you must add it to a list. + + ```py + # Add the image to a batch where it's the only member. + img = (np.expand_dims(img,0)) + + print(img.shape) + ``` + + 12. Predict the correct label for this image. + + ```py + predictions_single = probability_model.predict(img) + + print(predictions_single) + + plot_value_array(1, predictions_single[0], test_labels) + _ = plt.xticks(range(10), class_names, rotation=45) + plt.show() + ``` + + ```{figure} ../../data/understand/deep_learning/mnist_5.png + --- + align: center + --- + ``` + + 13. tf.keras.Model.predict returns a list of lists—one for each image in the batch of data. Grab the predictions for our (only) image in the batch. + + ```py + np.argmax(predictions_single[0]) + ``` + +### Case Study: TensorFlow with Text Classification + +This procedure demonstrates text classification starting from plain text files stored on disk. You will train a binary classifier to perform sentiment analysis on an IMDB dataset. At the end of the notebook, there is an exercise for you to try in which you will train a multiclass classifier to predict the tag for a programming question on Stack Overflow. + +Follow these steps: + +1. Import the necessary libraries. + + ```py + import matplotlib.pyplot as plt + import os + import re + import shutil + import string + import tensorflow as tf + + from tensorflow.keras import layers + from tensorflow.keras import losses + ``` + +2. Get the data for the text classification, and extract the database from the given link of IMDB. + + ```py + url = "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz" + + dataset = tf.keras.utils.get_file("aclImdb_v1", url, + untar=True, cache_dir='.', + cache_subdir='') + ``` + + ```py + Downloading data from https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz + 84131840/84125825 [==============================] – 1s 0us/step + 84149932/84125825 [==============================] – 1s 0us/step + ``` + +3. Fetch the data from the directory. + + ```py + dataset_dir = os.path.join(os.path.dirname(dataset), 'aclImdb') + print(os.listdir(dataset_dir)) + ``` + +4. Load the data for training purposes. + + ```py + train_dir = os.path.join(dataset_dir, 'train') + os.listdir(train_dir) + ``` + + ```py + ['labeledBow.feat', + 'urls_pos.txt', + 'urls_unsup.txt', + 'unsup', + 'pos', + 'unsupBow.feat', + 'urls_neg.txt', + 'neg'] + ``` + +5. The directories contain many text files, each of which is a single movie review. To look at one of them, use the following: + + ```py + sample_file = os.path.join(train_dir, 'pos/1181_9.txt') + with open(sample_file) as f: + print(f.read()) + ``` + +6. As the IMDB dataset contains additional folders, remove them before using this utility. + + ```py + remove_dir = os.path.join(train_dir, 'unsup') + shutil.rmtree(remove_dir) + batch_size = 32 + seed = 42 + ``` + +7. The IMDB dataset has already been divided into train and test but lacks a validation set. Create a validation set using an 80:20 split of the training data by using the validation_split argument below: + + ```py + raw_train_ds=tf.keras.utils.text_dataset_from_directory('aclImdb/train',batch_size=batch_size, validation_split=0.2,subset='training', seed=seed) + ``` + +8. As you will see in a moment, you can train a model by passing a dataset directly to model.fit. If you are new to tf.data, you can also iterate over the dataset and print a few examples as follows: + + ```py + for text_batch, label_batch in raw_train_ds.take(1): + for i in range(3): + print("Review", text_batch.numpy()[i]) + print("Label", label_batch.numpy()[i]) + ``` + +9. The labels are zero or one. To see which of these correspond to positive and negative movie reviews, check the class_names property on the dataset. + + ```py + print("Label 0 corresponds to", raw_train_ds.class_names[0]) + print("Label 1 corresponds to", raw_train_ds.class_names[1]) + ``` + +10. Next, create validation and test the dataset. Use the remaining 5,000 reviews from the training set for validation into two classes of 2,500 reviews each. + + ```py + raw_val_ds = tf.keras.utils.text_dataset_from_directory('aclImdb/train', + batch_size=batch_size,validation_split=0.2,subset='validation', seed=seed) + + raw_test_ds = + tf.keras.utils.text_dataset_from_directory( + 'aclImdb/test', + batch_size=batch_size) + ``` + +To prepare the data for training, follow these steps: + +1. Standardize, tokenize, and vectorize the data using the helpful tf.keras.layers.TextVectorization layer. + + ```py + def custom_standardization(input_data): + lowercase = tf.strings.lower(input_data) + stripped_html = tf.strings.regex_replace(lowercase, '
', ' ') + return tf.strings.regex_replace(stripped_html, '[%s]' % re.escape(string.punctuation),'') + ``` + +2. Create a TextVectorization layer. Use this layer to standardize, tokenize, and vectorize our data. Set the output_mode to int to create unique integer indices for each token. Note that we are using the default split function and the custom standardization function you defined above. You will also define some constants for the model, like an explicit maximum sequence_length, which will cause the layer to pad or truncate sequences to exactly sequence_length values. + + ```py + max_features = 10000 + sequence_length = 250 + vectorize_layer = layers.TextVectorization( + standardize=custom_standardization, + max_tokens=max_features, + output_mode='int', + output_sequence_length=sequence_length) + ``` + +3. Call adapt to fit the state of the preprocessing layer to the dataset. This causes the model to build an index of strings to integers. + + ```py + # Make a text-only dataset (without labels), then call adapt + train_text = raw_train_ds.map(lambda x, y: x) + vectorize_layer.adapt(train_text) + ``` + +4. Create a function to see the result of using this layer to preprocess some data. + + ```py + def vectorize_text(text, label): + text = tf.expand_dims(text, -1) + return vectorize_layer(text), label + + text_batch, label_batch = next(iter(raw_train_ds)) + first_review, first_label = text_batch[0], label_batch[0] + print("Review", first_review) + print("Label", raw_train_ds.class_names[first_label]) + print("Vectorized review", vectorize_text(first_review, first_label)) + ``` + + ```{figure} ../../data/understand/deep_learning/TextClassification_3.png + --- + align: center + --- + ``` + +5. As you can see above, each token has been replaced by an integer. Look up the token (string) that each integer corresponds to by calling get_vocabulary() on the layer. + + ```py + print("1287 ---> ",vectorize_layer.get_vocabulary()[1287]) + print(" 313 ---> ",vectorize_layer.get_vocabulary()[313]) + print('Vocabulary size: {}'.format(len(vectorize_layer.get_vocabulary()))) + ``` + +6. You are nearly ready to train your model. As a final preprocessing step, apply the TextVectorization layer we created earlier to train, validate, and test the dataset. + + ```py + train_ds = raw_train_ds.map(vectorize_text) + val_ds = raw_val_ds.map(vectorize_text) + test_ds = raw_test_ds.map(vectorize_text) + ``` + + The cache() function keeps data in memory after it is loaded off disk. This ensures the dataset does not become a bottleneck while training your model. If your dataset is too large to fit into memory, you can also use this method to create a performant on-disk cache, which is more efficient to read than many small files. + + The prefetch() function overlaps data preprocessing and model execution while training. + + ```py + AUTOTUNE = tf.data.AUTOTUNE + + train_ds = train_ds.cache().prefetch(buffer_size=AUTOTUNE) + val_ds = val_ds.cache().prefetch(buffer_size=AUTOTUNE) + test_ds = test_ds.cache().prefetch(buffer_size=AUTOTUNE) + ``` + +7. Create your neural network. + + ```py + embedding_dim = 16 + model = tf.keras.Sequential([layers.Embedding(max_features + 1, embedding_dim),layers.Dropout(0.2),layers.GlobalAveragePooling1D(), + layers.Dropout(0.2),layers.Dense(1)]) + model.summary() + ``` + + ```{figure} ../../data/understand/deep_learning/TextClassification_4.png + --- + align: center + --- + ``` + +8. A model needs a loss function and an optimizer for training. Since this is a binary classification problem and the model outputs a probability (a single-unit layer with a sigmoid activation), use [losses.BinaryCrossentropy](https://www.tensorflow.org/api_docs/python/tf/keras/losses/BinaryCrossentropy) loss function. + + ```py + model.compile(loss=losses.BinaryCrossentropy(from_logits=True), + optimizer='adam',metrics=tf.metrics.BinaryAccuracy(threshold=0.0)) + ``` + +9. Train the model by passing the dataset object to the fit method. + + ```py + epochs = 10 + history = model.fit(train_ds,validation_data=val_ds,epochs=epochs) + ``` + + ```{figure} ../../data/understand/deep_learning/TextClassification_5.png + --- + align: center + --- + ``` + +10. See how the model performs. Two values are returned: loss (a number representing our error; lower values are better) and accuracy. + + ```py + loss, accuracy = model.evaluate(test_ds) + + print("Loss: ", loss) + print("Accuracy: ", accuracy) + ``` + + :::{note} + model.fit() returns a History object that contains a dictionary with everything that happened during training. + ::: + + ```py + history_dict = history.history + history_dict.keys() + ``` + +11. Four entries are for each monitored metric during training and validation. Use these to plot the training and validation loss for comparison, as well as the training and validation accuracy: + + ```py + acc = history_dict['binary_accuracy'] + val_acc = history_dict['val_binary_accuracy'] + loss = history_dict['loss'] + val_loss = history_dict['val_loss'] + + epochs = range(1, len(acc) + 1) + + # "bo" is for "blue dot" + plt.plot(epochs, loss, 'bo', label='Training loss') + # b is for "solid blue line" + plt.plot(epochs, val_loss, 'b', label='Validation loss') + plt.title('Training and validation loss') + plt.xlabel('Epochs') + plt.ylabel('Loss') + plt.legend() + + plt.show() + ``` + + {numref}`TextClassification6` and {numref}`TextClassification7` illustrate the training and validation loss and the training and validation accuracy. + + ```{figure} ../../data/understand/deep_learning/TextClassification_6.png + :name: TextClassification6 + --- + align: center + --- + Training and Validation Loss + ``` + + ```{figure} ../../data/understand/deep_learning/TextClassification_7.png + :name: TextClassification7 + --- + align: center + --- + Training and Validation Accuracy + ``` + +12. Export the model. + + ```py + export_model = tf.keras.Sequential([ + vectorize_layer, + model, + layers.Activation('sigmoid') + ]) + + export_model.compile( + loss=losses.BinaryCrossentropy(from_logits=False), optimizer="adam", metrics=['accuracy'] + ) + + # Test it with `raw_test_ds`, which yields raw strings + loss, accuracy = export_model.evaluate(raw_test_ds) + print(accuracy) + ``` + +13. To get predictions for new examples, call model.predict(). + + ```py + examples = [ + "The movie was great!", + "The movie was okay.", + "The movie was terrible..." + ] + + export_model.predict(examples) + ``` diff --git a/docs/how_to/deep_learning_rocm.md b/docs/how_to/deep_learning_rocm.md index 3653a00ca..4a1fc9509 100644 --- a/docs/how_to/deep_learning_rocm.md +++ b/docs/how_to/deep_learning_rocm.md @@ -1 +1,12 @@ -# Deep Learning Guide +# Frameworks Installation + +The following sections cover the different framework installations for ROCm and +Deep Learning applications. {numref}`Rocm-Compat-Frameworks-Flowchart` provides the sequential flow for the use of each framework. Refer to the ROCm Compatible Frameworks Release Notes for each framework's most current release notes at [Framework Release Notes](https://docs.amd.com/bundle/ROCm-Compatible-Frameworks-Release-Notes/page/Framework_Release_Notes.html). + +```{figure} ../data/how_to/magma_install/image.005.png +:name: Rocm-Compat-Frameworks-Flowchart +--- +align: center +--- +ROCm Compatible Frameworks Flowchart +``` diff --git a/docs/how_to/magma_install/magma_install.md b/docs/how_to/magma_install/magma_install.md index 7843572c8..518dabf77 100644 --- a/docs/how_to/magma_install/magma_install.md +++ b/docs/how_to/magma_install/magma_install.md @@ -1,415 +1,5 @@ # Magma Installation for ROCm -Pull content from - - -The following sections cover the different framework installations for ROCm and -Deep Learning applications. {numref}`rocm-compat-frameworks-flowchart` provides the sequential flow for the use of -each framework. Refer to the ROCm Compatible Frameworks Release Notes for each -framework's most current release notes at -[/bundle/ROCm-Compatible-Frameworks-Release-Notes/page/Framework_Release_Notes.html](/bundle/ROCm-Compatible-Frameworks-Release-Notes/page/Framework_Release_Notes.html). - - -:::{figure-md} rocm-compat-frameworks-flowchart - -ROCm Compatible Frameworks Flowchart - -ROCm Compatible Frameworks Flowchart -::: - -## PyTorch - -PyTorch is an open source Machine Learning Python library, primarily -differentiated by Tensor computing with GPU acceleration and a type-based -automatic differentiation. Other advanced features include: - -- Support for distributed training -- Native ONNX support -- C++ frontend -- The ability to deploy at scale using TorchServe -- A production-ready deployment mechanism through TorchScript - -### Installing PyTorch - -To install ROCm on bare metal, refer to the section -[ROCm Installation](https://docs.amd.com/bundle/ROCm-Deep-Learning-Guide-v5.4-/page/Prerequisites.html#d2999e60). -The recommended option to get a PyTorch environment is through Docker. However, -installing the PyTorch wheels package on bare metal is also supported. - -#### Option 1 (Recommended): Use Docker Image with PyTorch Pre-installed - -Using Docker gives you portability and access to a prebuilt Docker container -that has been rigorously tested within AMD. This might also save on the -compilation time and should perform as it did when tested without facing -potential installation issues. - -Follow these steps: - -1. Pull the latest public PyTorch Docker image. - - ```bash - docker pull rocm/pytorch:latest - ``` - - Optionally, you may download a specific and supported configuration with - different user-space ROCm versions, PyTorch versions, and supported operating - systems. To download the PyTorch Docker image, refer to - [https://hub.docker.com/r/rocm/pytorch](https://hub.docker.com/r/rocm/pytorch). - -2. Start a Docker container using the downloaded image. - - ```bash - docker run -it --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --device=/dev/kfd --device=/dev/dri --group-add video --ipc=host --shm-size 8G rocm/pytorch:latest - ``` - - :::{note} - This will automatically download the image if it does not exist on the host. - You can also pass the -v argument to mount any data directories from the host - onto the container. - ::: - -#### Option 2: Install PyTorch Using Wheels Package - -PyTorch supports the ROCm platform by providing tested wheels packages. To -access this feature, refer to -[https://pytorch.org/get-started/locally/](https://pytorch.org/get-started/locally/) -and choose the "ROCm" compute platform. Figure 6 is a matrix from pytroch.org -that illustrates the installation compatibility between ROCm and the PyTorch -build. - -| ![Figure 6](../../data/how_to/magma_install/image.006.png) | -|:------------------------------------------------------------------:| -| Figure 6. Installation Matrix from Pytorch.org | - -To install PyTorch using the wheels package, follow these installation steps: - -1. Choose one of the following options: - a. Obtain a base Docker image with the correct user-space ROCm version - installed from - [https://hub.docker.com/repository/docker/rocm/dev-ubuntu-20.04](https://hub.docker.com/repository/docker/rocm/dev-ubuntu-20.04). - - or - - b. Download a base OS Docker image and install ROCm following the - installation directions in the section - [Installation](https://docs.amd.com/bundle/ROCm-Deep-Learning-Guide-v5.4-/page/Prerequisites.html#d2999e60). - ROCm 5.2 is installed in this example, as supported by the installation - matrix from pytorch.org. - - or - - c. Install on bare metal. Skip to Step 3. - - ```bash - docker run -it --device=/dev/kfd --device=/dev/dri --group-add video rocm/dev-ubuntu-20.04:latest - ``` - -3. Install any dependencies needed for installing the wheels package. - - ```bash - sudo apt update - sudo apt install libjpeg-dev python3-dev - pip3 install wheel setuptools - ``` - -4. Install torch, torchvision, and torchaudio as specified by the installation - matrix. - - :::{note} - ROCm 5.2 PyTorch wheel in the command below is shown for reference. - ::: - - ```bash - pip3 install --pre torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/nightly/rocm5.2/ - ``` - -#### Option 3: Install PyTorch Using PyTorch ROCm Base Docker Image - -A prebuilt base Docker image is used to build PyTorch in this option. The base -Docker has all dependencies installed, including: - -- ROCm -- Torchvision -- Conda packages -- Compiler toolchain - -Additionally, a particular environment flag (`BUILD_ENVIRONMENT`) is set, and -the build scripts utilize that to determine the build environment configuration. - -Follow these steps: - -1. Obtain the Docker image. - - ```bash - docker pull rocm/pytorch:latest-base - ``` - - The above will download the base container, which does not contain PyTorch. - -2. Start a Docker container using the image. - - ```bash - docker run -it --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --device=/dev/kfd --device=/dev/dri --group-add video --ipc=host --shm-size 8G rocm/pytorch:latest-base - ``` - - You can also pass the -v argument to mount any data directories from the host - onto the container. - -3. Clone the PyTorch repository. - - ```bash - cd ~ - git clone https://github.com/pytorch/pytorch.git - cd pytorch - git submodule update --init –recursive - ``` - -4. Build PyTorch for ROCm. - - :::{note} - By default in the rocm/pytorch:latest-base, PyTorch builds for these - architectures simultaneously: - - gfx900 - - gfx906 - - gfx908 - - gfx90a - - gfx1030 - ::: - -5. To determine your AMD uarch, run: - - ```bash - rocminfo | grep gfx - ``` - -6. In the event you want to compile only for your uarch, use: - - ```bash - export PYTORCH_ROCM_ARCH= - ``` - - `` is the architecture reported by the `rocminfo` command. - -7. Build PyTorch using the following command: - - ```bash - ./.jenkins/pytorch/build.sh - ``` - - This will first convert PyTorch sources for HIP compatibility and build the - PyTorch framework. - -8. Alternatively, build PyTorch by issuing the following commands: - - ```bash - python3 tools/amd_build/build_amd.py - USE_ROCM=1 MAX_JOBS=4 python3 setup.py install ––user - ``` - -#### Option 4: Install Using PyTorch Upstream Docker File - -Instead of using a prebuilt base Docker image, you can build a custom base -Docker image using scripts from the PyTorch repository. This will utilize a -standard Docker image from operating system maintainers and install all the -dependencies required to build PyTorch, including - -- ROCm -- Torchvision -- Conda packages -- Compiler toolchain - -Follow these steps: - -1. Clone the PyTorch repository on the host. - - ```bash - cd ~ - git clone https://github.com/pytorch/pytorch.git - cd pytorch - git submodule update --init –recursive - ``` - -2. Build the PyTorch Docker image. - - ```bash - cd.circleci/docker - ./build.sh pytorch-linux-bionic-rocm-py3.7 - # eg. ./build.sh pytorch-linux-bionic-rocm3.10-py3.7 - ``` - - This should be complete with a message "Successfully build ``." - -3. Start a Docker container using the image: - - ```bash - docker run -it --cap-add=SYS_PTRACE --security-opt - seccomp=unconfined --device=/dev/kfd --device=/dev/dri --group-add - video --ipc=host --shm-size 8G - ``` - - You can also pass -v argument to mount any data directories from the host - onto the container. - -4. Clone the PyTorch repository. - - ```bash - cd ~ - git clone https://github.com/pytorch/pytorch.git - cd pytorch - git submodule update --init --recursive - ``` - -5. Build PyTorch for ROCm. - - :::{note} - By default in the rocm/pytorch:latest-base, PyTorch builds for these - architectures simultaneously: - - gfx900 - - gfx906 - - gfx908 - - gfx90a - - gfx1030 - ::: - -6. To determine your AMD uarch, run: - - ```bash - rocminfo | grep gfx - ``` - -7. If you want to compile only for your uarch: - - ```bash - export PYTORCH_ROCM_ARCH= - ``` - - `` is the architecture reported by the rocminfo command. - -8. Build PyTorch using: - - ```bash - ./.jenkins/pytorch/build.sh - ``` - -This will first convert PyTorch sources to be HIP compatible and then build the -PyTorch framework. - -Alternatively, build PyTorch by issuing the following commands: - -```bash -python3 tools/amd_build/build_amd.py -USE_ROCM=1 MAX_JOBS=4 python3 setup.py install --user -``` - -### Test the PyTorch Installation - -You can use PyTorch unit tests to validate a PyTorch installation. If using a -prebuilt PyTorch Docker image from AMD ROCm DockerHub or installing an official -wheels package, these tests are already run on those configurations. -Alternatively, you can manually run the unit tests to validate the PyTorch -installation fully. - -Follow these steps: - -1. Test if PyTorch is installed and accessible by importing the torch package in - Python. - - :::{note} - Do not run in the PyTorch git folder. - ::: - - ```bash - python3 -c 'import torch' 2> /dev/null && echo 'Success' || echo 'Failure' - ``` - -2. Test if the GPU is accessible from PyTorch. In the PyTorch framework, - torch.cuda is a generic mechanism to access the GPU; it will access an AMD - GPU only if available. - - ```bash - python3 -c 'import torch; print(torch.cuda.is_available())' - ``` - -3. Run the unit tests to validate the PyTorch installation fully. Run the - following command from the PyTorch home directory: - - ```bash - BUILD_ENVIRONMENT=${BUILD_ENVIRONMENT:-rocm} ./.jenkins/pytorch/test.sh - ``` - - This ensures that even for wheel installs in a non-controlled environment, - the required environment variable will be set to skip certain unit tests for - ROCm. - - :::{note} - Make sure the PyTorch source code is corresponding to the PyTorch wheel or - installation in the Docker image. Incompatible PyTorch source code might give - errors when running the unit tests. - ::: - - This will first install some dependencies, such as a supported torchvision - version for PyTorch. Torchvision is used in some PyTorch tests for loading - models. Next, this will run all the unit tests. - - :::{note} - Some tests may be skipped, as appropriate, based on your system - configuration. All features of PyTorch are not supported on ROCm, and the - tests that evaluate these features are skipped. In addition, depending on the - host memory, or the number of available GPUs, other tests may be skipped. No - test should fail if the compilation and installation are correct. - ::: - -4. Run individual unit tests with the following command: - - ```bash - PYTORCH_TEST_WITH_ROCM=1 python3 test/test_nn.py --verbose - ``` - - test_nn.py can be replaced with any other test set. - -### Run a Basic PyTorch Example - -The PyTorch examples repository provides basic examples that exercise the -functionality of the framework. MNIST (Modified National Institute of Standards -and Technology) database is a collection of handwritten digits that may be used -to train a Convolutional Neural Network for handwriting recognition. -Alternatively, ImageNet is a database of images used to train a network for -visual object recognition. - -Follow these steps: - -1. Clone the PyTorch examples repository. - - ```bash - git clone https://github.com/pytorch/examples.git - ``` - -2. Run the MNIST example. - - ```bash - cd examples/mnist - ``` - -3. Follow the instructions in the README file in this folder. In this case: - - ```bash - pip3 install -r requirements.txt - python3 main.py - ``` - -4. Run the ImageNet example. - - ```bash - cd examples/imagenet - ``` - -5. Follow the instructions in the README file in this folder. In this case: - - ```bash - pip3 install -r requirements.txt - python3 main.py - ``` - ## MAGMA for ROCm Matrix Algebra on GPU and Multicore Architectures, abbreviated as MAGMA, is a @@ -472,180 +62,3 @@ To build MAGMA from the source, follow these steps: popd mv magma /opt/rocm ``` - -## TensorFlow - -TensorFlow is an open source library for solving Machine Learning, -Deep Learning, and Artificial Intelligence problems. It can be used to solve -many problems across different sectors and industries but primarily focuses on -training and inference in neural networks. It is one of the most popular and -in-demand frameworks and is very active in open source contribution and -development. - -### Installing TensorFlow - -The following sections contain options for installing TensorFlow. - -#### Option 1: Install TensorFlow Using Docker Image - -To install ROCm on bare metal, follow the section -[ROCm Installation](https://docs.amd.com/bundle/ROCm-Deep-Learning-Guide-v5.4-/page/Prerequisites.html#d2999e60). -The recommended option to get a TensorFlow environment is through Docker. - -Using Docker provides portability and access to a prebuilt Docker container that -has been rigorously tested within AMD. This might also save compilation time and -should perform as tested without facing potential installation issues. -Follow these steps: - -1. Pull the latest public TensorFlow Docker image. - - ```bash - docker pull rocm/tensorflow:latest - ``` - -2. Once you have pulled the image, run it by using the command below: - - ```bash - docker run -it --network=host --device=/dev/kfd --device=/dev/dri - --ipc=host --shm-size 16G --group-add video --cap-add=SYS_PTRACE - --security-opt seccomp=unconfined rocm/tensorflow:latest - ``` - -#### Option 2: Install TensorFlow Using Wheels Package - -To install TensorFlow using the wheels package, follow these steps: - -1. Check the Python version. - - ```bash - python3 –version - ``` - - | If: | Then: | - |:-----------------------------------:|:--------------------------------:| - | The Python version is less than 3.7 | Upgrade Python. | - | The Python version is more than 3.7 | Skip this step and go to Step 3. | - - :::{note} - The supported Python versions are: - - - 3.7 - - 3.8 - - 3.9 - - 3.10 - ::: - - ```bash - sudo apt-get install python3.7 # or python3.8 or python 3.9 or python 3.10 - ``` - -2. Set up multiple Python versions using update-alternatives. - - ```bash - update-alternatives --query python3 - sudo update-alternatives --install - /usr/bin/python3 python3 /usr/bin/python[version] [priority] - ``` - - :::{note} - Follow the instruction in Step 2 for incompatible Python versions. - ::: - - ```bash - sudo update-alternatives --config python3 - ``` - -3. Follow the screen prompts, and select the Python version installed in Step 2. - -4. Install or upgrade PIP. - - ```bash - sudo apt install python3-pip - ``` - - To install PIP, use the following: - - ```bash - /usr/bin/python[version] -m pip install --upgrade pip - ``` - - Upgrade PIP for Python version installed in step 2: - - ```bash - sudo pip3 install --upgrade pip - ``` - -5. Install TensorFlow for the Python version as indicated in Step 2. - - ```bash - /usr/bin/python[version] -m pip install --user tensorflow-rocm==[wheel-version] –upgrade - ``` - - For a valid wheel version for a ROCm release, refer to the instruction below: - - ```bash - sudo apt install rocm-libs rccl - ``` - -6. Update protobuf to 3.19 or lower. - - ```bash - /usr/bin/python3.7 -m pip install protobuf=3.19.0 - sudo pip3 install tensorflow - ``` - -7. Set the environment variable PYTHONPATH. - - ```bash - export PYTHONPATH="./.local/lib/python[version]/site-packages:$PYTHONPATH" #Use same python version as in step 2 - ``` - -8. Install libraries. - - ```bash - sudo apt install rocm-libs rccl - ``` - -9. Test installation. - - ```bash - python3 -c 'import tensorflow' 2> /dev/null && echo 'Success' || echo 'Failure' - ``` - - :::{note} - For details on tensorflow-rocm wheels and ROCm version compatibility, see: - [https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/blob/develop-upstream/rocm_docs/tensorflow-rocm-release.md](https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/blob/develop-upstream/rocm_docs/tensorflow-rocm-release.md) - ::: - -### Test the TensorFlow Installation - -To test the installation of TensorFlow, run the container image as specified in -the previous section Installing TensorFlow. Ensure you have access to the Python -shell in the Docker container. - -```bash -python3 -c 'import tensorflow' 2> /dev/null && echo ‘Success’ || echo ‘Failure’ -``` - -### Run a Basic TensorFlow Example - -The TensorFlow examples repository provides basic examples that exercise the -framework's functionality. The MNIST database is a collection of handwritten -digits that may be used to train a Convolutional Neural Network for handwriting -recognition. - -Follow these steps: - -1. Clone the TensorFlow example repository. - - ```bash - cd ~ - git clone https://github.com/tensorflow/models.git - ``` - -2. Install the dependencies of the code, and run the code. - - ```bash - #pip3 install requirement.txt - #python mnist_tf.py - ``` diff --git a/docs/how_to/pytorch_install/pytorch_install.md b/docs/how_to/pytorch_install/pytorch_install.md index a1e6e3dfb..5f615b74f 100644 --- a/docs/how_to/pytorch_install/pytorch_install.md +++ b/docs/how_to/pytorch_install/pytorch_install.md @@ -1,6 +1,402 @@ # PyTorch Installation for ROCm -Pull content from - +## PyTorch -TEST +PyTorch is an open source Machine Learning Python library, primarily +differentiated by Tensor computing with GPU acceleration and a type-based +automatic differentiation. Other advanced features include: + +- Support for distributed training +- Native ONNX support +- C++ frontend +- The ability to deploy at scale using TorchServe +- A production-ready deployment mechanism through TorchScript + +### Installing PyTorch + +To install ROCm on bare metal, refer to the section +[ROCm Installation](https://docs.amd.com/bundle/ROCm-Deep-Learning-Guide-v5.4-/page/Prerequisites.html#d2999e60). +The recommended option to get a PyTorch environment is through Docker. However, +installing the PyTorch wheels package on bare metal is also supported. + +#### Option 1 (Recommended): Use Docker Image with PyTorch Pre-installed + +Using Docker gives you portability and access to a prebuilt Docker container +that has been rigorously tested within AMD. This might also save on the +compilation time and should perform as it did when tested without facing +potential installation issues. + +Follow these steps: + +1. Pull the latest public PyTorch Docker image. + + ```bash + docker pull rocm/pytorch:latest + ``` + + Optionally, you may download a specific and supported configuration with + different user-space ROCm versions, PyTorch versions, and supported operating + systems. To download the PyTorch Docker image, refer to + [https://hub.docker.com/r/rocm/pytorch](https://hub.docker.com/r/rocm/pytorch). + +2. Start a Docker container using the downloaded image. + + ```bash + docker run -it --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --device=/dev/kfd --device=/dev/dri --group-add video --ipc=host --shm-size 8G rocm/pytorch:latest + ``` + + :::{note} + This will automatically download the image if it does not exist on the host. + You can also pass the -v argument to mount any data directories from the host + onto the container. + ::: + +#### Option 2: Install PyTorch Using Wheels Package + +PyTorch supports the ROCm platform by providing tested wheels packages. To +access this feature, refer to +[https://pytorch.org/get-started/locally/](https://pytorch.org/get-started/locally/) +and choose the "ROCm" compute platform. {numref}`Installation-Matrix-from-Pytorch` is a matrix from pytroch.org that illustrates the installation compatibility between ROCm and the PyTorch build. + +```{figure} ../../data/how_to/magma_install/image.006.png +:name: Installation-Matrix-from-Pytorch +--- +align: center +--- +Installation Matrix from Pytorch.org +``` + +To install PyTorch using the wheels package, follow these installation steps: + +1. Choose one of the following options: + a. Obtain a base Docker image with the correct user-space ROCm version + installed from + [https://hub.docker.com/repository/docker/rocm/dev-ubuntu-20.04](https://hub.docker.com/repository/docker/rocm/dev-ubuntu-20.04). + + or + + b. Download a base OS Docker image and install ROCm following the + installation directions in the section + [Installation](https://docs.amd.com/bundle/ROCm-Deep-Learning-Guide-v5.4-/page/Prerequisites.html#d2999e60). + ROCm 5.2 is installed in this example, as supported by the installation + matrix from pytorch.org. + + or + + c. Install on bare metal. Skip to Step 3. + + ```bash + docker run -it --device=/dev/kfd --device=/dev/dri --group-add video rocm/dev-ubuntu-20.04:latest + ``` + +2. Start the Docker container, if not installing on bare metal. + + ```dockerfile + docker run -it --device=/dev/kfd --device=/dev/dri --group-add video rocm/dev-ubuntu-20.04:latest + ``` + +3. Install any dependencies needed for installing the wheels package. + + ```bash + sudo apt update + sudo apt install libjpeg-dev python3-dev + pip3 install wheel setuptools + ``` + +4. Install torch, torchvision, and torchaudio as specified by the installation + matrix. + + :::{note} + ROCm 5.2 PyTorch wheel in the command below is shown for reference. + ::: + + ```bash + pip3 install --pre torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/nightly/rocm5.2/ + ``` + +#### Option 3: Install PyTorch Using PyTorch ROCm Base Docker Image + +A prebuilt base Docker image is used to build PyTorch in this option. The base +Docker has all dependencies installed, including: + +- ROCm +- Torchvision +- Conda packages +- Compiler toolchain + +Additionally, a particular environment flag (`BUILD_ENVIRONMENT`) is set, and +the build scripts utilize that to determine the build environment configuration. + +Follow these steps: + +1. Obtain the Docker image. + + ```bash + docker pull rocm/pytorch:latest-base + ``` + + The above will download the base container, which does not contain PyTorch. + +2. Start a Docker container using the image. + + ```bash + docker run -it --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --device=/dev/kfd --device=/dev/dri --group-add video --ipc=host --shm-size 8G rocm/pytorch:latest-base + ``` + + You can also pass the -v argument to mount any data directories from the host + onto the container. + +3. Clone the PyTorch repository. + + ```bash + cd ~ + git clone https://github.com/pytorch/pytorch.git + cd pytorch + git submodule update --init –recursive + ``` + +4. Build PyTorch for ROCm. + + :::{note} + By default in the rocm/pytorch:latest-base, PyTorch builds for these + architectures simultaneously: + - gfx900 + - gfx906 + - gfx908 + - gfx90a + - gfx1030 + ::: + +5. To determine your AMD uarch, run: + + ```bash + rocminfo | grep gfx + ``` + +6. In the event you want to compile only for your uarch, use: + + ```bash + export PYTORCH_ROCM_ARCH= + ``` + + `` is the architecture reported by the `rocminfo` command. + +7. Build PyTorch using the following command: + + ```bash + ./.jenkins/pytorch/build.sh + ``` + + This will first convert PyTorch sources for HIP compatibility and build the + PyTorch framework. + +8. Alternatively, build PyTorch by issuing the following commands: + + ```bash + python3 tools/amd_build/build_amd.py + USE_ROCM=1 MAX_JOBS=4 python3 setup.py install ––user + ``` + +#### Option 4: Install Using PyTorch Upstream Docker File + +Instead of using a prebuilt base Docker image, you can build a custom base +Docker image using scripts from the PyTorch repository. This will utilize a +standard Docker image from operating system maintainers and install all the +dependencies required to build PyTorch, including + +- ROCm +- Torchvision +- Conda packages +- Compiler toolchain + +Follow these steps: + +1. Clone the PyTorch repository on the host. + + ```bash + cd ~ + git clone https://github.com/pytorch/pytorch.git + cd pytorch + git submodule update --init –recursive + ``` + +2. Build the PyTorch Docker image. + + ```bash + cd.circleci/docker + ./build.sh pytorch-linux-bionic-rocm-py3.7 + # eg. ./build.sh pytorch-linux-bionic-rocm3.10-py3.7 + ``` + + This should be complete with a message "Successfully build ``." + +3. Start a Docker container using the image: + + ```bash + docker run -it --cap-add=SYS_PTRACE --security-opt + seccomp=unconfined --device=/dev/kfd --device=/dev/dri --group-add + video --ipc=host --shm-size 8G + ``` + + You can also pass -v argument to mount any data directories from the host + onto the container. + +4. Clone the PyTorch repository. + + ```bash + cd ~ + git clone https://github.com/pytorch/pytorch.git + cd pytorch + git submodule update --init --recursive + ``` + +5. Build PyTorch for ROCm. + + :::{note} + By default in the rocm/pytorch:latest-base, PyTorch builds for these + architectures simultaneously: + - gfx900 + - gfx906 + - gfx908 + - gfx90a + - gfx1030 + ::: + +6. To determine your AMD uarch, run: + + ```bash + rocminfo | grep gfx + ``` + +7. If you want to compile only for your uarch: + + ```bash + export PYTORCH_ROCM_ARCH= + ``` + + `` is the architecture reported by the rocminfo command. + +8. Build PyTorch using: + + ```bash + ./.jenkins/pytorch/build.sh + ``` + +This will first convert PyTorch sources to be HIP compatible and then build the +PyTorch framework. + +Alternatively, build PyTorch by issuing the following commands: + +```bash +python3 tools/amd_build/build_amd.py +USE_ROCM=1 MAX_JOBS=4 python3 setup.py install --user +``` + +### Test the PyTorch Installation + +You can use PyTorch unit tests to validate a PyTorch installation. If using a +prebuilt PyTorch Docker image from AMD ROCm DockerHub or installing an official +wheels package, these tests are already run on those configurations. +Alternatively, you can manually run the unit tests to validate the PyTorch +installation fully. + +Follow these steps: + +1. Test if PyTorch is installed and accessible by importing the torch package in + Python. + + :::{note} + Do not run in the PyTorch git folder. + ::: + + ```bash + python3 -c 'import torch' 2> /dev/null && echo 'Success' || echo 'Failure' + ``` + +2. Test if the GPU is accessible from PyTorch. In the PyTorch framework, + torch.cuda is a generic mechanism to access the GPU; it will access an AMD + GPU only if available. + + ```bash + python3 -c 'import torch; print(torch.cuda.is_available())' + ``` + +3. Run the unit tests to validate the PyTorch installation fully. Run the + following command from the PyTorch home directory: + + ```bash + BUILD_ENVIRONMENT=${BUILD_ENVIRONMENT:-rocm} ./.jenkins/pytorch/test.sh + ``` + + This ensures that even for wheel installs in a non-controlled environment, + the required environment variable will be set to skip certain unit tests for + ROCm. + + :::{note} + Make sure the PyTorch source code is corresponding to the PyTorch wheel or + installation in the Docker image. Incompatible PyTorch source code might give + errors when running the unit tests. + ::: + + This will first install some dependencies, such as a supported torchvision + version for PyTorch. Torchvision is used in some PyTorch tests for loading + models. Next, this will run all the unit tests. + + :::{note} + Some tests may be skipped, as appropriate, based on your system + configuration. All features of PyTorch are not supported on ROCm, and the + tests that evaluate these features are skipped. In addition, depending on the + host memory, or the number of available GPUs, other tests may be skipped. No + test should fail if the compilation and installation are correct. + ::: + +4. Run individual unit tests with the following command: + + ```bash + PYTORCH_TEST_WITH_ROCM=1 python3 test/test_nn.py --verbose + ``` + + test_nn.py can be replaced with any other test set. + +### Run a Basic PyTorch Example + +The PyTorch examples repository provides basic examples that exercise the +functionality of the framework. MNIST (Modified National Institute of Standards +and Technology) database is a collection of handwritten digits that may be used +to train a Convolutional Neural Network for handwriting recognition. +Alternatively, ImageNet is a database of images used to train a network for +visual object recognition. + +Follow these steps: + +1. Clone the PyTorch examples repository. + + ```bash + git clone https://github.com/pytorch/examples.git + ``` + +2. Run the MNIST example. + + ```bash + cd examples/mnist + ``` + +3. Follow the instructions in the README file in this folder. In this case: + + ```bash + pip3 install -r requirements.txt + python3 main.py + ``` + +4. Run the ImageNet example. + + ```bash + cd examples/imagenet + ``` + +5. Follow the instructions in the README file in this folder. In this case: + + ```bash + pip3 install -r requirements.txt + python3 main.py + ``` diff --git a/docs/how_to/tensorflow_install/tensorflow_install.md b/docs/how_to/tensorflow_install/tensorflow_install.md index 13a1e01d7..ebe84d787 100644 --- a/docs/how_to/tensorflow_install/tensorflow_install.md +++ b/docs/how_to/tensorflow_install/tensorflow_install.md @@ -1,4 +1,178 @@ # TensorFlow Installation for ROCm -Pull content from - +## TensorFlow + +TensorFlow is an open source library for solving Machine Learning, +Deep Learning, and Artificial Intelligence problems. It can be used to solve +many problems across different sectors and industries but primarily focuses on +training and inference in neural networks. It is one of the most popular and +in-demand frameworks and is very active in open source contribution and +development. + +### Installing TensorFlow + +The following sections contain options for installing TensorFlow. + +#### Option 1: Install TensorFlow Using Docker Image + +To install ROCm on bare metal, follow the section +[ROCm Installation](https://docs.amd.com/bundle/ROCm-Deep-Learning-Guide-v5.4-/page/Prerequisites.html#d2999e60). +The recommended option to get a TensorFlow environment is through Docker. + +Using Docker provides portability and access to a prebuilt Docker container that +has been rigorously tested within AMD. This might also save compilation time and +should perform as tested without facing potential installation issues. +Follow these steps: + +1. Pull the latest public TensorFlow Docker image. + + ```bash + docker pull rocm/tensorflow:latest + ``` + +2. Once you have pulled the image, run it by using the command below: + + ```bash + docker run -it --network=host --device=/dev/kfd --device=/dev/dri + --ipc=host --shm-size 16G --group-add video --cap-add=SYS_PTRACE + --security-opt seccomp=unconfined rocm/tensorflow:latest + ``` + +#### Option 2: Install TensorFlow Using Wheels Package + +To install TensorFlow using the wheels package, follow these steps: + +1. Check the Python version. + + ```bash + python3 –version + ``` + + | If: | Then: | + |:-----------------------------------:|:--------------------------------:| + | The Python version is less than 3.7 | Upgrade Python. | + | The Python version is more than 3.7 | Skip this step and go to Step 3. | + + :::{note} + The supported Python versions are: + + - 3.7 + - 3.8 + - 3.9 + - 3.10 + ::: + + ```bash + sudo apt-get install python3.7 # or python3.8 or python 3.9 or python 3.10 + ``` + +2. Set up multiple Python versions using update-alternatives. + + ```bash + update-alternatives --query python3 + sudo update-alternatives --install + /usr/bin/python3 python3 /usr/bin/python[version] [priority] + ``` + + :::{note} + Follow the instruction in Step 2 for incompatible Python versions. + ::: + + ```bash + sudo update-alternatives --config python3 + ``` + +3. Follow the screen prompts, and select the Python version installed in Step 2. + +4. Install or upgrade PIP. + + ```bash + sudo apt install python3-pip + ``` + + To install PIP, use the following: + + ```bash + /usr/bin/python[version] -m pip install --upgrade pip + ``` + + Upgrade PIP for Python version installed in step 2: + + ```bash + sudo pip3 install --upgrade pip + ``` + +5. Install TensorFlow for the Python version as indicated in Step 2. + + ```bash + /usr/bin/python[version] -m pip install --user tensorflow-rocm==[wheel-version] –upgrade + ``` + + For a valid wheel version for a ROCm release, refer to the instruction below: + + ```bash + sudo apt install rocm-libs rccl + ``` + +6. Update protobuf to 3.19 or lower. + + ```bash + /usr/bin/python3.7 -m pip install protobuf=3.19.0 + sudo pip3 install tensorflow + ``` + +7. Set the environment variable PYTHONPATH. + + ```bash + export PYTHONPATH="./.local/lib/python[version]/site-packages:$PYTHONPATH" #Use same python version as in step 2 + ``` + +8. Install libraries. + + ```bash + sudo apt install rocm-libs rccl + ``` + +9. Test installation. + + ```bash + python3 -c 'import tensorflow' 2> /dev/null && echo 'Success' || echo 'Failure' + ``` + + :::{note} + For details on tensorflow-rocm wheels and ROCm version compatibility, see: + [https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/blob/develop-upstream/rocm_docs/tensorflow-rocm-release.md](https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/blob/develop-upstream/rocm_docs/tensorflow-rocm-release.md) + ::: + +### Test the TensorFlow Installation + +To test the installation of TensorFlow, run the container image as specified in +the previous section Installing TensorFlow. Ensure you have access to the Python +shell in the Docker container. + +```bash +python3 -c 'import tensorflow' 2> /dev/null && echo ‘Success’ || echo ‘Failure’ +``` + +### Run a Basic TensorFlow Example + +The TensorFlow examples repository provides basic examples that exercise the +framework's functionality. The MNIST database is a collection of handwritten +digits that may be used to train a Convolutional Neural Network for handwriting +recognition. + +Follow these steps: + +1. Clone the TensorFlow example repository. + + ```bash + cd ~ + git clone https://github.com/tensorflow/models.git + ``` + +2. Install the dependencies of the code, and run the code. + + ```bash + #pip3 install requirement.txt + #python mnist_tf.py + ``` diff --git a/docs/reference/openmp/openmp.md b/docs/reference/openmp/openmp.md index 3c9c7b943..cd4c077b4 100644 --- a/docs/reference/openmp/openmp.md +++ b/docs/reference/openmp/openmp.md @@ -1,4 +1,330 @@ # OpenMP Support in ROCm -Pull from - +## Introduction to OpenMP Support Guide + +he ROCm™ installation includes an LLVM-based implementation that fully supports the OpenMP 4.5 standard and a subset of OpenMP 5.0, 5.1, and 5.2 standards. Fortran, C/C++ compilers, and corresponding runtime libraries are included. Along with host APIs, the OpenMP compilers support offloading code and data onto GPU devices. This document briefly describes the installation location of the OpenMP toolchain, example usage of device offloading, and usage of rocprof with OpenMP applications. The GPUs supported are the same as those supported by this ROCm release. See the list of supported GPUs in the installation guide at [https://docs.amd.com/](https://docs.amd.com/). + +### Installation + +The OpenMP toolchain is automatically installed as part of the standard ROCm installation and is available under /opt/rocm-{version}/llvm. The sub-directories are: + +bin: Compilers (flang and clang) and other binaries. + +- examples: The usage section below shows how to compile and run these programs. + +- include: Header files. + +- lib: Libraries including those required for target offload. + +- lib-debug: Debug versions of the above libraries. + +## OpenMP: Usage + +The example programs can be compiled and run by pointing the environment variable AOMP to the OpenMP install directory. + +**Example:** + +```bash +% export AOMP=/opt/rocm-{version}/llvm +% cd $AOMP/examples/openmp/veccopy +% make run +``` + +The above invocation of Make compiles and runs the program. Note the options that are required for target offload from an OpenMP program: + +```bash +-target x86_64-pc-linux-gnu -fopenmp -fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa -march= +``` + +Obtain the value of gpu-arch by running the following command: + +```bash +% /opt/rocm-{version}/bin/rocminfo | grep gfx +``` + +[//]: # (dated link below, needs upading) + +See the complete list of compiler command-line references [here](https://github.com/RadeonOpenCompute/llvm-project/blob/amd-stg-open/clang/docs/ClangCommandLineReference.rst). + +### Using rocprof with OpenMP + +The following steps describe a typical workflow for using rocprof with OpenMP code compiled with AOMP: + +1. Run rocprof with the program command line: + + ```bash + % rocprof + ``` + + This produces a results.csv file in the user’s current directory that shows basic stats such as kernel names, grid size, number of registers used, etc. The user can choose to specify the preferred output file name using the o option. + +2. Add options for a detailed result: + + ```bash + --stats: % rocprof --stats + ``` + + The stats option produces timestamps for the kernels. Look into the output CSV file for the field, DurationNs, which is useful in getting an understanding of the critical kernels in the code. + + Apart from --stats, the option --timestamp on produces a timestamp for the kernels. + +3. After learning about the required kernels, the user can take a detailed look at each one of them. rocprof has support for hardware counters: a set of basic and a set of derived ones. See the complete list of counters using options --list-basic and --list-derived. rocprof accepts either a text or an XML file as an input. + +For more details on rocprof, refer to the ROCm Profiling Tools document on [https://docs.amd.com](https://docs.amd.com). + +### Using Tracing Options + +**Prerequisite:** When using the --sys-trace option, compile the OpenMP program with: + +```bash + -Wl,–rpath,/opt/rocm-{version}/lib -lamdhip64 +``` + +The following tracing options are widely used to generate useful information: + +- **--hsa-trace**: This option is used to get a JSON output file with the HSA API execution traces and a flat profile in a CSV file. + +- **--sys-trace**: This allows programmers to trace both HIP and HSA calls. Since this option results in loading ``libamdhip64.so``, follow the prerequisite as mentioned above. + +A CSV and a JSON file are produced by the above trace options. The CSV file presents the data in a tabular format, and the JSON file can be visualized using Google Chrome at chrome://tracing/ or [Perfetto](https://perfetto.dev/). Navigate to Chrome or Perfetto and load the JSON file to see the timeline of the HSA calls. + +For more details on tracing, refer to the ROCm Profiling Tools document on [https://docs.amd.com](https://docs.amd.com). + +### Environment Variables + +:::{table} +:widths: auto +| Environment Variable | Description | +| ----------- | ----------- | +| OMP_NUM_TEAMS | The implementation chooses the number of teams for kernel launch. The user can change this number for performance tuning using this environment variable, subject to implementation limits. | +| OMPX_DISABLE_MAPS | Under USM mode, the implementation automatically checks for correctness of the map clauses without performing any copying. The user can disable this check by setting this environment variable to 1. | +| LIBOMPTARGET_KERNEL_TRACE | This environment variable is used to print useful statistics for device operations. Setting it to 1 and running the program emits the name of every kernel launched, the number of teams and threads used, and the corresponding register usage. Setting it to 2 additionally emits timing information for kernel launches and data transfer operations between the host and the device. | +| LIBOMPTARGET_INFO | This environment variable is used to print informational messages from the device runtime as the program executes. Users can request fine-grain information by setting it to the value of 1 or higher and can set the value of -1 for complete information. | +| LIBOMPTARGET_DEBUG | If a debug version of the device library is present, setting this environment variable to 1 and using that library emits further detailed debugging information about data transfer operations and kernel launch. | +| GPU_MAX_HW_QUEUES | This environment variable is used to set the number of HSA queues in the OpenMP runtime. | +::: + +## OpenMP: Features + +The OpenMP programming model is greatly enhanced with the following new features implemented in the past releases. + +### Asynchronous Behavior in OpenMP Target Regions + +- Multithreaded offloading on the same device + +The libomptarget plugin for GPU offloading allows creation of separate configurable HSA queues per chiplet, which enables two or more threads to concurrently offload to the same device. + +- Parallel memory copy invocations + +Implicit asynchronous execution of single target region enables parallel memory copy invocations. + +### Unified Shared Memory + +Unified Shared Memory (USM) provides a pointer-based approach to memory management. To implement USM, fulfill the following system requirements along with Xnack capability. + +#### Prerequisites + +- Linux Kernel versions above 5.14 + +- Latest KFD driver packaged in ROCm stack + +- Xnack, as USM support can only be tested with applications compiled with Xnack capability + +#### Xnack Capability + +When enabled, Xnack capability allows programmers to handle page faults at runtime gracefully. When executing the binaries compiled with Xnack replay enabled, any page fault at runtime leads to a repeated attempt to access the memory. + +```bash +xnack+ --offload-arch=gfx908:xnack+ +``` + +The programmer must write offloading kernels carefully to avoid any page faults on the GPU at runtime when choosing to disable Xnack replay. + +```bash +xnack- with –offload-arch=gfx908:xnack- +``` + +#### Unified Shared Memory Pragma + +This OpenMP pragma is available on MI200 through xnack+ support. + +```bash +omp requires unified_shared_memory +``` + +As stated in the OpenMP specifications, this pragma makes the map clause on target constructs optional. By default, on MI200, all memory allocated on the host is fine grain. Using the map clause on a target clause is allowed, which transforms the access semantics of the associated memory to coarse grain. + +```bash +A simple program demonstrating the use of this feature is: +$ cat parallel_for.cpp +#include +#include + +#define N 64 +#pragma omp requires unified_shared_memory +int main() { + int n = N; + int *a = new int[n]; + int *b = new int[n]; + + for(int i = 0; i < n; i++) + b[i] = i; + + #pragma omp target parallel for map(to:b[:n]) + for(int i = 0; i < n; i++) + a[i] = b[i]; + + for(int i = 0; i < n; i++) + if(a[i] != i) + printf("error at %d: expected %d, got %d\n", i, i+1, a[i]); + + return 0; +} +$ clang++ -O2 -target x86_64-pc-linux-gnu -fopenmp --offload-arch=gfx90a:xnack+ parallel_for.cpp +$ HSA_XNACK=1 ./a.out +``` + +In the above code example, pointer “a” is not mapped in the target region, while pointer “b” is. Both are valid pointers on the GPU device and passed by-value to the kernel implementing the target region. This means the pointer values on the host and the device are the same. + +The difference between the memory pages pointed to by these two variables is that the pages pointed by “a” are in fine-grain memory, while the pages pointed to by “b” are in coarse-grain memory during and after the execution of the target region. This is accomplished in the OpenMP runtime library with calls to the ROCR runtime to set the pages pointed by “b” as coarse grain. + +### OMPT Target Support + +The OpenMP runtime in ROCm implements a subset of the OMPT device APIs, as described in the OpenMP specification document. These APIs allow first-party tools to examine the profile and kernel traces that execute on a device. A tool can register callbacks for data transfer and kernel dispatch entry points or use APIs to start and stop tracing for device-related activities such as data transfer and kernel dispatch timings and associated metadata. If device tracing is enabled, trace records for device activities are collected during program execution and returned to the tool using the APIs described in the specification. + +The following example demonstrates how a tool uses the supported OMPT target APIs. The README in /opt/rocm/llvm/examples/tools/ompt outlines the steps to be followed, and the provided example can be run as shown below: + +```bash +% cd /opt/rocm/llvm/examples/tools/ompt/veccopy-ompt-target-tracing +% make run +``` + +The file veccopy-ompt-target-tracing.c simulates how a tool initiates device activity tracing. The file callbacks.h shows the callbacks registered and implemented by the tool. + +### Floating Point Atomic Operations + +The MI200-series GPUs support the generation of hardware floating-point atomics using the OpenMP atomic pragma. The support includes single- and double-precision floating-point atomic operations. The programmer must ensure that the memory subjected to the atomic operation is in coarse-grain memory by mapping it explicitly with the help of map clauses when not implicitly mapped by the compiler as per the [OpenMP specifications](https://www.openmp.org/specifications/). This makes these hardware floating-point atomic instructions “fast,” as they are faster than using a default compare-and-swap loop scheme, but at the same time “unsafe,” as they are not supported on fine-grain memory. The operation in unified_shared_memory mode also requires programmers to map the memory explicitly when not implicitly mapped by the compiler. + +To request fast floating-point atomic instructions at the file level, use compiler flag -munsafe-fp-atomics or a hint clause on a specific pragma: + +```bash +double a = 0.0; +#pragma omp atomic hint(AMD_fast_fp_atomics) +a = a + 1.0; +``` + +NOTE AMD_unsafe_fp_atomics is an alias for AMD_fast_fp_atomics, and AMD_safe_fp_atomics is implemented with a compare-and-swap loop. + +To disable the generation of fast floating-point atomic instructions at the file level, build using the option -msafe-fp-atomics or use a hint clause on a specific pragma: + +```bash +double a = 0.0; +#pragma omp atomic hin.t(AMD_safe_fp_atomics) +a = a + 1.0; +``` + +The hint clause value always has a precedence over the compiler flag, which allows programmers to create atomic constructs with a different behavior than the rest of the file. + +See the example below, where the user builds the program using -msafe-fp-atomics to select a file-wide “safe atomic” compilation. However, the fast atomics hint clause over variable “a” takes precedence and operates on “a” using a fast/unsafe floating-point atomic, while the variable “b” in the absence of a hint clause is operated upon using safe floating-point atomics as per the compiler flag. + +```bash +double a = 0.0;. +#pragma omp atomic hint(AMD_fast_fp_atomics) +a = a + 1.0; + +double b = 0.0; +#pragma omp atomic +b = b + 1.0; +``` + +### Address Sanitizer (ASan) Tool + +Address Sanitizer is a memory error detector tool utilized by applications to detect various errors ranging from spatial issues such as out-of-bound access to temporal issues such as use-after-free. The AOMP compiler supports ASan for AMDGPUs with applications written in both HIP and OpenMP. + +**Features Supported on Host Platform (Target x86_64):** + +- Use-after-free + +- Buffer overflows + +- Heap buffer overflow + +- Stack buffer overflow + +- Global buffer overflow + +- Use-after-return + +- Use-after-scope + +- Initialization order bugs + +**Features Supported on AMDGPU Platform (amdgcn-amd-amdhsa):** +- Heap buffer overflow + +- Global buffer overflow + +**Software (Kernel/OS) Requirements:** Unified Shared Memory support with Xnack capability. See the section on [Unified Shared Memory](#unified-shared-memory) for prerequisites and details on Xnack. + +**Example:** + +- Heap buffer overflow + +```bash +void main() { +....... // Some program statements +....... // Some program statements +#pragma omp target map(to : A[0:N], B[0:N]) map(from: C[0:N]) +{ +#pragma omp parallel for + for(int i =0 ; i < N; i++){ + C[i+10] = A[i] + B[i]; + } // end of for loop +} +....... // Some program statements +}// end of main +``` + +See the complete sample code for heap buffer overflow [here](https://github.com/ROCm-Developer-Tools/aomp/blob/aomp-dev/examples/tools/asan/heap_buffer_overflow/openmp/vecadd-HBO.cpp). + +- Global buffer overflow + +```bash +#pragma omp declare target + int A[N],B[N],C[N]; +#pragma omp end declare target +void main(){ +...... // some program statements +...... // some program statements +#pragma omp target data map(to:A[0:N],B[0:N]) map(from: C[0:N]) +{ +#pragma omp target update to(A,B) +#pragma omp target parallel for +for(int i=0; i