The NVidia Volta-100 GPU released in Dec 2017 was the first microprocessor with dedicated cores purely for matrix computations called Tensor Cores. The Ampere-100 GPU released May’20 is its successor. A comparison is provided here. Ampere has 84 Streaming Multiprocessors (SMs) with 4 Tensor Cores (TCs) each for a total of 336 TCs. Tensor Cores reduce the cycle time for matrix multiplications, operating on 4×4 matrices of 16bit floating point numbers. These GPUs are aimed at Deep Learning use cases which consist of a pipeline of matrix operations. Here’s an article on choosing the right EC2 instance type for DL – https://towardsdatascience.com/choosing-the-right-gpu-for-deep-learning-on-aws-d69c157d8c86 (tldr – G4 for inferencing, P4 for training).

How did the need for specialized DL chips arise, and why are Tensors important in DL ? In math, we have Scalars and Vectors. Scalars are used for magnitude and Vectors encode magnitude and direction. To transform Vectors, one applies Linear Transformations in the form of Matrices. Matrices for Linear Transformations have EigenVectors and EigenValues which describe the invariants of the transformation. A Tensor in math and physics is a concept that exhibits certain types invariance during transformations. In 3 dimensions, a Stress Tensor has 9 components, which can be representated as a 3×3 matrix; under a change of basis the components of the tensor change however the tensor itself does not.

In Deep Learning applications a Tensor is basically a Matrix. The **Generalized Matrix Multiplication (GEMM) operation, D=AxB+C**, is at the heart of Deep Learning, and Tensor Cores are designed to speed these up.

In Deep Learning, multilinear maps are interleaved with non-linear transforms to model arbitrary transforms of input to output and a specific model is arrived by a process of error reduction on training of actual data. This PyTorch Deep Learning page is an excellent resource to transition from traditional linear algebra to deep learning software – https://pytorch.org/tutorials/beginner/nlp/deep_learning_tutorial.html .

Tesla Dojo is planned to build a processor/computer dedicated for Deep Learning to train on vast amounts of video data.

AWS Inferentia is a chip for deep learning inferencing, with its four Neuron Cores .

AWS Trainium is an ML chip for training.

Generally speaking the desire in deep learning community is to have simpler processing units in larger numbers.

Updates: Cerebras announced a chip which can handle neural networks with 120 trillion parameters, with 850,000 AI optimized cores per chip.

SambaNova, Anton, Cerebras and Graphcore presentations are at https://www.anandtech.com/show/16908/hot-chips-2021-live-blog-machine-learning-graphcore-cerebras-sambanova-anton

SambaNova is building 400,000 AI cores per chip.