Introduction to Transformers

Course: CS25: Transformers United

Course website: https://web.stanford.edu/class/cs25/

Video Lecture: https://youtu.be/P127jhj-8-Y

Instructors: Instructors: Div Garg, Chetanya Rastogi, Advay Pal

Find all my notes for this course in the ML Course Notes repo.

Please note that this is a rough draft of the notes, so you might find mistakes. The figures and equations are directly obtained/adapted from the slides used in the course. All of the credit goes to the instructors. I simply hope that the notes serve as accompanying study material.

What you will learn in the course:

How do transformers work?
How they are being applied (beyond just NLP)
Some new directions of research

Introduction

The course is about Transformers which have revolutionized fields like natural language processing (NLP) and Computer Vision. It’s also now making strides in other areas of machine learning like reinforcement learning and other scientific fields like Physics and Biology.

Before jumping into Transformers and self-attention, we can start by discussing attention and its timeline. Before self-attention, which is one of the key ingredients of Transformers, we had other classical models like recurrent neural networks (RNNs), long short term memory (LSTM) networks, and simple attention mechanisms. Let’s look at the below timeline more in detail.