Course: CS25: Transformers United
Course website: https://web.stanford.edu/class/cs25/
Video Lecture: https://youtu.be/P127jhj-8-Y
Instructors: Instructors: Div Garg, Chetanya Rastogi, Advay Pal
Find all my notes for this course in the ML Course Notes repo.
Please note that this is a rough draft of the notes, so you might find mistakes. The figures and equations are directly obtained/adapted from the slides used in the course. All of the credit goes to the instructors. I simply hope that the notes serve as accompanying study material.
The course is about Transformers which have revolutionized fields like natural language processing (NLP) and Computer Vision. It’s also now making strides in other areas of machine learning like reinforcement learning and other scientific fields like Physics and Biology.
Before jumping into Transformers and self-attention, we can start by discussing attention and its timeline. Before self-attention, which is one of the key ingredients of Transformers, we had other classical models like recurrent neural networks (RNNs), long short term memory (LSTM) networks, and simple attention mechanisms. Let’s look at the below timeline more in detail.