Skip to content
Machine Learning

Transformers and Attention

Account required to view full content

The transformer is the architecture behind modern large language models, and it has spread into time series, vision, and quantitative finance. Its core idea, attention, is simpler than it sounds: instead of passing information down a chain of hidden states the way a recurrent network does, attention lets every position in a sequence look directly at every other position and decide what is relevant. That one change removes the recurrent bottleneck, captures long-range dependencies in a single step, and parallelizes across the sequence. This lesson builds attention from the query, key, value idea, works a scaled dot-product attention calculation by hand on short vectors, and then assembles the full transformer block. Attention is now a standard interview topic even outside pure deep learning roles, so being able to compute it on a small example is worth the practice.