Attention notes
The classical attention mechanism is usually presented like this:
\[ \begin{align} & x \in \mathbb{R}^n \\ & W_q \in \mathbb{R}^{n \times d} \\ & W_k \in \mathbb{R}^{d \times n} \\ & W_v \in \mathbb{R}^{d \times d} \\ \\ & \text{softmax}(x) := \frac{e^{\circ x}}{\sum e^{\circ x}} \\ & \text{Attention}(x) := (W_v x)^\top \text{softmax}\left( (W_q x)^\top (W_k x) \right) \end{align} \]