by Younginn Park
Repository contains solutions for selected Deep Neural Network architecture optimization tasks using PyTorch and fine-tuned for running in the Google Colab environment.
The implemented solutions were a part of the Deep Neural Network course at the University of Warsaw 2024/25, where they achieved a top 10% class ranking for performance.
Projects include methods, proposed in the latest publications, with a potential for improving existing models:
1. Proximal Backpropagation (ProxyProp)
Modification for the backpropagation algorithm by taking implicit instead of explicit gradient steps to update the network parameters during neural network training.
The update of weights is defined as:
2. Adversarial Training (AdvProp and SparseTopK)
Improvement of robustness for image classification models by training on batches containing both clean and adversarially perturbed examples. This forces the model to learn features that are invariant to small, malicious input changes, leading to better generalization and resilience against attacks.
3. Differential Attention (Diff Attention)
Enhancement for transformer attention by allowing the model to amplify attention to the relevant context while canceling noise by calculating attention scores as the difference between two separate softmax attention maps.
Differential Attention works similarly to normal Attention, but with a few differences:
- Differential Attention partitions
$Q$ and$K$ matrices into chunks:$Q_1$ ,$Q_2$ ,$K_1$ ,$K_2$ . Each of them have a shape(b, l, n_h, head / 2) - It calculates two attention matrices:
$A_1$ and$A_2$ using respective$Q_i, K_i$ matrices. Note that here the attention pre-activation is normalized by $ \sqrt{head / 2}$. Causal mask is still added to the pre-activation. - The final attention matrix is calculated as
$A = A_1 - \lambda A_2$ -
$\lambda$ in the equation above is calculated as$\lambda = \exp(\lambda_{K_1} * \lambda_{Q_1}) - \exp(\lambda_{K_2} * \lambda_{Q_2}) + \lambda_{\text{init}}$ , where$\lambda_{K_1}, \lambda_{Q_1}, \lambda_{K_2}, \lambda_{Q_2}$ are all parameters of shapehead / 2,$*$ denotes scalar multiplication of vectors and$\lambda_{\text{init}} = 0.8$ - Additionally, normalization is applied to per-head tokens
$O' = \text{RMSNorm}(O') \cdot (1-\lambda_{init})$
Improving exploration in reinforcement learning, enabling agents to learn more effectively in sparse reward environments.

