A Cookbook for Deep Neural Networks

by Younginn Park

Repository contains solutions for selected Deep Neural Network architecture optimization tasks using PyTorch and fine-tuned for running in the Google Colab environment.

The implemented solutions were a part of the Deep Neural Network course at the University of Warsaw 2024/25, where they achieved a top 10% class ranking for performance.

Projects include methods, proposed in the latest publications, with a potential for improving existing models:

1. Proximal Backpropagation (ProxyProp)

Modification for the backpropagation algorithm by taking implicit instead of explicit gradient steps to update the network parameters during neural network training.

The update of weights is defined as:

$$ %\begin{equation} W^{(l)} = \text{arg min}_{W} \frac{1}{2} || W \cdot g^{(l-1)} + b^{(l)} - f^{(l)}_{*} ||^2 + \frac{1}{2\eta} || W - W^{(l)} ||^2 %\end{equation} $$

2. Adversarial Training (AdvProp and SparseTopK)

Improvement of robustness for image classification models by training on batches containing both clean and adversarially perturbed examples. This forces the model to learn features that are invariant to small, malicious input changes, leading to better generalization and resilience against attacks.

3. Differential Attention (Diff Attention)

Enhancement for transformer attention by allowing the model to amplify attention to the relevant context while canceling noise by calculating attention scores as the difference between two separate softmax attention maps.

Differential Attention works similarly to normal Attention, but with a few differences:

Differential Attention partitions $Q$ and $K$ matrices into chunks: $Q_1$, $Q_2$, $K_1$, $K_2$. Each of them have a shape (b, l, n_h, head / 2)
It calculates two attention matrices: $A_1$ and $A_2$ using respective $Q_i, K_i$ matrices. Note that here the attention pre-activation is normalized by $ \sqrt{head / 2}$. Causal mask is still added to the pre-activation.
The final attention matrix is calculated as $A = A_1 - \lambda A_2$
$\lambda$ in the equation above is calculated as $\lambda = \exp(\lambda_{K_1} * \lambda_{Q_1}) - \exp(\lambda_{K_2} * \lambda_{Q_2}) + \lambda_{\text{init}}$, where $\lambda_{K_1}, \lambda_{Q_1}, \lambda_{K_2}, \lambda_{Q_2}$ are all parameters of shape head / 2, $*$ denotes scalar multiplication of vectors and $\lambda_{\text{init}} = 0.8$
Additionally, normalization is applied to per-head tokens $O' = \text{RMSNorm}(O') \cdot (1-\lambda_{init})$

4. Proximal Policy Optimization (PPO) and Random Network Distillation (RND)

Improving exploration in reinforcement learning, enabling agents to learn more effectively in sparse reward environments.

Roguelike game NetHack has been used to train the Agent

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
1-ProxyProp		1-ProxyProp
2-AdvProp		2-AdvProp
3-DifferentialAttention		3-DifferentialAttention
4-PPO-RND		4-PPO-RND
img		img
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

A Cookbook for Deep Neural Networks

1. Proximal Backpropagation (ProxyProp)

2. Adversarial Training (AdvProp and SparseTopK)

3. Differential Attention (Diff Attention)

4. Proximal Policy Optimization (PPO) and Random Network Distillation (RND)

About

Uh oh!

Releases

Packages

Languages

License

young-sudo/DeepNetCookbook

Folders and files

Latest commit

History

Repository files navigation

A Cookbook for Deep Neural Networks

1. Proximal Backpropagation (ProxyProp)

2. Adversarial Training (AdvProp and SparseTopK)

3. Differential Attention (Diff Attention)

4. Proximal Policy Optimization (PPO) and Random Network Distillation (RND)

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages