# Neural Network

## The Fine Line between Linguistic Generalization and Failure in Seq2Seq-Attention Models (Weber et al., 2018)

We know that training a neural network involves optimising over a non-convex space, but using standard evaluation methods we see that our models…

## Attention Strategies for Multi-Source Sequence-to-Sequence Learning (Libovicky et al., 2017)

To apply attention across multiple input sources, it is best to apply attention independently and then have a second phase of attention over the summary vectors for each source.

## Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer (Shazeer et al., 2017)

Neural networks for language can be scaled up by using a form of selective computation, where a noisy single-layer model chooses among feed-forward networks (experts) that sit between LSTM layers.

## Dynamic Evaluation of Neural Sequence Models (Krause et al., 2017)

Language model perplexity can be reduced by maintaining a separate model that is updated during application of the model, allowing adaptation to short-term patterns in the text.

## Searching for Activation Functions (Ramachandran et al., 2017)

Switching from the ReLU non-linearity, $\text{max}(0, x)$, to Swish, $x \cdot \text{sigmoid}(x)$, consistently improves performance in neural networks across both vision and machine translation tasks.

## Attention Is All You Need (Vaswani et al., ArXiv 2017)

To get context-dependence without recurrence we can use a network that applies attention multiple times over both input and output (as it is generated).

## DeftNN: Addressing Bottlenecks for DNN Execution on GPUs via Synapse Vector Elimination and Near-compute Data Fission (Hill et al., MICRO 2017)

GPU processing can be sped up ~2x by removing low impact rows from weight matrices, and switching to a specialised floating point representation.