Search ICLR 2018

Searching papers submitted to ICLR 2018 can be painful. You might want to know which paper uses technique X, dataset D, or cites author ME. Unfortunately, search is limited to titles, abstracts, and keywords, missing the actual contents of the paper. This Frankensteinian search has been made to help scour the papers by ripping out their souls using pdftotext.

Good luck! Warranty's not included :)


Ranking Disclaimer: The rankings as they currently stand do not represent what the end state will be. Do not lose confidence if they're not where you want them to be. Reviewers are there to get the very best from you, not to gate you. You and your work are valuable and you're awesome :)
List of the top 100 ranked papers so far

"Top 100 papers" has 100 results


Certifiable Distributional Robustness with Principled Adversarial Training    

tl;dr We provide a fast, principled adversarial training procedure with computational and statistical performance guarantees.

Reviews: 9 (confidence: 5), 9 (confidence: 4), 9 (confidence: 4) = rank of 999 / 1000 = top 1%

Neural networks are vulnerable to adversarial examples and researchers have proposed many heuristic attack and defense mechanisms. We take the principled view of distributionally robust optimization, which guarantees performance under adversarial input perturbations. By considering a Lagrangian penalty formulation of perturbation of the underlying data distribution in a Wasserstein ball, we provide a training procedure that augments model parameter updates with worst-case perturbations of training data. For smooth losses, our procedure provably achieves moderate levels of robustness with little computational or statistical cost relative to empirical risk minimization. Furthermore, our statistical guarantees allow us to efficiently certify robustness for the population loss. We match or outperform heuristic approaches on supervised and reinforcement learning tasks.


Deep Mean Field Games for Learning Optimal Behavior Policy of Large Populations    

tl;dr Inference of a mean field game (MFG) model of large population behavior via a synthesis of MFG and Markov decision processes.

Reviews: 10 (confidence: 5), 8 (confidence: 4), 8 (confidence: 3) = rank of 998 / 1000 = top 1%

We consider the problem of representing a large population's behavior policy that drives the evolution of the population distribution over a discrete state space. A discrete time mean field game (MFG) is motivated as an interpretable model founded on game theory for understanding the aggregate effect of individual actions and predicting the temporal evolution of population distributions. We achieve a synthesis of MFG and Markov decision processes (MDP) by showing that a special MFG is reducible to an MDP. This enables us to broaden the scope of mean field game theory and infer MFG models of large real-world systems via deep inverse reinforcement learning. Our method learns both the reward function and forward dynamics of an MFG from real data, and we report the first empirical test of a mean field game model of a real-world social media population.


Emergence of grid-like representations by training recurrent neural networks to perform spatial localization    

tl;dr To our knowledge, this is the first study to show how neural representations of space, including grid-like cells and border cells as observed in the brain, could emerge from training a recurrent neural network to perform navigation tasks.

Reviews: 9 (confidence: 4), 8 (confidence: 4), 8 (confidence: 4) = rank of 997 / 1000 = top 1%

Decades of research on the neural code underlying spatial navigation have revealed a diverse set of neural response properties. The Entorhinal Cortex (EC) of the mammalian brain contains a rich set of spatial correlates, including grid cells which encode space using tessellating patterns. However, the mechanisms and functional significance of these spatial representations remain largely mysterious. As a new way to understand these neural representations, we trained recurrent neural networks (RNNs) to perform navigation tasks in 2D arenas based on velocity inputs. Surprisingly, we find that grid-like spatial response patterns emerge in trained networks, along with units that exhibit other spatial correlates, including border cells and band-like cells. All these different functional types of neurons have been observed experimentally. The order of the emergence of grid-like and border cells is also consistent with observations from developmental studies. Together, our results suggest that grid cells, border cells and others as observed in EC may be a natural solution for representing space efficiently given the predominant recurrent connections in the neural circuits.


i-RevNet: Deep Invertible Networks    

No tl;dr =[

Reviews: 9 (confidence: 4), 8 (confidence: 4), 8 (confidence: 4) = rank of 996 / 1000 = top 1%

It is widely believed that the success of deep convolutional networks is based on progressively discarding uninformative variability about the input with respect to the problem at hand. This is supported empirically by the difficulty of recovering images from their hidden representations, in most commonly used network architectures. In this paper we show that this loss of information is not a necessary condition to learn representations that generalize well on complicated problems, such as ImageNet. Via a cascade of homeomorphic layers, we build the i-RevNet, a network that can be fully inverted up to the final projection onto the classes, i.e. no information is discarded. Building an invertible architecture is difficult, for example, because the local inversion is ill-conditioned, we overcome this by providing an explicit inverse. An analysis of i-RevNet’s learned representations suggests an explanation of the good accuracy by a progressive contraction and linear separation with depth. To shed light on the nature of the model learned by the i-RevNet we reconstruct linear interpolations between natural images representations.


On the Convergence of Adam and Beyond    

tl;dr We investigate the convergence of popular optimization algorithms like Adam , RMSProp and propose new variants of these methods which provably converge to optimal solution in convex settings.

Reviews: 9 (confidence: 5), 8 (confidence: 4), 8 (confidence: 3) = rank of 995 / 1000 = top 1%

Several recently proposed stochastic optimization methods that have been successfully used in training deep networks such as RMSProp, Adam, Adadelta, Nadam, etc are based on using gradient updates scaled by square roots of exponential moving averages of squared past gradients. It has been empirically observed that sometimes these algorithms fail to converge to an optimal solution (or a critical point in nonconvex settings). We show that one cause for such failures is the exponential moving average used in the algorithms. We provide an explicit example of a simple convex optimization setting where Adam does not converge to the optimal solution, and describe the precise problems with the previous analysis of Adam algorithm. Our analysis suggests that the convergence issues may be fixed by endowing such algorithms with "long-term memory" of past gradients, and propose new variants of the Adam algorithm which not only fix the convergence issues but often also lead to improved empirical performance.


Multi-Scale Dense Networks for Resource Efficient Image Classification    

No tl;dr =[

Reviews: 10 (confidence: 4), 8 (confidence: 4), 7 (confidence: 4) = rank of 994 / 1000 = top 1%

In this paper we investigate image classification with computational resource limits at test time. Two such settings are: 1. anytime classification, where the network’s prediction for a test example is progressively updated, facilitating the output of a prediction at any time; and 2. budgeted batch classification, where a fixed amount of computation is available to classify a set of examples that can be spent unevenly across “easier” and “harder” inputs. In contrast to most prior work, such as the popular Viola and Jones algorithm, our approach is based on convolutional neural networks. We train multiple classifiers with varying resource demands, which we adaptively apply during test time. To maximally re-use computation between the classifiers, we incorporate them as early-exits into a single deep convolutional neural network and inter-connect them with dense connectivity. To facilitate high quality classification early on, we use a two-dimensional multi-scale network architecture that maintains coarse and fine level features all-throughout the network. Experiments on three image-classification tasks demonstrate that our framework substantially improves the existing state-of-the-art in both settings.


Wasserstein Auto-Encoders    

tl;dr We propose a new auto-encoder based on the Wasserstein distance, which improves on the sampling properties of VAE.

Reviews: 8 (confidence: 4), 8 (confidence: 3), 8 (confidence: 3) = rank of 993 / 1000 = top 1%

We propose the Wasserstein Auto-Encoder (WAE)---a new algorithm for building a generative model of the data distribution. WAE minimizes a penalized form of the Wasserstein distance between the model distribution and the target distribution, which leads to a different regularizer than the one used by the Variational Auto-Encoder (VAE). This regularizer encourages the encoded training distribution to match the prior. We compare our algorithm with several other techniques and show that it is a generalization of adversarial auto-encoders (AAE). Our experiments show that WAE shares many of the properties of VAEs (stable training, encoder-decoder architecture, nice latent manifold structure) while generating samples of better quality.


Continuous Adaptation via Meta-Learning in Nonstationary and Competitive Environments    

No tl;dr =[

Reviews: 9 (confidence: 2), 8 (confidence: 4), 7 (confidence: 4) = rank of 992 / 1000 = top 1%

Ability to continuously learn and adapt from limited experience in nonstationary environments is an important milestone on the path towards general intelligence. In this paper, we cast the problem of continuous adaptation into the learning-to-learn framework. We develop a simple gradient-based meta-learning algorithm suitable for adaptation in dynamically changing and adversarial scenarios. Additionally, we design a new multi-agent competitive environment, RoboSumo, and define iterated adaptation games for testing various aspects of continuous adaptation. We demonstrate that meta-learning enables significantly more efficient adaptation than reactive baselines in the few-shot regime. Our experiments with a population of agents that learn and compete suggest that meta-learners are the fittest.


Learning to Represent Programs with Graphs    

tl;dr Programs have structure that can be represented as graphs, and graph neural networks can learn to find bugs on such graphs

Reviews: 8 (confidence: 4), 8 (confidence: 4), 8 (confidence: 4) = rank of 991 / 1000 = top 1%

Learning tasks on source code (i.e., formal languages) have been considered recently, but most work has tried to transfer natural language methods and does not capitalize on the unique opportunities offered by code's known syntax. For example, long-range dependencies induced by using the same variable or function in distant locations are often not considered. We propose to use graphs to represent both the syntactic and semantic structure of code and use graph-based deep learning methods to learn to reason over program structures. In this work, we present how to construct graphs from source code and how to scale Gated Graph Neural Networks training to such large graphs. We evaluate our method on two tasks: VarNaming, in which a network attempts to predict the name of a variable given its usage, and VarMisuse, in which the network learns to reason about selecting the correct variable that should be used at a given program location. Our comparison to methods that use less structured program representations shows the advantages of modeling known structure, and suggests that our models learn to infer meaningful names and to solve the VarMisuse task in many cases. Additionally, our testing showed that VarMisuse identifies a number of bugs in mature open-source projects.


Spherical CNNs    

tl;dr We introduce Spherical CNNs, a convolutional network for spherical signals, and apply it to 3D model recognition and molecular energy regression.

Reviews: 9 (confidence: 4), 8 (confidence: 4), 7 (confidence: 3) = rank of 990 / 1000 = top 1%

Convolutional Neural Networks (CNNs) have become the method of choice for learning problems involving 2D planar images. However, a number of problems of recent interest have created a demand for models that can analyze spherical images. Examples include omnidirectional vision for drones, robots, and autonomous cars, molecular regression problems, and global weather and climate modelling. A naive application of convolutional networks to a planar projection of the spherical signal is destined to fail, because the space-varying distortions introduced by such a projection will make translational weight sharing ineffective. In this paper we introduce the building blocks for contructing spherical CNNs. We propose a definition for the spherical convolution that is both expressive and rotation-equivariant. We show that this spherical convolution satisfies a generalized convolution theorem, which allows us to compute it efficiently using a generalized (non-commutative) Fast Fourier Transform (FFT) algorithm. We demonstrate the computational efficiency, numerical accuracy, and effectiveness of spherical CNNs applied to 3D model recognition and atomization energy regression.


Boosting Dilated Convolutional Networks with Mixed Tensor Decompositions    

tl;dr We introduce the notion of mixed tensor decompositions, and use it to prove that interconnecting dilated convolutional networks boosts their expressive power.

Reviews: 9 (confidence: 4), 8 (confidence: 3), 7 (confidence: 4) = rank of 989 / 1000 = top 2%

The driving force behind deep networks is their ability to compactly represent rich classes of functions. The primary notion for formally reasoning about this phenomenon is expressive efficiency, which refers to a situation where one network must grow unfeasibly large in order to replicate functions of another. To date, expressive efficiency analyses focused on the architectural feature of depth, showing that deep networks are representationally superior to shallow ones. In this paper we study the expressive efficiency brought forth by connectivity, motivated by the observation that modern networks interconnect their layers in elaborate ways. We focus on dilated convolutional networks, a family of deep models delivering state of the art performance in sequence processing tasks. By introducing and analyzing the concept of mixed tensor decompositions, we prove that interconnecting dilated convolutional networks can lead to expressive efficiency. In particular, we show that even a single connection between intermediate layers can already lead to an almost quadratic gap, which in large-scale settings typically makes the difference between a model that is practical and one that is not. Empirical evaluation demonstrates how the expressive efficiency of connectivity, similarly to that of depth, translates into gains in accuracy. This leads us to believe that expressive efficiency may serve a key role in developing new tools for deep network design.


Deep Autoencoding Gaussian Mixture Model for Unsupervised Anomaly Detection    

tl;dr An end-to-end trained deep neural network that leverages Gaussian Mixture Modeling to perform density estimation and unsupervised anomaly detection in a low-dimensional space learned by deep autoencoder.

Reviews: 8 (confidence: 5), 8 (confidence: 4), 8 (confidence: 4) = rank of 988 / 1000 = top 2%

Unsupervised anomaly detection on multi- or high-dimensional data is of great importance in both fundamental machine learning research and industrial applications, for which density estimation lies at the core. Although previous approaches based on dimensionality reduction followed by density estimation have made fruitful progress, they mainly suffer from decoupled model learning with inconsistent optimization goals and incapability of preserving essential information in the low-dimensional space. In this paper, we present a Deep Autoencoding Gaussian Mixture Model (DAGMM) for unsupervised anomaly detection. Our model utilizes a deep autoencoder to generate a low-dimensional representation and reconstruction error for each input data point, which is further fed into a Gaussian Mixture Model (GMM). Instead of using decoupled two-stage training and the standard Expectation-Maximization (EM) algorithm, DAGMM jointly optimizes the parameters of the deep autoencoder and the mixture model simultaneously in an end-to-end fashion, leveraging a separate estimation network to facilitate the parameter learning of the mixture model. The joint optimization, which well balances autoencoding reconstruction, density estimation of latent representation, and regularization, helps the autoencoder escape from less attractive local optima and further reduce reconstruction errors, avoiding the need of pre-training. Experimental results on several public benchmark datasets show that, DAGMM significantly outperforms state-of-the-art anomaly detection techniques, and achieves up to 14% improvement based on the standard F1 score.


Simulating Action Dynamics with Neural Process Networks    

tl;dr We propose a new recurrent memory architecture that can track common sense state changes of entities by simulating the causal effects of actions.

Reviews: 9 (confidence: 4), 8 (confidence: 4), 6 (confidence: 4) = rank of 987 / 1000 = top 2%

Understanding procedural language requires anticipating the causal effects of actions, even when they are not explicitly stated. In this work, we introduce Neural Process Networks to understand procedural text through (neural) simulation of action dynamics. Our model complements existing memory architectures with dynamic entity tracking by explicitly modeling actions as state transformers. The model updates the states of the entities by executing learned action operators. Empirical results demonstrate that our proposed model can reason about the unstated causal effects of actions, allowing it to provide more accurate contextual information for understanding and generating procedural text, all while offering more interpretable internal representations than existing alternatives.


Zero-Shot Visual Imitation    

tl;dr Agents can learn to imitate solely visual demonstrations (without actions) at test time after learning from their own experience without any form of supervision at training time.

Reviews: 8 (confidence: 4), 8 (confidence: 3), 7 (confidence: 5) = rank of 986 / 1000 = top 2%

Existing approaches to imitation learning distill both what to do---goals---and how to do it---skills---from expert demonstrations. This expertise is effective but expensive supervision: it is not always practical to collect many detailed demonstrations. We argue that if an agent has access to its environment along with the expert, it can learn skills from its own experience and rely on expertise for the goals alone. We weaken the expert supervision required to a single visual demonstration of the task, that is, observation of the expert without knowledge of the actions. Our method is ``zero-shot'' in that we never see expert actions and never see demonstrations during learning. Through self-supervised exploration our agent learns to act and to recognize its actions so that it can infer expert actions once given a demonstration in deployment. During training the agent learns a skill policy for reaching a target observation from the current observation. During inference, the expert demonstration communicates the goals to imitate while the skill policy determines how to imitate. Our novel skill policy architecture and dynamics consistency loss extend visual imitation to more complex environments while improving robustness. Our zero-shot imitator, having no prior knowledge of the environment and making no use of the expert during training, learns from experience to follow experts for navigating an office with a turtlebot, and manipulating rope with a baxter robot. Videos and detailed result analysis available at https://sites.google.com/view/zero-shot-visual-imitation/home


Spatially Transformed Adversarial Examples    

tl;dr We propose a new approach for generating adversarial examples based on spatial transformation, which produces perceptually realistic examples compared to existing attacks.

Reviews: 9 (confidence: 5), 7 (confidence: 4), 7 (confidence: 4) = rank of 985 / 1000 = top 2%

Recent studies show that widely used Deep neural networks (DNNs) are vulnerable to the carefully crafted adversarial examples. Many advanced algorithms have been proposed to generate adversarial examples by leveraging the $\mathcal{L}_p$ distance for penalizing perturbations. Different defense methods have also been explored to defend against such adversarial attacks. While the effectiveness of $\mathcal{L}_p$ distance as a metric of perceptual quality remains an active research area, in this paper we will instead focus on a different type of perturbation, namely spatial transformation, as opposed to manipulating the pixel values directly as in prior works. Perturbations generated through spatial transformation could result in large $\mathcal{L}_p$ distance measures, but our extensive experiments show that such spatially transformed adversarial examples are more perceptually realistic and more difficult to defend against with existing defense systems. This potentially provides a new direction in adversarial example generation and the design of corresponding defenses. We visualize the spatial transformation based perturbation for different examples and show that our technique can produce realistic adversarial examples with smooth image deformation. Finally, we visualize the attention of deep networks with different types of adversarial examples to better understand how these examples are interpreted.


Can recurrent neural networks warp time?    

tl;dr Proves that gating mechanisms provide invariance to time transformations. Introduces and tests a new initialization for LSTMs from this insight.

Reviews: 8 (confidence: 4), 8 (confidence: 4), 7 (confidence: 4) = rank of 984 / 1000 = top 2%

Successful recurrent models such as long short-term memories (LSTMs) and gated recurrent units (GRUs) use \emph{ad hoc} gating mechanisms. Empirically these models have been found to improve the learning of medium to long term temporal dependencies and to help with vanishing gradient issues. We prove that learnable gates in a recurrent model formally provide \emph{quasi-invariance to general time transformations} in the input data. We recover part of the LSTM architecture from a simple axiomatic approach. This result leads to a new way of initializing gate biases in LSTMs and GRUs. Experimentally, this new \emph{chrono initialization} is shown to greatly improve learning of long term dependencies, with minimal implementation effort.


When is a Convolutional Filter Easy to Learn?    

tl;dr We prove randomly initialized (stochastic) gradient descent learns a convolutional filter in polynomial time.

Reviews: 9 (confidence: 4), 8 (confidence: 3), 6 (confidence: 3) = rank of 983 / 1000 = top 2%

We analyze the convergence of (stochastic) gradient descent algorithm for learning a convolutional filter with Rectified Linear Unit (ReLU) activation function. Our analysis does not rely on any specific form of the input distribution and our proofs only use the definition of ReLU, in contrast with previous works that are restricted to standard Gaussian input. We show that (stochastic) gradient descent with random initialization can learn the convolutional filter in polynomial time and the convergence rate depends on the smoothness of the input distribution and the closeness of patches. To the best of our knowledge, this is the first recovery guarantee of gradient-based algorithms for convolutional filter on non-Gaussian input distributions. Our theory also justifies the two-stage learning rate strategy in deep neural networks. While our focus is theoretical, we also present experiments that justify our theoretical findings.


Breaking the Softmax Bottleneck: A High-Rank RNN Language Model    

No tl;dr =[

Reviews: 8 (confidence: 4), 7 (confidence: 5), 7 (confidence: 4) = rank of 982 / 1000 = top 2%

We formulate language modeling as a matrix factorization problem, and show that the expressiveness of Softmax-based models (including the majority of neural language models) is limited by a Softmax bottleneck. Given that natural language is highly context-dependent, this further implies that in practice Softmax with distributed word embeddings does not have enough capacity to model natural language. We propose a simple and effective method to address this issue, and improve the state-of-the-art perplexities on Penn Treebank and WikiText-2 to 47.69 and 40.68 respectively.


CausalGAN: Learning Causal Implicit Generative Models with Adversarial Training    

tl;dr We introduce causal implicit generative models, which can sample from conditional and interventional distributions and also propose two new conditional GANs which we use for training them.

Reviews: 9 (confidence: 3), 7 (confidence: 3), 6 (confidence: 3) = rank of 981 / 1000 = top 2%

We introduce causal implicit generative models (CiGMs): models that allow sampling from not only the true observational but also the true interventional distributions. We show that adversarial training can be used to learn a CiGM, if the generator architecture is structured based on a given causal graph. We consider the application of conditional and interventional sampling of face images with binary feature labels, such as mustache, young. We preserve the dependency structure between the labels with a given causal graph. We devise a two-stage procedure for learning a CiGM over the labels and the image. First we train a CiGM over the binary labels using a Wasserstein GAN where the generator neural network is consistent with the causal graph between the labels. Later, we combine this with a conditional GAN to generate images conditioned on the binary labels. We propose two new conditional GAN architectures: CausalGAN and CausalBEGAN. We show that the optimal generator of the CausalGAN, given the labels, samples from the image distributions conditioned on these labels. The conditional GAN combined with a trained CiGM for the labels is then a CiGM over the labels and the generated image. We show that the proposed architectures can be used to sample from observational and interventional image distributions, even for interventions which do not naturally occur in the dataset.


Decision-Based Adversarial Attacks: reliable attacks against Black-Box Machine Learning Models    

tl;dr A novel adversarial attack that can directly attack real-world black-box machine learning models without transfer.

Reviews: 8 (confidence: 3), 7 (confidence: 4), 7 (confidence: 4) = rank of 980 / 1000 = top 2%

Most modern machine learning algorithms are vulnerable to minimal and almost imperceptible perturbations of their inputs. So far it was unclear how much risk adversarial perturbations carry for the safety of real-world machine learning applications because most attack methods rely either on detailed model information (gradient-based attacks) or on confidence scores such as probability or logit outputs of the model (score-based attacks), neither of which are available in most real-world scenarios. In such cases, currently the only option are transfer-based attacks, however they need access to the training data, rely on cumbersome substitute models and are much easier to defend against. Here we introduce a new category of attacks - decision-based attacks - that rely solely on the final decision of a model. Decision-based attacks are therefore (1) applicable to real-world black-box models such as autonomous cars, (2) need less knowledge and are easier to apply than transfer-based attacks and (3) are more robust to simple defences than gradient-based or score-based attacks. As a first effective representative of this category we introduce the Boundary Attack. This attack is conceptually simple, extremely flexible, requires close to no hyperparameter tuning and yields adversarial perturbations that are competitive with those produced by the best gradient-based attacks in standard computer vision applications. We demonstrate the attack on two real-world black-box models from Clarifai.com. The Boundary Attack in particular and the class of decision-based attacks in general thus open new avenues to study the robustness of machine learning models and raise new questions regarding the safety of deployed machine learning systems. An implementation of the Boundary Attack is available at XXXXXX.


Learning to Teach    

tl;dr We propose and verify the effectiveness of learning to teach, a new framework to automatically guide machine learning process.

Reviews: 9 (confidence: 3), 8 (confidence: 4), 5 (confidence: 4) = rank of 979 / 1000 = top 3%

Teaching plays a very important role in our society, by spreading human knowledge and educating our next generations. A good teacher will select appropriate teaching materials, impact suitable methodologies, and set up targeted examinations, according to the learning behaviors of the students. In the field of artificial intelligence, however, one has largely overlooked the role of teaching, and pays most attention to machine \emph{learning}. In this paper, we argue that equal attention, if not more, should be paid to teaching, and furthermore, an optimization framework (instead of heuristics) should be used to obtain good teaching strategies. We call this approach ``learning to teach''. In the approach, two intelligent agents interact with each other: a student model (which corresponds to the learner in traditional machine learning algorithms), and a teacher model (which determines the appropriate data, loss function, and hypothesis space to facilitate the training of the student model). The teacher model leverages the feedback from the student model to optimize its own teaching strategies by means of reinforcement learning, so as to achieve teacher-student co-evolution. To demonstrate the practical value of our proposed approach, we take the training of deep neural networks (DNN) as an example, and show that by using the learning-to-teach techniques, we are able to use much less training data and fewer iterations to achieve almost the same accuracy for different kinds of DNN models (e.g., multi-layer perceptron, convolutional neural networks and recurrent neural networks) under various machine learning tasks (e.g., image classification and text understanding).


Polar Transformer Networks    

tl;dr We learn feature maps invariant to translation, and equivariant to rotation and scale.

Reviews: 8 (confidence: 3), 7 (confidence: 4), 7 (confidence: 4) = rank of 978 / 1000 = top 3%

Convolutional neural networks (CNNs) are inherently equivariant to translation. Efforts to embed other forms of equivariance have concentrated solely on rotation. We expand the notion of equivariance in CNNs through the Polar Transformer Network (PTN). PTN combines ideas from the Spatial Transformer Network (STN) and canonical coordinate representations. The result is a network invariant to translation and equivariant to both rotation and scale. PTN is trained end-to-end and composed of three distinct stages: a polar origin predictor, the newly introduced polar transformer module and a classifier. PTN achieves state- of-the-art on rotated MNIST and the newly introduced SIM2MNIST dataset, an MNIST variation obtained by adding clutter and perturbing digits with translation, rotation and scaling. The ideas of PTN are extensible to 3D which we demonstrate through the Cylindrical Transformer Network.


FusionNet: Fusing via Fully-aware Attention with Application to Machine Comprehension    

tl;dr We propose a light-weight enhancement for attention and a neural architecture, FusionNet, to achieve STOA on SQuAD and adversarial SQuAD.

Reviews: 8 (confidence: 3), 7 (confidence: 5), 7 (confidence: 4) = rank of 977 / 1000 = top 3%

This paper introduces a new neural structure called FusionNet, which extends existing attention approaches from three perspectives. First, it puts forward a novel concept of "History of Word" to characterize attention information from the lowest word-level embedding up to the highest semantic-level representation. Second, it introduces an improved attention scoring function that better utilizes the "History of Word" concept. Third, it proposes a fully-aware multi-level attention mechanism to capture the complete information in one text (such as a question) and exploit it in its counterpart (such as context or passage) layer by layer. We apply FusionNet to the Stanford Question Answering Dataset (SQuAD) and it achieves the first position for both single and ensemble model on the official SQuAD leaderboard at the time of writing (Oct. 4th, 2017). Meanwhile, we verify the generalization of FusionNet with two adversarial SQuAD datasets and it sets up the new state-of-the-art on both datasets: on AddSent, FusionNet increases the best F1 metric from 46.6% to 51.4%; on AddOneSent, FusionNet boosts the best F1 metric from 56.0% to 60.7%.


Neural Map: Structured Memory for Deep Reinforcement Learning    

No tl;dr =[

Reviews: 9 (confidence: 5), 7 (confidence: 4), 6 (confidence: 5) = rank of 976 / 1000 = top 3%

A critical component to enabling intelligent reasoning in partially observable environments is memory. Despite this importance, Deep Reinforcement Learning (DRL) agents have so far used relatively simple memory architectures, with the main methods to overcome partial observability being either a temporal convolution over the past k frames or an LSTM layer. More recent work (Oh et al., 2016) has went beyond these architectures by using memory networks which can allow more sophisticated addressing schemes over the past k frames. But even these architectures are unsatisfactory due to the reason that they are limited to only remembering information from the last k frames. In this paper, we develop a memory system with an adaptable write operator that is customized to the sorts of 3D environments that DRL agents typically interact with. This architecture, called the Neural Map, uses a spatially structured 2D memory image to learn to store arbitrary information about the environment over long time lags. We demonstrate empirically that the Neural Map surpasses previous DRL memories on a set of challenging 2D and 3D maze environments and show that it is capable of generalizing to environments that were not seen during training.


Learning Differentially Private Recurrent Language Models    

tl;dr User-level differential privacy for recurrent neural network language models is possible with a sufficiently large dataset.

Reviews: 8 (confidence: 4), 7 (confidence: 4), 7 (confidence: 2) = rank of 975 / 1000 = top 3%

We demonstrate that it is possible to train large recurrent language models with user-level differential privacy guarantees with only a negligible cost in predictive accuracy. Our work builds on recent advances in the training of deep networks on user-partitioned data and privacy accounting for stochastic gradient descent. In particular, we add user-level privacy protection to the federated averaging algorithm, which makes large step updates from user-level data. Our work demonstrates that given a dataset with a sufficiently large number of users (a requirement easily met by even small internet-scale datasets), achieving differential privacy comes at the cost of increased computation, rather than in decreased utility as in most prior work. We find that our private LSTM language models are quantitatively and qualitatively similar to un-noised models when trained on a large dataset.


Training and Inference with Integers in Deep Neural Networks    

tl;dr We apply training and inference with only low-bitwidth integers in DNNs

Reviews: 8 (confidence: 4), 7 (confidence: 4), 7 (confidence: 3) = rank of 974 / 1000 = top 3%

Researches on deep neural networks with discrete parameters and their deployment in embedded systems have been active and promising topics. Although previous works have successfully reduced precision in inference, transferring both training and inference processes to low-bitwidth integers has not been demonstrated simultaneously. In this work, we develop a new method termed as ``WAGE" to discretize both training and inference, where weights (W), activations (A), gradients (G) and errors (E) among layers are shifted and linearly constrained to low-bitwidth integers. To perform pure discrete dataflow for fixed-point devices, we further replace batch normalization by a constant scaling layer and simplify other components that are arduous for integer implementation. Improved accuracies can be obtained on multiple datasets, which indicates that WAGE somehow acts as a type of regularization. Empirically, We demonstrates the potential to deploy training on hardware systems such as integer-based deep learning accelerators and neuromorphic chips with comparable accuracy and higher energy efficiency, which is crucial for future AI applications in variable scenarios with transfer and continual learning demands.


Skip Connections Eliminate Singularities    

tl;dr Degenerate manifolds arising from the non-identifiability of the model slow down learning in deep networks; skip connections help by breaking degeneracies.

Reviews: 8 (confidence: 3), 8 (confidence: 3), 6 (confidence: 4) = rank of 973 / 1000 = top 3%

Skip connections made the training of very deep networks possible and have become an indispensable component in a variety of neural architectures. A completely satisfactory explanation for their success remains elusive. Here, we present a novel explanation for the benefits of skip connections in training very deep networks. The difficulty of training deep networks is partly due to the singularities caused by the non-identifiability of the model. Two such singularities have been identified in previous work: (i) overlap singularities caused by the permutation symmetry of nodes in a given layer and (ii) elimination singularities corresponding to the elimination, i.e. consistent deactivation, of nodes. These singularities cause degenerate manifolds in the loss landscape previously shown to slow down learning. We argue that skip connections eliminate these singularities by breaking the permutation symmetry of nodes and by reducing the possibility of node elimination. Moreover, for typical initializations, skip connections move the network away from the "ghosts" of these singularities and sculpt the landscape around them to alleviate the learning slow-down. These hypotheses are supported by evidence from simplified models, as well as from experiments with deep networks trained on CIFAR-10 and CIFAR-100.


Mastering the Dungeon: Grounded Language Learning by Mechanical Turker Descent    

No tl;dr =[

Reviews: 8 (confidence: 5), 7 (confidence: 4), 7 (confidence: 4) = rank of 972 / 1000 = top 3%

Contrary to most natural language processing research, which makes use of static datasets, humans learn language interactively, grounded in an environment. In this work we propose an interactive learning procedure called Mechanical Turker Descent (MTD) that trains agents to execute natural language commands grounded in a fantasy text adventure game. In MTD, Turkers compete to train better agents in the short term, and collaborate by sharing their agents' skills in the long term. This results in a gamified, engaging experience for the Turkers and a better quality teaching signal for the agents compared to static datasets, as the Turkers naturally adapt the training data to the agent's abilities.


Synthetic and Natural Noise Both Break Neural Machine Translation    

tl;dr CharNMT is brittle

Reviews: 8 (confidence: 4), 7 (confidence: 4), 7 (confidence: 4) = rank of 971 / 1000 = top 3%

Character-based neural machine translation (NMT) models alleviate out-of-vocabulary issues, learn morphology, and move us closer to completely end-to-end translation systems. Unfortunately, they are also very brittle and easily falter when presented with noisy data. In this paper, we confront NMT models with synthetic and natural sources of noise. We find that state-of-the-art models fail to translate even moderately noisy texts that humans have no trouble comprehending. We explore two approaches to increase model robustness: structure-invariant word representations and robust training on noisy texts. We find that a model based on a character convolutional neural network is able to simultaneously learn representations robust to multiple kinds of noise.


Efficient Sparse-Winograd Convolutional Neural Networks    

tl;dr Prune and ReLU in Winograd domain for efficient convolutional neural network

Reviews: 8 (confidence: 4), 7 (confidence: 4), 7 (confidence: 3) = rank of 970 / 1000 = top 3%

Convolutional Neural Networks (CNNs) are computationally intensive, which limits their application on mobile devices. Their energy is dominated by the number of multiplies needed to perform the convolutions. Winograd’s minimal filtering algorithm (Lavin, 2015) and network pruning (Han et al., 2015) can reduce the operation count, but these two methods cannot be straightforwardly combined — applying the Winograd transform fills in the sparsity in both the weights and the activations. We propose two modifications to Winograd-based CNNs to enable these methods to exploit sparsity. First, we move the ReLU operation into the Winograd domain to increase the sparsity of the transformed activations. Second, we prune the weights in the Winograd domain to exploit static weight sparsity. For models on CIFAR-10, CIFAR-100 and ImageNet datasets, our method reduces the number of multiplications by 10.4x, 6.8x and 10.8x respectively with loss of accuracy less than 0.1%, outperforming previous baselines by 2.0x-3.0x. We also show that moving ReLU to the Winograd domain allows more aggressive pruning.


Generating Wikipedia by Summarizing Long Sequences    

tl;dr We generate Wikipedia articles abstractively conditioned on source document text.

Reviews: 8 (confidence: 3), 7 (confidence: 5), 7 (confidence: 4) = rank of 969 / 1000 = top 4%

We show that generating English Wikipedia articles can be approached as a multi- document summarization of source documents. We use extractive summarization to coarsely identify salient information and a neural abstractive model to generate the article. For the abstractive model, we introduce a decoder-only architecture that can scalably attend to very long sequences, much longer than typical encoder- decoder architectures used in sequence transduction. We show that this model can generate fluent, coherent multi-sentence paragraphs and even whole Wikipedia articles. When given reference documents, we show it can extract relevant factual information as reflected in perplexity, ROUGE scores and human evaluations.


SGD Learns Over-parameterized Networks that Provably Generalize on Linearly Separable Data    

tl;dr We show that SGD learns two-layer over-parameterized neural networks with Leaky ReLU activations that provably generalize on linearly separable data.

Reviews: 8 (confidence: 4), 7 (confidence: 3), 7 (confidence: 3) = rank of 968 / 1000 = top 4%

Neural networks exhibit good generalization behavior in the over-parameterized regime, where the number of network parameters exceeds the number of observations. Nonetheless, current generalization bounds for neural networks fail to explain this phenomenon. In an attempt to bridge this gap, we study the problem of learning a two-layer over-parameterized neural network, when the data is generated by a linearly separable function. In the case where the network has Leaky ReLU activations, we provide both optimization and generalization guarantees for over-parameterized networks. Specifically, we prove convergence rates of SGD to a global minimum and provide generalization guarantees for this global minimum that are independent of the network size. Therefore, our result clearly shows that the use of SGD for optimization both finds a global minimum, and avoids overfitting despite the high capacity of the model. This is the first theoretical demonstration that SGD can avoid overfitting, when learning over-specified neural network classifiers.


Apprentice: Using Knowledge Distillation Techniques To Improve Low-Precision Network Accuracy    

tl;dr We show that knowledge transfer techniques can improve the accuracy of low precision networks and set new state-of-the-art accuracy for ternary and 4-bits precision.

Reviews: 8 (confidence: 4), 7 (confidence: 4), 7 (confidence: 4) = rank of 967 / 1000 = top 4%

Deep learning networks have achieved state-of-the-art accuracies on computer vision workloads like image classification and object detection. The performant systems, however, typically involve big models with numerous parameters. Once trained, a challenging aspect for such top performing models is deployment on resource constrained inference systems -- the models (often deep networks or wide networks or both) are compute and memory intensive. Low precision numerics and model compression using knowledge distillation are popular techniques to lower both the compute requirements and memory footprint of these deployed models. In this paper, we study the combination of these two techniques and show that the performance of low precision networks can be significantly improved by using knowledge distillation techniques. We call our approach Apprentice and show state-of-the-art accuracies using ternary precision and 4-bit precision for many variants of ResNet architecture on ImageNet dataset. We study three schemes in which one can apply knowledge distillation techniques to various stages of the train-and-deploy pipeline.


AmbientGAN : Generative models from lossy measurements    

tl;dr How to learn GANs from noisy, distorted, partial observations

Reviews: 8 (confidence: 4), 7 (confidence: 4), 7 (confidence: 4) = rank of 966 / 1000 = top 4%

Generative models provide a way to model structure in complex distributions and have been shown to be useful for many tasks of practical interest. However, current techniques for training generative models require access to fully-observed samples. In many settings, it is expensive or even impossible to obtain fully-observed samples, but economical to obtain partial, noisy observations. We consider the task of learning an implicit generative model given only lossy measurements of samples from the distribution of interest. We show that the true underlying distribution can be provably recovered even in the presence of per-sample information loss for a class of measurement models. Based on this, we propose a new method of training Generative Adversarial Networks (GANs) which we call AmbientGAN. On three benchmark datasets, and for various measurement models, we demonstrate substantial qualitative and quantitative improvements. Generative models trained with our method can obtain $2$-$4$x higher inception scores than the baselines.


A DIRT-T Approach to Unsupervised Domain Adaptation    

tl;dr SOTA on unsupervised domain adaptation by leveraging the cluster assumption.

Reviews: 8 (confidence: 4), 7 (confidence: 4), 7 (confidence: 2) = rank of 965 / 1000 = top 4%

Domain adaptation refers to the problem of how to leverage labels in one source domain to boost up learning performance in a new target domain where labels are scarcely available or completely unavailable. A recent approach for finding a common representation of the two domains is via domain adversarial training (Ganin & Lempitsky, 2015), which attempts to induce a feature extractor that matches the source and target feature distributions in some feature space. However, domain adversarial training faces two critical limitations: 1) if the feature extraction function has high-capacity, feature distribution matching is under-constrained, 2) in the non-conservative domain adaptation setting, where no single classifier can perform well jointly in both the source and target domains, domain adversarial training is over-constrained. In this paper, we address these issues through the lense of the cluster assumption, i.e., decision boundaries should not cross high-density data regions. We propose two novel and related models: (1) the Virtual Adversarial Domain Adaptation (VADA) model, which combines domain adversarial training with a penalty term that punishes the violation the cluster assumption; (2) the Decision-boundary Iterative Refinement Training with a Teacher (DIRT-T)1 model, which takes the VADA model as initialization and employs natural gradient steps to further minimize the cluster assumption violation. Extensive empirical results demonstrate that the combination of these two models significantly improve the state-of-the-art performance on several visual domain adaptation benchmarks.


Neural Language Modeling by Jointly Learning Syntax and Lexicon    

tl;dr In this paper, We propose a novel neural language model, called the Parsing-Reading-Predict Networks (PRPN), that can simultaneously induce the syntactic structure from unannotated sentences and leverage the inferred structure to learn a better language model.

Reviews: 8 (confidence: 4), 7 (confidence: 4), 7 (confidence: 3) = rank of 964 / 1000 = top 4%

We propose a neural language model capable of unsupervised syntactic structure induction. The model leverages the structure information to form better semantic representations and better language modeling. Standard recurrent neural networks are limited by their structure and fail to efficiently use syntactic information. On the other hand, tree-structured recursive networks usually require additional structural supervision at the cost of human expert annotation. In this paper, We propose a novel neural language model, called the Parsing-Reading-Predict Networks (PRPN), that can simultaneously induce the syntactic structure from unannotated sentences and leverage the inferred structure to learn a better language model. In our model, the gradient can be directly back-propagated from the language model loss into the neural parsing network. Experiments show that the proposed model can discover the underlying syntactic structure and achieve state-of-the-art performance on word/character-level language model tasks.


Neural Speed Reading via Skim-RNN    

No tl;dr =[

Reviews: 8 (confidence: 3), 7 (confidence: 3), 7 (confidence: 3) = rank of 963 / 1000 = top 4%

Inspired by the principles of speed reading, we introduce Skim-RNN, a recurrent neural network (RNN) that dynamically decides to update only a small fraction of the hidden state for relatively unimportant input tokens. Skim-RNN gives significant computational advantage over an RNN that always updates the entire hidden state. Skim-RNN uses the same input and output interfaces as a standard RNN and can be easily used instead of RNNs in existing models. In our experiments, we show that Skim-RNN can achieve significant reduced computational cost without losing accuracy compared to standard RNNs across five different natural language tasks. In addition, we demonstrate that the trade-off between accuracy and speed of Skim-RNN can be dynamically controlled during inference time in a stable manner. Our analysis also show that Skim-RNN running on a single CPU offers lower latency compared to standard RNNs on GPUs.


Alternating Multi-bit Quantization for Recurrent Neural Networks    

tl;dr We propose a new quantization method and apply it to quantize RNNs for both compression and acceleration

Reviews: 8 (confidence: 4), 7 (confidence: 4), 7 (confidence: 2) = rank of 962 / 1000 = top 4%

Recurrent neural networks have achieved excellent performance in many applications. However, on portable devices with limited resources, the models are often too large to deploy. For applications on the server with large scale concurrent requests, the latency during inference can also be very critical for costly computing resources. In this work, we address these problems by quantizing the network, both weights and activations, into multiple binary codes {-1,+1}. We formulate the quantization as an optimization problem. Under the key observation that once the quantization coefficients are fixed the binary codes can be derived efficiently by binary search tree, alternating minimization is then applied. We test the quantization for two well-known RNNs, i.e., long short term memory (LSTM) and gate recurrent unit (GRU), on the language models. Compared with the full-precision counter part, by 2-bit quantization we can achieve ~16x memory saving and potential ~13.5x inference acceleration on CPUs, with only a reasonable loss in the accuracy. By 3-bit quantization, we can achieve almost no loss in the accuracy or even surpass the original model, with ~10.5x memory saving and potential ~6.5x inference acceleration. Both results beat the exiting quantization works with large margins.


Modular Continual Learning in a Unified Visual Environment    

tl;dr We propose a neural module approach to continual learning using a unified visual environment with a large action space.

Reviews: 8 (confidence: 3), 8 (confidence: 2), 6 (confidence: 2) = rank of 961 / 1000 = top 4%

A core aspect of human intelligence is the ability to learn new tasks quickly and switch between them flexibly. Here, we describe a modular continual reinforcement learning paradigm inspired by these abilities. We first introduce a visual interaction environment that allows many types of tasks to be unified in a single framework. We then describe a reward map prediction scheme that learns new tasks robustly in the very large state and action spaces required by such an environment. We investigate how properties of module architecture influence efficiency of task learning, showing that a module motif incorporating specific design principles (e.g. early bottlenecks, low-order polynomial nonlinearities, and symmetry) significantly outperforms more standard neural network motifs, needing fewer training examples and fewer neurons to achieve high levels of performance. Finally, we present a meta-controller architecture for task switching based on a recurrent neural voting scheme, which allows new modules to use information learned from previously-seen tasks to substantially improve their own learning efficiency.


A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks    

No tl;dr =[

Reviews: 9 (confidence: 4), 7 (confidence: 4), 6 (confidence: 3) = rank of 960 / 1000 = top 4%

We present a generalization bound for feedforward neural networks in terms of the product of the spectral norm of the layers and the Frobenius norm of the weights. The generalization bound is derived using a PAC-Bayes analysis.


Spectral Normalization for Generative Adversarial Networks    

tl;dr We propose a novel weight normalization technique called spectral normalization to stabilize the training of the discriminator of GANs.

Reviews: 8 (confidence: 3), 7 (confidence: 4), 7 (confidence: 2) = rank of 959 / 1000 = top 5%

One of the challenges in the study of generative adversarial networks is the instability of its training. In this paper, we propose a novel weight normalization technique called spectral normalization to stabilize the training of the discriminator. Our new normalization technique is computationally light and easy to incorporate into existing implementations. We tested the efficacy of spectral normalization on CIFAR10, STL-10, and ILSVRC2012 dataset, and we experimentally confirmed that spectrally normalized GANs (SN-GANs) is capable of generating images of better or equal quality relative to the previous training stabilization techniques.


Learning One-hidden-layer Neural Networks with Landscape Design    

tl;dr The paper analyzes the optimization landscape of one-hidden-layer neural nets and designs a new objective that provably has no spurious local minimum.

Reviews: 9 (confidence: 3), 7 (confidence: 3), 6 (confidence: 3) = rank of 958 / 1000 = top 5%

We consider the problem of learning a one-hidden-layer neural network: we assume the input x is from Gaussian distribution and the label $y = a \sigma(Bx) + \xi$, where a is a nonnegative vector and $B$ is a full-rank weight matrix, and $\xi$ is a noise vector. We first give an analytic formula for the population risk of the standard squared loss and demonstrate that it implicitly attempts to decompose a sequence of low-rank tensors simultaneously. Inspired by the formula, we design a non-convex objective function $G$ whose landscape is guaranteed to have the following properties: 1. All local minima of $G$ are also global minima. 2. All global minima of $G$ correspond to the ground truth parameters. 3. The value and gradient of $G$ can be estimated using samples. With these properties, stochastic gradient descent on $G$ provably converges to the global minimum and learn the ground-truth parameters. We also prove finite sample complexity results and validate the results by simulations.


Unsupervised Machine Translation Using Monolingual Corpora Only    

tl;dr We propose a new unsupervised machine translation model that can learn without using parallel corpora; experimental results show impressive performance on multiple corpora and pairs of languages.

Reviews: 8 (confidence: 5), 7 (confidence: 5), 7 (confidence: 4) = rank of 957 / 1000 = top 5%

Machine translation has recently achieved impressive performance thanks to recent advances in deep learning and the availability of large-scale parallel corpora. There have been numerous attempts to extend these successes to low-resource language pairs, yet requiring tens of thousands of parallel sentences. In this work, we take this research direction to the extreme and investigate whether it is possible to learn to translate even without any parallel data. We propose a model that takes sentences from monolingual corpora in two different languages and maps them into the same latent space. By learning to reconstruct in both languages from this shared feature space, the model effectively learns to translate without using any labeled data. We demonstrate our model on two widely used datasets and two language pairs, reporting BLEU scores up to 32.8, without using even a single parallel sentence at training time.


Learning how to explain neural networks: PatternNet and PatternAttribution    

tl;dr Without learning, it is impossible to explain a machine learning model's decisions.

Reviews: 8 (confidence: 4), 8 (confidence: 3), 6 (confidence: 4) = rank of 956 / 1000 = top 5%

DeConvNet, Guided BackProp, LRP, were invented to better understand deep neural networks. We show that these methods do not produce the theoretically correct explanation for a linear model. Yet they are used on multi-layer networks with millions of parameters. This is a cause for concern since linear models are simple neural networks. We argue that explanation methods for neural nets should work reliably in the limit of simplicity, the linear models. Based on our analysis of linear models we propose a generalization that yields two explanation techniques (PatternNet and PatternAttribution) that are theoretically sound for linear models and produce improved explanations for deep networks.


Unsupervised Cipher Cracking Using Discrete GANs    

No tl;dr =[

Reviews: 8 (confidence: 4), 7 (confidence: 4), 7 (confidence: 1) = rank of 955 / 1000 = top 5%

This work details CipherGAN, an architecture inspired by CycleGAN used for inferring the underlying cipher mapping given banks of unpaired ciphertext and plaintext. We demonstrate that CipherGAN is capable of cracking language data enciphered using shift and Vigenere ciphers to a high degree of fidelity and for vocabularies much larger than previously achieved. We present how CycleGAN can be made compatible with discrete data and train in a stable way. We then prove that the technique used in CipherGAN avoids the common problem of uninformative discrimination associated with GANs applied to discrete data.


On the insufficiency of existing momentum schemes for Stochastic Optimization    

tl;dr Existing momentum/acceleration schemes such as heavy ball method and Nesterov's acceleration employed with stochastic gradients do not improve over vanilla stochastic gradient descent, especially when employed with small batch sizes.

Reviews: 8 (confidence: 5), 7 (confidence: 4), 7 (confidence: 3) = rank of 954 / 1000 = top 5%

Momentum based stochastic gradient methods such as heavy ball (HB) and Nesterov's accelerated gradient descent (NAG) method are widely used in practice for training deep networks and other supervised learning models, as they often provide significant improvements over stochastic gradient descent (SGD). Theoretically, these ```fast gradient methods have provable improvements over gradient descent only for the deterministic case, where the gradients are exact. In the stochastic case, the popular explanations for their wide applicability is that when these fast gradient methods are applied in the stochastic case, they partially mimic their exact gradient counterparts, resulting in some practical gain. This work provides a counterpoint to this belief by proving that there are simple problem instances where these methods cannot outperform SGD despite the best setting of its parameters. These negative problem instances are, in an informal sense, generic; they do not look like carefully constructed pathological instances. These results suggest (along with empirical evidence) that HB or NAG's practical performance gains are a by-product of minibatching. Furthermore, this work provides a viable (and provable) alternative, which, on the same set of problem instances, significantly improves over HB, NAG, and SGD's performance. This algorithm, denoted as ASGD, is a simple to implement stochastic algorithm, based on a relatively less popular version of Nesterov's AGD. Extensive empirical results in this paper show that ASGD has performance gains over HB, NAG, and SGD.


Relational Neural Expectation Maximization    

tl;dr We introduce a novel approach to common-sense physical reasoning that learns physical interactions between objects from raw visual images in a purely unsupervised fashion

Reviews: 8 (confidence: 5), 7 (confidence: 4), 7 (confidence: 3) = rank of 953 / 1000 = top 5%

Common-sense physical reasoning is an essential ingredient for any intelligent agent operating in the real-world. For example, it can be used to simulate the environment, or to infer the state of parts of the world that are currently unobserved. In order to match real-world conditions this causal knowledge must be learned without access to supervised data. To solve this problem, we present a novel method that incorporates prior knowledge about the compositional nature of human perception to factor interactions between object-pairs and to learn them efficiently. It learns to discover objects and to model physical interactions between them from raw visual images in a purely unsupervised fashion. On videos of bouncing balls we show the superior modelling capabilities of our method compared to other unsupervised neural approaches, that do not incorporate such prior knowledge. We show its ability to handle occlusion and that it can extrapolate learned knowledge to environments with different numbers of objects.


Neural Sketch Learning for Conditional Program Generation    

tl;dr We give a method for generating type-safe programs in a Java-like language, given a small amount of syntactic information about the desired code.

Reviews: 8 (confidence: 4), 7 (confidence: 3), 7 (confidence: 2) = rank of 952 / 1000 = top 5%

We study the problem of generating source code in a strongly typed, Java-like programming language, given a label (for example a set of API calls or types) carrying a small amount of information about the code that is desired. The generated programs are expected to respect a `"realistic" relationship between programs and labels, as exemplified by a corpus of labeled programs available during training. Two challenges in such *conditional program generation* are that the generated programs must satisfy a rich set of syntactic and semantic constraints, and that source code contains many low-level features that impede learning. We address these problems by training a neural generator not on code but on *program sketches*, or models of program syntax that abstract out names and operations that do not generalize across programs. During generation, we infer a posterior distribution over sketches, then concretize samples from this distribution into type-safe programs using combinatorial techniques. We implement our ideas in a system for generating API-heavy Java code, and show that it can often predict the entire body of a method given just a few API calls or data types that appear in the method.


Distributed Prioritized Experience Replay    

tl;dr A distributed architecture for deep reinforcement learning at scale, using parallel data-generation to improve the state of the art on the Arcade Learning Environment benchmark in a fraction of the wall-clock training time of previous approaches.

Reviews: 9 (confidence: 4), 7 (confidence: 4), 6 (confidence: 3) = rank of 951 / 1000 = top 5%

We propose a distributed architecture for deep reinforcement learning at scale, that enables agents to learn effectively from orders of magnitude more data than previously possible. The algorithm decouples acting from learning: the actors interact with their own instances of the environment by selecting actions according to a shared neural network, and accumulate the resulting experience in a shared experience replay memory; the learner replays samples of experience and updates the neural network. The architecture relies on prioritized experience replay to focus only on the most significant data generated by the actors. Our architecture substantially improves the state of the art on the Arcade Learning Environment, achieving better final performance in a fraction of the wall-clock training time.


Eigenoption Discovery through the Deep Successor Representation    

tl;dr We show how we can use the successor representation to discover eigenoptions in stochastic domains, from raw pixels. Eigenoptions are options learned to navigate the latent dimensions of a learned representation.

Reviews: 9 (confidence: 5), 7 (confidence: 4), 6 (confidence: 3) = rank of 950 / 1000 = top 5%

Options in reinforcement learning allow agents to hierarchically decompose a task into subtasks, having the potential to speed up learning and planning. However, autonomously learning effective sets of options is still a major challenge in the field. In this paper we focus on the recently introduced idea of using representation learning methods to guide the option discovery process. Specifically, we look at eigenoptions, options obtained from representations that encode diffusive information flow in the environment. We extend the existing algorithms for eigenoption discovery to settings with stochastic transitions and in which handcrafted features are not available. We propose an algorithm that discovers eigenoptions while learning non-linear state representations from raw pixels. It exploits recent successes in the deep reinforcement learning literature and the equivalence between proto-value functions and the successor representation. We use traditional tabular domains to provide intuition about our approach and Atari 2600 games to demonstrate its potential.


GENERALIZING ACROSS DOMAINS VIA CROSS-GRADIENT TRAINING    

tl;dr Domain guided augmentation of data provides a robust and stable method of domain generalization

Reviews: 8 (confidence: 4), 7 (confidence: 5), 7 (confidence: 5), 7 (confidence: 4) = rank of 949 / 1000 = top 6%

We present CROSSGRAD , a method to use multi-domain training data to learn a classifier that generalizes to new domains. CROSSGRAD does not need an adaptation phase via labeled or unlabeled data, or domain features in the new domain. Most existing domain adaptation methods attempt to erase domain signals using techniques like domain adversarial training. In contrast, CROSSGRAD is free to use domain signals for predicting labels, if it can prevent overfitting on training domains. We conceptualize the task in a Bayesian setting, in which a sampling step is implemented as data augmentation, based on domain-guided perturbations of input instances. CROSSGRAD jointly trains a label and a domain classifier on examples perturbed by loss gradients of each other’s objectives. This enables us to directly perturb inputs, without separating and re-mixing domain signals while making various distributional assumptions. Empirical evaluation on three different applications where this setting is natural establishes that (1) domain-guided perturbation provides consistently better generalization to unseen domains, compared to generic instance perturbation methods, and (2) data augmentation is a more stable and accurate method than domain adversarial training.


Learning an Embedding Space for Transferable Robot Skills    

No tl;dr =[

Reviews: 7 (confidence: 5), 7 (confidence: 4), 7 (confidence: 4) = rank of 948 / 1000 = top 6%

We present a method for reinforcement learning of closely related skills that are parameterized via a skill embedding space. We learn such skills by taking advantage of latent variables and exploiting a connection between reinforcement learning and variational inference. The main contribution of our work is an entropy-regularized policy gradient formulation for hierarchical policies, and an associated, data-efficient and robust off-policy gradient algorithm based on stochastic value gradients. We demonstrate the effectiveness of our method on several simulated robotic manipulation tasks. We find that our method allows for discovery of multiple solutions and is capable of learning the minimum number of distinct skills that are necessary to solve a given set of tasks. In addition, our results indicate that the hereby proposed technique can interpolate and/or sequence previously learned skills in order to accomplish more complex tasks, even in the presence of sparse rewards.


Learning from Between-class Examples for Deep Sound Recognition    

tl;dr We propose an novel learning method for deep sound recognition named BC learning.

Reviews: 9 (confidence: 4), 8 (confidence: 4), 4 (confidence: 4) = rank of 947 / 1000 = top 6%

Deep learning methods have achieved high performance in sound recognition tasks. Deciding how to feed the training data is important for further performance improvement. We propose a novel learning method for deep sound recognition: learning from between-class examples (BC learning). Our strategy is to learn a discriminative feature space by recognizing the between-class sounds as between-class sounds. We generate between-class sounds by mixing two sounds belonging to different classes with a random ratio. We then input the mixed sound to the model and train the model to output the mixing ratio. The advantages of BC learning are not limited only to the increase in variation of the training data; BC learning leads to an enlargement of Fisher’s criterion in the feature space and a regularization of the positional relationship among the feature distributions of the classes. The experimental results show that BC learning improves the performance on various sound recognition networks, datasets, and data augmentation schemes, in which BC learning proves to be always beneficial. Furthermore, we construct a new deep sound recognition network (DSRNet) and train it with BC learning. As a result, the performance of DSRNet surpasses the human level.


Intrinsic Motivation and Automatic Curricula via Asymmetric Self-Play    

tl;dr Unsupervised learning for reinforcement learning using an automatic curriculum of self-play

Reviews: 8 (confidence: 4), 8 (confidence: 4), 5 (confidence: 3) = rank of 946 / 1000 = top 6%

We describe a simple scheme that allows an agent to learn about its environment in an unsupervised manner. Our scheme pits two versions of the same agent, Alice and Bob, against one another. Alice proposes a task for Bob to complete; and then Bob attempts to complete the task. In this work we will focus on two kinds of environments: (nearly) reversible environments and environments that can be reset. Alice will ``propose'' the task by doing a sequence of actions and then Bob must undo or repeat them, respectively. Via an appropriate reward structure, Alice and Bob automatically generate a curriculum of exploration, enabling unsupervised training of the agent. When Bob is deployed on an RL task within the environment, this unsupervised training reduces the number of supervised episodes needed to learn, and in some cases converges to a higher reward.


PixelNN: Example-based Image Synthesis    

tl;dr Pixel-wise nearest neighbors used for generating multiple images from incomplete priors such as a low-res images, surface normals, edges etc.

Reviews: 8 (confidence: 4), 7 (confidence: 3), 6 (confidence: 4) = rank of 945 / 1000 = top 6%

We present a simple nearest-neighbor (NN) approach that synthesizes high-frequency photorealistic images from an ``incomplete'' signal such as a low-resolution image, a surface normal map, or edges. Current state-of-the-art deep generative models designed for such conditional image synthesis lack two important things: (1) they are unable to generate a large set of diverse outputs, due to the mode collapse problem. (2) they are not interpretable, making it difficult to control the synthesized output. We demonstrate that NN approaches potentially address such limitations, but suffer in accuracy on small datasets. We design a simple pipeline that combines the best of both worlds: the first stage uses a convolutional neural network (CNN) to map the input to a (overly-smoothed) image, and the second stage uses a pixel-wise nearest neighbor method to map the smoothed output to multiple high-quality, high-frequency outputs in a controllable manner. Importantly, pixel-wise matching allows our method to compose novel high-frequency content by cutting-and-pasting pixels from different training exemplars. We demonstrate our approach for various input modalities, and for various domains ranging from human faces, pets, shoes, and handbags.


Ask the Right Questions: Active Question Reformulation with Reinforcement Learning    

tl;dr We propose an agent that sits between the user and a black box question-answering system and which learns to reformulate questions to elicit the best possible answers

Reviews: 8 (confidence: 3), 7 (confidence: 5), 6 (confidence: 4) = rank of 944 / 1000 = top 6%

We frame Question Answering as a Reinforcement Learning task, an approach that we call Active Question Answering. We propose an agent that sits between the user and a black box question-answering system and which learns to reformulate questions to elicit the best possible answers. The agent probes the system with, potentially many, natural language reformulations of an initial question and aggregates the returned evidence to yield the best answer. The reformulation system is trained end-to-end to maximize answer quality using policy gradient. We evaluate on SearchQA, a dataset of complex questions extracted from Jeopardy!. Our agent improves F1 by 11.4% over a state-of-the-art base model that uses the original question/answer pairs. Based on a qualitative analysis of the language that the agent has learned while interacting with the question answering system, we propose that the agent has discovered basic information retrieval techniques such as term re-weighting and stemming.


Multi-level Residual Networks from Dynamical Systems View    

No tl;dr =[

Reviews: 7 (confidence: 4), 7 (confidence: 4), 7 (confidence: 3) = rank of 943 / 1000 = top 6%

Deep residual networks (ResNets) and their variants are widely used in many computer vision applications and natural language processing tasks. However, the theoretical principles for designing and training ResNets are still not fully understood. Recently, several points of view have emerged to try to interpret ResNet theoretically, such as unraveled view, unrolled iterative estimation and dynamical systems view. In this paper, we adopt the dynamical systems point of view, and analyze the lesioning properties of ResNet both theoretically and experimentally. Based on these analyses, we additionally propose a novel method for accelerating ResNet training. We apply the proposed method to train ResNets and Wide ResNets for three image classification benchmarks, reducing training time by more than 40\% with superior or on-par accuracy.


Self-ensembling for visual domain adaptation    

tl;dr Self-ensembling based label propagation algorithm for visual domain adaptation, won VisDA-2017 image classification domain adaptation challenge.

Reviews: 7 (confidence: 5), 7 (confidence: 4), 7 (confidence: 3) = rank of 942 / 1000 = top 6%

This paper explores the use of self-ensembling for visual domain adaptation problems. Our technique is derived from the mean teacher variant (Tarvainen et. al 2017) of temporal ensembling (Laine et al. 2017), a technique that achieved state of the art results in the area of semi-supervised learning. We introduce a number of modifications to their approach for challenging domain adaptation scenarios and evaluate its effectiveness. Our approach achieves state of the art results in a variety of benchmarks, including our winning entry in the VISDA-2017 visual domain adaptation challenge. In small image benchmarks, our algorithm not only outperforms prior art, but can also achieve accuracy that is close to that of a classifier trained in a supervised fashion.


Deep Sensing: Active Sensing using Multi-directional Recurrent Neural Networks    

No tl;dr =[

Reviews: 8 (confidence: 4), 7 (confidence: 4), 6 (confidence: 3) = rank of 941 / 1000 = top 6%

For every prediction we might wish to make, we must decide what to observe (what source of information) and when to observe it. Because making observations is costly, this decision must trade off the value of information against the cost of observation. Making observations (sensing) should be an active choice. To solve the problem of active sensing we develop a novel deep learning architecture: Deep Sensing. At training time, Deep Sensing learns how to issue predictions at various cost-performance points. To do this, it creates multiple representations at various performance levels associated with different measurement rates (costs). This requires learning how to estimate the value of real measurements vs. inferred measurements, which in turn requires learning how to infer missing (unobserved) measurements. To infer missing measurements, we develop a Multi-directional Recurrent Neural Network (M-RNN). An M-RNN differs from a bi-directional RNN in that it sequentially operates across streams in addition to within streams, and because the timing of inputs into the hidden layers is both lagged and advanced. At runtime, the operator prescribes a performance level or a cost constraint, and Deep Sensing determines what measurements to take and what to infer from those measurements, and then issues predictions. To demonstrate the power of our method, we apply it to two real-world medical datasets with significantly improved performance.


Evaluating the Robustness of Neural Networks: An Extreme Value Theory Approach    

tl;dr We propose the first attack-independent robustness metric, a.k.a CLEVER, that can be applied to any neural network classifier.

Reviews: 7 (confidence: 3), 7 (confidence: 3), 7 (confidence: 1) = rank of 940 / 1000 = top 6%

The robustness of neural networks to adversarial examples has received great attention due to security implications. Despite various attack approaches to crafting visually imperceptible adversarial examples, little has been developed towards a comprehensive measure of robustness. In this paper, we provide theoretical justification for converting robustness analysis into a local Lipschitz constant estimation problem, and propose to use the Extreme Value Theory for efficient evaluation. Our analysis yields a novel robustness metric called CLEVER, which is short for Cross Lipschitz Extreme Value for nEtwork Robustness. The proposed CLEVER score is attack-agnostic and is computationally feasible for large neural networks. Experimental results on various networks, including ResNet, Inception-v3 and MobileNet, show that (i) CLEVER is aligned with the robustness indication measured by the $\ell_2$ and $\ell_\infty$ norms of adversarial examples from powerful attacks, and (ii) defended networks using defensive distillation or bounded ReLU indeed give better CLEVER scores. To the best of our knowledge, CLEVER is the first attack-independent robustness metric that can be applied to any neural network classifiers.


Variational Message Passing with Structured Inference Networks    

tl;dr We propose a message-passing algorithm for models that contain both the deep model and probabilistic graphical model.

Reviews: 7 (confidence: 4), 7 (confidence: 3), 7 (confidence: 2) = rank of 939 / 1000 = top 7%

Powerful models require powerful algorithms to be effective on real-world problems. We propose such an algorithm for a class of models that combine deep models with probabilistic graphical models. Our algorithm is a natural-gradient message-passing algorithms whose messages automatically reduce to stochastic-gradients for the deep components of the model. Using a special-structure inference network, our algorithm exploits the structural properties of the model to gain computational efficiency while retaining the simplicity and generality of deep-learning algorithms. By combining the strength of two different types of inference procedures, our approach offers a framework that simultaneously enables structured, amortized, and natural- gradient inference for complex models.


Backpropagation through the Void: Optimizing control variates for black-box gradient estimation    

tl;dr We present a general method for unbiased estimation of gradients of black-box functions of random variables. We apply this method to discrete variational inference and reinforcement learning.

Reviews: 8 (confidence: 4), 7 (confidence: 3), 6 (confidence: 2) = rank of 938 / 1000 = top 7%

Gradient-based optimization is the foundation of deep learning and reinforcement learning. Even when the mechanism being optimized is unknown or not differentiable, optimization using high-variance or biased gradient estimates is still often the best strategy. We introduce a general framework for learning low-variance, unbiased gradient estimators for black-box functions of random variables, based on gradients of a learned function. These estimators can be jointly trained with model parameters or policies, and are applicable in both discrete and continuous settings. We give unbiased, adaptive analogs of state-of-the-art reinforcement learning methods such as advantage actor-critic. We also demonstrate this framework for training discrete latent-variable models.


Stochastic Variational Video Prediction    

tl;dr Stochastic variational video prediction in real-world settings.

Reviews: 7 (confidence: 5), 7 (confidence: 4), 7 (confidence: 4) = rank of 937 / 1000 = top 7%

Predicting the future in real-world settings, particularly from raw sensory observations such as images, is exceptionally challenging. Real-world events can be stochastic and unpredictable, and the high dimensionality and complexity of natural images requires the predictive model to build an intricate understanding of the natural world. Many existing methods tackle this problem by making simplifying assumptions about the environment. One common assumption is that the outcome is deterministic and there is only one plausible future. This can lead to low-quality predictions in real-world settings with stochastic dynamics. In this paper, we develop a stochastic variational video prediction (SV2P) method that predicts a different possible future for each sample of its latent variables. To the best of our knowledge, our model is the first to provide effective stochastic multi-frame prediction for real-world video. We demonstrate the capability of the proposed method in predicting detailed future frames of videos on multiple real-world datasets, both action-free and action-conditioned. We find that our proposed method produces substantially improved video predictions when compared to the same model without stochasticity, and to other stochastic video prediction methods. Our SV2P implementation will be open sourced upon publication.


Active Learning for Convolutional Neural Networks: A Core-Set Approach    

tl;dr We approach to the problem of active learning as a core-set selection problem and show that this approach is especially useful in the batch active learning setting which is crucial when training CNNs.

Reviews: 7 (confidence: 4), 7 (confidence: 4), 7 (confidence: 3) = rank of 936 / 1000 = top 7%

Convolutional neural networks (CNNs) have been successfully applied to many recognition and learning tasks using a universal recipe; training a deep model on a very large dataset of supervised examples. However, this approach is rather restrictive in practice since collecting a large set of labeled images is very expensive. One way to ease this problem is coming up with smart ways for choosing images to be labelled from a very large collection (i.e. active learning). Our empirical study suggests that many of the active learning heuristics in the literature are not effective when applied to CNNs when applied in batch setting. Inspired by these limitations, we define the problem of active learning as core-set selection, i.e. choosing set of points such that a model learned over the selected subset is competitive for the remaining data points. We further present a theoretical result characterizing the performance of any selected subset using the geometry of the datapoints. As an active learning algorithm, we choose the subset which is expected to yield best result according to our characterization. Our experiments show that the proposed method significantly outperforms existing approaches in image classification experiments by a large margin.


DCN+: Mixed Objective And Deep Residual Coattention for Question Answering    

tl;dr We introduce the DCN+ with deep residual coattention and mixed-objective RL, which achieves state of the art performance on the Stanford Question Answering Dataset.

Reviews: 8 (confidence: 2), 7 (confidence: 4), 6 (confidence: 4) = rank of 935 / 1000 = top 7%

Traditional models for question answering optimize using cross entropy loss, which encourages exact answers at the cost of penalizing nearby or overlapping answers that are sometimes equally accurate. We propose a mixed objective that combines cross entropy loss with self-critical policy learning, using rewards derived from word overlap to solve the misalignment between evaluation metric and optimization objective. In addition to the mixed objective, we introduce a deep residual coattention encoder that is inspired by recent work in deep self-attention and residual networks. Our proposals improve model performance across question types and input lengths, especially for long questions that requires the ability to capture long-term dependencies. On the Stanford Question Answering Dataset, our model achieves state of the art results with 75.1% exact match accuracy and 83.1% F1, while the ensemble obtains 78.9% exact match accuracy and 86.0% F1.


Detecting Statistical Interactions from Neural Network Weights    

tl;dr We develop a method of detecting statistical interactions in data by directly interpreting the trained weights of a feedforward multilayer neural network.

Reviews: 7 (confidence: 5), 7 (confidence: 4), 7 (confidence: 4) = rank of 934 / 1000 = top 7%

We develop a method of detecting statistical interactions in data by directly interpreting the trained weights of a feedforward multilayer neural network. By structuring the neural network to statistical properties of data and applying sparsity regularization, we are able to leverage the weights to detect interactions with similar performance to the state-of-the-art without searching an exponential solution space of possible interactions. We obtain our computational savings by first observing that interactions between input features are created by the non-additive effect of nonlinear activation functions and that interacting paths are encoded in weight matrices. We use these observations to develop a way of identifying higher-order interactions with a simple traversal over the input weight matrix. In experiments on simulated and real-world data, we demonstrate the performance of our method and the importance of discovered interactions.


Towards Image Understanding from Deep Compression Without Decoding    

No tl;dr =[

Reviews: 9 (confidence: 5), 6 (confidence: 4), 6 (confidence: 3) = rank of 933 / 1000 = top 7%

Motivated by recent work on deep neural network (DNN)-based image compression methods showing potential improvements in image quality, savings in storage, and bandwidth reduction, we propose to perform image understanding tasks such as classification and segmentation directly on the compression representations produced by these compression methods. Since the encoders and decoders in DNN-based compression methods are neural networks with feature-maps as internal representations of the images, we directly integrate these with architectures for image understanding. This bypasses decoding of the compressed representation into RGB space and reduces computational cost. Our study shows that performance comparable to networks that operate on compressed RGB images can be achieved. Furthermore, we show that synergies are obtained by jointly training compression networks with classification networks on the compressed representations, improving image quality, classification accuracy, and segmentation performance. We find that inference from compressed representations is particularly advantageous compared to inference from compressed RGB images at high compression rates.


Enhancing The Reliability of Out-of-distribution Image Detection in Neural Networks    

No tl;dr =[

Reviews: 9 (confidence: 3), 6 (confidence: 4), 6 (confidence: 3) = rank of 932 / 1000 = top 7%

We consider the problem of detecting out-of-distribution images in neural networks. We propose ODIN, a simple and effective method that does not require any change to a pre-trained neural network. Our method is based on the observation that using temperature scaling and adding small perturbations to the input can separate the softmax score distributions of in- and out-of-distribution images, allowing for more effective detection. We show in a series of experiments that ODIN is compatible with diverse network architectures and datasets. It consistently outperforms the baseline approach by a large margin, establishing a new state-of-the-art performance on this task. For example, ODIN reduces the false positive rate from the baseline 34.7% to 4.3% on the DenseNet (applied to CIFAR-10) when the true positive rate is 95%.


Learning to cluster in order to Transfer across domains and tasks    

tl;dr A learnable clustering objective to facilitate transfer learning across domains and tasks

Reviews: 9 (confidence: 5), 7 (confidence: 4), 5 (confidence: 4) = rank of 931 / 1000 = top 7%

This paper introduces a novel method to perform transfer learning across domains and tasks, formulating it as a problem of learning to cluster. The key insight is that, in addition to features, we can transfer similarity information and this is sufficient to learn a similarity function and clustering network to perform both domain adaptation and cross-task transfer learning. We begin by reducing categorical information to pairwise constraints, which only considers whether two instances belong to the same class or not (pairwise semantic similarity). This similarity is category-agnostic and can be learned from data in the source domain using a similarity network. We then present two novel approaches for performing transfer learning using this similarity function. First, for unsupervised domain adaptation, we design a new loss function to regularize classification with a constrained clustering loss, hence learning a clustering network with the transferred similarity metric generating the training inputs. Second, for cross-task learning (i.e., unsupervised clustering with unseen categories), we propose a framework to reconstruct and estimate the number of semantic clusters, again using the clustering network. Since the similarity network is noisy, the key is to use a robust clustering algorithm, and we show that our formulation is more robust than the alternative constrained and unconstrained clustering approaches. Using this method, we first show state of the art results for the challenging cross-task problem, applied on Omniglot and ImageNet. Our results show that we can reconstruct semantic clusters with high accuracy. We then evaluate the performance of cross-domain transfer using images from the Office-31 and SVHN-MNIST tasks and present top accuracy on both datasets. Our approach doesn't explicitly deal with domain discrepancy. If we combine with a domain adaptation loss, it shows further improvement.


LEARNING TO SHARE: SIMULTANEOUS PARAMETER TYING AND SPARSIFICATION IN DEEP LEARNING    

tl;dr We have proposed using the recent GrOWL regularizer for simultaneous parameter sparsity and tying in DNN learning.

Reviews: 8 (confidence: 5), 7 (confidence: 4), 6 (confidence: 3) = rank of 930 / 1000 = top 7%

Deep neural networks (DNNs) usually contain millions, maybe billions, of parameters/weights, making both storage and computation very expensive. Moreover, although this complexity allows DNNs to learn complex mappings, it may also allow them to over-fit, which has motivated the use of regularization methods, e.g., by encouraging weight sparsity. Another well-known approach for controlling the complexity of DNNs is parameter sharing/tying, where certain sets of weights are forced to share a common value. Some forms of weight sharing are hard-wired to express certain in- variances, with a notable example being the shift-invariance of convolutional layers. However, there may be other groups of weights that may be tied together during the learning process, thus further re- ducing the complexity of the network. In this paper, we adopt a recently proposed sparsity-inducing regularizer, named GrOWL (group ordered weighted l1), which encourages sparsity and, simulta- neously, learns which groups of parameters should share a common value. GrOWL has been proven effective in linear regression, being able to identify and cope with strongly correlated covariates. Unlike standard sparsity-inducing regularizers (e.g., l1 a.k.a. Lasso), GrOWL not only eliminates unimportant neurons by setting all the corresponding weights to zero, but also explicitly identifies strongly correlated neurons by tying the corresponding weights to a common value. This ability of GrOWL motivates the following two-stage procedure: (i) use GrOWL regularization in the training process to simultaneously identify significant neurons and groups of parameter that should be tied together; (ii) retrain the network, enforcing the structure that was unveiled in the previous phase, i.e., keeping only the significant neurons and enforcing the learned tying structure. We evaluate the proposed approach on several benchmark datasets, showing that it can dramatically compress the network with slight or even no loss on generalization performance.


Variance Reduction for Policy Gradient Methods with Action-Dependent Baselines    

tl;dr Action-dependent baselines can be bias-free and yield greater variance reduction than state-only dependent baselines for policy gradient methods.

Reviews: 8 (confidence: 3), 7 (confidence: 4), 6 (confidence: 4) = rank of 929 / 1000 = top 8%

Policy gradient methods have enjoyed success in deep reinforcement learning but suffer from high variance of gradient estimates. The high variance problem is particularly exasperated in problems with long horizons or high dimensional action spaces. To mitigate this issue, we derive an action-dependent baseline for variance reduction which fully exploits the structural form of the stochastic policy itself, and does not make any additional assumptions about the MDP. We demonstrate and quantify the benefit of the action-dependent baseline both through theoretical analysis as well as numerical results. Our experimental results indicate that action-dependent baselines allow for faster learning on standard reinforcement learning benchmarks as well as on high dimensional manipulation and multi-agent communication tasks.


Understanding Short-Horizon Bias in Stochastic Meta-Optimization    

tl;dr We investigate the bias in the short-horizon meta-optimization objective.

Reviews: 8 (confidence: 3), 7 (confidence: 4), 6 (confidence: 4) = rank of 928 / 1000 = top 8%

Careful tuning of the learning rate, or even schedules thereof, can be crucial to effective neural net training. There has been much recent interest in gradient-based meta-optimization, where one tunes hyperparameters, or even learns an optimizer, in order to minimize the expected loss when the training procedure is unrolled. But because the training procedure must be unrolled thousands of times, the meta-objective must be defined with an orders-of-magnitude shorter time horizon than is typical for neural net training. We show that such short-horizon meta-objectives cause a serious bias towards small step sizes, an effect we term short-horizon bias. We introduce a toy problem, a noisy quadratic cost function, on which we analyze short-horizon bias by deriving and comparing the optimal schedules for short and long time horizons. We then run meta-optimization experiments (both offline and online) on standard benchmark datasets, showing that meta-optimization chooses too small a learning rate by multiple orders of magnitude, even when run with a moderately long time horizon (100 steps) typical of work in the area. We believe short-horizon bias is a fundamental problem that needs to be addressed if meta-optimization is to scale to practical neural net training regimes.


Variational image compression with a scale hyperprior    

No tl;dr =[

Reviews: 7 (confidence: 5), 7 (confidence: 5), 7 (confidence: 4) = rank of 927 / 1000 = top 8%

We describe an end-to-end trainable model for image compression based on variational autoencoders. The model incorporates a hyperprior into the generative model to effectively capture spatial dependencies in the latent representation. This hyperprior relates to side information, a concept universal to virtually all modern image codecs but largely unexplored for image compression with artificial neural networks (ANNs). Unlike existing autoencoder compression methods, our model trains a powerful entropy model jointly with the underlying autoencoder. We demonstrate that this model leads to state-of-the-art image compression when measuring visual quality using the popular MS-SSIM index, and yields rate--distortion performance surpassing published ANN-based methods when evaluated using a more traditional metric based on squared error (PSNR).


Learning Approximate Inference Networks for Structured Prediction    

No tl;dr =[

Reviews: 9 (confidence: 4), 7 (confidence: 5), 5 (confidence: 3) = rank of 926 / 1000 = top 8%

Structured prediction energy networks (SPENs; Belanger & McCallum 2016) use neural network architectures to define energy functions that can capture arbitrary dependencies among parts of structured outputs. Prior work used gradient descent for inference, relaxing the structured output to a set of continuous variables and then optimizing the energy with respect to them. We replace this use of gradient descent with a neural network trained to approximate structured argmax inference. This “inference network” outputs continuous values that we treat as the output structure. We develop large-margin training criteria for joint training of the structured energy function and inference network. On multi-label classification we report speed-ups of 10-60x compared to (Belanger et al., 2017) while also improving accuracy. For sequence labeling, for simple structured energies, our approach performs comparably to exact inference while being much faster at test time. We also show how inference networks can replace dynamic programming for test-time inference in conditional random fields, suggestive for their general use for fast inference in structured settings.


Deep Learning with Logged Bandit Feedback    

tl;dr The paper proposes a new output layer for deep networks that permits the use of logged contextual bandit feedback for training.

Reviews: 8 (confidence: 3), 7 (confidence: 4), 6 (confidence: 3) = rank of 925 / 1000 = top 8%

We propose a new output layer for deep networks that permits the use of logged contextual bandit feedback for training. Such contextual bandit feedback can be available in huge quantities (e.g., logs of search engines, recommender systems) at little cost, opening up a path for training deep networks on orders of magnitude more data. To this effect, we propose a counterfactual risk minimization approach for training deep networks using an equivariant empirical risk estimator with variance regularization, BanditNet, and show how the resulting objective can be decomposed in a way that allows stochastic gradient descent training. We empirically demonstrate the effectiveness of the method in two scenarios. First, we show how deep networks -- ResNets in particular -- can be trained for object recognition without conventionally labeled images. Second, we learn to place banner ads based on propensity-logged click logs, where BanditNet substantially improves on the state-of-the-art.


Monotonic Chunkwise Attention    

tl;dr An online and linear-time attention mechanism that performs soft attention over adaptively-located chunks of the input sequence.

Reviews: 8 (confidence: 4), 7 (confidence: 5), 6 (confidence: 4) = rank of 924 / 1000 = top 8%

Sequence-to-sequence models with soft attention have been successfully applied to a wide variety of problems, but their decoding process incurs a quadratic time and space cost and is inapplicable to real-time sequence transduction. To address these issues, we propose Monotonic Chunkwise Attention (MoChA), which adaptively splits the input sequence into small chunks over which soft attention is computed. We show that models utilizing MoChA can be trained efficiently with standard backpropagation while allowing online and linear-time decoding at test time. When applied to online speech recognition, we obtain state-of-the-art results and match the performance of an offline soft attention model. In document summarization experiments where we do not expect monotonic alignments, we show significantly improved performance compared to a baseline monotonic attention model.


Learning Latent Permutations with Gumbel-Sinkhorn Networks    

tl;dr A new method for gradient-descent inference of permutations, with applications to latent matching inference and supervised learning of permutations with neural networks

Reviews: 8 (confidence: 4), 7 (confidence: 4), 6 (confidence: 2) = rank of 923 / 1000 = top 8%

Permutations and matchings are core building blocks in a variety of latent variable models, as they allow us to align, canonicalize, and sort data. Learning in such models is difficult, however, because exact marginalization over these combinatorial objects is intractable. In response, this paper introduces a collection of new methods for end-to-end learning in such models that approximate discrete maximum-weight matching using the continuous Sinkhorn operator. Sinkhorn iteration is attractive because it functions as a simple, easy-to-implement analog of the softmax operator. With this, we can define the Gumbel-Sinkhorn method, an extension of the Gumbel-Softmax method (Jang et al. 2016, Maddison2016 et al. 2016) to distributions over latent matchings. We demonstrate the effectiveness of our method by outperforming competitive baselines on a range of qualitatively different tasks: sorting numbers, solving jigsaw puzzles, and identifying neural signals in worms.


A Neural Representation of Sketch Drawings    

tl;dr We investigate alternative to traditional pixel image modelling approaches, and propose a generative model for vector images.

Reviews: 8 (confidence: 4), 8 (confidence: 4), 5 (confidence: 4) = rank of 922 / 1000 = top 8%

We present sketch-rnn, a recurrent neural network able to construct stroke-based drawings of common objects. The model is trained on a dataset of human-drawn images representing many different classes. We outline a framework for conditional and unconditional sketch generation, and describe new robust training methods for generating coherent sketch drawings in a vector format. We demonstrate that our representation is useful for computer-aided artistic work such as conditional generation, generating multiple outcomes from a partial drawing, and morphing a drawing from one class to another.


Generative Models of Visually Grounded Imagination    

tl;dr A VAE-variant which can create diverse images corresponding to novel concrete or abstract "concepts" described using attribute vectors.

Reviews: 7 (confidence: 4), 7 (confidence: 3), 7 (confidence: 3) = rank of 921 / 1000 = top 8%

It is easy for people to imagine what a man with pink hair looks like, even if they have never seen such a person before. We call the ability to create images of novel semantic concepts visually grounded imagination. In this paper, we show how we can modify variational auto-encoders to perform this task. Our method uses a novel training objective, and a novel product-of-experts inference network, which can handle partially specified (abstract) concepts in a principled and efficient way. We also propose a set of easy-to-compute evaluation metrics that capture our intuitive notions of what it means to have good visual imagination, namely correctness, coverage, and compositionality (the 3 C’s). Finally, we perform a detailed comparison of our method with two existing joint image-attribute VAE methods (the JMVAE method of (Suzuki et al., 2017) and the BiVCCA method of (Wang et al., 2016)) by applying them to two datasets: the MNIST-with-attributes dataset (which we introduce here), and the CelebA dataset (Liu et al., 2015).


A Scalable Laplace Approximation for Neural Networks    

tl;dr We construct a Kronecker factored Laplace approximation for neural networks that leads to an efficient matrix normal distribution over the weights.

Reviews: 9 (confidence: 4), 6 (confidence: 4), 6 (confidence: 4) = rank of 920 / 1000 = top 8%

We leverage recent insights from second-order optimisation for neural networks to construct a Kronecker factored Laplace approximation to the posterior over the weights of a trained network. Our approximation requires no modification of the training procedure, enabling practitioners to estimate the uncertainty of their models currently used in production without having to retrain them. We extensively compare our method to using Dropout and a diagonal Laplace approximation for estimating the uncertainty of a network. We demonstrate that our Kronecker factored method leads to better uncertainty estimates on out-of-distribution data and is more robust to simple adversarial attacks. Our approach only requires calculating two square curvature factor matrices for each layer. Their size is equal to the respective square of the input and output size of the layer, making the method efficient both computationally and in terms of memory usage. We illustrate its scalability by applying it to a state-of-the-art convolutional network architecture.


Characterizing Adversarial Subspaces Using Local Intrinsic Dimensionality    

tl;dr We characterize the intrinsic dimensional property of adversarial subspaces in the neighborhood of adversarial examples, via the use of Local Intrinsic Dimensionality (LID) and empirically show that such characteristics can discriminate adversarial examples effectively.

Reviews: 8 (confidence: 3), 7 (confidence: 4), 6 (confidence: 1) = rank of 919 / 1000 = top 9%

Deep Neural Networks (DNNs) have recently been shown to be vulnerable against adversarial examples, which are carefully crafted instances that can mislead DNNs to make errors during prediction. To better understand such attacks, the properties of subspaces in the neighborhood of adversarial examples need to be characterized. In particular, effective measures are required to discriminate adversarial examples from normal examples in such subspaces. We tackle this challenge by characterizing the intrinsic dimensional property of adversarial subspaces, via the use of Local Intrinsic Dimensionality. LID assesses the space-filling capability of the subspace surrounding a reference example, based on the distance distribution of the example to its neighbors. We first provide explanations about how adversarial perturbation affects the LID characteristic of adversarial subspaces. Then, we explain how the LID characteristic can be used to discriminate adversarial examples generated using the state-of-the-art attacks. We empirically show that the LID characteristic can outperform several state-of-the-art detection measures by large margins for five attacks across three benchmark datasets. Our analysis of the LID characteristic for adversarial subspaces not only motivates new directions of effective adversarial defense but also opens up more challenges for developing new attacks to better understand the vulnerabilities of DNNs.


Emergent Communication in a Multi-Modal, Multi-Step Referential Game    

No tl;dr =[

Reviews: 7 (confidence: 4), 7 (confidence: 4), 7 (confidence: 3) = rank of 918 / 1000 = top 9%

Inspired by previous work on emergent communication in referential games, we propose a novel multi-modal, multi-step referential game, where the sender and receiver have access to distinct modalities of an object, and their information exchange is bidirectional and of arbitrary duration. The multi-modal multi-step setting allows agents to develop an internal communication significantly closer to natural language, in that they share a single set of messages, and that the length of the conversation may vary according to the difficulty of the task. We examine these properties empirically using a dataset consisting of images and textual descriptions of mammals, where the agents are tasked with identifying the correct object. Our experiments indicate that a robust and efficient communication protocol emerges, where gradual information exchange informs better predictions and higher communication bandwidth improves generalization.


Model compression via distillation and quantization    

No tl;dr =[

Reviews: 8 (confidence: 5), 7 (confidence: 2), 6 (confidence: 4) = rank of 917 / 1000 = top 9%

Deep neural networks (DNNs) continue to make significant advances, solving tasks from image classification to translation or reinforcement learning. One aspect of the field receiving considerable attention is efficiently executing deep models in resource-constrained environments, such as mobile or embedded devices. This paper focuses on this problem, and proposes two new compression methods, which jointly leverage weight quantization and distillation of larger teacher networks into smaller student networks. The first method we propose is called quantized distillation and leverages distillation during the training process, by incorporating distillation loss, expressed with respect to the teacher, into the training of a student network whose weights are quantized to a limited set of levels. The second method, differentiable quantization, optimizes the location of quantization points through stochastic gradient descent, to better fit the behavior of the teacher model. We validate both methods through experiments on convolutional and recurrent architectures. We show that quantized shallow students can reach similar accuracy levels to full-precision teacher models, while providing order of magnitude compression, and inference speedup that is linear in the depth reduction. In sum, our results enable DNNs for resource-constrained environments to leverage architecture and accuracy advances developed on more powerful devices.


PixelDefend: Leveraging Generative Models to Understand and Defend against Adversarial Examples    

No tl;dr =[

Reviews: 7 (confidence: 4), 7 (confidence: 4), 7 (confidence: 4) = rank of 916 / 1000 = top 9%

Adversarial perturbations of normal images are usually imperceptible to humans, but they can seriously confuse state-of-the-art machine learning models. What makes them so special in the eyes of image classifiers? In this paper, we show empirically that adversarial examples mainly lie in the low probability regions of the training distribution, regardless of attack types and targeted models. Using statistical hypothesis testing, we find that modern neural density models are surprisingly good at detecting imperceptible image perturbations. Based on this discovery, we devised PixelDefend, a new approach that purifies a maliciously perturbed image by moving it back towards the distribution seen in the training data. The purified image is then run through an unmodified classifier, making our method agnostic to both the classifier and the attacking method. As a result, PixelDefend can be used to protect already deployed models and be combined with other model-specific defenses. Experiments show that our method greatly improves resilience across a wide variety of state-of-the-art attacking methods, increasing accuracy on the strongest attack from 63% to 84% for Fashion MNIST and from 32% to 70% for CIFAR-10.


Learning Wasserstein Embeddings    

tl;dr We show that it is possible to fastly approximate Wasserstein distances computation by finding an appropriate embedding where Euclidean distance emulates the Wasserstein distance

Reviews: 7 (confidence: 4), 7 (confidence: 3), 7 (confidence: 3) = rank of 915 / 1000 = top 9%

The Wasserstein distance received a lot of attention recently in the community of machine learning, especially for its principled way of comparing distributions. It has found numerous applications in several hard problems, such as domain adaptation, dimensionality reduction or generative models. However, its use is still limited by a heavy computational cost. Our goal is to alleviate this problem by providing an approximation mechanism that allows to break its inherent complexity. It relies on the search of an embedding where the Euclidean distance mimics the Wasserstein distance. We show that such an embedding can be found with a siamese architecture associated with a decoder network that allows to move from the embedding space back to the original input space. Once this embedding has been found, computing optimization problems in the Wasserstein space (e.g. barycenters, principal directions or even archetypes) can be conducted extremely fast. Numerical experiments supporting this idea are conducted on image datasets, and show the wide potential benefits of our method.


Routing Networks: Adaptive Selection of Non-Linear Functions for Multi-Task Learning    

tl;dr routing networks: a new kind of neural network which learns to adaptively route its input for multi-task learning

Reviews: 8 (confidence: 4), 7 (confidence: 3), 6 (confidence: 3) = rank of 914 / 1000 = top 9%

Multi-task learning (MTL) with neural networks leverages commonalities in tasks to improve performance, but often suffers from task interference which reduces transfer. To address this issue we introduce the routing network paradigm, a novel neural network unit and training algorithm. A routing network is a kind of self-organizing neural network consisting of two components: a router and a set of one or more function blocks. A function block may be any neural network – for example a fully-connected or a convolutional layer. Given an input the router makes a routing decision, choosing a function block to apply and passing the output back to the router recursively, terminating when the router decides to stop or a fixed recursion depth is reached. In this way the routing network dynamically composes different function blocks for each input. We employ a collaborative multi-agent reinforcement learning (MARL) approach to jointly train the router and function blocks. We evaluate our model on multi-task settings of the MNIST, mini-ImageNet, and CIFAR-100 datasets. Our experiments demonstrate significant improvement in accuracy with sharper convergence over both single task and joint training baselines for these tasks.


Deep Learning as a Mixed Convex-Combinatorial Optimization Problem    

tl;dr We learn deep networks of hard-threshold units by setting hidden-unit targets using combinatorial optimization and weights by convex optimization, resulting in improved performance on ImageNet.

Reviews: 7 (confidence: 4), 7 (confidence: 4), 7 (confidence: 3) = rank of 913 / 1000 = top 9%

As neural networks grow deeper and wider, learning networks with hard-threshold activations is becoming increasingly important, both for network quantization, which can drastically reduce time and energy requirements, and for creating large integrated systems of deep networks, which may have non-differentiable components and must avoid vanishing and exploding gradients for effective learning. However, since gradient descent is not applicable to hard-threshold functions, it is not clear how to learn them in a principled way. We address this problem by observing that setting targets for hard-threshold hidden units in order to minimize loss is a discrete optimization problem, and can be solved as such. The discrete optimization goal is to find a set of targets such that each unit, including the output, has a linearly separable problem to solve. Given these targets, the network decomposes into individual perceptrons, which can then be learned with standard convex approaches. Based on this, we develop a recursive mini-batch algorithm for learning deep hard-threshold networks that includes the popular but poorly justified straight-through estimator as a special case. Empirically, we show that our algorithm improves classification accuracy in a number of settings, including for AlexNet and ResNet-18 on ImageNet, when compared to the straight-through estimator.


Emergence of Linguistic Communication from Referential Games with Symbolic and Pixel Input    

tl;dr A controlled study of the role of environments with respect to properties in emergent communication protocols.

Reviews: 9 (confidence: 5), 7 (confidence: 4), 5 (confidence: 4) = rank of 912 / 1000 = top 9%

Emergent communication problems can be used to study the ability of algorithms to evolve or learn communication protocols. In this work, we study the properties of protocols emerging when reinforcement learning agents are trained end-to-end on referential communication games. We extend previous work using symbolic representations to using raw pixel input data, a more challenging and realistic input representation. We find that the degree of structure found in the input data affects the nature of the emerged protocols, and thereby corroborate the hypothesis that structured language is most likely to emerge when agents perceive the world as being structured.


The Reactor: A fast and sample-efficient Actor-Critic agent for Reinforcement Learning    

tl;dr Reactor combines multiple algorithmic and architectural contributions to produce an agent with higher sample-efficiency than Prioritized Dueling DQN while giving better run-time performance than A3C.

Reviews: 7 (confidence: 4), 7 (confidence: 4), 7 (confidence: 2) = rank of 911 / 1000 = top 9%

In this work we present a new agent architecture, called Reactor, which combines multiple algorithmic and architectural contributions to produce an agent with higher sample-efficiency than Prioritized Dueling DQN (Wang et al., 2016) and Categorical DQN (Bellemare et al., 2017), while giving better run-time performance than A3C (Mnih et al., 2016). Our first contribution is a new policy evaluation algorithm called Distributional Retrace, which brings multi-step off-policy updates to the distributional reinforcement learning setting. The same approach can be used to convert several classes of multi-step policy evaluation algorithms designed for expected value evaluation into distributional ones. Next, we introduce the β-leaveone-out policy gradient algorithm which improves the trade-off between variance and bias by using action values as a baseline. Our final algorithmic contribution is a new prioritized replay algorithm for sequences, which exploits the temporal locality of neighboring observations for more efficient replay prioritization. Using the Atari 2600 benchmarks, we show that each of these innovations contribute to both the sample efficiency and final agent performance. Finally, we demonstrate that Reactor reaches state-of-the-art performance after 200 million frames and less than a day of training.


Smooth Loss Functions for Deep Top-k Classification    

tl;dr Smooth Loss Function for Top-k Error Minimization

Reviews: 8 (confidence: 4), 7 (confidence: 4), 6 (confidence: 5) = rank of 910 / 1000 = top 9%

Human labeling of data constitutes a long and expensive process. As a consequence, many classification tasks entail incomplete annotation and incorrect labels, while being built on a restricted amount of data. In order to handle the ambiguity and the label noise, the performance of machine learning models is usually assessed with top-$k$ error metrics rather than top-$1$. Theoretical results suggest that to minimize this error, various loss functions, including cross-entropy, are equally optimal choices of learning objectives in the limit of infinite data. However, the choice of loss function becomes crucial in the context of limited and noisy data. Besides, our empirical evidence suggests that the loss function must be smooth and non-sparse to work well with deep neural networks. Consequently, we introduce a family of smoothed loss functions that are suited to top-$k$ optimization via deep learning. The widely used cross-entropy is a special case of our family. Evaluating our smooth loss functions is computationally challenging: a na{\"i}ve algorithm would require $\mathcal{O}(\binom{C}{k})$ operations, where $C$ is the number of classes. Thanks to a connection to polynomial algebra and a divide-and-conquer approach, we provide an algorithm with a time complexity of $\mathcal{O}(k C)$. Furthermore, we present a novel and error-bounded approximation to obtain fast and stable algorithms on GPUs with single floating point precision. We compare the performance of the cross-entropy loss and our margin-based losses in various regimes of noise and data size. Our investigation reveals that our loss provides on-par performance with cross-entropy for $k=1$, and is more robust to noise and overfitting for $k=5$.


Unbiased Online Recurrent Optimization    

tl;dr Introduces an online, unbiased and easily implementable gradient estimate for recurrent models.

Reviews: 8 (confidence: 5), 7 (confidence: 4), 6 (confidence: 4) = rank of 909 / 1000 = top 10%

The novel \emph{Unbiased Online Recurrent Optimization} (UORO) algorithm allows for online learning of general recurrent computational graphs such as recurrent network models. It works in a streaming fashion and avoids backtracking through past activations and inputs. UORO is computationally as costly as \emph{Truncated Backpropagation Through Time} (truncated BPTT), a widespread algorithm for online learning of recurrent networks \cite{jaeger2002tutorial}. UORO is a modification of \emph{NoBackTrack} \cite{DBLP:journals/corr/OllivierC15} that bypasses the need for model sparsity and makes implementation easy in current deep learning frameworks, even for complex models. Like NoBackTrack, UORO provides unbiased gradient estimates; unbiasedness is the core hypothesis in stochastic gradient descent theory, without which convergence to a local optimum is not guaranteed. On the contrary, truncated BPTT does not provide this property, leading to possible divergence. On synthetic tasks where truncated BPTT is shown to diverge, UORO converges. For instance, when a parameter has a positive short-term but negative long-term influence, truncated BPTT diverges unless the truncation span is very significantly longer than the intrinsic temporal range of the interactions, while UORO performs well thanks to the unbiasedness of its gradients.


Compressing Word Embeddings via Deep Compositional Code Learning    

tl;dr Compressing the word embeddings over 94% without hurting the performance.

Reviews: 8 (confidence: 4), 7 (confidence: 4), 6 (confidence: 4) = rank of 908 / 1000 = top 10%

Natural language processing (NLP) models often require a massive number of parameters for word embeddings, resulting in a large storage or memory footprint. Deploying neural NLP models to mobile devices requires compressing the word embeddings without any significant sacrifices in performance. For this purpose, we propose to construct the embeddings with few basis vectors. For each word, the composition of basis vectors is determined by a hash code. To maximize the compression rate, we adopt the multi-codebook quantization approach instead of binary coding scheme. Each code is composed of multiple discrete numbers, such as (3, 2, 1, 8), where the value of each component is limited to a fixed range. We propose to directly learn the discrete codes in an end-to-end neural network by applying the Gumbel-softmax trick. Experiments show the compression rate achieves 98% in a sentiment analysis task and 94% ~ 99% in machine translation tasks without performance loss. In both tasks, the proposed method can improve the model performance by slightly lowering the compression rate. Compared to other approaches such as character-level segmentation, the proposed method is language-independent and does not require modifications to the network architecture.


Lifelong Learning with Dynamically Expandable Networks    

tl;dr We propose a novel deep network architecture that can dynamically decide its network capacity as it trains on a lifelong learning scenario.

Reviews: 8 (confidence: 2), 7 (confidence: 3), 6 (confidence: 3) = rank of 907 / 1000 = top 10%

We propose a novel deep network architecture for lifelong learning which we refer to as Dynamically Expandable Network (DEN), that can dynamically decide its network capacity as it trains on a sequence of tasks, to learn a compact overlapping knowledge sharing structure among tasks. DEN is efficiently trained in an online manner by performing selective retraining, dynamically expands network capacity upon arrival of each task with only the necessary number of units, and effectively prevents semantic drift by splitting/duplicating units and timestamping them. We validate DEN on multiple public datasets in lifelong learning scenarios on multiple public datasets, on which it not only significantly outperforms existing lifelong learning methods for deep networks, but also achieves the same level of performance as the batch model with substantially fewer number of parameters.


On the importance of single directions for generalization    

tl;dr We find that deep networks which generalize poorly are more reliant on single directions than those that generalize well, and evaluate the impact of dropout and batch normalization, as well as class selectivity on single direction reliance.

Reviews: 9 (confidence: 3), 7 (confidence: 3), 5 (confidence: 4) = rank of 906 / 1000 = top 10%

Despite their ability to memorize large datasets, deep neural networks often achieve good generalization performance. However, the differences between the learned solutions of networks which generalize and those which do not remain unclear. Here, we demonstrate that a network's reliance on single directions in activation space is a good predictor of its generalization performance, across networks trained on datasets with different fractions of corrupted labels, across ensembles of networks trained on datasets with unmodified labels, and over the course of training. While dropout only regularizes this quantity up to a point, batch normalization implicitly discourages single direction reliance, in part by decreasing the class selectivity of individual units. Finally, we find that class selectivity is a poor predictor of task importance, suggesting not only that networks which generalize well minimize their dependence on individual units by reducing their selectivity, but also that individually selective units may not be necessary for strong network performance.


Certified Defenses against Adversarial Examples    

tl;dr We demonstrate a certifiable, trainable, and scalable method for defending against adversarial examples.

Reviews: 8 (confidence: 4), 8 (confidence: 4), 5 (confidence: 3) = rank of 905 / 1000 = top 10%

While neural networks have achieved high accuracy on standard image classification benchmarks, their accuracy drops to nearly zero in the presence of small adversarial perturbations to test inputs. Defenses based on regularization and adversarial training have been proposed, but often followed by new, stronger attacks that defeat these defenses. Can we somehow end this arms race? In this work, we study this problem for neural networks with one hidden layer. We first propose a method based on a semidefinite relaxation that outputs a certificate that for a given network and test input, no attack can force the error to exceed a certain value. Second, as this certificate is differentiable, we jointly optimize it with the network parameters, providing an adaptive regularizer that encourages robustness against all attacks. On MNIST, our approach produces a network and a certificate that no that perturbs each pixel by at most $\epsilon = 0.1$ can cause more than $35\%$ test error.


Bi-directional Block Self-Attention for Fast and Memory-Efficient Sequence Modeling    

tl;dr A self-attention network for RNN/CNN-free sequence encoding with small memory consumption, highly parallelizable computation and state-of-the-art performance on several NLP tasks

Reviews: 9 (confidence: 4), 6 (confidence: 4), 6 (confidence: 4) = rank of 904 / 1000 = top 10%

Recurrent neural networks (RNN), convolutional neural networks (CNN) and self-attention networks (SAN) are commonly used to produce context-aware representations. RNN can capture long-range dependency but is hard to parallelize and not time-efficient. CNN focuses on local dependency but dose not perform well on some tasks. SAN can model both such dependencies via highly parallelizable computation, but memory requirement grows rapidly in line with sequence length. In this paper, we propose a model, called "bi-directional block self-attention network (Bi-BloSAN)", for RNN/CNN-free sequence encoding. It requires as little memory as RNN but with all the merits of SAN. Bi-BloSAN splits the entire sequence into blocks, and applies an intra-block SAN to each block for modeling local context, then applies an inter-block SAN to the outputs for all blocks to capture long-range dependency. Thus, each SAN only needs to process a short sequence, and only a small amount of memory is required. Additionally, we use feature-level attention to handle the variation of contexts around the same word, and use forward/backward masks to encode temporal order information. On nine benchmark datasets for different NLP tasks, Bi-BloSAN achieves or improves upon state-of-the-art accuracy, and shows better efficiency-memory trade-off than existing RNN/CNN/SAN.


HexaConv    

tl;dr We introduce G-HexaConv, a group equivariant convolutional neural network on hexagonal lattices.

Reviews: 7 (confidence: 4), 7 (confidence: 4), 7 (confidence: 4) = rank of 903 / 1000 = top 10%

The effectiveness of Convolutional Neural Networks stems in large part from their ability to exploit the translation invariance that is inherent in many learning problems. Recently, it was shown that CNNs can exploit other invariances, such as rotation invariance, by using group convolutions instead of planar convolutions. However, for reasons of performance and ease of implementation, it has been necessary to limit the group convolution to transformations that can be applied to the filters without interpolation. Thus, for images with square pixels, only integer translations, rotations by multiples of 90 degrees, and reflections are admissible. Whereas the square tiling provides a 4-fold rotational symmetry, a hexagonal tiling of the plane has a 6-fold rotational symmetry. In this paper we show how one can efficiently implement planar convolution and group convolution over hexagonal lattices, by re-using existing highly optimized convolution routines. We find that, due to the reduced anisotropy of hexagonal filters, planar HexaConv provides better accuracy than planar convolution with square filters, given a fixed parameter budget. Furthermore, we find that the increased degree of symmetry of the hexagonal grid increases the effectiveness of group convolutions, by allowing for more parameter sharing. We show that our method significantly outperforms conventional CNNs on the AID aerial scene classification dataset, even outperforming ImageNet pre-trained models.


MaskGAN: Textual Generative Adversarial Networks from Filling-in-the-Blank    

tl;dr Natural language GAN for filling in the blank

Reviews: 7 (confidence: 5), 7 (confidence: 4), 7 (confidence: 3) = rank of 902 / 1000 = top 10%

Neural text generation models are often autoregressive language models or seq2seq models. Neural autoregressive and seq2seq models that generate text by sampling words sequentially, with each word conditioned on the previous model, are state-of-the-art for several machine translation and summarization benchmarks. These benchmarks are often defined by validation perplexity even though this is not a direct measure of sample quality. Language models are typically trained via maximum likelihood and most often with teacher forcing. Teacher forcing is well-suited to optimizing perplexity but can result in poor sample quality because generating text requires conditioning on sequences of words that were never observed at training time. We propose to improve sample quality using Generative Adversarial Network (GANs), which explicitly train the generator to produce high quality samples and have shown a lot of success in image generation. GANs were originally to designed to output differentiable values, so discrete language generation is challenging for them. We introduce an actor-critic conditional GAN that fills in missing text conditioned on the surrounding context. We show qualitatively and quantitatively, evidence that this produces more realistic text samples compared to a maximum likelihood trained model.


Active Neural Localization    

tl;dr "Active Neural Localizer", a fully differentiable neural network that learns to localize efficiently using deep reinforcement learning.

Reviews: 8 (confidence: 5), 7 (confidence: 4), 6 (confidence: 4) = rank of 901 / 1000 = top 10%

Localization is the problem of estimating the location of an autonomous agent from an observation and a map of the environment. Traditional methods of localization, which filter the belief based on the observations, are sub-optimal in the number of steps required, as they do not decide the actions taken by the agent. We propose "Active Neural Localizer", a fully differentiable neural network that learns to localize efficiently. The proposed model incorporates ideas of traditional filtering-based localization methods, by using a structured belief of the state with multiplicative interactions to propagate belief, and combines it with a policy model to minimize the number of steps required for localization. Active Neural Localizer is trained end-to-end with reinforcement learning. We use a variety of simulation environments for our experiments which include random 2D mazes, random mazes in the Doom game engine and a photo-realistic environment in the Unreal game engine. The results on the 2D environments show the effectiveness of the learned policy in an idealistic setting while results on the 3D environments demonstrate the model's capability of learning the policy and perceptual model jointly from raw-pixel based RGB observations. We also show that a model trained on random textures in the Doom environment generalizes well to a photo-realistic office space environment in the Unreal engine.


Regularizing and Optimizing LSTM Language Models    

tl;dr Effective regularization and optimization strategies for LSTM-based language models achieves SOTA on PTB and WT2.

Reviews: 7 (confidence: 5), 7 (confidence: 4), 7 (confidence: 4) = rank of 900 / 1000 = top 10%

In this paper, we consider the specific problem of word-level language modeling and investigate strategies for regularizing and optimizing LSTM-based models. We propose the weight-dropped LSTM, which uses DropConnect on hidden-to-hidden weights, as a form of recurrent regularization. Further, we introduce NT-ASGD, a non-monotonically triggered (NT) variant of the averaged stochastic gradient method (ASGD), wherein the averaging trigger is determined using a NT condition as opposed to being tuned by the user. Using these and other regularization strategies, our ASGD Weight-Dropped LSTM (AWD-LSTM) achieves state-of-the-art word level perplexities on two data sets: 57.3 on Penn Treebank and 65.8 on WikiText-2. In exploring the effectiveness of a neural cache in conjunction with our proposed model, we achieve an even lower state-of-the-art perplexity of 52.8 on Penn Treebank and 52.0 on WikiText-2. We also explore the viability of the proposed regularization and optimization strategies in the context of the quasi-recurrent neural network (QRNN) and demonstrate comparable performance to the AWD-LSTM counterpart. The code for reproducing the results is open sourced and is available at \url{MASKED}.