Loading…
Transcript

Long Short Term Memory Networks for Machine Reading

Apoorv Kulshreshtha & Samarth Tripathi

Challenges faced by sequence level networks

Solution Proposed

Modeling 2 Sequences with LSTMN

Shallow Attention Fusion

Vanishing and exploding gradients, which are ameliorated by LSTMs and gradient clipping (Pascanu et al. 2013)

Leverage memory and attention to empower a recurrent network with stronger memorization capability and ability to discover relations among tokens

Lot of NLP tasks are concerned with modeling two sequences rather than a single one

Introduction

Memory generalization problems: The network generalizes poorly to long sequences while wasting memory on shorted ones

Insert a memory network module in the update of a recurrent network together with attention for memory addressing

Standard tool for modeling 2 sequences with RNN : encoder-decoder architecture where the second sequence (target) is conditioned on the first one (source)

Machine reading is related to a wide range of tasks from answering reading comprehension Qs, to fact and relation extraction, ontology learning and textual entailment

Sequence level networks lack a mechanism for handling the structure of the input. This imposes an inductive bias which is at odds with the fact that language has an inherent structure

Attention acts as a weak inductive module discovering relations between input tokens, and is trained without direct supervision

Explains how to combine the LSTMN, which applies attention for intra-relation reasoning, with the encoder-decoder network whose attention module learns the inter-alignment between 2 sequences

Aim: Enable sequence-level networks to better handle structured input

Memory network used is internal to the recurrence, thereby strengthening the interaction and leading to a representation learner which is able to reason over shallow structures

Extend LSTM architecture to a memory network instead of a single memory cell

Can be done in 2 ways: Shallow attention fusion, Deep attention fusion

About the paper and it's authors...

Initially designed to process a single sequence,

but the paper also demonstrates how to integrate it with an encoder-decoder architecture

Shallow Attention Fusion

Treats the LSTMN as a separate module that can be readily used in an encoder-decoder architecture

Both encoder and decoder are modeled as LSTMNs with intra-attention

Inter attention is triggered when the decoder reads a target token (calculation same as in Bahdanau et al. 2014)

Jianpeng Cheng, Li Dong & Mirelle Lapata

Related Work

School of Informatics, University of Edinburgh

Language Modeling

Experiments conducted on English Penn Treebank dataset

Deep Attention Fusion

Published at EMNLP 2016

Training: Sections 0-20 (1M words), Validation: Sections 21-22 (80K words), Testing: Sections 23-24 (90K words)

Deep Attention Fusion

Perplexity as evaluation metric

Introducing structural bias to neural models is not new

  • RNN with external memory stack for learning CFGs (Das et. al. 1992)
  • Set of DS(Stack,Queue and Dequeues) as memory controlled by a RNN (Grefenstette et al. 2015)
  • LSTM with an external memory block component which interacts with its hidden state (Tran et al. 2016)
  • Structured neural network with episodic memory modules (Kumar et al.2016)

Results

Introduces the concept of LSTM-Networks and demonstrates its efficiency on various NLP tasks

Stochastic Gradient Descent for optimization with an initial learning rate of 0.65, which decays by a factor of 0.85 per epoch if no improvement on validation set

LSTMN Advantages

Renormalize the gradient if its norm is greater than 5

Mini-batch size was set to 40 and the dimensions of the word embeddings were set to 150 for all the models

As no single layer variants for gLSTM and dLSTM, they have to be implemented as multi-layer systems

In contrast to work on dependency grammar induction (Klein and Manning 2004) where the learned head modifier relations are directed, LSTMN model captures undirected relations

Hidden unit size for all models (except KN5) was set to 300

Results continued...

The two sets of vectors for the hidden states and memory vectors allow the model to compute an adaptive summary vector for the previous hidden and memory tapes using softmax.

LSTMNs use attention for inducing relations between tokens (xt and x..xt-1) using h1..ht-1.

Implementation focuses on read operations on memories with attention linking the current token to previous memories and selecting useful content, however write is also possible e.g. to correct wrong interpretations.

At time step, the model computes the relation between the current token and previous tokens using the previous hidden states using an attention layer.

This yields a probability distribution over the hidden state vectors of previous tokens which can then be used to compute an adaptive summary vector for the previous hidden tape and memory tape.

Combines inter and intra-attention when computing state updates

LSTM

Sentiment Analysis

Core of the Machine Reader model is a LSTM unit with an extended memory tape with an attention based memory addressing mechanism at every step

LSTMN

Conclusions

Sentiment Analysis

Modifies LSTM structure by replacing the memory cell with a memory network at each unit, by keeping a vector of hidden states and memory.

Sentiment analysis performed on Stanford Sentiment Treebank into 5 and 2 classes.

Experimented with 1 and 2 layered classifier with ReLu activation, and 168 memory size for compatibility.

Used Glove 300D word embeddings, Adam as Optimizer, adaptive learning rate, and dropout.

Both 1 and 2 layered LSTMNs outperformed respective LSTM models and comparable with state-of-the-art

Their model is based on a Long Short-Term Memory architecture embedded with a memory network, explicitly storing contextual representations of input tokens without recursively compressing them.

More importantly, an intra-attention mechanism is employed for memory addressing, as a way to induce undirected relations among tokens.

The attention layer is not optimized with a direct supervision signal but with the entire network in downstream tasks.

Experimental results across three tasks show that their model yields performance comparable or superior to state of the art.

Natural Language Inference

LSTMN vs LSTM

Natural Language Inference

In LSTMs the next state is always computed from the current state in a Markov manner; current state ht, the next state ht+1 is conditionally independent of states h1..ht-1 and tokens x1...xt.

LSTM assumes unbounded memory so the current state can summarize all the previous tokens. This fails with long sequences or small memory size.

An LSTM aggregates information on a token-by-token basis in sequential order, but there is no explicit mechanism for reasoning over structure and modeling relations between tokens

LSTMN replaces memory cell with memory network, storing contextual representation of each input token with a unique memory slot and the size of the memory grows with time until an upper bound of the memory span is reached.

This design enables the LSTMN to reason about relations between tokens with a neural attention layer and then perform non-Markov state updates.

Recognize Textual Entailment, whether two premise-hypothesis pairs are entailing, contradicting or neutral., and use Stanford Natural Language Inference dataset.

Experimented with 1 and 2 layered classifier with ReLu activation, and 168 memory size for compatibility.

Used Glove 300D word embeddings, Adam as Optimizer, adaptive learning rate, and dropout.

Both 1 and 2 layered LSTMNs outperformed respective LSTM models with and without attention and is state-of-the-art.

A key observation is that fusion is generally better, and deep fusion is generally better over time.

With deep fusion the inter-attention vectors are recurrently memorized by the decoder with a gating operation, which also improves the information flow of the network.