Loading presentation...

Present Remotely

Send the link below via email or IM


Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.


Long Short Term Memory Networks for Machine Reading

Research presentation for Deep Learning Course @ Columbia University, NY

Apoorv Kulshreshtha

on 23 March 2017

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of Long Short Term Memory Networks for Machine Reading

About the paper and it's authors...
School of Informatics, University of Edinburgh
Aim: Enable sequence-level networks to better handle structured input
Challenges faced by sequence level networks
Vanishing and exploding gradients, which are ameliorated by LSTMs and gradient clipping (Pascanu et al. 2013)
Related Work
Introducing structural bias to neural models is not new
RNN with external memory stack for learning CFGs (Das et. al. 1992)
Set of DS(Stack,Queue and Dequeues) as memory controlled by a RNN (Grefenstette et al. 2015)
LSTM with an external memory block component which interacts with its hidden state (Tran et al. 2016)
Structured neural network with episodic memory modules (Kumar et al.2016)
Modeling 2 Sequences with LSTMN
Standard tool for modeling 2 sequences with RNN : encoder-decoder architecture where the second sequence (target) is conditioned on the first one (source)
Solution Proposed
Leverage memory and attention to empower a recurrent network with stronger memorization capability and ability to discover relations among tokens
Long Short Term Memory Networks for Machine Reading
Apoorv Kulshreshtha & Samarth Tripathi

Jianpeng Cheng, Li Dong & Mirelle Lapata
Published at EMNLP 2016
Introduces the concept of LSTM-Networks and demonstrates its efficiency on various NLP tasks
Extend LSTM architecture to a memory network instead of a single memory cell
Initially designed to process a single sequence,
but the paper also demonstrates how to integrate it with an encoder-decoder architecture
Memory generalization problems: The network generalizes poorly to long sequences while wasting memory on shorted ones
Sequence level networks lack a mechanism for handling the structure of the input. This imposes an inductive bias which is at odds with the fact that language has an inherent structure
Insert a memory network module in the update of a recurrent network together with attention for memory addressing
Memory network used is internal to the recurrence, thereby strengthening the interaction and leading to a representation learner which is able to reason over shallow structures
Explains how to combine the LSTMN, which applies attention for intra-relation reasoning, with the encoder-decoder network whose attention module learns the inter-alignment between 2 sequences
Can be done in 2 ways: Shallow attention fusion, Deep attention fusion
Shallow Attention Fusion
Deep Attention Fusion
Shallow Attention Fusion
Treats the LSTMN as a separate module that can be readily used in an encoder-decoder architecture
Both encoder and decoder are modeled as LSTMNs with intra-attention
Inter attention is triggered when the decoder reads a target token (calculation same as in Bahdanau et al. 2014)
Deep Attention Fusion
Language Modeling
Experiments conducted on English Penn Treebank dataset
Training: Sections 0-20 (1M words), Validation: Sections 21-22 (80K words), Testing: Sections 23-24 (90K words)
Perplexity as evaluation metric
Stochastic Gradient Descent for optimization with an initial learning rate of 0.65, which decays by a factor of 0.85 per epoch if no improvement on validation set
Renormalize the gradient if its norm is greater than 5
Mini-batch size was set to 40 and the dimensions of the word embeddings were set to 150 for all the models
As no single layer variants for gLSTM and dLSTM, they have to be implemented as multi-layer systems
Hidden unit size for all models (except KN5) was set to 300
Results continued...
Combines inter and intra-attention when computing state updates
Core of the Machine Reader model is a LSTM unit with an extended memory tape with an attention based memory addressing mechanism at every step
Modifies LSTM structure by replacing the memory cell with a memory network at each unit, by keeping a vector of hidden states and memory.
In LSTMs the next state is always computed from the current state in a Markov manner; current state ht, the next state ht+1 is conditionally independent of states h1..ht-1 and tokens x1...xt.

LSTM assumes unbounded memory so the current state can summarize all the previous tokens. This fails with long sequences or small memory size.

An LSTM aggregates information on a token-by-token basis in sequential order, but there is no explicit mechanism for reasoning over structure and modeling relations between tokens

LSTMN replaces memory cell with memory network, storing contextual representation of each input token with a unique memory slot and the size of the memory grows with time until an upper bound of the memory span is reached.
This design enables the LSTMN to reason about relations between tokens with a neural attention layer and then perform non-Markov state updates.
LSTMN Advantages
The two sets of vectors for the hidden states and memory vectors allow the model to compute an adaptive summary vector for the previous hidden and memory tapes using softmax.

LSTMNs use attention for inducing relations between tokens (xt and x..xt-1) using h1..ht-1.

Implementation focuses on read operations on memories with attention linking the current token to previous memories and selecting useful content, however write is also possible e.g. to correct wrong interpretations.

At time step, the model computes the relation between the current token and previous tokens using the previous hidden states using an attention layer.

This yields a probability distribution over the hidden state vectors of previous tokens which can then be used to compute an adaptive summary vector for the previous hidden tape and memory tape.
Sentiment Analysis
Sentiment analysis performed on Stanford Sentiment Treebank into 5 and 2 classes.
Experimented with 1 and 2 layered classifier with ReLu activation, and 168 memory size for compatibility.
Used Glove 300D word embeddings, Adam as Optimizer, adaptive learning rate, and dropout.
Both 1 and 2 layered LSTMNs outperformed respective LSTM models and comparable with state-of-the-art
Sentiment Analysis
Natural Language Inference
Recognize Textual Entailment, whether two premise-hypothesis pairs are entailing, contradicting or neutral., and use Stanford Natural Language Inference dataset.
Experimented with 1 and 2 layered classifier with ReLu activation, and 168 memory size for compatibility.
Used Glove 300D word embeddings, Adam as Optimizer, adaptive learning rate, and dropout.
Both 1 and 2 layered LSTMNs outperformed respective LSTM models with and without attention and is state-of-the-art.
A key observation is that fusion is generally better, and deep fusion is generally better over time.
With deep fusion the inter-attention vectors are recurrently memorized by the decoder with a gating operation, which also improves the information flow of the network.
Natural Language Inference
Machine reading is related to a wide range of tasks from answering reading comprehension Qs, to fact and relation extraction, ontology learning and textual entailment
Attention acts as a weak inductive module discovering relations between input tokens, and is trained without direct supervision
Their model is based on a Long Short-Term Memory architecture embedded with a memory network, explicitly storing contextual representations of input tokens without recursively compressing them.

More importantly, an intra-attention mechanism is employed for memory addressing, as a way to induce undirected relations among tokens.

The attention layer is not optimized with a direct supervision signal but with the entire network in downstream tasks.

Experimental results across three tasks show that their model yields performance comparable or superior to state of the art.
In contrast to work on dependency grammar induction (Klein and Manning 2004) where the learned head modifier relations are directed, LSTMN model captures undirected relations
Lot of NLP tasks are concerned with modeling two sequences rather than a single one
Full transcript