Loading presentation...

Present Remotely

Send the link below via email or IM


Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.


Stanford NLP Toolkit

As a part of WSM Practical

Jinal Dhruv

on 23 January 2014

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of Stanford NLP Toolkit

Natural Language Processing
Research at Stanford
Software Distributions by Stanford NLP Group
Running First Program
Stanford CoreNLP
Main Components
Information Extraction
Stanford CoreNLP provides a set of natural language analysis tools which can take raw English language text input and give:

The base forms of words
Their parts of speech
Whether they are names of companies, people, etc.,
Word dependencies
Indicate which noun phrases refer to the same entities.

This includes...
1. Identifying named entities, ==> Named Entity Recognition (NER) and Information Extraction (IE).
2. Resolving tokens and linking them to a global namespace, ==> Biological Process Extraction.
3. Identifying relations between the entities. ==> Coreference Resolution.


1. Tokenize
Stanford CoreNLP: PTBTokenizerAnnotator
2. Filter (a an the before...)
Regular Expressions in Java: [a-zA-Z]*-?[a-zA-Z]*
3. Lemmatization
Find the source word of tokens in line with a sematic web inplemented by the package

class StanfordLemmatizer {

protected StanfordCoreNLP pipeline;
// pattern only include letters
Pattern pattern;
// matcher to match
Matcher matcher;

public StanfordLemmatizer() {
// Create StanfordCoreNLP object properties, with POS tagging
// (required for lemmatization), and lemmatization
Properties props;
props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma");

// StanfordCoreNLP loads a lot of models, so you probably
// only want to do this once per execution
this.pipeline = new StanfordCoreNLP(props);

public List<String> lemmatize(String documentText)
List<String> lemmas = new LinkedList<String>();

// temp string
String temp;

// create an empty Annotation just with the given text
Annotation document = new Annotation(documentText);


return lemmas;

Problems may occur:
1. Efficiency in Calculation
2. Limited I/O Speed of Using Disk

1. Parallel computing(Hadoop MapReduce, MPI...)
2. Cache Memory(Memcached, Redis)

Stanford CoreNLP integrates all our NLP tools, including 
the part-of-speech (POS) tagger, 
the named entity recognizer (NER), 
the parser, 
the coreference resolution system,
the sentiment analysis tools.

It is designed to be highly flexible and extensible. With a single option you can change which tools should be enabled and which should be disabled.

A group called "Natural Language Processing" was established at Stanford University consisting of faculty, research scientists, postdocs, programmers and students.

It covers areas such as sentence understanding, machine translation, probabilistic parsing and tagging, biomedical information extraction, grammar induction, word sense disambiguation, and automatic question answering.
NLP Group at Stanford University
Major Applications
It is concerned with the interaction between computers and humans and also developing systems which can cope with natural languages like French, English.

The program can only answer with what it is programmed with and can not answer about something it does not have knowledge of.

The program will use keyword in a sentence but if there is not a keyword that it is looking for it will need more data.

Background noises could interfere with the program.

Different accents will affect the program.

The program has to understand the language that you are using.
Everyday applications like
Speech Recognition
Machine Translation
Main Goal of Group is developing key applications in all the areas of Human Language Technology.
Stanford TokensRegex: A tool for matching regular expressions over tokens.

Stanford Word Segmenter:
A word segmenter in Java.Also Supports Arabic and Chinese.

Stanford Parser:
Implementations of probabilistic natural language parsers in Java

Stanford CoreNLP:
An integrated suite of natural language processing tools

More Softwares:
Stanford CoreNlp
Online Demo
Mailing List
you can send other questions and feedback to java-nlp-support@lists.stanford.edu.
Removal of Stop Words
Speech Recognition
Automatic Summarization
Information Retrieval
Keyphrase Extraction
Basic Requirements
1. CoreNLP Toolkit

2. Netbeans or Eclipse IDE

3. for POS Tagger :
include following jar libraries :
1. stanford corenlp
2. stanford corenlp models
3. joda-time
4. xom

4. for NER :
include following jar libraries :
1. stanford corenlp
2. stanford corenlp models
3. joda-time
4. xom
5. stanford-ner(not included in corenlp toolkit)
6. jollyday
Part of Speech Tagging
Part-of-speech tagging is assigning the correct part of speech (noun, verb, etc.) to words.

A Part-Of-Speech Tagger (POS Tagger) is a piece of software that reads text in some language and assigns parts of speech to each word (and other token), such as noun, verb, adjective, etc.
Properties props = new Properties();
props.put("annotators", "pos");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);

// read some text in the text variable
String text = "A quick brown Fox jumped over the lazy dog."; // Add your text here!

// create an empty Annotation just with the given text
Annotation document = new Annotation(text);

// run all Annotators on this text
for (CoreMap sentence : sentences) {
// traversing the words in the current sentence
// a CoreLabel is a CoreMap with additional token-specific methods
for (CoreLabel token : sentence.get(TokensAnnotation.class)) {
// this is the text of the token
String word = token.get(TextAnnotation.class);
// this is the POS tag of the token
String pos = token.get(PartOfSpeechAnnotation.class);
// this is the NER label of the token
String ne = token.get(NamedEntityTagAnnotation.class);
Something Wrong
Adding annotator pos
Reading POS tagger model from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [2.3 sec].
Exception in thread "main" java.lang.IllegalArgumentException: annotator "pos" requires annotator "tokenize"
Adding other needed arguments
Properties props = new Properties();
props.put("annotators", "tokenize,ssplit, pos");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
Order of argument matters!
Now it's running!
Adding annotator tokenize
Adding annotator ssplit
Adding annotator pos
Reading POS tagger model from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [2.3 sec].
BUILD SUCCESSFUL (total time: 2 seconds)
But what is DT,JJ,NN ?
Another version which is easily understandable
Noun,Singular or mass
JJ: Adjective
DT: Determiner
NNP: Noun, Singular or mass
Tag Description
1. CC Coordinating conjunction
2. CD Cardinal number
3. DT Determiner
4. EX Existential there
5. FW Foreign word
6. IN Preposition or subordinating conjunction
7. JJ Adjective
8. JJR Adjective, comparative
9. JJS Adjective, superlative
10. LS List item marker
11. MD Modal
12. NN Noun, singular or mass
13. NNS Noun, plural
14. NNP Proper noun, singular
15. NNPS Proper noun, plural
16. PDT Predeterminer
17. POS Possessive ending
18. PRP Personal pronoun
19. PRP$ Possessive pronoun
20. RB Adverb
21. RBR Adverb, comparative
22. RBS Adverb, superlative
23. RP Particle
24. SYM Symbol
25. TO to
26. UH Interjection
27. VB Verb, base form
28. VBD Verb, past tense
29. VBG Verb, gerund or present participle
30. VBN Verb, past participle
31. VBP Verb, non-3rd person singular present
32. VBZ Verb, 3rd person singular present
33. WDT Wh-determiner
34. WP Wh-pronoun
35. WP$ Possessive wh-pronoun
36. WRB Wh-adverb
Exploring the toolkit
Character Encoding:
By default, it uses Unicode's UTF-8. You can change the encoding used when reading files by either setting the Java encoding property or more simply by supplying the program with the command line flag -encoding FOO
We can also add custom annotators by extending class edu.stanford.nlp.pipeline.Annotator.
We can have input as file or url and output can be generated in xml or visualize format.
Adding constraints to the parser:
The parser can be instructed to keep certain sets of tokens together as a single constituent.For any sentence where you want to add constraints, attach the ParserAnnotations.ConstraintAnnotation to that sentence.
Fruit flies like a banana
Here is the full list...
Full transcript