Introducing 

Prezi AI.

Your new presentation assistant.

Refine, enhance, and tailor your content, source relevant images, and edit visuals quicker than ever before.

Loading…
Transcript

Part of Speech Tagging

Part-of-speech tagging is assigning the correct part of speech (noun, verb, etc.) to words.

A Part-Of-Speech Tagger (POS Tagger) is a piece of software that reads text in some language and assigns parts of speech to each word (and other token), such as noun, verb, adjective, etc.

Properties props = new Properties();

props.put("annotators", "pos");

StanfordCoreNLP pipeline = new StanfordCoreNLP(props);

// read some text in the text variable

String text = "A quick brown Fox jumped over the lazy dog."; // Add your text here!

// create an empty Annotation just with the given text

Annotation document = new Annotation(text);

// run all Annotators on this text

pipeline.annotate(document);

for (CoreMap sentence : sentences) {

// traversing the words in the current sentence

// a CoreLabel is a CoreMap with additional token-specific methods

for (CoreLabel token : sentence.get(TokensAnnotation.class)) {

// this is the text of the token

String word = token.get(TextAnnotation.class);

System.out.println(word);

// this is the POS tag of the token

String pos = token.get(PartOfSpeechAnnotation.class);

System.out.println(pos);

// this is the NER label of the token

String ne = token.get(NamedEntityTagAnnotation.class);

System.out.println(ne);

}

Tag Description

1. CC Coordinating conjunction

2. CD Cardinal number

3. DT Determiner

4. EX Existential there

5. FW Foreign word

6. IN Preposition or subordinating conjunction

7. JJ Adjective

8. JJR Adjective, comparative

9. JJS Adjective, superlative

10. LS List item marker

11. MD Modal

12. NN Noun, singular or mass

13. NNS Noun, plural

14. NNP Proper noun, singular

15. NNPS Proper noun, plural

16. PDT Predeterminer

17. POS Possessive ending

18. PRP Personal pronoun

19. PRP$ Possessive pronoun

20. RB Adverb

21. RBR Adverb, comparative

22. RBS Adverb, superlative

23. RP Particle

24. SYM Symbol

25. TO to

26. UH Interjection

27. VB Verb, base form

28. VBD Verb, past tense

29. VBG Verb, gerund or present participle

30. VBN Verb, past participle

31. VBP Verb, non-3rd person singular present

32. VBZ Verb, 3rd person singular present

33. WDT Wh-determiner

34. WP Wh-pronoun

35. WP$ Possessive wh-pronoun

36. WRB Wh-adverb

class StanfordLemmatizer {

protected StanfordCoreNLP pipeline;

// pattern only include letters

Pattern pattern;

// matcher to match

Matcher matcher;

public StanfordLemmatizer() {

// Create StanfordCoreNLP object properties, with POS tagging

// (required for lemmatization), and lemmatization

Properties props;

props = new Properties();

props.put("annotators", "tokenize, ssplit, pos, lemma");

// StanfordCoreNLP loads a lot of models, so you probably

// only want to do this once per execution

this.pipeline = new StanfordCoreNLP(props);

.........

}

public List<String> lemmatize(String documentText)

{

List<String> lemmas = new LinkedList<String>();

// temp string

String temp;

// create an empty Annotation just with the given text

Annotation document = new Annotation(documentText);

.........

return lemmas;

}

}

Stanford CoreNLP provides a set of natural language analysis tools which can take raw English language text input and give:

  • The base forms of words
  • Their parts of speech
  • Whether they are names of companies, people, etc.,
  • Word dependencies
  • Indicate which noun phrases refer to the same entities.

Example:

Fruit flies like a banana

Major Applications

Stanford CoreNLP

A group called "Natural Language Processing" was established at Stanford University consisting of faculty, research scientists, postdocs, programmers and students.

It covers areas such as sentence understanding, machine translation, probabilistic parsing and tagging, biomedical information extraction, grammar induction, word sense disambiguation, and automatic question answering.

Steps:

1. Tokenize

Stanford CoreNLP: PTBTokenizerAnnotator

2. Filter (a an the before...)

Regular Expressions in Java: [a-zA-Z]*-?[a-zA-Z]*

3. Lemmatization

Find the source word of tokens in line with a sematic web inplemented by the package

Main Goal of Group is developing key applications in all the areas of Human Language Technology.

Stanford TokensRegex: A tool for matching regular expressions over tokens.

Stanford Word Segmenter:

A word segmenter in Java.Also Supports Arabic and Chinese.

Stanford Parser:

Implementations of probabilistic natural language parsers in Java

Stanford CoreNLP:

An integrated suite of natural language processing tools

More Softwares:

http://www-nlp.stanford.edu/software/index.shtml

  • Keyphrase Extraction

This includes...

1. Identifying named entities, ==> Named Entity Recognition (NER) and Information Extraction (IE).

2. Resolving tokens and linking them to a global namespace, ==> Biological Process Extraction.

3. Identifying relations between the entities. ==> Coreference Resolution.

NLP Group at Stanford University

  • Removal of Stop Words
  • Speech Recognition
  • Automatic Summarization
  • Information Retrieval

Exploring the toolkit

Software Distributions by Stanford NLP Group

Research at Stanford

HOLAAAA

It is concerned with the interaction between computers and humans and also developing systems which can cope with natural languages like French, English.

Everyday applications like

  • Speech Recognition
  • Machine Translation

Natural Language Processing

Information Extraction

Something Wrong

Adding other needed arguments

Online Demo

Mailing List

you can send other questions and feedback to java-nlp-support@lists.stanford.edu.

http://nlp.stanford.edu:8080/corenlp/

http://nlp.stanford.edu:8080/parser/index.jsp

Challenges

The program can only answer with what it is programmed with and can not answer about something it does not have knowledge of.

The program will use keyword in a sentence but if there is not a keyword that it is looking for it will need more data.

Background noises could interfere with the program.

Different accents will affect the program.

The program has to understand the language that you are using.

  • Character Encoding: By default, it uses Unicode's UTF-8. You can change the encoding used when reading files by either setting the Java encoding property or more simply by supplying the program with the command line flag -encoding FOO

Properties props = new Properties();

props.put("annotators", "tokenize,ssplit, pos");

StanfordCoreNLP pipeline = new StanfordCoreNLP(props);

Basic Requirements

Adding annotator pos

Reading POS tagger model from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [2.3 sec].

Exception in thread "main" java.lang.IllegalArgumentException: annotator "pos" requires annotator "tokenize"

1. CoreNLP Toolkit

2. Netbeans or Eclipse IDE

3. for POS Tagger :

include following jar libraries :

1. stanford corenlp

2. stanford corenlp models

3. joda-time

4. xom

4. for NER :

include following jar libraries :

1. stanford corenlp

2. stanford corenlp models

3. joda-time

4. xom

5. stanford-ner(not included in corenlp toolkit)

6. jollyday

Order of argument matters!

Running First Program

Another version which is easily understandable

Stanford CoreNlp

Now it's running!

  • JJ: Adjective
  • DT: Determiner
  • NNP: Noun, Singular or mass

run:

Adding annotator tokenize

Adding annotator ssplit

Adding annotator pos

Reading POS tagger model from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [2.3 sec].

A

DT

quick

JJ

brown

JJ

Fox

NNP

jumped

VBD

over

IN

the

DT

lazy

JJ

dog

NN

.

.

BUILD SUCCESSFUL (total time: 2 seconds)

  • We can also add custom annotators by extending class edu.stanford.nlp.pipeline.Annotator.

Here is the full list...

A

determiner

DT

quick

adjective

JJ

brown

adjective

JJ

Fox

NNP

jumped

VBD

over

IN

the

determiner

DT

lazy

adjective

JJ

dog

Noun,Singular or mass

But what is DT,JJ,NN ?

  • We can have input as file or url and output can be generated in xml or visualize format.
  • Adding constraints to the parser: The parser can be instructed to keep certain sets of tokens together as a single constituent.For any sentence where you want to add constraints, attach the ParserAnnotations.ConstraintAnnotation to that sentence.

Stanford CoreNLP integrates all our NLP tools, including 

  • the part-of-speech (POS) tagger, 
  • the named entity recognizer (NER), 
  • the parser, 
  • the coreference resolution system,
  • the sentiment analysis tools.

It is designed to be highly flexible and extensible. With a single option you can change which tools should be enabled and which should be disabled.

Main Components

Problems may occur:

1. Efficiency in Calculation

2. Limited I/O Speed of Using Disk

Solutions:

1. Parallel computing(Hadoop MapReduce, MPI...)

2. Cache Memory(Memcached, Redis)

Learn more about creating dynamic, engaging presentations with Prezi