Loading presentation...

Present Remotely

Send the link below via email or IM


Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.


Word Watching - NLP Meetup

Presented to Austin NLP

Brent Schneeman

on 22 April 2016

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of Word Watching - NLP Meetup

The Vacation Rental market is large but fragmented, with hundreds of thousands of suppliers (homeowners). It has few brands and no commonly accepted service levels or rating standards

-Douglas Quinby, PhoCus Wright

We Just returned from a great week at your wonderful villa. The location by the pool is really convenient and the short walk to the beach. The villa is very comfortable (especially the beds) and we always have a very relaxing time. We have stayed here numerous times and are never disappointed. Fiddler's Cove is great for families.
We have recently stayed here for 2 weeks and have really enjoyed our time. The house is brand new and immaculate, and the saltwater pool has been awesome for our 2 children(ages 1 and 4). The beach is a very short walk from the house- easy with a double stroller and all of our gear. The house is set up really well to accommodate kids or adults and has an outdoor shower, shady areas to hang out in and sunny spots to lay out, as well. The owners live nearby and were very helpful with the house and suggestions for things to do. The location is great for walking to get coffee, going for a run, taking a beach walk or just hanging out with little kiddos or adults. Plus, it's near the best beach areas on the island. We highly recommend it!
The house situation is
, close to all facilities, restaurants, groceries, beach, stores, etc. The pool, the patio furniture, the deck, the beach chairs and the towels are very good for bathing and dining outside, The house offers enough space. We were disapointed by the old tv sets; the bathrooms need to be refreshed as well as the cupboard in the kitchen and the laundry room. We were expecting more. We already rented two other houses with Homeaway before of better quality. The other couple also rent something cleaner and nicer for a better price. The cleaning must have been done more metiscusly. The oven was very dirty. We found that kitchen pot and pans were chipped and old . There are many old stuff under the cupboard. The toaster heats properly only on one side. The BBQ grill was rusty; all the protection was gone on half the surface. We had problems twice with the internet. The manager/owner came once to try (without success) to repair the leaking sink. The bath was very slow to drain; a plumber came one morning and waited half an hour for the owner who never showed up, so no repair were done. The small carpets in the bathrooms were old, dirty and disgutting. In the yard, close to the pool, there were old mops, brooms, plastic plants that should all be sent to garbage. It's more a 3.5* than a 4*. There is a real potential for this house but now it seems a bit neglected. If you haven't seen other places, you don't know; the four of us can compare and we were all disapointed this time.
This analysis found that about 10% of Reviews have text-and-star inconsistencies, but...
grouping by the language used
comparing with known labels
extracting statistically significant tri-grams
"Pots and Pans"

Most Significant Phrase in Negative Reviews
Maslow's Hierarchy of Needs
Love / Belonging
Abraham Maslow, 1943
Vacationer's Hierarchy of Needs
Hustle / Bustle
Pots and Pans
Washer and Dryer
Sliding Glass Doors
Bring your own
Visitor Recently Left
Open floor plan
Labor Day Weekend
Within Walking Distance
Glass of Wine
Visitor Recently Left
5 = "good" and 1 = "bad" might not be the only interpretation:

* = Tourist
** = Standard
*** = Comfort
**** = First Class
***** = Luxury
Word Watching
"The J Curve"
//Base Code
//randomize the list of all reviews, but use a known seed to
//recreate the study if needed
Collections.shuffle(allReviews, new Random(12));

int sampleSize = 1000;
Set inputSet = new HashSet( sampleSize); //just a subset
for (int i = 0; i < sampleSize; i++) { inputSet.add(allReviews.get(i)); }

//prepare for magic
Clusterer cl = new Clusterer(inputSet, tokenizerFactory());
//here be magic -> cluster the reviews in the input set based
//on the their distance from each other (defined in the
Tree clusters = cl.buildTree();

//just for display
outputJavaScript(clusters, "reviews.js");
public static TokenizerFactory tokenizerFactory() {
TokenizerFactory factory = IndoEuropeanTokenizerFactory.INSTANCE;//
factory = new WhitespaceNormTokenizerFactory(factory);
factory = new LowerCaseTokenizerFactory(factory);
factory = new EnglishStopTokenizerFactory(factory);
// factory = new TokenLengthTokenizerFactory(factory, 2,
// Integer.MAX_VALUE);
// factory = new PorterStemmerTokenizerFactory(factory);
return factory;
// Create the tokenizer factory
TokenizerFactory tokenizerFactory = tokenizerFactory();

// build a statistical model of the text
// NGRAM =3, so find tri-grams
TokenizedLM backgroundModel = buildModel(tokenizerFactory, NGRAM,

// clean up some internal data structures

// find the colocations
SortedSet<ScoredObject<String[]>> coll = backgroundModel
Spot key words to trump, and trust, all reviews

Brent Schneeman

brent@homeaway.com @schnee
Topic Modeling
Patterns in the use of words in a corpus
Words occur in statistically meaningful ways
Words are selected from baskets to form text
"Latent Dirichlet allocation"


#import the reviews
reviews = read.csv("./resources/Reviews_Top2000_ByRating.csv", stringsAsFactors=F)

# initialize Mallet with the review and a stop words
mallet.instances <- mallet.import(rownames(reviews),

#create topic trainer object. 15 Topics
n.topics <- 15
topic.model <- MalletLDA(n.topics)

#load reviews

## Get the vocabulary, and some statistics about word frequencies.
## These may be useful in further curating the stopword list.
vocabulary <- topic.model$getVocabulary()
word.freqs <- mallet.word.freqs(topic.model)

## Optimize hyperparameters every 20 iterations,
## after 50 burn-in iterations.
topic.model$setAlphaOptimization(20, 50)

## Now train a model. Note that hyperparameter optimization is on, by default.
## We can specify the number of iterations. Here we'll use a large-ish round number.

## NEW: run through a few iterations where we pick the best topic for each token,
## rather than sampling from the posterior distribution.

## Get the probability of topics in reviews and the probability of words in topics.
## By default, these functions return raw word counts. Here we want probabilities,
## so we normalize, and add "smoothing" so that nothing has exactly 0 probability.
doc.topics <- mallet.doc.topics(topic.model, smoothed=T, normalized=T)
topic.words <- mallet.topic.words(topic.model, smoothed=T, normalized=T)

# from http://www.cs.princeton.edu/~mimno/R/clustertrees.R
## transpose and normalize the doc topics
topic.docs <- t(doc.topics)
topic.docs <- topic.docs / rowSums(topic.docs)
write.csv(topic.docs, "dcb-topic-docs.csv")

## Get a vector containing short names for the topics
topics.labels <- rep("", n.topics)
for (topic in 1:n.topics) {
topics.labels[topic] <- paste(mallet.top.words(topic.model, topic.words[topic,], num.top.words=5)$words, collapse=" ")
# have a look at keywords for each topic

# create data.frame with columns as reviews and rows as topics
topic_docs <- data.frame(topic.docs)
names(topic_docs) <- rownames(reviews)

docs_topics = t(topic_docs)
best_fits = tail(sort(docs_topics[,2]))
best_fit = names(best_fits)[length(best_fits)]
Mallet and R

[1] "dirty cleaning clean water broken"
[2] "unit condo property stayed nice"
[3] "owner property stay day time"
[4] "cottage boat ð little dog"
[5] "deposit owner security money owners"
[6] "kitchen towels etc tv paper"
[7] "bedroom bed night living door"
[8] "villa food local beach hotel"
[9] "time thank coupon rebook stay"
[10] "de la en und die"
[11] "home time wonderful stay thank"
[12] "house pool hot nice water"
[13] "apartment location clean stay recommend"
[14] "beach condo nice pool stay"
[15] "rental vacation manager property location"
public Clusterer(Set<CharSequence> inputSet,
TokenizerFactory tf) {
this.inputSet = inputSet;
this.tf = tf;

distance = new TfIdfDistance(tf);

for (CharSequence charSequence : inputSet) {
term frequency - inverse document frequency
measures how important a word is to a document in a corpus
Stop Words

a, be, had, it, only, she, was, about, because, has, its, of, some, we, after, been, have, last, on, such, were, all, but, he, more, one, than, when, also, by, her, most, or, that, which, an, can, his, mr, other, the, who, any, co, if, mrs, out, their, will, and, corp, in, ms, over, there, with, are, could, inc, mz, s, they, would, as, for, into, no, so, this, up, at, from, is, not, says, to
When writing each document, you

Decide on the number of words N the document will have (say, according to a Poisson distribution).
Choose a topic mixture for the document (according to a Dirichlet distribution over a fixed set of K topics). For example, assuming that we have food and cute animal topics, you might choose the document to consist of 1/3 food and 2/3 cute animals.
Generate each word w_i in the document by:
First picking a topic (according to the multinomial distribution that you sampled above; for example, you might pick the food topic with 1/3 probability and the cute animals topic with 2/3 probability).
Using the topic to generate the word itself (according to the topic’s multinomial distribution). For example, if we selected the food topic, we might generate the word “broccoli” with 30% probability, “bananas” with 15% probability, and so on.

And then LDA backtracks into the probable topics
Latent Dirichlet allocation
Full transcript