Loading presentation...

Present Remotely

Send the link below via email or IM


Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.


South Park Speech Prediction

No description

Armando Galeana

on 29 March 2017

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of South Park Speech Prediction

Our data set
Source: https://www.kaggle.com/tovarischsukhov/southparklines/downloads/south-park-dialogue.zip
Model Definition
Vectorization using TfidfVectorizer:

GridSearchCV showed the highest accuracy is based on the following parameters:

Modeling used: MultinomialNB

Model's code
Data set description
The Problem
Determining character speech prediction accuracy
Findings and other applications
Results Presentation format
Who is talking is in question
Cleaning our data set
The main challenge in this data set in the amount of "irrelevant" data:

1. Season

2. Episode

3. Support characters

50% of the dialogue lines are spoken by only 15 characters.

The other 50% of the dialogue lines are spoken by 3935 characters

Main characters
Character Dialogue Prediction
= ["Well, I guess we'll have to roshambo for it. I'll kick you in the nuts as hard as I can, then you kick me square in the nuts as hard as you can..."]

new_text_transform = vect.transform(new_text)

new text
is the phrase entered
Leveraging a data set that includes the transcripts of 18 seasons of South Park, determine the prediction accuracy of the character most likely speaking given a random word or set of words.
Presenting results
By Armando Galeana, General Assembly DS-SF-31
Predicted accuracy applying MultinomialNB and cross validation in the characters that have 50% of the dialogues is 33.33%

It is hard to get a much higher accuracy score as the "Spam & Ham" words per character is very similar except for the top 4 characters

4 features

19 seasons

70896 dialogue lines

3950 unique characters
How would success look like?
Accuracy score must be higher than the percentage of lines of the main character
max_features = 850
ngram_range=(1, 1),
MultinomialNB accuracy 0.405300702664
Accuracy applying cross validation 0.3333783092

Main character's percentage of lines
Future of primary healthcare?
Who knows???
Now, imagine that instead of having characters you have user profiles, and instead of episode dialogues you have patient-doctor conversations...

Could you determine the success of a new treatment based on outspoken symptoms?
Can you predict what may be the most likely issues of a new patient?
Full transcript