Loading presentation...

Present Remotely

Send the link below via email or IM


Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.


NLP Project

Author Prediction for Turkish Texts

Ziynet d

on 3 January 2013

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of NLP Project

Gathering Data Data Content: 30 authors
Dataset Groups: author + gender + age
Data: Articles gathered from different newspapers

Genre of authors is close (politics and casual events)
Totally 1200 articles
40 articles from each 30 authors
Each data has a specific format:
extensions are txt
first line is title
every line is a paragraph
name of each data is the published date of article
each data is inside a folder named as author Creating ARFF File
Prizma program contains;
WEKA in the background
takes an arff file
allows the user to select an algorithm
shows the user the correction rate of results, confusion matrix, accuracy of the results, and so on
how many of them is classified correctly
how many of them is wrong classified as which class Creating Datasets Results Preprocessing & Experiment Why extra attribute? We want minimum training data to know an author, and maximum test data to predict correct author.
Even 25% training (15 training + 25 testing) is enough to find correct author.
The difference between 25 and 75 percentage is not big, so this shows us that our attributes are working good.
10 articles are almost enough to predict correct author with these attributes. - 3 datasets
- data for training and testing
30x10 (75% training)
20x20 (50% training)
15x25 (37.5% training)
10x30 (25% training) Author Prediction for Turkish Texts ziynetnesibe.com In this experiment; Aim to specify characterization of documents
correct predictions of author, age, and gender according to given text
Try to find the writing type of authors
including measurements (counting word numbers, some specific words used 40 articles of 30 Turkish columnists
machine learning algorithms (Naïve Bayes, Naïve Bayes Multinomial, Neural Networks, SMO, and other algorithms)
it’s implemented on WEKA before 10 authors -> now 30 authors
before 1 prediction(just author) -> now 3 predictions (author, gender, age)
When the number of possible output increases, it is more difficult to make correct predictions about the author and understand the writing style. Confusion Matrix Original Gender Dataset vs. 4 Fake Datasets Thank You
Your Interest ! ziynetnesibe@gmail.com
ziynetnesibe.com with our program Prizma (WEKA's library on the background) Naive Bayes Multilayer Perceptron Gender dataset is just looking at the authors in its group, and tries to know them. It does not much matter who the authors inside. If having two groups female or male, or having mixed two groups are not has too much affect. However, even with a small amount, the original gender dataset has the biggest percentage of correctly classified rate according to fake gender datasets. Inference: Last eliminated 19 attributes
For the future work, not always a complete article, but also some part of an article
so I have generally preferred ratio(average paragraph length with/without space, average sentence length, empty paragraph ratio, length of title, paragraph count, some punctuation ratios, punctuation count of title, stopword ratio, subtitle ratio and word length variance...)
necessary count measurements as exceptions: (writing too long articles, using long words, etc.) Attributes
Full transcript