Loading presentation...

Present Remotely

Send the link below via email or IM


Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.


Text Mining Algorithms on Turkish and English Texts

No description

Yakup Arslan

on 11 January 2013

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of Text Mining Algorithms on Turkish and English Texts

Agenda 1. Introduction
2. What is Text Mining?
3. Purpose
4. Datasets that we used
5. The Tools for the Project
6. Processes
6.1. Tokenize
6.2. Stopwords Handler
6.3. Stemming
6.4. TFxIDF
6.5. Cosine Similarity
6.6. Hierarchical Clustering integration with R
7. References
8. Questions The Tools and Programming Languages that we used in the Project Output Processes TOKENIZE
- Remove punctuations and numbers from the documents to get the pure texts.
Visual Studio-got all words from columns, put into the a database by tokenizing for each document
Approximately 140,000 words from all documents.
Sql Server- Distinct words are approximately 10,000. TF-IDF WEIGHTING APPROACH We have got 51 different texts for English language that are 6 different categories.
Categories are ;

We have also got 25 different texts for Turkish language that are 5 different categories.
Categories are;
Din Text Mining on Turkish and English Texts Advisor, Assoc. Prof. Melih Kırlıdoğ
Developed By Yakup Arslan,Doğan Güneş, Melik Erfidan Introduction Unstructured Data (Textual Information).

Hard to run algorithms on texts.

Extract meaningful numeric indices from the text.

Most of Data Mining Algorithms work only with numeric data.

Turkish Text Mining is insufficient. Purpose Find the similar documents in dataset and cluster them in a meaningful way.

Scan a set of documents written in a natural language and either model the document set for predictive classification purposes or populate a database or search index with the information extracted.

The Area That The Project Could Be Used
Copyright detection.
Spam Filters.
Text Categorization for Comments and Forums.
Search Engine for Articles. What is Text Mining ? A set of linguistic, statistical, and machine learning techniques that model and structure the information content of textual sources for business intelligence, exploratory data analysis, research, or investigation.

The term text analytics also describes that application of text analytics to respond to business problems, whether independently or in conjunction with query and analysis of fielded, numerical data Input Parameters Text Datasets We have tried different text processing tools to get the best solution such as Rapid Miner, Oracle Text and R Programming.
Among these tools, R Programming which is an open-source statistical programming language gave us the best solution for text processing.
Besides these tools do not support Turkish Language.
So we implemented our project using C# programming language integrate with R programming. The Tools and Programming Languages
that we used in the Project -Zemberek is an open source natural language processing library for Turkish Language.
- We used this library to get the roots of Turkish words.
-Rserve is an TCP connection that is used for communication between R and C#. STOPWORDS HANDLER Stop words are words which are filtered out prior to, or after, processing of natural language data (text).
We've filtered out our words from StopWords.

Here is Some of Stop Words in Turkish Language altı
artık birşeyi
bizi ben
beş ZEMBEREK AND PORTER STEMMER Identifies a word by its root.

Reduce dimensionality (number of features).

We get root by Zemberek-NLP Program for Turkish Language and Porter Stemmer Algorithm for English Language.

The program also found the false written words.

E.g. "Çekoslovakyalılaştıramadıklarımızdansınız" [Çekoslovakya-lı-laş-tır-a-ma-dık-lar-ımız-dan-sınız] Root is "Çekoslovakya". Term frequency–inverse document frequency, is a numerical statistic which reflects how important a word is to a document in a collection. Output Cosine Similarity Cosine similarity is a measure of similarity between two vectors of an inner product space that measures the cosine of the angle between them COSINE SIMILARITY RESULT We have found the cosine similarity among all documents. Hierarchical Clustering Method with R libraries In R programming there is a Text Mining Library which includes Hierarchical Clustering Method.
When we get the cosine similarities of documents, we send this matrix to the R programming via R serve connection to use Hierarchical Method and plot the Dendogram.
Hierarchical Clustering Method is similar to graph algorithms. HIERARCHICAL CLUSTERTING FOR ENGLISH TEXTS References DEMO FOR TURKISH TEXTS Assoc. Prof. Melih Kırlıdog, Advisor


R Programing,http://www.r-project.org/

Text Mining Package,http://cran.r-project.org/web/packages/tm/tm.pdf

Oracle Text, http://www.oracle.com/technetwork/database/enterprise-edition/11goracletexttwp-133192.pdf

Thank you for attention. Cinema
Economy Sinema
Full transcript