Introducing 

Prezi AI.

Your new presentation assistant.

Refine, enhance, and tailor your content, source relevant images, and edit visuals quicker than ever before.

Loading…
Transcript

en

eng

mal

AN ANALYSIS FOR MALAYALAM-ENGLISH CODE MIXED DATA

2

ml

m

1

p

GREESHMA MOHAN G

PAVITHRA M M

ROSHNA N

Mentor: RENU S, ICFOSS

Dept. of Futures Studies, University of Kerala

OutLine

A

1. Introduction

2. Approach

3. Data analysis

4.Conclusion

B

Introduction

Code-mixing is the phenomenon of mixing the vocabulary and syntax of multiple languages in the same sentence. Code mixed analysis is essential in many real-world applications such as review analysis, recommendation system, and so on.Code-mixing is a result of bilingualism and multilingualism.Code-mixing is not only a common occurrence but also communication requirement. We can communicate in any language. But for exact and effective communication, code-mixing is necessary.Code-mixing has become so common because it helps us for effective communication. Code-mixing is not only done by common bilingual people but it is sometimes deliberately used by educated people.

Why Code Mixed Data ....

If the speaker is speaking his mother tongue in that case also he can use code-mixing words to emphasize a particular point. Nowadays the social media communication is becoming wide. So the code mixing data through social media is increasing. That's why the analysis of these data is becoming relevant.We present a hybrid architecture for the task of English-Malayalam code-mixed data.Our method consists of some steps, each seeking to alleviate some issues.

Our Subject

Machines cannot understand chatting language. So a statistical analysis about this data is not directly possible. In order to solve this problem to a small extent we created a CODE MIXED ANALYSIS website for analyzing a code mixed data.

our subject

APPROACH

C

  • Decide on a data source
  • Manage your unstructured data search and eliminating useless data
  • Created a Malayalam-English code mixed corpus
  • Trained the machine using machine learning algorithm
  • Created a web interface
  • Analyze the data
  • Visualized the interpreted result

Data science process

.

Youtube comments downloader

!git clone https://github.com/egbertbouman/youtube-comment-downloader.git

Data source

{"cid": "UgzjPwNqgyq3u7HafNB4AaABAg", "text": "Driver ayalayakkum ithil food kodukathe moshayi pooyi cheruthayitt just vilikanayirunnu", "time": "6 minutes ago", "author": "Riyu Fathima"}

{"cid": "UgwEmMtSHBuvUperHah4AaABAg", "text": "Who is here just to watch what it is because this show is always there in trending", "time": "9 minutes ago", "author": "Reva K"}

{"cid": "Ugxk7vcp58uKaTxyroN4AaABAg", "text": "Edentha lechu evde", "time": "16 minutes ago", "author": "Fidha Rahman"}

{"cid": "UgwMiKBc2c1B6ekFbz54AaABAg", "text": "Sitting at the back is really terrible especially on hills", "time": "22 minutes ago", "author": "Susan Eapen"}

{"cid": "Ugz5yg0ZqSphVCidtf94AaABAg", "text": "KL06idukki kar ethraperund ivide", "time": "26 minutes ago", "author": "Shibina Rinas"}

Data pre- processing

Only text(reviews) are extracted from the collected data set. Emojis are removed from the text.Language of the text is identified using python Unicode and regular expression.

Corpus of Malayalam-English code mixed data is created on the basis.

Corpus

Corpus

The corpus is created by giving label as "lan_pos". Our corpus consist of around 2000 words including Malayalam, English, Manglish, and English abbreviations in chats.In order to pos tagging BIS tagset is used. The labels we used are as follows;

ml - Malayalam

en -English

mal -Malayalam transliterated

eng -Abbreviated English words

Corpus

(demo)

D

Data Analysis

Training and Testing

Using the machine learning toolkit "TnT" we trained the machine by the corpus that we created. Then a test file is tested. The output is as follows.

Statistical Analysis

The count of the tested words as Malayalam(ml), English(en), Malayalam transliterated(mal), abbreviated English(eng). Then the percentage of each is calculated. Thus the statistical analysis of test data is done. If we give a message to test we can obtain its label including its language and parts of speech. In spite of these we get the count and percentage of how many code mixed words present in that message. The count from the TnT tested output is calculated with the help of python regular expression.

Visualization

The visualisation for percentage analysis and TnT output is done as pie-diagram and Word Cloud.

Website

A web interface called CODE MIXED ANALYSIS is created for receiving users input and showing result within the page to avoid complications for user to use the code.

Web interface is designed with the help of python flask and Html.

E

CONCLUSION

Our desktop search application performed well with Code-

Mixed Malayalam-English query. It was also able to handle the unstructured files on our computers in a efficient way.We intend to make the application as a generic one. The user will be able to provide own training files to train the system. In this way, multilingual users will be able to train the system to work with other code-mixed query languages.

Smart composer

Dictionary

Communication

FUTURE WORK

Thank you!

F

Learn more about creating dynamic, engaging presentations with Prezi