Introducing
Your new presentation assistant.
Refine, enhance, and tailor your content, source relevant images, and edit visuals quicker than ever before.
Trending searches
GREESHMA MOHAN G
PAVITHRA M M
ROSHNA N
Mentor: RENU S, ICFOSS
Dept. of Futures Studies, University of Kerala
1. Introduction
2. Approach
3. Data analysis
4.Conclusion
If the speaker is speaking his mother tongue in that case also he can use code-mixing words to emphasize a particular point. Nowadays the social media communication is becoming wide. So the code mixing data through social media is increasing. That's why the analysis of these data is becoming relevant.We present a hybrid architecture for the task of English-Malayalam code-mixed data.Our method consists of some steps, each seeking to alleviate some issues.
Machines cannot understand chatting language. So a statistical analysis about this data is not directly possible. In order to solve this problem to a small extent we created a CODE MIXED ANALYSIS website for analyzing a code mixed data.
APPROACH
.
Youtube comments downloader
!git clone https://github.com/egbertbouman/youtube-comment-downloader.git
{"cid": "UgzjPwNqgyq3u7HafNB4AaABAg", "text": "Driver ayalayakkum ithil food kodukathe moshayi pooyi cheruthayitt just vilikanayirunnu", "time": "6 minutes ago", "author": "Riyu Fathima"}
{"cid": "UgwEmMtSHBuvUperHah4AaABAg", "text": "Who is here just to watch what it is because this show is always there in trending", "time": "9 minutes ago", "author": "Reva K"}
{"cid": "Ugxk7vcp58uKaTxyroN4AaABAg", "text": "Edentha lechu evde", "time": "16 minutes ago", "author": "Fidha Rahman"}
{"cid": "UgwMiKBc2c1B6ekFbz54AaABAg", "text": "Sitting at the back is really terrible especially on hills", "time": "22 minutes ago", "author": "Susan Eapen"}
{"cid": "Ugz5yg0ZqSphVCidtf94AaABAg", "text": "KL06idukki kar ethraperund ivide", "time": "26 minutes ago", "author": "Shibina Rinas"}
Only text(reviews) are extracted from the collected data set. Emojis are removed from the text.Language of the text is identified using python Unicode and regular expression.
Corpus of Malayalam-English code mixed data is created on the basis.
Corpus
The corpus is created by giving label as "lan_pos". Our corpus consist of around 2000 words including Malayalam, English, Manglish, and English abbreviations in chats.In order to pos tagging BIS tagset is used. The labels we used are as follows;
ml - Malayalam
en -English
mal -Malayalam transliterated
eng -Abbreviated English words
Data Analysis
Using the machine learning toolkit "TnT" we trained the machine by the corpus that we created. Then a test file is tested. The output is as follows.
The count of the tested words as Malayalam(ml), English(en), Malayalam transliterated(mal), abbreviated English(eng). Then the percentage of each is calculated. Thus the statistical analysis of test data is done. If we give a message to test we can obtain its label including its language and parts of speech. In spite of these we get the count and percentage of how many code mixed words present in that message. The count from the TnT tested output is calculated with the help of python regular expression.
The visualisation for percentage analysis and TnT output is done as pie-diagram and Word Cloud.
A web interface called CODE MIXED ANALYSIS is created for receiving users input and showing result within the page to avoid complications for user to use the code.
Web interface is designed with the help of python flask and Html.
Smart composer
Dictionary
Communication