Word Level Language Identification of Code Mixing Text in Social Media using NLP

Kasthuri Shanmugalingam, Sagara Sumathipala, Chinthaka Premachandra

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Understanding social media contents has been a primary research topic since the dawn of social networking. Especially, contextual understanding of the noisy text, which is characterized by a high percentage of spelling mistakes with creative spelling, phonetic typing, wordplay, abbreviations, and Meta tags. Thus, the data processing demands a more complex system than traditional natural language processors. Also people easily mixing two or more languages together to express their thoughts in social media context. So automatic language identification at word level become as necessary part for analyzing the noisy content in social media. It would help with the automated analysis of content generated on social media. This study uses Tamil-English code-mixed data from popular social media posts and comments and provided word level language tags using Natural Language Processing (NLP) and modern Machine Learning (ML) technologies. The methodology used for this system is a novel approach implemented as machine learning classifier based on features such as Tamil Unicode characters in Roman scripts, dictionaries, double consonant, and term frequency. Different machine learning classifiers such as Naive Bayes, Logistic Regression, Support Vector Machines (SVM), Decision Trees and Random Forest used in training and testing. Among that the highest accuracy of 89.46% was obtained in SVM classifier.

Original languageEnglish
Title of host publication2018 3rd International Conference on Information Technology Research, ICITR 2018
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9781728114705
DOIs
Publication statusPublished - 2018 Dec 1
Event3rd International Conference on Information Technology Research, ICITR 2018 - Moratuwa, Sri Lanka
Duration: 2018 Dec 52018 Dec 7

Publication series

Name2018 3rd International Conference on Information Technology Research, ICITR 2018

Conference

Conference3rd International Conference on Information Technology Research, ICITR 2018
CountrySri Lanka
CityMoratuwa
Period18/12/518/12/7

Fingerprint

Learning systems
Classifiers
Support vector machines
Processing
Speech analysis
Glossaries
Decision trees
Logistics
Large scale systems
Testing
Natural language processing
Language
Social media
Classifier
Machine learning
Support vector machine
Tag

Keywords

  • Code-mixing
  • language identification
  • machine learning
  • NLP

ASJC Scopus subject areas

  • Artificial Intelligence
  • Computer Networks and Communications
  • Computer Science Applications
  • Information Systems and Management
  • Media Technology

Cite this

Shanmugalingam, K., Sumathipala, S., & Premachandra, C. (2018). Word Level Language Identification of Code Mixing Text in Social Media using NLP. In 2018 3rd International Conference on Information Technology Research, ICITR 2018 [8736127] (2018 3rd International Conference on Information Technology Research, ICITR 2018). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ICITR.2018.8736127

Word Level Language Identification of Code Mixing Text in Social Media using NLP. / Shanmugalingam, Kasthuri; Sumathipala, Sagara; Premachandra, Chinthaka.

2018 3rd International Conference on Information Technology Research, ICITR 2018. Institute of Electrical and Electronics Engineers Inc., 2018. 8736127 (2018 3rd International Conference on Information Technology Research, ICITR 2018).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Shanmugalingam, K, Sumathipala, S & Premachandra, C 2018, Word Level Language Identification of Code Mixing Text in Social Media using NLP. in 2018 3rd International Conference on Information Technology Research, ICITR 2018., 8736127, 2018 3rd International Conference on Information Technology Research, ICITR 2018, Institute of Electrical and Electronics Engineers Inc., 3rd International Conference on Information Technology Research, ICITR 2018, Moratuwa, Sri Lanka, 18/12/5. https://doi.org/10.1109/ICITR.2018.8736127
Shanmugalingam K, Sumathipala S, Premachandra C. Word Level Language Identification of Code Mixing Text in Social Media using NLP. In 2018 3rd International Conference on Information Technology Research, ICITR 2018. Institute of Electrical and Electronics Engineers Inc. 2018. 8736127. (2018 3rd International Conference on Information Technology Research, ICITR 2018). https://doi.org/10.1109/ICITR.2018.8736127
Shanmugalingam, Kasthuri ; Sumathipala, Sagara ; Premachandra, Chinthaka. / Word Level Language Identification of Code Mixing Text in Social Media using NLP. 2018 3rd International Conference on Information Technology Research, ICITR 2018. Institute of Electrical and Electronics Engineers Inc., 2018. (2018 3rd International Conference on Information Technology Research, ICITR 2018).
@inproceedings{bf5ab831b8644eef8040ba689c8cdc33,
title = "Word Level Language Identification of Code Mixing Text in Social Media using NLP",
abstract = "Understanding social media contents has been a primary research topic since the dawn of social networking. Especially, contextual understanding of the noisy text, which is characterized by a high percentage of spelling mistakes with creative spelling, phonetic typing, wordplay, abbreviations, and Meta tags. Thus, the data processing demands a more complex system than traditional natural language processors. Also people easily mixing two or more languages together to express their thoughts in social media context. So automatic language identification at word level become as necessary part for analyzing the noisy content in social media. It would help with the automated analysis of content generated on social media. This study uses Tamil-English code-mixed data from popular social media posts and comments and provided word level language tags using Natural Language Processing (NLP) and modern Machine Learning (ML) technologies. The methodology used for this system is a novel approach implemented as machine learning classifier based on features such as Tamil Unicode characters in Roman scripts, dictionaries, double consonant, and term frequency. Different machine learning classifiers such as Naive Bayes, Logistic Regression, Support Vector Machines (SVM), Decision Trees and Random Forest used in training and testing. Among that the highest accuracy of 89.46{\%} was obtained in SVM classifier.",
keywords = "Code-mixing, language identification, machine learning, NLP",
author = "Kasthuri Shanmugalingam and Sagara Sumathipala and Chinthaka Premachandra",
year = "2018",
month = "12",
day = "1",
doi = "10.1109/ICITR.2018.8736127",
language = "English",
series = "2018 3rd International Conference on Information Technology Research, ICITR 2018",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
booktitle = "2018 3rd International Conference on Information Technology Research, ICITR 2018",

}

TY - GEN

T1 - Word Level Language Identification of Code Mixing Text in Social Media using NLP

AU - Shanmugalingam, Kasthuri

AU - Sumathipala, Sagara

AU - Premachandra, Chinthaka

PY - 2018/12/1

Y1 - 2018/12/1

N2 - Understanding social media contents has been a primary research topic since the dawn of social networking. Especially, contextual understanding of the noisy text, which is characterized by a high percentage of spelling mistakes with creative spelling, phonetic typing, wordplay, abbreviations, and Meta tags. Thus, the data processing demands a more complex system than traditional natural language processors. Also people easily mixing two or more languages together to express their thoughts in social media context. So automatic language identification at word level become as necessary part for analyzing the noisy content in social media. It would help with the automated analysis of content generated on social media. This study uses Tamil-English code-mixed data from popular social media posts and comments and provided word level language tags using Natural Language Processing (NLP) and modern Machine Learning (ML) technologies. The methodology used for this system is a novel approach implemented as machine learning classifier based on features such as Tamil Unicode characters in Roman scripts, dictionaries, double consonant, and term frequency. Different machine learning classifiers such as Naive Bayes, Logistic Regression, Support Vector Machines (SVM), Decision Trees and Random Forest used in training and testing. Among that the highest accuracy of 89.46% was obtained in SVM classifier.

AB - Understanding social media contents has been a primary research topic since the dawn of social networking. Especially, contextual understanding of the noisy text, which is characterized by a high percentage of spelling mistakes with creative spelling, phonetic typing, wordplay, abbreviations, and Meta tags. Thus, the data processing demands a more complex system than traditional natural language processors. Also people easily mixing two or more languages together to express their thoughts in social media context. So automatic language identification at word level become as necessary part for analyzing the noisy content in social media. It would help with the automated analysis of content generated on social media. This study uses Tamil-English code-mixed data from popular social media posts and comments and provided word level language tags using Natural Language Processing (NLP) and modern Machine Learning (ML) technologies. The methodology used for this system is a novel approach implemented as machine learning classifier based on features such as Tamil Unicode characters in Roman scripts, dictionaries, double consonant, and term frequency. Different machine learning classifiers such as Naive Bayes, Logistic Regression, Support Vector Machines (SVM), Decision Trees and Random Forest used in training and testing. Among that the highest accuracy of 89.46% was obtained in SVM classifier.

KW - Code-mixing

KW - language identification

KW - machine learning

KW - NLP

UR - http://www.scopus.com/inward/record.url?scp=85068474646&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85068474646&partnerID=8YFLogxK

U2 - 10.1109/ICITR.2018.8736127

DO - 10.1109/ICITR.2018.8736127

M3 - Conference contribution

AN - SCOPUS:85068474646

T3 - 2018 3rd International Conference on Information Technology Research, ICITR 2018

BT - 2018 3rd International Conference on Information Technology Research, ICITR 2018

PB - Institute of Electrical and Electronics Engineers Inc.

ER -