A Deep Learning Model for Classifying the Hate and Offensive Language in Social Media Text

Nidhi Bhandari

Abstract


Recently, we had introduced a model for identifying and removal of toxic content from twitter, using an Information Retrieval (IR) model SOIR (Semantic query Optimization-based Information Retrieval). Based on lexical and semantic analysis, SOIR identifies the class labels of tweets. The result demonstrates the superiority of the SOIR model. This model is accurate but social media is a big data problem and a significant amount of time and memory is required. In this paper the deep learning technique is used to process large-scale social media text data. First uses Natural Language Processing (NLP) based feature extraction to create four different sets of training samples i.e. TF-IDF-based features, POS Tagged Features, a reduced feature vector of POS and the combined vector of TF-IDF and POS tagged features. The deep Convolutional Neural Networks (CNN) is used to train the model and to classify hate and offensive language. The dataset has been obtained from Kaggle. The performance in terms of training accuracy, validation accuracy, training loss and validation loss has been measured with the time complexity. In addition, the class-wise Precision, Recall, F1-score, and Mean accuracy have also been investigated. From experimental results, we found TF-IDF and POS-based combined features provide superior performance.

Keywords


Text mining, social media, semantic knowledge, sentiment analysis, deep learning, hate and offensive language.

Full Text:

PDF

References


H. Wang, Q. Zhang, & J. Yuan, “Semantically Enhanced Medical Information Retrieval System: A Tensor Factorization Based Approach”, 2169-3536, 2017 IEEE

S. Bergamaschi, E. Domnor, F. Guerra, M. Orsini, R. T. Lado, Y. Velegrakis, “Keymantic: Semantic Keyword-based Searching in Data Integration Systems”, Proceedings of the VLDB Endowment, Vol. 3, No. 2, Copyright 2010 VLDB ACM

Manoj Chahal, “Information Retrieval using Jaccard Similarity Coefficient”, International Journal of Computer Trends and Technology– Volume 36 Number 3 - June 2016

Nidhi Bhandari,Rachna Navlakhe,G.L Prajapati, “Semantic Query Optimization based Information Retrieval Technique”,The Journal of Oriental Research Madras,ISSN:0022-3301,Vol.XCII-LXXVII:2021

P. Bafna, D. Pramod,A. Vaidya, “Document Clustering: TF-IDF approach”, International Conference on Electrical, Electronics, and Optimization Techniques, 978-1-4673-9939-5/16/$31.00 ©2016 IEEE

R. K. Roul, J. K. Sahoo, K. Arora, “Modified TF-IDF Term Weighting Strategies for TextCategorization”, 978-1-5386-4318-1/17/$31.00 ©2017 IEEE

J. Chen, X. Tang, “Ensemble Of Multiple K-NN Classifiers For Societal Riskclassification”, J Syst Sci Syst Eng (Aug 2017) 26(4): 433-447

H. Xu, W. Yang, J. Wang, “Hierarchical emotion classification and emotion component analysis onchinese micro-blog posts”, Expert Systems With Applications 42 (2015) 8745–8752

C. Pasquier, C. daCosta Pereira, A. G. B. Tettamanzi, “Extending a Fuzzy Polarity Propagation Method for Multi-Domain Sentiment Analysis with Word Embedding and POS Tagging”, ECAI 2020, The authors and IOS Press




DOI: http://dx.doi.org/10.23755/rm.v42i0.705

Refbacks

  • There are currently no refbacks.


Copyright (c) 2022 NIDHI BHANDARI

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.

Ratio Mathematica - Journal of Mathematics, Statistics, and Applications. ISSN 1592-7415; e-ISSN 2282-8214.