Team: 10
School: Academy For Tech & Classics
Area of Science: Natural Language Processing
Interim:
Problem Definition: Social media platforms are currently confronted with a large amount of controversial content, which carries significant consequences for the emotional health of their users [1]. Without an effective way to classify this content, social media platforms face difficulty when trying to limit such content.
Plan for Solving Problem Computationally: In order to accurately categorize controversial content, some researchers have found success using manually annotated datasets. However, these datasets are limited in size and would require constant updating to stay up to date with what is currently controversial. [4] To avoid these issues, we plan on instead predicting various heuristics which could indicate controversy (rather than controversy itself). For example, on Twitter, we could use the following variables:
Current Progress: Currently, we have made significant progress towards collecting the necessary data which we can use to train our model. While we have not yet deployed our data collection code at scale, we plan to do so very soon, and we have made significant progress writing a program to collect the necessary data from social media platforms. We have a working program to collect data from the Twitter API and have made significant progress towards doing the same with Facebook. We hope to have these programs collecting and storing data very soon, giving us the necessary data to train our model. However, due to Facebook’s restrictions on scraping data, we were unable to get any from the aforementioned site. Additionally, we have written the code which will take our raw text output and represent it as a series of word embeddings, which can be input into our BERT model in order to make predictions.
Expected Results: Because of the extremely stochastic nature of social media, it is fairly unrealistic to expect extraordinary precise accuracy with our model’s predictions. However, we anticipate that our model will be able to give a generally accurate prediction of certain post heuristics, which would align with the post's controversy. While it doesn't make sense to censor all forms of controversial content, the ability to de-prioritize it represents a powerful tool for the reduction of the emotional and societal harm largely caused by social media.
References: [1] William J. Brady, et al. "How social learning amplifies moral outrage expression in online social networks". Science Advances 7. 33(2021): eabe5641. [2] Devlin, Jacob et al. "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." (2018). [3] Devlin, Jacob, and Ming-Wei Chang. “Open Sourcing Bert: State-of-the-Art Pre-Training for Natural Language Processing.†Google AI Blog, Google, 2 Nov. 2018, https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html. [4] Mozafari M, Farahbakhsh R, Crespi N (2020) Hate speech detection and racial bias mitigation in social media based on BERT model. PLoS ONE 15(8): e0237861. https://doi.org/10.1371/journal.pone.0237861 [5] NLTK Project. (2023, January 2). nltk.sentiment.sentiment_analyzer module. NLTK. Retrieved January 9, 2023, from https://www.nltk.org/api/nltk.sentiment.sentiment_analyzer.html
Team Members:
Sponsoring Teacher: Jenifer Hooten