A collaboration between researchers in computer science, mathematics and linguistics has resulted in a program that distinguishes automatic posts on Twitter from manual, regardless of the language. The purpose is to increase the accuracy when using texts from Twitter in sociolinguistic studies.
Twitter has over 300 million active users every month and is widely used in the public debate. This makes Twitter an excellent platform for research – but also for spammers and automated programs.
Studies have estimated that 5–10 % of all users are so-called bots, which automatically create their own tweets and retweet others, and that 20–25 % of all tweets are autogenerated. Many bot accounts are used to influence societal issues, for example elections, and to spread fake news. They can thus also influence the research that is done in areas such as political campaigns and social changes.
Man or machine?
Therefore, it is important to be able to investigate whether a given tweet is written by a human or a machine. For this purpose, researchers in the digital humanities – computer science, mathematics and linguistics – at Linnaeus University and the University of Eastern Finland have developed a computer program that uses machine learning.
"The program provides data of better quality and thus a better picture of reality when collecting texts from Twitter for sociolinguistic research based on their content", says Jonas Lundberg, senior lecturer in computer science at Linnaeus University.
The program developed by the researchers has a unique feature that distinguishes it from previous attempts.
"The algorithm in the program only examines parameters that are both language and country independent in the metadata that accompanies each tweet. The text, the Twitter message itself, is not used. This makes the algorithm language independent and applicable also in less used languages and on data sets that use several different languages", says Jonas Lundberg.
The results are promising. After the program had been trained using Swedish and Finnish tweets, it could correctly classify 98.2 % of all tweets in a third language, English. But the development work continues.
"A lot of work remains. We need to train and test the algorithm in more languages before it can be considered reliable", says Jonas Lundberg.
The article is called Towards a language independent Twitter bot detector and was presented at the 4th Digital Humanities in the Nordic Countries conference.
The work is part of the research in the Data Intensive Digital Humanities group at Linnaeus University. This interdisciplinary research team combines traditional philological methods with empirical evidence to develop and apply new methods to enrich and (visually) analyse natural-language data streams in social media. The goal is to gain new insights on language variation in social contexts.
The program that has been developed is currently used mainly for clearing off machine-generated tweets before initiating linguistic analyzes. However, attempts have also been made to use the program to recognize Twitter users with a very large proportion of machine-generated tweets, so-called bots (from the word robot).