The impossible puzzle of piecing together unique data sets.

April Ye
5 min readJun 6, 2021

Since the 2016 election, fake news has been a major discussion point. However, up until the emergence of the Coronavirus in late 2019, fake news simply impacted political view. Now, after living through a pandemic for over a year, we can see just how dangerous fake news can be when it comes down to stopping the spread of a deadly virus across the world. A major reason why the initial COVID lockdown in the US was unsuccessful was due to fake news. Articles upon articles claiming that COVID was a conspiracy theory spread across the country like wildfire and as a result, over half a million lives have been claimed by this deadly virus in just the US alone. So how can we combat such a force? Reputable sources such as Princeton, MIT, Stanford, and many many others have already conducted research on just how quickly fake news spreads — with the consensus being that fake news spreads, on average, six times faster than true news.

I wanted to replicate the general consensus that fake news spreads six times faster than true news and even take it a step further to evaluate if extreme fake news spreads faster than mild fake news. This project seemed pretty straight forward. I would review multiple open data sets that were made available by previous research teams, looking at how many times a fake news article was shared in a certain time frame versus how many times a true news article was shared in the same time frame. For the second piece of my research, I would find 7–10 articles I deemed extreme fake news and 7–10 articles I deemed mild fake news and compare how many times each article was shared in a certain time frame. I knew this second piece would end up subjective to my own interpretation of what was extreme or mild news, but since I would disclose this area of weakness in my research report, I felt that it was okay to go ahead and entertain my curiosity about the matter.

At first I reviewed the ESOC COVID-19 Misinformation Dataset (published by the Princeton University Empirical Studies of Conflict): https://esoc.princeton.edu/publications/esoc-covid-19-misinformation-dataset and it felt like a good starting point because the data set included tweets that mention information flagged as false, the source/link to the fake/unreliable article, the tweet’s country of origin, key words, and the type of misinformation. However, I later realized that while this dataset had a good directory of fake news articles, it only lists up to four tweets that directly mention the article and also fails to capture how many times the original tweet may have been retweeted — which is how major traction is gained to spread like wildfire in the twitterverse. No biggie, to overcome this bump in the road I decided to take a look at other available data sets and combine the information from both to get me the information I needed. Sadly, I soon realized this was much easier said than done. After looking through 10+ pages of Google’s Scholarly Articles search results, I found that every data set I could access provided a small piece of my puzzle and lacked completely in others. Although I can’t find the data set anymore (because it got lost in my history of looking through 10+ pages of search results), one data set tracked all the tweets and retweets of fake news articles but didn’t list the actual linked fake article so I couldn’t confirm for myself that those tweets were in fact, referencing fake news. After all, the whole point of my project was to validate previous findings and I can’t do that by making assumptions about the previous research. What was even less helpful about this data set was that the tweets were listed simply by its reference number — with no context to what the tweet said. I couldn’t even cross compare this data set to the Princeton one because the lack of article titles doesn’t allow me to pair each tweet reference number with Princeton’s list of directly mentioned fake news articles. Each data set I viewed may have brought one piece of the puzzle to the table but it felt like each puzzle piece was from a different puzzle entirely. So even if I had all the pieces, in the end they wouldn’t all fit together anyway.

This entire journey was interesting to me because open data sets are made open so that future researchers can review your work and attempt to replicate your results, validating your findings if they do. However, although each article I read had interesting insights to fake news and how it spreads, it seems like many of them either document their process really well but the data set doesn’t quite have all the information you need to recreate their process or the data set has far too detailed information and the process was not defined clearly enough for you to know what to do with all the information you have. I’m not a professional data analyst or researcher, so take this last bit with a grain of salt. In the scope of my project and what I was attempting to do, it would have extremely helpful to have a detailed documented research process that directly referenced columns in the available data set either by title or by number to really allow future researchers to recreate your results and validate your findings. If some of the data used for the research can’t be made available to the public and therefore not anyone can recreate the results of the research, it would be helpful to know that in just a simple disclaimer at the end of the report. I know data has never been so easily available so these research reports are often written for other professional researchers and not directed to the general public but I think we could all learn from the interesting research being done and having more direct process and data set documentation could really help people validate research for themselves. As an added benefit, making it easier for people to replicate research results for themselves could also teach a few people how to distinguish a reputable source from a non reputable one and help slow the spread of fake news.

--

--