"The goal is to turn data into information, and information into insight." Carly Fiorina, Former CEO of HP.
"You will get stuck in the very first step - get meaningful data from the chaotic social media platform. Once passed the stage, you start to fly." Jeffrey
You might have read from the news about personal protective equipment (ppe) and equipment shortages. Dwindling supplies have sent many nurses, doctors, and frontline workers to put up posts on social media asking for help. However, these posts are often lost in the hectic environment of the social media platform. A quick search under the latest #coronavirus posts reveal what Twitter has to offer: a plethora of foregin users posting in foreign languages, spam, or anger.
Finding these posts and creating a comprehensive report is going to require a large amount of data. Since manually sifting through is far too time consuming and unproductive, it should be automated.
We get our tweet data by querying different hashtags. There are many hashtags, and although some are similar in wording e.g. #coronavirusupdate and #coronavirusupdates, each and every one of them are unique and have unique content. It is important to review each hashtag to determine if it should be included in our search. For example, #covid19vic would not be a good query since it revolves around Victoria, Australia
In order to find these subtle differences, we started with key words e.g. coronavirus, covid, pandemic. These keywords were fed into a supplementary website name keywordtool.io which finds lexicographically related hashtags.
For example a query using coronavirus will yield #coronavirusnl, #coronavirusbrasil, #coronavirusbh, and so on.
Four categories were considered when reviewing a single hashtag:
Region. Our efforts focus on the United States which mean that a sizable number of the posts made under the hashtag must be from within the United States.
Language. We filter and process based on the text in each tweet. Keeping everything in English is easier to manage.
Content. The tweets should be from legitimate users. Spam and bot posts do not help our cause.
Activity. We need to balance between the data extraction time as well as the query character limit of the twitter API.
After going through 181 hashtags, surprisingly, only a small percentage of all the hashtags analyzed met our criteria.
Here are the most-popular ones used by US tweeters:
#coronavaccine, #coronavirus, #coronavirus2020, #coronavirusoutbreak, #CoronavirusPandemic, #covid1984, #covid19colorado, #covid19isgettingcloser, #covid19ma, #covid19pandemic, #covid19quarantine, #covid19usa, #covid19wl, #covidchronicles, #covidcoping, #covidcrisis, #covidfalseflag, #covidfatigue, #covidflorida, #covidfoam, #covidkim, #covidstigmake, #covidstopswithme, #covidsymptoms, #covidtest, #covidtesting, #covidtv, #covidusa, #flattenthecurve, #getmeppe, #LockdownNow, #Mask, #pandemic, #pandemiclife, #pandemicpods, #ppe, #socialdistancing, #stayhomesavelives, #stayhomestaysafe, #WearADamnMask, #wearamask, #WearAMaskSaveALife
Getting location from tweets brings many benefits. Having a per-region breakdown of trends yields powerful insights on the Coronavirus Pandemic. Looking at location at a larger scale, it also allows us to filter a majority of tweets which saves us time and resources.
We want to focus on tweets made in the United States, and having accuracy down to the city level will allow us to better visualize pandemic trends on social media.
After analyzing 81,422 tweets from one typical day in July 2020, these were our findings. The tweets were queried using the twitter API from the 10 most popular Coronavirus hashtags.
Surprisingly only 31.5% of all tweets in our database were from the United States. Now, how did we come to this conclusion?
Geoparsing refers to the process of extracting place-names from text and matching those names unambiguously with proper nouns and spatial coordinates. Medium. To determine the location of the tweet, we geoparse two sections: the author location and the text.
Right now, we are using CLIFF, A entity extraction and geoparsing tool that can recognize common words e.g. street names or cities from a string of text. This makes it great for this use case since the location is often found in sentences or phrases. The geoparsing tool is used on the author bios as well as the tweet text content.
One question remains after looking at the JSON object from a single tweet: Why not use reverse geocoding to analyze the latitude and longtitude location tag of a tweet? After all, some tweets do contain coordinates representing an exact location or encircling a region. All we need to do is turn the longitude latitude coordinates into an address using reverse geocoding.
Unfortunately not all tweets have geotagging enabled. This might be due to privacy concerns or other factors we cannot control. Reverse geotagging also comes with a time constraint. Searching up reverse geocoding leads us to Geopy, a python client for several popular geocoding services.
Getting location data has monumentally helped our efforts to find those on social media who need help. Not only does it allow us to filter out almost two-thirds of the data extracted, it also allows us to create reports with detail going down to the city level. This technology is still not perfect, but we are constantly striving to find more accurate and efficient ways to aid our bulk location processing in the future.
Categorizing tweets is a crucial step in the data analysis process. Not only does it allow us to remove extraneous tweets from the final dataset, but it also provides insight into which areas the coronavirus is affecting the most.
Every tweet in our dataset is put through our labeling algorithm. We use the labels to filter out the discussions that are not focused on frontline workers or people affected by the pandemic
As a general overview, we run the tweets through two filter layers before arriving at the final dataset. In First Level Noise Filtering, we are removing tweets that do not fit our general interest. In Second Level Noise Filtering, we are selecting the discussions that match our topics of interest.
The first layer removes tweets that are unrelated to the Coronavirus pandemic in the United States. The tags we remove are:
From a typical day in July 2020, we extracted 81,422 tweets. The above graph shows the breakdown for tweets classified as first-level noise, second-level noise, and tweets of interest.
Upon detecting political terminologies and content, the tweet is flagged under the political tag. We also run a word count that excludes links, emojis, hashtags, and mentions. If the word count is less than 10, we flag the tweet under gibberish. Cliff, a geoparsing tool, is used to extract the location and entities present in the tweet. (Please read our blog about location).
However, after the first layer, we are still left with off-topic tweets that are not asking for help. That’s when the second-level noise filtering comes in. We iterate through each of the remaining tweets and run an emotion detection ai and content analysis algorithm to match them to the main topic categories.
The emotion detection ai distinguishes between tweets asking for help and tweets that are not asking for help. For a tweet to be given a main-topic tag, it must be classified under the tag not_needs_met.
The content analysis algorithm divides each main topic into the various subtopics. For example, critical resources may encompass personal protective equipment, ventilators, or COVID tests. We search for these subtopics in the tweet, and upon detecting related content, we classify the tweet under the main topic.
From a typical day in July 2020, we extracted 81,422 tweets. The first graph shows the breakdown for tweets classified as first-level noise, second-level noise, and tweets of interest. The second graph shows the breakdown of our three categories of interest.
The above graph shows the breakdown of our three categories of interest of a typical July day.
How can emotions be rational? Many argue that there is no logic in emotion. However, we believe that emotions are the gateway to rational discussions. Emotion can reveal the opinion about a subject - an approval or a dissatisfaction. In the context of this Aid-(d)irect platform, if we could determine the emotions of a social media post, then we would be able to identify if it is a plea for help.
At its core, emotions are used to communicate with other people and are expressed based on what a person is experiencing. For example, joy corresponds to positive experiences. Emotion can be the gateway to rationale discussions if we uncover their underlying causes.
According to Psychologist Robert Plutchik’s wheel of emotions, there are eight basic emotions that serve as the foundation for all others. This got us thinking. If we could distinguish between these eight emotions, we could then be able to determine if a tweet is asking for help. While researching the topic, we came across the feelings wheel that was published by the University of Oregon. It neatly organizes emotions that are experienced when needs are met and when needs are not met. Although it covers more than the eight basic emotions, we believe it offers a comprehensive solution.
According to the feelings wheel, When a person’s needs are met, they will usually experience
When a person’s needs are not met, they will usually experience
From this, we can see that emotion is not irrational. There is a reason why a person is feeling a certain emotion. If we can analyze the emotional content of the tweet, then we should be able to determine if a post is asking for help.
To deal with the natural language processing aspect, we have opted to use artificial intelligence. We tapped into a know Our model is not 100% accurate which is why we have implemented a feedback mechanism. If a post is correctly identified, hearting it will reinforce the positive behavior. If a post is analyzed incorrectly, you can re-annotate the data and change the AI algorithm's behavior for the better.