*Deadline May 29th*
We are looking for to partner up with a programmer/computer scientist who is experienced in data retrieval. This project has been chosen as a test project to get an idea of the programmer's fluency with data retrieval.
You need a twitter account and a valid cell phone number to complete this task. Extract tweets on a query using Twitter API. You may want to use Python.
• Extract 1000 tweets without any query to form data set D1.
• Extract 1000 tweets using query “COVID-19” to form data set D2
Use SpaCy or NLTK to parse and tokenize the tweets in the data set.
1. Report the numbers of unique tokens in D1 and D2, respectively.
2. List the top-100 most frequent tokens and their frequencies in the two data sets. Here,
the frequency of a token is the number of tweets it appears divided by 1000, no matter
how many times a token appears in one tweet.
[login to view URL] any word cloud tools available on the web, such as <[login to view URL]>,
<[login to view URL]>,
<[login to view URL]>, and many others, to produce word clouds
for both data sets D1 and D2. Please submit your word cloud figures and also explain how
you make them, including the tools you used, the meaning of size and color of the a word
in the cloud, etc.
4. Comparing two word-cloud figures is inconvenient and may not be intuitive. Can you
propose an idea to make one word cloud that can compare two data sets? As a baseline,
consider the following figure, the two sets of keywords are in two different colors and at
upper and lower part. Can you propose another approach? Please describe your method
and use examples on D1 and D2 to argue how your approach is compared with this
baseline < [login to view URL]>
Please upload the Python code (or the language that
you use). Also upload the two datasets D1 and D2 (as a CSV file) that you extract and use to
answer above questions.