1. Design and code up a class that can preprocess and store the LA Times articles. Specifically,
the methods of the class should take as an input the LA Times articles collection, extract each article
in the collection, and construct a hash table, the key of which is a word (in the collection) and the
value a linked list of all the document that contain this word, and the count of the word in each
document. For example, if the word “the” appears in all three articles, 20 times in the first, 34 times
in the second, and 12 times in the third, while the word “author” appears 7 times in the first, 3 times
in the second and does not appear at all in the third, the hash table should look as follows:
[the] -> [1, 20] -> [2, 34] -> [3, 12]
[author] -> [1, 7] -> [3, 2]
Create an object of the type of your class and use the data collection to initiate it. Think about how
could you handle different forms of the same word, e.g. "author", "Author", "authors".
2. Generate a plot (histogram should be good enough) of the count distribution of the words in
all documents (that is, the x-axis is the number of times a word appears in the entire collection -
total count -, and the y-axis the frequency of that count). Characterize this distribution.