In this assignment you have to extend the spider that you developed in the previous assignment as follows:
1. Extend the search to internal links (one level). Take 10 first links to Wikipedia pages (maximum) from the main page returned by Wikipedia for the searched word/words. Thread pool from 4 threads will process the links. Each thread will get a link, parse it, and update the index (same process as for the text from the main page).
2. Implement different listeners to the search engine. All listeners will run concurrently and present to the user different information about the search:
a. SearchResults will present the final search results to the user (without change but now including links),
b. IndexMonitor will build the index and monitor the index state: which link and from which depth (0 for the main page or 1 for a link) was added, which words were added (for new words), and entries for which words were updated (for words that already exist in the index). At the end of building, the total number of documents and words (terms) in the index will be shown to the user.
c. Statistics will present the search statistics: three types of frequency for each word: tf, df and tf-idf, where:
i. tf – number of word occurrences in the processed document (see the formula in the HW1), presented as follows: <word, url, tf>
ii. df – number of documents (in Wikiindex) containing the word, presented as follows: <word, df>
iii. tf-idf = tf*idf, where: idf = log (|D|/df), where |D| is the number of documents in the Wikiindex.
More information about tf-idf can be found here: [url removed, login to view]
Note, that df and, as result, tf-idf – two metrics that must be updated each time that the word entry (list of documents) in the Wikiindex is updated!
Each listener has to run in the separate window. No special requirement for a Graphical User Interface (GUI) in this work!
Bonus section (1 point to the final grade):
Extend the search to internal links where the depth of search is unlimited and user-specified. All possible race conditions in this case must be handled properly!
First job was attached