This assignment asks you to create a web page categorization program.
• The program reads 20 (or more) web pages. The urls for some of these web pages can be maintained in a control file that is read when the program starts. The others should be links from these pages. (Wikipedia is recommended source.) For each page, the program maintains frequencies of words along with any other related information that you choose.
• The user can enter any other URL, and the program reports which other known page is most closely related, using a similarity metric of your choosing.
The implementation restrictions are:
• Create a cache based on a custom hash table class you implement to keep track of pages that have not been modified since accessed; keep them in local files.
• Use library collections or your own data structures for all other data stores. Read through the Collections tutorial.
• Establish a similarity metric. This must be in part based on word-frequencies, but may include other attributes. If you follow the recommended approach of hash-based TF-IDF, create a hash table storing these.
• A GUI allows a user to indicate one entity, and displays one or more similar ones.
The presentation details are up to you. Use Swing, JavaFX, or Android components for the GUI.
This extends Assignment 1 using persistent data structures and additional similarity metrics. It requires two programs.
• For each of at least 100 URLs, create a persistent file-based B-Tree (or B+-Tree) containing frequencies and/or other information for hash-based key representations.
• Load each B-tree with frequencies (possibly along with other data) extending or changing those in assignment 1 if applicable.
• Implement and use a fixed size buffer cache to reduce IO.
• Pre-categorize pages into 5 to 10 clusters using k-means, k-mediods, or a similar metric.
Extend Assignment 1 to display a category (cluster) and most similar key from the above data structures.