125635 Build a Web spidering tool
N/A
Thanh toán khi bàn giao
The system incorporates two primary "blocks" of functionality, one front- and one back-end.
The back-end system is a "spider" that is capable of building a dynamic database of Web sites, scanning those sites for RSS/XML feeds, then adding newly discovered feeds to the database. URLs and contents for each feed entry should be included in the database. The system will be "seeded" with several starting Web sites, but should grow autonomously after that. For each site in the database, a secondary index must be built which tracks inbound and outbound links from all other entries in the database (in other words, if site A links to site B,C,D in one XML entry, all three references must be counted. If site A then links again to sites B,D, their counters must increase. Furthermore, sites B,C,D should all be tracked from this point forward). All data should be tracked historically so that a query can be done based on any given timeframe. The system must be scalable to track several million Web sites.
The front-end system shows a Web-based interface to the database. Fundamentally it is allowing users (who must register) to perform queries against the database, with virtually any type of combination possible. Queries will typically be showing quantities of references inbound/outbound for any given URL (or multiple URLs simultaneously). For example, with the previous scenario, a user should be able to query site A, which shows the counts of links to sites B,C,D. Users must be able to make queries based on time periods, or with specific keywords, or with specific URLs, or both. Comparison queries should be allowed as well (e.g. viewing site A vs site B during a given time period). The system must include a dynamic graphing system to chart the results of queries visually. The interface must have a very “clean” look to it, and resemble www.alexa.com. Finally, the system must have a Web-based administrative module for making manual database edits, if needed. Admins must be able to ban URLs or remove individual entries from the database (which must stay removed until an admin restores it).
ID dự án: #1871801