The task is to create a webpage which when loaded will go out and asynchronously capture the source of other webpages and create a tag cloud of the most popular terms on those pages.
The more specific functional requirements are as follows:
1) The process starts when a PHP page is loaded which contains anywhere from one to ten different images, with say each image representing one of the online video providers like YouTube. The images themselves are not important. The generation of the tag clouds should not hold up the loading of the page, but rather that should only begin after the page loads.
2) Associated with each of the images is a URL, so for this example say there are four images on the page and the URLs are (1) [url removed, login to view] (2) [url removed, login to view] (3) [url removed, login to view] (4) [url removed, login to view]
3) After the PHP page loads, some AJAX magic should take place so that the URL associated with each image (for this example the previously mentioned four URLs) is contacted and the source of the page is downloaded. It is this page source that should be used to eventually generate the tag cloud.
4) The page source itself used to generate the tag cloud should not include the HTML tags, the CSS, the meta tags, etc, but rather the text that you would view if your were looking at the page in a browser. Take for example the what is seen in the SEO Text Browser box for this example page: [url removed, login to view] Using the Lynx browser may be a good example of a means to capture this text.
5) When the text of the page is captured, all words contained in a list of negative that I can provision should be removed from the list of words on that page. The purpose of this is to allow removing common words like the, this, an ... etc. The means to provision this list does not need to be fancy - a simple text file would suffice.
6) Once the negative keywords are removed from the list, the 15 most popular terms should be determined.
7) Two things should then be done with this list:
a) The list of words should be stored in a MySQL database along with the date the list was stored. The key to the entry should be the URL. Note that back in step 3 the SQL database would first be checked to see if an entry existed for the URL, and if so, the relevant data would of course just be pulled from the database.
b) Along with the data being stored in the database, it must also be formatted into a tag cloud with the three most popular terms using one CSS class, terms 4-6 using another CSS class, etc for a total of 5 classes. Using onmouseover the tag cloud for [url removed, login to view] for example would then be visible when the user positions their mouse over the image described back in step 2 which represents YouTube. Before the tag cloud has completed generating, mousing over the YouTube image should do nothing.
8) The creation of the CSS for the tag cloud itself is not important, but an example of what a tab cloud for a selection of text looks like can be tested here: [url removed, login to view]
A few system requirements as:
1) The foundation for this should be LAMP
2) The solution should be completely cross browser compatible.
3) Emphasis of course on reliability and real-time performance.
4) Full ownership rights to the source are to held by me.
Anything not clear? Please feel free to ask.