I'm doing some research and I have a small project to scrape a website for data.
It is essential that those interested would be excellent at parsing a html file, probably by using regular expressions or other tools of their choice. It is likely that a regex wizard would fly through this project.
It is envisaged there would be 2 small programs. The first program will crawl the fixed no. of urls (not crawling below the fixed list of urls - each in an identical template format). Parse the html on the page to extract 5 fields. Make a calculation based on 1 of those fields. Then populate a table in a database with these fields (it could also be done with flat text files since ultimately the data will be exported to excel).
The second program will re-visit some of the pages, based on data in the table populated by program 1, and extract another variable and put it in the table. This program will perform a calculation to determine the next time the program should be run again and adjust cron so that it is done at that time.
The program will reside on an existing server I have.
The program will scrape 2 separate domains - so the above 2 programs will have to be written twice since the templates for both domains are different.
I have a more extensive guideline of the project available for interested parties.