This is a PHP 4 job. The bidder should have experience scraping sites with cURL and Regex and be able to write clean tight PHP.
I occassionally need to harvest information from websites (usually becuase they do not provide search functionality.)
This time I need to pull out information from a DMOZ-like directory. I need to collect details about each of the thousand-or-so websites listed in the directory.
To avoid namespace issues, I want a base parsing class using cURL and then a derived class to scrape this specific site.
This is not a database job. Information will simply be harvested into an array and saved to an XML file.