I need to so some large scale website crawling for data mining purposes. It needs to scale to potentially millions of sites. Unless you have done a project like this, please do not bid. I need an expert that has done it before.
Not looking for a custom solution. I would expect that solutions exist that can be leveraged.
- Input is a list of URL's
- Each URL is FULLY crawled starting at the depth provided. Example if [url removed, login to view] is provided, only the data under [url removed, login to view] is crawled, it would not see [url removed, login to view]
-All pages are saved as HTML in a network file system under the top level target domain directory
-All pages are saved as their rendered versions as HTML
Another process will consume and process the directory tree after each site is done. The crucial piece is that everything is traversable by following a regular html link.
Please suggest what technologies you would use to accomplish this project.
Please show that you have read this fully by putting your favorite color in CAPS as the first word in your bid. Do not post generic list of projects. I want to know what you have done in site crawling specifically. Thanks