I have a list of 9 websites that I need to have mapped and added to an existing Python scraping script.
Each of the 9websites contain about 1000 items that need to be mapped. Each item has about 10 different values that need to be captured.
The scripts run on Python using the scrapy framework, and reside on my VPS. A cronjob should execute a shell script once every 6 hours to run each scraping script.
Each website's scraper is defined and executes based on its own file, **so you will need to create a separate script for each website.**
Here is a sample set of code for one of the scrapers. You can use this as a basic template to build off of. However, you should have strong Python and other programming knowledge to be able to map and execute the scripts. This will not simply be just fill-in-the data project.
(Here is the existing code of the scraping engine).
[url removed, login to view]
- Can navigate through static and non-static pagination
- Can deal with variations among sites (some sites have pagination, some have none. Some have Item Details pages, some have none).
- Cache invalidation for when data is changed or removed from the source websites (should already be built into the script template)
There is data on both the Item List and Item Details pages that should be captured (some of it is redundant and present on both pages).
You should map the websites in a relatively rigorous way, so that they don't instantly break if something small on the website is changed.
Budget for this project is $150 - $200.