This is my first time hiring a freelancer, and i noticed you are skilled in python and crawler/scrapers. If you can take this task i have no problem convincing my partners to employ you as a steady programmer if you are interested.
Please disregard the amount of hours and time length we chose in the dropdown menu below, i dont know how long this would take you but from my experience with python and scraping, this should not take more than several hours.
I have attempted using SCRAPY to code this myself, but my knowledge of python is limited and therefore I feel it is best to hire an expert.
The project is simple, and separated into a couple of layers.
1. I need a crawler which takes a retail website URL, and crawls the entire site. The crawler must identify what pages on the site are "product" pages, perhaps by a variable match such as an "add to cart" button. This variable may be different from site to site, therefore the crawler must have the variable defined by me before initiating the crawl.
2. The crawler then outputs the list of URLs which are "product pages", and now the Scraper program goes to work.
- The scraper must pull a set of predefined fields on each product page. I understand this is done with xPaths, and I guess that if the scraper can accept a list of pre-selected xpaths then it will pull the correct data fields. The variables are site specific, so just as the crawler accepts a definition variable, the scraper should accept pre-selected variables defining the xpaths to pull. these xpaths will be chosen manually by myself, and the scraper should be able to accept a varying number of variables.
- The scraper pulls the plain text from these data fields, and outputs everything into a CSV file.
- Each output value must have an internal label assigned to it, for us as developers to know what each value is.
- Examples of the data fields would be: product title, price, meta keywords, meta description, product description, and whatever "attributes" are in the table or div located on the product page.
- I feel that these xpaths may change from page to page, so how do you think we can keep a consistent scrape of xpaths if they change from page to page?
This is all i will describe for now and would like to know if you are interested and how long this would take. If you are looking to work with us, then these initial two items are part of a much larger project.
I await your reply and look forward to hearing from you.