I am looking for a detailed web crawl of any website.
I am aiming to crawl each page of a website and pick only certain information to finally store in a database (suitable, to be suggested by you).
So, input will be the domain and you need to find a way to compile all the URLs and then collect info as in the excel sheet.
- Tab “Crawled URLs” will list out all the URLs of the sites
- Tab “Internal Links Raw Data” will list out all the specifics of the internal links
Now, for each crawl, you may need to record them under a unique crawl ID. This is the 1st phase of the project. We will expand the scope once we get the data correctly and reliably for large websites.
I can explain the details of the required information in the attached sheet.
To qualify for serious consideration of your proposal, you must provide the following in your bid:
- What Python library/package you will use and why
- What are the challenges you foresee and how you will overcome them? It is extremely important to get details here. This is the chance to show how good a fit you are for this project.
- What is your suggestion for data storage and why?
- What similar project did you do earlier and whether I can check that in action?
Please note without the points above in your bid, it is likely that we will not consider the bid seriously.