I'm looking to build a web crawler that can handle millions of pages, and a database structure that can support a large number of entries. Right now, I only want 3 simple things
1. My needs for the crawler are very basic - I want to index content in the body section of web pages, and basic meta information like page title and description. While that's a really simple task, I'm more concerned with the size of the database and making sure that it doesn't get so large that it becomes slow. How would you structure a database for something that needs to handle storing millions of pages?
2. For the crawler, like I mentioned, I'm indexing entire pages, so this should be very quick. I do have a large number of sites that I want to crawl though (20,000+). Preferred language and framework are Python + Scrapy but if you have experience with another language (like Java) for large scale crawling, I am open to considering other things. I want to scrape anything between HTML body tags and basic meta information, and store the time and date that the page was crawled. No other specifications at this time. The question here is, how long would it take for you to build a crawler that can handle crawling and scraping a large number of web pages?
3. I'm thinking in a different direction than I was before, and want to handle any parsing or searching for specific information through code that is separate from the crawler. I want to build an API so specific information is standardized and can be used by other websites. Do you have any experience in this area?
Right now, I want to get an idea of how long you see this taking, how you would handle a large dataset, and when you would be available to work on this. Looking only for a detailed estimate here, no code, no other work right now which is why the rate is so low.
If I invited you to bid, I'm considering your listed hourly rate, not the price on this project.
15 freelancer đang chào giá trung bình $1415 cho công việc này
Hi sir, I am scraping expert, I have did too many similar projects, please check my feedback then you will know. Can you tell me more details? then I will provide demo data for you. Thanks, Kimi