Đã Đóng

Web Crawler + Database + API Estimate

I'm looking to build a web crawler that can handle millions of pages, and a database structure that can support a large number of entries. Right now, I only want 3 simple things

1. My needs for the crawler are very basic - I want to index content in the body section of web pages, and basic meta information like page title and description. While that's a really simple task, I'm more concerned with the size of the database and making sure that it doesn't get so large that it becomes slow. How would you structure a database for something that needs to handle storing millions of pages?

2. For the crawler, like I mentioned, I'm indexing entire pages, so this should be very quick. I do have a large number of sites that I want to crawl though (20,000+). Preferred language and framework are Python + Scrapy but if you have experience with another language (like Java) for large scale crawling, I am open to considering other things. I want to scrape anything between HTML body tags and basic meta information, and store the time and date that the page was crawled. No other specifications at this time. The question here is, how long would it take for you to build a crawler that can handle crawling and scraping a large number of web pages?

3. I'm thinking in a different direction than I was before, and want to handle any parsing or searching for specific information through code that is separate from the crawler. I want to build an API so specific information is standardized and can be used by other websites. Do you have any experience in this area?

Right now, I want to get an idea of how long you see this taking, how you would handle a large dataset, and when you would be available to work on this. Looking only for a detailed estimate here, no code, no other work right now which is why the rate is so low.

If I invited you to bid, I'm considering your listed hourly rate, not the price on this project.

Kĩ năng: Hadoop, MySQL, Python, Kiến trúc phần mềm, Web Scraping

Xem nhiều hơn: why would you like to work here, web scraping python 3, web scraping api, web crawler architecture, web-crawler, simple scraping software, scraping websites with python, scraping web content, python web framework, long web pages, java indexing, how to do web scraping in java, how to build web pages, how to build a web page, how can i build web page, hourly rate python, architecture of web crawler, python html parsing, python api framework, how to build a web crawler

Về Bên Thuê:
( 11 nhận xét ) Las Vegas, United States

ID dự án: #6246220

15 freelancer đang chào giá trung bình $1415 cho công việc này

e3d

Hi, ok, so my quote is for idea only :) 1. first of all, please explain what you mean by "indexing". your ultimate goal is to build some kind of search engine, or you want to be able to have local copy of html code Thêm

$29 USD trong 3 ngày
(212 Nhận xét)
8.6
mantislin

Hi sir, I am scraping expert, I have did too many similar projects, please check my feedback then you will know. Can you tell me more details? then I will provide demo data for you. Thanks, Kimi

$250 USD trong 5 ngày
(271 Nhận xét)
7.4
dotnetsqlCoder

I think it is best to use a search engine , something like Apache Solr as a search Engine . This could work on Windows and Linux . It can be accessed from programming languages easily . Solr is distributed and can Thêm

$30 USD trong 1 ngày
(120 Nhận xét)
7.0
SigmaVisual

Dear Client, I can help in your project. We have already experience of working on similar projects. Please see below to get idea of our experience: Amazon/Ebay Bots: [login to view URL] Thêm

$222 USD trong 5 ngày
(128 Nhận xét)
7.6
mmadi

Hi, Iam interested in your project and I'll be happy to do that for you. I have rich experince in scrapping curl regular expressions Dom and Selenium RC. I worked for [login to view URL] and [login to view URL] search eng Thêm

$24 USD trong 1 ngày
(19 Nhận xét)
6.6
WorkXpressPaaS

!! **We are a US company with an 13 year history of developing cloud-based custom software solutions for a diverse client list**!! After reviewing your post, I am confident that we can develop and deploy a custom solut Thêm

$5154 USD trong 50 ngày
(1 Nhận xét)
5.5
dabing1205

一个有效的提议尚未被提供

$11111 USD trong 30 ngày
(30 Nhận xét)
5.3
smellyfinger

Hi. I have good news for you. The API you want is already built here and is ready to use. The language is Python. Here's some of it's features: 1- Write a full site scraper using a few lines of code (mainly, the sele Thêm

$25 USD trong 1 ngày
(13 Nhận xét)
4.7
ngcomp

I have done similar work in past for other company, Based on your requirement we can go with the following: [login to view URL] (Crawler) [login to view URL] (Extracting Body) scrape and parse HTML from a U Thêm

$3333 USD trong 21 ngày
(6 Nhận xét)
4.5
lenzai

Hello, I am a python scrapy specialist. I spent the past 4 years focusing exclusively on web scraping and web automation projects. 1. database is not suitable for storing full page content. We can use other appr Thêm

$277 USD trong 1 ngày
(11 Nhận xét)
3.7
OwasqaCo

Good with automated tasks, I can do this, PM for more discussions, Thank you!.......................

$25 USD trong 1 ngày
(7 Nhận xét)
2.8
tonypaul006

Hi, Our company had been in this web crawling business for a long time now. We have developed an open source web crawling framework Dragline ( [login to view URL] ) . We also have an affordable web sc Thêm

$25 USD trong 1 ngày
(0 Nhận xét)
0.0
kadianpuneet

I have a prior experience in building distributed systems, generic crawlers and indexing solutions on HDFS which I have delivered successfully to a major US company. The project was well appreciated by the client. I h Thêm

$35 USD trong 60 ngày
(0 Nhận xét)
0.0
vtico

I use ElasticSearch as the search engine, which is based on Lucene. I will use Java for the HTML Parsing & Calling Web services on ElasticSearch.

$666 USD trong 20 ngày
(0 Nhận xét)
0.0
maredefjore

A proposal has not yet been provided

$25 USD trong 1 ngày
(0 Nhận xét)
0.0