Đã Đóng

Website domain crawling

I need to so some large scale website crawling for data mining purposes. It needs to scale to potentially millions of sites. Unless you have done a project like this, please do not bid. I need an expert that has done it before.

Not looking for a custom solution. I would expect that solutions exist that can be leveraged.

Specs:

- Input is a list of URL's 

- Each URL is FULLY crawled starting at the depth provided. Example if [url removed, login to view] is provided, only the data under [url removed, login to view] is crawled, it would not see [url removed, login to view]

-All pages are saved as HTML in a network file system under the top level target domain directory

-All pages and links are crawled, including pages that require a click via javascript or user interaction

-All pages are saved as their rendered versions as HTML

-Child pages that are behind a javascript link: When this happens, the links are converted to an HTML link and inserted into the page. The rendered child page is saved as HTML

-Inserted HTML links (javascript click) should be human readable format for QA

Another process will consume and process the directory tree after each site is done. The crucial piece is that everything is traversable by following a regular html link.

Please suggest what technologies you would use to accomplish this project.

Please show that you have read this fully by putting your favorite color in CAPS as the first word in your bid. Do not post generic list of projects. I want to know what you have done in site crawling specifically. Thanks

Kĩ năng: Kiến trúc phần mềm

Xem nhiều hơn: allow gmail to send email from domain website, domain website, bot crawling flash website scraping data xml file, website spider, web crawling techniques, how to crawl a website in google, how to crawl a website using python, crawl game website, how to crawl a website using java, get list of urls from site, crawl website online, website format template, parked domains ad domains multiple domain website, jpeg website format, creating website format, crawling joomla website nutch, ads website format, blu domain website help, website format popup, host domain website different providers

Về Bên Thuê:
( 83 nhận xét ) Oconomowoc, United States

ID dự án: #14450929

5 freelancer đang chào giá trung bình $538 cho công việc này

prashushinde9

******************* SCRAPING EXPERT ****************** Hello, We worked on various websites for scraping data and we are expertise in that. Since I worked with scarping, I am very confident that I can easily h Thêm

$773 USD trong 10 ngày
(15 Nhận xét)
6.1
$555 USD trong 10 ngày
(6 Nhận xét)
5.3
$250 USD trong 10 ngày
(1 Nhận xét)
3.0
$555 USD trong 10 ngày
(5 Nhận xét)
3.3
saurabh04rk

i am interested for doing this project , will give my all best to complete it. Relevant Skills and Experience I have the 9 years of experience providing high end innovative and reliable [login to view URL] Make Work Easy. W Thêm

$555 USD trong 10 ngày
(0 Nhận xét)
0.0