Đã Hủy

PHP - Extraction Engine and Spider

We would like an HTML extraction engine that first spiders a given site, then extracts all the HTML from all the pages it spiders. There is currently no script out there I could find that does this (aside from maybe php dig?)

Therefore, we would like the following in a php script/series of scripts:

-spider a given website completely, ignoring images, media files, xml, docs, pdf, etc.

-copy the url/path of each file spidered (example, /products/[url removed, login to view] or /[url removed, login to view])

-extract the html from each spidered page

-if the page contains any of the following tags, print off the values within the html tags: ( through , , first paragraph of text, )

-The rest of the text doesn't need to be printed off.

-The expected result is a php page that prints off the URL and the 'defined' text on the page.

Attached is the code for a php file and relevant classes we've found and slightly modified. The script "test_pagelinks.php" basically looks at a URL, and gets all the links. Another part of the script goes to the URL, gets all the HTML and prints it out.

There are deficiencies with this script. First of all, it doesn't capture the home page, or index page. It also duplicates pages, so the same page can be indexed more than once. Additionally, there is no HTML parsing based on the parameters above.

So in closing, you can either modify this script attached, modify php dig components, or create your own (there may also be other components out there that can be used for this...)

Eventually this code will form the "cornerstone" of a WAP translator. The specs are still being worked out on that, but look for that project to be posted soon.

Kỹ năng: PHP, Thiết kế trang web

Xem thêm: html extraction engine, xml pdf php, translator find, php tags, php components, find translator website, find translator site, find php, home translator, asp spider, extraction engine, wap site design, translator t, translator home, text parsing, spider, prints, php test, modify php website, index media, find translator print, find translator, extracts, dig, design prints

Về Bên Thuê:
( 24 nhận xét ) Derby, United States

Mã Dự Án: #24774

Đã trao cho:

trajamohan

6+ years exp in software development

$100 USD trong 5 ngày
(19 Đánh Giá)
4.5

12 freelancer đang chào giá trung bình $239 cho công việc này

webexpertz

Please check your PM.

$250 USD trong 15 ngày
(167 Đánh Giá)
8.6
Properbouncetech

Greetings, we are pleased to place our budget which is most reasonable and accurate one. Pls. Check details in PMB, Microsectec.

$275 USD trong 15 ngày
(81 Đánh Giá)
7.5
cstl

Chandusoft is a customer-specific service oriented company has got an Professional and creative team with 6 years experience in Web design , web programming and software development . We have expertise and experience Thêm

$240 USD trong 15 ngày
(21 Đánh Giá)
7.0
viaden

Dear Sirs! Thank you for considering our bid. We've gone through your description and are ready to create the engine you require professionaly and in time. Our company has been in the sphere of web development and p Thêm

$300 USD trong 21 ngày
(9 Đánh Giá)
6.6
gigapromoters

It can be done... www.gigapromoters.com

$250 USD trong 12 ngày
(23 Đánh Giá)
6.5
Spin

We will doing all that you want (and more... :-))). Quickly, Professional, Quality - our answer you and your organization. We work more than 10 years.. There are questions?

$300 USD trong 10 ngày
(3 Đánh Giá)
6.1
nidle

Hello, We have examined your request and would be glad to develop the required engine for you. More than 7 years of experience in the field of web technologies and programming help us achieve outstanding results fo Thêm

$300 USD trong 20 ngày
(3 Đánh Giá)
4.9
xhenxhe

This seems doable. I'm wondering if it would be easier to do in Perl though, or start by using wget to spider then Perl or PHP to parse out what you need.

$300 USD trong 30 ngày
(8 Đánh Giá)
3.8
ebusiness2

Our bid is for really very high quality work for your PHP - Extraction Engine and Spider that will be made to be upgradable in case you need some upgrades in future. We will always be available for upgrades. Our bid in Thêm

$300 USD trong 10 ngày
(4 Đánh Giá)
3.8
silviuciprian

Dear Sir. We have a vast experience using PHP as scripting language. We are perfectly capable of doing this in 20 days.

$150 USD trong 20 ngày
(0 Đánh Giá)
0.0
Randeep

We are a group of highly educated professionals with specialization in various fields. although we are new to elance.com, we have done such projects before and we are confident of doing your project. We will always be Thêm

$100 USD trong 10 ngày
(0 Đánh Giá)
2.8