Đã hoàn thành

Web scraper and parser (patent data)

Job Description for Scraping and Parsing Patents

I’d like to query the US Patent and Trademark Office website. (Vist [url removed, login to view] to understand this process.) In particular, I’d like to run the following two types of queries:

1) A patent number (for example, 4237224) in the Term 1 box and “Referenced By” selected from the drop-down menu next to the Field 1 box

2) A phrase (for example, “recombinant DNA) in the Term 1 box and “Title”, “Abstract”, of “Claims” selected from the drop-down menu next to the Field 1 box

These queries will result in a list of several patents. (Try the first query described above as an example. It will result in 268 patents, with 50 hits per page.) If you then click on the link for any one of these hits, you’ll see that it contains a wealth of information for a single patent. I’d like a program that, for each of these resultant patents, will automatically download the name and location of each inventor; the name and location of each assignee; the filed date; and the issue date (which appears in the upper-right corner). For example, for patent 7375758 (the first hit in the list from the above query), the program should download:

Inventors: Harvey; Alex J. (Athens, GA), Wang; Youliang (Monroe, GA)

Assignee: AviGenics, Inc. (Athens, GA)

Filed: December 2, 2002

Issued: May 20, 2008

It should also download this information for each of the other 268 hits.

Each piece of information should appear in a separate field. There may be up to 40 inventors and associated inventor locations, and 10 assignees and assignee locations. (There will only be one filed date and one issue date.) Thus, the process should work as follows: I enter the first query listed above. (The program should obviously work for the other queries described, too.) The program should output a datafile that, if imported into a spreadsheet, has 268 rows (one for each hit). It has 40 inventor columns, 40 inventor location columns, 10 assignee columns, 10 assignee location columns, one filed column and one issued column (102 columns total). If there aren’t 40 inventors or 10 assignees (few, if any, patents will have all of them), it should insert blanks such that the fields line up from row to row.

Finally, I should note that I had someone write a program to do this three years ago in TCL. But, it only worked for patent numbers (and not for phrases as described in query 2 above) and the US patent office has changed the structure of their database since then so the program no longer works. But, I can provide you with the full commented source code that this person wrote. It may be that this job is as straightforward as updating the field names and adding the ability to query by phrase.

The deliverable is:

1) An executable program that allows me to enter the queries described above and that outputs a file in a format that I can import into a spreadsheet (e.g., a tab-delimited text file)

2) The code you used to do this. It must be well commented.

Kĩ năng: Java, Perl, Python

Xem nhiều hơn: Python web scraper, web scraper java, patent parser, cgi perl java web scraper, java web scraper, parsing patent data, you like hits, write corner on website, well referenced, web source format, web scraping process, web page spreadsheet, web format, uspto gov, types of link list in data structure, types of data structure, types data structure, spreadsheet web page, spreadsheet on web page, spreadsheet on web

Về Bên Thuê:
( 6 nhận xét ) Eugene, United States

ID dự án: #299186

Được trao cho:


Hi, Please check PMB.

$100 USD trong 0 ngày
(5 Đánh Giá)

9 freelancer đang chào giá trung bình $94 cho công việc này


I can do this job for you. See PM for details.

$150 USD trong 3 ngày
(141 Nhận xét)

Hi, I can write a firefox add-on to do that. I have already finished 2 firefox project recently.

$80 USD trong 3 ngày
(49 Nhận xét)

please review my pmb

$65 USD trong 3 ngày
(24 Nhận xét)

can be done

$100 USD trong 4 ngày
(4 Nhận xét)

Hi, I have created many web scrapers before. I'll do this in Python.

$100 USD trong 2 ngày
(2 Nhận xét)

please check PM

$200 USD trong 4 ngày
(7 Nhận xét)

please check PMB....

$60 USD trong 3 ngày
(0 Nhận xét)

Implementation in Perl with standard modules providing an excel (xls), not csv.

$45 USD trong 3 ngày
(0 Nhận xét)

This is Lekha. My team is fully experienced and we have the expertise's of developing many desktop/web application/ SMS Marketing products using .Net platform and J2ME/J2EE and WML technologies. My team comprises of hi Thêm

$50 USD trong 2 ngày
(0 Nhận xét)