Job Description for Scraping and Parsing Patents
I’d like to query the US Patent and Trademark Office website. (Vist [url removed, login to view] to understand this process.) In particular, I’d like to run the following two types of queries:
1) A patent number (for example, 4237224) in the Term 1 box and “Referenced By” selected from the drop-down menu next to the Field 1 box
2) A phrase (for example, “recombinant DNA) in the Term 1 box and “Title”, “Abstract”, of “Claims” selected from the drop-down menu next to the Field 1 box
These queries will result in a list of several patents. (Try the first query described above as an example. It will result in 268 patents, with 50 hits per page.) If you then click on the link for any one of these hits, you’ll see that it contains a wealth of information for a single patent. I’d like a program that, for each of these resultant patents, will automatically download the name and location of each inventor; the name and location of each assignee; the filed date; and the issue date (which appears in the upper-right corner). For example, for patent 7375758 (the first hit in the list from the above query), the program should download:
Inventors: Harvey; Alex J. (Athens, GA), Wang; Youliang (Monroe, GA)
Assignee: AviGenics, Inc. (Athens, GA)
Filed: December 2, 2002
Issued: May 20, 2008
It should also download this information for each of the other 268 hits.
Each piece of information should appear in a separate field. There may be up to 40 inventors and associated inventor locations, and 10 assignees and assignee locations. (There will only be one filed date and one issue date.) Thus, the process should work as follows: I enter the first query listed above. (The program should obviously work for the other queries described, too.) The program should output a datafile that, if imported into a spreadsheet, has 268 rows (one for each hit). It has 40 inventor columns, 40 inventor location columns, 10 assignee columns, 10 assignee location columns, one filed column and one issued column (102 columns total). If there aren’t 40 inventors or 10 assignees (few, if any, patents will have all of them), it should insert blanks such that the fields line up from row to row.
Finally, I should note that I had someone write a program to do this three years ago in TCL. But, it only worked for patent numbers (and not for phrases as described in query 2 above) and the US patent office has changed the structure of their database since then so the program no longer works. But, I can provide you with the full commented source code that this person wrote. It may be that this job is as straightforward as updating the field names and adding the ability to query by phrase.
The deliverable is:
1) An executable program that allows me to enter the queries described above and that outputs a file in a format that I can import into a spreadsheet (e.g., a tab-delimited text file)
2) The code you used to do this. It must be well commented.