Phase 2: Implement MR programs to solve unstructured data problems on the HDFS set up. In this phase you will implement the word co-occurrence MR algorithm
discussed in the Lin and Dyer’s book. You’ll select a data set from publications in any subject area you
are familiar with and prepare co-occurrence or co-author information from the publications. The stripes
method for co-occurrence may be better suited for this application. Map will have to parse and drop the
extra text in the publications. We need only the first author as key and rest of the authors as value and
number of occurrences in a given corpus.
Input: Many publications from an author.
Output: Author as the key and value is the associated array with the co-authors along with number of
occurrences as entry in the associated array.
Mandatory requirement: Every team has to have its own data set and cannot copy each other.