In this project, the students are to implement data pre-processing techniques and apply them to a gene expression dataset.
The dataset contains 62 samples collected from colon-cancer patients. 40 of the samples are labeled as ”negative” and 22 are labeled as ”positive.” Each tuple (row) in the dataset is a sample containing the readings for the genes, and the class (which is the last column) of the sample. Each gene is an attribute. The columns are separated by ”,”, which is a commonly used format in data mining. We will refer to the genes as G0, ..., GN, assigned in the left-to-right order as given in the original file.
You will write a C++ or Java program to handle the following two tasks:
Task 1. Task 2.
Discretize the data using equi-density binning with 3 bins for each of the first k attributes.
Use the entropy-based binning method to discretize all genes and to select the top-k genes, ranked in decreasing information gain order. Use 3 bins for each gene. Information gain for three bins is a generalization of the two-bins case (based on size-weighted entropy). To get three bins you should first divide the range of a given attribute into two bins and then divide one of the two bins into two more bins. The two splits should maximize the size-weighted entropy gain for the three intervals. (You should select between the two splits (one for the left interval and one for the right interval) as the the second split based on size-weighted entropy gain.)
IF POSSIBLE I HAVE A SIMILAR JAVA PROJECT THAT REQUIRES MODIFICATION OF THE CODE A LITTLE TO MEET THE GIVEN REQUIREMENTS