Tokenization and statistics

Đã Hủy Đã đăng vào Feb 22, 2015 Thanh toán khi bàn giao
Đã Hủy Thanh toán khi bàn giao

Write a program that preprocesses the collection. This preprocessing stage should specifically include:

a. Function that eliminates SGML tags (e.g. <DOC> <DOCNO> ...)

b. Function that tokenizes the text. In doing this, pay particular

attention to characters that need special handling, as

discussed in class (. , - etc.). For this task, you can use

your own implementation of a tokenizer or the class StringTokenizer.

2. Determine the frequency of occurrence for all the words in this collection. To do so, for each input file (document), produce an output file with the same input file name but with extension .Dat. Put inside this file each term along with its frequency in this document.

Answer the following questions:

a. What is the vocabulary size? (i.e. number of unique terms)

b. What are the top 10 words in the ranking? (i.e. the words

with the highest frequencies)

c. From these top 10 words, which are "meaningful" (i.e. they are not

stopwords), and which ones you would eliminate as "stopwords". Stopwords may include: a the of ... (search for suitable stop word list from the Internet)

d. What is the minimum number of unique words accounting for half of

the total number of words in the collection?

Example: if the total number of words in the collection is 100,

and we have the following word-frequency pairs: the - 30 of - 10

a - 10 clear - 8 cut - 7 etc. the answer to this question will be

3 (3 unique words account for half of the total 100 words)

Note: It is highly recommended that your code is as modularized as possible;

many of the functions that you implement during this assignment will be needed

in future assignments or in the term project.

Submission instructions:

- write a README file including:

* a detailed note about the functionality of each of the above programs,

* complete instructions on how to run them

* answers to the questions above

- make sure you include your name in each program and in the README file.

- make sure all your programs run correctly before you submit.

- submit your assignment by the due date

.

Java

ID dự án: #7186168

Về dự án

4 đề xuất Dự án từ xa Feb 22, 2015 đang mở

4 freelancer chào giá trung bình$25 cho công việc này

maheshtippani

Hi, I have overall 3 years of experience on java/j2ee ,spring, hibernate,Junit,Easymock technologies and also work with STS. I am a quick learner. if you give a chance to me i will prove myself.

$30 USD trong 1 ngày
(4 Nhận xét)
2.4
wiserehan

Hello Sir, Kindly have a look at my reviews. I always code efficiently. For now I only want awesome reviews. Don't pay me a single dollar until you check the results. I guarantee that you'll like my work and will de Thêm

$30 USD trong 0 ngày
(1 Nhận xét)
0.6
jdrana11

Hy, I am a java developer and having experience of 3+ years.........!! I just make a tokenizer like this(not fully like your project)........!! Hope you award me this project.......!! Thankx......!

$20 USD trong 2 ngày
(1 Nhận xét)
0.4
saurabheights

Hi, I am SDE at amazon currently with skills in Java and web. Give me 2 days to work on this problem. Once done, I will send you the screenshots and then you can purchase my bid. Let me know if you are OK with this. Thêm

$20 USD trong 5 ngày
(0 Nhận xét)
0.0