Đã Đóng

C code to index large text library and find similar -- 2

I need a mini-app (Compiled C on Linux) that groups similar sentences together.

I have 100,000 sentences (say in a PostgresSQL DB, Unicode text). It must perform VERY fast - by indexing each root-word to a 16bit integer (which would reduce its memory foot print), then re-creating a new data structure with sentence delimeters and sentence length. Group into buckets of similar sentence length.

Then iterate through doing word-by-word comparisons (16bit comparisons).

Two algos are acceptable:-

1. Simple - Take a source sentence and iterate through XORing word by word (irrespective of word order or word frequency). If there are more than x words outstanding - then it is NOT a similar sentence. X in this case would be 25% of the number of total words.

We leave such large gap so that we don't need to worry about word roots.

From the smaller data set - we then proceed to do a classic levenstechn comparison - but with an upper bound of x deviation - meaning after it detects more than say 10% deviation - it exists that comparison. Here it is a character by character comparison.

The app should communicate with a folder of .gz files that contain the text and it could use a text boundary to distinguish each sentence.

The output would need to be a new text file that sorts every sentence into groups of similarity - separated by a text boundary.

I need something very soon. A mediocre algorithm is fine.

To be awarded: explain in 1-2 sentences your proposed approach, and bid a base amount plus a bonus on completion. Come in cheap, and get the big reward after you have delivered.

Kĩ năng: Lập trình C, Lập trình C#, Lập trình C++, Linux, Python

Xem nhiều hơn: docfetcher score, docfetcher portable download, docfetcher wiki, docfetcher windows 7, where does docfetcher store index, docufetch, docfetcher web interface, docfetcher index location, code compare excel files find similar items, sorting large text file, vba code extract email text field, script numbering large text file, large text file viewer, parse large text files java, nutch index large, php code send emails text file, text library, easy code mafia online text game, css code product description text oscommerce, large text flash website

Về Bên Thuê:
( 14 nhận xét ) Ultimo, Australia

ID dự án: #17629535

17 freelancer đang chào giá trung bình $462 cho công việc này

hbxfnzwpf

I am very proficient in c and c++. I have 16 years c++ developing experience now, and have worked for more than 7 years. My work is online game developing, and mainly focus on server side, using c++ under Linux environ Thêm

$400 USD trong 5 ngày
(134 Nhận xét)
7.0
quantumcube

Hello, since the number of records is not so big I'd load the entire DB in RAM for faster [login to view URL] need to run it on a PC,right?

$777 USD trong 10 ngày
(16 Nhận xét)
6.2
ppgjsc

Hi. I'm very interested in your project. I've been already developing several NLP projects using C#, C++, Java, VB and open-source APIs before I bid. Also, if your project has some algorithmic issue, then it would b Thêm

$444 USD trong 10 ngày
(23 Nhận xét)
6.2
tudiptechnology

Hi there, We have proven track record of delivering C#/.Net web applications with AngularJS front end. Also, we have been working extensively in MVC and prowess in Azure. We have worked on MVC 1, MVC 3 and MVC 5. Thêm

$1000 USD trong 21 ngày
(6 Nhận xét)
5.9
dinhfreedom

Dear sir. Your project attracted my attention at first glance, because I've extensive experience in C Programming. I'm really confident about your project, and very eager to join your project. If we have a chance to Thêm

$400 USD trong 10 ngày
(46 Nhận xét)
6.0
ITPyramid85

hello,how are you. i read your bid carefully. i am c/c++, linux expert and have full experience for 10 years. so i can hanlde your project by c/c++ and parse the text by unicode method. i can provide most quality an Thêm

$444 USD trong 10 ngày
(4 Nhận xét)
5.5
polarjin2017

[login to view URL] I saw your project description carefully and i'm very interesting your project. But i have some question about your project. If u have enough time to discuss about your project with me ,please contact me. An Thêm

$444 USD trong 10 ngày
(24 Nhận xét)
5.2
erShashi

Hi, I must say very interesting and challenging project. I have done some work on the similar project and did research on how Twitter search works on large volume. I would suggest lucene search library to create ind Thêm

$588 USD trong 25 ngày
(32 Nhận xét)
5.2
freelancerSolvit

.................................................................................................................................................................................................................

$444 USD trong 10 ngày
(32 Nhận xét)
4.8
kalyanprakash4

please discuss

$388 USD trong 6 ngày
(24 Nhận xét)
4.6
limillion819

Hello sir. I am very interested in your proposal. I can instantly help you with your starting project with a successful completion. C is a very friendly language for me and in just some days, you'll get a wonderful Thêm

$500 USD trong 10 ngày
(11 Nhận xét)
4.2
naishodayo

hello,sir. I'm a professional programmer with 9 years of experience. I've already done this kind of project before. If you award me, I'll implement all of your requirements in a short time. C code to index large text Thêm

$444 USD trong 3 ngày
(2 Nhận xét)
3.7
magadhmindslx

Dear Sir, I have gone through project description and interested taking it up. Posted bid amount is indicative and a more accurate I can give once more details are shared. Looking forward to hear from you. Thanks

$200 USD trong 10 ngày
(17 Nhận xét)
3.4
teamspirit3

Dear hiring manager, I am senior Web Scraping expert with 13 years rich experience in the past. I have strong skills and so many experience in web scraping (10M Amazon products Images scraping, Cryptocurrency marke Thêm

$444 USD trong 2 ngày
(3 Nhận xét)
1.6
TobiObadiah

Hi there, Interesting project you have there. Here is my approach. I have data structure library in C which is in development but will meet this project needs as some of the data structures have been implemented. Thêm

$300 USD trong 4 ngày
(0 Nhận xét)
0.0
mdolgun

Hello, I am expert on C/C++/Python/Data Structures/Algorithms For word indexing, i propose using trie structure (character tree). Leaf nodes would carry the index value. We could also use a hash table for indexing, bu Thêm

$200 USD trong 7 ngày
(0 Nhận xét)
0.0
mbenkendorf

Dear Employer Due to my own interest in such natural language processing problems, I already developed your described approach into a first unoptimized protoype to see how fast it can process and group 100k sentence Thêm

$444 USD trong 3 ngày
(0 Nhận xét)
0.0