Extract: I need a system, that produces text for a given keyword. The text needs to be unique, targetted to the keyword and halfway readable.
The good news: I've a concept which will probably work.
The bad news: It's not easy to develop and requires handling large ammounts of data.
Goal: A text production maschine, where I can enter a keyword and it generates a lot of texts for it. Those texts pass copyscape and look halfway legit on the first glance. I know that "sense" is not possible and I don't expect that.
Input: $keyword (i.e. "credit card")
Output: $text (string with ~400 words unique text about "credit card")
Please read the project description at least twice before you bid. THIS IS A HARD TASK. If you have any questions please let me know. The project has high priority for me and I'm 24/7 available for the developer.
Step 0 [preperations]: We collect large ammounts of human written content (german texts). I have a list of 1.7 million .de domains, let's crawl them (including subpages) and extract all text to a database/semantic cloud.
If you choose a database, I'd suggest mongoDB as it's ways faster than MySQL with that ammount of data. Our main business is Hosting, so we can provide you with custom server technology (like a 24 GB RAM to load parts of the cloud into memory or SSDs raids to speed up access). I've also a good crawler for webpages available, but it's written in python.
Step 1 [generation proccess]: Input Keyword by user. Generate a random number between 1 and 30. Let's assume it is 15.
Step 2: Make a google query for the keyword we want to generate text for. Parse the google result number 15, remove tags and navigation (script for this is existing) and extract remaining content.
Step 3: We now have a snippet of relevant text for our keyword. But of course, it's only a copy - this is where the rewriting begins.
Step 4 [rewriting]: Split the snippet into single sentences. A new sentence begins after .,:!?;
Step 5: Here it is getting tricky. We need to somehow find out, which words form a block. When the sentence is: "Mark studies law at harvard university.", the system needs to detect that "harvard university" is a block and "university" shall not be replaced with "school", however, "harvard university" may be replaced with "stanford". So the words in a block need to be replaced together. How do we find out which words belong together? We check how "near" they are in our word cloud, how often they stand next to each other: "Mark | studies | law | at | harvard university."
Step 5: okay, we now have the blocks. Next step is to aim for replacing as many blocks as possible in order to make the text unique. Here we query our natural content cloud something like that: "$left" * "$right".
In this practical example: query: "Mark " * "law" - as you can see, we took three following blocks and replaced the middle one with an placeholder. Our natural content database should now return legit blocks for *, as they were used in natural language, for example: "Mark teaches law", "Mark demands law", "Mark is still searching for law" etc.
Not all will perfectly make sense, but it's a start and far better than working with synonyms, because you can also replace single words with blocks of multiple words and vice versa.
We should use this replacement system multiple times in each sentence. It also works for the begin and the end of sentences. From my manual tests with google, this works pretty well and might work even better with our own datapool. The system works identically for all languages, so you don't need to speak german.
Step 6: Output rewritten text.
You can use any programming language you like.
I can't write more description text here due to [url removed, login to view], so happy bidding & discussion!