Help in tesseract font training

Đã Đóng Đã đăng vào 6 tháng trước Thanh toán khi bàn giao
Đã Đóng Thanh toán khi bàn giao

Short summary of what I did:

I generated an artificial random data set that looks very close to the actual data. I made sure that all characters appeared, but special constellations and common similar letters appeared more often.

I chose the artificial generation route to have exactly 100% correct box data.

The training was carried out using the Python-based tool TessTrainGUI. This iteratively determines (probably with an SVM) the optimal constellation of Fetaure parameters.

It divides the total amount of GT data (>= 1000 rows) into a training set (T1 = 90% GT) and a validation set (V1 = 10%GT).

There are 2 possible ways to train: A completely new training and based on an existing one (best/[login to view URL]).

With both options, the result converges very quickly towards < 1% BCER after 1000 cycles.

However, when I compare the training data set to real (i.e. not artificially generated) scans (V2), there are slight weaknesses and also differences in the results.

The newly trained font has high error rates, the training based on the German is sometimes very good, but not perfect. It turns out that training with very high cycles produces poor results, indicating overtraining.

The task that still needs to be solved:

1) Find the number of iterations ("checkpoint") that is ideal in terms of validation against the second test set V2

2) or you can find a better training data set

3) There are also small discrepancies regarding special characters.

The German script "deu", on which the training is based, contains umlauts such as "ä,ö,ü," etc. In the documents to be recognized, however, there are also characters such as Ø and characters from the Slavic area (Ž) (because due to an incorrect language setting when printing at that time). In validation these characters are recognized when I retrain them. However, they are NOT recognized if the training is based on "deu" (probably because they are not included in the document from deu)

It's not that important, but it would be interesting to know how to solve it.

Python Machine Learning (ML) OCR Deep Learning

ID dự án: #37429412

Về dự án

5 đề xuất Dự án từ xa 5 tháng trước đang mở

5 freelancer chào giá trung bình₹5687 cho công việc này

talentedDev312

⚡️I want to assure you that I have prior experience working on similar projects, which gives me a clear understanding of what needs to be accomplished. I am confident in my ability to complete the project efficiently a Thêm

₹3500 INR trong 7 ngày
(0 Nhận xét)
0.0
alinorzadd

I'm ALI, an experienced Python developer with over 8 years of industry experience. From training fonts using TessTrainGUI to optimizing training cycles to ensuring high quality results, I'm the perfect person for this Thêm

₹3000 INR trong 7 ngày
(0 Nhận xét)
2.7
ndambukiboniface

Greetings Dear Client, Welcome to my profile, Home to Professional and Quality services with 100% customer satisfaction guarantee. I'm a Certified & Experienced Expert in the respective project requirements. Dear Clie Thêm

₹4000 INR trong 1 ngày
(0 Nhận xét)
0.0
jackiliu1239

Hi, Greetings! Primarily I have in-depth knowledge of machine learning algorithms, especially since I'm familiar with deep learning algorithms such as ANN, CNN, RNN, transformer neural network, Autoencoder and DCGAN. A Thêm

₹13500 INR trong 1 ngày
(0 Nhận xét)
0.0
WorkerDE

Hi, my name is Julien. I'm a professional Python developer with a Master's degree in Computer Science, specializing in AI and web scraping. I offer tailored solutions for your data-driven challenges. My expertise exten Thêm

₹4436 INR trong 1 ngày
(0 Nhận xét)
0.0