Manual data extraction from scanned images & PDFs (e.g. Invoices, Bank Statements, Diagnostics Report etc) is very cumbersome and error prone.
Teams spend hours in this manual data extraction work. The process is very expensive and still not very accurate
The Robotic Process Automation did not yield expected results
The traditional OCR fails when complexity in the input changes. Also with the change in Input file format, the plain OCR approach requires a code changes. Moreover, traditional OCR does not support data extraction from Engineering Diagrams, Symbols, Logo, Complex Tables, Images etc.
Extraction of text from Hand written document (scanned images, PDFs) is still a very big issue.
Missing document classification functionality, which results in lot of rework.
Document Data Extraction accuracy improvement is limited or none due to rule based and Template based data extraction mechanism.
The fintech company has a loan origination platform. The loan applicants upload the bank statements to apply for the loan. The Client already had an OCR solution to extract the data from Bank statements. However, the OCR solution was unable to give desired accuracy if the Bank statement was complex. Also, with the change in the Bank statement format, it required coding efforts. The Client was looking for a solution where new Bank statements format can be accommodated with minimum or no coding efforts
Python, CTPN, OpenCV, Deep Learning, Tesseract, Node.JS, React.JS, MongoDB
The traditional OCR based data extraction works on the co-ordinates. If there is a change in the structure on the input, the OCR solutions fails. Also if the image is very noise, the OCR based solution gives a very poor extraction accuracy.
To address this complex business problem we used a combination of Deep Learning and OCR to get the desired results. The deep learning based OCR solution involved image pre-processing to improve image resolution, automatically marking region of interest, text extraction and recognition. OpenCV was used for image processing, CTPN was used for automatically marking region of interest and text detection. For text extraction, Tesseract was used. Application was developed using NodeJS & ReactJS
The business problem was challenging. Especially, handling the changing format of the input document. We successfully delivered the solution with a great extraction accuracy and saved massive human efforts involved in the data extraction process.