Intelligent Document Processing: Overcoming the Challenges
Challenges Related to Data Capture and Extraction
Despite the fact that the world is rapidly digitalizing, document processing underwent no significant changes in a number of years. We still tend to handle documents in the same way we did 70 years ago. Even the PDF format itself was initially tailored solely for publishing rather than the complicated data processing workflows we face now.
Digital data extraction poses a number of challenges. Documents may contain structured data, which is easier to extract (e.g. financial data tables) and unstructured data with values distributed across the text or a mixture of tables and paragraphs. Documents may have a true PDF format (ready for digital processing) or simply be a scan copy of an image. Documents of similar type that are easily understandable to a human may have dramatically different layouts that a machine is unable to interpret.
There are even more challenges in moving from digitalizing documents to understanding what they actually mean and finding the data we need in those documents.
It is relatively easy to automate the data extraction process when document layouts are similar. In this case, a template-based approach can help. Sometimes, the documents may have similar layouts, but the length of text blocks inside may vary. If so, classic robotic process automation (RPA) comes into the game. But what if the layout changes completely from one document to another?
We worked with a company that has thousands of clients, the majority of which have purchase orders that differ in layout while containing similar information. To extract the information from these purchase orders, we needed to use a more advanced approach that utilizes computer vision techniques, pattern recognition, and named entity extraction algorithms. More simplistic/templatized approaches could have led to unsuccessful process automation.
What Is a PDE Engine?
A PDE engine is a library developed by DataArt that aggregates years of knowledge about data extraction from PDF documents. The core functionality is recognizing the structure of the documents, finding tables inside, and converting PDFs from plain text documents, as OCR tools usually do, to the structured JSON format ready for further processing by existing financial and insurance systems, CRMs, ERPs and so on.
The PDE engine plugs into the existing document’s processing pipeline, finding and extracting tables automatically and hierarchically, so that we know where column headers are located and what each row represents. It also extracts other blocks of texts, recognizing headers and titles.
Another benefit is that the PDE engine is highly customizable and can be adjusted to each particular business case and workflow. In addition, it can be deployed internally in your infrastructure, so that you don’t have to pay per document as in SaaS solutions.
Our plans also include automatic recognition of document types and consequently the key fields that people typically search for in a particular type of document, the so-called key-value pairs.
PDE engine provides some functionality that is not supported by Textract, Tabula or PyPDF solutions.
DataArt helps to implement Intelligent data processing solutions by combining open source, MLaaS and SaaS solutions with our own IP and solution accelerators. We offer our customers IDP consulting services, pilot implementation, integration with RPA, business process management and internal systems.
If you are facing data extraction challenges, please contact us, so we can find a way to address your issues using a PDE engine or a combination of the PDE engine with other machine learning, RPA, SaaS, or MLaaS tools available on the market.