Invoice Processing

Extracting custom structured information from images or pdf-documents

From challenges to solutions
Challenge

Many invoices have to be processed each day and the important information is extracted manually. This process is very slow and contains a lot of repetitive work.

Idea

Automate the process of extracting custom entities such as recipient name, amount or date out of documents to save time and gain more information from the invoices.

Solution

A scalable on-premise service that uses OCR (optical character recognition), object detection and NER (named entity recognition) to extract the defined information from the invoices.

Challenge

More and more unstructured or only partly structured text data is produced every day. Receipts, forms, descriptions, contracts, order requests or technical documents are only a few examples. The desired information often has to be extracted manually, which implies the need of a lot of time and human resources.
The value of the information that is hidden in deposited documents is often underestimated. As a consequence, a lot of valuable information is not integrated in the business flow.

Automate the process of entity extraction out of various document types to enhance your business workflow.

Idea

A machine learning model is trained to extract custom named entities from unstructured text data. Entities can be for example names, dates, numbers, descriptions, prices etc.
We extract the information that is needed from unstructured text using AI.

We automatically extract your custom defined entities based on your domain to shorten waiting times of your customers. 

Solution

3. Named Entity Recognition (NER)

An AI model is trained to extract custom defined entities. A dataset with labeled data has to be created. To do so, the text is extracted via OCR from the training documents. The labelling can then be performed in a tool that was developed by Catalysts in particular for the task of labelling texts and training NER models. With the final dataset, the model can be trained and then used for future predictions.

2. Optical Character Recognition (OCR)

Task of the OCR is to extract the text out of an image. A score is calculated for each word that represents the probability that it is extracted correctly. Additionally, a handwriting detection is applied to find documents with additional notes on them. Handwritten numbers in predefined fields are detected and recognized.

1. Preprocessing

The preprocessing is adjusted to the type of document that is processed. Pictures of documents need a different kind of preprocessing than scanned documents. Tasks for the preprocessing are for example rotation and deskewing of the image, as well as improving the contrast and removing noise.

0. Paper

Documents are scanned and no longer have to be manually sorted by human beings and typed in their systems.

Vision
  • Do you have to process documents and extract certain information?
  • Do you receive many free-text orders every day?
  • Do you have a huge amount of textual documents that noone ever reads because it’s too much?

Get in touch with us for your personal use-case!

Your contact person

Katrin Strasser

Team Lead Natural Language Processing

katrin.strasser@catalysts.cc

Profile on LinkedIn

Share this page:
Did you like this project? Sign up and receive more news:

Other projects

Menu