PDF Parsing API
Summary
PDF Parsing API allows to extract the data from the bank statements in a PDF format. Lack of global structure in such files makes this process especially hard, but our proprietary algorithms are able to recognise transactions, KYC and other related data. Moreover, the algorithms aim to verify credibility and integrity of information contained in the statements.
As well as with our Banking API, PDF statements can also be complemented by our labels and scoring solutions.
How to integrate
Integration of the PDF Parsing API is as easy as it can get. You just need to upload a bank statement in the PDF format encoded using “Base64” (plain text) to the Kontomatik endpoint. To get the results back in the XML format, you just need to call the data endpoint.
If you choose our SaaS solution, you can alternatively use our front-end widget to make these steps even more seamless.
Supported banks and features
The PDF Parsing API is currently available only in Poland and we support all major banks there.
Nevertheless, we can only extract as much information as is available in a given statement and this differs across all the banks. We’ve prepared the table showing what you can get from each bank here.
Some of the features include:
- Extracting KYC data, accounts and transactions
- Verifying possible document modification (fraud detection)
- Returning transactions labels (optional)
- Getting owner score (optional)
Supported types of documents:
- Monthly statement (wyciąg)
- Account history (lista transakcji, historia rachunku itp.)
The following are not supported:
- statements from company/corporate accounts (other than sole traders i.e. “Jednoosobowa działalność gospodarcza”),
- statements in English or other languages (the layout needs to be in Polish),
- encrypted statements,
- scans or otherwise prepared documents not downloaded directly from the online banking websites.
Anti-tampering verification
Verification aims to hinder malicious attempts at tampering with the PDF statements. Our algorithms verify:
- Bank’s digital signature
- Consistency of the account balance with transactions
- PDF metadata characteristic
- Fonts, color and size
- Bank’s logotype
- Consistency of the header period with transaction dates
- Keywords
- Document structure
It’s not guaranteed that checking these properties is enough to spot the fraud, but based on our analysis these are the most common to occur.
Sometimes, it might also be the case that someone accidentally edited the PDF document. In such situations, it’s best to ask the end-user to download the document again and not to open it before uploading it to the parser.
PDF parsing Widget
Our PDF Parsing Widget is a front-end widget that makes the process of parsing the bank statement even more seamless. It is available as part of our SaaS solution.
The end-user chooses the bank and uploads a bank statement as a PDF file. You can then get the extracted information using our data endpoint.
On-premise deployment
Our PDF parsing solution is mainly used as a SaaS, but we also offer on-premises deployments. It is designed for clients who don’t want the data to go through other servers than their own.
This option does however have some disadvantages:
- you need to maintain infrastructure all on your own,
- it’s not updated as frequently as the SaaS solution and you’re in charge of the installation,
- in case any bug arises, debugging becomes much harder and as a result it might take longer for our developers to fix the problems.
To find out more about this solution, contact our Sales team.
FAQ
PDF parsing is a general term to describe a class of methods that are able to extract plain text from PDF documents that is human-readable.
Our proprietary PDF Parsing solution is designed specifically for handling bank documents. It automatically recognises transactions, KYC and other related data in a bank statement and extracts them for further processing.
Moreover, our algorithm always tries to verify if the statement wasn’t edited, the data is consistent, the digital signature is correct and other expected features are in place so that you can protect yourself from accepting statements that have been tempered with.
It depends on what is important to you and your end-users. In both cases, you will get the data in the same format. The main difference is that the PDF parsing doesn’t require the user to login to the online bank via our widget. As a result, the end-user has more flexibility on how they obtain the bank statement.
On the other hand, the Banking API combined with our widget makes the whole process of providing the data by the end-user easier.
Finally, the PDF Parsing solution can work entirely from your servers (on-premise) in contrast to the Banking API which is served only via the cloud (SaaS)*.
*On-premise PDF Parsing solution requires separate contracts and fees to the SaaS version. We recommend it only to big companies with highly developed infrastructure and security IT teams.