PDF Parsing API
Summary
PDF Parsing allows you to extract the data from PDF format bank statements and transaction confirmations. Depending on the document you wish to parse, you select the appropriate endpoint and upload the encoded document. After uploading the document to our API, our system extracts key data to a JSON or XML format for easy processing. Additionally, our parsing algorithms verify the credibility and integrity of the information contained within the documents.
How to integrate
Integration of the PDF Parsing is as easy as it can get. You just need to upload a bank statement in the PDF format encoded using “Base64” (plain text) to one of our API endpoints. As a response you’ll receive the data extracted from the document.
See our documentation for more details:
Extracting data
We can only extract as much information as is available in a given document, which differs across all the banks. In most cases, the data we can extract includes:
From statements (monthly statements and account history documents):
- KYC data,
- accounts,
- transactions
From transaction confirmations:
- status of the transaction,
- who generated the confirmation,
- sender/recipient information,
- transaction details (amount, title, kind etc.)
For more detailed information on the data returned, please refer to our Kontomatik Services & Coverage document, which specifies the data available in each supported bank.
Parsing different document types
Document type |
Statements |
Transaction confirmations |
Supported countries |
Poland, Latvia |
Poland |
Supported types of documents |
|
|
Features |
|
|
Not supported |
|
|
Response format |
XML |
JSON |
Ability to operate on parsed data |
Yes |
No, extracted data from the confirmation is returned only immediately after parsing, it won’t be available in Insight and you won’t be able to use any features based on aggregated data like Data Analysis (e.g. Scoring or Data Summary) or Data processing. |
Anti-tampering verification
Verification aims to hinder malicious attempts at tampering with PDF documents. Our algorithms verify:
- Bank’s digital signature
- Consistency of the account balance with transactions
- PDF metadata characteristic
- Fonts, color and size
- Bank’s logotype
- Consistency of the header period with transaction dates
- Keywords
- Document structure
It’s not guaranteed that checking these properties is enough to spot fraud, but based on our analysis these are the most common indicators.
Sometimes, someone might accidentally edit the PDF document. In such situations, it’s best to ask the end-user to download the document again and not to open it before uploading it to the parser.
PDF parsing Widget
Please note that the PDF Parsing Widget is available for statements. Transaction confirmations can be parsed only via API.
Our PDF Parsing Widget is a front-end widget that makes the process of parsing the bank statement even more seamless. It is available as part of our SaaS solution.
The end-user chooses the bank and uploads bank statements as a PDF file. You can then get the extracted information using our data endpoint or review it in Insight.
On-premise deployment
Our PDF parsing solution is mainly used in the SaaS model, but we also offer on-premises deployments. It is designed for clients who don’t want the data to go through other servers than their own.
This option does however have some disadvantages:
- you need to maintain infrastructure and security all on your own,
- it’s not updated as frequently as the SaaS solution and you’re in charge of the installation,
- in case any bugs arise, debugging becomes much harder and as a result, it might take longer for our developers to fix the problems.
To find out more about this solution, contact our Sales team.
FAQ
PDF parsing is a general term to describe a class of methods that are able to extract plain text from PDF documents that is human-readable.
Our proprietary PDF Parsing solution is designed specifically for handling bank documents. It automatically recognises transactions, KYC and other related data in a bank statement and extracts them for further processing.
Moreover, our algorithm always tries to verify if the statement wasn’t edited, the data is consistent, the digital signature is correct and other expected features are in place so that you can protect yourself from accepting statements that have been tempered with.
It depends on what is important to you and your end-users. In both cases, you will get the data in the same format. The main difference is that the PDF parsing doesn’t require the user to login to the online bank via our widget. As a result, the end-user has more flexibility on how they obtain the bank statement.
On the other hand, the Banking API combined with our widget makes the whole process of providing the data by the end-user easier.
Finally, the PDF Parsing solution can work entirely from your servers (on-premise) in contrast to the Banking API which is served only via the cloud (SaaS)*.
*On-premise PDF Parsing solution requires separate contracts and fees to the SaaS version. We recommend it only to big companies with highly developed infrastructure and security IT teams.