Python library and command line tool for parsing pdf bank statements

marlanperumal, updated 🕥 2022-12-09 05:24:27

PDF Statement Reader

Build Status PyPI version Coverage Status

Python library and command line tool for parsing pdf bank statements

Inspired by https://github.com/antonburger/pdf2csv

Objectives

Banks generally send account statements in pdf format. These pdfs are often encrypted, the pdf format is difficult to extract tables from and when you finally get the table out it's in a non tidy format. This package aims to help by providing a library of functions and a set of command line tools for converting these statements into more useful formats such as csv files and pandas dataframes.

Installation

Python software can optionally be installed in a virtual environment to eliminae system conflicts as described here eg for Windows: python -m venv ./venv/psr .\venv\psr\scripts\activate cd .\venv\psr Use deactivate to return to the normal system.

pip install pdf-statement-reader

Troubleshooting

This package uses tabula-py under the hood, which itself is a wrapper for tabula-java. You thus need to have java installed for it to work. If you have any errors complaining about java, checkout out the tabula-py page for troubleshooting advice.

In the future, we hope to move to a pure python implementation.

Usage

The package provides a command line application psr

``` Usage: psr [OPTIONS] COMMAND [ARGS]...

Utility for reading bank and other statements in pdf form

Options: --help Show this message and exit.

Commands: bulk Bulk converts all files in a folder decrypt Decrypts a pdf file Uses pikepdf to open an encrypted pdf file... pdf2csv Converts a pdf statement to a csv file using a given format validate Validates the csv statement rolling balance ```

Configuration

PDF files are notoriously difficult to extract data from. (Here's a nice blog post on why). For a really good semi-manual GUI solution, check out tabula. In fact this package uses tabula's pdf parsing library under the hood.

Since bank statements are generally of the same (if inconvenient) format, we can set up a configuration to tell the tool how to grab the data.

For each type of bank statement, the exact format will be different. A config file holds the instructions for how to process the raw pdf. For now the only config supported is for Cheque account statements from Absa bank in South Africa.

To set up a different statement, you can simply add a new config file and then tell the psr tool to use it. These config files are stored in a folder structure as follows:

config > [country code] > [bank] > [statement type].json

So for example the default config is stored in

config > za > absa > cheque.json

The config spec is a code of the form

[country code].[bank].[statement type]

Once again for the default this will be

za.absa.cheque

The configuration file itself is in JSON format. Here's the Absa cheque account one with some commentary to explain what each field does.

```json5 { "$schema": "https://raw.githubusercontent.com/marlanperumal/pdf_statement_reader/develop/pdf_statement_reader/config/psr_config.schema.json", // Describes the page layout that should be scanned "layout": { // Default layout for all pages not otherwise defined "default": { // The page coordinates in containing the table in pts // [top, left, bottom, right] "area": [280, 27, 763, 576], // The right x coordinate of each column in the table "columns": [83, 264, 344, 425, 485, 570] }, // Layout for the first page "first": { "area": [480, 27, 763, 576], "columns": [83, 264, 344, 425, 485, 570] } },

// The columns names to be used as they exactly appear
// in the statement
"columns": {
    "trans_date": "Date",
    "trans_type": "Transaction Description",
    "trans_detail": "Transaction Detail",
    "debit": "Debit Amount",
    "credit": "Credit Amount",
    "balance": "Balance"
},

// The order of the columns to be output in the csv
"order": [
    "trans_date",
    "trans_type",
    "trans_detail",
    "debit",
    "credit",
    "balance"
],

// Specifies any cleaning operations required
"cleaning": {
    // Convert these columns to numeric
    "numeric": ["debit", "credit", "balance"],
    // Convert these columns to date
    "date": ["trans_date"],
    // Use this date format to parse any date columns
    "date_format": "%d/%m/%Y",
    // For cases where the transaction detail is stored
    // in the next line below the transaction type
    "trans_detail": "below",
    // Only keep the rows where these columns are populated
    "dropna": ["balance"]
}

} ```

These were the configuration options that were required for the default format. It is envisaged that as more formats are added, the list of options will grow.

This format is also captured in pdf_statement_rader/config/psr_config.schema.json as a json-schema. If you're using vscode or some other compatible text editor, you should get autocompletion hints as long as you include that $schema tag at the top of your json file.

A key part in setting up a new configuration is getting the page coordinates for the area and columns. The easiest way to do this is to run the tabula GUI, autodetect the page areas, save the settings as a template, then download and inspect json template file. It's not a one-to-one mapping to the psr config but hopefully it will be a good starting point.

CLI API

decrypt

``` Usage: psr decrypt [OPTIONS] INPUT_FILENAME [OUTPUT_FILENAME]

Decrypts a pdf file

Uses pikepdf to open an encrypted pdf file and then save the unencrypted version. If no output_filename is specified then overwrites the original file.

Options: -p, --password TEXT The pdf encryption password. If not supplied, it will be requested at the prompt --help Show this message and exit. ```

pdf2csv

``` Usage: psr pdf2csv [OPTIONS] INPUT_FILENAME [OUTPUT_FILENAME]

Converts a pdf statement to a csv file using a given format

Options: -c, --config TEXT The configuration code defining how the file should be parsed [default: za.absa.cheque] --help Show this message and exit. ```

validate

``` Usage: psr validate [OPTIONS] INPUT_FILENAME

Validates the csv statement rolling balance

Options: -c, --config TEXT The configuration code defining how the file should be parsed [default: za.absa.cheque] --help Show this message and exit. ```

bulk

``` Usage: psr bulk [OPTIONS] FOLDER

Bulk converts all files in a folder

Options: -c, --config TEXT The configuration code defining how the file should be parsed [default: za.absa.cheque] -p, --password TEXT The pdf encryption password. If not supplied, it will be requested at the prompt -d, --decrypt-suffix TEXT The suffix to append to the decrypted pdf file when created [default: _decrypted] -k, --keep-decrypted Keep the a copy of the decrypted file. It is removed by default -v, --verbose Print verbose output while running --help Show this message and exit. ```

Issues

Bump certifi from 2021.5.30 to 2022.12.7

opened on 2022-12-09 05:24:26 by dependabot[bot]

Bumps certifi from 2021.5.30 to 2022.12.7.

Commits


Dependabot compatibility score

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) - `@dependabot use these labels` will set the current labels as the default for future PRs for this repo and language - `@dependabot use these reviewers` will set the current reviewers as the default for future PRs for this repo and language - `@dependabot use these assignees` will set the current assignees as the default for future PRs for this repo and language - `@dependabot use this milestone` will set the current milestone as the default for future PRs for this repo and language You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/marlanperumal/pdf_statement_reader/network/alerts).

Bump lxml from 4.6.3 to 4.9.1

opened on 2022-07-06 20:04:18 by dependabot[bot]

Bumps lxml from 4.6.3 to 4.9.1.

Changelog

Sourced from lxml's changelog.

4.9.1 (2022-07-01)

Bugs fixed

  • A crash was resolved when using iterwalk() (or canonicalize()) after parsing certain incorrect input. Note that iterwalk() can crash on valid input parsed with the same parser after failing to parse the incorrect input.

4.9.0 (2022-06-01)

Bugs fixed

  • GH#341: The mixin inheritance order in lxml.html was corrected. Patch by xmo-odoo.

Other changes

  • Built with Cython 0.29.30 to adapt to changes in Python 3.11 and 3.12.

  • Wheels include zlib 1.2.12, libxml2 2.9.14 and libxslt 1.1.35 (libxml2 2.9.12+ and libxslt 1.1.34 on Windows).

  • GH#343: Windows-AArch64 build support in Visual Studio. Patch by Steve Dower.

4.8.0 (2022-02-17)

Features added

  • GH#337: Path-like objects are now supported throughout the API instead of just strings. Patch by Henning Janssen.

  • The ElementMaker now supports QName values as tags, which always override the default namespace of the factory.

Bugs fixed

  • GH#338: In lxml.objectify, the XSI float annotation "nan" and "inf" were spelled in lower case, whereas XML Schema datatypes define them as "NaN" and "INF" respectively.

... (truncated)

Commits
  • d01872c Prevent parse failure in new test from leaking into later test runs.
  • d65e632 Prepare release of lxml 4.9.1.
  • 86368e9 Fix a crash when incorrect parser input occurs together with usages of iterwa...
  • 50c2764 Delete unused Travis CI config and reference in docs (GH-345)
  • 8f0bf2d Try to speed up the musllinux AArch64 build by splitting the different CPytho...
  • b9f7074 Remove debug print from test.
  • b224e0f Try to install 'xz' in wheel builds, if available, since it's now needed to e...
  • 897ebfa Update macOS deployment target version from 10.14 to 10.15 since 10.14 starts...
  • 853c9e9 Prepare release of 4.9.0.
  • d3f77e6 Add a test for https://bugs.launchpad.net/lxml/+bug/1965070 leaving out the a...
  • Additional commits viewable in compare view


Dependabot compatibility score

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) - `@dependabot use these labels` will set the current labels as the default for future PRs for this repo and language - `@dependabot use these reviewers` will set the current reviewers as the default for future PRs for this repo and language - `@dependabot use these assignees` will set the current assignees as the default for future PRs for this repo and language - `@dependabot use this milestone` will set the current milestone as the default for future PRs for this repo and language You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/marlanperumal/pdf_statement_reader/network/alerts).

Execute Default Use Case (za.absa.cheque)

opened on 2022-03-28 17:27:36 by sendublon

Hello, I have installed the software, got an absa cheque template statement but when I try to execute the software I get the below error. I understood how to create a config file for the statement of my bank (the principle at least) but I cannot even run the default example. Anyone could help? Many thanks

Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3621, in get_loc return self._engine.get_loc(casted_key) File "pandas/_libs/index.pyx", line 136, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/index.pyx", line 163, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/hashtable_class_helper.pxi", line 5198, in pandas._libs.hashtable.PyObjectHashTable.get_item File "pandas/_libs/hashtable_class_helper.pxi", line 5206, in pandas._libs.hashtable.PyObjectHashTable.get_item KeyError: 'Debit Amount'

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/3.10/bin/psr", line 8, in sys.exit(cli()) File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/click/core.py", line 1128, in call return self.main(args, kwargs) File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/click/core.py", line 1053, in main rv = self.invoke(ctx) File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/click/core.py", line 1659, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/click/core.py", line 1395, in invoke return ctx.invoke(self.callback, ctx.params) File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/click/core.py", line 754, in invoke return __callback(args, **kwargs) File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pdf_statement_reader/init.py", line 80, in pdf2csv df = parse_statement(input_filename, config) File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pdf_statement_reader/parse.py", line 104, in parse_statement clean_numeric(statement, config) File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pdf_statement_reader/parse.py", line 50, in clean_numeric df[col] = df[col].apply(format_negatives) File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pandas/core/frame.py", line 3505, in getitem indexer = self.columns.get_loc(key) File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3623, in get_loc raise KeyError(key) from err KeyError: 'Debit Amount'

Bump ipython from 7.16.1 to 7.16.3

opened on 2022-01-21 19:51:43 by dependabot[bot]

Bumps ipython from 7.16.1 to 7.16.3.

Commits


Dependabot compatibility score

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) - `@dependabot use these labels` will set the current labels as the default for future PRs for this repo and language - `@dependabot use these reviewers` will set the current reviewers as the default for future PRs for this repo and language - `@dependabot use these assignees` will set the current assignees as the default for future PRs for this repo and language - `@dependabot use this milestone` will set the current milestone as the default for future PRs for this repo and language You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/marlanperumal/pdf_statement_reader/network/alerts).

Bump pillow from 8.2.0 to 8.3.2

opened on 2021-09-08 03:06:59 by dependabot[bot]

Bumps pillow from 8.2.0 to 8.3.2.

Release notes

Sourced from pillow's releases.

8.3.2

https://pillow.readthedocs.io/en/stable/releasenotes/8.3.2.html

Security

  • CVE-2021-23437 Raise ValueError if color specifier is too long [hugovk, radarhere]

  • Fix 6-byte OOB read in FliDecode [wiredfool]

Python 3.10 wheels

  • Add support for Python 3.10 #5569, #5570 [hugovk, radarhere]

Fixed regressions

  • Ensure TIFF RowsPerStrip is multiple of 8 for JPEG compression #5588 [kmilos, radarhere]

  • Updates for ImagePalette channel order #5599 [radarhere]

  • Hide FriBiDi shim symbols to avoid conflict with real FriBiDi library #5651 [nulano]

8.3.1

https://pillow.readthedocs.io/en/stable/releasenotes/8.3.1.html

Changes

8.3.0

https://pillow.readthedocs.io/en/stable/releasenotes/8.3.0.html

Changes

... (truncated)

Changelog

Sourced from pillow's changelog.

8.3.2 (2021-09-02)

  • CVE-2021-23437 Raise ValueError if color specifier is too long [hugovk, radarhere]

  • Fix 6-byte OOB read in FliDecode [wiredfool]

  • Add support for Python 3.10 #5569, #5570 [hugovk, radarhere]

  • Ensure TIFF RowsPerStrip is multiple of 8 for JPEG compression #5588 [kmilos, radarhere]

  • Updates for ImagePalette channel order #5599 [radarhere]

  • Hide FriBiDi shim symbols to avoid conflict with real FriBiDi library #5651 [nulano]

8.3.1 (2021-07-06)

  • Catch OSError when checking if fp is sys.stdout #5585 [radarhere]

  • Handle removing orientation from alternate types of EXIF data #5584 [radarhere]

  • Make Image.array take optional dtype argument #5572 [t-vi, radarhere]

8.3.0 (2021-07-01)

  • Use snprintf instead of sprintf. CVE-2021-34552 #5567 [radarhere]

  • Limit TIFF strip size when saving with LibTIFF #5514 [kmilos]

  • Allow ICNS save on all operating systems #4526 [baletu, radarhere, newpanjing, hugovk]

  • De-zigzag JPEG's DQT when loading; deprecate convert_dict_qtables #4989 [gofr, radarhere]

  • Replaced xml.etree.ElementTree #5565 [radarhere]

... (truncated)

Commits
  • 8013f13 8.3.2 version bump
  • 23c7ca8 Update CHANGES.rst
  • 8450366 Update release notes
  • a0afe89 Update test case
  • 9e08eb8 Raise ValueError if color specifier is too long
  • bd5cf7d FLI tests for Oss-fuzz crash.
  • 94a0cf1 Fix 6-byte OOB read in FliDecode
  • cece64f Add 8.3.2 (2021-09-02) [CI skip]
  • e422386 Add release notes for Pillow 8.3.2
  • 08dcbb8 Pillow 8.3.2 supports Python 3.10 [ci skip]
  • Additional commits viewable in compare view


Dependabot compatibility score

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) - `@dependabot use these labels` will set the current labels as the default for future PRs for this repo and language - `@dependabot use these reviewers` will set the current reviewers as the default for future PRs for this repo and language - `@dependabot use these assignees` will set the current assignees as the default for future PRs for this repo and language - `@dependabot use this milestone` will set the current milestone as the default for future PRs for this repo and language You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/marlanperumal/pdf_statement_reader/network/alerts).

Statement Columns as Graphics

opened on 2021-06-23 14:05:16 by flywire

The Australian Citibank cheque account uses graphics rather than text for statement columns (ie can't swipe it like the transactions) so pdf_statement_reader can't detect the start of the columns. It makes some attempt at the first two columns but it would be better if it used CLOSING BALANCE to detect the end of the transactions rather than picking up broken parts of bank notices.

image

The general layout is similar to CBA except dates are dd Mmm yyyy.

Releases

Release v0.2.3 2021-06-20 11:02:39

  • Added json-schema for config file
  • Added secondary alias for cli pdfsr
  • Started tests for parse methods

Release v0.2.2 2021-04-30 21:01:43

Upgraded dependent library versions

Release v0.2.1 2021-02-12 21:06:14

The release that hopefully actually builds

Release v0.2.0 2021-02-12 20:57:54

First release created using github actions

Marlan Perumal
GitHub Repository

python bank-statement pdf pdf-converter