SQL upsert using pandas DataFrames for PostgreSQL, SQlite and MySQL with extra features

ThibTrip, updated 🕥 2023-02-06 01:03:42

CircleCI codecov PyPI version Documentation Made withJupyter

pangres

pangres logo

Thanks to freesvg.org for the logo assets

Upsert with pandas DataFrames (ON CONFLICT DO NOTHING or ON CONFLICT DO UPDATE) for PostgreSQL, MySQL, SQlite and potentially other databases behaving like SQlite (untested) with some additional optional features (see features). Upserting can be done with primary keys or unique keys. Pangres also handles the creation of non-existing SQL tables and schemas.

Features

  1. (optional) Automatical column creation (when a column exists in the DataFrame but not in the SQL table)
  2. (optional) Automatical column type alteration for columns that are empty in the SQL table (except for SQlite where alteration is limited)
  3. (optional) Creates the table if it is missing
  4. (optional) Creates missing schemas in Postgres (and potentially other databases that have a schema system)
  5. JSON is supported (with pd.to_sql it does not work) with some exceptions (see Gotchas and caveats)
  6. Fast (except for SQlite where some help is needed)
  7. Will work even if not all columns defined in the SQL table are there
  8. SQL injection safe (schema, table and column names are escaped and values are given as parameters)
  9. New in version 4.1: asynchronous support. Tested using aiosqlite for SQlite, asyncpg for PostgreSQL and aiomysql for MySQL

Requirements

  • SQlite >= 3.24.4
  • Python >= 3.6.4
  • See also ./pangres/requirements.txt

Requirements for sqlalchemy>=2.0

For using pangres together with sqlalchemy>=2.0 (sqlalchemy is one of pangres dependencies listed in requirements.txt) - you will need the following base requirements: * alembic>=1.7.2 * pandas>=1.4.0 * Python >= 3.8 (pandas>=1.4.0 only supports Python >=3.8)

Requirements for asynchronous engines

For using asynchronous engines (such as aiosqlite, asyncpg or aiomysql) you will need Python >= 3.8.

Gotchas and caveats

All flavors

  1. We can't create JSON columns automatically, but we can insert JSON like objects (list, dict) in existing JSON columns.

Postgres

  1. "%", ")" and "(" in column names will most likely cause errors with PostgreSQL (this is due to psycopg2 and also affect pd.to_sql). Use the function pangres.fix_psycopg2_bad_cols to "clean" the columns in the DataFrame. You'll also have to rename columns in the SQL table accordingly (if the table already exists).
  2. Even though we only do data type alteration on empty columns, since we don't want to lose column information (e.g. constraints) we use true column alteration (instead of drop+create) so the old data type must be castable to the new data type. Postgres seems a bit restrictive in this regard even when the columns are empty (e.g. BOOLEAN to TIMESTAMP is impossible).

SQlite

  1. SQlite must be version 3.24.4 or higher! UPSERT syntax did not exist before.
  2. Column type alteration is not possible for SQlite.
  3. SQlite inserts can be at worst 5 times slower than pd.to_sql for some reasons. If you can help please contact me!
  4. Inserts with 1000 columns (or 32767 columns for SQlite >= 3.32.0) or more are not supported because one could not even insert one row without exceeding the max number of parameters per queries. One way to fix this would inserting the columns progressively but this seems quite tricky. If you know a better way please contact me.

MySQL

  1. MySQL will often change the order of the primary keys in the SQL table when using INSERT... ON CONFLICT.. DO NOTHING/UPDATE. This seems to be the expected behavior so nothing we can do about it but please mind that!
  2. You may need to provide SQL dtypes e.g. if you have a primary key with text you will need to provide a character length (e.g. VARCHAR(50)) because MySQL does not support indices/primary keys with flexible text length. pd.to_sql has the same issue.

Notes

This is a library I was using in production in private with very good results and decided to publish.

Ideally such features will be integrated into pandas since there is already a PR on the way and I would like to give the option to add columns via another PR.

There is also pandabase which does almost the same thing (plus lots of extra features) but my implementation is different. Btw big thanks to pandabase and the sql part of pandas which helped a lot.

Installation

pip install pangres Additionally depending on which database you want to work with you will need to install the corresponding library (note that SQlite is included in the standard library):

  • Postgres pip install psycopg2

  • MySQL pip install pymysql

  • Postgres (asynchronous) pip install asyncpg

  • MySQL (asynchronous) pip install aiomysql

  • SQLite (asynchronous) pip install aiosqlite

Usage

Head over to pangres' wiki! Note that the wiki is also available locally under the wiki folder.

Note:

The wiki is generated with a command which uses my library npdoc_to_md. It must be installed with pip install npdoc_to_md and you will also need the extra dependency fire which you can install with pip install fire. Replace $DESTINATION_FOLDER with the folder of you choice in the command below:

bash npdoc-to-md render-folder ./wiki_templates $DESTINATION_FOLDER

Contributing

Pull requests/issues are welcome.

Development

I develop the library inside of Jupyter Lab using the jupytext extension.

I recommend using this extension for the best experience. It will split code blocks within modules in notebook cells and will allow interactive development.

If you wish you can also use the provided conda environment (see environment.yml file) inside of Jupyter Lab/Notebook thanks to nb_conda_kernels.

Testing

Pytest

You can test one or multiple of the following SQL flavors (you will of course need a live database for this): PostgreSQL, SQlite or MySQL.

NOTE: in one of the tests of pangres we will try to drop and then create a PostgreSQL schema called pangres_create_schema_test. If the schema existed and was not empty an error will be raised.

Clone pangres then set your curent working directory to the root of the cloned repository folder. Then use the commands below. You will have to replace the following variables in those commands: * SQLITE_CONNECTION_STRING: replace with a SQlite sqlalchemy connection string (e.g. "sqlite:///test.db") * ASYNC_SQLITE_CONNECTION_STRING: replace with an asynchronous SQlite sqlalchemy connection string (e.g. "sqlite+aiosqlite:///test.db") * POSTGRES_CONNECTION_STRING: replace with a Postgres sqlalchemy connection string (e.g. "postgres:///user:[email protected]:5432/database"). Specifying schema is optional for postgres (will default to public) * ASYNC_POSTGRES_CONNECTION_STRING: replace with an asynchronous Postgres sqlalchemy connection string (e.g. "postgres+asyncpg:///user:[email protected]:5432/database"). Specifying schema is optional for postgres (will default to public) * MYSQL_CONNECTION_STRING: replace with a MySQL sqlalchemy connection string (e.g. "mysql+pymysql:///user:[email protected]:3306/database") * ASYNC_MYSQL_CONNECTION_STRING: replace with an asynchronous MySQL sqlalchemy connection string (e.g. "mysql+aiomysql:///user:[email protected]:3306/database") * PG_SCHEMA (optional): schema for postgres (defaults to public)

```bash

1. Create and activate the build environment

conda env create -f environment.yml conda activate pangres-dev

2. Install pangres in editable mode (changes are reflected upon reimporting)

pip install -e .

3. Run pytest

-s prints stdout

-v prints test parameters

--cov=./pangres shows coverage only for pangres

--doctest-modules tests with doctest in all modules

--benchmark-XXX : these are options for benchmarks tests (see https://pytest-benchmark.readthedocs.io/en/latest/usage.html)

pytest -s -v pangres --cov=pangres --doctest-modules --async_sqlite_conn=$ASYNC_SQLITE_CONNECTION_STRING --sqlite_conn=$SQLITE_CONNECTION_STRING --async_pg_conn=$ASYNC_POSTGRES_CONNECTION_STRING --pg_conn=$POSTGRES_CONNECTION_STRING --async_mysql_conn=$ASYNC_MYSQL_CONNECTION_STRING --mysql_conn=$MYSQL_CONNECTION_STRING --pg_schema=tests --benchmark-group-by=func,param:engine,param:nb_rows --benchmark-columns=min,max,mean,rounds --benchmark-sort=name --benchmark-name=short ```

Additionally, the following flags could be of interest for you: * -x for stopping at the first failure * --benchmark-only for only testing benchmarks * --benchmark-skip for skipping benchmarks

flake8

flake8 must run without errors for pipelines to succeed. If you are not using the conda environment, you can install flake8 with: pip install flake8.

To test flake8 locally you can simply execute this command:

flake8 .

Issues

Don't work with Pandas 2.0

opened on 2023-03-29 02:59:18 by tiokouBrice

Need to adapt some function: Exemple this line in engine.py

pandas_sql_engine = pd.io.sql.SQLDatabase(engine=connection, schema=schema)

Need to be too work with pandas>=2.0

pandas_sql_engine = pd.io.sql.SQLDatabase(conn=connection, schema=schema)

Feature Request: add a return value from `pangres.upsert` that shows how many rows were inserted, updated, etc.

opened on 2023-02-13 21:53:18 by DeflateAwning

This is a common feature in other software like DBeaver, where the result of running a query shows how many rows were inserted, updated, deleted, etc. It would be valuable if such feedback could be added to this awesome library.

AttributeError: 'Distribution' object has no attribute 'convert_2to3_doctests'

opened on 2023-02-06 14:02:04 by ManPython

\setuptools\command\build_py.py", line 132, in build_package_data srcfile in self.distribution.convert_2to3_doctests): AttributeError: 'Distribution' object has no attribute 'convert_2to3_doctests' Can't install new version pangres https://github.com/ThibTrip/pangres/releases/tag/v4.1.3 Befoere it works.

Workaround for Access?

opened on 2022-09-01 19:29:05 by Codein99

Hey,

when I try to use panges with access I become this Error:

raise exc.CompileError(

sqlalchemy.exc.CompileError: The 'access' dialect with current database version settings does not support in-place multirow inserts.

Have someone a workaround for Access or something like this?

` dataframe = pd.DataFrame(data=[data]) columnnamespr = pr_DB.columns.keys()

dataframe = dataframe.rename(columns=dict(zip(dataframe.columns, columnnamespr)))
dataframe.set_index(['ID'], drop=True, inplace=True)

with engine.connect() as con:
    upsert(con=con, df=dataframe, table_name='produkte', if_row_exists='update', create_table=False, chunksize=1)`

[Question] how is this performance-wise compared to df.to_sql(method='multi')?

opened on 2020-10-25 19:41:17 by lefnire

From reading around, it seems INSERT .. ON CONFLICT) is higher-performance than DELETE .. INSERT .., which is how I'm doing things now, so this library is compelling. However, one big boon of df.to_sql is method='multi', which creates a big-ol' insert statement rather than individual ones, which (combined with chunksize) I've found improves my bulk-insert performance massively. I realize this is more a question about sqlalchemy.dialects.postgresql.insert, but I'm asking here because - being less familiar with that method - I don't see any arguments that can be passed to pangres.upsert for managing the insert approach. Eg, I see postgresql.insert(inline=True) might be something along these lines? Or does postgresql.insert handle like that by default?

TL;DR: is pangres as fast as df.to_sql(method='multi'), or are there plans to add a options which get passed to sqlalchemy..postgresql for performance management? (Does this make sense?)

Releases

v4.1.3 2023-02-06 00:57:12

Bug Fixes

  • pangres is now compatible with sqlalchemy>=2.0. IMPORTANT: you will need alembic>=1.7.2 (it is one of the dependencies of pangres) and Python >= 3.8
  • typing should now work (py.typed file was missing). Please note that support for typing is partial, unless you are using sqlalchemy>=2.0

Development

  • added code style check with flake8
  • improvements to linting and typing of internal as well as "public" objects

v4.1.2 2022-07-29 19:40:03

Bug Fixes

  • pangres was not running on Python 3.10 because I never added packaging in requirements.txt. I fixed it by using pkg_resources from the standard library instead to avoid the additional dependency (see commit 89d3679)

Note: the tests were running fine because we use pytest which uses packaging (so I never saw the missing dependency)

v4.1.1 2022-03-13 15:41:21

Bug Fixes

  • fixed bug where I used a synchronous method instead of its asynchronous variant (UpsertQuery.execute instead of UpsertQuery.aexecute in pangres.aupsert). This has no repercussions for the end user

Documentation

  • fix illogic code in example for pangres.aupsert (using engine instead of connection in contexts) and commit which I had forgotten!
  • added changelog

Testing

  • overhaul of the tests. asynchronous and synchronous tests have been separated
  • module test_upsert_end_to_end has been renamed to test_core

v4.1 2022-01-21 19:15:55

New Features

  • Added async support with function pangres.aupsert :rocket: ! Tested using aiosqlite for SQlite, asyncpg for PostgreSQL and aiomysql for MySQL. See documentation in dedicated wiki page

v4.0.2 2022-01-17 19:48:28

This patches an important bug with MySQL. We recommend that all users upgrade to this version.

Bug Fixes

  • Fixed bug where tables in MySQL where created with auto increment on the primary key (see #56)

v4.0.1 2022-01-13 13:19:52

Bug Fixes

  • removed warning due to deprecated code when checking versions of other libraries in Python >= 3.10 (see issue #54)
Thibault Bétrémieux

Data Analyst | Python Software Entwickler

GitHub Repository

sqlite3 sql python3 postgresql pandas