Thanks to freesvg.org for the logo assets
Upsert with pandas DataFrames (ON CONFLICT DO NOTHING
or ON CONFLICT DO UPDATE
) for PostgreSQL, MySQL, SQlite and potentially other databases behaving like SQlite (untested) with some additional optional features (see features). Upserting can be done with primary keys or unique keys.
Pangres also handles the creation of non-existing SQL tables and schemas.
aiosqlite
for SQlite, asyncpg
for PostgreSQL and aiomysql
for MySQLFor using pangres
together with sqlalchemy>=2.0
(sqlalchemy is one of pangres dependencies
listed in requirements.txt) - you will need the following base requirements:
* alembic>=1.7.2
* pandas>=1.4.0
* Python >= 3.8 (pandas>=1.4.0
only supports Python >=3.8)
For using asynchronous engines (such as aiosqlite
, asyncpg
or aiomysql
) you will need Python >= 3.8.
This is a library I was using in production in private with very good results and decided to publish.
Ideally such features will be integrated into pandas since there is already a PR on the way and I would like to give the option to add columns via another PR.
There is also pandabase which does almost the same thing (plus lots of extra features) but my implementation is different. Btw big thanks to pandabase and the sql part of pandas which helped a lot.
pip install pangres
Additionally depending on which database you want to work with you will need to install the corresponding library (note that SQlite is included in the standard library):
Postgres
pip install psycopg2
MySQL
pip install pymysql
Postgres (asynchronous)
pip install asyncpg
MySQL (asynchronous)
pip install aiomysql
SQLite (asynchronous)
pip install aiosqlite
Head over to pangres' wiki! Note that the wiki is also available locally under the wiki folder.
Note:
The wiki is generated with a command which uses my library npdoc_to_md.
It must be installed with pip install npdoc_to_md
and you will also need the extra dependency fire
which you
can install with pip install fire
. Replace $DESTINATION_FOLDER
with the folder of you choice in the command below:
bash
npdoc-to-md render-folder ./wiki_templates $DESTINATION_FOLDER
Pull requests/issues are welcome.
I develop the library inside of Jupyter Lab using the jupytext extension.
I recommend using this extension for the best experience. It will split code blocks within modules in notebook cells and will allow interactive development.
If you wish you can also use the provided conda environment (see environment.yml
file) inside of Jupyter Lab/Notebook
thanks to nb_conda_kernels.
You can test one or multiple of the following SQL flavors (you will of course need a live database for this): PostgreSQL, SQlite or MySQL.
NOTE: in one of the tests of pangres
we will try to drop and then create a PostgreSQL schema called pangres_create_schema_test
. If the schema existed and was not empty an error will be raised.
Clone pangres then set your curent working directory to the root of the cloned repository folder. Then use the commands below. You will have to replace the following variables in those commands:
* SQLITE_CONNECTION_STRING
: replace with a SQlite sqlalchemy connection string (e.g. "sqlite:///test.db")
* ASYNC_SQLITE_CONNECTION_STRING
: replace with an asynchronous SQlite sqlalchemy connection string (e.g. "sqlite+aiosqlite:///test.db")
* POSTGRES_CONNECTION_STRING
: replace with a Postgres sqlalchemy connection string (e.g. "postgres:///user:[email protected]:5432/database"). Specifying schema is optional for postgres (will default to public)
* ASYNC_POSTGRES_CONNECTION_STRING
: replace with an asynchronous Postgres sqlalchemy connection string (e.g. "postgres+asyncpg:///user:[email protected]:5432/database"). Specifying schema is optional for postgres (will default to public)
* MYSQL_CONNECTION_STRING
: replace with a MySQL sqlalchemy connection string (e.g. "mysql+pymysql:///user:[email protected]:3306/database")
* ASYNC_MYSQL_CONNECTION_STRING
: replace with an asynchronous MySQL sqlalchemy connection string (e.g. "mysql+aiomysql:///user:[email protected]:3306/database")
* PG_SCHEMA
(optional): schema for postgres (defaults to public)
```bash
conda env create -f environment.yml conda activate pangres-dev
pip install -e .
pytest -s -v pangres --cov=pangres --doctest-modules --async_sqlite_conn=$ASYNC_SQLITE_CONNECTION_STRING --sqlite_conn=$SQLITE_CONNECTION_STRING --async_pg_conn=$ASYNC_POSTGRES_CONNECTION_STRING --pg_conn=$POSTGRES_CONNECTION_STRING --async_mysql_conn=$ASYNC_MYSQL_CONNECTION_STRING --mysql_conn=$MYSQL_CONNECTION_STRING --pg_schema=tests --benchmark-group-by=func,param:engine,param:nb_rows --benchmark-columns=min,max,mean,rounds --benchmark-sort=name --benchmark-name=short ```
Additionally, the following flags could be of interest for you:
* -x
for stopping at the first failure
* --benchmark-only
for only testing benchmarks
* --benchmark-skip
for skipping benchmarks
flake8 must run without errors for pipelines to succeed.
If you are not using the conda environment, you can install flake8 with: pip install flake8
.
To test flake8 locally you can simply execute this command:
flake8 .
Need to adapt some function: Exemple this line in engine.py
pandas_sql_engine = pd.io.sql.SQLDatabase(engine=connection, schema=schema)
Need to be too work with pandas>=2.0
pandas_sql_engine = pd.io.sql.SQLDatabase(conn=connection, schema=schema)
This is a common feature in other software like DBeaver, where the result of running a query shows how many rows were inserted, updated, deleted, etc. It would be valuable if such feedback could be added to this awesome library.
\setuptools\command\build_py.py", line 132, in build_package_data
srcfile in self.distribution.convert_2to3_doctests):
AttributeError: 'Distribution' object has no attribute 'convert_2to3_doctests'
Can't install new version pangres
https://github.com/ThibTrip/pangres/releases/tag/v4.1.3
Befoere it works.
Hey,
when I try to use panges with access I become this Error:
raise exc.CompileError(
sqlalchemy.exc.CompileError: The 'access' dialect with current database version settings does not support in-place multirow inserts.
Have someone a workaround for Access or something like this?
` dataframe = pd.DataFrame(data=[data]) columnnamespr = pr_DB.columns.keys()
dataframe = dataframe.rename(columns=dict(zip(dataframe.columns, columnnamespr)))
dataframe.set_index(['ID'], drop=True, inplace=True)
with engine.connect() as con:
upsert(con=con, df=dataframe, table_name='produkte', if_row_exists='update', create_table=False, chunksize=1)`
From reading around, it seems INSERT .. ON CONFLICT
) is higher-performance than DELETE .. INSERT ..
, which is how I'm doing things now, so this library is compelling. However, one big boon of df.to_sql
is method='multi'
, which creates a big-ol' insert statement rather than individual ones, which (combined with chunksize) I've found improves my bulk-insert performance massively. I realize this is more a question about sqlalchemy.dialects.postgresql.insert
, but I'm asking here because - being less familiar with that method - I don't see any arguments that can be passed to pangres.upsert
for managing the insert approach. Eg, I see postgresql.insert(inline=True)
might be something along these lines? Or does postgresql.insert
handle like that by default?
TL;DR: is pangres as fast as df.to_sql(method='multi')
, or are there plans to add a options which get passed to sqlalchemy..postgresql
for performance management? (Does this make sense?)
pangres
is now compatible with sqlalchemy>=2.0
. IMPORTANT: you will need alembic>=1.7.2
(it is one of the dependencies of pangres
) and Python >= 3.8sqlalchemy>=2.0
pangres
was not running on Python 3.10 because I never added packaging
in requirements.txt
. I fixed it by using pkg_resources
from the standard library instead to avoid the additional dependency (see commit 89d3679)Note: the tests were running fine because we use pytest
which uses packaging
(so I never saw the missing dependency)
UpsertQuery.execute
instead of UpsertQuery.aexecute
in pangres.aupsert
). This has no repercussions for the end userpangres.aupsert
(using engine
instead of connection
in contexts) and commit
which I had forgotten!test_upsert_end_to_end
has been renamed to test_core
pangres.aupsert
:rocket: ! Tested using aiosqlite
for SQlite, asyncpg
for PostgreSQL and aiomysql
for MySQL. See documentation in dedicated wiki pageThis patches an important bug with MySQL. We recommend that all users upgrade to this version.
sqlite3 sql python3 postgresql pandas