Python bindings to Zstandard (zstd) compression library, the API style is similar to Python's bz2/lzma/zlib modules.

animalize, updated πŸ•₯ 2023-03-25 05:40:18

Introduction

Pyzstd module provides classes and functions for compressing and decompressing data, using Facebook's Zstandard <http://www.zstd.net>_ (or zstd as short name) algorithm.

The API style is similar to Python's bz2/lzma/zlib modules.

  • Includes zstd v1.5.4 source code
  • Can also dynamically link to zstd library provided by system, see this note <https://pyzstd.readthedocs.io/en/latest/#build-pyzstd>_.
  • Has a CFFI implementation that can work with PyPy
  • Has a command line interface: python -m pyzstd --help

Links

Documentation: https://pyzstd.readthedocs.io/en/latest

GitHub: https://github.com/animalize/pyzstd

Release note

0.15.4 (Feb 24, 2023)

. Update bundled zstd source code from v1.5.2 to v1.5.4 <https://github.com/facebook/zstd/releases/tag/v1.5.4>_. v1.5.3 is a non-public release.

. Support pyproject.toml build mechanism (PEP-517). Note that specifying build options in old way may be invalid, see doc <https://pyzstd.readthedocs.io/en/latest/#build-pyzstd>_.

. Support "multi-phase initialization" (PEP-489) on CPython 3.11+, can work with CPython sub-interpreters in the future. Currently this build option is disabled by default.

. Add a command line interface (CLI).

0.15.3 (Aug 3, 2022)

Fix ZstdError object can't be pickled.

0.15.2 (Jan 22, 2022)

Update bundled zstd source code from v1.5.1 to v1.5.2 <https://github.com/facebook/zstd/releases/tag/v1.5.2>_.

0.15.1 (Dec 25, 2021)

. Update bundled zstd source code from v1.5.0 to v1.5.1 <https://github.com/facebook/zstd/releases/tag/v1.5.1>_.

. Fix ZstdFile.write() / train_dict() / finalize_dict() may use wrong length for some buffer protocol objects, see this issue <https://github.com/animalize/pyzstd/issues/4>_.

. Two behavior changes:

* Setting ``CParameter.nbWorkers`` to ``1`` now means "1-thread multi-threaded mode", rather than "single-threaded mode".

* If the underlying zstd library doesn't support multi-threaded compression, no longer automatically fallback to "single-threaded mode", now raise a ``ZstdError`` exception.

. Add a module level variable zstd_support_multithread <https://pyzstd.readthedocs.io/en/latest/#zstd_support_multithread>_.

. Add a setup.py option --avx2, see this note <https://pyzstd.readthedocs.io/en/latest/#build-pyzstd>_.

0.15.0 (May 18, 2021)

. Update bundled zstd source code from v1.4.9 to v1.5.0 <https://github.com/facebook/zstd/releases/tag/v1.5.0>_.

. Some improvements, no API changes.

0.14.4 (Mar 24, 2021)

. Add a CFFI implementation that can work with PyPy.

. Allow dynamically link to zstd library.

0.14.3 (Mar 4, 2021)

Update bundled zstd source code from v1.4.8 to v1.4.9 <https://github.com/facebook/zstd/releases/tag/v1.4.9>_.

0.14.2 (Feb 24, 2021)

. Add two convenient functions: compress_stream() <https://pyzstd.readthedocs.io/en/latest/#compress_stream>, decompress_stream() <https://pyzstd.readthedocs.io/en/latest/#decompress_stream>.

. Some improvements.

0.14.1 (Dec 19, 2020)

. Update bundled zstd source code from v1.4.5 to v1.4.8 <https://github.com/facebook/zstd/releases/tag/v1.4.8>_.

* v1.4.6 is a non-public release for Linux kernel.

* v1.4.8 is a hotfix for `v1.4.7 <https://github.com/facebook/zstd/releases/tag/v1.4.7>`_.

. Some improvements, no API changes.

0.13.0 (Nov 7, 2020)

. ZstdDecompressor class: now it has the same API and behavior as BZ2Decompressor / LZMADecompressor classes in Python standard library, it stops after a frame is decompressed.

. Add an EndlessZstdDecompressor class, it accepts multiple concatenated frames. It is renamed from previous ZstdDecompressor class, but .at_frame_edge is True when both the input and output streams are at a frame edge.

. Rename zstd_open() function to open(), consistent with Python standard library.

. decompress() function:

* ~9% faster when: there is one frame, and the decompressed size was recorded in frame header.

* raises ZstdError when input **or** output data is not at a frame edge. Previously, it only raise for output data is not at a frame edge.

0.12.5 (Oct 12, 2020)

No longer use Argument Clinic <https://docs.python.org/3/howto/clinic.html>_, now supports Python 3.5+, previously 3.7+.

0.12.4 (Oct 7, 2020)

It seems the API is stable.

0.2.4 (Sep 2, 2020)

The first version upload to PyPI.

Includes zstd v1.4.5 <https://github.com/facebook/zstd/releases/tag/v1.4.5>_ source code.

Issues

Find project collaborator(s)

opened on 2023-03-08 08:48:37 by animalize

Find project collaborator(s) to release new versions in my absence. About 2~3 versions of zstd are released every year, and a new version of pyzstd needs to be released at this time.

I recently changed the status of the project from Beta to Stable. As I said in https://github.com/animalize/pyzstd/pull/3#issuecomment-825365829, there is basically no need for other maintenance work:

I used to spend time checking such details, and manually triggering exceptions to see if them can be handled correctly. Once the development of pyzstd module is completed, almost no maintenance is needed. Basically just update the zstd source code, and use new API in major version updates.

Other precautions have been written in tech memo. If you are interested, I can explain more, such as what I have tried.

Please ensure that:

  1. Long-term service, so it is better for enterprise interested in this.
  2. Only build and upload wheels via CI, to prevent the host from being controlled by hacker.

tech memo

opened on 2020-12-22 02:27:46 by animalize

background

This module was originally written for Python stdlib: https://github.com/animalize/cpython/pull/8/files And use a script to convert the code to this pyzstd module.

After Oct-20-2020, all development were transferred to this module, and no longer use CPython's internal feature: argument clinic. Now only use CPython's public API for C extension.

In mid-March 2021, the code seems stable, then add a CFFI implementation.

After exploring some API/implementation changes, always return to "now is better". So in Jan 2023, change Development Status from Beta to Stable. It has exceeded its stdlib brothers a lot.

Compare to zstandard/zstd modules: https://github.com/animalize/pyzstd/discussions/19#discussioncomment-4702814

Some links:

[Feature Request] Add zstd module to stdlib, on Python issue tracker. https://bugs.python.org/issue37095

A discussion about adding zstd to Python standard library, on Python-Ideas mail-list. https://mail.python.org/archives/list/[email protected]/thread/VQIFA7WTNRAOYZGTVP4WZC2CD36KYIVY/

link to zstd library

Include zstd library source code, without any changes. Zstd lib source code is in zstd/ folder, if someone wants to upgrade/downgrade the bundled zstd lib, just replace this folder.

The code supports zstd v1.4.0+ (released in Apr 2019).

Only use zstd's "stable" API, don't use "experimental" API. Means don't #define ZSTD_STATIC_LINKING_ONLY.

When statically linking to zstd lib, use ZSTD_MULTITHREAD build macro (in setup.py) for enabling multi-threaded compression. MT is enabled by default in zstd v1.5.0+, pyzstd still define it for zstd v1.4.x. No more zstd macros are defined except this one.

See this note: https://pyzstd.readthedocs.io/en/latest/#build-pyzstd

API

The API is similar to Python's bz2/lzma/zlib module.

Try to make all major functionalities provided by zstd "stable" API can be used.

future plans, zstd

πŸ”΄ If "skippable frame" is used more, related API may be added. (unlikely. It's not difficult to implement "skippable frame" functions at user side.)

🟒 When ZSTD_c_stableInBuffer parameter is moved from "experimental" API to "stable" API, it can be used to speed up .FLUSH_FRAME compression if (.last_mode == .FLUSH_FRAME). No plan to use ZSTD_c_stableOutBuffer, because it raises an error when the output buffer is not enough. (likely)

🟒 When ZSTD_getFrameHeader() function is moved from "experimental" API to "stable" API, more items can be added to get_frame_info() function. (likely)

🟠 When ZSTD_d_refMultipleDDicts parameter is moved from "experimental" API to "stable" API, zstd_dict parameter may accepts a tuple that contains multiuple dictionaries. (not very likely, few people use it, and it makes the API complex a bit. This functionality can be implemented via get_frame_info() function and dispatching to different decompressors.)

🟒When ZDICT_finalizeDictionary() support training dict (no custom dict), the first arg can be None: finalize_dict(zstd_dict, samples, dict_size, level) Compare to train_dict(samples, dict_size), it can specify level. (likely)

future plans, python

🟒~~Use multi-phase init when it matures, then pyzstd module can support CPython sub-interpreters.~~ (implemented in 0.15.4, support subclass well.)

Depends on the progresses of CPython: - Subinterpreters for Python, https://lwn.net/Articles/820424/ - METH_METHOD flag metioned in PEP 573 can be used with more flag, otherwise have to disable subclass for ZstdDict/Compressor/Decompressor. Maybe need to wait until at least 3.11.

PEP 489 -- Multi-phase extension module initialization PEP 573 -- Module State Access from C Extension Methods

🟒 If the minimum version is 3.6: - use f-string. Its performance is better than % a bit. Currently string formatting is only used for exception message, so it's not a big problem. - remove #include "stdint.h", #include "pythread.h" in _zstdmodule.c. - try to add -fvisibility=hidden compile option, it reduces ~12KiB .so size. see commit https://github.com/animalize/pyzstd/commit/ab21add1e8d9b93e90eb49e62811159846934178. - remove this try...except in __init__.py, and related code in unit-test: python try: from os import PathLike except ImportError: # For Python 3.5 class PathLike: pass

🟒 If the minimum version is 3.7: - consider use METH_FASTCALL - remove #define Py_UNREACHABLE() assert(0) - remove this code in ZstdFile.read1(): python if size < 0: size = _32_KiB

🟒 If the minimum version is 3.8: - use := operator in ZstdFile, it's a bit faster.

known issues

🟑 ZstdDict.__init__(self, dict_content, is_raw=False) When dict_content is a normal dictionary, and set is_raw to True, the dictionary is NOT treated as raw dictionary. Very rare cases. If has magic number, it's probably a normal dict.

🟑 ~~When dynamically linking to zstd lib, compressionLevel_values.default may be wrong, it uses the value of ZSTD_CLEVEL_DEFAULT macro from zstd.h.~~ Very rare cases. Very few people modify ZSTD_CLEVEL_DEFAULT when building zstd lib. Fixed when zstd_version >= 1.5 and pyzstd_version >= 0.15

Ma Lin

Use Python, Android(Java).

GitHub Repository Homepage

zstd zstandard python