|
1 ================ |
|
2 python-zstandard |
|
3 ================ |
|
4 |
|
5 This project provides a Python C extension for interfacing with the |
|
6 `Zstandard <http://www.zstd.net>`_ compression library. |
|
7 |
|
8 The primary goal of the extension is to provide a Pythonic interface to |
|
9 the underlying C API. This means exposing most of the features and flexibility |
|
10 of the C API while not sacrificing usability or safety that Python provides. |
|
11 |
|
12 | |ci-status| |win-ci-status| |
|
13 |
|
14 State of Project |
|
15 ================ |
|
16 |
|
17 The project is officially in beta state. The author is reasonably satisfied |
|
18 with the current API and that functionality works as advertised. There |
|
19 may be some backwards incompatible changes before 1.0. Though the author |
|
20 does not intend to make any major changes to the Python API. |
|
21 |
|
22 There is continuous integration for Python versions 2.6, 2.7, and 3.3+ |
|
23 on Linux x86_x64 and Windows x86 and x86_64. The author is reasonably |
|
24 confident the extension is stable and works as advertised on these |
|
25 platforms. |
|
26 |
|
27 Expected Changes |
|
28 ---------------- |
|
29 |
|
30 The author is reasonably confident in the current state of what's |
|
31 implemented on the ``ZstdCompressor`` and ``ZstdDecompressor`` types. |
|
32 Those APIs likely won't change significantly. Some low-level behavior |
|
33 (such as naming and types expected by arguments) may change. |
|
34 |
|
35 There will likely be arguments added to control the input and output |
|
36 buffer sizes (currently, certain operations read and write in chunk |
|
37 sizes using zstd's preferred defaults). |
|
38 |
|
39 There should be an API that accepts an object that conforms to the buffer |
|
40 interface and returns an iterator over compressed or decompressed output. |
|
41 |
|
42 The author is on the fence as to whether to support the extremely |
|
43 low level compression and decompression APIs. It could be useful to |
|
44 support compression without the framing headers. But the author doesn't |
|
45 believe it a high priority at this time. |
|
46 |
|
47 The CFFI bindings are half-baked and need to be finished. |
|
48 |
|
49 Requirements |
|
50 ============ |
|
51 |
|
52 This extension is designed to run with Python 2.6, 2.7, 3.3, 3.4, and 3.5 |
|
53 on common platforms (Linux, Windows, and OS X). Only x86_64 is currently |
|
54 well-tested as an architecture. |
|
55 |
|
56 Installing |
|
57 ========== |
|
58 |
|
59 This package is uploaded to PyPI at https://pypi.python.org/pypi/zstandard. |
|
60 So, to install this package:: |
|
61 |
|
62 $ pip install zstandard |
|
63 |
|
64 Binary wheels are made available for some platforms. If you need to |
|
65 install from a source distribution, all you should need is a working C |
|
66 compiler and the Python development headers/libraries. On many Linux |
|
67 distributions, you can install a ``python-dev`` or ``python-devel`` |
|
68 package to provide these dependencies. |
|
69 |
|
70 Packages are also uploaded to Anaconda Cloud at |
|
71 https://anaconda.org/indygreg/zstandard. See that URL for how to install |
|
72 this package with ``conda``. |
|
73 |
|
74 Performance |
|
75 =========== |
|
76 |
|
77 Very crude and non-scientific benchmarking (most benchmarks fall in this |
|
78 category because proper benchmarking is hard) show that the Python bindings |
|
79 perform within 10% of the native C implementation. |
|
80 |
|
81 The following table compares the performance of compressing and decompressing |
|
82 a 1.1 GB tar file comprised of the files in a Firefox source checkout. Values |
|
83 obtained with the ``zstd`` program are on the left. The remaining columns detail |
|
84 performance of various compression APIs in the Python bindings. |
|
85 |
|
86 +-------+-----------------+-----------------+-----------------+---------------+ |
|
87 | Level | Native | Simple | Stream In | Stream Out | |
|
88 | | Comp / Decomp | Comp / Decomp | Comp / Decomp | Comp | |
|
89 +=======+=================+=================+=================+===============+ |
|
90 | 1 | 490 / 1338 MB/s | 458 / 1266 MB/s | 407 / 1156 MB/s | 405 MB/s | |
|
91 +-------+-----------------+-----------------+-----------------+---------------+ |
|
92 | 2 | 412 / 1288 MB/s | 381 / 1203 MB/s | 345 / 1128 MB/s | 349 MB/s | |
|
93 +-------+-----------------+-----------------+-----------------+---------------+ |
|
94 | 3 | 342 / 1312 MB/s | 319 / 1182 MB/s | 285 / 1165 MB/s | 287 MB/s | |
|
95 +-------+-----------------+-----------------+-----------------+---------------+ |
|
96 | 11 | 64 / 1506 MB/s | 66 / 1436 MB/s | 56 / 1342 MB/s | 57 MB/s | |
|
97 +-------+-----------------+-----------------+-----------------+---------------+ |
|
98 |
|
99 Again, these are very unscientific. But it shows that Python is capable of |
|
100 compressing at several hundred MB/s and decompressing at over 1 GB/s. |
|
101 |
|
102 Comparison to Other Python Bindings |
|
103 =================================== |
|
104 |
|
105 https://pypi.python.org/pypi/zstd is an alternative Python binding to |
|
106 Zstandard. At the time this was written, the latest release of that |
|
107 package (1.0.0.2) had the following significant differences from this package: |
|
108 |
|
109 * It only exposes the simple API for compression and decompression operations. |
|
110 This extension exposes the streaming API, dictionary training, and more. |
|
111 * It adds a custom framing header to compressed data and there is no way to |
|
112 disable it. This means that data produced with that module cannot be used by |
|
113 other Zstandard implementations. |
|
114 |
|
115 Bundling of Zstandard Source Code |
|
116 ================================= |
|
117 |
|
118 The source repository for this project contains a vendored copy of the |
|
119 Zstandard source code. This is done for a few reasons. |
|
120 |
|
121 First, Zstandard is relatively new and not yet widely available as a system |
|
122 package. Providing a copy of the source code enables the Python C extension |
|
123 to be compiled without requiring the user to obtain the Zstandard source code |
|
124 separately. |
|
125 |
|
126 Second, Zstandard has both a stable *public* API and an *experimental* API. |
|
127 The *experimental* API is actually quite useful (contains functionality for |
|
128 training dictionaries for example), so it is something we wish to expose to |
|
129 Python. However, the *experimental* API is only available via static linking. |
|
130 Furthermore, the *experimental* API can change at any time. So, control over |
|
131 the exact version of the Zstandard library linked against is important to |
|
132 ensure known behavior. |
|
133 |
|
134 Instructions for Building and Testing |
|
135 ===================================== |
|
136 |
|
137 Once you have the source code, the extension can be built via setup.py:: |
|
138 |
|
139 $ python setup.py build_ext |
|
140 |
|
141 We recommend testing with ``nose``:: |
|
142 |
|
143 $ nosetests |
|
144 |
|
145 A Tox configuration is present to test against multiple Python versions:: |
|
146 |
|
147 $ tox |
|
148 |
|
149 Tests use the ``hypothesis`` Python package to perform fuzzing. If you |
|
150 don't have it, those tests won't run. |
|
151 |
|
152 There is also an experimental CFFI module. You need the ``cffi`` Python |
|
153 package installed to build and test that. |
|
154 |
|
155 To create a virtualenv with all development dependencies, do something |
|
156 like the following:: |
|
157 |
|
158 # Python 2 |
|
159 $ virtualenv venv |
|
160 |
|
161 # Python 3 |
|
162 $ python3 -m venv venv |
|
163 |
|
164 $ source venv/bin/activate |
|
165 $ pip install cffi hypothesis nose tox |
|
166 |
|
167 API |
|
168 === |
|
169 |
|
170 The compiled C extension provides a ``zstd`` Python module. This module |
|
171 exposes the following interfaces. |
|
172 |
|
173 ZstdCompressor |
|
174 -------------- |
|
175 |
|
176 The ``ZstdCompressor`` class provides an interface for performing |
|
177 compression operations. |
|
178 |
|
179 Each instance is associated with parameters that control compression |
|
180 behavior. These come from the following named arguments (all optional): |
|
181 |
|
182 level |
|
183 Integer compression level. Valid values are between 1 and 22. |
|
184 dict_data |
|
185 Compression dictionary to use. |
|
186 |
|
187 Note: When using dictionary data and ``compress()`` is called multiple |
|
188 times, the ``CompressionParameters`` derived from an integer compression |
|
189 ``level`` and the first compressed data's size will be reused for all |
|
190 subsequent operations. This may not be desirable if source data size |
|
191 varies significantly. |
|
192 compression_params |
|
193 A ``CompressionParameters`` instance (overrides the ``level`` value). |
|
194 write_checksum |
|
195 Whether a 4 byte checksum should be written with the compressed data. |
|
196 Defaults to False. If True, the decompressor can verify that decompressed |
|
197 data matches the original input data. |
|
198 write_content_size |
|
199 Whether the size of the uncompressed data will be written into the |
|
200 header of compressed data. Defaults to False. The data will only be |
|
201 written if the compressor knows the size of the input data. This is |
|
202 likely not true for streaming compression. |
|
203 write_dict_id |
|
204 Whether to write the dictionary ID into the compressed data. |
|
205 Defaults to True. The dictionary ID is only written if a dictionary |
|
206 is being used. |
|
207 |
|
208 Simple API |
|
209 ^^^^^^^^^^ |
|
210 |
|
211 ``compress(data)`` compresses and returns data as a one-shot operation.:: |
|
212 |
|
213 cctx = zstd.ZsdCompressor() |
|
214 compressed = cctx.compress(b'data to compress') |
|
215 |
|
216 Streaming Input API |
|
217 ^^^^^^^^^^^^^^^^^^^ |
|
218 |
|
219 ``write_to(fh)`` (which behaves as a context manager) allows you to *stream* |
|
220 data into a compressor.:: |
|
221 |
|
222 cctx = zstd.ZstdCompressor(level=10) |
|
223 with cctx.write_to(fh) as compressor: |
|
224 compressor.write(b'chunk 0') |
|
225 compressor.write(b'chunk 1') |
|
226 ... |
|
227 |
|
228 The argument to ``write_to()`` must have a ``write(data)`` method. As |
|
229 compressed data is available, ``write()`` will be called with the comrpessed |
|
230 data as its argument. Many common Python types implement ``write()``, including |
|
231 open file handles and ``io.BytesIO``. |
|
232 |
|
233 ``write_to()`` returns an object representing a streaming compressor instance. |
|
234 It **must** be used as a context manager. That object's ``write(data)`` method |
|
235 is used to feed data into the compressor. |
|
236 |
|
237 If the size of the data being fed to this streaming compressor is known, |
|
238 you can declare it before compression begins:: |
|
239 |
|
240 cctx = zstd.ZstdCompressor() |
|
241 with cctx.write_to(fh, size=data_len) as compressor: |
|
242 compressor.write(chunk0) |
|
243 compressor.write(chunk1) |
|
244 ... |
|
245 |
|
246 Declaring the size of the source data allows compression parameters to |
|
247 be tuned. And if ``write_content_size`` is used, it also results in the |
|
248 content size being written into the frame header of the output data. |
|
249 |
|
250 The size of chunks being ``write()`` to the destination can be specified:: |
|
251 |
|
252 cctx = zstd.ZstdCompressor() |
|
253 with cctx.write_to(fh, write_size=32768) as compressor: |
|
254 ... |
|
255 |
|
256 To see how much memory is being used by the streaming compressor:: |
|
257 |
|
258 cctx = zstd.ZstdCompressor() |
|
259 with cctx.write_to(fh) as compressor: |
|
260 ... |
|
261 byte_size = compressor.memory_size() |
|
262 |
|
263 Streaming Output API |
|
264 ^^^^^^^^^^^^^^^^^^^^ |
|
265 |
|
266 ``read_from(reader)`` provides a mechanism to stream data out of a compressor |
|
267 as an iterator of data chunks.:: |
|
268 |
|
269 cctx = zstd.ZstdCompressor() |
|
270 for chunk in cctx.read_from(fh): |
|
271 # Do something with emitted data. |
|
272 |
|
273 ``read_from()`` accepts an object that has a ``read(size)`` method or conforms |
|
274 to the buffer protocol. (``bytes`` and ``memoryview`` are 2 common types that |
|
275 provide the buffer protocol.) |
|
276 |
|
277 Uncompressed data is fetched from the source either by calling ``read(size)`` |
|
278 or by fetching a slice of data from the object directly (in the case where |
|
279 the buffer protocol is being used). The returned iterator consists of chunks |
|
280 of compressed data. |
|
281 |
|
282 Like ``write_to()``, ``read_from()`` also accepts a ``size`` argument |
|
283 declaring the size of the input stream:: |
|
284 |
|
285 cctx = zstd.ZstdCompressor() |
|
286 for chunk in cctx.read_from(fh, size=some_int): |
|
287 pass |
|
288 |
|
289 You can also control the size that data is ``read()`` from the source and |
|
290 the ideal size of output chunks:: |
|
291 |
|
292 cctx = zstd.ZstdCompressor() |
|
293 for chunk in cctx.read_from(fh, read_size=16384, write_size=8192): |
|
294 pass |
|
295 |
|
296 Stream Copying API |
|
297 ^^^^^^^^^^^^^^^^^^ |
|
298 |
|
299 ``copy_stream(ifh, ofh)`` can be used to copy data between 2 streams while |
|
300 compressing it.:: |
|
301 |
|
302 cctx = zstd.ZstdCompressor() |
|
303 cctx.copy_stream(ifh, ofh) |
|
304 |
|
305 For example, say you wish to compress a file:: |
|
306 |
|
307 cctx = zstd.ZstdCompressor() |
|
308 with open(input_path, 'rb') as ifh, open(output_path, 'wb') as ofh: |
|
309 cctx.copy_stream(ifh, ofh) |
|
310 |
|
311 It is also possible to declare the size of the source stream:: |
|
312 |
|
313 cctx = zstd.ZstdCompressor() |
|
314 cctx.copy_stream(ifh, ofh, size=len_of_input) |
|
315 |
|
316 You can also specify how large the chunks that are ``read()`` and ``write()`` |
|
317 from and to the streams:: |
|
318 |
|
319 cctx = zstd.ZstdCompressor() |
|
320 cctx.copy_stream(ifh, ofh, read_size=32768, write_size=16384) |
|
321 |
|
322 The stream copier returns a 2-tuple of bytes read and written:: |
|
323 |
|
324 cctx = zstd.ZstdCompressor() |
|
325 read_count, write_count = cctx.copy_stream(ifh, ofh) |
|
326 |
|
327 Compressor API |
|
328 ^^^^^^^^^^^^^^ |
|
329 |
|
330 ``compressobj()`` returns an object that exposes ``compress(data)`` and |
|
331 ``flush()`` methods. Each returns compressed data or an empty bytes. |
|
332 |
|
333 The purpose of ``compressobj()`` is to provide an API-compatible interface |
|
334 with ``zlib.compressobj`` and ``bz2.BZ2Compressor``. This allows callers to |
|
335 swap in different compressor objects while using the same API. |
|
336 |
|
337 Once ``flush()`` is called, the compressor will no longer accept new data |
|
338 to ``compress()``. ``flush()`` **must** be called to end the compression |
|
339 context. If not called, the returned data may be incomplete. |
|
340 |
|
341 Here is how this API should be used:: |
|
342 |
|
343 cctx = zstd.ZstdCompressor() |
|
344 cobj = cctx.compressobj() |
|
345 data = cobj.compress(b'raw input 0') |
|
346 data = cobj.compress(b'raw input 1') |
|
347 data = cobj.flush() |
|
348 |
|
349 For best performance results, keep input chunks under 256KB. This avoids |
|
350 extra allocations for a large output object. |
|
351 |
|
352 It is possible to declare the input size of the data that will be fed into |
|
353 the compressor:: |
|
354 |
|
355 cctx = zstd.ZstdCompressor() |
|
356 cobj = cctx.compressobj(size=6) |
|
357 data = cobj.compress(b'foobar') |
|
358 data = cobj.flush() |
|
359 |
|
360 ZstdDecompressor |
|
361 ---------------- |
|
362 |
|
363 The ``ZstdDecompressor`` class provides an interface for performing |
|
364 decompression. |
|
365 |
|
366 Each instance is associated with parameters that control decompression. These |
|
367 come from the following named arguments (all optional): |
|
368 |
|
369 dict_data |
|
370 Compression dictionary to use. |
|
371 |
|
372 The interface of this class is very similar to ``ZstdCompressor`` (by design). |
|
373 |
|
374 Simple API |
|
375 ^^^^^^^^^^ |
|
376 |
|
377 ``decompress(data)`` can be used to decompress an entire compressed zstd |
|
378 frame in a single operation.:: |
|
379 |
|
380 dctx = zstd.ZstdDecompressor() |
|
381 decompressed = dctx.decompress(data) |
|
382 |
|
383 By default, ``decompress(data)`` will only work on data written with the content |
|
384 size encoded in its header. This can be achieved by creating a |
|
385 ``ZstdCompressor`` with ``write_content_size=True``. If compressed data without |
|
386 an embedded content size is seen, ``zstd.ZstdError`` will be raised. |
|
387 |
|
388 If the compressed data doesn't have its content size embedded within it, |
|
389 decompression can be attempted by specifying the ``max_output_size`` |
|
390 argument.:: |
|
391 |
|
392 dctx = zstd.ZstdDecompressor() |
|
393 uncompressed = dctx.decompress(data, max_output_size=1048576) |
|
394 |
|
395 Ideally, ``max_output_size`` will be identical to the decompressed output |
|
396 size. |
|
397 |
|
398 If ``max_output_size`` is too small to hold the decompressed data, |
|
399 ``zstd.ZstdError`` will be raised. |
|
400 |
|
401 If ``max_output_size`` is larger than the decompressed data, the allocated |
|
402 output buffer will be resized to only use the space required. |
|
403 |
|
404 Please note that an allocation of the requested ``max_output_size`` will be |
|
405 performed every time the method is called. Setting to a very large value could |
|
406 result in a lot of work for the memory allocator and may result in |
|
407 ``MemoryError`` being raised if the allocation fails. |
|
408 |
|
409 If the exact size of decompressed data is unknown, it is **strongly** |
|
410 recommended to use a streaming API. |
|
411 |
|
412 Streaming Input API |
|
413 ^^^^^^^^^^^^^^^^^^^ |
|
414 |
|
415 ``write_to(fh)`` can be used to incrementally send compressed data to a |
|
416 decompressor.:: |
|
417 |
|
418 dctx = zstd.ZstdDecompressor() |
|
419 with dctx.write_to(fh) as decompressor: |
|
420 decompressor.write(compressed_data) |
|
421 |
|
422 This behaves similarly to ``zstd.ZstdCompressor``: compressed data is written to |
|
423 the decompressor by calling ``write(data)`` and decompressed output is written |
|
424 to the output object by calling its ``write(data)`` method. |
|
425 |
|
426 The size of chunks being ``write()`` to the destination can be specified:: |
|
427 |
|
428 dctx = zstd.ZstdDecompressor() |
|
429 with dctx.write_to(fh, write_size=16384) as decompressor: |
|
430 pass |
|
431 |
|
432 You can see how much memory is being used by the decompressor:: |
|
433 |
|
434 dctx = zstd.ZstdDecompressor() |
|
435 with dctx.write_to(fh) as decompressor: |
|
436 byte_size = decompressor.memory_size() |
|
437 |
|
438 Streaming Output API |
|
439 ^^^^^^^^^^^^^^^^^^^^ |
|
440 |
|
441 ``read_from(fh)`` provides a mechanism to stream decompressed data out of a |
|
442 compressed source as an iterator of data chunks.:: |
|
443 |
|
444 dctx = zstd.ZstdDecompressor() |
|
445 for chunk in dctx.read_from(fh): |
|
446 # Do something with original data. |
|
447 |
|
448 ``read_from()`` accepts a) an object with a ``read(size)`` method that will |
|
449 return compressed bytes b) an object conforming to the buffer protocol that |
|
450 can expose its data as a contiguous range of bytes. The ``bytes`` and |
|
451 ``memoryview`` types expose this buffer protocol. |
|
452 |
|
453 ``read_from()`` returns an iterator whose elements are chunks of the |
|
454 decompressed data. |
|
455 |
|
456 The size of requested ``read()`` from the source can be specified:: |
|
457 |
|
458 dctx = zstd.ZstdDecompressor() |
|
459 for chunk in dctx.read_from(fh, read_size=16384): |
|
460 pass |
|
461 |
|
462 It is also possible to skip leading bytes in the input data:: |
|
463 |
|
464 dctx = zstd.ZstdDecompressor() |
|
465 for chunk in dctx.read_from(fh, skip_bytes=1): |
|
466 pass |
|
467 |
|
468 Skipping leading bytes is useful if the source data contains extra |
|
469 *header* data but you want to avoid the overhead of making a buffer copy |
|
470 or allocating a new ``memoryview`` object in order to decompress the data. |
|
471 |
|
472 Similarly to ``ZstdCompressor.read_from()``, the consumer of the iterator |
|
473 controls when data is decompressed. If the iterator isn't consumed, |
|
474 decompression is put on hold. |
|
475 |
|
476 When ``read_from()`` is passed an object conforming to the buffer protocol, |
|
477 the behavior may seem similar to what occurs when the simple decompression |
|
478 API is used. However, this API works when the decompressed size is unknown. |
|
479 Furthermore, if feeding large inputs, the decompressor will work in chunks |
|
480 instead of performing a single operation. |
|
481 |
|
482 Stream Copying API |
|
483 ^^^^^^^^^^^^^^^^^^ |
|
484 |
|
485 ``copy_stream(ifh, ofh)`` can be used to copy data across 2 streams while |
|
486 performing decompression.:: |
|
487 |
|
488 dctx = zstd.ZstdDecompressor() |
|
489 dctx.copy_stream(ifh, ofh) |
|
490 |
|
491 e.g. to decompress a file to another file:: |
|
492 |
|
493 dctx = zstd.ZstdDecompressor() |
|
494 with open(input_path, 'rb') as ifh, open(output_path, 'wb') as ofh: |
|
495 dctx.copy_stream(ifh, ofh) |
|
496 |
|
497 The size of chunks being ``read()`` and ``write()`` from and to the streams |
|
498 can be specified:: |
|
499 |
|
500 dctx = zstd.ZstdDecompressor() |
|
501 dctx.copy_stream(ifh, ofh, read_size=8192, write_size=16384) |
|
502 |
|
503 Decompressor API |
|
504 ^^^^^^^^^^^^^^^^ |
|
505 |
|
506 ``decompressobj()`` returns an object that exposes a ``decompress(data)`` |
|
507 methods. Compressed data chunks are fed into ``decompress(data)`` and |
|
508 uncompressed output (or an empty bytes) is returned. Output from subsequent |
|
509 calls needs to be concatenated to reassemble the full decompressed byte |
|
510 sequence. |
|
511 |
|
512 The purpose of ``decompressobj()`` is to provide an API-compatible interface |
|
513 with ``zlib.decompressobj`` and ``bz2.BZ2Decompressor``. This allows callers |
|
514 to swap in different decompressor objects while using the same API. |
|
515 |
|
516 Each object is single use: once an input frame is decoded, ``decompress()`` |
|
517 can no longer be called. |
|
518 |
|
519 Here is how this API should be used:: |
|
520 |
|
521 dctx = zstd.ZstdDeompressor() |
|
522 dobj = cctx.decompressobj() |
|
523 data = dobj.decompress(compressed_chunk_0) |
|
524 data = dobj.decompress(compressed_chunk_1) |
|
525 |
|
526 Choosing an API |
|
527 --------------- |
|
528 |
|
529 Various forms of compression and decompression APIs are provided because each |
|
530 are suitable for different use cases. |
|
531 |
|
532 The simple/one-shot APIs are useful for small data, when the decompressed |
|
533 data size is known (either recorded in the zstd frame header via |
|
534 ``write_content_size`` or known via an out-of-band mechanism, such as a file |
|
535 size). |
|
536 |
|
537 A limitation of the simple APIs is that input or output data must fit in memory. |
|
538 And unless using advanced tricks with Python *buffer objects*, both input and |
|
539 output must fit in memory simultaneously. |
|
540 |
|
541 Another limitation is that compression or decompression is performed as a single |
|
542 operation. So if you feed large input, it could take a long time for the |
|
543 function to return. |
|
544 |
|
545 The streaming APIs do not have the limitations of the simple API. The cost to |
|
546 this is they are more complex to use than a single function call. |
|
547 |
|
548 The streaming APIs put the caller in control of compression and decompression |
|
549 behavior by allowing them to directly control either the input or output side |
|
550 of the operation. |
|
551 |
|
552 With the streaming input APIs, the caller feeds data into the compressor or |
|
553 decompressor as they see fit. Output data will only be written after the caller |
|
554 has explicitly written data. |
|
555 |
|
556 With the streaming output APIs, the caller consumes output from the compressor |
|
557 or decompressor as they see fit. The compressor or decompressor will only |
|
558 consume data from the source when the caller is ready to receive it. |
|
559 |
|
560 One end of the streaming APIs involves a file-like object that must |
|
561 ``write()`` output data or ``read()`` input data. Depending on what the |
|
562 backing storage for these objects is, those operations may not complete quickly. |
|
563 For example, when streaming compressed data to a file, the ``write()`` into |
|
564 a streaming compressor could result in a ``write()`` to the filesystem, which |
|
565 may take a long time to finish due to slow I/O on the filesystem. So, there |
|
566 may be overhead in streaming APIs beyond the compression and decompression |
|
567 operations. |
|
568 |
|
569 Dictionary Creation and Management |
|
570 ---------------------------------- |
|
571 |
|
572 Zstandard allows *dictionaries* to be used when compressing and |
|
573 decompressing data. The idea is that if you are compressing a lot of similar |
|
574 data, you can precompute common properties of that data (such as recurring |
|
575 byte sequences) to achieve better compression ratios. |
|
576 |
|
577 In Python, compression dictionaries are represented as the |
|
578 ``ZstdCompressionDict`` type. |
|
579 |
|
580 Instances can be constructed from bytes:: |
|
581 |
|
582 dict_data = zstd.ZstdCompressionDict(data) |
|
583 |
|
584 More interestingly, instances can be created by *training* on sample data:: |
|
585 |
|
586 dict_data = zstd.train_dictionary(size, samples) |
|
587 |
|
588 This takes a list of bytes instances and creates and returns a |
|
589 ``ZstdCompressionDict``. |
|
590 |
|
591 You can see how many bytes are in the dictionary by calling ``len()``:: |
|
592 |
|
593 dict_data = zstd.train_dictionary(size, samples) |
|
594 dict_size = len(dict_data) # will not be larger than ``size`` |
|
595 |
|
596 Once you have a dictionary, you can pass it to the objects performing |
|
597 compression and decompression:: |
|
598 |
|
599 dict_data = zstd.train_dictionary(16384, samples) |
|
600 |
|
601 cctx = zstd.ZstdCompressor(dict_data=dict_data) |
|
602 for source_data in input_data: |
|
603 compressed = cctx.compress(source_data) |
|
604 # Do something with compressed data. |
|
605 |
|
606 dctx = zstd.ZstdDecompressor(dict_data=dict_data) |
|
607 for compressed_data in input_data: |
|
608 buffer = io.BytesIO() |
|
609 with dctx.write_to(buffer) as decompressor: |
|
610 decompressor.write(compressed_data) |
|
611 # Do something with raw data in ``buffer``. |
|
612 |
|
613 Dictionaries have unique integer IDs. You can retrieve this ID via:: |
|
614 |
|
615 dict_id = zstd.dictionary_id(dict_data) |
|
616 |
|
617 You can obtain the raw data in the dict (useful for persisting and constructing |
|
618 a ``ZstdCompressionDict`` later) via ``as_bytes()``:: |
|
619 |
|
620 dict_data = zstd.train_dictionary(size, samples) |
|
621 raw_data = dict_data.as_bytes() |
|
622 |
|
623 Explicit Compression Parameters |
|
624 ------------------------------- |
|
625 |
|
626 Zstandard's integer compression levels along with the input size and dictionary |
|
627 size are converted into a data structure defining multiple parameters to tune |
|
628 behavior of the compression algorithm. It is possible to use define this |
|
629 data structure explicitly to have lower-level control over compression behavior. |
|
630 |
|
631 The ``zstd.CompressionParameters`` type represents this data structure. |
|
632 You can see how Zstandard converts compression levels to this data structure |
|
633 by calling ``zstd.get_compression_parameters()``. e.g.:: |
|
634 |
|
635 params = zstd.get_compression_parameters(5) |
|
636 |
|
637 This function also accepts the uncompressed data size and dictionary size |
|
638 to adjust parameters:: |
|
639 |
|
640 params = zstd.get_compression_parameters(3, source_size=len(data), dict_size=len(dict_data)) |
|
641 |
|
642 You can also construct compression parameters from their low-level components:: |
|
643 |
|
644 params = zstd.CompressionParameters(20, 6, 12, 5, 4, 10, zstd.STRATEGY_FAST) |
|
645 |
|
646 You can then configure a compressor to use the custom parameters:: |
|
647 |
|
648 cctx = zstd.ZstdCompressor(compression_params=params) |
|
649 |
|
650 The members of the ``CompressionParameters`` tuple are as follows:: |
|
651 |
|
652 * 0 - Window log |
|
653 * 1 - Chain log |
|
654 * 2 - Hash log |
|
655 * 3 - Search log |
|
656 * 4 - Search length |
|
657 * 5 - Target length |
|
658 * 6 - Strategy (one of the ``zstd.STRATEGY_`` constants) |
|
659 |
|
660 You'll need to read the Zstandard documentation for what these parameters |
|
661 do. |
|
662 |
|
663 Misc Functionality |
|
664 ------------------ |
|
665 |
|
666 estimate_compression_context_size(CompressionParameters) |
|
667 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
|
668 |
|
669 Given a ``CompressionParameters`` struct, estimate the memory size required |
|
670 to perform compression. |
|
671 |
|
672 estimate_decompression_context_size() |
|
673 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
|
674 |
|
675 Estimate the memory size requirements for a decompressor instance. |
|
676 |
|
677 Constants |
|
678 --------- |
|
679 |
|
680 The following module constants/attributes are exposed: |
|
681 |
|
682 ZSTD_VERSION |
|
683 This module attribute exposes a 3-tuple of the Zstandard version. e.g. |
|
684 ``(1, 0, 0)`` |
|
685 MAX_COMPRESSION_LEVEL |
|
686 Integer max compression level accepted by compression functions |
|
687 COMPRESSION_RECOMMENDED_INPUT_SIZE |
|
688 Recommended chunk size to feed to compressor functions |
|
689 COMPRESSION_RECOMMENDED_OUTPUT_SIZE |
|
690 Recommended chunk size for compression output |
|
691 DECOMPRESSION_RECOMMENDED_INPUT_SIZE |
|
692 Recommended chunk size to feed into decompresor functions |
|
693 DECOMPRESSION_RECOMMENDED_OUTPUT_SIZE |
|
694 Recommended chunk size for decompression output |
|
695 |
|
696 FRAME_HEADER |
|
697 bytes containing header of the Zstandard frame |
|
698 MAGIC_NUMBER |
|
699 Frame header as an integer |
|
700 |
|
701 WINDOWLOG_MIN |
|
702 Minimum value for compression parameter |
|
703 WINDOWLOG_MAX |
|
704 Maximum value for compression parameter |
|
705 CHAINLOG_MIN |
|
706 Minimum value for compression parameter |
|
707 CHAINLOG_MAX |
|
708 Maximum value for compression parameter |
|
709 HASHLOG_MIN |
|
710 Minimum value for compression parameter |
|
711 HASHLOG_MAX |
|
712 Maximum value for compression parameter |
|
713 SEARCHLOG_MIN |
|
714 Minimum value for compression parameter |
|
715 SEARCHLOG_MAX |
|
716 Maximum value for compression parameter |
|
717 SEARCHLENGTH_MIN |
|
718 Minimum value for compression parameter |
|
719 SEARCHLENGTH_MAX |
|
720 Maximum value for compression parameter |
|
721 TARGETLENGTH_MIN |
|
722 Minimum value for compression parameter |
|
723 TARGETLENGTH_MAX |
|
724 Maximum value for compression parameter |
|
725 STRATEGY_FAST |
|
726 Compression strategory |
|
727 STRATEGY_DFAST |
|
728 Compression strategory |
|
729 STRATEGY_GREEDY |
|
730 Compression strategory |
|
731 STRATEGY_LAZY |
|
732 Compression strategory |
|
733 STRATEGY_LAZY2 |
|
734 Compression strategory |
|
735 STRATEGY_BTLAZY2 |
|
736 Compression strategory |
|
737 STRATEGY_BTOPT |
|
738 Compression strategory |
|
739 |
|
740 Note on Zstandard's *Experimental* API |
|
741 ====================================== |
|
742 |
|
743 Many of the Zstandard APIs used by this module are marked as *experimental* |
|
744 within the Zstandard project. This includes a large number of useful |
|
745 features, such as compression and frame parameters and parts of dictionary |
|
746 compression. |
|
747 |
|
748 It is unclear how Zstandard's C API will evolve over time, especially with |
|
749 regards to this *experimental* functionality. We will try to maintain |
|
750 backwards compatibility at the Python API level. However, we cannot |
|
751 guarantee this for things not under our control. |
|
752 |
|
753 Since a copy of the Zstandard source code is distributed with this |
|
754 module and since we compile against it, the behavior of a specific |
|
755 version of this module should be constant for all of time. So if you |
|
756 pin the version of this module used in your projects (which is a Python |
|
757 best practice), you should be buffered from unwanted future changes. |
|
758 |
|
759 Donate |
|
760 ====== |
|
761 |
|
762 A lot of time has been invested into this project by the author. |
|
763 |
|
764 If you find this project useful and would like to thank the author for |
|
765 their work, consider donating some money. Any amount is appreciated. |
|
766 |
|
767 .. image:: https://www.paypalobjects.com/en_US/i/btn/btn_donate_LG.gif |
|
768 :target: https://www.paypal.com/cgi-bin/webscr?cmd=_donations&business=gregory%2eszorc%40gmail%2ecom&lc=US&item_name=python%2dzstandard¤cy_code=USD&bn=PP%2dDonationsBF%3abtn_donate_LG%2egif%3aNonHosted |
|
769 :alt: Donate via PayPal |
|
770 |
|
771 .. |ci-status| image:: https://travis-ci.org/indygreg/python-zstandard.svg?branch=master |
|
772 :target: https://travis-ci.org/indygreg/python-zstandard |
|
773 |
|
774 .. |win-ci-status| image:: https://ci.appveyor.com/api/projects/status/github/indygreg/python-zstandard?svg=true |
|
775 :target: https://ci.appveyor.com/project/indygreg/python-zstandard |
|
776 :alt: Windows build status |