util: expose an "available" API on compression engines
When the zstd compression engine is introduced, it won't work in all
installations, namely pure Python installs. So, we need a mechanism to
declare whether a compression engine is available. We don't want to
conditionally register the compression engine because it is sometimes
useful to know when a compression engine name or encountered data is
valid but just not available versus unknown.
setup: compile zstd C extension
Now that zstd and python-zstandard are vendored, we can start compiling
them as part of the install.
python-zstandard provides a self-contained Python function that returns
a distutils.extension.Extension, so it is really easy to add zstd
to our setup.py without having to worry about defining source files,
include paths, etc. The function even allows specifying the module
name the extension should be compiled as. This conveniently allows us
to compile the module into the "mercurial" package so "our" version
won't collide with a version installed under the canonical "zstd"
module name.
zstd: vendor python-zstandard 0.5.0
As the commit message for the previous changeset says, we wish
for zstd to be a 1st class citizen in Mercurial. To make that
happen, we need to enable Python to talk to the zstd C API. And
that requires bindings.
This commit vendors a copy of existing Python bindings. Why do we
need to vendor? As the commit message of the previous commit says,
relying on systems in the wild to have the bindings or zstd present
is a losing proposition. By distributing the zstd and bindings with
Mercurial, we significantly increase our chances that zstd will
work. Since zstd will deliver a better end-user experience by
achieving better performance, this benefits our users. Another
reason is that the Python bindings still aren't stable and the
API is somewhat fluid. While Mercurial could be coded to target
multiple versions of the Python bindings, it is safer to bundle
an explicit, known working version.
The added Python bindings are mostly a fully-featured interface
to the zstd C API. They allow one-shot operations, streaming,
reading and writing from objects implements the file object
protocol, dictionary compression, control over low-level compression
parameters, and more. The Python bindings work on Python 2.6,
2.7, and 3.3+ and have been tested on Linux and Windows. There are
CFFI bindings, but they are lacking compared to the C extension.
Upstream work will be needed before we can support zstd with PyPy.
But it will be possible.
The files added in this commit come from Git commit
e637c1b214d5f869cf8116c550dcae23ec13b677 from
https://github.com/indygreg/python-zstandard and are added without
modifications. Some files from the upstream repository have been
omitted, namely files related to continuous integration.
In the spirit of full disclosure, I'm the maintainer of the
"python-zstandard" project and have authored 100% of the code
added in this commit. Unfortunately, the Python bindings have
not been formally code reviewed by anyone. While I've tested
much of the code thoroughly (I even have tests that fuzz APIs),
there's a good chance there are bugs, memory leaks, not well
thought out APIs, etc. If someone wants to review the code and
send feedback to the GitHub project, it would be greatly
appreciated.
Despite my involvement with both projects, my opinions of code
style differ from Mercurial's. The code in this commit introduces
numerous code style violations in Mercurial's linters. So, the code
is excluded from most lints. However, some violations I agree with.
These have been added to the known violations ignore list for now.
zstd: vendor zstd 1.1.1
zstd is a new compression format and it is awesome, yielding
higher compression ratios and significantly faster compression
and decompression operations compared to zlib (our current
compression engine of choice) across the board.
We want zstd to be a 1st class citizen in Mercurial and to eventually
be the preferred compression format for various operations.
This patch starts the formal process of supporting zstd by vendoring
a copy of zstd. Why do we need to vendor zstd? Good question.
First, zstd is relatively new and not widely available yet. If we
didn't vendor zstd or distribute it with Mercurial, most users likely
wouldn't have zstd installed or even available to install. What good
is a feature if you can't use it? Vendoring and distributing the zstd
sources gives us the highest liklihood that zstd will be available to
Mercurial installs.
Second, the Python bindings to zstd (which will be vendored in a
separate changeset) make use of zstd APIs that are only available
via static linking. One reason they are only available via static
linking is that they are unstable and could change at any time.
While it might be possible for the Python bindings to attempt to
talk to different versions of the zstd C library, the safest thing to
do is link against a specific, known-working version of zstd. This
is why the Python zstd bindings themselves vendor zstd and why we
must as well. This also explains why the added files are in a
"python-zstandard" directory.
The added files are from the 1.1.1 release of zstd (Git commit
4c0b44f8ced84c4c8edfa07b564d31e4fa3e8885 from
https://github.com/facebook/zstd) and are added without modifications.
Not all files from the zstd "distribution" have been added. Notably
missing are files to support interacting with "legacy," pre-1.0
versions of zstd. The decision of which files to include is made by
the upstream python-zstandard project (which I'm the author of). The
files in this commit are a snapshot of the files from the 0.5.0
release of that project, Git commit
e637c1b214d5f869cf8116c550dcae23ec13b677 from
https://github.com/indygreg/python-zstandard.
bdiff: give slight preference to removing trailing lines
[This change could be folded into the previous changeset to minimize the repo
churn ...]
Similar to the previous change, introduce an exception to the general
preference for matches in the middle of bdiff ranges: If the best match on the
B side starts at the beginning of the bdiff range, don't aim for the
middle-most A side match but for the earliest.
New (later) matches on the A side will only be considered better if the
corresponding match on the B side *not* is at the beginning of the range.
Thus, if the best (middle-most) match on the B side turns out to be at the
beginning of the range, the earliest match on the A side will be used.
The bundle size for 4.0 (hg bundle --base null -r 4.0 x.hg) happens to go from
22807275 to
22808120 bytes - a 0.004% increase.
bdiff: give slight preference to appending lines
[This change could be folded into the previous changeset to minimize the repo
churn ...]
The general preference to matches in the middle of bdiff ranges helps getting
balanced recursion and efficient computation. But, as previous changes have
shown, it might also give diffs that seems "obviously wrong".
To mitigate that: If the best match on the A side starts at the beginning of
the bdiff range, don't aim for the middle-most B side match but for the
earliest.
This will make the matches balanced (by both sides being "early") even though
the bisection will be less balanced. Still, this case only apply if the *best*
and middle-most match was fully unbalanced on the A side. Each recursion will
thus even in this worst case reduce the problem significantly and we are not
re-introducing the problem that was fixed in
f1ca249696ed.
The bundle size for 4.0 (hg bundle --base null -r 4.0 x.hg) happens to go from
22806817 to
22807275 bytes - a 0.002% increase.
This make the recent test-bdiff.py changes give a more pretty output ... but
they no longer show that the recursion is around middle matches (because it in
these cases isn't).
bdiff: give slight preference to longest matches in the middle of the B side
We already have a slight preference for matches close to the middle on the A
side. Now, do the same on the B side.
j is iterating the b range backwards and we thus accept a new j if the previous
match was in the upper half.
This makes the test-bhalf diff "correct". It obviously also gives more
preference to balanced recursion than to appending to sequences. That is kind
of correct, but will also unfortunately make some bundles bigger. No doubt, we
can also create examples where it will make them smaller ...
The bundle size for 4.0 (hg bundle --base null -r 4.0 x.hg) happens to go from
22803824 to
22806817 bytes - an 0.01% increase.