Jun Wu <quark@fb.com> [Fri, 09 Mar 2018 14:24:27 -0800] rev 36822
xdiff: replace {unsigned ,}long with {u,}int64_t
MSVC treats "long" as 4-byte. That could cause overflows since the xdiff
code uses "long" in places where "size_t" or "ssize_t" should be used.
Let's use explicit 8 byte integers to avoid
FWIW git avoids that overflow by limiting diff size to 1GB [1]. After
examining the code, I think the remaining risk (the use of "int") is low
since "int" is only used for return values and hash table size. Although a
wrong hash table size would not affect the correctness of the code, but that
could make the code extremely slow. The next patch will change hash table
size to 8-byte integer so the 1GB limit is unlikely needed.
This patch was done by using `sed`.
[1]: https://github.com/git/git/commit/
dcd1742e56ebb944c4ff62346da4548e1e3be67
Differential Revision: https://phab.mercurial-scm.org/D2762
Jun Wu <quark@fb.com> [Sun, 04 Mar 2018 11:30:16 -0800] rev 36821
xdiff: add comments for fields in xdfile_t
This makes the related code easier to understand.
Differential Revision: https://phab.mercurial-scm.org/D2685
Jun Wu <quark@fb.com> [Wed, 07 Mar 2018 14:45:31 -0800] rev 36820
xdiff: add a preprocessing step that trims files
xdiff has a `xdl_trim_ends` step that removes common lines, unmatchable
lines. That is in theory good, but happens too late - after splitting,
hashing, and adjusting the hash values so they are unique. Those splitting,
hashing and adjusting hash values steps could have noticeable overhead.
Diffing two large files with minor (one-line-ish) changes are not uncommon.
In that case, the raw performance of those preparation steps seriously
matter. Even allocating an O(N) array and storing line offsets to it is
expensive. Therefore my previous attempts [1] [2] cannot be good enough
since they do not remove the O(N) array assignment.
This patch adds a preprocessing step - `xdl_trim_files` that runs before
other preprocessing steps. It counts common prefix and suffix and lines in
them (needed for displaying line number), without doing anything else.
Testing with a crafted large (169MB) file, with minor change:
```
open('a','w').write(''.join('%s\n' % (i % 100000) for i in xrange(
30000000) if i != 6000000))
open('b','w').write(''.join('%s\n' % (i % 100000) for i in xrange(
30000000) if i != 6003000))
```
Running xdiff by a simple binary [3], this patch improves the xdiff perf by
more than 10x for the above case:
```
# xdiff before this patch
2.41s user 1.13s system 98% cpu 3.592 total
# xdiff after this patch
0.14s user 0.16s system 98% cpu 0.309 total
# gnu diffutils
0.12s user 0.15s system 98% cpu 0.272 total
# (best of 20 runs)
```
It's still slightly slower than GNU diffutils. But it's pretty close now.
Testing with real repo data:
For the whole repo, this patch makes xdiff 25% faster:
```
# hg perfbdiff --count 100 --alldata -c
d334afc585e2 --blocks [--xdiff]
# xdiff, after
! wall 0.058861 comb 0.050000 user 0.050000 sys 0.000000 (best of 100)
# xdiff, before
! wall 0.077816 comb 0.080000 user 0.080000 sys 0.000000 (best of 91)
# bdiff
! wall 0.117473 comb 0.120000 user 0.120000 sys 0.000000 (best of 67)
```
For files that are long (ex. commands.py), the speedup is more than 3x, very
significant:
```
# hg perfbdiff --count 3000 --blocks commands.py.i 1 [--xdiff]
# xdiff, after
! wall 0.690583 comb 0.690000 user 0.690000 sys 0.000000 (best of 12)
# xdiff, before
! wall 2.240361 comb 2.210000 user 2.210000 sys 0.000000 (best of 4)
# bdiff
! wall 2.469852 comb 2.440000 user 2.440000 sys 0.000000 (best of 4)
```
[1]: https://phab.mercurial-scm.org/D2631
[2]: https://phab.mercurial-scm.org/D2634
[3]:
```
// Code to run xdiff from command line. No proper error handling.
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include "mercurial/thirdparty/xdiff/xdiff.h"
#define ensure(x) if (!(x)) exit(255);
mmfile_t readfile(const char *path) {
struct stat st; int fd = open(path, O_RDONLY);
fstat(fd, &st); mmfile_t file = { malloc(st.st_size), st.st_size };
ensure(read(fd, file.ptr, st.st_size) == st.st_size); close(fd);
return file;
}
int main(int argc, char const *argv[]) {
mmfile_t a = readfile(argv[1]), b = readfile(argv[2]);
xpparam_t xpp = {0}; xdemitconf_t xecfg = {0}; xdemitcb_t ecb = {0};
xdl_diff(&a, &b, &xpp, &xecfg, &ecb);
return 0;
}
```
Differential Revision: https://phab.mercurial-scm.org/D2686
Martin von Zweigbergk <martinvonz@google.com> [Fri, 09 Mar 2018 14:30:15 -0800] rev 36819
transaction: add a name and a __repr__ implementation (API)
This has been useful for me for debugging.
Differential Revision: https://phab.mercurial-scm.org/D2758
Joerg Sonnenberger <joerg@bec.de> [Fri, 09 Mar 2018 16:10:55 +0100] rev 36818
phabricator: update doc string for deprecated token argument
Differential Revision: https://phab.mercurial-scm.org/D2755
Joerg Sonnenberger <joerg@bec.de> [Fri, 09 Mar 2018 16:09:27 +0100] rev 36817
phabricator: print deprecation warning only once
Differential Revision: https://phab.mercurial-scm.org/D2754
Martin von Zweigbergk <martinvonz@google.com> [Thu, 08 Mar 2018 21:17:26 -0800] rev 36816
tests: add a few tests involving --collapse and rebase.singletransaction=1
I'm about to change the rebase code quite a bit and this was poorly
tested.
Differential Revision: https://phab.mercurial-scm.org/D2757
Martin von Zweigbergk <martinvonz@google.com> [Thu, 08 Mar 2018 20:55:51 -0800] rev 36815
tests: simplify test-rebase-transaction.t
The file was extracted from test-rebase-base.t in
8cef8f7d51d0
(test-rebase-base: clarify it is about the "--base" flag,
2017-10-05). This patch follows up that and clarifies the new file's
purpose and simplifies it a bit.
Differential Revision: https://phab.mercurial-scm.org/D2756
Gregory Szorc <gregory.szorc@gmail.com> [Thu, 08 Mar 2018 16:22:25 -0800] rev 36814
hgweb: parse and store HTTP request headers
WSGI transmits HTTP request headers as HTTP_* environment variables.
We teach our parser about these and hook up a dict-like data
structure that supports case insensitive header manipulation.
Differential Revision: https://phab.mercurial-scm.org/D2742
Gregory Szorc <gregory.szorc@gmail.com> [Thu, 08 Mar 2018 16:43:32 -0800] rev 36813
wireprotoserver: remove broken optimization for non-httplib client
There was an experimental non-httplib client in core for several
years. It was removed a week or so ago.
We kept the optimization for this client in the server code. I'm
not sure if that was intended or not. But it doesn't matter: the
code was wrong.
Because the code was accessing a WSGI environment dict, it needed to
access the HTTP_X_HGHTTP2 key to actually read the HTTP header. So
the code deleted by this commit wasn't actually doing anything
meaningful. Doh.
Differential Revision: https://phab.mercurial-scm.org/D2741
Gregory Szorc <gregory.szorc@gmail.com> [Thu, 08 Mar 2018 15:58:52 -0800] rev 36812
wireprotoserver: move all wire protocol handling logic out of hgweb
Previous patches from several days ago worked to isolate processing
of HTTP wire protocol requests to wireprotoserver. We still had a
little logic in hgweb. If feels like the right time to finish the
job.
This commit moves WSGI request servicing from hgweb to wireprotoserver.
The ugly dict holding the parsed request is no more. I think the new
code is cleaner.
As part of this, we now process wire protocol requests before the
block to obtain the "query" variable. This makes it clear that this
wonky "query" variable is not used by the wire protocol.
The wonkiest part about this code is the HTTP 404. I'm actually not
sure what all is going on here. It looks like the code is trying to
prevent URL with path components that specify a command from not
working. That part I grok. What I don't grok is why we need to send
a 404. I would think it would be OK to no-op and let another handler
try to service the request. But if we do this, we get some subrepo
test failures. So it looks like something is expecting the HTTP 404
and reacting to it in a specific way. It /might/ be possible to
change the behavior here. But it isn't something I'm comfortable
doing because I don't understand the problem space.
Differential Revision: https://phab.mercurial-scm.org/D2740
Gregory Szorc <gregory.szorc@gmail.com> [Thu, 08 Mar 2018 15:37:05 -0800] rev 36811
hgweb: use parsed request to construct query parameters
The way hgweb routes requests is kind of bonkers. If PATH_INFO is
set, we take the URL path after the repository. Otherwise, we take
the first part of the query string before "&" and the part before
";" in that.
We then kinda/sorta treat this as a path and route based on that.
This commit ports that code to use the parsed request object. This
required a new attribute on the parsed request to indicate whether
there is any PATH_INFO.
The new code still feels a bit convoluted for my liking. But we'll
need to rewrite more of the code before a better solution becomes
apparant. This code feels strictly better since we're no longer
doing low-level WSGI manipulation during routing.
Differential Revision: https://phab.mercurial-scm.org/D2739
Gregory Szorc <gregory.szorc@gmail.com> [Thu, 08 Mar 2018 11:33:33 -0800] rev 36810
hgweb: only recognize wire protocol commands from query string (BC)
Previously, we attempted to parse the wire protocol command from
`req.form`. Data could have come from the query string or POST
form data.
The wire protocol states that the command must be declared in the
query string. And AFAICT all Mercurial releases from at least 1.0
send the command in the query string.
So let's actual require this behavior.
This is technically BC. But I'm not sure how anyone in the wild
would encounter this. POST has historically been used for sending
bundle data. So there's no opportunity to encode arguments there.
And the experimental HTTP POST args also takes over the body. So
the only way someone would be impacted by this is if they wrote
a custom client that both used POST for everything and sent arguments
via the HTTP body. I don't believe such a client exists.
.. bc::
The HTTP wire protocol server no longer accepts the ``cmd``
argument to control which command to run via HTTP POST bodies.
The ``cmd`` argument must be specified on the URL query string.
Differential Revision: https://phab.mercurial-scm.org/D2738
Gregory Szorc <gregory.szorc@gmail.com> [Thu, 08 Mar 2018 11:21:46 -0800] rev 36809
hgweb: teach WSGI parser about query strings
Currently, req.form uses cgi.parse() to populate form data. Depending
on the request, form data can come from POST multipart/form-data,
application/x-www-form-urlencoded, or the URL query string.
Putting all these things into one data structure makes it difficult
to reason about how exactly parameters got to the request. It can
lead to wonkiness such as pulling parameters from both the URL and
POST data.
This commit teaches our WSGI request parser about argument data
in query strings. We populate fields containing the query string
data and only the query string data so it can't be confused with
POST data.
Differential Revision: https://phab.mercurial-scm.org/D2737
Gregory Szorc <gregory.szorc@gmail.com> [Thu, 08 Mar 2018 15:08:20 -0800] rev 36808
hgweb: use the parsed application path directly
Previously, we assigned a custom system string with a trailing slash
to wsgirequest.url.
The addition of the trailing slash felt arbitrary and seems to go
against how things typically work in WSGI.
We also want our URLs to be bytes, not system strings.
And, assigning a custom attribute to wsgirequest felt wrong.
This commit fixes all those things by removing the trailing
slash from the app path, changing consumers to use that variable
and to use it without a trailing slash, and removing the custom
attribute from wsgirequest.
We preserve the trailing slash on {url}. Also, makebreadcrumb
strips the trailing slash. So no change to it was needed.
Differential Revision: https://phab.mercurial-scm.org/D2736
Gregory Szorc <gregory.szorc@gmail.com> [Thu, 08 Mar 2018 12:59:25 -0800] rev 36807
hgweb: use computed base URL from parsed request
Let's not reinvent URL construction in a function that runs the
templater.
Differential Revision: https://phab.mercurial-scm.org/D2735
Gregory Szorc <gregory.szorc@gmail.com> [Sat, 10 Mar 2018 10:20:51 -0800] rev 36806
hgweb: parse WSGI request into a data structure
Currently, our WSGI applications (hgweb_mod and hgwebdir_mod) process
the raw WSGI request instance themselves. This means they have to
talk in terms of system strings. And they need to know details
about what's in the WSGI request. And in the case of hgweb_mod, it
is doing some very funky things with URL parsing to impact
dispatching. The code is difficult to read and maintain.
This commit introduces parsing of the WSGI request into a higher-level
and easier-to-reason-about data structure.
To prove it works, we hook it up to hgweb_mod and use it for populating
the relative URL on the request instance.
We hold off on using it in more places because the logic in hgweb_mod
is crazy and I don't want to involve those changes with review of
the parsing code.
The URL construction code has variations that use the HTTP: Host header
(the canonical WSGI way of reconstructing the URL) and with the use
of SERVER_NAME. We need to differentiate because hgweb is currently
using SERVER_NAME for URL construction.
Differential Revision: https://phab.mercurial-scm.org/D2734
Gregory Szorc <gregory.szorc@gmail.com> [Thu, 08 Mar 2018 15:14:32 -0800] rev 36805
hgweb: always use "?" when writing session vars
This code resolves a string to insert in URLs as part of a
query string. Essentially, it resolves the {sessionvars}
template keyword, which is used by hgweb templates to build
a URL as a string.
The whole approach here feels wrong because there's no way of
knowing when this code runs how the final URL will look. There
could be additional URL fragments added before this template
keyword that add a query string component.
Furthermore, I don't think there's *any* for req.url to have
a query string. That's because the code that populates this
variable only takes SCRIPT_NAME and REPO_NAME into account. The
"?" character it is searching for would only be added if some
code attempted to add QUERY_STRING to the URL. Hacking the code
up to raise if "?" is present in the URL yields a clean test
suite run. I'm not sure if we broke this code or if it has
always been broken.
Anyway, this commit removes support for emitting "&" as the
first character in {sessionvars} and makes it always emit "?",
which is what it was always doing before AFAICT.
Differential Revision: https://phab.mercurial-scm.org/D2733
Gregory Szorc <gregory.szorc@gmail.com> [Thu, 08 Mar 2018 15:15:59 -0800] rev 36804
hgweb: rename req to wsgireq
We will soon introduce a parsed WSGI request object so we don't
have to concern ourselves with low-level WSGI matters. Prepare
for multiple request objects by renaming the existing one so it
is clear it deals with WSGI.
We also remove a symbol import to avoid even more naming confusion.
# no-check-commit because of some new foo_bar naming that's required
Differential Revision: https://phab.mercurial-scm.org/D2732
Gregory Szorc <gregory.szorc@gmail.com> [Thu, 08 Mar 2018 09:44:27 -0800] rev 36803
hgweb: validate WSGI environment dict
The wsgiref.validate module contains useful functions for validating
that various WSGI data structures are proper.
This commit adds validation of the environment dict to our built-in
HTTP server, which turns an HTTP request into an environment dict.
The check discovered that we weren't always setting QUERY_STRING,
which would cause the cgi module to fall back to sys.argv. So we
change things to always set QUERY_STRING.
The check passes on Python 2 and 3.
Differential Revision: https://phab.mercurial-scm.org/D2731
Gregory Szorc <gregory.szorc@gmail.com> [Thu, 08 Mar 2018 09:26:51 -0800] rev 36802
hgweb: ensure all wsgi environment values are str
Previously, we had a few entries that were bytes on Python 3.
PEP-0333 states that all entries must be the native str type
(bytes on Python 2, str on Python 3).
This required a number of changes to hgweb_mod to unbreak
things on Python 3. I suspect there still may be some regressions.
I'm going to introduce a data structure that represents a parsed
WSGI request in upcoming commits. This will hold bytes and will
allow us to stop using raw literals throughout the WSGI code.
Differential Revision: https://phab.mercurial-scm.org/D2730
Gregory Szorc <gregory.szorc@gmail.com> [Wed, 07 Mar 2018 16:18:52 -0800] rev 36801
wireproto: formalize permissions checking as part of protocol interface
Per the inline comment desiring to formalize permissions checking
in the protocol interface, we do that.
I'm not convinced this is the best way to go about things. I would love
for there to e.g. be a better exception for denoting permissions
problems. But it does feel strictly better than snipping attributes
on the proto instance.
Differential Revision: https://phab.mercurial-scm.org/D2719
Gregory Szorc <gregory.szorc@gmail.com> [Wed, 07 Mar 2018 16:02:24 -0800] rev 36800
wireproto: declare permissions requirements in @wireprotocommand (API)
With the security patches from 4.5.2 merged into default, we now
have a per-command attribute defining what permissions are needed
to run that command. We now have a richer @wireprotocommand that
can be extended to record additional command metadata. So we
port the permissions mechanism to be based on @wireprotocommand.
.. api::
hgweb_mod.perms and wireproto.permissions have been removed. Wire
protocol commands should declare their required permissions in the
@wireprotocommand decorator.
Differential Revision: https://phab.mercurial-scm.org/D2718
Gregory Szorc <gregory.szorc@gmail.com> [Tue, 06 Mar 2018 15:08:33 -0800] rev 36799
wireprotoserver: check permissions in main dispatch function
The permissions checking code merged from stable is out of place
in the refactored hgweb_mod module.
This commit moves the main call to wireprotoserver. We still have
some lingering code in hgweb_mod. This will get addressed later.
Differential Revision: https://phab.mercurial-scm.org/D2717