annotate tests/md5sum.py @ 40326:fed697fa1734

sqlitestore: file storage backend using SQLite This commit provides an extension which uses SQLite to store file data (as opposed to revlogs). As the inline documentation describes, there are still several aspects to the extension that are incomplete. But it's a start. The extension does support basic clone, checkout, and commit workflows, which makes it suitable for simple use cases. One notable missing feature is support for "bundlerepos." This is probably responsible for the most test failures when the extension is activated as part of the test suite. All revision data is stored in SQLite. Data is stored as zstd compressed chunks (default if zstd is available), zlib compressed chunks (default if zstd is not available), or raw chunks (if configured or if a compressed delta is not smaller than the raw delta). This makes things very similar to revlogs. Unlike revlogs, the extension doesn't yet enforce a limit on delta chain length. This is an obvious limitation and should be addressed. This is somewhat mitigated by the use of zstd, which is much faster than zlib to decompress. There is a dedicated table for storing deltas. Deltas are stored by the SHA-1 hash of their uncompressed content. The "fileindex" table has columns that reference the delta for each revision and the base delta that delta should be applied against. A recursive SQL query is used to resolve the delta chain along with the delta data. By storing deltas by hash, we are able to de-duplicate delta storage! With revlogs, the same deltas in different revlogs would result in duplicate storage of that delta. In this scheme, inserting the duplicate delta is a no-op and delta chains simply reference the existing delta. When initially implementing this extension, I did not have content-indexed deltas and deltas could be duplicated across files (just like revlogs). When I implemented content-indexed deltas, the size of the SQLite database for a full clone of mozilla-unified dropped: before: 2,554,261,504 bytes after: 2,488,754,176 bytes Surprisingly, this is still larger than the bytes size of revlog files: revlog files: 2,104,861,230 bytes du -b: 2,254,381,614 I would have expected storage to be smaller since we're not limiting delta chain length and since we're using zstd instead of zlib. I suspect the SQLite indexes and per-column overhead account for the bulk of the differences. (Keep in mind that revlog uses a 64-byte packed struct for revision index data and deltas are stored without padding. Aside from the 12 unused bytes in the 32 byte node field, revlogs are pretty efficient.) Another source of overhead is file name storage. With revlogs, file names are stored in the filesystem. But with SQLite, we need to store file names in the database. This is roughly equivalent to the size of the fncache file, which for the mozilla-unified repository is ~34MB. Since the SQLite database isn't append-only and since delta chains can reference any delta, this opens some interesting possibilities. For example, we could store deltas in reverse, such that fulltexts are stored for newer revisions and deltas are applied to reconstruct older revisions. This is likely a more optimal storage strategy for version control, as new data tends to be more frequently accessed than old data. We would obviously need wire protocol support for transferring revision data from newest to oldest. And we would probably need some kind of mechanism for "re-encoding" stores. But it should be doable. This extension is very much experimental quality. There are a handful of features that don't work. It probably isn't suitable for day-to-day use. But it could be used in limited cases (e.g. read-only checkouts like in CI). And it is also a good proving ground for alternate storage backends. As we continue to define interfaces for all things storage, it will be useful to have a viable alternate storage backend to see how things shake out in practice. test-storage.py passes on Python 2 and introduces no new test failures on Python 3. Having the storage-level unit tests has proved to be insanely useful when developing this extension. Those tests caught numerous bugs during development and I'm convinced this style of testing is the way forward for ensuring alternate storage backends work as intended. Of course, test coverage isn't close to what it needs to be. But it is a start. And what coverage we have gives me confidence that basic store functionality is implemented properly. Differential Revision: https://phab.mercurial-scm.org/D4928
author Gregory Szorc <gregory.szorc@gmail.com>
date Tue, 09 Oct 2018 08:50:13 -0700
parents 904bc1dc2694
children 2372284d9457
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
4122
306055f5b65c Unified #! paths for python scripts and removed them for test modules.
Thomas Arendsen Hein <thomas@intevation.de>
parents: 3223
diff changeset
1 #!/usr/bin/env python
1928
50e1c90b0fcf clarify license on md5sum.py
Peter van Dijk <peter@dataloss.nl>
parents: 1924
diff changeset
2 #
50e1c90b0fcf clarify license on md5sum.py
Peter van Dijk <peter@dataloss.nl>
parents: 1924
diff changeset
3 # Based on python's Tools/scripts/md5sum.py
50e1c90b0fcf clarify license on md5sum.py
Peter van Dijk <peter@dataloss.nl>
parents: 1924
diff changeset
4 #
50e1c90b0fcf clarify license on md5sum.py
Peter van Dijk <peter@dataloss.nl>
parents: 1924
diff changeset
5 # This software may be used and distributed according to the terms
50e1c90b0fcf clarify license on md5sum.py
Peter van Dijk <peter@dataloss.nl>
parents: 1924
diff changeset
6 # of the PYTHON SOFTWARE FOUNDATION LICENSE VERSION 2, which is
50e1c90b0fcf clarify license on md5sum.py
Peter van Dijk <peter@dataloss.nl>
parents: 1924
diff changeset
7 # GPL-compatible.
50e1c90b0fcf clarify license on md5sum.py
Peter van Dijk <peter@dataloss.nl>
parents: 1924
diff changeset
8
29485
6a98f9408a50 py3: make files use absolute_import and print_function
Pulkit Goyal <7895pulkit@gmail.com>
parents: 25660
diff changeset
9 from __future__ import absolute_import
6a98f9408a50 py3: make files use absolute_import and print_function
Pulkit Goyal <7895pulkit@gmail.com>
parents: 25660
diff changeset
10
33873
904bc1dc2694 md5sum: assume hashlib exists now that we're 2.7 only
Augie Fackler <raf@durin42.com>
parents: 32852
diff changeset
11 import hashlib
29485
6a98f9408a50 py3: make files use absolute_import and print_function
Pulkit Goyal <7895pulkit@gmail.com>
parents: 25660
diff changeset
12 import os
6a98f9408a50 py3: make files use absolute_import and print_function
Pulkit Goyal <7895pulkit@gmail.com>
parents: 25660
diff changeset
13 import sys
6470
ac0bcd951c2c python 2.6 compatibility: compatibility wrappers for hash functions
Dirkjan Ochtman <dirkjan@ochtman.nl>
parents: 6212
diff changeset
14
ac0bcd951c2c python 2.6 compatibility: compatibility wrappers for hash functions
Dirkjan Ochtman <dirkjan@ochtman.nl>
parents: 6212
diff changeset
15 try:
7080
a6477aa893b8 tests: Windows compatibility fixes
Patrick Mezard <pmezard@gmail.com>
parents: 6470
diff changeset
16 import msvcrt
a6477aa893b8 tests: Windows compatibility fixes
Patrick Mezard <pmezard@gmail.com>
parents: 6470
diff changeset
17 msvcrt.setmode(sys.stdout.fileno(), os.O_BINARY)
a6477aa893b8 tests: Windows compatibility fixes
Patrick Mezard <pmezard@gmail.com>
parents: 6470
diff changeset
18 msvcrt.setmode(sys.stderr.fileno(), os.O_BINARY)
a6477aa893b8 tests: Windows compatibility fixes
Patrick Mezard <pmezard@gmail.com>
parents: 6470
diff changeset
19 except ImportError:
a6477aa893b8 tests: Windows compatibility fixes
Patrick Mezard <pmezard@gmail.com>
parents: 6470
diff changeset
20 pass
a6477aa893b8 tests: Windows compatibility fixes
Patrick Mezard <pmezard@gmail.com>
parents: 6470
diff changeset
21
1924
46fb38ef9a91 add md5sum.py required by fix in previous changeset
Peter van Dijk <peter@dataloss.nl>
parents:
diff changeset
22 for filename in sys.argv[1:]:
46fb38ef9a91 add md5sum.py required by fix in previous changeset
Peter van Dijk <peter@dataloss.nl>
parents:
diff changeset
23 try:
46fb38ef9a91 add md5sum.py required by fix in previous changeset
Peter van Dijk <peter@dataloss.nl>
parents:
diff changeset
24 fp = open(filename, 'rb')
25660
328739ea70c3 global: mass rewrite to use modern exception syntax
Gregory Szorc <gregory.szorc@gmail.com>
parents: 14494
diff changeset
25 except IOError as msg:
1924
46fb38ef9a91 add md5sum.py required by fix in previous changeset
Peter van Dijk <peter@dataloss.nl>
parents:
diff changeset
26 sys.stderr.write('%s: Can\'t open: %s\n' % (filename, msg))
46fb38ef9a91 add md5sum.py required by fix in previous changeset
Peter van Dijk <peter@dataloss.nl>
parents:
diff changeset
27 sys.exit(1)
3223
53e843840349 Whitespace/Tab cleanup
Thomas Arendsen Hein <thomas@intevation.de>
parents: 1928
diff changeset
28
33873
904bc1dc2694 md5sum: assume hashlib exists now that we're 2.7 only
Augie Fackler <raf@durin42.com>
parents: 32852
diff changeset
29 m = hashlib.md5()
1924
46fb38ef9a91 add md5sum.py required by fix in previous changeset
Peter van Dijk <peter@dataloss.nl>
parents:
diff changeset
30 try:
32852
3a64ac39b893 md5sum: adapt for python 3 support
Augie Fackler <augie@google.com>
parents: 29731
diff changeset
31 for data in iter(lambda: fp.read(8192), b''):
1924
46fb38ef9a91 add md5sum.py required by fix in previous changeset
Peter van Dijk <peter@dataloss.nl>
parents:
diff changeset
32 m.update(data)
25660
328739ea70c3 global: mass rewrite to use modern exception syntax
Gregory Szorc <gregory.szorc@gmail.com>
parents: 14494
diff changeset
33 except IOError as msg:
1924
46fb38ef9a91 add md5sum.py required by fix in previous changeset
Peter van Dijk <peter@dataloss.nl>
parents:
diff changeset
34 sys.stderr.write('%s: I/O error: %s\n' % (filename, msg))
46fb38ef9a91 add md5sum.py required by fix in previous changeset
Peter van Dijk <peter@dataloss.nl>
parents:
diff changeset
35 sys.exit(1)
46fb38ef9a91 add md5sum.py required by fix in previous changeset
Peter van Dijk <peter@dataloss.nl>
parents:
diff changeset
36 sys.stdout.write('%s %s\n' % (m.hexdigest(), filename))
46fb38ef9a91 add md5sum.py required by fix in previous changeset
Peter van Dijk <peter@dataloss.nl>
parents:
diff changeset
37
46fb38ef9a91 add md5sum.py required by fix in previous changeset
Peter van Dijk <peter@dataloss.nl>
parents:
diff changeset
38 sys.exit(0)