tests/test-encoding.t
author Martin von Zweigbergk <martinvonz@google.com>
Sun, 09 Jul 2017 17:02:09 -0700
changeset 33379 7ddb2aa2b7af
parent 33262 8e6f4939a69a
child 34661 eb586ed5d8ce
permissions -rw-r--r--
match: express anypats(), not prefix(), in terms of the others When I added prefix() in 9789b4a7c595 (match: introduce boolean prefix() method, 2014-10-28), we already had always(), isexact(), and anypats(), so it made sense to write it in terms of them (a prefix matcher is one that isn't any of the other types). It's only now that I realize that it's much more natural to define prefix() explicitly (it's one that uses path: patterns, roughly speaking) and let anypats() be defined in terms of the others. Remember that these methods are all used for determining which fast paths are possible. anypats() simply means that no fast paths are possible (it could be called complex() instead). Further evidence is that rootfilesin:some/dir does not have any patterns, but it's still considered to be an anypats() matcher. That's because anypats() really just means that it's not a prefix() matcher (and not always() and not isexact()). This patch thus changes prefix() to return False by default and anypats() to return True only if the other three are False. Having anypats() be True by default also seems like a good thing, because it means forgetting to override it will lead only to performance bugs, not correctness bugs. Since the base class's implementation changes, we're also forced to update the subclasses. That change exposed and fixed a bug in the differencematcher: for example when both its two input matchers were prefix matchers, we would say that the result was also a prefix matcher, which is incorrect, because e.g "path:dir - path:dir/foo" no longer matches everything under "dir" (which is what prefix() means).

Test character encoding

  $ hg init t
  $ cd t

we need a repo with some legacy latin-1 changesets

  $ hg unbundle "$TESTDIR/bundles/legacy-encoding.hg"
  adding changesets
  adding manifests
  adding file changes
  added 2 changesets with 2 changes to 1 files
  (run 'hg update' to get a working copy)
  $ hg co
  1 files updated, 0 files merged, 0 files removed, 0 files unresolved
  $ $PYTHON << EOF
  > f = file('latin-1', 'w'); f.write("latin-1 e' encoded: \xe9"); f.close()
  > f = file('utf-8', 'w'); f.write("utf-8 e' encoded: \xc3\xa9"); f.close()
  > f = file('latin-1-tag', 'w'); f.write("\xe9"); f.close()
  > EOF

should fail with encoding error

  $ echo "plain old ascii" > a
  $ hg st
  M a
  ? latin-1
  ? latin-1-tag
  ? utf-8
  $ HGENCODING=ascii hg ci -l latin-1
  transaction abort!
  rollback completed
  abort: decoding near ' encoded: \xe9': 'ascii' codec can't decode byte 0xe9 in position 20: ordinal not in range(128)! (esc)
  [255]

these should work

  $ echo "latin-1" > a
  $ HGENCODING=latin-1 hg ci -l latin-1
  $ echo "utf-8" > a
  $ HGENCODING=utf-8 hg ci -l utf-8
  $ HGENCODING=latin-1 hg tag `cat latin-1-tag`
  $ HGENCODING=latin-1 hg branch `cat latin-1-tag`
  marked working directory as branch \xe9 (esc)
  (branches are permanent and global, did you want a bookmark?)
  $ HGENCODING=latin-1 hg ci -m 'latin1 branch'
  $ hg -q rollback
  $ HGENCODING=latin-1 hg branch
  \xe9 (esc)
  $ HGENCODING=latin-1 hg ci -m 'latin1 branch'
  $ rm .hg/branch

hg log (ascii)

  $ hg --encoding ascii log
  changeset:   5:a52c0692f24a
  branch:      ?
  tag:         tip
  user:        test
  date:        Thu Jan 01 00:00:00 1970 +0000
  summary:     latin1 branch
  
  changeset:   4:94db611b4196
  user:        test
  date:        Thu Jan 01 00:00:00 1970 +0000
  summary:     Added tag ? for changeset ca661e7520de
  
  changeset:   3:ca661e7520de
  tag:         ?
  user:        test
  date:        Thu Jan 01 00:00:00 1970 +0000
  summary:     utf-8 e' encoded: ?
  
  changeset:   2:650c6f3d55dd
  user:        test
  date:        Thu Jan 01 00:00:00 1970 +0000
  summary:     latin-1 e' encoded: ?
  
  changeset:   1:0e5b7e3f9c4a
  user:        test
  date:        Mon Jan 12 13:46:40 1970 +0000
  summary:     koi8-r: ????? = u'\u0440\u0442\u0443\u0442\u044c'
  
  changeset:   0:1e78a93102a3
  user:        test
  date:        Mon Jan 12 13:46:40 1970 +0000
  summary:     latin-1 e': ? = u'\xe9'
  

hg log (latin-1)

  $ hg --encoding latin-1 log
  changeset:   5:a52c0692f24a
  branch:      \xe9 (esc)
  tag:         tip
  user:        test
  date:        Thu Jan 01 00:00:00 1970 +0000
  summary:     latin1 branch
  
  changeset:   4:94db611b4196
  user:        test
  date:        Thu Jan 01 00:00:00 1970 +0000
  summary:     Added tag \xe9 for changeset ca661e7520de (esc)
  
  changeset:   3:ca661e7520de
  tag:         \xe9 (esc)
  user:        test
  date:        Thu Jan 01 00:00:00 1970 +0000
  summary:     utf-8 e' encoded: \xe9 (esc)
  
  changeset:   2:650c6f3d55dd
  user:        test
  date:        Thu Jan 01 00:00:00 1970 +0000
  summary:     latin-1 e' encoded: \xe9 (esc)
  
  changeset:   1:0e5b7e3f9c4a
  user:        test
  date:        Mon Jan 12 13:46:40 1970 +0000
  summary:     koi8-r: \xd2\xd4\xd5\xd4\xd8 = u'\\u0440\\u0442\\u0443\\u0442\\u044c' (esc)
  
  changeset:   0:1e78a93102a3
  user:        test
  date:        Mon Jan 12 13:46:40 1970 +0000
  summary:     latin-1 e': \xe9 = u'\\xe9' (esc)
  

hg log (utf-8)

  $ hg --encoding utf-8 log
  changeset:   5:a52c0692f24a
  branch:      \xc3\xa9 (esc)
  tag:         tip
  user:        test
  date:        Thu Jan 01 00:00:00 1970 +0000
  summary:     latin1 branch
  
  changeset:   4:94db611b4196
  user:        test
  date:        Thu Jan 01 00:00:00 1970 +0000
  summary:     Added tag \xc3\xa9 for changeset ca661e7520de (esc)
  
  changeset:   3:ca661e7520de
  tag:         \xc3\xa9 (esc)
  user:        test
  date:        Thu Jan 01 00:00:00 1970 +0000
  summary:     utf-8 e' encoded: \xc3\xa9 (esc)
  
  changeset:   2:650c6f3d55dd
  user:        test
  date:        Thu Jan 01 00:00:00 1970 +0000
  summary:     latin-1 e' encoded: \xc3\xa9 (esc)
  
  changeset:   1:0e5b7e3f9c4a
  user:        test
  date:        Mon Jan 12 13:46:40 1970 +0000
  summary:     koi8-r: \xc3\x92\xc3\x94\xc3\x95\xc3\x94\xc3\x98 = u'\\u0440\\u0442\\u0443\\u0442\\u044c' (esc)
  
  changeset:   0:1e78a93102a3
  user:        test
  date:        Mon Jan 12 13:46:40 1970 +0000
  summary:     latin-1 e': \xc3\xa9 = u'\\xe9' (esc)
  

hg tags (ascii)

  $ HGENCODING=ascii hg tags
  tip                                5:a52c0692f24a
  ?                                  3:ca661e7520de

hg tags (latin-1)

  $ HGENCODING=latin-1 hg tags
  tip                                5:a52c0692f24a
  \xe9                                  3:ca661e7520de (esc)

hg tags (utf-8)

  $ HGENCODING=utf-8 hg tags
  tip                                5:a52c0692f24a
  \xc3\xa9                                  3:ca661e7520de (esc)

hg tags (JSON)

  $ hg tags -Tjson
  [
   {
    "node": "a52c0692f24ad921c0a31e1736e7635a8b23b670",
    "rev": 5,
    "tag": "tip",
    "type": ""
   },
   {
    "node": "ca661e7520dec3f5438a63590c350bebadb04989",
    "rev": 3,
    "tag": "\xc3\xa9", (esc)
    "type": ""
   }
  ]

hg branches (ascii)

  $ HGENCODING=ascii hg branches
  ?                              5:a52c0692f24a
  default                        4:94db611b4196 (inactive)

hg branches (latin-1)

  $ HGENCODING=latin-1 hg branches
  \xe9                              5:a52c0692f24a (esc)
  default                        4:94db611b4196 (inactive)

hg branches (utf-8)

  $ HGENCODING=utf-8 hg branches
  \xc3\xa9                              5:a52c0692f24a (esc)
  default                        4:94db611b4196 (inactive)
  $ echo '[ui]' >> .hg/hgrc
  $ echo 'fallbackencoding = koi8-r' >> .hg/hgrc

hg log (utf-8)

  $ HGENCODING=utf-8 hg log
  changeset:   5:a52c0692f24a
  branch:      \xc3\xa9 (esc)
  tag:         tip
  user:        test
  date:        Thu Jan 01 00:00:00 1970 +0000
  summary:     latin1 branch
  
  changeset:   4:94db611b4196
  user:        test
  date:        Thu Jan 01 00:00:00 1970 +0000
  summary:     Added tag \xc3\xa9 for changeset ca661e7520de (esc)
  
  changeset:   3:ca661e7520de
  tag:         \xc3\xa9 (esc)
  user:        test
  date:        Thu Jan 01 00:00:00 1970 +0000
  summary:     utf-8 e' encoded: \xc3\xa9 (esc)
  
  changeset:   2:650c6f3d55dd
  user:        test
  date:        Thu Jan 01 00:00:00 1970 +0000
  summary:     latin-1 e' encoded: \xc3\xa9 (esc)
  
  changeset:   1:0e5b7e3f9c4a
  user:        test
  date:        Mon Jan 12 13:46:40 1970 +0000
  summary:     koi8-r: \xd1\x80\xd1\x82\xd1\x83\xd1\x82\xd1\x8c = u'\\u0440\\u0442\\u0443\\u0442\\u044c' (esc)
  
  changeset:   0:1e78a93102a3
  user:        test
  date:        Mon Jan 12 13:46:40 1970 +0000
  summary:     latin-1 e': \xd0\x98 = u'\\xe9' (esc)
  

hg log (dolphin)

  $ HGENCODING=dolphin hg log
  abort: unknown encoding: dolphin
  (please check your locale settings)
  [255]
  $ HGENCODING=ascii hg branch `cat latin-1-tag`
  abort: decoding near '\xe9': 'ascii' codec can't decode byte 0xe9 in position 0: ordinal not in range(128)! (esc)
  [255]
  $ cp latin-1-tag .hg/branch
  $ HGENCODING=latin-1 hg ci -m 'auto-promote legacy name'

Test roundtrip encoding of lookup tables when not using UTF-8 (issue2763)

  $ HGENCODING=latin-1 hg up `cat latin-1-tag`
  0 files updated, 0 files merged, 1 files removed, 0 files unresolved

  $ cd ..

Test roundtrip encoding/decoding of utf8b for generated data

#if hypothesis

  >>> from hypothesishelpers import *
  >>> from mercurial import encoding
  >>> roundtrips(st.binary(), encoding.fromutf8b, encoding.toutf8b)
  Round trip OK

#endif