highlight: add option to prevent content-only based fallback
When Mozilla enabled Pygments on hg.mozilla.org, we got a lot of weirdly
colorized files. Upon further investigation, the hightlight extension
is first attempting a filename+content based match then falling back to a
purely content-driven detection mode in Pygments. Sounds good in theory.
Unfortunately, Pygments' content-driven detection establishes no minimum
threshold for returning a lexer. Furthermore, the detection code for
a number of languages is very liberal. For example, ActionScript 3 will
return a confidence of 0.3 (out of 1.0) if the first 1k of the file
we pass in matches the regex "\w+\s*:\s*\w"! Python matches on
"import ". It's no coincidence that a number of our extension-less files
were getting highlighted improperly.
This patch adds an option to have the highlighter not fall back to
purely content-based detection when filename+content detection failed.
This can be enabled to render unlighted text instead of taking the risk
that unknown file types are highlighted incorrectly. The old behavior is
still the default.
--- a/hgext/highlight/__init__.py Wed Oct 14 17:43:44 2015 -0700
+++ b/hgext/highlight/__init__.py Wed Oct 14 18:22:16 2015 -0700
@@ -13,11 +13,17 @@
It depends on the Pygments syntax highlighting library:
http://pygments.org/
-There are two configuration options::
+There are the following configuration options::
[web]
pygments_style = <style> (default: colorful)
highlightfiles = <fileset> (default: size('<5M'))
+ highlightonlymatchfilename = <bool> (default False)
+
+``highlightonlymatchfilename`` will only highlight files if their type could
+be identified by their filename. When this is not enabled (the default),
+Pygments will try very hard to identify the file type from content and any
+match (even matches with a low confidence score) will be used.
"""
import highlight
@@ -32,12 +38,14 @@
def pygmentize(web, field, fctx, tmpl):
style = web.config('web', 'pygments_style', 'colorful')
expr = web.config('web', 'highlightfiles', "size('<5M')")
+ filenameonly = web.configbool('web', 'highlightonlymatchfilename', False)
ctx = fctx.changectx()
tree = fileset.parse(expr)
mctx = fileset.matchctx(ctx, subset=[fctx.path()], status=None)
if fctx.path() in fileset.getset(mctx, tree):
- highlight.pygmentize(field, fctx, style, tmpl)
+ highlight.pygmentize(field, fctx, style, tmpl,
+ guessfilenameonly=filenameonly)
def filerevision_highlight(orig, web, req, tmpl, fctx):
mt = ''.join(tmpl('mimetype', encoding=encoding.encoding))
--- a/hgext/highlight/highlight.py Wed Oct 14 17:43:44 2015 -0700
+++ b/hgext/highlight/highlight.py Wed Oct 14 18:22:16 2015 -0700
@@ -20,7 +20,7 @@
SYNTAX_CSS = ('\n<link rel="stylesheet" href="{url}highlightcss" '
'type="text/css" />')
-def pygmentize(field, fctx, style, tmpl):
+def pygmentize(field, fctx, style, tmpl, guessfilenameonly=False):
# append a <link ...> to the syntax highlighting css
old_header = tmpl.load('header')
@@ -46,6 +46,12 @@
lexer = guess_lexer_for_filename(fctx.path(), text[:1024],
stripnl=False)
except (ClassNotFound, ValueError):
+ # guess_lexer will return a lexer if *any* lexer matches. There is
+ # no way to specify a minimum match score. This can give a high rate of
+ # false positives on files with an unknown filename pattern.
+ if guessfilenameonly:
+ return
+
try:
lexer = guess_lexer(text[:1024], stripnl=False)
except (ClassNotFound, ValueError):
--- a/tests/test-highlight.t Wed Oct 14 17:43:44 2015 -0700
+++ b/tests/test-highlight.t Wed Oct 14 18:22:16 2015 -0700
@@ -644,4 +644,43 @@
% hgweb filerevision, html
% errors encountered
+We attempt to highlight unknown files by default
+
+ $ killdaemons.py
+
+ $ cat > .hg/hgrc << EOF
+ > [web]
+ > highlightfiles = **
+ > EOF
+
+ $ cat > unknownfile << EOF
+ > #!/usr/bin/python
+ > def foo():
+ > pass
+ > EOF
+
+ $ hg add unknownfile
+ $ hg commit -m unknown unknownfile
+
+ $ hg serve -p $HGPORT -d -n test --pid-file=hg.pid
+ $ cat hg.pid >> $DAEMON_PIDS
+
+ $ get-with-headers.py localhost:$HGPORT 'file/tip/unknownfile' | grep l2
+ <span id="l2"><span class="k">def</span> <span class="nf">foo</span><span class="p">():</span></span><a href="#l2"></a>
+
+We can prevent Pygments from falling back to a non filename-based
+detection mode
+
+ $ cat > .hg/hgrc << EOF
+ > [web]
+ > highlightfiles = **
+ > highlightonlymatchfilename = true
+ > EOF
+
+ $ killdaemons.py
+ $ hg serve -p $HGPORT -d -n test --pid-file=hg.pid
+ $ cat hg.pid >> $DAEMON_PIDS
+ $ get-with-headers.py localhost:$HGPORT 'file/tip/unknownfile' | grep l2
+ <span id="l2">def foo():</span><a href="#l2"></a>
+
$ cd ..