Three ways to check whether a file is ASCII-only
19 October 2018I have a 5k line LaTeX file, and I wasn’t sure whether it contained any non-ASCII characters. I’m using LuaLaTeX to compile it, which supports UTF-8, so just running pdfLaTeX on it is not a solution, and it would be nice to have a method that works on any text file.
The first tool any seasoned Unix admin reaches for to answer the
question ‘Does a particular character occur in a text file?’ is, of
course, grep
. The man pages don’t list a built-in character
class right away for ASCII characters, so a tiny bit of ingenuity is
required. This
Stack Overflow answer gives a command which almost works (in my
testing it found the ‘ï’ of ‘naïve’ but missed the ‘ń’ in a Polish
name), but it’s easily modified into the following:
grep -P '[^\x00-\x7f]' myfile.tex
Having the full scope of Perl-compatible regexes available means we
could use the (slightly more memorable) [^[:ascii:]]
.
The second is a less well-known part of the standard Unix toolbox,
iconv
. Its primary purpose is to convert between text
encodings, but we can abuse it to detect non-ASCII characters by asking
it to convert our file into ASCII and seeing if it gives any errors:
iconv -t ASCII myfile.tex > /dev/null
Finally, trusty old file
will tell us the encoding used,
and is probably the easiest one to remember:
file myfile.tex