What are the metrics that `file ` uses to determine the type of a text-like file?

Question

I have a bunch of LaTeX source files, all have the same structure, all have Unix-style line endings and all are UTF-8, all are roughly the same size (1-2KB), all use spaces for indentation-formatting. They are included in a bigger document, each file handling a separate section in the document with each section having the same layout (so each file is structured identical with mostly the same LaTeX commands, just with different text content), so all files directly start/end with and contain many LaTeX commands. The strange thing now is this:

$ file *.tex
file1.tex: LaTeX document, Unicode text, UTF-8 text
file2.tex: CSV text

This is just a tiny excerpt, the detection of CSV vs. LaTeX is totaly random, while CSV is slightly less often detected (maybe 40% CSV, 60% LaTeX), but for each file the type is reproducible.

I tried varying some formatting and content in CSV-detected files, but they stay detected as CSV.

What is going on here?

Read man file. Inspect the first few bytes of your files with head -n 1 file | od -bc. file only looks at the first few bytes. — waltinator, Nov 25 '23 at 21:59
@waltinator if you followed your own advice, you would see that file reads up to the first mebibyte of a file to identify it. — Stephen Kitt, Nov 25 '23 at 22:31

score 11 · Accepted Answer · edited Nov 25 '23 at 23:58

11

Most file type detection in file is based on “magic” values, described in a large set of files; TeX files have their own set of detection recipes.

CSV files however are handled differently, with a dedicated routine in file itself. This counts comma-separated fields in the first ten lines of a file. If there are at least two fields in each line, and there are at least two lines in the file, and the number of fields is the same in the first ten lines (or the whole file if it has fewer than ten lines), then the file is identified as a CSV file.

CSV detection can be disabled using the -e option:

file -e csv -- *.tex

edited Nov 25 '23 at 23:58

Stéphane Chazelas

544,893

answered Nov 25 '23 at 22:15

Stephen Kitt

434,908

2

Like printf '%s\n' '\begin' | file - will tell you LaTex but printf '%s\n' '\begin, now' 'Hi, Jack' | file - will tell you CSV – Stéphane Chazelas Nov 25 '23 at 23:59
Thank you. Number of commas in first 10 lines was it basically (by chance, CSV-detected ones had exactly one comma in each of the first 10 lines). So CSV seems to win over TeX, interestingly. – Jack Nov 26 '23 at 08:13
@StéphaneChazelas : Possibly OT but I noticed your edit. What purpose serve the added "--" (as part of the "file -e csv…" SK had suggested )? – MC68020 Nov 26 '23 at 13:18
2

@MC68020 see What does "--" (double-dash) mean?, especially the ⚠️ Important part. – Stéphane Chazelas Nov 26 '23 at 13:23

What are the metrics that `file ` uses to determine the type of a text-like file?

1 Answers1