6

I have a bunch of LaTeX source files, all have the same structure, all have Unix-style line endings and all are UTF-8, all are roughly the same size (1-2KB), all use spaces for indentation-formatting. They are included in a bigger document, each file handling a separate section in the document with each section having the same layout (so each file is structured identical with mostly the same LaTeX commands, just with different text content), so all files directly start/end with and contain many LaTeX commands. The strange thing now is this:

$ file *.tex
file1.tex: LaTeX document, Unicode text, UTF-8 text
file2.tex: CSV text

This is just a tiny excerpt, the detection of CSV vs. LaTeX is totaly random, while CSV is slightly less often detected (maybe 40% CSV, 60% LaTeX), but for each file the type is reproducible.

I tried varying some formatting and content in CSV-detected files, but they stay detected as CSV.

What is going on here?

Jack
  • 171
  • Read man file. Inspect the first few bytes of your files with head -n 1 file | od -bc. file only looks at the first few bytes. – waltinator Nov 25 '23 at 21:59
  • 1
    @waltinator if you followed your own advice, you would see that file reads up to the first mebibyte of a file to identify it. – Stephen Kitt Nov 25 '23 at 22:31

1 Answers1

11

Most file type detection in file is based on “magic” values, described in a large set of files; TeX files have their own set of detection recipes.

CSV files however are handled differently, with a dedicated routine in file itself. This counts comma-separated fields in the first ten lines of a file. If there are at least two fields in each line, and there are at least two lines in the file, and the number of fields is the same in the first ten lines (or the whole file if it has fewer than ten lines), then the file is identified as a CSV file.

CSV detection can be disabled using the -e option:

file -e csv -- *.tex
Stephen Kitt
  • 434,908