How can I determine if a file is compressed from Elisp?

Question

I know that I can use the file command to determine this, but I'd like a cross-platform solution using elisp only (or as few subprocesses as possible). I have the compressed data in a variable called response; you can use the following shell command to get a sample of the data I'm trying to profile:

curl --silent "http://api.stackexchange.com/2.2/filter/create"

Piping the above through gunzip will give a readable result. The problem is that my mechanism for retrieving the information has different behavior when it is run locally and when it is run on Travis.

Unfortunately, the Content-Encoding header lies.

Constantine · Accepted Answer · 2014-11-12T16:47:17.523

It seems to me that the question is a little vague: is the goal to recognize gzip-compressed data? If not, what formats need to be supported?

Focussing on the gzip case:

The way I see it, possible approaches depend on the use case. For instance, if the length of a possibly compressed response is expected to be small, one can try decompressing to test if it was compressed in the first place.

;; get sample data for testing
(setq response
      (with-temp-buffer
        (set-buffer-multibyte nil)
        (shell-command "curl --silent 'http://api.stackexchange.com/2.2/filter/create'" t)
        (buffer-string)))

(Now I can test the code below using data in response.)

(defun zlib-compressed-p (string)
  "Return t if STRING is compressed with zlib, nil otherwise."
  (when (not (zlib-available-p))
    (error "This function requires zlib!"))
  (with-temp-buffer
    (set-buffer-multibyte nil)
    (insert string)
    (if (zlib-decompress-region (point-min) (point-max))
        t
      nil)))

A small modification returns human-readable data regardless of whether it was compressed:

(defun zlib-decompress-if-compressed (string)
  "Decompress STRING if it is recognized as a compressed
unibyte string by zlib, otherwise return STRING unchanged.

Requires zlib."
  (when (not (zlib-available-p))
    (error "zlib-decompress-if-compressed requires zlib!"))
  (with-temp-buffer
    (set-buffer-multibyte nil)
    (insert string)
    (if (zlib-decompress-region (point-min) (point-max))
        (buffer-string)
      string)))

As far as I know, zlib is the only compression library Emacs can be compiled with, so we can't handle other formats this way.

The original question states "I have the compressed data in a variable called response...", and zlib-decompress-if-compressed can process it without writing it to a file. It is easy to create versions that takes a file name:

(defun zlib-file-compressed-p (filename)
  "Return t if file is compressed with zlib, nil otherwise."
  (with-temp-buffer
    (set-buffer-multibyte nil)
    (insert-file-contents-literally filename nil)
    (zlib-compressed-p (buffer-string))))

(defun zlib-decompress-file (filename)
  "Return the contents of the file FILENAME as a string,
decompressed using zlib if the file is recognized as compressed.
If the file is not compressed with zlib, return its contents
literally."
  (with-temp-buffer
    (set-buffer-multibyte nil)
    (insert-file-contents-literally filename nil)
    (zlib-decompress-if-compressed (buffer-string))))

For other formats, or if a response is too long to attempt decompressing, one can check if the first two bytes of a string match the magic number defined by the gzip file format specification.

(defun gzip-check-magic (data)
  "Check if the first two bytes of a string in DATA match magic
numbers identifying the gzip file format. See
http://www.gzip.org/zlib/rfc-gzip.html for the file format
description."
  (equal (substring (string-as-unibyte data) 0 2) (unibyte-string 31 139)))

(defun gzip-compressed-p (filename)
  "Check if the file FILENAME is gzip-compressed by checking
magic numbers identifying the gzip file format. See
`gzip-check-magic' for details."
  (let ((first-two-bytes (with-temp-buffer
                           (set-buffer-multibyte nil)
                           (insert-file-contents-literally filename nil 0 2)
                           (buffer-string))))
    (gzip-check-magic first-two-bytes)))

This method is also limited to gzip, may give a false positive result, but is truly cross-platform.

(It can be extended to other formats that use "magic numbers" to identify the format, for example bzip2, but this is certainly not scalable.)

Overall, using call-process and file or similar seems to be the most flexible approach.

Magic bytes are brilliant! I'll try this out [upstream](http://www.github.com/vermiculus/stack-mode). — Sean Allred, Nov 04 '14 at 00:39
This works when the all of the characters can be converted to unibyte, but chokes if they cannot. Use `string-as-unibyte` to interpret the data as bytes instead of characters. — Sean Allred, Nov 11 '14 at 03:10
@SeanAllred: Thanks for pointing out that `string-to-unibyte` signals an error *if STRING contains a non-ASCII, non-eight-bit character*! I updated the answer. — Constantine, Nov 12 '14 at 16:49

How can I determine if a file is compressed from Elisp?

1 Answers1

Linked