6

I have a large file, that I need to process and after writing some scripts that don't seem to work properly, I have discovered that a small subset of the lines in the file are actually space separated rather than tab separated.

Question: I'm wondering what the best way would be to change these space-separated lines to tab-separated ones?

The file contains 4 entries in each line, about 5000 entries total and about 150 of them are space separated rather than tab-separated.

zx8754
  • 109
  • Could you show an example of the file? Is it safe to assume that all stretches of one or more spaces should be turned into tabs? Can any line contain valid spaces? – terdon Aug 20 '13 at 13:58
  • Yes, sorry I should have made it clear that the only spaces appearing anywhere in the file are as separators. That is, yes to the second question. – Stiofán Fordham Aug 20 '13 at 14:11

3 Answers3

8
tr ' ' '\t' < file 1<> file

Would replace every space character with a tab character.


Just to respond to people saying it's not safe:

The shell will open the file for reading on file descriptor 0, and for reading-and-writing on file descriptor 1. If any of those fail, it will bail out, tr won't even be executed. If the redirections are successful, tr is executed.

tr will read the file one block at a time, do the transliteration and output the modified block over the unmodified one.

In doing so, it will generally not need to allocate any space on disk. Exception to that would be if the file was sparse to start with, or file systems that implement copy-on-write. So errors for "no space available" are not likely.

Other errors may occur though like I/O error if the underneath disk is failing, or if the file system is on a block device that has been thinly provisioned (like a LVM snapshot), both conditions being rare and anyway probably going to involve bringing back a backup.

In any case, upon failure of the write() system call, tr should report an error and exit. Because its stdout is open in read-write mode, it will not be truncated. For the file to be truncated, tr would have to explicitly call truncate() on its standard output on exit which would not make sense.

What would happen though would be that the file would be partially transliterated (up to the point where tr failed).

What I found out though is that the GNU tr currently found on Debian sid amd64 has a bug in that it segfaults upon a failure of the write() system call and output garbage on stdout (edit, now fixed since version 2.19-1 of the libc6 Debian package). That would actually corrupt the file (but again not truncate it).

tr ' ' '\t' < file > newfile && mv newfile file

would not replace file unless the newfile has been correctly created but has a number of issues associated with it:

  • you need to make sure you don't clobber an already existing newfile (think also symlinks)
  • You need write access to the current directory
  • you need additional storage space for that extra copy of the file
  • you're losing the permissions, ownership, birth time, extended attributes... of the original file
  • if the original file was a symlink, you're going to replace it with a regular.

tr ' ' '\t' < file 1<> file is safer than the commonly used perl -pi -e 's/ /\t/g' because upon failure of perl (like on disk full), you lose the original file and only get what perl has managed to output so far.

2

You can use sed as well.

sed -i.bak 's/ /\t/g' filename

This will create a filename.bak before editing the file.

s/ /\t/g => This tells sed to substitute a space with the tab character globally accross each line of the file.

Anthon
  • 79,293
ptierno
  • 202
1

To change every space in a file to a tab, use tr.

tr ' ' '\t' <input_file >output_file

To change every sequence of one or more space to a single tab, use sed.

sed -e 's/  */\t/g' <input_file >output_file

Some sed implementations understand \t to mean a tab, others need a literal tab character.

If you have a file with aligned columns that use a variable number of spaces to align the columns, you can convert it to have tab-separated columns with unexpand.