10

I have a text file being processed on a Windows machine. There is a need to remove trailing tab characters prior to using bcp utility to load the data from the file into a database table.

The following command, in a Bash script, stripped out the trailing tabs:

sed 's/[\t]*$//' < ./input/raw.txt >> ./input/data.txt

but it converted the CR-LF to LF which caused the bcp command to fail.

In an effort to keep the CR-LF I tried this:

sed 's/[\t]*$/$CR/' < ./input/raw.txt >> ./input/data.txt

but that resulted in:

enter image description here

The desired outcome is:

enter image description here

How do I modify the command to achieve the desired output?

AdminBee
  • 22,803
knot22
  • 271

3 Answers3

25

You need to install the unix2dos package. It has two utilities:

unix2dos    Convert UNIX newlines to CR-LF
dos2unix    Convert DOS CR-LF to UNIX newlines

Let's create a test file of five lines, and do a hex dump to examine the line endings:

$ jot -w 'line %d' 5 > foo
$ hexdump -C foo
00000000  6c 69 6e 65 20 31 0a 6c  69 6e 65 20 32 0a 6c 69  |line 1.line 2.li|
00000010  6e 65 20 33 0a 6c 69 6e  65 20 34 0a 6c 69 6e 65  |ne 3.line 4.line|
00000020  20 35 0a                                          | 5.|
00000023

We see that each line ends in a newline character, hex 0a.

Now we convert those newlines to DOS CR-LF line endings, and inspect again:

$ unix2dos foo
$ hexdump -C foo
00000000  6c 69 6e 65 20 31 0d 0a  6c 69 6e 65 20 32 0d 0a  |line 1..line 2..|
00000010  6c 69 6e 65 20 33 0d 0a  6c 69 6e 65 20 34 0d 0a  |line 3..line 4..|
00000020  6c 69 6e 65 20 35 0d 0a                           |line 5..|
00000028

Now each line ends in CR-LF, hex 0d 0a.

Finally, we can convert the file back to the original UNIX newlines:

$ dos2unix foo
$ hexdump -C foo
00000000  6c 69 6e 65 20 31 0a 6c  69 6e 65 20 32 0a 6c 69  |line 1.line 2.li|
00000010  6e 65 20 33 0a 6c 69 6e  65 20 34 0a 6c 69 6e 65  |ne 3.line 4.line|
00000020  20 35 0a                                          | 5.|
00000023
Jim L.
  • 7,997
  • 1
  • 13
  • 27
15

Note that in standard sed, sed 's/[\t]*$//' removes all backslash and t characters from the end of the line. The GNU implementation of sed only does it when there's a POSIXLY_CORRECT variable in its environment.

sed 's/\t*$//' is unspecified, but at least with GNU sed, that happens to remove trailing TABs whether POSIXLY_CORRECT is in the environment or not.

Here you could do:

sed $'s/\t*$/\r/'

Using the ksh93-style $'...' form of quotes inside which things like \t or \r are expanded to TAB and CR respectively. That's now supported by many other shells and will be in the next version of the POSIX standard for sh.

If you have TAB and CR characters in shell variables, which you could do without $'...' for instance with:

eval "$(printf 'TAB="\t" CR="\r"')"

You could do:

sed "s/$TAB*\$/$CR/"

But that has to be within double-quotes. Inside single quotes, no expansion is performed.

Now, in the unlikely event that the input doesn't end in a LF character (which would make it invalid text in Unix), those (with GNU sed at least) would produce a file that ends in a CR character, making it invalid in DOS as well.

To convert the text files from Unix to DOS, you could use the unix2dos utility instead which wouldn't have the problem:

sed $'s/\t*$//' | unix2dos

Or use perl's sed mode:

perl -pe 's/\t*$//; s/\n/\r\n/'

perl -p works like sed in that it runs the code for each line of input, except that in perl the pattern space ($_ there) has the full line including the line delimiter. It also supports those \t, \n, \r escapes (while standard sed only supports \n and only in regular expressions), and can cope with non-text files.

0

Using Raku (formerly known as Perl_6)

~$ cat unix2dos.raku
my $fh1 = open $*IN, :r;
#below :x opens write-only :exclusive (i.e. 'no-clobber')
my $fh2 = open $*OUT, :x, nl-out => "\r\n";

for $fh1.lines() { $fh2.put($_) };

$fh1.close; $fh2.close;

Raku (a.k.a. Perl6) is a programming language in the Perl family. One thing the Perl6 project tried to do was abstract out OS-specific niggles to make code more portable, and one of these niggles is newline processing. Raku provides a nl-in parameter for filehandle input (defaults to ["\x0A", "\r\n"]), autochomps lines by default, uses \n-terminated newlines internally, and provides a nl-out parameter for filehandle output (defaults to "\n").

The key statement by the OP is as follows:

...but it converted the CR-LF to LF which caused the bcp command to fail.

So for the Raku script above (on whatever platform you happen to be working on), you can open a file for writing and set nl-out => \r\n, i.e. newline-out to CRLF. Raku reads lines lazily, so this script should be memory-efficient. Even without making the above script executable, you can call it at the command line as follows:

~$ raku unix2dos.raku < ends_with_LF.txt > ends_with_CRLF.txt

The above script defaults to taking $*IN stdin and therefore is a "one-off", but Raku provides functions for reading$*ARGFILES and dir directories as well. Finally, there's an excellent summary of newline processing in Raku at the first link below:

https://docs.raku.org/language/newline.html
https://raku.org

jubilatious1
  • 3,195
  • 8
  • 17