combine text files column-wise

Question

I have two text files. The first one has content:

Languages
Recursively enumerable
Regular

while the second one has content:

Minimal automaton
Turing machine
Finite

I want to combine them into one file column-wise. So I tried paste 1 2 and its output is:

Languages   Minimal automaton
Recursively enumerable  Turing machine
Regular Finite

However I would like to have the columns aligned well such as

Languages               Minimal automaton
Recursively enumerable  Turing machine
Regular                 Finite

I was wondering if it would be possible to achieve that without manually handling?

Added:

Here is another example, where Bruce method almost nails it, except some slight misalignment about which I wonder why?

$ cat 1
Chomsky hierarchy
Type-0
—

$ cat 2
Grammars
Unrestricted

$ paste 1 2 | pr -t -e20
Chomsky hierarchy   Grammars
Type-0              Unrestricted
—                    (no common name)

That last example, with misalignment, is a doozy. I can duplicate it on Arch linux, pr (GNU coreutils) 8.12. I can't duplicate it on an elderly Slackware (11.0) I also have around: pr (GNU coreutils) 5.97. The problem is with the '-' character, and it's in pr, not paste. — , Jul 11 '11 at 04:18
I get the same thing with the EM-DASH with both pr and expand ... columns avoids this issue. — Peter.O, Jul 11 '11 at 16:28
I've produced output for most of the different answers except for awk + paste, which will left-shift right-most column(s) if a left file is shorter than any t the right of it. The same, and more, applies to 'paste + column' which also has this problem with blank lines in left column(s)... If you want to see all the outputs together. here is the link: http://paste.ubuntu.com/643692/ I've used 4 columns. — Peter.O, Jul 14 '11 at 02:48
I just noticed something misleading on the paste.ubuntu link... I originally set the data up for testing my scripts, (and that led on to doing the others)... so the fields which say ➀ unicode may render oddly but the column count is ok definitely does not apply to wc-paste-pr and wc-paste-pr They do show column count differences.. The others are ok. — Peter.O, Jul 14 '11 at 03:39
@BruceEdiger: The alignment problem occurs when non-ASCII characters are used (in his question, the OP used a dash (—) instead of a minus (-) character), most probably due to a bad or no handling by pr of the multibyte characters in the current locale (usually UTF8). — WhiteWinterWolf, Sep 13 '16 at 09:15

glenn jackman · Accepted Answer · 2016-11-03T10:40:15.567

93

You just need the column command, and tell it to use tabs to separate columns

paste file1 file2 | column -s $'\t' -t

To address the "empty cell" controversy, we just need the -n option to column:

$ paste <(echo foo; echo; echo barbarbar) <(seq 3) | column -s $'\t' -t
foo        1
2
barbarbar  3

$ paste <(echo foo; echo; echo barbarbar) <(seq 3) | column -s $'\t' -tn
foo        1
           2
barbarbar  3

My column man page indicates -n is a "Debian GNU/Linux extension." My Fedora system does not exhibit the empty cell problem: it appears to be derived from BSD and the man page says "Version 2.23 changed the -s option to be non-greedy"

edited Nov 03 '16 at 10:40

answered Jul 11 '11 at 13:57

glenn jackman

85,964

5

glenn: You are the hero of the hour! I knew there was something like this around, but I couldn't rememeber it. I've been lurking on this question; waiting for you :) ... column, of course; how obvious (in hindsight) +1... Thanks... – Peter.O Jul 11 '11 at 15:58
6

I've just noticed that column -s $'\t' -t ignores empty cells, resulting in all subsequent cells to the right of it (on that line) to moved to the left; ie, as a result of a blank line in a file, or it being shorter... :( – Peter.O Jul 13 '11 at 04:08
1

@masi, corrected – glenn jackman Nov 03 '16 at 10:40
-n does not work in RHEL. Is there an alternative? – Koshur Jul 14 '17 at 06:20
I can finally comment, so want to note that I previously added an answer below that addresses Peter.O's issue with runs of empty cells by using nulls. – techno Jun 22 '18 at 14:49

score 19 · Answer 2 · answered Jul 11 '11 at 01:26

19

You're looking for the handy dandy pr command:

paste file1 file2 | pr -t -e24

The "-e24" is "expand tab stops to 24 spaces". Luckily, paste puts a tab-character between columns, so pr can expand it. I chose 24 by counting the characters in "Recursively enumerable" and adding 2.

answered Jul 11 '11 at 01:26

1

Thanks! What does "expand tab stops to 24 spaces" mean? – Tim Jul 11 '11 at 01:53
I also update with an example where your method almost nails it except a slight misalignment. – Tim Jul 11 '11 at 02:08
1

Traditionally "tabstops" hit every 8 spaces. "123TABabc" would get printed out with the 'a' character 8 character-widths from the start of the line. Setting it to 24 would put the 'a' at 24 char widths from the start of the line. – Jul 11 '11 at 04:20
You say the "-e24" is "expand tab stops to 24 spaces", so why not use the expand command directly: paste file1 file2 | expand -t 24 ? – WhiteWinterWolf Sep 13 '16 at 08:48
Can you please compare this answer to techno's answer? – Léo Léopold Hertz 준영 Nov 03 '16 at 10:30
2

@Masi - my answer is similar but less complicated that @techno's answer below. It doesn't invoke sed so there's one process that doesn't run. It uses pr which is an ancient command, dating to Unix SysV days, I think, so it might exist on more installs than expand. It's just old school, in short. – Nov 03 '16 at 16:05
The difference between the two approaches is that using column will allow for varying column lengths rather than the fixed of pr or expand. So it depends upon what you need and can live with. – techno Jun 23 '18 at 18:06

Peter.O · Answer 3 · 2011-07-13T04:33:16.490

Update: Here ia a much simpler script (that the one at the end of the question) for tabulated output. Just pass filename to it as you would to paste... It uses html to make the frame, so it is tweakable. It does preserve multiple spaces, and the column alignment is preserved when it encounters unicode characters. However, the way the editor or viewer renderers the unicode is another matter entirely...

┌──────────────────────┬────────────────┬──────────┬────────────────────────────┐
│ Languages            │ Minimal        │ Chomsky  │ Unrestricted               │
├──────────────────────┼────────────────┼──────────┼────────────────────────────┤
│ Recursive            │ Turing machine │ Finite   │     space indented         │
├──────────────────────┼────────────────┼──────────┼────────────────────────────┤
│ Regular              │ Grammars       │          │ ➀ unicode may render oddly │
├──────────────────────┼────────────────┼──────────┼────────────────────────────┤
│ 1 2  3   4    spaces │                │ Symbol-& │ but the column count is ok │
├──────────────────────┼────────────────┼──────────┼────────────────────────────┤
│                      │                │          │ Context                    │
└──────────────────────┴────────────────┴──────────┴────────────────────────────┘

#!/bin/bash
{ echo -e "<html>\n<table border=1 cellpadding=0 cellspacing=0>"
  paste "$@" |sed -re 's#(.*)#\x09\1\x09#' -e 's#\x09# </pre></td>\n<td><pre> #g' -e 's#^ </pre></td>#<tr>#' -e 's#\n<td><pre> $#\n</tr>#'
  echo -e "</table>\n</html>"
} |w3m -dump -T 'text/html'

---

A synopsis of the tools presented in the answers (so far).
I've had a pretty close look at them; here is what I've found:

paste # This tool is common to all the answers presented so far # It can handle multiple files; therefore multiple columns... Good! # It delimits each column with a Tab... Good. # Its output is not tabulated.

All the tools below all remove this delimiter!... Bad if you need a delimiter.

column # It removes the Tab delimiter, so field identificaton is purely by columns which it seems to handle quite well.. I haven't spotted anything awry... # Aside from not having a unique delimiter, it works fine!

expand # Only has a single tab setting, so it is unpredictable beyond 2 columns # The alignment of columns is not accurate when handling unicode, and it removes the Tab delimiter, so field identificaton is purely by column alignment

pr # Only has a single tab setting, so it is unpredictable beyond 2 columns. # The alignment of columns is not accurate when handling unicode, and it removes the Tab delimiter, so field identificaton is purely by column alignment

To me, column it the obvious best soluton as a one-liner.. It you want either the delimiter, or an ASCII-art tabluation of your files, read on, otherwise.. columns is pretty darn good :)...

Here is a script which takes any numper of files and creates an ASCII-art tabulated presentation.. (Bear in mind that unicode may not render to the expected width, eg. ௵ which is a single character. This is quite different to the column numbers being wrong, as is the case in some of the utilities mentioned above.) ... The script's output, shown below, is from 4 input files, named F1 F2 F3 F4...

+------------------------+-------------------+-------------------+--------------+
| Languages              | Minimal automaton | Chomsky hierarchy | Grammars     |
| Recursively enumerable | Turing machine    | Type-0            | Unrestricted |
| Regular                | Finite            | —                 |              |
| Alphabet               |                   | Symbol            |              |
|                        |                   |                   | Context      |
+------------------------+-------------------+-------------------+--------------+

#!/bin/bash

# Note: The next line is for testing purposes only!
set F1 F2 F3 F4 # Simulate commandline filename args $1 $2 etc...

p=' '                                # The pad character
# Get line and column stats
cc=${#@}; lmax=                      # Count of columns (== input files)
for c in $(seq 1 $cc) ;do            # Filenames from the commandline 
  F[$c]="${!c}"        
  wc=($(wc -l -L <${F[$c]}))         # File length and width of longest line 
  l[$c]=${wc[0]}                     # File length  (per file)
  L[$c]=${wc[1]}                     # Longest line (per file) 
  ((lmax<${l[$c]})) && lmax=${l[$c]} # Length of longest file
done
# Determine line-count deficits  of shorter files
for c in $(seq 1 $cc) ;do  
  ((${l[$c]}<lmax)) && D[$c]=$((lmax-${l[$c]})) || D[$c]=0 
done
# Build '\n' strings to cater for short-file deficits
for c in $(seq 1 $cc) ;do
  for n in $(seq 1 ${D[$c]}) ;do
    N[$c]=${N[$c]}$'\n'
  done
done
# Build the command to suit the number of input files
source=$(mktemp)
>"$source" echo 'paste \'
for c in $(seq 1 $cc) ;do
    ((${L[$c]}==0)) && e="x" || e=":a -e \"s/^.{0,$((${L[$c]}-1))}$/&$p/;ta\""
    >>"$source" echo '<(sed -re '"$e"' <(cat "${F['$c']}"; echo -n "${N['$c']}")) \'
done
# include the ASCII-art Table framework
>>"$source" echo ' | sed  -e "s/.*/| & |/" -e "s/\t/ | /g" \'   # Add vertical frame lines
>>"$source" echo ' | sed -re "1 {h;s/[^|]/-/g;s/\|/+/g;p;g}" \' # Add top and botom frame lines 
>>"$source" echo '        -e "$ {p;s/[^|]/-/g;s/\|/+/g}"'
>>"$source" echo  
# Run the code
source "$source"
rm     "$source"
exit

Here is my original answer (trimmed a bit in lieu of the above script)

Using wc to get the column width, and sed to right pad with a visible character . (just for this example)... and then paste to join the two columns with a Tab char...

paste <(sed -re :a -e 's/^.{1,'"$(($(wc -L <F1)-1))"'}$/&./;ta' F1) F2

# output (No trailing whitespace)
Languages.............  Minimal automaton
Recursively enumerable  Turing machine
Regular...............  Finite

If you want to pad out the right column:

paste <( sed -re :a -e 's/^.{1,'"$(($(wc -L <F1)-1))"'}$/&./;ta' F1 ) \
      <( sed -re :a -e 's/^.{1,'"$(($(wc -L <F2)-1))"'}$/&./;ta' F2 )  

# output (With trailing whitespace)
Languages.............  Minimal automaton
Recursively enumerable  Turing machine...
Regular...............  Finite...........

Thanks! You have done quite a lot of work. That's amazing. – Tim Jul 14 '11 at 03:00 — Tim, Jul 14 '11 at 03:00

Gilles 'SO- stop being evil' · Answer 4 · 2011-07-11T18:21:22.883

You're almost there. paste puts a tab character between each column, so all you need to do is expand the tabs. (I assume your files don't contain tabs.) You do need to determine the width of the left column. With (recent enough) GNU utilities, wc -L shows the length of the longest line. On other systems, make a first pass with awk. The +1 is the amount of blank space you want between columns.

paste left.txt right.txt | expand -t $(($(wc -L <left.txt) + 1))
paste left.txt right.txt | expand -t $(awk 'n<length {n=length} END {print n+1}')

If you have the BSD column utility, you can use it to determine the column width and expand the tabs in one go. (␉ is a literal tab character; under bash/ksh/zsh you can use $'\t' instead, and in any shell you can use "$(printf '\t')".)

paste left.txt right.txt | column -s '␉' -t

In my version of wc, the command needs to be: wc -L <left.txt ... because, when a filename is spedified as a command line arg, its name is output to stdout — Peter.O, Jul 11 '11 at 16:18

techno · Answer 5 · 2018-06-23T18:04:28.970

5

I'm unable to comment on glenn jackman's answer, so am adding this to address the issue of empty cells that Peter.O noted. Adding a null char prior to each tab eliminates the runs of delimiters that are treated as a single break and addresses the issue. (I originally used spaces, but using the null char eliminates the extra space between columns.)

paste file1 file2 | sed 's/\t/\0\t/g' | column -s $'\t' -t

If the null char causes problems for various reasons, try either:

paste file1 file2 | sed 's/\t/ \t/g' | column -s $'\t' -t

or

paste file1 file2 | sed $'s/\t/ \t/g' | column -s $'\t' -t

Both sed and column appear to vary in implementation across flavors and versions of Unix/Linux, especially BSD (and Mac OS X) vs. GNU/Linux.

edited Jun 23 '18 at 18:04

answered Nov 05 '14 at 21:20

techno

106

That sed command appears to do nothing. I replace the column command with od -c and I don't see any null bytes. This is on centos and ubuntu. – glenn jackman Jun 22 '18 at 15:21
1

This worked for me in RedHat EL4. Both sed and column seem to vary over time and system. In Ubuntu 14.4 using \0 didn't work as a null in sed, but \x0 did. However, then column gave a line too long error.
Simplest thing seems to be to use a space and live with the extra character.
– techno Jun 23 '18 at 18:08

score 4 · Answer 6 · answered Jul 11 '11 at 01:22

This is multi-step, so it's non-optimal, but here goes.

1) Find the length of the longest line in file1.txt.

while read line
do
echo ${#line}
done < file1.txt | sort -n | tail -1

With your example, the longest line is 22.

2) Use awk to pad file1.txt, padding the each line less than 22 characters up to 22 with the printf statement.

awk 'FS="---" {printf "%-22s\n", $1}' < file1.txt > file1-pad.txt

Note: For FS, use a string that does not exist in file1.txt.

3) Use paste as you did before.

$ paste file1-pad.txt file2.txt
Languages               Minimal automaton
Recursively enumerable  Turing machine
Regular                 Finite

If this is something you do often, this can easily be turned into a script.

In your code to find the longest line, you need while IFS= read -r line, otherwise the shell will mangle whitespace and backslashes. But the shell isn't the best tool for that job; recent versions of GNU coreutils have wc -L (see fred's answer), or you can use awk: awk 'n<length {n=length} END {print +n}'. — Gilles 'SO- stop being evil', Jul 11 '11 at 13:20

score 1 · Answer 7 · edited Apr 13 '17 at 12:37

Building on bahamat’s answer: this can be done entirely in awk, reading the files only once and not creating any temporary files. To solve the problem as stated, do

awk '
        NR==FNR { if (length > max_length) max_length = length
                  max_FNR = FNR
                  save[FNR] = $0
                  next
                }
                { printf "%-*s", max_length+2, save[FNR]
                  print
                }
        END     { if (FNR < max_FNR) {
                        for (i=FNR+1; i <= max_FNR; i++) print save[i]
                  }
                }
    '   file1 file2

As with many awk scripts of this ilk, the above first reads file1, saving all the data in the save array and simultaneously computing the maximum line length. Then it reads file2 and prints the saved (file1) data side-by-side with the current (file2) data. Finally, if file1 is longer than file2 (has more lines), we print the last few lines of file1 (the ones for which there is no corresponding line in the second column).

Regarding the printf format:

"%-nns" prints a string left-justified in a field nn characters wide.
"%-*s", nn does the same thing — the * tells it to take the field width from the next parameter.
By using maxlength+2 for nn, we get two spaces between the columns. Obviously the +2 can be adjusted.

The above script works only for two files. It can trivially be modified to handle three files, or to handle four files, etc., but this would be tedious and is left as an exercise. However, it turns out not to be hard to modify it to handle any number of files:

awk '
        FNR==1  { file_num++ }
                { if (length > max_length[file_num]) max_length[file_num] = length
                  max_FNR[file_num] = FNR
                  save[file_num,FNR] = $0
                }
        END     { for (j=1; j<=file_num; j++) {
                        if (max_FNR[j] > global_max_FNR) global_max_FNR = max_FNR[j]
                  }
                  for (i=1; i<=global_max_FNR; i++) {
                        for (j=1; j<file_num; j++) printf "%-*s", max_length[j]+2, save[j,i]
                        print save[file_num,i]
                  }
                }
    '   file*

This is very similar to my first script, except

It turns max_length into an array.
It turns max_FNR into an array.
It turns save into a two-dimensional array.
It reads all the files, saving all the contents. Then it writes out all the output from the END block.

I know that this question is old; I just stumbled upon it. I agree that paste is the best solution; specifically, glenn jackman’s paste file1 file2 | column -s $'\t' -t. But I thought it would be fun to try to improve on the awk approach. — G-Man Says 'Reinstate Monica', Oct 24 '16 at 04:34

combine text files column-wise

7 Answers7

---

Linked