Remove Same Column Values

Question

I have a very big file and I want to remove if the column value is 9.

Sample:

My file value like this:

1 5 8 3 5 9 5 7 6 9
2 5 7 4 2 9 7 6 3 1
5 9 7 4 1 9 5 7 9 1

I want to delete any columns where the value, on all rows, is 9 ( my column size is very big then I can not check first column = 9 second column = 9 ... etc). I need a dynamic script.

Output should be like this:

1 5 8 3 5 5 7 6 9
2 5 7 4 2 7 6 3 1
5 9 7 4 1 5 7 9 1

I am new and I tried many things and did not do it.

How can I do it?

Thanks for your help

you want to remove first occurrence of 9 as in your sample, or all ? — Archemar, Mar 23 '15 at 12:41
all columns Archemar. If the all column value is 9 then I want to remove that column — John, Mar 23 '15 at 12:45
So, individual 9s are not a problem? You only want to remove if the entire columns consists exclusively of 9? — terdon, Mar 23 '15 at 13:05
How big is your big file? Is it so big that manipulating it in an array will use up all your free memory? — Peter.O, Mar 23 '15 at 17:40

score 1 · Answer 1 · edited Apr 13 '17 at 12:37

You can try this with awk:

awk '{ t[NR] = $0; if (NR == 1) { for (i = 1; i <= NF; i++) if ($i == 9) { met[i] = 1 } } else { for (i = 1; i <= NF; i++) { if (met[i] != 1 || $i != 9) { met[i] = 0; } } } }
END { for (i = 1; i <= NR; i++) { n = split(t[i], a); for (z = 1; z <= NF; z++) if (met[z] != 1) { printf("%s ", a[z]); } print "" } }' file

or as of this answer we can construct params to cut which is faster:

awk '{ t[NR] = $0; if (NR == 1) { for (i = 1; i <= NF; i++) if ($i == 9) { met[i] = 1 } } else { for (i = 1; i <= NF; i++) { if (met[i] != 1 || $i != 9) { met[i] = 0; } } } }
END { c = 0; s = " -f"; for ( i = 1; i <= NF; i++) { if (met[i] == 1) { if (c == 0) s = s " " i; else s = s "," i; c++; } }  s = s " -d\" \" "; if (c != 0) { system("cut --complement " s " " FILENAME); } else { system("cat " FILENAME) } }' file

And of course both are open for criticism.

Good! The cut version is approximately 3 times faster than split version, and even that one is significantly faster than what I came up with (+1). BTW, you seem to have an extra print s, c; in the cut version — Peter.O, Mar 24 '15 at 01:41

score 1 · Answer 2 · answered Mar 23 '15 at 17:22

In python:

#! /usr/bin/env python3

import sys
# Get the numbers
numbers = [[int(x) for x in line.strip().split()] for line in sys.stdin] 
# Get indexes of 9 in sets for each row
index_9 = (set(x for x, num in enumerate(line) if num == 9) for line in numbers)  

common_column = next(index_9).intersection(*index_9)

for line in numbers:
    print(' '.join((str(num) for x, num in enumerate(line) if x not in common_column)))

score 1 · Answer 3 · answered Mar 23 '15 at 17:56

This awk method assumes that each row has the same number of fields... (as shown in the example given in the question). It also assumes there to be no empty fields.

cat <<EOF >file
1 5 8 3 5 9 5 7 6 9
2 5 7 4 2 9 7 6 3 1
5 9 7 4 1 9 5 7 9 1
EOF

awk '{ for (c=1; c<=NF; c++) a[NR,c]=$c }
 END { for (c=1; c<=NF; c++) { 
         vc="" # values in column
         for (r=1; r<=NR; r++) { 
           vc = vc " " a[r,c]  }
         if ( ! gensub( /[9 ]/,"","g",vc) ) {
           for (r=1; r<=NR; r++) {
             a[r,c] = "" } }
       }
       for (r=1; r<=NR; r++) {
         for (c=1; c<=NF; c++) {
           if ( a[r,c] ) printf a[r,c]" " } 
         print "" }
     }' file

# output
1 5 8 3 5 5 7 6 9 
2 5 7 4 2 7 6 3 1 
5 9 7 4 1 5 7 9 1

score 1 · Answer 4 · answered Mar 23 '15 at 18:53

1

Here's a possible approach using bash / GNU coreutils, that doesn't require much storage:

cut the file column-by-column and record the indexes of any columns that do not consist entirely of 9s; if you know how many columns your file has (in this case 10) it could be as simple as
```
for ((i=1;i<11;i++)); do 
  [[ $(cut -d' ' -f${i} file | sed '/^9$/d' | wc -l) -eq 0 ]] || a+=($i)
done
```
(using the fact that only columns consisting entirely of 9s have length 0 after all the 9s are deleted); then
pass the list of columns to be retained to a further cut command, using a change of IFS to turn the array to a comma-separated list
```
(IFS=, ; cut -d' ' -f"${a[*]}" file)
```

If your version of cut supports the --complement flag you could record the columns that do contain all-9s and cut all except those:

    for ((i=1;i<11;i++)); do
      [[ $(cut -d' ' -f${i} file | sed '/^9$/d' | wc -l) -eq 0 ]] && a+=($i)
    done

    (IFS=, ; cut -d' ' --complement -f"${a[*]}" file)

answered Mar 23 '15 at 18:53

steeldriver

81,074

+1, I had the same in mind. You could simplify the first part: cut -d' ' -f${i} infile | grep -qv '^9$' && a+=($i) (or || a+=($i) if using cut with --complement). Now, it'd be interesting to run the four working solutions here against a big file and see which one is faster... – don_crissti Mar 24 '15 at 02:04
@don_crissti thanks - I like thegrep idea - much neater. With the complement version, one could maybe even make use of the -m option to quit processing as soon as it finds a non-9 entry? i.e. ... | grep -vqm1 '^9$' || a+=($i). Might speed it up a bit further. – steeldriver Mar 24 '15 at 03:39
Indeed, that should speed it up (it does work the other way around too: grep -vqm1 '^9$' && a+=($i) for your first approach but if you want to keep it portable then you'll have to do without -m1 & --complement). – don_crissti Mar 24 '15 at 11:54

score 0 · Answer 5 · answered Mar 23 '15 at 13:31

Given the info in the question I can currently come up with:

awk '{for (i=1; i<NF; i++){ a[i]+=$i; b[i]=b[i]" " $i}} END{for (i=1; i<NF; i++) if (a[i]/NR!=9) {printf "%s\n", b[i]}}' same-column-values

Where the function iterates through the file, calculates the sum into variable 'a' and appends the value to an index array 'b'. After the file is completely read, the array of sums is iterated through and if the sum divided by the number of records (NR) not equals to 9 than that corresponding line in array 'b' is printed.

This gets me an output of 1 2 5 5 5 9 8 7 7 3 4 4 5 2 1 5 7 5 7 6 7 6 3 9

Drawback on this is that the output should be read top->bottom and should be translated from top->bottom to left->right.

Alternatively you can get a list of columns which only contains values of 9 with the following command:

awk '{for (i=1; i<NF; i++){ a[i]+=$i; b[i]=b[i]" " $i}} END{for (i=1; i<NF; i++) if (a[i]/NR==9){print i; }}' same-column-values

The add and divide method can easily give false positives. eg. 3 lines of a file may have a column with 8+ 9+10=27, which gives the same result as 3 lines containing 9+9+9=27.. — Peter.O, Mar 23 '15 at 18:47

Remove Same Column Values

5 Answers5