How remove duplicate names and print array after unique names

Question

How to collapse KO categories with the same name and print gene names that were assigned to each category in array, as in example below.

I have this:

K00002  gene_65472
K00002  gene_212051
K00002  gene_403626
K00003  gene_666
K00003  gene_5168
K00003  gene_7635
K00003  gene_12687
K00003  gene_175295
K00003  gene_647659
K00003  gene_663019
K00004  gene_88381
K00005  gene_30485
K00005  gene_193699
K00005  gene_256294
K00005  gene_307497

And want this:

K00002  gene_65472  gene_212051 gene_403626             
K00003  gene_666    gene_5168   gene_7635   gene_12687  gene_175295 gene_647659 gene_663019
K00004  gene_88381                      
K00005  gene_30485  gene_193699 gene_256294 gene_307497

The following command worked (taken from roaima's answer):

tr -d '\r' < file| awk '$1 != p { if (p>"") {printf "\n"} printf "%s",$1; p=$1 } { printf "\t%s",$2 } END { if(p>"") {printf "\n"} }' > output

Kamila, can you put your sample file online somewhere? I can't see any obvious reason why my code should be giving you the results you see, and I'm wondering if there is some strange but unseen formatting embedded in the file. — Chris Davies, Jan 16 '19 at 15:35
If you can't/won't share your file, you can edit your question with the output of head -n 5 your_file | hexdump -C. This will help us detecting non-printable characters. — fra-san, Jan 16 '19 at 15:49
I uploaded the file here: https://www.filemail.com/d/ggrodflaiihybep — Kamila, Jan 16 '19 at 15:51
It's a Windows format file. Apply the fixes I've given you at the bottom of my question. — Chris Davies, Jan 16 '19 at 15:54
I tried your suggestion but it didn't help: tr -d '\r' < file | awk '$1 != p { if (p>"") {printf "\n"} printf "%s",$1; p=$1 } { printf "\t%s",$2 } END { if(p>"") {printf "\n"} }' file — Kamila, Jan 16 '19 at 15:57
@Kamila It's tr -d '\r' < file | awk '$1 != p { if (p>"") {printf "\n"} printf "%s",$1; p=$1 } { printf "\t%s",$2 } END { if(p>"") {printf "\n"} }', without the final file. — fra-san, Jan 16 '19 at 16:10

Chris Davies · Accepted Answer · 2019-01-16T15:30:07.850

More of the same

awk '$1 != p { if (p>"") {printf "\n"} printf "%s",$1; p=$1 } { printf "\t%s",$2 } END { if(p>"") {printf "\n"} }' datafile

K00002  gene_65472      gene_212051     gene_403626
K00003  gene_666        gene_5168       gene_7635       gene_12687      gene_175295     gene_647659     gene_663019
K00004  gene_88381
K00005  gene_30485      gene_193699     gene_256294     gene_307497

If you don't want separation by tab then change the \t to a space.

Here's how it works:

# Each line is processed in turn. "p" is the previous line's key field value

# Key field isn't the same as before
$1 != p {
    # Flush this line if we have printed something already
    if (p > "") { printf "\n" }

    # Print the key field name and set it as the current key field
    printf "%s", $1; p = $1
}

# Every line, print the second value on the line
{ printf "\t%s", $2 }

# No more input. Flush the line if we have already printed something
END {
    if (p > "") { printf "\n" }
}

From the vague comments you're making against everyone's Answers, it seems the underlying issue is that you're using a data file generated on a Windows system and expecting it to work on a UNIX/Linux platform. Don't do that. Or if you must, convert the file to the correct format first.

dos2unix < datafile | awk '...'       # As above

tr -d '\r' < data file | awk '...'    # Also as above

It didn't work. The gene names are listed in separate lines instead of in one line — Kamila, Jan 16 '19 at 13:38
I used your command: awk '$1 != p { if (p>"") {printf "\n"} printf "%s",$1; p=$1 } { printf "\t%s",$2 } END { if(p>"") {printf "\n"} }' file.txt > output — Kamila, Jan 16 '19 at 13:43
I'm working on Linux. I tried to paste the example of the output I'm getting but this forum doesn't allow me to paste it in its original form — Kamila, Jan 16 '19 at 13:48

score 0 · Answer 2 · edited Jan 16 '19 at 11:32

file:

K00002  gene_65472
K00002  gene_212051
K00002  gene_403626
K00003  gene_666
K00003  gene_5168
K00003  gene_7635
K00003  gene_12687
K00003  gene_654221
K00003  gene_663019
K00004  gene_88381
K00005  gene_30485
K00005  gene_193699
K00005  gene_256294

Using awk:

awk '1 {if (a[$1]) {a[$1] = a[$1]" "$2} else {a[$1] = $2}} END {for (i in a) { print i,a[i]}}' file

output:

K00002 gene_65472 gene_212051 gene_403626
K00003 gene_666 gene_5168 gene_7635 gene_12687 gene_654221 gene_663019
K00004 gene_88381
K00005 gene_30485 gene_193699 gene_256294

I took this post as reference.

fra-san · Answer 3 · 2020-06-28T20:59:23.357

Assuming that your values do not contain spaces and are space-separated; assuming also that your data is in a file named file (see below for a tab-separated version):

for x in $(<file cut -d ' ' -f 1 | sort | uniq); do
    printf '%s %s\n' "$x" "$(grep "$x" file | cut -d ' ' -f 2- | tr '\n' ' ' | sed 's/.$//')"
done

This will:

Extract the distinct values of the first field:
- cut selects only the first chunk (-f 1) of a line, breaking it at each space (-d ' ');
- sort | uniq will sort the values of the first field and output each one a single time only (alternatively, shorter and more efficient: sort -u);
For each:
- Extract all the relevant lines from file with grep;
- Strip the first field from them with cut (-f 2- means "take the second and following fields");
- Translate the remainder into a list of space separated values (tr);
- Get rid of the last character - an unneeded space - using sed (yes, this is really inelegant);
- Concatenate the result to the value of the first field and print to standard output.

If your input is tab-separated and you want tab-separated output, the above code becomes:

for x in $(<file cut -f 1 | sort | uniq); do
    printf '%s\t%s\n' "$x" "$(grep "$x" file | cut -f 2- | tr '\n' '\t' | sed 's/.$//')"
done

Notes:

Performance: the execution time for this approach is significantly higher than that of the awk based solutions (I tested roaima's answer). At least of an order of magnitude.
On the other hand, this approach works even if the input file is not ordered.
Despite this kind of solution being a quick (and dirty?) way of effectively doing the job, processing text with shell loops is generally not advisable; see for reference "Why is using a shell loop to process text considered bad practice?".

score 0 · Answer 4 · answered Jan 16 '19 at 12:02

using Miller http://johnkerl.org/miller/doc
with

mlr --csv --implicit-csv-header --headerless-csv-output cat -n -g 1 then label a,b,c then reshape -s a,c then unsparsify --fill-with "" input.csv

and this example csv input

A,234
A,4945
B,8798
B,8798
B,790

You will have

A,234,4945,
B,8798,8798,790

How remove duplicate names and print array after unique names

4 Answers4