Grouping data based on the second column

Question

I have a file with the following lines:

1 a
2 a
3 a
1 b
2 b
1 c
2 c
3 c
4 c
1 d

I want to get the result as:

Hello and welcome to SE. Have you tried anything on your own at all or are you just waiting for somebody to do your work? And why did you roll back all the edits on this badly formatted (and titled/tagged) question? — Panki, Mar 18 '19 at 11:34
Please add more details to the question: Are the data lines unique or are duplicate lines possible? In case of e.g. a duplicate line 2 a, how should the output look like? Are the input lines always sorted already (by the 2nd column, followed by the 1st column)? In the output, should the numbers be sorted or do you want them to appear in the same order as in the input data? — Bodo, Mar 18 '19 at 11:44
Answering 4 specific questions with a simple 'Yes' isn't going to help anybody. — Panki, Mar 18 '19 at 11:50
Syed. I'm looking back at your original question text. Is your input a single line containing 1 a 2 a 3 a 1 b 2 b 1 c 2 c 3 c 4 c 1 d? Or is it multiple lines 1 a, 2 a, etc.? — Chris Davies, Mar 18 '19 at 11:57
It's with multiple line. I want the distinct of column 2 i.e. (a,b,c,d) with column value horizontally i.e. a 1 2 3, then b 1 2, etc — syed, Mar 18 '19 at 12:01
OK. Please [edit] your question to include this requirement ("I want the distinct...") as well as clarifications that answer @Rui's four questions. — Chris Davies, Mar 18 '19 at 12:57

Kusalananda · Answer 1 · 2019-03-18T17:37:02.170

Using awk:

awk '{ group[$2] = (group[$2] == "" ? $1 : group[$2] OFS $1 ) }
     END { for (group_name in group) print group_name, group[group_name] }' inputfile

This stores the groups in an array called group. This array is indexed on the group name (the second column in the input data) and for each line of input from inputfile, the value in the first column is appended to the correct group.

The END block loops over all collected groups and outputs the group name and the entries of that group.

This awk program with a nicer layout:

{
    group[$2] = (group[$2] == "" ? $1 : group[$2] OFS $1 )
}

END {
    for (group_name in group)
        print group_name, group[group_name]
}

Note that this is not what you'd want to do if you have massive amounts of data as the group array will actually store all input data read from the file.

For huge amounts of data, we assume that the input is sorted on the group names (the second column) and use

awk '$2 != group_name { if (group != "") print group_name, group; group = ""; group_name = $2 }
    { group = (group == "" ? $1 : group OFS $1) }
    END { if (group != "") print group_name, group }' inputfile

This keeps track of what the current group is, and collects the data for that group. Whenever the second column in the input switches to another value, it outputs the collected group data and starts collecting new data. This means that only a few lines of input is ever stored, rather than storing the whole input data set.

This last awk program with a nicer layout:

$2 != group_name {
    if (group != "")
        print group_name, group

    group = ""
    group_name = $2
}

{
    group = (group == "" ? $1 : group OFS $1)
}

END {
    # Output last group (only), if there was any data at all.
    if (group != "")
        print group_name, group
}

Siva · Answer 2 · 2019-03-18T14:28:26.817

0

Try this,

for i in  `awk '!a[$2]++ { print $2}' file.txt`
do
        echo "$i `awk -v z=$i '$2==z{print $1}' file.txt | tr '\n' ' '`"
done

awk '!a[$2]++ { print $2} will give the unique value of column 2.
$2==z{print $1} will print all values where $2 equals variable z.

edited Mar 18 '19 at 14:28

answered Mar 18 '19 at 11:55

Siva

9,077

for i in awk '!a[$2]++ { print $2}' a.txt

do echo "$i awk -v z=$i '$2==z{print $1}' a.txt | tr '\n' ' '" done

awk: syntax error near line 1 awk: bailing out near line 1
– syed Mar 18 '19 at 12:09
are u working on solaris? – Siva Mar 18 '19 at 12:12
Yes. My OS is Solaris 5.11 – syed Mar 18 '19 at 12:13
check this https://stackoverflow.com/a/31338810/9702260 – Siva Mar 18 '19 at 12:14
2

This is not what you want to do in general: you're launching an independent copy of awk for each unique value in column 2, even though the task at hand could be done with a single awk script. (This is related to Why is using a shell loop to process text considered bad practice?) Also, using for i in $(foo) isn't good either, it breaks in case of whitespace (which the columns admittedly can't contain here), but also on glob characters. (See: Why is looping over find's output bad practice?) – ilkkachu Mar 18 '19 at 12:16
Tried with gawk it's working – syed Mar 18 '19 at 12:17
1

for i in gawk '!a[$2]++ { print $2}' a.txt

do echo "$i gawk -v z=$i '$2==z{print $1}' a.txt | tr '\n' ' '" done

a 1 2 3 b 1 2 c 1 2 3 4 d 1
– syed Mar 18 '19 at 12:18

score -1 · Answer 3 · answered Mar 19 '19 at 08:13

-1

Command:for i in a b c d; do echo $i;awk -v i="$i" '$2 == i{print $1}' filename| perl -pne "s/\n/ /g";echo " "| perl -pne "s/ /\n/g";done| sed '/^$/d'| sed "N;s/\n/ /g"

output

for i in a b c d; do echo $i;awk -v i="$i" '$2 == i{print $1}' l.txt | perl -pne "s/\n/ /g";echo " "| perl -pne "s/ /\n/g";done| sed '/^$/d'| sed "N;s/\n/ /g"

a 1 2 3 
b 1 2 
c 1 2 3 4 
d 1

answered Mar 19 '19 at 08:13

Praveen Kumar BS

5,211

Please let me know down vote reason – Praveen Kumar BS Mar 19 '19 at 23:32

Grouping data based on the second column

3 Answers3