1

I have multiple CSV files with many records. The total number of rows should be 134. I have many files, and each rows has its own number of columns (from 15 to 200). I need to sort them according to their number of columns.

I was able to count a file's columns using:

$ awk -F"," '{print NF}' file # 1.csv

... which gives something like:

134
134
134
5
25
133
...

Now, I would like to add these numbers to each row, so that I can later sort my rows according to it. How can I add this information at the beginning of each, and sort?

I'd also like to split the files with value=134 into 1 file other by their respective count.

small INPUT file example (this is 3 rows):

2,"A.B.C.D",50,"SDf3oa701-ab73-a0pcs90","7012218969217-1413752517-32448","SDf3oa701-ab73-a0pcs90","SIP",,"<99999@sipgw5003.com>;tag=70122","<8888888@X.Y.Z.W>",17,0,"00:01:57.827 GMT Oct 20 2014","00:00:00.000 UTC Jan 01 1970","00:01:57.870 GMT Oct 20 2014",3,"sp3",1904,"sp3",1904,"realm_IN","realmTERM_OUT",,,,"::",0,"::",0,,"::",0,"::",0,0,0,0,0,0,0,0,0,0,0,,,,"::",0,"::",0,,"::",0,"::",0,0,0,0,0,0,0,0,0,0,0,,,,"::",0,"::",0,,"::",0,"::",0,0,0,0,0,0,0,0,0,0,0,,,,"::",0,"::",0,,"::",0,"::",0,0,0,0,0,0,0,0,0,0,0,,,"Sw-buildabcd","GMT-03:00",0,"8888888@X.Y.Z.W",,,,,,"X.Y.Z.W:50","A.S.D.F:50","A.S.D.F:50","A.S.D.F:50",,1,2,1,404,"8888888@A.S.D.F",,,4493101
2,"A.B.C.D",50,,,,4493105
2,"A.B.C.D",50,,"88888@B.D.S.E",,,4493106
slm
  • 369,824
  • Wow, I spent more time than I thought fixing this question... Next time, please make sure you read our How to ask before posting a question. By the way, I was unable to *clarify" your last sentence. – John WH Smith Nov 25 '14 at 11:25
  • Please [edit] your question and give us an example of your input file and the output you would like from that example. The solutions will depend on the format of your input files. Most importantly what you have to delimit fields. Are columns separated by spaces? Tabs? Commas? – terdon Nov 25 '14 at 11:28
  • Thanks. So, should row2 be considered to have 4 fields or 7? Can you have commas within a field? What is the output you would like to have from that input? – terdon Nov 25 '14 at 11:40
  • @terdon row2 & row3 should have 134 fields , same as row1 i dont want to add 0 values at this stage, rather i want to sort them by number of fields – Hussein Kray Nov 25 '14 at 11:44
  • @HusseinKray So in your example they all have the same number of fields and won't need sorting ? –  Nov 25 '14 at 11:49
  • @jidder no , i have many rows with vairalble length, this why i will seprate them to multiple files later on .. ( the correct nb of fields is 134) – Hussein Kray Nov 25 '14 at 11:51
  • They should have 134 fields? One hundred and thirty four? Why? Please show us a workable example. In order to help you, we need to see both input and desired output. If you need 134 fields, show us a file with 134 fields. Otherwise, show us a dummy file with 5 fields and explain what output you want to see from it. – terdon Nov 25 '14 at 11:55
  • @terdon both answers below helps alot to fix the files i have, the files i have are CDR ( call detail records) , it is complicated to give a golden rule since amny rows have different forms, . i marked the Q as answered, if i need more help i will make sure to post a new Q with proper input\output\question thanks alot – Hussein Kray Nov 25 '14 at 11:58
  • by the way INPUT-ROW#1 is a required format for each row , and each row should have 134 field ( this is the CDR format) – Hussein Kray Nov 25 '14 at 11:59

3 Answers3

2

Think this is what you want
Add -F, for comma separated.
E.g awk -F, '$(NF+1)=NF' file

Add Number to end of rows

 awk '$(NF+1)=NF' file

Input

1
1 2 3
1 2
1 2 3 4 5 6
a b

Output

1 1
1 2 3 3
1 2 2
1 2 3 4 5 6 6
a b 2

Sort Rows

 awk '{a[NF]=a[NF]?a[NF]"\n"$0:$0;x=x<NF?NF:x}END{for(i=1;i<=x;i++)if(i in a)print a[i]}'

Input

1
1 2 3
1 2
1 2 3 4 5 6
a b

Output

1
1 2
a b
1 2 3
1 2 3 4 5 6

Print to different files

Using field length of 4 for example, change to 134 or whatever you want for yours

 awk '{print > (NF>=4?"LargeFile.txt":"SmallFile.txt")}' file 

Input

1
1 2 3
1 2
1 2 3 4 5 6
a b

Output

LargeFile.txt

 1 2 3 4 5 6

SmallFile.txt

1
1 2 3
1 2
a b
  • that would work , but i have a comma separated values (CSV files) – Hussein Kray Nov 25 '14 at 11:28
  • 1
    @HusseinKray that's why you need to show your input. Also, do you want the lines from all CSVs do be mixed? Separate? Sorted? You need to show us an example of both your input and your desired output. – terdon Nov 25 '14 at 11:29
  • @HusseinKray Add -F, after awk –  Nov 25 '14 at 11:31
  • @Jidder awk isn't a native CSV parser : there may be commas in quoted fields. The OP may need a more robust command line tool for processing CSV files. – John WH Smith Nov 25 '14 at 11:33
  • @JohnWHSmith OP seems happy with the output for fields in his post, but yes if it has embedded commas then depending on which version of awk they are using all of this may be useless. –  Nov 25 '14 at 11:37
  • @jidder no comma in quoted fields so this wont be an issue – Hussein Kray Nov 25 '14 at 11:39
  • @HusseinKray Okay well is that what you wanted ? –  Nov 25 '14 at 11:47
  • @Jidder, that is exactly what i want , appreciate if you can tell me how to handle these input as CSV , not space-seperated (thanks a lot ) – Hussein Kray Nov 25 '14 at 11:49
  • @HusseinKray Add -F, after awk for comma separated –  Nov 25 '14 at 11:50
  • @slm I'd appreciate it if you didn't edit my code purely for your personal preference of formatting. Also For e.g doesn't even make sense –  Nov 25 '14 at 12:59
  • @Jidder - I edited more than that. You had formatting, typos. and punctuation issues. It's not my personal preference, I'm making your answer easier to read too. Your first paragraph is oddly formatted too. For e.g. means For example. If you disagree roll it back as you've done. But I've done enough editing to know what works/looks better than what you had originally posted. The posts here are read on a variety of devices, not just desktop browsers, so keep that in mind as well when posting. – slm Nov 25 '14 at 15:27
  • @slm I had no typos? Also e.g means exempli gratia(literally for example). For e.g is like saying 'for for example' so maybe do some research before incorrectly editing answers. As for it 'looking better' with your formatting that is completely your opinion/preference. Also doing a lot of editing doesn't necessarily mean your are good at it –  Nov 25 '14 at 20:17
2

Similar to @terdon's answer, but with sed:

{ seq -s, 10; seq -s, 5; seq -s, 15; } | 
tee - -

That's my infile - it looks like:

1,2,3,4,5,6,7,8,9,10
1,2,3,4,5,6,7,8,9,10
1,2,3,4,5,6,7,8,9,10
1,2,3,4,5
1,2,3,4,5
1,2,3,4,5
1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
1,2,3,4,5,6,7,8,9,10,11,12,13,14,15

I can then do:

sed 'h;s/[^,]*//g;G;s/\n/ /' | sort -t\  -nk1,1

...which gets...

,,,, 1,2,3,4,5
,,,, 1,2,3,4,5
,,,, 1,2,3,4,5
,,,,,,,,, 1,2,3,4,5,6,7,8,9,10
,,,,,,,,, 1,2,3,4,5,6,7,8,9,10
,,,,,,,,, 1,2,3,4,5,6,7,8,9,10
,,,,,,,,,,,,,, 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
,,,,,,,,,,,,,, 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
,,,,,,,,,,,,,, 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15

The numbers are not there, but the count is sure enough, I guess. To remove the leading commas I can just do:

PIPELINE | sed 's/,* //'

...which gets...

1,2,3,4,5
1,2,3,4,5
1,2,3,4,5
1,2,3,4,5,6,7,8,9,10
1,2,3,4,5,6,7,8,9,10
1,2,3,4,5,6,7,8,9,10
1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
1,2,3,4,5,6,7,8,9,10,11,12,13,14,15

Perhaps not the most auspicious answer thus far, but the primary reason I decided to write this was that you mention you want to write lines with 134 comma-separated entries to another file. This, as it happens, is a simple matter with sed. For example - lets say I wanted to write the lines with 10 fields from the above sequence to a file2:

PIPELINE | sed '/^\([^,]*,[^,]*\)\{9\}$/w file2'
cat file2

OUTPUT

1,2,3,4,5,6,7,8,9,10
1,2,3,4,5,6,7,8,9,10
1,2,3,4,5,6,7,8,9,10

I use \{9\} above because it specifies 9 instances of the pattern - which makes 9 delimiters to 10 delimited fields. Ranges are simply handled as well:

PIPELINE | sed '/^\([^,]*,[^,]*\)\{4,9\}$/w file2'
cat file2

OUTPUT

1,2,3,4,5
1,2,3,4,5
1,2,3,4,5
1,2,3,4,5,6,7,8,9,10
1,2,3,4,5,6,7,8,9,10
1,2,3,4,5,6,7,8,9,10
mikeserv
  • 58,310
1

This will add the number of (comma separated) fields to the beginning of each line, print the line and then sort everything:

awk -F"," '{print NF,$0}' *csv | sort -nk1,1

The -n is numerical sort and the -k1,1 ensures that it is only sorted on the first field. To remove the number of fields after sorting, use:

awk -F"," 'print NF,$0' *csv | sort -nk1,1 | cut -d ' ' -f 2- 

NOTE: This will break very easily depending on your actual data. Can you have commas within fields? Can you have fields spanning multiple lines? This is a very naive approach and can't deal with any of that.

terdon
  • 242,166