2

I'm using Bash on Ubuntu and my issue is the following:

I have dozens of .TXT files in a certain folder.

As far as I know, all of them have a header (I believe every header holds on 1 line but not 100% sure). However, they do not necessary share the same header. Five files might have the same header while another file might have a unique header.

Ultimately, what I would like to do is concatenate the files that share the same header. The answers to the following question (Concatenate multiple files with same header) explain how to concatenate multiple files with the same header, however in my case I would first need to group the files sharing the same header before concatenating them (and only keep the header of first file among said groups).

Any ideas are welcome :) Thank you !

fra-san
  • 10,205
  • 2
  • 22
  • 43
Alex
  • 33

1 Answers1

2
awk '
  FNR==1{
    if (!($0 in h)){file=h[$0]=i++}
    else{file=h[$0];next}
  }
  {print >> (file)}
' *.txt

If awk is in a first line of a file:

  • If the header is not present in the header array h, then set the number i++ (initially zero) as the file name and also put it in the key $0 of the array.
  • Otherwise (if the header is already in the array h), fetch the file name from the array and read the next line.

Finally, the line is printed to the corresponding file.


But I get "too many open files" or something of the sort.

GNU awk handles opening and closing files on demand to circumvent that, but other awks may not do it. In such case, go for

awk '
  FNR==1{
    if (!($0 in h)||file!=h[$0]){close(file)}
    if (!($0 in h)){file=h[$0]=i++}
    else{file=h[$0];next}
  }
  {print >> (file)}
' *.txt

Bear in mind this can be slower.

Quasímodo
  • 18,865
  • 4
  • 36
  • 73
  • There are many columns in these tables so the headers are very long. As a result, I get the error awk: .. fatal: can't redirect to... : File name too long . That is since the output file name is determined as the header. – Alex Aug 03 '20 at 10:30
  • @Alex Check out the edit. Hope it helps. Otherwise, please show us at least one fragment of one file of yours. – Quasímodo Aug 03 '20 at 11:59
  • I know it doesn't help but I am not allowed to show you any fragment of file... So I tried running the edited version (I thought I already had but I hadn't, my bad) and after 10 minutes it created 7 new files called 3,4,5,6,7,8 and 9. I guess it means there were 7 distinct headers initially – Alex Aug 03 '20 at 12:19
  • @Alex You could come up with a file that resembled yours, but that's OK. Are you sure there are no 0, 1 and 2 files too? The new files are supposed to begin from 0. – Quasímodo Aug 03 '20 at 12:33
  • I wrote other comments but just deleted them because nevermind I just realized that there are files 0, 1 and 2 and moreover not 9 but 18 files were created. I'm not making sure the number of lines add up – Alex Aug 03 '20 at 13:21
  • Thank you, it worked perfectly :) – Alex Aug 03 '20 at 13:40
  • Last question though : My .TXT files all have the same name structure, like: XXXXFile1_part1.TXT, XXXXFile1_part2.TXT would be the two parts of File1, that your command properly concatenated in a new file named 0, so how could I edit this command so that the new file is called 'File1' instead of '0' (for all the files, the file name is from the 5th to the 9th character of the path) – Alex Aug 03 '20 at 14:19
  • @Alex Glad it helps. This last request is a bit more complicated than it seems and may lead to edge cases. I'd need a list of filenames without masks to be sure the solution would be reliable. And you would have to guarantee that constraint you mentioned. All in all, it would be probably better if you opened a new question if you can give the details, otherwise there is not much we can do. – Quasímodo Aug 03 '20 at 14:27