5

I need to feed a program some specific files, in the correct order and grouped two by two.

If I have

A_file.txt
B_file.txt
C_file.txt
D_file.txt

I need to feed it to a program so that file A and B are processed first, then C and D and so on. In essence:

for i in *.txt; do 
   some_program A_file.txt B_file.txt > output_AB
   some_program C_file.txt D_file.txt > output_CD

I know that the above doesn't make sense, but it was to illustrate the point. Essentially, iterate over all .txt files in the folder but feed them two at a time to the program, then move to the next two.

Looking to learn, many thanks.

Kusalananda
  • 333,661

4 Answers4

7

You could do this with the xargs command. If I have these files:

$ ls
A_file.txt  B_file.txt  C_file.txt  D_file.txt  E_file.txt  F_file.txt  G_file.txt  H_file.txt

Then I can process these two at a time like this:

$ find . -type f | xargs -n2 echo some_program
some_program ./A_file.txt ./B_file.txt
some_program ./C_file.txt ./D_file.txt
some_program ./E_file.txt ./F_file.txt
some_program ./G_file.txt ./H_file.txt

Here I'm simply calling echo, but you could of course drop the echo and actually run some_program instead. This will process two files at a time...but it doesn't handle generating an output filename for each invocation.

If we make it a little more elaborate, we can output to a file named after the first input filename:

find . -type f | xargs -n2 sh -c 'echo some_program $1 $2 > $1.output' --

This will produce the file A_file.txt.output for A_file.txt and B_file.txt, C_File.txt.output for the next pair, and so forth. You can get fancier with the output filename by applying various transformations; for example, to get the filename you asked for in your question, you could write:

find . -type f | xargs -n2 sh -c 'echo some_program $1 $2 > output_${1:2:1}${2:2:1}' --

This will generate output filenames output_AB, output_CD, etc.

larsks
  • 34,737
  • 1
    +1, I was going to post something like yours. Instead of -n2 I'd use -L2. Also in find I used -print0 and in xargs I used -0 too. That to avoid problems in files with spaces (although I do not think the user has any file with spaces) – Edgar Magallon Dec 14 '22 at 04:45
  • 1
    I intentionally left out the use of -print0/-0 in order to simplify the example, but I agree that in practice it's a good idea. – larsks Dec 14 '22 at 05:00
  • You probably want to restrict the find search to the current directory and files matching the pattern ?_file.txt (if you're picking the 1st character from the names to create the name of the output file). If you don't you will pick up old output files if you run it a second time, as well as any file in any subdirectory. – Kusalananda Dec 14 '22 at 07:25
  • 1
    @EdgarMagallon, that's not limited to spaces, that's potentially all whitespace characters (list varying with the implementation, but includes at least tab and newline) and quote and backslash characters. Since larsks forgot the quotes around parameter expansions, characters in $IFS and glob characters would also be a problem. – Stéphane Chazelas Dec 14 '22 at 08:38
  • ${var:offset:pattern} is a ksh93 operator (now also supported by bash and zsh), not a sh operator. Also, here, you're getting the 3 character of the full path, so of the first directory component in a file like ./foo/bar/file.txt not of the file name. – Stéphane Chazelas Dec 14 '22 at 08:40
  • 2
    Also note that the output of find is not sorted, so it's unspecified what pairs will be passed to some_program. – Stéphane Chazelas Dec 14 '22 at 08:41
  • @StéphaneChazelas thanks, that's useful. Recently I learned about the use of -0/-print0 but I had no idea that this is useful for other cases apart from the simple spaces. – Edgar Magallon Dec 14 '22 at 19:09
  • If the ordering of the files is important, then you will have to ensure that they are found by find in the correct order somehow. – Kusalananda Dec 14 '22 at 20:16
  • Ordering of files would also be an issue by just relying on shell wildcards, so I like the idea of controlling file ordering via find (e.g., do you care about upper/lower case? Numerical sort, etc?). Example: find . -name "foo*.txt" -print0 | xargs -0 printf '%s\n' | sort -f | xargs -L 2 sh -c 'echo 0=${0}, 1=${1}' – michael Dec 14 '22 at 22:49
  • @michael The ordering of the list resulting from a filename globbing pattern is guaranteed to be lexicographical by name by default. The order in which find finds files is dependent on the filesystem. Why are you using -print0 in your code when you then later do not use that nul terminator with sort? – Kusalananda Dec 15 '22 at 10:37
  • @Kusalananda the OP didn't really say what the expectations/assumptions were (plus these can change), so I'm kinda exploring all the options. Both glob expansion & find are going to be "environment dependent", e.g., wildcards are expanded based on LC_COLLATE as well (of course sort is also affected by this). E.g., try touch a1.txt A2.txt; LC_COLLATE=en_US.UTF-8 bash -c 'echo *'; LC_COLLATE=C bash -c 'echo *' (output: a1.txt A2.txt and A2.txt a1.txt). – michael Dec 16 '22 at 16:30
  • @Kusalananda As for when to use xargs and find w/ print0, my first find uses print0, which is paired with the first xargs -0, but that explicitly and necessarily converts back to lines of text for input to sort. Unfortunately, after experimenting, there's really no good one-liner piped solution here for filenames that have e.g. newlines (\n) in them, so I'm not sure it's all worth the effort. – michael Dec 16 '22 at 16:31
  • note: after testing on linux and macos, I do need to use $0 and $1 with xargs -L 2 ..., not $1,$2... on both platforms – michael Dec 16 '22 at 16:33
7
#!/bin/sh

set -- *_file.txt

until [ "$#" -lt 2 ]; do
    process "$1" "$2" >"output_${1%_file.txt}${2%_file.txt}"
    shift 2
done

This sets the positional parameters to the list of filenames you are interested in, based on a filename globbing pattern matching the names in the question. It then uses a loop to iterate over this list until there are less than two names left in the list ($# is the length of the list of positional parameters).

In each iteration, the first two elements of the list, $1 and $2, are processed and then shifted off the list using shift 2.

The output from the processing is redirected to a file named output_ followed by the concatenation of the variable parts of the two filenames (whatever is before the static _file.txt string in each).

This assumes that the files are named in such a way that sorting the names in lexicographical order (which the expansion of the globbing pattern will do) results in a list of names that can be paired in the way shown in the question.

Kusalananda
  • 333,661
  • 1
    Works like a charm. I researched '''set''' function as I wasn't familiar with positional parameters; pretty cool! I have to work out the output names as files have more complicated names than A_file.txt which was more of an example - but I can figure that out. – ThePresident Dec 15 '22 at 02:17
5

If switching from bash to zsh is an option, then it's just:

for i j ( *.txt(N) ) some_program -- $i $j > output_$i[1]$j[1]

(N) enables nullglob from that one glob expansion so as not to report an error if there's no match.

If there's an odd number of files, then the last run will be run with $j set to the empty string. As we leave it unquoted in argument to some_program, that will result in no corresponding argument to be passed to it. Replace with "$j" if you'd rather an empty argument be passed to it in that case.

The *.txt expansion will be in alphabetical order; you can change the order to anything you want using the o, O and/or n glob qualifiers.

For an arbitrary number of files at each iteration as opposed to just 2:

files=( *.txt(N) ) n=5
while (( $#files )) {
  some_program -- $files[1,n] > output_${(Mj[])files[1,n]#?}
  files[1,5]=()
}

Or using zargs:

autoload -Uz zargs
process() some_program -- $@ > output_${(Mj[])@#?}
zargs -rl5 -- *.txt(N) -- process

In ${(Mj[])array#?}, ${array#?} would strip the leading character from each element of the array, but with M, what is Matched is returned instead. The result is joined with nothing ([]), so you get a string made of the first character of each element.

2

Dump list of files into an array and read from it.

#!/bin/bash
arr=( *.txt )
i=0
while [ $i -lt ${#arr[@]} ];
do
  echo ${arr[$i]} ${arr[ $[$i+1] ]}
  i=$[$i+2]
done

If you have odd number of files, the request ${arr[ $[$i+1] ]} will silently give you an empty string. It is up to you to decide what to do in this case.

Kusalananda
  • 333,661
White Owl
  • 5,129