0

I used the following syntax to read in a file name and modify its extensions to generate the name of a matched data file. As an example, I have ABC_1.fastq.Blockxy, XYZ_1.fastq.Block34 and I want to generate ABC_2.fastq.Block12, XYZ_2.fastq.Block34 as my new file name:

for infile in *_1.fastq.Block*
do
    #base=$(basename ${infile} _1.fastq.**)
    IFS='_'
    read -ra ADDR <<< $infile
    base=${ADDR[0]}
    IFS='.'
    read -ra ADDR <<< ${ADDR[1]}
    second_file="${base}_2.fastq.${ADDR[2]}"
    echo $second_file
done

When executed, this script prints e.g.

ABC_2 fastq Block12
XYZ_2 fastq Block34

i.e. spaces between fastq and Block12. Why am I getting spaces rather than a period between these three strings when I concatenate? I thought that using braces for variable names should have eliminated this problem.

Max
  • 101
  • 1
    Wouldn't second_file="${infile/_1/_2}" be a lot easier? – frabjous Aug 31 '22 at 22:46
  • Yes, that works in this case. However, for more complicated examples I do want to be able to use pieces of the split string to create new strings. I still don't understand why I'm getting spaces and no periods between the string elements with second_file. – Max Aug 31 '22 at 22:51
  • 4
    I take it it's because IFS is still set to ".", which is making echo treat the variable as multiple arguments rather than one. You can unset IFS prior to the echo or put in quotes, echo "$second_file". – frabjous Aug 31 '22 at 23:02

1 Answers1

1

In

echo $second_file

Since you forgot to quote the parameter expansion, it's subject to split+glob, so with IFS=., ABC_2.fastq.Block12 is first split into, ABC_2, fastq and Block12 and each word subject to globbing, with no effect here since none of the words contain glob operators.

So 3 arguments are passed to echo which it prints space separated.

To print the contents of a variable followed by a newline character, you need:

printf '%s\n' "$var"

For more details, see:


Now, a few more comments on your code:

  • since bash doesn't have the equivalent of zsh's (N) glob qualifier or ksh93's ~(N) glob operator, before using a glob in a for loop (at least), you need to set the nullglob option:

    shopt -s nullglob
    for infile in *_1.fastq.Block*; do...
    

    If you don't and there's no matching file, you'll loop over a literal *_1.fastq.Block*

  • You can set IFS for read only with: IFS=_ read -ra ADDR <<< "$infile" (see also the quotes around $infile which are needed in older versions of bash). That way, $IFS is only changed while read runs¹ and it's restored to its previous value after read returns.

  • IFS=. read -ra <<< "$var" is a poor method for splitting. First is only works for single line $vars which is not necessarily the case of file names, and also it's quite inefficient. That involves either storing the contents of $var into a tempfile or feed it via a pipe depending on the version of bash and/or the size of $var and then reading it one byte at a time until a newline is found.

    Here, you could use the split+glob operator instead:

    IFS=:; set -o noglob
    addr=( $infile )
    

    (or addr=( $infile'' ) to not ignore a trailing :.)

    Or switch to better shells with proper splitting operators.

    Another approach here would be to do:

    regex='^(.*)_1\.fastq\.(Block.*)$'
    if [[ $infile =~ $regex ]]; then
      outfile=${BASH_REMATCH[1]}_2.fastq.${BASH_REMATCH[2]}
      ...
    

    With the caveat that regex matching only works with valid text, which again is not a guarantee for file names.

    Here, you could also use standard sh parameter expansion operators:

    new_file=${infile%_1.fastq.Block*}_2.fastq.Block${infile##*_1.fastq.Block}
    

    Or the ksh93-style:

    new_file=${infile/_1.fastq.Block/_2.fastq.Block}
    

    (note the variations in behaviour among all those approaches if _1.fastq.Block occurs more than once in the file name).


¹ Though beware that if a trap is handled whilst read is running, the code in that trap will have the modified $IFS