2

I have got this script here. It is supposed to run the loop by reading the file LineNumbers.file line by line (each contains a line number) and then accordingly replacing 0/0 with ./. in the BEFORE_File.txt. It works, but it takes only the very last line of the file LineNumbers.file, instead of the >100 entries.

I am not sure what I am doing wrong here. Can you please help me to get awk to read the LineNumbers.file line by line?

I have got it to work with sed -i "${line}s/0\/0/\.\/\./" "${myFileTmp}", but it was really slow on the >3GB large files that I have got. So I thought that awk would be a faster option.

Many thanks!

cat ./LineNumbers_TEMP/LineNumbers.file | while read line
do
myFileTmp=BEFORE_File.txt
awk -v var=${line} 'FNR==var { sub(/0\/0/, "\.\/\."); print }' "${myFileTmp}" > AFTER_File.txt
done

For example this is how the files look like:

cat ./LineNumbers_TEMP/LineNumbers.file
1
2
5

File.txt before script:

cat BEFORE_File.txt
0/0
0/0
0/1
0/1
0/0
0/0
0/0

This is how the file should look after running the script:

cat AFTER_File.txt
./.
./.
0/1
0/1
./.
0/0
0/0

At the moment I get only this:

./.
Kusalananda
  • 333,661
P. HamB
  • 35

4 Answers4

7

Your code does not work because for each line number read from LineNumbers.file, you modify the original BEFORE_File.txt and create AFTER_File.txt. Therefore, the final AFTER_File.txt will only contain the changes made to the last line number listed in LineNumbers.file.

Additionally, parsing a whole file just to change a single line, and then doing that many times, is extremely inefficient, and doubly so when the modification to the lines is identical.

It would be better to first read the line numbers, and then to modify all lines in one go:

awk 'FNR == NR { lineno[$1] = 1; next }
     (FNR in lineno) && $0 == "0/0" { $0 = "./." }
     { print }' LineNumbers.file BEFORE_File.txt >AFTER_File.txt

FNR and NR are two special variables in awk that holds the current record number (line number by default) of the current file, and the over all number of records (lines) read so far. For the first input file, these two values will be the same, and when they are, we store the line number as a key in the associative array lineno, and then skip to the next line.

When they are not the same, we test whether the current line number is a key in the lineno array, and whether the current line is additionally equal to 0/0. If so, it is changed to ./.. The last { print } block outputs all lines of the second file, whether modified or not.


A totally different approach is to use sed to create a sed script to do the necessary changes.

Given a line number n, the sed expression ns,^0/0$,./., would change line n by substituting 0/0 into ./.. If the line is anything but exactly 0/0, no changes would be made. I'm using commas as delimiters for the s/// command to avoid the leaning toothpick syndrome.

All we have to do is to create expressions like that for each line number n:

sed 's#.*#&s,^0/0$,./.,#' LineNumbers.file

Here, I use # as delimiters for s///. The & in the replacement part of the command would be replaced by the line number read from the input file.

For the given list of line numbers, this generates

1s,^0/0$,./.,
2s,^0/0$,./.,
5s,^0/0$,./.,

We can simply apply this directly to our file:

sed 's#.*#&s,^0/0$,./.,#' LineNumbers.file | sed -f /dev/stdin BEFORE_File.txt >AFTER_File.txt
Kusalananda
  • 333,661
  • Sounds great. :) How would the awk and sed command look for not exact line matches please? I should have clarified this at the beginning, but 0/0 is just part of a whole bunch of characters in a line. Lets say the line looks something like this: blabla 4858 ABC 0/0:4,3,2 0/1:4,3,2. At the moment it would not work for me because of the exact line match. Thanks..

    I like the toothpick syndrome :D.

    – P. HamB Apr 24 '20 at 07:08
  • @P.HamB You would use \<0/0\> rather than ^0/0$. \< and \> matches word boundaries. On GNU systems, you may use \b instead of these. The answer must be updated further if you expect to replace all 0/0 on these lines, but this is not currently what the question asks about. Always ask about exactly what you need. Update the question if you have further requirements (I won't update my answer as it would not be in line with the question otherwise). – Kusalananda Apr 24 '20 at 07:12
  • Nice, thats great. Many thanks. I have slightly modified it to my needs, seems to work so far pretty good: sed 's#.*#&s,0/0,./.,#' LineNumbers.file | sed -f /dev/stdin BEFORE_File.txt >AFTER_File.txt Many thanks for the help. Very much appreciated. – P. HamB Apr 24 '20 at 07:31
  • 1
    @P.HamB Note that you actually do need \<0/0\> or \b0/0\b there if you don't want to replace 10/0 with 1./.. – Kusalananda Apr 24 '20 at 07:33
  • Ok, makes sense. Sounds good. I dont have any cases with 10/0 :D.. – P. HamB Apr 24 '20 at 07:39
5

Let's see if this works for you:

awk '{ 
  if ( NR == FNR ) { 
    n[$1] = 0 
  } else { 
    if ( FNR in n ) { 
      gsub(/^0\/0$/, "./.", $0) 
    } 
    print 
  } 
}' LineNumbers.file BEFORE_File.txt > AFTER_File.txt

Output:

./.
./.
0/1
0/1
./.
0/0
0/0
3

Given that your input actually looks like blabla 4858 ABC 0/0:4,3,2 0/1:4,3,2 rather than the example you posted in your question, all you need is:

awk 'NR==FNR{a[$1]; next} FNR in a{sub("0/0","./.")} 1' LineNumbers.file BEFORE_File.txt >AFTER_File.txt

For your next question please post sample input that looks like your real input to avoid getting answers that are too simple or more complicated than necessary and/or only work with input you don't actually have.

Don't do this as it's a bad approach in multiple ways but FYI if you were to use a shell loop like in your question then you'd write it as:

myFileTmp=$(mktemp)
cp BEFORE_File.txt AFTER_File.txt
while IFS= read -r line
do
    awk -v var="${line}" '
        FNR==var { sub("0/0", "./.") } { print }
    ' AFTER_File.txt > "$myFileTmp" &&
    mv "$myFileTmp" AFTER_File.txt
done < LineNumbers.file

Also wrt the script in your question - "\.\/\." in your gsub() is a string. You don't need to escape regexp metacharacters in strings, just in regexps. Ditto for /. All you need to write there is "./.". See Why is using a shell loop to process text considered bad practice?, http://porkmail.org/era/unix/award.html, and https://mywiki.wooledge.org/Quotes for some of the other issues with your script beyond just the problem you're having.

Ed Morton
  • 31,617
  • You can avoid storing the whole file with line numbers in memory by reading the value of line numbers into a variable with getline var <file. –  Apr 24 '20 at 19:58
  • Yeah but you'd need to add more code than that to call getline safely (see http://awk.freeshell.org/AllAboutGetline) and to decide when to call it (and make sure the file of line numbers is sorted which it looks like it probably is anyway) - it's not worth doing any of that until if/when simply reading the line numbers into memory proves to be a problem. – Ed Morton Apr 24 '20 at 20:41
  • Your position, then, is that knowing that a problem could be avoided and knowing the (not complex) way to solve it doesn't mean that a better solution needs to be provided, Interesting, noted, thanks. –  Apr 24 '20 at 20:50
  • My position is that you should write clear, concise, efficient, robust, portable, idiomatic code and then solve any specific speed or memory issues with that on those extremely rare occasions when that's not adequate. At a glance - compared to the idiomatic solution I posted, the answer you wrote is lengthier, less clear, probably slower, harder to enhance (try simply printing every input line from both files to stderr for debugging or printing the count of the number of lines read or ....) and will fail cryptically if/when getline fails and/or when the file of line numbers isn't sorted. – Ed Morton Apr 24 '20 at 21:40
  • Yes, sure, provide an answer that doesn't capture the whole file in memory then, any other discussion is just noise and personal opinion. –  Apr 24 '20 at 22:24
  • Ha, faster? Make seq 1000000 > filelinenum and time both solutions. –  Apr 24 '20 at 22:57
  • Not sure why you're choosing to further discuss probably slower rather than everything else I said but anyway I'm certain the solution I posted will be fast enough for the OPs needs and it has all of the other benefits I mentioned so I don't plan to devote any more time to this but - the OP said they have >100 lines in their line numbers file and >3GB in each file they want to change so if you want to test some timing then you should do so using files with those magnitude values, i.e. about 150 lines in the line numbers file and 100,000,000 lines in the other is probably representative. – Ed Morton Apr 25 '20 at 00:17
  • If you do get any results you find interesting, maybe take it up with one of the other people who posted answers that use the same approach as mine since I don't care as mine will be fast enough. Also, to test 1 non-sunny day case, try both approaches when you run chmod oug-r LineNumbers.file first (or remove it) and youll find that the approach I used fails immediately telling you it can't open the file while your approach just keeps silently trying and failing to open the file but otherwise behaving as if it COULD open the file. As I said you need more code than that to call getline safely – Ed Morton Apr 25 '20 at 00:51
0

There is the possibility of reading the lines from the file with the line numbers directly into a variable in awk via getline (assuming line numbers are sorted):

getline var <"filename"

The whole script will be one single call to awk as follows:

awk -v f1='./LineNumbers.file' '
       NR >var+0 {    rc=getline var <f1;
                      if(rc<0){  stderr = "cat 1>&2";
                                 print "error reading",f1 | stderr;
                                 close(stderr);
                                 exit 1
                              }
                 }
       NR==var+0 {    sub(/0\/0/,"./.")
                 }
     1' BEFORE_File.txt

Of course, redirect the output to any file you like.