4

I'm not sure how to phrase the question as most of the answers are about removing \r\n from a file.

I have a unique problem where compressed files are numbered randomly and in order to associate them properly with a database record, I need to list the file contents and check them.

I'm using this solutions "In bash script, how to capture stdout line by line"

Which was a great start.

Some of the contents have a name with whitespace and I found this solution: How to print third column to last column?

I discovered when attempting to update the database record, that ^M is being inserted in the results from of the awk pipe, but only for the NF column.

Not sure how to resolve this particular glitch. I don't see where ^M is being inserted, or how to remove it from the last column.

My code

This line works fine if I strip ^M

filename="$(echo "$line" | awk '{if ($3 ~ /^M$/) {sub(/^M$/,"", $3)} printf $3; printf ""}')"

This line fails:

text="$(echo "$line" | awk '{for(i=6;i<NF+1;i++) {if ($i ~ /^M$/) {sub(/^M$/,"", $i)} } printf "%s ", $i; printf ""}')"

And the simplified version fails:

text="$(echo "$line" | awk '{for(i=6;i<NF+1;i++) sub(/^M$/,"", $i) printf "%s ", $i; printf ""}')"

In vim/vi ^M is created with ctrl-V + <return key> Using \r\n has no effect.

I'm using cygwin, and have been for a long time, and I have other *nix scripts that I have written which run fine. I discovered that for some reason, this particular run of awk is adding ^M to the output.

I found this question with a similar problem, but I created my script with vim from the start so there was no windows based editor involved.

If I mount that windows folder as a samba share and run the script from linux it produces the output without a ^M, so at this point I'm wondering if this is a bug or something else. It's really strange.

UPDATE My use of the REGEX in sub() was causing the string to return empty, so I did not properly understand how to clear out the CRLF.

NF+1 was a leftover from attempting to find out what was introducing the CRLF I was using i<=NF before that.

2 Answers2

7

With a few implementations of awk including GNU awk, mawk and busybox awk (the 3 implementations commonly found on Linux-based systems, Cygwin's being GNU awk by default I believe), RS the input record separator can be a regular expression (as opposed to a single character in POSIX).

In those, you can do:

awk -v RS='\r\n' '{print $NF}' < your-file.msdos

to process those files, or:

awk -v RS='\r?\n' '{print $NF}' < your-file.msdos-or-unix

to be able to process both files with \n delimiters or \r\n separators.

some MS-DOS files also tend to have the last line non-delimited, but awk will also fix that on output as it appends the output record separator (ORS which remains \n here) to all records when printing.

As far as default field splitting goes in awk, you'll also find that there is variation between implementations. POSIX says it should be split on sequences of blanks, leading and trailing ones removed. The notion of blank is locale dependant, and includes at least SPC and TAB. You'll find many awk implementations restrict it to SPC and TAB only regardless of the locale, many also add NL (only relevant when the record separator is not newline).

busybox awk includes all ASCII whitespace, so including CR, FF, VT. So in busybox awk, fields by default never contain CR. You can achieve the same behaviour with GNU awk by doing gawk -v 'FPAT=[^[:space:]]' where fields are then defined as sequences of non-whitespace.

A few more notes:

  • avoid shell loops to process text, especially here since you're already using awk which is one of the right tools to process text.
  • don't use echo on arbitrary data
  • the first argument to printf is the format, you don't want to use arbitrary data there. Use printf "%s", $3 if you want to print $3 without appending ORS, not printf $3.
  • printf "" is a no-op. It doesn't do anything. If you want to print a newline, use printf "\n" or print "" (the latter prints ORS, newline by default).
  • Ok. Thanks for the insight. – Ken Ingram Jan 24 '20 at 07:35
  • 1
    This was really good info. I see that I need to stop using "echo" the way I am doing it and use printf instead. This will help me level up my bash skills. Perhaps after all these years I will truly be intermediate level. – Ken Ingram Jan 24 '20 at 08:22
  • 2
    Just be careful with RS='\r?\n' since some files, e.g. a CSV exported from MS-Excel, will use \r\n as the end of a record but use \n as linebreaks within fields so if you set RS='\r?\n' then you get parts of the record read as if it was a whole record since the \n within a field will also be interpreted as an end of a record. – Ed Morton Jan 24 '20 at 16:03
  • 1
    Most of my ETL work is generated in a unix environment. However, this is a good caveat to tape to the toolbox. – Ken Ingram Jan 24 '20 at 20:39
  • 1
    @EdMorton, a CSV can have \r\n or \n within a field, you can't process those CSVs with awk in any case in that way (regardless of whether RS is \n, \r\n or \r?\n). – Stéphane Chazelas Jan 24 '20 at 21:35
  • Right, a "CSV" could mean just about anything and all I'm saying is that some CSVs, including those exported from MS-Excel, end records with \r\n but use \n mid-field which means that if you use RS='\r?\n' then you will have failures. So you should determine what your particular CSV uses to terminate records and mid-record and then use the specific RS='\n' or RS='\r\n' (or whatever else) as appropriate rather than assuming that if you use RS='\r?\n' it will generally let you process both files with \n delimiters or \r\n separators. – Ed Morton Jan 25 '20 at 14:46
  • Overall I appreciate the push to tighter and more efficient algorithm creation, and from the link on loops, I'd really like to see an alternative. The bottom line is that tasks need to be done on the SysAdmin side. When they are repetitive and ongoing, it makes sense to "Tom Sawyer" the situation....Aborbing these concepts and using them requires some adjustment...while work still needs to get done. – Ken Ingram Jan 26 '20 at 00:13
3

awk does not identify the ^M literally, it identifies it as the CRLF pattern \r\n, so your sub() can use the CR character representation directly as below. Also you don't have to check if the field contains a character and do a replacement. The replacement functions simply do nothing if the mentioned pattern is not found. So all you need is the following to replace the CR at the last column only.

awk '{ sub("\r", "", $NF); print $NF }' 

If there are multiple columns that need to be replaced, switch $NF with the appropriate column needed.

If you are doing this in a loop for all the columns up to the end of the file, just do

awk '{ for(i=6; i<=NF ; i++) { sub("\r", "", $i); printf "%s ", $i; } }'

Also file can only have utmost NF columns and $NF is the last column value. Change your loop to run until NF to access the last column value.

Inian
  • 12,807
  • I don't understand how that fits into my code. It looks like that's the sum total of the awk string. AAlso don't understand the function of the 1. – Ken Ingram Jan 24 '20 at 06:53
  • @KenIngram: I added an explicit print statement now. You can modify it how you want to use it for your other column values – Inian Jan 24 '20 at 06:54
  • I tried this: text="$(echo "$line" | awk '{for(i=6;i<NF+1;i++) {sub("\r", "", $NF) printf "%s ", $i; printf ""}}')" – Ken Ingram Jan 24 '20 at 06:55
  • The key point is that the particular text is sometimes separated with whitespace. It's a name so the whitespace needs to be included. Thus the loop. Integrating that last column is proving to be a major obstacle. – Ken Ingram Jan 24 '20 at 06:57
  • @KenIngram: I mentioned in the answer $NF is for last column only. You need to change to $i in your loop. Also why do you NF+1. The file can only have utmost NF columns and $NF is the last column value – Inian Jan 24 '20 at 06:57
  • +1 was an attempt to find the problem. i < NF leaves out $NF. I figured NF+1 in the condition would work. – Ken Ingram Jan 24 '20 at 06:57
  • Ok. I got it. Your solution solved the problem. – Ken Ingram Jan 24 '20 at 07:01
  • 1
    sub("\r", "", $NF) should be sub(/\r$/, "") since sub() takes a regexp as the first arg, doing the sub on $NF would force awk to rebuild each record unnecessarily, and if you have white space at the end of the line then the \r is after the white space rather than at the end of $NF. – Ed Morton Jan 24 '20 at 16:06
  • The problem with the regex is that it leaves out the other pieces. If there's no /\r$/, that $i is skipped. The other way it simply returns what it was given; unedited. – Ken Ingram Jan 26 '20 at 00:16