6

Sometimes there are a few really annoying lines in otherwise tabular data like

column name | other column name
-------------------------------

I generally prefer removing garbage lines that shouldn't be there by grep -v ing a reasonably unique string, but the problem with that approach is that if the reasonably unique string appears in the data by accident that's a serious problem.

Is there a way to limit the number of lines that grep -v can remove (say to 1)? For bonus points, is there a way to count the number of lines from the end without resorting to <some command> | tac | grep -v <some stuff> | tac ?

GAD3R
  • 66,769
Greg Nisbet
  • 3,076
  • 1
    how about awk 'NR>2' ? –  Jan 06 '17 at 04:11
  • looking for something cat a.txt | sed -e '1,2d' | tac | sed -e '1,1d' | tac where 1,2d removes 1st two lines which is column name and hyphen row and 1,1d removes the result count row. And check it out in http://unix.stackexchange.com/questions/37790/how-do-i-delete-the-first-n-lines-of-an-ascii-file-using-shell-commands – RBB Jan 06 '17 at 04:13
  • grep unfortunately can not do it. The closest option would be to limit the number of lines shown before you ignore them: grep -v -m 10 would show the first 10 matches and ignore the rest. – Julie Pelletier Jan 06 '17 at 04:22
  • You can do both of these using the POSIX-specified predecessor to vi known as ex. If you include example input output I'll elucidate further. – Wildcard Jan 11 '17 at 22:28

5 Answers5

5

You could use awk to ignore the first n lines that match (e.g. assuming you wanted to remove only the 1st and 2nd match from the file):

n=2
awk -v c=$n '/PATTERN/ && i++ < c {next};1' infile

To ignore the last n lines that match:

awk -v c=${lasttoprint} '!(/PATTERN/ && NR > c)' infile

where ${lasttoprint} is the line number of the nth+1 to last match in your file. There are various ways to get that line no. (e.g. print only the line number for each match via tools like sed/awk, then tail | head to extract it)... here's one way with gnu awk:

n=2
lasttoprint=$(gawk -v c=$((n+1)) '/PATTERN/{x[NR]};
END{asorti(x,z,"@ind_num_desc");{print z[c]}}' infile)
don_crissti
  • 82,805
2

sed provides a simpler way:

... |  sed '/some stuff/ {N; s/^.*\n//; :p; N; $q; bp}' | ...

This way you delete first occurrence.

If you want more:

sed '1 {h; s/.*/iiii/; x}; /some stuff/ {x; s/^i//; x; td; b; :d; d}'

, where count of i is count of occurrences (one or more, not zero).

Multi-line Explanation

sed '1 {
    # Save first line in hold buffer, put `i`s to main buffer, swap buffers
    h
    s/^.*$/iiii/
    x
}

# For regexp what we finding
/some stuff/ {
    # Remove one `i` from hold buffer
    x
    s/i//
    x
    # If successful, there was `i`. Jump to `:d`, delete line
    td
    # If not, process next line (print others).
    b
    :d
    d
}'

In addition

Probably, this variant will work faster, 'cos it reads all rest lines and print them in one time

sed '1 {h; s/.*/ii/; x}; /a/ {x; s/i//; x; td; :print_all; N; $q; bprint_all; :d; d}'

As result

You can put this code into your .bashrc (or config of your shell, if it is other):

dtrash() {
    if [ $# -eq 0 ]
    then
        cat
    elif [ $# -eq 1 ]
    then
        sed "/$1/ {N; s/^.*\n//; :p; N; \$q; bp}"
    else
        count=""
        for i in $(seq $1)
        do
            count="${count}i"
        done
        sed "1 {h; s/.*/$count/; x}; /$2/ {x; s/i//; x; td; :print_all; N; \$q; bprint_all; :d; d}"

    fi
}

And use it this way:

# Remove first occurrence
cat file | dtrash 'stuff' 
# Remove four occurrences
cat file | dtrash 4 'stuff'
# Don't modify
cat file | dtrash
ValeriyKr
  • 929
  • This is GNU-Sed specific. To use Sed labels and branching portably in a one-liner, use e.g. sed -e '/some stuff/ {N; s/^.*\n//; :p' -e 'N; $q; bp' -e '}' In other words you cannot portably include anything after a label name within the same argument. (You can also embed a newline, or put the Sed script in a file and use sed -f, but this is the clean portable one-liner branching approach.) – Wildcard Jan 11 '17 at 22:06
0

Perhaps reduce the chances of filtering out your data by using a more accurate grep command. For example:

grep -v -F -x 'str1'

For lines that are exatctly str1. Or maybe:

grep -v '^str1.*str2$'

For lines that start with 'str1' and end with 'str2'.

ifb
  • 316
  • 1
  • 7
0

To do this you might have to use awk.

The simple way I know is this:

cat file | awk '{ $1=""; print}'

You can skip multiple columns too:

cat file | awk '{ $1=$2=$3=""; print}'

If you want to skip the last column and you're not sure how much columns you will have:

cat file | awk '{ $NF=""; print}'

Tested on Ubuntu 16.04 (GNU bash, version 4.3.48)

Best.

  • Note that you don't need to cat a file to awk; just provide it as command-line argument, it can read files completely on its own. – AdminBee Jul 22 '20 at 12:50
0

Another possible solution is to use bashs own utilities:

count=1
found=0
cat execute-commons-fileupload.sh | while read line
do 
   if [[ $line == *"myPattern"* ]]
   then 
      if [ $found -eq $count ]
      then 
         echo "$line"
      else 
         found=$(($found+1))
      fi
   else 
     echo "$line"
   fi
done

By setting count, you can change the count of occurences of your pattern which you want to remove.

For me personally, this seems to be easier extendable, since you can easily add other conditions to the if statement (but this might be caused by my marginal knowledge of sed).

  • 1
    Please note that while it is possible to perform text processing using shell read loops, it is quite inefficient when compared to solutions involving awk, for example, which has a rather powerful syntax for complex text and string manipulation tasks (see here for an overview on why using a shell loop is considered bad practice). – AdminBee Jul 22 '20 at 12:54
  • 1
    Thanks for the hint. For some use cases, using shell read loops is sufficiently fast and easier to configure (as my use case, where I only need to process ~200 lines), but you are right that in bigger cases the usage of awk or sed is better. – David Georg Reichelt Jul 22 '20 at 13:42