10

I want to remove a line from a file which contains a particular character only once, if it is present more than once or is not present then keep the line in file.

For example:

DTHGTY
FGTHDC
HYTRHD
HTCCYD
JUTDYC

Here, the character which I want to remove is C so, the command should remove lines FGTHDC and JUTDYC because they have C exactly once.

How can I do this using either sed or awk?

Namz
  • 101

8 Answers8

20

In awk you can set the field separator to anything. If you set it to C, then you'll have as many fields +1 as occurrences of C.

So if you say awk -F'C' '{print NF}' <<< "C1C2C3" you get 4: CCC consists in 3 Cs, and hence 4 fields.

You want to remove lines in which C occurs exactly once. Taking this into consideration, in your case you will want to remove those lines in which there are exactly two C-fields. So just skip them:

$ awk -F'C' 'NF!=2' file
DTHGTY
HYTRHD
HTCCYD
fedorqui
  • 7,861
  • 7
  • 36
  • 74
  • 4
    Astute use of awk field separator ! – Valentin B. May 02 '17 at 09:24
  • interresting, as in the default case (FS=" ") it ignores leading spaces ($1 = the first non-space on the line) and also repetitions (you can have 5 spaces to separate field 1 and field 2) ... space is probably treated specially? (to see it, one can do awk 'BEGIN { print "FS={" FS"}","OFS={" OFS "}";} {printf "%d fields : ",NF; for (i=1;i<=NF;i++) {printf "{" $i "} ";}; print "" }' and feed it some lines, some having multiple spces, and others begininng with space(s)) – Olivier Dulac May 02 '17 at 16:18
  • 2
    @OlivierDulac, yes, space is handled specially as specified by POSIX. – Wildcard May 02 '17 at 20:13
8

sed approach:

sed -i '/^[^C]*C[^C]*$/d' input

-i option allows in-place file modification

/^[^C]*C[^C]*$/ - matches lines that contain C only once

d - delete matched lines

8

This can be done with sed as:

Code:

sed '/C.*C/p;/C/d' file1

Results:

DTHGTY
HYTRHD
HTCCYD

How?

  1. Match and print any line with at least two copies of C via /C.*C/p
  2. Delete any line with a C via /C/d, this includes the lines already printed in step 1
  3. Default print the rest of the lines
6

This removes the lines with exactly one occurrence of C.

grep -v '^[^C]*C[^C]*$' file

The regular expression [^C] matches one character which isn't C (or newline), and the repetition operator (aka Kleene star) * specifies zero or more repetitions of the preceding expression.

The default output from grep (and most other text-oriented tools) is to standard output; redirect to a new file and maybe move it on top of the original file if that's what you want. The same regex can be used with sed -i for in-place editing:

sed -i '/^[^C]*C[^C]*$/d' file

(On some platforms, notably *BSD including macOS, the -i option requires an argument, like -i ''.)

tripleee
  • 7,699
4

The POSIX tool for scripted edits of a file (rather than printing the modified contents to standard out) is ex.

printf '%s\n' 'g/^[^C]*C[^C]*$/d' x | ex file.txt

Of course you can use sed -i if your version of Sed supports it, just be aware that's not portable if you're writing a script that's intended to run on different types of systems.


David Foerster asked in the comments:

Is there a reason why you're using printf and not echo or something like ex -c COMMAND?

Answer: Yes.

For printf vs. echo it's a question of portability; see Why is printf better than echo? And it's also easier to intersperse newlines between commands using printf.

For printf ... | ex vs. ex -c ..., it's a question of error handling. For this specific command it would not matter, but in general it does; for example, try putting

ex -c '%s/this pattern is not in the file/replacement text/g | x' filename

in a script. Contrast with the following:

printf '%s\n' '%s/no matching lines/replacement/g' x | ex file

The first will hang and await input; the second will exit when EOF is received by the ex command, so the script will continue. There are alternative workarounds, such as s///e, but they are not specified by POSIX. I prefer using the portable form, which is shown above.

For the g command, there must be a newline at the end, and I prefer using printf to wrap the commands rather than embedding a newline in single quotes.

Wildcard
  • 36,499
2
sed -e '
  s/C/&/2;t   # when 2nd C matches skip processing and print
  /C/d        # either one C or no C, so delete on C
'

sed -e '
   /C/!b     # no C, skip processing and print
   /C.*C/!d  # not(at least 2 C) => 1 C => delete
'

perl -lne 's/C/C/g == 1 or print'
2

Here are a couple of options using perl.

Since you're only matching a single character, you can use tr/C// (a translate, with no replacements), to return the number of matches of C:

perl -lne 'print if tr/C// != 1' file

More generally, if you want to match a multi-character string or regular expression, then you can use this:

perl -lne 'print if (@m = /C/g) != 1' file

This assigns the matches of the regular expression /C/g to a list @m and prints lines when the length of that list is not 1.

The -i switch can be added to edit "in-place".

1

For anyone wanting awk specifically, I'd offer

awk '/C[^C]*C/{next}//{print}'

skip the line if it matches the pattern, print it otherwise. You don't actually need {print}, you can use // and default print, but I think it's clearer spelled out.

My first thought was to use egrep -v with the same pattern, but that doesn't actually answer the question as posed.

nigel222
  • 317
  • 1
    What's the point of matching anything after {next}? Just say awk '/pattern/ {next} 1' and all lines not matching the pattern will be printed. Or, better, awk '!/pattern/' to directly print those. – fedorqui May 02 '17 at 21:45
  • @fedorqui good point about !/pattern/ (which somehow slipped my mind) but I'd far rather see a self-explanatory //{print} than a cryptic 1. Assume the least competence and fluency from the next person to maintain your code, consistent with not making it seriously less efficient or effective. – nigel222 May 03 '17 at 07:54