How to remove line if it contains a character exactly once

Question

I want to remove a line from a file which contains a particular character only once, if it is present more than once or is not present then keep the line in file.

For example:

DTHGTY
FGTHDC
HYTRHD
HTCCYD
JUTDYC

Here, the character which I want to remove is C so, the command should remove lines FGTHDC and JUTDYC because they have C exactly once.

How can I do this using either sed or awk?

score 20 · Answer 1 · answered May 02 '17 at 08:50

20

In awk you can set the field separator to anything. If you set it to C, then you'll have as many fields +1 as occurrences of C.

So if you say awk -F'C' '{print NF}' <<< "C1C2C3" you get 4: CCC consists in 3 Cs, and hence 4 fields.

You want to remove lines in which C occurs exactly once. Taking this into consideration, in your case you will want to remove those lines in which there are exactly two C-fields. So just skip them:

$ awk -F'C' 'NF!=2' file
DTHGTY
HYTRHD
HTCCYD

answered May 02 '17 at 08:50

fedorqui

7,861
7
36
74

4

Astute use of awk field separator ! – Valentin B. May 02 '17 at 09:24
interresting, as in the default case (FS=" ") it ignores leading spaces ($1 = the first non-space on the line) and also repetitions (you can have 5 spaces to separate field 1 and field 2) ... space is probably treated specially? (to see it, one can do awk 'BEGIN { print "FS={" FS"}","OFS={" OFS "}";} {printf "%d fields : ",NF; for (i=1;i<=NF;i++) {printf "{" $i "} ";}; print "" }' and feed it some lines, some having multiple spces, and others begininng with space(s)) – Olivier Dulac May 02 '17 at 16:18
2

@OlivierDulac, yes, space is handled specially as specified by POSIX. – Wildcard May 02 '17 at 20:13

RomanPerekhrest · Answer 2 · 2017-05-02T05:15:25.523

8

sed approach:

sed -i '/^[^C]*C[^C]*$/d' input

-i option allows in-place file modification

/^[^C]*C[^C]*$/ - matches lines that contain C only once

d - delete matched lines

edited May 02 '17 at 05:15

answered May 02 '17 at 05:09

RomanPerekhrest

30,212

Stephen Rauch · Answer 3 · 2017-05-02T20:29:20.603

8

This can be done with sed as:

Code:

sed '/C.*C/p;/C/d' file1

Results:

DTHGTY
HYTRHD
HTCCYD

How?

Match and print any line with at least two copies of C via /C.*C/p
Delete any line with a C via /C/d, this includes the lines already printed in step 1
Default print the rest of the lines

edited May 02 '17 at 20:29

answered May 02 '17 at 05:10

Stephen Rauch

4,239

2

Clever alternative approach; I like it. – Wildcard May 02 '17 at 05:15

score 6 · Answer 4 · answered May 02 '17 at 05:21

6

This removes the lines with exactly one occurrence of C.

grep -v '^[^C]*C[^C]*$' file

The regular expression [^C] matches one character which isn't C (or newline), and the repetition operator (aka Kleene star) * specifies zero or more repetitions of the preceding expression.

The default output from grep (and most other text-oriented tools) is to standard output; redirect to a new file and maybe move it on top of the original file if that's what you want. The same regex can be used with sed -i for in-place editing:

sed -i '/^[^C]*C[^C]*$/d' file

(On some platforms, notably *BSD including macOS, the -i option requires an argument, like -i ''.)

answered May 02 '17 at 05:21

tripleee

7,699

1

sed -i '/^[^C]*C[^C]*$/d' file - sounds like it was posted before, how do you think, plagiarism? – RomanPerekhrest May 02 '17 at 05:24
1

Indeed, there is some duplication. I started out with the grep answer but it obviously easily extends to the sed -i variant. Didn't see your answer because I was looking for previous grep answers. – tripleee May 02 '17 at 05:24
1

It's safer to just plainly avoid -i with sed and instead redirect to a new file and replace the original with that if the sed utility exited with no error. – Kusalananda May 02 '17 at 09:34
2

Or grep -vx '[^C]*C[^C]*' – Stéphane Chazelas May 02 '17 at 09:35
@Kusalananda But then you might as well use grep because it's clearer and more robust (in particular, sed has a less informative exit code). – tripleee May 02 '17 at 09:52
@tripleee In this case, yes maybe. – Kusalananda May 02 '17 at 09:54

Wildcard · Answer 5 · 2017-05-02T20:10:39.467

The POSIX tool for scripted edits of a file (rather than printing the modified contents to standard out) is ex.

printf '%s\n' 'g/^[^C]*C[^C]*$/d' x | ex file.txt

Of course you can use sed -i if your version of Sed supports it, just be aware that's not portable if you're writing a script that's intended to run on different types of systems.

David Foerster asked in the comments:

Is there a reason why you're using printf and not echo or something like ex -c COMMAND?

Answer: Yes.

For printf vs. echo it's a question of portability; see Why is printf better than echo? And it's also easier to intersperse newlines between commands using printf.

For printf ... | ex vs. ex -c ..., it's a question of error handling. For this specific command it would not matter, but in general it does; for example, try putting

ex -c '%s/this pattern is not in the file/replacement text/g | x' filename

in a script. Contrast with the following:

printf '%s\n' '%s/no matching lines/replacement/g' x | ex file

The first will hang and await input; the second will exit when EOF is received by the ex command, so the script will continue. There are alternative workarounds, such as s///e, but they are not specified by POSIX. I prefer using the portable form, which is shown above.

For the g command, there must be a newline at the end, and I prefer using printf to wrap the commands rather than embedding a newline in single quotes.

Is there a reason why you're using printf and not echo or something like ex -c COMMAND? — David Foerster, May 02 '17 at 15:43
@DavidFoerster, yes. I started to answer you in comments but it grew long, so I added it to the answer. — Wildcard, May 02 '17 at 20:11
Thanks and +1! I knew about printf vs. echo (though I usually just prefer echo when the argument is hard-coded) but I haven't used ex extensively so far. — David Foerster, May 02 '17 at 20:17

score 2 · Answer 6 · 2017-05-02T11:28:11.437

2

sed -e '
  s/C/&/2;t   # when 2nd C matches skip processing and print
  /C/d        # either one C or no C, so delete on C
'

sed -e '
   /C/!b     # no C, skip processing and print
   /C.*C/!d  # not(at least 2 C) => 1 C => delete
'

perl -lne 's/C/C/g == 1 or print'

edited May 02 '17 at 11:28

answered May 02 '17 at 07:11

Note that it assumes GNU sed, t #... would typically branch to the label called #... in most other sed implementations. – Stéphane Chazelas May 02 '17 at 09:37
Even the !b is GNU sed since branch doesn't like anything except a label or a newline after it. – May 02 '17 at 09:38
Yes, b, t, :, } (and r file, w file...) can't have a command after them on the same line. You can also use separate -e options. – Stéphane Chazelas May 02 '17 at 09:41
Your perl option doesn't produce the correct output. I guess you forgot to add the g modifier. – Tom Fenech May 02 '17 at 10:34
@TomFenech You are correct. I am fixing that. Thanks. – May 02 '17 at 11:28

score 2 · Answer 7 · answered May 02 '17 at 09:27

Here are a couple of options using perl.

Since you're only matching a single character, you can use tr/C// (a translate, with no replacements), to return the number of matches of C:

perl -lne 'print if tr/C// != 1' file

More generally, if you want to match a multi-character string or regular expression, then you can use this:

perl -lne 'print if (@m = /C/g) != 1' file

This assigns the matches of the regular expression /C/g to a list @m and prints lines when the length of that list is not 1.

The -i switch can be added to edit "in-place".

score 1 · Answer 8 · answered May 02 '17 at 16:05

1

For anyone wanting awk specifically, I'd offer

awk '/C[^C]*C/{next}//{print}'

skip the line if it matches the pattern, print it otherwise. You don't actually need {print}, you can use // and default print, but I think it's clearer spelled out.

My first thought was to use egrep -v with the same pattern, but that doesn't actually answer the question as posed.

answered May 02 '17 at 16:05

nigel222

317

1

What's the point of matching anything after {next}? Just say awk '/pattern/ {next} 1' and all lines not matching the pattern will be printed. Or, better, awk '!/pattern/' to directly print those. – fedorqui May 02 '17 at 21:45
@fedorqui good point about !/pattern/ (which somehow slipped my mind) but I'd far rather see a self-explanatory //{print} than a cryptic 1. Assume the least competence and fluency from the next person to maintain your code, consistent with not making it seriously less efficient or effective. – nigel222 May 03 '17 at 07:54

How to remove line if it contains a character exactly once

8 Answers8