0

I have a file with about 7 million lines which looks like this:

head gokind_SNPs.txt
1:753541:G:A
1:769223:C:G
1:771967:G:A
1:778745:A:G
1:779322:A:G
...

How do I remove everything after the 2nd colon so that it looks like this:

1:753541
1:769223
1:771967
1:778745
1:779322
...

I tried doing this but it didn't work, the file didn't change:

sed 's/:[A-Z].* / /g' gokind_SNPsF.txt > gokind_SNPsf.txt
anamaria
  • 121

4 Answers4

1

Your command did not do anything because the regular expression you're using is trying to match a space that is not there in the data.

Instead, use

sed 's/:[A-Z].*//' gokind_SNPsF.txt >new-gokind_SNPsf.txt

This deletes all text on every line from the first : that is immediately followed by an upper case letter. I've also chosen to replace with nothing, rather than with a space, and I've deleted the g flag which isn't necessary.

I'm assuming that you didn't actually run the command that you showed though, as that would have truncated (emptied) your data file before even starting sed (due to redirecting into the same file that you're reading from).

If you want to do in-place editing with sed use sed -i, but also read "How can I achieve portability with sed -i (in-place editing)?".

A slightly quicker alternative to your sed command is

cut -d: -f -2 gokind_SNPsF.txt >new-gokind_SNPsf.txt

which simply extracts the first two :-delimited fields from each line. Instead of -f -2 you could use either -f 1,2 or -f 1-2 to specify that you want to get the first two columns.

Using awk, you would do

awk -F : 'BEGIN { OFS=FS } { print $1, $2 }' gokind_SNPsF.txt >new-gokind_SNPsf.txt

to print only the two first fields from each line to a new file.

With GNU awk, you could do in-place editing with

awk -i inplace -F : 'BEGIN { OFS=FS } { print $1, $2 }' gokind_SNPsF.txt

See "How to change a file in-place using awk? (as with "sed -i")" for more about that.

Kusalananda
  • 333,661
  • OP's command clobbers input only on a case-insensitive filesystem, typically Windows/WSL or MacOS. But it is clearer and safer to use a more obvious difference like 'NEW' or 'OUTPUT' or 'TEMP'. sponge is also an option. – dave_thompson_085 Apr 17 '20 at 04:10
1

using awk , if you want to delete whatever character it may be..

awk -F":" '{ print $1":"$2 }' gokind_SNPs.txt > gokind_SNPs_OUTPUT.txt
1

The cut command is designed for exactly this:

cut -d: -f-2
0

Never try to write to the same file you're trying to read:

sed 's/:[A-Z].* / /' gokind_SNPsF.txt > tmp && mv tmp gokind_SNPsf.txt

or use sed -i if your version of sed supports it.

Ed Morton
  • 31,617