0

I'm trying to extract a couple of fields from each entry of a VCF file. Specifically I want the first and second field and the number after END= to be included. Here is one entry of the file:

1 234529926 AC=1;AF=0.00019968;AFR_AF=0.0008;AMR_AF=0;AN=5008;CIEND=0,500;CIPOS=-500,0;CS=DEL_union;EAS_AF=0;END=234549706;EUR_AF=0;MC=YL_CN_ACB_337;NS=2504;SAS_AF=0;SVTYPE=DEL 0|0

I tried the following to get the result I want:

sed 's|\([\d\s]*\)AC=.*;END=\([0-9]*\).*|\1\2|'

Results in:

1 234529926 234549706

Replacing [0-9] with \d should give the same result but it doesn't:

sed 's|\([\d\s]*\)AC=.*;END=\(\d*\).*|\1\2|'

Gives:

1 234529926

This doesn't make sense, since the [\d\s]*\ group at the beginning works just fine so it can't be the case that sed doesn't understand \d. Why is it so?

  • Possible duplicate of sed regular expression behaving differently than in vim and perl? and https://unix.stackexchange.com/questions/119905/why-does-my-regular-expression-work-in-x-but-not-in-y – Sundeep Sep 28 '18 at 07:00
  • sed doesn't recognize \d, in the first code you have * on [\d\s], so it is matching zero times as a valid pattern.. \s is recognized by GNU sed, not sure about other versions.. also, \s won't be recognized inside character class in any case – Sundeep Sep 28 '18 at 07:02
  • if you do echo '1 234529926 AC=1;AF=0' | sed 's|\([\d\s]*\)AC=|"\1"AC=|' , you will see that capture group didn't capture any character, but * means empty match is fine.. whereas sed 's|\([[:space:]]*\)AC=|"\1"AC=|' will match the space – Sundeep Sep 28 '18 at 07:05

0 Answers0