I'm trying to extract a couple of fields from each entry of a VCF file. Specifically I want the first and second field and the number after END=
to be included. Here is one entry of the file:
1 234529926 AC=1;AF=0.00019968;AFR_AF=0.0008;AMR_AF=0;AN=5008;CIEND=0,500;CIPOS=-500,0;CS=DEL_union;EAS_AF=0;END=234549706;EUR_AF=0;MC=YL_CN_ACB_337;NS=2504;SAS_AF=0;SVTYPE=DEL 0|0
I tried the following to get the result I want:
sed 's|\([\d\s]*\)AC=.*;END=\([0-9]*\).*|\1\2|'
Results in:
1 234529926 234549706
Replacing [0-9]
with \d
should give the same result but it doesn't:
sed 's|\([\d\s]*\)AC=.*;END=\(\d*\).*|\1\2|'
Gives:
1 234529926
This doesn't make sense, since the [\d\s]*\
group at the beginning works just fine so it can't be the case that sed doesn't understand \d
. Why is it so?
\d
, in the first code you have*
on[\d\s]
, so it is matching zero times as a valid pattern..\s
is recognized byGNU sed
, not sure about other versions.. also,\s
won't be recognized inside character class in any case – Sundeep Sep 28 '18 at 07:02echo '1 234529926 AC=1;AF=0' | sed 's|\([\d\s]*\)AC=|"\1"AC=|'
, you will see that capture group didn't capture any character, but*
means empty match is fine.. whereassed 's|\([[:space:]]*\)AC=|"\1"AC=|'
will match the space – Sundeep Sep 28 '18 at 07:05