0

I want to insert characters in TSV files using the sed command with wildcard expressions:

I have a file like this:

Marker  Pvalue  Trait   Chr Pos
S3_16887238 6.172847e-09    Total_Soil_S    3   16887238
S3_16887238 6.172847e-09    Total_Soil_Pa   3   16887238
S3_16887238 6.172847e-09    Total_Soil_Cl   3   16887238

And I want to add _All at the end of all the texts in the third column like this:

Marker  Pvalue  Trait   Chr Pos
S3_16887238 6.172847e-09    Total_Soil_S_All    3   16887238
S3_16887238 6.172847e-09    Total_Soil_Pa_All   3   16887238
S3_16887238 6.172847e-09    Total_Soil_Cl_All   3   16887238

I am using this command and it is not working:

sed -i 's/Total_Soil_\(.*\)/&_\1_All/g' top1.txt

This is just an example file, there could be anything instead of S, Pa, and Cl.

jubilatious1
  • 3,195
  • 8
  • 17

3 Answers3

4

I would avoid working on tabular data with sed as it's difficult to correctly address the exact positions in the data that you want to modify. The sed utility is more suitable for processing unstructured data like text.


Using Miller (mlr; a tool specifically made for working with structured data) to append the string _All to the end of the data in the Trait field of each TSV record:

$ mlr --tsv put '$Trait .= "_All"' file
Marker  Pvalue  Trait   Chr     Pos
S3_16887238     6.172847e-09    Total_Soil_S_All        3       16887238
S3_16887238     6.172847e-09    Total_Soil_Pa_All       3       16887238
S3_16887238     6.172847e-09    Total_Soil_Cl_All       3       16887238

Use mlr with its -I option to do the change in-place.

Would you need to ensure that you are modifying the field only if it starts with the string Total_Soil, then use

mlr --tsv put '$Trait =~ "^Total_Soil" { $Trait .= "_All" }' file

With awk, appending the string _All to the end of the data in the 3rd tab-delimited field of each record:

$ awk -F '\t' 'BEGIN { OFS=FS } NR > 1 { $3 = $3 "_All" }; 1' file
Marker  Pvalue  Trait       Chr     Pos
S3_16887238     6.172847e-09    Total_Soil_S_All        3       16887238
S3_16887238     6.172847e-09    Total_Soil_Pa_All       3       16887238
S3_16887238     6.172847e-09    Total_Soil_Cl_All       3       16887238

The trailing 1 at the end of the awk code causes the modified record to be unconditionally outputted. It is, in a way, a short-hand way of writing { print }. Note that we explicitly need to avoid modifying the header. We do this by testing using NR > 1 and modifying the field only if the test evaluates to true (NR is the ordinal number of the current record).

Redirect the output to a new file and then rename the new file to the original name. Or, if you are using GNU awk, use -i inplace as described in another question+answer.

Again, if you need to ensure that you only modify the 3rd field if it starts with the string Total_Soil, then use

awk -F '\t' 'BEGIN { OFS=FS } NR > 1 && $3 ~ /^Total_Soil/ { $3 = $3 "_All" }; 1' file

Using Perl in pretty much the same way as awk:

$ perl -F'\t' -e 'BEGIN { $" = "\t" } if ($. > 1) { $F[2] .= "_All" } print "@F"' file
Marker  Pvalue  Trait   Chr     Pos
S3_16887238     6.172847e-09    Total_Soil_S_All        3       16887238
S3_16887238     6.172847e-09    Total_Soil_Pa_All       3       16887238
S3_16887238     6.172847e-09    Total_Soil_Cl_All       3       16887238

Ensuring we only modify Total_Soil data:

perl -F'\t' -e 'BEGIN { $" = "\t" } if ($. > 1 && $F[2] =~ /^Total_Soil/) { $F[2] .= "_All" } print "@F"' file
Kusalananda
  • 333,661
0

Using Raku (formerly known as Perl_6)

~$ raku -ne 'BEGIN put get; my @a = .split("\t"); @a.[2] = @a.[2] ~ "_All"; put @a.join("\t");' file

Raku is a member of the Perl-family of programming languages. An advantage of Raku is high-level support for Unicode built-in, no external libraries (or special flags) required.

Above is a fairly direct translation of @Kusalananda's excellent Perl(5) answer. Raku's -ne non-autoprinting "linewise" commandline flags are used. To print the headerline verbatim, use a BEGIN phaser that put get puts (print-using-terminator) the first line it gets.

The body of the linewise commands works like so: declare an array with my @a and assign to it the input line $_ that has been .split("\t") on tabs [ .split("\t") is short for $_.split("\t") ].

Take @a.[2] the third element (i.e. column) and over-write it with the same element, @a.[2] ~ "_All" tilde-concatenated with a trailing _All string.

Then take all @a elements, joined back together on tabs, and output.

Sample Input:

Marker  Pvalue  Trait   Chr Pos
S3_16887238 6.172847e-09    Total_Soil_S    3   16887238
S3_16887238 6.172847e-09    Total_Soil_Pa   3   16887238
S3_16887238 6.172847e-09    Total_Soil_Cl   3   16887238

Sample Output:

Marker  Pvalue  Trait   Chr Pos
S3_16887238 6.172847e-09    Total_Soil_S_All    3   16887238
S3_16887238 6.172847e-09    Total_Soil_Pa_All   3   16887238
S3_16887238 6.172847e-09    Total_Soil_Cl_All   3   16887238

https://docs.raku.org
https://raku.org

jubilatious1
  • 3,195
  • 8
  • 17
0

With sed you can do:

sed 's/Total_Soil_[^[:blank:]]*/&_All/' top1.txt

To do it inline, add an -i after sed

Edit: Substituted [^ ] with [^[:blank:]] to match everything but Spaces and Tabs.

Oliver Knodel
  • 114
  • 1
  • 3