2

I have a file containing SNP data called snp.bed, which looks like this:

head snp.bed

    Chr17   214708483   214708484   Chr17:214708484
    Chr17   214708507   214708508   Chr17:214708508
    Chr17   214708573   214708574   Chr17:214708574

I also have a file called intersect.bed, which looks like this:

head intersect.bed

    Chr17   214708483   214708484   Chr17:214708484 Chr17   214706266   214710783   gene50573
    Chr17   214708507   214708508   Chr17:214708508 Chr17   214706266   214710783   gene50573
    Chr17   214708587   214708588   Chr17:214708580 Chr17   214706266   214710783   gene50573

I want to print out a modified version of snp.bed which contains an extra column appended to each row. If a row in snp.bed matches the first 4 columns of a row in intersect.bed, then I want to print the entire row from snp.bed with an extra column obtained by adjoining the last column from the corresponding row in intersect.bed (the gene name). Alternatively, if a row from snp.bed does not match any row from intersect.bed then adjoin an extra column consisting of the string "NA" instead of the gene name.

This is my desired output:

head snp.matched.bed

    Chr17   214708483   214708484   Chr17:214708484   gene50573
    Chr17   214708507   214708508   Chr17:214708508   gene50573
    Chr17   214708573   214708574   Chr17:214708574   NA

How can I do this?

Anna1364
  • 1,026

3 Answers3

3

Here is a solution using awk:

$ awk -F '\t' 'BEGIN{while(getline line<"intersect.bed") {N=split(line,a,"\t"); seen[a[1]"\t"a[2]"\t"a[3]"\t"a[4]]=a[N];}} {if(seen[$0]) {name=seen[$0];} else{name="NA"}; print $0 "\t" name}' snp.bed
Chr17       214708483       214708484       Chr17:214708484 gene50573
Chr17       214708507       214708508       Chr17:214708508 gene50573
Chr17       214708573       214708574       Chr17:214708574 NA

I am assuming single tab characters as the delimiter of both input files.

Note also that I interpreted "first 4th column" as "first four columns."

user001
  • 3,698
3

Personally, I think that for this kind of task it's probably best to use a "real" programming language. I like Python, so here's a Python script that does what you want (it's intentionally verbose so that you can understand it and modify it easily):

#!/usr/bin/env python2

# intersect.py

# Read data from the first file
snp_rows = []
with open("snp.bed", 'r') as snp_file:
    for row in snp_file:
        snp_rows.append(row.split())

# Read data from the second file
int_rows = []
with open("intersect.bed", 'r') as int_file:
    for row in int_file:
        int_rows.append(row.split())

# Compare data and compute results
results = []
for row in int_rows:
    if row[:4] in snp_rows:
        results.append(row[:4] + [row[-1]])
    else:
        results.append(row[:4] + ["NA"])

# Print the results
for row in results:
    print(' '.join(row))

Save it to a file and then execute it:

python2 intersect.py

And just for fun, here is a Bash solution using standard commands (just grep and cut):

while read row; do
    match="$(grep -F "${row}" intersect.bed)";
    if [[ -n "${match}" ]]; then
        echo "${row} $(echo ${match} | cut -d' ' -f8)";
    else
        echo "${row} NA";
    fi;
done < snp.bed

Note that in general it isn't recommended to use Bash to do serious text processing. See, for example, the following post:

igal
  • 9,886
3

This solution assumes, that files don't have spaces in the beginning of lines. What is different from your examples, which have these spaces.

awk '
{
    str = $1$2$3$4; 
}
FNR == NR {
    arr[str] = $NF;
}
FNR != NR {
    gene_name = arr[str] ? arr[str] : "NA";
    print $0, gene_name;
}' intersect.bed snp.bed 

Output

Chr17   214708483   214708484   Chr17:214708484 gene50573
Chr17   214708507   214708508   Chr17:214708508 gene50573
Chr17   214708573   214708574   Chr17:214708574 NA
MiniMax
  • 4,123