gawk columns from multiple files and add to a single text file

Question

I have 50 sets of files containing 9 columns (sample shown in the attached picture).

The files are named as ( 1) inputfile_1.assoc.logistic (2) inputfile_2.assoc.logistic etc….

The values in Columns 1, 2 and 3 are identical in all 50 files

I want to be able to grep columns 7,8 and 9 from all 50 files and add to a single .txt file, to look like this (field to be tab separated and the columns 7,8 and 9 to be labelled as shown)

I have been using grep loop (shown below) to extract the columns individually, save as text file, import the .txt file into stata to merge them , but it is taking considerable time (as I have over 7 million rows) and I need to this for several analyses.

for i in $(seq 1 50); do
    gawk -F" " '{print $2, $7, $8, $9}' inputfile_${i}.assoc.logistic >>/mnt/jw01-aruk-home01/projects/jia_mtx_gwas_2016/common_files/output/imputed_dataset/all_50_mi_datasets/acr30R_vs_acr30NR_combined_coefficients/outputfile_${i}.txt
done

Can this be made more efficient and incorporated in a shell loop ?

Please post sample text as text, not pictures. – muru Jan 06 '17 at 11:40 — muru, Jan 06 '17 at 11:40

glenn jackman · Answer 1 · 2017-01-06T13:44:35.917

0

untested due to lack of input data:

gawk '
    BEGIN {FS = OFS = "\t"}
    BEGINFILE {match(FILENAME, /inputfile_([0-9]+).assoc.logistic/, m)}
    FNR == 1 {
        key = $1 OFS $2 OFS $3
        data[key] = data[key] OFS $7"_"m[1] OFS $8"_"m[1] OFS $9"_"m[1]
        next
    }
    {
        key = $1 OFS $2 OFS $3
        data[key] = data[key] OFS $7 OFS $8 OFS $9
    }
    END {
        for (key in data) {
            print key data[key]
        }
    }
' inputfile_*.assoc.logistic > outputfile

Since I'm iterating over hash keys to output the data, the output will appear in random-ish order

edited Jan 06 '17 at 13:44

answered Jan 06 '17 at 11:52

glenn jackman

85,964

Hi Glenn, Thank you for your time and input. I am getting an error "gawk:cmd. line:3: fatal:0 is invalid as number of arguments for match". Is [0-9] in the script for selecting the files with those numbers as suffix ? I tried replacing with [1-9] as all my files are labelled _1 , _2...._50, but I still get same error message. I am completely new to this scripting language. Thanks for your help. – Sam Jan 06 '17 at 12:11
The regular expression [0-9]+ (with the plus) should capture all the digits. I don't understand the error message: I provide 3 arguments to match(). Perhaps you did not cut and paste correctly? – glenn jackman Jan 06 '17 at 15:01

gawk columns from multiple files and add to a single text file

1 Answers1