0

I'm working on a project for which I need to collate specific lines of data from multiple files into one new text file. For example, say I had 3 files that each contain a matrix of values:

Text File 1

Obs.    TGCP_WM23   STT_WM189   MPO_WM496   PTP_WM724
TGCP_WM23   0.000000    0.174510    0.153292    0.177030
STT_WM189   0.174510    0.000000    0.077663    0.203359
MPO_WM496   0.153292    0.077663    0.000000    0.183706
PTP_WM724   0.177030    0.203359    0.183706    0.000000

Text File 2

Obs.    TGCP_WM15   STT_WM187   MPO_WM485   PTP_WM725
TGCP_WM15   0.000000    0.157164    0.145516    0.168991
STT_WM187   0.157164    0.000000    0.051973    0.187443
MPO_WM485   0.145516    0.051973    0.000000    0.171824
PTP_WM725   0.168991    0.187443    0.171824    0.000000

Text File 3

Obs.    TGCP_WM1    STT_WM184   MPO_WM489   PTP_WM721
TGCP_WM1    0.000000    0.166831    0.161654    0.192732
STT_WM184   0.166831    0.000000    0.059373    0.202718
MPO_WM489   0.161654    0.059373    0.000000    0.185286
PTP_WM721   0.192732    0.202718    0.185286    0.000000

I want to automate reading the 3 files and printing the second line from each into sequential lines of one new text file, such that the new text file contains:

New Text File

TGCP_WM23   0.000000    0.174510    0.153292    0.177030
TGCP_WM15   0.000000    0.157164    0.145516    0.168991
TGCP_WM1    0.000000    0.166831    0.161654    0.192732

Is there a relatively straightforward way to do something like this using the Terminal on a Mac? As it stands, I'm looking at 2,200 files from which I need to extract and format data so that I can run some downstream analyses. I would like to avoid having to manually open all those files, copy text and paste into a new file where the values are formatted in a more useful fashion.

Edit: All of the files I'm working with are text files outputted from a program called Genodive. Half of the files are Fst matrix files that look like the examples shown above; the other 1,100 files are genetic diversity output files, the contents of which look like...


___________________________________________________________________

GenoDive 3.01, 2019-12-12 23:28:01 +0000
Genetic Diversity: Nei 1987.
File: TrkNbr_1083n1282_L1n2_PrelimPops_02SubSampPops_Rep001.txt
8 of 8 individuals included, 6843 of 6843 loci included

– Summary of indices of genetic diversity

Statistic   Value   Std.Dev.    c.i.2.5%    c.i.97.5%   Description
Num 1.418   0.006   1.405   1.428   Number of alleles
Eff_num 1.086   0.002   1.082   1.088   Effective number of alleles
Ho  0.092   0.002   0.089   0.096   Observed Heterozygosity
Hs  0.098   0.002   0.094   0.101   Heterozygosity Within Populations
Ht  0.114   0.002   0.110   0.117   Total Heterozygosity
H't 0.122   0.002   0.117   0.125   Corrected total Heterozygosity
Gis 0.055   0.013   0.030   0.079   Inbreeding coefficient

Standard deviations of F-statistics were obtained through jackknifing over loci.
95% confidence intervals of F-statistics were obtained through bootstrapping over loci.


– Indices of genetic diversity per population

Population  Num Eff_num Ho  Hs  Gis
TGCP_WM3    1.261   1.183   0.142   0.141   -0.003
STT_WM186   1.186   1.132   0.088   0.108   0.183
MPO_WM483   1.194   1.136   0.097   0.109   0.110
PTP_WM732   1.095   1.068   0.056   0.051   -0.097


___________________________________________________________________

I don't need to process the Fst files and the genetic diversity files all at once, I want to extract different data from each type of file.

The naming convention of the two file types is as follows:

Fst files are named

TrkNbr_1083n1282_L1n2_PrelimPops_02SubSampPops_Rep001_FstRslts

Genetic diversity files are named

TrkNbr_1083n1282_L1n2_PrelimPops_02SubSampPops_Rep001_GenDivRslts

The distinguishing part of the file names is the '##SubSampPops_Rep###' portion. There's 1,100 'FstRslts' files, and those 1,100 files are subdivided into 11 groups of 100 files...

02SubSampPops_Rep001
02SubSampPops_Rep002
02SubSampPops_Rep003
.
.
.
02SubSampPops_Rep100
04SubSampPops_Rep001
04SubSampPops_Rep002
04SubSampPops_Rep003
.
.
.
04SubSampPops_Rep100

Similarly, there's 1,100 'GenDivRslts' files organized in the same fashion.

  • Are all of the files in that same format? How many of the files do you want to extract that data from? Is it a specific number or just whatever you need at the time? – Nasir Riley Dec 31 '19 at 01:30

3 Answers3

1

zsh version (default shell in Mac terminal):

for file in $(find . -type f -iname "*.txt"); cat "$file" | head -2 | tail -1 >> output.txt

This assumes all the input text files are in the same directory and order of processing the files is not important.

bash version:

for file in $(find . -type f -iname "*.txt"); do cat $file | head -2 | tail -1; done >> output.txt 

EDIT-1: Following the suggestions from Nasir and steeldriver echo with command substitution is not necessary. And following is the awk version,

for file in $(find . -type f -iname "*.txt"); awk 'NR==2' $file >> output.txt

And, if the files doesn't have extension txt any pattern that is common among all the files can be used instead. Assuming all the files have File in their name, the awk version can be

for file in $(find . -type f -iname "*File*"); awk 'NR==2' $file >> output.txt

EDIT-2:

From what you have mentioned, your FstRslts and GenDivRslts are the unique identifiers of the file groups. Hence you can use, "*FstRslts" for your FstRslts files instead of "*.txt". Same for GenDivRslts

NOTE

I am taking up @steeldrivers suggestion and lesson and adding the following as one of the answers (more idiomatic):

find . -type f -iname "*FstRslts" -exec awk 'NR==2' {} \; > output.txt

EDIT-3 find . - start searching from the present working directory

type -f - searching file type

-iname "*FstRslts" - ignore case while searching the file names matching the pattern

-exec - execute the following command

awk 'NR==2' - extract 2nd line of every file found as a result of the previous commands (matched pattern)

{} \; - placeholder for the files (matched pattern) command termination

> output.txt - redirect the results to a file names 'output.txt'

Bussller
  • 515
  • 1
    Using cat and then head and tail isn't necessary as he only wants to print the second line. awk 'NR==2' can be used for that without cat. Even if you were going to use head and tail, cat isn't needed as those utilities as well as awk work without it. He also may not want to extract data from all of the files and may want the lines to be in a certain order in the output file. – Nasir Riley Dec 31 '19 at 01:27
  • 1
    Text files do not necessarily have the .txt extension. – Paulo Tomé Dec 31 '19 at 01:51
  • 2
    file * | awk -F: '/ASCII text/ {print $1}' would be a better way to return all text files in a folder. See Finding all “Non-Binary” files – Paulo Tomé Dec 31 '19 at 02:05
  • I'm having some difficulty following what the suggested command line does; would you be willing to explain in some detail what each component of find . -type f -iname "*FstRslts" -exec awk 'NR==2' {} \; > output.txt is doing so that I can understand if this would accomplish what I'm looking for?

    I'm a novice when it comes to using the command line to accomplish new things that I'm not practiced at doing, so while everyone's suggestions and interest are very appreciated, I'm going to be slow to understand them.

    – jmahguib Dec 31 '19 at 04:39
  • Thank you for your suggestions. I tested the command find . -type f -iname "*FstRslts" -exec awk 'NR==2' {} \; > output.txt and it did EXACTLY what I wanted. This will save me so much time and I'm very grateful.

    And thank you @RussellB for providing the explanations of each component of the command because it's important for my documentation. On that note, could you tell me also what the {} \ portion of the command line does? It's the last bit of it that I still don't understand what it actually is doing.

    – jmahguib Jan 03 '20 at 21:12
  • Added comments. – Bussller Jan 04 '20 at 04:15
  • +1 - thanks you so much for taking the time to document this so completely. Greatly appreciated, probably saved me hours :D – Rax Adaam Dec 07 '20 at 18:49
1

First we define some useful shell variables on our commandline:

$ d='[0-9]'
$ pre='TrkNbr_1083n1282_L1n2_PrelimPops'
$ main="$d${d}SubSampPops_Rep$d$d$d"
$ post='GenDivRslts'
$ filename="${pre}_${main}_${post}"

With GNU awk:

$ find . -type f -name "$filename"      |
  sort -t_ -nk5.1,5.2 -nk6.4,6.6        |
  xargs -r awk 'FNR==2{print;nextfile}' \
> new_text_file;

With GNU sed:

$ find . -type f -name "$filename" |
  sort -t_ -nk5.1,5.2 -nk6.4,6.6   |
  xargs -r sed -se '2!d'           \
> new_text_file;

With perl:

$ find . -type f -name "$filename"                |
  sort -t_ -nk5.1,5.2 -nk6.4,6.6                  |
  xargs -r perl -ne 'print,close ARGV if $. == 2' \
> new_text_file;

With head/tail:

$ find . -type f -name "$filename" |
  sort -t_ -nk5.1,5.2 -nk6.4,6.6   |
  xargs -r \
   sh -c '
    for f
    do
     head -n 2 "$f" | tail -n 1
    done
   ' x > new_text_file;
1

Why not simply

awk 'FNR == 2' *FstRslts > NewFile

? If the command line becomes too long, try to group input files by their subdivisions, or use xargs to split the line.

RudiC
  • 8,969