Splitting a text file into new files

Question

I have a text file in the following format:

Model        1   
Atom….
Atom….
Atom….
ENDMDL
Model        2   
Atom….
Atom….
Atom….
ENDMDL
Model       n   
Atom….
Atom….
Atom….
ENDMDL

I need to split this file into the files corresponding to each Model. The names of new files according to the Model’s number.

PM 2Ring · Answer 1 · 2016-06-14T11:29:02.103

This is easily done using a small awk script.

#!/usr/bin/awk -f 
# Write sections of the input file to separate files
# Written by PM 2Ring 2016.06.14

BEGIN{outbase = "outfile"}

/^Model/{outname = outbase $2}

{print > outname}

outbase is the base file name. It gets the Model number appended to it, so for your sample file the output files outfile1, outfile2, etc will get created. With a minor change to the script you could set outbase from the command line, using awk's -v option.

The heart of this script is

/^Model/{outname = outbase $2}

It says: If the current line starts with "Model" append the contents of field #2 to the outbase string, assigning the result to outname.

By default, awk process a file line by line, splitting each line into fields using whitespace as the field separator.

{print > outname}

simply prints the current line to the file whose name is stored in outname.

This script is small enough to write the whole thing on the command line:

awk 'BEGIN{outbase = "outfile"}; /^Model/{outname = outbase $2}; {print > outname}' infile.txt

You can actually supply multiple input file arguments and they will be handled correctly, as long as you don't have duplicated Model numbers.

Chris Davies · Answer 2 · 2016-06-14T21:02:35.017

I would probably go for this using csplit. This will work for a file called file.txt:

csplit -ksz file.txt '/^Model/' '{*}'
for xx in xx*
do
    newname=$(awk '{print $2; exit}' "$xx")
    test ! -f "$newname" && mv -f "$xx" "$newname"
done

The csplit splits file.txt into multiple parts based on the RE. Filenames are (by default) named as xx and a monotonically increasing numeric suffix. We look at each of these in turn and rename them to the model number found inside the file.

Any files matching xx* at the end of the loop contain duplicate model numbers (the renaming is performed on a first come first served basis).

Vombat · Answer 3 · 2016-06-14T11:25:31.813

#!/bin/bash                                                                                                                                                                                                                                   

while read -r line                                                                                                                                                                                                                            
do                                                                                                                                                                                                                                            
    case $line in                                                                                                                                                                                                                             
        Model*)                                                                                                                                                                                                                               
            f="${line//[[:space:]]/}"
            touch "$f"  # file name without white spces                                                                                                                                                                                                                     
            ;;                                                                                                                                                                                                                                
        ENDMDL)                                                                                                                                                                                                                               
            :                                                                                                                                                                                                                                 
            ;;                                                                                                                                                                                                                                
        *)                                                                                                                                                                                                                                    
            echo "$line" >> "$f"                                                                                                                                                                                                              
            ;;                                                                                                                                                                                                                                
    esac                                                                                                                                                                                                                                      
done < "$1"

Something like this. You should run it providing models file as argument: ./script_name models.txt

Note that as mentioned by @PM 2Ring this approach is slow specially if you have large files.

The filenames will be Model.............N(dots are space).I'd assume OP would not want all that space in filenames. — 123, Jun 14 '16 at 11:10
Note that Bash read is rather slow. It obtains its input character by character, with a system call required for each character. It's really designed for interactive use, not text processing. Please see Why is using a shell loop to process text considered bad practice? for further details. — PM 2Ring, Jun 14 '16 at 11:21

Splitting a text file into new files

3 Answers3