How to get portion of lines from all .txt files in a directory?

Question

I have 5,000 text files of journal article citations. I am trying to extract only the abstract portion. Meaning that I want to keep the same text document and delete all the other text except for abstract. I am very new to Linux and I have been trolling your board for a while.

how to extract words that after keyword

execute command on all file in a directory

for file in test
nano my.sh
while read variable do
  sed '0,/^Abstract$/d' 
done <file

Here is an example of a file its similar to a scientific journal article

Sponsor     : Beckman Res Inst Cty Hope
      1500 E. Duarte Road
      Duarte, CA  910103000    /   -
NSF Program : 1114      CELL BIOLOGY
Fld Applictn: 0000099   Other Applications NEC

          61        Life Science Biological

Program Ref : 9285,
Abstract    :
      Studies of chickens have provided serological and nucleic acid                 
      probes useful in defining the major histocompatibility complex                 
      (MHC) in other avian species.  Methods used in detecting genetic               
      diversity at loci within the MHC of chickens and mammals will be               
      applied to determining the extent of MHC polymorphism within                   
      small populations of ring-necked pheasants, wild turkeys, cranes,              
      Andean condors and other species.  The knowledge and expertise                 
      gained from working with the MHC of the chicken should make for                
      rapid progress in defining the polymorphism of the MHC in these                
      species and in detecting the polymorphism of MHC gene pool within              
      small wild and captive populations of these birds.

It would also help if you would indicate what result you want from the sample text. Do you want to delete lines 1 through 10 (where line 9 is the one that says Abstract and line 10 is blank) and keep lines 11 (Studies of chickens …) through 21 (… these birds.)? Is line 21 the last line of the file? If not, how is the command supposed to identify the last line of the abstract? (E.g., is line 22 blank? Or does it contain some other constant heading string?) — Scott - Слава Україні, Dec 07 '14 at 20:17
@Scott I do not know what lines that this text portion is located. The text in these txt files can vary from author. — user3426338, Dec 07 '14 at 20:37
I know that you don’t know the line numbers in general; that’s why I said, “*from the sample text” and referred to the sample text that you added to your question. I’m requesting that, for the sample text that you posted in your question*, you specify what results you expect. — Scott - Слава Україні, Dec 07 '14 at 21:34
@Scott Ohh. Alright gotcha. I'll keep that in mind for the future. And I'll edit my question now. — user3426338, Dec 07 '14 at 21:40

John1024 · Accepted Answer · 2014-12-07T20:07:56.230

1

As I understand it, you want to change a series of file in-place. You want to delete all up to and including the first line that consists in total of Abstract. If those files are in the current directory and are all named with a .txt extension, then use:

sed -i '0,/^Abstract$/d' *.txt

Since this will overwrite the old files and in case something goes wrong, don't use this without having a backup.

This may require GNU sed (which is standard on Linux).

How it works

-i

The -i option tells sed to edit files in-place. The old file will be overwritten.
0,/^Abstract$/d

This command tells sed to delete (d) all lines from the first (number 0) up to and including the first line that matches the regular expression ^Abstract$. The caret, ^, matches at the start of the line and the dollar sign matches at the end of the line. Thus, this regex matches on a line that contains only the word Abstract with no other characters on the line.
*.txt

This tells the shell to select all files in the current directory that have the .txt suffix.

Update

This will delete all lines in each file up to the first line that starts with Abstract:

sed -i '0,/^Abstract/d' *.txt

Because the $ has been removed, this regular expression only requires that the line begin with Abstract.

edited Dec 07 '14 at 20:07

answered Dec 07 '14 at 19:38

John1024

74,655

what does the -i mean? And your solution made all of my files blank. – user3426338 Dec 07 '14 at 19:47
@user3426338 -i means edit in-place. It causes the old files to be overwritten with the new version. If you want something else to happen instead, let me know. – John1024 Dec 07 '14 at 19:53
@Jon1024 yeah I am looking at the modified files through mousepad and all of my files are blank. I'm using backup files. I'm glad I took your advice on not using the original files. – user3426338 Dec 07 '14 at 19:56
@user3426338 My solution was based on the regex that you provided: ^Abstract$. I see from the updated question, that this is not a match to the sample input. See the updated answer. And, continue making sure that you have a backup. – John1024 Dec 07 '14 at 19:58
So tried executing this command over a directory with countless sub directories. I looked at this manual about the find command and came up with this sudo find / awd_1990_02 -name txt -exec sed -i '0,/^Abstract/d' .txt {} ; but I keep getting sed: can't read *.txt: No such file or directory – user3426338 Dec 07 '14 at 23:16
Figured it out. – user3426338 Dec 08 '14 at 00:10

αғsнιη · Answer 2 · 2014-12-08T07:28:09.630

Using sed:

sed -ni.bak '/^Abstract/,$p' *.txt

Get all lines that starts^ with Abstract to, end$ of file and save a copy from original file named *.txt.bak by using sed's -i option.

With awk:

awk '/^Abstract/,0' *.txt

If you want that also for sub_directory, use the command with find command like this:

find /path/to/main-dir -type f -name "*.txt" -exec  sed -ni.bak '/^Abstract/,$p' '{}';

This is much better if you had a new line in file names:

find /path/to/main-dir -type f -name "*.txt" -print0 | while IFS= read -d '' -r file
do
    sed -ni.bak '/^Abstract/,$p' "$file";
done

In your given solution(find -name *txt -type d -exec sed -i '0,/^Abstract/d' *.txt {} \;) in body of your question you search for directories(-type d used for searching directories) which their names ends with txt, if you don't have any directory with same name as *txt, then your -exec part won't run. So you do nothing with that command.

So you have to change *txt -type d to "*.txt" -type f(this means all *.txt files-type f ) and quote them if your file names has spaces in it. And also you need to remove *.txt from end of sed command because '{}' in find command points to current file which is found and quote it too. Even it would be better if you specify path to find in command. Finally your tried command would be like following:

find /path/to/main-dir -name "*.txt" -type f -exec sed -i '0,/^Abstract/d' '{}' \;

Thank you for all of you help.Do you know of any really good books or resources out there to get a better grasp of Linux and regex? Or does it just take practices? — user3426338, Dec 07 '14 at 20:17
You are welcome. 1www.regexr.com, 2Famous Sed One-Liners Explained, Part I: File Spacing, Numbering and Text Conversion and Substitution, 3http://linuxreviews.org, 4Sed - An Introduction and Tutorial by Bruce Barnett, 5https://www.cs.ucy.ac.cy/~dzeina/courses/epl371/lectures/06-sed.pdf and more on google search... ;) — αғsнιη, Dec 07 '14 at 20:28
@user3426338 I have updated answer for working on sub_directory. — αғsнιη, Dec 08 '14 at 07:38

How to get portion of lines from all .txt files in a directory?

2 Answers2

How it works

Update