0

I have 5,000 text files of journal article citations. I am trying to extract only the abstract portion. Meaning that I want to keep the same text document and delete all the other text except for abstract. I am very new to Linux and I have been trolling your board for a while.

how to extract words that after keyword

execute command on all file in a directory

for file in test
nano my.sh
while read variable do
  sed '0,/^Abstract$/d' 
done <file

Here is an example of a file its similar to a scientific journal article

Sponsor     : Beckman Res Inst Cty Hope
      1500 E. Duarte Road
      Duarte, CA  910103000    /   -

NSF Program : 1114 CELL BIOLOGY Fld Applictn: 0000099 Other Applications NEC
61 Life Science Biological
Program Ref : 9285, Abstract :

      Studies of chickens have provided serological and nucleic acid                 
      probes useful in defining the major histocompatibility complex                 
      (MHC) in other avian species.  Methods used in detecting genetic               
      diversity at loci within the MHC of chickens and mammals will be               
      applied to determining the extent of MHC polymorphism within                   
      small populations of ring-necked pheasants, wild turkeys, cranes,              
      Andean condors and other species.  The knowledge and expertise                 
      gained from working with the MHC of the chicken should make for                
      rapid progress in defining the polymorphism of the MHC in these                
      species and in detecting the polymorphism of MHC gene pool within              
      small wild and captive populations of these birds.       

αғsнιη
  • 41,407
user3426338
  • 111
  • 5
  • 2
    Please give a sample of the file. – jimmij Dec 07 '14 at 19:26
  • put sample data. – PersianGulf Dec 07 '14 at 19:37
  • @jimmij I just edited the question. I put in example data. – user3426338 Dec 07 '14 at 19:41
  • 1
    It would also help if you would indicate what result you want from the sample text. Do you want to delete lines 1 through 10 (where line 9 is the one that says Abstract and line 10 is blank) and keep lines 11 (Studies of chickens …) through 21 (… these birds.)? Is line 21 the last line of the file? If not, how is the command supposed to identify the last line of the abstract? (E.g., is line 22 blank? Or does it contain some other constant heading string?) – Scott - Слава Україні Dec 07 '14 at 20:17
  • @Scott I do not know what lines that this text portion is located. The text in these txt files can vary from author. – user3426338 Dec 07 '14 at 20:37
  • I know that you don’t know the line numbers in general; that’s why I said, “*from the sample text” and referred to the sample text that you added to your question. I’m requesting that, for the sample text that you posted in your question*, you specify what results you expect. – Scott - Слава Україні Dec 07 '14 at 21:34
  • @Scott Ohh. Alright gotcha. I'll keep that in mind for the future. And I'll edit my question now. – user3426338 Dec 07 '14 at 21:40

2 Answers2

1

As I understand it, you want to change a series of file in-place. You want to delete all up to and including the first line that consists in total of Abstract. If those files are in the current directory and are all named with a .txt extension, then use:

sed -i '0,/^Abstract$/d' *.txt

Since this will overwrite the old files and in case something goes wrong, don't use this without having a backup.

This may require GNU sed (which is standard on Linux).

How it works

  • -i

    The -i option tells sed to edit files in-place. The old file will be overwritten.

  • 0,/^Abstract$/d

    This command tells sed to delete (d) all lines from the first (number 0) up to and including the first line that matches the regular expression ^Abstract$. The caret, ^, matches at the start of the line and the dollar sign matches at the end of the line. Thus, this regex matches on a line that contains only the word Abstract with no other characters on the line.

  • *.txt

    This tells the shell to select all files in the current directory that have the .txt suffix.

Update

This will delete all lines in each file up to the first line that starts with Abstract:

sed -i '0,/^Abstract/d' *.txt

Because the $ has been removed, this regular expression only requires that the line begin with Abstract.

John1024
  • 74,655
  • what does the -i mean? And your solution made all of my files blank. – user3426338 Dec 07 '14 at 19:47
  • @user3426338 -i means edit in-place. It causes the old files to be overwritten with the new version. If you want something else to happen instead, let me know. – John1024 Dec 07 '14 at 19:53
  • @Jon1024 yeah I am looking at the modified files through mousepad and all of my files are blank. I'm using backup files. I'm glad I took your advice on not using the original files. – user3426338 Dec 07 '14 at 19:56
  • @user3426338 My solution was based on the regex that you provided: ^Abstract$. I see from the updated question, that this is not a match to the sample input. See the updated answer. And, continue making sure that you have a backup. – John1024 Dec 07 '14 at 19:58
  • So tried executing this command over a directory with countless sub directories. I looked at this manual about the find command and came up with this sudo find / awd_1990_02 -name txt -exec sed -i '0,/^Abstract/d' .txt {} ; but I keep getting sed: can't read *.txt: No such file or directory – user3426338 Dec 07 '14 at 23:16
  • Figured it out. – user3426338 Dec 08 '14 at 00:10
1

Using sed:

sed -ni.bak '/^Abstract/,$p' *.txt

Get all lines that starts^ with Abstract to, end$ of file and save a copy from original file named *.txt.bak by using sed's -i option.

With awk:

awk '/^Abstract/,0' *.txt

If you want that also for sub_directory, use the command with find command like this:

find /path/to/main-dir -type f -name "*.txt" -exec  sed -ni.bak '/^Abstract/,$p' '{}';

This is much better if you had a new line in file names:

find /path/to/main-dir -type f -name "*.txt" -print0 | while IFS= read -d '' -r file
do
    sed -ni.bak '/^Abstract/,$p' "$file";
done

In your given solution(find -name *txt -type d -exec sed -i '0,/^Abstract/d' *.txt {} \;) in body of your question you search for directories(-type d used for searching directories) which their names ends with txt, if you don't have any directory with same name as *txt, then your -exec part won't run. So you do nothing with that command.

So you have to change *txt -type d to "*.txt" -type f(this means all *.txt files-type f ) and quote them if your file names has spaces in it. And also you need to remove *.txt from end of sed command because '{}' in find command points to current file which is found and quote it too. Even it would be better if you specify path to find in command. Finally your tried command would be like following:

find /path/to/main-dir -name "*.txt" -type f -exec sed -i '0,/^Abstract/d' '{}' \;
αғsнιη
  • 41,407