11

My text file looks like this:

This is one
sentence that is broken.
However this is a good one.
And this
one is
somehow, broken into
many.

I want to remove the trailing newline character for any line which is followed by a line starting with a lowercase letter.

So this should be:

This is one sentence that is broken.
However this is a good one.
And this one is somehow, broken into many.

How can I do this?

Edit: There are some really good answers here, but I chose to accept the first one that worked and was earliest. Thanks so much everyone!

  • 1
    LaTeX? The problem is that you don't really state the rules for proper sentence breaking. Do you want to put everything up to and including end-of-sentence punctuation on a single line? But what if you have a long sentence and it runs off the edge of your display window? – jamesqf Jul 26 '17 at 18:40
  • 1
    I wonder what you're really trying to solve? Perhaps you should use markdown formatting? – Wildcard Jul 26 '17 at 22:22
  • @JeffSchaller Thanks for the reminder! I had missed out somehow. :) –  Jul 31 '17 at 10:31

7 Answers7

10

With awk:

awk -v ORS= '{print (NR == 1 ? "" : /^[[:lower:]]/ ? " " : RS) $0}
             END {if (NR) print RS}'

That is, do not append the record separator to each line (ORS empty). But prepend a record separator before the current line if not on the first line and the current line doesn't start with a lowercase letter. Otherwise prepend a space character instead, except on the first line.

  • When I run this some pairs of words are concatenated. For example And thisone issomehow, broken intomany. I don't know awk but should lines be joined with <space> in addition to RS? Or is this user error? – B Layer Aug 13 '17 at 19:58
  • @BLayer, well spotted, thanks. Should be fixed now. – Stéphane Chazelas Aug 14 '17 at 06:54
  • No problem. Though one wonders where the 11 upvotes came from. Must be nice to have people just assume you're always right. ;) – B Layer Aug 14 '17 at 21:24
7

try

awk '$NF !~ /\.$/ { printf "%s ",$0 ; next ; } {print;}' file

where

  • $NF !~ /\.$/ match line where last element do not end with a dot,
  • { printf "%s ",$0 print this line with a trailling space, and no line feed,
  • next ; } fetch next line,
  • {print;} and print it.

I am sure there will be a sed option.

Note: this will work with line ending in a dot, however condition in sentences beginning with upper case letter won't get merged. See Stéphane Chazelas's answer.

Archemar
  • 31,554
4

In perl:

#!/usr/bin/perl -w
use strict;
my $input = join("", <>);
$input =~ s/\n([a-z])/ $1/g;
print $input;

Technically you wanted to replace "newline followed by lower-case letter" with "space and-that-lower-case-letter", which is what the core of the above perl script does:

  1. Read in the input to a string input.
  2. Update the input variable to be the result of the search & replace operation.
  3. Print the new value.
Jeff Schaller
  • 67,283
  • 35
  • 116
  • 255
  • 1
    good one!! translated to one-liner, perl -0777 -pe 's/\n([a-z])/ $1/g' and can similarly be done with GNU sed as sed -zE 's/\n([a-z])/ \1/g' (assuming input doesn't have null characters) – Sundeep Jul 26 '17 at 13:56
  • 3
    @Sundeep, or perl -Mopen=locale -0777 -pe 's/\n(?=[[:lower:]])/ /g' for it not to be limited to ASCII letters. – Stéphane Chazelas Jul 26 '17 at 16:05
4

With sed you could use a N;P;D cycle (so as to always have two lines in the pattern space and if the first character after the newline is lowercase then replace the newline with a space) and a test - that way after each substitution you restart the cycle:

sed -e :t -e '$!N;/\n[[:lower:]]/s/\n/ /;tt' -e 'P;D' infile
don_crissti
  • 82,805
  • 1
    I think I see what's going on here, but an expanded answer would help those of us who don't use sed loops and pattern spaces very often. – Joe Jul 29 '17 at 12:32
  • @Joe - what do you mean by "not using the pattern space very often" ? That's where almost all operations take place - the hold space is a "storage space" - you can't do anything with the data while it's there. Anyway, I have explained in detail how a N;P;D cycle works here so I won't go over it again. The difference here is the test - to check whether something was replaced or not - if the test is successful then we branch to the top of the script, otherwise it means nothing was replaced and P;D are executed. Let me know if it's still unclear. – don_crissti Jul 29 '17 at 14:16
3

Another way you can do this is:

perl -lpe '$\ = /\.$/ ? $/ : $"' data

wherein: $\ => ORS, $/ => IRS= \n, $" = space

perl -pe '$_ .= <>, eof or redo if s/[^.]\K\n/ /' data

sed -e '
   :a
      /\.$/!N
      s/\n/ /
   ta
' data
3

Using sed and fmt:

$ sed -e '1n; s/^[[:upper:]]/\n&/' input.txt | fmt
This is one sentence that is broken.

However this is a good one.

And this one is somehow, broken into many.

The sed script inserts a newline before every line that begins with a capital letter (except for the very first line of input). sed's output is then piped into fmt to reformat the resulting paragraphs.

Alternatively use par if you have it installed. It's another paragraph reformatter, but much more capable than fmt, with many more features and options.

Note that there will be a blank line between each paragraph. Paragraphs should be separated from each other by at least one blank line. Without the blank lines, your entire input sample is reformatted as a single multi-sentence paragraph, e.g.:

$ fmt input.txt
This is one sentence that is broken.  However this is a good one.
And this one is somehow, broken into many.

If you need to remove the blank lines after reformatting just pipe it through sed again - but this will remove ALL blank lines, including any that may have been in the original input. e.g.

$ sed -e '1n; s/^[[:upper:]]/\n&/' input.txt | fmt | sed -e '/^$/d'
This is one sentence that is broken.
However this is a good one.
And this one is somehow, broken into many.
cas
  • 78,579
2

Python 3

import re
print(re.sub(r'\n([a-z])', r' \1', open('file.txt').read(), flags=re.MULTILINE))

This is the same regex/substitution as Jeff's answer

wjandrea
  • 658