How can I fix lines broken in wrong places?

Question

My text file looks like this:

This is one
sentence that is broken.
However this is a good one.
And this
one is
somehow, broken into
many.

I want to remove the trailing newline character for any line which is followed by a line starting with a lowercase letter.

So this should be:

This is one sentence that is broken.
However this is a good one.
And this one is somehow, broken into many.

How can I do this?

Edit: There are some really good answers here, but I chose to accept the first one that worked and was earliest. Thanks so much everyone!

LaTeX? The problem is that you don't really state the rules for proper sentence breaking. Do you want to put everything up to and including end-of-sentence punctuation on a single line? But what if you have a long sentence and it runs off the edge of your display window? — jamesqf, Jul 26 '17 at 18:40
I wonder what you're really trying to solve? Perhaps you should use markdown formatting? — Wildcard, Jul 26 '17 at 22:22
@JeffSchaller Thanks for the reminder! I had missed out somehow. :) — , Jul 31 '17 at 10:31

Stéphane Chazelas · Answer 1 · 2017-08-14T06:54:19.390

10

With awk:

awk -v ORS= '{print (NR == 1 ? "" : /^[[:lower:]]/ ? " " : RS) $0}
             END {if (NR) print RS}'

That is, do not append the record separator to each line (ORS empty). But prepend a record separator before the current line if not on the first line and the current line doesn't start with a lowercase letter. Otherwise prepend a space character instead, except on the first line.

edited Aug 14 '17 at 06:54

answered Jul 26 '17 at 16:13

Stéphane Chazelas

544,893

When I run this some pairs of words are concatenated. For example And thisone issomehow, broken intomany. I don't know awk but should lines be joined with <space> in addition to RS? Or is this user error? – B Layer Aug 13 '17 at 19:58
@BLayer, well spotted, thanks. Should be fixed now. – Stéphane Chazelas Aug 14 '17 at 06:54
No problem. Though one wonders where the 11 upvotes came from. Must be nice to have people just assume you're always right. ;) – B Layer Aug 14 '17 at 21:24

Archemar · Accepted Answer · 2017-07-26T18:36:53.110

7

try

awk '$NF !~ /\.$/ { printf "%s ",$0 ; next ; } {print;}' file

where

$NF !~ /\.$/ match line where last element do not end with a dot,
{ printf "%s ",$0 print this line with a trailling space, and no line feed,
next ; } fetch next line,
{print;} and print it.

I am sure there will be a sed option.

Note: this will work with line ending in a dot, however condition in sentences beginning with upper case letter won't get merged. See Stéphane Chazelas's answer.

edited Jul 26 '17 at 18:36

answered Jul 26 '17 at 13:35

Archemar

31,554

If you like clever (many don't) awk 'ORS=$NF~/\.$/?"\n":" "' – dave_thompson_085 Aug 14 '17 at 08:36

score 4 · Answer 3 · answered Jul 26 '17 at 13:43

4

In perl:

#!/usr/bin/perl -w
use strict;
my $input = join("", <>);
$input =~ s/\n([a-z])/ $1/g;
print $input;

Technically you wanted to replace "newline followed by lower-case letter" with "space and-that-lower-case-letter", which is what the core of the above perl script does:

Read in the input to a string input.
Update the input variable to be the result of the search & replace operation.
Print the new value.

answered Jul 26 '17 at 13:43

Jeff Schaller

67,283
35
116
255

1

good one!! translated to one-liner, perl -0777 -pe 's/\n([a-z])/ $1/g' and can similarly be done with GNU sed as sed -zE 's/\n([a-z])/ \1/g' (assuming input doesn't have null characters) – Sundeep Jul 26 '17 at 13:56
3

@Sundeep, or perl -Mopen=locale -0777 -pe 's/\n(?=[[:lower:]])/ /g' for it not to be limited to ASCII letters. – Stéphane Chazelas Jul 26 '17 at 16:05

score 4 · Answer 4 · edited Jul 26 '17 at 15:56

4

With sed you could use a N;P;D cycle (so as to always have two lines in the pattern space and if the first character after the newline is lowercase then replace the newline with a space) and a test - that way after each substitution you restart the cycle:

sed -e :t -e '$!N;/\n[[:lower:]]/s/\n/ /;tt' -e 'P;D' infile

edited Jul 26 '17 at 15:56

Stéphane Chazelas

544,893

answered Jul 26 '17 at 13:57

don_crissti

82,805

1

I think I see what's going on here, but an expanded answer would help those of us who don't use sed loops and pattern spaces very often. – Joe Jul 29 '17 at 12:32
@Joe - what do you mean by "not using the pattern space very often" ? That's where almost all operations take place - the hold space is a "storage space" - you can't do anything with the data while it's there. Anyway, I have explained in detail how a N;P;D cycle works here so I won't go over it again. The difference here is the test - to check whether something was replaced or not - if the test is successful then we branch to the top of the script, otherwise it means nothing was replaced and P;D are executed. Let me know if it's still unclear. – don_crissti Jul 29 '17 at 14:16

score 3 · Answer 5 · 2017-07-27T05:31:23.097

3

Another way you can do this is:

perl -lpe '$\ = /\.$/ ? $/ : $"' data

wherein: $\ => ORS, $/ => IRS= \n, $" = space

perl -pe '$_ .= <>, eof or redo if s/[^.]\K\n/ /' data

sed -e '
   :a
      /\.$/!N
      s/\n/ /
   ta
' data

edited Jul 27 '17 at 05:31

answered Jul 26 '17 at 14:14

cas · Answer 6 · 2017-07-27T02:42:39.630

Using sed and fmt:

$ sed -e '1n; s/^[[:upper:]]/\n&/' input.txt | fmt
This is one sentence that is broken.

However this is a good one.

And this one is somehow, broken into many.

The sed script inserts a newline before every line that begins with a capital letter (except for the very first line of input). sed's output is then piped into fmt to reformat the resulting paragraphs.

Alternatively use par if you have it installed. It's another paragraph reformatter, but much more capable than fmt, with many more features and options.

Note that there will be a blank line between each paragraph. Paragraphs should be separated from each other by at least one blank line. Without the blank lines, your entire input sample is reformatted as a single multi-sentence paragraph, e.g.:

$ fmt input.txt
This is one sentence that is broken.  However this is a good one.
And this one is somehow, broken into many.

If you need to remove the blank lines after reformatting just pipe it through sed again - but this will remove ALL blank lines, including any that may have been in the original input. e.g.

$ sed -e '1n; s/^[[:upper:]]/\n&/' input.txt | fmt | sed -e '/^$/d'
This is one sentence that is broken.
However this is a good one.
And this one is somehow, broken into many.

score 2 · Answer 7 · edited Jun 11 '20 at 14:16

2

Python 3

import re
print(re.sub(r'\n([a-z])', r' \1', open('file.txt').read(), flags=re.MULTILINE))

This is the same regex/substitution as Jeff's answer

edited Jun 11 '20 at 14:16

Community

1

answered Jul 26 '17 at 21:11

wjandrea

658

How can I fix lines broken in wrong places?

7 Answers7

Python 3