3

I have a file which contains a list of words under each other where these words belong to one sentence, and then the words that belong to the next sentences are also under each other. The chunk of words related to one sentence are followed by a space as shown in Representation #2 below

Expected Output: (Representation #1):

These are the words for sentence 1
These are the words for sentence 2

Expected Input: (Representation #2):

These
are
the
words
for
sentence 1

these are the words for sentence 2

I tried following this question but it doesn't work where I have different words for different sentences, so how can I change representation number 2 to representation number 1 in linux?

M.A.G
  • 261
  • 1
    Your example shows that you can have single blanks in a line, e.g. sentence 1. Can you also have chains of blanks or tabs within a line? If so please [edit] your example to show should they be handled - left as is or converted to individual blank chars or something else. – Ed Morton Jan 04 '22 at 22:33
  • @EdMorton Yes I can have blanks within a line for example I can have sentence a b . in one line – M.A.G Jan 04 '22 at 22:36
  • I'm not talking about individual blanks, you already show individual blanks with sentence<blank>1, I'm talking about chains of blanks like foo<blank><blank>bar, or tabs, e.g. foo<tab>bar. – Ed Morton Jan 04 '22 at 22:38
  • 1
    @EdMorton no this case is non-existent – M.A.G Jan 04 '22 at 22:43

6 Answers6

7
$ awk -v RS= '{$1=$1}1' file
These are the words for sentence 1
these are the words for sentence 2
Ed Morton
  • 31,617
5

With awk:

awk 'BEGIN { RS = "" } {gsub(/ *\n */, " "); print}' FILE
Stephen Kitt
  • 434,908
3

GNU sed editor in extended regex mode and use of hold space to store non empty lines.

sed -Ee 's/^\s+|\s+$//g
  /./{H;$!d;}
  x;s/.//;y/\n/ /
' file

Anotger method is to use the awk reserved words :

awk -v RS= '
BEGIN{FS=ORS}
{$1=$1}1
' file
guest_7
  • 5,728
  • 1
  • 7
  • 13
2
sed ':1;N;/\n$/!{$!b1};s/\s*\n/ /g' file

Either a trailing line feed or the last line serves as a trigger.

nezabudka
  • 2,428
  • 6
  • 15
2
$ perl -00 -aE 'say join " ", @F' input.txt 
These are the words for sentence 1
these are the words for sentence 2
  • -00 tells perl to read the file in paragraph mode (paragraphs are separated by one or more blank lines).

  • -a tells perl to auto-split the input on white-space into array @F (similar to how awk auto-splits its input into $1, $2, $3, etc).

    -a also implicitly sets the -n option, which makes perl behave like sed -n (read all input, without automatically printing it). This can be over-ridden (to auto-print the possibly-modified input, like sed without -n) by adding the -p option to the command line.

  • -E enables all optional features for the script - like the say function to automatically append a newline after printing...slightly simpler than print join(" ", @F), "\n" (which is what you'd have to do if you used -e instead of -E).

    say has been in perl for a long time now and arguably should be enabled by default but the decision was made by perl devs decades ago not to do that because of the risk breaking old scripts which defined their own say functions.

  • The join() function joins the elements of array @F with spaces between them.


Alternatively, you can set the output field separator ($,) and not use join:

$ perl -00 -aE 'BEGIN{$,=" "}; say @F' input.txt 
These are the words for sentence 1
these are the words for sentence 2

Unlike awk, where the default OFS is a space character, the default OFS in perl is empty, undefined. This would print the array without any spaces between the words:

$ perl -00 -aE 'say @F' input.txt 
Thesearethewordsforsentence1
thesearethewordsforsentence2

not exactly what you wanted.

cas
  • 78,579
0

Using Raku (formerly known as Perl_6)

raku -e 'for slurp.split(/\n**2..*/) {S:g/\n/ /.put};' 

Sample Input:

These
are
the
words
for
sentence 1

these are the words for sentence 2

Sample Output:

These are the words for sentence 1
these are the words for sentence 2

Here is a solution coded in Raku, a member of the Perl-family of programming languages. Briefly, the file is slurp-ed in (read all-at-once into memory), and split on any instances where 2-or-more consecutive \n newlines occur. (The newline-with-quantifier \n ** 2..* literally means 'two-newlines up to whatever number of newlines'). Then each paragraph chunk is processed, using the S/// operator: the capital-S tells Raku to return the resultant string. In the whole expression, S:g globally-substitutes a space character where \n a single newline is found.

Note, this returns consecutive lines as output, i.e. any significant (for example, triple-spacing) will be lost in the output. Since the OP states that each sentence is "...followed by a space" (i.e. a single blank line) hopefully this answer suffices. The OP will encounter problems if (however), the 'blank' lines separating unit 'sentences' aren't truly blank (e.g. they contain horizontal whitespace characters).

https://raku.org

jubilatious1
  • 3,195
  • 8
  • 17