Merge successive lines into one line

Question

In a text file, if the first 25 characters in a line are a space, how can I append that line to the previous line until another line comes along that starts with an ASCII character in column one. Since it can't be displayed like that here, I've added a screenshot. In the original file, I have to remove the trailing spaces for each line first. That works, but I have no idea how to implement the rest. I would prefer the whole thing as a script (no Perl or similar)

Original file:

08/07/2023 09:02:07      ANR8592T Session 137576 connection is using protocol
                         TLSVI3, cipher specification TLS_AES_256_GCM_SHA384,
                         certificate TSM Self-Signed Certificate. (SESSION:
                         137576)
08/07/2023 09:02:07      ANR@B4OT Session 137576 started for administrator ADMIN
                         (WinNT) (SSL MU-SV-SPS1.de.bertrandt.net[192.168.171.56]-
                         :65234) on MU-SV-SPS1.de.bertrandt.net:1500. (SESSTON:
                         137576)
08/07/2023 09:02:07      ANR2017T Administrator ADMIN issued command: select
                         status from processes where process="NAS SnapMirror
                         Backup’ and status like 'WMU-SV-CL2%' (SESSION: 137576)
                         08/07/2023 09:02:07 ANR@46ST Session 137576 ended for administrator ADMIN
                         (WinNT). (SESSION: 137576)
08/07/2023 09:02:38      ANR8592T Session 137577 connection is using protocol
                         TLSVI3, cipher specification TLS_AES_256_GCM_SHA384,
                         certificate TSM Self-Signed Certificate. (SESSION:
                         137577)
08/07/2023 09:02:38      ANR@B4OT Session 137577 started for administrator ADMIN
                         (WinNT) (SSL MU-SV-SPS1.de.bertrandt.net[192.168.171.56]-
                         :65235) on MU-SV-SPS1.de.bertrandt.net:1560. (SESSTON:
                         137577)
08/07/2023 09:02:38      ANR2017T Administrator ADMIN issued command: select
                         node_name, filespace_name, BACKUP_START, BACKUP_END,
                         CAPACITY, PCT_UTIL from filespaces where node_name like
                         “MU-SV-CL2%" (SESSION: 137577)
08/07/2023 09:02:38      ANR@46ST Session 137577 ended for administrator ADMIN
                         (WinNT). (SESSION: 137577)
08/07/2023 09:02:38      ANR8592T Session 137578 connection is using protocol
                         TLSVI3, cipher specification TLS_AES_256_GCM_SHA384,
                         certificate TSM Self-Signed Certificate. (SESSION:
                         137578)
08/07/2023 09:02:38      ANR@B4OT Session 137578 started for administrator ADMIN
                         (WinNT) (SSL MU-SV-SPS1.de.bertrandt.net[192.168.171.56]-
                         :65236) on MU-SV-SPS1.de.bertrandt.net:1560. (SESSTON:
                         137578)

Requested result:

08/07/2023 09:02:07      ANR8592T Session 137576 connection is using protocol TLSVI3, cipher specification TLS_AES_256_GCM_SHA384, certificate TSM Self-Signed Certificate. (SESSION: 137576)
08/07/2023 09:02:07      ANR@B4OT Session 137576 started for administrator ADMIN (WinNT) (SSL MU-SV-SPS1.de.bertrandt.net[192.168.171.56]- :65234) on MU-SV-SPS1.de.bertrandt.net:1500. (SESSTON: 137576)
08/07/2023 09:02:07      ANR2017T Administrator ADMIN issued command: select status from processes where process="NAS SnapMirror Backup’ and status like 'WMU-SV-CL2%' (SESSION: 137576) 08/07/2023 09:02:07 ANR@46ST Session 137576 ended for administrator ADMIN (WinNT). (SESSION: 137576)
08/07/2023 09:02:38      ANR8592T Session 137577 connection is using protocol TLSVI3, cipher specification TLS_AES_256_GCM_SHA384, certificate TSM Self-Signed Certificate. (SESSION: 137577)
08/07/2023 09:02:38      ANR@B4OT Session 137577 started for administrator ADMIN (WinNT) (SSL MU-SV-SPS1.de.bertrandt.net[192.168.171.56]- :65235) on MU-SV-SPS1.de.bertrandt.net:1560. (SESSTON: 137577)
08/07/2023 09:02:38      ANR2017T Administrator ADMIN issued command: select node_name, filespace_name, BACKUP_START, BACKUP_END, CAPACITY, PCT_UTIL from filespaces where node_name like “MU-SV-CL2%" (SESSION: 137577)
08/07/2023 09:02:38      ANR@46ST Session 137577 ended for administrator ADMIN (WinNT). (SESSION: 137577)
08/07/2023 09:02:38      ANR8592T Session 137578 connection is using protocol TLSVI3, cipher specification TLS_AES_256_GCM_SHA384, certificate TSM Self-Signed Certificate. (SESSION: 137578)
08/07/2023 09:02:38      ANR@B4OT Session 137578 started for administrator ADMIN (WinNT) (SSL MU-SV-SPS1.de.bertrandt.net[192.168.171.56]- :65236) on MU-SV-SPS1.de.bertrandt.net:1560. (SESSTON: 137578)

Whitespace in questions and answers can be preserved by enclosing the example text in code blocks. Quote: The first four spaces will be stripped off, but all other whitespace will be preserved. — Vilinkameni, Aug 15 '23 at 09:59
@OP: Perl is a scripting language, so I assume you mean shell script? — Vilinkameni, Aug 15 '23 at 10:02
This question seems similar to an earlier question with 'indented' input, for which there are "one-liner" answers written in sed, awk, perl, and raku: https://unix.stackexchange.com/q/738723 — jubilatious1, Aug 17 '23 at 05:04

Kusalananda · Answer 1 · 2023-08-15T10:25:41.480

Solution using sed at the end of this answer.

Use the ed editor to first replace all blanks (tabs or spaces) at the start of lines with a single space, and then join each line that starts with a space with the line before:

printf '%s\n' 'g/^[[:blank:]]\{1,\}/ s// /' 'g/^ / -,.j' ,p Q | ed -s file

The resulting document is printed to the standard output stream, but you can change ,p to w to write it back to the original file.

The two main commands in this editing session:

g/^[[:blank:]]\{1,\}/ s// /
This removes all runs of one or more blanks from the start of any line and replaces them with a single space.
g/^ / -,.j
This joins each line that starts with a space with its previous line.

Combining these two g commands into a single g command that executes both the s and j command:

printf '%s\n' 'g/^[[:blank:]]\{1,\}/ s// /\' '-,.j' ,p Q | ed -s file

Testing on this example input:

XXX XXX XXX     YYY YYY YYY YYY YYY
                ZZZ ZZZ ZZZ ZZZ ZZZ
                YYY ZZZ YYY ZZZ YYY
                YYY ZZZ YYY ZZZ YYY
                YYY ZZZ YYY ZZZ YYY
XXX XXX XXX     YYY YYY YYY YYY YYY
                ZZZ ZZZ ZZZ ZZZ ZZZ
                YYY ZZZ YYY ZZZ YYY
XXX XXX XXX     YYY YYY YYY YYY YYY
                ZZZ ZZZ ZZZ ZZZ ZZZ
                YYY ZZZ YYY ZZZ YYY
                YYY ZZZ YYY ZZZ YYY
                YYY ZZZ YYY ZZZ YYY

The result:

XXX XXX XXX     YYY YYY YYY YYY YYY ZZZ ZZZ ZZZ ZZZ ZZZ YYY ZZZ YYY ZZZ YYY YYY ZZZ YYY ZZZ YYY YYY ZZZ YYY ZZZ YYY
XXX XXX XXX     YYY YYY YYY YYY YYY ZZZ ZZZ ZZZ ZZZ ZZZ YYY ZZZ YYY ZZZ YYY
XXX XXX XXX     YYY YYY YYY YYY YYY ZZZ ZZZ ZZZ ZZZ ZZZ YYY ZZZ YYY ZZZ YYY YYY ZZZ YYY ZZZ YYY YYY ZZZ YYY ZZZ YYY

It does not matter if the indentation of the indented lines is made with spaces or tabs. Also, the spacing (whether done with spaces or tabs) between the first bit of each "section" (the XXX... bit in my example) and the rest is not altered.

With sed:

sed -e '/^[[:blank:]]\{1,\}/ { s///; H; $!d; }' -e 'x; y/\n/ /' file

This detects any line with one or more initial blanks, removes these and appends the line to the hold space (an auxiliary buffer in sed that is not wiped between cycles). If the line is not the last, it is discarded, and the script skips to the next input line.

For any other line (and the last line in the document, if it starts with a blank), the buffer is swapped with the hold space and all newlines (inserted as a delimiter by the H command) are replaced by spaces before the result is outputted.

This produces the same output as the ed pipeline above but will fail to process the last line of input if it does not start with a blank (this is not the case in the example text, as far as I can make out from the image).

Arnaud Valmary · Answer 2 · 2023-08-15T08:58:37.297

1

A solution with awk

Script format_text.awk:

#! /usr/bin/awk -f
/^[0-9][0-9]/[0-9][0-9]/[0-9][0-9][0-9][0-9] [0-9][0-9]:[0-9][0-9]:[0-9][0-9]/ {
    if (line) {
        print line
    }
    line = $0
    next
}
{
    gsub(/^[ \t]+/, "")
    line = line " " $0
}
END {
    if (line) {
        print line
    }
}

With:

chmod +x format_text.awk

Run like this:

./format_text.awk log.txt

edited Aug 15 '23 at 08:58

answered Aug 14 '23 at 20:34

Arnaud Valmary

718

This is actually a solution in awk, not even requiring POSIX sh, much less the features of GNU Bash. The shebang could simply be changed to #!/usr/bin/awk -f, with the text between single quotes after the invocation of awk pasted after the shebang, and (not really necessary, but providing a clue to the user about the script type) renamed to format_text.awk. What's more critical though, is that in either form it duplicates lines not beginning with spaces, and doesn't join with them those that do. – Vilinkameni Aug 15 '23 at 08:50
@Vilinkameni, I follow you. It's a pure awk solution. I don't see any duplicate lines – Arnaud Valmary Aug 15 '23 at 09:00
This is a log of a session using the current version of the solution: http://ix.io/4DwV I'm using GNU awk 5.1.1 in Artix Linux. – Vilinkameni Aug 15 '23 at 09:10
This is a possible working solution: http://ix.io/4DwX (The awk script is replaced by BEGIN{FS="\n"; RS=""} { gsub(/\n\s+/, " "); print }.) – Vilinkameni Aug 15 '23 at 09:12
With your example, I have only two lines. I'm using GNU Awk 5.1.1, API: 3.1 (GNU MPFR 4.1.1-p1, GNU MP 6.2.1) on Fedora 38 – Arnaud Valmary Aug 15 '23 at 09:22
I just tried the script format_text.sh on an Alpine 3.17 machine with GNU awk 5.1.1, with the exact same results as on Artix Linux (duplicate lines, unmerged lines). – Vilinkameni Aug 15 '23 at 09:29
@Vilinkameni don't use a shebang to call awk (see https://stackoverflow.com/a/61002754/1745001), don't add a suffix to a Unix command or you can't change the implementation (e.g. replace awk with perl or compiled C) without changing the calls to it, the script you suggested requires the input to contain no empty lines and would read all input into memory at once and would require GNU awk for the non-POSIX extension \s. – Ed Morton Aug 16 '23 at 10:11
@EdMorton I'm not sure I understand the first sentence in your comment. The only parameter here is the pathname of the file, which gets passed to the script with the awk shebang, while I avoid the overhead of unnecessarily invoking the bloated Bash. Second, as far as I could see from the picture posted in the OP, the input contains no empty lines, and there is no requirement to use a specific version of awk (\s could be rewritten easily though). The rationale behind your linked answer goes against the Unix philosophy. Scripts should do one thing and work with other programs/scripts. – Vilinkameni Aug 16 '23 at 12:57
The only parameter for now is the pathname to the file. If the OP decides to add something in future (e.g. an option) they'll need to do much more than necessary if they used a shebang and may be tempted to implement it all in awk when doing some in shell may be better. The overhead is miniscule and the script is not bloated by adding a call to awk. The OP posted some sample data, that doesn't mean they don't have other data. – Ed Morton Aug 16 '23 at 13:01
The rationale behind my linked answer is valid and consistent with the Unix philosophy and good engineering judgement - a script doing 1 thing doesn't mean it must only call 1 tool, and needing to require more changes if you use a shebang than if you don't demonstrates unnecessary coupling in the former case. – Ed Morton Aug 16 '23 at 13:08
As you stated, any necessary parameters could be passed to awk with -v. There is no need to pass them to shell at all if shell is not used. If more sophisticated functionality would be needed, the output from the program could be manipulated with other programs or scripts anyway. In any case, the answer we are commenting on doesn't work for the supplied data. – Vilinkameni Aug 16 '23 at 13:09
That's one of my points - if we use a shebang then the CALLER of the script would need to know that it's implemented in awk to pass parameters to it using -v, that's bad software. What if I want to add a "help" option to my tool - should the user need to call it with -v help instead of --help like in every other Unix tool because it's using an awk shebang? If not, I have to change my shebang and my callers may need to change how they're already calling it as a consequence (e.g. if they use -v for other things). The answer we're commenting on works just fine as far as I can see. – Ed Morton Aug 16 '23 at 13:15
Interesting, because on two of the machines I tested it on (both using GNU awk and mksh, one Artix and the other Alpine) both the original script and the modified one give the incorrect output I linked. – Vilinkameni Aug 16 '23 at 13:25
Unfortunately I can't see the site you linked as it's blocked for me but just reading the code (I've been using awk for 30+ years and Unix for 40+ so I've got a decent grasp of the nooks and crannies) I can't imagine how it could not work. Maybe when you copy/pasted the input your editor added CRs at the end of the lines so it just looks like the script isn't working? – Ed Morton Aug 16 '23 at 13:33
@Vilinkameni does the script I posted work for you? – Ed Morton Aug 16 '23 at 13:34
Yes, your script does work. And no, there are no CRs in the file. I use vim under Alpine and Artix. – Vilinkameni Aug 16 '23 at 13:37
Then I'm out of ideas, sorry. – Ed Morton Aug 16 '23 at 13:37
About accessing the links I posted, they are served as plain text on ix.io and can be retrieved by using curl: curl http://ix.io/4DwV and curl http://ix.io/4DwX. – Vilinkameni Aug 16 '23 at 13:47
I found out the culprit. It was the first regex. When that line was pasted into vim, the end of that line, ...[0-9]/ { was broken into two lines at space: ...[0-9]/ and { on the next line. This is interpreted by awk as a regex without an action, followed by an action without a regex. I can now confirm that this answer works as well. – Vilinkameni Aug 16 '23 at 14:33

Ed Morton · Answer 3 · 2023-08-16T13:29:46.313

Using any POSIX awk:

$ awk -F'^[[:space:]]+' '
    NF==1 { if (NR>1) print rec; rec=$0; next }
    { rec = rec OFS $2 }
    END { print rec }
' file
08/07/2023 09:02:07      ANR8592T Session 137576 connection is using protocol TLSVI3, cipher specification TLS_AES_256_GCM_SHA384, certificate TSM Self-Signed Certificate. (SESSION: 137576)
08/07/2023 09:02:07      ANR@B4OT Session 137576 started for administrator ADMIN (WinNT) (SSL MU-SV-SPS1.de.bertrandt.net[192.168.171.56]- :65234) on MU-SV-SPS1.de.bertrandt.net:1500. (SESSTON: 137576)
08/07/2023 09:02:07      ANR2017T Administrator ADMIN issued command: select status from processes where process="NAS SnapMirror Backup’ and status like 'WMU-SV-CL2%' (SESSION: 137576) 08/07/2023 09:02:07 ANR@46ST Session 137576 ended for administrator ADMIN (WinNT). (SESSION: 137576)
08/07/2023 09:02:38      ANR8592T Session 137577 connection is using protocol TLSVI3, cipher specification TLS_AES_256_GCM_SHA384, certificate TSM Self-Signed Certificate. (SESSION: 137577)
08/07/2023 09:02:38      ANR@B4OT Session 137577 started for administrator ADMIN (WinNT) (SSL MU-SV-SPS1.de.bertrandt.net[192.168.171.56]- :65235) on MU-SV-SPS1.de.bertrandt.net:1560. (SESSTON: 137577)
08/07/2023 09:02:38      ANR2017T Administrator ADMIN issued command: select node_name, filespace_name, BACKUP_START, BACKUP_END, CAPACITY, PCT_UTIL from filespaces where node_name like “MU-SV-CL2%" (SESSION: 137577)
08/07/2023 09:02:38      ANR@46ST Session 137577 ended for administrator ADMIN (WinNT). (SESSION: 137577)
08/07/2023 09:02:38      ANR8592T Session 137578 connection is using protocol TLSVI3, cipher specification TLS_AES_256_GCM_SHA384, certificate TSM Self-Signed Certificate. (SESSION: 137578)
08/07/2023 09:02:38      ANR@B4OT Session 137578 started for administrator ADMIN (WinNT) (SSL MU-SV-SPS1.de.bertrandt.net[192.168.171.56]- :65236) on MU-SV-SPS1.de.bertrandt.net:1560. (SESSTON: 137578)

Johan B · Answer 4 · 2023-08-16T18:46:04.443

0

Here is a one-line solution using sed

sed -E -z 's/\n([ ]{25})//g' ./input.txt > ./output.txt

-E specify that we will use regular expressions
-z to match \n characters
s/\n([ ]{25})//g
- s/ to replace
- \n([ ]{25})/ to replace a line return followed by 25 spaces by nothing
- /g to execute the action globally on the content

edited Aug 16 '23 at 18:46

answered Aug 14 '23 at 21:56

Johan B

101

2

Why do you use dos2unix ? dos2unix do not remove \r characters but replace \r\n by \n. Single \r characters are not removed. Test with: echo $'a\r\nb\rc' | dos2unix | cat -e – Arnaud Valmary Aug 15 '23 at 09:06
The question doesn't require converting text from DOS/Windows, so there is no need for dos2unix (which often isn't even installed by default on many distributions). Also, \n is not matched by sed unless you add a -z parameter, which also makes N; not needed. Other than that, you can just use sed ... <input.txt >output.txt instead of cat input.txt | sed ... >output.txt. (This is often referred to as "useless use of cat".) – Vilinkameni Aug 15 '23 at 09:43
Thanks for the feedback, I edited the answer with your comments – Johan B Aug 15 '23 at 12:06
2

Regarding Here is a one-line solution in bash: - That's not bash, it's sed, a completely different tool from bash that you could call from any shell. Note that that script does require GNU sed for -z, it won't work in any other sed, and it does require the whole input file to be read into memory at once. Regarding [ ] - blanks are already literal, there's no need to put it in a bracket expression. – Ed Morton Aug 16 '23 at 10:13

Prabhjot Singh · Answer 5 · 2023-08-15T15:38:59.143

0

Using fmt:

$ fmt -tw 2500 file

From fmt manual:

-t, --tagged-paragraph

indentation of first line different from second

This solution works, until the combined length of merged lines surpasses 2500 characters.

edited Aug 15 '23 at 15:38

answered Aug 15 '23 at 09:17

Prabhjot Singh

1,925

What does the -t option do to your fmt command? On my system, -t takes a numerical argument (the width of a tab). – Kusalananda Aug 15 '23 at 10:28
On Alpine (with GNU coreutils installed), fmt(1) states: -t, --tagged-paragraph indentation of first line different from second – Vilinkameni Aug 15 '23 at 10:38

Arnaud Valmary · Accepted Answer · 2023-08-14T20:56:53.267

-1

A pure bash solution

Script format_text_purebash.sh:

#! /usr/bin/env bash
declare -r input_filename="$1"
declare new_line=
while read -r line; do
    if [[ "$line" =~ ^[0-9][0-9]/[0-9][0-9]/[0-9][0-9][0-9][0-9]\ [0-9][0-9]:[0-9][0-9]:[0-9][0-9] ]]; then
        if [[ -n "$new_line" ]]; then
            echo "$new_line"
        fi
        new_line="$line"
    else
        new_line="$new_line $line"
    fi
done < "$input_filename"
if [[ -n "$new_line" ]]; then
    echo "$new_line"
fi

With:

chmod +x format_text_purebash.sh

Run like this:

./format_text_purebash.sh log.txt

edited Aug 14 '23 at 20:56

answered Aug 14 '23 at 20:47

Arnaud Valmary

718

1

You should add the usual warning for newcomers to that - why-is-using-a-shell-loop-to-process-text-considered-bad-practice. – Ed Morton Aug 16 '23 at 10:01

Merge successive lines into one line

6 Answers6