stream one liner every time a specific characters is found from multiple files

Question

Here is the example of file that I am trying to convert in one line:

-
Jun 6th
something2
09:00
some text blah blah
something1
–
Jun 6th
something1
09:00
some text xxx
something1
–

I am trying to get these lines as one single line, such as a csv, example:

Jun 6th, something2, 09:00, some text blah blah, something1
Jun 6th, something1, 09:00, some text xxx, something1

Check this out https://unix.stackexchange.com/questions/655262/squash-file-with-key-value-records-to-csv — annahri, Jul 04 '23 at 16:24
What exactly is the character sequence delimiting the blocks? When I paste your text, the first looks like an ordinary ASCII hyphen, but the other two look like Unicode en-dashes. Are there always 5 lines per block? — steeldriver, Jul 04 '23 at 16:34
The right solution will depend on the values of something* and some text*, e.g. whether or not those strings can they contain commas or double quotes or newlines or dashes. So [edit] your question to tell us more about those and provide sample input/output that includes all of the rainy day cases like those. — Ed Morton, Jul 05 '23 at 13:02
Separating your output by <comma><blank> instead of just <comma> seems like a bad idea as it just makes parsing by subsequent tools more difficult. Do you really want to do that? — Ed Morton, Jul 05 '23 at 13:04
Is your input really separated by 2 different types of dashes or is one of them a typo? — Ed Morton, Jul 05 '23 at 13:05

Grobu · Answer 1 · 2023-07-04T17:25:12.047

You could try this SED one-liner:

sed -ne '/^–/{g; /./!b; s/\n//; s/\n/, /g; p; z; h; b}; H' INPUTFILE

Explanation:

/^–/{                 -->  if line starts with char "–", then:
    g                 -->      copy hold space to pattern space
    /./!b             -->      empty line? restart cycle
    s/\n//            -->      get rid of first newline
    s/\n/, /g         -->      replace all other newlines by ", "
    p                 -->      print pattern space
    z                 -->      erase pattern space
    h                 -->      erase hold space
    b                 -->      start new cycle
    }
H                     -->  otherwise, append newline + pattern space to hold space

Input:

–
Jun 6th
something2
09:00
some text blah blah
some other thing2
–
Jun 7th
something1
10:30
some text xxx
some other thing1
–
Jun 9th
something3
12:15
some text yyy
some other thing3
–
Jun 8th
something4
07:05
some text zzz
some other thing4
–

Output:

Jun 6th, something2, 09:00, some text blah blah, some other thing2
Jun 7th, something1, 10:30, some text xxx, some other thing1
Jun 9th, something3, 12:15, some text yyy, some other thing3
Jun 8th, something4, 07:05, some text zzz, some other thing4

Hope that helps.

Kaz · Answer 2 · 2023-07-04T18:30:12.117

First of all, it needs to be noted that your data uses two different kinds of dash characters for separation: the ASCII hyphen, as well as the Unicode en dash (U+2013).

The GNU implementation of Awk (GNU Awk) can handle regular expressions for record separation. Here is a one-liner:

$ gawk -v RS='\n?[–-]\n' -v FS='\n' -v OFS=', ' '$1 = $1' data
Jun 6th, something2, 09:00, some text blah blah, something1
Jun 6th, something1, 09:00, some text xxx, something1

Where data is a file containing your example, verbatim.

We set up a record separator regex which matches an optional newline, followed by either an ASCII dash or Unicode en dash, followed by a newline. Then our field separator within these records is the newline. Output separator is a comma and space.

The expression $1=$1 serves two purposes. Assigning a field back to itself causes the record $0 to be reconstituted, taking into account the custom OFS field separator. So then we just have to print it. Because the data begins with a record separator, there is an initial blank record. For that record, the expression $1 = $1 assigns the blank value, and since that is the result, the expression is a Boolean false; that record isn't printed.

If we don't include the optional leading \n in the RS pattern, then each record ends up with an extra blank field, because the newline after something1 gets interpreted as a field separator. We need the newline which follows the last field to count as part of the record separation. It has to be optional because the file starts with a record separator character not preceded by a newline. Without it we get this:

$ gawk -v RS='[–-]\n' -v FS='\n' -v OFS=', ' '$1 = $1' data
Jun 6th, something2, 09:00, some text blah blah, something1,
Jun 6th, something1, 09:00, some text xxx, something1,

Extra commas, due to an extra empty field.

score 0 · Answer 3 · answered Jul 04 '23 at 18:57

You can do that with tr and sed commands:

$ tr '\n' ',' <input_file | sed 's/-,/\n/g' | sed 's/.$//'
Jun 6th,something2,09:00,some text blah blah,something1
Jun 6th,something1,09:00,some text xxx,something1

(The second sed gets rid of trailing commas)

You need to make sure the dash-lines separators in your input file are the same. They were not when I copied them to test this code.

Ed Morton · Answer 4 · 2023-07-05T13:26:04.353

Assuming you want valid CSV output, could have quotes or commas in the input, can't have newlines in the something*, some text*, etc. parts of the records, and only really have - as the record separator, here's an input file that would test a potential solution:

$ cat file
-
Jun 6th
something2
09:00
"some "text" blah blah"
"something1"
-
Jun 6th
something1
09:00
some, text, xxx
something1
-

and here is a solution using any POSIX awk and it's output which is valid CSV:

$ cat tst.awk
$1 == "-" {
    if ( NR > 1 ) {
        print ""
    }
    sep = ""
    next
}
/[",]/ {
    gsub(/^"|"$/,"")
    gsub(/"/,"\"\"")
    $0 = "\"" $0 "\""
}
{
    printf "%s%s", sep, $0
    sep = ","
}

$ awk -f tst.awk file
Jun 6th,something2,09:00,"some ""text"" blah blah","something1"
Jun 6th,something1,09:00,"some, text, xxx",something1

If that's not the output you'd want given that input then edit the example in your question to show how to handle the cases with ,s and "s in the input.

stream one liner every time a specific characters is found from multiple files

4 Answers4