First of all, it needs to be noted that your data uses two different kinds of dash characters for separation: the ASCII hyphen, as well as the Unicode en dash (U+2013).
The GNU implementation of Awk (GNU Awk) can handle regular expressions for record separation. Here is a one-liner:
$ gawk -v RS='\n?[–-]\n' -v FS='\n' -v OFS=', ' '$1 = $1' data
Jun 6th, something2, 09:00, some text blah blah, something1
Jun 6th, something1, 09:00, some text xxx, something1
Where data
is a file containing your example, verbatim.
We set up a record separator regex which matches an optional newline, followed by either an ASCII dash or Unicode en dash, followed by a newline. Then our field separator within these records is the newline. Output separator is a comma and space.
The expression $1=$1
serves two purposes. Assigning a field back to itself causes the record $0
to be reconstituted, taking into account the custom OFS
field separator. So then we just have to print it. Because the data begins with a record separator, there is an initial blank record. For that record, the expression $1 = $1
assigns the blank value, and since that is the result, the expression is a Boolean false; that record isn't printed.
If we don't include the optional leading \n
in the RS
pattern, then each record ends up with an extra blank field, because the newline after something1
gets interpreted as a field separator. We need the newline which follows the last field to count as part of the record separation. It has to be optional because the file starts with a record separator character not preceded by a newline. Without it we get this:
$ gawk -v RS='[–-]\n' -v FS='\n' -v OFS=', ' '$1 = $1' data
Jun 6th, something2, 09:00, some text blah blah, something1,
Jun 6th, something1, 09:00, some text xxx, something1,
Extra commas, due to an extra empty field.
something*
andsome text*
, e.g. whether or not those strings can they contain commas or double quotes or newlines or dashes. So [edit] your question to tell us more about those and provide sample input/output that includes all of the rainy day cases like those. – Ed Morton Jul 05 '23 at 13:02<comma><blank>
instead of just<comma>
seems like a bad idea as it just makes parsing by subsequent tools more difficult. Do you really want to do that? – Ed Morton Jul 05 '23 at 13:04