2

I have a pipe separated file with the format something like below

1|ABC|11|DEF|111
2|ABC|22|PQR
ST
UW|222
3|ABC|33|XYZ|333
4|ABC|44|LMN|444

Now for the line starting with 2 when I try to insert this record into the table the record is inserted only till PQR and start inserting the record from the line starting 4

I would appreciate if there is any command to bring all of the line 2 record into single line which would help me to insert the line 2 into the table successfully?

Chris Davies
  • 116,213
  • 16
  • 160
  • 287
3DM
  • 21
  • 1
    No spaces, right. Line 2 should read 2|ABC|22|PQRSTUW|222 I suggest you clarify what starts a record. like Newline followed by decimal number followed by "|" or something. Right now it could be X|anything|anything|angthing|XXX where X is some particular digit. – infixed Jun 01 '16 at 22:14
  • So you are saying that PQR \n ST \n UW is actually all a single field. Ugh. Can you amend your file generator to encode the embedded newlines? That would make things far, far, simpler than trying to work around this broken format that you've currently been given. – Chris Davies Jun 01 '16 at 22:25
  • I need to get it clarified as not sure whether its a for all the new lines even if the record goes to next line OR there is a separate delimiter for new line and separate for the full record on the same line. – 3DM Jun 01 '16 at 23:22
  • I had a confirmation that there is a Carriage Return in the record starting with '2'. The start of record is a NEWLINE. – 3DM Jun 02 '16 at 12:35

2 Answers2

1

To replace newlines embedded in records with spaces using GNU awk.

num_fields=4
awk -v RS='([^|]*\\|){'"$num_fields"'}[^|]*\n' '
  {
   n = split(RT, a,"|"); 
   for (i=1; i<=n; ++i)
   {
      gsub("\n", " ", a[i]); 
      printf "%s%s", a[i], i==n?"\n":"|"
   }
  }' file

This gives

1|ABC|11|DEF|111 
2|ABC|22|PQR ST UW|222 
3|ABC|33|XYZ|333 
4|ABC|44|LMN|444 

The trick here is to use GNU awk's support for arbitrary record separators to define one as four pipe-terminated fields followed by a newline-terminated field, with no field allowed to contain embedded pipes (via RS='([^|]*\\|){4}[^|]*\n') .

The actual record separator answering to this specification is then accessible via RT. It's a simple matter of splitting RT by pipe into an array a, stripping embedded newlines from each element of a and finally reconstructing the record by re-concatenating elements of a

iruvar
  • 16,725
  • Hello Iruvar, This {4} will work fine for the file with four pipe separator but what if I have multiple files and each have this record splitting on multiple lines with separator and more columns than {4}. The start of new record is NewLine separator. – 3DM Jun 02 '16 at 13:34
  • @3DM, you can store the number of fields in a variable and then use that variable in place of the hard-coded 4. See my updated example – iruvar Jun 02 '16 at 14:44
0

One way of picking up the pieces from this insane file format is with Perl.

#!/usr/bin/perl
#
use warnings;
use strict;

undef $/;
my $file = <>;

while ($file =~ /^(.*?\|.*?\|.*?\|.*?\|.*?)$/mscg) {
    my $fields = $1;
    $fields =~ s/\n(.)/\\n$1/sg;
    print "$fields\n";
}

The code slurps the entire file into memory, and then reblocks it on five |-delimited fields per line. Embedded newlines are replaced with \n.

If the script were called repipe.pl you could use it to process a file called datafile as perl repipe.pl datafile.

If you're feeling challenged you can embed it as a one-liner like this, but it won't do much for maintainability:

perl -e 'undef $/; $file = <>; while ($file =~ /^(.*?\|.*?\|.*?\|.*?\|.*?)$/mscg) { $fields = $1; $fields =~ s/\n(.)/\\n$1/sg; print "$fields\n"; }' datafile

Here's the resulting output from your sample datafile

1|ABC|11|DEF|111
2|ABC|22|PQR\nST\nUW|222
3|ABC|33|XYZ|333
4|ABC|44|LMN|444
Chris Davies
  • 116,213
  • 16
  • 160
  • 287