2

I have a Perl script that parses data sent to me from a bunch of school districts. I'm adding a new school and have run into a problem I've never faced before. When I do $line = <INPUT>, it slurps up the whole file instead of one line.

If I run file on the file, it returns UTF-8 Unicode text, with CRLF, CR line terminators. All my other files return ASCII text, with CRLF line terminators. I've run it through dos2unix but it still operates as one long string. When I edit it in emacs, it still shows ^M for the line endings.

What can I do to convert these line endings into usable line endings?

Update: The vendor sent me another file with different line endings which still don't work. They report as CRLF, LF. I've extracted a few sample lines.

Here's some snippets from my code:

$line = <INPUT> if ($schooldistricts{$schooldistrict}{'header'});
LINE: foreach $line (<INPUT>) {
    next LINE unless ($line =~ /\S/);
    <do stuff>
}

The file does have a header which gets stripped off correctly. Then in the foreach loop it reads the first line successfully and then that's it -- it's like the rest of the file is empty.

I tried setting $/ to \r\n\n but then the script does nothing. Same if I try \r\n. Is there a way to definitively see what characters are encoded for the line ending?

Second update: As an experiment, I brought the file into Excel, split it out, and saved it as a tab-delimited file. On the server, I ran dos2unix. The Perl script still won't parse after the second line. File now returns UTF-8 Unicode text, with CRLF line terminators. That's the right line ending so that leaves Unicode as being the issue. Is there something different about how Unicode would encode the line endings?

jubilatious1
  • 3,195
  • 8
  • 17
Chanel
  • 99
  • That "with CRLF, CR line terminators" means the file has both CRLF-sequences and lone CRs. The LF in CRLF would usually be recognized as a line ending regardless of the CR, so a file like that likely shouldn't appear as just one line anyway, unless the only LF is at the end of the line. The file you posted behind the link has only LFs, no CRs at all. All of which start me thinking that your issue might be somewhere else. – ilkkachu May 12 '23 at 19:25
  • Now, if it is somewhere else, we can't know where, since we don't see the full code and perhaps not the actual input data either. You need to post another question and include a full (but preferably minimal) program that shows the issue, along with a corresponding input. You can look at the file with e.g. od -c, it should show CRs as \r and LFs as \n. And you can use the same escapes with e.g. printf; printf 'one\rtwo\nthree\r\n' would print stuff with three different CR/LF-combinations. (Also I'm not sure if you tried the solutions you got in answers.) – ilkkachu May 12 '23 at 19:34
  • Are you looking for a Perl solution, only? Or would shell, awk, sed, ruby, raku, python, etc. also fit the bill? – jubilatious1 May 19 '23 at 21:52
  • @jubilatious1 The vendor sent me a file in a different format and now it's working. – Chanel May 22 '23 at 14:38
  • Yes, but you don't know why. You should try piping a few lines of your file through hexdump -C and tell us what you see for line endings. Additionally, if you think your problem is Unicode-related have a look at: https://stackoverflow.com/q/13836352/7270649 – jubilatious1 May 23 '23 at 04:24

3 Answers3

4
perl -pi -e 's/\r\n?/\n/g' your-file

Would turn CR characters optionally followed by a LF to LF, similar to what mac2unix or dos2unix -c mac would do.

Or:

perl -pi -e 's/\r\n?/\r\n/g' your-file

To turn them to CRLF if that's what your script expects (because for instance it sets $/, the input record separator to "\r\n").

  • 1
    It needs a small edit otherwise it only changes the first occurrence: perl -pi -e 's/\r\n?/\n/g' your-file. However, now the script is only able to read the first two lines of the file even though the line endings look good throughout in emacs. – Chanel May 11 '23 at 21:25
  • 1
    Sorry about the missing g. Fixed now. Hard to tell what the problem is without seen the script or the data. <HANDLE> reads a $/-delimited record, you may want to check what your script sets $/ to (default is LF). perldoc -v '$/' for details. – Stéphane Chazelas May 12 '23 at 05:34
0

This pipeline will convert CR characters or CR/LF sequences to LF

tr '\r\n' '\n\r' | sed 's/^\r//g' | tr '\r' '\n'
Chris Davies
  • 116,213
  • 16
  • 160
  • 287
0

Using Raku (formerly known as Perl_6)

If the OP believes the problem to be Unicode-based, passing through a Raku script might help, since Raku handles UTF-8 by default:

~$ cat dos2unix.raku
my $fh1 = open $*IN, :r;

#below use :w (write-only) or :x (:x write-only :exclusive i.e. 'no-clobber') my $fh2 = open $*OUT, :x, nl-out => "\n";

for $fh1.lines() { $fh2.put($_) };

$fh1.close; $fh2.close;

Save the above file to a script (e.g. "dos2unix.raku"), add a shebang line and make it executable--or simply call it at the command line:

~$ raku dos2unix.raku < ends_with_CRLF.txt > ends_with_LF.txt 

Example Input with DOS line endings (0d 0a per line):

~$ jot -w '%d' 5 | raku unix2dos.raku | hexdump -C
00000000  31 0d 0a 32 0d 0a 33 0d  0a 34 0d 0a 35 0d 0a     |1..2..3..4..5..|

Example Output converted to Unix line endings (0a per line):

~$ jot -w '%d' 5 | raku unix2dos.raku | raku dos2unix.raku | hexdump -C
00000000  31 0a 32 0a 33 0a 34 0a  35 0a                    |1.2.3.4.5.|
0000000a

Above replicates authentic Unix line endings (0a per line):

~$ jot -w '%d' 5 | hexdump -C
00000000  31 0a 32 0a 33 0a 34 0a  35 0a                    |1.2.3.4.5.|
0000000a

If the script above doesn't work then a Regex solution might be helpful on the slurped file (\v stands for vertical-whitespace). Raku claims to honor the Unicode definition of Line Boundaries within the Raku Regex dialect: https://unicode.org/reports/tr18/#Line_Boundaries .

~$ raku -e 'slurp.subst(:global, / \v /, "\n").chop.put;'  file

#OR

~$ raku -e 'slurp.subst(:global, / <+ :Zl + :Zp> /, "\n").chop.put;' file

See the first link below for the unix2dos.raku script (i.e. the converse answer).

References:
https://unix.stackexchange.com/a/743445/227738
https://docs.raku.org/language/newline.html
https://raku.org

Example Source:
https://unix.stackexchange.com/a/742732/227738

jubilatious1
  • 3,195
  • 8
  • 17