10

According to man perlrun:

-0[octal/hexadecimal]
     specifies the input record separator ($/) as an octal or
     hexadecimal number. If there are no digits, the null character is
     the separator. 

and

The special value 00 will cause Perl to slurp files in paragraph
mode.  Any value 0400 or above will cause Perl to slurp files
whole, but by convention the value 0777 is the one normally used
for this purpose.

However, given this input file:

This is paragraph one

This is paragraph two.

I get some unexpected results:

$ perl -0ne 'print; exit' file ## \0 is used, so everything is printed
This is paragraph one.

This is paragraph two.

 $ perl -00ne 'print; exit' file ## Paragraph mode, as expected
 This is paragraph one.

So far, so good. Now, why do these two seem to also work in paragraph mode?

$ perl -000ne 'print; exit' file 
This is paragraph one.

$ perl -0000ne 'print; exit' file 
This is paragraph one.

And why is this one apparently slurping the entire file again?

$ perl -00000ne 'print; exit' file 
This is paragraph one.

This is paragraph two.

Further testing shows that these all seem to work in paragraph mode:

perl -000 
perl -0000
perl -000000
perl -0000000
perl -00000000

While these seem to slurp the file whole:

perl -00000
perl -000000000

I guess my problem is that I don't understand octal well enough (at all, really), I am a biologist, not a programmer. Do the latter two slurp the file whole because both 0000 and 00000000 are >= 0400? Or is there something completely different going on?

terdon
  • 242,166

2 Answers2

7

Octal is just like decimal in that 0 == 0, and 0000 == 0, 0 == 000000, etc. The fact that the switch here is -0 may make things a little confusing -- I would presume the point about "the special value 00" means one 0 for the switch and one for the value; adding more zeros is not going to change the latter, so you get the same thing...

Up to a point. The behavior of 000000 etc. is kind of bug-like, but keep in mind that this is supposed to refer to a single 8-bit value. The range of 8 bits in decimal is 0-255, in octal, 0-377. So you can't possibly use more than 3 digits here meaningfully (the special values are all outside that range, but still 3 digits + the switch). You are perhaps meant to just infer this from:

You can also specify the separator character using hexadecimal notation: -0xHHH..., where the H are valid hexadecimal digits. Unlike the octal form, this one may be used to specify any Unicode character, even those beyond 0xFF.

0xFF hex == 255 decimal == 377 octal == max for 8-bits, the size of one byte and a character in the (extended) ASCII set.

goldilocks
  • 87,661
  • 30
  • 204
  • 262
  • 2
    Ah, yes! Checking more deeply (perl -0e 'print $/' | od -c) shows that -00000 and -000000000 set the record separator to \0 so I guess they're cycling back to -0 when they're at multiples of 4. And yes, the special value is -00, they include the 0 of the flag itself. – terdon Mar 25 '15 at 19:17
4

Let looking into perl source to more details. In perl.c:

case '0':
    {
     I32 flags = 0;
     STRLEN numlen;

     SvREFCNT_dec(PL_rs);
     if (s[1] == 'x' && s[2]) {
          const char *e = s+=2;
          U8 *tmps;

          while (*e)
        e++;
          numlen = e - s;
          flags = PERL_SCAN_SILENT_ILLDIGIT;
          rschar = (U32)grok_hex(s, &numlen, &flags, NULL);
          if (s + numlen < e) {
           rschar = 0; /* Grandfather -0xFOO as -0 -xFOO. */
           numlen = 0;
           s--;
          }
          PL_rs = newSVpvs("");
          SvGROW(PL_rs, (STRLEN)(UNISKIP(rschar) + 1));
          tmps = (U8*)SvPVX(PL_rs);
          uvchr_to_utf8(tmps, rschar);
          SvCUR_set(PL_rs, UNISKIP(rschar));
          SvUTF8_on(PL_rs);
     }
     else {
          numlen = 4;
          rschar = (U32)grok_oct(s, &numlen, &flags, NULL);
          if (rschar & ~((U8)~0))
           PL_rs = &PL_sv_undef;
          else if (!rschar && numlen >= 2)
           PL_rs = newSVpvs("");
          else {
           char ch = (char)rschar;
           PL_rs = newSVpvn(&ch, 1);
          }
     }
     sv_setsv(get_sv("/", GV_ADD), PL_rs);
     return s + numlen;
    }

grok_oct converts a string representing an octal number to numeric form. It return immediately if attempt an invalid octal digit. And it only assumes each 4 characters (numlen = 4) for a valid value (You can see the for loop in its implementation in numeric.c)

So in -00000, first perl parse -0000 and set $/ to \000. The last 0 is considered as perl -0, causing $/ set to \000 again. You can see in:

$ perl -MO=Deparse -00000777ne 'print; exit' file
BEGIN { $/ = undef; $\ = undef; }
LINE: while (defined($_ = <ARGV>)) {
    print $_;
    exit;
}
-e syntax OK

$/ was set to undef, because the last octal sequence perl parsed is 0777.

More clearly:

$ perl -MO=Deparse -00000x1FF -ne 'print; exit' file
BEGIN { $/ = "\x{1ff}"; $\ = undef; }
LINE: while (defined($_ = <ARGV>)) {
    print $_;
    exit;
}
-e syntax OK

You can see $/ was set to the last 4 digits sequence 0x1FF.

cuonglm
  • 153,898
  • Since any number of zeros is still a valid octal number, I'd be interested to read what your theory is regarding the (unconfirmed by me but reported by terdon) special behavior of -00000 and -000000000 is. I don't see it here, since that and any number of zeros >= 2 should result in the same thing. – goldilocks Mar 25 '15 at 23:36
  • @goldilocks since you won't just take my word for it (the gall!), you can see the link provided by don_crissti where the same thing is reported. – terdon Mar 26 '15 at 00:24
  • @goldilocks: perl only get each 4 characters in the whole string for valid octal. See my updated. – cuonglm Mar 26 '15 at 03:13
  • @terdon I didn't mean that I didn't believe you, or that I though cuonglm was wrong -- I only meant that the original version didn't show exactly why the problem occurred. Well -- it sort of did because there is that numlen = 4, but that is passed by reference, implying that it's going to be set to the true length. Which it is, but looking at the grok_oct() source (if you don't get the "grok" reference, BTW, it's from Robert Heinlein...only in perl, lol) the initial value of that arg is a maximum (i.e., if it is less than 4, it will be set to that). So cuonglm is correct. +1 – goldilocks Mar 28 '15 at 12:26