11

Given a sorted input file (or command output) that contains unique numbers, one per line, I would like to collapse all runs of consecutive numbers into ranges such that

n
n+1
...
n+m

becomes

n,n+m

input sample:

2
3
9
10
11
12
24
28
29
33

expected output:

2,3
9,12
24
28,29
33
αғsнιη
  • 41,407
don_crissti
  • 82,805

13 Answers13

10

With dc for the mental exercise:

dc -f "$1" -e '
[ q ]sB
z d 0 =B sc sa z sb
[ Sa lb 1 - d sb 0 <Z ]sZ
lZx
[ 1 sk lf 1 =O lk 1 =M ]sS
[ li p c 0 d sk sf ]sO
[ 2 sf lh d sj li 1 + !=O ]sQ
[ li n [,] n lj p c 0 sf ]sM
[ 0 sk lh sj ]sN
[ 1 sk lj lh 1 - =N lk 1 =M ]sR
[ 1 sf lh si ]sP
[ La sh lc 1 - sc lf 2 =R lf 1 =Q lf 0 =P lc 0 !=A ]sA
lAx
lSx
'
Jeff Schaller
  • 67,283
  • 35
  • 116
  • 255
ctac_
  • 1,960
  • 2
    this is the kind of answer that makes you wish you could upvote twice... – don_crissti Sep 21 '18 at 19:18
  • 3
    @don_crissti, if you're in to this sort of stuff, post the same question on codegolf.se and someone will implement it in Brainf**k. – ilkkachu Sep 24 '18 at 21:12
  • I like this answer, but have two questions: (1) How I provide dc the numbers via STDIN? I tried to remove -f "$1" and prepend and echo "$numbers" or append <<< "$numbers", but it didn’t work. I can use -f - and then dc reads the STDIN. (2) How could I replace the newlines with a custom separator? I have already replaced [,] with [-], but I have no idea if I replace the newlines with , (of course, I can do it using sed, for example). – tukusejssirs Jun 29 '21 at 08:56
  • @tukusejssirs Not sure I understand well what you want to know? Some implementation of dc don't get option at all, only files so for the 1 you can try echo '2 3 z p' | dc to print the number of values. first the values and after the code. for the 2, take a look at the man page for the commands n , p and c commands. Be aware that the c command is not always present. If the c command don't exist, you must have a trash stack. – ctac_ Jun 29 '21 at 16:53
  • @ctac_, I have solved the first problem by myself (I needed to use -f -). The second problem I wish to solve is to replace the newlines with a custom string, so that the output looks like 2-3, 9-12, 24, 28-29, 33 (I copied the example from the OP). Currenly, I can do it only with an additional command (like sed), but not directly with dc. Note that I have no idea if it is actually possible with dc. ;) – tukusejssirs Jul 01 '21 at 16:18
  • @tukusejssirs All can be done with dc. Hum, perhaps not the coffee! To get what you want, you must add 2 new macros. [ li p c 0 d sk sf ]sD and [ li n [-] n lj p c 0 sf ]sC. You must also modify 3 macros. [ 1 sk lf 1 =D lk 1 =C ]sS and [ li n [,] n [ ] n 0 d sk sf ]sO and [ li n [-] n lj n [,] n [ ] n 0 sf ]sM. It's not optimized but a patch for you. – ctac_ Jul 03 '21 at 16:56
7
awk '
    function output() { print start (prev == start ? "" : ","prev) }
    NR == 1 {start = prev = $1; next}
    $1 > prev+1 {output(); start = $1}
    {prev = $1}
    END {output()}
'
glenn jackman
  • 85,964
5

awk, with a different (more C-like) approach:

awk '{ do{ for(s=e=$1; (r=getline)>0 && $1<=e+1; e=$1); print s==e ? s : s","e }while(r>0) }' file

the same thing, even less awk-ward:

awk 'BEGIN{
    for(r=getline; r>0;){
        for(s=e=$1; (r=getline)>0 && $1<=e+1; e=$1);
        print s==e ? s : s","e
    }
    exit -r
}' file
  • 1
    Nice. In the interests of compactness, for(r=getline; r>0;) could just be for(r=getline;r;) – steve Sep 21 '18 at 18:56
  • 1
    And (r=getline)>0 could just be (r=getline) – steve Sep 21 '18 at 18:56
  • 2
    @steve getline returns -1 on error (eg EIO) –  Sep 21 '18 at 19:02
  • 2
    @steve. That's why the exit -r too -- that could be removed (awk will handle that itself on the next automatic getline) but I wanted the second version to be completely unmagical. –  Sep 21 '18 at 19:14
  • 1
    on my gnu awk, getline returns zero on EIO. Example : echo foo | awk 'BEGIN{a=getline;print a;a=getline;print a}' yields output of "1" followed by "0". Man page : "The getline command returns 1 on success, 0 on end of file, and -1 on an error" – steve Sep 22 '18 at 08:37
  • 1
    You're not simulating an EIO in your example, but an end-of-file. Try this instead '(trap '' SIGTTIN; gawk 'BEGIN{print getline}') &'. In fact, only gawk handles errors on stdin correctly (as required by the standard), mawk/nawk/oawk/etc simply treat them as eof (they do warn or exit with fatal errors in some cases, quite inconsistently -- this whole thing probably deserves its own question) tl;dr with gawk, for(r=getline;r;) ... may result in a infinite loop. –  Sep 22 '18 at 10:37
5

Using Perl substitute with eval (Sorry for the obfuscation...):

perl -0pe 's/(\d+)\n(?=(\d+))/ $1+1==$2 ? "$1," : $& /ge; 
           s/,.*,/,/g' ex
  • first substitution creates lines with "," separated consecutive int sequences;
  • second substitution, removes middle numbers.
JJoao
  • 12,170
  • 1
  • 23
  • 45
2

Another awk approach (a variation of glenn's answer):

awk '
    function output() { print start (start != end? ","end : "") }
    end==$0-1 || end==$0 { end=$0; next }
    end!=""{ output() }
    { start=end=$0 }
END{ output() }' infile
αғsнιη
  • 41,407
2

Yet another awk solution similar to the other:

#!/usr/bin/awk -f

function output() {
    # This function is called when a completed range needs to be
    # outputted. It will use the global variables rstart and rend.

    if (rend != "")
        print rstart, rend
    else
        print rstart
}

# Output field separator is a comma.
BEGIN { OFS = "," }

# At the start, just set rstart and prev (the previous line's number) to
# the first number, then continue with the next line.
NR == 1 { rstart = prev = $0; next }

# Calculate the difference between this line and the previous. If it's
# 1, move the end of the current range here.
(diff = $0 - prev) == 1 { rend = $0 }

# If the difference is more than one, then we're onto a new range.
# Output the range that we were processing and reset rstart and rend.
diff > 1 {
    output()

    rstart = $0
    rend = ""
   }

# Remember this line's number as prev before moving on to the next line.
{ prev = $0 }

# At the end, output the last range.
END { output() }

The rend variable is not actually needed, but I wanted to keep as much range logic as possible away from the output() function.

muru
  • 72,889
Kusalananda
  • 333,661
2

An alternative in awk:

<infile sort -nu | awk '
     { l=p=$1 }
     { while ( (r=getline) >= 0 ){
           if ( $1 == p+1 ) { p=$1;  continue };
           print ( l==p ? l : l","p );
           l=p=$1
           if(r==0){ break };
           }
       if (r == -1 ) { print "Unexpected error in reading file"; quit }
     }
    ' 

On one line (no error check):

<infile awk '{l=p=$1}{while((r=getline)>=0){if($1==p+1){p=$1;continue};print(l==p?l:l","p);l=p=$1;if(r==0){ break };}}'

With comments (and pre-processing the file to ensure a sorted, unique list):

<infile sort -nu | awk '

     { l=p=$1 }    ## Only on the first line. The loop will read all lines.

     ## read all lines while there is no error.
     { while ( (r=getline) >= 0 ){

           ## If present line ($1) follows previous line (p), continue.
           if ( $1 == p+1 ) { p=$1;  continue };

           ### Starting a new range ($1>p+1): print the previous range.
           print ( l==p ? l : l","p );

           ## Save values in the variables left (l) and previous (p).
           l=p=$1

           ## At the end of the file, break the loop.
           if(r==0){ break };

           }

       ## All lines have been processed or got an error.
          if (r == -1 ) { print "Unexpected error in reading file"; quit }
     }
    ' 
2

A nice discussion from 2001 on perlmonks.org, and adapted to read from STDIN or files named on the command line (as Perl is wont to do):

#!/usr/bin/env perl
use strict;
use warnings;
use 5.6.0;  # for (??{ ... })
sub num2range {
  local $_ = join ',' => @_;
  s/(?<!\d)(\d+)(?:,((??{$++1}))(?!\d))+/$1-$+/g;
  tr/-,/,\n/;
  return $_;
}
my @list;
chomp(@list = <>);
my $range = num2range(@list);
print "$range\n";
2

There's a perl module called Set::IntSpan which already does this (it was originally written in 1996 to collapse lists of article numbers for .newsrc files, which could be enormous).

There is also a similar module for python called intspan, but I haven't used it.

Anyway, with perl and Set::IntSpan (and tr to get the input data into comma-separated format, and tr again to munge the output), this is trivial.

$ tr $'\n' ',' < input.txt  | 
    perl -MSet::IntSpan -lne 'print Set::IntSpan->new($_)' |
    tr ',-' $'\n,'
2,3
9,12
24
28,29
33

Set::IntSpan is packaged for debian and ubuntu as libset-intspan-perl, for fedora as perl-Set-IntSpan, and probably for other distros too. Also available on CPAN, of course.

cas
  • 78,579
1

How about

awk '
$0 > LAST+1     {if (NR > 1)  print (PR != LAST)?"," LAST:""
                 printf "%s", $0
                 PR = $0
                }
                {LAST  = $0
                }
END             {print (PR != LAST)?"," LAST:""
                }
' file
2,3
9,12
24
28,29
33
RudiC
  • 8,969
  • I'd stick to lower case variable names, but that's just style. I came up with the same logic. – glenn jackman Sep 19 '18 at 17:42
  • 1
    Actually, there is a bug here: if the first line is less than one, the first condition will be false, so the the first number will not be printed. You need a separate rule for NR==1 – glenn jackman Sep 19 '18 at 17:44
  • I might recheck. I'm not too happy with the clumsy logics anyhow. – RudiC Sep 19 '18 at 18:06
1

Perl approach!

#!/bin/perl
    print ranges(2,3,9,10,11,12,24,28,29,33), "\n";

sub ranges {
    my @vals = @_;
    my $first = $vals[0];
    my $last;
    my @list;
    for my $i (0 .. (scalar(@vals)-2)) {
        if (($vals[$i+1] - $vals[$i]) != 1) {
            $last = $vals[$i];
            push @list, ($first == $last) ? $first : "$first,$last";
            $first = $vals[$i+1];
        }
    }
    $last = $vals[-1];
    push @list, ($first == $last) ? $first : "$first,$last";
    return join ("\n", @list);
}
1

Ugly software tools bash shell code, where file is the input file:

diff -y file <(seq $(head -1 file) $(tail -1 file))  |  cut -f1  | 
sed -En 'H;${x;s/([0-9]+)\n([0-9]+\n)*([0-9]+)/\1,\3/g;s/\n\n+/\n/g;s/^\n//p}'

Or with wdiff:

wdiff -12 file <(seq $(head -1 file) $(tail -1 file) ) | 
sed -En 'H;${x;s/([0-9]+)\n([0-9]+\n)*([0-9]+)/\1,\3/g;s/=+\n\n//g;s/^\n//p}'

How these work: Make a gapless sequential list with seq using the first and last numbers in the input file, (because file is already sorted), and diff does most of the work. The sed code is mainly just formatting, and replacing in-between numbers with a comma.

For a related problem, which is the inverse of this one, see: Finding gaps in sequential numbers

agc
  • 7,223
1

On a "Unix & Linux" site, a simple, readable, pure (bash) shell script feels most appropriate to me:

#!/bin/bash

inputfile=./input.txt

unset prev begin
while read num ; do
    if [ "$prev" = "$((num-1))" ] ; then
        prev=$num
    else
        if [ "$begin" ] ; then
            [ "$begin" = "$prev" ] && echo "$prev" || echo "$begin,$prev"
        fi
        begin=$num
        prev=$num
    fi
done < $inputfile
Hkoof
  • 1,667