improve mock data generation

Question

I'm trying to generate csv with mock data

for i in {1..1000000..1}
do 
  echo "$i,$(date -d "2017-08-01 + $(shuf -i 1-31 -n 1) days" +'%Y-%m-%d')" >> $F
done;

looping from 1 to million and generating unique id & random date

but it works very slowly, is there any one-liner to make it parallel?

you don't really want it parallel; you just want it faster, right? — Jeff Schaller, Aug 27 '17 at 13:24
also, if I'm reading this correctly, you're just generating dates in August -- is this your actual use-case, or has it been simplified? Otherwise, you could just generate (one million divided by 31) entries for each date, then shuffle them. — Jeff Schaller, Aug 27 '17 at 13:25
relating only: https://unix.stackexchange.com/questions/140750/generate-random-numbers-in-specific-range — Jeff Schaller, Aug 27 '17 at 13:27
also interesting reading: https://unix.stackexchange.com/q/323845/117549 — Jeff Schaller, Aug 27 '17 at 13:28

Kusalananda · Accepted Answer · 2017-08-27T15:40:15.547

See very end for end result.

for i in {1..1000000..1}
do 
  echo "$i,$(date -d "2017-08-01 + $(shuf -i 1-31 -n 1) days" +'%Y-%m-%d')" >> $F
done;

Shell loops are slow, and there are two main things that makes this particular loop extra slow:

Opening and appending to a file in each iteration.
Two executions of external utilities (shuf and date) in each iteration. The echo is likely built into the shell, so that incurs less overhead.

The output redirection is most easily remedied:

for i in {1..1000000..1}
do 
  echo "$i,$(date -d "2017-08-01 + $(shuf -i 1-31 -n 1) days" +'%Y-%m-%d')" 
done >"$F"

This only open the output file once and keeps it open for the duration of the loop.

The rest of the code can be done more efficiently with awk and GNU date (since you're using shuf I presume that you are on a Linux system, which means it's pretty likely that date is in fact GNU date).

awk 'END { for (i=0;i<100;++i) { printf("2017-08-01 + %d days\n", 1+int(31*rand())) }}' /dev/null

This thing generates 100 lines like

2017-08-01 + 22 days
2017-08-01 + 31 days
2017-08-01 + 11 days
2017-08-01 + 27 days
2017-08-01 + 27 days
2017-08-01 + 20 days
(etc.)

Let's feed these into GNU date. GNU date has this flag, -f, that lets us batch feed it with multiple date specifications, for example those outputted by our awk program:

awk 'END { for (i=0;i<100;++i) { printf("2017-08-01 + %d days\n", 1+int(31*rand())) }}' /dev/null |
date -f - +'%Y-%m-%d'

Now we get

2017-08-23
2017-08-27
2017-08-21
2017-08-29
2017-08-25
2017-08-17
2017-08-07
(etc.)

Then it's just a matter of adding the unique ID (a sequential integer) to each line:

awk 'END { for (i=0;i<100;++i) { printf("2017-08-01 + %d days\n", 1+int(31*rand())) }}' /dev/null |
date -f - +'%Y-%m-%d' |
awk -vOFS=',' '{ print NR, $0 }'

This gives you

1,2017-08-06
2,2017-08-17
3,2017-08-25
4,2017-08-28
5,2017-08-14
6,2017-08-15
7,2017-08-17
8,2017-08-10
9,2017-08-16
10,2017-08-08
(etc.)

And now we're done. And in the process, I totally forgot we had a shell loop. Turns out it's not needed.

Just crank up the 100 to whatever value you want, and adjust the random number generator to fit your needs. rand() returns a floating point value such that 0 <= number < 1.

Obviously, if you just want random dates in August (a month with 31 days), you may bypass date altogether:

awk 'END { for (i=1;i<=100;++i) { printf("%d,2017-08-%02d\n", i, 1+int(31*rand())) }}' /dev/null

With GNU awk and Mike's awk (mawk), but not with BSD awk, you may even do proper date handling directly in awk:

awk 'END { for (i=1;i<=100;++i) { printf("%d,%s\n", i, strftime("%Y-%m-%d", 1501545600 + int(2678400*rand()),1 )) }}' /dev/null

Now we're dealing with Unix timestamps rather than with days though. 1501545600 corresponds to "Tue Aug 1 00:00:00 UTC 2017" and there are 2678400 seconds in 31 days.

glenn jackman · Answer 2 · 2017-08-28T21:29:51.323

# A "random" date between 2000-01-01 and 2025-12-28
# Only uses day 01 to 28 
rand_date() {
    printf "%4d-%02d-%02d" $((RANDOM%25+2000)) $((RANDOM%12+1)) $((RANDOM%28+1))
}

csv_data() {
    for ((i=1; i<="$1"; i++)); do printf "%d,%s\n" $i $(rand_date); done
}

$ time (csv_data 1000000 > data.csv)
real    7m26.683s
user    0m36.376s
sys 1m57.768s

Perl's probably faster, let's try it

$ cat data.pl
#!/usr/bin/perl
$, = ",";
$\ = "\n";

sub rand_date {
    sprintf "%4d-%02d-%02d", int(rand(25))+2000, int(rand(12))+1, int(rand(28))+1;
}

sub csv_data {
    my $n = shift;
    for ($i = 1; $i <= $n; $i++) {
        print $i, rand_date();
    }
}

csv_data(1_000_000);

$ time (perl data.pl > data.csv)

real    0m0.881s
user    0m0.876s
sys 0m0.004s

Hmm, yeah, a bit faster...

improve mock data generation

2 Answers2