Parallelization of PDL code

Introduction

In a previous post I developed some PDL codes for vectorizing orthogonalization of basis sets, but the code had some flaws. In particular, it wouldn’t work in the presence of paralelization. However, when I tried to check parallelizing the code, PDL refused to launch parallel threads. I want to understand why, so I do a few experiments here.

Boilerplate code

Initialization

use warnings;
use strict;
use PDL;
use v5.12;

my $N=20; # size of vector
set_autopthread_targ(4); # Number of cores in my laptop
set_autopthread_size(0); # zero to attemp to parallelize small vectors
my $x=sequence($N);

Report

say "Threads: ",get_autopthread_actual();
say "Result: $x";

Simple scalar routine

Code

<<initialization>>
$x++;
<<report>>

Run it

./simple.pl

Results

Threads: 4
Result: [1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20]

Seems ok.

Same with thread_define

Code

<<initialization>>
increment($x);
<<report>>
BEGIN {
    thread_define 'increment([io]a())', over {
	$_[0]++;
    }
}

Run it

./thread_define.pl

Results

Threads: 0
Result: [1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20]

I repeat but using an explicit vector instead of a scalar in the thread_define.

Code

<<initialization>>
increment($x);
<<report>>
BEGIN {
    thread_define 'increment([io]a(n))', over {
	$_[0]++;
    }
}

Run it

./thread_define_v.pl

Results

Threads: 4

Seems ok again. So implicit threading seems not to parallelize for operations defined with thread_define but it does work for explicit vector operations.

Now I try with a scalar and a separate output.

Code

<<initialization>>
$y=null;
increment($x, $y);
$x=$y;
<<report>>
BEGIN {
    thread_define 'increment(a();[o]b())', over {
	$_[1].=$_[0]+1;
    }
}

Run it

./thread_define_o.pl

Results:

Threads: 0
Result: [1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20]

So it seems that when implicit threading a thread_defined function, there is no parallelization even when input and output are separated.

Inline::Pddlpp

Now I solve the same using Inline::Pdlpp.

Code

<<initialization>>
$x->increment;
<<report>>
use Inline Pdlpp => <<'EOPP'
pp_def('increment',
       Pars=>'[io]a()',
       Code=>'++$x();',
);
EOPP

Run it

./pdlpp.pl

Results

Threads: 4
Result: [1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20]

So the implicit threading for routines defined with Inline PDL does parallelize.

Dumb mistakes

Now I test the kind of dumb mistake that could happen when parallelizing codes that modify their inputs, as mentioned in the introduction.

Example

Below is a code to perform partial sums of an array.

perl -MList::Util=reductions -E 'say join " ", reductions {$a+$b} 0..19'

Results

0 1 3 6 10 15 21 28 36 45 55 66 78 91 105 120 136 153 171 190

Same in PDL

perl -MPDL -MPDL::NiceSlice -E '$x=sequence(20)->dummy(1,20);say +($x*($x->xvals<=$x->yvals))->sumover;'

Results

[0 1 3 6 10 15 21 28 36 45 55 66 78 91 105 120 136 153 171 190]

The same with a complicated trick as in my previous post.

Code

use warnings;
use strict;
use v5.12;
use PDL;
use PDL::NiceSlice;
my $x=sequence(20);
sums($x);
say $x;
sub sums {
    $x=shift;
    $x->sums_aux($x->sequence(), $x);
}
no PDL::NiceSlice;
use Inline Pdlpp => <<'EOPP'
    pp_def('sums_aux',
           Pars=>'[io]a(); p(); b(n)',
	   Code=>q{
		 PDL_Indx i=$p()-1;
	         if(i>=0){$a()+=$b(n=>i);}
		 },
    );
EOPP

Run it

./tricky.pl

Results

[0 1 3 6 10 15 21 28 36 45 55 66 78 91 105 120 136 153 171 190]

Tricky, but it works.

Parallelized version.

I rewrite the same program, but I only print the last value (an error in any previous value would produce an error in the last one), and I accept an argument for the size of the sequence. I also print the exact expected value. I explicitly use the longlong type, to test large numbers.

Code

use warnings;
use strict;
use v5.12;
use PDL;
use PDL::NiceSlice;
set_autopthread_targ(4);
set_autopthread_size(0);
my $N=shift @ARGV;
my $x=sequence(longlong, $N);
sums($x);
say "Threads: ", get_autopthread_actual;
say "Size of array: $N\nLast result: ", scalar $x((-1)), " vs. exact result: ", $N*($N-1)/2;
sub sums {
    $x=shift;
    $x->sums_aux($x->sequence(), $x);
}
no PDL::NiceSlice;
use Inline Pdlpp => <<'EOPP'
    pp_def('sums_aux',
           Pars=>'[io]a(); p(); b(n)',
	   Code=>q{
		   PDL_Indx i=$p()-1;
	           if(i>=0){
		     $a()+=$b(n=>i);
		   }
		},
    );
EOPP

Run it

./tricky_p.pl 20

Results:

Threads: 4
Size of array: 20
Last result: 190 vs. exact result: 190

This results look good, but…

Run it again

./tricky_p.pl 18000

Results

Threads: 4
Size of array: 18000
Last result: 151872749 vs. exact result: 161991000

So for large enough vectors, the parallelized version of the tricky code yields the wrong result.

Conclusion

It seems that thread_define doesn’t parallelize iterations over intrinsic dimensions, but Inline::Pdlpp does. And of course, if the code is indeed parallelized, you shouldn’t trust the results if they depend on the order of execution, as in my tricky but dumb example above. Thus, for the vectorizable orthogonalization trick presented in my previous post, care should be taken to avoid its parallelization.

Written on September 9, 2021