Orthogonalization of vector sets in PDL.
Introduction
I have a code in my package for Photonic and metamaterials calculations to do some orthonormalization within Haydock iterations for a large set of states. I use the standard Gram-Schmidt procedure, which is iterative. For each vector, I substract its projections onto the previous elements. I used to store each vector as an element in an array, but after many modifications of the code, they are now part of a large ndarray. Thus Ed (@mohawk2) suggested to vectorize the procedure using PDL threading instead of PERL iterations. I thought it is not possible, but here, I test a few alternatives.
Gram Schmidt
The original version goes like this:
# Gram Schmidt orthogonalization, iterative
use warnings;
use strict;
use v5.12;
use PDL;
use PDL::NiceSlice;
use Time::HiRes qw(time);
$PDL::BIGPDL=1;
my ($N,$M)=@ARGV; # dimensions of system
die "Usage: ./gramschmidt N M with N>=M" unless defined $N && defined $M and $N>=$M;
srand(0); # for reproducible tests.
my $V=random($N,$M); # M random N-vectors.
# The first index is the vector index i and the second the vector number: i,n
my $t0=time();
foreach my $m (0..$M-1){
$V(:,($m))-= (($V(:,($m))*$V(:,0:$m-1))->sumover*($V(:,0:$m-1)->transpose))->sumover if $m>0;
$V(:,($m))/=($V(:,($m))**2)->sumover->sqrt; #normalize
}
my $t1=time();
say "Gram Schmidt time for $N,$M=", $t1-$t0;
# Check orthonormality
my $O=(($V->dummy(2)*$V->dummy(1))->sumover-identity($M))->abs->sum;
say "Orthogonality=$O";
my $t2=time();
say "Total time: ", $t2-$t0;
say $V if $N<=5;
This program builds a square array of random numbers which I interpret as a collection of row vectors and then orthogonalizes them iteratively using |n’>=summ<n |m><m|n> and then normalizing |n’>. I test it on a 1000x1000 matrix
./gramschmidt.pl 1000 1000
Results:
Gram Schmidt time for 1000,1000=4.96293306350708
Orthogonality=3.69626767899845e-08
Total time: 10.2968249320984
Checking the orthogonality seems to take longer than the actual orthogonalization, but in this example it requires building a large 1000x1000x1000 array.
Modified Gram Schmidt
In the modified version after each step orthogonalizing, say, |n> to |m>, building |n’>=|n>-|m><m|n>, I replace |n> by |n’> before orthogonalizing to the next basis vector |m+1>. The results would agree with the original procedure using infinite precision, but might differ for finite precision.
# Modified Gram Schmidt orthogonalization, iterative
use warnings;
use strict;
use v5.12;
use PDL;
use PDL::NiceSlice;
use Time::HiRes qw(time);
$PDL::BIGPDL=1;
my ($N,$M)=@ARGV; # dimensions of system
die "Usage: ./gramschmidt N M with N>=M" unless defined $N && defined $M and $N>=$M;
srand(0); # for reproducible tests.
my $V=random($N,$M); # random vectors.
# The first index is the vector index i and the second the vector number: i,m
my $t0=time();
foreach my $m (0..$M-1){
foreach my $n (0..$m-1){
$V(:,($m))-=(($V(:,($m))*$V(:,($n)))->sumover*$V(:,($n)));
}
$V(:,($m))/=($V(:,($m))**2)->sumover->sqrt; #normalize
}
my $t1=time();
say "Modified Gram Schmidt time for $N,$M=", $t1-$t0;
# Check orthonormality
my $O=(($V->dummy(2)*$V->dummy(1))->sumover-identity($M))->abs->sum;
say "Orthogonality=$O";
my $t2=time();
say "Total time: ", $t2-$t0;
say $V if $N<=5;
./modifiedgramschmidt.pl 1000 1000
Results:
Modified Gram Schmidt time for 1000,1000=12.7265810966492
Orthogonality=3.30134552816193e-10
Total time: 17.4651801586151
The procedure is much slower, due to the nested PERL iteration, but the precision is two orders of magnitude better.
QR factorization
Using the Householder transformation, a matrix V may be decomposed
into the product QR, where Q is an orthogonal matrix and R is upper
triangular. Furthermore, the first n columns of Q span the same space
as the first n columns of V. Thus, QR factorization accomplishes an
orthogonalization of the columns of V. So, I’ll try
my luck with the mqr
routine from PDL::LinearAlgebra.
# QR factorization
use warnings;
use strict;
use v5.12;
use PDL;
use PDL::NiceSlice;
use PDL::LinearAlgebra;
use Time::HiRes qw(time);
$PDL::BIGPDL=1;
my ($N,$M)=@ARGV; # dimensions of system
die "Usage: ./modifiedgramschmidt N M with N>=M" unless defined $N && defined $M and $N>=$M;
srand(0); # for reproducible tests.
my $V=random($N,$M); # random vectors.
# The first index is the vector index i and the second the vector number: i,m
my $t0=time();
$V=mqr($V->transpose)->transpose; # from PDL::LinearAlgebra
my $t1=time();
say "QR time for $N,$M=", $t1-$t0;
# Check orthonormality
my $O=(($V->dummy(2)*$V->dummy(1))->sumover-identity($M))->abs->sum;
say "Orthogonality=$O";
my $t2=time();
say "Total time: ", $t2-$t0;
say $V if $N<=5;
./QR.pl 1000 1000
Results:
QR time for 1000,1000=0.342689037322998
Orthogonality=2.99989157025321e-11
Total time: 7.1581711769104
The orthogonality became another order of magnitude better, and the time gained one order of magnitude with respect to the first algorithm and two with respect to the second one. So it seems this is the best procedure for orthogonalization within PDL.
Now I do a case with a small number of entries to check that I get the same result set of orthogonalized vectors. I also test sets with fewer vectors than the space dimension, to make sure I didn’t mix up dimensions.
./gramschmidt.pl 5 4
./modifiedgramschmidt.pl 5 4
./QR.pl 5 4
Results:
Gram Schmidt time for 5,4=0.000535011291503906
Orthogonality=3.12944115066216e-15
Total time: 0.000841140747070312
[
[ 0.1313358 0.57653872 0.074092339 0.66923001 0.44384178]
[ 0.87588479 -0.20720125 0.38458399 -0.1370171 0.15236453]
[ -0.3739352 0.19733953 0.88397582 -0.19940276 0.0074073153]
[ 0.25173294 0.43494961 0.047217819 0.13603524 -0.85247537]
]
Modified Gram Schmidt time for 5,4=0.000305891036987305
Orthogonality=1.6531914726059e-15
Total time: 0.000445842742919922
[
[ 0.1313358 0.57653872 0.074092339 0.66923001 0.44384178]
[ 0.87588479 -0.20720125 0.38458399 -0.1370171 0.15236453]
[ -0.3739352 0.19733953 0.88397582 -0.19940276 0.0074073153]
[ 0.25173294 0.43494961 0.047217819 0.13603524 -0.85247537]
]
QR time for 5,4=0.000292062759399414
Orthogonality=2.16840434497101e-15
Total time: 0.000442981719970703
[
[ -0.1313358 -0.57653872 -0.074092339 -0.66923001 -0.44384178]
[ 0.87588479 -0.20720125 0.38458399 -0.1370171 0.15236453]
[ 0.3739352 -0.19733953 -0.88397582 0.19940276 -0.0074073153]
[ 0.25173294 0.43494961 0.047217819 0.13603524 -0.85247537]
]
Disregarding sign changes, all three methods yielded the same result.
Thus, although I didn’t vectorize the code, the use of the Householder
transformation to do a QR factorization performs its iterations within a
C
code, and thus is much faster, besides being a more stable and
accurate algorithm.
I’m still not sure this may be applied to Photonic
as the product
between states is not a simple multiplication of vectors of
numbers.
Inline::Pdlpp
Now I use Inline::Pdlpp to build a PDL program that interfaces directly to C-like code. I implement below the modified Gram Schmidt algorithm.
# Gram Schmidt orthogonalization, vectorized
use warnings;
use strict;
use v5.12;
use PDL;
use PDL::NiceSlice;
use Time::HiRes qw(time);
$PDL::BIGPDL=1;
my ($N,$M)=@ARGV; # dimensions of system
die "Usage: ./modifiedgramschmidtv N M with N>=M" unless defined $N && defined $M and $N>=$M;
srand(0); # for reproducible tests.
my $V=random($N,$M); # M random N-vectors.
# The first index is the vector index i and the second the vector number: i,n
my $t0=time();
$V->orthogonalize();
my $t1=time();
say "Modified GS Inline::Pdlpp time for $N,$M=", $t1-$t0;
# Check orthonormality
my $O=(($V->dummy(2)*$V->dummy(1))->sumover-identity($M))->abs->sum;
say "Orthogonality=$O";
my $t2=time();
say "Total time: ", $t2-$t0;
say $V if $N<=5;
no PDL::NiceSlice;
use Inline Pdlpp => <<'EOPP'
pp_def('orthogonalize',
Pars=>'[io] V(n,m);', # modify input matrix
Code=>'
loop(m)%{
for(int l=0; l < m; ++l){
$GENERIC() p=0; /* accumulator for inner product */
loop(n) %{ /* component */
p+=$V()*$V(m=>l); /* syntax is strange */
%}
loop(n) %{ /* subtract projection */
$V()-=p*$V(m=>l);
%}
}
$GENERIC() q=0;
loop(n) %{ /* square magnitude */
q+=$V()*$V();
%}
q=sqrt(q);
loop(n) %{
$V()/=q; /* normalize */
%}
%}
',
);
EOPP
./gramschmidt.pl 5 4
./modifiedgramschmidt.pl 5 4
./QR.pl 5 4
./modifiedgramschmidtv.pl 5 4
Results:
Gram Schmidt time for 5,4=0.000308036804199219
Orthogonality=3.12944115066216e-15
Total time: 0.000442981719970703
[
[ 0.1313358 0.57653872 0.074092339 0.66923001 0.44384178]
[ 0.87588479 -0.20720125 0.38458399 -0.1370171 0.15236453]
[ -0.3739352 0.19733953 0.88397582 -0.19940276 0.0074073153]
[ 0.25173294 0.43494961 0.047217819 0.13603524 -0.85247537]
]
Modified Gram Schmidt time for 5,4=0.000325202941894531
Orthogonality=1.6531914726059e-15
Total time: 0.000478029251098633
[
[ 0.1313358 0.57653872 0.074092339 0.66923001 0.44384178]
[ 0.87588479 -0.20720125 0.38458399 -0.1370171 0.15236453]
[ -0.3739352 0.19733953 0.88397582 -0.19940276 0.0074073153]
[ 0.25173294 0.43494961 0.047217819 0.13603524 -0.85247537]
]
QR time for 5,4=0.000297069549560547
Orthogonality=2.16840434497101e-15
Total time: 0.000451087951660156
[
[ -0.1313358 -0.57653872 -0.074092339 -0.66923001 -0.44384178]
[ 0.87588479 -0.20720125 0.38458399 -0.1370171 0.15236453]
[ 0.3739352 -0.19733953 -0.88397582 0.19940276 -0.0074073153]
[ 0.25173294 0.43494961 0.047217819 0.13603524 -0.85247537]
]
Modified GS Inline::Pdlpp time for 5,4=5.96046447753906e-06
Orthogonality=1.6531914726059e-15
Total time: 0.000267982482910156
[
[ 0.1313358 0.57653872 0.074092339 0.66923001 0.44384178]
[ 0.87588479 -0.20720125 0.38458399 -0.1370171 0.15236453]
[ -0.3739352 0.19733953 0.88397582 -0.19940276 0.0074073153]
[ 0.25173294 0.43494961 0.047217819 0.13603524 -0.85247537]
]
./gramschmidt.pl 1000 1000
./modifiedgramschmidt.pl 1000 1000
./QR.pl 1000 1000
./modifiedgramschmidtv.pl 1000 1000
Results:
Gram Schmidt time for 1000,1000=4.89408993721008
Orthogonality=3.69626767899845e-08
Total time: 9.66118001937866
Modified Gram Schmidt time for 1000,1000=12.8312540054321
Orthogonality=3.30134552816193e-10
Total time: 17.5760328769684
QR time for 1000,1000=0.338634014129639
Orthogonality=2.99989157025321e-11
Total time: 6.98031520843506
Modified GS Inline::Pdlpp time for 1000,1000=2.36270093917847
Orthogonality=3.29437648148214e-10
Total time: 7.08088278770447
Vectorization with threaddefine
After much thought, I found a tricky way to vectorize the ordinary
Gram Schmidt orthogonalization. To that end I use thread_define
, a
function that builds threadable perl functions. Thus I define an auxiliary
function that orthogonalizes one vector A to an orthonormal subset 0:p-1 of a set
of vectors B. I add a scalar parameter with the index p. The
routine modifies its input parameter A. Then I call this function
using the full set of vectors V that I want to
orthonormalize as an nXm ndarray as both A and B. Instead of a
scalar p I use a sequence of numbers 0..m-1. The threading engine
of PDL
than sets p=0, A=V(:,0) and normalizes it, then sets p=1,
A=V(:,1), orthogonalizes it to the previous normalized vector and
normalizes it, then proceeds with p=2,3…m-1 orthogonalizing to all
the previously orthogonalized vectors. My expectation is that using
the threading engine may be faster than using PERL
loops.
# Gram Schmidt orthogonalization, vectorized
use warnings;
use strict;
use v5.12;
use PDL;
use PDL::NiceSlice;
use Time::HiRes qw(time);
$PDL::BIGPDL=1;
my ($N,$M)=@ARGV; # dimensions of system
die "Usage: ./gramschmidtv N M with N>=M" unless defined $N && defined $M and $N>=$M;
srand(0); # for reproducible tests.
my $V=random($N,$M); # M random N-vectors.
# The first index is the vector index i and the second the vector number: i,n
my $t0=time();
orthonormalize($V);
my $t1=time();
say "GS vectorized with thread_define time for $N,$M=", $t1-$t0;
# Check orthonormality
my $O=(($V->dummy(2)*$V->dummy(1))->sumover-identity($M))->abs->sum;
say "Orthogonality=$O";
my $t2=time();
say "Total time: ", $t2-$t0;
say $V if $N<=5;
sub orthonormalize {
my $V=shift;
ortho_aux($V, sequence($V->dim(1)), $V);
}
BEGIN{
thread_define 'ortho_aux([io]A(n); p(); B(n,m))', over {
# Orthogonalize one vector A to the first p orthonormal vectors of B.
# With threading it can orthogonalize an arbitrary set basis
# Modifies first argument
my($A, $p, $B)=@_;
my $N=$B(:,0:$p-1); # assume already normalized
$A-=(($A*$N)->sumover*$N->transpose)->sumover if $p>0;
my $q=($A*$A)->sumover->sqrt;
$A/=$q if $q!=0;
};
}
./gramschmidtv.pl 5 4
./gramschmidt.pl 5 4
Results:
GS vectorized with thread_define time for 5,4=0.000516891479492188
Orthogonality=3.12944115066216e-15
Total time: 0.000686883926391602
[
[ 0.1313358 0.57653872 0.074092339 0.66923001 0.44384178]
[ 0.87588479 -0.20720125 0.38458399 -0.1370171 0.15236453]
[ -0.3739352 0.19733953 0.88397582 -0.19940276 0.0074073153]
[ 0.25173294 0.43494961 0.047217819 0.13603524 -0.85247537]
]
Gram Schmidt time for 5,4=0.000303983688354492
Orthogonality=3.12944115066216e-15
Total time: 0.000446081161499023
[
[ 0.1313358 0.57653872 0.074092339 0.66923001 0.44384178]
[ 0.87588479 -0.20720125 0.38458399 -0.1370171 0.15236453]
[ -0.3739352 0.19733953 0.88397582 -0.19940276 0.0074073153]
[ 0.25173294 0.43494961 0.047217819 0.13603524 -0.85247537]
]
To my surprise, given the confusing logic (the ndarray B is silently modified in situ as the argument A is modified, as they share the same locations), it worked!
./gramschmidtv.pl 1000 1000
./gramschmidt.pl 1000 1000
Results:
GS vectorized with thread_define time for 1000,1000=4.72058701515198
Orthogonality=3.69307216405263e-08
Total time: 10.2687690258026
Gram Schmidt time for 1000,1000=4.92849779129028
Orthogonality=3.69626767899845e-08
Total time: 10.0826327800751
The result is as good (or as bad) as the first Gram Schmidt program. It has the advantage that it may be easily generalized to arbitrary inner products. It can also be easily modified to deal with the modified Gram Schmidt algorithm.
Unfortunately, there was no speed increase compared with the
pure PERL/PDL
code! Nevertheless, it may be threaded over additional
dimensions if desired, with no additional work.
However, this approach depends on PDL threading in ascending order of
the index. Thus I worry that it will fail if the calculation is
inadvertently parallelized, probably using PDL::ParallelCPU. I tried
to test this, but I was unable to convince PDL to split my code into
parallel threads.
Conclusion
The winner so far is still the QR factorization, but the Inline::Pdlpp
code is as good as the modified PDL Gram Schmidt algorithm and is
about 5 times faster for the 1000x1000 orthogonalization (I don’t know
why the orthogonality is close but not identical). It would be
useful to be able to call PERL,PDL
code from Inline::Pdlpp
code to
implement more complex inner products, but I guess it won’t be too easy.
The thread_define
solution has the advantage that it can accomodate
a generalized inner products, and it can be threaded over additional
dimensions if present, but it seems it didn’t yield a speed
increase. Maybe, given an algorithm, there is no much speed difference
if the main iteration is done in C
or in PERL
, as the inner
iteration is done in C
but is so large that it eats up most of the
time. Therefore, maybe the next step is to program the QR algorithm in
such a way that it may accomodate other scalar products. Finally, I
would have to test the complex cases.