I wanted to compile MADbench2, which is a program designed to test the interaction of I/O with communication in an HPC environment. It has some prerequisites such as Scalapack, Lapack and their prerequisites. I have root access on this particular cluster, so I was hoping I could just install a few precompiled packages and just run it. Hahahahahaha!
$ sudo yum install lapack-devel lapack
$ sudo yum install scalapack-common scalapack-openmpi scalapack-openmpi-devel scalapack-openmpi-static
Next, try to compile MADbench2:
$ mpicc -D SYSTEM -D COLUMBIA -o MADbench2.x MADbench2.c -lm
And that gives me some errors:
/tmp/ccVFN3Hw.o: In functiondefine_gang': MADbench2.c:(.text+0xb42): undefined reference to
blacs_get'
MADbench2.c:(.text+0xb72): undefined reference toblacs_gridmap' MADbench2.c:(.text+0xc05): undefined reference to
numroc'
MADbench2.c:(.text+0xc46): undefined reference tonumroc' MADbench2.c:(.text+0xcb8): undefined reference to
descinit'
MADbench2.c:(.text+0xd0c): undefined reference to `descinit'
Hmm.. that looks like I need another package.
$ sudo yum install blacs-openmpi
I tried building MADbench2 again, but I get the same error. Hmm. When I check /usr/lib64/openmpi/lib, I see libmpiblacs.so.2 and libmpiblacs.so.4, but no libmpiblacs.so. Let’s try this again:
$ sudo yum install blacs-openmpi-devel
$ mpicc -D SYSTEM -D COLUMBIA -o MADbench2.x MADbench2.c -L/usr/lib64/openmpi/lib -lm -lmpiblacs
Now I’m including the location of the library, and I’m linking to it, but, maddeningly, I get the same error. The other thing that bothers me is that the precompiled openmpi version for Centos7 are v1.10, and I’ve been regularly using 3.80, and I even have v4.00 ready to go. I don’t really want my research to use such as old version of OpenMPI. So I decide to compile from source.. because that’s always easier, right?
$ wget http://www.netlib.org/scalapack/scalapack_installer.tgz
$ tar zxvf scalapack_installer.tgz
./setup.py --prefix /opt/scalapack --mpibindir /opt/openmpi-4.0.0
Permission denied! It didn’t like where I was trying to install scalapack.
$ sudo ./setup.py --prefix /opt/scalapack --mpibindir /opt/openmpi-4.0.0
Failure! Now it’s bad because to do an mpirun, and openmpi doesn’t like you doing that as root. So let’s try to install in my home directory:
$ ./setup.py --prefix /home/kfrye/scalapack --mpibindir /opt/openmpi-4.0.0/bin
Permission denied! Okay. I guess it didn’t like me running this setup file from a directory outside of my home directory. So I moved it to my home and tried again. This was more successful. Now I got an error message:
Please provide a working BLAS library. If a BLAS library
is not present on the system, the reference BLAS library it can be
automatically downloaded and installed by adding the --downblas flag.
Progress!
$ ./setup.py --prefix /home/kfrye/scalapack --mpibindir /opt/openmpi-4.0.0/bin --downblas
... good compiling stuff ...
Unzip and untar reference BLAS… done
Traceback (most recent call last):
File "./setup.py", line 51, in
sys.exit(main(sys.argv))
File "./setup.py", line 43, in main
Blas(config, scalapack);
File "/home/kfrye/scalapack_installer/script/blas.py", line 78, in init
self.down_install_blas()
File "/home/kfrye/scalapack_installer/script/blas.py", line 187, in down_install_blas
os.chdir(os.path.join(os.getcwd(),'BLAS'))
OSError: [Errno 2] No such file or directory: '/home/kfrye/scalapack_installer/build/BLAS'
Hmm. That’s weird. So I check the contents of the build directory and discovered that it had created BLAS-3.8.0 instead of BLAS. I can work around that. So I go into the BLAS-3.8.0 directory and run “make”. Success!
$ ./setup.py --prefix /home/kfrye/scalapack --mpibindir /opt/openmpi-4.0.0/bin --blaslib=/home/kfrye/scalapack/build/BLAS-3.8.0/blas_LINUX.a
Success! The installation continues:
What do you want to do ?
- d : download the netlib LAPACK
- q : quit and try with another BLAS library or define the
lapacklib parameter.
I tell it to download LAPACK. Everything looks good. Then:
ScaLAPACK installer is starting now. Buckle up!
Downloading ScaLAPACK… done
Installing scalapack-2.0.2 …
Writing SLmake.inc… done.
Compiling BLACS, PBLAS and ScaLAPACK… done
Getting ScaLAPACK version number… 2.0.1
Installation of ScaLAPACK successful.
(log is in /home/kfrye/scalapack_installer/build/log/scalog )
Compiling test routines…
ScaLAPACK: error building ScaLAPACK test routines
Warning: Type mismatch in argument 'ierr' at (1); passed REAL(8) to INTEGER(4)
../../libscalapack.a(igamx2d_.oo): In functionCigamx2d': igamx2d_.c:(.text+0x208): undefined reference to
MPI_Type_struct'
../../libscalapack.a(sgamx2d_.oo): In functionCsgamx2d': sgamx2d_.c:(.text+0x208): undefined reference to
MPI_Type_struct'
../../libscalapack.a(dgamx2d_.oo): In functionCdgamx2d': dgamx2d_.c:(.text+0x208): undefined reference to
MPI_Type_struct'
../../libscalapack.a(cgamx2d_.oo): In functionCcgamx2d': cgamx2d_.c:(.text+0x210): undefined reference to
MPI_Type_struct'
../../libscalapack.a(zgamx2d_.oo): In functionCzgamx2d': zgamx2d_.c:(.text+0x208): undefined reference to
MPI_Type_struct'
So it looks like it can’t find OpenMPI, except I explicitly included a link to OpenMPI, and I know that link works. What I didn’t know then, but know now is that it’s looking for a bunch of MPI symbols that have been discontinued in newer versions of MPI. But what I thought at the time was “This installer thing is crap! I need to try something else”
$ wget http://www.netlib.org/scalapack/scalapack-2.0.2.tgz
$ tar zxvf scalapack-2.0.2.tgz
$ wget http://www.netlib.org/blas/blas-3.8.0.tgz
$ tar zxvf blas-3.8.0.tgz
$ cd blas-3.8.0
$ make
This compiles a bunch of files and gives me a new library blas_LINUX.a. Scalapack also requires Lapack:
$ wget http://www.netlib.org/lapack/lapack-3.8.0.tar.gz
$ tar zxvf lapack-3.8.0.tar.gz
$ cd lapack-3.8.0
$ mkdir build
$ cd build
$ cmake ..
$ make
$ sudo make install
Success! Now back to scalapack:
$ cd scalapack-2.0.2
$ mkdir build
$ cd build
$ cmake ..
$ make
Looks good for a while, and then…
[ 63%] Linking Fortran executable ../../TESTING/xFbtest
../../lib/libscalapack.a(igamx2d_.c.o): In functionigamx2d_': igamx2d_.c:(.text+0x3fa): undefined reference to
MPI_Type_struct'
../../lib/libscalapack.a(sgamx2d_.c.o): In functionsgamx2d_': sgamx2d_.c:(.text+0x408): undefined reference to
MPI_Type_struct'
../../lib/libscalapack.a(dgamx2d_.c.o): In functiondgamx2d_': dgamx2d_.c:(.text+0x408): undefined reference to
MPI_Type_struct'
../../lib/libscalapack.a(cgamx2d_.c.o): In functioncgamx2d_': cgamx2d_.c:(.text+0x40e): undefined reference to
MPI_Type_struct'
Hmmm! This is the same error that we got running the scalapack installer script. After some googling, I found out that MPI_Type_struct referred to some old functionality that has since been removed from newer versions of OpenMPI. To fix this error, openmpi needs to be compiled with configure –enable-mpi1-compatibility. I’ve been meaning to upgrade to v4.0.1 anyway, so let’s do that:
$ wget https://download.open-mpi.org/release/open-mpi/v4.0/openmpi-4.0.1.tar.gz
$ tar xvfz openmpi-4.0.1.tar.gz
$ cd openmpi-4.0.1
$ ./configure --prefix=/opt/openmpi-4.0.1 --enable-mpi1-compatibility
$ make
$ make install
BTW… compiling OpenMPI takes a looong time. I think I went and cleaned my kitchen and made pancakes and probably even had enough time to solve a jigsaw puzzle while it ran. But it works fine without any problems. After completion, I created a modulefile for it by copying the existing 4.0.0 modulefile in /etc/modulefiles/mpi and did a global search in replace to change everything to 4.0.1. All good. Finally, I copied the files to my cluster because I run openmpi locally and not on the shared file server.
Next, I went back to scalapack and ran cmake and make again. This time, with the link to the new version of OpenMPI v.4.0.1 with the backward compatibility, scalapack compiles fine, and then I install it in /usr/local/lib. Great! MADbench2 should compile fine now, right?
Hahahahaahahaha!
First I tried the somewhat naive:
$ mpicc -D SYSTEM -D COLUMBIA -o MADbench2.x MADbench2.c -lm
Error!
/tmp/cca9jv5n.o: In functiondefine_gang': MADbench2.c:(.text+0xb42): undefined reference to
blacs_get'
MADbench2.c:(.text+0xb72): undefined reference toblacs_gridmap' MADbench2.c:(.text+0xc05): undefined reference to
numroc'
MADbench2.c:(.text+0xc46): undefined reference tonumroc' MADbench2.c:(.text+0xcb8): undefined reference to
descinit'
MADbench2.c:(.text+0xd0c): undefined reference to `descinit'
Let’s try linking in the library:
$ mpicc -D SYSTEM -D COLUMBIA -o MADbench2.x MADbench2.c -lm -lblas
This didn’t help.
$ mpicc -D SYSTEM -D COLUMBIA -o MADbench2.x MADbench2.c -lm -L/usr/local/lib64 -lblas -llapack
No dice. Wait! I was looking at the BLAS library, but it’s looking for BLACS. Whooops!
$ wget $ http://www.netlib.org/blacs/mpiblacs.tgz
$ tar xvfz mpiblacs.tgz
$ cd BLACS
There are bmake files in the BMAKE directory. So I go in there and:
$ cp Bmake.MPI-LINUX ..
$ mv Bmake.MPI-LINUX Bmake.inc
I edit the file so that MPIdir = /opt/openmpi-4.0.1 and a couple of other minor changes. Then: make mpi
Error! make[2]: g77: Command not found
Okay. So I need to make some more adjustments to the default compiler settings. I set F77 = mpif77 in Bmake.inc and try again. Success! It creates 3 library files: blacsCinit_MPI-LINUX-0.a, blacsF77init_MPI-LINUX-0.a, blacs_MPI-LINUX-0.a.
I decide to try to compile the tester program that came with the library to make sure everything is working fine.
mpif77 -o /home/kfrye/BLACS/TESTING/EXE/xFbtest_MPI-LINUX-0 blacstest.o btprim_MPI.o tools.o /home/kfrye/BLACS/LIB/blacsF77init_MPI-LINUX-0.a /home/kfrye/BLACS/LIB/blacs_MPI-LINUX-0.a /home/kfrye/BLACS/LIB/blacsF77init_MPI-LINUX-0.a /opt/openmpi-4.0.1/lib//libmpi.so
blacstest.o: In functiondchkamn_': blacstest.f:(.text+0x12a9): undefined reference to
blacs_gridinfo_'
That’s not good. Hrm. I go ahead and copy the library file into /usr/local/lib64 and try to compile MADbench2 again. I get the same error, complaining about undefined reference to `blacs_get’. Argh!
$ mpicc -D SYSTEM -D COLUMBIA -o MADbench2.x MADbench2.c /usr/local/lib64/blacs_MPI-LINUX-0.a /usr/local/lib64/blacsCinit_MPI-LINUX-0.a /usr/local/lib64/blacsF77init_MPI-LINUX-0.a -lm -L/usr/local/lib64 -lblas -llapack
Same error! My new library files didn’t help. I check the contents of the library files for one of the missing files from when it tried to compile the tester program:
$ nm blacsCinit_MPI-LINUX-0.a | grep pinfo
blacs_pinfo_.o:
0000000000000000 T blacs_pinfo__
Cblacs_pinfo.o:
0000000000000000 T Cblacs_pinfo
Good! It’s finding the function. And yet. What’s going on? The error is:
blacstest.f:(.text+0x48bd): undefined reference to `blacs_pinfo_
But with nm, I can confirm the function blacs_pinfo__ is in one of the libraries. See the difference? There are two underscores instead of one! And, if you go back the error from MADbench2, it’s looking for function names without any underscores at the end of the function names. Is this a problem? It turns out that computers are stupid and, yet, this is a problem. The symbols have to match perfectly for everything to work. Back to the drawing board.
I read a bunch about the problem, looking for other people with similar issues. It seems this is an issue with the interaction between C and Fortran. Sometimes Fortran adds an underscore at the end of function names. And sometimes C does. And that’s why you can have function names with 0, 1, or 2 underscores at the end. The BLACS tester program seems to be expecting 1 underscore. MADbench2 is looking for functions without any underscores.
Eventually, I figure out how to partially solve the problem. In Bmake.inc, I update:
F77FLAGS = $(F77NO_OPTFLAGS) -O -fPIC -fno-underscoring
It took a bit for me to figure this out, but IT’S REALLY IMPORTANT to clean the existing compiled object files. If you just change the flags in the Bmake.inc file and rerun make, it will not recompile the existing object files for you, and thus nothing will happen. This is VERY frustrating. But after I did a make clean in the SRC/MPI directory (the make cleanall in the root directory isn’t working for some reason), and reran make mpi in the root directory, I checked the symbol table:
$ nm blacs_MPI-LINUX-0.a | grep blacs_gridinfo_
Partial success!! I’m now getting only one underscore after the function name instead of two. Halfway there! The testing program still isn’t compiling, but it’s bombing out on a different error this time:
blacs_pinfo_.c:(.text+0xa0): undefined reference to `bi_f77_get_constants_'
That looks like a fortran library issue. So back to Bmake.inc and update the fortran flags again:
F77FLAGS = $(F77NO_OPTFLAGS) -O -fPIC -fno-underscoring -lgfortran
That fixed that problem. Now the testing program compile errors with:
blacstest.f:(.text+0x6c): undefined reference to `ibtmyproc_'
More importantly, my library functions still have an underscore that MADbench2 doesn’t like. After some more research, I change a different compile option in Bmake.inc:
INTFACE = -DNoChange
I make clean and compile again. This time:
$ nm blacs_MPI-LINUX-0.a | grep grid
U Cblacs_gridinfo
U Cblacs_gridexit
blacs_gridinit_.o:
0000000000000000 T blacs_gridinit
No underscores!! Success!! Of course, my test program won’t like this, because it wants a single underscore, but I care more about MADbench2. So let’s copy my library files over to /usr/local/lib64 and try to compile it again:
$ mpicc -D SYSTEM -D COLUMBIA -o MADbench2.x MADbench2.c /usr/local/lib64/blacs_MPI-LINUX-0.a /usr/local/lib64/blacsCinit_MPI-LINUX-0.a /usr/local/lib64/blacsF77init_MPI-LINUX-0.a -lm -L/usr/local/lib64 -lblas -llapack
Error! But, this time it’s a different error:
MADbench2.c:(.text+0xc05): undefined reference to `numroc'
$ nm /usr/local/lib/libscalapack.a | grep numroc
numroc.f.o:
0000000000000000 T numroc_
Oh, crap. libscalapack.a has the same problem with underscores. I need to go back, fix the compilation flags and compile it again.
I edit SLmake.inc and find a setting for CDEFS that is currently set to -DAdd_, which is what INTFACE in BLACS was set to. So I change this to -DNoChange, and change FCFLAGS to -O3 -fno-underscoring
After this:
$ rm -rf build
$ mkdir build
$ cd build
$ cmake ..
$ make
And… it didn’t work. But the CMake output is really opaque and I don’t know if it’s using my new flags. So I try compiling it in the root directory just by using “make.” This allows me to see that my flags are being used. For example:
$ mpicc -c -DNoChange -O3 BI_HypBS.c
$ mpif77 -c -O3 -fno-underscoring iceil.f
After it’s done compiling (which, thankfully, doesn’t have any problems, even though I’m not using Cmake)
$ nm libscalapack.a | grep numroc
numroc.o:
0000000000000000 T numroc
Success!! No underscores! Back to compiling MADbench2:
$ mpicc -D SYSTEM -D COLUMBIA -o MADbench2.x MADbench2.c /usr/local/lib64/blacs_MPI-LINUX-0.a /usr/local/lib64/blacsCinit_MPI-LINUX-0.a /usr/local/lib64/blacsF77init_MPI-LINUX-0.a -L/usr/local/lib -lm -lblas -llapack -lscalapack
Now I’m getting a different error:
/usr/local/lib/libscalapack.a(BI_GlobalVars.o):(.bss+0x0): multiple definition of BI_Stats'
/usr/local/lib64/blacs_MPI-LINUX-0.a(BI_GlobalVars.o):(.bss+0x0): first defined here
/usr/local/lib/libscalapack.a(BI_GlobalVars.o):(.bss+0x10): multiple definition of
BI_SysContxts'
So, it looks like the BLACS functions are built directory into libscalapack, so I don’t need those spacs library after all. So let’s change my compile line for MADbench2:
$ mpicc -D SYSTEM -D COLUMBIA -o MADbench2.x MADbench2.c -L/usr/local/lib -lm -lblas -llapack -lscalapack -lgfortran
This time:
MADbench2.c:(.text+0x3159): undefined reference to `dposv'
At least I know what to look for this time:
$ nm /usr/local/lib64/liblapack.a | grep dposv
dposv.f.o:
0000000000000000 T dposv_
The lapack library needs to be fixed for underscores too.
$ cd lapack-3.8.0
$ cp make.inc.example make.inc
$ vim make.inc
CFLAGS = -O3 -DNoChange
OPTS = -O2 -frecursive -fno-underscoring
$ make lapacklib
The next error that comes up is:
pdgemv_.c:(.text+0x94b): undefined reference to `dgemv'
Turns out this is part of the BLAS library (not the BLACS library!)
$ nm /usr/local/lib64/blas_LINUX.a | grep dgemv
dgemv.o:
0000000000000000 T dgemv
So I need to include that library in my compilation:
$ mpicc -D SYSTEM -D COLUMBIA -o MADbench2.x MADbench2.c /usr/local/lib64/blas_LINUX.a /usr/local/lib64/liblapack.a -L/usr/local/lib -lm -lscalapack -lgfortran
This didn’t help.
$ mpicc -D SYSTEM -D COLUMBIA -o MADbench2.x MADbench2.c /usr/local/lib64/blas_LINUX.a /usr/local/lib64/liblapack.a /usr/local/lib/libscalapack.a -lm -lgfortran
This should work, but it doesn’t. If you are a more experienced C program than I am, you might have said “But wait! Isn’t that in the wrong linker order?” And you would have been right.
Even though it seems to me like you should include the files that the other files depend on first, it turns out that you need to include the dependent files first so that they create empty symbols in the symbol table that the compiler can then fill with the other files as they are linked up. So this, FINALLY, was successful:
$ mpicc -D SYSTEM -D COLUMBIA -o MADbench2.x MADbench2.c /usr/local/lib/libscalapack.a /usr/local/lib64/liblapack.a /usr/local/lib64/blas_LINUX.a -lm -lgfortran
$ mpirun -n 4 ./MADbench2.x 640 80 1 8 8 4 4
MADbench 2.0
no_pe = 4 no_pix = 640 no_bin = 80 no_gang = 1 sblocksize = 8 fblocksize = 8 r_mod = 4 w_mod = 4
IOMETHOD = POSIX IOMODE = SYNC FILETYPE = UNIQUE REMAP = CUSTOM
S_cc 3.96 [ 3.95: 3.96]
S_w 0.16 [ 0.16: 0.16]
-------
S_total 4.11 [ 4.11: 4.11]
D_cc 0.06 [ 0.06: 0.06]
-------
D_total 0.06 [ 0.06: 0.06]
W_cc 5.20 [ 5.20: 5.20]
W_r 0.07 [ 0.07: 0.07]
W_w 0.09 [ 0.09: 0.09]
-------
W_total 5.37 [ 5.37: 5.37]
C_cc 1.22 [ 1.22: 1.23]
C_r 2.29 [ 2.29: 2.30]
-------
C_total 3.52 [ 3.52: 3.52]
dC[0] = -4.99994e-01
Success!!! And, yes, this took a very large chunk of a beautiful Spring Saturday that I probably should have spent outside instead of sitting at my computer. But I learned a heck of a lot about the exchange between Fortran and C, and how to fix linker problems.