A tale of linker woes

I wanted to compile MADbench2, which is a program designed to test the interaction of I/O with communication in an HPC environment. It has some prerequisites such as Scalapack, Lapack and their prerequisites. I have root access on this particular cluster, so I was hoping I could just install a few precompiled packages and just run it. Hahahahahaha!

$ sudo yum install lapack-devel lapack
$ sudo yum install scalapack-common scalapack-openmpi scalapack-openmpi-devel scalapack-openmpi-static

Next, try to compile MADbench2:

$ mpicc -D SYSTEM -D COLUMBIA -o MADbench2.x MADbench2.c -lm

And that gives me some errors:

/tmp/ccVFN3Hw.o: In function define_gang':
MADbench2.c:(.text+0xb42): undefined reference toblacs_get'
MADbench2.c:(.text+0xb72): undefined reference to blacs_gridmap' MADbench2.c:(.text+0xc05): undefined reference tonumroc'
MADbench2.c:(.text+0xc46): undefined reference to numroc' MADbench2.c:(.text+0xcb8): undefined reference todescinit'
MADbench2.c:(.text+0xd0c): undefined reference to `descinit'

Hmm.. that looks like I need another package.

$ sudo yum install blacs-openmpi

I tried building MADbench2 again, but I get the same error. Hmm. When I check /usr/lib64/openmpi/lib, I see libmpiblacs.so.2 and libmpiblacs.so.4, but no libmpiblacs.so. Let’s try this again:

$ sudo yum install blacs-openmpi-devel
$ mpicc -D SYSTEM -D COLUMBIA -o MADbench2.x MADbench2.c -L/usr/lib64/openmpi/lib -lm -lmpiblacs

Now I’m including the location of the library, and I’m linking to it, but, maddeningly, I get the same error. The other thing that bothers me is that the precompiled openmpi version for Centos7 are v1.10, and I’ve been regularly using 3.80, and I even have v4.00 ready to go. I don’t really want my research to use such as old version of OpenMPI. So I decide to compile from source.. because that’s always easier, right?

$ wget http://www.netlib.org/scalapack/scalapack_installer.tgz
$ tar zxvf scalapack_installer.tgz
./setup.py --prefix /opt/scalapack --mpibindir /opt/openmpi-4.0.0

Permission denied! It didn’t like where I was trying to install scalapack.

$ sudo ./setup.py --prefix /opt/scalapack --mpibindir /opt/openmpi-4.0.0

Failure! Now it’s bad because to do an mpirun, and openmpi doesn’t like you doing that as root. So let’s try to install in my home directory:

$ ./setup.py --prefix /home/kfrye/scalapack --mpibindir /opt/openmpi-4.0.0/bin

Permission denied! Okay. I guess it didn’t like me running this setup file from a directory outside of my home directory. So I moved it to my home and tried again. This was more successful. Now I got an error message:

Please provide a working BLAS library. If a BLAS library
is not present on the system, the reference BLAS library it can be
automatically downloaded and installed by adding the --downblas flag.

Progress!

$ ./setup.py --prefix /home/kfrye/scalapack --mpibindir /opt/openmpi-4.0.0/bin --downblas
... good compiling stuff ... 
Unzip and untar reference BLAS… done
Traceback (most recent call last):
File "./setup.py", line 51, in
sys.exit(main(sys.argv))
File "./setup.py", line 43, in main
Blas(config, scalapack);
File "/home/kfrye/scalapack_installer/script/blas.py", line 78, in init
self.down_install_blas()
File "/home/kfrye/scalapack_installer/script/blas.py", line 187, in down_install_blas
os.chdir(os.path.join(os.getcwd(),'BLAS'))
OSError: [Errno 2] No such file or directory: '/home/kfrye/scalapack_installer/build/BLAS'

Hmm. That’s weird. So I check the contents of the build directory and discovered that it had created BLAS-3.8.0 instead of BLAS. I can work around that. So I go into the BLAS-3.8.0 directory and run “make”. Success!

$ ./setup.py --prefix /home/kfrye/scalapack --mpibindir /opt/openmpi-4.0.0/bin --blaslib=/home/kfrye/scalapack/build/BLAS-3.8.0/blas_LINUX.a 

Success! The installation continues:

What do you want to do ?
- d : download the netlib LAPACK
- q : quit and try with another BLAS library or define the
lapacklib parameter.

I tell it to download LAPACK. Everything looks good. Then:

ScaLAPACK installer is starting now. Buckle up!
Downloading ScaLAPACK… done
Installing scalapack-2.0.2 …
Writing SLmake.inc… done.
Compiling BLACS, PBLAS and ScaLAPACK… done
Getting ScaLAPACK version number… 2.0.1
Installation of ScaLAPACK successful.
(log is in /home/kfrye/scalapack_installer/build/log/scalog )
Compiling test routines…
ScaLAPACK: error building ScaLAPACK test routines

Warning: Type mismatch in argument 'ierr' at (1); passed REAL(8) to INTEGER(4)
../../libscalapack.a(igamx2d_.oo): In function Cigamx2d': igamx2d_.c:(.text+0x208): undefined reference toMPI_Type_struct'
../../libscalapack.a(sgamx2d_.oo): In function Csgamx2d': sgamx2d_.c:(.text+0x208): undefined reference toMPI_Type_struct'
../../libscalapack.a(dgamx2d_.oo): In function Cdgamx2d': dgamx2d_.c:(.text+0x208): undefined reference toMPI_Type_struct'
../../libscalapack.a(cgamx2d_.oo): In function Ccgamx2d': cgamx2d_.c:(.text+0x210): undefined reference toMPI_Type_struct'
../../libscalapack.a(zgamx2d_.oo): In function Czgamx2d': zgamx2d_.c:(.text+0x208): undefined reference toMPI_Type_struct'

So it looks like it can’t find OpenMPI, except I explicitly included a link to OpenMPI, and I know that link works. What I didn’t know then, but know now is that it’s looking for a bunch of MPI symbols that have been discontinued in newer versions of MPI. But what I thought at the time was “This installer thing is crap! I need to try something else”

$ wget http://www.netlib.org/scalapack/scalapack-2.0.2.tgz
$ tar zxvf scalapack-2.0.2.tgz
$ wget http://www.netlib.org/blas/blas-3.8.0.tgz
$ tar zxvf blas-3.8.0.tgz
$ cd blas-3.8.0
$ make

This compiles a bunch of files and gives me a new library blas_LINUX.a. Scalapack also requires Lapack:

$ wget http://www.netlib.org/lapack/lapack-3.8.0.tar.gz
$ tar zxvf lapack-3.8.0.tar.gz
$ cd lapack-3.8.0
$ mkdir build
$ cd build
$ cmake ..
$ make
$ sudo make install

Success! Now back to scalapack:

$ cd scalapack-2.0.2
$ mkdir build
$ cd build
$ cmake ..
$ make

Looks good for a while, and then…

[ 63%] Linking Fortran executable ../../TESTING/xFbtest
../../lib/libscalapack.a(igamx2d_.c.o): In function igamx2d_': igamx2d_.c:(.text+0x3fa): undefined reference toMPI_Type_struct'
../../lib/libscalapack.a(sgamx2d_.c.o): In function sgamx2d_': sgamx2d_.c:(.text+0x408): undefined reference toMPI_Type_struct'
../../lib/libscalapack.a(dgamx2d_.c.o): In function dgamx2d_': dgamx2d_.c:(.text+0x408): undefined reference toMPI_Type_struct'
../../lib/libscalapack.a(cgamx2d_.c.o): In function cgamx2d_': cgamx2d_.c:(.text+0x40e): undefined reference toMPI_Type_struct'

Hmmm! This is the same error that we got running the scalapack installer script. After some googling, I found out that MPI_Type_struct referred to some old functionality that has since been removed from newer versions of OpenMPI. To fix this error, openmpi needs to be compiled with configure –enable-mpi1-compatibility. I’ve been meaning to upgrade to v4.0.1 anyway, so let’s do that:

$ wget https://download.open-mpi.org/release/open-mpi/v4.0/openmpi-4.0.1.tar.gz
$ tar xvfz openmpi-4.0.1.tar.gz
$ cd openmpi-4.0.1
$ ./configure --prefix=/opt/openmpi-4.0.1 --enable-mpi1-compatibility
$ make
$ make install

BTW… compiling OpenMPI takes a looong time. I think I went and cleaned my kitchen and made pancakes and probably even had enough time to solve a jigsaw puzzle while it ran. But it works fine without any problems. After completion, I created a modulefile for it by copying the existing 4.0.0 modulefile in /etc/modulefiles/mpi and did a global search in replace to change everything to 4.0.1. All good. Finally, I copied the files to my cluster because I run openmpi locally and not on the shared file server.

Next, I went back to scalapack and ran cmake and make again. This time, with the link to the new version of OpenMPI v.4.0.1 with the backward compatibility, scalapack compiles fine, and then I install it in /usr/local/lib. Great! MADbench2 should compile fine now, right?

Hahahahaahahaha!

First I tried the somewhat naive:

$ mpicc -D SYSTEM -D COLUMBIA -o MADbench2.x MADbench2.c -lm

Error!

/tmp/cca9jv5n.o: In function define_gang':
MADbench2.c:(.text+0xb42): undefined reference toblacs_get'
MADbench2.c:(.text+0xb72): undefined reference to blacs_gridmap' MADbench2.c:(.text+0xc05): undefined reference tonumroc'
MADbench2.c:(.text+0xc46): undefined reference to numroc' MADbench2.c:(.text+0xcb8): undefined reference todescinit'
MADbench2.c:(.text+0xd0c): undefined reference to `descinit'

Let’s try linking in the library:

$ mpicc -D SYSTEM -D COLUMBIA -o MADbench2.x MADbench2.c -lm -lblas

This didn’t help.

$ mpicc -D SYSTEM -D COLUMBIA -o MADbench2.x MADbench2.c -lm -L/usr/local/lib64 -lblas -llapack

No dice. Wait! I was looking at the BLAS library, but it’s looking for BLACS. Whooops!

$ wget $ http://www.netlib.org/blacs/mpiblacs.tgz
$ tar xvfz mpiblacs.tgz
$ cd BLACS

There are bmake files in the BMAKE directory. So I go in there and:

$ cp Bmake.MPI-LINUX ..
$ mv Bmake.MPI-LINUX Bmake.inc

I edit the file so that MPIdir = /opt/openmpi-4.0.1 and a couple of other minor changes. Then: make mpi

Error! make[2]: g77: Command not found

Okay. So I need to make some more adjustments to the default compiler settings. I set F77 = mpif77 in Bmake.inc and try again. Success! It creates 3 library files: blacsCinit_MPI-LINUX-0.a, blacsF77init_MPI-LINUX-0.a, blacs_MPI-LINUX-0.a.

I decide to try to compile the tester program that came with the library to make sure everything is working fine.

mpif77  -o /home/kfrye/BLACS/TESTING/EXE/xFbtest_MPI-LINUX-0 blacstest.o btprim_MPI.o tools.o /home/kfrye/BLACS/LIB/blacsF77init_MPI-LINUX-0.a /home/kfrye/BLACS/LIB/blacs_MPI-LINUX-0.a /home/kfrye/BLACS/LIB/blacsF77init_MPI-LINUX-0.a /opt/openmpi-4.0.1/lib//libmpi.so
blacstest.o: In function dchkamn_': blacstest.f:(.text+0x12a9): undefined reference toblacs_gridinfo_'

That’s not good. Hrm. I go ahead and copy the library file into /usr/local/lib64 and try to compile MADbench2 again. I get the same error, complaining about undefined reference to `blacs_get’. Argh!

$ mpicc -D SYSTEM -D COLUMBIA -o MADbench2.x MADbench2.c /usr/local/lib64/blacs_MPI-LINUX-0.a /usr/local/lib64/blacsCinit_MPI-LINUX-0.a /usr/local/lib64/blacsF77init_MPI-LINUX-0.a -lm -L/usr/local/lib64 -lblas -llapack

Same error! My new library files didn’t help. I check the contents of the library files for one of the missing files from when it tried to compile the tester program:

$ nm blacsCinit_MPI-LINUX-0.a | grep pinfo
blacs_pinfo_.o:
0000000000000000 T blacs_pinfo__
Cblacs_pinfo.o:
0000000000000000 T Cblacs_pinfo

Good! It’s finding the function. And yet. What’s going on? The error is:

blacstest.f:(.text+0x48bd): undefined reference to `blacs_pinfo_ 

But with nm, I can confirm the function blacs_pinfo__ is in one of the libraries. See the difference? There are two underscores instead of one! And, if you go back the error from MADbench2, it’s looking for function names without any underscores at the end of the function names. Is this a problem? It turns out that computers are stupid and, yet, this is a problem. The symbols have to match perfectly for everything to work. Back to the drawing board.

I read a bunch about the problem, looking for other people with similar issues. It seems this is an issue with the interaction between C and Fortran. Sometimes Fortran adds an underscore at the end of function names. And sometimes C does. And that’s why you can have function names with 0, 1, or 2 underscores at the end. The BLACS tester program seems to be expecting 1 underscore. MADbench2 is looking for functions without any underscores.

Eventually, I figure out how to partially solve the problem. In Bmake.inc, I update:

F77FLAGS       = $(F77NO_OPTFLAGS) -O -fPIC -fno-underscoring

It took a bit for me to figure this out, but IT’S REALLY IMPORTANT to clean the existing compiled object files. If you just change the flags in the Bmake.inc file and rerun make, it will not recompile the existing object files for you, and thus nothing will happen. This is VERY frustrating. But after I did a make clean in the SRC/MPI directory (the make cleanall in the root directory isn’t working for some reason), and reran make mpi in the root directory, I checked the symbol table:

$ nm blacs_MPI-LINUX-0.a | grep blacs_gridinfo_

Partial success!! I’m now getting only one underscore after the function name instead of two. Halfway there! The testing program still isn’t compiling, but it’s bombing out on a different error this time:

blacs_pinfo_.c:(.text+0xa0): undefined reference to `bi_f77_get_constants_'

That looks like a fortran library issue. So back to Bmake.inc and update the fortran flags again:

F77FLAGS       = $(F77NO_OPTFLAGS) -O -fPIC -fno-underscoring -lgfortran

That fixed that problem. Now the testing program compile errors with:

blacstest.f:(.text+0x6c): undefined reference to `ibtmyproc_'

More importantly, my library functions still have an underscore that MADbench2 doesn’t like. After some more research, I change a different compile option in Bmake.inc:

INTFACE = -DNoChange

I make clean and compile again. This time:

 $ nm blacs_MPI-LINUX-0.a | grep grid
U Cblacs_gridinfo
U Cblacs_gridexit
blacs_gridinit_.o:
0000000000000000 T blacs_gridinit

No underscores!! Success!! Of course, my test program won’t like this, because it wants a single underscore, but I care more about MADbench2. So let’s copy my library files over to /usr/local/lib64 and try to compile it again:

$ mpicc -D SYSTEM -D COLUMBIA -o MADbench2.x MADbench2.c /usr/local/lib64/blacs_MPI-LINUX-0.a /usr/local/lib64/blacsCinit_MPI-LINUX-0.a /usr/local/lib64/blacsF77init_MPI-LINUX-0.a -lm -L/usr/local/lib64 -lblas -llapack

Error! But, this time it’s a different error:

MADbench2.c:(.text+0xc05): undefined reference to `numroc'
$ nm /usr/local/lib/libscalapack.a | grep numroc
numroc.f.o:
0000000000000000 T numroc_

Oh, crap. libscalapack.a has the same problem with underscores. I need to go back, fix the compilation flags and compile it again.

I edit SLmake.inc and find a setting for CDEFS that is currently set to -DAdd_, which is what INTFACE in BLACS was set to. So I change this to -DNoChange, and change FCFLAGS to -O3 -fno-underscoring

After this:

$ rm -rf build
$ mkdir build
$ cd build
$ cmake ..
$ make

And… it didn’t work. But the CMake output is really opaque and I don’t know if it’s using my new flags. So I try compiling it in the root directory just by using “make.” This allows me to see that my flags are being used. For example:

$ mpicc  -c -DNoChange -O3  BI_HypBS.c
$ mpif77 -c -O3 -fno-underscoring iceil.f

After it’s done compiling (which, thankfully, doesn’t have any problems, even though I’m not using Cmake)

$ nm libscalapack.a | grep numroc

numroc.o:
0000000000000000 T numroc

Success!! No underscores! Back to compiling MADbench2:

$ mpicc -D SYSTEM -D COLUMBIA -o MADbench2.x MADbench2.c /usr/local/lib64/blacs_MPI-LINUX-0.a /usr/local/lib64/blacsCinit_MPI-LINUX-0.a /usr/local/lib64/blacsF77init_MPI-LINUX-0.a -L/usr/local/lib -lm -lblas -llapack -lscalapack

Now I’m getting a different error:

/usr/local/lib/libscalapack.a(BI_GlobalVars.o):(.bss+0x0): multiple definition of BI_Stats'
/usr/local/lib64/blacs_MPI-LINUX-0.a(BI_GlobalVars.o):(.bss+0x0): first defined here
/usr/local/lib/libscalapack.a(BI_GlobalVars.o):(.bss+0x10): multiple definition ofBI_SysContxts'

So, it looks like the BLACS functions are built directory into libscalapack, so I don’t need those spacs library after all. So let’s change my compile line for MADbench2:

$ mpicc -D SYSTEM -D COLUMBIA -o MADbench2.x MADbench2.c -L/usr/local/lib -lm -lblas -llapack -lscalapack -lgfortran

This time:

MADbench2.c:(.text+0x3159): undefined reference to `dposv'

At least I know what to look for this time:

$ nm /usr/local/lib64/liblapack.a | grep dposv
dposv.f.o:
0000000000000000 T dposv_

The lapack library needs to be fixed for underscores too.

$ cd lapack-3.8.0
$ cp make.inc.example make.inc
$ vim make.inc
CFLAGS = -O3 -DNoChange
OPTS = -O2 -frecursive -fno-underscoring
$ make lapacklib

The next error that comes up is:

pdgemv_.c:(.text+0x94b): undefined reference to `dgemv'

Turns out this is part of the BLAS library (not the BLACS library!)

$ nm /usr/local/lib64/blas_LINUX.a | grep dgemv
dgemv.o:
0000000000000000 T dgemv

So I need to include that library in my compilation:

$ mpicc -D SYSTEM -D COLUMBIA -o MADbench2.x MADbench2.c /usr/local/lib64/blas_LINUX.a /usr/local/lib64/liblapack.a -L/usr/local/lib -lm -lscalapack -lgfortran

This didn’t help.

$ mpicc -D SYSTEM -D COLUMBIA -o MADbench2.x MADbench2.c /usr/local/lib64/blas_LINUX.a /usr/local/lib64/liblapack.a /usr/local/lib/libscalapack.a -lm  -lgfortran

This should work, but it doesn’t. If you are a more experienced C program than I am, you might have said “But wait! Isn’t that in the wrong linker order?” And you would have been right.

Even though it seems to me like you should include the files that the other files depend on first, it turns out that you need to include the dependent files first so that they create empty symbols in the symbol table that the compiler can then fill with the other files as they are linked up. So this, FINALLY, was successful:

$ mpicc -D SYSTEM -D COLUMBIA -o MADbench2.x MADbench2.c /usr/local/lib/libscalapack.a /usr/local/lib64/liblapack.a /usr/local/lib64/blas_LINUX.a -lm  -lgfortran
$ mpirun -n 4 ./MADbench2.x 640 80 1 8 8 4 4

MADbench 2.0
no_pe = 4 no_pix = 640 no_bin = 80 no_gang = 1 sblocksize = 8 fblocksize = 8 r_mod = 4 w_mod = 4
IOMETHOD = POSIX IOMODE = SYNC FILETYPE = UNIQUE REMAP = CUSTOM
S_cc 3.96 [ 3.95: 3.96]
S_w 0.16 [ 0.16: 0.16]
-------
S_total 4.11 [ 4.11: 4.11]
D_cc 0.06 [ 0.06: 0.06]
-------
D_total 0.06 [ 0.06: 0.06]
W_cc 5.20 [ 5.20: 5.20]
W_r 0.07 [ 0.07: 0.07]
W_w 0.09 [ 0.09: 0.09]
-------
W_total 5.37 [ 5.37: 5.37]
C_cc 1.22 [ 1.22: 1.23]
C_r 2.29 [ 2.29: 2.30]
-------
C_total 3.52 [ 3.52: 3.52]
dC[0] = -4.99994e-01

Success!!! And, yes, this took a very large chunk of a beautiful Spring Saturday that I probably should have spent outside instead of sitting at my computer. But I learned a heck of a lot about the exchange between Fortran and C, and how to fix linker problems.

Leave a Reply

Your email address will not be published. Required fields are marked *