2007年10月11日星期四

mpi可能出现的错误

1.p1_xxxxx: p4_error: interrupt SIGSEGV: 11

这个错误可能是因为某个进程中出现了段错误引起的,
自己编程中曾出现过的错误:
1.只在一个进程中给指针申请空间,而在其他进程没有申请,所以在广播的时候出错
2.在一个进程中联接mysql数据库,而在所有的进程中断开数据库的联接

网上有个人说的很好:
"There are 2 things to check.
* Run one of the test programs like pi3.f or cpi.c to see whether your cluster's OK.
* if it is, the fault is in your code. See if you're exceeding array bounds or accessing memory which you haven't allocated, There's a SIGSEGV error - that's a segmentation violation. That might explain stuff like
bm_list_21829: p4_error: interrupt SIGINT: 2
Once you have a seg. violation, all the 4 processors are sent a signal to interrupt the process (SIGINT). Signals are defined in /usr/include/sys/signal.h (at least on the SGIs; might be
different on other systems). "

2. p1_10401: p4_error: : 14
1 - MPI_BCAST : Message truncated
[1] Aborting program !
[1] Aborting program!

这个也是由于mpi_bcast的接收空间不够引起的,要在mpi_bcast之前分配足够大的空间,这样就不会truncated了

3.p4_error: alloc_p4_msg failed:

p0_6773: (7.828703) xx_shmalloc: returning NULL; requested 1048616 bytes
p0_6773: (7.828762) p4_shmalloc returning NULL; request = 1048616 bytes
内存空间没分配足,可以通过设置环境变量P4_GLOBMEMSIZE (in bytes)来增大程序需要的内存空间
export P4_GLOBMEMSIZE=32000000 (for bash users)
setenv P4_GLOBMEMSIZE 32000000 (for csh or tcsh users)
 
4.libcprts.so.5: cannot open shared object file: No such file or directory 
 
/home/jbrandt/tests/test.exe: error while loading shared libraries:
libcprts.so.5: cannot open shared object file: No such file or directory
p0_792: p4_error: Child process exited while making connection to remote
process on compute-0-0.local: 0
/opt/mpich/intel/bin/mpirun: line 1: 792 Broken pipe /home/jbrandt/tests/test.exe -
p4pg /home/jbrandt/tests/PI646 -p4wd /home/jbrandt/tes
 
没有用-static静态的连接,用-static重新编译就好了

没有评论: