The Big Red Cluster

Outstanding Problems

Suspect files on GPFS

GPFS serving Big Red was returned to service at 15:30 EDT, Friday, June 20. As a result of the multiple power outages between June 4 and June 18, data was migrated from several disks; some data may have been lost in the process. We have placed a list of all potentially damaged files broken down by username in /N/gpfsbr/gpfs_suspect_files/USERNAME. Please restore those files from backup, if available, or verify them by some other means.

"Signal 15" errors

MPICH-MX job processes are susceptible to termination due to a lack of available Myrinet ports on one or more nodes allocated to the job. This situation is a result of "orphaned" processes from previous MPICH-MX jobs. Because of existing scheduler policy, which allows a single user to execute multiple jobs on one or more nodes, these orphaned processes are difficult to identify. Administrators continue to work on a solution, which is often manifested in job output files containing the string "Killed by signal 15".