The Big Red Cluster

Outages and Hardware Service

2008

Date System(s) Problem
Oct 3 BigRed Big Red suffered a power outage around 3:03 AM, Friday, Oct 3rd, EDT. All blades were affected. Service was fully restored as of 7:20 AM.
Sep 14 BigRed Big Red suffered a power outage at approximately 3:15pm, Sunday, September 14. Thirty-five compute nodes power-cycled during the outage; deferred jobs were requeued, and service restored at approximately 4:00pm.
Sep 12 BigRed From 10:05am EDT until 10:23am EDT, Big Red suffered a network outage which affected connections to user home directories, the Data Capacitor, GPFSWAN, and other external connections. Running jobs should have paused until connectivity was restored.
Sep 3 BigRed From 1:50pm EDT until 2:04pm EDT, Big Red suffered a network outage which affected connections to user home directories, the Data Capacitor, GPFSWAN, and other external connections. Running jobs should have paused until connectivity was restored.
Aug 28 BigRed At 3:54pm EDT Big Red suffered a network outage which affected connections to the Data Capacitor, the home directories, GPFSWAN, and external connections. The network was restored at 8:20pm. Running jobs relying on those connections paused until connectivity was restored.
Aug 14 BigRed nodes Big Red login nodes h1 and h2 were rebooted at 2:53pm EDT in order to resolve an issue with the Data Capacitor.
Aug 13 BigRed Job scheduling was paused for a approximately seven hours while technicians resolved a chiller outage in the data center. The measure was precautionary; no jobs were lost.
Jul 23 BigRed The Big Red machine room experienced a power failure at 1:42am EDT; power was restored shortly thereafter, but GPFS was unavailable until 10:18am.
Jun 24 Data Capacitor Problems accessing the Data Capacitor WAN filesystem.
Jun 14-18 BigRed At approximately 3:00pm, Saturday, June 14, the IU data center lost power. BigRed was returned to service, following a series of power infrastructure upgrades, at 8:00pm, Wednesday, June 18.
Jun 4-20 GPFS GPFS was unavailable during this period as a result of disk errors caused by multiple site power outages. The recovery process resulted in possibly corrupted data files. Please see /N/gpfsbr/gpfs_suspect_files/USERNAME for a list of files that may be affected. You will need to recover from backup or verify via other means if possible.
Jun 4-10 BigRed At 4:40pm, the IU data center experienced a weather-related power outage. Subsequent problems with GPFS, and additional power outages prevented the system's return to service until 11am, Tuesday, June 10.
May 31 GPFS At 10:47am, the GPFS file system became inaccessible due to multiple disk failures. Access was restored at 2:26pm. Running jobs were affected by this outage.
May 1 GPFS At 6:18pm, the GPFS file system became inaccessible due to a hardware failure. Access was restored at 9:36pm. Running jobs were affected by this outage.
May 19 BigRed nodes At 11:05am, the BigRed machine room experienced a brief power outages causing approximately 200 nodes to reboot. Scheduling was paused until 12:09pm while admins examined the system
Apr 5 Myrinet At 3:39pm, the myrinet network went down due to a mapper issue. The myrinet network was returned to service at 11:47am on April 6
Apr 5 GPFS At 3:39pm, the GPFS file system became inaccessible due to a hardware failure. Access was restored at 10:35am on April 6.
Feb 4 BigRed At 11:00pm BigRed was returned to service. Cooling towers were reparied, power maintenance was performed, network upgrades were performed. Regularly scheduled maintenance for 2/5/2008 is canceled.
Feb 3 BigRed At 10:23am on February 3, the BigRed machine room experienced loss of cooling. Admins shutdown the machine at 10:30am.

2007

Date System(s) Problem
Nov 28 GPFS GPFS was inaccessible from 12:34PM until 8:59PM. Running jobs were affected by this event.
Nov 16 GPFS GPFS was inaccessible from 4:59PM until Nov 17 at 2:03am. Running jobs were affected by this event.
Nov 1 GPFS GPFS was inaccessible from ~4pm until 6:20pm. Running jobs were affected by this event.
Oct 25 GPFS GPFS was inaccessible from ~midnight until 12:47am. Running jobs were not affected by this event.
Oct 23 GPFS GPFS was inaccessible from ~11:00pm until Oct 24 at 9:00am. Running jobs were not affected by this event.
Oct 20 GPFS GPFS was inaccessible from 8:47am to 12:25pm. Running jobs were affected by this event.
Oct 19 GPFS GPFS was inaccessible from 5:54pm until Oct 20 at 2:26am. Running jobs were affected by this event.
Oct 16 GPFS GPFS was inaccesible from ~8:00pm until Oct 17 at 12:07am. Running jobs were not affected by this event.
Oct 7 GPFS GPFS was in accessible from ~4:26pm until 10:16pm. Running jobs were not affected by this event.
Sep 30 Big Red, GPFS Power outage, ~6:25am to 3:49pm EDT. All systems were down during this event.
Sep 25 Big Red, GPFS Power outage, 4:14pm to 10:08pm EDT. All systems were down during this event.
Jul 4 Big Red, GPFS Power outage, ~9:00am to 2:30pm EDT. All systems were down during this event.
Jun 21 GPFS Failure of several blades during benchmarking of expansion hardware resulted in GPFS instability, approximately 3:45pm until 5:50pm. GPFS recovered without a restart, though remounts did occur on several nodes: some jobs were lost.
Apr 27 GPFS Communication failure during test of a firmware update process resulted in GPFS instability, approximately noon until 5pm. GPFS restarted; no data was lost.
Apr 17 Login nodes Network configuration change on campus resulted in a routing error; access to login nodes was unavailable from approximately 2:30pm until 3:00pm.
Apr 10-11 Racks 1-8 NFS server issues resulted in hanging NFS mounts (see Apr 6 outage).
Apr 6 Rack 9 10:45am - 3:00pm; NFS server issues resulted in hanging NFS mounts on the compute blades.
Mar 7-10 GPFS Switch and disk issues resulted in multiple GPFS outages
Jan 31 GPFS Failed disk controller resulted in a GPFS outage from 16:27 until 22:15 EST. No files appear to have been lost during the rebuild process.

2006

Date System(s) Problem
Nov 30 Rack 4 Outage due to failed /dev/sdb (intermittent SCSI errors) in the image server, s4. This resulted in a loss of access to the NFS exports for all blades in rack 4. Resolved following a power-cycle of s4. Began 6:40am, resolved 10:30am.
Nov 6-7 GPFS Outage due to defective SFP on DDN disk controller, which resulted in SAN switch problems. Began 11:08am, 11/6, resolved 9:41am, 11/7.
Sep 18 GPFS Three storage hosts (storage8u, 11u, and 15u) were affected during last night's storm. Some NSDs are unavailable to GPFS.