The Big Red Cluster

Outages and Hardware Service

2009

Date System(s) Problem
Jan 25 BigRed The Data Capacitor file system mounted at /N/dc on Big Red became unavailable at 1:07am due to a hardware failure. Admins restored service at 4:45pm. Some jobs may have exceeded walltime without completing during this period. Please contact hps-admin@iu.edu with any questions.
Jan 05 BigRed Big Red suffered a power outage at 12:00am EST Monday, January 5. All compute nodes were affected; running jobs failed. Please contact hps-admin@iu.edu for more information, or TeraGrid refunds. Service was restored at approximately 1:45am EST Monday, January 5.

2008

Date System(s) Problem
Nov 18-19 GPFS GPFS was unresponsive for about 12 hours from the evening of November 18 until approximately 10:30am EST November 19. Jobs using GPFS may have run out of walltime during this period.
Nov 10-11 Data Capacitor The Data Capacitor mounted at /N/dc is back in production as of 11:00am on November 11. It was unavailable between 9:00am November 10 and 11:00am on November 11. Jobs using the Data Capacitor were affected.
Nov 4-5 BigRed BigRed was powered down at 5pm on November 4 to replace a damaged transformer. The system was returned to service at 9:45am on November 5.
Oct 30-31 BigRed The IUPUI data center experienced a power outage around 5:25pm on October 30 resulting in network connectivity problems until 12:40am on October 31. BigRed experienced login problems and connectivity issues to the home directories. Jobs may have been affected by this outage. Please send any questions or concerns to hps-admin@iu.edu.
Oct 12 BigRed Big Red suffered a power outage around 9:37 AM, Sunday, Oct 12th, EDT. All blades were affected. Service was fully restored as of 12:11PM.
Oct 3 BigRed Big Red suffered a power outage around 3:03 AM, Friday, Oct 3rd, EDT. All blades were affected. Service was fully restored as of 7:20 AM.
Sep 14 BigRed Big Red suffered a power outage at approximately 3:15pm, Sunday, September 14. Thirty-five compute nodes power-cycled during the outage; deferred jobs were requeued, and service restored at approximately 4:00pm.
Sep 12 BigRed From 10:05am EDT until 10:23am EDT, Big Red suffered a network outage which affected connections to user home directories, the Data Capacitor, GPFSWAN, and other external connections. Running jobs should have paused until connectivity was restored.
Sep 3 BigRed From 1:50pm EDT until 2:04pm EDT, Big Red suffered a network outage which affected connections to user home directories, the Data Capacitor, GPFSWAN, and other external connections. Running jobs should have paused until connectivity was restored.
Aug 28 BigRed At 3:54pm EDT Big Red suffered a network outage which affected connections to the Data Capacitor, the home directories, GPFSWAN, and external connections. The network was restored at 8:20pm. Running jobs relying on those connections paused until connectivity was restored.
Aug 14 BigRed nodes Big Red login nodes h1 and h2 were rebooted at 2:53pm EDT in order to resolve an issue with the Data Capacitor.
Aug 13 BigRed Job scheduling was paused for a approximately seven hours while technicians resolved a chiller outage in the data center. The measure was precautionary; no jobs were lost.
Jul 23 BigRed The Big Red machine room experienced a power failure at 1:42am EDT; power was restored shortly thereafter, but GPFS was unavailable until 10:18am.
Jun 24 Data Capacitor Problems accessing the Data Capacitor WAN filesystem.
Jun 14-18 BigRed At approximately 3:00pm, Saturday, June 14, the IU data center lost power. BigRed was returned to service, following a series of power infrastructure upgrades, at 8:00pm, Wednesday, June 18.
Jun 4-20 GPFS GPFS was unavailable during this period as a result of disk errors caused by multiple site power outages. The recovery process resulted in possibly corrupted data files. Please see /N/gpfsbr/gpfs_suspect_files/USERNAME for a list of files that may be affected. You will need to recover from backup or verify via other means if possible.
Jun 4-10 BigRed At 4:40pm, the IU data center experienced a weather-related power outage. Subsequent problems with GPFS, and additional power outages prevented the system's return to service until 11am, Tuesday, June 10.
May 31 GPFS At 10:47am, the GPFS file system became inaccessible due to multiple disk failures. Access was restored at 2:26pm. Running jobs were affected by this outage.
May 1 GPFS At 6:18pm, the GPFS file system became inaccessible due to a hardware failure. Access was restored at 9:36pm. Running jobs were affected by this outage.
May 19 BigRed nodes At 11:05am, the BigRed machine room experienced a brief power outages causing approximately 200 nodes to reboot. Scheduling was paused until 12:09pm while admins examined the system
Apr 5 Myrinet At 3:39pm, the myrinet network went down due to a mapper issue. The myrinet network was returned to service at 11:47am on April 6
Apr 5 GPFS At 3:39pm, the GPFS file system became inaccessible due to a hardware failure. Access was restored at 10:35am on April 6.
Feb 4 BigRed At 11:00pm BigRed was returned to service. Cooling towers were reparied, power maintenance was performed, network upgrades were performed. Regularly scheduled maintenance for 2/5/2008 is canceled.
Feb 3 BigRed At 10:23am on February 3, the BigRed machine room experienced loss of cooling. Admins shutdown the machine at 10:30am.

2007

Date System(s) Problem
Nov 28 GPFS GPFS was inaccessible from 12:34PM until 8:59PM. Running jobs were affected by this event.
Nov 16 GPFS GPFS was inaccessible from 4:59PM until Nov 17 at 2:03am. Running jobs were affected by this event.
Nov 1 GPFS GPFS was inaccessible from ~4pm until 6:20pm. Running jobs were affected by this event.
Oct 25 GPFS GPFS was inaccessible from ~midnight until 12:47am. Running jobs were not affected by this event.
Oct 23 GPFS GPFS was inaccessible from ~11:00pm until Oct 24 at 9:00am. Running jobs were not affected by this event.
Oct 20 GPFS GPFS was inaccessible from 8:47am to 12:25pm. Running jobs were affected by this event.
Oct 19 GPFS GPFS was inaccessible from 5:54pm until Oct 20 at 2:26am. Running jobs were affected by this event.
Oct 16 GPFS GPFS was inaccesible from ~8:00pm until Oct 17 at 12:07am. Running jobs were not affected by this event.
Oct 7 GPFS GPFS was in accessible from ~4:26pm until 10:16pm. Running jobs were not affected by this event.
Sep 30 Big Red, GPFS Power outage, ~6:25am to 3:49pm EDT. All systems were down during this event.
Sep 25 Big Red, GPFS Power outage, 4:14pm to 10:08pm EDT. All systems were down during this event.
Jul 4 Big Red, GPFS Power outage, ~9:00am to 2:30pm EDT. All systems were down during this event.
Jun 21 GPFS Failure of several blades during benchmarking of expansion hardware resulted in GPFS instability, approximately 3:45pm until 5:50pm. GPFS recovered without a restart, though remounts did occur on several nodes: some jobs were lost.
Apr 27 GPFS Communication failure during test of a firmware update process resulted in GPFS instability, approximately noon until 5pm. GPFS restarted; no data was lost.
Apr 17 Login nodes Network configuration change on campus resulted in a routing error; access to login nodes was unavailable from approximately 2:30pm until 3:00pm.
Apr 10-11 Racks 1-8 NFS server issues resulted in hanging NFS mounts (see Apr 6 outage).
Apr 6 Rack 9 10:45am - 3:00pm; NFS server issues resulted in hanging NFS mounts on the compute blades.
Mar 7-10 GPFS Switch and disk issues resulted in multiple GPFS outages
Jan 31 GPFS Failed disk controller resulted in a GPFS outage from 16:27 until 22:15 EST. No files appear to have been lost during the rebuild process.

2006

Date System(s) Problem
Nov 30 Rack 4 Outage due to failed /dev/sdb (intermittent SCSI errors) in the image server, s4. This resulted in a loss of access to the NFS exports for all blades in rack 4. Resolved following a power-cycle of s4. Began 6:40am, resolved 10:30am.
Nov 6-7 GPFS Outage due to defective SFP on DDN disk controller, which resulted in SAN switch problems. Began 11:08am, 11/6, resolved 9:41am, 11/7.
Sep 18 GPFS Three storage hosts (storage8u, 11u, and 15u) were affected during last night's storm. Some NSDs are unavailable to GPFS.