| Date |
System(s) |
Problem |
| Nov 18-19 |
GPFS |
GPFS was unresponsive for about 12 hours from the evening of
November 18 until approximately 10:30am EST November 19. Jobs using
GPFS may have run out of walltime during this period.
|
| Nov 10-11 |
Data Capacitor |
The Data Capacitor mounted at /N/dc is back in production as of 11:00am
on November 11. It was unavailable between 9:00am November 10 and
11:00am on November 11. Jobs using the Data Capacitor were affected.
|
| Nov 4-5 |
BigRed |
BigRed was powered down at 5pm on November 4 to
replace a damaged transformer. The system was returned
to service at 9:45am on November 5. |
|
| Oct 30-31 |
BigRed |
The IUPUI data center experienced a power outage around 5:25pm on
October 30 resulting in network connectivity problems until 12:40am
on October 31. BigRed experienced login problems and connectivity
issues to the home directories. Jobs may have been affected by this
outage. Please send any questions or concerns to hps-admin@iu.edu.
|
| Oct 12 |
BigRed |
Big Red suffered a power outage around 9:37 AM, Sunday, Oct 12th,
EDT. All blades were affected. Service was fully restored as of
12:11PM.
|
| Oct 3 |
BigRed |
Big Red suffered a power outage around 3:03 AM, Friday, Oct 3rd,
EDT. All blades were affected. Service was fully restored as of
7:20 AM.
|
| Sep 14 |
BigRed |
Big Red suffered a power outage at approximately 3:15pm, Sunday,
September 14. Thirty-five compute nodes power-cycled during the
outage; deferred jobs were requeued, and service restored at
approximately 4:00pm.
|
| Sep 12 |
BigRed |
From 10:05am EDT until 10:23am EDT, Big Red suffered a network outage
which affected connections to user home directories, the Data
Capacitor, GPFSWAN, and other external connections. Running jobs
should have paused until connectivity was restored.
|
| Sep 3 |
BigRed |
From 1:50pm EDT until 2:04pm EDT, Big Red suffered a network outage
which affected connections to user home directories, the Data
Capacitor, GPFSWAN, and other external connections. Running jobs
should have paused until connectivity was restored.
|
| Aug 28 |
BigRed |
At 3:54pm EDT Big Red suffered a network outage which affected
connections to the Data Capacitor, the home directories, GPFSWAN, and
external connections. The network was restored at 8:20pm. Running
jobs relying on those connections paused until connectivity was
restored.
|
| Aug 14 |
BigRed nodes |
Big Red login nodes h1 and h2 were rebooted at 2:53pm EDT in order to
resolve an issue with the Data Capacitor.
|
| Aug 13 |
BigRed |
Job scheduling was paused for a approximately seven hours while
technicians resolved a chiller outage in the data center. The measure
was precautionary; no jobs were lost.
|
| Jul 23 |
BigRed |
The Big Red machine room experienced a power failure at 1:42am EDT;
power was restored shortly thereafter, but GPFS was unavailable until
10:18am.
|
| Jun 24 |
Data Capacitor |
Problems accessing the Data Capacitor WAN filesystem.
|
| Jun 14-18 |
BigRed |
At approximately 3:00pm, Saturday, June 14, the IU data center lost
power. BigRed was returned to service, following a series of
power infrastructure upgrades, at 8:00pm, Wednesday, June 18.
|
| Jun 4-20 |
GPFS |
GPFS was unavailable during this period as a result of disk errors
caused by multiple site power outages. The recovery process resulted
in possibly corrupted data files. Please see
/N/gpfsbr/gpfs_suspect_files/USERNAME for a list of files that may be
affected. You will need to recover from backup or verify via other
means if possible.
|
| Jun 4-10 |
BigRed |
At 4:40pm, the IU data center experienced a weather-related power
outage. Subsequent problems with GPFS, and additional power outages
prevented the system's return to service until 11am, Tuesday, June 10.
|
| May 31 |
GPFS |
At 10:47am, the GPFS file system became inaccessible due to multiple disk failures. Access was restored at 2:26pm. Running jobs were affected by this outage. |
| May 1 |
GPFS |
At 6:18pm, the GPFS file system became inaccessible due to a hardware failure. Access was restored at 9:36pm. Running jobs were affected by this outage. |
| May 19 |
BigRed nodes |
At 11:05am, the BigRed machine room experienced a brief power outages causing approximately 200 nodes to reboot. Scheduling was paused until 12:09pm while admins examined the system |
| Apr 5 |
Myrinet |
At 3:39pm, the myrinet network went down due to a mapper issue. The myrinet network was returned to service at 11:47am on April 6 |
| Apr 5 |
GPFS |
At 3:39pm, the GPFS file system became inaccessible due to a hardware failure. Access was restored at 10:35am on April 6. |
| Feb 4 |
BigRed |
At 11:00pm BigRed was returned to service. Cooling towers were reparied, power maintenance was performed, network upgrades were performed. Regularly scheduled maintenance for 2/5/2008 is canceled. |
| Feb 3 |
BigRed |
At 10:23am on February 3, the BigRed machine room experienced loss of cooling. Admins shutdown the machine at 10:30am. |
| Date |
System(s) |
Problem |
| Nov 28 |
GPFS |
GPFS was inaccessible from 12:34PM until 8:59PM. Running jobs were affected by this event. |
| Nov 16 |
GPFS |
GPFS was inaccessible from 4:59PM until Nov 17 at 2:03am. Running jobs were affected by this event. |
| Nov 1 |
GPFS |
GPFS was inaccessible from ~4pm until 6:20pm. Running jobs were affected by this event. |
| Oct 25 |
GPFS |
GPFS was inaccessible from ~midnight until 12:47am. Running jobs were not affected by this event. |
| Oct 23 |
GPFS |
GPFS was inaccessible from ~11:00pm until Oct 24 at 9:00am. Running jobs were not affected by this event. |
| Oct 20 |
GPFS |
GPFS was inaccessible from 8:47am to 12:25pm. Running jobs were affected by this event. |
| Oct 19 |
GPFS |
GPFS was inaccessible from 5:54pm until Oct 20 at 2:26am. Running jobs were affected by this event. |
| Oct 16 |
GPFS |
GPFS was inaccesible from ~8:00pm until Oct 17 at 12:07am. Running jobs were not affected by this event. |
| Oct 7 |
GPFS |
GPFS was in accessible from ~4:26pm until 10:16pm. Running jobs were not affected by this event. |
| Sep 30 |
Big Red, GPFS |
Power outage, ~6:25am to 3:49pm EDT. All systems were down during
this event. |
| Sep 25 |
Big Red, GPFS |
Power outage, 4:14pm to 10:08pm EDT. All systems were down during
this event. |
| Jul 4 |
Big Red, GPFS |
Power outage, ~9:00am to 2:30pm EDT. All systems were down during
this event. |
| Jun 21 |
GPFS |
Failure of several blades during benchmarking of expansion
hardware resulted in GPFS instability, approximately 3:45pm until
5:50pm. GPFS recovered without a restart, though remounts did occur
on several nodes: some jobs were lost. |
| Apr 27 |
GPFS |
Communication failure during test of a firmware update process
resulted in GPFS instability, approximately noon until 5pm. GPFS
restarted; no data was lost. |
| Apr 17 |
Login nodes |
Network configuration change on campus resulted in a routing
error; access to login nodes was unavailable from approximately 2:30pm
until 3:00pm. |
| Apr 10-11 |
Racks 1-8 |
NFS server issues resulted in hanging NFS mounts (see Apr 6
outage). |
| Apr 6 |
Rack 9 |
10:45am - 3:00pm; NFS server issues resulted in hanging NFS mounts
on the compute blades. |
| Mar 7-10 |
GPFS |
Switch and disk issues resulted in multiple GPFS outages |
| Jan 31 |
GPFS |
Failed disk controller resulted in a GPFS outage from 16:27 until
22:15 EST. No files appear to have been lost during the rebuild
process.
|