Break reports

Here are listed all breaks in HIIT's IT services.

Break in at 2012-11-02 18:32 - 18:46

2012-11-02 18:32 to 18:46
14 min
Affected services: 
Services running in Adaptive group's test server

Universe's Apache process and possibly kernel's multipath were twisted themselves. To resolve this, the server was rebooted. All pending updates were installed as well.

Update at 18:48: Universe was up and running at 18:46. The problem was that multipathd failed to fail one path during a service on disk array system at 2012-10-23T13:45 even though it had failed the devices behind it:

mithlond-lun-14 (3600601603c0027009a1ae2bc8a12e011) dm-2 ,
size=2.0T features='0' hwhandler='1 emc' wp=rw
|-+- policy='round-robin 0' prio=0 status=active
| |- #:#:#:# -   #:#   active faulty running
| `- #:#:#:# -   #:#   active faulty running
`-+- policy='round-robin 0' prio=0 status=enabled
  |- 1:0:4:0 sda 8:0   active ready running
  `- 0:0:6:0 sdc 8:32  active ready running

This caused disk IO to fail and thus Apache generated some load and one zombie process:

top - 18:26:31 up 150 days,  6:23, 10 users,  load average: 93.99, 93.97, 93.64


Break in network connections at 2012-10-26 10:40-11:02

2012-10-26 10:40 to 11:02
22 min
Affected services: 
Whole HIIT's network connections

Whole HIIT's network went down 10-26-2012 10:40 and returned 11:02. We are still investigating reason for the break.

Update at 11:02: The break is over.

Break in 2012-10-17 13:15 - 13:35

2012-10-17 13:15 to 13:35
20 min
Affected services: 
Software Factory fileserver
Description: crashed after rescan of FC HBAs. Server was booted.

Update at 13:35 The break is over.

Break in at 2012-10-10 22:23 - 2012-10-11 09:29

2012-10-10 22:23 to 2012-10-11 09:29
10h 6min
Affected services: 
Domain Name Service (DNS)

DNS server was rebooted due to kernel upgrade. Due to Tigon3 module problems network didn't come up automatically but needed to be manually brought up. Unfortunately, during this hassle, DNS service wasn't started until morning.

DNS service may have been a bit slower but still worked because other DNS server was functioning nicely.

Break in network connections at 2012-10-01 9:50-13:35

2012-10-01 09:50 to 13:35
3h 45min
Affected services: 
Open innovation house's network connections

Open Innovation House is connected to rest of HIIT's IT infrastructure with a light path running route Vallila - CSC - Otaniemi. This light path went down at 9:50 due to BPDU packets transmitted from Aalto IT's router in Otaniemi and received by University of Helsinki's router in Vallila. Incorrectly those packets were not discarded, instead the port was shut down by spanning tree.

At 13:35 the port was brought up manually and OIH returned to network.

Update at 16:30: Both Aalto university's and University of Helsinki's routers now filter BPDU packets from OIH light path, so this kind of situation shouldn't happen again.
