FYI – K20 Announcement of fiber damage from the storm. No issues have been reported yet as a result.
CURRENT EVENT START : 12/13/10 22:22 PST
CURRENT EVENT STOP : ?
This notice is being sent to all K20 customers. We probably should have sent this notice sooner.
WSIPC is part of a 14 site outage caused by a fiber break which occurred last night at 22:22 pm. That specific outage is being tracked in our ticket 940112.
As many customers have noticed, WSIPC services such as hosted DNS, email, and some applications such as Skyward and Citrix are affected.
These WSIPC services will be down until the outage can be resolved. At this time, the fiber provider is working on providing an alternate path for the affected transport circuits but there is still no ETR at this time.
K20 Network Operations Network Operations Center
DETAILED OUTAGE HISTORY
K20 Area Wide
FIRST DOWN: 12/13/10 22:22 PST
LAST UP: ?
TOTAL DOWN TIME: 15 hours 10 minutes
12/13/10 22:22 PST – ?
In production, Boss had been using the rep_bossweb reports server which is not stable.
It was down this morning, so we switched it to use the UPS_REPORTS_NONSSO_PROD reports server in production. That worked fine.
There is still the issue of report output security, because anyone using that reports server can access other people’s report file output.
We will plan to test Boss with the SSO reports server and see how it functions and whether it would work for us.
Our residential Internet connections are experiencing intermittent outages due to a vendor outage in the Seattle area. They are working to replace the equipment but we can expect intermittent outages for a short time.
[Update 1/14/10 4:00 PM] Service has been fully restored.
The Active Directory password policy was inadvertently set to reject passwords that did not contain any special (non-alphanumeric) character, such as *#$% etc.
The problem began about 3/21/2009 and was corrected at 3:15pm on 3/26/2009. During this period, anyone changing a password using Windows was instructed to include a special character.
Passwords changed using Cascade Web during this period were not synchronized to Active Directory, so the new password did not work for Webmail, Windows, etc. This can now be corrected by changing either the AD or OID password.
The problem was corrected by deselecting the special character requirement in the AD password policy.
Here is an example of the error in the ActiveExportUsers_Groups.trc log:
Error in executing mapping DIP_LDAPWRITER_ERROR_MODIFY
javax.naming.OperationNotSupportedException: [LDAP: error code 53 - 0000052D: SvcErr: DSID-031A0FC0, problem 5003 (WILL_NOT_PERFORM), data 0
A database error occurred in CRM about 11:30 am on 4/15/2009. DST was alerted to the problem, a table that was unable to extend, and fixed it about noon. A couple of email campaigns were in-progress and were adversely affected:
1. A message from the President’s office going to faculty, staff and students was sent out twice, but the records indicated it only went out once.
2. A message that was being created by Admission was in the middle of generating the target group and got stuck there. Every attempt to resolve it failed so the solution was to copy the schedule without the target group, re-create the target group and then the email was sent out successfully.
[Update 3/19/09 4:20 PM] Services have been restored. The application services were stopped and restarted. Root cause analysis in underway.
The application server that hosts Cascade is currently unavailable. Other services impacted are Cascade, Portal, Discoverer, CRM, and Views Flash Survey. TS is aware of the problem and it is being investigated.
Cascade Web Forms and Banner Web Forms were down for a little while after 2 p.m. today. We are having unexplained errors on the forms server and we had to bounce it. We are changing the configuration to see if that helps this problem.
Technology Services was alerted around 8:15 am that Alexandria shares were unavailable. Services were restored as of 8:30am. Steps are being taken to 1) improve the stability of Alexandria and 2) better alert TS staff when critical services on Alexandra are unavailable.
Reportedly at 3 PM Saturday, 12/13/2008 the Backboard server became unresponsive to login requests. At 4 PM, Technology Services staff were notifed by Security Services. The Blackboard services and database were restarted at about 4:20 , which restored service.
CRM was temporarily unavailable for for about 40 minutes.
At approximately 10:30 AM, Internet service was resumed as Integra Telecom moved our routes to a temporary router. Internet service resumed until about 12:30 PM, when Integra moved our routes back to our normal router, causing a brief 5-10-minute interruption in service.
A major fiber feed for Integra has been broken. A pole damaged along route 167 has broken the fiber feed for the state of Washington. This feed cannot be repaired until later this afternoon when 167 can be shutdown to restring the fiber across the highway. There is no estimated time of repair.
Beginning April 1st, the Mailman listservs started failing to deliver messages to their membership. This happened because one of the real-time blackhole lists that Mailman was subscribed to went out of service permanently, causing Mailman’s mail server (sendmail) to reject all messages outgoing to the list memberships. So the messages were received by Mailman and archived properly, but they were not delivered. This was found and resolved on April 7th.
This coincided with another email problem on April 2-4, which masked the listserv issue, which is why it took so long to find and resolve.
There was a 5-10 minute interruption today at approximately 2:15. Our internet service provider increased bandwidth to the University without coordinating the change with us. Configuration changes were made here at the University’s equipment, which restored connection to the internet. The University now has a full DS3(T3) connection at 45 Mbps.
Our ISP, Integra, was having some issues last night between 8:10pm and 9:46pm. I called ELI to have them look into the problem. Mark and I came onsite to assist. The problem was resolved by Integra who reset the connection interface on their end. I have not seen any further issues occurring.