Due to hardware problems, the spam filter end user interface (the web system where University members check their spam quarantine contents) will be unreliable. Service has been engaged, and we hope to have a resolution today.
Category Archives: Unscheduled Outages
6/3 – Database Systems Temporarily Offline
The database systems went offline sometime of 3 June this weekend when rainier and crystal lost connections to several disk volumes. The disks were remounted, and the databases were restarted. The DBA was called at 9 AM this morning.
The www2 webserver also became unresponsive because it lost connection to the database. The webserver was restarted once the database came online, restoring service.
5/30 – License problems with the FTP server on www2
Sometime during the weekend, the ftp server on www2 decided that its license had expired. Until a new license could be obtained, the standard ftp server (wu-ftpd) was run. This led to some slow response (wu-ftpd operates under the xinetd master daemon) for much of the day. The vendor provided a new license key at 4:30 PM. This was installed, and ftp service was restored to normal.
5/25 – www2 failed
The secondary web server www2 became unresponsive when the SurgeFTP service began consuming 99% of the CPU cycles. We were unable to stop the service or gracefully restart the system. The system was powered down, then back on, after which the system ran normally.
5/24 – MERLIN2 failure
Merlin2 became unresponsive to fileshare access today at 5:30 PM. The console was still responding, and we were able to log on. Access to the disk arrays appeared to be impaired – we were unable to list the disks or view their contents. No pertinent events were logged in the Event Log. The system was rebooted, and was back to normal.
We have made an adjustment to the antivirus software (changed vendors), and will keep monitoring.
Email Problems
There were reports last night and this morning that user were having problems connecting to webmail. The problem appears to be linked to the Sophos PureMessage software and the timing of the quarantine digest messages and the slow disk array.
We updated our PureMessage configuration to send on digest messages instead of having each server send individual messages. The result being that all the messages hit the mail server and the same time and overwhelmed it with e-mail and client access with the start of the business day.
We are modifying the interval for digest messages from twice a day to once per day and moving the time to 1:30 am each morining.
IMAP, POP3, and Webmail Slowness IV
The firmware upgrades did not resolve the problem with disk performance, even though we had a good couple of days. Sun’s final analysis led them to the conclusion that the disk array on which mailboxes reside has simply reached a saturation level in terms of I/O rate. This will mean that we will have to get a faster disk array.
Continue reading
IMAP and POP3 and Webmail Slowness II
Sun has identified some issues with our disk array configuration. They have provided some new settings for us to apply. One of the changes has been made, with little effect on disk performance. The other two changes will require that the email server be shutdown. We are scheduling this right now. Please refer to the “Scheduled Outages” (http://www2.ups.edu/ois/nssg/network/alerts.shtml)page for the latest information.
IMAP and POP3 and Webmail Slowness I
We are currently experiencing disk performance problems on the mail server. This is causing slowness and problems connecting with Webmail and POP3 and IMAP clients. We are working with the hardware vendor to correct the problem.
PureMessage not accepting messages
PureMessage stopped accepting messages this morning when the disk volume was filled by log files. Corrections have been made to the logrotate.conf file in an attempt to prevent this from occurring in the future.
Network upgrade
The planned OS upgrade of core network equipment on Sunday was not as smooth as planned. Two systems had difficulty with the upgrade and required a reboot: the mail server, and the Oracle development server. Otherwise, the upgrade was a success.
As a result of the problem, we have a better understanding of the upgrade process for future upgrades.
Dial-in Failures
The radius daemon and portmaster were reset to address several reported issues with dial-in access. The core reason for users not being authenticated is not completely clear, but there are indications that the portmaster or the radius daemon became confused about the appropriate share secret. Once all entryies were reset authentication started to be validated correctly.
During the process, four modems were identified as failing to respond properly and were removed from service.
Crystal – problems connecting to SAN
DBA reported problems with CRYSTAL on Saturday, 8/28 in the evening.
Examined system on Monday 8/30 system appears to be unable to communicate with SAN. Cleaned fibre, switch and HBA but no luck.
Examined HBA on 8/31 no LED on card. Called vendor support–3.5 hours later the HBA was considered bad and a new HBA was sent out–4 hour delivery. New HBA installed, but still loading SAN drives. Lights are now working on both HBA and switch.
Called back support engineer at 9am left voice message. Called vendor support at 11 am to talk with another engineer and was told our call would be assigned to another engineer. Original engineer called back at 4pm to apologize that another engineer had not been assigned. Dual entries seen in switch, old entry removed and system rebooted and SAN drives are not visible.
Restarted databases.
Media server down
A hard disk failure in the RAID array on the media server resulted in the server hanging. After making several attempts to recover the system, the data was backed up and system was rebuilt.
Directory Server Reboot
The directory server was rebooted at 5:15pm to finish the OS upgrade.
Server back on-line at 5:25pm.