6/3 – Database Systems Temporarily Offline

The database systems went offline sometime of 3 June this weekend when rainier and crystal lost connections to several disk volumes. The disks were remounted, and the databases were restarted. The DBA was called at 9 AM this morning.

The www2 webserver also became unresponsive because it lost connection to the database. The webserver was restarted once the database came online, restoring service.

5/30 – License problems with the FTP server on www2

Sometime during the weekend, the ftp server on www2 decided that its license had expired. Until a new license could be obtained, the standard ftp server (wu-ftpd) was run. This led to some slow response (wu-ftpd operates under the xinetd master daemon) for much of the day. The vendor provided a new license key at 4:30 PM. This was installed, and ftp service was restored to normal.

5/24 – MERLIN2 failure

Merlin2 became unresponsive to fileshare access today at 5:30 PM. The console was still responding, and we were able to log on. Access to the disk arrays appeared to be impaired – we were unable to list the disks or view their contents. No pertinent events were logged in the Event Log. The system was rebooted, and was back to normal.

We have made an adjustment to the antivirus software (changed vendors), and will keep monitoring.

Email Problems

There were reports last night and this morning that user were having problems connecting to webmail. The problem appears to be linked to the Sophos PureMessage software and the timing of the quarantine digest messages and the slow disk array.

We updated our PureMessage configuration to send on digest messages instead of having each server send individual messages. The result being that all the messages hit the mail server and the same time and overwhelmed it with e-mail and client access with the start of the business day.

We are modifying the interval for digest messages from twice a day to once per day and moving the time to 1:30 am each morining.

IMAP and POP3 and Webmail Slowness II

Sun has identified some issues with our disk array configuration. They have provided some new settings for us to apply. One of the changes has been made, with little effect on disk performance. The other two changes will require that the email server be shutdown. We are scheduling this right now. Please refer to the “Scheduled Outages” (http://www2.ups.edu/ois/nssg/network/alerts.shtml)page for the latest information.

Network upgrade

The planned OS upgrade of core network equipment on Sunday was not as smooth as planned. Two systems had difficulty with the upgrade and required a reboot: the mail server, and the Oracle development server. Otherwise, the upgrade was a success.

As a result of the problem, we have a better understanding of the upgrade process for future upgrades.

Dial-in Failures

The radius daemon and portmaster were reset to address several reported issues with dial-in access. The core reason for users not being authenticated is not completely clear, but there are indications that the portmaster or the radius daemon became confused about the appropriate share secret. Once all entryies were reset authentication started to be validated correctly.

During the process, four modems were identified as failing to respond properly and were removed from service.

Crystal – problems connecting to SAN

DBA reported problems with CRYSTAL on Saturday, 8/28 in the evening.

Examined system on Monday 8/30 system appears to be unable to communicate with SAN. Cleaned fibre, switch and HBA but no luck.

Examined HBA on 8/31 no LED on card. Called vendor support–3.5 hours later the HBA was considered bad and a new HBA was sent out–4 hour delivery. New HBA installed, but still loading SAN drives. Lights are now working on both HBA and switch.

Called back support engineer at 9am left voice message. Called vendor support at 11 am to talk with another engineer and was told our call would be assigned to another engineer. Original engineer called back at 4pm to apologize that another engineer had not been assigned. Dual entries seen in switch, old entry removed and system rebooted and SAN drives are not visible.

Restarted databases.