5/24 – MERLIN2 failure

Merlin2 became unresponsive to fileshare access today at 5:30 PM. The console was still responding, and we were able to log on. Access to the disk arrays appeared to be impaired – we were unable to list the disks or view their contents. No pertinent events were logged in the Event Log. The system was rebooted, and was back to normal.

We have made an adjustment to the antivirus software (changed vendors), and will keep monitoring.

Email Problems

There were reports last night and this morning that user were having problems connecting to webmail. The problem appears to be linked to the Sophos PureMessage software and the timing of the quarantine digest messages and the slow disk array.

We updated our PureMessage configuration to send on digest messages instead of having each server send individual messages. The result being that all the messages hit the mail server and the same time and overwhelmed it with e-mail and client access with the start of the business day.

We are modifying the interval for digest messages from twice a day to once per day and moving the time to 1:30 am each morining.

MERLIN2 Problems

MERLIN2, the administrative file server became unresponsive for unknown reasons, causing workstations connected to it to also become unresponsive.

The server was rebooted, which cleared the problem. Workstations my need to be rebooted as well.

IMAP and POP3 and Webmail Slowness II

Sun has identified some issues with our disk array configuration. They have provided some new settings for us to apply. One of the changes has been made, with little effect on disk performance. The other two changes will require that the email server be shutdown. We are scheduling this right now. Please refer to the “Scheduled Outages” (http://www2.ups.edu/ois/nssg/network/alerts.shtml)page for the latest information.

WeMail Problems I

Today at about 10:00 AM, the HelpDesk reported major failure in WebMail.

We noticed the presence of a large number of processes (about 1000 and growing) on the mail server, and a larger than normal of mail in-queue. WebMail was stopped, server processes on the mail server were stopped, and the mail queues were processed by hand.
Continue reading

E-mail problem: 5550 5.3.0 Can’t create output

Some user reported this morning the inability to send messages to users. They received a common error,

Final-Recipient: RFC822; username@ups.edu
X-Actual-Recipient: RFC822; username@ups.edu
Action: failed
Status: 5.3.0
(reason: Can’t create output)

This error was the result of poor deactivation of the quotaing system. The quota system had been turned off, but not removed from the fstab file. When the system was rebooted yesterday, the quota system was re-enabled and locked account in excess of their time limit. This issue has been corrected.

Network upgrade

The planned OS upgrade of core network equipment on Sunday was not as smooth as planned. Two systems had difficulty with the upgrade and required a reboot: the mail server, and the Oracle development server. Otherwise, the upgrade was a success.

As a result of the problem, we have a better understanding of the upgrade process for future upgrades.

Packeteer Fails

The Packeteer failed on 12/15/04. In fact, it’s been failing for the past few months with a hard drive problem, but today, it went down hard. Packeteer is shipping a new unit to arrive 12/17/04. Until then, the traffic on our internet is uncontrolled.

Dial-in Failures

The radius daemon and portmaster were reset to address several reported issues with dial-in access. The core reason for users not being authenticated is not completely clear, but there are indications that the portmaster or the radius daemon became confused about the appropriate share secret. Once all entryies were reset authentication started to be validated correctly.

During the process, four modems were identified as failing to respond properly and were removed from service.