The secondary web server www2 became unresponsive when the SurgeFTP service began consuming 99% of the CPU cycles. We were unable to stop the service or gracefully restart the system. The system was powered down, then back on, after which the system ran normally.
Category Archives: Failures
5/24 – MERLIN2 failure
Merlin2 became unresponsive to fileshare access today at 5:30 PM. The console was still responding, and we were able to log on. Access to the disk arrays appeared to be impaired – we were unable to list the disks or view their contents. No pertinent events were logged in the Event Log. The system was rebooted, and was back to normal.
We have made an adjustment to the antivirus software (changed vendors), and will keep monitoring.
5/10 – ALEXANDRIA Stable – for the moment
ALEXANDRIA activated its spare memory bank on 11 April. Since that time it has not frozen. A new server is en route to replace the defective replacement server.
Email Problems
There were reports last night and this morning that user were having problems connecting to webmail. The problem appears to be linked to the Sophos PureMessage software and the timing of the quarantine digest messages and the slow disk array.
We updated our PureMessage configuration to send on digest messages instead of having each server send individual messages. The result being that all the messages hit the mail server and the same time and overwhelmed it with e-mail and client access with the start of the business day.
We are modifying the interval for digest messages from twice a day to once per day and moving the time to 1:30 am each morining.
MERLIN2 Problems
MERLIN2, the administrative file server became unresponsive for unknown reasons, causing workstations connected to it to also become unresponsive.
The server was rebooted, which cleared the problem. Workstations my need to be rebooted as well.
IMAP, POP3, and Webmail Slowness IV
The firmware upgrades did not resolve the problem with disk performance, even though we had a good couple of days. Sun’s final analysis led them to the conclusion that the disk array on which mailboxes reside has simply reached a saturation level in terms of I/O rate. This will mean that we will have to get a faster disk array.
Continue reading
IMAP and POP3 and Webmail Slowness III
Firmware upgrades to the mail server disks have been applied. This had little effect on performance. We’re following up with Sun to determine next steps.
IMAP and POP3 and Webmail Slowness II
Sun has identified some issues with our disk array configuration. They have provided some new settings for us to apply. One of the changes has been made, with little effect on disk performance. The other two changes will require that the email server be shutdown. We are scheduling this right now. Please refer to the “Scheduled Outages” (http://www2.ups.edu/ois/nssg/network/alerts.shtml)page for the latest information.
IMAP and POP3 and Webmail Slowness I
We are currently experiencing disk performance problems on the mail server. This is causing slowness and problems connecting with Webmail and POP3 and IMAP clients. We are working with the hardware vendor to correct the problem.
WeMail Problems I
Today at about 10:00 AM, the HelpDesk reported major failure in WebMail.
We noticed the presence of a large number of processes (about 1000 and growing) on the mail server, and a larger than normal of mail in-queue. WebMail was stopped, server processes on the mail server were stopped, and the mail queues were processed by hand.
Continue reading
PureMessage not accepting messages
PureMessage stopped accepting messages this morning when the disk volume was filled by log files. Corrections have been made to the logrotate.conf file in an attempt to prevent this from occurring in the future.
E-mail problem: 5550 5.3.0 Can’t create output
Some user reported this morning the inability to send messages to users. They received a common error,
Final-Recipient: RFC822; username@ups.edu
X-Actual-Recipient: RFC822; username@ups.edu
Action: failed
Status: 5.3.0
(reason: Can’t create output)
This error was the result of poor deactivation of the quotaing system. The quota system had been turned off, but not removed from the fstab file. When the system was rebooted yesterday, the quota system was re-enabled and locked account in excess of their time limit. This issue has been corrected.
Network upgrade
The planned OS upgrade of core network equipment on Sunday was not as smooth as planned. Two systems had difficulty with the upgrade and required a reboot: the mail server, and the Oracle development server. Otherwise, the upgrade was a success.
As a result of the problem, we have a better understanding of the upgrade process for future upgrades.
Packeteer Fails
The Packeteer failed on 12/15/04. In fact, it’s been failing for the past few months with a hard drive problem, but today, it went down hard. Packeteer is shipping a new unit to arrive 12/17/04. Until then, the traffic on our internet is uncontrolled.
Dial-in Failures
The radius daemon and portmaster were reset to address several reported issues with dial-in access. The core reason for users not being authenticated is not completely clear, but there are indications that the portmaster or the radius daemon became confused about the appropriate share secret. Once all entryies were reset authentication started to be validated correctly.
During the process, four modems were identified as failing to respond properly and were removed from service.