The planned OS upgrade of core network equipment on Sunday was not as smooth as planned. Two systems had difficulty with the upgrade and required a reboot: the mail server, and the Oracle development server. Otherwise, the upgrade was a success.
As a result of the problem, we have a better understanding of the upgrade process for future upgrades.
The radius daemon and portmaster were reset to address several reported issues with dial-in access. The core reason for users not being authenticated is not completely clear, but there are indications that the portmaster or the radius daemon became confused about the appropriate share secret. Once all entryies were reset authentication started to be validated correctly.
During the process, four modems were identified as failing to respond properly and were removed from service.
The MX record on DNS zones was changed to mx00.ups.edu and mx01.ups.edu in an effort to normalize the naming convention for our mail exchange servers. This change resulted in some mail delivery problems since not also external mail servers picked up the change in a timely manner. A workaround was implemented to allow mail delivery to continue. Mail messages sent between 10:00am and 11:45am (-8:00 PST) seem to have been effected.
The Anti-Virus Gateway failed to update to the 4357 DAT file. The update was run manually and the DAT files were updated. We are now able to detect the W32/Bagle.aa virus.
In our efforts to identify the cause of recent problems with the WebMail server, we have been at a loss for information. We have tried to discover what has been causing the delays and unresponsiveness in WebMail as of late. We have looked at possible memory leaks in daemons, possible attacks, possible miss configurations. All of these have not lead to a clear answer.
It is believed at this point in time that if Ockham’s Razor holds true we may have found the source of the problem. It was discovered late yesterday that the available disk space of the WebMail server was extremely low. Since WebMail serves as an imap gateway temporarily caching and displaying mail messages via a http server, disk space for temporary files is necessary. This has been the best possible explanation for the problems we have seen thus far.
We have increased available disk space on the server. We have also contacted server individuals who reported problems to determine if the issue still persists.
For background on this problem, see the January 12 weblog entry.
A new domain, mail.ups.edu, was defined with an MX record pointing mail.ups.edu to the antivirus gateway (AVG). This allows us to deliver messages destined for addresses of the form email@example.com. We will leave this in place until 1 July.
I checked the C-BORD Oyssey backup process this moring to inspect my changes. It was determined that the modifications made on 3/18 were not executed. Upon further inspection it was determined that the original file did not use the correct command-line version switches for the backup log. This was modified and we wait for the scheduled process to run again.
This morning there were questions posed by the operators about the status of the ntbackup process on the C-BORD Odyssey server around the location of the backup log file.
During the course of my investigation, I discovered that the use of the %Odyssey% variable does not work in all situations. So I modified the tapebackup.cmd file to use the absolute path contained in the %Odyssey% variable.
The University mail server began to timeout clients and slow down at approximately noon today. The cause of the problem is unclear.
The number of process running on the server appearred to be at a normal load average. The servers response from secure shell was slow. It appearred as though the system was having a problem allocating resources for the processes running, but memory and disk space were both available.
We shut down process to try and identify the cause of the slowness. Imapd and Ipop3d were disabled with no luck. The web server was shutdown, no luck. Then mailman and sendmail. At this point things improved.
I disabled the webmail interface and brought sendmail and mailman backup. I then brought the web server backup. The server still appeared to be responding in a slower than normal, but somewhat timely manner. I restarted the ipop3d daemon and about five minutes later the imapd daemon. The server still looked good–slower than normal, but somewhat timely. After restarting the webmail interface things when into the tank.
We tried restarting the webmail server between 3:00 and 3:30 pm, but the unbareable slowness remained.
We then rebooted the mail server at about 3:40pm.
The server was back to normal after the reboot.
During the implementation of e-mail routing changes on 2/2/04 the MX record for the ups.edu domain on the internal DNS server was inadvertantly disabled.
This problem has been fixed and a backlog of messages, held by the anti-virus gateway are now being delivered.
A high volume of messages has been noticed in the queues of the anti-virus gateway. The cause of this situation is not clear. Currently message are taking anywhere from 15 minutes to several hours to be delivered. One cause may be the high volume of virus/worm traffic.
The University e-mail service has been reconfigured to route as many messages as possible through the anti-virus gateway before they are sent off-campus or recieved by the University’s mail servers.
This additional stop for messages increases the delivery time, but in necessary to reduce the propogation of e-mail viruses and worms.
The Cbord server reported to not be responding to inquiries from the cash registers at approximately 12:30 PST. Upon inspection of the server, it was discovered that pcAnywhere was hung waiting for a connection. We received a VFEP error when tring to access the Odyssey Control panel.
It is our belief that C-Bord support encountered errors when doing routine maintenance that hung the server.
RESOLUTION: We rebooted the Odyssey server and had the cash registers re-inquire.
The sendmail configuration files on the University’s mail server have been upgraded to version 8.12.10. The early problem with the configuration files and the vacation program have been resolved. The vacation program has also been upgrade to prevent rev-lag.
The sendmail configuration files were also modified to use the currect os_type.
Problems were encountered with the sendmail 8.12.10 configuration file and the auto-responder (vacation) message. The configuration file has been rolled back while further investigation in being done.