Network upgrade

The planned OS upgrade of core network equipment on Sunday was not as smooth as planned. Two systems had difficulty with the upgrade and required a reboot: the mail server, and the Oracle development server. Otherwise, the upgrade was a success.

As a result of the problem, we have a better understanding of the upgrade process for future upgrades.

Dial-in Failures

The radius daemon and portmaster were reset to address several reported issues with dial-in access. The core reason for users not being authenticated is not completely clear, but there are indications that the portmaster or the radius daemon became confused about the appropriate share secret. Once all entryies were reset authentication started to be validated correctly.

During the process, four modems were identified as failing to respond properly and were removed from service.

MX record change

The MX record on DNS zones was changed to mx00.ups.edu and mx01.ups.edu in an effort to normalize the naming convention for our mail exchange servers. This change resulted in some mail delivery problems since not also external mail servers picked up the change in a timely manner. A workaround was implemented to allow mail delivery to continue. Mail messages sent between 10:00am and 11:45am (-8:00 PST) seem to have been effected.

Possible WebMail problem identified

In our efforts to identify the cause of recent problems with the WebMail server, we have been at a loss for information. We have tried to discover what has been causing the delays and unresponsiveness in WebMail as of late. We have looked at possible memory leaks in daemons, possible attacks, possible miss configurations. All of these have not lead to a clear answer.

It is believed at this point in time that if Ockham’s Razor holds true we may have found the source of the problem. It was discovered late yesterday that the available disk space of the WebMail server was extremely low. Since WebMail serves as an imap gateway temporarily caching and displaying mail messages via a http server, disk space for temporary files is necessary. This has been the best possible explanation for the problems we have seen thus far.

We have increased available disk space on the server. We have also contacted server individuals who reported problems to determine if the issue still persists.

Modification to CBord backup

This morning there were questions posed by the operators about the status of the ntbackup process on the C-BORD Odyssey server around the location of the backup log file.

During the course of my investigation, I discovered that the use of the %Odyssey% variable does not work in all situations. So I modified the tapebackup.cmd file to use the absolute path contained in the %Odyssey% variable.

Mail server slowness I

The University mail server began to timeout clients and slow down at approximately noon today. The cause of the problem is unclear.

The number of process running on the server appearred to be at a normal load average. The servers response from secure shell was slow. It appearred as though the system was having a problem allocating resources for the processes running, but memory and disk space were both available.

We shut down process to try and identify the cause of the slowness. Imapd and Ipop3d were disabled with no luck. The web server was shutdown, no luck. Then mailman and sendmail. At this point things improved.

I disabled the webmail interface and brought sendmail and mailman backup. I then brought the web server backup. The server still appeared to be responding in a slower than normal, but somewhat timely manner. I restarted the ipop3d daemon and about five minutes later the imapd daemon. The server still looked good–slower than normal, but somewhat timely. After restarting the webmail interface things when into the tank.

We tried restarting the webmail server between 3:00 and 3:30 pm, but the unbareable slowness remained.

We then rebooted the mail server at about 3:40pm.

The server was back to normal after the reboot.

Cbord Odyssey unresponsive to registers

The Cbord server reported to not be responding to inquiries from the cash registers at approximately 12:30 PST. Upon inspection of the server, it was discovered that pcAnywhere was hung waiting for a connection. We received a VFEP error when tring to access the Odyssey Control panel.

It is our belief that C-Bord support encountered errors when doing routine maintenance that hung the server.

RESOLUTION: We rebooted the Odyssey server and had the cash registers re-inquire.