Researching the Cause of a Linux Server Crash
From Ubiquity Server Wiki
Server crashes can occur for a virtually infinite number of reasons, due to file corruption, server administration variables, out of control scripts, the result of exploited software, and potentially failing hardware to name just a few. In the event that your server is becoming completely unreachable on a regular basis, a responsible systems administrator will want to closely examine the server's configuration.
Examining the Events
- Once your Linux server returns to service, enter the command dmesg for the most recent errors.
- Check system messages stored in /var/log/messages for the most recent events
- If you don't know the meaning of the errors presented, you should search Google for the error and examine a variety of sources for accurate assistance from other sysadmins
- In the time after a crash, it's highly recommended to leave the command top running to watch for daemons which may be locking up the server
- If continued problems occur, ask a tech to examine and record the data at the console of your server following the crash rather than just remotely rebooting.
Possible Solutions for Continued Instability
Sometimes a crash just plain won't give a clear error. If no obvious solutions present themselves at a systems administration level, and problems persist for an extended period of time, a few options are possible..
- In Linux, request that our staff take your server offline to run FSCK (takes 1-2 hours)
- A Linux File System ChecK will search for errors at the Operating System level and repair them
- Usually a lot of errors will show if hardware is failing, however our staff will always run hardware diagnostics checks for you if requested
- We typically use MemTest to test memory and PCTools for hard disk checks - if you think it's one or the other please specify which you would like
- This process will generally definitively diagnose if hardware problems do exist and where, and generally take 1-2 hours to complete
- Sometimes a fresh re-install on a problematic server configuration is the only cure for operating system instability
