Researching the Cause of a Linux Server Crash

From Ubiquity Server Wiki

Jump to: navigation, search

Server crashes can occur for a virtually infinite number of reasons, due to file corruption, server administration variables, out of control scripts, the result of exploited software, and potentially failing hardware to name just a few. In the event that your server is becoming completely unreachable on a regular basis, a responsible systems administrator will want to closely examine the server's configuration.


Examining the Events

  • Once your Linux server returns to service, enter the command dmesg for the most recent errors.
  • Check system messages stored in /var/log/messages for the most recent events
  • If you don't know the meaning of the errors presented, you should search Google for the error and examine a variety of sources for accurate assistance from other sysadmins
  • In the time after a crash, it's highly recommended to leave the command top running to watch for daemons which may be locking up the server
  • If continued problems occur, ask a tech to examine and record the data at the console of your server following the crash rather than just remotely rebooting.


Possible Solutions for Continued Instability

Sometimes a crash just plain won't give a clear error. If no obvious solutions present themselves at a systems administration level, and problems persist for an extended period of time, a few options are possible..

  • In Linux, request that our staff take your server offline to run FSCK (takes 1-2 hours)
    • A Linux File System ChecK will search for errors at the Operating System level and repair them
  • Usually a lot of errors will show if hardware is failing, however our staff will always run hardware diagnostics checks for you if requested
    • We typically use MemTest to test memory and PCTools for hard disk checks - if you think it's one or the other please specify which you would like
    • This process will generally definitively diagnose if hardware problems do exist and where, and generally take 1-2 hours to complete
  • Sometimes a fresh re-install on a problematic server configuration is the only cure for operating system instability



Personal tools