One of the issues I've run into in the past was a MySQL problem where their syst...

One of the issues I've run into in the past was a MySQL problem where their system was creating tons of temporary tables, and because of the massive amount of traffic they were getting, it was killing their database server because it was choking out the disk's I/O, and the server was at 80% iowait most of the time because the disk cache was paging things in and out like crazy.

I went back and forth with them about optimizing the app, but it was apparently some huge labyrinthine monstrosity. They insisted that they didn't have the resources to do any of the significant rewrites that it would require to fix the app to do proper queries (or at least, not enough to be worthwhile).

Eventually I gave up. /tmp was mounted onto a separate partition, so I disabled ext3 journalling and set commit=30 so that it only sync'ed to the disk every 30 seconds. Since no temporary tables lasted that long, the VFS layer never wrote to the disk if it didn't have to. /tmp became an in-memory cache, and CPU use dropped to 5%.

Optimizing isn't about a checklist, it's about looking at the system that you have, understanding what it's doing and why, and understanding how the other systems around it behave so that you can resolve the issue. Moving onto another database server wouldn't have helped them. Moving onto a RAID would have reduced the impact, but their load didn't scale linearly so they'd hit their limit in a few months anyway.