A little off-topic, but looking at disassembly is important for the big processors too if you want to squeeze out the best performance. In particular, vectorizing compilers can be pretty finicky (easily derailed).  For DSP filter implementations properly vectorizing inner loops makes a huge difference.

