We are constantly running into this issue. The mem_per_process does help but it is just a heuristic and needs to be tuned quite frequently.
The last time this occurred, I did give it some thought. What if we tuned the build system a little? We could identify the biggest memory hogs and build them sequentially (post-parallel build?). I know ceph-dencoder was super-memory-expensive (it required more than 4G of RAM even on 32-bit architectures so we couldn't even build on 32-bit architectures anymore) and it might be a first candidate on the 'sequential' list. Do we have any other culprits?
What do you think?
Regards,
Boris