We have a board layout that satisfies the routing rules in the document "DDR3 Design Requirements for KeyStone Devices Application Report" (SPRABI1B May 2014, pages 27-29). After running a simulation on the complete DDR3 bus using HyperLynx, we found that most of the address, command, and control signals failed with not enough setup time for the Slow process corner case. All signals for all cases have plenty of hold time. Typical and Fast corners passed in all cases.
Worst case setup time margin is about -100ps (Slow corner); worst case hold margin is about +240ps (Fast corner). The average setup time margin is about -70ps for Bank A, and about -77ps for Bank B.
It seems to me that making the clock traces longer between the KeyStone and first RAM device will solve this, though it would violate the length requirements the above document spells out. I would have to make the clock traces around 400 mils longer to delay the signal by 70ps, which seems like way too much. To delay the clock by 100ps, the traces would need to be upwards of 570mils long, which definitely seems like too much.
How hard should we try to satisfy the simulation for the Slow process corner case which means going way out of spec for length matching? The process corner is an outer limit, after all.