Archive for the ‘Implementation’ Category

Hold Derating Kicks Hold Margins’ Butt

Monday, October 19th, 2009

Holdtime is an emotive issue amongst SOC designers because, unlike setup timing problems, hold time problems kill chips. Designers generally want a high “hold margin” to ensure their design cannot have hold problems – justifiably, noone wants to have their chip be a brick and have to find out it was a hold problem.

However, overconservatism can easily backfire. We worked on one design where the backend team, having had issues with hold in the past, was requiring a very high holdtime margin in the supplied constraints. We did some simple math and determined that

required holdtime margin > flop clock->q delay + 1x buffer delay + D/SI holdtime requirement

what this meant was that every back-to-back flop, including almost the entire scan chain, would be an automatic hold violator once implemented. Although we pointed this out, the issue was not resolved until implementation was actually run and block area estimates blew up – because the number of hold buffers added was greater than the number of flops in the design, for this very reason.

To help understand this issue, it helps to understand how flops are characterized for holdtime.

HoldTime Characterization

Generally, characterization tools are limited to observing only the “external” nodes of a cell during analysis runs. So a common metric is that the “data” input is changed, up until the CLOCK->Q time degrades by some percentage (usually 10%). What this means is that the internaly flop storage node _is_ being disrupted, just not enough to cause a output node flip. To ensure this doesn’t happen, you add enough margin to your holdtime to account for two things

  1. the extent to which you’re uncomfortable with a 10% output timing degradation being “assumed”
  2. the extent to which you don’t trust 100% the results of the characterization process.

Usually we recommend 2-3 buffer delays, depending on technology. This covers pretty much everything. BUT WAIT, we hear you say, this is not nearly enough to ensure functionality! What about derating! Well, we’re coming to that.

(The other approach is to custom characterize all the flops in your design so that you observe the internal storage node & make sure IT doesn’t glitch my more than X%. You can do this if you’re a big company or a genius. Otherwise, use the supplied models & add a little margin.)

Now, our controversial assertion here is that, no matter how bad your clock tree is or your variation is, when two flops are driven by the same clock net (with transition times in the characterized range), if holdtime is “met” in STA between these two flops, then it’s impossible to have a true holdtime violation between these two flops in the lab.

This is because it’s _always_ the same edge that arrives at the two back-to-back flops. Here’s an example.

HoldTime-NoViolationPossible

Where true hold violations occur, it’s due to _variation_ in cell delays and parasitic delays (noise or not). This occurs when there is divergence in the driving CTS tree.

HoldTime-ViolationPossible

A few short years ago, only Synopsys PrimeTime supported this properly (with pessimism removal and clock/data derating). Back then, we did ECOs to fix these violations, but since we were in 130nm or the early stages of 90nm the number was relatively small so it was no matter. Nowadays, every major tool supports derating for both setup & hold natively, so unless you have severe correlation issues between implementation & signoff tools it’s not such a huge issue.

Now the critical question is how to choose a good derating factor? Interestingly, we often see incredible conservatism in hold “margin”/uncertainty, but much less conservatism in the derating factors applied during hold fixing. The range we’ve seen is anywhere from 5% (low) to 18% (high, but high-performance design with custom clock trees). In general, set it as high as you can while maintaining an appropriate lid on the number of buffers required to fix hold.

Where things really get interesting is multimodal and multicorner hold analysis and fixing. This article is too short for that, but some interesting things to consider, especially if you’re doing complicated SoCs with 10s-100s of clocks, or even if you’ve turned on “useful skew optimization” in your implementation tool, are

  1. don’t assume that BC model/BC parasitics covers every possible worst-case behavior for holdtime. We’ve found cases where it’s BC models and either WC or TT parasitics that cause additional holdtime problems (although, the right initial “optimization” corner is BC/BC).
  2. often, implementation is done with a functional SDC, and hold closure is done with a different constraint set. It’s possible to create a multimode constraint file – which is in general a tough problem. Probably the “easiest” thing to get 95% of the problem solved is to remove any case analysis that sets the value of the flop SE (ScanEnable), and replace it with a set_false_path. That will force the CP->Q->SI paths to be sensitized, while ignoring the timing on the scan control signals which are typically don’t-cares. Watch out when you do this though – all of a sudden a very large number of interclock paths will become sensitized (that were previously masked by the SI pins in the scan chain being disabled). Having a tool that automatically finds & fixes missing interclock false paths can help!
  3. back in the days of 90nm/130nm, we could generally assume that any hold violator had plenty of setup slack, and you could just drop buffers all over the place to fix hold violations. This is not the case anymore with the really small mincase delay values in sub-65nm technologies. We’re finding more and more cases where there are 100s to 1000s of pins that are BOTH setup and hold violators – which means you need far more sophistication to fix the hold violations – something we’ll discuss in the future.