subbdue (u/subbdue) - Readit News

subbdue commented on How Intel makes sure the FDIV bug never happens again chiplog.io/p/how-intel-ma... · Posted by u/subbdue

subbdue · 10 months ago

The people and breakthroughs behind Intel’s quiet revolution in formal verification

subbdue commented on DDR4 SDRAM – Initialization, Training and Calibration systemverilog.io/ddr4-ini... · Posted by u/ivank

willis936 · 6 years ago

Are the EQ values never looked at it during debugging/testing?

subbdue · 6 years ago

By EQ values I assume you're referring to the calibration results. These are typically called Delay Registers.

Yes, before signing-off on a system, we carefully look at the calibration result and see if there is enough margin.

Now what do I mean by margin? - The delay registers hold some sort of an integral value and there is a certain range to it, for example 0 to 7. You typically don't want the result to be at the ends. Because you want to give periodic calibration (which could run due to big temperature swings) a little margin.

If the calibration result does end up at the ends of the range, we have to play around with other DRAM timing parameters, which are managed by the DDR controller.

subbdue commented on DDR4 SDRAM – Initialization, Training and Calibration systemverilog.io/ddr4-ini... · Posted by u/ivank

retSava · 6 years ago

Thanks for the write-up, very interesting!

Wow, the cold/warm soaks sound like they make for a very slow iterative process when it doesn't work on the first try. Do you have several systems soaked at the same time? So if the first you test fails, you can adjust something and test on a second setup while the first is re-soaked?

I also thought much of the difference in length on the PCB is compensated by with those wiggly traces (so all have equal-ish length), but you still need to compensate for it? Or is it just to gain a larger error margin?

subbdue · 6 years ago

> Wow, the cold/warm soaks sound like they make for a very slow iterative process when it doesn't work on the first try. Do you have several systems soaked at the same time? So if the first you test fails, you can adjust something and test on a second setup while the first is re-soaked?

Thermal chambers are quite expensive, around $100,000 per unit. So bigger shops such as Intel, AMD, Qualcomm probably have many. But I would be surprised if smaller companies have more than a couple.

It is a painful process when a company develops their first system. As you would guess, once they have a proven PCB design with DDR controller firmware, the DDR sub-system design is reused in subsequent systems.

Now say you've been shipping the system for a couple of years. There is one situation under which the above experiments will need to be performed again.

Say your system uses a 16GB DIMM. Micron and Samsung, the DIMM makers, are always trying to improve their manufacturing process, moving to the next node (14nm to 7nm) and so on. So every couple of years you'll find them EOL-ing (End of life) a certain 16GB DIMM for a newer one. There is a chance you'll start seeing failures with the new 16GB DIMM.

> I also thought much of the difference in length on the PCB is compensated by with those wiggly traces (so all have equal-ish length), but you still need to compensate for it? Or is it just to gain a larger error margin?

You are partially correct.

Check out this image: https://www.systemverilog.io/ddr4-initialization-and-calibra...

PCB Board designers match the length of the data lines, which are hooked up in a star-topology from the processor to different DRAMs on the DIMM.

But the address lines are hooked up using fly-by topology. So data signals launched from the processor arrive at all the DRAMs at the same time. But the clocks and address signals that are launched from the processor will reach each DRAM on the DIMM at different times. So, initial calibration compensates for this.

subbdue commented on DDR4 SDRAM – Initialization, Training and Calibration systemverilog.io/ddr4-ini... · Posted by u/ivank

subbdue · 6 years ago

Hello .. I’m the author of this article. Reading all your comments was an absolute blast.

Since there is some curiosity around temperature and voltage variation - here are some more details for you folks to geek out on.

When you build a system with a DRAM interface, you typically specify 2 parameters - A temperature range you guarantee its operation within. For example, this range could be 0C-80C. - Maximum rate of change of temperature your system can handle. Example, +/-2C/min.

Now, to test if the system can withstand the above 2 parameters, while the firmware is being developed it is put in a Thermal Chamber and experiments such this are conducted: - Do a cold soak for a few hours (i.e., power down the system and leave it in a 0C chamber for a few hours). - Then power on the system and let the DRAM interface calibrate at this low temp - Then start a stress test which reads and writes to the memory, and simultaneously ramp up the temperature of the chamber at the specified rate upto your maximum (2C/min upto 80C in our example here) - If the test fails, it typically means the signal integrity is not good enough. Then you go back to the lab, probe the DRAM interface and observe the signals on an oscilloscope (if you have to). Then re-calculate/fiddle around with 6 parameters until you have it all working. These parameters are 1. The drive strength of transistors at the Processor when its writing data to memory 2. The termination resistance of the transistors at the Processor when it is Reading data back from memory 3. Voltage reference (Vref) - This is the value the PHY[++] uses to decide if a voltage level is a binary-0 or 1 4. The set of 3 parameters above exist on the DRAMs as well. Making it a total of 6.

It is easy to imagine what drive strength and termination of transistors mean. But Vref is a bit more interesting.

In DDR4, binary-1 is represented by a 1.2V signal, but binary-0 is a floating voltage value. It could be 0.2V or 0.4V, or whatever. It depends on the termination at either end of the PCB trace. This type of a circuit is called POD (Pseudo Open Drain). Since the level of binary-0 is variable, the DDR controller calibration logic has to figure out where to place Vref so it can reliably decode 1s and 0s.

Lastly, just like the cold soak experiment, we also do hot-soaks with a ramp down and several other modalities to ensure the system is solid.

The PHY has delay registers within it which you can read to figure out the result of calibration. When you power on a system after a cold-soak vs a hot-soak, you'll see different values in these delay registers.

PHYs these days are very robust. They typically don't need periodic calibration (re-tuning of delay registers) while operating in a typical data center environment. Of course, it's a different story if the system if sitting somewhere off on an oil rig.

— [++] The PHY is separate from the DDR controller. This is the actual analog circuits at the edge of the processor sending out and receiving signals on the PCB.