Web Statistics

Validation, Test and Performance Characterization of DDR Memory Subsystems

                                                                                                       Questions?  Contact us!

We here at FuturePlus often use the Google StressApp memory test to exercise the DDR Memory.  One recent discovery was the nice Gaussian distribution we saw when measuring the number of simultaneous open banks.  The graph below shows how many Banks are open at one time as a percent of total cycles.  An important metric.  Open banks speed up performance but at the cost of power consumption.  

Open_banks_stress_app.png

Figure:  DDR Detective NUMBER OF SIMULTANEOUS OPEN BANKS

What can we learn from this type of analysis? 

Characterizing workloads can lead to better software, compilers and BIOS settings for the memory controller.  Also in the area of CyberSecurity certain critical workloads may have unique signatures.  If a virus is present the signature might be disturbed.

The DDR Detective is now a Finalist in the 2015 DesignCon BEST IN TEST!            2015_BestinDesignTest_Finalist_Green.png

Rowhammer.js: Row Hammer Attacks Can Now Be Triggered Through Webpages

 

(July 31st, 2015)  The unique and terrifying computer attacks known as 'Row hammer' can now be triggered through a webpage. In a paper by security researchers Daniel Gruss, Clémentine Maurice and Stefan Mangard reveal how what once used to be perceived as an attack dependent on the instruction of the CPU can now be triggered simply through use of a web browser.

This attack on the physical memory of a system is so destructive because it can occur to any operating system and can potentially gain access to data stored. Row hammer events do not affect the address bits being accessed but those in adjacent rows due to the accessed row being hammered with a charge over and over and ultimately leads to data corruption. Before it was believed to only occur through local execution on the computer, but now, it’s been found that it can occur at a remote location through Javascript, which is the scripting language for nearly 90% of webpages now. These three researchers developed a piece of code that escape’s the security sandbox of a browser and can perform a Row hammer event on a system’s hardware.

The best solution to the problem so far is shutting off Javascript on unreliable sites; browsers are unable to patch this problem since there is nothing that can be patched. Fortunately, Rowhammer events are very hard to control and their code is very difficult to replicate, but it does raise huge security concerns about the memory of a system. Google’s own security researchers showed how row hammer events can lead to gaining access of the entire memory of a computer, but this was done locally on the computer and was not tested by the researchers of Rowhammer.js. 

FuturePlus remains at the forefront of this topic and has introduced test equipment that can monitor the DDR3 memory bus in a server, laptop or embedded design to identify Row hammer events. See more information here. 

ASUS X99 memory subsystem teardown

We received our ASUS X99 a few weeks back and immediately set it up in our lab to take a look at the DDR4 memory bus.  The first thing we all noticed was the fan on top of the processor.  It is very quiet.  In fact it is the quietest fan I think any of us has ever seen inside of a computer.  Onto the memory!  We got a hold of a 2 rank DDR4 DIMM from a vendor and plugged it in on top of our DDR4 DIMM Interposer that is attached to our FS2800 DDR Detective®  When we look at a memory bus we really look at it!  The eyes on the Address/Command and Control bus looked great at 1867MT/s and since they did we cranked it up to 2400MT/s.  So the DDR clock itself is 1.2GHz.  All is well and we were not failing any of the memory tests we threw at it including Memtest86, Google StressApp and ThirdIO's Nemesis.  We also used our U4154A logic Analyzer and our FS2510 Interposer to do eye scan and burst scan's on the DQ lines.  They looked great and completely jived with what our friends at Keysight were seeing.  BTW Keysight took it up a notch and clocked their ASUS X99 at 3.125GT/s and was able to successfully capture the DQ data signals.  See that announcement here.  So at first blush it looks like the folks at ASUS get a passing grade on thier first DDR4 implementation.

NIST get's into servers BIOS

Concern for our nations cloud infrastructure has caused the National Institute for Standards or NIST to get involved in BIOS updating.  See this article for more information.  In summary nefarious activity can take place if a servers BIOS is compromised.  NIST's ITL or Information Technology Lab has produced a special publication to give guidance on BIOS updates. 

 

The Threat That Just Keeps Getting Bigger: DRAM Row Hammer 

 

One of the more recent papers from Carnegie Mellon University and ETH Zurich tries to forewarn the tech industry about the prevalence of DRAM Row Hammer failures and lay out possible strategies to combat it. The testing they performed of the DRAM devices (DDR3, DDR4, and LPDDR4) helped to illustrate the magnitude of this error which appears to get worse for newer technologies. This means that Row Hammer is a problem that needs to be addressed sooner rather than later. Row Hammer is an electrostatic interference glitch that affects nearby cells and causes what is called “bit flips”. A bit flip in turn makes a “zero” into a “one” and vice versa.

This characterization study also explores the organization of the DRAM technology in order to thoroughly explain the Row Hammer error and give suggestions moving forward. The easiest suggestion included refreshing the memory more frequently which increases the charge of the victim row thus making it less susceptible to bit flips. But this suggestion takes a toll on power consumption and performance which is very important to data center operators. The other suggestion is to reconsider the design of DRAM at the component level which DRAM manufacturers are unwilling to do. 

This in depth exploration helps readers understand the Row Hammer error a little better. It offers insight into the prevalence of the error in older technologies versus newer technologies. It also offers suggestions on how to mitigate the error. There are big risks with newer technologies having higher vulnerabilities to attacks such as Row Hammer. Hopefully as more research dives deep into this topic, more and more engineers will start to take this error seriously but until then, this error still poses a serious reliability and security threat.

Link to the study is found here.

Row Hammer. The problem no DRAM vendor wants to talk about, except for one, Zentel

 

Zentel's new DDR3 DRAM has published data showing zero Row Hammer failures.  I fondly recall talking to a large vendor’s ‘tiger team’ concerning Row Hammer failures a few years back.  I asked them what should their DDR3 users do if they start to experience Row Hammer failures.  Their response? ‘Upgrade to DDR4!’.  ‘How convenient’ I responded, ‘forcing the industry to throw away all those DDR3 based systems so you can sell more DDR4’.  And come to find out, DDR4 although a bit better, still experienced Row Hammer failures!

We here at FuturePlus Systems are glad to see our colleagues in academia are still hunting down those Row Hammer vulnerabilities (https://rambleed.com/).  They can feel good about causing the industry to look for solutions, and it appears that Zentel has answered the call.

To the best of our knowledge Zentel has the ONLY Row Hammer hardened DDR3 memory on the market.  See the Zentel data sheets here. 

Data Center Down Time: DRAM Row Hammer Failures in the field

 

DDR3 memory is at the heart of almost all cloud computing servers today.  A recently publicized failure mechanism in DDR3 memory, coined Row Hammer,  has been shown to not only be a reliability issue but also a security risk for servers, laptops, desktops and embedded systems around the world.  In short, excessive accesses to a single Row in memory can cause bit flips in adjacent locations causing system crashes, corrupted data and even security exploits.  Several research papers have been published and more work is being done to help the industry understand this problem.  No industry standards group, government agency or trade association has signed up to address this issue.  Data Centers and end users are on their own.

Computer architecture relies on three basic building blocks, the CPU or central processing unit, the I/O, Input and Output and the Memory.  When it comes to the memory the dominate technology is DRAM or Dynamic Random Access Memory.  Today’s most prevalent version of memory is called DDR3 which stands for the 3rd generation of Double Data Rate Memory.  In the quest to get memories smaller and faster memory vendors have had to make very small physical geometries.  These small geometries put memory cells very close together and as such one memory cell’s charge can leak into an adjacent one causing a bit flip.   It has come to the attention of the industry that this is indeed happening under certain conditions.  Very simply the problem occurs when the memory controller under command of the software causes an ACTIVATE command to a single row address repetitively.  If the physically adjacent rows have not been ACTIVATED or Refreshed recently the charge from the over ACTIVATED row leaks into the dormant adjacent rows and causes a bit to flip.   This failure mechanism has been coined ‘Row Hammer’ as a row of memory cells are being ‘hammered’ with ACTIVATE commands.  Additionally double sided Row Hammering has also been proven.  This involves two ‘aggressor’ rows on either side of a ‘victim’ row.  This double sided hammering produces failures faster and causes more bits to flip .  Once this failure occurs a Refresh command from the Memory Controller solidifies the error into the memory cell.   Current understanding is that the charge leakage does not permanently damage the physical memory cell which makes repeated memory tests trying to find the failing device useless.

DDR3 memory is pervasive today and used in nearly all cloud server systems, many embedded applications and military applications.  Most critical applications do use error detection and correction, ECC. However ECC is a single bit detection and correction and double bit detection.  In the case of more than two bit errors, which has been demonstrated with Row Hammer failures,  ECC falls short.  Our dependence on DDR3 memory and this known failure mechanism should be a wake up call for the industry.  So far the most common workaround is to double the refresh rate to the memory.  This is an attempt to ‘charge up’ the dormant memory cells so that they do not fall victim to adjacent rows that might become ‘hammered’.  This reduces performance and increases power consumption and the problem is not going away.  This workaround just reduces the statistical probability.

Why does this happen? Simply put, the memory controller’s job is to read and write information to and from the memory under program control.  If the software running executes certain commands that cause repeated accesses to a single location the memory controller will generate excessive ACTIVATE commands.    Currently there is nothing in the DDR3 memory controller designs to prevent this from happening.

DDR3 memory is a critical part of the world’s cloud computing strategy and today’s servers have an extensive amount of DDR3 memory.  The studies have shown a potential for millions of Row Hammer failures per system.  Given the vast amount of DDR3 memory in today’s systems failures should clearly be a concern. This known failure mechanism can lead to undetected data corruption, reliability issues and security breaches.  Current mitigation strategies, for deployed systems, are impractical, expensive or just reduce the statistical likelihood.  Should we upgrade to DDR4?  Not so fast…..studies have shown that DDR4 has the same problem.   A strategy to determine if applications even create the Row Hammer failure should be considered. Understanding if an application is at risk can reduce the pressure to implement unneeded, expensive and time consuming mitigation strategies saving organizations millions of dollars.  If applications are shown to be at risk then steps can be taken to upgrade hardware, rewrite the application and provide warnings to the field that such failures might occur.