[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[ale] fascinating data on temperature, including ATI / AMD Radeon gpu



On 4/24/2013 17:47, Ron Frazier (ALE) wrote:

>
> My opinion is that any solid state component in my system should be fine
> if I stay at least 15 degrees below the maximum limits listed.
> Mechanical devices (hdd's, optical drives, floppy drives) are a whole
> other matter.
>
> In my opinion, with proper ventilation, the PC should be able to run
> almost indefinitely at full load at Tmax - 15.  I don't believe I'm
> shortening the life substantially.  Again, I could be wrong.

Nope, you are causing damage and shortening the life (but you probably 
will get useful life out of the chip).  Semiconductors do NOT like heat, 
period.  The problem is that heat causes a slow degradation of the P-N 
junctions and the metal contacts (diffusion).  The damage is cumulative. 
  At low temperatures (below about 40 C) the diffusion is very slow. 
Increasing the temperature accelerates the process along an exponential 
curve. [1]  The likelyhood of the chip up and dying in a year if you run 
it hot is low but it's not zero.  However, run it for a few years like 
that and it's life will be shorter.

The eventual outcomes of the thermal stress are non-functional P-N 
junctions (device characteristics change) or short circuits (due to 
metal migration).  Lots of steps have been taken to mitigate these 
effects but the process can't be stopped entirely simply because 
performance and economics prevent it.  The more performance we try to 
push out of chip designs the more susceptible they become to this types 
of damage.[2]  However, the expected useful operating life is so short 
that the damage is simply accepted.  The manufacturer assumes that after 
a few years the chip will end up in the trash because the owner replaced 
it with the next greatest thing on the market at that point.



[1] Note that the temperatures you report are *external* temperatures as 
reported near the package or heatsink, not on the silicon die inside the 
package.  If your sensor is reporting 50 C it's very likely that the 
on-die temperature is closer to 100 C.  What is most critical for 
keeping your chip cool is to ensure maximum thermal transfer (maximum 
conductance of heat) and a large temperature differential (cold 
environment).  Lowering your internal case temperature by just ten 
degrees (e.g. more airflow) can have a big effect on the die temperature 
as more heat can flow out.  Also, ensuring that you have good conduction 
between the package and the heatsink using an appropriate (and *proper* 
amount of[3]) thermal grease (silver colloid is best) will ensure heat 
does not become trapped in the package.


[2]The Coppermine process that Intel developed several years ago was one 
of the attempts.  Most chips prior to that point used aluminum as the 
interconnect wiring.  However, if the temperature gets high enough, the 
aluminum alloys with the silicon and "spikes" into it causing shorts. 
With processes approaching nanometer scales[4], this spiking was 
absolutely disastrous.  So Intel developed a process to use copper 
instead of aluminum to wire the chip.  The idea was that copper is a 
very good thermal conductor in addition to a good electrical conductor 
(better than aluminum).  Lowering the electrical resistance reduces 
heating (Ohm's power law P = I^2 * R ) and lowering thermal resistance 
means the heat can be pulled away from the junctions towards the outside 
of the package faster (a chip is typically has metal covering 40-60% of 
its total area).  However, copper can alloy with silicon at room 
temperature which leads to its own problems.  So the copper is isolated 
from the silicon with thin layers of other metals like platinum, 
titanium, nickel, tantalum, etc. (e.g. anything not terribly reactive). 
  The Coppermine process worked well for managing heat but it's more 
complicated and expensive to process compared to the standard aluminum 
deposition methods.  There's a lot of research into cooling chips at 
nanometer scales but not much more beyond the Coppermine process that 
will help without major modifications to fab lines.  This is part of the 
reason why the mulit-core chips showed up.  Split the work across 
several cores with an on-chip controller and the chips individually 
don't have to work so hard which means they run cooler (and more 
efficiently since cooler chips have larger signal margins and fewer errors).

[3] There is such a thing as too much thermal grease.  A proper amount 
fills the microscopic voids in the two macroscopically flat surfaces 
without interfering with the physical proximity of the surfaces.  That's 
what permits maximum thermal transfer.  Too much grease puts a gap 
between the surfaces and reduces thermal transfer efficiency.  Silver 
colloid is the best because the silver microparticles are pliable and 
can help to fill the voids and improve contact between the surfaces. 
The sticky pads that some companies put on their chips are garbage. 
They're dirt cheap to make which is why the companies use them (silver 
colloid is expensive) but they don't have very good thermal conduction. 
  I rip out those pads whenever I find them and use silver colloid. 
Again, just use a proper (tiny) amount and make sure to not slather it 
on.  Not only will it reduce cooling efficiency but it can also cause 
short circuits.



[4] Chip scales have gotten so small and the device density so high that 
the major manufacturers (Intel, AMD, Samsung, etc.) are having to 
concern themselves with cosmic ray events.  Older chips might have one 
cosmic ray event in a year or two.  Today's chips are so densely packed 
with such tiny devices on them (approaching single digit nanometers in 
many cases) that it's becoming more likely to see a cosmic ray event 
once every week and sometimes once a day at higher altitudes.  They've 
considered adding special circuits to monitor for cosmic ray events and 
signal the CPU to repeat the last instruction if an event is detected.