Name: AMD Core Counts and Bulldozer: Preparing for an APU World
Item: AMD Core Counts and Bulldozer: Preparing for an APU World
Author: Anand Lal Shimpi

AMD Core Counts and Bulldozer: Preparing for an APU World

by Anand Lal Shimpi on 11/30/2009 12:00 AM EST

Posted in
CPUs

Post Your Comment
Please log in or sign up to comment.

Comments Locked

94 Comments

Back to Article

mattclary - Monday, December 14, 2009 - link
[quote]

Anand,

Think of each twin Integer core Bulldozer module as a single unit, so correct.

[/quote]

It's no wonder you misinterpreted what he said. This is vague at best! "Is it either or? - Correct!"
aj28 - Thursday, December 3, 2009 - link
Alright, so here's the quote from the article. Take note of the parts in bold...

[QUOTE]Also, just to confirm, when your roadmap refers to 4 bulldozer cores that is four of these cores:

http://images.anandtech.com/reviews/cpu/amd/FAD200...">http://images.anandtech.com/reviews/cpu/amd/FAD200...

Or does each one of those cores count as two? I think it's the former but I just wanted to confirm.[/QUOTE]

And AMD's response...

[QUOTE]Think of each twin Integer core Bulldozer module as a single unit, so correct.[/QUOTE]

So to me this reads, "Correct, the former, meaning..."

[QUOTE]...when your roadmap refers to 4 bulldozer cores that is four of these cores:

http://images.anandtech.com/reviews/cpu/amd/FAD200...">http://images.anandtech.com/reviews/cpu/amd/FAD200...[/QUOTE]

There's a good chance that the majority is correct and I am in fact wrong, but... Well, that's just how I read their response. I feel there is a good chance of some more confusion afoot, much like the percentages being thrown around in the original article.
aj28 - Thursday, December 3, 2009 - link
I think it's also worth noting that I fail at quoting... Evidently... Sorry!
swindelljd - Wednesday, December 2, 2009 - link
I bet Oracle is salivating over the new core count technique since it is sure to create a huge surge in their revenue because they charge per core on the x86 platform.
Sivar - Tuesday, December 1, 2009 - link
If FP performance is given the backseat, it could impact game performance for well multi-threaded games.
JumpingJack - Wednesday, December 2, 2009 - link
Depends on how effectively the designers are able to share the FP in this arrangement, but yeah -- gaming will be a question mark. I am pretty confident it will be better not worse.
nirmv - Tuesday, December 1, 2009 - link
For what I understand, AMD figured out how to reduce core size by 25% without impacting performance.
Each 2 cores will now share the same fetch/decode units (using SMT like Intel), and also the same FP unit (but doubled for 256 bits so actually it's 2 128 bit unit), but seperate Int unit like before). So actually they share half of the logic of two cores together, so they now use 150% of the die area of one core for 2 cores, or in other words save 25% of each core (75% * 2 = 150%).
But, it will still have 1/2 the throuput of Sandy Bridge in FP, and they still will have 1/2 the bandwidth of the fetch/decode because they use 1 for two cores instead using 1 per each.

Nevertheless it looks like a wise decision in terms of power/performance. So nice, but it won't give AMD the performance crown.
Seramics - Tuesday, December 1, 2009 - link
From the way it seems, I'm afraid the badly delayed, highly anticipated, much hyped and AMD's only hope to retake the performance crown from Intel will fall short of expectations. Unless they really come up with a competitive n powerful processor, I'm afraid the AMD we know from A64 days will continue to be history till the next major architecture after bulldozer which could well be 5 years or so after 2011. AMD to be budget player till then.
Alberto - Tuesday, December 1, 2009 - link
Buldozer seems too late against Intel upcoming offerings.
An eight core Buldozer will be clearly slower than an eight core Sandy Bridge, in both integer and Fp.
This cpu implementation seems done to fight Nehalem ( two 128 bit units, both possibly utilized from one core only ).
Sandy Bridge will have two times Fp power and threads per die,
assuming the article right.
The only manner to be competitive is to consider a single "block"
like a monolitic core. Intel can answer with 50% more cores/die,
performing a complessive better integer and Fp performance.

Still we don't know what will be the new integer performance of the
Sandy Bridge integer unit. I believe it will be higher than in Nehalem.
epobirs - Tuesday, December 1, 2009 - link
I don't buy this claim that FP will be eliminated from CPUs in favor of doing it all on a GPU. There are too many situations where FP is still needed on a per core basis with a primarily integer load. About two minutes after the first systems ship with no integrated FP in the CPU (Bulldozer SX?) there will be engineers thinking themselves clever by proposing to boost FP performance by integrating it into the CPU die!

What will happen instead is the FP and onboard low-end graphics solution will merge. The monster GPUs will be there for high-end FP as needed and the die area consumed by the FP and IGA minimized so as to be beneath concern. FP may be external to the cores but they won't be sold without at least one FP/IGA module in the mix. That way you have a chip that is versatile for a wide range of different boxes but also cost competitive.
Alberto - Tuesday, December 1, 2009 - link
You are right.
The eight core Sandy Bridge will have over 200 Gflops Double Precision with a power budget of 130W in 32nm and 95W in 22nm.
In these conditions the "dream" to throw away the Fp unit from the CPU it's only a Nvidia desire.....to survive.
gruffi - Sunday, December 6, 2009 - link
Give me your calculation please. I see Sandy Bridge nowhere near 200 GFLOPS in DP.

Sandy Bridge may have up to 8 cores/16 threads (the known die shot shows only 4 cores), probably clocked around 3 GHz.

4 DP (AVX/256-bit) * 1 op/cycle (no FMA) * 8 cores * 3 GHz = 96 GFLOPS

OTOH, AMD may have twice as much FP throughput with "Interlagos" (8 modules/16 cores/16 threads) if we assume similar clock rates.

4 DP (AVX/256-bit) * 2 ops/cycle (FMA4) * 8 modules * 3 GHz = 192 GFLOPS
psychobriggsy - Tuesday, December 1, 2009 - link
That certainly beats AMD's ~100GFLOPS in double precision from an 8-core Bulldozer.

Calculation: 3GHz * 2 (FMA) * 2 (units) * 2 DP (128-bit unit) * 4 (modules).

Clearly AMD are providing enough CPU power for OpenCL, etc, to run "well", but if you need "serious" power then you'll plug in an RV900 series GPU that will probably try to get near 1TFLOP in DP in the same timeframe. With OpenCL, the exact same code will run (AMD's OpenCL driver can switch between CPU and GPU without any application changes).
epobirs - Tuesday, December 1, 2009 - link
It looks like AMD is engaging in another word of words instead of performance. Remember when they claimed ownership of what was or was not 'dual core' and 'quad core?' While AMD declared the C2Q line as 'not true quad-core' the Intel product was actually shipping and available for use a year before AMD's 'true' chips came out with less performance and some serious bugs for added enjoyment.

This gets tiresome to the point where I hold AMD in great suspicion when they lead with a new official vocabulary instead of the product and how it actually performs.

I truly don't give a damn about your modules, AMD. Take your new architecture and define the smallest portion that could be sold as a discrete product to run a PC. That is a core. It doesn't matter how many threads it runs. It is a core. If we cannot have meaningful definition to which all companies adhere, the conversation is dead and all that remains is useless PR blather.
Nehemoth - Tuesday, December 1, 2009 - link
Well said. At the end of the day users don't care about the elegance of the architecture they'll care about performance, performance per watt, etc, etc.

PD : Where is the Z Ram technology they're licensed back time ago for the Cache Memory?

What about the license for XDR from Rambus?

At less for some servers should have a value.
Nehemoth - Tuesday, December 1, 2009 - link
Well said. At the end of the day users don't care about the elegance of the architecture they'll care about performance, performance per watt, etc, etc.

PD : Where is the Z Ram technology they're licensed back time ago for the Cache Memory?

What about the license for XDR from Rambus?

At less for some servers should have a value.
Milleman - Monday, November 30, 2009 - link
It's good to see that the existance of AMD is healthy for the competition, progress and innovation. The existance of AMD is even good for the Intel fan-boys. The Inte CPU's wouldent be half that fast today, if there wasn't any competition on the market.
jmurbank - Monday, November 30, 2009 - link
An AMD representative said that the picture you provided is one core, but it has two integer units. These two integer units are hardware basis of a similar feature of Intel's Hyperthreading. The following picture is a dual core.

http://images.anandtech.com/reviews/cpu/amd/Bulldo...">http://images.anandtech.com/reviews/cpu/amd/Bulldo...

The four core is the following image.

http://images.anandtech.com/reviews/cpu/amd/Bulldo...">http://images.anandtech.com/reviews/cpu/amd/Bulldo...

This is all assuming the Bulldozer core is for their enthusiasts or high end setups. For the low end, these pictures will not include two integer units. Though it all depends what AMD has in store for the microcode for their Bulldozer core because it can be one way and other or it can be both that can take advantage of both features by including a switch in the BIOS or software, but it is too soon.
Milleman - Monday, November 30, 2009 - link
Looks like the AMD CPU's are slowly getting structures "borrowed" from ATI GPU's, which is very interrresting. The traditional CPU strukture from the seventies are on the way out. The future looks really exciting!
tatertot - Monday, November 30, 2009 - link
AMD marketing made a mistake (Fruehe, on his blog) when referring to an AMD engineering claim made by Moore.

The claim is on slide 4:

http://www.amd.com.cn/chcn/assets/content_type/Dow...">http://www.amd.com.cn/chcn/assets/conte...loadable...

80% more throughput (integer work) for 50% more (core) area.

Fruehe LOLed this into 80% more performance for 5% more area (ooops!), and now this meme has taken hold.

It's wrong. Each module is 50% larger to get 80% more integer throughput, and even adding in all the "uncore" portions on a chip does not get this number anywhere NEAR 5%. (The uncore is nowhere near 10x the area of all the core area combined)
vsary6968 - Tuesday, December 1, 2009 - link
this slide was 2005. This is not the latest slide. you need to do more research.
vsary6968 - Tuesday, December 1, 2009 - link
this slide was 2005. This is not the latest slide. you need to do more research.
Anand Lal Shimpi - Monday, November 30, 2009 - link
You're very right, AMD responded and said that the 5% figure was incorrect. Unfortunately it looks like both Johan and I were given the same incorrect info.

The real figure is closer to 50%, I've updated the article accordingly.

Thanks again :)

Take care,
Anand
piesquared - Monday, November 30, 2009 - link
I think i'd investigate a little further. Judging by the block diagrams each integer core is no where near 50% of the die, so obviously that number can't be correct....
JumpingJack - Tuesday, December 1, 2009 - link
And as we all know, these power point block diagrams are carefully scaled to ensure that blocks are exactly proportional to the actual units located on the floor plan of the die.

From this, one may extrapolate the L3 cache is not much more the 512 KB.

Thanks for the knee slapper.

Jack
GaiaHunter - Monday, November 30, 2009 - link
I'm still not sure. JF arguing is solid.

That 5% and 50% could be just semantics.

Because JF said, distinctly and repeatedly, he was talking about total die size, while the 50% is referring to the area of the module, sans L3$/IMC/NB/etc. And more specifically the Int-core area, which clearly doubles when going from 1 Int-core to 2 Int-cores.

So, while to get up to 180% increase in integer performance you need to double the area (or 50% of the total integer area)dedicated to integer operations, that relatively to the total die size may well take only 5% of the die space.
GaiaHunter - Monday, November 30, 2009 - link
I really think this is semantics again.

Module and Die.

Module is 50% bigger but die is only 5% bigger.
psychobriggsy - Tuesday, December 1, 2009 - link
A single integer core (just the unique per-core parts, not the shared functionality in the module) takes up 5% of a typical quad-core Bulldozer die (including uncore and L3)? Or maybe even an octo-core die.

Also assume rounding up and down and nearest. Could be 47% and 5.4%, etc.

It's a way away yet. Let's see what happens.
smilingcrow - Monday, November 30, 2009 - link
5% always sounded very unrealistic as that would mean a remarkable increase in IPC for such a small increase in ‘core’ size.
If it was only 5% we would expect to see a native 8 module version being for the desktop if looked at purely from die size or on a cost basis. But at 50% extra it means that all other things being equal 4 modules = 6 ‘simple’ cores in space terms ignoring the uncore.

I’m still not 100% clear on the 50% thing. If a die is 50% cores and 50% un-core and measures 100 sq mm. When we add the 50% larger cores to the equation the cores become 75 sq mm and the die becomes 25% larger or 125 sq mm. Or is there another portion of the module/core that is excluded so the total size increase is less than 25 sq mm?
Zool - Monday, November 30, 2009 - link
That 50% sounds much more realistic.
On the k10 die are http://en.wikipedia.org/wiki/File:K10h.jpg">http://en.wikipedia.org/wiki/File:K10h.jpg u can see
that doubling the integer pipeline, data cache and load store unit is clearly more than 5% :P.
The thing is that L2 cache and L3 cache are in the buldozer module picture and they are several times bigger die area than the core. And there are also other things in the uncore like memory controler, hypertransport. The whole die vs core is quite diferent than the whole die vs module. They say 50% more core area invested not module or die area.
Zool - Monday, November 30, 2009 - link
This is the K10 core from wikipedia with integer pipeline highlited (and other areas too) http://en.wikipedia.org/wiki/File:K10h.jpg">http://en.wikipedia.org/wiki/File:K10h.jpg .
The 5% are is quite realistic if count in the shared L1 and L2 cache for one module.
psychobriggsy - Tuesday, December 1, 2009 - link
The L1 caches are duplicated however. Also the Load/Store units I presume, but maybe there is a way to share some resource there.

What that diagram does show is that there are two 64-bit SIMDs (one of which can do x87) in K10 (not K10.5).

In Bulldozer there are two 128-bit SIMDS (that can also do FMA). I presume that they can each do x87 if they deign to lower themselves to the task.

That's why the FP performance has gone up. FMA counts as two operations when it comes to Linpack. :D FP is doubled compared to K10, even on a per-BDcore basis.

Will we refer to a Bulldozer module as K11?
GaiaHunter - Monday, November 30, 2009 - link
In the guy own words

http://forums.anandtech.com/showpost.php?p=2893509...">http://forums.anandtech.com/showpost.php?p=2893509...

[quote]I think the difference between 50% and 5% might be the difference between marketing and engineering. Engineers tend to be very literal.

If 2 cores get you 180% performance of 1, then in simple terms, that extra core is 50% that gets you the 80%.

What I asked the engineering team was "what is the actual die space of the dedicated integer portion of the module"? So, for instance, if I took an 8 core processor (with 4 modules) and removed one integer core from each module, how much die space would that save. The answer was ~5%.

Simply put, in each module, there are plenty of shared components. And there is a large cache in the processor. And a northbridge/memory controller. The dies themselves are small in relative terms.[/quote]
tatertot - Monday, November 30, 2009 - link
The guy is wrong, or his engineering team misunderstood.

Moore (the lead designer) said about 50% increase to double the integer resources, L1D, etc. That sounds about right.

What I COULD believe is this:

Q: "If I took an 8 core processor (with 4 modules) and removed 1 integer core from ONE module, how much die space would that save?" A: 5%

In other words, if you removed them from all 4, you'd save 20%.

If you figure that the uncore takes up a bit more than half of the die, that would be totally consistent with Moore's 50% larger core figure.

For example (totally made up numbers):

Die size 300 mm2

uncore 160 mm2
4 BD modules 140mm^2

1 BD module 35 mm2
1 BD module without extra integer units: 23 mm2
(Savings from lopping 1 BD module: 12mm^2)

4 BD modules without extra integer units: 93 mm2
(Savings from lopping 4 BD modules: 47mm^2)

12/300 is 4%, which is what his engineers thought he was asking.

But really he was asking about 47/300 or ~16%.

So as stated, the 5% is wrong. It's the area cost of 1 of the module's extra int resources on a 4 module die. All 4 of them cost more.

And this would be consistent with Moore's estimate that relative to JUST the module, it is a 50% area increase.
psychobriggsy - Tuesday, December 1, 2009 - link
Thanks for doing the example maths.

Yes, it looks like adding a single core to each module adds around 15%-20% to the die size of a dual-module/quad-core Bulldozer.

So 20% die space for 80% performance increase. Well, until you decide to make the L2 larger because there will be more contention for it.

Of course the 5% die area for SMT in Nehalem is negligable when you start factoring in the uncore portions as above...
GaiaHunter - Monday, November 30, 2009 - link
Meh, I miss-clicked and reported your post by mistake, sorry about that. :(

Anyway, imagine a int core is 5% area of a total module and that a module size is 100 (size units not mm^2), so the int core size is 5. 4 modules will then be 400 and 4 int cores will be 20. 20 is 5% of 400, not 20%. Same for 8 modules.

You have to see, that they do need 50% more area to get 80% Int boost performance as they are using a second Int core to accomplish that. So the dedicated area of module to do Int operations is 2x the size of a regular Int core.
GaiaHunter - Monday, November 30, 2009 - link
Meh, I miss-clicked and reported your post by mistake, sorry about that. :(

Anyway, imagine a int core is 5% area of a total module and that a module size is 100 (size units not mm^2), so the int core size is 5. 4 modules will then be 400 and 4 int cores will be 20. 20 is 5% of 400, not 20%. Same for 8 modules.

You have to see, that they do need 50% more area to get 80% Int boost performance as they are using a second Int core to accomplish that. So the dedicated area of module to do Int operations is 2x the size of a regular Int core.
HolKann - Monday, November 30, 2009 - link
Nah, you don't understand him. His assumptions are:
1. one int core is about 5% of the whole die (including uncore).
2. one int core is about 50% of a module.
3. the uncore makes up about half of the core.

Put this in numbers:
Take a module as 100 size units. 4 modules means 400 size units, adding the uncore makes the size of the whole die 800. 5% of 800 is 40 size units. And tadaa, this makes an int core 40% of the size of a module ;) The number gets closer to 50% if one takes the uncore bigger.

If his assumptions are correct, a 25% total die increase (4*5% to 80%) results in 80% extra performance. This is about as good as Intel's 5% die increase for 15-20% extra performance (I know, this is a bold statement, a lot of unknown variables could alter this situation drastically).
GaiaHunter - Monday, November 30, 2009 - link
The data we have is: Removing 1 int core 5% from each module would result on 5% reduction of total die size. 1 int core = 50% of the total area dedicated to integer operations.

So, for a total die of lets say 1000 units with 8 int cores, 4 int cores represent 5% of the total die size or 50 size units.

So each int core is 12,5 size units and the 8 int cores take 100 size units or 10% of the core.

Assuming sizes for total die size or what is the Bulldozer Module size relative to total size is pure speculation, as we don't have any numbers other than that JF affirmation.

To remember:
"What I asked the engineering team was "what is the actual die space of the dedicated integer portion of the module"? So, for instance, if I took an 8 core processor (with 4 modules) and removed one integer core from each module, how much die space would that save. The answer was ~5%. "

That was the affirmation.

In no way this contradicts the affirmation that AMD increased the Module area dedicated to integer operations by 50% to achieve 80% performance.

Main point is DIE SIZE ? BULLDOZER MODULE.
tatertot - Monday, November 30, 2009 - link
I am disputing the JF claim: " Removing 1 int core 5% from each module would result on 5% reduction of total die size. "

I suspect that his engineers misunderstood his question, and it is actually the removal of the "extra core" from ONE BD module that would result in 5% overall die savings.

You can take it to the bank that Moore is correct that adding another integer execution unit group , L1D, etc to the core (thus making 2 cores, or a 'module') increased the size by 50%. Moore is the designer, not a marketing guy.

In order for Fruehe's claim to be correct, the uncore area would have to be VERY large:

Some more (different numbers):

Assume BD module is 30 mm2, (thus increased by 10 mm2, or 50% from 20 mm2 to add the second 'core', per Moore)

If 5% were actually the correct estimation of the area added for 4 BD modules (4 * 10 mm2 increase = 40 mm2 increase), then the overall die size would need to be... 800 mm^2.

This is nuts.

On the other hand, if "5% of the total die area" is the estimate of the space needed to add the integer resources to just 1 BD module, then the overall die can be 200 mm^2, so uncore 80 mm^2, 4 BD modules at 120 mm^2, and then Moore's numbers can be consistent with what JF heard back from the engineers.

So, my theory is that his engineers thought they were being asked how much of the total die (for a 4 BD module part) the increase in integer units to 1 BD module resulted in, while JF thought he was asking how much the increase to ALL 4 modules would be. This would be an easy misunderstanding to have, and I don't see another way to reconcile Moore's information (which I trust), with JF's claim.
GaiaHunter - Monday, November 30, 2009 - link
You start assuming the BM is 30mm^2.

But both integer cores are exactly the same size. So if the resources you add are 10mm^2, that means that 2 int cores take 20mm^2 and 8 is 80 mm^2.
GaiaHunter - Monday, November 30, 2009 - link
You start assuming the BM is 30mm^2.

But both integer cores are exactly the same size. So if the resources you add are 10mm^2, that means that 2 int cores take 20mm^2 and 8 is 80 mm^2.

Now each integer core is 10mm^2 and represents 5% total die size - bam 200mm^2 die as you said.

You had 50% resources to the BM and end with each int core at 5% of the die size and even get your 200mm^2 die.
GaiaHunter - Tuesday, December 1, 2009 - link
Now lets go the other way.

Lets assume JF is right and Moore is also right.

Grab a 8 core bulldozer CPU, shave 4 cores and save 5% die space.

CPU die size is 200mm^2.

5% is 10mm^2, so each int core is 2.5mm^2 and 8 of these will take 20mm^2.

Now, Deneb is 260mm^2.

If 8 core Bulldozer is 300mm^2, you end with 3.75mm^2 int cores.

Small?

Maybe.

Around half of the die will be L3$.

Northbridge circuits stay. Memory controller and the HT PHY also stay. L2$, fetch, decode and FPU are also shared.

So basically you are just removing a very small portion.

The question would be if you would need as much of those resources in the first place.
GaiaHunter - Monday, November 30, 2009 - link
Moore affirmation is only related to the INTEGER AREA of the BULLDOZER Module.

Fruehe's claim is about total die size.

If Moore's claim is about Integer Area of the Bulldozer module and Fruehe's claim is about die size, then, these claims don't have to be mutually exclusive.

Additionally, there is no BULLDOZER MODULE. Forget about it. It isn't a unit by itself.

2 (or more) of those bulldozer modules will share L2$ and L3$ for example, so you can't even define a damn size for a frigging module to start with.
ThaHeretic - Monday, November 30, 2009 - link
I really hate the move to call each "module" 2 "cores." AMD is shooting themselves in the foot when it comes to software licensing, in particular, Oracle DB licensing where they charge .5 CPU license for each x86 "multi-core." AMD's decision will double the cost of software running on Bulldozer.

Bad move AMD, bad move.
cfaalm - Monday, November 30, 2009 - link
Really? This issue popped up for OS licenses when x86 dual cores were first introduced. Microsoft decided to go on a "per socket" base, not counting cores.
Calin - Monday, November 30, 2009 - link
Microsoft requests licensing per mainboard socket. Oracle requests a decreased licensing cost per core, if the core is a part of a socket. Meanwhile, some other companies requests licensing costs per core.
Everyone with its own ways.
cfaalm - Tuesday, December 1, 2009 - link
Fair enough. So indeed this is a concern when buying this new stuff. I'd rather have AMD not call a this module 2 cores for the simple reason that is a sort of siamese twincore, not a true dual core. Though that is just the naming game. It looks promising nonetheless. Hopefully AMD/Oracle can enlighten the big system buyers by the time the decisions need to be made.
ThaHeretic - Tuesday, December 1, 2009 - link
A lot of licenses and MRCs are based on socket count, unfortunately, many of the most expensive software packages licensing arrangements are derived from core-count. Since Oracle changed their multi-core licensing near the end of 2006, quad-core x86 processors have counted as two licenses, six-cores as three licenses, etc. A Bulldozer quad-module die will therefore need 4 licenses for OracleDB.

Does this suck? Yes. Is SQLServer's licensing model better for end-users? Yes. Is SQLServer anywhere near as awesome as Oracle Database? Hell no, not even close.
DominionSeraph - Monday, November 30, 2009 - link
"AMD claims that the performance benefit from the second integer core on a single Bulldozer module is up to 80% on threaded code."

Yet their performance graph has FP gains outrunning integer.
Lifted - Monday, November 30, 2009 - link
This. Maybe this takes into account the moving of FP off the die to a add on module or GPU.

I never could have imagined we'd be going back to the add on FP modules.
GodisanAtheist - Monday, November 30, 2009 - link
I believe that's because Interlagos gets the integrated GPU core, which in terms of theoretical performance will send FP performance through the roof.
DominionSeraph - Monday, November 30, 2009 - link
But then the performance is underwhelming. Current-gen GPUs would be off the chart.
medi01 - Monday, November 30, 2009 - link
Could someone decrypt the following text for me please:

[quote]It all started about two weeks ago when I got a request from AMD to have a quick conference call about Bulldozer. I get these sorts of calls for one of two reasons. Either:

1) I did something wrong, or
2) Intel did something wrong.

This time it was the former. I hate when it's the former.[/quote]
GaiaHunter - Monday, November 30, 2009 - link
It means Anand get these calls ("short conference calls") requested by AMD when:

1) Anand makes a mistake

or

2) Intel is being naughty (like telling OEM to not sell AMD).

I asked Anand in the Bulldozer article if a quad-core zambezi meant 4cores/8 threads or 4cores/4 threads.

He said (and I was convinced at that time it was the correct answer too)that a zambezi quad-core meant 4 cores/8 threads and an octo-core would be 8cores/16 threads. Or if you prefer 4Modules/8cores/8threads and 8Modules/16cores/16threads.

But it seems it is 2Modules/4cores/4threads and 4modules/8cores/8threads.

Sincerely, I can't really blame Anand - this shit is confusing.
Kiijibari - Monday, November 30, 2009 - link
Yes .. for desktops. However for Servers there will be again an MCM with two dies, i.e. 8 modules, 16 cores, 16 threads, called Interlagos.
piesquared - Monday, November 30, 2009 - link
Which makes a person wonder, if AMD has a 16 core Intelagos in the server space, how nice and cool and efficient will an 8 core Zambezi be.
GaiaHunter - Monday, November 30, 2009 - link
U can see it in there

http://www.anandtech.com/cpuchipsets/showdoc.aspx?...">http://www.anandtech.com/cpuchipsets/showdoc.aspx?...

And also we can see that this new designation was causing quite a confusion in the forums.

http://forums.anandtech.com/showthread.php?t=20230...">http://forums.anandtech.com/showthread.php?t=20230...
pcfxer - Monday, November 30, 2009 - link
The first with L3 cache was Intel Pentium 4 EXTREME EDITION.

Phenom just had the most logical use of L3 in the sense that it served as a "community" buffer.
JimmiG - Monday, November 30, 2009 - link
The first with L3 cache was actually the AMD K6-III released in 1999. Of course, the L3 was actually on the mobo, while the L2 was on-die. But it did use a tri-level cache, making it outperform the Pentium III Katmai on integer workloads.
Calin - Monday, November 30, 2009 - link
Only on instructions per clock - the K6-3 was available at 400 and 450 MHz, while the Pentium !!! was available at (much later) up to 1300 MHz.
However, the K6-3 was in the competition against Pentium !!! at up to 550-600 MHz, as the original K7 appeared around those times.
mino - Sunday, January 17, 2010 - link
K6-2+ and K6-3 were PII competitors that were able to outperform PII and even early PIII purely thanks to ON-DIE L2 and 3Dnow.
L3 was on motherboard back then, was slow, and had little to do with K6-2+/3 performance gains.

Also K6-2+/3 was the top AMD CPU for a very short time as K7 came righ afterward.
K6-2+/3 was the notebook & low cost bussiness desktop solution of the times while K6-2 (without cache) was the budget solution.
medi01 - Monday, November 30, 2009 - link
Was it shared?
Zool - Monday, November 30, 2009 - link
Any info on the amount of transistors for each module ?
At least they can make a decent notebook cpu from the modules. Mobile nehalem and phenons are everything with 4 cores just not low power notebook cpu-s.
Zool - Monday, November 30, 2009 - link
Something like 1 module no L3 cache for netbooks,notebooks.
2 modules with L3 cache for average notebooks and 4 or more modules for desktop replacement notebooks. They could play with the cache sizes too.
The 1 module no L3 cache for example would kill atom in performance. And the power usage could be still quite good. There is no meaning for 2-4 W slow cpu in netbook when the mainboard,hdd,display eats several times more electric power together than the cpu itself.
JVLebbink - Monday, November 30, 2009 - link
I am wondering about the shared FPU. Does one thread really have to be purely integer for the other thread to use the 2 FMACs at the same time, or can one thread use the 2 FMACs if the other one is currently not sending FPU instructions. "if one thread is purely integer, the other can use all of the FP execution resources to itself." sounds like the first, but that would (1) waste FPU resources, (2) give problems with threads switching cores. (how does the not witched thread know it can now use all of the FPU resources?)

My presumption is that AMD chose the former and that if 2 (FMAC) instructions of different 'cores' reach the FPU (at that point in time) they will be executed using an FMAC each and if 2 instructions of one core reach the FPU without the other core sending any (at that point in time) they will be executed in parallel using both FMACs.

If my presumption is correct AMD decided not to HyperThread their ALUs but did HyperThread their FPU.
Kiijibari - Monday, November 30, 2009 - link
You do not have to switch anything.

The shared FPU has ONE common scheduler. Both threads can issue Ops into the scheduler queue. If there are no FPU Ops from the first thread then - of course - the second thread has the power of the whole FPU.

Very simple ... it's like a chat room. Several people type in messages and you can see it serialized in the chat window (that would be equivalent to the queue).

You will read the messages one after another, according to their posted time / when they were issued. It is the same for the Bulldozer FPU.

You do not need to switch from one chat member to another to read their messages. Neither does the FPU has to switch ;-)
kobblestown - Monday, November 30, 2009 - link
I hope AMD regain their common sense and use the term "core" in amore conventional sense. According to their definition a quad core will only have 2 FP pipelines. What I see in a quad core Bulldozer is dual core with 2 Int and 1 FP pipeline each. I wish them the very best in their effort to regain the performance crown but abusing existing terminology will not help in that.
Zool - Monday, November 30, 2009 - link
OS , drivers, API layers trashing the cpu constantly are usualy integer loads. For average math in code the curent fpus are fast enough. Things that realy need paralel FP performance (multimedia things,graphic) are using SIMD SSE units and those things should run much faster on gpus anyway. For 5% extra die area the extra integer pipeline rocks.
fitten - Tuesday, December 1, 2009 - link
Except it isn't 5% additional die area.
Calin - Monday, November 30, 2009 - link
This was started by Sun's Niagara (I think) processor - 32 "int cores" and only one FP unit. A physical integer core ran four threads at a time (one instruction from each, with instant context switching between them), so one would have had eight physical integer cores with only one FP unit.
The Niagara 2 would have had one FP unit for each of those integer cores, so one FP for each four int cores.
defter - Monday, November 30, 2009 - link
This confusion has nothing to do with int or fp cores. By a common definition, a core a standalone unit, which can function on it's own if necessary.

For example, each of Niagara's 8 cores had an own fetch and decode unit. In AMD's case, the "module" is the unit with it's own fetch and decode units, and integer ALU clusters only have own scheduler. Therefore, it's very confusing to call these clusters "cores".

AMD seems to learned confusing marketing from from ATI and it's 320/800/1600 shader GPUs (which have actually have 64/160/320 shader units) :)
Spoelie - Monday, November 30, 2009 - link
You're abusing the term pipelines and cores as well.

Trying to describe an unconventional design with conventional terms might not always be very clear, but there is no right or wrong on this issue, just different viewpoints.
kobblestown - Monday, November 30, 2009 - link
Fair enough. The use of the term "core" is still confusing though. There seems to be only one complete pipeline which branches after instruction decode. It's interesting whether the Icache is trace cache, i.e. contains decoded instructions or is a regular cache that needs to be fed back via the fetch/decode bottleneck. In the latter case I see no merit in calling the two integer (for lack of a better term) pipelines separate cores.
Penti - Monday, November 30, 2009 - link
Uhm, the Sun UltraSPARC T1 was a 8-core eight integer-units and one floating point unit, and with CMT - eight threads per core 32-thread. It was still called a 8-core CPU, but the T2 included a fp unit for every integer unit, so I doubt AMD will use this config for long. But who knows. It's impressive if its up to 30% faster then a PII with twich the number of FPU's.

Also one integer unit already includes three ALU's. They (the core/scheduler) seems independent enough.
blyndy - Monday, November 30, 2009 - link
I also don't like their mixing of definitions. Let's just skirt that issue for now by call them by their thread number ie: 4-thread bulldozer, 8-thread bulldozer etc.
blyndy - Monday, November 30, 2009 - link
From a marketing standpoint, '8 cores' would be more desirable to the laymen buyer than '4 cores', so that's one reason why they might have chosen to do it. Indeed I think Intel will quickly be following AMDs definition so as not to have 'less cores'.
lyeoh - Wednesday, December 2, 2009 - link
8 cores is not automatically more desirable than 4 cores to someone who buys stuff like Oracle. They get charged per _core_.
heulenwolf - Monday, November 30, 2009 - link
Its a poor choice of words. Unfortunately, since the term "core" was never fully defined, everyone gets to have their own understanding of what it means. I took core to mean a complete processor that could, if packaged alone, perform all the functions of a processor (both integer and floating point). I think AMD should have taken the high road and called "modules" "dual-integer cores" instead of splitting them into two "cores" with this extraneous, shared FPU tacked on like an afterthought. It makes the term core meaningless. I would guess that making the term "Core" meaningless would be advantageous to AMD, however, since that is the term Intel uses for their entire architecture.
Nocturnal - Monday, November 30, 2009 - link
Very interesting. I hope that AMD will one day regain their edge they once held against. Oh how I miss those days. I embrace Intel nonetheless.
Alouette Radeon - Wednesday, March 10, 2010 - link
How the hell could you embrace Intel after all the harm they've caused AMD, nVidia, VIA and ultimately us, the consumers with their criminal tactics? They don't give a damn about you, they just want your money. A lot of people say that AMD is no different (I know nVidia sure is the same as Intel in that regard) but at least they operate with integrity. They've never been accused of anything underhanded or sneaky. For that matter, neither has VIA. Intel and nVidia on the other hand, while nVidia has never done anything downright CRIMINAL, they've still been dishonest as hell. Intel on the other hand, has stooped about as low as you can go. So go ahead, embrace Intel, just like a stupid biatch who won't leave her abusive spouse. She just keeps going back for more and people like me who have brains can only shake our heads and wonder.
AmbroseAthan - Monday, November 30, 2009 - link
While I agree Intel has the performance crown now, I can't knock AMD for being the value right now. Picked up an AMD x4 955 BE and Asus motherboard (full ATX/crossfire) for $230 to build my parents a computer with (Newegg combo). Intel can't compete in that price space easily.
dilidolo - Monday, November 30, 2009 - link
Intel can't compete in that price range? No, Intel doesn't want to. Manufacturing capacity is limited, if I can sell more higher margin products, why should I go after lower margin segment? Leave that segment to AMD, the more AMD sells in that segment, the more money AMD looses. If Intel wants to compete in that segment, they can easily kill AMD, that's not what Intel wants to do.
siuol11 - Tuesday, December 1, 2009 - link
Ah, the ravings of the marginally informed... How the internet loves them!
blyndy - Monday, November 30, 2009 - link
I'm very excited about AMD's brand-new design and how it's new ideas translate into performance, however:

"The quad-core Zambezi should have roughly 10 - 35% better integer performance than a similarly clocked quad-core Phenom II"

That sounds a bit low, I hope the final comparable CPUs can manage something more like 15 - 40% better integer performance over their PhII counterparts. Then again perhaps that's just because of Intels large performance increases between their recent architectures making us expect more -- they are more the exception than the rule, so 10 - 35 % shouldn't be sneezed at, although that just may not be competitive on their release in 2011.
mczak - Monday, November 30, 2009 - link
Considering that the int cores actually have less execution units (used to be 3 alus (plus shared load/store, but can do two operations per clock), bulldozer only 2 alus (+ load and separate store)) I think 10-35% better integer performance is amazing. More than that would be a miracle imho...
Zool - Monday, November 30, 2009 - link
From the previous article "The extra integer core (schedulers, D-cache and pipelines) adds only 5% die space".
So the quad core Zambezi (2 modules, 4 integer pipelines)should have roughly 10-35% better integer performance than a similarly clocked quad-core Phenom II. Thats a super boost per transistor count.
nafhan - Monday, November 30, 2009 - link
Based on AMD's re-defining of the word core that's actually a HUGE improvement. A quad core Zambezi has a similar transistor budget as a dual core Phenom II, and a 10-35% performance improvement.
In other words, quad core integer performance for dual core price.
psychobriggsy - Tuesday, December 1, 2009 - link
A quad-core Bulldozer has the same transistor budget as a tri-core Phenom II (if they existed natively), yet performs around 20% better than a quad-core.

I think that SMT would have provided easier performance pickings (20% for 5% die space). I don't understand why AMD have been avoiding SMT so far. Sure, 80% more performance for 50% die space isn't to be sneezed at, but it's not so easy pickings.

In addition there are more integer resources than in a Phenom II core, and the FPU has two 128-bit FMAs, so each core could be reasonably bigger. In effect it could be that 1 Bulldozer module is the same size as two Phenom II cores - so all you have then is the 10-35% performance increase. I hope this is per-clock...
titan7 - Sunday, December 6, 2009 - link
Perhaps the k7/k8 didn't make sense to add SMT? The p4 was really designed for it and had it enabled in genII. Look how long it took Intel to get SMT into the Pentium Pro/Core/i7.

I suspect AMD is designing for SMT right now, but gen1 is just "get to market ASAP because Intel is faster right now" and genII will have SMT enabled.
gost80 - Monday, November 30, 2009 - link
Judging apparent benefit of this architecture over Intel's can be done only if the die size per _module_ is also made available. So, how about it?
Zool - Thursday, December 3, 2009 - link
The thing is that the picture in this article contains shared L2 cache and L3 cache too and its quite unclear from the picture if L2 is shared to one module or all modules.(sharing all modules 2 times with L2 and L3 would be quite useless)
The bulldozer picture in the other article from anandtech http://it.anandtech.com/IT/showdoc.aspx?i=3681&...">http://it.anandtech.com/IT/showdoc.aspx?i=3681&... shows clearly that the L2 cache belongs to module.
So clearly ading 50% to the core(which is everything till L1) is much less than 2 whole cores each with its own same size L2 cache ( Nehalem has only tiny 256KB cache per core from die area reasons).
If we take whole die size with 8MB L3 cache and 1MB L2 cache per module/core (+ things like memmory controler,hypertransport core/module conects) the final die increase could end in 10-15% or even less.
Zool - Thursday, December 3, 2009 - link
So a 4 module Bulldozer core with 512KB L2 cache and 6MB L3 cache could be something like 10-15% bigger than a 4 core PhenomII with 512KB L2 cache per core and same 6MB L3 cache. For 80% more integer performance that wouldnt be bad.
And about Oracle , the server cpu-s from both intel and AMD are running in ranges from few hundred dolars to over 2k dolars with minimal performance increase just more sockets suported and everyone is buying them. So i wouldnt care less for them than a fly on my window. It will end on final pricing per core for cpu not core/modul license price.
swindelljd - Wednesday, December 2, 2009 - link
I bet Oracle is salivating over the new core count technique since it is sure to create a huge surge in their revenue because they charge per core on the x86 platform.

AMD Core Counts and Bulldozer: Preparing for an APU World

Post Your Comment

94 Comments

Back to Article

mattclary - Monday, December 14, 2009 - link

aj28 - Thursday, December 3, 2009 - link

aj28 - Thursday, December 3, 2009 - link

swindelljd - Wednesday, December 2, 2009 - link

Sivar - Tuesday, December 1, 2009 - link

JumpingJack - Wednesday, December 2, 2009 - link

nirmv - Tuesday, December 1, 2009 - link

Seramics - Tuesday, December 1, 2009 - link

Alberto - Tuesday, December 1, 2009 - link

epobirs - Tuesday, December 1, 2009 - link

Alberto - Tuesday, December 1, 2009 - link

gruffi - Sunday, December 6, 2009 - link

psychobriggsy - Tuesday, December 1, 2009 - link

epobirs - Tuesday, December 1, 2009 - link

Nehemoth - Tuesday, December 1, 2009 - link

Nehemoth - Tuesday, December 1, 2009 - link

Milleman - Monday, November 30, 2009 - link

jmurbank - Monday, November 30, 2009 - link

Milleman - Monday, November 30, 2009 - link

tatertot - Monday, November 30, 2009 - link

vsary6968 - Tuesday, December 1, 2009 - link

vsary6968 - Tuesday, December 1, 2009 - link

Anand Lal Shimpi - Monday, November 30, 2009 - link

piesquared - Monday, November 30, 2009 - link

JumpingJack - Tuesday, December 1, 2009 - link

GaiaHunter - Monday, November 30, 2009 - link

GaiaHunter - Monday, November 30, 2009 - link

psychobriggsy - Tuesday, December 1, 2009 - link

smilingcrow - Monday, November 30, 2009 - link

Zool - Monday, November 30, 2009 - link

Zool - Monday, November 30, 2009 - link

psychobriggsy - Tuesday, December 1, 2009 - link

GaiaHunter - Monday, November 30, 2009 - link

tatertot - Monday, November 30, 2009 - link

psychobriggsy - Tuesday, December 1, 2009 - link

GaiaHunter - Monday, November 30, 2009 - link

GaiaHunter - Monday, November 30, 2009 - link

HolKann - Monday, November 30, 2009 - link

GaiaHunter - Monday, November 30, 2009 - link

tatertot - Monday, November 30, 2009 - link

GaiaHunter - Monday, November 30, 2009 - link

GaiaHunter - Monday, November 30, 2009 - link

GaiaHunter - Tuesday, December 1, 2009 - link

GaiaHunter - Monday, November 30, 2009 - link

ThaHeretic - Monday, November 30, 2009 - link

cfaalm - Monday, November 30, 2009 - link

Calin - Monday, November 30, 2009 - link

cfaalm - Tuesday, December 1, 2009 - link

ThaHeretic - Tuesday, December 1, 2009 - link

DominionSeraph - Monday, November 30, 2009 - link

Lifted - Monday, November 30, 2009 - link

GodisanAtheist - Monday, November 30, 2009 - link

DominionSeraph - Monday, November 30, 2009 - link

medi01 - Monday, November 30, 2009 - link

GaiaHunter - Monday, November 30, 2009 - link

Kiijibari - Monday, November 30, 2009 - link

piesquared - Monday, November 30, 2009 - link

GaiaHunter - Monday, November 30, 2009 - link

pcfxer - Monday, November 30, 2009 - link

JimmiG - Monday, November 30, 2009 - link

Calin - Monday, November 30, 2009 - link

mino - Sunday, January 17, 2010 - link

medi01 - Monday, November 30, 2009 - link

Zool - Monday, November 30, 2009 - link

Zool - Monday, November 30, 2009 - link

JVLebbink - Monday, November 30, 2009 - link

Kiijibari - Monday, November 30, 2009 - link

kobblestown - Monday, November 30, 2009 - link

Zool - Monday, November 30, 2009 - link

fitten - Tuesday, December 1, 2009 - link

Calin - Monday, November 30, 2009 - link

defter - Monday, November 30, 2009 - link

Spoelie - Monday, November 30, 2009 - link

kobblestown - Monday, November 30, 2009 - link

Penti - Monday, November 30, 2009 - link

blyndy - Monday, November 30, 2009 - link