This! Oracle stopped any further development of SPARC, and I can see why Fujitsu didn't want to keep paying license fees, royalties and whatever for arch that is basically abandoned. Too bad, SPARC was quite good at handling a large number of threads simultaneously.
You are mixing apples and oranges here. (1) Fujitsu have their own SPARC implementation completely uncomparable to Oracle's/Sun's SPARC Mx and Tx so they basically just license SPARC64 ISA. (2) SPARC handling a lot of threads simultaneously is a line of SPARC Tx which Sun Micro purchased as part of some other company and first was SPARC T1 in T1000 and T2000 boxes. Their single-thread performance was really low even at that time. (3) IMHO Fujitsu is still able to sell you SPARC box if you like and will be probably also in the foraseeable future. Their A64Fx is just for HPC.
AFAIK, Fujitsu has merged their workstation/server/HPC chips for a few generations, now. While the front end of the A64FX isn't really setup for general purpose workloads (definitely HPC), given its narrow nature, it does indicate Fujitsu is moving over to arm sooner or later. They already spend the time moving their SDK over to arm, to support the A64FX.
That was my first thought when I saw Fujitsu's choice of name for the processor. I'm not sure what marketing agency they use, but I'll bet they also pitched using the name 'Camry' to Honda or something.
This architecture kind of begs the question, what does an x86 CPU with HBM on-package perform like? There aren't a massive number of bandwidth-constrained applications, but those that are also tend to be ones where the x86 FPU (which is a monster as currently constituted) would enjoy the bandwidth.
CPUs do have a tendency of making up for this shortfall by using a lot of cache. 256MB of L3 on the best x86 setups, Power9 goes even wilder with complicated cache and memory system setups (which are claiming 350GB/s for their memory system). Even something like the massive GA100 only has 40MB of last level cache available.
That being said, latest Intel setup is ~140GB/s (2933, 6 channel), latest AMD is ~200GB/s (3200, 8 channel). Sure, GDDR6 can go higher (at the cost of latency) and HBM can go even faster but I think part of a CPU's FPU performance is deep integration into the execution pipeline, using the same registers and cache hierarchy. Memory bandwidth alone would not be the main performance differentiator.
However, I would also like to see a HBM2 high performance x86 CPU, or at least good reasons why they don't exist.
The biggest benefit for HBM isn't necessarily bandwidth but a potential reduction in memory latency. The amount of time to access HBM is lower than commodity DRAM.
There is operational parallelism with HBM too as each 1024 bit wide stack necessitates is own independent memory controller: performing three reads and one write operation across four stacks is not an issue. Desktop systems generally have a single memory controller and Xeon's tend to have two with AMD's Epyc matching the figure at four. Future generations of HBM have the possibility of incorporating independent read/write buses plus seperate address and command buses. While that would radically increase in the number of vias through an interposer, such changes would eliminate the need for turn-around times that DRAM has to account for.
HBM has enough capacity now for commodity PC work but the reason we don't see them (yet?) for general usage is simply down to cost. There is still a huge premium attached to HBM that doesn't make it viable in the razor thin margin of PCs. Servers could easily absorb the cost of HBM enabled chips but server workloads tend to leverage far more memory than can be put into a package with a CPU. Like you though, I'd love to see such a system.
>There is operational parallelism with HBM too as each 1024 bit wide stack necessitates is own independent memory controller: performing three reads and one write operation across four stacks is not an issue.
HBM channels are 128 bits wide, so each 1024 bit stack has 8 fully independent channels. 4 stacks gives a total of 32 independent memory accesses concurrently.
Using HBM as a high speed L4 would for sure help performance for servers. The reason server CPUs have so much cache is because they can use it and it helps performance. In a virtualized environment you might have 30 VMs running on a single host. A good number of those VMs might only have 4-8GB RAM and be sitting idle most times. You have 8+GB L4 HBM at 1TB/sec and you could probably substantial increase performance.
>This architecture kind of begs the question, what does an x86 CPU with HBM on-package perform like?
Probably similar to standard DDR4. Even the big Skylake-SP dies have a more limited number of cores than typical high bandwidth applications like GPUs or these ARM vector accelerators, so having huge numbers of parallel memory channels doesn't make as much sense. You just need enough channels to keep up with your demand, having more than you need doesn't make individual accesses any faster.
The naming of the CPU still makes me do a double take on whether AMD just came out with a new Athlon 64 FX-series :)
I think it's interesting how in the span of a month or so, we went from "ARM in the cloud is nice and all, but there's no real desktop systems to develop on" to several options (including whatever Apple's gonna do), though of course, a $40000 machine is not a developer desktop machine. Then again, for essentially having a slice of a TOP500 supercomputer, it's not bad pricing?
Any way, good times for ARM outside of just mobile devices ahead.
(Sidenote: I do wish that CPU companies would offer defective chips as souvenir/decorative pieces. I wouldn't mind wallmounting one of these next to an Itanium and Opteron, but I doubt these will show up on eBay for less than 100 bucks anytime soon :P)
2 nodes for 40k seems like not a great deal TBH, and I'm very much a fan of the A64FX otherwise. That's only rough parity to the cost of a a modest server and a couple high-end Tesla cards, and effectively not much different since both platforms are effectively SIMD architectures. Tough to sell a newcomer at the same price as an incumbant without offering something very different, or at least significantly undercutting their price. 4 Nodes at 40-50k and you'd really be talking; 8 nodes at 65-70k even better.
If this chip is successful in HPC, do you think Fujitsu will accelerate the creation of a SVE2 / 5nm / HMB3 / ARM v9 / more (130?) core SoC - or is the next colossal jump in performance going to come from staking the memory vertically (Ie on the same chipset)?
We’ve updated our terms. By continuing to use the site and/or by logging into your account, you agree to the Site’s updated Terms of Use and Privacy Policy.
32 Comments
Back to Article
GNUminex_l_cowsay - Friday, June 26, 2020 - link
Did Fujitsu stop using SPARC?jeremyshaw - Friday, June 26, 2020 - link
Yeah, they abandoned SPARC (with their own custom extensions) in favor of arm-SVE.Deicidium369 - Friday, June 26, 2020 - link
Abandoned or did Larry do what Larry does?eastcoast_pete - Friday, June 26, 2020 - link
This! Oracle stopped any further development of SPARC, and I can see why Fujitsu didn't want to keep paying license fees, royalties and whatever for arch that is basically abandoned. Too bad, SPARC was quite good at handling a large number of threads simultaneously.Oxford Guy - Saturday, June 27, 2020 - link
These Chinese are still developing MIPS.Oxford Guy - Saturday, June 27, 2020 - link
(Typo: The, not these.)kgardas - Saturday, June 27, 2020 - link
You are mixing apples and oranges here. (1) Fujitsu have their own SPARC implementation completely uncomparable to Oracle's/Sun's SPARC Mx and Tx so they basically just license SPARC64 ISA. (2) SPARC handling a lot of threads simultaneously is a line of SPARC Tx which Sun Micro purchased as part of some other company and first was SPARC T1 in T1000 and T2000 boxes. Their single-thread performance was really low even at that time.(3) IMHO Fujitsu is still able to sell you SPARC box if you like and will be probably also in the foraseeable future. Their A64Fx is just for HPC.
jeremyshaw - Saturday, June 27, 2020 - link
AFAIK, Fujitsu has merged their workstation/server/HPC chips for a few generations, now. While the front end of the A64FX isn't really setup for general purpose workloads (definitely HPC), given its narrow nature, it does indicate Fujitsu is moving over to arm sooner or later. They already spend the time moving their SDK over to arm, to support the A64FX.yetanotherhuman - Friday, June 26, 2020 - link
Yup. Look up the world's fastest supercomputer.Oxford Guy - Saturday, June 27, 2020 - link
"one based on Arm – specifically, the A64FX"Yeah. Apparently, they're using old AMD parts!
Oxford Guy - Saturday, June 27, 2020 - link
The creativity of these people is off the charts.sing_electric - Monday, June 29, 2020 - link
That was my first thought when I saw Fujitsu's choice of name for the processor. I'm not sure what marketing agency they use, but I'll bet they also pitched using the name 'Camry' to Honda or something.Sahrin - Friday, June 26, 2020 - link
This architecture kind of begs the question, what does an x86 CPU with HBM on-package perform like? There aren't a massive number of bandwidth-constrained applications, but those that are also tend to be ones where the x86 FPU (which is a monster as currently constituted) would enjoy the bandwidth.jeremyshaw - Friday, June 26, 2020 - link
CPUs do have a tendency of making up for this shortfall by using a lot of cache. 256MB of L3 on the best x86 setups, Power9 goes even wilder with complicated cache and memory system setups (which are claiming 350GB/s for their memory system). Even something like the massive GA100 only has 40MB of last level cache available.That being said, latest Intel setup is ~140GB/s (2933, 6 channel), latest AMD is ~200GB/s (3200, 8 channel). Sure, GDDR6 can go higher (at the cost of latency) and HBM can go even faster but I think part of a CPU's FPU performance is deep integration into the execution pipeline, using the same registers and cache hierarchy. Memory bandwidth alone would not be the main performance differentiator.
However, I would also like to see a HBM2 high performance x86 CPU, or at least good reasons why they don't exist.
Kevin G - Friday, June 26, 2020 - link
The biggest benefit for HBM isn't necessarily bandwidth but a potential reduction in memory latency. The amount of time to access HBM is lower than commodity DRAM.There is operational parallelism with HBM too as each 1024 bit wide stack necessitates is own independent memory controller: performing three reads and one write operation across four stacks is not an issue. Desktop systems generally have a single memory controller and Xeon's tend to have two with AMD's Epyc matching the figure at four. Future generations of HBM have the possibility of incorporating independent read/write buses plus seperate address and command buses. While that would radically increase in the number of vias through an interposer, such changes would eliminate the need for turn-around times that DRAM has to account for.
HBM has enough capacity now for commodity PC work but the reason we don't see them (yet?) for general usage is simply down to cost. There is still a huge premium attached to HBM that doesn't make it viable in the razor thin margin of PCs. Servers could easily absorb the cost of HBM enabled chips but server workloads tend to leverage far more memory than can be put into a package with a CPU. Like you though, I'd love to see such a system.
saratoga4 - Friday, June 26, 2020 - link
>There is operational parallelism with HBM too as each 1024 bit wide stack necessitates is own independent memory controller: performing three reads and one write operation across four stacks is not an issue.HBM channels are 128 bits wide, so each 1024 bit stack has 8 fully independent channels. 4 stacks gives a total of 32 independent memory accesses concurrently.
schujj07 - Saturday, June 27, 2020 - link
Using HBM as a high speed L4 would for sure help performance for servers. The reason server CPUs have so much cache is because they can use it and it helps performance. In a virtualized environment you might have 30 VMs running on a single host. A good number of those VMs might only have 4-8GB RAM and be sitting idle most times. You have 8+GB L4 HBM at 1TB/sec and you could probably substantial increase performance.brucethemoose - Friday, June 26, 2020 - link
"good reasons why they don't exist."Price, price, and price.
According to this: https://semiengineering.com/whats-next-for-high-ba...
HMB2 is about $120/16GB stack, and that may not include the interposer and the extra testing/validation.
Oxford Guy - Saturday, June 27, 2020 - link
Plenty of enthusiasts would be willing to pay that.That's peanuts when compared with the Nvidia GPU tax.
brucethemoose - Monday, June 29, 2020 - link
Maybe, but every addition to the BoM is multiplied, and comes at the cost of other features (like, say, a bigger die).saratoga4 - Friday, June 26, 2020 - link
>This architecture kind of begs the question, what does an x86 CPU with HBM on-package perform like?Probably similar to standard DDR4. Even the big Skylake-SP dies have a more limited number of cores than typical high bandwidth applications like GPUs or these ARM vector accelerators, so having huge numbers of parallel memory channels doesn't make as much sense. You just need enough channels to keep up with your demand, having more than you need doesn't make individual accesses any faster.
MenhirMike - Friday, June 26, 2020 - link
The naming of the CPU still makes me do a double take on whether AMD just came out with a new Athlon 64 FX-series :)I think it's interesting how in the span of a month or so, we went from "ARM in the cloud is nice and all, but there's no real desktop systems to develop on" to several options (including whatever Apple's gonna do), though of course, a $40000 machine is not a developer desktop machine. Then again, for essentially having a slice of a TOP500 supercomputer, it's not bad pricing?
Any way, good times for ARM outside of just mobile devices ahead.
MenhirMike - Friday, June 26, 2020 - link
(Sidenote: I do wish that CPU companies would offer defective chips as souvenir/decorative pieces. I wouldn't mind wallmounting one of these next to an Itanium and Opteron, but I doubt these will show up on eBay for less than 100 bucks anytime soon :P)Deicidium369 - Friday, June 26, 2020 - link
https://www.youtube.com/watch?v=rUieSdFbLA4 - goes a little past Itanium and Opteronthetrashcanisfull - Friday, June 26, 2020 - link
Seems less appealing without the built in TOFU interconnect. Unless that is (hopefully) used for nodes within a single chassis?Stele - Saturday, June 27, 2020 - link
Nah, this card's meant to interface with a string of others in a self-contained computing pod, so it uses the EDAMAME interconnect instead.SuperiorSpecimen - Saturday, June 27, 2020 - link
Fantastic! I lol'dozzuneoj86 - Friday, June 26, 2020 - link
I still read it as "Athlon 64 FX". I just can't help it.Oxford Guy - Saturday, June 27, 2020 - link
Obviously, it was intentional. Creative naming is not this company's strong suit, clearly.Ripping someone else off, though, is.
I am strongly reminded of a certain company's laptops that look almost exactly like a MacBook Pro.
ravyne - Sunday, June 28, 2020 - link
2 nodes for 40k seems like not a great deal TBH, and I'm very much a fan of the A64FX otherwise. That's only rough parity to the cost of a a modest server and a couple high-end Tesla cards, and effectively not much different since both platforms are effectively SIMD architectures. Tough to sell a newcomer at the same price as an incumbant without offering something very different, or at least significantly undercutting their price. 4 Nodes at 40-50k and you'd really be talking; 8 nodes at 65-70k even better.Ian Cutress - Sunday, June 28, 2020 - link
I have read (though I can't confirm) that if you order these outside Japan, the minimum order quantity is 128 nodes, so 16 x 2UAJ_NEWMAN - Sunday, June 28, 2020 - link
If this chip is successful in HPC, do you think Fujitsu will accelerate the creation of a SVE2 / 5nm / HMB3 / ARM v9 / more (130?) core SoC - or is the next colossal jump in performance going to come from staking the memory vertically (Ie on the same chipset)?