Hot Chips 2018: Tachyum Prodigy CPU Live Blog

Original Link: https://www.anandtech.com/show/13255/hot-chips-2018-tachyum-prodigy-cpu-live-blog

Hot Chips 2018: Tachyum Prodigy CPU Live Blog

by Dr. Ian Cutress on August 21, 2018 5:55 PM EST

15 Comments

05:55PM EDT - One of the more interesting talks is from Tachyum, who have a deep presentation about their new hyperscale Prodigy processors with up to 64 cores and eight channel memory. Tachyum is headed up by one of the original co-Founders of SandForce and Wave Computing. The talk is set to start at 3pm PT / 10pm UTC.

06:01PM EDT - Trying to give the best of CPU GPU TPU

06:01PM EDT - Prodigy is a server/AI/supercomputer chip for hyperscale datacenters

06:01PM EDT - Is programmable and optimized

06:02PM EDT - AI based on spiking model requires other hardware

06:02PM EDT - Not all AI is neural networks

06:02PM EDT - Tape out in 2019

06:02PM EDT - One die with three different versions

06:03PM EDT - Different variants for different markets

06:03PM EDT - T864 = 64 cores and 8 DDR4/5 controllers, 72 PCIe 5.0, 2x400G internet on single piece of silicon

06:03PM EDT - Each core will be faster than Intel Xeon

06:03PM EDT - Single threaded performance on Spec INT/FP better than Xeon

06:04PM EDT - 2TF performance per core

06:04PM EDT - Better perf than Volta chip wide

06:04PM EDT - 180W at 4 GHz

06:04PM EDT - 7nm, 12 metal layers, 0.8V, 290 mm2

06:05PM EDT - Data wires are small to reduce power

06:05PM EDT - Smaller than ARM

06:05PM EDT - Using standard cells and memories

06:05PM EDT - In order design - out of order execution with compiler

06:05PM EDT - Accepts C/C++/Fortran

06:05PM EDT - Programming model is CPU-like

06:06PM EDT - 32 x INT64 registers

06:06PM EDT - 32 vector registers 256/128 bits

06:06PM EDT - Sorry pictures have stopped

06:06PM EDT - maybe not

06:06PM EDT - ILP based on bundling

06:07PM EDT - 2 load + 2 multiply-add + 1 store + 1 address increment + 1 compre + 1 branch per clock

06:07PM EDT - 1.72 instructions per cycle normal

06:07PM EDT - based on bundle size from compiler

06:07PM EDT - 2.6 instructions/bundle on average

06:07PM EDT - 8-RISC-style micro-ops per cycle

06:08PM EDT - Itanium was stalled 50% of the time due to dynamic scheduling. Normal OoO core is about 15%. Due to our simulator, we achieve less than 20% stall

06:08PM EDT - Can do OoO and speculation in software without expensive hardware

06:08PM EDT - Changing the perf/$ and perf/W equation

06:09PM EDT - Linux and FreeBSD ported in 2019

06:09PM EDT - Device drivers, boot-loaded and Java JIT

06:09PM EDT - Replace parts of GCC to enable Prodigy core

06:09PM EDT - Working so critical applications work day one

06:10PM EDT - Learning from Arm entering the hyperscale space - took over 12 months and $100m to compile and port to Arm

06:10PM EDT - For a startup that would be problem

06:10PM EDT - Allow customers to use x86 binary through QEMU emulator

06:11PM EDT - allows for shorter deployment time

06:11PM EDT - Existing binaries supported via emulators

06:11PM EDT - Use service providers in Europe to port existing applications

06:12PM EDT - The core - 8 micro-ops decode and dispatch per clock (16 bytes per op)

06:12PM EDT - 1 store, 1 load, 1 load/store unit

06:12PM EDT - 32 x 512-bit vector registers

06:12PM EDT - Conventional front end

06:12PM EDT - Simple branch predictor works sufficiently well with compiler

06:13PM EDT - Machine can execute 1-4 words per clock, each word with 1-2 ops

06:13PM EDT - Instruction Execution looks like in-order, but can slide instructions on cache misses

06:14PM EDT - 9-stage pipeline due to no register renaming required

06:14PM EDT - Supports 64B/clock per memory op into L1

06:14PM EDT - Private 512KB L2 cache

06:14PM EDT - Spill and fill with data cache and L2 cache

06:15PM EDT - L3 distributed cache is same latency as L2

06:15PM EDT - Supports FP16 and INT8

06:15PM EDT - 2x512-bit MAC or 3x512-bit int per clcok

06:16PM EDT - Standard mesh with the cores, 8x8

06:16PM EDT - It also has an IO ring

06:16PM EDT - for networking applications, lower latency for far memory access

06:16PM EDT - important for no packet loss and reduces congestion

06:16PM EDT - 32 bytes per clock per direction I/O to DRAM

06:17PM EDT - 2 x HBM3 controllers and 8 x DDR controllers

06:17PM EDT - Average utilization on Amazon EC is 30%. Facebook average utilization is 40%

06:18PM EDT - 60-70% idle. With an AI chip, you can train / inference when idle with chips that are capable

06:18PM EDT - Our chips you can run servers and then AI when idle

06:18PM EDT - access to 20x AI

06:18PM EDT - don't need different hardware

06:19PM EDT - In a Facebook 100MW datacenter, 442k servers. 40% idle means 265k idle servers per day

06:19PM EDT - With our chip, those chips can run AI during downtime

06:19PM EDT - e.g. 256k servers, each with 4x2x100 GbE with no oversubscription

06:21PM EDT - Also saves power

06:21PM EDT - New technology is needed - datacenters could consume 40% of planet energy by 2040

06:22PM EDT - Overall aims

06:22PM EDT - Outperform Xeon E5 v4 using same GCC

06:22PM EDT - 4.0 GHz on 7 nm

06:22PM EDT - Tape out in 2019

06:22PM EDT - Multiple interested and engaged customers

06:23PM EDT - Integer Datapath is in place - currently limited by speed of SRAMs

06:23PM EDT - Q&A time

06:26PM EDT - Q: I feel like deja vu - at Hot Chips, Intel introduced VLIW-concept Itanium that pushed complexity onto the compiler. I see traces of that here. What are you doing to avoid the Itanium traps? How will you avoid IP from Intel? A: Itanium was in-order VLIW, hope people will build compiler to get perf. We came from opposite direction - we use dynamic scheduling. We are not VLIW, every node defines sub-graphs and dependent instructions. We designed the compiler first. We build hardware around the compiler, Intel approach the opposite.

06:27PM EDT - Q: You mention emulation for x86. What kind of penalty in performance? A: About 40% performance loss. Significant because our customer didn't want us to invest in that, as 90% of their software will be natively compiled by next year. It's a temporary deployment. Binary 4.0 GHz emulated still outperforms 2.5 GHz Xeon

06:28PM EDT - Q: How do you dodge the patents in place in this? A: We building a new beast. Our tech is significantly different. This is America.

06:30PM EDT - That's a wrap. Next live blog in 30 mins on IBM.

Hot Chips 2018: Tachyum Prodigy CPU Live Blog

Log in

Don't have an account? Sign up now