As a lay person someone explain how a 1W maybe 2.5W tiny USB device contributes anything to AI acceleration. This is slower than using a similarly priced discrete GPU right? If it is similar in speed then Intel is dumb, they should have a PCIe card with a whole bunch of these to make an actual useful product.
Given the extreme efficiency of modern GPUs and the considerable know-how of nVidia, which has been building specific libraries for years, I have a hard time believing that this tiny stick can really be that competitive. Probably in a very, very specific scenario involving very, very specific benchmarks. But general AI use? Hard to believe.
These devices are a low power mobile processor and a big vector multiply unit with some SRAM cache. They use less power than a GPU because they lack 99.9% of the hardware in a GPU. Aside from matrix multiplication operations, they have the processing power of a low end smartphone.
At 100 GFLOPS of FP16, how do you figure it has 1/3 the performance of a $5500 GPU? It's a useless comparison, anyway. A discrete GPU has a much different intended workload than this thing.
It makes more sense to compare this to NVIDIA's Jetson line, even though the form factor and use case is still different. The Jetson is meant for others to embed into their devices or to be used as a development platform. The Jetson can be used with a lot more than just Caffe, and can handle a lot more tasks than just accelerating CNNs. It can handle full CUDA and graphics workloads. It comes with wifi, video encode/decode blocks, and I think an ISP. It can't just be plugged into a USB port, though. But it's a much better comparison than to a discrete GPU because it is a device for computing at the edge. The Tegra X2 in the Jetson TX2 module has 874 GFLOPS of FP16 at 7.5 W. That's 8.7 times the performance at 7.5 times the power draw. The Jetson TX2 has more and faster memory, but costs 5 times as much.
NVIDIA is open sourcing a deep learning accelerator though, and I wouldn't be surprised if someone came out with a product using it that is meant to compete with this Movidius stick.
That 7.5W is for the TX2 is for the module, I believe, not the SoC. Maybe it's better to compare the 7.5 W with the 2.5 W number of the stick. But, the two products are probably not really competitors to each other.
I only made the comparison because someone tried to make a comparison to some unnamed "$5500 GPU". I figure that must be the P40, with its 250 W TDP. But again, this Intel stick has a 2.5 W power envelope, not a 1 W power envelop. So it would only be 1/100th the power draw, not 1/250th like he claimed. The P40 has 10000 GFLOPS, whereas this stick has 100 GFLOPS, so I have no idea where he got the 33% performance number. The stick has 1/100th the perfomance at 1/100th the power draw of the P40 using peak throughput and power envelopes. But really, we need actual benchmarks to make such comparisons, not peak throughput and TDP numbers. The P40 is designed to inference with a batch size in the hundreds, however, and will only be efficient when doing so. It has 24 GB of memory. It's a silly comparison.
It is possible, however, that the SoC on this Movidius stick is capable of inferencing a non-batched workload faster than the Tegra X2. GPUs need batching in order to take advantage of their parallelism and perform well with inferencing. The Tegra X2, with only 256 CUDA cores, needs a much smaller batch size than a discrete GPU, though.
Intel is scattering to generate as much revenue as possible via cheap silly projects like hypetane cache and compute sticks. Probably to avoid disappointing shareholders on the next quarter results.
"1watt $80 stick is about 33% of the performance of a $5500 GPU and consumes 1/250th the power. "
Just to illustrate how silly this statement is, and how poor this product's value is, let's compare it to an arm single board computer.
The ODROID-XU4 comes with 2 gigs of ram and a GPU that is capable of 102 gflops compute. Aside from having the performance, it can also run applications, graphics and whatnot. All this for 60 bucks.
Funny thing that intel needed to make a 400 million $ acquisition to essentially get a purpose specific 16 bit CPU. Maybe it has been so long since intel made 16 bit chips they no longer remember how to do it?
"That means targeting use cases where the latency of going to a server would be too great, a high-performance CPU too power hungry, or where privacy is a greater concern."
It's an asic. A gpgpu is a general purpose processing unit. Nvidia realizes this which is why they've started to offer NN specific compute areas with their latest offerings. It's difficult to say if this is slower than an $80 gpu, but it's far faster for inference and can be plugged into any laptop. Too bad it looks to be only useful for inference and only supports caffe.
It is slower and less capable than a 30$ arm soc chip. Basically all of the high end arm socs released the last two years can do in excess of 100 gflops. And can do a bunch of other stuff on top of computing deep learning.
Look at what happened to Bitcoin mining on GPUs vs ASICs, for an idea of how this works. Silicon built *specifically to one application* will always be faster/cheaper/more efficient than something that can do a lot of things, like a CPU or GPU.
This is a fixed function neural network, it does neural networks and that's it, no other functions like a GPU.
Yes, most edge inferencing will be done on either ARM or on an ASIC. It is far from clear, however, that other aspects of neural network computing, such as data center inferencing and training, will be done on ASICs. That is because there will probably be a lot of mixed workload usage of neural networks, e.g., neural networks used in conjunction with simulation or analytics. An ASIC for inferencing or training/inferencing a neural network will not be able to carry out the other workloads, unlike GPUs. Whatever power saving and performance are gained by having the ASIC run the neural network will be given back when transferring the data to some other processor to complete the workload.
What it comes down to is that the GPU is already very efficient at the number crunching needed in neural networks. The inefficiency comes from memory access, or more precisely, data flow. However, to optimize the data flow for a CNN it seems one gives up doing much else other than training or inferencing CNNs.
"As a lay person someone explain how a 1W maybe 2.5W tiny USB device contributes anything to AI acceleration."
The short answer is that this is a specialized ASIC to run neural networks. So at 100 GLOPS, it has enough performance to run basic networks on its own, without the help of a server. And since it's 1W, this means it can be deployed in certain types of mobile devices (e.g. drones), along with low-powered devices such as cameras. This is the basic concept behind inference at the edge.
Meanwhile it also doubles as a developer kit. So developers using this stick or other Movidius VPUs can program and test their neural nets against the NCS, and then deploy that code on the final product. It's somewhat similar to NVIDIA's Jetson in that it allows developers to either use the hardware as-is, or use it as a dev kit should they install the ASIC on a custom design.
This stick is really interesting. I wonder what's the deep learning tensorflow performance compared to the nvidia dgx-1 https://www.cadnetwork.de/de/produkte/deep-learnin... or other gpu server. We use 2x gpu systems with 4x gtx 1080 ti in each server with ubuntu and tensorflow with nearly the same performance than the dgx-1 but about one-fourth of the price.
I wonder how this would affect those who are leveraging AI for business? I feel it can greatly impact the advertising and media industry if utilized properly.
We’ve updated our terms. By continuing to use the site and/or by logging into your account, you agree to the Site’s updated Terms of Use and Privacy Policy.
38 Comments
Back to Article
CajunArson - Thursday, July 20, 2017 - link
Give him the stick.DON'T give him the stick!
Jhlot - Thursday, July 20, 2017 - link
As a lay person someone explain how a 1W maybe 2.5W tiny USB device contributes anything to AI acceleration. This is slower than using a similarly priced discrete GPU right? If it is similar in speed then Intel is dumb, they should have a PCIe card with a whole bunch of these to make an actual useful product.bcronce - Thursday, July 20, 2017 - link
"As a lay person someone explain how a 1W maybe 2.5W tiny USB device contributes anything to AI acceleration"This 1watt $80 stick is about 33% of the performance of a $5500 GPU and consumes 1/250th the power. Slight apples to oranges, but ballpark close.
FriendlyUser - Thursday, July 20, 2017 - link
Given the extreme efficiency of modern GPUs and the considerable know-how of nVidia, which has been building specific libraries for years, I have a hard time believing that this tiny stick can really be that competitive. Probably in a very, very specific scenario involving very, very specific benchmarks. But general AI use? Hard to believe.saratoga4 - Friday, July 21, 2017 - link
These devices are a low power mobile processor and a big vector multiply unit with some SRAM cache. They use less power than a GPU because they lack 99.9% of the hardware in a GPU. Aside from matrix multiplication operations, they have the processing power of a low end smartphone.Yojimbo - Thursday, July 20, 2017 - link
At 100 GFLOPS of FP16, how do you figure it has 1/3 the performance of a $5500 GPU? It's a useless comparison, anyway. A discrete GPU has a much different intended workload than this thing.It makes more sense to compare this to NVIDIA's Jetson line, even though the form factor and use case is still different. The Jetson is meant for others to embed into their devices or to be used as a development platform. The Jetson can be used with a lot more than just Caffe, and can handle a lot more tasks than just accelerating CNNs. It can handle full CUDA and graphics workloads. It comes with wifi, video encode/decode blocks, and I think an ISP. It can't just be plugged into a USB port, though. But it's a much better comparison than to a discrete GPU because it is a device for computing at the edge. The Tegra X2 in the Jetson TX2 module has 874 GFLOPS of FP16 at 7.5 W. That's 8.7 times the performance at 7.5 times the power draw. The Jetson TX2 has more and faster memory, but costs 5 times as much.
NVIDIA is open sourcing a deep learning accelerator though, and I wouldn't be surprised if someone came out with a product using it that is meant to compete with this Movidius stick.
ddriver - Thursday, July 20, 2017 - link
So tx2 has 8.7x the perf at 7.5x the power draw at 8x the memory at 5x the price. Gee, intel offerings are really slipping down in value.Yojimbo - Thursday, July 20, 2017 - link
That 7.5W is for the TX2 is for the module, I believe, not the SoC. Maybe it's better to compare the 7.5 W with the 2.5 W number of the stick. But, the two products are probably not really competitors to each other.I only made the comparison because someone tried to make a comparison to some unnamed "$5500 GPU". I figure that must be the P40, with its 250 W TDP. But again, this Intel stick has a 2.5 W power envelope, not a 1 W power envelop. So it would only be 1/100th the power draw, not 1/250th like he claimed. The P40 has 10000 GFLOPS, whereas this stick has 100 GFLOPS, so I have no idea where he got the 33% performance number. The stick has 1/100th the perfomance at 1/100th the power draw of the P40 using peak throughput and power envelopes. But really, we need actual benchmarks to make such comparisons, not peak throughput and TDP numbers. The P40 is designed to inference with a batch size in the hundreds, however, and will only be efficient when doing so. It has 24 GB of memory. It's a silly comparison.
Yojimbo - Thursday, July 20, 2017 - link
It is possible, however, that the SoC on this Movidius stick is capable of inferencing a non-batched workload faster than the Tegra X2. GPUs need batching in order to take advantage of their parallelism and perform well with inferencing. The Tegra X2, with only 256 CUDA cores, needs a much smaller batch size than a discrete GPU, though.ddriver - Thursday, July 20, 2017 - link
*pulls arbitrary numbers out of ass*Intel is scattering to generate as much revenue as possible via cheap silly projects like hypetane cache and compute sticks. Probably to avoid disappointing shareholders on the next quarter results.
ddriver - Friday, July 21, 2017 - link
"1watt $80 stick is about 33% of the performance of a $5500 GPU and consumes 1/250th the power. "Just to illustrate how silly this statement is, and how poor this product's value is, let's compare it to an arm single board computer.
The ODROID-XU4 comes with 2 gigs of ram and a GPU that is capable of 102 gflops compute. Aside from having the performance, it can also run applications, graphics and whatnot. All this for 60 bucks.
Funny thing that intel needed to make a 400 million $ acquisition to essentially get a purpose specific 16 bit CPU. Maybe it has been so long since intel made 16 bit chips they no longer remember how to do it?
tipoo - Friday, July 21, 2017 - link
How is 100Gflops 33% of the power of a 5500 dollar GPU?Kvaern1 - Thursday, July 20, 2017 - link
"That means targeting use cases where the latency of going to a server would be too great, a high-performance CPU too power hungry, or where privacy is a greater concern."tuxRoller - Thursday, July 20, 2017 - link
It's an asic. A gpgpu is a general purpose processing unit. Nvidia realizes this which is why they've started to offer NN specific compute areas with their latest offerings.It's difficult to say if this is slower than an $80 gpu, but it's far faster for inference and can be plugged into any laptop.
Too bad it looks to be only useful for inference and only supports caffe.
ddriver - Friday, July 21, 2017 - link
It is slower and less capable than a 30$ arm soc chip. Basically all of the high end arm socs released the last two years can do in excess of 100 gflops. And can do a bunch of other stuff on top of computing deep learning.tuxRoller - Saturday, July 22, 2017 - link
From the link to Tom's:"Movidius claimed that the Myriad 2 can hit up to 1,000 GFLOPS per watt[...]"
I await your demonstration proving your claims.
tuxRoller - Saturday, July 22, 2017 - link
Btw, this is an asic. It isn't intended to play Battlefield or let you watch Netflix. It's designed for one thing, and that's why it's efficient.tipoo - Thursday, July 20, 2017 - link
Look at what happened to Bitcoin mining on GPUs vs ASICs, for an idea of how this works. Silicon built *specifically to one application* will always be faster/cheaper/more efficient than something that can do a lot of things, like a CPU or GPU.This is a fixed function neural network, it does neural networks and that's it, no other functions like a GPU.
Yojimbo - Thursday, July 20, 2017 - link
Yes, most edge inferencing will be done on either ARM or on an ASIC. It is far from clear, however, that other aspects of neural network computing, such as data center inferencing and training, will be done on ASICs. That is because there will probably be a lot of mixed workload usage of neural networks, e.g., neural networks used in conjunction with simulation or analytics. An ASIC for inferencing or training/inferencing a neural network will not be able to carry out the other workloads, unlike GPUs. Whatever power saving and performance are gained by having the ASIC run the neural network will be given back when transferring the data to some other processor to complete the workload.Yojimbo - Thursday, July 20, 2017 - link
What it comes down to is that the GPU is already very efficient at the number crunching needed in neural networks. The inefficiency comes from memory access, or more precisely, data flow. However, to optimize the data flow for a CNN it seems one gives up doing much else other than training or inferencing CNNs.Ryan Smith - Thursday, July 20, 2017 - link
"As a lay person someone explain how a 1W maybe 2.5W tiny USB device contributes anything to AI acceleration."The short answer is that this is a specialized ASIC to run neural networks. So at 100 GLOPS, it has enough performance to run basic networks on its own, without the help of a server. And since it's 1W, this means it can be deployed in certain types of mobile devices (e.g. drones), along with low-powered devices such as cameras. This is the basic concept behind inference at the edge.
Meanwhile it also doubles as a developer kit. So developers using this stick or other Movidius VPUs can program and test their neural nets against the NCS, and then deploy that code on the final product. It's somewhat similar to NVIDIA's Jetson in that it allows developers to either use the hardware as-is, or use it as a dev kit should they install the ASIC on a custom design.
ddriver - Friday, July 21, 2017 - link
But which usb3 capable system will lack the juce to run it itself?And considering usb latencies, it may well turn out to be a decelerator.
Basically, intel must have felt that they don't have enough "stick" failures and need more.
tipoo - Friday, July 21, 2017 - link
It's a developer device, end products will likely have the ASIC integrated. Drone was a good example.tbaier - Friday, July 21, 2017 - link
This stick is really interesting. I wonder what's the deep learning tensorflow performance compared to the nvidia dgx-1 https://www.cadnetwork.de/de/produkte/deep-learnin... or other gpu server.We use 2x gpu systems with 4x gtx 1080 ti in each server with ubuntu and tensorflow with nearly the same performance than the dgx-1 but about one-fourth of the price.
psakamoori - Friday, July 21, 2017 - link
MTCNN is for Face Detection not for recognitionFarhan Jami - Saturday, July 22, 2017 - link
I wonder how this would affect those who are leveraging AI for business? I feel it can greatly impact the advertising and media industry if utilized properly.https://mediamastersinc.com/blog/artificial-intell...
saeidmoradi - Thursday, July 27, 2017 - link
باربری تهرانhttp://www.barbariteh.ir/
saeidmoradi - Thursday, July 27, 2017 - link
http://www.barbariteh.ir/ باربری تهران , اتوبار , اتوبار تهرانsriram87 - Monday, August 14, 2017 - link
it is good for computer vision. Not general AI if you know what I meanhamed102 - Saturday, December 2, 2017 - link
http://shomalbar.ir/باربری تهران
hamed102 - Saturday, December 2, 2017 - link
تبلیغات رایگانhttp://bazarche96.com/
hamed102 - Saturday, December 2, 2017 - link
شرکت خدمات نظافتی در مشهدhttp://clean.asghari.in/
hamed102 - Saturday, December 2, 2017 - link
افزایش فالوور اینستاگرامhttp://instagramiha.org/
hamed102 - Saturday, December 2, 2017 - link
مبل ملایرhttp://parsian-mobl.com/
hamed102 - Saturday, December 2, 2017 - link
سردخانهhttp://sardkhaneh.org/
hamed102 - Saturday, December 9, 2017 - link
شركت نقاشي وظيفه / نقاشي ساختمان - نقاشي نمايشگاهhttp://building-painting.sitearia.ir/
hamed102 - Saturday, December 9, 2017 - link
یاس 98http://yas98.ir/
hamed102 - Sunday, December 17, 2017 - link
شرکت خدمات نظافتیhttp://m-clean.ir/