Final month, Qualcomm unveiled the Snapdragon 855 cellular platform. The Snapdragon 855 is the cellular platform that may energy most Android flagship smartphones in 2019. Qualcomm has made substantial year-on-year enhancements with their subsequent era cellular platform. The Snapdragon 855 cellular platform is constructed on a 7nm manufacturing course of and gives a powerful 45% bounce in CPU efficiency over the Snapdragon 845. The enhancements in computation throughout the board permit Qualcomm to boast wonderful AI efficiency on the new Snapdragon 855. There’s loads of info to unpack right here and we’ve carried out our greatest to point out how Qualcomm has improved efficiency and AI on the Snapdragon 855. Nevertheless, we nonetheless had questions of our personal after the product unveiling, so we sat down with Travis Lanier, Senior Director of Product Administration at Qualcomm, to speak about the Kryo 485 CPU and AI on Qualcomm’s new cellular platform.
Mario Serrafero: “45% [jump], it’s like the biggest ever. Let’s unwrap that. We have the A76 base, 7nm—those are big contributors. It seems that ever since you guys moved away from custom cores, some publications and audiences haven’t had much of a clue as to what the Built on ARM license entails in terms of what it can allow you to do. You’ve been pretty secretive about what that entails [too]. Now on stage for one of the first times you have, at least beyond Q&As, …but for the first time you’ve shown what some of the improvements were, and that’s cool. So we were wondering whether you would like to expand on how Qualcomm tuned the Kryo 485 to squeeze more [out] of ARM’s base, whether that’s expanding on the stuff you’ve exposed over there or something that you haven’t presented.”
Travis Lanier: “So I can’t actually say an excessive amount of greater than different what was in my slides. Perhaps at a future date we will, so we will sit down and get some specialists who truly did the work; I do know the high-level speaking factors. However as you realize, A76 is already a high-level design—it’s fairly good. And it’s one among the causes once we noticed ARM’s roadmap. So I’m like, okay, perhaps we should always work with these guys extra intently, as a result of it seemed very robust. And simply going again to your remark on customization versus ARM. So okay, there’s all this stuff that you are able to do. And in the event you’re doing one thing, and it must have differentiation, so you are able to do one thing 100% or companion with them. And [as in] earlier years, we’re a bit bit extra about integration. So buses, and how we hooked as much as the system, their security measures that we put into the CPUs, cache configurations. Now that the engagements have been going longer, we have been capable of do a deeper customization on this one. And that’s how we have been capable of put a few of these issues in there, like bigger [out-of-order] execution home windows, proper, so you’ve gotten extra directions in flight, knowledge pre-fetching is definitely one among the areas the place there’s the most innovation going on in the microprocessor business proper now. A whole lot of the methods for lots of this stuff are fairly comparable, everyone makes use of a TAGE department predictor these days, simply how massive you provision it, individuals know how one can do out-of-order, and forwarding and all that stuff for greater caches. However pre-fetching, there’s nonetheless lots of, it’s a type of darkish artwork sort issues. So there’s nonetheless loads of innovation getting into that area. In order that’s one thing that we felt we might assist with.
After which simply because we really feel that we usually do a greater job with… often we will implement a design quicker than others can combine a course of node. And so once we put a few of these issues in there, like whenever you go extra out-of-order, it’s extra stress on your design, proper? It’s not free so as to add all these execution issues in there. So, to have the ability to do this, and not have successful on your fmax. Yeah, that’s a part of the engagement that we have now with ARM, like how do you pull them off?”
Mario Serrafero: “Just out of curiosity, in the presentation, you had talked about efficiency improvements coming from the pre-fetching, were you talking about power efficiency, performance improvements, a bit of both?”
Travis Lanier: “All the above. So, by its nature, we’re doing pre-fetching—you’ve pulled things in the cache. So when you have the cache not doing as many memory accesses, now there’s a flip side to pre-fetching: If you do too much pre-fetching, you are [using] more memory because, you know, [you’re] doing too much speculative prefetching, but as far as, if you have stuff in and you’re pulling the right stuff, then you’re not going out to memory to pull it in there. So if you have a more efficient prefetcher, you’re saving power and you’re increasing performance.”
Mario Serrafero: “Okay, cool, yeah. Yeah, I didn’t expect that you would be able to expand much more beyond that but, it’s interesting that if you say that now you guys are customizing more and maybe you’re able to share more in the future then I’ll keep an eye open for that. So the other kind of head turner, at least among people I’m surrounded by, is the prime core. So we were expecting kind of more flexible, cluster arrangements for a couple years now with [the] inclusion of DynamIQ and that we expected other companies are moving away from [the] 4+4 arrangement. So two questions: What was the motive behind the prime core? How is the prime core benefiting the user experience, because our readers would like to know why there’s just a lone core over there, and also why it’s not quite a lone core? Wouldn’t sharing the power plane with the performance cluster kind of mitigate some of the utility that you could obtain if you were using DynamIQ and kind of sitting [it] on its own?”
Travis Lanier: “So let’s speak about totally different clocks and totally different voltage planes first. So each time you add a clock and each time you add a voltage, it prices cash. So there’s a restrict to the variety of pins you set on the package deal, there’s extra PLLs it’s a must to have for various clocks and there’s simply elevated complexity. So there’s a commerce off to doing issues. We went type of excessive at one level; we had 4 totally different domains on 4 totally different clocks, so we had expertise with that and it was costly. Sort of once you begin to go massive.LITTLE, you’ve gotten the small cores on [the] small cluster and they don’t fairly want that very same granularity, so to talk, of a separate clock between the small cores. Sure, it’s type of in the air what you do with these. So when you might have an enormous.LITTLE system, then conversely you will have these massive cores. Properly, okay, do you set every of these on an enormous clock? Nicely, you’re not operating on these all [the] time, in case you’re truly in a low sufficient state of affairs the place a clock unoccupied will run on a small core anyway. So actually, it’s type of two of them is sweet sufficient there.
Now we’ve three cores run a bit of bit at a decrease frequency, however they’re additionally extra energy environment friendly. And so like, everytime you—I don’t understand how a lot you understand about implementation of cores—however everytime you begin to hit the prime of the frequency, and the implementations of those cores, there’s a commerce off in energy, issues begin to get exponential in these previous couple of megahertz or gigahertz that you’ve. Yeah, and so I talked a few second in the past, the place, hey all video games are beginning to get multithreaded, like abruptly, when you look again, there was a few video games not too way back, and they’re simply utilizing one thread. However it’s bizarre how shortly the business can change. Like in the previous yr, yr and a half, they’ve actually began placing all these video games into… I’ve gotten excited over these excessive constancy video games. And so whereas a whole lot of stuff simply even like six months to a yr in the past, earlier than, it’s truly flipped over all of China. In China, I hear “I don’t really care about big cores, give me a eight of anything, give me eight of the smallest cores so I can have eight cores.” They’ve modified as a result of they need these video games, these video games require massive cores. And now we’re getting suggestions from companions that “no, we actually want four big cores,” due to all the superior video games which are popping out. They usually’re going to make use of all these cores.
So if you recreation, you don’t recreation for 30 seconds, or 5 minutes, you recreation for longer. So, it is sensible, we’ve these three different cores in most of your multithreaded massive core use instances, they need to have slightly bit extra energy effectivity. It type of balances out, you’ve got this larger efficiency core once you want it for a few of these issues inside a few of these sustained instances the place additionally they have massive cores and you have got this extra energy environment friendly answer to pair with that. That’s sort of the considering—it’s sort of just a little little bit of an uncommon symmetry. However hopefully that solutions why [there’s a] prime core, why don’t you could have separate clocks, and why don’t you’ve got separate voltages? And so I feel I touched on all these.”
Mario Serrafero: “Now, heterogeneous compute. That’s what Qualcomm has been stressing since the move away from the old branding to the mobile platform, and that kind of [a] descriptor, and also aggregating blocks from describing certain performance metrics like AI. How has that evolution been in switching to a more heterogeneous compute approach? Anywhere from design to execution to marketing, or whatever you can touch upon.”
Travis Lanier: “It goes a bit of bit again and forth. However in the finish, it’s a must to have these engines as a result of the identify of the recreation in cellular is energy effectivity. Now you typically see it transfer again to a generalization each as soon as in awhile. Should you return to your unique, even for smartphones, function telephones had multimedia and digital camera capabilities to some extent and in order that they have all these little devoted issues since you couldn’t do it. In case you return to the telephones which are constructed on the ARM 9 or an ARM 7 all of them had a hardware acceleration widget for all the things.
However to offer you an instance, the place one thing went common and then now they’re asking for hardware once more, can be JPEG. There was a JPEG accelerator. The CPU ultimately received ok and it was energy environment friendly sufficient and the JPEGs sort of stayed the similar measurement that, hey, you realize what, we’ll simply go forward and do it on the CPU [as] it’s simply simpler to do it. Now, as footage get greater and greater, hastily, individuals are going, you recognize, truly, I would like these actually gigantic photograph file sizes to be accelerated. The CPUs [are] type of both not quick sufficient or burning an excessive amount of energy. It’s simply all of a sudden that there’s curiosity in probably having JPEG accelerators once more. So it isn’t all the time a straight line how issues go, then it’s a must to take a look at what’s going on proper now with Moore’s Regulation. Everybody retains speaking about, hey, you is probably not lifeless, however it’s slowing down a bit bit, proper? So in the event you’re not getting that energy increase, or efficiency increase from every subsequent node, how do you proceed to place extra performance on the telephone in case you don’t have that overhead? So you may simply put it on the CPU. However should you don’t have extra headroom in your CPU, how do you speed up this stuff? Nicely the reply is, you set all these specialised cores and issues extra effectively. And so it’s that pure rigidity.
You’ll see individuals being pressured to do this stuff for widespread features as perhaps not everybody’s going to be on the bleeding edge. However we’re definitely going to attempt staying there so long as potential, however we will’t pressure the fabs to maneuver to the subsequent node if it’s not there essentially. In order that’s why it’s a must to focus on continuous innovation and these architectures to proceed to get higher efficiency and energy effectivity. In order that’s our power and our background.”
Mario Serrafero: “Even though there’s been this move to heterogeneous compute, on Qualcomm’s part, many audiences and certainly many publications, certainly many enthusiasts, surprisingly, who you think would know better, they still think of, consider, and evaluate, the blocks as separate entities. They still focus on, “I want to see the CPU numbers because I care about that.” They need to see GPU numbers as a result of they like video games, so on and so forth. They don’t think about them as communicated elements of 1 integral product. How do you assume that Qualcomm has, and is, and can, shatter that paradigm as rivals truly maintain focusing on that particular block-by-block type of enhancements in advertising? Particularly, [we’ll] transfer on to the neural networks, the neural engine stuff later.”
Travis Lanier: “I hope I touched on some of that today. We focus on, for example, sustained gaming, so maybe you score well on all the gaming benchmarks. People get obsessed about that. But really, what matters is, if you’re playing your game, does your frames per second stay consistently where you want it to be at the highest point for these things? I think people put way too much weight into a number for one of these blocks. It’s so hard, and I understand that desire to give me one number that tells me what the best is. It’s just so convenient, especially in AI right now, it’s just nuts. Even with CPU benchmarks, what does a CPU benchmark measure? They all measure different things. Take any of the benchmarks, like GeekBench has a bunch of sub components. Do you see anybody ever tear apart and look into which one of these sub components is most relevant to what I’m actually doing?”
Mario Serrafero: “Sometimes, we do.”
Travis Lanier: “Perhaps you guys do. You guys are like an outlier. However like perhaps one CPU is best on this and perhaps one’s higher on one other. Similar factor with SPEC, individuals will spotlight the one SPEC, properly, okay, there’s a number of totally different workloads inside that. They usually’re fairly tight issues, however even SPEC, which we truly use for creating CPUs, should you take a look at the precise workloads, are they really related? It’s nice for evaluating workstation workloads, however am I actually doing molecular modeling on my telephone? No. However once more, that’s my level is, most of those benchmarks are helpful not directly, however it’s a must to perceive the context of what [it’s] for and the way you get there. And so it’s actually arduous to distill issues down to at least one quantity.
And I see this particularly—I’m pivoting right here a bit bit—however I see this with AI proper now, it’s bonkers. I see that there’s a few various things that wouldn’t get one quantity for AI. And in order a lot as I used to be speaking about CPU, and you’ve got all these totally different workloads, and you’re making an attempt to get one quantity. Holy moly, AI. There’s so many various neural networks, and so many various workloads. Are you operating it in floating level, are you operating it in int, operating it in eight or 16 bit precision? And so what’s occurred is, I see individuals attempt to create this stuff and, nicely, we selected this workload, and we did it in floating level, and we’re going to weight 50% of our exams on this one community and two different checks, and we’ll weight them on this. Okay, does anyone truly even use that exact workload on that internet? Any actual purposes? AI is fascinating as a result of it’s shifting so quick. Something I inform you’ll in all probability be incorrect in a month or two. In order that’s what’s additionally cool about it, as a result of it’s altering a lot.
However the largest factor isn’t the hardware in AI, it’s the software program. As a result of everybody’s utilizing it has, like, I’m utilizing this neural internet. And so principally, there’s all these multipliers on there. Have you ever optimized that exact neural community? And so did you optimize the one for the benchmark, or do you optimize the one so some individuals will say, you realize what I’ve created a benchmark that measures tremendous decision, it’s a benchmark on an excellent decision AI. Properly, they use this community and they could have executed it in floating level. However each companion we interact with, we’ve both managed to do it 16 bit and/or eight bit and utilizing a unique community. So does that imply we’re not good at tremendous decision, as a result of this work doesn’t match up with that? So my solely level is that AI benchmark[ing] is admittedly difficult. You assume CPU and GPU is difficult? AI is simply loopy.”
Mario Serrafero: “Yeah, there’s too many types of networks, too many parameterizations—different parameterization leads to different impacts, how it’s computed.”
Travis Lanier: “It’ll keep reviewers busy.”
Mario Serrafero: “But like if you want to measure the whole broad of things, well it’s a lot more difficult. But yeah, nobody’s doing it.”
Mishaal Rahman: “That’s why you guys are focusing more on the use cases.”
Travis Lanier: “I think in the end, once you show use cases, that’s how good your AI is right now. It comes down to the software, I think it will mature a little bit more in a few years. But right now there’s just so much software work that has to be done and then changes like, Okay, well, this network’s hot and then like, next year, “Oh, no, we found a new network that’s more efficient in all these things,” so then it’s a must to go redo the software program. It’s fairly loopy.”
Mario Serrafero: “Speaking of NN, you kind of did the transition for me, less awkward transition thinking for me. Moving on to the Hexagon. This is kind of one of the components that is least understood, I would say, by consumers, even most enthusiasts, certainly my colleagues. You know, especially given that it was not introduced as an AI block, and like kind of the whole digital signal processing idea, you know, when you introduce something that original idea kind of sticks so if you’re doing to do something, okay it’s a neural thing with the neural, neural, neural brain intelligence, it kind of sticks with people. They have the AI machine learning neural, neural, neural labels for other solutions. So we want to maybe give you a chance to explain the evolution of the Hexagon DSP, why you haven’t moved away from that kind of engineering-sound[ing] names like Hexagon DSP, vector extensions, and so on that are not like as marketing friendly. But yeah, just like maybe like a quick rundown of how it’s been for you at the forefront of DSP to see it go from the imaging workload beginnings to the brand new tensor accelerator.”
Travis Lanier: “It’s actually an interesting point because some of our competitors actually have something they’ll call a neural engine or a neural accelerator—it’s actually a DSP, it’s the same thing. So I guess the name is important, but you touched on an important point and in all honestly when we put this out there it was for imaging, we just happened to support 8 bit. And I remember we were presenting at Hot Chips and Pete Warden of Google kind of tracked us down and was like, “Hey, you..so you guys support 8 bit, huh?” Yeah, we do. And so from there, we instantly went out and like, hey, we’ve received all [these] tasks going on. That’s once we went and ported TensorFlow to Hexagon, as a result of it’s like, hey, we’ve acquired like this eight bit supported vector processor on the market to try this, and it was on our Hexagon DSP. If I needed to go once more, I might in all probability name it the Hexagon Neural Sign Processor. And we nonetheless have the different DSP, we do have scalar DSPs and that’s a DSP in the truest sense. After which we name this type of a vector DSP. Perhaps we should always rename it, perhaps we should always name it a neural sign processor as a result of we’re in all probability not giving ourselves as a lot credit score as we should always for this as a result of, like I stated, some individuals they only have vector DSPs and they’re calling it no matter, and they haven’t revealed no matter it’s. Did I reply your query?”
Mario Serrafero: “So, yeah, that’s right probably most of it.”
Travis Lanier: “What was the second question?”
Mario Serrafero: “Just how you saw kind of that development internally. What’s it been like: the experience, the difficulties, the challenges, whatever you want to tell us about? How [have] you seen the evolution from the image processing beginnings to the tensor accelerator?”
Travis Lanier: “It’s been a little frustrating because it’s like the thing that makes me cringe is like some of the press will raise their hand and be like, “Qualcomm, what you’re so behind! Why didn’t you—When are you going to get like a dedicated neural signal processor?” and I simply need to like pound my head. I used to be like we have been the first one to have a vector processor! However that stated, we edit this and there’ll in all probability proceed to be extra issues as we study extra about AI. So, we did add this different factor and yeah this one is—it solely does AI, it doesn’t do picture processing as a part of the hexagon complicated so that you supply … as we nonetheless name it the Hexagon DSP, we’re calling the entire complicated the Hexagon processor [to] attempt and get a captured identify for entire hexagon factor now. We did add stuff which truly [is] extra instantly compute, I shouldn’t say instantly compute, prefer it has this automated administration of the way you do that larger order map of the place you’re multiplying matrices.”
Mario Serrafero: “Tensors are actually pretty hard for me to wrap my head around. It’s just like they kind of wrap around themselves too, anyway.”
Travis Lanier: “Yeah, I thought like, I took my linear algebra classes in college. I did that like man, “I hope I never have to do that again!” They usually got here again with a vengeance. I assume I used to be like, ‘Oh man, differential equations and linear algebra are back with a vengeance!’”
Mario Serrafero: “No yeah, and I like took multi, right, and linear algebra, and by linear algebra, I’m like, “Okay, now I have the proof based courses of upper-level math, I don’t have to think about this stuff again.” After which I’m taking ML till chain derivatives and backpropagation and a bunch of matrix calculus simply when you don’t like the statistics half and it’s nice in the sense that I’m like, “Well, I guess it’s back.” Nevertheless it’s truly extra enjoyable than proof for me, no less than now higher, quicker you possibly can see if it really works which is good. Anyway, we’ll return to the math bit. Certainly one of the issues that not solely caught with me that you simply talked about final yr, it truly was one in every of the the reason why I used to be impressed taking AI courses now, is that you simply stated: “Vector math is at the foundation of deep learning.” And also you stated this final yr, it caught with me.
I really feel plenty of my colleagues haven’t caught up on this. They nonetheless assume that there’s this mystifying facet to the NPU when it’s only a bunch of matrix multiplication, dot merchandise, nonlinearity features, convolutions, [and] so on. And I don’t assume that personally, that sort of the neural processing engine identify helps, however that’s the factor, proper? How a lot of it’s both not being expanded, obfuscated, sort of the underlying math shoveled, by the naming conventions, and what could be executed maybe? I don’t know if you considered this. [What] may be carried out to tell individuals about how this works? The way it’s not identical to, why for instance, why the DSP can do what the different new neural processing engines can do? I imply, it’s simply math, however it doesn’t appear that customers, readers, some journalists, perceive that. What can—I’m not saying it’s Qualcomm’s duty—however what do you assume might be accomplished in another way? It’s in all probability my duty.”
Travis Lanier: “Honestly, I’m starting to surrender. Maybe we just have to name things “neural.” We simply talked about how linear algebra and differential equations made our heads spin once we began taking a look at this stuff, and so once you begin making an attempt to elucidate that to individuals like whenever you begin doing the regression evaluation, you take a look at the equations and stuff, peoples’ heads explode. You’ll be able to train most individuals primary programming, however if you begin educating them how the backpropagation equations work, they’re gonna take a look at that and their heads are going to blow up. So yeah, enjoyable stuff. They don’t need to see partial derivatives…”
Mario Serrafero: “Chains of partial derivatives, not across scalars but across vectors and including nonlinear functions.”
Travis Lanier: “Good luck with that! Yeah, so it’s hard and I don’t know that most people do want to know about that. But I try: I put in a little thing like, “Hey, all we’re doing here is vector math. We have a vector processor.” And I feel individuals take a look at that and are like, “Okay, but man I really want a neural accelerator.” “Tensor” continues to be mathematical, however I feel individuals might affiliate that a bit extra with AI processing.”
Mario Serrafero: “Could be like bridging the gap, the semantic gap.”
Travis Lanier: “In the end, I think it’s come down to, we probably just have to come up with a different name.”
All graphics on this article are sourced from Travis Lanier’s presentation at the Snapdragon Tech Summit. You possibly can view the presentation slides right here.
Need extra posts like this delivered to your inbox? Enter your e mail to be subscribed to our publication.