Introducing Project Magna – Artificial Intelligence and Machine Learning for vSphere self-driving operations
At VMworld 2018, Pat Gelsinger made reference to a project that was looking to use Artificial Intelligence and Machine Learning to create self driving operations for the vSphere stack. At VMworld 2019 last week, we were given a tech preview of the first iteration of this effort, called Project Magna. There were a number of VMworld break-out sessions dedicated to this effort, and I will reference them near the end of this post. However, this first tech preview is focused solely on hyperconverged infrastructure (HCI), namely vSAN.
It is an interesting choice to start with, since we position vSAN as a platform for all workloads. And in many cases, workloads will come and go. How do you stay optimized when workloads keep coming and going? Typically, it involves a lot of manual intervention, tweaking knobs and settings, and then monitoring to see if it made any difference. If it did, well great. If it didn’t, you most likely roll back the previous set of tweaks and try some new ones. One of the disadvantages of this approach is that sometimes you make things worse. And remember that age old guidance that I learnt working in tech support – change one thing at a time!So this is a very time consuming and on-going process.
Since this is initially designed for vSAN for this first phase, customers will be asked what their Key Performance Indicator (KPI) is for vSAN performance – is it read optimizations, write optimizations or a balance of the two? Project Magna then compares your environment to the industry average in general, and if your system is below average, it will then commence to make changes to improve your system’s performance. There is no manual interaction by the user, other than to select the appropriate KPI. This is a screenshot from one of the Project Magna presentations (HCI1620BU by Adam Hawley) which gives a good overview of the constituent parts.
I do want to re-emphasize that Project Magna is about the whole Software Defined Data Center (SDDC), not just vSAN. The longer term goal is to have it work for compute, storage, network, and security, as you can see from the above diagram.
How Project Magna works
Project Magna is actually a set of cloud services. It integrates with vRealize Operations (vROps) for enabling/disabling of the service, as well as the visualization of the impact it has had on the system. Once enabled, information about your system is sent to a VMware Data Lake. This takes analytics from your system and compares it against everyone else. It then determines whether or not your system is above, below or at the industry standard for performance. The Magna Cloud Services, which includes the Magna AI/ML Engine and services such as self-tuning, self-healing, etc, will then use vROps to display what performance was like before Project Magna is enabled, and once again after Project Magna is enabled. Customers can then see at a glance whether their KPIs have improved (better read throughput and lower read latency, for example). The Magna Cloud Service is responsible for initiating any actions (tweaking of knobs) on the cluster to achieve those better KPIs, i.e. any underlying knobs that need to be tweaked will be done automatically.
In the VMworld session HCI1207BU, vSAN PM Junchi Zhang showed us a preview of how a proposed integration of IO Insight for vSAN might link directly into Project Magna. IO Insight highlights to the end user that the system may benefit from some optimizations. On clicking the Magna link, the end user could immediately see that the read performance is below average.
The directive above was to enable Magna to have it attempt to improve performance. You then have to pick the KPI against which Magna should operate. Selecting the KPI is very simple – here is a screenshot from another presentation (HCI1650BU) that I watched that shows how to do it, and what they mean.
So how often will it take an action? At the moment, it has not yet been decided how often it will check the performance of the system. But when it does check, and if it thinks it can improve on the current performance, it will do some actions/tweaks of the system. When the results are measured, and if it has improved, it will increase its rewards. If it is does not improve, it will not reward itself. This is the basis of what is referred to as reinforcement learning (RL). RL is where the system adjusts itself based on rewards. The example given in a number of places is that RL is like a video game, where the objective is to get the highest score or extend your life span. To do so, you learn what actions to carry out, and which actions to avoid each time your play. Eventually you get better and better at it, and build rewards in the process. This is what underpins reinforcement learning.
The screenshot below shows Project Magna in action. Each of the dots on the performance index graph indicates a tweak or some action that Project Magna carried out on the system to improve the read performance of the workload on vSAN. And now read performance has improved significantly, without any manual user intervention.
One final point to highlight. There are built-in safeguards to ensure that these actions to the system do not do any harm. Thus, Project Magna actions should never make performance worse than what it already is.
I’ll close on one other item which the presenters raised. Although the initial phase is focused on self-tuning, and area I think is very interesting is around self-escalating and self-explaining. This is where Project Magna becomes integrated with VMware’s Skyline support product. Now if there are issues with the system, the idea is that Project Magna can proactively escalate issues directly into VMware’s technical support organization. There is also integration planned with VMware Knowledge Base articles. This means that when a solution to a particular problem is found, it should be very easy to implement the solution (who knows, perhaps it will be automated). Some very interesting times ahead for sure!
VMworld Sessions on Project Magna
Here are some links to the VMworld 2019 sessions on Project Magna:
- Artificial Intelligence and Machine Learning for HCI: Where We’re Going (HCI1620BU) by Adam Hawley, Director Product Management, VMware
- Optimize vSAN performance using vRealize Operations and Reinforcement Learning (HCI1650BU) by Arun Annavarapu, Product Line Manager, and David Pham,Sr. Product Marketing Manager, VMware
- HCI Management: Current and Future [HCI1207BU] by Christian Dickmann, Principal Engineer and Junchi Zhang, Product Line Manager
Good blog posts on Project Magna, including details on the difference between AI, ML, RL:
Please note that these are tech previews and as such, there is no guidance given about which future version of vSphere will include these products/features. Also, there is no commitment or obligation that technical preview features will become generally available.
Sounds like cool technology. How does magna know the action it takes is not impacting something else? For example, improve read throughput of VM1 at the expense of another. How does magna control for those situations?
Since this initial tech preview was for vSAN, and vSAN only has a single datastore, then its relatively easy to control. You will need to determine they type of workload that is running on vSAN. If it is a read intensive workload, then you would select the read workload KPI, and tuning is made to improve read performance. If it is write, then it tunes for writes. If it is a mix, then it attempts to balance for both.
However, one of our goals would be to eventually be able to do this for VM granular workloads. Even on vSAN, it is likely that you will run workloads of different types. Then it becomes rather more complex. To make sure there is no untoward behaviour, we develop a bunch of tried and tested algorithms for Magna to use. Currently, in our testing, if we see a downside we were not expecting, then we tweak the “reward”/”goal” mechanism to train the system to not let that happen again. And then we test again (…rinse, repeat…).
But you do make a good point – and this is the nature of AI/ML. I reached out so some of the Magna folks around the guard-rails. While we are focused on creating guardrails to control the possibility of introducing a performance regression, when Magna is first turned on and get orientated with your system, there may be temporary regressions as the first tunings start and Magna starts learning. But these will be temporary and once Magna builds up it reinforcement learning, performance will improve. This may be the temporary price they pay to experience the much larger – and longer upside.
We also have techniques that help converge on the right tuning strategy faster and with fewer data points. There are also ways to put in guardrails to get Magna to halt/reset if it is clearly not converging on a winning strategy in a reasonable amount of time. And we are doing all of that. Hope that helps.
Thanks for the detailed explanation. I suppose it also has the benefit of observing the reward (+/-) of actions taken from different customers/environments and if there are enough examples it can anticipate the reward with a level of confidence that withstands regressions.
Indeed. My understanding is that we build a number of configurations internally to do the modelling, but it wouldn’t be possible to build every customer environment/configuration policy. As the adoption grows, so will the catalog of different environment against which we can do models. Then, as you said, tuning can be made with more confidence that the outcome will be positive.