Rebuilding for the cloud with Robert Venable

This is Robert Venable. He likes whitewater rafting, shrimp poboys from the Old Tyme Grocery in Lafayette, Louisana, and hanging out with his sons. He works as a principal architect for Microsoft IT leading the effort to rebuild dated financial reporting systems in the cloud. The thing I love about my job is solving real world problems. Specifically I get to solve them within finance, which is historically a risk-adversed field for companies. I get to provide new capabilities and advancements. And one of those financial systems is our revenue reporting system, commonly referred to internally as MS Sales. MS Sales is a large data warehouse and analytical platform built for Microsoft’s revenue reporting that is based on Microsoft SQL server technology to be in Azure and use some of Azure’s capabilities to make our users happy. We have to keep 21 years of sales data. 10 years forward-looking, 10 years historical, and the current year. We need to do this for compliance reasons. The people that look at the data want to see what revenue looked like based upon the business rules and the organizational structure in the past as well as what it would like in the future given the changes within a business rule. Overall, the MS Sales system has a pretty big system. MS Sales was originally built on SQL Server and mainly in a scale up fashion. So, MS Sales is about 20 years old, and over those 20 years it had been enhanced and added on to, which made it more complex. So some of the code was spaghetti-ish in nature. We get about 1,500 sources of data through MS Sales. This includes our channel partners. We also have multiple billing systems within Microsoft as well as licensings and product systems. The system actually integrates all this data together to give you a view of Microsoft’s financial revenue position across organizations, business segments, geographic hierarchies, those kinds of things. When the MS Sales app was built 20 years ago, customers typically purchased a box of new software from Microsoft every three to six years. Over time, the number of incoming transactions have multiplied exponentially. The app now operates ’round the clock, tracking billions of financial transactions. These could be large purchases, like when a global company subscribes its workforce to Office 365, or micro-transactions like when a customer makes a short call on Skype, or uses a few minutes of server time on Microsoft Azure. The MS Sales app has struggled to keep up with the heavy demands of modern financial reporting. The uh-oh moment was when one of the development leads came to me a couple of years ago. He said, “I think we have a problem with MS Sales “as it currently is architected. “The data size and the growth that we see “based upon the hardware that was available “doesn’t look like it’s gonna keep up.” We did a graph of how fast the data, and we have an exponential data curve, but we had more of a linear compute curve from basically the scale out and Moore’s Law. We found out that hey, in 18 months we’re gonna tip over if we don’t do anything. We’re not gonna meet our business needs. That kicked off an effort to find out what technology stack we could use in the future to help MS Sales fulfill its needs. We chose a distributed system, so instead of scaling up we thought about how can we scale out? We’re going towards more of a modern data warehouse or a logical warehouse where we try not to hop the data. We actually try to bring the compute to the data as much as we bring the data to the compute. As the clock ticked towards the demise of MS Sales, the team considered the best options. Would they lift and shift the entire system to the cloud employing infrastructure as a service? Or would they build something from scratch using platform as a service fabric and big data computing solutions? Whatever solution they picked would have to support hefty future performance and capability needs. The solution would also need to take into account the cost and complexity of re-engineering the app in a race against time. The team finally decided on using a Apache Spark for HDInsight, which allowed for reuse of existing code but also provided a robust architecture that could easily scale out. Spark is a big data processing engine. It has a couple of different advantages. One advantage that we like is the in-memory processing and the other is that I can basically use the same code and use it for streaming or use it for batch. So I’m a firm believer on keeping your options open, especially when you start down a path and you don’t know exactly where you’re gonna end up, you try to keep as many of the options open in your back pocket as you can. As the MS Sales app continued chugging towards a cliff, the team seemed to have its solution. They would use a distributed system based in Microsoft Azure, which remedied all of the apps current shortcomings as well as to add robust cloud capabilities. Though everyone agreed the solution was best for the situation, implementing it would require IT experts to move well outside their comfort zones. When we moved to open source, there was a couple of different cultural changes that we needed to embrace. One was we had a development team that had been working on MS Sales for a long time. So what they knew was SQL, they knew it inside and out, and we needed to move to an open source technology, and that new technology landscape was scary for them. It’s just a different way of thinking about the processing, and trying to do that is a cultural change that you had to make within our own engineering team. Being that it’s open source was just another thing that was scary because most of them had some C# capabilities and moving to where we actually ended up, which is Scala, was daunting to them, it was a cultural change. From a business aspect, even they knew SQL. SQL being a Microsoft product, they were able to open up SQL Server Management Studio and actually to write a T-SQL statement, and actually view the data. They were comfortable in what they knew. Rebuilding an app this big and important to Microsoft required significant buy-in from teams across the company. The team worked hard to earn the trust of key stakeholders. We can’t go dark for 12 months or 16 months and then say, “Oh by the way, we’re here “two months before we fall over, “and here’s your new system.” So there’s a lot of confidence building and trust building you have to do with both the business side, with the engineering side. With MS Sales there was two ways to really do this. The first way is we took vertical slices of the platform and tried to move them into the cloud and to use a different paradigm. The problem with that was that if I just moved ingestion or I just moved processing or just moved distribution, I had really no end-to-end value and I didn’t get to start the cultural change from a business aspect of what does it mean when my data is refreshed every 30 minutes? We decided instead of taking a vertical approach we tried to take a horizontal approach. So we took a specific pipeline within MS Sales, we call it the channel pipeline. We actually took two, channel and inventory. But we took that holistic, and so it’s a little bit of ingestion, a little bit of processing, a little bit of distribution, and we moved that piece as our pilot phase. The current model in which Microsoft operated was more batch ingestion, and we would get a file once a day, three times a day. But we would basically batch data through the system. The really thought process there is how do we get out of that batch, latent, inherent system and think about hey, when a transaction hits an event hub for example, I can process that transaction from beginning to the end without ever even landing the data if I want to. To support current internal systems and partners, ingestion must allow for batch and streaming methods. Incoming file transfers land in blob store and a simple process built in service fabric validates basic elements of the file. Number of columns, schema, data types, and more. The process then streams each row as a transaction into the event hub. A copy of the validated data is saved for archival which allows for auditing in each stage of the process. The future state will utilize an API for partners to stream transactions real-time into the event hub and provide faster ingestion and processing. It was really about how do we use a lot of distributing computing versus scale up computing? It was about how do I make sure that the system can meet the demands of today as well as meet the demands of tomorrow? Historically, we have been running 21 years of data and our end users would see data every 24 hours. In the new processing we have reduced that to be able to process 21 years of data in 42 minutes, so the end users can actually see fresh data every 42 minutes. To test the scale of that, we have increased the data to 10 times that volume, and our processing time only increased by 10 minutes. So even at 10 times the volume of today’s 21 years of data, the end user can see data in 52 minutes. Where we’re going with this is when a change happens on a business rule, an event gets fired. That event is taking and then an analysis of what transactions are affected by that event are needed. Then only those transactions actually get fed through and are incrementally updated. Currently we’re using a Drools basically as an add-on into the Spark processing pipeline. For our distribution side, currently today we offer data marts that people can pivot and see data the way that they need to see it to actually figure out what they’re trying to solve or to make decisions for their businesses or for their specific application. The team is currently using SQL Database for distribution, primarily to maintain backwards compatibility with the client app used to access the platform. The transfer time for distribution has become the bottleneck in end-to-end processing and the team is implementing Azure Data Warehouse in combination with changes to the client app to create a distributed data model that mimics the design of processing. It’s hard to say that it’s one technology. It’s not, it’s a lot of different technologies that make up the end-to-end solution. I believe that from a user standpoint they will start seeing the benefits of data that’s refreshed and to them quicker. I see us adding machine learning into the pipeline so that we can actually do predictive forecasting, we could actually get ahead of the game. Instead of looking at what has happened, but what is going to happen or how can we make it happen in the future? So that’s what I see as the future. We’ll address the scale, we’ll address the latency from end-to-end, we’ll add the agility and the componentization that we talked about earlier which is how can I make MS Sales more agile to Microsoft’s business? Then the combinability piece is more how can I combine this normal relational financial data with other big data elements, whether it be Twitter feeds or market sentiment or whatever it is, to actually provide bigger, better value for our customers whether they be in marketing or whether they be in finance, right, so that we can say “Hey, we are going to sell X today,” rather than, “What did we sell?” Right now we get about 2.6 million transactions a day, but we’re architecting to do about 10 times that, about 26 million. We have basically done the things that we said we were going to do. It gave us a spot where we could have the business see the benefits and start thinking about how their world will change as well as have the team prove out the technology stack in between. I believe we can impact not just Microsoft, but we can impact almost any financial institution that is risk-averse of moving their information to the cloud and taking advantage of these capabilities because they know what worked in the past. Maybe we can help show them what it looks like in the future. The power of the cloud is actually the ability to not think about your infrastructure and to light up new capabilities from both a business standpoint as well as from an IT standpoint. It allows us to really focus your investment onto your core business value, not the maintaining of servers, to be honest. That’s what the cloud means to me. It’s expanding the capabilities of an organization. In our 10-part video series Expedition Cloud, Brad Wright and other Microsoft technical experts will share the inside story of Microsoft’s journey to the cloud including proven practices, things we’d do differently, and advice for customers on the same journey toward digital transformation.

, , , , , , ,

Post navigation

Leave a Reply

Your email address will not be published. Required fields are marked *