Hear Rasmus Selsmark, Lead Developer at Unity, explain best practices for securing your build and deployment pipeline, creating the capability for 100+ deployments per day
Lauri (00:03):
Welcome to the DevOps Sauna, my name is Lauri and I am the chief marketing officer of Eficode. Not too long ago, we held a hugely popular two-day DevOps 2020 event. In the event, we had awesome speakers from around the world telling stories about DevOps tools and culture for over 1,000 people online.
Lauri (00:23):
Since then, we have made these recordings available at the Eficode website, and due to the popularity of the speeches we have now made them also available in our podcast. You can find the links to the video recording and to the materials referred in the speech in the show notes. We're also keen to feature topics that you find interesting in the area of DevOps. Do let us hear about you in our Twitter, Facebook, or LinkedIn pages.
Lauri (00:48):
Today, our speaker is Rasmus Selsmark. He is a lead developer at Unity, and his topic is about scaling DevSecOps to integrate security tooling for more than 100 deployments per day. Unity's next step is to integrate security tooling into the deployment process, and therefore avoiding container vulnerability scanning happening days or weeks after the actual deployment took place. He will also talk about build pipelines for microservices. With that, it's high time that we're tuned in for the show.
Rasmus (01:23):
I'm here to talk about how we at Unity have scaled security tooling into our build pipelines, some days more than 100 deployments per day, but also we're going to talk about microservices, where we have roughly 200 microservices using this build pipeline that I'll get into. I am Rasmus. I have been working at Unity Technologies since 2013, being a lead engineer in one of our DevOps teams since 2018, have been building this case as part of Unity. I'll come back a bit on the different areas of Unity. My background is originally a developer. Then I've been working in test automation, also leading QA teams before transitioning into DevOps, and I think with that background, I've seen a similar movement of quality where back 10, 15, 20 years ago, you would have separate QA teams. It might be even in some places you still have. Which works if you have fairly infrequent deployments, which are over a month, then there is time for manual testing in that process.
Rasmus (02:42):
I can see the same happening with security. You have extra time to do a security audit, or you would have time for doing that, but today when you want to ship early, ship fast, there really isn't time for other manual testing or a lot security audit, which is the purpose of why we actually want to have early feedback, both on testing, by running automated tests as part of pipeline, but also integrating security tooling into pipeline.
Rasmus (03:13):
And I actually was attending the same conference in 2018, where I talked about the first part of what we've done at Unity in terms of building a common deployment pipeline build and deployment pipeline, which enables a lot of this, that we are able to integrate tools into one set build pipeline, which then is available to all the teams using that.
Rasmus (03:44):
And that goes again then for security tooling, in this case. I've seen a lot of the questions on earlier presentations today asking about actual tooling. I hope I will be able to address those in this presentation by showing what tools we're using. So, coming from a game company thought the good way to visualize this is a defense game. So this is say, running in Unity. So, for the security, like in the south tower defense game, you have enemies, in this case, various threats, vulnerabilities, and we have tools, weapons we could use to eliminate those. One point of this presentation is that I will show some of the tools we're using. The context of how your contact might vary, so it might not be the same tools. These tools are constantly evolving, and we also work with our security teams at Unity to see if there are new tools, better tools we can use.
Rasmus (04:47):
So keep in mind that both the vulnerabilities themselves, the security threats, are constantly evolving, but also the market of security tooling, and especially also open-source tooling is evolving. So one side of Unity is to build the game engine for building games. Another thing we do at Unity, and the part of Unity I'm in, is building online services for game developers. My background is from the Unity network, where we have roughly 35,000 requests per second from games requesting, and that's across almost two billion devices per month. And this, as mentioned, takes off 200 microservices that is behind this. That was also one of the interesting aspects of scaling, that's quite what, which off of Unity. We'd like to borrower these slides from Guy Patrami, founder of the Snake.
Rasmus (06:00):
Snake is a static code analysis tool or static analysis tool. We're not using that tool. I think it's a good presentation. I would recommend people to watch this presentation. And it talks about microservices architecture and how you, in the amount of this application, you have a very clear parameter. There's typically a few ways of inputs and some outputs from that system, and when you talk about, from a security aspect, that is what you're interested in. What are the interfaces into your system? And in an application, the fewer inputs you have, the easier it is also to address those and test those from a security point of view.
Rasmus (06:52):
Constraint flows means that also when you have, this is sort of a bit black box approach, so when you have input into the system, something happens within the monitor and you get an output. It's relatively easy from a security point of view. You have a few end points you can target, and you can test those and do fast testing, try to break those. And the wholesale deploys is again, back to usually you have fewer deployments, which also allows a bit more time in each deployment for doing a security audit, for instance. Whereas on the other side, the microservice architecture has way more in points. So each of these microservice has their own end points interface to interact with flexible flows, meaning the data connect to the flow different ways through a microservice architecture.
Rasmus (07:53):
So it might be that one service is to take down data flows. Other services might be that you're running some AB tests, meaning that you are sending 10% of traffic to a new service to see how that service works in terms of other functionality, in terms of latencies performance. And not least, constant deploys. We knew that you release often. And getting back to the point of, if we release often, and we're talking in our case, typically with 15 to 30 minutes from merge to being deployed. In this case, for some of our services, that's 1,000. It means that you don't have much time to run security tooling. To this point, having microservices means that you have smaller code bases in each microservice, does actually help in that because it means that if the code base is smaller, the time it takes to scan any smaller code base also helps. In terms of microservices, Guy, the presenter from this one, you can call those microservices from a miss, from a security perspective. We are running a microservices architecture.
Rasmus (09:16):
The primary reason for that is the ability to scale out. So that means that you actually can scale the relevant microservices out where there is a need for that. In our case, it means that the desk ports, where game developers will access, get way less requests per second compared to the runtime environment, which is the one that the games are contacting. Then also for a clearer code ownership model, one microservice is owned by one team, which means that it gives a more clear code ownership model. Another thing which microservice architecture introduces is a lot of library publication. So the fact that you are using the same libraries across many of the repositories microservices means that if there is a vulnerability found in one library, external library, means that that needs to be addressed in a lot of different places.
Rasmus (10:25):
Whereas the monolith, you have to update that once, and then it's done. Whereas in a microservice, it could be, in our case, 200 microservices that need to have the same library updated. So, that's also what we try to address. Quick slide about the DevOps team culture at Unity Ads, in this case. An important notice about Unity is that we ship both a desktop product, which is the unity game engine, which has a different deployment model. And that is shipped desktop product with a few releases versus our online services, which are released continuously. So, for Unity Ads, we have a relatively smaller DevOps team of seven people compared to probably 250 developers. So it's a small team. And with intention all the time of driving DevOps culture and avoiding becoming a bottleneck, which also means, that also will be part of this presentation, that is also about spreading good practices, best practices from development teams.
Rasmus (11:33):
So if one development team is using tools that we can see, it can also benefit other teams then spread that practice. One product that the DevOps team Unity is developing is a common chats CI/CD pipeline, which is the one used to deploy 200 microservices. We do have other pipelines, CI/CD pipelines within Unity. So this is just one of the examples. I don't necessarily think that you need to have one pipeline cover all the way from shipping a desktop product to online services. At the bottom, we have shared infrastructure. So we are using the same network, Terraform, for configuration as code for speeding up the infrastructure running on Google Cloud, and then also the security tools. So we're working with a security team at Unity, a centralized security team, who are evaluating security tools. And many of the tools that I show today are picked by the security team, and we see it as our role to integrate that into a pipeline because having centralized security tools does give the benefit of giving the security team insights into where could there be an opportunity to do a security audit of some microservices, for instance. Then on top of that, we have the build and deployment pipeline, which on top of that, the Dev teams are building the services, deploying the services. One way where we started was, we're running on Kubernetes.
Rasmus (13:11):
I believe most of the tools that we've shown today also we work in a non-Kubernetes environment. Kubernetes has some guidelines on 11 ways not to get hacked. Static analysis of, in this case, Yamalo, Kubernetes has a lot Yamalo, or static analysis in general, as early as possible, instead of having to do the runtime and the evaluation. Run containers as non road users, in case there would be a vulnerability, at least you don't get that far. And scan, in this case images, and run intrusion detection. So scan images for vulnerabilities and intrusion detection for verifying that if there are suspicious traffic in your environment. From that blog post, it mentions the notion of shifting lift, which for me just means early feedback. We want to release often fast. So that means that having less time for doing a security audit, for instance.
Rasmus (14:14):
So the earlier we can catch issues and give feedback to the development teams, the better in terms of avoiding vulnerabilities in production. So if we look at a development workflow or release workflow, even development of a service, then we have the CD pipeline of build test, extended analysis deployment, and finally going into staging and production. Some of the tools, I'll go through these, on the left side during development phase is keeping your code base tidy, meaning that avoid at the code base level that you have vulnerabilities in your code base. The earlier we can detect and correct issues, the better in this case. For the build test phase, we don't have any standardized tool for that. I know some teams are, as part of the unit test, also testing for security issues, but we don't have any standardized tool for that.
Rasmus (15:13):
So the next one is static analysis where we've used some tools. Again, at least before deploying to production, getting information about the vulnerabilities. I'll come back to those tools later. Doing deployments, have a secrets management so we don't have store in the secrets and code, and when running instating on production, using runtime container scanning for detecting any issues at runtime. And the whole part of this is, moving to the tools or left side, is to provide early feedback to development teams., whereas on the right side, it's collecting metrics. Usually, though, on the right side, it's a cruiser team who is able to react on those and address those issues. Let's start with the early feedback part. Oh, it's good when we have examples of worst case examples. So Equifax probably most of you are familiar with.
Rasmus (16:11):
So they had a huge leak of personal data because of an unpatched live application. So there was a window of two months where a vulnerability to that stack had been available. And during those two months, a hacker actually managed to get in and get access to the data. For that, we are using a tool called renovate. So renovate will run every day, check if there are updates to libraries that you are using in your code base, and automatically create a pull request into our GitHub repository which would then go through the pipeline run assessed. So actually you can feel confident that yes, this change will still work and doesn't break. And also, another added benefit of that is to avoid building up technical debt. So updating your libraries as early as possible, from a security point of view, means avoiding having vulnerable libraries, AWS token, getting leaked from a repository.
Rasmus (17:22):
And for that, don't store secrets in your code base. We actually have an internally developed tool, but I know that GitHub is also having a GitHub token scanning tool which will also warn you about tokens in your source code. So this is an example of if you try to push a secret into a code base in this case, it's a set of regular expression, which we'll see, does this look like a token? And then, on the commit in the GitHub, you will get a notification saying that, hey, this looks to be a secret. And then of course remove that from the code and rotate the token, in that case. Then, next part is static analysis for which we're running three twos, SonarQube, SourceClear, and Trivy. We'll get back into those another time.
Rasmus (18:22):
So SonarQube, we actually started using that as a tool for quality analysis of our code base. It gives a very good overview of code fails in your code base, but also has a security hotspots feature, which can also tell us about possible vulnerable areas of the code base and is used by our security team. If a security team is doing their audit of a code base, SonarQube is a way for them to try to get a good idea of where to look, and just at the first level, get an idea of, by having the metrics available across different services, know, okay, which service is actually interesting to look at. Another tool is SourceClear. SourceClear will do scanning of the source code, but also take libraries. So it actually does similar as Renovate, but does this at a later stage. So it actually is able to find if you are part of a pull request, will add a vulnerable library source will detect that during the build of that pull request. Another nice feature.
Rasmus (19:34):
So this is the ideal view of a project scanned by SourceClear showing that nice work. So no vulnerable libraries in this project. Another nice feature from SourceClear is another tool is also providing that, Snake, for instance, the one I mentioned earlier, is license scanning. So you get an overview of which licenses are used in the libraries that you're referencing and even transitive, so libraries of libraries. So at one point, our legal team would keep a manual updated list of the licenses using SourceClear to gather that information that was automatically. Quickly showing how we have integrated this into the common build pipeline. So as mentioned, we have a pipeline used across the microservices, and the idea behind that is to have the logic for building and deploying services in one.
Rasmus (20:41):
So we have one place of logic, and then at the service level, this is on the left side, or a specific service, just specify configuration. In this case for SourceClear, you specify that this is a Java version. So which image is used for scanning that. An example of how this is actually implemented, a basic example. But the idea that, in this case, we are taking the SourceClear. If SourceClear has been enabled for this repository, then run the SourceClear scanning within a Docker container and process the lock we get from SourceClear. How that looks to the development team? So that's this part where we have the static code analysis steps running as part of the pipeline. So this is for a full deployment. You would also have this on a pull request where you also would get the SourceClear, or the static analysis scanning.
Rasmus (21:39):
So even on a pull request level, you actually get information if there has been found a vulnerability. And one thing to note here is that, so this is quite typical. It's actually one of all the services within Java that 95% of this code base is external libraries, which matches quite well. I hear numbers between 90 and 95%. And then you get, in this case, if you set the notification, if the development team, actually comments on the pull request, that they would like to found issues with this pull request and expand that to get more information about, at the bottom, which risks were found. Another open source tool is Trivy for doing container scanning. I'll say the container scanning has shown not to be that relevant for development teams, given that it's usually not part of the workflow, that if there is a vulnerability in your image, it's simply caused by some external dependency, for instance, that a vulnerability was found. And it's not something that is necessarily related to that pull request. So this is something that we want to do on runtime, scanning and steps. One aspect of security is when to track deployments also to make sure that you are aware that you know what is going into a production environment.
Rasmus (22:59):
So we require a pull request with at least one approver, that's a team specific rule. So for some teams, they actually require multiple approvers. Note direct pushes to master, and it's actually only the CSD pipeline, or Jenkins in this case, which has access to deployed tools production. And secrets management, back to the point of no secrets in code. There are various tools for this. We are using Vault. How you specify that is that you store the secret intervals, and then you can reference that, in this case, in the Kubernetes manifest. So no secrets at deployment time by this Vault secret feature that we're using. And more information about that on the blog post when you search for managing credentials at Unity to have open source to this tool. And they also have other tools available for this. Google Secret Manager, for instance. And, last part is we have experimented with Falco, which is a opensource intrusion detection tool which can actually give information about, for instance, if you have containers running as we would. So, conclusions slash learnings, early feedback. The earlier you can get feedback the better, but make that actionable. Some of the feedback we've got that if it's not actionable, then Dev teams cannot really, it has to be actionable in relation to the workflow that they are, for instance, for pull request workflow. If it's something that is not detectable until the runtime, then use runtime scanning for those instead. I hope it was useful, and thanks for joining.
Lauri (24:43):
Oh, yes, for sure it was useful, Rasmus. Thanks for the lessons learned and your advices. Next time, our guest comes from Microsoft. He is Sam Guggenheimer, Azure DevOps product owner. He tells us an interesting story of transforming Microsoft to using Azure DevOps and GitHub with a globally distributed service on the public cloud. But he will also refer to the lessons learned at Microsoft during the recent lockdown period. Until then, remember to give early actionable feedback to your developers.