How did GitOps get started? An interview with Alexis Richardson

As part of the research for my upcoming GitOps article in the iX magazine, we - Johannes Schnatterer and I - had the opportunity to interview Alexis Richardson, founder & CEO of WeaveWorks, about how he invented GitOps - both the concept and the term. The interview covers Alexis' views on GitOps, Kubernetes and gives an interesting insight into his company and their take on Open Source.

The following transcript is auto-generated by YouTube, please excuse any mistakes:

Schlomo: welcome everybody to our little fireside chat about GitOps in the call we have Alexis Richardson who is the founder and probably CEO or something like that at WeaveWorks, we have Johannes Schnatterer from Cloudogu and myself my name is Schlomo Schapiro, I'm working at Deutsche Bahn DB Systel the goal is that little bit figure out what is actually GitOps and the reason i asked Alexis for this interview is that I have the impression that Alexis you invented GitOps some time ago

Alexis: oh that's right

S: can you maybe shed some light on how did you come to invent GitOps, what happened, what was the background and why is it the thing that is gonna save IT forever?

A: I'm glad that you think it's gonna save it forever

I'm just trying to remember it was some years ago we've work started um 2014. We had a team originally in London and then not long after that also in Berlin now of course for international company we wanted to make it easy for software developers to build and operate applications in this new environment of containers. Yes and we cared about this because we had spent the last five years trying to build cloud applications ourselves from originally at RabbitMQ and then at VMware and Pivotal when i was in charge of the Rabbit team with Redis as well and Spring in in all in one you know company portfolio which i was head of products for and trying to offer application development capability that would run anywhere in the cloud some of the ease of use maybe of things like Heroku was very challenging and we tried different experiments. There was of course the cloud found different versions of Cloud Foundry there was OpenShift which is essentially built on a similar set of principles trying to make Heroku work in more environments maybe behind the firewall maybe on other clouds than Amazon. But the thing is that none of these solutions really provided much more than in those days. The 12-factor application model which i'm sure you've heard about it's one pattern for a scalable web app and they went a bit further on but essentially you know you couldn't imagine more complex applications being pushed into this setting. We could imagine lots of enterprise applications that were heavily data oriented not working well in this environment and there were assumptions in the architecture about how it was load balanced and how it would scale that meant it would work well for some things but not others and it wasn't particularly you know easy to set up and so on. What we saw was that containers could be a better way forward because they provided a very simple operating model for a single node where you have a package which is also a runtime and it's kind of portable sort of kind of portable and also if you spent five years trying to build applications out of VMs as i did at VMware you soon realize that this is good for some things but not everything because VMs include extra software to make them reliable so they're virtualizing a real machine but actually what you want is something more ephemeral something that you can start and stop quickly. It doesn't matter if it fails, you start another, you know the whole cattle concept and all of these things were there in containers so we had several different really nice properties. It's developer friendly, it's a democratic model anyone can do it you don't need to be understanding an architecture you could see NodeJS programmers doing it you could see Python programmers doing it you could see Spring Java programmers doing it and this is all due to really the innovations by the Docker team around Solomon, so you know big kudos to Solomon for popularizing this and realizing just how important it was and pushing it through and then of course there was this industry attention around it because you could see developers picking up this technology really quickly and running with it which was we never saw with platform as a service.

And so we started this company thinking let's make it easy to build applications like Rabbit as a service, Redis as a service or again as i said more complex things um we could immediately see there were pieces missing there was no network therefore you couldn't build an application with two nodes that was a problem so we built a network: Weave Net that got us into the market and got us some funding and then we built a set of tools around management and monitoring so in those days we thought that there would be developers building apps in the cloud and they would use containers instead of VMs and they would want to manage and monitor those applications but what we didn't think was that um they would pick and standardize on one orchestration technology because you could see there were lots of different ways of organizing an application and part of the freedom and appeal of containers was that you won't push down one particular architecture so we built an architecture orchestration agnostic tool management and monitoring tool still running today called Weave Cloud, lots of people using it. It provides a deployment capability, it provides a monitoring capability, and unifies them so you can do deployment management and monitoring the thing is though that we ourselves had to choose an orchestration tool to run this on we originally of course wrote our own and then realized oh no that's not a very clever thing to do we've got a very very very small good team but we shouldn't be writing orchestration technology we should be building on it so let's pick one to run our stuff our stack on remember this is 2015. What was available, was Docker Swarm, Nomad, Mesos and Marathon and um Kubernetes. Remind me if there's anything else kicking around i don't think so. There might have been some things from Spotify or some other companies that done Microservices for a few years. So the problem with Docker Swarm was, it was a bit buggy and we've been trying to collaborate with the Docker team and community around networking and generally found that it wasn't a very smooth completely seamless collaboration because Docker was already coming under pressure from its investors to try and carve out certain terrain in the market, i think that created tensions between the commercial and open models, which they eventually resolved by completely separating them off with maybe later on by the time it was hard and we just couldn't feel that we could rely on Docker Swarm.

Mesos was fascinating but really general purpose technology, very abstract you had to write Marathon to make Mesos work, so simply and nobody could really understand it, and there were things like ECS which was early, we called up Hashicorp, they said no no no, don't use Nomad. It was version 1.1 at the time, and that left by process of elimination Kubernetes, which we were sort of half a fan of because we could see that it was relatively well thought out although a real architectural mess in the sense that you couldn't find any person to explain Kubernetes to you and you could only understand it by reviewing the design many many many times from different directions and eventually became familiar with its unique shape and form. I mean it's just far too complicated really but complexity that had arisen for a reason, which was that the reason was these are some of the bumps that you run into, these are things you learn and these are ways around them and i think that was where the knowledge of the Google team was so important. We had people from Google in our company who had left Google come to work with us they've been SREs, developers and they were like, oh gosh, please let's not use Google technology because it's got so many sort of Googly things that would just you know frustrate us. But we had to pick something and really Kubernetes is the only game in town by 2015 for us. And so we had to become experts in installing, deploying and running Kubernetes we'd already been trying to get something called Kubernetes anywhere working, playing around with it putting it on EC2 and things. That was Ilia's work if, you remember, but then you know getting Kubernetes up and running in 2015 was no joke! Not easy at all, it took us a long time and eventually people said we've got it working, we can install it we can make it work, but it's a bit of a bugger, which is an English word

Now we then built our stack on top and we had this SaaS, so we remembered that we wanted to sell a product that was a SaaS product, so now we're operating Kubernetes on EC2, a couple of different staging areas multiple zones for reliability we have a stack of app components on top we're deploying those we've also got some monitoring and security baked into this and after a while, you know the process gets a bit simpler and faster and better and we actually have some confidence and people are joining the team and extending the SaaS app. And one day, it was a sunny spring day maybe 2016 maybe 2017, i can't remember when, one of the engineers said loudly i'm about to make a check deploy a change to the system, i'm going to push a change to the system, which if - if it doesn't work quite as i intend could wipe out our entire systems.

S: So back then you were doing CI-Ops, like you were doing -

A: Not exactly, no. We had a, we had no we had a tool which was basically doing GitOps, but i didn't know that at the time. We were not pushing from CI in the CI-Ops sense CI-Ops is a very, it's a massive anti-pattern, so anyone who says that they're doing GitOps and they're doing CI-Ops is just not understood at the point so anyway that's another story, because on the day um the whole system basically got wiped out by that change, and this is a person, who often did this kind of thing so it wasn't a huge surprise, but we didn't manage to stop him you know it was like: no don't do that and then you hear that click and then two seconds later oh f*** and then you know yes the whole team springs into action, into you know issue resolution team mode. We got the whole thing back up in about 40 minutes and i said that was pretty quick. it's quite a complex system, how did we do that? And they said, well it's a lot of these practices that we learned. We're actually working in Google these other places to do with infrastructure as code and DevOps and automation but fundamentally we can do this quickly because we have an image of the entire system not just the cluster but the app and the monitoring and other pieces, all described in various config files in Git, and whenever we make a change to the system we do so by first making sure that the change has been committed and then allowing the change to propagate automatically through into production. So this was the very early version of version 0.1 element of GitOps. Today we have much more sophisticated concepts but notice, there was no requirement that everybody had to use Git and write PRs. It was simply that you couldn't allow a change to happen, if it was a permanent change, in the system without but without also committing it to the system of record for the desired state which was stored in a resilient distributed authenticated non-repudiable secure GitHub instance.

And so i thought, that was pretty cool and i said well you know what what, what else are we doing that's interesting and so we talked about that and we wrote down all the principles that this team had learnt and built themselves from Kubernetes and they said look Kubernetes is particularly amenable to working this way because it's based on declarative config but we've also applied the concept of declarative config to other parts of the system like the dashboards for example and we've also and the apps and we've also made it possible to do things like alerting when the system which is trying to converge to its correct state isn't doing so fire an alert sticking a message into Slack with a link and you can click on it open up the dashboard. And we can also force convergence in pieces that Kubernetes is not itself traditionally controlling, so Kubernetes the orchestrator's job is to have an eventually consistent convergence model around the cluster state based on what's in your YAML file. This is again basic level and today we have more complicated things with templates and generated things and so on and so on, but at that basic level YAML file drives convergence in the cluster and so the cluster takes that YAML file, holds a copy of it in memory and uses that to force convergence. We did that we had applied that in other areas of the stack and i said that's really interesting and so we talked about it and talked about it and talked about it and sort of drove the list of principles down down down to just a few and i was thinking you know this is pretty cool, we should talk about this more and then one day we were talking about it and i just suddenly realized that one word that we could use to describe all of this was GitOps because the fundamental thing we were doing was making operations automatic for the whole system based on a model of the system which was living outside the system and Git was where we had chosen to put that model but of course there is no absolute requirement that it being Git. In fact, you could do GitOps without any Git at all, if you really wanted to. But then we don't want to talk about SVN-Ops, because when you're in the world of naming things, it really helps if the name is one that somebody else can say and so then i went downstairs we were at the time working in Shoreditch, where we still have a small office now that we have lockdown in London here most people are working from home, but we've retained a little footprint in Shoreditch and we were sharing a building with James Governor from RedMonk, that you may have read some of his work called Stephen O'Grady's "Developers of the new King Makers", is very good and i went to see James and said hi James i just want to run something by you. What is your reaction, if i say the word GitOps to you? And he said, well that's a that's possibly the ugliest word i've heard in a long time and i'm not sure i can unhear it now. So i thought, excellent, we have a winner! Because you can say this thing and it sticks in somebody's head they can't undo it, so it gives them a way, a hook into thinking about something. Now this is important, because you know, when we wrote down, i made it my job to write down a description of what what i thought the team was the engineering team was doing and it was really an operating model, because it gives you a way to think about your entire stack's operations it also has this very nice security property, because when you make a change you only update the the image repos or the Git configs or some of the other stores with immutable information so that is then pulled into the running system which gives you a different security model from one where you're pushing directly in through the authentication layer. Anyway all of these quite complex ideas with deep deep deep ramifications really could be rolled up into a set of smaller set of principles which we've talked about a lot and i can bring up some slides to show you if if you want me to - sure - but okay let me do that in just a second, let me finish the story and then i'll bring up the slides. The, the the i'm losing my train of thought now thanks we started talking...

S: the complete world description in Git

A: this is the there were several things that made me realize we should talk about this more one was i thought what the team had done was important and it wasn't just about having automated pipelines, because they also needed to have the full description the way that metadata was used the ability of the system to update itself, drift alerting, dealing with stacks on top of the cluster itself, at runtime not just building a machine all of these things were implied by but ever evolutions from things like Puppet and Terraform and Chef had been around doing 95 of this stuff but in a slightly different way it had just gone further and it was an absolute position, where it could actually do much more Kubernetes was really the breakthrough to make it possible. Now i wrote it all down started talking about it, and people said this is how we do things or, this is how we want to do things we're trying to get there, or we've been talking about what to do, and you've written down a way of which captures the idea we had really nicely, or people would say, i'm trying to introduce a lot of changes into my organization, because i know we need continuous delivery, i know we need deployments and updates and patches, and i know we need elements of automation, and we need scalable management, and we need these things to be correct, due to the automation and the programmatic updates, all of this is GitOps. And you've given us a way with just one word to carry this change through the organization, and so i thought that's really nice because it's an ugly word and it doesn't always mean exactly what you think it means, because as i said you don't absolutely have to use Git but if people benefit, by having a way to create an entry point into a world where they can make changes that are actually really meaningful and useful to development teams and hence to IT that's great. And so then i realized that um we needed to make sure that um we didn't see GitOps was something that everybody could talk about, everybody could do. It wasn't supposed to be like a WeaveWorks product, and so we talked about it in a way that it was described what we were doing we had some open source software - flux and flagger, which does progressive delivery in a GitOps ... file and other things like customer management. All of that stuff was if you like reference implementations of GitOps ideas, but there are other people like the the people like Cloudbees, people in other companies like Replicated, Amazon, Google, Microsoft, Alibaba. All doing GitOps, okay all doing GitOps themselves obviously sometimes in their own way.

So by talking about it in a way that it wasn't owned by us and anybody could own it more people described their own vision of how they could do this kind of automation: model driven automation. And as a result that's allowed people to make their own stories and investments which is why an industry trend has arisen because it is actually the right way to do DevOps and CI/CD and management for Kubernetes. It's scales, it's secure and it also the really cool thing is, you don't, if you're a developer, you don't need to understand Kubernetes to make changes. If you can be given away to update the desired state. Now of course, there are many problems like people think, oh i don't like using gears, okay great you know use a UI, that goes into Git, things like that there are many things we can do better there are challenges like how do you manage secrets, but you know all of these things gradually are being overcome with different solutions and it's really nice that you can do it in many many different ways. Let me see if i can find these some slides for you quickly

This may even be a public link so i'm going to also share it with you here Schlomo and Johannes. This is a slide from FOSDEM at the weekend. A presentation made by our CTO Cornelia. i apologize for the WeaveWorks branding on some of the slides we have not remembered

This is a slide presentation of a new thing called the GitOps Working Group in the CNCF underneath the auspices of the SIG for App Delivery. So it shouldn't have any WeaveWorks branding on it, it should just be Cloud Native Foundation branding. i'm sorry about that but what's happened is we've used the Foundation's legal umbrella and auspices for collaboration to bring together about 80 people from about 50 companies talking about what is GitOps. In a way that lets us as a community come up with a robust, if you like, definition that people can then work from. Okay this is great because uh it means that we stop arguing about you know whether x or y or x plus y is exactly GitOps, but also it it will mean that people can say things like well our system isn't 100% GitOps, but we do the first half and then we do this other thing in a different way. Okay so let's have a quick look. This is Cornelia's bio

So yeah i mean it's about mapping from the model which is in a code code config and potentially some kind of user experience to update it and the runtime environment and we see this as a cycle of updating the system the desired system of record in git and then deploying changes observing them and then managing them um which might mean updates back to Git so in in the GitOps world of WeaveWorks today one of the things we're doing is also sending information back to the desired state that reflects what the runtime state thinks it's doing.

So this is typically how you merge with a continuous integration tool, like Jenkins, GitLab, Travis CI, GitHub Actions and - hang on a second please -

Okay so we see the world really into two parallel cycles of what we think of as some Dev test and CI on the left and then runtime usually production on the right and the convergence here is an eventually consistent continual change always driving to a correct state. so it isn't the case that you have an automated pipeline and do a push and update and then that's that's GitOps. It is the case that there are agents sorry agents inside Kubernetes in this case and potentially other systems that are always pushing for convergence between the desired state and the running state that means that if you drift from the correct state, the agents will try to fix your application back to the correct state and alert you if they can't for example. And it means that you know once that you if you ask somebody do you know if your fleet of clusters is in the right state um if they've had if they've had no alerts they will say yes i think everything is in the correct state so that's very powerful.

And this means of course you can use any CI tool as a foundation for your GitOps provided you also have the operational pieces which we WeaveWorks provide for you so commercially as a company now you know, we focus on the stuff on the right and we recommend that you use one of the known Git providers and a good CI tool on the left so we work with Gitlab we work with GitHub etc etc

And then this is a slightly more complex slide which i'm not going to try to explain but essentially with flux 2 which is a reference implementation of GitOps we've discovered that it's useful to think in terms of an additional cycle for example you might compile some of your some of your config from from documents for policy from YAML files or even from Typescript.

So here are the principles that we've found to be fairly stable. Notice that i mentioned earlier the GitOps working group that has taken a version of these principles and is working and expanding that so you know WeaveWorks wrote down some some thoughts we had about this and just essentially nailed them to the door and said, hey everybody let's take this as a starting point version 0.01 and everybody can iterate from here. So where we end up in the community process maybe maybe slightly different but i hope it's at least consistent with this. So here you can see we want everything that we can describe to be described declaratively having a declaration of the intended state allows us to do continuous operations instead of having a set of imperative commands that may or may not succeed and have to be checked to see if they've out if the outcome has been correct which is something that's hard to scale to large-scale systems and to many systems. We recommend using git because it has beautiful properties for the version management of these state descriptions you can use other things and sometimes you want to use Git as well as additional stores, we have a tool called libgitops that lets you do that. Once changes have been approved, which may be a manual policy step for some, it is then possible for them to be automatically applied to the system by agents and those agents continue running to ensure correctness and act against divergence.

And now the GitOps Working Group you can see there was some interesting companies involved in creating this. First time we got Microsoft, Amazon and GitHub into the room at the same time, for a few seconds which is really good and then a bunch of other folks joined and it's very much an open process under the Foundation which is all very good this is the CNCF radar, showing that flux is a very popular tool with helm for this. SIG App Delivery is the CNCF's place of discussing CI/CD and app delivery and GitOps management and that's how they're related. So i guess that brings you up to speed you know what i haven't talked about, is some of the you know interesting discussions around well how do you do progressive delivery or what's the data story what's the secret story story or how do i generate um the right number of repos if i have a thousand plus to fleet or actually how do i do cluster management all of these things happy to try and talk about. But i hope i've given you a good overview up to the present day.

S: thank you, thank you, this is really good. So first of all i would like to mention that problems like secret management are not related to GitOps. We had these problems before and we've been using Git before so i think that everything that was done properly before continues to work properly, everything that was a bad practice before GitOps continues to be a bad practice, even after introducing GitOps.

A: yes that's true, Schlomo, but let me put it another way: when we first clarified what we what we thought GitOps should be, and we said look this is an evolution of DevOps, it's adding the operations piece, the missing piece to the DevOps pipeline and developer automation concept and you can do everything in this way now and then you can have automation for your operations as well as for your Dev and people said, well okay we've got questions and we had this FAQ essentially. Frequently asked the first question everybody would ask is what about secrets and i'm like what do you mean what about secrets and they would say, well you say everything has to be in Git does that mean secret should be in Git we would say well no okay so you've got a good point we shouldn't say everything has to be in Git, so you know we have to clarify these assumptions, otherwise thousands you know what the term that the personality of the technical mind is usually to find to to ask questions as a way of clarifying what's going on. So now i think people have a better understanding and of course you're correct, Schlomo, but people didn't realize that because they were thought to be absolutists about everything must exactly be written in Git, it's not the case.

S: I mean i have this conversation almost on a daily base, what about secrets and Git partially also because i wrote a policy recently, that explained how to put secrets in Git which is obviously encrypted, but then the question is well, why do we need to version them whatsoever and i say, well if you don't version in Git, where do you version them? Or we don't version them? Well whoever allowed you to have unversioned content going into production? So actually in my opinion to version secrets and if i have a versioning system for 99.99% of all my strings or text in the in the application, why not use it for this last percentile as well?

A: how about this: i mean typically you make a complex change to your application, you change the app, you add services, you might change the data model a bit, you might make some changes around the permissions and secrets, you might change the dashboard. You roll out all of these changes as part of an upgrade or an update. If you want to roll back you want to roll back all of the changes not some of them right you need to be either on the updated version or the previous version. You can't have a dashboard that is on the old version looking at the new code or the new dashboard looking at the old code, most of the time. Same thing with secrets, i can't i mean people who think you shouldn't version control your your systems, are in my opinion stuck way in the past, i know that you're fighting this so you're sympathetic, but i mean obviously you can also you might be using something like Vault, we we love Vaul, you know Sealed Secrets is great, that's one way of doing it. There's other there's others there's other encryption ways of doing it in Git. Vault is also a very nice tool, because it integrates with other Enterprise secrets systems and gives you a central panel. You can make this work in a GitOps style too, there's so many things that you can do.

S: I mean i always ask people, well if you keep your secrets somewhere else for example in Vault, how do you play the four eyes approval process there? And if the answer is, we don't - i say why, why do you not have that for eyes approval process for such an important change like a secret, that will take your entire application down if it's wrong? And i find this to be a very circular discussion which in my experience always ends up with, well in the end we do the same as we would be doing with Git, just somewhere else.

A: You know i mean i'm not we really don't want to be overzealous about insisting that every single thing has to be done this way. i mean you know i don't know if you're familiar with the George Orwell's rules of clear writing have you come across this phenomenon? it's really good let me see if i can bring it up on Google whoops George Orwell "clear rules for writing pair english"

six rules should be five, someone should have told him

here we go

Never do this, never do that. Last rule is "break any of these rules sooner than say anything outright barbarous" you know and GitOps is the same i think, you know by all means break the rules if there are rules - good, but also ignore them if your system needs something different for a reason. Let's not be overzealous

S: Yeah, i mean you mentioned the declarative descriptions to be the base for GitOps success, which is really very close - oh uh can i share the screen please? let me start sharing so that you can share your screen - yeah um so this is really close to what i'm trying to to show people or to explain, which is that for me GitOps is actually a way how we rescue a lot of the like, rescue us our systems our environments our organizations even, from a lot of the challenges that we face and i always try to show that actually the the core principle is declarative descriptions. That is the one change that allows all the other things to happen, and why i explain that is that separating between declarative descriptions and deployment automation allows to have separate tests where you test the automation for correctness and the descriptions for compliance and if you mix the declarative descriptions with the deployment automation like in the good old Puppet times, then you can't test them independently or if you have a bash script, that does an important job, you can't test the correctness independently from the compliance - right - so for me this is actually the core principle and i find it very hard to convince people to really go this 100% - everything has to be declarative.

A: Right and it's hard what i mean what if you don't have a way of declaring things that's good. Let's get let's take an example workflows show me a good declarative workflow model, it's hard they end up being basically a nested tree of S steps of the workflow, why don't you just write down the workflow and not put it in XML for god's sake. Anyway look i think, you should be part of the GitOps Working Group. It's a it's a contributor-driven community, so please do get get involved and help to shape these emphases because this is going to get nailed on as a definition of GitOps and your contribution could be could be very important in that process. Yes let's not be absolutist. We hope to describe more things in the future

S: We're lucky my company is already a CNCF member

A: You don't even need to be a CNCF member at company level , any individual can just contribute - oh nice - Very good, it's just linux anyway. But another objection people have is, well what if i don't have time to go through Git, you know, do i am i saying that i have to do a transaction in Git to do an update in the operational system? Sometimes that's too high latency or sometimes i'm unsure of my change, so i might do a branch. How do i do a provisional transaction and look at the effects of a change on a branch before committing that to the main system? So people are asking these questions now which are very interesting questions. The other one is also how many repo should i have if i have lots of clusters, what's the best way to organize my repos, do i want to have more repos than clusters, you know this kind of thing and some of these are a matter of taste, but you know we need better answers to some of them, than we than people jump out today and we're going to get that soon but that's exciting it shows you that it's not a trivial concept

S: Yeah actually um you mentioned that uh CI-Ops is actually an anti-pattern - yeah. - As it happens, i get the impression that we as an industry are currently very much stuck on CI-Ops. We have this map which we try to use as a discussion tool, and i have the theory that we come from the lower left corner, we're like Dev against Ops and we do manual ops and as an industry we took a great step towards CI-Ops, which is a big step both towards Ops automation and towards DevOps, which is a as a culture thing and now we discuss what's the next step.

A: yeah look first thing is, stop can i steal your slides? they're great! i love them both, especially this one this is really really cool. And actually it's very very important to be respectful of everybody starts in a different place in the journey maybe you're starting the bottom left that's okay because maybe you have a system that doesn't need updating frequently maybe you have a large team of approvals people. We spoke to a bank where the where marketing needed to approve any change to the mobile app because the GUI might change, and what happened was, there was once a change, the marketing people were very embarrassed about, and they got angry and they said, you may never make a change to this app without showing us first. So the automated pipeline included filing a ticket in Jira, sending an email to the marketing team. They had to check the GUI was still okay before before it could go through, but that was what you know, i mean we laughed about it, but that was what the organization needed for that for that app, and the marketing team needed to have a stake in the process. And then of course um small scale changes occasional changes if you do something manually, or you do something using kubectl or SSH or through a GUI tool, that does an update, you know the GUI tool updates Kubernetes in the application you do this with something like Rancher, which is not a good Ops tool, and then, you don't know whether your system is in the state that the GUI tool just told you it was in. But, this is okay for small scale, but when you have more than one cluster and more than one team, more than one change per day, you do want to be automated and programmatic if you can be. So you start adding your CI, and the CI lets you run scripts but the problem is, if you run a lot of scripts and something breaks halfway through, you don't know, whether you're in a correct state, because your mechanism for verifying correctness is based on the CI tool updating this - the system and then looking at a monitor, to see if you've got the results you wanted and that cannot prove or assert correctness, it can only give you a way of observing. The correctness may indeed be the case, whereas what what you have with GitOps is, inside the cluster you have agents that can see everything in this state, in the running state and check that against the desired state so that can tell you whether the cluster is in the correct state or not, so you have guarantees. But maybe you don't need those guarantees maybe you're happy having a system that's having a few changes a day.

S: wait, i don't believe that the observation of the correct deployment of cluster resources actually guarantees, that the application fulfills its purpose.

A: no it doesn't guarantee that it fulfills its purpose

S: i still need external monitoring to make sure, that my application is actually serving my customers or doing its purpose, beyond the fact that it's all up and running.

A: right well there's two parts to the answer to that: one is yes you're right. So for example, maybe i deploy a container. The container starts and then immediately crashes and then it reboots and then it immediately crashes. Is that a correct or incorrect behavior? So I need to have some tools to help me detect this and do stuff about it. Secondly, this goes back to your point: I can't declare everything today. So if I could describe my customer SLAs declaratively and verify those inside the runtime, then I could check the application was doing what it should be doing for my customer. But today we don't have many tools that help us to do that. So yes I agree with you, that's part of the gradual transition which is why I'm saying that I think it's okay to be somewhere in the middle of this picture a lot of the time

S: so this picture is kind of a discussion tool I came up with because I realized, that at least for me DevOps is really a culture thing DevOps is the the way, how we improve the way how people work together. And there's actually another dimension which is technology, which is the question how far do we automate processes and essentially take our hands out of production. Yes my personal belief is that that if we like push both to the maximum like maximum automation and maximum kind of people working together on a kind of same on the same eye level then of course we get to hands-off operations which where I see GitOps as a major stepping stone towards achieving that goal. I hope that GitOps will be the tool to make that happen, but maybe there will be some other evolution that comes after GitOps. I'm pretty sure that GitOps is the next correct step, that we should take in that direction.

A: I think we're still exploring the technology consequences, and until we finished exploring them, we won't know what the cultural impact is.

S: so you see us currently working more on the tech side than on the culture side actually?

A: the tech side is easier to move forward quickly, because culture change can be slow. I mean, I believe that we're having a conversation, I'm in England you're in Germany. We both live in a country, which has large organizations which of you know very regulated banks, utilities, hospitals. And so you know, culture change is slow for a reason! People wish to be careful, conservative about introducing change too fast, because it might have an impact on people. So the technology can race ahead in, at times. I think we're having one of those moments today

S: my favorite hobby is to bring up new technology that helps to change the culture. As I believe that changing culture through technology is much easier than changing culture without technology, without an accompanying change in how we work. Because I think

A: right very very sensible to say that

S: think we were like animals of habit so to speak. And if you change our habits it helps us a lot to change our culture and our thinking and actually even our beliefs, like if you do it often enough in a different way you really believe that this way works well and eventually you come to believe that this is a better way. But you need the experience to believe that.

A: it's like the the unexpected changes that occur, once a technology has been adopted. Let's take for example the mobile phone. You know when the mobile phone first appeared, it was big, it was hard to move around and people would use it instead of essentially a radio system. But when it got small enough, it was possible to be to be contactable. But then what really made it break through, was the idea that it could be used as a computer. And you know also other things like photographs and so on, and people thought okay, it's actually useful to have a powerful computer on your person. And that creates social change, and so I see for example my parents, who are, you know, in their 70s and 80s respectively. You know happily using WhatsApp to do emojis with each other, because it's something where that concept of group chat has been introduced to them through technology. Now I think that's a bit a bit of a funny silly example, but I do think, that actually you're quite right, that it'll be impossible to introduce that sort of change without putting the tool in their hands to make it easy

S: so about the tech progress, I have another picture, sorry it's not colored, which tries to depict the different layers, that we find in a GitOps world. So we have some sort of utmost layer, which is the the Git repository for one application. And I can do GitOps between this Git repository for one application and let's say a Kubernetes namespace or even an application running inside. And if I go one step outside, I have maybe a Git repository that is coupled with a namespace, where I say that whatever is in the namespace has to be in that Git repository. And so far down, till we come to the - well let's have one Git repository for the Cloud account and everything that's in the Cloud account should be there. And I have the impression, that this is already the place where actually GitOps fails at the moment, because we don't have the tools. I haven't seen many places that play this GitOps game on the Cloud account level. Where they say, if I delete a file, a line from a YAML file, then a whole bunch of Cloud Formation stacks will be deleted as a result. Or something like that.

A: So I think we will see that in the future

S: because I believe that the question, how do you delete stuff, is a good question to understand, how honestly people actually implemented GitOps.

A: so we had a bug in flux for a while, when, well no not a bug there was a complex feature around the difference between moving and deleting elements of a repo. That's all been fixed now and I can't remember the technical details but it was quite challenging mentally to sort of follow through all the reasoning. Now it's all done correctly, but you know there's there's some real deep technology trickery that you need to think through with some of this stuff. And so you have to be super careful what delete means for example. I mean I would say, this is a good list. I like the way you've got it as a sort of production ready journey. We, there are other things as well, so you've got tools like ACK on amazon and Microsoft and Google have one as well. Which is basically a Kubernetes CRD for describing bindings to external services like RDS and and Route53 and all those things. Then you have service management, so flagger allows you to state properties of a canary A-B rollouts that kind of thing and manage those declaratively. Has a tool called Grafana lab, which lets you manage Grafana dashboards declaratively. We have CAPI in Kubernetes, which is essentially goes hand in hand with GitOps. Also ways of describing add-ons to Kubernetes clusters, so typically you might have a base cluster and then Prometheus, Helm, something else added as a standard baseline on top of that. And then you might have a machine learning environment consisting of Tensorflow and two or three other libraries and then finally you might have the application. So dealing with all of that is through GitOps is actually great, but does require more elements of your diagram. I really like the diagram. Down at the bottom of the stack we've experimented with this tool WeaveIgnite, which is a wrapper around firecracker which is an Amazon tool written in Rust to do a Virtual Machine for Lambda and it's very lightweight, very secure fast starting Virtual Machine, that potentially could be used for example in edge computing by Telcos and trains companies for instance. And Ignite gives it a nice Docker-like API for people who are more developer-friendly. We've got some declarative config in there and I can imagine there being more. I think networking will have more and more declarative config, because networking is the one thing which should be really great, if somebody could automate operating that stuff. And so on so, yeah lots lots to say about this. It's just an emerging space.

S: so do you agree that actually GitOps is a bit slim the lower we go in the stack and that maybe hardware is not yet a domain that has been conquered by GitOps?

A: agreed

S: okay well my main question was, if you would confirm this picture. so there's another thought that we have which is uh again on this people versus automation um scale

A: hold on if I may. I like the picture, I would say I confirm it. I would say that it is a particular point of view down a very important narrow middle track of the world of GitOps and I think it's good for that, but it's not a comprehensive picture

S: no no no it's my point here is really, how far down in the stack are you willing to do GitOps nowadays. And my observation is, many people are maybe in the first two or maybe three levels, but I don't see people really working on the fourth level which is the AWS account with GitOps. Like with full-blown GitOps meaning if you delete it in Git and commit and push, then it's actually gone, including the data that was there. So the other thought that I kind of had was, with this DevOps people versus automation technology landscape: I have the impression, that at the moment as an industry we're investing more into the DevOps culture change than into the tech change. Which leads to some sort of roundabout way towards reaching this goal of hands-off operation. And then I think 10 million Euro is not much, compared to what many companies spend on this culture change. So I'm always wondering how would I start a discussion to spend maybe a bit more money than that on the required technology change, to actually get there we want to be, because otherwise we'll just stop with a great culture and a lousy technology.

A: good question. I don't know the answer. I mean we sell products, I don't think anyone's paid us 10 million $ for one of our products yet, it'll be a good thing to do of course. I think that typically we would like to say, that a big organization can achieve meaningful change using six and seven figure sums of money. It isn't necessary, to use eight figure sums of money, to achieve change. You should be moving in smaller steps. Because, and this is part of the underlying conceptual principle, we iterate in small steps to make our journey. The right one.

S: okay thanks for the insights! It's a lot of stuff to think about. Johannes, I think you also had some questions prepared

Johannes: that's all right. One point I particularly find interesting is the definition of GitOps. The slides you just shared Alexis about the GitOps Working Group and the four principles, I think you call them. This definition actually has been followed me around for some time, because I wrote an article or I'm writing an article about GitOps tools and comparing them. So I had a look at about I think it was almost 50 tools, that called themselves GitOps and they had different definitions of what GitOps is really about. And I think you you already had those principles or similar ones on one of your blog posts. So I had a look at them and tried to match and also what my ideas about GitOps were, and it for me it turned out to be like three things that some of them had in common and some of them had only one or two of the properties, like it's the the reconciliation loop and the pool principle, I think those are combined in the agent software principle you had on your slide and another one was the operations by pull request. And my my question, is also you you talked about not being too overzealous but, it was difficult for me to judge which tools are actually GitOps tools, like do they have to do they have to have all of those properties, or is it enough to have one of them, or two of them or did I miss anything is there anything else that they should fulfill to be considered GitOps.

A: so what you raising is an interesting question, because I suppose that my view is GitOps is not a property of a tool. It is a property of somebody's actual system, so it's an end-to-end property of a system just like security and reliability and availability are properties of a system. So you wouldn't say that a web browser is an availability tool but you might say, that if the if a system is accessed mainly through a web browser and the system itself is highly available making web browsers the main way in is part of making that possible, because everybody has a web browser, you can if it crashes you can restart it blah blah blah. So um you know, each individual system, that somebody sets up whether it's in a company or in a Cloud or some some other way, can be said to be a GitOps system or not. And so then each tool that is used to make that possible could be regarded as a partial enabler of the whole system's GitOps property. So for instance you could imagine having new tools for making Git itself a much more pleasant and humane experience for normal people, which is something a lot of people want, or lots of user interface stuff that is that is beautiful user experience and then underneath plugs into the system so that everything's done correctly so that you don't have changes that are not tracked for example, such a tool could be argued to be a GitOps tool, because it's providing fundamental enabling technology for part of the overall process. I think that's okay, by the same token you know I've mentioned a few times flux and flagger, if you want to use flux to manage a fleet of Kubernetes clusters which you can then you also use the cluster management Cluster API CAPI to do that and that lets you combine a declarative definition of a cluster with a tool for enforcing declarative correctness continuously and patching and supporting things continuously so you've taken two technologies and put them together, to achieve a fleet management system that works through GitOps. So again it's the end system that is GitOps or not, and the individual components are enabling GitOps to happen in that way and so the question then becomes, well what's the best way to talk about um these tools. i think it's okay for somebody to say you know if you use GitLab then you can achieve GitOps. Because I believe that to achieve GitOps you should use a decent GitOps tool like Gitlab or Github and the operations tools because Gitlab is not a tool for managing operations, doing security patches, rolling out platforms, keeping them correct, doing service rollouts. But it's a damn useful tool for plugging in the Dev side of all of that. So you know i think that means that when you see people like Gitlab running around saying, we actually invented GitOps and everything is about us, you have to let them do that because you know that that's their way. It's sort of centering everything around their universe. But the reality is that the whole system will be made up of GitLab and many other components hopefully obviously from WeaveWorks, or GitHub and WeaveWorks, or maybe both sometimes both sometimes Atlassian, Gitea, so many other things. So really I think for me we need to make sure that people understand what is a GitOps system and not get too hung up about whether tools call themselves GitOps or not.

S: I really like how you put GitOps clearly on the on the behavioral scale, so GitOps is a an attribute of the behavior of an entire system. Same as I always try to tell people that DevOps is not a tool, but DevOps is the result of doing the right thing between people, on how people work together. So maybe GitOps is also much more the result of doing the right thing in your systems, no matter which tool you use and actually how the people work together.

A: only those things that may be described, programmed, and observed can be described to be automatic.

S: yep very nice

J: if you try to put this definition forward like uh on the the map that Schlomo showed with the production readiness like kind of decreasing the nearer we get to the physical hardware, what do you think would have to kind of evolve to get this production readiness onto those lower layers like or would you say so my question basically is, at the moment would you say there is tooling to do what Schlomo said, to do GitOps on this on this AWS account level

A: so what I see happening is people adopting GitOps because it's the right operating model for Cloud Native and associating it with their Kubernetes machinery and systems and services. On Amazon for example EKS, that's great and now EKS Anywhere of course, that will probably mean, that the more systems touch Kubernetes and containers on that Cloud provider and their associated offerings, the more that GitOps will will creep in and become a way to operate associated components like the ACK library that I mentioned, joins the world of Kubernetes with controlling external services. Now that means, that pretty soon if containers and obviously functions and Micro VMs become the norm for how we roll our applications globally, which I think is happening more and more, then eventually all of the other tooling will catch up and that will mean, that there will be demand for, an expectation for declaratively managed fundamental tools for the bottom layer in your picture. And why not. And actually I discovered that when we talked about GitOps, I learned from, for example there are people who make hardware firewalls or even some other network switch type technology, who have been working in this way for some time, and they actually have the standard in some of these segments of the industry is, that you have declarative tools for provisioning your setup in your hardware, but those are esoteric, they're associated with the narrow function, however important now what we're doing with GitOps is seeing things pooling around much larger, deeper pools of technology like the whole Kubernetes as the platform concept. So I think that will create more momentum for more change which will eventually lead to all of it, but we're not there yet and that's okay

A: and we haven't mentioned Terraform yet, I think that's a great GitOps type of tool, I think it doesn't have drift alerts in the free version, but I think if you pay for it, you get some kind of more GitOps automation hands-free alert system

S: so you mentioned applications and managing more via githubs what's your opinion on managing data by GitOps tooling?

A: there's Joe Bader was asking about this on Twitter this week. I've seen a few things there's people doing things like um you know DVCC or whatever it's called, which is the the open metadata store for machine learning applications, there are a couple of years ago tools and replicated an LA based kubernetes company for doing some database schema management through GitOps, I think this will come into the market in stages for most of the apps that people are writing around Kubernetes you can get a very long way by not worrying too much about the data versioning, but of course eventually you'll be really nice to to correlate between data snapshots and things in the application stack in a more meaningful way. I see it as open territory um for people to go after. Don't forget though, if you if you let people virtually control their data, they'll want to roll back, and that may lead you into some computationally adventurous parts of the landscape.

S: I mean the reason I'm asking is, that my personal experience is, that kind of the day 2 operations problem is actually one of the biggest problems for IT. Which is after setting up something, how do you actually run it for years in production doing updates, upgrades, backup, disaster recovery, performance tuning, whatnot. Which is for me all kind of some day 2, day 3 problem and

A: then the answer is you don't do it by fiddling with a GUI you do it forget it and that's why you need tools from WeaveWorks. So yeah

S: so what does the WeaveWorks tooling offer for day two operations? I mean my data in RDS, and it's fine there. But eventually I need to update, upgrade my RDS and maybe do a test of my disaster recovery procedure, or something like that. How will your tooling help with that?

A: we will help you upgrade everything else, the Kubernetes, the app, the services, we can do progressive delivery to do a test against the new avid version of your of your data service, we can work with the ACK libraries to declaratively describe which version of RDS we're pointing to. What we cannot do, is give you a GitOps management for what's inside the database itself. But we can help you with everything else.

S: that's the hard part, that's what I'm looking for the, maybe it's the missing piece at the moment

A: missing piece I think it's a lot harder, but you know for most people having a way to to get to the hard part is very challenging. That's what I would say, because they don't know when they are being successful and what you really want is for them to have automation as much as possible for the bits that you describe Schlomo, as not hard and then when they get to the place where it's actually difficult they can really focus on the problem maybe solve it manually. I mean I remember with a few years ago, people would do highly available systems and in about 2010 you would typically not worry too much about network partitions. You'd focus more on maintaining consistency and availability for a non-partition system. But as more and more workloads moved into the Cloud and systems got bigger, and other kinds of apps were introduced, people began to experience more network partitions and now you can't really release a piece of distributed technology middleware, and not have a strategy for dealing with connecting reconnecting after a network partition. And actually people want that to be automatic but, 10 years ago there was no way of dealing with it, and then five years ago the state of the art was you have support for a manual merge, but you don't have automated merges. And so I think this will also happen with databases, that initially people will ignore it, then they'll say it's a pain, but they'll ignore it, and then they'll say it's a pain, but you have to sort it out manually, and then there'll be guidance, there'll be documentation telling you how to do it, then they'll be tool support and then finally it might be a better solution

that'll take ten years

in fact making prediction your problem Schlomo will not be solved in the next four years.

S: that's okay, my pension's still much more further out,

A: we'll still be will to be to come again in four years time, we're talking about how great everything has become, how much easier how we can do this, how we can do that, culture has begun to accept automation, and policy is the way forward. What about this database problem? Nah, still not solved.

J: Four years ago, the GitOps term wasn't around, so it's quite quite some time

A: yeah there you go. Well you know there's people um there are some important precedents, that should be mentioned. So in the originally in the 90s by Mark Burgess with promise theory, really has the main ideas all there. The idea of basically having a contract around a system, and that is then provided to a system, that has to honor that contract, which it does using a different mechanism, which is independent from the mechanism of delivery and creation. And then you had in the night later 90s apparently Microsoft talked about the model driven data center, and tried to do everything in this way and then you had Subversion and Martin Fowler writing about pipelines and CD and then you have the Jess Humble book with Dave Farley - Continuous Delivery 2010. There are sections of that book that say operations should be autonomous based on a description

S: so if you make a prediction, in five years where will we be?

A: hopefully not all on Zoom working from home all the time

probably not yet on Mars, so we're all still gonna be here.

I think that a lot more systems will be managed in the way that they are using Kubernetes today

So um I think containers will evolve there'll be Micro VMs, there's things from functions so that the unit of compute, the packaging will change perhaps, but there'll still be more emphasis on the declarative management of that thing and then we'll continue to have declarative models from for systems management like Kubernetes um maybe maybe other tools will do it too like like Nomad I don't know much about how that works. There'll be more support inside the cloud providers for a range of scenarios and the majority of new applications would be written in a way that you just focus on the code and almost all of the config and operations is driven declaratively and we'll probably have in five years twice as many programmers as we do today, because that's how the numbers work. And so yeah I think this will become accepted more as a normal way of working, people will be habituated to saying give me the system I want. Don't make me write the system for you, I'll tell you what I want you do you do it you do the how like in your chart and your slide Schlomo so that turns everybody I guess into a sort of product manager, a digital champion and they'll all be working in you know laptops and cafes still

J: Alexis you talked a bit about Weave and the kind of things it does for GitOps being a reference implementation and all that, and there are other operators around. Could you describe the like common history and collaboration between like for example the Argo project or a Rancher has fleet and maybe there are even other operators, is there some collaboration going on? Or is there some collaboration being planned, when in the GitOps Working Group

A: I don't think there will be any collaboration in the GitOps Working Group around implementations. There'll be collaboration on reference materials and probably on tests, that will certainly be running code, but it won't be, there won't be just one reference implementation. I think that Rancher fleet, I don't know much about it, but we looked at it, it looks like an extremely belt and braces approach to more like, a sort of puppetesque way of doing GitOps for starting out more than one cluster, but it doesn't really do the enforcement operationally so it has no day 2, day 3, day 4, day 5 story. There can be divergence. To enforce convergence you have to redeploy, I think. I could be wrong, so don't hold me to that. With Argo and Flux those are the most similar tools. Flux has flagger, which is much more sophisticated than Argo rollouts. The main difference is Argo CD has a concept of an application, which is very opinionated, that's perfectly okay. But it's tied up with the whole architecture, so that the security model, and the user interface, and the application design all fit together into a single workflow deployment concept. Which means that, for example you're committed to using Argo's own roles based access control and user database, rather than um Kubernetes RBAC, so they're not integrated. Whereas with Flux for example, everything's driven by Kubernetes RBAC, which I think is much safer. Also if you look at the critical version critical vulnerability exception CVEs, Argo has over 300 reported on tools like snick, because it has so much extra software and I think you download over over, you know over a Gigabyte of images and packages, maybe one and a half Gig. Whereas Flux I think is currently 40 Megabytes, so it's only really narrowly focused on doing certain things. Flux 2 introduces the notion of a pipeline and it's very extensible, so that's the one that's being baked into Cloud vendors. Whereas Argo I said is more opinionated, it is extensible but by extending the opinion rather than my changing opinion. So I think that they both have their place in the world, we did try very hard to work with them. Unfortunately we didn't quite find a way to combine the - they were thinking focusing slightly on different layers of the stack and had a different set of opinions about how that should work. Which meant that we couldn't quite find a way to mesh everything together. It's it's it's a shame, there's no hard feelings. I think what we're going to do is kind of develop independently for a bit and then maybe we'll see another attempt at convergence in the future. I mean, the tools work in a similar way, so it's it's it's definitely a possibility, but at the moment it's not really possible to integrate flux with Argo due to the tightly coupled nature of the UI, the security model, and the application model

A: I hope, that answers your question

J: yeah definitely thanks, those are interesting insights, yeah I don't think that I have much more questions. So it was a lot of input from you Alexis. Something to to ponder about, and I also

A: what's the plan for this recording? are you going to publish it as a video or write up something, what's the deal?

S: so the idea is that um I'll try to transcribe it, no promises here. And I'll give you a review copy definitively, which you i will kindly ask you to sign off on and then the idea is to put the video essentially uncut on YouTube as a contribution to "What about GitOps - an interview with Alexis". I think a lot of things that you mentioned here are very interesting also for a wider audience, and I plan to add this to my blog. Johannes probably also you will do something with that. Alexis feel free to use it as well plus i will try to put an excerpt of your answers into our article, in a German translation unfortunately.

A:That's great, our head of marketing is German so you know, she can always tell me what it says

I would love to see that, please share anything that you can including the recording. You know feel free to cut out bits where you know people get up to go outside for a second or whatever. But yeah no I think that's good um thank you very very much for the questions and the time today, I really enjoyed the conversation and I hope that I've left you with the impression that it's an exciting area, that that needs more development not a triviality or a done deal by any means.

S: thank you for your time I took your advice and contacted you in the working group GitOps on the CNCF slack, so I'll be happy to help with anything I can help with

A: thank you all right and Johannes thank you too

The GitOps logo shown is made by @iboonox and has been suggested for the GitOps Working Group.


Like this content? You could send me something from my Amazon Wishlist. Need commercial support? Contact me for Consulting Services.

Popular posts from this blog

Overriding / Patching Linux System Serial Number

A Login Security Architecture Without Passwords

The Demise of KaiOS - Alcatel 3088X