Test Driven Infrastructure
Yesterday I was at the Berlin DevOps meetup and we had a very nice fishbowl about Test Driven Infrastructure (TDI). I used my Lightning Talk from the PyCon in Köln as an introduction to the topic, but quickly realized that the term does not fully explain itself.
Test Driven Development in itself is not a new thing, maybe it is not yet common to apply it to platform operations. As an old Ops guy I had a lot to learn when I started to work at ImmobilienScout24, which is a real software development company.
The bottom line is really simple:
My idea of TDI is to apply most of the basic ideas of TDD also to the development process of the software that runs our platform. Again, the same thing as the developers do with their code already for a long time. Let's just say that we start to test all code that goes on a server, no matter who wrote it or what it actually does.
Some specific examples that we did in the last month:
Test Driven Development in itself is not a new thing, maybe it is not yet common to apply it to platform operations. As an old Ops guy I had a lot to learn when I started to work at ImmobilienScout24, which is a real software development company.
The bottom line is really simple:
Untested = Broken |
Some specific examples that we did in the last month:
- A service that mounts SAN LUNs (via file system labels) is tested on a server with a mock SAN LUN. The mock is just a loop-back device with a filesystem on it that has a suitable label.
The tests check that the services gets installed and activated properly and that starting/stopping the service actually mounts/umounts the file system. - A package containing configuration files for a Squid proxy server is tested on a server to make sure that the rules actually apply correctly. The test server has no Internet access because even a 5xx return code means that the proxy allowed the request to pass.
- The packages that make up a cluster of subversion servers are put through a large integration test chain on a test cluster to make sure that the automated backup and restore, the cluster failover, the various post-commit hooks and other scripts involved there actually work correctly.
All these cases (and many others) have one thing in common: If the packages pass the tests then they are automatically propagated to our production environment and automatically installed on the respective servers.
And exactly that is the main motivation for investing the effort to create those tests for infrastructrue components:
- Continuous Delivery for everything that goes on a server.
- Trusting the tests means trusting other people - even those who don't know so much about a system - to fix things and add features. If their changes pass the tests, most likely no harm will be done. If something breaks in production, we first fix the test and then the code.
- Testing each commit and deploying all successful commits to production gives us small changes with very little risk for each change.
- Having tests means that our system run much more stable and reliably. Even if no customer is hurt by a short outage of a server, our colleagues really value a stable environment for their own work.
DevOps is not only about Devs doing more Ops work, it is much more also about Ops starting to think like Devs about their work.
Some ideas that helped us to start doing more tests with our infrastructure components:
- Do unit tests at package build time: Syntax check everything. Run tools on test data and check the result.
- Do system tests on test servers: Try to mock everything that is not directly relevant to the test subject. Linux is a great tool for mocking stuff. Be creative in using firewall tricks, loop-back connections / mounts, fake services or data etc. to reduce the need for external systems to run a test.
- Do integration tests by running more complex scenarios on test servers with test data.
The longer a test runs the later in the delivery chain it should come. Beware of making huge test scenarios where you will spend a lot of time debugging problems. Small tests fail faster and tell you immediately where to look for the problem.
I hope to collect more information about TDI and to present my findings at a conference next year.
Update Ocotober 2014: I gave an improved talk about DevOps Risk Mitigation at the EuroPython 2014 (video).
Update August 2014: I wrote a Linux Magazin article "Testgetrieben".
Update April 2014: I gave a talk about TDI at the Open Source Data Center Conference 2014 (video).
Update Ocotober 2014: I gave an improved talk about DevOps Risk Mitigation at the EuroPython 2014 (video).
Update August 2014: I wrote a Linux Magazin article "Testgetrieben".
Update April 2014: I gave a talk about TDI at the Open Source Data Center Conference 2014 (video).
Comments
Post a Comment