Here at ravello we’re no different than any other startup which develops new and innovative software – we need to develop fast, adjust our plans as we go along, react to user feedback and be able to roll out versions as fast as possible. In other words – we need to be AGILE.
But our product is fairly complicated – it’s not your average web application – we’re talking a large distributed system, with a specialized hypervisor which runs on multiple cloud providers.
In this post we will describe some aspects of our test and dev mechanisms, which enable us to stay agile, and thus move faster.
The system we’ll concentrate at is the ravello management system.
For the sake of simplicity, we’ll describe it as a 5 Ubuntu servers application – a frontend server (using nginx and tomcat7), a backend server (tomcat7), a database (postgresql), queuing system (HornetQ) and additional server for other services we support.
The blueprint below describes this application, and is actually a screenshot from our product:
As we started developing our management system long before the ravello product was live and kicking, we had to use our local lab and installed our management system on VMs on top of vSphere. While everything initially worked great, as time progressed we’ve encountered more and more issues, which are probably typical in this kind of setup.
First of all, due to lack of resources in our lab, we couldn’t really dedicate 5 different VMs (as we have in our production system) to our test/dev environment. Instead we had just two VMs there, one running the database, and the other running all other components. All our QA engineers and developers have shared these two servers for integration and system tests. This introduced all kinds of problems:
- “Version day mayhem” – once in a few weeks (typically two) we’re rolling out a version to production. In the few days before the version day – the integration VMs became extremely loaded, and every few hours (or even minutes) someone would shout: “hi guys, I’m restarting that tomcat/database/… it’s stuck again!”.
- Whenever a serious (and non easily reproducible bug” was found, usually the developers had asked the QA engineer if he can confiscate the integration servers for debug. While it did improve the QA engineers table tennis skills while waiting for dev to do their work – it was a productivity killer for them.
- Most of our automatic testing scenarios are quite lengthy (deploying applications of different types on different clouds and operating them), and we have many of those. Ideally while testing, we would like to run each scenario independent of the others. When having just a limited set of integration servers in our lab, most of these automatic tests ran serially, which made the whole test suite duration very long.
- Having a different setup in test then in production caused us to miss on a few critical bugs which were caught only in our staging env – a few minutes from rolling out to production…
After a few months working with just these servers in our lab, we duplicated them a few times, in order to solve some of the problems. Unfortunately, we’ve learnt that not only that we haven’t really solve anything (as still we had a largely shared environment), we just introduced new problems:
- We had to configure each duplicate manually (changed IPs, etc…) in order for the system to work. Even worse, whenever we made some change in one of the VMs (for example, updated the database’s configuration), we had to go and do it in all others.
- In a few cases, wrong configurations led to cross-wiring between the different environments (e.g. the backend server from one env. using the DB from the other env. by mistake)
Ideally, we understood that we want each developer/tester to have a full blown copy of the management application, and that he could fire it up at will. He could use it for his regular tasks (deploy a version from his IDE, connect to the DB from the DB viewer, run automated tests from Jenkins, etc.), while not suffering from the problems mentioned above.
In order to do that, a few months ago we started using our own system for dev and test of it. We’ve nicknamed this recursive concept “Babushka” (the common misused name for the Matryoshka doll).
Basically, we have defined the 5 server application in our system, and created the VMs there once. We’ve configured the application there once (using private networking and DNS names), and saved this application as the master babushka blueprint. As soon as we’ve done that – all our developers and testers got the green light to use this blueprint to publish their own instances of the management application instead of using the integration servers.
Just like that – we’ve eliminated all of the pains listed above:
- No more resource contention, no more shouts in the hallways before releases – everyone gets what he needs!
- Whenever a QA engineer encounters a serious bug – he simply saves his app as a blueprint and sends this blueprint to his favorite developer, without interrupting his daily routine
- The automatic tests are parallelized now, running on multiple instances of the application in isolation. They end up much faster, which supplies us much faster feedback on our continuous integration.
- We have the exact same setup as the production – thus we can experiment with things that we couldn’t before
- We could test server failures much easier
- We could test server size impact on performance much easier
- Whenever a change is made to the application infrastructure – the new application is saved as a blueprint, and everyone can publish new versions of the application in a click of a button!
- The different applications are completely isolated – so no need to worry about cross feedback between the applications
A quick (and non planned) benefit of having the babushka was also the ability to integrate with contractors much more easily. Whenever a 3rd party needs a sandbox management system to play with – we just publish a babushka for him – and he’s good to go, working on a system identical to our production system…
We’ve also integrated this concept with our Jenkins (so on-commit continuous integration tests utilize this environment), and with our maven build (so each dev can easily deploy on his babushka from the comfort of his IDE/shell).
In one of the next blog posts we will describe our CI integration with our system which uses the babushka – stay tuned for more!
About Ravello Systems
Ravello is the industry’s leading nested virtualization and software-defined networking SaaS. It enables enterprises to create cloud-based development, test, UAT, integration and staging environments by automatically cloning their VMware-based applications in AWS. Ravello is built by the same team that developed the KVM hypervisor in Linux.