Suppose you’re writing a script to spin up servers for your web application.
def deploy(ip):
copy('code/', ip + ':~/code', recursive=True)
write_template('conf/config.py', ip + ':~/config.py')
write_template('conf/crontab', ip + ':~/.crontab')
write_template('conf/crontab', ip + ':/etc/apache2/httpd.conf')
run_as_root('service cron restart')
run_as_root('service apache restart')
post('https://pingdom.com/api/2.0/checks',
{ 'name':ip, 'host':ip, 'type':'ping' })
Everything is going well until you decide to split up your machines into ones that run tasks and ones that answer requests. Since you’re not sure which deploy logic you want to be shared and which you want to keep separate, you decide to start out by copy-pasting the logic:
def deploy_taskrunner(ip):
# Warning: don't forget to edit deploy_webserver as well
copy('code/', ip + ':~/code', recursive=True)
write_template('conf/task_config.py', ip)
write_template('conf/crontab', ip + ':~/.crontab')
run_as_root('service cron restart')
post('https://pingdom.com/api/2.0/checks',
{ 'name':ip, 'host':ip, 'type':'ping' })
def deploy_webserver(ip):
# Warning: don't forget to edit deploy_taskrunner as well
copy('code/', ip + ':~/code', recursive=True)
write_template('conf/web_config.py', ip)
write_template('conf/httpd.conf',
ip + ':/etc/apache2/httpd.conf')
run_as_root('service apache restart')
post('https://pingdom.com/api/2.0/checks',
{ 'name':ip, 'host':ip, 'type':'ping' })
Your application hums along for a while, and everything is fine. When you need to manually deploy something, you can read the list of steps from the relevant function. When you need to tweak some config files to improve performance, you just add the code to both locations (and put up with some nagging from your coworkers).
Suddenly, disaster strikes! You release an API! This means you need to have a third type of machine in your cluster.
Since your coworkers have been nagging you about the copy-pasted code in the deploy scripts, you decide to factor out the common logic into functions.
def predeploy_common(ip):
copy_code_to(ip)
tweak_config_files(ip)
def postdeploy_common(ip):
run_tests_on(ip)
setup_pingdom(ip)
def deploy_taskrunner(ip):
predeploy_common(ip)
write_template('conf/task_config.py', ip)
write_template('conf/crontab',
ip + ':~/.crontab')
run_as_root('service cron restart')
postdeploy_common(ip)
def deploy_webserver(ip):
predeploy_common(ip)
write_template('conf/web_config.py', ip)
write_template('conf/httpd_web.conf',
ip + ':/etc/apache2/httpd.conf')
run_as_root('service apache restart')
postdeploy_common(ip)
def deploy_apiserver(ip):
predeploy_common(ip)
write_template('conf/api_config.py', ip)
write_template('conf/httpd_api.conf',
ip + ':/etc/apache2/httpd.conf')
run_as_root('service apache restart')
postdeploy_common(ip)
Everything is glorious again. Your machines are fruitful and multiply. Adding more new machine types is a cinch. Your application hums along for a while.
Suddenly, disaster strikes! While adding support for BSD machines in addition to Linux, you realize that you have to change a function three levels deep inside tweak_config_files
.
You snottily point out to your coworkers that if they had let you keep the copy-paste code, you could have just added an if...then
block and been done with it. Your coworkers tell you to stop being ridiculous.
You briefly consider threading a bsd=True
parameter through three levels of function calls before deciding that you’ve taken enough beatings in code review. Instead, you realize that you just need to think of your machines as objects with behaviors instead of passive IP address strings that just have stuff done to them. So you refactor your code to be object-oriented:
class Machine(object):
# abstract class
__metaclass__ = abc.ABCMeta
def __init__(self, ip):
self.ip = ip
def copy_files(self, files):
...
@abc.abstractmethod
def tweak_ram_config(self):
pass
@abc.abstractmethod
def tweak_fs_config(self):
pass
@abc.abstractmethod
def tweak_network_config(self):
pass
def tweak_config_files(self):
self.tweak_ram_config()
self.tweak_fs_config()
self.tweak_network_config()
...
class LinuxMachine(Machine):
...
class BSDMachine(Machine):
...
class DeploymentPlan(object):
__metaclass__ = abc.ABCMeta
@abc.abstractmethod
def deploy(machine):
pass
...
class APIDeploymentPlan(DeploymentPlan):
...
class WebDeploymentPlan(DeploymentPlan):
...
class TaskDeploymentPlan(DeploymentPlan):
...
...
Everything is glorious again. Your machines are fruitful and multiply. When you add more different hardware types, you can just add more subclasses. Your application hums along for a while.
Suddenly, disaster strikes! A botched deploy causes your machines to run amok, losing your app hundreds of millions of dollars before anyone figures out how to turn it off off. Just kidding—it’s not that hard to turn off a computer. But it was a near thing.
Shaken, you drag yourself to the incident post-mortem. “What happened here?” your boss asks.
Well, you say, the deploy script started by constructing a Machine
subclass for each machine in the cluster, and a DeploymentPlan
instance to manage deploying to it… After a couple hours of jumping between files and trying to do vtable lookups in your head, you eventually trace the bug to a bad assumption in a DeploymentPlan
subclass method—the author was confused about which subtype of Machine
they were operating on.
Your boss groans. “Didn’t this stuff used to be, like, a function in a one-file script?”
This is a circle I often find myself going around. I start out with the equivalent of copy-paste code (or a single script with if-then statements). It’s not abstract or DRY, but you can read it from beginning to end and you can change whatever part of it you want.
As the code gets more complex, I’ll decide that to keep it manageable I need to split it up into a hierarchy of functions. It always feels refreshing to abstract away a hairball of code—until I try to modify part of the internals, and realize I have to thread an argument through too many layers of functions.
At that point, I might decide to take the state that gets threaded through functions, and encapsulate it in an object. But now each method has access to a huge amount of hidden state (in the form of instance variables), and each method call could mean one of many different things depending on the runtime class of the object it’s being called on. That makes it much harder to look at code and know what path will be executed, so the code becomes much harder to keep in my head.
Copy-paste code is readably and hackable but poorly-abstracted. Large trees of functions are readable and abstract, but hard to hack. Stateful objects and virtual dispatch are both hackable and abstract, but difficult to understand. I haven’t found any pattern that accomplishes all three at once.
In fact, I wonder if it’s even possible to get all three of readability, hackability and abstraction. When I cycle through these three patterns, it feels like I’m playing code complexity whack-a-mole. As soon as I simplify one part of the program, another one gets hairier.
This might just mean the problem I’m trying to solve is inherently complex. If I’m trying to run five different deploy recipes across four different hardware/OS configurations, that’s 20 different potential interactions to take care of. At that point, unless I explicitly take stock and notice that my problem has a certain amount of inherent complexity, I’m likely to oscillate several times between different ways of expressing the same thing.
At the same time, I still hold out hope. Someday, I tell myself, I’ll write a nontrivial program that’s readable, hackable and abstract. Hopefully before I blow anyone up with a botched deploy.
Comments
I think the copy-pasted code (
deploy_taskrunner
anddeploy_webserver
) are pretty close to what you want in this situation, since it’s reasonable to expect them to diverge further.It might be possible to remove some of the repetition without adding lots of abstraction and without composing things into a hierarchy of function calls that have lots of effects (not all of which you might want in the future. and if you want to omit or modify an effect in a special case, you’re in trouble!).
I can only see two lines that are guilty of much repetition:
copy('code/', ip + ':~/code', recursive=True)
and especially:
post('https://pingdom.com/api/2.0/checks', { 'name':ip, 'host':ip, 'type':'ping' })
I think your
copy_code_to
function would be fine to use here– as long as it’s called directly instead of via another level of indirection. Perhaps it can even havesrc
anddst
parameters with default values ofcode/
andip + :~/code
– but that can be over-ridden if you ever have a a special case (or, the special case machine could just callcopy
directly).In the second case, you could write a
notify_pingdom(ip)
function and call it directly. Alternatively, since it’s something that is probably always going to be run as the last step, you could write an@pingdom
decorator.It’s not a perfect solution, but I think it has a better balance of readability/hackability/abstraction for this case.
I think readability and hackability are at a premium here: in ops, it’s super important to easily know exactly what’s going to happen when you run a command. And it’s also super important to be able to hack in small changes.
From the perspective of a strongly-typed pure functional programmer, this problem looks self-inflicted. Your explanation of the runtime error, “the author was confused about which subtype of Machine they were operating on,” could only happen because the state is hidden and a strong type system isn’t being leveraged. Why force the programmer to keep track of things in their head when the compiler can do that for them? For instance, in (incomplete) Haskell:
If you try to call
tweakLinuxConfig
on aBSDMachine
, the compiler will stop you. Functions that are currently common can be called onMachine os
, and easily refactored into specific versions if the need arises. EachMachine
value has a params record containing further needed data about its IP, whether it’s an API or a web server, etc.You could argue that by passing around that params record, I’m basically still threading an argument around, which you wanted to avoid. Regardless of approach, though, the data has to live somewhere. You can make it hidden state in an object, but then it’s difficult to reason locally and equationally about your program. It makes much more sense, to me, to treat the data as a value and provide it to the functions that need it directly.
You mention that hacking would be easier if you could just add an
if...then
block. There’s a way to preserve that simplicity: rather than factoring out the common behavior into functions, factor out the specific behavior into functions. Then your API has a single entry-point, it’s a simple matter of pattern matching (equivalent of yourif...then
block) to figure out which specific functions to call, and the parametric types resolve to more concrete ones as you call deeper functions, providing you with more safety. The goal is to create large functions by composing many small functions. The most specific functions should be the smallest, so they should be the ones being composed, rather than the result of the composition.I agree that large collections of functions are difficult to hack on if you don’t have a powerful compiler and type-checker. With one, though, it becomes fairly trivial. I’ve made changes to core data types in production Haskell code, whether to add additional data or to do more thorough refactorings, without any difficulty at all. It’s only difficult if a human has to load the whole structure into their head. If a compiler can do that part for you and just tell you which bits to change, it’s no problem at all.
Really nice article! Well-written and accurate.
I’d agree with Ted in this case and I think it’s also a situation where you anticipate this will always happen, and build safe-checks into the deployment/build process itself.
If you’re operating a business big enough for this to be relevant, I’m guessing each service would have at least two “machines” or instances dedicated–by using a half-at-once deployment strategy and running some post checks in your build flow, you can usually prevent any failures (even in the build/deploy scripts themselves) from ever compromising the service.
Still sucks when it happens, but, the service doesn’t go down and everyone on the business side remains happy (ie, you don’t have to re-justify the point of abstraction or efforts to enhance readability which may introduce bugs).
@Ted: I’d agree with you about the copy-pasted code if you didn’t expect the scripts to go. As the number of things the script does increases, the copy-paste code is going to become more unwieldy and the factored-out function hierarchy will become deeper, leading to the parameter-threading problems I mentioned before.
@Nathan: It looks to me like you’re essentially doing the same thing as virtual dispatch with your code snippet. It’s true that this will catch something like using
tweakConfigLinux
instead oftweakConfigBSD
, but this isn’t an advantage over normal object-oriented virtual dispatch where you can just implementtweakConfig
differently for theBSDMachine
andLinuxMachine
classes.That said, I do agree that using a type system as rich as Haskell’s would cut down the potential bug surface significantly (even if not eliminate it entirely). Is there a good resource for patterns that Haskellers/other functional programmers use for this kind of thing? I’d be interested in learning more.
@Julian: Thanks for the kind words!
I definitely agree with the mitigation strategies you suggest. In practice I usually solve this problem at least partly with outside-the-program tools like better safety checks, intensive code review for critical components, etc.
@Ben: Part of this is definitely similar to dispatch, but this version enforces a lot more safety. It does this in large part by making everything completely explicit, and by forcing the programmer to handle every case. If you write a function that works on any general
Machine os
, you have to either only use data and other functions that are common to all types of machines, or you have to handle each type of machine explicitly in a pattern match. In the OO version, on the other hand, these requirements are not enforced, so you end up with run-time errors rather than compile-time errors.I wish I could point you to good educational resources, unfortunately most of it is just learned on the job, so I’m not aware of very good instructional material.
As a general programmer I also agree with Nathan about the strong types here. Its not specific to Haskell and functional programming, in other languages like C++ or Java you can get the same kind of safety with only some small investment in keeping good discipline.
You may also get tools that enable meta-programming, generics, or a declarative style that can bring even more problems into the scope of detection for the compiler whilst reducing the amount of code and more clearly communicating its intent and structure to the human reader as well. This code may be identical in the final executed bytes as some “good ol’ procedural/object-oriented code” but will be furnishing the compiler with the information it requires to do additional checks.
This is however, a very hard problem to solve for interpreted languages, those with extremely weak typing or that only allow very late dynamic binding, without making a considerable effort yourself. If you only know about something as late as at run-time then how can you reliably find it before it is detected by causing a problem?
These kinds of languages often have lots of benefits for rapid application development and day-to-day scripting, but that effectiveness comes at a cost of lacking features and constraints that are useful for large, complicated software which has to work reliably.
I agree with Nathan’s claim that a lot of these problems would be solved with a type system.
People often make a subtle error when deciding whether to use a dynamic vs statically typed language for a particular task. For script-like tasks, such as the server-deployment example you describe, people intuitively lean towards using a dynamic language. But this idea of “scriptiness” is a misleading concept. Instead, the relevant factor is how intensively the code gets tested before being checked-in or deployed. If the code is highly tested before checkin, either because of good unit test coverage or because the codebase is so tightly integrated that a bug anywhere would cause everything to break, then the protections offered by a good type checker aren’t as relevant, and the use of a dynamic language should be encouraged.
In a lot of cases, though, people have code that is run infrequently and is detached from the rest of the codebase, and therefore not subject to constant testing. This profile matches your deployment script example very well. In these cases, you really benefit from locking the code down as much as possible, using both a strong type system and a lot of assertions.
@Nathan: no good educational resources about production Haskell patterns? Have you considered starting a blog?
I’d also be interested in any open-source codebases you think are particularly good with stuff like this.
Hey Ben,
Great article. Have you read anything about DCI? It’s intended to solve this problem, and make centralize your paths of code execution into a single, understandable chunk (rather than a confusing network of calls among objects) while still allowing you to keep your objects.
Good intro here: http://www.infoq.com/presentations/The-DCI-Architecture
your code has become data. treat it as data and not as object code. Your copy-paste code level is a series of instructions(rows) you execute. Or rows, you add, re-order, and change as needed. If you want to abstract that code, abstract it as data.
you have a table of commands
you have a table for server types
you have a table that sequences commands per server type.
if you need to add a server type, it’s just a new set of data and rows in tables.
if you have to change a command, you can see what server types use that command… and opt to add a new command(row) instead of modifying an existing one.
if you want to get fancy, you could abstract command groups, and then apply groups and individual commands to server types creating a more mixed structure… of data. Keep a list of your servers in your database, and now you can print your whole server config setup and processing sequence with a single SQL select.
you could easily keep track of how each command is executed in code by writing it upon completion to a log file, creating an easily readable facsimile of your copy-paste code. And then if something goes wrong, you know exactly which record it breaks down on in your database.
also, you can run a process that then monitors your servers, and if one stops or fails for some reasons, it can auto-restart it by reading the database setup just for that server.
anyway, this is how I’ve solved problems like these in the past.
I wrote a paper about how to get the best of both worlds:
http://harmonia.cs.berkeley.edu/papers/toomim-linked-editing.pdf
Attack the problems with Copy-Pasted code directly using your editor!
I started reading your your post, and have to admit it is a bit above my head despite 33 years of computing experience, 28 in industry. Presently I am a .NET developer.
However, your description of co-workers and their reactions to the pre-solutions you encountered elicited some simple understanding.
I have read some of the comments and Calvin’s seems to make the most sense to me. If I understand him correctly he is suggesting putting all the meta information regarding the script in a Database. Then write some sort of algorithm (script) to manipulate the Database in an order giving the correct results. Then if a new machine is added the Database is the only place that requires changing. Sounds ideal?
I guess you are working on VMs, from within the host control area, although it is possible you are “spinning” up physical machines.
I have not the time to read all the links just at the moment, however, I thought I would mention that I liked the post, and how it was written.
I followed the trail from linkedin and here I am now.
Sometimes we are not afforded the time to work with abstraction. It might be said that it would save time in the future if you start by abstracting, however, there might not be a future if we stopped and abstracted at every “required” point in time.
As a .NET developer, I have only read about abstraction and have never written an abstract class, except perhaps in my head pre-class creation.
For the time being I am happy to remain a .NET developer and write the WebApplications that your machines host.
Hi,
I’m part of a team that uses a similar kind of script to orchestrate deployments.
I don’t want to get into implementation too much, other to say your post was timely.
The way you phrased your query, was, to me, scratching the surface of a fundamental set of issues in developing software I’ve seen time and again over the course of my career.
I hope you don’t mind this, but I asked Robert Martin via Twitter, the author of “Clean Code”, and a source for clarity around this subject matter for me, to comment on your post. I’ve only met him once at a class I took and was rather blown away by his reply!
I hope his post gives you a a different frame to think about what you are struggling with, which is shared by all of us, no matter how long we’ve been in this vocation.
Hi Ben,
Great post, thank you. What language is this code in? It looks very easy to read. Cheers.
It’s python