Mitigate the risk of temporary data science development hacks

programming-istock.jpg
programming-istock.jpg

 Image: iStock

I'm in the doghouse again. I'm the one responsible in my household for hooking up anything electronic. Therefore, when the call came to wire-up all the various devices connected to our family room TV, I responded with dignity and valor. My approach resembled my agile-esque philosophy on life: first make it work, then make it pretty, and finally make it better.

Within an hour or two, I was done with phase one. Everything was working; however, there were wires everywhere. I assured my concerned wife that I would soon make everything look clean and neat. That was several years ago. My wife has had it, and I don't blame her.

This happens in the world of data science as well. We're all guilty of writing a hack with the best of intentions of fixing it later, and then forgetting all about it until it shows up at the most inauspicious time (like during a client demo). When building data science solutions, make sure your temporary hacks don't become permanent.

Good intentions and bad practices

Temporary code is a necessary evil. Unfortunately, it's part and parcel of an extremely effective best practice. Code for data science can get extremely complex, which leads to confusion and frustration when things aren't going right.

To simplify your approach, isolate errant code, and most importantly protect your sanity. Your first order of business must be to get the code working as quickly as possible -- by any means necessary. More often than not, that includes one or more temporary hacks. For instance, you might hard-code the value of an incoming parameter, so you can troubleshoot what's happening with the other parameter. Of course, you see the setup here. If you accidentally deploy this code into production without removing your hack, any calls to this function will return misleading results.

To make matters worse, it's very difficult to catch these mistakes. To your computer, your hack looks like perfectly working code, so you won't get any complaints at compile-time. Plus, most programmers aren't conspicuous enough about their hacks. Hacks are usually buried somewhere deep in the code with a small, unnoticeable comment -- successfully escaping functional testing and even code reviews.

When profiling the risks of data science development, most analytic managers focus on probability and impact, but overlook one very important third dimension: detectability. Temporary hacks shoot the moon with risk: highly probable, highly impactful, and extremely difficult to detect. Something must be done to mitigate this risk.

Clean up your mess

Dealing with temporary hacks involves a couple of preventive measures. First, you can prevent the appearance of temporary hacks by following good unit testing practices. Instead of writing a hack, just write a good test that doesn't interfere with your production code. So, instead of hard-coding that parameter's value in your production code, call the function from within your unit test's code, with the hard-coded parameter value. Since the code for your unit tests aren't deployed with your production code, you eliminate any risk that these hacks will show up in the wrong place. This takes planning and discipline, so your success with this approach is greatly dependent on your team's overall approach to test-driven design (TDD).

Even if your team hasn't fully embraced the TDD paradigm, there's still hope with another preventive design pattern called a marker interface. A marker is like a comment on steroids. It's a simple but effective design pattern that structurally tags any function that contains a hack. In an object-oriented language like Java, you would create a simple marker interface (i.e., with no method declarations) with a conspicuous name like TEMPORARY_HACK. Then in your production code, just before you hard-code that parameter's value, you would write a small comment to mark your place (which you probably would have done otherwise) and then immediately tell your class to implement the marker interface. Finally, make sure your build process flags temporary hacks (i.e., all classes that implement the TEMPORARY_HACK interface) as errors (not warnings), so those insidious hacks never make it to your production code.

Once your temporary hack is removed, you can safely remove the TEMPORARY_HACK marker. Worst-case, you forget to remove it, and your build process erroneously stops the build. That is much better than erroneously allowing temporary hacks into your production code. Once you understand the concept, you can easily apply this technique to non-object-oriented languages as well.

Summary

It's never a good scenario when temporary code becomes permanent code. Even with the best of intentions, we're all guilty of diving deep into code surgery only to leave sponges and sutures in the production image.

I've shown you a couple of ways to prevent these hacks from ever making their way into the wrong place: coded unit tests and marker interfaces. Both involve good practices and discipline, so review your code but more importantly review your habits. Your dog may be your best friend, but it's pretty cold outside in his house.

Automatically subscribe to TechRepublic's Big Data Analytics newsletter.