Testing in Production

This is a practice that is commonly preached against by any serious testing advocate. “Never test in production; you must test your software from the design phase, making sure your design will meet the requirements.”

So you unit test your modules, integration test your code with the larger system, and acceptance test with the user interface. to make sure it all feels right to them. We use fake, contrived data, and finally data that seems real.

“Only teams with poor practices test in production,” it is said, all with a smugness that makes you envision them with a pinkie in the air sipping a steaming cup of tea.

Yet everyone who ships code, definitely tests in production.

What do you mean, we all do it?

Yep, everyone. In fact, every manufacturer does it. Every restaurant especially does it. But software creators do it as much and maybe more than anyone else.

Before I go any farther, I want to note that the key to this clearly clickbait topic is this: While we all test in production, the only errors we should see are ones we could not have anticipated and tested before shipping it to users.

What kinds of production testing do we do?

There are two kinds of testing that almost everyone does when they do a release. Smoke tests and live user testing.

Smoke tests

It may sound like a bad habit, but the concept of a smoke test is to make sure your software has been deployed as expected. A page on Geeks for Geeks noted, this is also called: “Build Verification Testing” or “Build Acceptance Testing”.

Apparently, the smoke testing name came from people building circuit boards, and the first test was that it didn’t start smoking when you applied power. Originally, I assumed it had something to do with checking seals (on tires, not the animal!), but I suppose that is more of a bubble test to see if air is leaking from a tire. Either way, the point here is to make sure everything seems to run.

As an example, you might go check all the pages on your site that you deployed to make sure they work. Check to see if tables were created, etc. Obviously, the better your testing up front, the less issues you may find.

You might have special fake accounts you test a process with to make sure that order processing works too. The more automation you can apply to this process, the better, but the less automation you have in your testing as deployment processes, the more needed this is. But we have all went to a major company’s website at gotten weird errors that were likely related to a bad deployment.

Live User Testing

Now (pause for dramatic effect) the real testing starts.

Every piece of hardware you have in your house, your body, your car, the cars of other drivers, the kitchen at that bistro you are going to get that sandwich you love that tastes just like mother made… all of that is being tested in production.

Sometimes it actually is somewhat experimental. The chef adds a new spice to see what people say. Software is beta tested. For example, Apple’s iOS is tested every year for months on real phones doing real work. Azure has lots of preview features. The chef gets feedback; iOS 26 will go out of beta; and even those long In Preview Azure features will become GA (probably).

But once you ship the software as fully ready for everyone’s use… this is when the software will be tested for real. By everyone that uses it. And these users are going to beat your software to death. They will:

Test every boundary you have set
Use your software for things it wasn’t designed for
Tell you, their friends, and their followers of your failures

Your company will egg them on to use your software heavily with marketing and promotions. This leads to true load testing. A few exampls immediately to mind that regularly fail live user testing:

Those who sell high-demand things. Tickets, for example. It took years to get companies to add virtual queues. I have had Ticketmaster’s site fail too many times when I had choice seats ready to buy.
Sports streaming like Netflix’s Paul vs Tyson fight is a recent example, but it happens quite often.
Companies that make offers to customers that go viral, and for a few hours, there is 100X the normal daily load that is typically spread out over 24 hours.

None of this is completely unexpected. Sometimes it is just that they couldn’t test with a realistically large number of users. Sometime sortawre crashes Crashes related to hardware they didn’t (couldn’t?) test on

There is a reason why most software asks for feature usage info, because they are testing to see what is being used. And when there is a crash, they want the details to see if it happens to other people.

So?

There has always been some humor around the phrase “we test in production”, and then some deep scolding from the “experts” that say “you should never test in production.” To me, you need to treat your production system just like you would when you are doing your own tests.

Get a monitoring tool and and set up proper alerts to tell you when things are outside of the norm. Then no matter if your system buckles under the pressure, look to see where there may have been limitations.

Always have a testing mindset

The “never tests in prodcution” attitude is dead wrong. You always think of production as proving grounds for your software. Testing before going to production is like a game plan. The better you plan and practice, the more likely you will succeed.

But like Mike Tyson once said “Everybody has a plan until they get punched in the mouth.”

Simply make sure you production is not not be the ONLY place you test. Because you don’t need to test your user’s patience, or they will stop testing your software and move on to someone else’s.

Drsql.link