Why the testing/checking debate is so messy – a fruit salad analogy

Five days ago James Thomas posted the following in the Software Testing & Quality Assurance group on LinkedIn:

Are Testing and Checking different or not?
This article by Paul Gerrard explains why we shouldn’t be trying to draw a distinction between checking and testing, but should be paying more attention to the skills of the testers we employ to do the job.

I posted a reply there, but I think I can do better than those initial thoughts, so here we go.

Let’s imagine the following scene: Alice and Bob are preparing a fruit salad together.
Alice: “Ok, let’s make a nice fruit salad. We need some apples and some fruit.”
Bob: “Euh, aren’t apples fruit?”
Alice: “Yes. Of course. But when I say ‘fruit’, I mean ‘non-apple fruit’.”
Bob: “So you don’t think that an apple is fruit?”
Alice: “No, I do. It’s just when I say ‘fruit’, I want to focus on the non-apple fruit.”
Bob: “Uhuh. So fruit is stuff like bananas, pears and pomegranate?”
Alice: “Exactly. And that would actually make a great fruit salad: apple and those three fruits.”
Bob: “Ok, but what if I feel like having a fruit salad. And it turns out that I only have apples and bananas at home and I don’t have time to go to the store. And, importantly, I really really don’t like bananas. So I decide to only use apple. That’s still a fruit salad, right?”
Alice: “I suppose so, technically, but still… a fruit salad without any non-apple fruit… I mean, everyone puts apples in their fruit salad and there’s so much more fruit than just apples! So when I say ‘fruit’, I just really want to focus on the non-apple fruit, ok?”
Bob: “Ok, fine. Glad we cleared that up. One more question though: what about tomatoes?”
Alice: “Don’t. Just don’t.”

Now read this piece of dialogue again and replace ‘apple’ with ‘checking’ and ‘fruit’ with ‘testing’. Bob’s confusion is exactly the reason why the whole testing/checking debate is messy: most of the time it’s about testing *versus* checking. You can see it in the title of the LinkedIn post: “Are testing and checking different or not?” You can see it in Paul Gerrard’s article: “[…] the James Bach & Michael Bolton demarcation of or distinction between ‘testing versus checking’.” You can see it in Cem Kaner’s article: “According to the new doctrine, “checking” is the opposite of “testing” and therefore, automated tests that check against expectations are not only not sapient, they are not tests.” You can also see it in the original “Testing vs. Checking” blog post by Michael Bolton dated August 2009. It’s right there in the title. Do take note however, that this post has been retired and we are directed to the new version “Testing and Checking Refined“. However, the new version still contains a sub-title “Checking vs. Testing”.

“Testing and Checking Refined” also contains a helpful diagram, that’s key to the point I want to make in this post. The diagram shows us that there’s one overarching category ‘testing’ (fruit), which contains two things: ‘checking’ (apples) and all other testing activities (non-apple fruit). This helps us to understand two things.

First of all, it shows that any discussion about testing *versus* checking is bullshit. They are on a different conceptual level, just like fruit and apples, so any direct comparison is meaningless. To throw in one more analogy, what would you answer when you were asked while visiting a friend at his home: “Would you like coffee, or something to drink?”

Secondly, it explains why my previous point is difficult to get(1). The diagram presents people with two concepts: ‘testing’ and ‘checking’. Of course there’s also “learning by experimenting, including study, questioning, modeling, observation, inference, etc.”, but that’s just too vague to register mentally as an entity. It does not coalesce into a concept. What we’re left with are only two concepts, ‘testing’ and ‘checking’, and the non-checking part of testing is gone. This is actually illustrated by the title of James Bach’s and Michael Bolton’s blog post: “Testing and Checking Refined”.

So when you present two concepts in this way, is it really that surprising that people talk about them likes apples and oranges, instead of like apples and fruit? I think not.

— — —

(1) Including for myself. See for instance how I’m struggling with this very problem in my blog post from August: “What’s the word for the part of testing that’s not checking?

What’s the word for the part of testing that’s not checking?

The question I asked
Yesterday I asked on twitter:

The reason I asked, is that I noticed I needed that word in discussions about testing and checking. If checking is part of testing – and in the RST namespace it most definitely is, see ‘Testing and checking refined‘ -, then what can I contrast checking with? Contrasting checking with testing (as in ‘checking versus testing’) isn’t going to work: there’s one thing that’s checking and then there’s this other thing, testing, that contains that one thing and some other stuff(1), but it’s like a completely different thing. See the difference? Conceptually that just doesn’t work – at least not in my mind.

The answers I got
So I figured I’d ask twitter in all its infinite testing wisdom and lo and behold, not only did people reply, a discussion ensued with the following people (listed in no particular order) participating in different configurations: @eddybruin, @mariakedemo, @SandroIbig, @TestPappy, @dwiersma, @ilarihenrik, @PhilipHoeben, @huibschoots and @deefex. Thank you all!

Do click on the embedded tweet to read all of it, but here’s a list of the answers they came up with:

  • Exploring
  • Learning
  • Evaluating
  • Monitoring
  • Confidence building and refining
  • Experimenting
  • Non-checking

It only took a few replies for me to realize I may have asked the wrong question – as in: not the question I had intended to ask. And a quick look at the diagram in ‘Testing and checking refined‘ confirmed this:

full blog post at http://www.satisfice.com/blog/archives/856

Testing is a very big box. Learning is a part of it and so are experimenting, studying, questioning, modeling, etc. *and* checking. So the part of testing that’s not checking, isn’t just one thing, it’s many things. Hence Del Delwar’s (@deefex) reply: “I’d suggest that’s possibly too wide an array of things to encapsulate in a single word. Try ‘non-checking’ :-)”

The question I meant to ask
So with that settled, on to the question I meant to ask: If checking is “the process of making evaluations by applying algorithmic decision rules to specific observations of a product” (source yet again ‘Testing and checking refined‘), then what’s the name for the non-algorithmic evaluation of a product? A ‘heuristic evaluation’? Does such a thing exist? Or are all our evaluations during testing checks?

First of all, when I test, it doesn’t feel like all my evaluations are checks. That may not be a very strong argument, but I do think it’s worth to at least note.

Secondly, where do these algorithmic decision rules come from? Do I need to have them beforehand? Do I need to have them recorded somewhere? Or can I just make them up as I go along? More importantly, do these rules need to be explicit?

And that last question led me to a bunch of philosophical questions:
– If not all of our evaluations are checks per se, is it possible to (re-)formulate them as checks?
– If I can’t express my evaluation as a check, how would I able to communicate in a meaningful way about my evaluation?
– If my evaluation is founded on tacit knowledge and there’s no need to make that knowledge explicit, because the people I communicate with, share in that tacit knowledge, can that evaluation still be considered a check?
– Does it matter if the tacit knowledge on which an evaluation is based, is ‘weak’ (we could make it explicit) or ‘strong’ (we can’t make it explicit)?
– Where does algorithmic end? I can make an algorithmic decision if I find something beautiful, by observing my aesthetic feelings towards that object. If I observe positive feelings within myself, I find the thing beautiful. However, I can’t make an algorithmic decision if I find some beautiful (yes, I know, that’s the exact same sentence), because I can’t specify a set of algorithmic rules that decide if I would find something beautiful. So what it boils down to is: is the algorithmic evaluation of an observation of my own mental state a valid check? Or should we go one level deeper, to the cause(s) of my mental state?
– Is the previous bullet point anything other than a philosophical quagmire one needs to extract oneself out off Münchhausen-style? In any case I highly recommend reading Raymond M. Smullyan’s “An Epistemological Nightmare” to sink a little deeper.

Back to the context of the question
So… why was I asking this question again about non-algorithmic evaluation during testing? It’s quite simple actually:
(1) If testing is investigating a product to evaluate it
(2) all evaluation is done by applying algorithmic decision rules,
(3) the core of testing is checking.

Of course, there’s all the stuff going on around the checking. There’s all the investigating, modeling, experimenting to come to the checks. And there’s all the sense-making of the results of the checks to provide valuable information to our stakeholders. But it all revolves around checking.

So when I am talking to someone who seems to think that there’s nothing more to testing than checking, I can argue that there’s all this other stuff we testers do that is testing, but is not checking. But what I cannot argue is that there is something we do instead of checking (so something non-algorithmic) that leads to evaluative data(2) about the product, because there is no such thing. And that bugs me, because that’s not how I’ve been using the words ‘checking’ and ‘testing’.

— — —

(1) Advanced semantics question: is ‘checking versus testing’ more like ‘apples versus fruit’ or more like ‘squares versus rectangles’? (a)

(2) Data + interpretation = information. Hmm… or: Interpretation(data) = information.

— — —

(a) Apparently the correct answer is “leaves vs trees”. (https://twitter.com/al3ksis/status/633343017252995073)

Test automation – five questions leading to five heuristics

(I wrote a follow-up to this post in June 2019: how this tester writes code.)

In 1984 Abelson and Sussman said in the Preface to ‘Structure and Interpretation of Computer Programs‘:

Our design of this introductory computer-science subject reflects two major concerns. First, we want to establish the idea that a computer language is not just a way of getting a computer to perform operations but rather that it is a novel formal medium for expressing ideas about methodology. Thus, programs must be written for people to read, and only incidentally for machines to execute. Second, we believe that the essential material to be addressed by a subject at this level is not the syntax of particular programming-language constructs, nor clever algorithms for computing particular functions efficiently, nor even the mathematical analysis of algorithms and the foundations of computing, but rather the techniques used to control the intellectual complexity of large software systems. [emphasis mine]

This oft-quoted sentence I emphasized, is even more true if the purpose of our programs is test automation(1). So let’s say you run your test automation program and the result is a list of passes and fails.  The purpose of testing is to produce information. You could say that this list of results qualifies as information and I would disagree. I would say it is data, data in need of interpretation. When we attempt this interpretation, we should consider the following five questions.

Question 1: What exactly is this list of results as such telling you?
Picture the list of test results. All it contains are the names of the test cases and whether they passed or failed. With just that list in front of you, how much do you know? How easy is it to identify potential problems? To identify where you need to start investigating? Are you able to do that based on the list as such? Or will you have to dive into the details of each test case to be able to do this? I certainly hope not…

Question 2: How do you tell false negatives from true ones?
Going through the list of passes and fails, you’ll probably feel good about the passes and bad about the fails.(2) So you set out to investigate the failed test cases. However, some will be true negatives (the test exposed a bug) and some will be false negatives (the test is wrong). How will you be able to tell the difference?

Question 3: How do you tell false positives from true ones?
Not only can we have false negatives in our test results, we might also have false positives. Test cases that pass, although they shouldn’t have. How will you be able to tell the difference here? And more poignantly, where will you find the motivation to even start looking for the false positives? Why can’t we just be happy all those tests passed?

Question 4: How do you find the thing that’s broken? Or even more fun, the things that are broken?
So you have at least one test that doesn’t return the result you want to have. That means the result is a either a fail or a false positive. (So yes, of the four possible outcomes, three require further action.) For your investigation there are basically four areas to focus on:
– The product under test. You found a bug. Good job!
– Test design. You designed a test to identify a potential problem, but it turns out that problem isn’t actually a problem.
– Test execution. You made a mistake in how you translated your test designs into test automation code.
– Test tooling. Your tool (this includes the test environment) had a ‘hiccup’ or a ‘glitch’.

These four areas are relevant whether you’re investigating automated tests or other tests. However, a major problem with automated tests is that this investigation is more difficult because two of the four areas are bigger. First of all there’s the test execution area. Your translated test designs will be interpreted by a computer, which has a lot less interpretative flexibility than a human being. So your translation needs to be of a higher quality than if you were translating for another human being. Secondly, the test tooling area is bigger, simply because you have more tooling.

Question 5 (bonus meta-question): What understanding are you losing by automating?
Toyota is not unfamiliar with automation. And last year, they decided to replace a number of robots in their factories with human workers. Why? As project lead Mitsuru Kawai says:

We cannot simply depend on the machines that only repeat the same task over and over again. To be the master of the machine, you have to have the knowledge and the skills to teach the machine. (source: Bloomberg)

Toyota realized that by fully automating the car manufacturing process, they were losing important knowledge and skills about how to build cars. So no, they’re not replacing all robots with humans, but they are putting humans back into the manufacturing process so that learning and improvement can happen. The same applies to test automation. If it is keeping you from interacting with the product, from actually testing yourself, it’s time to rethink your approach.

Epistemic testability
In the end it all boils down to one question: is your test automation increasing our decreasing your epistemic testability? Does it make it easier or harder to bridge the gap between what we know and what we need to know about the status of the product? Test automation is excellent in providing you with the illusion of increased epsitemic testability: “Every night we run 10,000 tests in less than an hour!” While actually decreasing it: “Alice and Bob spend four hours every day processing the results!”

Having thought about those questions, I have gathered the following set of heuristics on test automation. Time and experience will tell if they’re any good…

Heuristic 0: Don’t call it test automation.
As James Bach pointed out at Tasting Let’s Test Benelux, developers used to talk about “automatic programming“. The meaning of the term has changed over time, but at no point in time did developers think that when you do automatic programming (e.g. use a compiler), all of programming has been automated. So either we change the meaning of ‘test automation’ in a similar way (which fails to account for the testing-checking distinction), or we come up with a better term. I’m still looking for a better term, all suggestions are welcome.

Heuristic 1: Never trust a test you haven’t seen fail. (source: Colin Vipur via Rob Fletcher)
It will help you avoid false positives. But we should actually takes this several steps further, as you can read in this blog post by Richard Bradshaw: Who tests the checks? Do go read the whole post, but one excellent thing he proposes is to test if a failing test gives sufficient information about why it fails.

Heuristic 2: Each test should test only one thing. (s/test/check, of course)
This will reduce the complexity of your investigation when your test needs investigating. If it fails, you can begin looking at the one thing your test is testing. Also, if each test tests only one thing, you will have several quite similar tests. Looking at all of them, seeing which passed and which failed, will give you useful clues in your investigation.

Heuristic 3: It’s better to have reliable information that doesn’t exactly tell you what you want to know, than unreliable information that does.
With reliable I mean: Does it run all the tests every time with a minimal risk of false positives or negatives? If to get that reliability, my tests don’t run on the level I would like to run them (e.g. the GUI-level), I’m more than happy to make that trade-off. The additional interpretative step I need to make, is less of a risk than the extra effort it takes to deal with a flaky, unreliable test set that doesn’t require that step.

Heuristic 4: Every minute spent debugging test automation code is wasted, because you learn nothing.
Going back to the four areas to investigate, the first three (product, test design, test execution) are interesting from a tester’s perspective. Investigating these will provide you with opportunities to learn about the product or about testing. Not so with a failure in your test tooling. It’s an impediment that needs to be solved quickly. In this respect there is no difference between a failure in your test automation tool and a failure of your keyboard.

Heuristic 5: Epistemic testability, epistemic testability, epistemic testability.
Repeating this because it is so important. It is the litmus test of your test automation. Consider it when choosing your tools, when deciding on abstraction layers, when designing your tests, when composing your test set, when writing your test automation code, when testing your tests, when documenting your tests, when interpreting the results. Because when you have your first test results, your first list of passes and fails, it’s the epistemic testability that will decide for a large part how useful that list will be.

(This post was deeply influenced by the ideas James Bach, Micheal Bolton, Alan Richardson, Pascal Dufour, Richard Bradshaw and the BBST Bug Advocacy Course. Thank you to all.)

— — —
(1) Or, instead of test automation, a better term would be ‘check execution automation’. Although this is an important distinction, I’m not going to pursue it today. If you do want to, this post is a good starting point: Testing and Checking Refined by James Bach and Michael Bolton.

(2) Be wary of the binary disease! Luckily there’s a cure: Curing Our Binary Disease by Rikard Edgren at Øredev 2011.