Monday, April 22, 2024

Learning the hard way, experience

Over my career, I have done my fair share of good results in testing. There are two signature moves I wanted to write a post on, and what I learned while failing at both of my two signature moves after I first thought I had them pocketed. 

These two signature moves are changes I have been driving through successfully over multiple organizations and teams: 

  1. Frequent releases
  2. Resultful contemporary exploratory testing with good automation at heart of it

Frequent releases

Turning up the release frequency is an improvement initiative I have done now for over 10 different teams, products, and for four organizations. And the products/teams/orgs I have done this at, only of them are the optimal case of a web application with control over distribution channels. I've been through this the hard hard way - removing reboots from upgrade cycles, integrating processes that were designed for non-continuous such as localization and legal,  including telemetry to know if we leave customers hanging on versions - all of that is an article of its own

Succeeding with this is not the interesting part, except for the latest success of my current team making releases routine after dropping the ball and suffering through a 4 month stabilization period last year. I dare to call it again a success since we made our 5th release this year last week, and included the best scope and process we have had in that. 

Why did we fail with a stabilization phase then? We can surely learn something from it. 

The experiences point us to failing at communicating and collaborating with a non-technical product owner. How can that create a four month stabilization phase? Leaking uncertainty, and pushing uncertainty to a time it piles up. We leaked uncertainty in functional scope, but also in parafunctional scope, and when pushed to address performance and reliability, we found functional scope we had not recognized existed. 

From breaking the main and not realizing how much we had broken it (automation was still passing!) cost us extra stress and uncertainty. We would look at tasks as "I did what you asked", not as "We together did what you did not know to ask". 

And when we tested, we failed at communicating our results effectively. New people, new troubles, and they accumulated quickly when we did not want to stop the line to fix before moving forward. We optimized for keeping people busy, over getting things done to a shape where they could be released.  

Having my team fail with my signature move taught me that there is much more of implicit knowledge on continuous timely testing than what we capture with automation. Designing for feedback, and making it timely takes learning, and this year we have been learning that with a short leash of releases.

Resultful contemporary exploratory testing with good automation at heart of it

Framing test automation as worthwhile notetaking mechanism while exploring has become so much of a signature move of mine, that I have given it a label of its own: contemporary exploratory testing. When we explore, some of it is attended and some unattended. Test automation, when failing, calls for us to attend. Usually pretty much continuously. 

Working with developers, we have learned to capture our notes as lovely English sentences describing the product capabilities in some tests; as developer intent captured in unit tests in some tests; as component tests ad as subsystem, and as system-of-systems tests when those make sense. We have 2500 tests we work with on the new tech stack side, some hand-crafted, and others model-based generated test we use as reliability tests allowing longer unique flows be generated and tracked.

We learn about information we have been missing, and we extend our tests as we fix the problems. We extend the tests with new capabilities. How can all this successful routine be a failure?

It turns out there was a difficult conversation we did not have on time, one of validation. And when a problem is that you are building your product so that it does not address the user needs, none of the other tests you may have built matter. Instead of being a help, they become friction where you will explain how changing those all will be work, at a time when work is less welcome. 

This failure comes from two primary sources: carefulness on a cross-organizational communication due to project setup reasons delaying the learning, but as much as that, unavailability of more senior test/dev folks on a previous project delivery. 


Both the two failures are successes if we learned and can apply what we learned going forward. For me they taught a career change is in order: I don't want to find myself spread so thin that I don't recognize these expensive mistakes happening. So I will be a tester again, fully. Well, surrounded by testers, as director, test consulting, with a hands on consulting role starting June. Looking forward to new mistakes, and correcting the mistakes I make. Because: mistakes were made, and by me. Only taking ownership allows us to move forward and learn. 

* recommendation of a book: 


Sunday, March 17, 2024

Urban Legends, Fact Checking and Speaking in Conferences

I consume a lot of material in form of conference talks, and I know exactly the moment when conference talks changed for me forever. It was Scan Agile conference in Helsinki many years ago, and I had just listened to a talk from an American speaker. I enjoyed their experience as told from stage so much that I shared what I had learned with my family. Only to learn the story was fabricated. 

In one go, I became suspicious of all stories told from stage. I started recognizing that my stories are lies too, they are me-sided recantations of actual events. While I don't actively go and fabricate them, those who do will tell me that is how human experience and memory work anyway. And the responsibility of taking everything with a grain of salt is on the consumer. 

The stage seeks memorable, and impactful. And if that includes creating urban legends, it bothers some popular speakers less than it bothers me. 

With that mindset, I still enjoy conference talks but they cause me extra work of fact-checking. I may go and have a conversation to link a personal story to a reference. But more often, it's the other people's stories they share that I searching online on. 

In the last two weeks of conferences, I have been fact-checking two stories told from stage. They were told by my fellow testing professionals. And evidence points to word of mouth urban legends on AI. 

The first story was about some people sticking tiny traffic signs around to fool machine vision based models. The technique is called adversial attack. There's articles of how small changes with stickers mess up recognition models. There's ideas that this could be done. But with my searches, I did not find the conclusive evidence that someone actually did this in production, in live environments, risking others people's health. I found the story dangerous as it came without warnings of not testing this in production, of the liability of doing something like this. 

In addition to searching online for this, I asked a colleague at work with experience of scanning millions of kilometers of imagery of roads where this could be happening. It just so happens I have worked very close to a product with machine vision, and could assess if this was something those in the problem space knew of. I was unable to confirm the story. The evidence points to misdirections from stickers in busses, but not tiny stickers computer sees but human doesn't. 


The story was told to about 50 people. Those 50 people, trusting that presenter from the stage would get to apply their fact checking skills before they do what I do now: tell the story forward as it was told, with their flavor of it. 

The second story was told by two separate presenters. The story was about someone getting a binding contract on the company letterhead for buying a car for $1 where justice system is still out to decide whether the contract must be withheld. One presented showed it as a reimagined chat conversation translated to Finnish but made no claims on letterhead or legal mitigation, as the more colorful version was already out there presented to this audience. 

Fact checking suggests that this is something that someone, specifically Chris Bakke asked from a ChatGPT-based car sales bot, to give an offer for 2024 Chevy Tahoe with "no takesies backsies", far from official offer on company letterhead. 

So no binding contract. No company letterhead. No car bought for this price. No litigation pending. 

The third story told was a personal one. You can find it as a practical example to analyze a screenshot for bugs. It suggested ChatGPT4 does better on recognizing bugs from an image than people do. While it may not be entirely incorrect on scope of that individual picture, the danger of this story is that the example used in the picture is a testing classic. Myers points out 7.8/14 is what a typical developer scored in 1979, and there's more detailed listings around in literature that show we do even worse in analyzing it and that it is technology-dependent. 


Someone else on the conference also suggested we should not read books but ask for summaries from ChatGPT, completely missing a frame of reference on how well then the model would do compared to a human that has read many of these references. Reading less would not help us. 

Starting an urban legend, especially now that people are keen on hearing stories of good and bad in the real of AI is easy. Tell a story, they tell it forward. Educational and entertainment value over facts. 

So let's finish with a recognition of what I just did with this blog post. I told you four stories. I provided you no names of the people leaving impact significant enough in me to take the time to write this. It's up to you to assess, filter, and fact check and choose which stories become part of what you tell forward. Most information we have available is folklore. There are no absolutes. But there is harm in telling stories from stage, and I would think speakers would need to start stepping up to that responsibility. 

Then again, it's theater. Entertainment with a sprinkle of education. 

Wednesday, March 6, 2024

A sample of attended testing

Today prompted by day 6 of 30 days of AI testing, I tried a tool: Testar. My reasons for giving it a go are many: 

  • Tanja Vos as project leader for the research that generated this would get my attention
  • I set up a research project at previous employer on AI in / for testing, and this tool's some generation was one of that project's outcomes
  • The day 6 challenge said I should
  • Open source over commercial for hobbyist learning attention all the way
I read the code, read the website, and tried the tool. The tool did not crash but survived over an hour "testing" our software with the standards of testing I learned some large software contractors still expect from testers, so I could say it did well. I would not go to as far as "beat manual testing" like the videos on the site did. 



The tool was not the point for me though. While the tool run clicking 50 scenarios with 50 actions and what seems to be a lot more than 50 clicks I intertwined unattended testing (the tool doing what the tool does) and attended testing, and I was watching a user interface the tool did not watch for fascinating patterns. 

In my test now I have two systems that depend on each other, and two ways of integrating the two systems for a sample value of pressure. And I have a view that allows me to look at the two ways side by side. 

As Testar was doing it's best to conclude eventually Test verdict for this sequence: No problem detected, I was watching the impact of running a tool such as this on the integration with the second system. 

I noted some patterns: 
  • The values on the left were blinking between no data available and expected values, whereas the values on the right were not. I would expect blinking. 
  • The values on the left were changed with delay compared to the values on the right. I would expect no difference in time of availability. 
  • The bling-sounds of trying something with a warning was connected with the patterns I was observing, and directed me to make visual comparisons sampled across the hour of Testar running. 

This is yet another example of why there is no *manual testing* and *automated testing*. With contemporary exploratory testing, this is a sample of attended testing with relevant results, while simultaneously doing unattended testing with no problem detected. 

This was not the first time we use generated test cases, but the types of programmatic oracles general enough to note crash, hang, error, those we have had around for quite a while. As soon as we realized it's about programmatic tests, not automated testing. Programmatic tests provide some of their greatest value in attended modes of running them. 

For me as a contemporary exploratory tester, automation gives four things: 

Running Testar, I have now 50 scenarios of clicking with 50 actions with screenshots on my disk. That alone serves as documenting in ways I could theoretically benefit from if anyone asked me how we tested things. It would most definitely replace the testing that I cut from 30 days investment to 2 days investment last year, and replace one tester. 

Testar gave me quick clicking on another system while my real interest was on the other one. Like many forms of automation, time and numbers are how we extend reach. 

Testar did not tell me to look at things, but our applications sound alerting did. Looking at the screenshots, I saw a state I would not expect, and went back to investigate it manually. That too was helpful, sampling what I care to attend. 

The last piece, guiding to detail I usually get when I don't rely on auto clickers, but actually have to understand to write a programmatic test. 



Tuesday, March 5, 2024

A Bunnycode Case Study for AI in Testing

It's day 5 of 30 days of AI testing, and they ask for reading a case study or sharing your experience. I did sharing experience already on an earlier day, and in the whim of a moment, set up a teaching example. 

I google for obfuscated code in python to find https://pyobfusc.com/. I'm drawn to most reproducible, authored by mindiell and when I see the code, I'm sold. How would you test this? 

Pretty little rabbit, right? Reminds me of reading some code at work, work is just less intentional with obfuscation. And really do not have the time or energy to read that. I could test it as black box, learning that given a parameter of a number, it gives me a number:


As if I didn’t know what the rabbit implements or recognize the pattern in the output, I was thinking of just asking ChatGPT about it. However, I did not get that far. 

Instead, I wrote def function(): on my IDE while GitHub copilot is on, thinking I would have wrapped the program into a function. It reformatted it to something a bit more readable. 

Prompting some more in the context of code. 

Comment line “#This function” proposes “is obfuscated”. Duh. 

Comment line “#This function imp” proposes "lements the Fibonacci sequence using Binet’s formula.

At this point, I ask chatGPT how to test a function that implements the Fibonacci sequence using Binet’s formula. I get long text saying try values I already tried, but in code, and a hint to consider edge cases and performance. I try a few formats to ask for a value that would make a good performance benchmark, and lose patience. 

I google for performance benchmark to learn that this Binet’s formula is much faster than the recursive algorithm, and find performance benchmarks comparing the two. 

I think of finalizing my work today with inserting the bunny code into chatGPT and asking “what algorithm does this use” to get second language model generate likely answer as the Binet’s formula. Given the risk and importance of this testing at this time, I conclude it’s time to close my case. 

There are so many uses to figure out what it is I am testing (while being aware of what I can share with tool vendors when giving access to code) and this serves as a simulation of idea that you could ask about the pull request. This was the case I wanted to add to the world today.

I should write a real case study. After all, that was one of the accommodations we agreed with multiple levels above me in management when my team at work started GitHub Copilot tryouts some 6 months ago. I should publish this in action, with a real time. As soon as something generates me some time. 

Sunday, March 3, 2024

List Ways in Which AI is Used In Testing

I have now been through 3 out of 30 days of AI in Testing by Ministry of Testing. I was awarded my fifth Anniversary Badge, meaning that I have not shown up in that community for a while. 

The first day was introduction. Second day was to read an introductory article. Third day asked to list ways AI is used in testing. As with blogging, I filled the paper and wanted to leave the notes from personal experience behind as a blog post. 

Practical applications, personal reflection rather than research:

Explaining code. Especially on particularly tired day while being aware that I cannot share secrets, I like to ask chatGPT to explain to me what changes with code in some pull request so that I would understand what to test. Answers vary from useful to hilarious, and overly extensive.

Test ideas for a feature. Whenever I have completed an analysis of a feature to brainstorm my ideas, I tend to ask how chatGPT would recommend me to test it. Works nicely on domain concepts that are not this company only, and I have a lot of that with standards and weather phenomena.

Manipulating statistics. I seem to be bad at remembering excel formulas, but I use a lot of cross-referencing test generated results in excels. ChatGPT has been most helpful in formulas to manipulate masses of data in excel.

Generating input/output data. Especially with Co-pilot, I get data values for parameterized tests. Same test, multiple inputs and outputs generated. More effort into reviewing if I like them and find them useful.

Generating (manual) test cases. I have seen multiple tools do this, and I hate hate hate it. I always turn off steps from test cases and write down only core insights I would want the future me to remember in 3 months.

Generating programmatic tests. Copilot does well on these on unit testing level, but I am not sure I would want all that stuff available. Sometimes helps in capturing intent. But I prefer approvals of inputs and outputs over handcrafted scripts anyway for unit level exploratory testing.

Generating tests based on models. Has nothing to do with AI, but is a pre-AI practice of avoiding writing scripts and working with more maintainable state-based models. Love this, especially for long running reliability testing.

Generating database full of test data. Liked tools of this. I think they are not AI though, even though they often claim they are. The problem of having realistic pattern but not real people’s data is a thing.

Refactoring test code. Works fine at least for robot framework and python tests as long as we first open the code we want to move things from. Trust things to be pre-aware and suffer the duplication. We’ve been measuring this a bit, and copilot seems to encourage duplication for us.

Wrote down few, will revisit when time is right. 

Saturday, March 2, 2024

Fooled by Microservices, APIs and Common Components

These days writing software is not the problem. Reading software is the problem. And reading is a big part of the real problem, which is owning software. Last two years has been a particularly challenging experience in owning software, and navigating changes in owning software. I have not cracked it, I am not sure if I will crack it but I have learned a lot. 

To set the stage of my experience. Imagine coming to a company with a product created over a period of 20 years. There's a lot of documentation, none of it particularly useful except code. While the shape of the existing product is invisible, you join a team dedicated to modernisation. And the team has already chosen a rewrite approach. 

For the first year, the invisible is not a priority. After all, it will be replaced, and figuring out the new thing is enough work as is with a new product. With what feels like heroic effort, you complete the goal of the year with managed compromises. Instead of full rewrite, it's full rewrite of selected pieces. The release is called "Proof of Concept" and it does not survive the first customer contact. 

Second year has goals set, to add more functionality on top of the first year. The customer feedback derails goals leading to an entire redesign of the user interface, addressing  9/19 individually listed pieces of feedback. Again what feels like heroic effort, you complete the goal of the year with managed compromises, but now an upwards rather than downwards trend. 

The second year starts to give a bit of shape to the existing invisible product, with deliberate actions. You learn you own something with 852k lines of code. It has 6.2% duplication, and 16,4% unit test code coverage. The new thing you've been focusing on has grow into 34k lines of code, with 5% duplication and unit test code coverage of 70%, and great set of programmatic tests that don't hit the unit test coverage numbers. 

Meanwhile, you start seeing other trends:

  • Management around you casually drops expectations of microservices, APIs that allow easily building other products than this one, and common components with expectations of organizational reuse.
  • You and your team are struggling with explaining that the compromises you took really may not have changed it all to what some people now seem to be expecting,
  • You realize that what you now live with is four different generations of technological choices no one told the invisible mass would bring you - time adds to your understanding
In one particularly difficult discussion, someone throws microservices at you as the key thing, and you realize that the two sides of the table aren't talking about the same thing. 

One party is describing Scripts with flat file integration: 
  • Promised by shadow R&D without promises of support, usually done in scale of days
  • Automate a specific task
  • Deployment is file drop on filesystem, and not available if not dropped
  • Can break with product changes
  • Output: file or database, usually a file
  • Expected to not have dependencies
  • Needs monitoring extended separately
  • Using files comes with inherent synchronisation problems
  • Needs improving if scale in insufficient
  • Can be modified by a user
  • Batch work, will not be real time
  • File based processing steps quickly increase complexity
The other party is describing microservices with API integration: 
  • Promised by R&D assuming it will drive product perspective forward, usually in scale of months when deployment risks are addressed
  • Develop apps that scale
  • Deployment is with product and feature can be on or off
  • Protected by design as product capability with product changes
  • Output: well defined API
  • Deployed independently / separately with dependencies - API, DB, logic, UI
  • Follows a pattern that allows for common monitoring
  • Uses http and message queues as communication protocol
  • Can be scaled independently
  • Can't be modified by a user
  • Can be close to real time data synchronization
  • Plug in extra processing steps
One party thinks micro is "fast additions of features" and other thinks it is a way of creating common designs. 

You start paying more attention of what people are saying and expecting, and realize the pattern is everywhere. Living with so many generations of expectations and lessons makes owning software particularly tricky. 

We're often coming to organizations at a time. We're learning as we are moving along. And if we are lucky, we learn sooner rather than later. Meanwhile, we are fooled by the unknown unknowns, and confused by the promises of the future in relation to realities of today. 

The best you can do is ground conversations to what there is now, and manage to better from there. 


Friday, March 1, 2024

How AI changes Software Testing?

This week Wednesday, two things happened. 

I received an email from Tieturi, a Finnish training company, to respond to the question "How AI changes software testing?". 

I went to Finnish Testing Meetup group to a session themed on AI & Testing. 

These two events make me want to write two pieces into a single post. 

  • My answer to the question
  • My thinking behind answering the way I do 

My Answer to the Question: How AI changes Software Testing

I know the question is asking me to speculate on the future, but the future is already here, it's not just equally divided - repurposing the quote from a sci-fi author. 

Five years ago, AI changed *my software testing* by becoming a core part of it. I tested systems with machine learning. I networked with people creating and testing systems with machine learning. Machine learning was a test target, and it was a tool applied on testing problems for me, in a research project I set up at the employer I was with at that time. 

Five years ago, I learned that AI -- effectively applications of machine learning -- are components in our systems, "algorithms" generated from data. I learned that treating systems with these kinds of components as if the entire system is "AI" is not the right way to think about testing, and AI changed my software testing with the reality that it is even more important than ever to decompose our systems into the pieces. These pieces are serving purpose in the overall flow and there's a lot of other things around. 

Now that I understood that AI components are probabilistic and not hand-written, I also understood that 
the problem is not testing of it, but fixing of it. We had a world where we could fix bugs before. With AI, we no longer had that. We had possibility of experimenting with parameters and data in case those created a fix. We had possibility of filtering out the worst results. But the control we had would never again be the same. 

For five years, I have had the privilege of working to support teams with such systems. I was very close on focusing solely on one such team but felt there was another purpose to serve. 

Two years ago, AI changed *my software testing* by giving me GitHub Copilot. I got it early on, and used it for hobby and teaching projects. I created a talk and a workshop on it based on Roman Numerals example, and paired and ensembles on use of it with some hundred people. I learned to make choices between what my IDE was capable of doing without it, and it, and reinforced my learning of intent in programming. If you have clarity of intent, you reject stupid proposals and let it fill in some text. I learned that my skills of exploratory testing (and intent in that) made me someone who would jump to identify bugs in talks showing copilot generated code. 

These two years culminated 6 months ago into me and my whole team starting to use copilot out our production code after making agreements on accommodations for ethical considerations. I believe erasing attribution for the open source programmers may not be direct violation of copyright, but it is ethically shifting power balance in ways I don't support. We agreed on the accommodations: using work time to contribute on open source projects and using direct money to support open source projects. 

One year ago AI changed *my software testing* by access to ChatGPT. I was on it since its first week, suffering through the scaling issues. I had my Testing Dozen mentoring group testing it as soon as it was out, and I learned that the thing I learned in 5 years of AI about decomposing systems, newbies were lost on. From watching that group and then teaching ensembles after that scaling to about 50 people including professional testers in the community, I realized the big change was that testers would need to skill up in their thinking. Noticing it has gender bias is too low a bar. Knowing how you would fix gender bias in data used to teaching would now be required. Saying there's a problem would not suffice for more than adding big blunders to filtering rules. Smart people at scale would fill social media with examples how your data and filtering fixes were failing. 

One year ago, I also learned the problems of stupid testing -- test case writing would scale to unprecedented heights with this kind of genAI. I would be stuck in perpetual loop of someone writing text not worth reading. Instead of inheriting 5000 (manual) test cases a human wrote and throwing them away after calculating it would take me 11 full working days to read them with one minute each, I would now have this problem at scale with humans babysitting computers creating materials we should not create in testing in the first place. 

Or creating code that is just flat out wrong even if the programmer does not notice with lack of intent on. 


AI would change testing to be potentially stuck in the perpetual loop of copy-pasting mistakes in scale and pointing the same ones out in systems. We would be reviewing code not thought through algorithmically. And this testing would be part of our programmer lives because testing this without looking at the code would be non-sensical. 

They ask how AI changes Software Testing - it already changed it. Next we ask how people change software testing, understanding what they have at hand. 

I have laughed with AI, worked with tricky bugs making me feel sense of powerlessness like never before,  learned tons with great people around AI and its use. I have welcomed it as a tool, and worried about what it does when people struggle with asking help, asking help from a source such as this without skills of understanding if what is given is good. I've concluded that faster writing of code or text is not the problem - reading is the problem. Some things are worth reading for a laugh. 


AI changed software testing. Like all technology changes software testing.  The most recent change is that we use word "AI" to talk about automation of things to get computers acting humanly. 
  • natural language processing to communicate successfully in a human language.
  • knowledge representation to store what it knows or hears.
  • automated reasoning to answer questions and to draw new conclusions.
  • machine learning to adapt to new circumstances and to detect and extrapolate patterns.
  • computer vision and speech recognition to perceive the world. 
  • robotics to manipulate objects and move about.
I feel like adding specific acting humanly uses cases like 'parroting to nonsense' or 'mansplaining as a service', to fill in the very human space of claims and stories that could be categorised as fake news or fake certainty. 

What we really need to work on is problems (in testing) worth applying this for. Maybe it is the popular "routes a human would click" or the "changing locators" problems. Maybe it is the research-inspiring examples of combining bug reports from users with automated repro steps. Maybe it's the choices of not to test all for all changes. We should fill space more with decomposed problems over discussion about "AI".

My thinking behind answering the way I do 

This week the people on stage at the meet-up said they are interested yet not experienced in this space. I was sharing some of my actual experience from the audience, as I am retired from public speaking. There is a chance I may have to unretire with a change of job I am considering, but until then I hold space for conversations as chair of events such as AI & Testing in a few weeks, or as loud audience member of events such as Finnish Testing Meetup this week. I don't speak from stage, but I occasionally write, and I always have meaningful 1:1 conversations with my peers over video, the modern global face to face. 

I collaborate a lot of different parties in the industry as part of my work-like hobbies. It's kind of win-win for me to do my thing and write a blog post and for someone else to make business out of intertwining my content with their ads. I have said yes to many such request this last month, one of those allowing a Finnish training company Tieturi to nominate me for a competition for the title of "Tester of the Year 2023 in Finland". This award has been handed out 16 times before, and I have been nominated every single year for 16 years, I just asked not to include me by actively opting out after someone had nominated me for 4 or 5 years.  

The criteria for this award I have never been considered worthy to win is: 

  • inspiring colleagues or other organizations to better testing
  • bring testing thoughts and trends from the world to Finnish testing
  • influenced positively testing culture in their own organization
  • influenced positively to resultfulness of testing (coverage, found bugs etc) in own organization or community
  • created testing innovations, rationalising improvements or new kinds of ways to do testing
  • influenced the establishing of testing profession in Finland
  • influenced Finnish testing culture and testers profession development positively
  • OR in other ways improved possibilities to do testing

I guess my 26 years, 529 talks, 848 blog posts in this blog or the thousands of people I have taught testing don't match this criteria. It was really hard to keep going at 10 years of this award, and I worked hard to move past it. 

So asking me to freely contribute "How AI changes Software Testing?" as marketing material may have made me a little edgy. But I hope the edginess resulted in something useful for you to read. Getting it out of my system at least helped me.