What does it mean to test big data? As testers, except to ensure data quality, we rarely test data at all. To answer this question, we have to start with what big data means. In general big data refers to data that exceeds the in-memory capability of traditional databases. These databases are generally defined at about two gigabytes of data.
But big data means more than that. It involves collecting large amounts of data on customers, transactions, website visits, network performance, and more, and storing that data over time, perhaps over a long time.
Big data also defines a purpose. We rarely have to search big data stores for a particular value or field. Rather, big data equates to analytics. The results are typically expressed in statistical terms, such as trends, likelihoods, or distributions, rather than records. Big data is all about analytics rather than a query. We want to mine the data for detailed information about its domain.
Where did big data come from
Big data has become possible in recent years because of three computing trends. The first is inexpensive storage. The data has always been available for collection, but storing millions of data points was cost-prohibitive. Today, enterprises can store locally or in the cloud for pennies a megabyte.
The second is different types of databases. Traditional SQL databases aren’t usually appropriate when large amounts of data are being processed. The maturation of No-SQL databases such as MongoDB and Couchbase make it possible to more effectively mine big data for analytics. Specialized databases are available for specific uses, such as in-memory databases for high performance and time series databases for data trends over time.
Last, computing, in general, has evolved into fast and inexpensive networked server farms. These server farms, which can consist of hundreds of systems and thousands of processor cores, can each process a portion of the data. This making it possible to spread out a complex and time-consuming computation across many different servers and compute the result in parallel. The result is that the application gets the results back in seconds, and the cost of the computation is usually trivial.
Where testing big data comes in
Testers generally aren’t testing the data, but they need to use big data in order to test the application. This means they need an underlying knowledge of the database and data architecture. It’s unlikely they will be using live data, so testers may have to maintain their own test environment, complete with the database and enough data to make tests realistic.
But it’s the applications themselves that are different. Big data is inexorably tied to analytics. Users are more likely to be running statistical and sensitivity analyses, instead of querying the database for a specific result. This means that it’s impossible to know ahead of time what the answer is going to be. Rather, it’s going to be expressed in distributions or probabilities or time series trends. And those answers won’t be obviously correct or incorrect, as are most answers in analytics applications.
For testers, that lack of certainty is going to make designing test cases and analyzing test results a challenge. Another challenge testers will face is that they out of necessity have to use big data to test these applications. They can’t just take a few sample data points and expect to get test results that reflect real life use.
This means understanding the underlying data architecture, including the type of database used and how to access that database. You can’t use big data in testing unless you know what it is and how it’s stored.
We will find ways of testing
None of this means it’s not possible to test analytics applications using big data. But testing expectations have to change. First, you need to be looking at data as a resource for determining statistical results, rather than for querying (usually). This means you need to use more than just a sampling. The good news is that because you aren’t querying, you don’t have to worry about confusing it.
Second, your test cases are going to be different. You’re not usually looking for a specific and known answer, but for a statistical result, and the test cases have to reflect that.
Last, you have to evaluate test results not by their correctness, because there is no easy way of determining that. You might have to break the problem into smaller pieces and analyze tests from each piece. Use your skill and creativity to determine how to best interpret your test results.
However you test it, automation will play an important role. There have to be a large number of tests, run multiple times, to determine whether the results are aligned with expected values. In fact, the same test can be run multiple times, with different results, as long as those results are within the expected distribution. Plus, automation allows you to take the same tests and easily adjust them with other parameters to produce test variations that provide new tests and new results.
Any way you test, you’ll find testing big data to be an exciting new area for discovery.