The @WalmartLabs Blog

Hacking A/B Testing with Confidence

Posted on by Eran Hammer

A couple months ago our team was tasked with enhancing mCommerce conversion. This is an ongoing focus but when the Black Friday weekend looms it becomes paramount. Any improvements get significant amplification.

Our first step was to implement an A/B testing framework to try out both client-side and server-side ideas and measure their impact. We’ve used A/B solutions in the past but they weren’t up to the task (or build for server-generated HTML, not for single-page apps). We had 3 weeks to make the cut.

Server Side State

Our mobile services do not maintain server state. That’s all being managed upstream by the various backend providers (orders, accounts, inventory). However, A/B testing requires someone to keep track of which client is running which test, and persist that over time. It is not enough to serve 50% of the users blue buttons and 50% green buttons. We need to serve the same color button to the same user over multiple sessions throughout the test.

The two obvious solutions are to keep state on the client using local storage or cookies, or to keep state on the server using a database or memory cache (mapped to a smaller client-side state holding a unique identifier). Implementing a new server-side state this close to Black Friday was a non-starter.

Analytics Integration

We decided to use the same unique identifier used to report analytics for determining A/B segments. This identifier is a GUID unique to each client installation, does not contain any user-identifiable properties, and was generated server-side using the node-uuid module.

We needed to link the unique identifier to the A/B test groups each client belongs to. Since we ruled out using server-side state, we had to find a way to pass that extra information from the clients to the analytics system directly, but do it without adding significant (mobile) bandwidth to each analytics report.

We decided to add a string suffix (e.g. “-xxxxxxxxx”) to the GUID format and extract it on the analytics system side. Not ideal but the only way to make it work without requiring changes to the existing system and data flow.

Minimizing Bandwidth

The suffix solution required keeping that extra state small. Instead of coming up with a rich test description format, we decided to define a set of groups (A, B, C, etc.). Each client gets a set of random numbers between 1-100 representing the test groups, which are then assigned to specific tests. For example, our button color test can use group A where values up to 50 will get green and values above 50 will get blue. This will give us a 50/50 distribution.

Once a test is done, we can repurpose the group value for something else. This is not statistically perfect, but if we accommodate a large enough set of groups (e.g. 10+), we can performed multiple tests with sufficient accuracy.

Once we generate the values for each group, we combine them together into a single string by simply concatenating the values together (e.g. 10 and 20 become ’1020′). For example, this allows us to express 10 random groups using 20 characters where each pair represents a group (’38429831034956013452′ is broken into 10 2-digits sets such as ’38′, ’42′, ’98′, etc.).

To further minimize the random digits string size, we figured we can reuse digits more than once, as long as each is only used once as the most-significant-digit. For example, ’8372649502′ is broken into ’83′, ’37′, ’72′, etc., where the less-significant-digit of the first group is the most-significant-digit of the next.

And last, using hex encoding we can further compress the length of the string (’8372649502′ can be expressed as ’1F30C7E1E’, 10% shorter). We didn’t use more condensed encoding such as base64 because of the additional complexity and parsing cost compared to native hex support, and because the GUID specification uses hex encoding.

Even Distribution

A/B testing requires setting clear distribution goals for each test for both risk mitigation and statistical accuracy. We had to generate our random group values in a way that produces an even distribution across all clients. We knew we had to use a cryptographically secure random number generator. However, random numbers are generated by producing an even distribution of bits, not decimal digits.

For example, generating 8 random bits repeatedly will produce an evenly distributed set of values between 0 and 255. If we want a random number between 0 and 200, we still need 8 bits (7 bits will only get us up to 127) but we might end up with values outside our desired range. We can’t just ignore the most-significant-bit when the number is over 200 because that will skew our even distribution. To get even distribution we have to throw out the entire 8 bits block and generate a new one, until we get a value within our 0-200 range.

Most random utilities produce a value between 0 and 1, allowing you to simply multiply that by the max value of the desired range. Because of floating point calculations and the way the 0-to-1 random number is generated, even distribution is not likely.

Random GUID

We noticed that the algorithm we were already using to produce the GUID was very similar, but just lacked the even distribution guarantee. Instead of adding another string to it as we originally planned, we found that we can simply replace some of the randomness already in the GUID with our own evenly-distributed digits. We now had a way to generate GUID with built-in A/B testing groups, evenly distributed, and for no additional cost in bandwidth or complexity.

The problem was, we already had millions of these GUID in the wild, none generated under this strict distribution rules. The first idea was to add some kind of flag so we can tell if this was an old GUID or a new GUID.

But then we realized that wasn’t needed. All we had to do, when the client makes its first request with a GUID, is to ensure that the value we received complied with our random generation rules (by following the same bit check used at generation time). We already had to parse the GUID for making A/B decisions, so this was just a small extra test. If the GUID complied, great — that assured us we already had even distribution. If it didn’t, we generated a new one and told the client to replace it.

This was all nice in theory but we didn’t know how well the existing GUID values would work alongside the new GUID values until we got some real numbers. We setup a test with a 50/50 distribution which resulted in an actual 49.9/50.1 split, and a 20/60/20 distribution which resulted in an actual 20.2/59.5/20.3 split. The theory was working out.

Configuring Tests

The last step was to define a way to describe tests and tie those to the random group numbers we’ve generated. For that, we used a node.js module we wrote called confidence which provides a simple JSON-based format for describing conditional configuration. The way confidence works, is by applying a set of criteria to a configuration and producing a “compiled” version specific to that criteria.

To tie it all together, we added the GUID generation logic we used to the confidence module, and created a function that takes the GUID and returns a criteria object confidence understands. When a client connects to the server to request its configuration (this was already in place), we use its GUID to generate the client-specific criteria, and pass that to produce a configuration object specific to that client, including all the A/B tests options.

Because the GUID encoded all the required information, we didn’t need to do any further integration with the analytics system because it already included that identifier in its reports. We just had to enhance the reports to parse that additional meaning when needed.

Future Plans

The confidence module including the GUID generation algorithm is available as open source. We plan to continue enhancing the configuration format and A/B testing functionality, as well as port it to our other platforms including Android and iOS in the near future. As always, feedback is greatly appreciated.