Improving user experience through A/B Testing

Project Overview
Experiment Design
Experiment Analysis
Follow-up Experiment

Project Overview

At the time of this experiment, Udacity courses currently have two options on the home page: “start free trial”, and “access course materials”. If the student clicks “start free trial”, they will be asked to enter their credit card information, and then they will be enrolled in a free trial for the paid version of the course. After 14 days, they will automatically be charged unless they cancel first. If the student clicks “access course materials”, they will be able to view the videos and take the quizzes for free, but they will not receive coaching support or a verified certificate, and they will not submit their final project for feedback.

In the experiment, Udacity tested a change where if the student clicked “start free trial”, they were asked how much time they had available to devote to the course. If the student indicated 5 or more hours per week, they would be taken through the checkout process as usual. If they indicated fewer than 5 hours per week, a message would appear indicating that Udacity courses usually require a greater time commitment for successful completion, and suggesting that the student might like to access the course materials for free. At this point, the student would have the option to continue enrolling in the free trial, or access the course materials for free instead. This screenshot shows what the experiment looks like.

Free Trial Screener Feature

The hypothesis was that this might set clearer expectations for students upfront, thus reducing the number of frustrated students who left the free trial because they didn’t have enough time-without significantly reducing the number of students to continue past the free trial and eventually complete the course. If this hypothesis held true, Udacity could improve the overall student experience and improve coaches’ capacity to support students who are likely to complete the course.

The unit of diversion is a cookie, although if the student enrolls in the free trial, they are tracked by user-id from that point forward. The same user-id cannot enroll in the free trial twice. For users that do not enroll, their user-id is not tracked in the experiment, even if they were signed in when they visited the course overview page.

Experiment Design

Metric Choice

This experiment has several possible metrics that could be used. The metrics have been generally categorized as either invariant, meaning that they should not be affected by our experimental manipulation, or evaluation metrics, which are affected by our manipulation. Invariant metrics are used to check if there were any issues with our randomization procedure. If users are randomly assigned to the control and experimental groups, then we should expect the invariant metrics to be similar for the two groups.

Invariant Metrics:

The metrics below are considered invariant because they happen before the “Start free trial” button event. We expect the control and experimental groups to have similar numbers for these metrics.

Number of cookies: represents the number of unique cookies to view the course overview page (dmin=3000).
Number of clicks: represents the number of unique cookies to click the “Start free trial” button (which happens before the free trial screener is trigger). (dmin=240)
Click-through-probability: represents the number of unique cookies to click the “Start free trial” button divided by number of unique cookies to view the course overview page. (dmin=0.01)

Evaluation Metrics:

The metrics below are not invariant, since they can change in response to our manipulation. For instance, the retention and net conversion metrics could increase because our experimental manipulations aims to decrease the frustration experienced by students who might not be aware about the time commitments required to go through the course. On the other hand, the Screener Button might decrease the enrollment in the free trial period, since it pushes away customers who are made aware that they don’t have enough time to commit to the course. In summary, the reason why the metrics below are not invariant is because they are likely to change in response to the new feature we are testing.

Gross conversion: number of user-ids to complete checkout and enroll in the free trial divided by number of unique cookies to click the “Start free trial” button. In other words, this measure is an indication of enrollment in the free trial period. (dmin= 0.01)
Retention : number of user-ids to remain enrolled past the 14-day boundary (and thus make at least one payment) divided by number of user-ids to complete checkout. (dmin=0.01)
Net conversion: number of user-ids to remain enrolled past the 14-day boundary (and thus make at least one payment) divided by the number of unique cookies to click the “Start free trial” button. (dmin= 0.0075)
number of user IDs: number of users who enroll in the free trial. This can be considered an evaluation metric, since fewer users might enroll in the experimental condition relative to the control condition because users are made aware of the time commitment. (dmin=50)

Although we have 4 evaluation metrics, I will not use number of user-ids since it is not normalized. gross conversion, retention, and net conversion provide a better measure for how likely our manipulation reduces student frustration. We should expect the following results:

The gross conversion metric should be smaller in the experimental group relative to the control group. A decrease in this metric will lead to a decrease in costs required to support Mentors and coaches and to a higher level of satifaction among students since they will have clear expectations about course requirements.
Retention metric should increase in the Experimental group. We should expect the number of students enrolled after the free trial to be the same for both experimental group and control group. Nontheless, we also expect the number of user ids who go through the free trial to be smaller for experimental group. Since the numerator stays the same and the denominator decreases in the experimental group, we should expect that the screener button will lead to a larger retention ratio.
Net conversion we expect net conversion to not decrease as a result of our manipulation. We don’t want our manipulation to decrease the number of students enrolled, since this would affect our profitability. We would preffer if this metric remained the same or increased.

Measuring Standard Deviation

Assuming that the number of clicks and enrollments follows a Gaussian distribution, the standard deviation for our three measures will be computed using the following formula:

\[\sigma = \sqrt{\frac{p(1-p)}{n}}\]

In the formula above, p corresponds to the rates of our evaluation metrics provided in the baseline. p could take the following values:

\(p_{grossConversion}\) = 0.20625; \(p_{retention}\) = 0.53; \(p_{netConversion}\) = 0.1093125;

n corresponds to the number of unique cookies for the gross conversion and net conversion. In the case of retention, the sample size n corresponds to the number of enrollments. Based on a daily count of 5,000 unique cookies and a click-through-rate of .08, we should expect to have 400 clicks. Out of these 400 clicks, we should expect 82.5 enrollments (\(p_{grossConversion}*400\))

The standard deviation for each of our metric is:

\[\sigma_{grossConversion} = \sqrt{\frac{ 0.20625 (1 - 0.20625)}{400}} = 0.0202\]

\[\sigma_{retention} = \sqrt{\frac{ 0.535 (1 - 0.53)}{82.5}} = 0.0549\]

\[\sigma_{netConversion} = \sqrt{\frac{ 0.1093125 (1 - 0.1093125)}{400}} = 0.0156\]

Udacity’s unit of diversion is a cookie, so the analytic estimate would be comparable to the empirical variability for only two of our measures: gross conversion and net conversion. Both of these metrics use cookies as the unit of analysis. For retention, the unit of analysis is user id, so the empirical standard deviation is not likely to match the analytic standard deviation. In this circumstance, it is recommended to use a bootstrapping method to derive the empirical sd for retention.

Sizing

Number of Samples vs. Power

I did not use the Bonferonni correction because it is overly-conservative. Moreover, our metrics are likely to be correlated with each other (see effect sizes section), so it would be more suitable to rely on other methods for adjusting p-values, such as Holm or false discovery rate method.

To calculate the sample size, I used the same formula as in the online calculator. The code for calculations can be found in the .rmd file. The sizes required for each evaluation metric are:

\(n_{grossConversion}\) = 645875; \(n_{retention}\) = 4741212; \(n_{netConversion}\) = 685325;

One thing that becomes very striking is that we require a very large sample size for our retention metric. We require a total of 4,741,212 pageviews if we want to evaluate the retention metric. Since our daily traffic is only 40,000, it will take us several months to reach a decision in regard to the retention metric. Hence, it would be more suitable to use gross conversion and net conversion as our only decision metrics and discard retention as an evaluation metric.

Our required sample size is the larger of the two values: 685325

Duration vs. Exposure

Using the screener button constitutes a minimal risk, since users are simply provided a reminder regarding time commitment required to be successful in the course. Hence, I would advise having 100% exposure to this experiment. This means that we are dedicating all the 40,000 daily traffic. Considering that we require a sample of 685,325 pageviews, it will take us 18 days to complete the data collection for this experiment.

Experiment Analysis

Sanity Checks

After collecting all the data for this experiment, we can check if our randomization procedure was successful by comparing the two groups on the invariant metrics that were mentioned above. There are two approaches for testing if the two groups are significant for each of the invariant metrics: one approach involves using a signal to noise ratio test (e.g.,a test based on t or z) and checking if the p-value for that test is significant. The second approach involves checking if the observed value for the control group falls within the 95% CI. You can find the function for sanity check in the .rmd file. The output of the calculations is provided below.

Sanity test for pageviews 95% CI:[ 0.49882 0.50118 ] Observed Value: 0.50064 Conclusion: Passes sanity test

Sanity test for clicks 95% CI:[ 0.495884 0.504116 ] Observed Value: 0.500467 Conclusion: Passes sanity test

Sanity test for click through rate (CTR) 95% CI:[ 0.081214 0.083045 ] Observed Value: 0.082191 Conclusion: Passes sanity check

All sanity checks have been passed.

Result Analysis

Effect Size Tests

We have two evaluations metrics: gross conversion and net conversion. For all the tests I assume an alpha of .05 and a z value of 1.96. I did not use Bonferroni correction because a correlation test between gorss conversion and net conversion revealed that the two metrics are coorelated (r=0.642, t=5.5591, p<0.0001). You can examine the relationship in the figure below. Also, Bonferroni correction would not be appropriate here because we are interested in both of our metrics to fit our expectations. If our reduction in gross conversion comes at the cost of a decrease in net conversion, we would likely not implement the screener button.

All the tests have been performed in R. The calculations and the code can be found in the .rmd file.

We found a significant effect for gross conversion. The confidence interval of our effects excludes both d=0 and dmin=.01. In other words, the effect passes both the statistical and practical significance thresholds.

95% CI:[ -0.02912 -0.011989 ] Conclusion: The effect is significant for both d=0 and dmin

This effect indicates that the manipulation led to fewer enrollments in the experimental group compared to the control group.

We didn’t find a significant effect for net conversions. The confidence interval of our effect includes the d=0. Nontheless, the lower bound of our confidence interval is smaller than the negative practical significance threshold (dmin=-0.01), indicating that there is a chance that the screener button could affect our profitability.

95% CI:[ -0.011604 0.001857 ] Conclusion: Fail to reject the null of no difference

Sign Tests

In addition to the tests above, we conducted a binomial test. For each of our metrics, we counted the number days with a positive difference between the control and experimental groups. I did not use Bonferroni correction. As I point above, the two metrics are correlated with each other, and Bonferroni adjustment to p-values would be overly-conservative.

The gross conversion rate was greater on 4/23 days for experimental group relative to the control group. The two-tailed p-value is 0.0026, which is below an uncorrected threshold of .05.
The net conversion rate was greater on 10/23 days for experimental group relative to control group. The two-tailed p-value is 0.6776, indicating that there is no significant difference in conversion rates between the two groups.

Summary

The goal of this experiment was to decrease the frustration users experience from not knowing the time-commitments required to be successful in a course. The manipulation in this experiment involved a screener button that was essentially a reminder to students that if they don’t have enough time to dedicate to a course, they might be better of no enrolling in the free trial. Hence, we expected that our manipulation would lead to a decrease in enrollment. Although a decrease in enrollment might sounds like a bad thing, we also checked if the manipulation affects in any way the number of payments that we end up recieving from users.

Our results show a reliable effect for gross conversion, indicating that the enrollment decreased due to the screener button. The results also show that there is no significant effect for net conversion, meaning that even though we have fewer users enrolling in the free trial, the number of users who eventually stay in the course after the free-trial period ends remains essentially unchanged.

Recommendation

The screener button seems to be effective at reducing the number of enrollments. By using the screener button, we should expect to reduce the frustration among incoming users and reduce the workload for coaches and TAs.

Nontheless, the screener button also seems to reduce slightly the profitability to the business. For instance, the effect size tests indicates that the practical significance boundry falls within the 95% CI.

Because our confidence interval includes the negative practical signifiance boundry, I would recommend against implementing the screener button feature.

Follow-up Experiment

Although the experiment above has removed some users who have been reminded about the time commitment required for the class, it is likely that there are still some users who might feel frustrated with the difficulty of the material and end up not using the paid service.

For example, I’m currently taking a class on data visualization. The class requires that one have some experience with JavaScript, HTML, and CSS, however, these requirements are not spelled out clearly before enrolling in the nano-degree I ended up having to study a lot on my own in order to catch up, but I can see why a lot of students might just give up because they were not aware of these pre-requisites.

Imagine this scenario: if a student signs up for a class in order to learn to make cool visualizations, but then ends up finding out that they need to know three languages to be able to work on the project, their experience might be very frustrating, and they might decide to not sign up for the paid service.

Hence, one thing I would do is include a detailed description of course pre-requisites and tell students that if they don’t have those pre-requisites, they might benefit more from taking the course for free rather than signing up for the free trial. Moreover, the message prompt could link to some introductory materials or other introductory courses that would help the student build a foundational base before tackling the more advanced course they are trying to take.

The course of the experiment will be as follows: 1. Students see a class they would like to take 2. They click on the button to sign up for the free trial - The control group proceeds to sign up for the free trial - The experimental group is given a list of per-requisities and additional resources that would help students prepare for the course. They are then given the option to continue to sign up for the free trial or to take the course for free.

Hypothesis: the hypothesis is that the additional information about course pre-requisites will decrease frustration and work load among Udacity coaches without affecting the net retention rate. We should expect the gross retention rate for the experimental group to be smaller relative to the control group.
For this experiment, I would use the following measures:
- Invariant measures: number of unique cookies, number of clicks, and click-through-probability; these measures should be the same for both experimental and control groups.
- Evaluation metrics: gross conversion and net conversion. These metrics should change due to our manipulation.
The unit of diversion will be unique cookies, similar to the experiment we performed above. The reason why I picked cookies is because the users in the experimental group are not yet enrolled in the start free trial period when the prompt with additional information about the class appears on the screen.