One of the challenges that is encountered in the analysis of program outcome results is having significant power.
Statistical power is the part of statistical conclusion validity in that the null hypothesis should be rejected, but in practice, we have accepted the null hypothesis. In outcome evaluation, most often, this translates in the intervention appearing to not be effective, when in fact, it is. Statistical power may be influenced by the sample size, the alpha level set to indicate that a result is significant, the robustness of the result, referred to as the effect size, and in some cases, the actual statistical procedure being used.
Speculate on what would occur to statistical power given the recommended change in the scenario. Identify any practical challenges that making the change could incur. This could include demand on resources to implement the change, threat to statistical conclusion validity or even ethical concerns if relevant. There could be others that you think of that may apply.
Scenario A: Upon completion of an evaluation of the Templeton program the results were surprisingly “not significant” in terms of the reported improvement in client well-being. One issue that was noted that for the evaluation data was collected only on a sample of the participants during the evaluation cycle. It was recommended that in the next evaluation cycle all participants to the program be included in the evaluation.
Scenario B: In developing an evaluation plan for the upcoming cycle, the administration and evaluation team met to discuss the plan. Typical for most studies and evaluations the significance value of .05 was used to indicate that a result was statistically significant when the data is analyzed with inferential statistics. During the meeting, one administrator noted that the program is unique, and after all “this isn’t data to be used in rocket science or of a life and death matter”—and recommended that a new significance value of .10 be used to indicate clinical significance.
Scenario C: In developing an evaluation plan for the upcoming cycle, the administration and evaluation team met to discuss the plan. Even though the previous evaluation’s results indicated that the results were not statistically significant, the administration noted that satisfaction surveys and their own discussions with participants indicated that the program was impactful in a positive way. And in fact, they argued that if it wasn’t for other factors in the process the actual “effect” of the program is larger than anticipated. In “statistics speak” they were indicating that the effect size for their program results should be “medium” as opposed to “small to medium.” They are suggesting a larger effect size.
Be sure to speak in terms of statistical power and statistical conclusion validity. Suggest any additional consequences of implementing the suggestion.
References
Aberson, C.L. (2019). Applied power analysis for the behavioral sciences. Routledge.
Prajavati, B., Dunne, M. & Armstrong, R. (2010). Sample size estimation and statistical power analysis. Optometry Today.
Fitzner, K. & Heckinger, E. (2010). Sample size calculation and power analysis: A quick review. The Diabetes Educator, 36(5), 701 -707.
Resources
Statistical Power
Statistical power is the likelihood that a study will detect an effect when there is an effect to be detected. In terms of statistical conclusion validity, it is the probability that a Type II error is not committed. It is influenced, for the most part, by the size of the effect and sample size. Bigger effects are easier to detect than smaller ones. Let’s say that you were playing a game of “hide and seek” and were asked to find a golf cart or a cigar box in a backyard. It would seem much more likely that you would find the golf cart as it is larger. You can think of the measure of the two items as effect sizes—the larger being easier to find. Larger sample sizes improve power due to the increased sensitivity that is provided. The alpha level and nature of the hypothesis also influence the sample size. Reducing the alpha level (e.g. .05 to .01) decreases power. In effect, by decreasing the alpha level for the same sample size will increase the “standard” that is necessary to reject the null hypothesis. By increasing the standard, the likelihood of surpassing the standard decreases. Consider this analogy—if the final average for this course was 70% for an A, many more people would be confident of having an A in the course as opposed to the final average being 99% for an A. A directional hypothesis (a one tailed test) increases power as opposed to a non-directional hypothesis (two-tailed test).
Calculating Statistical Power
The power of a study can be calculated before or after the study. Conducting a power calculation prior to a study is generally a good idea. Using the above parameters evaluators and researchers use charts called power charts. The main reason a power calculation is done in advance is to help decide how many people they need to include in the study (attrition) must be accounted for as well). What is acceptable power? Cohen (1988) suggests a study should have 80% power to be worth conducting. The more power the better!!!—but the costs of more subjects can be prohibitive.
Calculating Effect Size
There are different ways to calculate effect size depending on the evaluation design you use. Generally, effect size is calculated by taking the difference between the two groups (e.g., the mean of treatment group minus the mean of the control group) and dividing it by the pooled standard deviation of one of the groups. For example, in an evaluation with a pre-test and posttest, effect size is the difference in means between the two groups divided by the pooled standard deviation of two groups.
How to Calculate the Pooled Standard Deviation
Cohen (1988) offers a couple of options for calculating the pooled standard deviation. The simplest is:
SD pooled = sq SD 2 + SD 2
– –
1 2
—————–
2
Where:
SD1 = standard deviation for group 1
SD2 = standard deviation for group 2
To interpret the resulting number, most social scientists use this general guide developed by Cohen:
< 0.1 = trivial effect 0.1 - 0.3 = small effect 0.3 - 0.5 = moderate effect > 0.5 = large difference effect