A Meta-Analysis and Literature Review of Language Programs 

Article Key Findings:

For the purposes of this article, I conducted a met-analysis of 61 language program studies. Despite the fact that this meta-analysis was conducted 20 years after the National Reading Panel meta-analysis, with studies included up to 2022, I found the identical mean effect size for phonics, of .45. I also conducted a secondary meta-analysis of 13 phonics meta-analyses conducted over the last 25 years, which found a mean effect size of phonics, of .43. This helps to show that despite the fact that the NRP meta-analysis was conducted 20 years ago, its findings are still valid today. Moreover, not only did I find the same mean effect size, but I found the same general trends about phonics education held true as well. 

I also found that phonics focused programs consistently outperformed Balanced Literacy programs across all demographics. Indeed, phonics on average doubled the effects found for Balanced Literacy for grades 1-2, for class based instruction, and for at risk learners. 

Phonics interventions showed efficacious results, both for early primary instruction, and for older students with reading deficits. This suggests that students should receive phonics instruction during their foundational education years and that if they miss this instruction that they benefit from getting it later on. 

Purpose of Analysis: 

I often see the claim that a program is the best program or the most evidence-based program. However, these types of statements are rarely followed up by sufficient evidence. To prove something is the best program, I would ideally like to see multiple meta-analyses showing that a program has greater evidence of efficacy than other programs studied. Of course, even with this type of evidence, said analysis would only show that the program is better than the other programs studied, according to the currently available evidence most language programs have little to no research done, let alone a meta-analysis. To make matters more complex, results are context dependent, the program that works best for one grade or one outcome, might be completely different from what works best for a different grade/outcome. The reality is that proving a certain program is the best, is likely an intellectually fruitless endeavor.

However, I do believe we make educated guesses about which programs work better by conducting meta-analyses of the available evidence. While research studies do not always exactly match the conditions of the real world, by conducting meta-analyses of individual programs, we can gain a more objective proxy of what we have evidence for working. This is not to say that programs that show low results in meta-analyses are inherently bad or that programs which show high results are inherently good, but rather we can make our most educated guess as to what works, by using this method. 

To this end, I have been conducting a series of small non-peer reviewed meta-analyses on my website on popular language programs. Ideally this information would be best conducted in the form of a published peer-reviewed meta-analysis. However, most programs do not have a meta-analysis conducted on them and some of the meta-analyses conducted appear to have limited scopes. This research has been my attempt to help make the available scientific literature on the topic more useful and meaningful to teachers; however, I recognize that it has severe limitations, in that it is not published peer-reviewed work, and I do not have post-graduate qualifications.  




For the purposes of this study, searches of studies were conducted on Education Source, Sage Pub, Google, and company websites, for the most popular language programs. Programs were selected based, on polling by teachers interested in the science of reading. All studies found with a control group, sufficient reporting data (meaning either effect sizes or sufficient data for me to calculate a proper effect size), and total samples above 10 were included in this analysis. No time restrictions were place on this inclusion criteria. In total 61 studies were included. Three studies were initially included, but later excluded (after review) due to non-equivalent pre-test scores. In all cases, effect sizes based on the post test results were extremely misleading. One study was by Ring, Et, al on Take Flight. The second study was by Stuart, Et, al, on Jolly Phonics. In the cases of the studies on Take Flight and Jolly Phonics, the Cohen's D formula provided an effect size that was exaggerated in comparison to the actual gains made. For the Take Flight study, the gains were statistically insignificant. For the Jolly Phonics study the treatment group did worse than the control group in terms of gains, but still ended up with a positive effect of .73, using Cohen's d. The third study excluded was on Remediation Plus, by Corcoran, Et al. The gains found by the treatment group, were very significant with an effect of over .50, however, the non equivalent groups, resulted in a much smaller Cohen's d effect sizes. 

Effect sizes for studies with sample sizes above 50 per group were calculated with a Cohen’s d calculation. Effect sizes for studies with sample sizes below 50 were calculated with Hedge’s g. When effect sizes were already calculated by the peer reviewed author, these effect sizes were usually accepted as is, unless there was probable cause for an improperly calculated effect size, in which case, the effect size was double checked. If the effect sizes did not match, I used my own calculation. All effect size calculations were independently duplicated by a second analyst, to guarantee the reliability of the results, in the case of a disagreement by the second analyst, both persons re-did their calculations, and discussed to reach a consensus. Effect sizes, were not rounded, they were cut off at the hundredth place value. When the effect size calculation found the control group to outperform the treatment group, the effect size was entered into the meta-analysis as a negative effect size. Meta-analyses typically report on results with effect sizes. These effect sizes are meant to be interpreted in the following way: 

However, I think this can be a reductive lense. As effect sizes tend to be context specific. For example, studies on reading tend to show lower results, whereas studies on math tend to show higher results. For this reason, I think effect sizes for single outcomes, programs, or pedagogical ideas are not helpful for teachers. I think it is more useful to compare the effect sizes of similar factors, so that teachers can see which factors have a stronger research base. For this purpose, I have compiled my research in language programs so far into the following infographic:



As you can see in the above graph, there is not a single Balanced Literacy program in the top 7 ranks, within the graph. Moreover, all Balanced Literacy programs showed a mean effect size that as statistically low. In order to find larger trends in this data, I also broke down this information according to moderator variables, sample type, and program type: 

As can be seen in the above results. Phonics heavy programs outperformed Balanced Literacy programs across all grades, and sample types. Indeed phonics programs showed roughly double the impact for grades 1-2, at risk learners, and class based instruction. This research does not show support for the use of Balanced Literacy programs, over phonics focused programs, in any context.

Of course, Balanced Literacy is not the only other sub-type of phonics programming. Within this meta-analysis, I also had samples of Orton Gillingham programs, analytic programs, and synthetic programs. To compare the efficacy of these styles of phonics programs, I also found the mean effect size for each sub-type of phonics program. The results of which can be found below. 

Within this study, synthetic phonics programs showed better results than Balanced Literacy, analytic, and Orton Gillingham programs. However, the difference between synthetic, analytic, and Orton Gillingham programs were statistically insignificant. These results were very similar to the NRP meta-analysis results, with the only exception being that the NRP meta-analysis found a much lower result for Orton Gillingham programs. 

One interesting claim, I often see of effect size and meta-analysis research is that the study design biases the result. Indeed people often claim that RCT studies show substantially lower results. Similarly, study authors often claim the reason they found low results, was due to low teacher fidelity. With this in mind, I broke down my meta-analysis data across design variables, the results of which can be seen below. 



Similar to the NRP meta-analysis, this meta-analysis found higher results for studies that did not track fidelity, than studies that did. This seriously calls into question the notion that fidelity should be the primary examined culprit, when study results are low. Indeed program type variables showed far more significant impacts on results. While RCT studies did show on average lower results than quasi-experimental studies, this difference was not statistically significant. This calls into question the idea that we should expect lower results from RCT studies. 


The data showed higher outcomes for shorter phonics studies, and for longer balanced literacy studies. This last result was particularly surprising to me. However, I think one logical explanation might be that phonics programs might work more efficiently. As teaching students to memorize large amounts of words takes more instruction, it makes sense that Balanced Literacy programs might take longer to show a positive effect for students, as it is an in-efficient method for instruction. Whereas, some students might learn how to decode from phonics quite quickly and the phonics programs might show diminishing returns over time. The NRP meta-analysis actually showed similar results, for phonemic awareness instruction; however, it did not look at duration effects on phonics programs.

Ultimately, this meta-analysis was not a peer-reviewed published study, which limits its credibility. However, its results are very comparable with the rest of the scientific literature. To help rectify this problem, I have included a literature review of all other meta-analyses I could find on this topic, starting with the NRP meta-analysis. 

NRP Discussion: 

Overall, we see three really clear patterns, within this meta-analysis. Firstly, phonics is most useful for younger students and becomes less useful over time. Indeed, we see a statistically insignificant effect for teaching phonics to students in grade 2 in up, overall, within this meta-analysis. I actually had the opportunity to discuss these specific results with Timothy Shanahan, who was one of the project leads. To paraphrase, he told me that we saw an inverse relationship, with older students getting lower results from phonics and younger students getting higher results. Now that being said, the grade 2 and grade 6 results are all lumped together, but from my discussions with Dr. Shanahan, I think we can assume that the grade 2 results are much higher than the grade 6 results. 


Interestingly, we see here that the NRP evidence that phonics helps at-risk readers above grade 2 or dyslexic students, is quite weak, something Dr. Shanahan also conceded when he discussed the topic with me. Indeed opponents of phonics have used this information to try and discredit the use of phonics for remedial students. However, I would caution against this interpretation. What this is likely actually showing is that Dyslexic students learn to read more slowly and therefore require more instructional time spent on foundational knowledge. I also think the results might have been deflated at the time, by the studies chosen for this sample. 


I also think it should be noted that the NRP meta-analysis found the same trends for both phonics overall and programs as I did. Both the NRP meta-analysis and mine meta-analysis found a mean effect size for phonics of .45. The NRP  found Jolly Phonics to have the highest results for any individual program, it found synthetic phonics programs tended to have the higher results compared to analytic phonics. Moreover, it found on average lower results for Orton Gillingham programs. This is not to say that Orton Gillingham programs are bad programs, indeed, it is possible that these programs are helping students, in a way that the research is not capturing. However, meta-analysis data does seem to consistently find lower results for these programs, with the notable exception of the SPIRE program. Indeed, my meta-analysis result for Orton GIllingham programs, might be inflated due to my heavy weight for SPIRE, which has the highest results for this type of program and indeed some of the highest results overall. 

Other Phonics Meta-Analyses:

In total I was able to find 15 meta-analyses including my own that looked at the efficacy of phonics. The average result of these meta-analyses was .55. However, the 2014 Garcia meta-analysis of the topic is clearly an extreme outlier. Indeed, I have no idea how they found this effect size, as I have never found a single phonics study with an effect size over 1.5, let alone find a mean average for over 2. If we correct for this outlier data, we get a mean effect size of .43, which I must point out, is with .02 of my meta-analysis of the subject. However, three of these studies (mine included) were not peer reviewed. If we remove these 3 studies and the outlier, we get a mean effect size of .60, which I must admit, I think is likely a bit high. However that being said, whereas I used to be thoroughly convinced that this was the best method for assessing the efficacy of phonics, I am now more skeptical, after completing this analysis. Personally, I would be more willing to accept the results of my own analysis or the NRP, over the results of a secondary meta-analysis, due to the amount of outlier data. 


The results of phonics studies seem very context dependent on two factors: the age of the students and the program being used, whereas fidelity, experiment design, and duration seemed to have less impact on the results of phonics studies. Indeed, prior to starting this project, I was firmly convinced that the principles of phonics are more crucial to instruction outcomes than the program used. However, I am less confident in this belief now. As some phonics programs showed results that were below .20 and others showed results that were above .80. I think this fact is substantially damaging to the idea that as long as our program is based on the principles of science of reading that it is essentially evidence-based. 

Whole Language and Balanced Literacy Meta-Analyses:

As you can see from these charts there does seem to be strong evidence that Phonics has a stronger evidence base than Balanced Literacy, whereas Balanced Literacy has stronger evidence than Whole Language. 


Orton Gillingham: 

Orton Gillingham programs are very popular in the Science of Reading community and so I thought I would share the available meta-analysis data I had on the subject as well. However, the evidence on the topic, might be weaker than some might assume. Indeed, my meta-analysis results, which were low, appear to be by far the highest. 

To help summarize all of this data, I have created a secondary meta-analysis, which shows the mean meta-analysis result for each program type, where I had available data.

As can be seen here, there appears to be strong evidence that a phonics program is better than a balanced literacy program and a Whole Language program. The evidence also seems to suggest that Orton Gillingham programs might not be more efficient than Balanced Literacy ones. However, I personally find that hard to believe, in part because my meta-analysis found significantly higher results for Orton Gillingham programs than Balanced Literacy ones. 

Written by Nathaniel Hansford and Joshua King

Last Edited 2022-07-31



