top of page
What is the Best Language Program?

Written by: Nathaniel Hansford

Contributed to by: Elizabeth Reenstra, Pamela Aitchison, Rachel Schechter, Joshua King, & Sky McGlynn


Article Summary:

This article summarizes the findings of all Pedagogy Non Grata language reading program reviews, based on experimental research, and compares them with the findings of Evidence for ESSA on language program reviews. Evidence for ESSA employs far stricter inclusion criteria and does not utilize the principles of meta-analysis. In theory, their results are supposed to be more reliable than the results found in traditional meta-analytic approaches. However, they reject a far greater pool of studies, and thus their findings are based on smaller sets of data. On average, there were no statistically significant differences between the mean effect sizes found for language programs reviewed by both organizations. On average, mean effect sizes for Pedagogy Non Grata were larger than the mean effect sizes found by Evidence for ESSA. This difference is likely in part due to the stricter inclusion criteria for Evidence for ESSA and a greater focus on systematic phonics programs by Pedagogy Non Grata.


The mean results for structured literacy (systematic phonics) and balanced literacy (unsystematic phonics) were compared between organizations and related meta-analyses. All meta-analyses and both sets of reviewing organizations showed larger results for systematic

 phonics instruction than unsystematic phonics instruction. However, Pedagogy Non Grata  showed the smallest difference of d = 0.14.

Figure 6: What is the Impact of Systematic Phonics Compared to Unsystematic Phonics?

What is the Effect Size for Systematic Phonics Compared to Unsystematic Phonics_.png

Lastly, the database of research collected by Rachel Schechter and LXD Research from Evidence for ESSA was used to examine whether higher quality studies yielded lower effect sizes. In general, a negative correlation was found between higher quality research and effect sizes, but that relationship was weak and not consistent (r = -0.19, p-value < 0.0000).

Studies reviewed by Evidence for ESSA that had a promising ranking had a mean effect size of 0.15. Studies with a moderate ranking had a mean effect size of 0.37. Studies with a strong rating had a mean effect size of 0.25. Studies with a strong rating that were large-scale had a mean effect size of 0.17.



Pedagogy Non Grata was founded in 2018. Our original approach was to provide teacher-friendly reviews of education meta-analyses. We hoped that doing this would help make education research more accessible to teachers and, thus, lead to an increase in evidence-based practices and equitable learning outcomes for students.


In 2020, we shared a few reviews of language programs using the principles of meta-analysis. This means we reviewed studies of these programs and calculated a mean effect size for each program. The interest in Pedagogy Non Grata grew very rapidly when we did this, and it became incredibly obvious that there was an interest in such reviews. While we continued to share summaries of peer-reviewed meta-analyses, we also added many language program reviews over time. However, these reviews did receive some criticisms from scholars in the field, most notably for:


  1. These reviews essentially amounted to original scientific research, but we weren't peer-reviewing said research.

  2. Our inclusion criteria and search parameters were not clear.

  3. We did not explain the calculation methods.

  4. We did not include confidence intervals.

  5. We did not include moderator analyses.

  6. Many reviews were conducted entirely by one person. However, meta-analysis requires replication by at least one extra author to achieve reliability.

  7. None of these reviews were authored by a person with any formal training in meta-analysis.


All of these criticisms were indeed fair. This project was a massive learning curve, and we have worked extremely hard over the last two and a half years to improve the quality of these reviews. Most reviews have been updated at least once to match our increasing standards of quality, and some reviews have been updated multiple times. We are planning to update more reviews in the near future. Most notably, we will be updating our review of Reading Recovery to include additional studies found in new systematic searches and to reflect a better weighting method.


As we improved our processes, we made the following improvements to reviews over time:


  1. We had effect size calculations and sometimes coding duplicated by at least one other author. For a few reviews, there were multiple duplications.

  2. We attempted to make our inclusion criteria and search parameters more clear.

  3. We attempted to make our calculation methods more clear.

  4. We included confidence intervals.

  5. We included moderator analysis, and even regression analysis, on a few reviews.

  6. We started to weight analyses, according to sample size, with the inverse variance method.

  7. We added an appeal process so that companies could share criticisms with us. Indeed, we have had multiple reviews appealed and updated upon receiving fair criticism from companies.


With these improvements in mind, I feel much more confident about our more recent reviews than the ones written in 2020. However, I believe one criticism stays as valid now as ever before. None of this research is peer-reviewed.


Late in 2022, we submitted two large-scale meta-analyses for peer-review, based on original research shared on this blog. This has been a great learning experience for us, and that submitted research has been updated and resubmitted multiple times now. However, this process has been extremely long, six months in, and our research still has not been published. While I will continue to work on peer-reviewing more research, the duration of the peer-review process leaves me more convinced than ever that peer-reviewing these language reviews on a regular basis is not feasible, as it would delay the release of each review by many months. Moreover, the format of this blog was specifically meant for teachers, not academics and peer-reviewed journals require a very different format. We continually try to strike a balance between presenting information in a teacher friendly format and being scientifically accurate. This is a difficult balance to strike, as these goals are not well aligned. 


There are other more accredited organizations that offer reviews of such programs, most notably: Evidence for ESSA, What Works ClearingHouse, and EdReports. These reviews can be very useful. I think Evidence for ESSA can be especially helpful for teachers and have a video explaining why here: 


I also think WWC can be extremely useful for researchers; however, I think it is too unfriendly for teacher use. That said, none of these review sources use the principles of meta-analysis. EdReports only looks at programs qualitatively. WWC and Evidence for ESSA do calculate a mean effect size for programs, but instead of using meta-analytic principles, they select the highest quality studies and take a mean effect size of only those studies, typically 1-3 studies are included total. However, in my non-expert opinion, there are some methodological limitations with their approach, as outlined here, in this paper on the topic written by myself and Dr. Rachel Schechter for the Journal of Modern Education: 


Most specifically, I think the rating system of promising, moderate, or strong is very confusing for teachers. This system appears to be mostly based on the rigor level of research assessed and not the results of studies. In other words, a program can yield low or negligible improvements in their research, but still be given the "strong" rating. Personally, I think research on any pedagogy or program should consider the research impact, quality, and quantity, as outlined in this article, also written with Dr. Rachel Schechter. 


That said, I recognize that the quantitative reviews on this website do not hold the same level of legitimacy as those conducted by WWC and Evidence for ESSA. So rather than continuing to conduct my reviews in isolation, I wanted to provide a reasonable comparison. All future language programs will include a reference to the findings of other reviewing organizations. Moreover, in this article, I will provide a comparison between the findings of Evidence for ESSA to date and our findings to date. Dr. Rachel Schechter and her firm of LXD research recently conducted a systematic review of the Evidence for ESSA website and recorded their findings for each language program. She was kind enough to share that research with me for the purposes of this article. The LXD research was then duplicated by Pedagogy Non Grata and reviewed to minimize potential errors. Below in Figures 1-3, you can see the effect size findings of Evidence for ESSA on language programs. Findings are separated out by the research strength ratings of strong, moderate, and promising. In brackets next to each program is the number of student participants that the analysis was based on. All initial data collection was conducted by LXD research, and coding was conducted by Nathaniel Hansford and Elizabeth Reenstra (Pedagogy Non Grata).

Figure 1: Evidence for ESSA strong findings. 

Figure 2: Evidence for ESSA Moderate Findings

Programs Classified as Moderate Ranked by Effect Size.png

Figure 3: Evidence for ESSA Promising Findings

This research is based on the current findings of Evidence for ESSA (as of June, 2023). However, the requirements for Evidence for ESSA are changing. In previous years, studies only had to have a minimum of 35 students per group to be included, within the strong or moderate tier. However, this has recently been changed to 350 per group. Studies that have smaller numbers of participants, on average, show larger findings due to an increased risk of authors conducting multiple studies to search for positive findings (Lee & Hotopf, 2012). In order to retroactively model how this change would alter previous Evidence for ESSA findings, I have duplicated the "strong" findings analysis while removing effect sizes based on sample sizes below 700. The results can be seen in Figure 4


Figure 4: Evidence for ESSA Large Scale, Strong Studies 

Pedagogy Non Grata Results:

For Language Program Reviews conducted by Pedagogy Non Grata, a database was created by Pamela Aitchison. All programs were then coded for the quality of their research by Nathaniel Hansford and Pamela Aitchison. However, some studies were also coded by Elizabeth Reenstra.


To evaluate research quality, studies were awarded 1 point for each of the following criteria: sample size above 100, randomized design, standardized assessment used, and a duration of at least 14 weeks. All programs that did not have experimental or quasi-experimental research were excluded. A graph was then generated based on these results. Programs were ranked by mean effect size, the number of studies was indicated in brackets next to each program, and the bars were color-coded according to study quality. A blue bar indicates that, on average, the program's research had 4 points for quality, a green bar indicates that, on average, the program's research had 3 points for quality, an orange bar indicates that the program has 2 points for quality, and a red bar indicates that, on average, the program's research had 1 or fewer points for quality. It should be noted that for the Wonders program, only one study could be properly coded for quality (Eddy, 2002), as the other study was no longer publicly available. This study showed the smaller effect size of the two reviewed studies.


It should be noted that the inclusion criteria for Pedagogy Non Grata reviews are less strict than the inclusion criteria for Evidence for ESSA or WWC reviews. For a study to be included in a review by Pedagogy Non Grata, it needed to be on the relevant topic, have a control group, contain sufficient reporting for an effect to be found, and involve a minimum of 20 study participants. The inclusion criteria for Evidence for ESSA and WWC are much more specific and can be found in the links below, respectively: 


The inclusion criteria for Evidence for ESSA span five pages, while the inclusion criteria and protocols for WWC extend over 52 pages. The strict inclusion criteria of these organizations are theoretically intended to produce more sensitive and reliable results. However, this also means that they include far fewer studies in their final review. Moreover, in my 2023 paper with Dr. Rachel Schechter, we thoroughly examined the assertion that this methodology yields more reliable results than meta-analysis and found this claim to be untrue. The results of our synthesized Pedagogy Non Grata findings can be seen below in Figure 4.

Figure 4: Pedagogy Non Grata Review Results. 

*Bars are coded for research quality. Blue denotes the highest quality, green the second highest, orange the third highest, and red the lowest. The number of studies each effect size is based on can be found in brackets next to each program. 


Overall, this analysis looked at 115 studies and 27 programs, revealing a mean effect size of 0.36 with 95% confidence intervals of [0.26, 0.46]. In comparison, Evidence for ESSA reviewed 92 programs, yielding a mean effect size of 0.24 with confidence intervals of [0.20, 0.28]. The fact that Evidence for ESSA, on average, found lower results is likely a consequence of their stricter inclusion criteria. (Plonsky, 2014) demonstrated that, on average, less rigorous studies tend to show larger effect sizes.


However, it is worth noting that for many program reviews, the difference in effect sizes between Evidence for ESSA and Pedagogy Non Grata was statistically insignificant. For example, all of the following programs showed insignificant differences: Targeted Reading Intervention, LLI, iStation, Read 180, Reading Recovery, and SIPPS. The only reviews for which both organizations found statistically significant differences in effect sizes were Wilson, Lexia, and ARC. This difference is surprising, given the vast difference in methodologies being used.

How do Structured Literacy and Balanced Literacy Programs Results Vary Across Both Sets of Reviews: 


One of the topics most discussed on this blog is the Reading Wars debate, particularly the question of whether structured literacy (systematic phonics) outperforms balanced literacy (unsystematic phonics.) Out of intellectual curiosity, we sought to compare the results of these reviews for structured literacy against balanced literacy programs. We calculated the mean overall effect size average for both sets of reviews for each type of program. It's important to note that while this methodology resembles meta-analysis, it is a more simplistic approach, as the results sums were divided by the number of programs reviewed rather than the number of studies. Additionally, results were not weighted by sample size.


Although the reviews conducted for this blog were already coded, we were not familiar enough with the 95 programs reviewed by Evidence for ESSA to properly code them as balanced literacy, structured literacy, or other. To classify these programs, we conducted polls widely shared on multiple social media platforms. The polling questions were phrased as follows:


“We need help classifying these programs. For the purposes of this form, we are defining each classification based on the following:


I Don't Know: Please don't guess. If you're not personally familiar with the program, please select this option.


Structured Literacy: The explicit instruction of the 5 pillars of Literacy (phonics, phonemic awareness, vocabulary, fluency, and comprehension). Structured literacy also includes systematic phonics, meaning phonics is taught explicitly, with a scope and sequence, and with decodables.


Balanced Literacy: Also teaches the 5 pillars of literacy instruction. However, phonics and phonemic awareness instruction are taught as needed, not systematically. Balanced literacy programs also include instruction or prompts of contextual clues, such as MSV.


Whole Language: Early reading instruction, or reading interventions that do not include phonics, morphology, or phonemic awareness instruction.


Supplementary: Refers to add-on tools, readers, and/or programs that only target one specific skill. E.g., Words Their Way, Letterland, or Secret Stories. This category refers to items that do not represent complete literacy programs.


Other: Anything that does not fit into the above criteria.”


A total of 112 people responded to the poll, with 96 respondents answering every polling question. Most participants selected “I don’t know” for most programs, giving us increased confidence that respondents only coded programs they had personal experience with.


Responses were used to categorize Evidence for ESSA reviews as not codable, structured literacy, balanced literacy, or supplementary. Both Nathaniel Hansford and Elizabeth Reenstra coded each program. Mean effect sizes and 95% confidence intervals were then calculated based on these results. It should be noted that no programs were classified as Whole Language in this analysis.


The results of the original survey can be found here: 


For Evidence for ESSA reviews, the following programs were classified as Balanced Literacy-based: ARC, Journeys, LLI, Journeys Secondary, Reading Recovery, and Reading Rescue. For these program reviews, a mean effect size of 0.21 was found, with 95% confidence intervals of [0.05, 0.37].


For Evidence for ESSA Reviews, the following programs were classified as Structured Literacy-based: Lindamood, Targeted Reading Instruction, (RS), ECRI, Lexia Core, Lexia Intervention, Super Kids, SpellRead, Wilson Reading System, 95% Core Program, and Corrective Reading. For these program reviews, a mean effect size of 0.36 was found, with confidence intervals of [0.17, 0.54].


For Pedagogy Non Grata reviews, the following programs were classified as Balanced Literacy: ARC, LLI, Reading Recovery, and Units of Study. However, it should be noted that ARC has made substantial changes to its program, and whether or not the program should still be classified as Balanced Literacy is a debatable topic. The research referenced in this article was conducted prior to those changes. Similarly, Units of Study and LLI have also made recent updates to their programming, but those changes are less substantial than those made by ARC.


It should be noted that we previously referred to Read 180 as a balanced literacy program. We made this coding decision because the authors of Read 180 studies specifically referred to their program as a balanced literacy program. However, many of those authors have since contacted us and rejected the classification of balanced literacy, as it has come to be understood in the current academic context. Unfortunately, there is not enough publicly available information on the program for us to independently code it. Therefore, we excluded it from this analysis. These program reviews showed a mean effect size of 0.26, with 95% confidence intervals of [0.03, 0.50].


For Pedagogy Non Grata reviews, the following programs were classified as structured literacy-based: Jolly Phonics, Empower, Letterland, Spire, Corrective Reading, Lexia Core-5, Targeted Reading Intervention, SpellRead, Spell-Links, MindPlay, Amplify CKLA, iReady, Wilson, SIPPS, Open Court, Wonders, iStation, and Reading Mastery. It should be noted that we felt iReady and iStation met the classification of structured literacy. However, one of the criticisms we have seen of these programs is that they don't include enough phonics instruction, which is the same common criticism of balanced literacy programs. Both programs had low effect sizes. These program reviews showed a mean effect size of 0.40, with 95% confidence intervals of [0.27, 0.53). These results can be seen graphed for visual representation in Figure 5.

Figure 5: Comparing Evidence for ESSA and Pedagogy Non Grata, Language Program Reviews. 

Comparing Evidence for ESSA Reviews with Pedagogy Non Grata Reviews.png

As Figure 5 illustrates, structured literacy outperformed balanced literacy in Pedagogy Non Grata reviews by 0.14 and by 0.15 in Evidence for ESSA reviews. To provide context, we can refer to meta-analyses that compared systematic phonics with unsystematic phonics, as the most fundamental difference between structured literacy and balanced literacy is that structured literacy programs teach phonics systematically.


The National Reading Panel (NRP, 2000) demonstrated that systematic phonics instruction, on average, yielded a mean effect size of 0.44. However, this effect size did not account for the impact of unsystematic phonics. (Camilli, 2003) found that systematic phonics outperformed unsystematic phonics by a mean effect size of 0.27, while (Steubings, 2008) reported a mean effect size difference of 0.31 in favor of systematic phonics.


I submitted a meta-analysis for peer review, earlier this year with Joshua King, which found a mean weighted effect size of 0.46 for structured literacy and 0.23 for balanced literacy. However, this paper has not been published yet. An earlier version of the article was posted on this blog and can be found here: But the article has been heavily edited since the results were published in this blog. To make these results more visually representative, they have been graphed in figure 6. 

Figure 6: Systematic vs unsystematic phonics: A review of meta-analytic findings. 

What is the Effect Size for Systematic Phonics Compared to Unsystematic Phonics_ (1).png

Interpreting Effect Sizes:

One common criticism of meta-analysis, in general, is that studies with more rigorous methods tend to yield lower results. Therefore, interpreting effect sizes somewhat depends on the level of rigor applied in the study. On numerous occasions, I have seen scholars claim on social media that studies of the highest quality, which show a mean effect size below 0.20, should still be considered meaningful, even if the result is negligible according to Cohen's guide.


I thought Evidence for ESSA data would provide an interesting way to evaluate the legitimacy of this claim. I have calculated the mean effect size, confidence intervals, quartiles, and IQR for each tier of research quality as outlined by Evidence for ESSA. The mean effect size and confidence intervals allow us to understand what constitutes an average result. The quartile analysis helps us gauge how an effect size compares to others from similar research. The IQR analysis helps us identify outlier status. It logically follows that study results lower than the IQR result should be regarded as negligible.

Figure 6: Interpreting Effect Sizes.

Study Quartiles.png

Interestingly, the above analysis did not reveal significant trends among quartiles, making this analysis quite challenging and possibly an exercise in futility. The magnitude of correlation between study quality and effect sizes found was weak, r = .-19, and p value = <.0000. That said, we did find some evidence to support the idea of lowering the benchmark for negligible findings when considering higher quality studies. However, we did not find support for reducing the benchmark below 0.10. Furthermore, we did not consistently find evidence supporting the notion that higher quality studies showed reduced effects. The mean effect size and confidence intervals for promising research were lower than the mean effect size and confidence intervals for strong and large-scale research.



On average, the Evidence for ESSA review process yields slightly lower results than the Pedagogy Non Grata review process. The Evidence for ESSA model employs stricter inclusion criteria, leading to the exclusion of more studies and, consequently, a smaller pool of research to review per program. For most programs reviewed by both organizations, the results are statistically similar across the reviews. There is compelling evidence, stemming from various research methodologies, demonstrating that systematic phonics outperforms unsystematic phonics. An analysis into the impact of research quality on effect size results did not yield meaningful results.

A note on authorship:

-Nathaniel Hansford wrote the write up for this article and calculated the mean overall effect sizes. 

-Rachel Schechter and research assistants from LXD research compiled the Evidence for ESSA research. 

-Elizabeth Reenstra and Pamela Aitchison helped with coding and effect size calculations.

-All effect size calculations for Pedagogy Non Grata Language reviews were duplicated by at least one other author. 

-The following individuals have helped to duplicate coding or effect size calculations for Pedagogy Non Grata language program reviews: Elizabeth Reenstra, Pamela Aitchison, Joshua King, and Sky McGlynn. 

Questions About This Article? Reach out to us at 

Last Edited: 2023-10-16



Camilli, G., Vargas, S., & Yurecko, M. (2003). Teaching children to read: The fragile link between science and federal education policy. Education Policy Analysis Archive, 11(15). Retrieved March 20, 2007, from


Evidence for Essa. (2023). Reading program reviews. John Hopkins University.


Evidence for ESSA: standards and procedures. (n.d). Slavin, R. John Hopkins University.


Hansford, H & Schechter, R., E. (2023). Challenges and opportunities of meta-analysis in education research. International Journal of Modern Education Studies, 7(1), 218-231.


Hansford, H & Schechter, R., E. (2023). Is this _____ evidence-based? Rethinking how we evaluate evidence. Pedagogy Non Grata.


Hansford, N & King, J. (2023). Structured Literacy Compared to Balanced Literacy. Unpublished.


Lee, W., & Hotopf, M. (2012). Funnel plot. Accessed at on March 24, 2023.


NRP. (2001). Teaching Children to Read: An Evidence Based Assessment of the Scientific Literature on Reading Instruction. United States Government. Retrieved from


Pedagogy Non Grata. (2023). Language program reviews.


Plonsky, Luke & Oswald, Frederick. (2014). How Big Is "Big"? Interpreting Effect Sizes in L2 Research. Language Learning. 64. 878-912. 10.1111/lang.12079.


Stuebing, K. K., Barth, A. E., Cirino, P. T., Francis, D. J., & Fletcher, J. M. (2008). A response to recent reanalyses of the National Reading Panel report: Effects of systematic phonics instruction are practically significant. Journal of Educational Psychology, 100(1), 123–134.


WWC. (2008). Evidence Standards for Reviewing Studies. What Works Clearing House.

bottom of page