Comments on Bishop et al.’s Article on Boaler’s Work
This post makes a few comments on “A Close Examination of Jo Boaler’s Railside Report” by Bishop, Milgram, and Clopton (hereafter Bishop et al.), comparing its account with that of two articles written by Boaler and Staples: a 2005 conference paper and a 2008 journal article.
Disclosure: I am not and have never been a friend or collaborator of any author listed above. On the other hand, the math education world is small. I work on projects and communicate regularly with people who are or have been friends or collaborators of Boaler or Milgram.
No one that I know condones the actions of Bishop et al. in attempting to determine the identities of the schools in Boaler’s studies. (September 2013 update: Two people have contacted me to say that they do. They note the rationale given by Milgram in BrainMind Magazine (with response from Boaler in the same issue), Nonpartisan Educational Review, and on his web site.) However, at first glance, it is hard to determine what the two sides are claiming and its basis. Some claims are not connected with details of the study, the details are complicated, and the two sides seem to talk past each other. The comments below are intended to be helpful in making sense of the articles, rather than as an exhaustive discussion of their merits.
All three articles concern the performance of students from three schools which are given the pseudonyms Railside, Greendale, and Hilltop. Railside and Greendale had large percentages of English language learners and students who were eligible for free or reduced lunches, but these were few at Hilltop.
Part of the controversy (described in Inside Higher Education) centers on the different distributions of scores on different mathematics tests. Students from Railside outscored students from the other schools on the California Standards Test in Algebra and, in Years 2 and 3 of the study, on tests of algebra and geometry developed by the researchers (Boaler & Staples, 2008, p. 621), but the reverse occurred for two state tests in Year 3 of the study. Different groups of students took the different tests, making comparison complicated. At Railside, block scheduling allowed students to take geometry two, three, or four years after entering high school, but this was not the case at other schools. Due to the difficulties of obtaining individual test scores from school district offices, Boaler and Staples reported schoollevel scores on state tests, rather than scores for the students in their study (2005, p. 11). Thus, possible explanations for the different score distributions include sample bias (due to the different groups of testtakers) and the nature of the different tests. Bishop et al. claim that the explanation is sample bias: “different populations of students at the three schools” (p. 2) and the researchers’ tests were “tailored to favor their [Railside’s] program” (p. 3). The researchers say “Only content common to the three approaches was included and an equal proportion of questiontypes from each of the three teaching approaches” (Boaler & Staples, 2008, p. 618). They contend there were many reasons for the discrepancy between scores, but that item format was a major one: “the cultural and linguistic barriers provided by the state tests” (2005, p. 12).
Details of publication and date. The article written by Bishop et al. is undated, but refers to Boaler’s “recent” talk at a National Council of Teachers of Mathematics meeting in Anaheim, CA. According to the NCTM web site, this occurred in 2005.
Bishop et al. say that Boaler “has just published an already well known study” but do not give a citation with publication information. The referent appears to be a paper written by Boaler and Staples called “Transforming Students’ Lives Through an Equitable Mathematics Approach: The Case of Railside School.” In 2005, this paper was “published” in the sense of being presented at the American Educational Research Association meeting. Talks at this meeting generally refer to associated papers that are available upon request.
The AERA paper is similar to the Boaler and Staples article “Creating Mathematical Futures Through an Equitable Teaching Approach: The Case of Railside School,” which was published in the Teachers College Record in 2008. However, the latter also reports on Years 3, 4, 5 of the fiveyear study.
The study students. In contrast with Hilltop and Greendale, Railside did not track students. All students entering high school at Railside were placed in algebra courses, but students entering Hilltop and Greendale might be placed in remedial nonalgebra courses or geometry as well as algebra. The Hillside and Greendale students in the study were those placed in algebra (Boaler & Staples, 2005, pp. 8, 13). Thus, both advanced and remedial students at Hilltop and Greendale were not included in the study. Moreover, the study was longitudinal. Analysis of scores on study tests only included scores from students who began algebra in Year 1 and took geometry in Year 2, and took the four tests in Years 1, 2, and 3 (Boaler & Staples, 2008, p. 621).
Course organization at Railside.
Railside followed a practice of “block scheduling” and lessons were 90 minutes long, with courses taking place over half a school year, rather than a full academic year. In addition, the introductory algebra curriculum, generally taught in one course in US high schools, including Greendale and Hilltop, was taught in the equivalent of two courses at Railside. (Boaler & Staples, 2005, pp. 12–13)
Thus, it appears that by the end of Year 1, Railside students had much more mathematics instruction (the equivalent of two courses) than students in algebra courses at Hilltop and Greendale. In Year 2 of the study, however, a student at Railside might not be taking geometry. A comparison of the entire Railside gradelevel cohort with the cohort of Hilltop and Greendale students who took algebra in Year 1 and geometry in Year 2 would include Railside students who had not yet taken geometry.
Scores on state tests. Like Boaler and Staples, Bishop et al. compare schoollevel results on state mathematics tests: California Standards test for algebra, CAT 6, and AYP (academic yearly progress). Not all students took all state tests. Only students who completed algebra took the state algebra test (Boaler & Staples, 2005, p. 11). Both sets of authors note that the Railside scores are lower on STAR and AYP than those of Greendale and Hilltop. However, as noted earlier, their explanations differ.
There were many reasons for the [Railside] students’ lower performance we contend, most importantly the cultural and linguistic barriers provided by the state tests. The correlation between students’ scores on the language arts and mathematics sections of the AYP tests, across the whole state of California is a staggering 0.932 for 2004. This data point provides a strong indication that the mathematics tests were testing language as much as mathematics. (Boaler & Staples, 2005, p. 12)
At Railside and Greendale, 25% and 24% of students were English language learners. The corresponding figure at Hilltop was 0% (Boaler & Staples, 2005, p. 4). The hypothesis that language was a factor might have been explored further via an analysis of the tests: study tests and California standards test in algebra vs AYP and STAR tests. None of the authors explored other aspects of test format. For example, the study tests used constructed response items in which students needed to show their work. In contrast, the state exams were multiple choice. The testing literature documents how score patterns among groups can vary according to item format (see, e.g., Gipps and Murphy’s A Fair Test?).
Scores and samples for study tests. Bishop et al. say:
At Railside, her population appeared to consist primarily of the upper two quartiles, while at the other two schools the treatment group was almost entirely contained in the two middle quartiles. (p. 2)
Table 1 combines information from the two Boaler and Staples articles (2005, p. 10; 2008, pp. 610, 613, 627). At Hilltop and Greendale, the standard sequence of courses was algebra, geometry, advanced algebra then precalculus. Students who took calculus were likely to have been placed in an upper track upon entering high school. I cannot find any information about the proportions of students in lower tracks other than the assertion quoted above. Thus, the study populations at Greendale and Hilltop appears to have excluded the 23% and 30% (respectively) of students on upper tracks.
Calculus coursetakers 
Approx. no. in each grade 
Approx. total enrollment 

Hilltop  30%  475  1900 
Greendale  23%  300  1200 
It is not clear that “At Railside, her population appeared to consist primarily of the upper two quartiles.” Railside’s total enrollment was approximately 1500, so there were about 375 students in each grade. Approximately 87% of eligible students agreed to be in the study. These numbers are consistent with the number of Railside test scores reported for Year 1 of the study (Boaler & Staples, 2005, p. 9). The Railside sample sizes were:
347 (Y1 pretest)
344 (Y1 posttest).
Thus, “upper two quartiles” does not describe the Railside sample for Year 1 of the study.
In Years 2 and 3, the Railside sample sizes were:
199 (Y2 posttest)
130 (Y3 posttest).
Recall that Railside used a block schedule, allowing students to take geometry in their second, third, or fourth year of high school. So, the reduction in numbers is due to researcher selection—only in that the researchers tested only those students who took algebra followed by geometry during the next year. However, students with this coursetaking pattern might have some characteristic that put them in the upper two quartiles according to some measure. Whatever that measure might be, it does not appear to be the same as that used to determine the students who were placed in algebra upon entry at Greendale and Hilltop. Only this information is given: “The students in geometry classes at Railside did not represent a selective group; they were of the same range as the students entering Year 1” (Boaler & Staples, 2008, p. 620). Boaler and Staples might have strengthened their case by giving details and analysis of the score distributions on Year 1 tests for the two groups: those who took geometry in Year 2 and those who did not.
To summarize:
Railside sample  Outcome on study tests  
Year 1 pretest  Most students in grade 9  Average below G and H algebra students 
Year 1 posttest  Most students in grade 9  Average similar to G and H algebra students 
Year 2 posttest  199 students  Average above G and H geometry students 
Year 3 posttest  130 students  Average above G and H advanced algebra students 
Gender differences in SAT scores. Bishop et al. note that although Boaler and Staples report no gender differences in performance on the study tests, there were gender differences on SAT scores from Railside (Bishop et al., p. 6). This type of finding is well known. For example:
When the SATM scores of boys and girls are matched, girls go on to earn higher grades in college mathematics classes (see Royer & Garofoli, 2005, for a review). The SATM’s underprediction of girls’ mathematics performance is widely known (e.g., Gallagher & Kaufman, 2005; Nature Neuroscience Board of Editors, 2005; Willingham & Cole, 1997). (Spelke, 2005, p. 955)
Thus, Bishop et al.’s comment “this [lack of gender difference on study tests] is not supported by the SAT I data” (p. 6) does not seem relevant.
Scores on Advanced Placement tests other than calculus. Bishop et al. note that the Railside students did not take the Advanced Placement Calculus test between 2000 and 2004 (p. 7). However, Bishop et al. report Railside students’ combined AP scores for unspecified subjects. There are now 34 AP courses and exams in mathematics and computer science, arts, English, history and social science, world languages and cultures, and science. It is not clear what can be inferred about mathematics at Railside from its students’ AP scores in some unspecified subset of these other subjects.
Concluding Remarks
The information from both sides is incomplete and reflects different viewpoints. The additional details about the schools collected by Bishop et al. do little to explain the different score distributions on the study tests and the state tests.
In their abstract, Boaler and Staples describe the achievement of the study students in general terms (e.g., “Railside students” vs “Greendale and Hilltop students”). They describe the samples used, but do not discuss their limitations, nor do they frame the results and discussion in terms of the samples rather than in terms of the schools.
This might seem deceptive. However, it’s consistent with a tradition in developmental psychology—a discipline which influences some areas of research in mathematics education research. In this tradition, findings are reported in ways that “universalize” from a sample to a whole population. So, for example, a recent abstract says (in part), “Among young adolescents in the top 1% of quantitative reasoning ability, individual differences . . . lead to differences in educational, occupational, and creative outcomes decades later” (Robertson et al., 2010). Without digging into the details of the sample involved (the Study of Mathematically Precocious Youth cohorts), it is difficult to know its limitations. As I have described elsewhere in articles about this dataset (e.g., Kessel, 2006) understanding its limitations is a job is largely left to the reader.
If Boaler and Staples are to be censured for following the same reporting tradition, then so are numerous others. Boaler and Staples are, however, to be applauded for being candid about their sample and for reporting performance differences on various tests.
Additional comments that are consistent with the comments policy are welcome.
References
Bishop, W., Milgram, R., & Clopton, P. (n.d.). A close examination of Jo Boaler’s Railside report. Retrieved from ftp://math.stanford.edu/pub/papers/milgram/combinedevaluationsversion3.pdf
Boaler, J., & Staples, M. (2005). Transforming students’ lives through an equitable mathematics approach: The case of Railside School. (Paper presented at meeting of the American Educational Research Association, Montreal, Canada, April 2005.)
Boaler, J., & Staples, M. (2008). Creating mathematical futures through an equitable teaching approach: The case of Railside School. The Teachers College Record, 110(3), 608645. Retrieved from http://ed.stanford.edu/faculty/joboaler
Gipps, C., & Murphy, P. (1994). A fair test?: A fair test? Assessment, achievement, and equality. Philadelphia, PA: Open University Press.
Jaschik, S. (2012 October 15). Casualty of the math wars. Inside Higher Education. Retrieved from http://www.insidehighered.com/news/2012/10/15/stanfordprofessorgoespublicattacksoverhermatheducationresearch
Kessel, C. (2006). Perceptions and research: Mathematics, gender, and the SAT. Focus, 26(9), 14–15. Retrieved from http://www.maa.org/pubs/december06focus.pdf
Robertson, K., Smeets, S., Lubinski, D., & Benbow, C. (2010). Beyond the threshold hypothesis: Even among the gifted and top math/science graduate students, cognitive abilities, vocational interests, and lifestyle preferences matter for career choice, performance, and persistence. Current Directions in Psychological Science, 19(6), 346–351. Retrieved from http://cdp.sagepub.com/content/19/6/346.abstract
Spelke, E. (2005). Sex differences in intrinsic aptitude for mathematics and science?: A critical review. American Psychologist, 60(9), 950–958. Retrieved from http://www.wjh.harvard.edu/~lds/index.html?spelke.html
You might be interested in Wayne Bishop’s two new responses to Boaler, first on the merits of her work and second on her allegations:
http://math.stanford.edu/~milgram/
http://math.stanford.edu/~milgram/JoBoalerrevealsattacksAccusationsResponsetrans.html
JDE
December 14, 2012 at 2:12 pm