Psychological Bulletin, 1988, Vol. 103, No. 1, 105110
Copyright 1988 by the American Psychological Association, Inc. 00332909/88/$00.75
Significance Test or Effect Size?
Siu L. Chow*
University of Wollongong, Wollongong, New South Wales, Australia
I describe and question the argument that in psychological research, the significance test should be replaced (or, at least, supplemented) by a more informative index (viz., effect size or statistical power) in the case of theorycorroboration experimentation because it has been made on the basis of some debatable assumptions about the rationale of scientific investigation. The rationale of theorycorroboration experimentation requires nothing more than a binary decision about the relation between two variables. This binary decision supplies the minor premise for the syllogism implicated when a theory is being tested. Some metatheoretical considerations reveal that the magnitude of the effectsize estimate is not a satisfactory alternative to the significance test.
Although the usefulness of the significance test in psychological research was questioned in the 1960s, it is still in use today, despite the fact that its critics were more numerous than its defenders (Morrison & Henkel, 1970). Nonetheless, it is under scrutiny again (Cohen, 1977). Whereas its critics in the 1960s emphasized primarily what the significance test could not do (or what was wrong with the null hypothesis), its contemporary critics exhort what two alternatives to the significance test (viz., effect size and statistical power) can do.
The significance test is used in psychology because psychologists aspire to be scientific in their endeavor. Consequently, the use of the significance test should be assessed with reference to the rationale of scientific investigation that is at a level of abstraction different from that of statistics. I will make the case that the use of the significance test is appropriate if its role in theorycorroboration investigation is made explicit. I will first recapitulate critics' reasons for advocating the two alternatives to the significance test. I will then describe and examine the assumptions on which these reasons are based. I will defend the use of the significance test accordingly.
Early Criticisms of the Significance Test
Criticisms of the use of the significance test in the 1960s argued against the null hypothesis. For example, Grant (1962) said that the null hypothesis could never be true because no theory is perfect, and Bakan (1966) said, "There is really no good reason to expect the null hypothesis to be true in any population" (p. 426). The second objection to the null hypothesis is that the choice of which hypothesis to be identified with the null hypothesis is arbitrary (Rozeboom, 1960) because null hypothesis means the hypothesis to be nullified, not necessarily a hypothesis of no difference (see Bakan, 1966, p. 424, Footnote 1).
Other criticisms are concerned with the statistical hypothesis testing procedure itself. For example, it has been said that the allornone decision implicated in statistical hypothesis testing is antithetical to the view that scientific knowledge generally accumulates bit by bit (Grant, 1962; Nunnally, 1960). The use of the null hypothesis testing procedure is objectionable to some investigators because it apparently favors only one hypothesis when there are actually an infinite number of alternative hypotheses (Rozeboom, 1960).
More closely related to the current misgivings about the use of significance test are the following arguments: First, a hypothesis is a belief in something. It is obvious that the various alternative hypotheses may be accepted to different degrees. That is, an investigator is likely to assign different a priori probabilities to the various alternative hypotheses. Second, the use of the significance test leads investigators to concentrate on a point estimate (e.g., the mean is 5), whereas it may be more fruitful to ask questions about interval estimates (Bakan, 1966; Grant, 1962; Lykken, 1968; Meehl, 1967; Nunnally, 1960; Rozeboom, 1960). Finally, the whole statistical hypothesis testing approach is made suspect because the choice of the alpha level (i.e., the probability of Type I error) is arbitrary (Glass, McGaw, & Smith, 1981).
Current Critique of the Significance Test
The current critique of the use of the significance test stems from two sources, namely, (a) the observation that a linear regression analysis (which is concerned with the degree of relatedness between two variables) provides more information than the conventional analysis of variance (ANOVA) procedure (whose primary concern is whether there is a significant difference; Cohen, 1977) and (b) the contention that metaanalysis has an important role to play in scientific investigations (Glass et al., 1981; Rosenthal, 1984). This contemporary perspective may be best illustrated as follows:
Consider the outcomes of the four hypothetical studies and their associated t tests depicted in Table 1 (assuming that it is appropriate to use the independentsample t test). In terms of the difference between the two means, Studies 1 and 4 are the same. However, the t test is significant in Study 1 but not in Study 4. The difference between the two means in Study 3 is larger than that in Study 1. Yet the t test is significant in Study I but not in Study 3. Studies 3 and 4 each have fewer subjects than does Study 1. This procedural difference suggests that whether a test is significant or not depends on the number of subjects used. Moreover, the number of subjects used in an experiment is arbitrary. Consequently, one should have reservations about the methodological contribution of the significance test to scientific investigations. This difficulty may be called the "samplesize problem."
Table 1
Hypothetical Outcomes of Four Hypothetical Studies
Study 
M_{1} 
M_{2} 
M_{l}  M_{2} 
df 
t test significant 
1 
5 
4 
1 
20 
Yes 
2 
12 
2 
10 
20 
Yes 
3 
6 
2 
4 
5 
No 
4 
5 
4 
1 
5 
No 
Note. M_{1} = mean of experimental condition; M_{2 }= mean of control
condition; M_{l}  M_{2} = difference between M_{l} and M_{2}.
Although the test is significant in Studies 1 and 2, the difference is considerably larger in Study 2 than in Study 1. This valuable information is not being used, however, if the investigators consider only whether the tests are significant. This is particularly serious in assessing the effectiveness of a particular applied program, such as the study of teachers' expectancy effects (Harris & Rosenthal, 1985) or the assessment of the effectiveness of a psychotherapeutic program (Fiske, 1983; Rosenthal, 1983; Strube & Hartman, 1982, 1983). By the same token, although the test is not significant in either Study 3 or 4, the magnitude of the difference is larger in Study 3 than in Study 4. Again, this valuable information is lost if the decision is simply to reject the null hypothesis in Studies 3 and 4. This difficulty may be called the "effect size problem."
There is another way of stating the effectsize problem. Consider Studies 1 and 3 again. The magnitude of the difference between the two means is smaller in Study 1 than in Study 3. Yet the null hypothesis is rejected in Study I but retained in Study 3. Reliance on the significance test may lead one to accept an effect of a trivial magnitude as well as (or even instead of) an effect of a larger magnitude.
The effectsize problem has also been presented as follows: The general practice of using the significance test is nothing more than an explicit commitment to a particular level of Type I error (i.e., the probability of wrongly rejecting a true null hypothesis). The general emphasis (but an undue one, according to the critics of the use of the significance test) on Type I error leads to a neglect of Type II error (i.e., the probability of accepting a wrong null hypothesis; Cohen, 1977). Because Type II error is inversely proportional to the extent to which the null and the alternative hypotheses overlap, the complement of Type II error reflects the probability that a true alternative hypothesis is accepted as such (i.e., the power of the statistical test; see Cohen, 1977). Rosenthal (1984) shares this view.
Some investigators are concerned with the practical significance of experimental results (e.g., Rosenthal, 1983). The statistical significance of a set of data is not informative about the practical importance (or substantive significance) of the findings. It has been suggested, however, that an index of substantive significance can be derived from an effectsize estimate (Harris & Rosenthal, 1985; Rosenthal, 1983; Rosenthal & Rubin, 1979, 1982). The fact that a significance test does not have any implication on the substantive significance of experimental outcomes may be called the "substantivesignificance problem."
Alternatives to the Significance Test
To the critics of the significance test, the samplesize, effectsize, and substantivesignificance problems can be resolved by appealing to the power of the statistical test or the size of the experimental effect. At the mathematical level, Cohen and Cohen (1983) showed that whatever can be achieved by an ANOVA can be achieved by a linear regression analysis. Moreover, an estimate of the power of a test may be obtained by considering the proportion of variance accounted for by the variable of interest. Consequently, instead of rejecting or accepting the null hypothesis, experimental results may be ranked in terms of the amount of variance accounted for by the independent variable involved. More specifically, statistical tests showing that the independent variable accounts for 20, 50, and 80% of the variance may be considered tests of low, medium, and high statistical power, respectively (Cohen, 1977). That is, instead of receiving only a rejectoraccept answer from a statistical analysis, an investigator may gain additional information.
Some investigators who subscribe to the notion of metaanalysis also advocate obtaining an effectsize estimate for every experiment (see Glass et al., 1981; Rosenthal, 1984). The advantages of appealing to effect size are twofold in this view. First, it enables the investigators to quantitatively compare the outcomes of two or more studies. At the level of applied research, this facility makes it possible to assess the practical importance of an experimental effect (Harris & Rosenthal, 1985). That is, the intuitive anomaly of the picture presented jointly by Studies 1 and 3 in Table 1 may then be resolved. The second advantage of dealing with effect size is that it enables metaanalysts to obtain a numerical average for a set of experiments (Glass et al., 1981; Harris & Rosenthal, 1985; Rosenthal, 1984).
Role of Statistical Analysis in Descriptive Research
The sample size and effectsize problems are both concerned with the role of statistical analysis in scientific investigation. Consequently, the purpose and the rationale of experimentation must be taken into account when the use of the significance test is being evaluated.
There are two types of experimental investigation, namely, descriptive and theory corroborative. In the case of descriptive investigation, the objective is to have an estimate of a parameter of a population of interest on the basis of what can be known about a sample of a certain size randomly chosen from the population. It is descriptive in the sense that the concern is whether there is an effect or what the magnitude of the effect is but not why there is the effect. For this descriptive purpose, an interval estimate is definitely superior to a point estimate. Moreover, the availability of a welldefined and properly derived estimate of effect size is more informative than the mere knowledge that a statistical test is significant. The samplesize problem is no longer an issue because the effect of the sample size is reflected in the interval estimate. More specifically, smaller samples give larger interval estimates.
The substantivesignificance issue arises for two reasons, only one of which is relevant to the use of statistics. It becomes relevant only if the treatment of interest (called "substantive treatment," e.g., a particular drug, A) is used as the experimental manipulation. An example par excellence of this situation is early experimentation in agricultural research. The substantive question was whether a particular fertilizer (or a certain type of soil or seed) would give a better yield. The experimental manipulation was the application of the fertilizer in question or the choice of the type of soil (or seed) under investigation. That is, the substantive treatment was used as the experimental manipulation. Another way of putting this is that the investigator was interested in the experimental question for its own sake. This practice of not differentiating between the substantive treatment and the experimental manipulation may be called the "agricultural model" of science (see also Hogben, 1957; Meehl, 1978).
The null hypothesis testing procedure in statistics was developed with the agricultural model as the prototype of scientific investigation (Hogben, 1957; Meehl, 1978; Mook, 1983).^{1} As a result of identifying the experimental question with the substantive question, the null hypothesis testing procedure in statistics became indistinguishable from the procedure of testing a substantive theory in the agricultural model. It is not unreasonable, then, to give the effectsize estimate a substantive meaning under these circumstances. When these metatheoretical assumptions are made, the effectsize and substantivesignificance problems are indeed shortcomings of using the significance test. The following two questions are crucial, however, and have not been given proper consideration: (a) Is the agricultural model the appropriate one for the bulk of psychological and educational research?^{2} (b) Do the effectsize and substantivesignificance problems arise in theorycorroboration experimentation?
TheoryCorroboration Experimentation
Many experiments are conducted in psychology to corroborate explanatory theories. That is, they are concerned with the tenability of certain hypothetical mechanisms that enable an investigator to answer why certain things happen the way they do. As I will show later, the investigator is not interested in the experimental question for its own sake (i.e., the question about the relation between the independent and dependent variables per se). The effectsize and substantivesignificance problems assume a different complexion when theorycorroboration experimentation is being considered. This statement can be best illustrated by considering the role of statistical analyses in theorycorroboration experimentation. The latter cannot be described without first considering the rationale of theorycorroboration experimentation. This rationale can be described by referring to Table 2.
Table 2
Two Syllogisms Showing the Relations Among One Implication of a Theory:
The Experimental Outcome, and the Permissible Conclusion
Theory 
T_{1} 
T_{1} 
Implication 
I_{11} 
I_{11} 

Modus tollens 
Affirming consequent 
Major premise 
If A.I_{11}, then X under EFG. 
If A.I_{11}, then X under EFG. 
Minor premise 
D is dissimilar to X. 
D is similar to X. 
Experimental conclusion 
A.I_{11} is false. 
A.I_{11} is probably true. 
Note: T_{1} =theory of interest; I_{11} = one implication of T; EFG = control and independent variables of the experiment; X = experimental expectation; A = set of auxiliary assumptions underlying the experiment; D = experimental outcomes (i.e., the pattern shown by the dependent variable in various conditions of the experiment). 
Psychologists have to theorize (e.g., proposing a theory, T_{1}) when they are confronted with a phenomenon that is not readily accounted for in terms of existent knowledge. At the same time, more than one potentially successful theory may be proposed to account for a phenomenon. These theories appeal to different unobservable hypothetical mechanisms. The task for the psychologist is to choose among these rival hypothetical mechanisms (which are unobservable) in an objective way.
The hypothetical mechanism implicated in a theory often cannot be tested directly. The necessary condition, however, for a theory's being good is that it leads to testable implications. A theory is tested by means of one or more of its implications (e.g., implication I_{11}. of theory T_{1}). I_{11}, in turn, specifies what should happen in a specific situation by virtue of the theoretical properties of the hypothetical mechanism in question. This theoretical specification (expectation or prediction) is the experimental hypothesis, which is represented by the following statement: Observation D should be like X under conditions EFG by virtue of I_{11}. Strictly speaking, no experiment is ever conducted in the absence of some auxiliary assumptions (Cohen & Nagel, 1934; Meehl, 1978). Hence, the relation among (a) the theory in question (T_{1}), (b) one of its implications (I_{11}), (c) the experimental setup (E, F, and G), and (d) the experimental expectation (X) are represented by the following two conditional propositions: (a) If T_{1}, then I_{11}. (b) If A.I_{11}, then D should be like X under EFG. Data (D) from the experiment either conform to the pattern prescribed by X, or they do not. That is, either D is similar to X, or D is dissimilar to X.
An important departure from the agricultural model may be seen in a summary description of Table 2. Although the investigator is interested in theory T_{1}, the theory is tested by means of one of its implications (viz., I_{11}). Moreover, the experimental question is one of what happens under conditions E, F, and G. Unlike subscribers to the agricultural model, the investigator is not interested in the experimental question for its own sake.
If the experimental expectation is not met by the experimental outcomes (i.e., D is dissimilar to X), Implication I_{11} is refuted, thereby refuting the substantive theory, T, (see the modus tollens paradigm in Table 2). If the experimental outcomes conform to the experimental expectation (i.e., D is similar to X), the experimental hypothesis, I_{11} is not rejected (see the "affirming consequent" paradigm in Table 2), thereby adding credibility to the substantive theory, T_{1}, because it has withstood a deliberate attempt to refute it. As may be seen from the affirming consequent paradigm in Table 2, the permissible conclusion is that A. I_{11} is probably true. Hence, following Popper (1968), the investigator has corroborated but not proven theory T_{1}.
Role of Statistics in TheoryCorroboration Experimentation
The experimental expectation, X, is an explicit statement about the different ways in which the underlying mechanism should exemplify itself in the experimental and control conditions if the theory, T_{1}, is true. It is usually in the form of a statement of an ordinal relation such as the statement that the performance under the experimental condition is superior (or inferior) to that under the control condition. This expectation of an ordinal relation becomes the statistical alternative hypothesis (H_{1}). In the rare occasions when normative data are available (e.g., the mean IQ on the Wechsler Intelligence Scale for Children is 100 for normal children), the experimental expectation is still expressed in the form of a statement of an ordinal relation, as may be seen from the use of the onesamplecase t test or the use of Fisher's Z transformation when the expected correlation coefficient is not zero (Edwards, 1976).
The concern of a statistical analysis is with the experimental outcomes, D (e.g., whether there is any difference between the experimental and control conditions or whether the functional relation between the independent and dependent variables is a linear one). Even if there is really no difference between the experimental and control populations, the actual difference obtained between the two corresponding samples may nonetheless be numerically different from zero because of human errors, instrumental failures, and other unexpected momentary influences unrelated to the experimental manipulation. In other words, a binary decision is to be made about D with respect to two mutually exclusive and exhaustive alternatives, namely, a real difference or (in the exclusive sense) a chance variation. The practical problem is how to choose between the two mutually exclusive alternatives.
The mathematical solution is to base the binary decision on the probability of obtaining various deviations from zero if there is only chance variation (i.e., the null hypothesis, H_{0}). The convention is that if the probability associated with an observed difference is as small as or smaller than .05Erratum, the difference is ignored (i.e., considered a deviation due to chance variation). This is a rule in statistical decision that is independent of the specific nature of either the substantive or experimental hypothesis. The null hypothesis (H_{0}) is the logical complement of the experimental expectation that is identified as the statistical alternative hypothesis (H_{1}).
Characteristics of the statistical decision are as follows: First, the magnitude of the experimental effect is treated in a binary manner even though it numerically is a continuous variable. That is, the magnitude is relevant only in deciding whether the result falls within the region of rejection (of the null hypothesis) or without. The magnitude of the effect size is irrelevant once the region is determined. Hence, it is misleading to make statements like "Although the effect is statistically significant, it is nonetheless very small." This statement is misleading because it misrepresents the binary nature of the statistical decision.
Second, by itself, such a statistical decision does not (and cannot) say anything about the truth of the substantive theory because they belong to different domains. As may be seen from Table 2, although the statistical decision about D has implications for which experimental and theoretical conclusions to draw, it is not the same as either of the two conclusions.
As may be seen from the foregoing discussion and Table 2, the task of a statistical analysis is to supply the investigator with the minor premise of the syllogism used in relating (a) the theoretical expectation of the theory, (b) the outcomes of an experiment designed to test it, and (c) the theoretical conclusion permissible. Moreover, the theoretical expectation is a qualitative one (viz., Is D like X?), not a quantitative one (How unlike X is D?). Moreover, the tenability of theory under investigation is determined by the syllogistic argument in toto, not by the statistical decision (see also Tukey, 1960).
It has been shown that the role of a statistical analysis in the context of theorycorroboration experimentation is to supply the investigator with the minor premise for the syllogistic argument. Could this role be filled better by an alternative index (e.g., effect size)? As has been shown, all that is required of a statistical analysis is a binary decision. This is the case because the validity of the syllogistic argument requires only that information. Even if a quantitatively more informative index is available (e.g., effect size, the amount of variance accounted for, or the power of the test), it will still be used in a binary manner. That is, nothing is gained by using an effectsize estimate in this context.
The use of the significance test has been criticized because the choice of the alpha level is arbitrary (Rozeboom, 1960). However, this criticism can be directed to the alternative criteria suggested. For example, Cohen (1977) acknowledged that his index of statistical power was also an arbitrary one. The question here is not whether the criterion is arbitrary or not. The issues should be (a) whether the criterion is welldefined and (b) whether the criterion would mislead its users. It can be argued that the use of the significance test is more satisfactory with regard to the latter issue.
Suppose two experiments were conducted to test theory T. Further suppose that the results were significant in both studies. However, the size of the effect of Experiment 1 was larger than its counterpart in Experiment 2. Does this mean that Experiment 1 lends more support to theory T than Experiment 2 does? The answer is no because the corroboration of a theory depends on the argument form in toto depicted in the affirming consequent paradigm in Table 2, not on the magnitude of an effectsize estimate. The paradigm ensures that the theoretical conclusion follows logically from its premises, and logical validity is an allornone property of an argument.
As may be seen from Table 2, the experimental hypothesis (i.e., the statistical alternative hypothesis, H_{1}) is not the same as the theoretical implication, I_{11} , of the theory of interest. More important, the statistical hypothesis testing procedure itself is not the theorycorroboration procedure. It is only one step, albeit a very important one, in the syllogistic argument implicated in testing a theory. For this reason, it is important to note that the experimental manipulation may be remote from the theoretical property of the underlying mechanism (see Meehl, 1978). Consequently, the magnitude of the effect size in an experiment is not necessarily a quantitative index of a theoretical property of the underlying mechanism. An investigator may be misled, however, to think otherwise if the emphasis is on the magnitude of the effectsize estimate.
Some Criticisms of Significance Revisited
Some of the early criticisms of the use of the significance test in the context of theorycorroboration experimentation may now be considered. First, consider the view that the null hypothesis is never true (Bakan, 1966; Grant, 1962; Lykken, 1968; Nunnally, 1960). The null hypothesis is used as the antecedent of a conditional proposition when it is used in testing an experimental hypothesis. That is, the statement used is "If the null hypothesis is true, then the probability associated with a t value of such and such is so and so." Any proposition, regardless of its truth value, can be used as the antecedent of a conditional proposition (Copi, 1965). Consequently, whether the state of affairs described in a null hypothesis occurs or not is immaterial to the use of the significance test.
Only one statistical alternative hypothesis is being considered in making a statistical decision. An investigator may believe in several alternative hypotheses, albeit to various degrees. Does it not mean that the use of the significance test is incompatible with the spirit of scientific investigation?
It is necessary to distinguish between (a) a statistical alternative hypothesis (H_{1}) and (b) a theoretical alternative hypothesis (e.g., T_{2}). Consider Table 2 again: X is the statistical alternative hypothesis in the statistical decision used to test theory T]. A theoretical alternative hypothesis would be a rival theory to T_{1}, for example, T_{2}. What the critics may be suggesting is that T, is not the only substantive theory. The criticism then becomes a question of whether rival theories are being ignored when using the significance test. There are two responses to this concern.
First, psychologists often contrast two rival theories in the same experiment (e.g., Reitman's, 1971, 1974, studies of forgetting in shortterm memory). That is, the number of substantive theories tested in an experiment is a question of experimental design and of ingenuity. It is not limited by the binary nature of the statistical decision procedure. Second, the theorycorroboration procedure does not stop at the completion of an experiment. Rival theories may be tested in a series of related experiments. The use of the significance test is not incompatible with this strategy.
Similarly, the binary nature of the statistical decision is not incompatible with the fact that scientific knowledge accumulates bit by bit. Again consider Table 2. Theory T, has more than one theoretical implication. Given that the conclusion permissible by the affirming consequent paradigm is the statement "Theory T, is probably true," theory T, should he tested further. This is done by deriving further implications, such as I_{12}, I_{13}, and so on. Each implication leads to a unique experimental expectation. Separate experiments are then designed to test these expectations. These tests constitute the "converging operations" for corroborating theory T, (Garner, Hake, & Eriksen, 1956). Understanding of the phenomenon grows in this incremental manner. The important point is that the binary statistical decision procedure is implicated at every stage of this incremental growth of the theory of interest.
Finally, the samplesize problem should be considered. It is true that a statistical test will be significant if a large enough sample is used. On the other hand, any test may be made insignificant if too few subjects are tested. Does it follow that the use of the significance test is inappropriate? Binder (1963) gave a cogent answer:
[The sample size problem was] a particular form of the more general argument against bad experimentation. It is unquestionably the case that an . . . experiment that is too small and insensitive is poor, but the poorness is a property of the insensitivity and not of the . . . procedure. (p. 112)That is, that the significance test may sometimes be misused in an insufficient reason to abandon the method. The samplesize problem is really one about the statistical conclusion validity (Cook & Campbell, 1979) of an experiment. It arises because the experiment has not been conducted in a way that satisfies the assumptions of a statistical analysis. Under such circumstances, an alternative to the significance test (be it an effectsize estimate or an interval estimate) is also not meaningful because something is fundamentally wrong with the experiment. An appeal for an alternative to the use of significance may, in fact, be misleading because attention is directed away from the design problems of the experiment. That is, a statistical solution is not appropriate here because it is not a statistical problem. The proper solution is to rectify the shortcomings of the original study and conduct the experiment again.
Implication for MetaAnalysis
One putative advantage of the metaanalytic approach to literature review and integration is that the metaanalyst can pool the results of many studies (Glass et al., 1981; Rosenthal, 1984). The suggested candidate for pooling is the magnitude of the effect size. This putative advantage is apparent only for the following reasons:
First, as has been mentioned before, the experimental hypothesis is not the same as the theoretical implication of interest. Hence, the magnitude of an experimental effect does not necessarily reflect a theoretical property in a quantitative way. Second, I have just shown that an explanatory theory should be tested systematically by means of its diverse implications with separate experiments that differ in some theoretically significant ways. (That is, a theory is strengthened not by mere literal replications of the same experiment but by a series of converging operations.) In other words, the outcomes of a set of theorycorroboration experiments may be too theoretically dissimilar for pooling. This may be the case even though all the experiments involve the same theory.
Summary and Conclusion
Statistics is an important tool for experimentation. Underlying this tool is a view of science in which the substantive question is not differentiated from the experimental question. Consequently, users of statistics may easily overlook that the rationale of the statistical hypothesis testing procedure should not be identified with that of the procedure used in corroborating an explanatory theory. In the context of theorycorroboration experimentation, a statistical analysis is necessary only because it enables one to make a statistical decision of a binary nature. This function is fulfilled satisfactorily by the use of a significance test. It is preferred to the magnitude of an effectsize estimate because it is less unlikely to mislead its users. Some early criticisms of the use of the significance test have been responded to by (a) distinguishing between a statistical alternative hypothesis and a theoretical rival hypothesis and (b) showing that the binary statistical decision is compatible with the incremental growth of scientific knowledge.
References
Bakan, D. (1966). The effect of significance in psychological research. Psychological Bulletin, 66, 423437.
Binder, A. ( 1963). Further considerations on testing the null hypothesis and the strategy and tactics of investigating theoretical models. Psychological Review, 70,107115.
Cohen, J. (1977). Statistical power analysis for the behavioral sciences. New York: Academic Press.
Cohen, J., & Cohen, P. (1983). Applied multiple regression/correlation analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ: Eflbaum.
Cohen, M. R., & Nagel, E. (1934). An introduction to logic and scientific method. London: Routledge & Kegan Paul.
Cook, T D., & Campbell, D. I (1979). Quasiexperimentation: Design and analysis issues for field studies. Chicago: Rand McNally.
Copi, 1. M. (1965). Symbolic logic (2nd ed.). New York: Macmillan.
Edwards, A. L. ( 1976). An introduction to linear regression and correlation. San Francisco: Freeman.
Fiske, D. W. (1983). The metaanalysis revolution in outcome research. Journal of Consulting and Clinical Psychology, 51, 6570.
Garner, W. R., Hake, H. W., & Eriksen, C. W. (1956). Operationism and the concept of perception. Psychological Review, 63, 149159.
Glass, G. V., McGaw, B., & Smith, M. L. (198 1). Metaanalysis in social research. Beverly Hills, CA: Sage Publications.
Grant, D. A. (1962). Testing the null hypothesis and the strategy and tactics of investigating theoretical models. Psychological Review, 69, 5461.
Harris, M. J., & Rosenthal, R. (1985). Mediation of interpersonal expectancy effects: 31 metaanalyses. Psychological Bulletin, 97, 363386.
Hogben, L. (1957). Statistical theory. London: Allen and Unwin.
Lykken, D. T. ( 1 968). Statistical significance in psychological research. Psychological Bulletin, 70, 151159.
Meehl, P. E. (1967). Theory testing in psychology and in physics: A methodological paradox. Philosophy of Science, 34, 103115.
Meehl, P. E. (1978). Theoretical risks and tabular asterisks: Sir Karl, Sir Ronald, and the slow progress of soft psychology. Journal of Consulting and Clinical Psychology, 46, 806834.
Mook, D. G. (1983). In defense of external invalidity. American Psychologist, 38, 379387.
Morrison, D. E., & Henkel, R. E. (Eds.). (1970). The significance test controversy: A reader Chicago: Aldine.
Myers, J. L. (1979). Fundamentals of experimental design (3rd ed.). Boston: Allyn & Bacon.
Nunnally, J. (1960). The place of statistics in psychology. Educational and Psychological Measurement, 20, 641650.
Popper, K. R. (1968). Conjectures and refutations. New York: Harper & Row. (Original work published 1962)
Reitman, J. S. (197 1). Mechanisms of forgetting in shortterm memory. Cognitive Psychology, 2, 185195.
Reitman, J. S. (1974). Without surreptitious rehearsal, information in shortterm memory decays. Journal of Verbal Learning and Verbal Behavior, 13, 365377.
Rosenthal, R. (1983). Assessing the statistical and social importance of the effects of psychotherapy. Journal of Consulting and Clinical Psychology, 51, 413.
Rosenthal, R. (1984). Metaanalytic procedure for social research. Beverly Hills, CA: Sage Publications.
Rosenthal, R., & Rubin, D. B. (1979). A note on percent variance explained as a measure of the importance of effects. Journal of applied Social Psychology, 9, 395396.
Rosenthal, R., & Rubin, D. B. (1982). A simple, general purpose display of magnitude of experimental effect. Journal of Educational Psychology, 74, 166169.
Rozeboom, W. W. (1960). The fallacy of the nullhypothesis significance test. Psychological Bulletin, 5 7, 416428.
Stemberg, S. (1969). Memoryscanning: Mental processes revealed by reaction time. American Scientist, 5 7, 421457.
Strube, M. J., & Hartman, D. P. (1982). A critical appraisal of metaanalysis. British Journal of Clinical Psychology, 21, 129139.
Strube, M. J., & Hartman, D. P. (1983). Metaanalysis: Techniques, applications, and functions. Journal of Consulting and Clinical Psychology, 51, 1427.
Tukey, J. W. (1960). Conclusions vs. decisions. Technometrics, 2, 1 1 I.
*I did this research while spending my sabbatical leave at the Department of Psychology, University of Alberta. I wish to thank Vincent Di Lollo and the Department of Psychology of the University of Alberta for their hospitality. Thanks are also due Don Mixon and William Rozeboom for their helpful comments. I am grateful to an anonymous reviewer who pointed out a factual error I made in an earlier draft of this article.
Correspondence concerning this article should be addressed to Siu L. Chow, Department of Psychology, University of Wollongong, P.O. Box II 44, Wollongong, New South Wales, Australia, 2500.
Endnotes
1. This historical origin may be responsible for the fact that the majority of examples given in introductory statistics textbooks follow the agricultural model as defined here. For example, Myers (1979) characterized the aim of psychological experimentation as an attempt "to determine what factors influence a certain behavior, and the extent and direction of the influence. We seek answers to such questions as: What are the relative effects of these three drugs on the number of errors made in learning a maze? Which of these three training methods is most effective? What changes in auditory acuity occur as a function of certain changes in sound intensity?" (p. 1). Although these questions are good examples to use in introducing statistical concepts and computational procedures, they may be misleading about the function of psychological experimentation. For example, they may give the impression that psychologists necessarily ask those questions for their own sake.
2. I do not mean that the kinds of questions described in Footnote I should not be asked. Rather, the issue is whether these questions are asked for their own sake in psychological experimentation. For example, Sternberg ( 1 969) assessed his subjects' correct reaction times under several memoryload conditions. He was not interested, however, in the effect of the variation in memory load on his subjects' performance per se. Rather, he was interested in the manner in which memory search was conducted. He used the effect of memory load to determine whether the search process was serial exhaustive, serial selfterminating, or parallel.
Received June 16, 1986
Revision received March 17, 1987
Accepted March 17, 1987
Erratum: The phrase, 'as small as or smaller than .05,' is incorrect. It should read, 'larger than..05'.