Author's Response

**The Popperian
framework, statistical significance, and Rejection of Chance
**

**Siu L. Chow
**

*Department of Psychology, University of
Regina, Regina, Saskatchewan, Canada S4S OA2*.
**siu.chow@uregina.ca uregina.ca/chowsl/
**

Abstract: That **Haig** and **Sohn***
*find the hypotheticodeductive approach wanting in different
ways shows that multiple conditional syllogisms are being used
in different stages of' theory corroboration in the Popperian
approach. The issues raised in the two commentaries assume a different
complexion when certain distinctions are made.

Separate conditional syllogisms at different
levels of abstraction are being used to justify the rejection
of chance in significance tests and to corroborate theories. **Haig's
**concerns are met with a discussion of the nature of psychometric
instruments, the incommensurability problem of meta-analysis,
and the circularity of adductive conclusions. Simulation data
are used to answer **Sohn's **critiques by showing that (a)
the alpha level is not meant to be applied to a set of *t*tests,
and (b) statistical significance is dependent on neither effect
size nor sample size.

It is important to calibrate an instrument
if it is used to obtain exact measurements (e.g., a clock). Psychologists
do not use significance tests for calibration purposes because
psychometric instruments provide relative, not absolute, measurements.
For example, a WISCR score of 115 indicates, not how intelligent
an individual is, but that the individual is better than 84.13%
of the norm group. The acceptability of a psychometric instrument
depends on its validity, not the sort of precision monitored by
calibration.

The 12 studies described in Chow's (1996)
Table 5.5 belong to the same domain. To conduct a metaanalysis
on them is to obtain the average of the diverse effects of the
qualitatively different independent variables. The result is conceptually
anomalous because it is not theoretically meaningful to mix apples
and oranges. As no valid conclusion can be drawn from metaanalysis,
it cannot be used to discover new phenomena.

**Haig's **concerns
about theory discovery and the interrelationships among phenomenon,
theory, and evidential data have been anticipated in Sections
3.2.1 (pp. 4647) and 3.7 (pp. 6364) in Chow (1996).
Given the "phenomenon hypothesis evidential data" sequence
(Chow 1996, pp. 46 & 63), the theory is necessarily** **an
*ad hoc* postulation *visàvis* the
tobeexplained phenomenon. It is circular for Haig
to assert that "phenomena provide the evidence for theories."
Hence, theories obtained by adduction, like theories established
by any other means, have to be corroborated. A series of three
embedding conditional syllogisms is used when corroborating theories,
including adductively established ones (Chow 1996, Table 4.2,
p. 70).

The null hypothesis (H_{0}) is used
in significance tests as the antecedent and the consequent of
two conditional propositions (Chow 1996, p. 32) as follows:

[Proposition 1]: *If* the research manipulation
is not efficacious (i.e., only chance influences are assumed),
*then *H_{0}.

[Proposition 2]: *If* H_{0},
*then *the mean difference of the sampling distribution of
differences is zero.

This practice of emphasizing that H_{0}
is an implication of the chance hypothesis, not the chance hypothesis
itself, will henceforth be called the formal approach. **Sohn's
**treatment of H_{0} as the chance hypothesis is acceptable
as a casual way to express the transitive relationship between
Propositions 1 and *2. *Subsequently, it is called the vernacular
approach.

There are important differences between the
formal and vernacular stances. For example, Proposition 2 is true
only if H_{o} is the zeronull (i.e., H_{0}:
*u*_{1} = *u*_{2}). If H_{0}
is a pointnull (e.g., *u*_{1} - *u*_{2}
= 5), Proposition 2 is replaced by Proposition 3:

[Proposition 3]: *If* H_{0}:
*u*_{1} - *u*_{2} = 5, *then* the
mean difference of the sampling distribution of differences is
5.

Not making the distinction between the formal
and vernacular approaches is responsible for some of the issues
raised by **Sohn.
**

If H_{0} were the chance hypothesis,
"significant" and "rejecting the chance hypothesis"
become synonymous characterizations. The question about justification
becomes mute because two synonymous expressions do not (and cannot)
have a justificatory relationship. On the other hand, excluding
chance as an explanation by rejecting H_{0} in the case
of Proposition 1 is warranted by *modus tollens *(Chow 1996,
pp*. *5052).

**Sohn **questions
the justificatory function of *modus tollens *because of
Berkson's (1942*) *conditional syllogism. If one were to
follow Berkson's example, A represents the chance hypothesis,
and B stands for H_{0}, This is not possible when H_{0}
is the chance hypothesis (viz., Sohn's contention), however, because
a concept does not imply itself. Moreover, the "sometimes"
qualifier makes ambiguous Berkson's major premise. Furthermore,
there is a confusion in Sohn's appeal to Berkson.

[Proposition 4]: Of all possible differences
between 2 sample means, 5% produce a *t*value equal
to, or smaller than, the critical *t* value.

[Proposition 5]: Some differences between
two sample means produce a *t* value equal to, or smaller
than, the critical *t* value.

Proposition 4 is* *a definite statement
about a probabilistic phenomenon that can be tested. The ambiguity
of Berkson's (1942)* *minor premise (like Proposition 5)
precludes it from being used as a criterion for making the statistical
decision. Subscribing to Berkson's reasoning betrays a confusion
between adopting a welldefined probabilistic statement and
using a vague proposition.

**Sohn **finds
the reasonableness of the formal approach wanting because one
can never be certain that H_{0} is false when one rejects
it. This objection would be unassailable if absolute certainty
were the prerequisite for reasonableness. Be that as it may, the
inevitable uncertainty in question does not invalidate the formal
approach.

The Type I error becomes a concern when there
are reservations about the statistical significance of the result
of a specific experiment. Such is an occasion for checking the
correctness of the experimental hypothesis, the presence of a
confounding variable, or the appropriateness of the experimental
design, task or procedure. That is, instead of disputing the validity,
usefulness or importance of significance tests, the inevitable
uncertainty serves to ensure conceptual or methodological rigor.

**Sohn's **two
scenarios set in high relief a common misunderstanding about significance
tests. Specifically, it is said in the first scenario (Sohn's
para. 5) that, given a
= .05, the results of around 50 of 1,000 separate *t*tests
will be significant by chance when the zeronull hypothesis
is true. This statement is as incorrect as saying that there will
be *n* heads and *n *tails in *2n *identical tosses
of a fair coin. What a fair coin implies is that 50% of an infinite
number of identical tosses result in heads. It does not follow
that half of any exact number of identical tosses will result
in heads. Consider the first scenario more closely with reference
to Table R1.

Underlying the ttest are two statistical
populations specified by the two levels of the independent variable
(Winer 1962)*. *Shown in Panel 1 of Table R1 is the composition
of two such populations. Their means are shown in Panel 2 (viz.,
*u*_{1} = *u*_{2} *= *4.812)*
*and they have the same standard deviation (viz., (s_{1}
= s_{2}*
=*.894)*. *The following steps
were carried out:

(a) Selected with replacement a random sample
of *n *units from Population 1 and another random sample
of *n, *from Population *2, *and *n _{1} =
n_{2}.*

(b) Ascertained the difference between the two sample means, as well as the standard error of the difference.

(c) Calculated the *t* ratio.

(d) Returned the two sets of *n* units
to their respective populations.

(e) Repeated Steps (a) through (d) 5,000*
*times.

(f) Steps (a) through (e) were repeated with
*u*_{1} = *u*_{2} = 5, 75, 750, and
1,000.

Given any sample size, there are 5,000 differences
at the end of the exercise. When they are tabulated in the form
of a frequency distribution, the result is an empirical approximation
to the random sampling distribution of the differences between
two sample means. It is only an approximation because, in theory,
Step (e) should consist of an infinite number of times.

The 5,000 *t*values obtained in
Step (c) represent the result of standardizing the 5,000* *differences
in terms of their respective standard errors of differences. Shown
in Column 2A* *of Table R1 are the numbers of empirically
determined *t*values that fall within the ranges identified
in the corresponding row. For example, 104 *t*values
fall between 1.90* *and 1.701*. *This simulation
exercise is to make explicit four points:

(1) The probability foundation of the *t*test
is the sampling distribution of differences.

(2) A* *different sampling distribution
of differences is used when the sample size changes (see Columns
2A* *through 2D* *of Table R1).

(3) The expression "a
= .05*" *means that 5%* *of an infinite number
of differences between 2* *means give *t*values
that are as extreme as, or more extreme than, 1.86* *(or
1.86* *as the case may be) for the 1tailed test
with (*df* = 8*.*

(4)* *It does not follow from (3) that
5%* *of any 1,000 differences would be as extreme as, or
more extreme than, the critical *t* value.

To recapitulate (1), every application of
the ttest evokes the appropriate sampling distribution of
differences. Hence, the same sampling distribution is evoked 1,000
times in **Sohn's **first scenario if the *2 *statistical
populations (as well as *n*_{1} and *n*_{2})
remain the same throughout. The 50-950 split of the 1,000 experiments
envisioned by Solin has nothing to do with the alpha level for
the reasons stated in (3) and (4)*.
*

Given that testing a pointnull hypothesis
is no different from testing a zeronull hypothesis (Kirk,
1984)*, *the outcomes of significance tests should be independent
of the expected effect size (Chow, 1996, pp. 132-34; 1998a, pp.
18485). This contradicts the second scenario described in
**Sohn's **paragraph six, which is an echo of the poweranalytic
"significanceeffect size dependence" assertion
that the outcomes of significance tests depend on effect size
(Cohen, 1987). This point is amplified below.

The entries in Table R2 were also obtained
with Steps (a) through (f), except that mean of the second statistical
population is larger than that of the first one by 0.5 of the
standard deviation of the first statistical population (viz.,
*u*_{1} = 4.812; *u*_{2} = 5.262; Panel
2). If the "significanceeffect size dependence"
thesis were correct, the mean tratio should differ from
zero. There is no support for the "significanceeffect
size dependence" thesis because none of the four mean tratios
differs from 0 (viz., .028, .013, .012, and .008).

**Sohn's **second
scenario also echoes another poweranalytic assertion, namely,
that larger sample sizes increase statistical power, thereby making
it easier to obtain statistical significance. This "significancesample
size dependence" thesis is questioned by the four c^{2}
tests reported in the two panels of Table R3*, *as may be
seen from the italicized and boldface entries in the two panels.

Use Panel B of Table R3* *as an illustration.
Each of the 5,000 tvalues in Column 2A* *of Table R2*
*was classified as "Significant" or "Not significant."
For example, there are 456* *and 4544* t*values
in the "Significant" and "Not significant"
categories, respectively, when *n*_{1} *= n*_{2}*
= 5. *The same process was repeated with the entries from each
of the other columns of Table R2* *(i.e., for sample sizes
of 75, 750*, *and 1,000). The result is the eight boldface
entries in Panel B of Table R3*. *They make up the twoway
c^{2}
test for the independence of statistical significance (columns)
and sample size (rows). As* *the c^{2}
= 2*.64 (df = *3*) *is not significant, there is no
reason to reject the independence in question. That is, there
is no support for the view that statistical significance is a
function of sample size in the case of the point-null.

The procedure just described was also curried
out with the entries of Table R1. The result is the eight italicized
entries in Panel A of Table R3. The c^{2}
of 2.93 *(df *= 3) is also not significant. Hence, there
is also no support for the "significancesample size
dependence" thesis in the case of the zeronull.

To conclude, it is necessary to distinguish
between (a) phenomenon and evidential data, and (b) the chance
hypothesis and H_{0}. The inevitable possibility of committing
the Type I error does not invalidate the formal approach to significance
tests. The exclusion of the chance explanation by rejecting H_{0}
is warranted by *modus tollens. *Although the critical *t*value
defined by the alpha level serves as the decision criterion in
every *t*test, it has nothing to do with a collection
of separate *t*tests as a set.

References

**Letters "u" and "r"
appearing before authors' initials refer to target article and
response respectively
**

Baird, D. (1992) *Inductive logic: Probability
and statistics*. Prentice Mill. [BDH]

Berkson, J. (1942) Tests of significance considered
as evidence. *Journal of the American Statistical Association*
37:32535. [DS]

Chow, S. L. (1996) Statistical significance: Rationale, validity, and utility. Sage, [rSLC, BDH, DS] BDH, DS]

(1998a) Précis of *Statistical significance
Rationale, validity. and utility*. *Behavioral and Brain
Sciences* 21:169239. [rSLC, DS]

(1998b) The nullhypothesis significancetest
procedure is still warranted. *Behavioral and Brain Sciences*
21:22838. [BDH]

Cohen, J. (1987) *Statistical power analysis
for the behavioral sciences *(revised edition). Academic Press.
[rSLC]

Erwin, E. (1998) The logic of null hypothesis
testing. *Behavioral and Brain Sciences* 21:19798.
[BDH]

Franklin, A. (1997) Calibration. *Perspectives
on Science* 5:3180. [BDH]

Hunter, J. E. (1998) Testing significance
testing: A flawed defense. *Behavioral and Brain Sciences*
21:204. [BDH]

Josephson, J. R. & Josephson, S. G., eds.
(1994) *Abductive inference*. Cambridge University Press.
[BDH]

Kirk, R. E. (1984) *Basic statistics*,
2nd edition. Brooks/Cole. [rSLC]

Nickles, T (1987) Methodology, heuristics
and rationality In: *Rational changes in science*, ed. J.
C. Pitt & M. Pera. Reidel. [BDH]

Schmidt, F L. (1992) What do data really mean?
Research findings, meteanalysis, and cumulative knowledge
in Psychology. *American Psychologist* 47:117381. [BDH]

Sohn, D. (1993) Psychology of the scientist:
LXVI. The idiot savants have taken over the Psychology labs! Or
why in science the rejection of the null hypothesis as the basis
for affirming the research hypothesis is unwarranted. *Psychological
Reports* 73:116775. [DS]

Thagard, P. (1992) *Conceptual revolutions*,
Princeton University Press. [BDH]

Winer, B. J. (1962) *Statistical principles
in experimental design*, McGrawHill. [rSLC]

Woodward, J. (1989) Data and phenomena. *Synthese*
79:393472.[BDH]