- Thagard, P (1997 in press). Probabilistic networks and explanatory coherence.
In: P. O'Rourke & J. Josephson (eds.)

To go directly a particular section of this paper, click on a section title below.

- Social reasoning: when friends are acting strange, we conjecture about what might be bothering them.
- Legal reasoning: when a crime has been committed, jurors must decide whether the prosecution's case gives a convincing explanation of the evidence.
- Medical diagnosis: given a set of symptoms, a physician tries to decide what disease or diseases produced them.
- Fault diagnosis in manufacturing: when a piece of equipment breaks down, a trouble shooter must try to determine the cause of the breakdown.
- Scientific theory evaluation: scientists seek an acceptable theory to explain experimental evidence.

The many discussions of causal reasoning over the centuries can be seen
as falling under two general traditions that I shall call *explanationism*
and *probabilism*. Explanationists understand causal reasoning qualitatively,
while probabilists exploit the resources of the probability calculus to
understand causal reasoning quantitatively. Explanationism goes back at
least to Aristotle (1984, vol. 1 p. 128) who considered the inference that
the planets are near as providing an explanation of why they do not twinkle.
Some Renaissance astronomers such as Copernicus and Rheticus evaluated
theories according to their explanatory capabilities (Blake 1960). The
leading explanationists in the nineteenth century were the British scientist,
philosopher, and historian William Whewell (1967), and the American polymath
C. S. Peirce (1931-58). The most enthusiastic explanationists in this century
have been epistemologists such as Gilbert Harman (1973, 1986) and William
Lycan (1988). Very recently, computational models of inference to the best
explanation have been developed (Josephson et al. 1987; Thagard 1988, 1989,
1992b; O'Rorke, Morris, and Schulenburg 1990; Rajamoney 1990; Ng and Mooney
1990; Ram 1990).

The probabilist tradition is less ancient than explanationism, for the mathematical theory of probability only arose in the seventeenth century through the work of Pascal, Bernoulli, and others (Hacking 1975). Laplace and Jevons were the major proponents of probabilistic approaches to induction in the eighteenth and nineteenth century, respectively (Laudan 1981, ch. 12). Many twentieth-century philosophers have advocated probabilistic approaches to epistemology, including Keynes (1921), Carnap (1950), Jeffrey (1983), Levi (1980), and Kyburg (1983). Probabilistic approaches have recently become influential in artificial intelligence as a way of dealing with uncertainty encountered in expert systems (Pearl 1988, Peng and Reggia 1990, Neapolitain 1990). Probabilistic approaches are also being applied to natural language understanding (Goldman and Charniak 1990).

The explanationist versus probabilist issue surfaces in a variety of sub-areas. Some legal scholars concerned with evidential reasoning have been probabilist (Lempert 1986, Cohen 1977), while some are explanationist and see probabilist reasoning as neglecting important aspects of how jurors reach decisions (Allen 1990, Pennington and Hastie 1986). In the philosophy of science, there is an unresolved tension between probabilist accounts of scientific inference (Hesse 1974, Horwich 1982, Howson and Urbach 1989) and explanationist accounts (Thagard 1978, 1988, 1989). Neither the probabilist nor the explanationist tradition is monolithic: there are competing interpretations of probability and inference within the former, and differing views of explanatory inference within the latter.

In recent years, it has become possible to examine the differences between explanationist and probabilist approaches at a much finer level, because algorithms have been developed for implementing them computationally. I have proposed a theory of explanatory coherence that incorporates the kinds of reasoning advocated by explanationists, and implemented the theory in a connectionist program called ECHO that shows how explanatory coherence can be computed in networks of propositions (Thagard 1989). Pearl (1988) and others have shown how probabilistic reasoning can be implemented computationally using networks. The question naturally arises of the relation between ECHO networks and probabilistic networks.

This paper shows how ECHO's qualitative input can be used to produce a probabilistic network to which Pearl's algorithms are applicable. At one level, this result can be interpreted as showing that ECHO is a special case of a probabilistic network. The production of a probabilistic version of ECHO highlights, however, several computational problems with probabilistic networks. The probabilistic version of ECHO requires the provision of many conditional probabilities of dubious availability, and the computational techniques needed to translate ECHO into probabilistic networks are potentially combinatorially explosive. ECHO can therefore be viewed as an intuitively appealing and computationally efficient approximation to probabilistic reasoning. We will also see that ECHO puts important constraints on the conditional probabilities used in probabilistic networks.

The comparison between ECHO and probabilistic networks does not in itself settle the relation between the explanationist and probabilist traditions, since there are other ways of being an explanationist besides ECHO, and there are other ways of being a probabilist besides Pearl's networks. But from a computational perspective, ECHO and Pearl networks are much more fully specified than previous explanationist and probabilist proposals, so a head-to-head comparison is potentially illuminating. After briefly reviewing my theory of explanatory coherence and Pearl's approach to probabilistic networks, I shall sketch the probabilistic interpretation of explanatory coherence and discuss the computational problems that arise. Then a demonstration of how ECHO naturally handles Pearl's central examples will support the conclusion that explanatory coherence theory is not obviated by the probabilistic approach. (2)

*Principle 1. Symmetry*.

(a) If P and Q cohere, then Q and P cohere.(b) If P and Q incohere, then Q and P incohere.

If P1 ...Pmexplain Q, then:(a) For each P

iin P1 ...Pm, Piand Q cohere.(b) For each P

iand Pjin P1 ...Pm, Piand Pjcohere.(c) In (a) and (b) the degree of coherence is inversely proportional to the number of propositions P

1... Pm.

If P1explains Q1, P2explains Q2, P1is analogous to P2, and Q1is analogous to Q2, then P1and P2cohere, and Q1and Q2cohere.

Propositions that describe the results of observation have a degree of acceptability on their own.

If P contradicts Q, then P and Q incohere.

If P and Q both explain a proposition R, and if P and Q are not explanatorily connected, then P and Q incohere. Here P and Q are explanatorily connected if any of the following conditions holds:(a) P is part of the explanation of Q,(b) Q is part of the explanation of P,

(c) P and Q are together part of the explanation of some proposition S.

(a) The acceptability of a proposition P in a system S depends on its coherence with the propositions in S.To understand these principles, further discussion is required. (4) Principle 1, Symmetry, asserts that pairwise coherence and incoherence are symmetric relations, in keeping with the everyday sense of coherence as holding together. The coherence of two propositions is thus different from the nonsymmetric relations of entailment and conditional probability. Typically, P entails Q without Q entailing P, and the conditional probability of P given Q is different from the probability of Q given P. But if P and Q hold together, so do Q and P. The use of a symmetrical relation has advantages that will become clearer in the discussion of the connectionist implementation below.(b) If many results of relevant experimental observations are unexplained, then the acceptability of a proposition P that explains only a few of them is reduced.

Principle 2, Explanation, is by far the most important for assessing
explanatory coherence, since it establishes most of the coherence relations.
Part (a) is the most obvious: If a hypothesis P is part of the explanation
of a piece of evidence Q, then P and Q cohere. Moreover, if a hypothesis
P2 is explained by another hypothesis P*1*, then P*1* and P*2*
cohere. Part (a) presupposes that explanation is a more restrictive relation
than deductive implication, since otherwise we could prove that any two
propositions cohere.

Whereas part (a) of Principle 2 says that what explains coheres with
what is explained, part (b) states that two propositions cohere if together
they provide an explanation. Behind part (b) is the idea that the evaluation
of a hypothesis depends partly on the other hypotheses with which it furnishes
explanations. I call two hypotheses that are used together in an explanation
``cohypotheses.'' Again I assume that explanation is more restrictive than
implication, since otherwise it would follow that any proposition that
explained something was coherent with every other proposition, because
if P1 implies Q, then so does P*1* & P2. But any scientist who
maintained at a conference that the theory of general relativity and today's
baseball scores together explain the motion of planets would be laughed
off the podium. Principle 2 is intended to apply to explanations and hypotheses
actually proposed by scientists.

Part (c) of Principle 2 embodies the claim that if numerous propositions
are needed to furnish an explanation, then the coherence of the explaining
propositions with each other and with what is explained is thereby diminished.
Scientists tend to be skeptical of hypotheses that require myriad *ad
hoc* assumptions in their explanations. There is nothing wrong in principle
in having explanations that draw on many assumptions, but we should prefer
theories that generate explanations using a unified core of hypotheses.
I have elsewhere contended that the notion of *simplicity* most appropriate
for scientific theory choice is a comparative one preferring theories that
make fewer special assumptions (Thagard 1988). Principles 2(b) and 2(c)
together subsume this criterion. I shall not attempt further to characterize
``degree of coherence'' here, but the onnectionist algorithm described
below provides a natural interpretation.

The third criterion for the best explanation in my earlier account was analogy, and this is subsumed in principle 3. It is controversial whether analogy is of more than heuristic use, but scientists such as Charles Darwin have used analogies to defend their theories; his argument for evolution by natural selection is analyzed below. Principle 3 does not say simply that any two analogous propositions cohere. There must be an explanatory analogy, with two analogous propositions occurring in explanations of two other propositions that are analogous to each other. Recent computational models of analogical mapping and retrieval show how such correspondences can be noticed (Holyoak and Thagard 1989, Thagard, Holyoak, Nelson, and Gochfeld 1990).

Principle 4, Data Priority, stands much in need of elucidation and defense. In saying that a proposition describing the results of observation has a degree of acceptability on its own, I am not suggesting that it is indubitable, only that it can stand on its own more successfully than a hypothesis whose sole justification is what it explains. A proposition Q may have some independent acceptability and still not be accepted, if it is only coherent with propositions that are not themselves acceptable.

From the point of view of explanatory coherence alone, we should not take propositions based on observation as independently acceptable without any explanatory relations to other propositions. The coherence of such propositions is non-explanatory, based on background knowledge that observations of certain sorts are very likely to be true. From experience, we know that our observations are very likely to be true, so we should believe them unless there is substantial reason not to. Similarly, at a different level, we have some confidence in the reliability of descriptions of experimental results in carefully refereed scientific journals. Observations may be ``theory-laden'', but they are far from being theory-determined. I count as data not just individual observations such as ``the instrument dial reads .5,'' but also generalizations from such observations.

Principle 5, Contradiction, is straightforward. By ``contradictory'' here I mean not just syntactic contradictions like P \& not-P but also semantic contradictions such as ``This ball is black all over'' and ``This ball is white all over.'' In my earlier version of TEC, I tried to stretch ``contradiction'' to cover cases where hypotheses that are not strictly contradictory are nevertheless held to be incompatible, but such cases are better handled by the new principle 6, Competition.

According to Principle 6, we should assume that *hypotheses that explain
the same evidence compete with each other unless there is reason to believe
otherwise*. Hence there need be no special relation between two hypotheses
for them to be incoherent, since hypotheses that both explain a piece of
evidence are judged to incohere unless there are reasons to think that
they cohere. Not all alternative hypotheses incohere, however, since many
phenomena have multiple causes. For example, explanations of why someone
has certain medical symptoms may involve hypotheses that the patient has
various diseases, and it is possible that more than one disease is present.
Normally, however, if hypotheses are proposed to explain the same evidence,
they will be treated as competitors. For example, in the debate over dinosaur
extinction (Thagard 1991), scientists generally treat as contradictory
the hypotheses:

(1) Dinosaurs became extinct because of a meteorite collision.Logically, (1) and (2) could both be true, but scientists treat them as conflicting explanations. According to principle 6, they incohere because they both are claimed to explain why dinosaurs became extinct and there is no explanatory relation between them.(2) Dinosaurs became extinct because the sea level fell.

Principle 7, Acceptability, proposes in part 7(a) that we can make sense of the overall coherence of a proposition in an explanatory system just from the pairwise coherence relations established by principles 1--5. If we have a hypothesis P that coheres with evidence Q by virtue of explaining it, but incoheres with another contradictory hypothesis, should we accept P? To decide, we cannot merely count the number of propositions with which P coheres and incoheres, since the acceptability of P depends in part on the acceptability of those propositions themselves. We need a dynamic and parallel method of deriving general coherence from particular coherence relations; such a method is provided by the connectionist program described below.

Principle 7(b), reducing the acceptability of a hypothesis when much
of the relevant evidence is unexplained by any hypothesis, is intended
to handle cases where the best available hypothesis is still not very good,
in that it accounts for only a fraction of the available evidence. Consider,
for example, a theory in economics that could explain the stock market
crashes of 1929 and 1987 but had nothing to say about myriad other similar
economic events. Even if the theory gave the best available account of
the two crashes, we would not be willing to elevate it to an accepted part
of general economic theory. What does ``relevant'' mean here? As a first
approximation, we can say that a piece of evidence is *directly* relevant
to a hypothesis if the evidence is explained by it or by one of its competitors.
We can then add that a piece of evidence is relevant if it is directly
relevant or if it is similar to evidence that is relevant, where similarity
is a matter of dealing with phenomena of the same kind. Thus a theory of
the business cycle that applies to the stock market crashes of 1929 and
1987 should also have something to say about nineteenth century crashes
and major business downturns in the twentieth century.

I will only briefly review how the principles are implemented in the computer program ECHO. My emphasis will be on the graphical and computational properties of ECHO networks most relevant to comparison with probabilistic networks. ECHO takes as input statements such as

(`EXPLAIN '(H1 H2) 'E1)`

whose interpretation is that hypotheses H1 and H2 together explain evidence
E1. ECHO represents each proposition by a network node called a *unit*,
and constructs links between units in accord with TEC. Whenever two propositions
cohere according to TEC, ECHO places an excitatory link (with weight greater
than 0) between the units that represent them. Whenever two propositions
incohere according to TEC, ECHO places an inhibitory link (with weight
less than 0) between them.

In accord with Principle 1, Symmetry, all links are symmetric. Given the above input, Principle 2, Explanation, requires that ECHO produce excitatory links between the units representing the following pairs of propositions: H1 and E1, H2 and E1, and H1 and H2. Notice the last link, required by Principle 2(b), since it provides one of the greatest differences between ECHO networks and probabilistic networks.

In accord with principle 2(b), Simplicity, the weights among the units given the above input are lower than they would be if only one hypotheses had been needed in the explanation.

Principle 3 says that analogy can also be a source of coherence, and
ECHO constructs the appropriate excitatory links given input that says
that propositions are analogous. To implement Principle 4, Data Priority,
ECHO takes input specifying propositions as data and constructs an excitatory
link between each of those propositions and a special evidence unit. On
the basis of Principles 5 and 6, Contradiction and Competition, ECHO constructs
inhibitory links between units representing propositions that contradict
each other or compete to explain other propositions. Finally, Principle
7, Acceptability, is implemented in ECHO using a simple connectionist method
for updating the activation of a unit based on the units to which it is
linked. Units typically start with an activation level of 0, except for
the special evidence unit whose activation is always 1. Activation spreads
from it to the units representing data, and from them to units representing
propositions that explain data, and then to units representing higher-level
propositions that explain propositions that explain data, and so on. Inhibitory
links between units make them suppress each other's activation. Activation
of each unit u, a*j*, is updated in parallel using the equation:

{netj (max - aj (t)) |
if\ net_j > 0 | |

aj (t+1) = aj (t)(1 - theta ) + |
||

{cr netj (aj (t) - min) |
otherwise |

Figure 1.1 displays a typical network produced by ECHO, given that H1 and H2 together explain E1, H1 and H2 together explain E2, H4 and H5 together also explain E1 and E2, and H1 is itself explained by H3. All links are symmetric, although the flow of activation is not, since it originates in the EVIDENCE unit. Note that there are many loops in the graph in figure 1.1. Loops with 3 nodes arise both from explanation (e.g. H1, H2, and E1) and from competition (e.g. H1, H4 and E1). Loops with 4 nodes occur whenever there are two propositions that both explain two other propositions (e.g. H1, H4, E1, E2). The presence of loops brings no complications to ECHO's computation of acceptability, but we shall see that loops are problematic for probabilistic networks.

(Graphic not available)

**Figure 1.1** A typical ECHO network. Solid lines indicate excitatory
links, while dotted lines indicate inhibitory links.

Updating an ECHO network is very simple and can be performed locally
at each unit *u*, which must have access, for each *u* to which
is linked, to the weight of the link between *u* and *u* and
to the activation of *u*. ECHO networks are never completely connected
- most units are not directly linked to most other units. Even if networks
were completely connected, the maximum number of links per unit in a network
with *n* units would be *n-1*, so that *n*(n-1)* simple
calculations are all that updating requires. In principle, if each unit
has its own processor, the *n-1* calculations can be performed independently
of each other. There is, however, no guarantee that a small number of updating
steps will suffice to produce stable activation values. It is possible
that activation will pass round and round the network, producing oscillations
in activations of units so that no judgment of acceptability is produced.
But computational experiments have shown that ECHO networks can be highly
stable given appropriate weight values for excitatory and inhibitory links.
For most of the major ECHO examples, 1000 runs have been done to determine
how sensitive network performance is to excitation, inhibition, and decay,
considering each in the range of .01 to .1. The sensitivity experiments
show that ECHO performs well so long as excitation is low relative to inhibition,
when the inhibitory links between units make the network settle without
the oscillations that uncontrolled excitation would produce. Resulting
activations can range between 1 (maximum acceptance) and -1 (maximum rejection).

(Graphic not available)

**Figure 1.2** Graph of settling time versus number of units for
15 ECHO applications, as described in: Thagard (1989, 1991b, 1991a); Thagard
and Nowak (1988, 1990); Nowak and Thagard (1991a, 1991b).

On the basis of these experiments, I chose the default parameter values of .04 for excitation, -.06 for inhibition, and .05 for decay, since they tend to lead to rapid settling. All runs of ECHO on different examples now use these parameter values. Figure~\ref{echo-2} shows the settling time of the most interesting ECHO applications as a function of number of units. Strikingly, larger networks with more links do not require more updating steps to reach stable activation values. This result is consistent with experiments performed using ARCS, a program for analog retrieval that uses networks similar to ECHO's but with many more units and links (Thagard, Holyoak, Nelson, and Gochfeld 1990). Thus although we can not show analytically that ECHO can efficiently compute explanatory coherence, experimental results suggest that ECHO's ability to determine the acceptability of propositions is little affected by the size or connectivity of networks. Further information on how ECHO works will be provided in section 5 which shows how ECHO handles some probabilistic cases. (5)

P(H/E) = P(H) P(E/H) quantity over P(E)

It says that the probability of a hypothesis given the evidence is the prior probability of the hypothesis times the probability of the evidence given the hypothesis, divided by the probability of the evidence. Bayes' theorem is very suggestive for causal reasoning, since we can hope to decide what caused an effect by considering what cause has the greatest probability given the effect. (6) In practice, however, application of the probability calculus becomes complicated. Harman (1986, p.~25) points out that in general probabilistic updating is combinatorially explosive, since we need to know the probabilities of a set of conjunctions whose size grow exponentially with the number of propositions. For example, full probabilistic information about three propositions, A, B, and C, would require knowing a total of 8 different values: P(A & B & C), P(A & B & not-C), P(A & not-B & not-C), etc. Only 30 propositions would require more than a billion probabilities.

Probabilistic networks prune enormously the required number of probabilities
and probability calculations, since they restrict calculations to a limited
set of dependencies. Suppose you know that B depends only on A, and C depends
only on B. You then have the simple network *A right arrow B right arrow
C*. This means that A can affect the probability of C only through B,
so that the calculation of the probability of C can take into account the
value of B while ignoring A.

Probabilistic networks have gone under many different names: causal networks, belief networks, Bayesian networks, influence diagrams, and independence networks. For precision's sake, I want to consider a particular kind of probabilistic network, concentrating on the elegant and powerful methods of Pearl (1988). Since methods for dealing with probabilistic networks other than his are undoubtedly possible, I shall not try to compare ECHO generally with probabilistic networks, but will make the comparison specifically with Pearl networks.

In Pearl networks, each node represents a multi-valued *variable*
such as a patient's temperature, which might take three values: high, medium,
low. In the simplest cases, the variable can be propositional, with two
values, true and false. Already we see a difference between Pearl networks
and ECHO networks, since ECHO requires separate nodes for a proposition
and its negation. But translations between Pearl nodes and ECHO nodes are
clearly possible and will be discussed below.

More problematic are the edges in the two kinds of networks. Pearl networks
are directed, acyclic graphs. Edges are directed, pointing from causes
to effects, so that *A right arrow B* indicates that A causes B and
not vice versa. In contrast, ECHO's links are all symmetric, befitting
the character of coherence and incoherence (Principle 1 of TEC), but symmetries
are not allowed in Pearl networks. The specification that the graphs be
acyclic rules out relations such as those shown in figure 1.3. Since the
nodes are variables, a more accurate interpretation of the edge *A right
arrow B* would be: the values of B are causally dependent on the values
of A.

(Graphic not available)

**Figure 1.3** Examples of cyclic graphs.

(Graphic not available)

**Figure 1.4 **Sample Pearl network, in which the variable D is dependent
on A, B, and C, while E and F are dependent on D. Lines with arrows indicate
dependencies.

The structure of Pearl networks is used to localize probability calculations and surmount the combinatorial explosion that can result from considering the probabilities of everything given everything else. Figure1.4 shows a fragment of a Pearl network in which the variable D is identified as being dependent on A, B, and C, while E and F are dependent on D. The probabilities that D will take on its various values can then be calculated only by looking at A, B, C, E, and F, ignoring other variables in the network from which D is assumed to be conditionally independent, given the five variables on which it is directly dependent. The probabilities of the values of D can be expressed as a vector corresponding to the set of values. For example, if D is temperature and has values (high, medium, low), the vector (.5 .3 .2) assigned to D means that the probability of high temperature is .5, of medium temperature is .3, and low temperature is .2. In accord with the axioms of probability theory, the numbers in the vector must sum to 1, since they are the probabilities of all the exclusive value of the variable. The desired result of computing using a Pearl network is that each node should have a stable vector representing the probabilities of its values given all the other information in the network. If a measurement determines that the temperature is high, then the vector for D would be (1 0 0). If the temperature is not known, it must be inferred using information gathered both from the variables on which D is dependent and the ones that depend on D. In terms of Bayes' theorem, we can think of A, B, and C as providing prior probabilities for the values of D, while E and F provide observed evidence for it. The explanatory coherence interpretation of figure~\ref{pearl} is that A, B, and C explain D, while D explains E and F.

For each variable X, Pearl uses BEL(x) to indicate the computed degrees of belief (probabilities) that X takes on for each of its values x. BEL(x) is thus a vector with as many entries as X has values, and is calculated using the equation:

BEL(x) = alpha * lambda(x) * pi(x)

Here alpha is a normalizing constant used to ensure that the entries in the vector sum to 1. Lambda(x) is a vector representing the amount of support for particular values of X coming up from below, that is from variables that depend on X. Pi(x) is a vector representing the amount of support for particular values of X coming down from above, that is from variables on which X depends. For a variable V at the very top of the network, the value for passed down by V will be a vector of the prior probabilities that V takes on its various values. Ultimately, BEL(x) should be a function of these prior probabilities and fixed probabilities at nodes where the value of the variable is known, producing BEL vectors such as (1 0 0).

Calculating BEL values is non-trivial, because it requires repeatedly
updating the lambda, pi, and BEL values until the prior probabilities and
the known values based on evidence have propagated throughout the network.
It has been shown that the general problem of probabilistic inference in
networks is NP-hard (Cooper 1990), so we should not expect there to be
a universal efficient algorithm for updating BEL.(7) Pearl presents algorithms
for computing BEL in the special case where networks are *singly connected*,
that is where no more than one path exists between two nodes (Pearl 1988,
ch.~4; see also Neapolitain 1990). A loop here is a sequence of edges independent
of direction. If there is more than one path between nodes, then the network
contains a loop that can interfere with achievement of stable values of
BEL, lambda, and pi. Hence methods have been developed for converting multiply
connected networks into singly connected ones by clustering nodes into
new nodes with many values. For example, consider the network shown in
figure 1.5 (from Pearl 1988, p. 196). In this example, metastatic cancer
is a cause of both increased total serum calcium and brain tumor, either
of which can cause a coma. This is problematic for Pearl because there
are two paths between A and D. Clustering involves collapsing nodes B and
C into a new Z representing a variable with values that are all possible
combinations of the values of B and C: increased calcium and tumor, increased
calcium and no tumor, no increased calcium and tumor, and no increased
calcium and no tumor. ECHO deals with cases such as these very differently,
using the principle of competition, as we will see in section 5. There
are other ways of dealing with loops in probabilistic networks besides
clustering.

Pearl discusses two approximating alternatives, while Lauritzen and Spiegelharter (1988) have offered a powerful general method for converting any directed acyclic graph into a tree of cliques of that graph. See also Neapolitain 1990, ch.~7. Hrycej (1990) shows how approximation by stochastic simulation can be understood as sampling from the Gibbs distribution in a random Markov field.

(Graphic not available)

**Figure 1.5** Pearl's representation of a multiply connected network
that must be manipulated before probability calculations can be performed.

ECHO | Pearl | |
---|---|---|

Nodes represent | propositions | variables |

Edges represent | coherence | dependencies |

Directedness | symmetric | directed |

Loops | many | must be eliminated |

Node quantity updated | activation [-1,1] | BEL: vector of [0,1] |

Additional updating | none | lambda, pi |

Additional info used | explanations | conditional probabilities |

data | prior probabilities |

More generally, if D depends on *n* variables with *k* values
each, *k^n* conditional probabilities will be required for computation.
This raises two problems that Pearl discusses. First, if *n* is large,
the calculations become computational intractable, so that approximation
methods must be used. Probabilistic networks are nevertheless much more
attractive computationally than the general problem of computing probabilities,
since the threat of combinatorial explosion is localized to nodes which
we can hope to be dependent on a relatively small number of other nodes.
Second, even if *n* is not so large, there is the problem of obtaining
sensible conditional probabilities to plug into the calculations. Pearl
acknowledges that it is unreasonable to expect a human or other system
to store all this information about conditional probabilities, and shows
how it is sometimes possible to use simplified models of particular kinds
of causal interaction to avoid having to do many of the calculations that
the algorithms would normally require.

Table 1.1 summarizes the descriptions so far of ECHO and Pearl networks. The characterizations have been very terse, but for more detail the original sources can be consulted. We now have enough information to begin considering the relation between ECHO and Pearl networks.

A translation algorithm from ECHO networks to Pearl networks could take either of two forms. The most direct would be an immediate network-to-network translation. Every proposition node in the ECHO network would become a variable in a Pearl network, and every ECHO link would become a Pearl conditional probability between values of variables. This direct translation clearly fails, since ECHO's symmetric links would translate into 2-way conditional probabilities not allowed in Pearl networks. Moreover, the translation would produce many cycles that Pearl networks exclude by definition. (8) Alternatively we can bypass the creation of the ECHO network and simply use the input to ECHO directly to generate a Pearl network. We can then try to produce a program that will take ECHO's input and produce an Pearl network suitable for running Pearl's algorithms. Let us call this program PECHO.

First we must worry about creating the appropriate nodes. ECHO creates nodes when it is given input describing a proposition P. Analogously, PECHO would create a variable node with two values, TRUE and FALSE. At this point, PECHO would have to consult ECHO's input concerning contradictions, and check whether there is some proposition not-P that contradicts P. If so, there is no need to construct a new variable node, since not-P would simply be represented by the FALSE value of the variable node for which P represents the value TRUE. It becomes more complicated if there are several propositions in ECHO that all contradict each other, but these could all be amalgamated into one variable node with multiple values.

Since the vector representing the BEL values for a variable is required to sum to 1, PECHO will be able to duplicate ECHO's effect that the acceptability of a proposition counts against the acceptability of any proposition that contradicts it. Because we are not directly translating from ECHO networks to Pearl networks, we do not have to worry that ECHO activation values range from -1 to 1, rather than from 0 to 1 like probabilities; but in any case it is possible to normalize ECHO's activations into the same range as probabilities. To normalize ECHO in the simplest case, just add 1 to a unit's activation and divide the result by 2. If two units represent contradictory propositions, normalize the resulting values by multiplying each value by 1 divided by the sum of the values. When a proposition contradicts more than one other, the normalization becomes problematic unless the propositions in question are all mutually contradictory, as when they all represent different values of the same variable. In ECHO networks this is not always the case.

Simulation of Copernicus's case against Ptolemy (Nowak and Thagard 1992a) includes:

P6 The Earth is always at the center of the heavenly sphere.

P12 The sun moves eastward along a circle about the earth in one year.

C12 The sun is immobile at the center of the universe.

Here Ptolemy's propositions P6 and P12 each contradict Copernicus's C12, but they do not contradict each other.

ECHO uses the range [-1,1] for both conceptual and computational reasons.
Conceptually, ECHO is interpreted in terms of degrees of *acceptance*
(activation > 0) and degrees of *rejection* (activation < 0), sharing
with some expert systems the intuition that attitudes toward hypotheses
are better characterized in terms of accept/reject rather than degrees
of belief as probabilists hold (cf. Buchanan and Shortliffe 1984, ch. 10).
The computational reason is that activation updating in ECHO has the consequence
that if a hypothesis coheres with one that is deactivated, it will also
tend to be deactivated. Thus sets of hypotheses (theories) tend to be accepted
and rejected as wholes, as is usually the case in the history of science
(Thagard 1992b).

Now we get to the crucial question of links. When ECHO reads the input

`(EXPLAIN '(P1 P2 P3 P4) 'Q),`

it creates excitatory links between each of the explaining propositions and Q. PECHO correspondingly would note that the variable node whose TRUE value represents Q is causally dependent on the variable node whose TRUE value represents P1, and so on. More problematically, PECHO would have to contrive 4 conditional probabilities: P(Q/P1), P(Q/not-P1), P(not-Q/P1), and P(not-Q/not-P1). The first of these could perhaps be derived approximately from the weight that ECHO puts on the link between the nodes representing P1 and Q, and we could derive the third from the first since P(Q/P1) and P(not-Q/P1) sum to 1; but ECHO provides no guidance about the other two probabilities. In fact, the situation is much worse, since Pearl's algorithms actually require 32 different conditional probabilities, for example P(Q/P1 & not-P2 & P3 & not-P4). In the most complex ECHO network to date, modeling the case of Copernicus against Ptolemy (Nowak and Thagard 1992a), there are 143 propositions. A search through the units created by ECHO, counting the number of explainers, shows that the number of conditional probabilities that PECHO would need is 45,348. This is a big improvement over the 2^143 (more than 10^43) probabilities that a full distribution would require, but it is still daunting. Of the 45,348, only 469 could be directly derived from the weight on the ECHO link. Does it matter what these probabilities are? Perhaps PECHO could you give them all a simple default value and still perform as well as ECHO. We shall shortly see that in fact TEC requires some constraints on the conditional probabilities if PECHO is to duplicate ECHO's judgements.

A derivation of P(Q/P1) based on the ECHO link between Q and P1 would effectively implement TEC's simplicity principle, 2(c), since it would mean that in updating the Pearl network P1 would get less support from Q than it would if P1 explained Q without the help of other hypotheses. This is because ECHO makes the strength of such links inversely proportional to the number of hypotheses. PECHO is able to get by with a unidirectional link between the node for P and the node for Q, since the contribution of the lambda and pi functions to the BEL functions of the two nodes effectively spreads support in both directions.

The construction just described would enable PECHO to implement principles 2(a) and 2(c) of TEC, but what about 2(b), according to which P1, P2, P3 and P4 all cohere with each other? Here PECHO encounters serious difficulties. Explanatory coherence theory assumes that cohypotheses (hypotheses that participate together in an explanation) support each other, but this is impossible in Pearl networks which have to be acyclic. Here we see a deep difference in the fundamental assumptions of TEC and the probabilistic networks. Probabilistic networks gain their relative efficiency by making strong assumptions of independence. In contrast, TEC assumes that every proposition in a belief system can be affected by every other one, although the effects may be very indirect and small. To put it graph-theoretically, ECHO networks are strongly connected, by virtue of their symmetric links, but probabilistic networks with directed edges are emphatically not. Another aside: At first glance, ECHO networks might seem to be similar to the noisy or-gates for which Pearl (1988, 188ff.) provides an efficient method. However, his method requires assumptions not appropriate for ECHO, such as that an event be presumed false if all conditions listed as its causes are false.

Pearl networks can, however, get the effects of excitatory links between cohypotheses by means of the clustering methods that are used to eliminate loops. We saw in the last section that Pearl considers collapsing competing nodes into single nodes with multiple values. If H1 and H2 together explain E, then instead of creating separate variable nodes for H1 and H2, PECHO would create a variable node <H1-H2> with values representing H1 & H2, H1 & not-H2, not-H1 & H2, and not-H1 & not-H2. To implement TEC's principle 2(b) that establishes coherence between H1 and H2, PECHO would have to ensure that it has conditional probabilities such that P(E/H1 & H2) is greater than either P(E/H1 & not-H2) and P(E/not-H1 & H2). (I am simplifying the representation here: in the Pearl network, by P(E/H1 & H2) I mean the probability that variable E takes the value TRUE given that variable <H1-H2> takes the value <H1 & H2>.)

A similar method should enable PECHO to deal with the inhibitory links required by ECHO to implement TEC's Principle 6, Competition. We saw that PECHO can handle contradictions by constructing complex variables, but it has no direct way of having a negative impact of one hypothesis on another when they are competing to explain a piece of evidence. The clustering technique mentioned in the last section shows how this can be done. If an ECHO network has H1 and H2 independently explaining E, PECHO will have to replace variable nodes for H1 and H2 by a combined node with values for H1 & H2, H1 & not-H2, not-H1 & H2, and not-H1 & not-H2. PECHO can enforce competition between H1 and H2 by requiring that P(E/H1 & H2) be less than either P(E/H1 & not-H2) or P(E/not-H1 & H2). (9)

The clustering situation gets much more complicated, since H1 may compete
with other hypotheses besides H2. In the Copernicus versus Ptolemy simulation,
ECHO finds 214 pairs of competing hypotheses, and there is an important
Copernican hypothesis that competes with more than 20 Ptolemaic hypotheses.
In general, if a proposition P has *n* hypotheses participating in
explaining it, either in cooperation or in competition with each other,
then a clustered variable node with 2^*n* values will have to be created.
For example, in the Copernicus versus Ptolemy simulation, there are pieces
of evidence explained by 5 Ptolemaic hypotheses working together and by
5 Copernican hypotheses. To handle both cohypotheses supporting each other
and competition between conflicting hypotheses, PECHO would need to have
a single node with 1024 values, in place of the 10 nodes corresponding
to ECHO units. In fact, the node would have to be still more complicated
because these 10 hypotheses compete and cohere with many others because
they participate in additional explanations. Typically, the set of hypotheses
formed by collecting all those that either compete with or co-explain with
some member of the set will be virtually all the hypotheses that there
are. In the Copernicus simulation, there are 80 explaining hypotheses,
including Ptolemaic ones, and search shows that each is connected to every
other by chains of coherence or competition. We would then require a single
hypothesis node with 2^80 values, more than the number of milliseconds
in a billion years. Thus dealing with competition and cohypotheses in Pearl
networks can be combinatorially disastrous, although there may be more
efficient method of clustering. Pearl 1988 (p. 201) considers an alternative
method akin to that of Lauritzen and Spiegelharter (1988) in which the
nodes would represent cliques in an ECHO network such as the cluster of
H1, H2 and E. (10) In the Copernicus simulation, there are more than 4000
such cliques, so the reconstituted Pearl network would be much larger than
the original. More importantly, it is not at all clear how to assign conditional
probabilities in ways that yield the desired results concerning cohypotheses
and competitors. Thus in principle ECHO input can be used to drive a Pearl
network, but in practice the computational obstacles are formidable.

The input to ECHO also includes information about data and analogies. PECHO could implement something like TEC Principle 6, Data Priority, by special treatment of variable nodes corresponding to evidence propositions in ECHO. It would be a mistake to instantiate an evidence variable node with a value (1 0) since that would not allow the possibility that the evidence is mistaken. (ECHO can reject evidence if it does not fit with accepted hypotheses.) Pearl (1988, p. 170) provides a method of dummy variables that allows a node to represent virtual evidence, an effective solution if updating can lead to low BEL values for such nodes. As for analogy, there is no direct way in a Bayesian network in which a hypothesis H1 can support an analogous one H2, but once again dummy nodes might be constructed to favor the hypothesis in question. Analogy is normally used to support a contested hypothesis by analogy with an established one, so little is lost if there is no symmetric link and the contested one is simply viewed as being slightly dependent on the established one. Analogy is thus viewed as contributing to the prior probability of the contested hypothesis.

In sum, the theory of explanatory coherence that is implemented in ECHO by connectionist networks can also be implemented by probabilistic networks. There is, however, a high computational cost associated with the alternative implementation. A massively greater amount of information in the form of conditional probabilities is needed to run the algorithms for updating probabilistic networks, and the problem of creating a probabilistic network is non-trivial: reconstruction is required to avoid loops, and care must be taken to retain information about cohypotheses and competitors. Combinatorial explosions must be avoided. Hence while ECHO's connectionist networks can be abstractly viewed in probabilistic terms, there are potentially great practical gains to be had by not abandoning the explanationist approach for the apparently more general probabilist one. Practically, the probabilist approach must use the explanationist one for guidance in assessing probabilities. We saw that consideration of cohypotheses and competitors puts constraints on the conditional probabilities allowable in probabilistic networks, and explanatory coherence theory can also contribute to setting prior probabilities. One can also think of the principles of analogy and data priority as giving advice on how to set prior probabilities. How far can we do with ECHO alone?

`(EXPLAIN '(BURGLARY) 'ALARM)`

`(EXPLAIN '(EARTHQUAKE) 'ALARM)`

`(EXPLAIN '(EARTHQUAKE) 'RADIO-REPORT)`

`(EXPLAIN '(ALARM) 'WATSON-CALLED)`

`(EXPLAIN '(JOKE NO-ALARM) 'WATSON-CALLED)`

`(CONTRADICT 'ALARM 'NO-ALARM)`

`(DATA '(WATSON-CALLED RADIO-REPORT))`

The network created by ECHO using this input is shown in figure 1.6. Note that in implementing the principle of competition ECHO automatically places inhibitory links between BURGLARY and EARTHQUAKE, and between ALARM and JOKE. Given the above input, ECHO reaches the conclusion that there was an earthquake rather than a burglary.

(Graphic not available)

**Figure 1.6** ECHO network created for Pearl's burglary example
for input given in the text. Solid lines indicate excitatory links, while
dotted lines indicate inhibitory ones.

This simple qualitative information may give misleading results in cases where statistical information is available. Suppose Holmes knows that burglaries almost always set off his alarm, but earthquakes do so only rarely. ECHO need not assume that every explanation is equally good, but allows the input to include an indicator of the strength of the explanation. We could, for example, alter the above input to include the statements:

`(EXPLAIN '(BURGLARY) 'ALARM .8)`

`(EXPLAIN '(EARTHQUAKE) 'ALARM .1)`

This has the effect of making the excitatory link between BURGLARY and ALARM nine times stronger than the link between EARTHQUAKE and ALARM, so, other things being equal, ECHO will prefer the burglary hypothesis to the earthquake hypothesis.

Statistical information that provides prior probabilities can be used in similar ways. Suppose that an alarm is as likely given a burglary as given an earthquake, but Mr. Holmes knows that in his neighborhood burglaries are far more common than earthquakes. Without the radio report of the earthquake, Holmes should prefer the burglary hypothesis to the earthquake hypothesis. In Bayesian terms, the burglary base rate is higher. ECHO can implement consideration of such prior probabilities by assuming that the base rates provide a statistical explanation of the occurrences (cf. Harman 1986, p. 70). The base rates can be viewed has hypotheses that themselves explain statistical information that has been collected. We could thus have the additional input to ECHO:

`(EXPLAIN '(BURGLARY-RATE) 'BURGLARY .1)`

`(EXPLAIN '(BURGLARY-RATE) 'BURGLARY-STATISTICS)`

`(EXPLAIN '(EARTHQUAKE-RATE) 'EARTHQUAKE-STATISTICS)`

`(EXPLAIN '(EARTHQUAKE-RATE) 'EARTHQUAKE .01)`

`(DATA '(BURGLARY-STATISTICS EARTHQUAKE-STATISTICS))`

The network constructed by ECHO is shown in figure 1.7. Similar cases in which prior probabilities need to be taken into account often arise in medical diagnosis. Medical students are cautioned to prefer routine diagnoses to exotic ones with the adage: when you hear hoofbeats, think horses, not zebras.

(Graphic not available)

**Figure 1.7 **Enhanced ECHO network for burglary example including
statistical explanations.

ECHO is also capable of handling the cancer example (figure~\ref{cancer}) using the prior and conditional probabilities provided by Pearl (1988, p. 197). Of course, ECHO's final activations are not exactly equivalent to the final probabilities that Pearl calculates, but without recourse to clustering methods ECHO gets results that are qualitatively very similar. ECHO strongly accepts just the propositions to which Pearl's calculation gives high probability, and strongly rejects just the propositions to which he gives low probability.

ECHO is thus capable of using probabilistic information when it is available, but does not require it. There may well be cases in which a full probability distribution is known and ECHO can be shown to give a defective answer because activation adjustment does not exactly mirror calculation of posterior probabilities. In such cases where there are few variables and the relevant probabilities are known, it is unnecessary to muddy the clear probabilistic waters with explanatory coherence considerations. But in most real cases found in science, law, medicine, and ordinary life, the explanationist will not be open to the charge of being probabilistically incoherent, since the probabilities are sparsely available and calculating them is computationally unfeasible. What matters then are the qualitative considerations that explanatory coherence theory takes into account, and probabilities are at best epiphenomenal. I shall not discuss here the argument that probabilistic approaches are preferable because they provide a clear semantics for numerical assessment of hypotheses. (11)

At a more local level, however, it is an open question whether explanationist or probabilist accounts are superior. Local disputes can be epistemological, psychological or technological. If one accepts the view of Goldman (1986) that power and speed are epistemological goals as well as reliability, then explanationist models can be viewed as desirable ways of proceeding apace with causal inferences while probabilistic models are still lost in computation. Similarly, the computational cost associated with the probabilistic interpretation of explanatory coherence suggest that such models may be inappropriate as models of human psychology. ECHO and probabilistic networks can be compared with respect to human performance, with the latter apparently predicting that people should be much slower at inference tasks that require the most work to translate into probabilistic terms. We saw such cases arise when there are cohypotheses and competitors. ECHO takes such complications in stride, whereas Pearl networks require computations to realign networks and many more conditional probabilities to handle such cases. It should therefore be possible to present people with examples of varying complexity and determine whether their reasoning ability declines rapidly as the complexity of probabilistic computations suggest.

Similarly, for technological applications in expert systems, ECHO may perform better than probabilistic networks. If rich probabilistic information is not generally available, and if the domains are complex enough with cohypotheses and competitors, then ECHO may be more effective than probabilistic cases. The issue must be decided on a local basis, application by application, just as the psychological issue depends on experiments that have not yet been done. My conjecture is that the psychological and technological applicability of explanationist and probabilist techniques will vary from domain to domain, with the following approximate ordering from most appropriate for explanationist to most appropriate for probabilist approaches: social reasoning, scientific reasoning, legal reasoning, medical diagnosis, fault diagnosis, games of chance. The psychological and technological answers need not be the same: diagnosis may well be an enterprise where a non-psychological probabilistic approach can bring technological gains.

Since the theory of explanatory coherence has a probabilistic interpretation, albeit a computationally expensive one, the analysis in this paper suggests that probabilism reigns supreme as the epistemology of Eternal Beings. But explanationism survives as epistemology for the rest of us.

2. One important recent contribution is Peng and Reggia (1990), which I review in Thagard (1990).

3. In the original statement of TEC in Thagard (1989), Principle 3 included a second clause concerning disanalogies that is not included here because it lacks interesting applications. The old principle 7, system coherence, has similarly been deleted because it does little to illuminate actual cases. A new principle 6, Competition, has been added to cover cases where non-contradictory hypotheses compete with each other (Thagard 1991). The old principle 6, Acceptability, becomes the new principle 7.

4. The next 11 paragraphs are adapted from Thagard (1992b), pp. 66-69, with permission of Princeton University Press.

5. Settling time is also unaffected by number of links. The total time for ALL 15 runs on a Sun 4 workstation is less than two minutes, if graphics and detailed screen output are not used.

6. Hence probabilists are often called Bayesians.

7. Note for non-computer-scientists. A problem belongs in class P if it can be solved by a deterministic automaton in polynomial time. A problem belongs in class NP if it can be solved by a non-deterministic automaton in polynomial time. Here ``non-deterministic'' means that given an automaton in a particular state and given a particular input, the automaton may reach more than one state. Anything that a non-deterministic automaton can do can be done by a deterministic one, but the non-deterministic automaton may require an exponentially increasing number of steps to do it. Because of the rapid growth of exponential functions, problems requiring an exponential number of steps are said to be intractable. A problem is NP-complete if it is in NP and all other problems in NP are reducible to it. A problem is NP-hard if there is some NP-complete problem that is reducible to it. A problem that is NP-complete or NP-hard cannot be solved in polynomial time unless P=NP, which is widely believed to be false. See Garey and Johnson (1979) for more careful and complete definitions. Some problems involving computing the best explanation non-probabilistically have also been shown to be NP-hard (Bylander 1990).

8. Clarification: in Pearl's terminology, all cycles are loops, but
not all loops are cycles. *A right arrow B right arrow C right arrow
A* is a cycle, because the direction of the path is maintained. However,
*A right arrow B right arrow C left arrow A* is a loop, since for
loops direction is ignored, but it is not a cycle.

9. In very simple situations, it is possible to enforce competition in probabilistic networks without using clustering. Pearl describes how the effect of one cause ``explaining away'' another can be modeled in singly connected networks. Often, however, in the examples to which ECHO has been applied, two hypotheses compete to explain more than one piece of evidence. Units representing those hypotheses are therefore connected by two different paths, and clustering or some other technique will be necessary to translate the network into one not multiply connected.

10. Cliques are subgraphs whose nodes are all adjacent to each other.

11. While probability theory undoubtedly has a clear syntax, the meaning or meanings of probability is an unsolved problem. All the available interpretations, in terms of frequencies, propensities, degrees of belief, and possible worlds, can be challenged. Cohen (1989) provides a concise review.

Aristotle (1984). *The complete works of Aristotle*. Ed. J. Barnes.
Princeton: Princeton University Press.

Blake, R. (1960). Theory of hypothesis among renaissance astronomers.
In R. Blake, C. Ducasse, and E.H. Madden (eds.), *Theories of scientific
method*. Seattle: University of Washington Press, pp. 22--49.

Buchanan, B., and Shortliffe, E. (Eds.), (1984). *Rule-based expert
systems*. Reading, MA: Addison Wesley.

Bylander, T. (1990). When efficient assembly performs correct abduction and why abduction is otherwise trivial or intractable. Paper presented at AAAI Spring Symposium on Automated Abduction.

Carnap, R. (1950). *Logical foundations of probability*. Chicago:
University of Chicago Press.

Cohen, L.J. (1977). *The probable and the provable*. Oxford: Clarendon
Press.

Cohen, L.J. (1989). *An introduction to the philosophy of induction
and probability*. Oxford: Clarendon.

Cooper, G. (1990). The computational complexity of probabilistic inference
using Bayesian belief networks. *Artificial Intelligence, 42*, 393--405.

Garey, M., and Johnson, D. (1979). *Computers and intractability*.
New York: Freeman.

Goldman, A. (1986). *Epistemology and cognition*. Cambridge, MA:
Harvard University Press.

Goldman, R., and Charniak, E. (1990). Incremental construction of probabilistic models for language abduction: Work in progress. Paper presented at AAAI Spring Symposium on Automated Abduction.

Hacking, I. (1975). *The emergence of probability*. Cambridge:
Cambridge University Press.

Harman, G. (1973). *Thought*. Princeton: Princeton University Press.

Harman, G. (1986). *Change in view: Principles of reasoning*. Cambridge,
MA: MIT Press/Bradford Books.

Hesse, M. (1974). *The structure of scientific inference*. Berkeley:
University of California Press.

Horwich, P. (1982). *Probability and evidence*. Cambridge: Cambridge
University Press.

Howson, C., and Urbach, P. (1989). *Scientific reasoning: The Bayesian
tradition*. Lasalle, IL: Open Court.

Hrycej, T. (1990). Gibbs sampling in Bayesian networks. *Artificial
Intelligence*, 46, 351--363.

Jeffrey, R. (1983). *The logic of decision*. Second edition. Chicago:
University of Chicago Press. First published 1965.

Josephson, J, Chandrasekaran, B., Smith, J., and Tanner, M. (1987) A
mechanism for forming composite explanatory hypotheses. *IEEE Transactions
on Systems, Man, and Cybernetics, 17*: 445--454.

Keynes, J.M. (1921). *A treatise on probability*. London: Macmillan.

Kyburg, H. (1983). *Epistemology and inference*. Minneapolis: University
of Minnesota Press.

Laudan, L. (1981). *Science and hypothesis*. Dordrecht: Reidel.

Lauritzen, S., and Spiegelharter, D. (1988). Local computation with probabilities in graphical structures and

their applications to expert systems. *Journal of the Royal Statistical
Society B*, 50.

Lempert, R. (1986). The new evidence scholarship: analyzing the process
of proof. *Boston University Law Review, 66*, 439--477.

Levi, I. (1980). *The enterprise of knowledge*. Cambridge, MA:
MIT Press.

Lycan, W. (1988) *Judgement and justification*. Cambridge: Cambridge
University Press.

Neapolitain, R. (1990). *Probabilistic reasoning in expert systems*.
New York: John Wiley.

Ng, H., and Mooney, R. (1990). The role of coherence in constructing and evaluating abductive explanations.

Paper presented at AAAI Spring Symposium on Automated Abduction.

Nowak, G., and Thagard, P., (1992a). Copernicus, Ptolemy, and explanatory
coherence. In R. Giere (Ed.), *Cognitive Models of Science, Minnesota
Studies in the Philosophy of Science*, vol. 15. Minneapolis: University
of Minnesota Press, 274-309.

Nowak, G., and Thagard, P., (1992b). Newton, Descartes, and explanatory
coherence. In R. Duschl and R. Hamilton (Eds)., *Philosophy of science,
cognitive psychology and educational theory and practice*, Albany: SUNY
Press, 69-115.

O'Rorke, P., Morris, S., and Schulenburg, D., (1990). Theory formation
by abduction: A case study based on the chemical revolution. In J. Shrager
and P. Langley (Eds.), *Computational models of discovery and theory
formation*. San Mateo, CA: Morgan Kaufman. 197--224.

Pearl, J. (1988). *Probabilistic reasoning in intelligent systems*.
San Mateo: Morgan Kaufman.

Peirce, C. (1931--1958). *Collected papers*. 8 vols. Edited by
C. Hartshorne, P. Weiss, and A. Burks.

Cambridge, MA: Harvard University Press.

Peng, Y. and Reggia, J. (1990). *Abductive inference models for diagnostic
problem solving*. New York: Springer-Verlag.

Pennington, N., and Hastie, R. (1986). Evidence evaluation in complex
decision making. *Journal of Personality and Social Psychology* 51:242--258.

Rajamoney, S. (1990). A computational approach to theory revision. In
J. Shrager and P. Langley (Eds.), *Computational models of discovery
and theory formation*. San Mateo, CA: Morgan Kaufman. 225--253.

Ram, A. (1990). Decision models: A theory of volitional explanation.
*Proceedings of the Twelfth Annual Conference of the Cognitive Science
Society*. Hillsdale, NJ: Erlbaum. 198--205.

Thagard, P. (1978). The best explanation: Criteria for theory choice.
*Journal of Philosophy*, *75*, 76--92.

Thagard, P. (1988). *Computational philosophy of science*. Cambridge,
MA: MIT Press/Bradford Books.

Thagard, P. (1989). Explanatory coherence. *Behavioral and Brain Sciences,
12*, 435--467.

Thagard, P. (1990). Review of Y. Peng and J. Reggia, *Abductive Inference
Models for Diagnostic Problem Solving*, *SIGART Bulletin, 2*, no.
1, 72--75.

Thagard, P. (1991) The dinosaur debate: Explanatory coherence and the
problem of competing hypotheses. In J. Pollock and R. Cummins (Eds.), *Philosophy
and AI: Essays at the interface*. MIT Press/Bradford Books, 279--300.

Thagard, P. (1992a). Adversarial problem solving: Modeling an opponent
using explanatory coherence. *Cognitive Science, 16*, 123--149.

Thagard, P. (1992b). *Conceptual revolutions*. Princeton Unversity
Press.

Thagard, P., Holyoak, K., Nelson, G., and Gochfeld, D. (1990). Analog
retrieval by constraint satisfaction. *Artificial Intelligence, 46*,
259--310.

Thagard, P., and Nowak, G. (1988). The explanatory coherence of continental
drift. In A. Fine and J. Leplin (Eds.), *PSA 1988*, vol. 1. East Lansing,
Mich.: Philosophy of Science Association, 118--126.

Thagard, P., and Nowak, G. (1990). The conceptual structure of the geological
revolution. In J. Shrager and P. Langley (Eds.), *Computational models
of discovery and theory formation*. San Mateo, CA: Morgan Kaufman. 27--72.

Whewell, W. (1967). *The philosophy of the inductive sciences*.
New York: Johnson Reprint Corp. Originally published 1840.