10
1965 I n t e r n a t i o n a l Conference Computational Linguistics
on
NEASURENENT OF SI~IILARITY I~ETWI!EN NOUNS Kenneth
E. l l a r p e r
Tile IOiND C o r p o r a t i o n 1700 Main S t r e e t Santa blonica, California
9041)6
AJ;STt',A(?T
A study
was r~ade o f
pairs
of
Russian
occur
in s e n t e n c e s
tile d e g r e e
nouns,
as
with
of
expressed
identical
by t h e i r
~,,ords
syntactic
relationships.
A similarity
for
nouns;
pair
forty
shared (iii)
for
adjective
dependents,
noun
governors
was a u t o m a t i c a l l y
each
pair
such
shared
two n o u n s this
words the
~'ords in
the
of physics
ranged
as
to t h e text.
coefficient.
the
from 42 t o suggest
be u s e f u l
for
the
noun p a i r s
with
similar
derivation
from d i s t r i b u t i o n a l l y
observed.
tionships
as
properties
similarity
The r e l e v a n c e
criteria
total
of
ranked
12(1,~00
forty
sample
to
not
the
the
by h i g h
for
of
the
according running
The RAND C o r p o r a t i o n ;
that
size
is
of
from
frequencies
were
and
nouns
in
this
328.
sufficient
characterized
at
dependents,
the
comprised
processed
number o f
coefficient
the
The 78~ p a i r s
of o c c u r r e n c e
The r e s u l t s
of
was p r e p a r e d
retrieved
of
to
identical
the
noun
ratio
product
The t e x t
text
(ii)
between tendency
matrix
The s i m i l a r i t y
~;as d e t e r m i n e d
frequencies
text
text.
in
of nouns
(i)
machine-processed
to
each
similarity
for
of text
intended
(synonymy, similar
verbs,
coefficients; of v a r i o u s
meas~rement
is
is
of
purpose.
Many
antonym),, etc.) the
syntactic discussed.
are
converse rela-
[larper
1
MEASURENIiNT OF SIMILARITY BETWEEN NOUNS I.
INTRODUCTION One of the goals of studies
is the establishment observed behavior
in Distributional
Semantics
of word classes on the basis of the
of words
in written
and significant way of discussing
texts.
"behavior"
A convenient of words
in terms of syntactic
relationship.
fact,
that we treat a word in terms of its
it is necessary
Syntactically
Related Words
(SRW).
word bears a given syntactic ber of other words; and pronouns) another
relationship
to a finite num-
e.g., a finite number of words
appear as "subject"
object" of each transitive
verb;
appear as modifiers
in
In a given text, each
(nouns
for each active verb;
group of nouns and pronouns
"adverb,"
At the outset,
is
are used as "direct
other words of the class,
of a given verb.
In each
instance we may speak of the related words as SRW of a given verb, so that in our example three different ~ emerge; a given SRW is then defined class and specific
relationship
of SRW
in terms both of word
to the verb.
may of course belong to two different
(A given noun
types of SRW, e.g.,
as both subject and object of the same verb.) Distributionally, of their SRN.
The objective
test the premise SRW.
we may compare
is tested,
in terms
of the present study is to
that "similar" words
This premise
two verbs
tend to have the same
not with verbs,
as in the
l,arper
above in
example,
a given
nouns,
text
(2)
of nouns
to
three
formed
the
from
of
nouns.
types
out
that
("avocado"
and
"cherry")
have
common t h e
in
facts
would
lar,
that
"c"
less
similar,
etc.
A number
of
anyway?
Do w o r d s
a significant
subject
these
is
based
limited
are
questions that
are
areas?
as
is
to
claimed
the
and "b" as
words
the
throw for
possession
foreknowledge
are
share
What
is
also
have to
from
the
differ
that
=
be
light
on
"results"
experiment of
many
of
should some
"a
estab-
effect
texts
the
"c"
simi-
really
in order is
are
"similarity"
text?
using,
attempting the
and
investigation
designed
factors:
(physics),
What
it
These
and " b "
meaning
necessary
of nouns,
modifiers
"a"
dissimilar
and of
no v a l i d i t y
on t h r e e
"a"
a given
pair the
"a"
"modern."
that
find
and "furniture")
in
is
in
("chair"
similar
The p r e s e n t
Our a u d a c i t y
field
adjective
What i s
Do n o t
an e x p e r i m e n t
nouns
arise:
o f SRW i n
of
example:
such
similar)
in words,
questions;
obtained.
"d"
group
and groups
Another
that
to
to express
the
conclude
(i)
by e a c h
text
o f word b e h a v i o r ?
multiple=meaning
regarded
to
(3)
modifier
flow much t e x t
patterns
ent
us
a small
nouns)
" c )' a n d " d "
number"?
common S R W ?
and
share
is
Sill;T s h a r e d
SRI~.
adjective
number
significant
lish
and
group,
in a given
nouns
lead
of
shared
turn
whereas
o f SRW f o r
individual
their
"ripe,"
Our p r o c e d u r e
number
the
between
a function
might
with
find
"similarity" as
but
a text the
at
all
in a multiple
=
llarper
3
meaning probler:l is processing
of text.
i n v i e w o£ t h e reader
We w o u l d h o p e ,
or t h e
2.
conclude however,
that
capability
is
and c o m p l e x i t y
a critical
clearly of
the problem.)
with his
if
the
would n o t employed,
results
do n o t
expectations.
PROCIiDIIRI'] Tile p r e s e n t
from R u s s i a n 120)000
s t u d y was b a s e d
physics
running
journals,
words
on a s e r i e s
comprising
(some 500 p a g e s ) .
The p r o c e s s i n p , o f
te.xt has b e e n d e s c r i b e d
elsewhere, (1'2)
note
only that
of
on m a g n e t i c
each sentence
tape,
together
for
each
occurrence
its
"word number"
glossary}, (i£ any) this
in the
text
words
and i t s
tape
in
syntactic
language"
information of speech,
in the machine
or "dependent" program applied
information
interested.
about
in the machine printout than
for
to
t h e SRI'i f o r
For convenience
rather
we
recorded
part
number
A retrieval
by word n u m b e r ,
this
in their
and study "natural-
£o rv).
In o u r s t u d y Russian
is
its
"governor"
then yielded
ltere,
the following
the sentence:
sentence.
words
identified
with
text
(an i d e n t i f i c a t i o n
i n w h i c h we were
economy, all
this
of articles
approximately
this
are
The
proves nothing.
s u c h an o p i n i o n
of disbelief
automatic
a necessity,
the experiment
that
for
judgment of the procedures
suspension
correspond
and t h e
(The l a t t e r
size
may w e l l
preclude
mininlal,
nouns,
we c h o s e
herein
called
t o d e a l wit]~ t h e SRI~ o f f o r t y Test ~ords
{TW).
The number
ltarper
4
is
completely
arbitrary;
tile p a r t i c u l a r t o form d i f f e r e n t
Table
1) a'ere
presumed
Table
1 gives
one p o s s i b l e
criteria the
for
reader
grouping
may e a s i l y
or c o n t r a c t i n g only purpose
of
similar
in terms
by t h e i r
is
of their
types
comparison either
case
speaking, into
o f SRW.
to three
the
account.
permitting
are
we s h a l l
to compare of their
refer
t o t h e 'rWs
reference
we h a v e c h o s e n t.t}e a d j e c t i v e
syntactic
this
into
factor,
in the
(in
genitive
( t h e TN i s case).
we a r e
norStrictly
measurement will SRW f o r
40 TWs, and o f c o m p a r i n g e a c h TW w i t h
TW, i s
too tedious
t o be a t t e m p t e d .
consciously
on t h e p r e m i s e
of manualiy retrieving
i n two w a y s ,
our
o f t h e SRIq s h o u l d be t a k e n
inexactitudes,
introduced
dif-
t h e noun
in tile g e n i t i v e
function
iimit
dependents
function),
and t h e n o u n g o v e r n o r s
to several
to
of the
was e n l i s t e d ,
The
f o u n d t o be
like
but not necessarily,
In i g n o r i n g
The t a s k
by e x p a n d i n g
understanding
or p r e d i c a t i v e
necessarily,
certain
distortions
ilere,
types:
(normally,
but not
although
equivalents.)
in Russian),
mally,
the
a weak m e a s u r e o f
SRN, we s h o u l d
some i n t u i t i v e
attributive
depend.ents
groups,
two n o u n s
Two n o u n s may be c o m p a r e d w i t h ferent
words;
obvious,
(see
groupings.
we h a v e d e s i g n a t e d .
if
(For convenience, English
of these
to provide
experiment:
with
similarity.
that
semantic
less
form d i f f e r e n t
grouping
in the
finding
a r e more o r
the groups
control
this
grouping
nouns chosen
that
the
n o t be s e v e r e . each occurrence every
other
The a i d o f t h e c o m p u t e r
llarper
5
Table
1
39 TEST NOUNS
Group l calculation 1 measurement determination calculation 2 Grou p 2 cc:ns-ideration comparison study investigation Group 3 relation ratio correspondence Group 4 solution compound alloy G,roup 5 metal gas liquid crystal Group 6 uranium silver copper phosphor Group 7 proton ion molecule atom Group 8 formula expression equation Grou 9 " w~dth depth length height Group 10 presence ab s e n c e existence Group 11 que s tion problem 1 p r o b 1era 2
W No.
F
L1
L2
1,3
L4
vycislenie izmerenie opredelenie rascet
782 1579 3324 4627
62 328 121 90
15 29 7 12
23 63 39 24
II 36 14 16
49 128 60 53
rassmotrenie sravnenie izuienie issledovanie
4598 5200 1610 1723
Sl 106 64 159
14 6 8 32
29 22 44 65
6 6 21
49 32 58 11~
sootnosenie otno~enie sootvetstvie
5111 3455 5109
113 102 29
14 14 2
18 22 i
15 9 0
47 45 3
rastvor soedinenie splay
4608 5082 S182
129 15 27
6 5 6
22 5 2
24 6
52 16 12
metall gaz Zidkost' kristall
2400 807 1329 2131
86 37 56 171
ii 7 8 15
2 2 2 19
28
44
41 17 25 78
uran serebro med ' fosfor
5745 4899 2419 5913
171 48 58 130
0 4 2 9
0 1 3 2
18 17 20 34
18 22 25 45
proton ion molekula atom
4565 1686 2568 186
125 98 112 106
8 14 18 9
2 I0 18 23
27 31 39 28
37 55 75 60
formula vyrazenie uravnenie
5911 739 5742
231 223 412
20 25 • 42
21 12 24
19 24 32
60 61 98
sirina glubina dlina vysota
6198 913 1194 764
43 40 112 23
4 6 16 2
9 21 ii
9 9 22 3
22 23 59 16
nalicie otslzts tvie su~ des tvovanie
2696 3485 5352
119 44 41
3 2 3
73 35 25
5 1 6
81 38 34
repros zadada problema
615 1362 4254
96 68 26
5 15 4
3 11 i0
I0 i0 6
18 36 20
"~ No." = word n u m b e r ;
"F" = frequency
8
4
4
8
15
tlarper
6
i.
Through automatic
scanning of the text, each
occurrence of tile 40 TWs was located, and in each instance the identity
(word number)
of relevant SR~V was recorded.
A listing is produced for each of the TWs (see Table 2, "SRW Detail," for an example of the TW, VYCISLENIE = calculation 1), showing tile different words used as adjective dependents governors
(List i), noun dependents (List 3).
(List 2), and noun
Tile number of words on each of these
lists is also shown in Table i, together with the total number of SRW for each TW (List 4).
We stress the fact that
these numbers refer to different words used as SRW; the repetition of a given SRW (for a given SRW type) was not recorded. 2.
Each Tl~ was automatically compared with every other
TW, with respect to their shared SRW, i.e., in terms of the words
i~ Lists I, 2, and 3 of the "SRW Detail Listing."
A new listing, "Similarity Ranking by T%~'," is then produced (see Table 3 for the T~, VYCISLENIE = calculationl).
This
listing shows for each TW the number of shared SRW of each of the three types
(NI, N2, and N3, Table 3), the total
number of shared SR%~ (NA), and a measure of similarity for the pairs, herein designated as the Similarity Coefficient (SC).
The SC is a decimal fraction obtained by dividing
the sum of shared SRW for each pair of TWs by the product of the frequencies of the two TWs.
(The latter is of course
a device for taking into account the differing frequencies
llarper
7
N
,-1
0 0 C
0 0 (D
0 (D
0 C 0
e~
,-,
0 0 0 0
0 0 C~ wO
WD
0
(D 0 0
0 (D 0
0 0 C)
o 0 0 0
o 0 0 ~
0 0 0 o
O 0 0
o 0 0
0 0 0
0 C 0
0 0 o
0 c'3 0
0 (D 0
0 0 0
o C, 0 ,,,t
C:, c o
o o 0
0 o C
~
*
.,
?
-
g
o
e
d
*
0
o
°
°
~
<
~P
Z
,, O"
•
o
ii
ii o
" .
2"
Z
~.~
° ~
~,
Z
g Z
ii
~-
0
,0
L
Ii
~.~
°
Z
Z
,4'
u.~
Z
o, Z
u.l
Z
Z
D
C
Q
0
II
g
=d
~
, , = = ~ C "
.
=
g
C
:
,
,
=
,..
tiarpcr
,
•
•
•
.~
. . . .
•
•
,~
•
,
,
,
•
•
•
•
,
C
~
C
~
. . . .
,
,
>LD }--4 t.')
>.~ J
<
C C C
C
T C
C C C
CDC
~
O C
.. ~ C
O
O
~
C
~
~ C C ~ C C
C
~
O
C
C
C
C
C
C
C
C
~
C
.~J 3:
2"
"T_
~ C C e" .'? C C ~ C
0 C ~--. C "~ C C C LD ." "D ~
C 0 C..~ C ~
C. "2_ c C C C C C
c = o = z C
~
C
~
C
C
GC
C
~
C
~
C
~
C
C
C
GC
~
~
~
~
'
C C
C
~
C
O
G
C
~
C
C
~
C
O
C
C
C
Iiarper
9
of the TWs; other means for determining can be utilized.)
The pairings
the value of the SC.
express
for each TW are ordered on
It should be noted that the similarity
between TWs is measured shared SRW
this coefficient
in terms of the total number of
(Column NA of Table
this measurement
3); it is also possible
to
in terms of shared SRW of any
single type. A third listing was also produced:
a listing of the
7,~I) TI~'-pairs, ordered oll the value of the SC.
This
listing,
not reproduced here because of its length, will be referred to as "Ranking of TW-Pairs by SC~." 'Fable 4 shows
the dis-
tribution of the SC as compared with tile number of TW pairs. The following discussion ings described
above.
is based on the three
A few additional
made about the procedure
remarks may be
itself, which may be likened to
deep-sea fishing with a tea strainer limitations
of size are obvious:
full of holes.
we have limited
to three of the numerous ways of comparing nouns of their SRW. are:
Other types of SRW that suggest
verbsj where TW is subject;
object;
prepositional
of TW; nouns
list-
The
ourselves in terms
themselves
verbs, where TW is direct
phrases as dependents,
joined to TW through coordinate
or governors, conjunctions
(i.e., "apples" and "grapes" are said to be more similar if "apples
and oranges" and "grapes and oranges" occur in
text).
Some of the holes
neglect
of the case of the noun dependent
in our tea strainer are:
the
of TW, or the
'Liarpor
i0
Po
0
0
0
0
0
0
0
o
0
o
c~
o
o
© j
~
b
;
i
.......
_-+ .............
~- . . . . . . . . .
~. . . .
~ .................
,
.~
t
?
; . . . . . . . .
0
-t
i . . . . . . . . . . . . . . . . . . .
I
0"1 .
.
.
"r---
.
0
i .......
t. . . . . .
i .............
---'t
q'.~.
. . . .
~ .........
:: ............
4 .................. :
: ..............
+
t
i
,
- -
I
4 ........
o
I
I
20
i
,
o
:
!
:
I
.3" ~. '
0
°-~. .--
J
°
".< i
o
o
I
i ~.° "~
0 L
._
,
!
o o
,.<
o(";.b
P~
o
//
(--, ~ °
I . . . . . . . . . . . . .
i
o I
÷ . . . . . . . . . . . . . . . . . . .
~ . . . . . . . . . . . . . . . . . .
~
o
o
!
!
O.,
r
o o
P0
o
i
:
i ..........
0
i
L .
.
.
.
.
.
.
i ................ . . . . . . . L ............... . . . .
i
'
~
. . . . . . . . . . . . . . . . . . . . . . . .
l-ia r p e r
11
case
o f t h e TW when t h e SRW i s
of technical governor
symbols
functions
noun p a i r
(e.g.,
"objective" valent
in physical
o f t h e TW; t h e
different
a noun g o v e r n o r ; textj
failure
the
between
the neglect
constructions.
In v i e w o f
expedition
to examine the 3.
is
in a noun/
"subjective"
and
of transformationally these
to mention the problem of statistics), fishing
or
between
or d e p e n d e n t s
distinction
genitive);
as d e p e n d e n t
to distinguish
of governors
the neglect
open t o d o u b t .
deficiencies the
success
Let us
equi-
(not of our
then proceed
catch.
RI!SULTS The e v a l u a t i o n
machine listin~.s
the
The p r o b l e m o f
how c o m p l e t e l y
degrees
and a c c u r a t e l y
groupings
(Table
a summary m a n n e r w i t h
should as
1)?
in our three
We can s c a r c e l y
of sir.~ilarity
interpretation
pond w i t h o u r e x p e c t a t i o n s , semantic
contained
i s n o t an e a s y t a s k .
e x a m i n e and d i s c u s s pairs.
of the data
is the
also
Our a p p r o a c h
highest Similarity Coefficients,
corres-
in the is
tentative
to deal
characterized
in
by
especially with respect to
their intra- and inter-group relationships. to this discussion,
780 n o u n -
complicated:
results
represented
the noun-pairs
of
a few preliminary
Before proceeding
remarks should be
made about the data in the various machine listings. The summary of SRW counts for each TW, contained
in
Table I, suggests all TWs do not have the same opportunity for comparison.
In the case of "correspondence"
(Group 3),
tlarper
12
a total of only three SRW is noted
in (Column
result,
this TW should be eliminated
ation.
In addition,
all three, TW, the
from furtJler consider-
at least two,
and preferably
types of SRW are well represented
SC
examples,
unless
for that noun will
we n o t e
all
column p r e d o m i n a t e s ) ,
nouns
"deficient"
and t h e n o u n s
in certain
As
which the
i n Group lO ( f o r
In e f f e c t ,
types
for a given
tend to be skewed.
in Croup 6 ( f o r
t h e L2 c o l u m n p r e d o m i n a t e s ) .
14); as a
these
1,3 which
nouns are
o f SRI;', and r e q u i r e
special
handling. On t h e p r i n t o u t , number o f noun p a i r s although value The
appear at
pairs
low.
on t h e
for
the
that
similarity
NA was a r b i t r a r i l y
to
the discussion
i z e d by h i g h e s t
S(:.
5(2 by n o u n - p a i r s .
i
enter
or e x t r e m e l y A
Table
is
"1,"
at
the
frequencies these
significant The minimum
four. to
the data
of t h e n o u n - p a i r s 3 shows t h e
weak s i m i l a r i t y
(i.e.,
discount
two TWs.
anmndations
By a n y s t a n d a r d ,
a
"~,.," o r " 3 . "
o f "NA" i s
between set
small
of the
has b e e n t o
the value
K e e p i n g i n mind t h e s e We p r o c e e d
4)
SRW i s
the product
Our p o l i c y
grounds
in determining value
because
,t
t o p end o f t h e s c a l e
number o f s h a r e d
may be h i g h ,
relatively
tive
the
of colurnn "NA" ( s e e T a b l e SC
is
the total
" R a n k i n g o f T l ~ - P a i r s by SC,
for
character-
distribution
the
data
in mind,
of
shows n e g a -
most of the
780 p a i r s .
An a b s t r a c t o f a p a p e r on tile p r o c l i v i t y of nouns to into certain c o m b i n a t i o n s i s c i t e d i n R e f e r e n c e 3.
~,arper
13
At which point on the curve shall we draw a line,
saying
that an SC above
this value
indicates
similarity,
a~d
that an SC below
this value
indicates
dissimilarity
weak similarity
(all this of course: in terms
For purposes
of discussion,
at .00100--a
rigorously
high
we propose figure.
or
of rcliability)?
to set the t]~reshold After eliminating
pairs whose NA value
is less than 4, we find 38 p,~irs whose
SC lies in the range
.00100
two zeroes
to .01~337 (Table
degree of similarity For purposes
pairings
(Z],e first
are dropped.)
The reader may draw his own conclusions
ing.
5).
between
the nouns
of discussion,
The following
intra-
in any given pair-
we will
in terms of our preliminary
about the
refer to the
groupings
(Table I).
and inter-Group
pairings
are observed
1 pair with nouns 2 3 4 5 6 7 8 9 i0 ii
of Group
I, 2 I, 2, i0
in Tab le 5 : Nouns of Group
We note that no pairings and 8.
All other groups
intra-group
pairings;
are fulfilled,
i.e.,
of Groups
except Group 4 are represented
the data supports between words.
7
9 2, I0 5, Ii
appear for nouns
to this degree,
ings for the similarity
5 4, 5, 6, 5, 6, 5, 7
3 by
our expectations our a priori
feel-
The amount of inter-
Harper
14
Tab le 5 "HIGH RANKING TW-PAIRS" TWI
TWJ
SC
NA
m
calculation I
1
determination study
investigation
consideration
liquid
gas metal crystal
5 5 5
copper ion atom
7 7
height depth length absence
i0
presence
i0
question
Ii
calculation 2 consideration determination investigation measurement study calculation 2
1 2 1 2 1 2 1
323 285 2OO 183 113 101 165
18 9 IS 18 23 4 18
consideration existence investigation absence calculation 2 presence determination consideration calculation a absence existence calculation 2
2 I0 2 i0 1 i0 1 2 1 I0 I0 i
337 267 246 213 139 118 116 173 154 114 107 174
Ii 7 25 8 9 9 14 22 8 7 8
molecule problem 1 metal crystal metal silver compound
7 11 5 5 5 7 4
143 I05 125 104 126 194 156
9 4 6 I0 4 8 4
silver metal
6 S
180 120
copper ion
6 7
106 125
length width width
9 9 9
155 233 125
I0 1 I0 I0
222 101 229 225
4 4 12 11
Ii
240
6
e xi s tence calculation absence existence
problem 2
2
6 13
ltarper
15
group pairing may indicate
either
clusive,
or that our original
In fact,
two larger
1 and 2 (perhaps of Groups
that the data is incon-
groupings
groups emerge:
including
4, 5, 6, and 7.
This tendency
For example,
with those of Group those of Groups
above.
10, and nouns
above,
Groups
on semantic
is subject
of Group
classes
grounds;
to aberrant
but strongly
groups mentioned
1 and 2 can easily be
since Group behavior
10, as noted
(because of the very
its inter-relation
1 and 2 may not be taken seriously.
Groups
include the names of chemical
of elements,
individual
elements,
may be taken together
sub-class
of "object
nouns."
elements,
The physicist in this group.
One of tile 38 pairs
in Table
expectation:
listed
with 4, 5,
mixtures,
and components
semantically
same things about all nouns
tradict
4 pair with
conclusive,
of Groups
of noun dependents),
6, and 7, which
in
1 are found to pair
of the two major
The amalgamation
high number
listed
6 and 7.
the emergence
defended
is more marked
to the number
nouns of Group
The data is not statistically suggests
of Groups
.00100 to .00070,
thereby adding a total of 28 pairs 5.
one composed
Group 1O), the other composed
if we lower the SC threshold from
Table
were too narrow.
of
as a single tends to say the
5 appears
"liquid"/"problem"(Groups
to con-
5 and Ii).
I t s h o u l d a l s o be n o t e d t h a t t h e n o u n d e p e n d e n t s of Group i0 nouns serve a "subjective" rather than "objective" function. If we had distinguished between the syntactic function of the noun dependent, TWs of Group I0 would be only weakly similar to TWs of Groups 1 and 2.
llarper
16
Tile f o u r
SRW s h a r e d
"certain" inatory
by t h o s e
and t h e noun
two n o u n s
governor
("promiscuous")
nature
"number." of these
obvious,
and one o f t h e
refinelaents
duced
future
is
in
"significant" tives
is
SRI~.
referred
experience words
studies
are
suggests minimal
if
the
Our g e n e r a l
conclusion
the 66 pairings
for which
4.)
distortions
should
be i n t r o -
At t h e
with
as
in a d j e c present,
introduced
that,
perhaps
of such words
number o f SRW i s is
adjective
The n o n - d i s c r i m -
of " p r o m i s c u i t y "
in Reference that
the
two SRW i s
that
the neglect
(Tile s t u d y to
include
by s u c h
sufficiently
large.
a few a n o m a l i e s ,
the SC Is .00700 or higher
meet with our expcctations. Another
aspect of the question
with presumed
similarity
arc not represented
end of the SC distribution old to include non-similar
curve.
full detail,
many
pairs that nouns
or not the SC is "signifi-
In lieu of presenting
this information
in
we show in Table 6 the most closely correlated
Groups
aspect of Table 6 is the repetition
and inter-Group
high-SC pairings.
noun from each of the Groups
3, 4, and 8).
The most striking of intra-
(If we lower the thresh-
the most highly correlated
pairs for a representative (excepting
on the high
One way of dealing with this problem
in each Group form, whether cantly" high.
many nouns
such pairs we shall also encounter
pairs.)
is to consider
remains:
pairings
In other words,
noted
in Table
the relative
S for
value of
I~arper
17
N
0
0 °,--I
U
o u .~.~
~
bO¢~ ~ . ~
~ •~ ~
Z C
C,
E--,
0
0
0 0
u
0
.
~ ~
.~
~ 0 ~,~ ~
~ ~
0 ~ -.4 m
~
0
m
0
r,,0u
¢~.c
~
0
~.~
O0
~
u
~
0
"~
u~.~
:~ ~
e,a ,-~ ~ 1 ~
~
,-~
0(.~_,
).( o
,-~ 0 --j ~ ~ . ~ .,.~ 0 4.a r-~
0
u 0
u'~
o
C--. X
<
~
0 ,--,
N
~
~
000 ~.~.~.~ ~ 0 ~ ~ '~ ~ ~ ~ ~ ~
o,~,~
~.~
h
~o~ u u~.~
~
o
O0 ~ ~.~.~ 0 0 ~ ~ ,~.~ ~ ~ 00
o.~
~
~ ~ ~o
~
~
~
~
o ~ u
~ ~
o ~
~
~ 0
~ ~ ~
r-(
~.~.~.~
o
~ ~ ~ m ~ ~ o o u U , ~
0 .,.~
0
0
~
0
0
,~ ~ U
,-4 ~ ~ ' ~ - - P - , O 0,-~ .,-( ~ . , . 4 0 ~--,~ ~.J 0
o u ~
o
o
C: 0
~
0 ,-'~.~
~
~
0
~., I~ ,~ 0
~)
0t, .,~
o .,~ ~ ~
0
~,-~ 0 o
0
3
0
~"
ltarper
18
the
SC a p p e a r s
This
result
cates than
to
was
Table
as
not
sensitivity have
as
our
and perhaps
but does not prove, of T~s,
in which
to any outside word.
lee have not
of the data is perhaps
clumps:
that strongly suggests
with high mutual a high
5C i s
and A and C; o f SRW a r e of the
three
ened.
The
studied) tration of
the
lined
if,
shared
the
the
three words
corresponding
SRI;~ f o r
measurement ) that to c a l c u l a t i o n
those
1.
are
the
which,
not
high mutual
been
is
Below,
we l i s t
in
T;is
connection
all
the
SRW
The u n d e r -
also
to
strength-
an i l l u s -
served
(determination
correlated
C,
proportion
as
1.
addition,
that
systematically
offered
]'I~ c a l c u l a t i o n
two o t h e r highly
the
TWs
for example,
t o be c o n s i d e r a b l y
sample
phenomenon.
are
Tl~s,
o f SRW h a s
for
out
of
Words A a n d B) B a n d
appear
following
types,
the existence
a relatively
three
would
recurrence
we shall point
Consider,
Test
addition,
by a l l
words
but of
between
in
to this
of the same SRI~~ among several
correlation.
found
a better
a prerequisite
For the present,
the recurrence
are
and in which no member
understanding
a phenomenon
procedures
the members
to apply clumping procedures;
treatment.
indi-
the existence
yet attempted
rigorous
value.
reasonable.
(or "clumps")
correlated
absolute
measurement
closely correlated with each other, is closely
the
expected,
in
thought
6 suggests,
of clusters
significant
certainly
a greater we w o u l d
be
each
as
, and other
and
tiarper
19
Tab l e
7
SRW OF CALCULATION 1 Dependents:
Adjective (L1)
TAKOJ ( s u ~ ; ANALOGICNYJ (analogous) ; ~EJSIJ (further); NAg (our); NEPOSREDSTVENNYJ (direct). ZAVISIMOST' (dependence); ~[ASSA (mass); VJiLI~INA (magnitude' - ~ , SECENIE (cross=section) ; KOEFFICIENT ~ c i e n t - ~ NOI)UL' ( m o d u l u s ) ; RASSTOJANIE (distance); SILA (force); FORMA ( f o r m ) . ZRENIE ( v i e w ) ; REZUL'TAT ( r e s u l t ) ; ~NO~T~--(pos s ib i I i t y ) ; - ~ _ _ (method).
Noun D e p e n d e n t s :
(LZ)
Noun Governors: (L3) Table Of t h e s e ,
7 shows t h a t one h a l f
determination
ready
also
appeared
of these
three
formula
behavior
for
TWs i s
determining
their relevance
SRW f o r b o t h
strengthened
o f SR;V."
that
is
In general,
by t h i s
or is n o t
the nature
SRIV remain to be studied,
to our problem
the
We h a v e no
recurrence
in a given situation.
of individual
as
for calculation
I t w o u l d seem t h a t
w h i c h we t e r m " r e c u r r e n c e
significant
4.
(nine)
SRW a p p e a r e d
and m e a s u r e m e n t .
"togetherness" feature,
eighteen
and
so far as
is concerned.
CONCLUS IONS We conclude
between
that there
the results
of our experiment
ing for the similarity in meaning
of words.
agreement
and an a priori
feel-
Words that are similar
tend to have the same SRI','t to a far greater
degree than chance would valid,
is considerable
a large-scale
larger number
determine.
experiment
of Test Words,
If this conclusion
is suggested,
more SRW types,
is
using a and a larger
I
t la r p e r
20
amount
of text.
proved
t o be a d e q u a t e ;
however, further
remove
occurrences
count
procedures the
e.g.,
in the
for
should of
procedure take
into to
be a p p l i e d ,
should,
also
account
SRIV. taking
of
be t a k e n multiple
some d e g r e e
or noun
perhaps
experiment
The q u e s t i o n
must
of "promiscuous"
individual
of text
anomalies.)
of noun g o v e r n o r s
occurrence
tile present
amounts
o f an SRW, d i s t i n g u i s h
recurrence
Words.
larger
we may a l s o
functions the
base
some o f t h e
refinements
seriously:
ferent
(The t e x t
the
dependents,
difdis-
(:lumping into
SRW among a g r o u p
account
of Test
lta r p e r
21
REFERENCES .
flays, D. G., and T. W. Ziehe, Russian Sentence-Structure Determination, The RAND Corporation, R~I-2538, Ap'ril 1960. i
.
.
.
t l a y s , D. G., B a s i c P r i n c i p l e s and T e c h n i c a l V a r i a t i o n s in , q e n t e n c e - S t r u c ~ u r e D e t e r m i n a t i o n , The RAND C 0 r p o r a t i o n , P - 1 9 8 1 , A p r i l 1960. l l a r p e r , K. t i . , "A Study of t h e C o m b i n a t o r i a l of R u s s i a n Nouns, " M e c h a n i c a l T r a n.s. l. a. .t i o n , p. 36.
Properties August 1963,
t l a r p e r , K. E . , P r o c e d u r e s f o r t h e D e t e r m i n a t i o n of D i s t r i b u t i o n a l Classes, "File RAND Corporation,-RM22~13, Janu d ary' 1 9 6 i .