C65 1010

10 1965 I n t e r n a t i o n a l Conference Computational Linguistics on NEASURENENT OF SI~IILARITY I~ETWI!EN NOUNS ...

0 downloads 78 Views 796KB Size
10

1965 I n t e r n a t i o n a l Conference Computational Linguistics

on

NEASURENENT OF SI~IILARITY I~ETWI!EN NOUNS Kenneth

E. l l a r p e r

Tile IOiND C o r p o r a t i o n 1700 Main S t r e e t Santa blonica, California

9041)6

AJ;STt',A(?T

A study

was r~ade o f

pairs

of

Russian

occur

in s e n t e n c e s

tile d e g r e e

nouns,

as

with

of

expressed

identical

by t h e i r

~,,ords

syntactic

relationships.

A similarity

for

nouns;

pair

forty

shared (iii)

for

adjective

dependents,

noun

governors

was a u t o m a t i c a l l y

each

pair

such

shared

two n o u n s this

words the

~'ords in

the

of physics

ranged

as

to t h e text.

coefficient.

the

from 42 t o suggest

be u s e f u l

for

the

noun p a i r s

with

similar

derivation

from d i s t r i b u t i o n a l l y

observed.

tionships

as

properties

similarity

The r e l e v a n c e

criteria

total

of

ranked

12(1,~00

forty

sample

to

not

the

the

by h i g h

for

of

the

according running

The RAND C o r p o r a t i o n ;

that

size

is

of

from

frequencies

were

and

nouns

in

this

328.

sufficient

characterized

at

dependents,

the

comprised

processed

number o f

coefficient

the

The 78~ p a i r s

of o c c u r r e n c e

The r e s u l t s

of

was p r e p a r e d

retrieved

of

to

identical

the

noun

ratio

product

The t e x t

text

(ii)

between tendency

matrix

The s i m i l a r i t y

~;as d e t e r m i n e d

frequencies

text

text.

in

of nouns

(i)

machine-processed

to

each

similarity

for

of text

intended

(synonymy, similar

verbs,

coefficients; of v a r i o u s

meas~rement

is

is

of

purpose.

Many

antonym),, etc.) the

syntactic discussed.

are

converse rela-

[larper

1

MEASURENIiNT OF SIMILARITY BETWEEN NOUNS I.

INTRODUCTION One of the goals of studies

is the establishment observed behavior

in Distributional

Semantics

of word classes on the basis of the

of words

in written

and significant way of discussing

texts.

"behavior"

A convenient of words

in terms of syntactic

relationship.

fact,

that we treat a word in terms of its

it is necessary

Syntactically

Related Words

(SRW).

word bears a given syntactic ber of other words; and pronouns) another

relationship

to a finite num-

e.g., a finite number of words

appear as "subject"

object" of each transitive

verb;

appear as modifiers

in

In a given text, each

(nouns

for each active verb;

group of nouns and pronouns

"adverb,"

At the outset,

is

are used as "direct

other words of the class,

of a given verb.

In each

instance we may speak of the related words as SRW of a given verb, so that in our example three different ~ emerge; a given SRW is then defined class and specific

relationship

of SRW

in terms both of word

to the verb.

may of course belong to two different

(A given noun

types of SRW, e.g.,

as both subject and object of the same verb.) Distributionally, of their SRN.

The objective

test the premise SRW.

we may compare

is tested,

in terms

of the present study is to

that "similar" words

This premise

two verbs

tend to have the same

not with verbs,

as in the

l,arper

above in

example,

a given

nouns,

text

(2)

of nouns

to

three

formed

the

from

of

nouns.

types

out

that

("avocado"

and

"cherry")

have

common t h e

in

facts

would

lar,

that

"c"

less

similar,

etc.

A number

of

anyway?

Do w o r d s

a significant

subject

these

is

based

limited

are

questions that

are

areas?

as

is

to

claimed

the

and "b" as

words

the

throw for

possession

foreknowledge

are

share

What

is

also

have to

from

the

differ

that

=

be

light

on

"results"

experiment of

many

of

should some

"a

estab-

effect

texts

the

"c"

simi-

really

in order is

are

"similarity"

text?

using,

attempting the

and

investigation

designed

factors:

(physics),

What

it

These

and " b "

meaning

necessary

of nouns,

modifiers

"a"

dissimilar

and of

no v a l i d i t y

on t h r e e

"a"

a given

pair the

"a"

"modern."

that

find

and "furniture")

in

is

in

("chair"

similar

The p r e s e n t

Our a u d a c i t y

field

adjective

What i s

Do n o t

an e x p e r i m e n t

nouns

arise:

o f SRW i n

of

example:

such

similar)

in words,

questions;

obtained.

"d"

group

and groups

Another

that

to

to express

the

conclude

(i)

by e a c h

text

o f word b e h a v i o r ?

multiple=meaning

regarded

to

(3)

modifier

flow much t e x t

patterns

ent

us

a small

nouns)

" c )' a n d " d "

number"?

common S R W ?

and

share

is

Sill;T s h a r e d

SRI~.

adjective

number

significant

lish

and

group,

in a given

nouns

lead

of

shared

turn

whereas

o f SRW f o r

individual

their

"ripe,"

Our p r o c e d u r e

number

the

between

a function

might

with

find

"similarity" as

but

a text the

at

all

in a multiple

=

llarper

3

meaning probler:l is processing

of text.

i n v i e w o£ t h e reader

We w o u l d h o p e ,

or t h e

2.

conclude however,

that

capability

is

and c o m p l e x i t y

a critical

clearly of

the problem.)

with his

if

the

would n o t employed,

results

do n o t

expectations.

PROCIiDIIRI'] Tile p r e s e n t

from R u s s i a n 120)000

s t u d y was b a s e d

physics

running

journals,

words

on a s e r i e s

comprising

(some 500 p a g e s ) .

The p r o c e s s i n p , o f

te.xt has b e e n d e s c r i b e d

elsewhere, (1'2)

note

only that

of

on m a g n e t i c

each sentence

tape,

together

for

each

occurrence

its

"word number"

glossary}, (i£ any) this

in the

text

words

and i t s

tape

in

syntactic

language"

information of speech,

in the machine

or "dependent" program applied

information

interested.

about

in the machine printout than

for

to

t h e SRI'i f o r

For convenience

rather

we

recorded

part

number

A retrieval

by word n u m b e r ,

this

in their

and study "natural-

£o rv).

In o u r s t u d y Russian

is

its

"governor"

then yielded

ltere,

the following

the sentence:

sentence.

words

identified

with

text

(an i d e n t i f i c a t i o n

i n w h i c h we were

economy, all

this

of articles

approximately

this

are

The

proves nothing.

s u c h an o p i n i o n

of disbelief

automatic

a necessity,

the experiment

that

for

judgment of the procedures

suspension

correspond

and t h e

(The l a t t e r

size

may w e l l

preclude

mininlal,

nouns,

we c h o s e

herein

called

t o d e a l wit]~ t h e SRI~ o f f o r t y Test ~ords

{TW).

The number

ltarper

4

is

completely

arbitrary;

tile p a r t i c u l a r t o form d i f f e r e n t

Table

1) a'ere

presumed

Table

1 gives

one p o s s i b l e

criteria the

for

reader

grouping

may e a s i l y

or c o n t r a c t i n g only purpose

of

similar

in terms

by t h e i r

is

of their

types

comparison either

case

speaking, into

o f SRW.

to three

the

account.

permitting

are

we s h a l l

to compare of their

refer

t o t h e 'rWs

reference

we h a v e c h o s e n t.t}e a d j e c t i v e

syntactic

this

into

factor,

in the

(in

genitive

( t h e TN i s case).

we a r e

norStrictly

measurement will SRW f o r

40 TWs, and o f c o m p a r i n g e a c h TW w i t h

TW, i s

too tedious

t o be a t t e m p t e d .

consciously

on t h e p r e m i s e

of manualiy retrieving

i n two w a y s ,

our

o f t h e SRIq s h o u l d be t a k e n

inexactitudes,

introduced

dif-

t h e noun

in tile g e n i t i v e

function

iimit

dependents

function),

and t h e n o u n g o v e r n o r s

to several

to

of the

was e n l i s t e d ,

The

f o u n d t o be

like

but not necessarily,

In i g n o r i n g

The t a s k

by e x p a n d i n g

understanding

or p r e d i c a t i v e

necessarily,

certain

distortions

ilere,

types:

(normally,

but not

although

equivalents.)

in Russian),

mally,

the

a weak m e a s u r e o f

SRN, we s h o u l d

some i n t u i t i v e

attributive

depend.ents

groups,

two n o u n s

Two n o u n s may be c o m p a r e d w i t h ferent

words;

obvious,

(see

groupings.

we h a v e d e s i g n a t e d .

if

(For convenience, English

of these

to provide

experiment:

with

similarity.

that

semantic

less

form d i f f e r e n t

grouping

in the

finding

a r e more o r

the groups

control

this

grouping

nouns chosen

that

the

n o t be s e v e r e . each occurrence every

other

The a i d o f t h e c o m p u t e r

llarper

5

Table

1

39 TEST NOUNS

Group l calculation 1 measurement determination calculation 2 Grou p 2 cc:ns-ideration comparison study investigation Group 3 relation ratio correspondence Group 4 solution compound alloy G,roup 5 metal gas liquid crystal Group 6 uranium silver copper phosphor Group 7 proton ion molecule atom Group 8 formula expression equation Grou 9 " w~dth depth length height Group 10 presence ab s e n c e existence Group 11 que s tion problem 1 p r o b 1era 2

W No.

F

L1

L2

1,3

L4

vycislenie izmerenie opredelenie rascet

782 1579 3324 4627

62 328 121 90

15 29 7 12

23 63 39 24

II 36 14 16

49 128 60 53

rassmotrenie sravnenie izuienie issledovanie

4598 5200 1610 1723

Sl 106 64 159

14 6 8 32

29 22 44 65

6 6 21

49 32 58 11~

sootnosenie otno~enie sootvetstvie

5111 3455 5109

113 102 29

14 14 2

18 22 i

15 9 0

47 45 3

rastvor soedinenie splay

4608 5082 S182

129 15 27

6 5 6

22 5 2

24 6

52 16 12

metall gaz Zidkost' kristall

2400 807 1329 2131

86 37 56 171

ii 7 8 15

2 2 2 19

28

44

41 17 25 78

uran serebro med ' fosfor

5745 4899 2419 5913

171 48 58 130

0 4 2 9

0 1 3 2

18 17 20 34

18 22 25 45

proton ion molekula atom

4565 1686 2568 186

125 98 112 106

8 14 18 9

2 I0 18 23

27 31 39 28

37 55 75 60

formula vyrazenie uravnenie

5911 739 5742

231 223 412

20 25 • 42

21 12 24

19 24 32

60 61 98

sirina glubina dlina vysota

6198 913 1194 764

43 40 112 23

4 6 16 2

9 21 ii

9 9 22 3

22 23 59 16

nalicie otslzts tvie su~ des tvovanie

2696 3485 5352

119 44 41

3 2 3

73 35 25

5 1 6

81 38 34

repros zadada problema

615 1362 4254

96 68 26

5 15 4

3 11 i0

I0 i0 6

18 36 20

"~ No." = word n u m b e r ;

"F" = frequency

8

4

4

8

15

tlarper

6

i.

Through automatic

scanning of the text, each

occurrence of tile 40 TWs was located, and in each instance the identity

(word number)

of relevant SR~V was recorded.

A listing is produced for each of the TWs (see Table 2, "SRW Detail," for an example of the TW, VYCISLENIE = calculation 1), showing tile different words used as adjective dependents governors

(List i), noun dependents (List 3).

(List 2), and noun

Tile number of words on each of these

lists is also shown in Table i, together with the total number of SRW for each TW (List 4).

We stress the fact that

these numbers refer to different words used as SRW; the repetition of a given SRW (for a given SRW type) was not recorded. 2.

Each Tl~ was automatically compared with every other

TW, with respect to their shared SRW, i.e., in terms of the words

i~ Lists I, 2, and 3 of the "SRW Detail Listing."

A new listing, "Similarity Ranking by T%~'," is then produced (see Table 3 for the T~, VYCISLENIE = calculationl).

This

listing shows for each TW the number of shared SRW of each of the three types

(NI, N2, and N3, Table 3), the total

number of shared SR%~ (NA), and a measure of similarity for the pairs, herein designated as the Similarity Coefficient (SC).

The SC is a decimal fraction obtained by dividing

the sum of shared SRW for each pair of TWs by the product of the frequencies of the two TWs.

(The latter is of course

a device for taking into account the differing frequencies

llarper

7

N

,-1

0 0 C

0 0 (D

0 (D

0 C 0

e~

,-,

0 0 0 0

0 0 C~ wO

WD

0

(D 0 0

0 (D 0

0 0 C)

o 0 0 0

o 0 0 ~

0 0 0 o

O 0 0

o 0 0

0 0 0

0 C 0

0 0 o

0 c'3 0

0 (D 0

0 0 0

o C, 0 ,,,t

C:, c o

o o 0

0 o C

~

*

.,

?

-

g

o

e

d

*

0

o

°

°

~

<

~P

Z

,, O"



o

ii

ii o

" .

2"

Z

~.~

° ~

~,

Z

g Z

ii

~-

0

,0

L

Ii

~.~

°

Z

Z

,4'

u.~

Z

o, Z

u.l

Z

Z

D

C

Q

0

II

g

=d

~

, , = = ~ C "

.

=

g

C

:

,

,

=

,..

tiarpcr

,







.~

. . . .





,~



,

,

,









,

C

~

C

~

. . . .

,

,

>LD }--4 t.')

>.~ J

<

C C C

C

T C

C C C

CDC

~

O C

.. ~ C

O

O

~

C

~

~ C C ~ C C

C

~

O

C

C

C

C

C

C

C

C

~

C

.~J 3:

2"

"T_

~ C C e" .'? C C ~ C

0 C ~--. C "~ C C C LD ." "D ~

C 0 C..~ C ~

C. "2_ c C C C C C

c = o = z C

~

C

~

C

C

GC

C

~

C

~

C

~

C

C

C

GC

~

~

~

~

'

C C

C

~

C

O

G

C

~

C

C

~

C

O

C

C

C

Iiarper

9

of the TWs; other means for determining can be utilized.)

The pairings

the value of the SC.

express

for each TW are ordered on

It should be noted that the similarity

between TWs is measured shared SRW

this coefficient

in terms of the total number of

(Column NA of Table

this measurement

3); it is also possible

to

in terms of shared SRW of any

single type. A third listing was also produced:

a listing of the

7,~I) TI~'-pairs, ordered oll the value of the SC.

This

listing,

not reproduced here because of its length, will be referred to as "Ranking of TW-Pairs by SC~." 'Fable 4 shows

the dis-

tribution of the SC as compared with tile number of TW pairs. The following discussion ings described

above.

is based on the three

A few additional

made about the procedure

remarks may be

itself, which may be likened to

deep-sea fishing with a tea strainer limitations

of size are obvious:

full of holes.

we have limited

to three of the numerous ways of comparing nouns of their SRW. are:

Other types of SRW that suggest

verbsj where TW is subject;

object;

prepositional

of TW; nouns

list-

The

ourselves in terms

themselves

verbs, where TW is direct

phrases as dependents,

joined to TW through coordinate

or governors, conjunctions

(i.e., "apples" and "grapes" are said to be more similar if "apples

and oranges" and "grapes and oranges" occur in

text).

Some of the holes

neglect

of the case of the noun dependent

in our tea strainer are:

the

of TW, or the

'Liarpor

i0

Po

0

0

0

0

0

0

0

o

0

o

c~

o

o

© j

~

b

;

i

.......

_-+ .............

~- . . . . . . . . .

~. . . .

~ .................

,

.~

t

?

; . . . . . . . .

0

-t

i . . . . . . . . . . . . . . . . . . .

I

0"1 .

.

.

"r---

.

0

i .......

t. . . . . .

i .............

---'t

q'.~.

. . . .

~ .........

:: ............

4 .................. :

: ..............

+

t

i

,

- -

I

4 ........

o

I

I

20

i

,

o

:

!

:

I

.3" ~. '

0

°-~. .--

J

°

".< i

o

o

I

i ~.° "~

0 L

._

,

!

o o

,.<

o(";.b

P~

o

//

(--, ~ °

I . . . . . . . . . . . . .

i

o I

÷ . . . . . . . . . . . . . . . . . . .

~ . . . . . . . . . . . . . . . . . .

~

o

o

!

!

O.,

r

o o

P0

o

i

:

i ..........

0

i

L .

.

.

.

.

.

.

i ................ . . . . . . . L ............... . . . .

i

'

~

. . . . . . . . . . . . . . . . . . . . . . . .

l-ia r p e r

11

case

o f t h e TW when t h e SRW i s

of technical governor

symbols

functions

noun p a i r

(e.g.,

"objective" valent

in physical

o f t h e TW; t h e

different

a noun g o v e r n o r ; textj

failure

the

between

the neglect

constructions.

In v i e w o f

expedition

to examine the 3.

is

in a noun/

"subjective"

and

of transformationally these

to mention the problem of statistics), fishing

or

between

or d e p e n d e n t s

distinction

genitive);

as d e p e n d e n t

to distinguish

of governors

the neglect

open t o d o u b t .

deficiencies the

success

Let us

equi-

(not of our

then proceed

catch.

RI!SULTS The e v a l u a t i o n

machine listin~.s

the

The p r o b l e m o f

how c o m p l e t e l y

degrees

and a c c u r a t e l y

groupings

(Table

a summary m a n n e r w i t h

should as

1)?

in our three

We can s c a r c e l y

of sir.~ilarity

interpretation

pond w i t h o u r e x p e c t a t i o n s , semantic

contained

i s n o t an e a s y t a s k .

e x a m i n e and d i s c u s s pairs.

of the data

is the

also

Our a p p r o a c h

highest Similarity Coefficients,

corres-

in the is

tentative

to deal

characterized

in

by

especially with respect to

their intra- and inter-group relationships. to this discussion,

780 n o u n -

complicated:

results

represented

the noun-pairs

of

a few preliminary

Before proceeding

remarks should be

made about the data in the various machine listings. The summary of SRW counts for each TW, contained

in

Table I, suggests all TWs do not have the same opportunity for comparison.

In the case of "correspondence"

(Group 3),

tlarper

12

a total of only three SRW is noted

in (Column

result,

this TW should be eliminated

ation.

In addition,

all three, TW, the

from furtJler consider-

at least two,

and preferably

types of SRW are well represented

SC

examples,

unless

for that noun will

we n o t e

all

column p r e d o m i n a t e s ) ,

nouns

"deficient"

and t h e n o u n s

in certain

As

which the

i n Group lO ( f o r

In e f f e c t ,

types

for a given

tend to be skewed.

in Croup 6 ( f o r

t h e L2 c o l u m n p r e d o m i n a t e s ) .

14); as a

these

1,3 which

nouns are

o f SRI;', and r e q u i r e

special

handling. On t h e p r i n t o u t , number o f noun p a i r s although value The

appear at

pairs

low.

on t h e

for

the

that

similarity

NA was a r b i t r a r i l y

to

the discussion

i z e d by h i g h e s t

S(:.

5(2 by n o u n - p a i r s .

i

enter

or e x t r e m e l y A

Table

is

"1,"

at

the

frequencies these

significant The minimum

four. to

the data

of t h e n o u n - p a i r s 3 shows t h e

weak s i m i l a r i t y

(i.e.,

discount

two TWs.

anmndations

By a n y s t a n d a r d ,

a

"~,.," o r " 3 . "

o f "NA" i s

between set

small

of the

has b e e n t o

the value

K e e p i n g i n mind t h e s e We p r o c e e d

4)

SRW i s

the product

Our p o l i c y

grounds

in determining value

because

,t

t o p end o f t h e s c a l e

number o f s h a r e d

may be h i g h ,

relatively

tive

the

of colurnn "NA" ( s e e T a b l e SC

is

the total

" R a n k i n g o f T l ~ - P a i r s by SC,

for

character-

distribution

the

data

in mind,

of

shows n e g a -

most of the

780 p a i r s .

An a b s t r a c t o f a p a p e r on tile p r o c l i v i t y of nouns to into certain c o m b i n a t i o n s i s c i t e d i n R e f e r e n c e 3.

~,arper

13

At which point on the curve shall we draw a line,

saying

that an SC above

this value

indicates

similarity,

a~d

that an SC below

this value

indicates

dissimilarity

weak similarity

(all this of course: in terms

For purposes

of discussion,

at .00100--a

rigorously

high

we propose figure.

or

of rcliability)?

to set the t]~reshold After eliminating

pairs whose NA value

is less than 4, we find 38 p,~irs whose

SC lies in the range

.00100

two zeroes

to .01~337 (Table

degree of similarity For purposes

pairings

(Z],e first

are dropped.)

The reader may draw his own conclusions

ing.

5).

between

the nouns

of discussion,

The following

intra-

in any given pair-

we will

in terms of our preliminary

about the

refer to the

groupings

(Table I).

and inter-Group

pairings

are observed

1 pair with nouns 2 3 4 5 6 7 8 9 i0 ii

of Group

I, 2 I, 2, i0

in Tab le 5 : Nouns of Group

We note that no pairings and 8.

All other groups

intra-group

pairings;

are fulfilled,

i.e.,

of Groups

except Group 4 are represented

the data supports between words.

7

9 2, I0 5, Ii

appear for nouns

to this degree,

ings for the similarity

5 4, 5, 6, 5, 6, 5, 7

3 by

our expectations our a priori

feel-

The amount of inter-

Harper

14

Tab le 5 "HIGH RANKING TW-PAIRS" TWI

TWJ

SC

NA

m

calculation I

1

determination study

investigation

consideration

liquid

gas metal crystal

5 5 5

copper ion atom

7 7

height depth length absence

i0

presence

i0

question

Ii

calculation 2 consideration determination investigation measurement study calculation 2

1 2 1 2 1 2 1

323 285 2OO 183 113 101 165

18 9 IS 18 23 4 18

consideration existence investigation absence calculation 2 presence determination consideration calculation a absence existence calculation 2

2 I0 2 i0 1 i0 1 2 1 I0 I0 i

337 267 246 213 139 118 116 173 154 114 107 174

Ii 7 25 8 9 9 14 22 8 7 8

molecule problem 1 metal crystal metal silver compound

7 11 5 5 5 7 4

143 I05 125 104 126 194 156

9 4 6 I0 4 8 4

silver metal

6 S

180 120

copper ion

6 7

106 125

length width width

9 9 9

155 233 125

I0 1 I0 I0

222 101 229 225

4 4 12 11

Ii

240

6

e xi s tence calculation absence existence

problem 2

2

6 13

ltarper

15

group pairing may indicate

either

clusive,

or that our original

In fact,

two larger

1 and 2 (perhaps of Groups

that the data is incon-

groupings

groups emerge:

including

4, 5, 6, and 7.

This tendency

For example,

with those of Group those of Groups

above.

10, and nouns

above,

Groups

on semantic

is subject

of Group

classes

grounds;

to aberrant

but strongly

groups mentioned

1 and 2 can easily be

since Group behavior

10, as noted

(because of the very

its inter-relation

1 and 2 may not be taken seriously.

Groups

include the names of chemical

of elements,

individual

elements,

may be taken together

sub-class

of "object

nouns."

elements,

The physicist in this group.

One of tile 38 pairs

in Table

expectation:

listed

with 4, 5,

mixtures,

and components

semantically

same things about all nouns

tradict

4 pair with

conclusive,

of Groups

of noun dependents),

6, and 7, which

in

1 are found to pair

of the two major

The amalgamation

high number

listed

6 and 7.

the emergence

defended

is more marked

to the number

nouns of Group

The data is not statistically suggests

of Groups

.00100 to .00070,

thereby adding a total of 28 pairs 5.

one composed

Group 1O), the other composed

if we lower the SC threshold from

Table

were too narrow.

of

as a single tends to say the

5 appears

"liquid"/"problem"(Groups

to con-

5 and Ii).

I t s h o u l d a l s o be n o t e d t h a t t h e n o u n d e p e n d e n t s of Group i0 nouns serve a "subjective" rather than "objective" function. If we had distinguished between the syntactic function of the noun dependent, TWs of Group I0 would be only weakly similar to TWs of Groups 1 and 2.

llarper

16

Tile f o u r

SRW s h a r e d

"certain" inatory

by t h o s e

and t h e noun

two n o u n s

governor

("promiscuous")

nature

"number." of these

obvious,

and one o f t h e

refinelaents

duced

future

is

in

"significant" tives

is

SRI~.

referred

experience words

studies

are

suggests minimal

if

the

Our g e n e r a l

conclusion

the 66 pairings

for which

4.)

distortions

should

be i n t r o -

At t h e

with

as

in a d j e c present,

introduced

that,

perhaps

of such words

number o f SRW i s is

adjective

The n o n - d i s c r i m -

of " p r o m i s c u i t y "

in Reference that

the

two SRW i s

that

the neglect

(Tile s t u d y to

include

by s u c h

sufficiently

large.

a few a n o m a l i e s ,

the SC Is .00700 or higher

meet with our expcctations. Another

aspect of the question

with presumed

similarity

arc not represented

end of the SC distribution old to include non-similar

curve.

full detail,

many

pairs that nouns

or not the SC is "signifi-

In lieu of presenting

this information

in

we show in Table 6 the most closely correlated

Groups

aspect of Table 6 is the repetition

and inter-Group

high-SC pairings.

noun from each of the Groups

3, 4, and 8).

The most striking of intra-

(If we lower the thresh-

the most highly correlated

pairs for a representative (excepting

on the high

One way of dealing with this problem

in each Group form, whether cantly" high.

many nouns

such pairs we shall also encounter

pairs.)

is to consider

remains:

pairings

In other words,

noted

in Table

the relative

S for

value of

I~arper

17

N

0

0 °,--I

U

o u .~.~

~

bO¢~ ~ . ~

~ •~ ~

Z C

C,

E--,

0

0

0 0

u

0

.

~ ~

.~

~ 0 ~,~ ~

~ ~

0 ~ -.4 m

~

0

m

0

r,,0u

¢~.c

~

0

~.~

O0

~

u

~

0

"~

u~.~

:~ ~

e,a ,-~ ~ 1 ~

~

,-~

0(.~_,

).( o

,-~ 0 --j ~ ~ . ~ .,.~ 0 4.a r-~

0

u 0

u'~

o

C--. X

<

~

0 ,--,

N

~

~

000 ~.~.~.~ ~ 0 ~ ~ '~ ~ ~ ~ ~ ~

o,~,~

~.~

h

~o~ u u~.~

~

o

O0 ~ ~.~.~ 0 0 ~ ~ ,~.~ ~ ~ 00

o.~

~

~ ~ ~o

~

~

~

~

o ~ u

~ ~

o ~

~

~ 0

~ ~ ~

r-(

~.~.~.~

o

~ ~ ~ m ~ ~ o o u U , ~

0 .,.~

0

0

~

0

0

,~ ~ U

,-4 ~ ~ ' ~ - - P - , O 0,-~ .,-( ~ . , . 4 0 ~--,~ ~.J 0

o u ~

o

o

C: 0

~

0 ,-'~.~

~

~

0

~., I~ ,~ 0

~)

0t, .,~

o .,~ ~ ~

0

~,-~ 0 o

0

3

0

~"

ltarper

18

the

SC a p p e a r s

This

result

cates than

to

was

Table

as

not

sensitivity have

as

our

and perhaps

but does not prove, of T~s,

in which

to any outside word.

lee have not

of the data is perhaps

clumps:

that strongly suggests

with high mutual a high

5C i s

and A and C; o f SRW a r e of the

three

ened.

The

studied) tration of

the

lined

if,

shared

the

the

three words

corresponding

SRI;~ f o r

measurement ) that to c a l c u l a t i o n

those

1.

are

the

which,

not

high mutual

been

is

Below,

we l i s t

in

T;is

connection

all

the

SRW

The u n d e r -

also

to

strength-

an i l l u s -

served

(determination

correlated

C,

proportion

as

1.

addition,

that

systematically

offered

]'I~ c a l c u l a t i o n

two o t h e r highly

the

TWs

for example,

t o be c o n s i d e r a b l y

sample

phenomenon.

are

Tl~s,

o f SRW h a s

for

out

of

Words A a n d B) B a n d

appear

following

types,

the existence

a relatively

three

would

recurrence

we shall point

Consider,

Test

addition,

by a l l

words

but of

between

in

to this

of the same SRI~~ among several

correlation.

found

a better

a prerequisite

For the present,

the recurrence

are

and in which no member

understanding

a phenomenon

procedures

the members

to apply clumping procedures;

treatment.

indi-

the existence

yet attempted

rigorous

value.

reasonable.

(or "clumps")

correlated

absolute

measurement

closely correlated with each other, is closely

the

expected,

in

thought

6 suggests,

of clusters

significant

certainly

a greater we w o u l d

be

each

as

, and other

and

tiarper

19

Tab l e

7

SRW OF CALCULATION 1 Dependents:

Adjective (L1)

TAKOJ ( s u ~ ; ANALOGICNYJ (analogous) ; ~EJSIJ (further); NAg (our); NEPOSREDSTVENNYJ (direct). ZAVISIMOST' (dependence); ~[ASSA (mass); VJiLI~INA (magnitude' - ~ , SECENIE (cross=section) ; KOEFFICIENT ~ c i e n t - ~ NOI)UL' ( m o d u l u s ) ; RASSTOJANIE (distance); SILA (force); FORMA ( f o r m ) . ZRENIE ( v i e w ) ; REZUL'TAT ( r e s u l t ) ; ~NO~T~--(pos s ib i I i t y ) ; - ~ _ _ (method).

Noun D e p e n d e n t s :

(LZ)

Noun Governors: (L3) Table Of t h e s e ,

7 shows t h a t one h a l f

determination

ready

also

appeared

of these

three

formula

behavior

for

TWs i s

determining

their relevance

SRW f o r b o t h

strengthened

o f SR;V."

that

is

In general,

by t h i s

or is n o t

the nature

SRIV remain to be studied,

to our problem

the

We h a v e no

recurrence

in a given situation.

of individual

as

for calculation

I t w o u l d seem t h a t

w h i c h we t e r m " r e c u r r e n c e

significant

4.

(nine)

SRW a p p e a r e d

and m e a s u r e m e n t .

"togetherness" feature,

eighteen

and

so far as

is concerned.

CONCLUS IONS We conclude

between

that there

the results

of our experiment

ing for the similarity in meaning

of words.

agreement

and an a priori

feel-

Words that are similar

tend to have the same SRI','t to a far greater

degree than chance would valid,

is considerable

a large-scale

larger number

determine.

experiment

of Test Words,

If this conclusion

is suggested,

more SRW types,

is

using a and a larger

I

t la r p e r

20

amount

of text.

proved

t o be a d e q u a t e ;

however, further

remove

occurrences

count

procedures the

e.g.,

in the

for

should of

procedure take

into to

be a p p l i e d ,

should,

also

account

SRIV. taking

of

be t a k e n multiple

some d e g r e e

or noun

perhaps

experiment

The q u e s t i o n

must

of "promiscuous"

individual

of text

anomalies.)

of noun g o v e r n o r s

occurrence

tile present

amounts

o f an SRW, d i s t i n g u i s h

recurrence

Words.

larger

we may a l s o

functions the

base

some o f t h e

refinements

seriously:

ferent

(The t e x t

the

dependents,

difdis-

(:lumping into

SRW among a g r o u p

account

of Test

lta r p e r

21

REFERENCES .

flays, D. G., and T. W. Ziehe, Russian Sentence-Structure Determination, The RAND Corporation, R~I-2538, Ap'ril 1960. i

.

.

.

t l a y s , D. G., B a s i c P r i n c i p l e s and T e c h n i c a l V a r i a t i o n s in , q e n t e n c e - S t r u c ~ u r e D e t e r m i n a t i o n , The RAND C 0 r p o r a t i o n , P - 1 9 8 1 , A p r i l 1960. l l a r p e r , K. t i . , "A Study of t h e C o m b i n a t o r i a l of R u s s i a n Nouns, " M e c h a n i c a l T r a n.s. l. a. .t i o n , p. 36.

Properties August 1963,

t l a r p e r , K. E . , P r o c e d u r e s f o r t h e D e t e r m i n a t i o n of D i s t r i b u t i o n a l Classes, "File RAND Corporation,-RM22~13, Janu d ary' 1 9 6 i .