paper16

Using the “Semantic Vector” Concept for Improving Knowledge Sources, and their Use in Language Translation Software USI...

0 downloads 16 Views 1MB Size
Using the “Semantic Vector” Concept for Improving Knowledge Sources, and their Use in Language Translation Software

USING THE “SEMANTIC VECTOR” CONCEPT FOR IMPROVING KNOWLEDGE SOURCES, AND THEIR USE IN LANGUAGE TRANSLATION SOFTWARE Andreea Stanciu, ASC, MA, National Institute of Research and Development in Mechatronics and Measurement Technique, Bucharest Abstract: In my daily work in the National Institute of Research and Development in Mechatronics and Intelligent Measurement Technique, I often come across translations, as new knowledge needs to be infused not only from national coverage, but also foreign sources. However, this has increased my awareness and my interest in translation algorithms and has made me more keen on better understanding the process that lies beneath the seemingly “magical process of translation” embedded in instant translation software. By bearing this in mind, I started studying the mechanisms involved in sensing the meaning, finding the appropriate way for conveying the right message and finally, translating one phrase, or one block of language, to another foreign language. This paper aims to describe how this process is enabled, using a very remarkable method, developed by a few authors from the Centre of Technology and Systems UNINOVA, Portugal and Federal University of Western Pará UFOPA / IEG / PSI, Brasil, based on their paper Semantic Enrichment Of Building and Construction Knowledge Sources Using A Domain Ontology For Classification, and to look furthermore into the process of generating meaning and conveying it into another language, and also into the manner of improving this process. Keywords: language translations, semantic vector (SV), semantics, semiotics, equivalent terms, statistical methods used in linguistics, co-occurrences, term weights, knowledge sources (KS), signifier, signified, referent, pragmatics.

1. Introduction Difference between „Language and Manner of Speech”. Where does the meaning come from and how can it be conveyed? Translating the real world into meaning is a tricky process. This is why, although the concern for doing it lies, as in most sciences and techniques, almost back to the beginning of history, and it has been an endeavour of philosophers ever since, the most recent breakthroughs date back to the midst of the XIXth century and to the XXth century. Focusing more on written / spoken language, this was latter extended to meaning in general, and was crystallized in the science of semiotics. Basing on the work of Saussure, Eco and Foucault, this science does not fail to explain how come language can express and reflect the world we live in (however imperfectly) and how language left alone can lead to a newer, independent world, regardless and dispensing of the surrounding one. So, how does this happen? The philosophers of language show how this happens in spoken / written language. This is explained like this. Our starting point is a piece of reality. We can speak, for example, of a “chair” (from the real world). In terms of semiotics, this is called a “referent”. As it may not seem obvious, this is not translated into the manner of speech directly. 90

The real world is a continuum. It is only our senses, pre-judgements and former knowledge of the world which shape the way we see it and segment it in “pieces of reality”. The real world, thus, cannot be perceived free of a series of conditionalities, such imposed by mere physiology (sight, smell), and such by former knowledge (eg: the way we perceive and refer to time). Still, in order to communicate, the basis of evolution, we need to be able to express that is a way that is understood by the recipient of the communication. To be able to do this, we need to have a shared set of information, which is arbitrary. Thus, we use a signifier to do this. This signifier it is just a way of expressing what we want to communicate (the signified) and it has a “somehow tangible form”, in that it can be perceived using the senses of the receiver, and is socially shared. In case of language as “meaning of speech” the signified can be of a phonic / audible form or a written word. But, when we come back to nature, do we have the certainty that this signified, or even the signifier are the very translation of the world? Due to previously stated causes, none is. The ultimate reality corresponding to the referent can never be expressed perfectly. The signifier is a sociably shared meaning which corresponds to the signified, but the signified itself can never express its

The Romanian Review Precision Mechanics, Optics & Mechatronics, 2014, No. 46

Using the “Semantic Vector” Concept for Improving Knowledge Sources, and their Use in Language Translation Software essence perfectly, and is the mental image of the referent in each mind of each speaker, thus being personal itself, but never perfect. Like the myth of the cave, described by Plato, the ultimate reality escapes us.

The relationship between signifier, signified and referent, can be graphically reproduced below:

Signifier – socially shared code, used to express a signified

Referent – ultimate reality, that can never be expressed, but it corresponds to the signifier

Signified – personal expression of referent

This is better illustrated by the “chair example”. The previous concepts are thus illustrated below, as follows:

Signifier – written form of “chair” or series of phonems corresponding to “chair”

Signified – image of “chair” in mind of each speaker

More, as we unravel the fact that this process is arbitrary, we can notice that for example, a signified can be used as a referent. This process can lead to infinite evolutions and to a world of meaning which ignores the real one. Thus, if we think of an actual chair (referent), we make a mental image of it, basing on reality (signified), and we communicate using the

Referent – actual chair in the real world, wich is perceived via perceptions ie mental interpretations of the world

written or spoken word (signifier), by means of its written form (set of letters) or spoken work (set of phonemes). But if we use the word “chair” as a referent and we want to associate it with the word cat, we use the image of chair as a signified and the word “cat” as a signifier. Thus, we can call a “chair”, “cat” and this is true for virtually every other combination.

The Romanian Review Precision Mechanics, Optics & Mechatronics, 2014, No. 46

91

Using the “Semantic Vector” Concept for Improving Knowledge Sources, and their Use in Language Translation Software

Signifier – spoken or written word for cat

Referent – chair

Signified – personal expression of chair

This shows how meaning occurs. But, this is true of a word, one of the building blocks of phrases that need to be translated. But how can we make a sentence? A sentence consists of a series of words, like this: I have a nice cat. (eg. 1) The relations between words are conditioned by grammar and sense.

He We I Anna John …

has have have has has …

some a a two nice …

Due to these reasons, in my opinion, a perfect instant translation software, regardless if it is used independently (on web, as a desktop or laptop application), or even embedded in a high tech robot which could substitute a human translation could never be developed. The main cause is that, although grammar includes a finite set of rules, the meaning raises a lot of issues. Here is where another science of language comes in place, with its own argument: we speak of pragmatics, which shows that underneath any act of communication, there lies am “intention of the author”. If we speak of plain talk, still this raises a lot of questions, and we can understand where does it occurs in light of the things mentioned before when it comes to meaning. An example is the following: I want to go home (eg. 2) The “communicative intention” is, let us say, “I want to go to my own house”. Still, one can understand that I want to go to his/her home.

92

We can choose various alternate words, but they need to be associated according to a series of rules. This is the conditionality of each language. For example, we can use alternate words for “I”, “have”, “a cat”, “nice”, but they also need to respect conditions imposed by grammar. Our example could have virtually an infinite number of forms, but which need to respect conditions imposed by grammar and meaning: cool little nice brothers cards …

cards time cat …

This shows that the intention of the author / the person who makes the statement cannot be perceived perfectly. This is due to formal causes (called “communicative intention”), caused by the imperfect (non-ambiguous or vague) sentence. Also, the resulting meaning depends on the previous knowledge of the receiver. Thus, the final results is called the “intention of the receiver” and is the perceived meaning. For example, as mentioned, the receiver understands the sentence in a wrong way, that the emitter intends to come to their house. Thus, the conclusions are that the transmission of the message depend on the previous knowledge of the emitter, on the pervious knowledge of the receiver and inner aspects of language. This brings along a lot of bias that needs to be countered for a proper transmission of a message and that which depends on the imperfection of language itself cannot be counterbalanced by any means. This also brings another aspect into the equation: context. The relations between emitter, receiver, message and intentions can be drawn below:

The Romanian Review Precision Mechanics, Optics & Mechatronics, 2014, No. 46

Using the “Semantic Vector” Concept for Improving Knowledge Sources, and their Use in Language Translation Software

Previous knowledge of the Emitter

Conditionalities related to language

Emitter

Message

This previous example aimed to shows what happed when someone who is “good intentioned” wants to convey a message. What if it is not the case, and we speak of intended mislead. We might want to say that we want to help someone, but instead not to want that at all. A classic example of different meaning and form is the kind demand, expressed as a question: Will you close the door? (eg 3) The light in which we interpret this question is conditioned by a lot of things, if we are addressed such a statement. In itself, it can be a kind request, a question on the capacity of someone to close the door, a puzzle, or even something else. In the light of things shown in relation to the occurrence of meaning, it could be a coded message (for spies or not) for virtually anything. Jason Bourne could use it to ask if a bombing was completed.

I made this (quite) wide exposé of the linguistic conditionalities and theories which apply to how meaning is conveyed. Without a former knowledge of previous context, previous knowledge of the emitter, of the receiver and very good knowledge of the conditionalities (semantic, grammar) of the language uses, the message would benefit from an erroneous decoding. And even of all former knowledge of the mentioned aspects is very good, that the message could not be understood, even if no formal or semantic noises occurred. So, when it came to translating a phrase from one language to the other, it is clear that the context of communication, and the previous set of knowledge of

Context

Receiver

both the receiver and the emitter are unknown to the software or to the machine. The parameter which can be optimized more easily is grammar. Generally, if every grammar form has a very strict correspondent in terms of signifier, that grammar correction is not very hard to achieve. This is the case of most romance languages, which have, generally, individualized verbal forms. The example of conjugation of the verb “traer” (to “bring”) in Spanish goes like this: You traigo Tú traes Él trae Nosotros traemos Vosotros traeís Ellos

2. Issues raised by context, intentions of emitter and speaker, linguistic conditionalities ,and other aspects related to translation software

Previous knowledge of the Receiver

(eg. 4)

… shows that due to verbal endings, the correlation between the verbal form and the pronoun is univocal. This is a very serious problem for translating from languages such as Japanese, where the verbal form is the same for all persons, but only the time of the verb differs. Watashiwa Andoreea des (“deru” verb, present time of verb). Anatawa John des (“deru” verb, present time of verb) (eg. 5) This is also true for English, when there is no univocal correspondence: They are going home . → Ellos van a casa. (Spanish) You are going home. → Tú vas a casa. (Spanish) (eg. 6)

The Romanian Review Precision Mechanics, Optics & Mechatronics, 2014, No. 46

93

Using the “Semantic Vector” Concept for Improving Knowledge Sources, and their Use in Language Translation Software Still, with a proper adaptation, most translation pieces of software would do quite a good work. Approaches such as “sounds like”, which identify paronyms or words that are just spelled in a wrong way are very helpful also. 3. The Semantic Enrichment of Building and Construction Knowledge Sources Using q Domain Ontology for Classification Method – a new and valuable approach for translation softwares The biggest problem, however, when it comes to translation softwares, is not related, as one might think, to the identification of the general type of context (formal or informal, colloquial, scientific and so on), but of the peculiar one. This can be made easily, if we create a very good database with all linguistic contexts, but as Ruben Costa, Celson Lima, João Sarraipa, and Ricardo Jardim-Gonçalves show in their paper: Semantic Enrichment Of Building and Construction Knowledge Sources Using A Domain Ontology For Classification, the harder issue is to indentify the very peculiar one. They have shown, in their brilliant piece of work, that classical methods used for processing knowledge in the form of texts, such as methods related to equivalent terms, weights, relations amongst words, and other elements are useful, but in order to achieve better results, ontology, which is a way to represent knowledge within a specific domain, must be considered. The authors use in their approach, a broader framework that that used in software for translations. Thus, they use the term, “knowledge representation techniques”, encompassing a much broader area. In this light, these techniques are used for “bringing human understanding of the meaning of data to the world of machines. Such techniques create knowledge representations of knowledge sources (KS), whether they are web pages or documents”. This approach aims to optimize the representations of the KS. In order to do this, another concept is in place. We are speaking of Semantic Vectors (SV) approach, based on the combination of the Vector Space Model (VSM) approach, and of the domain-specific ontology.

94

Their supporting argument is that the document corpus words like "architect" and "design" do not appear frequently, so, in this situation statistical methods will fail to infer that such terms are somehow related. So, Vector Space Models approach, formerly treating the document as a “bag of words”, ignoring at the same time the dependence between terms by assuming that terms in a document occur independent of each other, is enriched by adding ontoloy to this proceeding. The sense in which ontology hereby is used is that of “a way to represent knowledge within a specific domain”. The authors have used “K-Means clustering”, an unsupervised classification algorithm, in order to measure the effectiveness of their demarche as is inherently limited by the information that can be inferred from the given information, regardless of knowing the knowledge field to which the information was related (in our previous example, constructions). This method and the underlying algorithm consists of two main modules. The first one is called Document Analysis Module. This module also consists of two sub-steps, that is Term Extraction (a) and Term Selection (b). Their usage it to reduce the dimensionality of the source document set. Term extraction (a) performs the breaking of the document into sentences and extracting terms in each sentence as “tokens”; then, the “tokens” are converted to lower case. Subsequently, terms belonging to a predefined set of words are removed and the resulting terms are converted to their base forms, based on the “snowball method”. The terms with the same stem are combined and their frequency is counted. Next, the “tokens” that are shorter that 4 characters or larger that 50 are discarded and the generation of n-Grams is initiated. As terms with low frequencies are considered to represent noise and, are thus useless, the tf-idf (term frequency - inverse document frequency) method is applied. This way, key terms can be identified. Still, the main limitation of this method is that long documents tend to have higher weights than short ones. Thus, the authors use a series of equations to prevent bias for long documents:

The Romanian Review Precision Mechanics, Optics & Mechatronics, 2014, No. 46

Using the “Semantic Vector” Concept for Improving Knowledge Sources, and their Use in Language Translation Software

After performing these steps, the authors choose all terms with a tf-idf score greater or equal than 0.001 and retain them as a set of key terms for the document set D. Then, the Term Extraction phase follows. This consists in the generating a statistical vector of a certain document, denoted di. The document is now translated to a logical unit of text, characterized by a set of key terms tij together with their corresponding frequency f, and can be represented by di={(t1fi1),(t2fi2),…, tmfm)}. The following step, the second Module of Document Analysis Module, is the Semantic Enrichment Module. This is the phase in which a new term reduced vector, and taking into consideration ontology concepts- called Semantic Vector (SV) - is developed, for all the documents in D. This semantic vector is represented by two columns: the first column contains the concepts that build up the knowledge representation of the KS, such as the most relevant concepts for contextualizing the information within the KS, whereas the second column contains the degree of relevance, or the weight of the term in the KS. This approach was built on three interconnected procedures for building up the semantic vector. In this semantic vector, each iteration was aimed to expected to add new semantic enrichment of the KS representation: keyword-based, taxonomy-based, and ontology-based semantic vectors. By applying these two phases of the algorithm, the authors could see better where a term in a document can be better included, when we talk about a field of knowledge. For example, the authors were able to differentiate among terms in the “Sanitary Laundry” and “Cleaning” category, by using vectors which take into account the ontology concepts and their relationships and to add new concepts in vector that are highly similar to concepts present in the document, but are not stated “per se” in its contents. Given the findings of this method, it is clear that it can and should be used for improving translation software, by limiting the use of certain terms in certain documents, after performing the identification of fields in the paper (context). 4. Conclusions Building translation software, be it used independently, as a software application, or integrated in Web pages (amongst which the best known is the Google Translation Application), or even embedded in portable devices (smart phones or tablets), or, anticipating a SF creation, in robots that perform the functions of the human interpreters, is not an easy job. We currently think that language theories shed a great deal of light on how the process of creating meaning occurs, and without this understanding, the generation of translation software would be impossible

to say at least. But this does not bring along great news: generating meaning is not only related to drawbacks caused by the imperfection of language itself or to the competence of the speaker, but also to the context of speech, to the previous knowledge of both the emitter and of the receiver of the message, and also to the ‘communicative intention”. Correlated, these issues make it very hard to convey a meaning in just one language, and very hard to achieve that when it comes to translating from one language to another (The famous saying “Traduttore, Traditore” of Eco is very telling of that). Still, some results can be attained even in this area. By using the Semantic Vectors (SV) approach, based on the combination of the Vector Space Model (VSM) approach and of the domain-specific ontology, described above, can be attained great results, and its integration to translation software is of very high value, by being able to identify terms that belong to one specific area of knowledge and could be used, then, to eliminate a series of other words that are not adequate as translation in a certain context. Also, speaking of other issued related to difficulties of translation, I believe that more importance should be given to the identification of verbal forms, as they are the building blocks of sentences, especially when they are implicit. Moreover, a greater deal of work should be done to differentiate at least formal and informal contexts of speech and more nuances should be added for the translation of terms. Another idea would be to create an environment similar to the concept of “middleware” in programming, that is, a language that is very precise and that has many nuances and very well individualized verbal and personal forms, in order to ensure a better translation among the two languages involved in the translation process. 5. References [1]. Ruben Costa, Celson Lima, João Sarraipa and Ricardo Jardim-Gonçalves, „Semantic Enrichment of Building and Construction, Knowledge Sources Using a Domain Ontology for Classification”, INCAMM 2013 Conference, Madras, Chennai, India, July 4th-6th, 2013; [2]. P. Figueiras, R. Costa, L. Paiva, R. JardimGonçalves and C. Lima, "Information Retrieval in Collaborative Engineering [3]. Projects – A Vector Space Model Approach," in Knowledge Engineering and Ontology Development Conference 2012, Barcelona, Spain, 2012. [4]. G. Salton, A. Wong and C. S. Yang, "A Vector Space Model for Automatic Indexing," Communications of the ACM, vol. 18, no. 11, pp. 613-620, November 1975. [5]. R. Costa, P. Figueiras, L. Paiva, R. Jardim-

The Romanian Review Precision Mechanics, Optics & Mechatronics, 2014, No. 46

95

Using the “Semantic Vector” Concept for Improving Knowledge Sources, and their Use in Language Translation Software Gonçalves and C. Lima, "Capturing Knowledge Representations Using Semantic Relationships," in The Sixth International Conference on Advances in Semantic Processing, Barcelona, Spain, 2012. [6]. S. Scott and S. Matwin, "Text Classification Using WordNet Hypernyms," in Use of WordNet in Natural Language Processing Systems, 1998. [7]. D. Mladenic and M. Grobelnik, "Feature Selection for Classification Based on Text Hierarchy," in Conference on Automated Learning and Discovery, 1998. [8]. T. Gruber, "Toward principles for the design of ontologies used for knowledge sharing," International Journal of Human-Computer Studies, pp. 907-928, 1993.

96

[9]. J. MacQueen, "Some methods for classification and analysis of multivariate observations," Berkeley, 1967. [10]. Rapid-I GmBH, "RapidMiner," 2012. [Online]. Available: http://rapid-i.com. [Accessed 3 September 2012]. [11]. G. Salton and C. Buckley, “Term-weighting approaches in automatic text retrieval,” Information Processing and Management, pp. 513523, 1988. [12]. M. Nagarajan, A. Sheth, M. Aguilera, K. Keeton, A. Merchant and M. Uysal, “Altering Document Term Vectors for Classification - Ontologies as Expectations of Co-occurrence,” in 16th international conference on World Wide Web, Alberta, 2007.

The Romanian Review Precision Mechanics, Optics & Mechatronics, 2014, No. 46