Program and results: Multilingual Lexical Resources and Tools

This page is a training report of the CLARA training course on Multilingual Lexical Resources and Tools. It contains hyperlinks to selected materials used at the course.

Valeria Quochi (University of Pisa): Lexical Resources and Standards: WordNet – LMF

Lexical Resources have a great value in today HLT, for many different types of stakeholders: researchers and companies in the first place, but also to translators, cultural heritage curators etc.

Standards are asserted to be a pre-requisite for interoperability, and are also important for maintainability. They allow for easy comparison of different resources, for sharing methodologies across languages, for sharing software and APIs. They also facilitate documentation of the resources and format maintenance. Of course, they require some extra effort to convert, map, or build one’s own resource in a standard format (usually XML). However, the advantages seem to win the share, all the more in the emerging data distribution and e-infrastructural paradigms.

This 4 hour course aimed to raise awareness in young researches on standards for interoperability and provide useful insights in particular in the ISO LMF standard. The goal was achieved by presenting LMF showing how different types of (semantic) lexical sources can be represented. The focus of the course has been on WordNets, for which an LMF version has been defined and applied to various wordnets within the KYOTO project, so that interesting issues related to multilinguality will also be tackled. Students did exercises on encoding entries possibly in multiple entries and in manipulating sample entries.

The last part of the course introduced the issue of semantic interoperability, still largely unsolved and unattained, for which there may be no established/ known solution yet, and proposed some exercises, so that students could become acquainted with the issue and related problems and may come up with interesting novel ideas for possible working solutions. IsoCat was briefly introduced as the current proposal to attempt to reach semantic interoperability on a broad scale.

Recommended readings:

Francopoulo G. et al. 2006. Lexical Markup Framework (LMF). In: LREC 2006 – 5th International Conference on Language Resources and Evaluation. Proceedings, pp. 233 – 236. http://lirics.loria.fr/doc_pub/LMFPaperForLREC2006FinalSubmission6March06.pdf

Francopoulo G. et al. 2009. “Multilingual resources for NLP in the Lexical Markup Framework (LMF)”. In: G. Sérasset, A. Witt, U. Heid, F. Sasaki (eds.) Special Issue: Multilingual Language Resources and Interoperability. Language Resources and Evaluation, vol. 43 (1) pp. 57 – 70. Springer. http://puma.isti.cnr.it/dfdownload.php?ident=/cnr.ilc/2009-A0-002&langver=it&scelta=Metadata

Monachini M. et al. 2011 Kyoto-LMF, WordNet representation format. Version 5. KYOTO Working paper: WP02_TR002_V5_Kyoto_LMF. Technical Report. http://puma.isti.cnr.it/linkdoc.php?idauth=8&idcol=8&icode=2011-TR-001&authority=cnr.ilc&collection=cnr.ilc&langver=it

Software advised: An XML editor and validator.

Links:

www.isocat.org/interface/index.html We will most probably explore/use IsoCat: Browsing does not require an account, but if you want you can register in advance so that you may try the full functionality.

http://deb.fi.muni.cz/kyoto/user.php. Please register to this platform, we may want to see and download some examples from here.

http://www.clarin-it.it/Simple/

Sample materials

Slides Part 1

Other useful readings:

Francopoulo G. and M. George. 2008.  ISO-24613: Language Resource Management-Lexical Markup Framework (LM). FDIS revision 16 : 21 mars 2008. (section 5, Annex G, H, I, J) http://tagmatica.fr/doc/iso_tc37_sc4_n453_rev16_FDIS_24613_LMF.pdf

Francopoulo G. 2005 Extended examples of lexicons using LMF (auxiliary working paper for LMF). http://lirics.loria.fr/doc_pub/ExtendedExamplesOfLexiconsUsingLMF29August05.pdf

Toral A. et al. 2010. Rejuvenating the ItalianWordNet: upgrading, standardising, extending. In: GWC2010 – 5th Global Wordnet Conference (Mumbai (India), 31 gennaio – 4 febbraio 2010). Proceedings, p. -. P. Bhattacharyya, C. Fellbaum, P. Vossen (eds.). Narosa Publishing House, 2010.  (http://puma.isti.cnr.it/dfdownload.php?ident=/cnr.ilc/2010-A2-025&langver=it&scelta=Metadata)

Martha Thunes (University of Bergen): A classification model for translational relations.

A clear understanding of translational relations is essential for any work in multilingual resources and tools. This one-hour course presented a few topics from Thunes (2011), which is a study of translational complexity in English-Norwegian parallel texts. It focused on the model for analysing translationally corresponding text units. Using the finite clause as the basic unit of translation, the chosen parallel texts are segmented into pairs of corresponding strings. Each string pair is classified according to a hierarchy of four correspondence types, reflecting the complexity of the relation between source and target string.

In type 1, the least complex type, the corresponding strings are pragmatically, semantically, and syntactically equivalent, down to the level of the sequence of word forms. In type 2, source and target string are pragmatically and semantically equivalent, and equivalent with respect to syntactic functions, but there is at least one mismatch in the sequence of constituents or in the use of grammatical form words. Within type 3, source and target string are pragmatically and semantically equivalent, but there is at least one structural difference violating syntactic functional equivalence between the strings. In type 4, there is at least one linguistically non-predictable, semantic discrepancy between source and target string. I.e., type 4 covers correspondences where the translation cannot be predicted from the source expression together with information about source and target language and their interrelations. Hence, such cases are probably outside of the scope of automatic translation.

The distribution of the four correspondence types within the set of string pairs extracted from a given body of parallel texts is a measure of the degree of translational complexity in that text pair. In the parallel texts which were analysed, human translators have produced the target texts, and the identification and classification of string pairs is also done manually. The study involves two text types, narrative fiction and law texts. The analysed data cover about 68000 words, and include comparable amounts of texts for each direction of translation and for each text type.

Reference:

Thunes, Martha. 2011. Complexity in Translation. An English-Norwegian Study of Two Text Types. Doctoral dissertation. University of Bergen.

Slides

Recommended reading:

The abstract from Thunes (2011) is a good introduction. In chapter 1, sections 1.1, 1.2, and 1.3 with subsections are most relevant. In chapter 2, sections 2.2.4, 2.3 with subsections and 2.4.2 with subsections are recommended. In chapter 3, suggested sections are 3.2.4 and 3.2.5. The table of contents and bibliography are provided for your reference.

Miloš Jakubíček (Lexicography MasterClass Ltd and Masaryk Univ. Brno): Exploiting Corpora with Sketch Engine

It is now about two decades since the significant turnover in linguistic analysis occurred that moved researchers from language introspection to language observation — in the 21st century, language examples must always be found, not invented. Since then a great progress within the field of corpus linguistics has been performed leading to corpus-based linguistic studies becoming a standard investigation method. Moreover, thanks to numerous NLP teams and tools they developed, corpora serve now as a statistical backbone in many practical applications.

As usual new technology brings in new challenging tasks — starting with finding effective ways for preparing, storing and encoding corpora, retrieving  basic search results and statistics, currently continuing with extended corpus annotation, complex querying and providing contextual characteristics of various language phenomena, and also always with the problem of where and how to obtain large corpus sources.

These sessions provided an introduction to state-of-the-art methods of corpus building, processing and querying by summarizing the necessary background knowledge and then applying it within the Sketch Engine corpus management suite. In the practical parts there were exercises in building custom corpora from source texts as well as from the web using the WebBootCat tool, searching corpora with complex queries using CQL, obtaining statistical characteristics and investigating context-dependent behaviour of a word according to certain grammatical relations by introducing the concept of word sketches. Finally several applications built on top of word sketches were demonstrated.

Slides

References:

Kilgarriff, A. and Rychlý, P. and Smrž, P. and Tugwell, D.: The Sketch Engine. In Proceedings of EURALEX, 2004 http://promethee.philo.ulg.ac.be/engdep1/download/bacIII/sketch-engine.pdf

Baroni, M. and Kilgarriff, A. and Pomikálek, J. and Rychlý, P.: WebBootCaT: A Web Tool for Instant Corpora. In Proceeding of the EuraLex Conference, 2006 http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.127.5136&rep=rep1&type=pdf

Kilgarriff, A.: I Don’t Believe in Word Senses. In Computers and the Humanities, Volume 31, Springer Verlag 1997 http://arxiv.org/pdf/cmp-lg/9712006

Kilgarriff, A. and Grefenstette, G.: Introduction to the special issue on the web as corpus. Computational linguistics, Volume 29, MIT Press, 2003 http://www.mitpressjournals.org/doi/pdf/10.1162/089120103322711569

Session schedule:

  1. General overview: todays corpora and their creation
  2. Exercises: corpus building and preparing
  3. Word sense & word sketch
  4. Exercises: corpus querying and exploitation of word sketches

Helge Dyvik (University of Bergen): Semantic Mirrors

In this part it was shown how word-aligned parallel corpora can be used as a source of knowledge for deriving monolingual lexical relations in an inter-subjective way.

Session schedule:

  1. Introduction to the methodology
  2. Examples and applications

Reading: Helge Dyvik: Semantic Mirrors

Gunn Inger Lyse (University of Bergen): Translation-based Word Sense Disambiguation

This part of the course showed how multilingual information (Dyvik, 2005) can be used for Word Sense Disambiguation (WSD) with illustrations for Norwegian-English. WSD is the process of determining the relevant sense of an ambiguous word in context automatically. Automated WSD is relevant for Natural Language Processing systems such as machine translation (MT), information retrieval, information extraction and content analysis.

The most successful WSD approaches to date are so-called supervised machine learning techniques, in which the system learns the contextual characteristics of each sense on the basis of examples from a collection of concrete examples of contexts in which a word sense typically occurs. Although promising, there are two problems with this approach: first, the system needs to know which senses to disambiguate between (what is the sense inventory?) and second, the system needs many training examples to learn something about the contextual characteristics of each sense. Preparing such sense-tagged training material manually is costly and time-consuming and automated methods are therefore desirable.

The Mirrors method (Dyvik, 2005) is explored as an experimental lexical knowledge source for WSD. The Mirrors method derives word senses, as well as relations between word senses, on the basis of translational correspondences in a parallel corpus. The general idea of using translations to approximate word senses is widely acknowledged; however it is not easy to evaluate the quality of the resulting semantic information in the Mirrors method. The main research question of the presented work is thus: are the translation-based senses and semantic relations in the Mirrors method linguistically motivated from a monolingual point of view? To this end, the monolingual task of WSD can be used as a practical framework to evaluate the usefulness of the Mirrors method as a lexical knowledge source.

The innovative aspect of applying the Mirrors method for WSD is two-fold: first, the Mirrors method is used to obtain sense-tagged data automatically (using cross-lingual data), providing a SemCor-like corpus which allows us to exploit semantically analysed context features in a subsequent WSD classifier. Second, we attempt to generalise from the actually occurring context words by using the semantic network between word senses in the Mirrors method. For instance, if the noun phone was found to co-occur with the ambiguous noun bill-N in the ‘invoice’ sense, and if the classifier can generalise from this to include words that are semantically close to phone(according to the Mirrors method), such as telephone, this means that the presence of only one of them during learning could make both of them ‘known’ to the classifier at classification time.

The results indicate that the gain in abstracting from context words to classes of semantically related word senses was only marginal compared to a traditional word-based classifier. Regarding classification accuracy, the use of related words from the Mirrors was as good as, and sometimes better, than the traditional word model, but the differences were not found to be statistically significant. Although unfortunate for the purpose of enriching a traditional WSD model with Mirrors-derived information, the lack of a difference between the traditional word model and Mirrors-based related words nevertheless provides promising indications with regard to the plausibility of the Mirrors method.

Reading suggestions:

Dyvik, H. (2005). Translations as a semantic knowledge source. In Proceedings of the Second Baltic Conference on Human Language Technologies. Tallinn University of Technology.

Ide, N. & Wilks, Y. (2006). Making sense about sense. In E. Agirre & P. Edmonds (Eds.), Word Sense Disambiguation: Algorithms and applications (chap. 3). Springer.

Slides

Núria Bel & Muntsa Padró (IULA, Universitat Pompeu Fabra): Cue-based Lexical Information Acquisition

According to the linguistic distributional hypothesis, words that can be inserted in the same contexts can be said to belong to the same class (Harris, 1951). Thus, lexical classes are linguistic generalizations drawn from the characteristics of the contexts where a number of words tend to appear. One of the approaches to lexical information acquisition proposes classifying words by training a classifier with information about their occurrence in selected contexts where words belonging to a class indeed occur, e.g. members of the class of transitive verbs will  appear in passive constructions, while intransitive verbs will not, as expected (Merlo and Stevenson, 2001). Following this idea, the task of cue-based lexical information induction has achieved successful results. A learner is supplied with pre-classified examples represented by numerical information about matched and not matched cues and the final exercise is to confirm that the data characterized by the linguistically-motivated cues indeed support the division into the proposed classes, that is, the whole set of occurrences (tokens) of a word are taken as cues for defining its class membership (the class of the type) either because this word is observed in a number of particular contexts or because it is not.

These training sessions provided the students with the concepts and the tools to start research on cue-based lexical information acquisition. After an introduction to the task of lexical acquisition and a short review to the state of the art, this part of the training guided students toward practical work:

  • Definition of context for a particular lexical class and the cues that can assist in classification.
  • The problems of data-sparseness and noise filtering and their impact in the design of a battery of cues for a particular lexical class.
  • The representation of word occurrences in a vector. Regular Expressions for cue matching and vector building.
  • Fundamentals of Machine Learning methods applied to lexical information acquisition: Naïve Bayes, Decision Trees, and k-means clustering. Supervised vs. unsupervised learning.
  • Experiments with Weka and IULA’s web services.

The course was divided in 5 sessions:

  1. Introduction to cue-based lexical information acquisition
  2. Introduction to Machine Learning methods applied to LIA: Naïve Bayes
  3. Distributional Hypothesis and cue identification: the linguistic facts and their implementation as Regular Expressions
  4. Supervised and Unsupervised ML for LIA: Decision Trees and k-means clustering
  5. Hands-on exercises (1)
  6. Hands-on exercises (2)

Readings:

Paola Merlo & Suzanne Stevenson, 2001. Automatic Verb Classification Based on Statistical Distributions of Argument Structure, Computational Linguistics, 27:3. http://www.cs.toronto.edu/~suzanne/papers/cl01-ref.pdf

Tom Mitchell, 1997. Machine Learning. McGraw-Hill. http://www.cs.cmu.edu/~tom/

Núria Bel, Maria Coll, and Gabriela Resnik. 2010. Automatic detection of non-deverbal event nouns for quick lexicon production. In Proceedings of the 23rd International Conference on Computational Linguistics(COLING ’10). Association for Computational Linguistics, Stroudsburg, PA, USA, 46-52. http://portal.acm.org/citation.cfm?id=1873787

Software to be used: WEKA http://www.cs.waikato.ac.nz/ml/weka/

Exercise: http://sites.google.com/site/cuebasedlia/

Data for eventive verbs: http://gilmere.upf.edu/WS/upload/system/files/365/original/weka_eventive_en_freq.arff

Slides Núria, Slides Muntsa

Marita Kristiansen (Norwegian School of Economics): Multilingual termbases and their relevance to ontologies

Ontologies are frequently used as tools to aid language understanding and analysis, in for instance translation, information retrieval or for knowledge-management purposes. However, when such ontologies are restricted to specific languages, they are closely related to terminologies which will typically be found in multilingual termbases.

This part of the course advocated the use of terminologies as found in multilingual termbases, such as EU’s IATE termbase (http://iate.europa.eu/iatediff/) or the Swedish Rikstermbanken (http://www.rikstermbanken.se/), as relevant complementary resources for ontology building to be used in for instance wordnets.

Termbases typically cover a high number of domains and concepts, including their representations such as definitions, terms, synonyms and specialized collocations in addition to concept relations. Such resources may be exploited for word sense disambiguation purposes, machine translation or information extraction.

Whereas ontologies are developed to support machine understanding, terminologies are developed to give a comprehensive account of the concepts and specialized vocabulary within one specific domain. An overview was given of the theoretical foundations underlying research on languages for specific purposes (LSP). In particular, terminology theory should guide the language data comprised in termbases, including termsconcepts and concept relations as disclosed in underlying concept systems. These concept systems are in fact to an increasing extent called ontologies in terminology theory as well. Emphasis was placed on what distinguishes “terminological” ontologies from “computational” ontologies.

Some real-life examples were given based on online multilingual termbases to see whether the two information resources may be complimentary in that there may be possible benefits from termbase content in the construction and exploitation of other lexical tools supported by ontologies.

Recommended reading:

Project presentations:

Carla Parra: Detection of complex translation equivalents to improve Spanish – German/Norwegian word alignment

Pedro Patiño: Description and representation in language resources of Spanish and English specialized collocations from Free Trade Agreeements

Leave a Reply

You must be logged in to post a comment.