Report on presentations at the thematic training course on evaluation of Human Language Technologies


Khalid Choukri, ELDA, and Bente Maegaard, University of Copenhagen, welcomed the participants to the workshop and participants introduced themselves and briefly described the theme of their work.

Khalid Choukri then went on to introduce the field of Language Technology. He showed examples from the fields of e.g. speech recognition (including emotion), speech-to-speech translation, Named Entity Recognition, question answering (Watson), subtitling, gesture tracking etc., and related each theme to evaluation – what to evaluate, as there are many aspects of these applications that could be evaluated.

In the afternoon Edouard Geoffrois, French National Defence Procurement Agency (DGA), made an Introduction to Evaluation of Knowledge Processing Technologies, discussing as one of the themes when to use existing corpora and data collections and when to create your own evaluation data. He also described the role of automatic vs. human evaluation: human evaluation is useful, and is necessary for usability tests. It is however not useful for measuring progress in development. This is the role of automatic evaluation. It is useful to measure distances – the distance to the goal; the purpose of the research effort is that this distance diminishes.  Evaluation campaigns mostly use automatic evaluation. The well-known Shared task concept is much the same as evaluation campaigns.

An important part of the course was the presentation and discussion of workshop participants’ projects, including the need for evaluation. The fellows presented their projects with a specific focus on evaluation questions, followed by a discussion. These presentations took place in two sessions in order to allow for sufficient time for the discussion of each project.

Evaluation of speech : Djamel Mostefa, ELDA, France.

Djamel Mostefa gave a presentation on the evaluation of core speech technologies, namely speaker identification, speaker verification, automatic speech recognition, spoken language understanding and speech synthesis. He detailed the different metrics and methodologies for the evaluation of each  technology.

Misclassification Rate is commonly used for speaker identification.

For speaker verification, false rejection rate and false acceptation rate are used and can be combined to compute equal error rate, geometric mean error rate, cost functions or detection error trade-off (DET) plot

For speech recognition, word error rate (WER) is the is the most commonly used metric and is based on the Levenshtein distance (also called edit distance).

For spoken language understanding, the metric is similar to WER but considers semantic concepts instead of words as basic units.

For speech synthesis, the quality of the speech output is assessed through subjective human evaluation based on the mean opinion score (MOS).

His talk was concluded with hands-on experience. Students were asked to evaluate real  system outputs for speech recognition, speaker identification and speaker verification.

Data for evaluation: How to compose evaluation material: Djamel Mostefa, Jeremy Leixa, Diego Antolinos-Basso, Maylis Bodin, ELDA.

Djamel Mostefa presented the multi-modal data produced for the evaluation of multi-modal technologies within the European FP6 CHIL (Computers in the Human Interaction Loop) project. He gave a demonstration of the ISL Video Labelling Tool used to annotate video images with different features.

Diego Antolinos-Basso presented the data produced for a national evaluation project in the field of multimedia people recognition in television documents called REPERE. The audio channels are orthographically transcribed and segmented with Transciber and the video part is annotated with Viper.

Then Maylis Bodin presented the collection and annotation of written documents (OCR), handwritten and typed. These annotated documents are used for the development and evaluation of systems for automatic processing of written documents.

Finally, Jérémy Leixa introduced the Quaero 2 project for which annotation of named entities for French and English are carried out on different type of data including old newspapers (OCR) or transcription of broadcast news.

Sustainability for Evaluation platform: Olivier Hamon, ELDA.

Olivier Hamon gave a presentation on aspects of reuse related to evaluation data and tools.

His main point was that it is important to see the generic aspects in evaluation, and to try to convert repetitive tasks into generic and sustainable ones. He also emphasized the possibility of reuse of LRs irrespective of technology to be evaluated. As an example he mentioned a FR/EN parallel corpus in the medical domain which could be used for evaluating QA, term extraction and MT – and could also have been used for alignment evaluation.

Principles in evaluation and basic statistical issues: Anders Søgaard, University of Copenhagen.

Anders Søgaard started out with two assumptions: Labeled data is scarce, and labeled data is biased. Since a lot of NLP builds on labeled data, these are important problems. Parsers that are trained with Wall Street Journal data will give good results when used for similar data, whereas results are less good when tested on BBC news, and even less on data which are more different. Data is always biased.

He went through three basic ML algorithms and showed how algorithms can be trained to take the right decision. Example application: document classification.

Finally, he briefly went through significance testing.

Evaluation of text technologies: Patrick Paroubek, LIMSI.

Patrick Paroubek started with a definition of evaluation, and gave a short historical background for evaluation, including the question: do we evaluate a system or a pair system+user? And it is important to always define what the goal of the evaluation is.

He gave definitions of the basic concepts: measure assigns a value to subset, metric is a system of parameters or ways of assessing. Metric is a constrained type of measure. He argued that the evaluation paradigm is an accelerator of science. He gave a classification of competition, validation and evaluation as follows:

Competition: one criterion, total ordering, no performance analysis, not reproducible

Validation: several criteria, partial ordering, performance threshold, yes/no answer reproducible

Evaluation: several criteria, partial ordering, performance analysis, reproducible

For evaluation of POS tagging he presented the experience from the GRACE project.

He also presented two initiatives on parser evaluation, EASy 2003-2006 and PASSAGE 2007-2010. Finally he presented evaluation of Opinion Mining, (DoXa 2009-2011) which received a lot of interest from French companies.

Evaluation of machine translation: Aurélien Max, LIMSI.

Aurélien Max went through the most used evaluation metrics for machine translation, also introducing briefly the SMT methodology as e.g. MOSES. He described the advantages and challenges of each approach. In particular he presented the BLEU, METEOR, TER and HTER approaches to MT evaluation.

He concluded with some current directions in MT evaluation, focusing on error analysis: automatic error detection, and automatic error classification.

Victoria Arranz, ELDA, presented a detailed and comprehensive overview of the tasks involved in Building Reference Data for MT evaluation. Reference data or test data are data used to “automatically” measure system output. When such data are produced in a professional way, protocols are needed, such as Protocol for the Production of Parallel Corpora for MT Evaluation, Translation and Proofreading Protocol, Quality Control Protocol etc.

She gave a comprehensive list of issues to take into account, such as size of corpus, type (text or speech) and domain, languages involved etc.

She mentioned IPR and copyright issues, and stressed that these have to be taken into account also when crawling the internet for material. This is one of the reasons that it takes 50-60 working days to produce a quality corpus of 22,000 words.

A corpus produced according to standards of this type way will be costly, but reusable and shareable.


The course ended with a discussion of how to integrate the knowledge obtained in the participants’ projects. Each participant was asked to identify knowledge they had acquired during the course which would be useful for the completion of their project. Interestingly, all parts of the programme would be useful for at least one participant which means that the selection of topics for the programme had been correct. Finally, the course participants evaluated the course formally: the course had been successful.

Comments are closed.