TRIS by Carla Parra

The TRIS corpus is a specialized parallel corpus of Spanish and German texts compiled and aligned by Carla Parra, see the LREC paper.

The texts in the corpus are texts from the European Commission between 1997-2010. They are technical regulations in a variety of domains and the most updated version (v0.3) is sentence aligned and is in TMX and TEI format. TMX files are sentence aligned while TEI encoded files have the information about sentence alignment in stand-off annotation. Every sentence includes information about the domain, the year and the file it belongs to as well as the sentence number. It contains files written in Austria and translated into European Spanish from three different domains:

  • B00: Construction (205 files; 70,648 sentences; 1,563,000 words; time frame: 1999-2010)
  • C00A: Agriculture, Fishing and Foodstuffs (12 files; 4879 sentences; 137,354 words; time frame: 1999-2001)
  • H00: Domestic and Leisure Equipment (12 files; 1229 sentences; 58328 words; time frame: 2005-2010)

Additionally, the corpus has also been Part-Of-Speech tagged using the TreeTagger POS tagger and the POS tagged files are also available.

Versions v.01 and v0.2 are kept as individual records because they are (currently) intended to be downloaded individually.

Version v.03 is encoded in TEI P5 and includes files from two new domains not included in versions 0.1 and 0.2: C00A (Agriculture, Fishing and Foodstuffs), which is currently under alignment and H00 (Domestic and Leisure Equipment), which includes all files available in the database up to 2010.

The TRIS corpus is currently being used to carry out experiments in Machine Translation (MT) and its effort of compilation has been reported and documented in the following papers:

  • Parra Escartin, Carla. Encoding a parallel corpus: The TRIS corpus experience. I: Vol 3,No 1 (2013) The many facets of corpus linguistics in Bergen – in honour of Knut Hofland.. Bergen: Bergen Language and Linguistics Studies (BeLLS) 2013 ISBN 978-82-998587-2-4. pp. 61-80.
  • Parra Escartin, Carla. Encoding a parallel corpus: The TRIS corpus experience. In: The many facets of corpus linguistics in Bergen – in honour of Knut Hofland BeLLS Vol.3, Nr.1 (online version). Bergen: Bergen Language and Linguistics Studies (BeLLS) 2013 ISBN 978-82-998587-3-1. pp. 61-80.
  • Lyse, Gunn Inger; Parra Escartín, Carla and De Smedt, Koenraad. 2012. Applying Current Metadata Initiatives: The META-NORD Experience.Proceedings of the Eigth Conference on International Language Resources and Evaluation (LREC’12) (Istanbul, Turkey) (Victoria Arranz, Daan Broeder, Bertrand Gaiffe, Maria Gavrilidou, Monica Monachini, and Thorsten Trippel, eds.), vol. Describing LRs with Metadata: Towards Flexibility and Interoperability in the Documentation of LR, European Language Resources Association (ELRA), 22 May 2012, pp. 20–27.
  • Parra Escartín, Carla. 2012. Design and compilation of a specialized Spanish-German parallel corpus, Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12) (Istanbul, Turkey) (Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Mehmet Uğur Doğan and Bente Maegaard and Joseph Mariani and Jan Odijk and Stelios Piperidis, eds.), European Language Resources Association (ELRA), May 2012, pp. 2199-2206. ISBN 978-2-9517408-7-7.

If you are using the TRIS corpus in your work, please cite the LREC 2012 paper.

Leave a Reply

You must be logged in to post a comment.