The Unit for Linguistic Data (ULD) is concerned with the creation, improvement and maintenance of linguistic data (also known as language resources) through a variety of methods. The term linguistic data refers to a range of data types that are of use to researchers in linguistics and natural language processing (NLP). Principally, linguistic data can be split into four major categories: firstly, lexical data contains descriptions of words and their meanings, syntax and relations; secondly, corpora consist of collections of texts made for a particular purpose; thirdly, language descriptions document typological properties of language to enable comparative studies; and finally, metadata about language resources and their availability.

As a primary research method, this group is focussed on exploring the use of linked data technologies, that is Linguistic Linked Open Data (LLOD), as a method of processing linguistic data. This has led to the development of several key tools and resources that use linked data as a key part of its mechanism. One such tool, the Naisc tool is a novel tool developed by the group for linking together resources of different kinds and has been applied to the task of linking lexicographical resources in the context of the ELEXIS project. Another tool, Teanga, enables the construction of pipelines of NLP tools that can be composed and integrated through the use of linked data and standards for linguistic data, such as the OntoLex-Lemon standard developed in this project. Finally, ULD maintains and develops several catalogues for the discovery of resources of linguistic data, including the Linghub website as well as the Linked Open Data Cloud and its Linguistic Linked Open Data Subcloud. In the context of the Prêt-à-LLOD project, ULD is further exploring how the quality and availability of resources can be improved.

One of the major applications of linguistic data is the use of already developed NLP technologies to new languages and domains. As such, a major part of this group's work is on under-resourced languages, and there is much ongoing work on the development of technologies for minority languages as well as an active collaboration with the Irish Department and the Moore Institute on the development of NLP techniques for historical languages, in particular Old Irish. Furthermore, the unit is working on expanding WordNet to many under-resourced languages by means of machine translation.

Areas of work:

Linked data, Under-resourced languages, Digital humanities, Language resources, Lexicography, Metadata, Linguistic linked open data, Linked-data-based services,

Unit Leader:

Unit Members: