Welsh Natural Language Toolkit (WNLT)

Background

The Welsh Natural Language Toolkit (WNLT) is a Welsh Government funded project (£39K) under the Welsh-language technology and digital media grant. The project commenced on 1st July 2015 and is planned to be completed by the end of March 2016. Doug Tudhope is Project Director, Daniel Cunliffe is Project Manager and Andreas Vlachidis is Research Fellow on the project.

Aims

WNLT aims to develop a suite of open source software modules that enable Welsh Language computational linguistic applications and strengthening Welsh language technology infrastructure with a set of core Natural Language Processing (NLP) tools.

The project outcomes aim to;

  • Enable Welsh and non-Welsh speakers from academia and industry to develop general purpose and bespoke applications for a range of Welsh language processing tasks.
  • Accelerate the development of Welsh language Computational Linguistic and NLP applications by facilitating adaptation from English to Welsh of existing Information Extraction and Linguistic Analysis applications.
  • Support teaching and training of Computational Linguistics for Welsh.
  • Enable the delivery of industry standard, interoperable outputs of linguistic analysis applications advancing the potential for collaborative activities.

Objectives

The proposed work will support the Welsh Language Technology domain with a set of NLP tools that drive innovation and advance the development of sophisticated textual analysis solutions.
 
The WNLT project will deliver four core NLP modules;

  • Word Segmentation for separating text into words,
  • Sentence Boundary Disambiguation for finding sentence boundaries
  • Part of Speech Tagger for determining the part of speech of each word
  • Stemmer for identifying the root form of words.

In addition, the project will use the resulting modules to construct an exemplar Welsh language Information Extraction application.

Final workshop

A workshop was held at USW (Trefforest) in April 2016 to discuss the results with participants including representatives from Welsh Universities and SMEs, the GATE NLP community and RCAHMW. The presentations are available:
WNLT Toolkit
Background
Introduction to Project

Outputs

The modules are written in JAVA and ‘wrapped’ for execution under the General Architecture for Text Engineering (GATE) framework. They are disseminated under the GNU Lesser General Public License (LGPL). 

The WNLT modules can be download from Sourceforge at https://sourceforge.net/projects/wnlt/