Welsh Natural Language Toolkit (WNLT)

Background

The Welsh Natural Language Toolkit (WNLT) was a Welsh Government funded project (2015-2016) under the Welsh-language technology and digital media grant. Doug Tudhope was Project Director, Daniel Cunliffe Project Manager and Andreas Vlachidis the Research Fellow on the project. 

Aims

WNLT aims to develop a suite of open source software modules that enable Welsh Language computational linguistic applications and strengthening Welsh language technology infrastructure with a set of core Natural Language Processing (NLP) tools within the GATE framework. 

The project aims to

  • Enable Welsh and non-Welsh speakers from academia and industry to develop general purpose and bespoke applications for a range of Welsh language processing tasks.
  • Accelerate the development of Welsh language Computational Linguistic and NLP applications by facilitating adaptation from English to Welsh of existing Information Extraction and Linguistic Analysis applications.
  • Support teaching and training of Computational Linguistics for Welsh.
  • Enable the delivery of industry standard, interoperable outputs of linguistic analysis applications advancing the potential for collaborative activities.

Objectives

The proposed work supports the Welsh Language Technology domain with a set of NLP tools that facilitate the development of sophisticated textual analysis solutions. 
 
The WNLT project delivers four core NLP modules

  • Word Segmentation for separating text into words, 
  • Sentence Boundary Disambiguation for finding sentence boundaries
  • Part of Speech Tagger for determining the part of speech of each word
  • Stemmer for identifying the root form of words. 

In addition, the project uses the resulting modules to construct an exemplar Welsh language Information Extraction application. 

Final workshop

A workshop was held at USW (Trefforest) in April 2016 to discuss the results with participants including representatives from Welsh Universities and SMEs, the GATE NLP community and RCAHMW. The presentations are available:
WNLT Toolkit 
Background
Introduction to Project

Outputs

The modules are written in JAVA and ‘wrapped’ for execution under the General Architecture for Text Engineering (GATE) framework. They are disseminated under the GNU Lesser General Public License (LGPL).  

The WNLT modules can be download from Sourceforge at https://sourceforge.net/projects/wnlt/


A second stage of the work took place under the WNLT2 project.