September 23-24, 2010
|
|
|
Andromeda Hotel & Thalassa
Albert I-Promenade 60
8400 OOSTENDE
See here
for directions.
[Day 1: 23 September]
[Day 2: 24 September]
| 10.00 | Registration & welcome coffee |
| 10.45 | Opening |
| 11.00-12.40 |
Building a Gold Standard for Dutch Spelling Correction Tanja Gaustad van Zaanen and Antal van den Bosch (ILK)
The main question in the NWO project "Implicit Linguistics" is whether abstract
linguistic representations are necessary as an intermediate step in NLP models
and systems. To investigate this, we focus on text-to-text processing tasks,
i.e. processes which map form to form. In particular, we are investigating
Dutch spelling correction where a corrupted text is converted to a clean
version of the same text.
A Knowledge-Light Approach to Luo Machine Translation and Part-of-Speech Tagging Guy De Pauw, Naomi Maajabu and Peter Waiganjo Wagacha (CLiPS) This presentation describes the collection and exploitation of a small trilingual corpus English - Swahili - Luo (Dholuo). Taking advantage of existing morphosyntactic annotation tools for English and Swahili and the unsupervised induction of Luo morphological segmentation patterns, we are able to perform fairly accurate word alignment on factored data. This not only enables the development of a workable bidirectional statistical machine translation system English - Luo, but also allows us to part-of-speech tag the Luo data using the projection of annotation technique. The experiments described in this paper demonstrate how this knowledge-light and language-independent approach to machine translation and part-of-speech tagging can result in the fast development of language technology components for a resource-scarce language. Using Domain Similarity for Performance Estimation Vincent Van Asch and Walter Daelemans (CLiPS) Many natural language processing (NLP) tools exhibit a decrease in performance when they are applied to data that is linguistically different from the corpus used during development. This makes it hard to develop NLP tools for domains for which annotated corpora are not available. This paper explores a number of metrics that attempt to predict the cross-domain performance of an NLP tool through statistical inference. We apply different similarity metrics to compare different domains and investigate the correlation between similarity and accuracy loss of NLP tool. We find that the correlation between the performance of the tool and the similarity metric is linear and that the latter can therefore be used to predict the performance of an NLP tool on out-of-domain data. The approach also provides a way to quantify the difference between domains. Fusion of Readability Assessments for Ranking in a Possibilistic Logic Framework Dries Tanghe, Philip van Oosten and Véronique Hoste (LT³) In an attempt to collect reliable reference data against which readability predictions can be evaluated, a set of readability assessments was created by a group of language experts. In order to construct an overall gold standard ranking, the assessment data should be merged in an optimally unbiased way. We present a method to construct a gold standard that consists of a partial ordering of the evaluated texts, based on possibilistic logic. In the gold standard, one of two possible outcomes is expressed for each pair of texts: either that it can be said with a given degree of certainty that the former text is more readable than the other, or that the comparability of the texts can not be determined. The method we present has polynomial time complexity and can therefore be regarded as efficient. It is extensible to ranking unrelated objects of any kind, based on preferences expressed by multiple individuals. Future work also includes comparisons of evaluation methods, systems that learn to rank, and active learning to extend the gold standard. |
| 12.40 | Lunch |
| 13.45-15.00 |
CLAM: Computational Linguistics Application Mediator
Maarten van Gompel and Martin Reynaert (ILK)
CLAM allows you to quickly and transparently transform your Natural Language Processing
application into a RESTful webservice, with which both human end-users as well as
automated clients can interact. CLAM takes a description of your system and wraps itself
around the system, allowing end-users or automated clients to upload input files to your
application, start your application with specific parameters of their choice, and
download and view the output of the application once it is completed.
ALeF: An Active Learning Framework for Readability Prediction Philip van Oosten and Véronique Hoste (LT³) Readability research, in which the ultimate goal is to produce a system that can automatically determine the readability of an unseen text, can be largely sketched as follows. From an existing text corpus, texts are selected for readability assessment by people. At the same time, features that seem to be relevant are extracted from the texts. The assessments and the features are used to train a readability prediction system. The systems can finally be evaluated by means of an evaluation metric. A number of publications from the last decades follow this pattern. Most of those, however, use their own set of resources. The training corpus nor the implementations of learning methods or evaluation metrics are often reused. Further, due to a lack of a common ground on which research can be performed, the workload is high, so the influence of more sophisticated techniques, such as Active Learning, have not yet received the attention they deserve. That counts for the domain of readability research as well as for other domains. To overcome those issues, we felt that readability researchers should be able to work further on solutions that were proposed earlier and their own contributions to the field should be made available for other researchers. Readability research needs a community and tools to build new systems much more efficiently. Such a community should lead to comparable research results. In this talk, we propose ALeF, an Active Learning Framework upon which such a community can be built. WINKLE: Web Interface for Narrative Kernel Labeling and Extraction Martin Reynaert and Bart Persoons (ILK)
We present a collaborative online environment for basic metadata and text structure annotation.
In the framework of the reference corpus of written Dutch SoNaR we are amassing far more texts
in difficult formats and layouts than we have the means of processing. In the corpus, only the
texts' running discourse is incorporated, which requires that non-running text elements need
to be identified and removed.
|
| 15.00 | Coffee break |
| 15.30-16.45 |
Classifying Songs into Moods using Lyrics Menno van Zaanen (ILK)
With the recent boost in music sharing, many people have music
collections that are different in nature from those in the past.
Current collections are not only larger, they often consist of a
selection of individual songs in contrast to a set of albums, which
was normal in the past. Due to this unstructured nature of the
collection, music listeners create playlists, which describe which
songs should be played in which order. Creating playlists by hand,
however, is not only very time-consuming, it is also hard. Many
playlists group songs with similar emotional content (happy, sad),
however, this information is typically not available explicitly on a
per song basis. Furthermore, the fact that people tend not to know
all songs in their collection makes the task even harder.
Beyond Reported History: Strikes That Never Happened Martha van den Hoven, Antal van den Bosch and Kalliopi Zervanou (ILK) We present a study on applying text analytics methods to historical text and data to uncover aspects of event structure. First, we associate primary historical resources, newspaper articles, to a secondary resource, a database of labor conflicts in the Netherlands, detecting newspaper stories denoting labor conflicts. For the task of retrieving newspaper articles based on database record information, we construct a query model exploiting the database record fields. We consider labor conflicts as historical events referred to in sequences of newspaper article narratives, of which the climax, i.e., the strike, may or may not have occured. We analyse documents preceding a strike by considering them as a sub-narrative class of "strike threat" articles, and we then attempt to retrieve articles referring to conflicts which were about to burst into strike, but for some reason never did: strikes that never happened. Age and Gender Prediction on Netlog Data Claudia Peersman, Walter Daelemans and Leona Van Vaerenbergh (CLiPS) In recent years millions of people have started using social networking sites such as Netlog to support their personal and professional communications, creating digital communities. However, a common characteristic of these digital communities is that users can easily provide a false name, age, gender and location in order to hide their true identity. This way, social networking sites can be used by people with criminal intentions (e.g., paedophiles) to support their activities on-line. In the context of the DAPHNE project (Defending Against Paedophiles in Heterogeneous Network Environments), we present first results of a machine learning approach for age and gender prediction on a corpus of posts on the social network site Netlog. We investigate which types of linguistic and stylistic features are effective for age and gender prediction, given the specific characteristics of (the Dutch) chat language and compare the effectiveness of different machine learning techniques for age and gender prediction on the Netlog data. We will conclude our presentation by discussing how these results will guide future research in the DAPHNE project. |
| 16.45 | Room check-in & social activity |
| 20.00 | Dinner |
| 9.00 | Breakfast & check-out |
| 10.00-10.50 |
Advances in Memory-Based Machine Translation Maarten van Gompel, Peter Berck and Antal van den Bosch (ILK) We present advances in research on Memory-based Machine Translation (MBMT), a form of machine translation in which the translation model takes the form of approximate k-nearest neighbour classifiers. These classifiers are trained to map words or phrases in context to a target word or phrase. The modelling of source-side context is a key feature distinguishing this approach from standard Statistical Machine Translation (SMT). We present three progressively more complex MBMT systems that range from lightweight (i.e. fast training and processing) to incorporating the typical ingredients of SMT systems, such as a target language model. In the third system, which is the main focus of this contribution, we embrace the concept of phrases, as opposed to the single words or fixed n-grams that the two earlier systems focused on. In this phrase-based MBMT system we use a phrase-translation table generated by Moses as the basis for the generation of training and test instances for our classifiers. We present an automatic method for hyperparameter optimization for our system, and investigate the inclusion of simple syntactic features to this model. We critically measure and compare the performance of our systems against baselines, its precursor systems, and state-of-the-art competitors. Examining the validity of Cross-Lingual Word Sense Disambiguation Els Lefever and Véronique Hoste (LT³)
We present the data creation process and results of the "Cross-Lingual
Word Sense Disambiguation" task that was defined in the framework of
the SemEval-2010 competition. The goal of this task was to evaluate
the feasibility of multilingual WSD on a newly developed multilingual
lexcial sample data set. Participants to the task were asked to
automatically determine the contextually appropriate translation of a
given English noun in five languages, viz. Dutch, German, Italian,
Spanish and French.
|
| 10.50 | Coffee break |
| 11.20-12.10 |
Experimental Design in Multi-Topic Authorship Attribution Kim Luyckx and Walter Daelemans (CLiPS)
Topic is one of the most important factors interfering with authorship, a
characteristic making it hard to ‘separate’ from authorship. In addition,
including topic-related words in the attribution model can either aid
classification, affect performance in a negative way, or confuse the
discriminative method. This results in a model unreliable when tested
on other topics, in other words, a model that does not scale.
The author et al. Medieval authorship attribution and the case of the Battle of the Golden Spurs Mike Kestemont (CLiPS) The fifth part (Vijfde Partie, ca. 1316) is the last part of the continuation of the Spiegel historiael, a Middle Dutch rhymed chronicle initiated in the thirteenth century by Jacob van Maerlant and Filip Utenbroeke. Historians agree that one of the most interesting accounts of the famous Battle of the Golden Spurs (Guldensporenslag, 1302) is to be found in the fourth book of this fifth part, which is traditionally attributed to Lodewijk of Velthem. This contribution claims, however, that there is ample reason to doubt the attribution of the account of the battle to Velthem. A stylometric analysis (Machine Learning) of the rhyme words demonstrates that the bulk of the fourth book shows a significant deviation from Velthem's style. The author seems to have borrowed many a passage from a pre-existing vernacular source text, possibly written by another unidentified author. This hypothesis is backed up by various non-stylistic arguments. We shall present a novel way to assess 'confidence of attribution' in nearest neighbor classification for authorship discrimination. |
| 12.10 | Lunch |
| 13.30-14.30 |
Invited talk Richard Beaufort
Introduced a few years ago, Short Message Service (SMS) offers the possibility of exchanging written messages
between mobile phones. SMS has quickly been adopted by users. These messages often greatly deviate from
traditional spelling conventions. As shown by specialists (Thurlow and Brown, 2003; Fairon et al., 2006;
Bieswanger, 2007), this variability is due to the simultaneous use of numerous coding strategies,
like phonetic plays (2m1 read ‘demain’, “tomorrow”), phonetic transcriptions
(kom instead of ‘comme’, “like”),
consonant skeletons (tjrs for ‘toujours’, “always”), misapplied, missing or
incorrect separators (j esper
for ‘j'espère’, “I hope“; j'croibi1k, instead of
‘je crois bien que’, “I am pretty sure that“), etc. These
deviations are due to three main factors: the small number of characters allowed per text message by the
service (140 bytes), the constraints of the small phones' keypads and, last but not least, the fact that
people mostly communicate between friends and relatives in an informal register.
In the general framework of an SMS-to-speech synthesis system (Cf. the Vocalise project,
http://cental.fltr.ucl.ac.be/team/projects/vocalise/),
we developed an SMS normalization module that only uses models learned from a training corpus. The normalization involves
three steps. First, an SMS-dedicated lexicon look-up, which differentiates between known and unknown parts of a noisy token.
Second, a rewrite process, which creates a lattice of weighted solutions. The rewrite model differs depending on whether
the part to rewrite is known or not. Third, a combination of the lattice of solutions with a language model, and the choice
of the best sequence of lexical units. Our presentation will be organized as follows. First, we will focus on the way we gathered the SMS corpus in the framework of the sms4science project (Cf. http://www.sms4science.org/), which aims at collecting international SMS corpora. Second, we will describe the finite-state algorithm we designed to align the SMS corpus and its transcription at the character-level, a step needed to automatically learn pieces of knowledge from these corpora. Third, we will detail our finite-state normalization process and, alongside, the way we learned the normalization models from our aligned corpora. This presentation will end by an evaluation of the system, and a small demonstration of the complete SMS-to-speech prototype already available for Android smartphones: text-it/voice-it.
References |
| 14.30-14.55 |
The SemEval-2010 shared task on "Linking Events and Their Participants in Discourse" Josef Ruppenhofer, Caroline Sporleder, and Roser Morante (ILK) In this talk we present the SemEval-2010 shared task on "Linking Events and Their Participants in Discourse". This task is a variant of the classical semantic role labelling task. The novel aspect is that we focus on linking local semantic argument structures across sentence boundaries. In the shared task, we intend to make a first step towards taking SRL beyond the domain of individual sentences by linking local semantic argument structures to the wider discourse context. In particular, we address the problem of finding fillers for roles which are neither instantiated as direct dependents of our target predicates nor displaced through long-distance dependency or coinstantatiation constructions. We will describe the dataset, the evaluation method and the systems that participated in the task. |
| 14.55 | Coffee break |
| 15.25-16.15 |
Combining Classifiers for Named Entity Recognition Bart Desmet and Véronique Hoste (LT³) This presentation explores the use of classifier ensembles for the task of named entity recognition (NER) on a Dutch dataset. Classifiers from 3 classification frameworks, namely memory-based learning (MBL), conditional random fields (CRF) and support vector machines (SVM), were trained on 8 different feature sets to create a pool of classifiers from which an ensemble could be built. A genetic algorithm approach was used to find the optimal ensemble combination, given various voting mechanisms for combining classifier outputs. The experiments yielded a classifier ensemble that outperformed the best individual classifier by 0.67 percentage points (F-score), a small but statistically significant margin. Experimental results also showed that ensembling classifiers from different frameworks benefits generalization performance. Computational approaches to creativity Tom De Smedt (CLiPS) Traditionally, software applications for computer graphics have been based on real-world analogies. Each icon in the application's
user interface represents a concrete object - a pen, an eraser, scissors. This model raises creative limitations: features can only be
used as implemented by the developers, the screen is too small to display all the features (some are never discovered), actions are
mouse-based so the user's decision-making process is literally lost in translation. "NodeBox"
(http://nodebox.net) is an ongoing effort to produce software that allows more people to express
themselves creatively. Essentially,
NodeBox is a free software application that produces 2D visuals (artistic, data visualization) based on Python programming code.
|
| 16.15-17.00 | Discussion & closing |