ATILA MEETING

September 23-24, 2010

[CLiPS - Computational Linguistics & Psycholinguistics [ILK - Induction of Linguistic Knowledge [LT³ - Language and Translation Technology Team

Location

Andromeda Hotel & Thalassa
Albert I-Promenade 60
8400 OOSTENDE
See here for directions.


Program

[Day 1: 23 September]
[Day 2: 24 September]

Day 1: 23 September

10.00 Registration & welcome coffee
10.45 Opening
11.00-12.40

Building a Gold Standard for Dutch Spelling Correction

Tanja Gaustad van Zaanen and Antal van den Bosch (ILK)

The main question in the NWO project "Implicit Linguistics" is whether abstract linguistic representations are necessary as an intermediate step in NLP models and systems. To investigate this, we focus on text-to-text processing tasks, i.e. processes which map form to form. In particular, we are investigating Dutch spelling correction where a corrupted text is converted to a clean version of the same text.
In order to test the quality of a spelling corrector, a Gold standard is needed. This, however, does not exist for Dutch as of yet. For this reason, we set out to build such a Gold standard, containing a mixed selection of texts in which we aim to mark all errors and their corrections. In this talk, we will present the (current state of the) Gold standard including inter-annotator agreement and other statistics relating to the data used. Furthermore, we will present first results with applying our language model WOPR to the corpus, comparing it against two baselines: a high-precision known error list and a context-insensitive lexical baseline. Evaluation is performed in terms of precision and recall on detection and correction on full text.

A Knowledge-Light Approach to Luo Machine Translation and Part-of-Speech Tagging

Guy De Pauw, Naomi Maajabu and Peter Waiganjo Wagacha (CLiPS)

This presentation describes the collection and exploitation of a small trilingual corpus English - Swahili - Luo (Dholuo). Taking advantage of existing morphosyntactic annotation tools for English and Swahili and the unsupervised induction of Luo morphological segmentation patterns, we are able to perform fairly accurate word alignment on factored data. This not only enables the development of a workable bidirectional statistical machine translation system English - Luo, but also allows us to part-of-speech tag the Luo data using the projection of annotation technique. The experiments described in this paper demonstrate how this knowledge-light and language-independent approach to machine translation and part-of-speech tagging can result in the fast development of language technology components for a resource-scarce language.

Using Domain Similarity for Performance Estimation

Vincent Van Asch and Walter Daelemans (CLiPS)

Many natural language processing (NLP) tools exhibit a decrease in performance when they are applied to data that is linguistically different from the corpus used during development. This makes it hard to develop NLP tools for domains for which annotated corpora are not available. This paper explores a number of metrics that attempt to predict the cross-domain performance of an NLP tool through statistical inference. We apply different similarity metrics to compare different domains and investigate the correlation between similarity and accuracy loss of NLP tool. We find that the correlation between the performance of the tool and the similarity metric is linear and that the latter can therefore be used to predict the performance of an NLP tool on out-of-domain data. The approach also provides a way to quantify the difference between domains.

Fusion of Readability Assessments for Ranking in a Possibilistic Logic Framework

Dries Tanghe, Philip van Oosten and Véronique Hoste (LT³)

In an attempt to collect reliable reference data against which readability predictions can be evaluated, a set of readability assessments was created by a group of language experts. In order to construct an overall gold standard ranking, the assessment data should be merged in an optimally unbiased way. We present a method to construct a gold standard that consists of a partial ordering of the evaluated texts, based on possibilistic logic. In the gold standard, one of two possible outcomes is expressed for each pair of texts: either that it can be said with a given degree of certainty that the former text is more readable than the other, or that the comparability of the texts can not be determined. The method we present has polynomial time complexity and can therefore be regarded as efficient. It is extensible to ranking unrelated objects of any kind, based on preferences expressed by multiple individuals. Future work also includes comparisons of evaluation methods, systems that learn to rank, and active learning to extend the gold standard.
12.40 Lunch
13.45-15.00

CLAM: Computational Linguistics Application Mediator
TICCLops: Text-Induced Corpus CLean-up online processing system

Maarten van Gompel and Martin Reynaert (ILK)

CLAM allows you to quickly and transparently transform your Natural Language Processing application into a RESTful webservice, with which both human end-users as well as automated clients can interact. CLAM takes a description of your system and wraps itself around the system, allowing end-users or automated clients to upload input files to your application, start your application with specific parameters of their choice, and download and view the output of the application once it is completed.
CLAM is set up in a universal fashion, requiring minimal effort on the part of the service developer. Your actual NLP application is treated as a black box, of which only the parameters, input formats and output formats need to be described.
We illustrate this by means of TICCLops.

ALeF: An Active Learning Framework for Readability Prediction

Philip van Oosten and Véronique Hoste (LT³)

Readability research, in which the ultimate goal is to produce a system that can automatically determine the readability of an unseen text, can be largely sketched as follows. From an existing text corpus, texts are selected for readability assessment by people. At the same time, features that seem to be relevant are extracted from the texts. The assessments and the features are used to train a readability prediction system. The systems can finally be evaluated by means of an evaluation metric. A number of publications from the last decades follow this pattern. Most of those, however, use their own set of resources. The training corpus nor the implementations of learning methods or evaluation metrics are often reused. Further, due to a lack of a common ground on which research can be performed, the workload is high, so the influence of more sophisticated techniques, such as Active Learning, have not yet received the attention they deserve. That counts for the domain of readability research as well as for other domains. To overcome those issues, we felt that readability researchers should be able to work further on solutions that were proposed earlier and their own contributions to the field should be made available for other researchers. Readability research needs a community and tools to build new systems much more efficiently. Such a community should lead to comparable research results. In this talk, we propose ALeF, an Active Learning Framework upon which such a community can be built.

WINKLE: Web Interface for Narrative Kernel Labeling and Extraction

Martin Reynaert and Bart Persoons (ILK)

We present a collaborative online environment for basic metadata and text structure annotation. In the framework of the reference corpus of written Dutch SoNaR we are amassing far more texts in difficult formats and layouts than we have the means of processing. In the corpus, only the texts' running discourse is incorporated, which requires that non-running text elements need to be identified and removed.
Automatic text conversion tools, e.g. PDF conversion tools, delivering unsatisfactory results to date, we have developed WINKLE (Web Interface for Narrative Kernel Labeling and Extraction) around which we hope to foster a community of volunteers to help us. Winkle provides dual opportunities for supporting SoNaR. First, it offers facilities for donating text to the corpus by simple upload and IPR-settlement provisions. Second and foremost, it provides innovative tag cloud based access to texts waiting to be processed which helps volunteers choose texts that appeal to them. It then offers an ergonomic environment for the necessary annotation work.
WINKLE lives at http://winkle.uvt.nl and is due to be made available under an open source license.

15.00 Coffee break
15.30-16.45

Classifying Songs into Moods using Lyrics

Menno van Zaanen (ILK)

With the recent boost in music sharing, many people have music collections that are different in nature from those in the past. Current collections are not only larger, they often consist of a selection of individual songs in contrast to a set of albums, which was normal in the past. Due to this unstructured nature of the collection, music listeners create playlists, which describe which songs should be played in which order. Creating playlists by hand, however, is not only very time-consuming, it is also hard. Many playlists group songs with similar emotional content (happy, sad), however, this information is typically not available explicitly on a per song basis. Furthermore, the fact that people tend not to know all songs in their collection makes the task even harder.
We will present research into using lingual parts of songs in a mood classification system. This allows for the classification of songs into moods using the lyrics only. The results of the classifier can help users to select songs for a mood-oriented playlist, making the creation of playlists easier. It also makes it easier for users to investigate the unknown songs of a particular mood in their collection.
Based on tf*idf values of words in the lyrics, the most representative words that occur in lyrics of a particular mood are identified. These values are also used for classification. In practice, it turns out that using words taken from the lyrics and their tf*idf values provide a valuable source of information for automatic mood classification of music. Furthermore, we will show that different aspects of mood can be classified equally effectively.

Beyond Reported History: Strikes That Never Happened

Martha van den Hoven, Antal van den Bosch and Kalliopi Zervanou (ILK)

We present a study on applying text analytics methods to historical text and data to uncover aspects of event structure. First, we associate primary historical resources, newspaper articles, to a secondary resource, a database of labor conflicts in the Netherlands, detecting newspaper stories denoting labor conflicts. For the task of retrieving newspaper articles based on database record information, we construct a query model exploiting the database record fields. We consider labor conflicts as historical events referred to in sequences of newspaper article narratives, of which the climax, i.e., the strike, may or may not have occured. We analyse documents preceding a strike by considering them as a sub-narrative class of "strike threat" articles, and we then attempt to retrieve articles referring to conflicts which were about to burst into strike, but for some reason never did: strikes that never happened.

Age and Gender Prediction on Netlog Data

Claudia Peersman, Walter Daelemans and Leona Van Vaerenbergh (CLiPS)

In recent years millions of people have started using social networking sites such as Netlog to support their personal and professional communications, creating digital communities. However, a common characteristic of these digital communities is that users can easily provide a false name, age, gender and location in order to hide their true identity. This way, social networking sites can be used by people with criminal intentions (e.g., paedophiles) to support their activities on-line. In the context of the DAPHNE project (Defending Against Paedophiles in Heterogeneous Network Environments), we present first results of a machine learning approach for age and gender prediction on a corpus of posts on the social network site Netlog. We investigate which types of linguistic and stylistic features are effective for age and gender prediction, given the specific characteristics of (the Dutch) chat language and compare the effectiveness of different machine learning techniques for age and gender prediction on the Netlog data. We will conclude our presentation by discussing how these results will guide future research in the DAPHNE project.

16.45 Room check-in & social activity
20.00 Dinner

Day 2: 24 September

9.00 Breakfast & check-out
10.00-10.50

Advances in Memory-Based Machine Translation

Maarten van Gompel, Peter Berck and Antal van den Bosch (ILK)

We present advances in research on Memory-based Machine Translation (MBMT), a form of machine translation in which the translation model takes the form of approximate k-nearest neighbour classifiers. These classifiers are trained to map words or phrases in context to a target word or phrase. The modelling of source-side context is a key feature distinguishing this approach from standard Statistical Machine Translation (SMT). We present three progressively more complex MBMT systems that range from lightweight (i.e. fast training and processing) to incorporating the typical ingredients of SMT systems, such as a target language model. In the third system, which is the main focus of this contribution, we embrace the concept of phrases, as opposed to the single words or fixed n-grams that the two earlier systems focused on. In this phrase-based MBMT system we use a phrase-translation table generated by Moses as the basis for the generation of training and test instances for our classifiers. We present an automatic method for hyperparameter optimization for our system, and investigate the inclusion of simple syntactic features to this model. We critically measure and compare the performance of our systems against baselines, its precursor systems, and state-of-the-art competitors.

Examining the validity of Cross-Lingual Word Sense Disambiguation

Els Lefever and Véronique Hoste (LT³)

We present the data creation process and results of the "Cross-Lingual Word Sense Disambiguation" task that was defined in the framework of the SemEval-2010 competition. The goal of this task was to evaluate the feasibility of multilingual WSD on a newly developed multilingual lexcial sample data set. Participants to the task were asked to automatically determine the contextually appropriate translation of a given English noun in five languages, viz. Dutch, German, Italian, Spanish and French.
In addition, we report on a set of experiments in which the viability of a multilingual classification-based Word Sense Disambiguation system is investigated. Instead of using a predefined sense inventory, we construct our gold standard from the translations that result from word alignments on a parallel corpus. To train and test the classifier, we used English as our input language and we incorporated the translations of our target word in five languages as additional features in the feature vectors. Our results show that the multilingual approach outperforms the classification exeriments where no additional evidence from other languages is used. Moreover, the scores also degrade relatively to the number of translation features that is used. This allows us to develop a proof of concept for a multilingual approach to Word Sense Disambiguation.

10.50 Coffee break
11.20-12.10

Experimental Design in Multi-Topic Authorship Attribution

Kim Luyckx and Walter Daelemans (CLiPS)

Topic is one of the most important factors interfering with authorship, a characteristic making it hard to ‘separate’ from authorship. In addition, including topic-related words in the attribution model can either aid classification, affect performance in a negative way, or confuse the discriminative method. This results in a model unreliable when tested on other topics, in other words, a model that does not scale.
A commonly applied solution to avoiding topic, is the selection of function words as these are considered insensitive to topic shifts. Although function words are robust to sparse data and provide good indicators of authorship, we see at least two reasons to consider the integration of content words. First of all, it has been shown that stylistic features work well for topic identification, and that hardly any of the so-called topic-neutral features are in fact topic-neutral. Secondly, a lot of useful predictive information is disregarded by excluding content words. Our aim is to integrate content words into the model without sacrificing scalability and reliability.
In this talk, we focus on two aspects of experimental design: feature selection and cross-validation schemes. Depending on the feature selection method or cross-validation scheme chosen, topic will play a larger or smaller role in the model. We show that seemingly small decisions in experimental design heavily affect the scalability of the resulting model.

The author et al. Medieval authorship attribution and the case of the Battle of the Golden Spurs

Mike Kestemont (CLiPS)

The fifth part (Vijfde Partie, ca. 1316) is the last part of the continuation of the Spiegel historiael, a Middle Dutch rhymed chronicle initiated in the thirteenth century by Jacob van Maerlant and Filip Utenbroeke. Historians agree that one of the most interesting accounts of the famous Battle of the Golden Spurs (Guldensporenslag, 1302) is to be found in the fourth book of this fifth part, which is traditionally attributed to Lodewijk of Velthem. This contribution claims, however, that there is ample reason to doubt the attribution of the account of the battle to Velthem. A stylometric analysis (Machine Learning) of the rhyme words demonstrates that the bulk of the fourth book shows a significant deviation from Velthem's style. The author seems to have borrowed many a passage from a pre-existing vernacular source text, possibly written by another unidentified author. This hypothesis is backed up by various non-stylistic arguments. We shall present a novel way to assess 'confidence of attribution' in nearest neighbor classification for authorship discrimination.

12.10 Lunch
13.30-14.30

Invited talk
From SMS gathering to SMS normalization: linguistic studies and finite-state algorithms

Richard Beaufort
CENTAL, Université Catholique de Louvain
http://cental.fltr.ucl.ac.be/team/beaufort/

Introduced a few years ago, Short Message Service (SMS) offers the possibility of exchanging written messages between mobile phones. SMS has quickly been adopted by users. These messages often greatly deviate from traditional spelling conventions. As shown by specialists (Thurlow and Brown, 2003; Fairon et al., 2006; Bieswanger, 2007), this variability is due to the simultaneous use of numerous coding strategies, like phonetic plays (2m1 read ‘demain’, “tomorrow”), phonetic transcriptions (kom instead of ‘comme’, “like”), consonant skeletons (tjrs for ‘toujours’, “always”), misapplied, missing or incorrect separators (j esper for ‘j'espère’, “I hope“; j'croibi1k, instead of ‘je crois bien que’, “I am pretty sure that“), etc. These deviations are due to three main factors: the small number of characters allowed per text message by the service (140 bytes), the constraints of the small phones' keypads and, last but not least, the fact that people mostly communicate between friends and relatives in an informal register.
Whatever their causes, these deviations considerably hamper any standard natural language processing (NLP) system, which stumbles against so many Out-Of-Vocabulary words. For this reason, as noted by Sproat et al. (2001), an SMS normalization must be performed before a more conventional NLP process can be applied. As defined by Yvon (2008), “SMS normalization consists in rewriting an SMS text using a more conventional spelling, in order to make it more readable for a human or for a machine.”

In the general framework of an SMS-to-speech synthesis system (Cf. the Vocalise project, http://cental.fltr.ucl.ac.be/team/projects/vocalise/), we developed an SMS normalization module that only uses models learned from a training corpus. The normalization involves three steps. First, an SMS-dedicated lexicon look-up, which differentiates between known and unknown parts of a noisy token. Second, a rewrite process, which creates a lattice of weighted solutions. The rewrite model differs depending on whether the part to rewrite is known or not. Third, a combination of the lattice of solutions with a language model, and the choice of the best sequence of lexical units.
Our normalization models were trained on a French SMS corpus of 30,000 messages, gathered in Belgium, semi-automatically anonymized and manually normalized by the Catholic University of Louvain (Fairon and Paumier, 2006). Together, the SMS corpus and its transcription constitute parallel corpora aligned at the message-level.

Our presentation will be organized as follows. First, we will focus on the way we gathered the SMS corpus in the framework of the sms4science project (Cf. http://www.sms4science.org/), which aims at collecting international SMS corpora. Second, we will describe the finite-state algorithm we designed to align the SMS corpus and its transcription at the character-level, a step needed to automatically learn pieces of knowledge from these corpora. Third, we will detail our finite-state normalization process and, alongside, the way we learned the normalization models from our aligned corpora. This presentation will end by an evaluation of the system, and a small demonstration of the complete SMS-to-speech prototype already available for Android smartphones: text-it/voice-it.

References
Markus Bieswanger. 2007. abbrevi8 or not 2 abbrevi8: A contrastive analysis of different space and time-saving strategies in English and German text messages. In Texas Linguistics Forum, volume 50.
Cédrick Fairon and Sébastien Paumier. 2006. A translated corpus of 30,000 French SMS. In Proc. LREC 2006, May.
Cédrick Fairon, Jean R. Klein, and Sébastien Paumier. 2006. Le langage SMS: étude d'un corpus informatisé à partir de l'enquête Faites don de vos SMS à la science. Presses Universitaires de Louvain. 136 pages.
Richard Sproat, A.W. Black, S. Chen, S. Kumar, M. Ostendorf, and C. Richards. 2001. Normalization of non-standard words. Computer Speech & Language, 15(3):287-333.
Crispin Thurlow and Alex Brown. 2003. Generation txt? The sociolinguistics of young people's text-messaging. Discourse Analysis Online, 1(1).
François Yvon. 2008. Reorthography of SMS messages. Technical Report 2008, LIMSI/CNRS, Orsay, France.

14.30-14.55

The SemEval-2010 shared task on "Linking Events and Their Participants in Discourse"

Josef Ruppenhofer, Caroline Sporleder, and Roser Morante (ILK)

In this talk we present the SemEval-2010 shared task on "Linking Events and Their Participants in Discourse". This task is a variant of the classical semantic role labelling task. The novel aspect is that we focus on linking local semantic argument structures across sentence boundaries. In the shared task, we intend to make a first step towards taking SRL beyond the domain of individual sentences by linking local semantic argument structures to the wider discourse context. In particular, we address the problem of finding fillers for roles which are neither instantiated as direct dependents of our target predicates nor displaced through long-distance dependency or coinstantatiation constructions. We will describe the dataset, the evaluation method and the systems that participated in the task.

14.55 Coffee break
15.25-16.15

Combining Classifiers for Named Entity Recognition

Bart Desmet and Véronique Hoste (LT³)

This presentation explores the use of classifier ensembles for the task of named entity recognition (NER) on a Dutch dataset. Classifiers from 3 classification frameworks, namely memory-based learning (MBL), conditional random fields (CRF) and support vector machines (SVM), were trained on 8 different feature sets to create a pool of classifiers from which an ensemble could be built. A genetic algorithm approach was used to find the optimal ensemble combination, given various voting mechanisms for combining classifier outputs. The experiments yielded a classifier ensemble that outperformed the best individual classifier by 0.67 percentage points (F-score), a small but statistically significant margin. Experimental results also showed that ensembling classifiers from different frameworks benefits generalization performance.

Computational approaches to creativity

Tom De Smedt (CLiPS)

Traditionally, software applications for computer graphics have been based on real-world analogies. Each icon in the application's user interface represents a concrete object - a pen, an eraser, scissors. This model raises creative limitations: features can only be used as implemented by the developers, the screen is too small to display all the features (some are never discovered), actions are mouse-based so the user's decision-making process is literally lost in translation. "NodeBox" (http://nodebox.net) is an ongoing effort to produce software that allows more people to express themselves creatively. Essentially, NodeBox is a free software application that produces 2D visuals (artistic, data visualization) based on Python programming code.
One of our areas of interest is the way creative ideas are established and how these ideas can be mined from text. Using a number of NLP techniques (shallow parser, semantic network) and drawing inspiration from cognitive processes such as analogy and concept fluidity, the system is able to translate graphically underspecified concepts to something that can be used in a visual representation. For example, "creepy" has no direct visual representation - instead the system could propose you use an image of an octopus for your creepy design. For a given property (e.g. "creepy") and a range of concepts (e.g. animals) it yields the concepts from the range that best resemble the property (the creepiest animals). In this particular example the system will suggest such animals as octopus, bat, crow, locust, mayfly, termite, tick, amphibian, arachnid... No fluffy bunnies or frolicking ponies there!

16.15-17.00 Discussion & closing

List of participants

Walter Daelemans
Guy De Pauw
Tom De Smedt
Mike Kestemont
Kim Luyckx
Roser Morante
Claudia Peersman
Frederik Vaassen
Vincent Van Asch
Peter Berck
Tanja Gaustad van Zaanen
Steve Hunt
Martin Reynaert
Matje van de Camp
Antal van den Bosch
Maarten van Gompel
Menno van Zaanen
Sander Wubben
Kalliopi Zervanou
Orphée De Clercq
Bart Desmet
Véronique Hoste
Germán Hurtado Martín
Els Lefever
Lieve Macken
Dries Tanghe
Philip van Oosten
[CLiPS - Computational Linguistics & Psycholinguistics [ILK - Induction of Linguistic Knowledge [LT³ - Language and Translation Technology Team