meta-net · web viewinformation retrieval (text indexing, multimedia ir, crosslingual ir) 4 1 4 3 4...

41
www.meta-net.eu office@meta- net.eu Tel: +49 30 3949 1833 Fax: +49 30 3949 1810 META-NET White Paper Series Languages in the European Information Society Volume X – Swedish INTERNAL DRAFT NOT TO BE USED FOR EXTERNAL COMMUNICATION

Upload: truongduong

Post on 22-Apr-2018

222 views

Category:

Documents


6 download

TRANSCRIPT

[email protected]: +49 30 3949 1833 Fax: +49 30 3949 1810

META-NET White Paper SeriesLanguages in the EuropeanInformation SocietyVolume X – Swedish

INTERNAL DRAFTNOT TO BE USED FOR EXTERNAL

COMMUNICATION

[email protected]: +49 30 3949 1833 Fax: +49 30 3949 1810

PrefaceMany European languages run the risk of becoming victims of the digital age as they are underrepresented and under-resourced online. Huge regional market op-portunities remain unused today because of language barriers. If we do not take action now, speaking their native language will become a social and economic dis-advantage for many European citizens.

Innovative multilingual Language Technology is the ultimate intermediary that can help all European citizens to particip-ate in an egalitarian, inclusive, and economically successful knowledge and information society. Language technology can be an enabler of instantaneous, cheap, and effortless communication and interaction across language boundaries. However, the degree to which Language Technology is used in the EU varies from language to language. So do the ac-tions that need to be taken within META-NET, depending on factors such as the complexity of the respective language, the size of its community, and the existence of active re-search centres in this area.

The META-NET white paper series ‘Languages in the European Information Society’ reports on the state of each European language with respect to Language Technology and explains the most prominent risks and chances. The series will cover all official European languages.

While there are numerous valuable and comprehensive sci-entific reviews on certain aspects of individual languages and the language technology available for them, there is as yet no generally understandable survey that summarises the main findings and challenges for each language. The META-NET white paper series is intended to fill this gap.

ImprintAuthors/Editors: Mikael Parvall, Stockholm UniversityJonas Lindh, University of GothenburgLars Borin, University of Gothenburg

Preface

This document is part of the Network of Excellence “Multilingual Europe Technology Alliance (META-NET)”, co-funded by the 7th Framework Pro-gramme and the ICT Policy Support Programme of the European Commis-sion through the contracts T4ME (grant agreement no.: 249119), CESAR (grant agreement no.: 271022), METANET4U (grant agreement no.: 270893), and META-NORD (grant agreement no.: 270899).

3INTERNAL DRAFT

Contents

Table of ContentsPreface 2Imprint 2Table of Contents 3Executive Summary 4Introduction: A Risk for our Languages – a Challenge for Language Technology 5

Language Technology is the Key 5Opportunities of Language Technology 6Challenges of Language Technology 6Why is it so immensely difficult for computers to deal with lan-guage? 7Why Language Whitepapers? 8Who is META-NET? 8

Swedish in the European Information Society 10General Facts 10Particularities of the Swedish Language 10Recent developments 11Language cultivation in Sweden 12Language in Education 13International aspects 14Swedish on the Internet 15Selected Further Reading 16

Language Technology Support for Swedish 17Language Technologies 17Language Technology Application Architectures 17Core application areas 18Web search 18Language checking19Speech interaction 20Machine translation 21Information management/“LT behind the scenes“ 22Miscellaneous 22LT Industry and Programs (ca. 1 page) 22LT Research and Education (ca. 1 page) 23Status of Tools and Resources for Swedish 24Status of Tools and Resources for Swedish 26Conclusions 27

References28META-NET 29

What is the goal? 29First META-NET Events in 2010 29Current composition of the META Technology Council 30Composition of the META-NET Network of Excellence 31

4INTERNAL DRAFT

Executive Summary

Executive Summary

(max. 1 page)

A Risk for our Language – a Challenge for Language Technology

Introduction: A Risk for our Languages – a Challenge for Lan-guage TechnologyWe are currently witnessing a digital revolution whose im-pact on language and society is comparable to the one of Gutenberg’s invention of the printing press. Digitisation and networked communication technology make possible unlim-ited exchange of information and services – at any place, at any time. The downside is that certain groups (for example, people who live in rural areas or senior citizens) have diffi-culties participating in this new information-driven society – a problem widely known as the digital divide.

Aspects like the availability of broadband access or appropri-ate user interfaces and mobile devices have so far dominated the discussion of the digital divide. These are hardware is-sues that will eventually be solved. Surprisingly, fundamental questions have not gained any attention in the public dis-course yet:

Which of our languages will make it into and then per-sist in the networked information and knowledge soci-ety?

From larger languages such as French or German to smaller ones such as Latvian and Maltese, do other languages stand a chance of surviving next to English?

Just like modern printing did five hundred years ago, digital communication will have far-reaching and dramatic effects on the languages in Europe. 500 years ago the new opportun-ities of large-scale communication triggered orthographic and grammatical standardisation for some languages and made the rapid dissemination of new scientific and intellec-tual ideas possible. At the same time, small languages and re-gional dialects were rarely put to print. This turned out to be a considerable disadvantage as it limited their sphere to oral conversation and sometimes even contributed to their even-tual extinction, e.g., in the case of Cornish.

Today’s multitude of languages is one of Europe’s richest and most important cultural assets and it is also a vital part of its social success story. While big languages such as English or Chinese will certainly be well represented in the emerging digital society and marketplace, we have to be aware that that many European languages are in real in danger of being cut off from digital communication if we do not act now.

Such a development would be most unwelcome. First, this would mean that a strategic opportunity remains unused, weakening Europe’s position in the global market. Second, such a development would run counter to the crucial goal of equal participation of every European citizen regardless of his or her language.

We are currently witnessing a digital revolution that is com-parable to Gutenberg’s inven-tion of modern printing.

Today’s multitude of languages is one of Europe’s most import-ant cultural assets and it is also a vital part of its success story.

A Risk for our Language – a Challenge for Language Technology

Language Technology is the KeyThe key for protecting and furthering the heterogeneous group of more than 60 European languages is Language Technology. Research has made considerable progress in the last few years. Machine translation delivers a reasonable amount of accuracy, albeit only in specific domains, and ex-perimental applications provide multilingual information and knowledge management as well as content production across many European languages. This opens a genuine window of opportunity.

The Language Whitepaper series provided by the META-NET initiative is intended to promote knowledge about language technology and its consequences for all official European lan-guages. The expert analysis and assessment of the situation for each language will help maximising the impact of lan-guage technology and avoiding the risks it potentially entails. This whitepaper takes a close look at Swedish.

Opportunities of Language TechnologyThe Internet connects citizens nationally and internationally via a growing number of mobile and stationary devices. Once the necessary infrastructure is in place, Language Techno-logy will allow people to collaborate, do business, share a tre-mendous amount of knowledge, and to participate in social and political opinion forming– across language borders and independent of their computer skills.

Language Technology can realise automatic translation, mul-tilingual information and knowledge management and con-tent production across all European languages. It will en-hance the development of intuitive language-based interfaces to technology ranging from household electronics, machinery and vehicles to computers and robots. While prototypes for several of these technologies exist, they are, however, by no means perfect but only rudimentary at best. Nevertheless it is safe to say that current progress opens a genuine window of opportunity. Further huge market opportunities lie in the entertainment sector including games and mobile informa-tion services, the educational sector including computer-as-sisted language learning and next generation self-assessment software for many different areas of study.

Language Technology offers tremendous chances for the European Union and its multilingual environment, both from the viewpoint of economy and the viewpoint of citizenship. But it also opens international economic opportunities. From a worldwide perspective, multilingualism is the rule, not the exception. Experiences with multilingual Language Techno-logy within the EU can be adapted to the specific needs of other highly multilingual communities, in India or China.

A Risk for our Language – a Challenge for Language Technology

Challenges of Language TechnologyAs has been stated above, Language Technology has made considerable progress in the last few years. Unfortunately, the current pace of technological progress is too slow to ar-rive at substantial software products to further communica-tion and productivity in a multilingual environment within the next 10 to 20 years. Those basic technologies that are already widely used are usually monolingual and only avail-able for a handful of languages. Well-known examples of the broad use of Language Technology are the spelling and, re-cently, grammar correction features in modern text pro-cessing systems.Applications for multilingual communication such as Machine Translation require a certain level of sophistication. Online services like Google Translate or Bing Translator are helpful when it comes to getting a rough idea of what a document in a foreign language is about. However, services such as these and also professional Machine Translation applications are fraught with multiple difficulties, especially if correct and also complete translations are needed. Well-known examples of funny sounding mistranslations (for example, literal trans-lations of names such as “Bush” or “Kohl”) are only the tip of the iceberg here.

Applications such as language and voice-based user inter-faces or dialogue systems are used only in specialised do-mains and exhibit limited performance. An active field of re-search is technology for rescue operations in disaster areas. In such high-risk environments, the accuracy of translations can determine the outcome of life or death situations. The same holds for language-equipped technology in the health care sector. Intelligent robots with cross-lingual language capabilities have the potential to save lives.

A concerted, substantial, continent-wide effort in language technology research and engineering is needed for realising applications that enable automatic translation, multilingual information and knowledge management and content pro-duction across all European languages.

Why is it so immensely difficult for computers to deal with language?In this whitepaper, you will find information on available soft-ware applications and digital text collections as well as data-bases for Swedish. To illustrate how computers deal with lan-guage and why this is a very difficult task, we take a brief look at the way humans acquire first and second languages and then sketch how Machine Translation systems work – after all, there is a reason why the field of Language Techno-logy is closely linked to research in the field of Artificial Intel-ligence.

The current pace of technolo-gical progress is too slow to ar-rive at substantial software products within the next 10 to 20 years.

A Risk for our Language – a Challenge for Language Technology

Humans acquire their language skills basically in two differ-ent ways. A baby learns its mother tongue from examples. It grows up surrounded by language users such as parents, sib-lings and other family members and, from about age two on-wards, it is able to produce its own first words and short phrases. At school age, a second language is usually acquired by learning its grammatical structure, vocabulary, and the orthographic system with the help of books that contain lin-guistic knowledge in terms of abstract rules and tables as well as example texts. Learning a foreign language takes a lot of time and effort and it gets more difficult with age.

In a rather simplified view, the two main types of Language Technology systems acquire their language capabilities in very much the same way as humans. In the first, currently predominant, statistical setting, computers analyse vast col-lections of texts either in a single language or so-called paral-lel text that are available in two or more languages. Machine Learning algorithms are, to a certain extent, able to derive patterns of how words, short phrases, and full sentences are correctly used in one language or translated from one lan-guage to another. Spelling correction in text processing soft-ware works this way. The sheer number of sentences needed is huge and the performance quality gets better the more text material is analysed. It is not uncommon to train such systems on texts that comprise millions of sentences. Avail-able online information and translation services such as Google Search and Google Translate work in a purely statist-ical (“data-driven”) fashion. This is one of the reasons why search engine providers are eager to collect as much written material as possible.

The second type of Language Technology systems works in a rule-based fashion. Linguists along with Computer Science experts and Computational Linguists encode grammatical analysis or translation rules and compile, among others, vocabulary lists (lexicons). Setting up rule-based systems is very time- and also labour-intensive. It requires highly spe-cialised experts – some of the well-known rule-based Ma-chine Translation systems have been under constant develop-ment for more than 20 years. The advantage of rule-based systems is that the experts can control the language pro-cessing in detail. This makes it possible to correct mistakes of the software in a systematic way and to give detailed feed-back to the user, for example, if it is used for language learn-ing. However, due to financial constraints, rule-based lan-guage technology is only feasible for major languages.

Why Language Whitepapers?META-NET takes the challenge and starts with an analysis of the state of affairs in Language Resources and Language Technologies. The detected gaps will certainly be heterogen-eous for the different European Languages. In parallel,

A Risk for our Language – a Challenge for Language Technology

META-NET is teaming up with stakeholders from many areas of society, industry and research to generate strategic vis-ions and finally a strategic research agenda of how language technology applications will bridge these gaps within the next years until 2020 (see the Appendix for more informa-tion). One of the topics that will deserve special attention on the research side is coupling statistical and rule-based Lan-guage Technology into hybrid systems. These will eventually produce much better results while lowering the costs for de-velopment and maintenance.

Who is META-NET?META-NET is a Network of Excellence funded by the European Union. It currently consists of 44 members, repres-enting 31 EU countries, which are listed in Table XX at the end of this document. The list of members may be expanded in the future. META-NET is dedicated to fostering the technological found-ations for establishing and maintaining a truly multilingual European information society that:

enables communication and cooperation across lan-guages,

secures users of any language equal access to informa-tion and knowledge,

offers advance functionalities of networked informa-tion technology to all citizens at affordable costs.

META-NET wants the multilingual European digital informa-tion space to become a success story like the written culture after Gutenberg. If it works, the multicultural union of na-tions will prosper and serve as a model for the peaceful and egalitarian cooperation of people in other parts of the world. If it fails, Europe will be forced to choose between sacrificing cultural identities and economic defeat.

META – The Multilingual Europe Technology Alliance

Swedish in the European Information Society

Swedish in the European Information SocietyGeneral FactsAccording to my own estimate (Parkvall 2009), the number of native speakers of Swedish alone corresponds to about 85% of Sweden’s population, which would correspond to about 7,7 million people. Of the remaining 15% (corresponding to 1,35 million people), virtually all of those who have grown up in Sweden [hittar inte deras antal på direkten. Bör jag gräva?] must have acquired Swedish as one of their native languages, be it in addition to an immigrant language or an indigenous minority tongue. As of 2010, 1,35 million Swedes are born abroad, according to Statistics Sweden. The foreign-born population, however, includes adoptive children, some individuals born abroad by Swedish parents, and members of Swedish-speaking ethnic groups in Finland, Estonia and the Ukraine (for which, see below). Together, these total just over 100 000, making Swedish one of the major immigrant language of the country.

Parkvall (2009) estimates about 185 000 native speakers of highly divergent Swedish dialects, of whom 5-10 000 use va-rieties excentric enough to merit being considered languages in their own right.

Outside Sweden, Swedish also enjoys official standing in Fin-land, whose statistic authorities claim 290 000 native speak-ers. Their number has been declining since the second world war, and in terms of their proportion of the population in Fin-land, the Swedish Finns have been shrinking since the 17th century.All Finns are also required to study Swedish, which of course does not guarantee that they leave school with any profi-ciency in it. Most in fact do not, but when questioned in a survey administered by the European Union, 38% [chk+ref] did claim capability of conversing in Swedish. For whatever it is worth, my personal experience suggests that Finns are more prone to underestimate than to overestimate their pro-ficiency in Swedish.Indigenous Swedish-speaking communities (here arbitrarily defined as groups where the language survives more than three generational changes among a sizeable proportion) have also existed in four other (contemporary) countries: Russia (small enclaves in the Petersburg and Karelian areas, which were mainly offshoots of Finland’s Swedish-speaking population), the United States (where the language of the 17th century colony of New Sweden survived until the early 1800s), Estonia and the Ukraine. In Estonia, the vast major-ity of the Swedish-speaking population (present since at least the 13th century) of about 7 000 fled to Sweden in the wake of the second world war, and the remaining individuals are

Swedish in the European Information Society

probably to be counted in dozens (at most) rather than hun-dreds. The Ukrainian group descended from Estonian Swedes deported in the late 18th century. Most emigrated to Sweden and North America in 1929, and only a handful of survivors remain today.Apart from these groups, Swedish-speakers outside of Sweden and Finland consist of emigrants and temporary ex-patriates from these two countries. The number is likely to be around 300 000 (Parkvall [ref Nationalatlasen]), mainly in the other Nordic countries, in western Europe, the United States, Canada and Australia. In none of these countries, however, do they represent more than a negligeable propor-tion of the recipient countries’ total population.Second-language speakers of Swedish can of course be found in many coutries, among former exchange students or emig-rants, or simply among individuals who have learnt the lan-guage in situ for any of a number of reasons. There are no statistics on how many people this group might consist of, es-pecially as one-time studies do not guarantee competence in a language

The number of daily newspapers in Sweden was 168 in 2008, according to Statistics Sweden, a number which seems reas-onably stable despite falling circulation. The definition of a “daily” newspaper used is one which is published at least three times a week.

In 2008, 26 182 "books and pamphlets" were published in Sweden, a number which increased constantly during the decade. The total includes 86% original works and 14% translations. Interestingly, a fourth of the original works were published in languages other than Swedish. This public-ations, however, were normally not in any of the indigenous minority languages or any of the major immigrant languages, but overwhelmingly in English. An impressive 22% of all ori-ginal works published in Sweden in 2008 were in English.In Finland, about 500 original Swedish-language titles are published yearly (Statistics Finland), in addition to which there are about 100 translations into Swedish.

Some remarks on translations have already been made above. In addition to this, I have consulted UNESCO’s Index translationum. That database features 31 474 into Swedish, and 31 358 with Swedish as the source language.Given that Statistics Sweden counts about 3 000 annual translations into Swedish in Sweden alone, it would seem that the two sources differ in scope. However, since 2005, the Index translationum does include about 2 500 cases yearly of Swedish as a target language of translations.

As of 2011, Sweden’s foremost trading partner (according to Statistics Sweden) is Germany, followed by (in order) Nor-

Swedish in the European Information Society

way, Denmark, Britain, the Netherlands, Finland, the United States, France, Belgium, China and Russia.According to its Finnish counterpart, the ten main trading patners are Germany, Russia, Sweden, China, Britain, the United States, the Netherlands, Frnce, Italy and Estonia. Swedish has a large variety of dialects, e.g. XX and YY. The orthography is very similar in all areas, except for minor dif-ferences... In general, all Swedish dialects underlie the same grammar, even though some dialects exhibit slightly different syntactic constructions. Minor lexical differences exist, e.g. the word “XX” is only used in YYY instead of the standard Swedish “ZZZ”.

Particularities of the Swedish LanguageThe Swedish language exhibits some specific characteristics, which contribute to the richness of the language by allowing the speakers to express ideas in a large variety of ways. One such particularity is …

blab la bla. In English, there are two more ways to express the same idea, namely:

The woman gave an apple to the man. An apple was given to the man by the woman.

In Swedish, there exist at least… .... ....

Also, the Swedish language is extremely productive when it comes to coining new words. This is mainly due to the com-pounding system, which allows speakers to put together words (and affixes) in a quite simple way. In theory, this al-lows the creation of infinitely long words:

StabTrennstabWarentrennstabWarentrennstabregalWarentrennstabregalboden …

Usually, a human can easily derive the meaning of these neo-logisms, but a machine can hardly process them. In addition, Swedes tend to use comparably long and nested sentences, which adds another difficulty to automatic processing. An-other characteristics of Swedish is that makes processing difficult are separable verb prefixes that can occur far from the verb in nested constructions like:

....

The difference in meaning between verbs…

Swedish in the European Information Society

Recent developmentsFrom the 1950s on…bla bla…, …anglicisms, i.e., originating from the English language. Ac-cording to Lemnitzer (2007), …

However, in some areas, anglicisms…

The example demonstrates the importance of raising aware-ness for a development that entails the risk of excluding large parts of the population from taking part in information society, namely those who are not familiar with English.

Language cultivation in SwedenThe Swedish language is represented by various publicly fun-ded societies and language bodies, e.g.,.

Others raise awareness for a sensible language use by dis-cussing funny developments such as the influential use of in-correct...

Private initiatives specifically turn against anglicisms:.

Unlike other countries, Sweden…

Language in EducationThe first XX study, conducted in XXX, revealed that...

International aspectsSweden is often referred to as the land of….

Swedish on the InternetIn spring 2010, 86% of Swedes were internet users.1 Most of them stated to be online every day…

For language technology, the growing importance of the in-ternet is important in two ways. On the one hand, the large amount of digitally available language data represents a rich source for analysing the usage of natural language, in partic-ular by collecting statistical information. On the other hand, the internet offers a wide range of application areas involving language technology.

The most commonly used web application is certainly web search, which involves the automatic processing of language on multiple levels, as we will see in more detail the second part of this paper. It involves sophisticated language techno-

1 http://www.iis.se/docs/SOI2010_web_v1.pdf

Swedish in the European Information Society

logy, differing for each language. For Swedish, this com-prises matching...

However, it becomes less surprising if we consider the com-plexity of the Swedish language and the number of technolo-gies involved in typical LT applications. In the next chapter, we will present an introduction to language technology and its core application areas as well as an evaluation of the cur-rent situation of LT support for Swedish.

Selected Further Reading etc.

Language Technology Support for Swedish

Language Technology Support for SwedishLanguage TechnologiesLanguage technologies are information technologies that are specialized for dealing with human language. Therefore these technologies are also often subsumed under the term Human Language Technology. Human language occurs in spoken and written form. Whereas speech is the oldest and most natural mode of language communication, complex in-formation and most of human knowledge is maintained and transmitted in written texts. Speech and text technologies process or produce language in these two modes of realiza-tion. But language also has aspects that are shared between speech and text such as dictionaries, most of grammar and the meaning of sentences. Thus large parts of language tech-nology cannot be subsumed under either speech or text tech-nologies. Among those are technologies that link language to knowledge. Figure 1 illustrates the Language Technology landscape. In our communication we mix language with other modes of communication and other information media. We combine speech with gesture and facial expressions. Digital texts are combined with pictures and sounds. Movies may contain language and spoken and written form. Thus speech and text technologies overlap and interact with many other technologies that facilitate processing of multimodal commu-nication and multimedia documents.

Language Technology Application ArchitecturesTypical software applications for language processing consist of several components that mirror different aspects of lan-guage and of the task they implement. Figure 2 displays a highly simplified architecture that can be found in a text pro-cessing system. The first three modules deal with the struc-ture and meaning of the text input:

Pre-processing: cleaning up the data, removing format-ting, detecting the input language, replacing “ä” by “å” for Swedish, etc.

Grammatical analysis: finding the verb and its objects, modificators, etc.; detecting the sentence structure.

Semantic analysis: disambiguation (Which meaning of “apple” is the right one in the given context?), resolv-ing anaphora and referring expressions like “she”, “the car”, etc.; representing the meaning of the sentence in a machine-readable way

Task-specific modules then perform many different opera-tions such as automatic summarization of an input text, data-base look-ups and many others. Below, we will illustrate core application areas and highlight certain of the modules of the differentarchitectures in each section. Again, the archi-

Figure 1: The Language Technol-ogy Landscape

Input Text

Pre-prosessing

Grammatical

Analysis

Semantic Analysis

Task-SpecificModules

Output

Language Technology Support for Swedish

tectures are highly simplyfied and idealised, serving for illus-trating the complexity of language technology applications in a generally understandable way.

After the introduction of the core application areas, we will shortly give an overview of the situation in LT research and education, concluding with an overview of (past) funding pro-grams. In the end of this section, we will present an expert estimation on the situation regarding core LT tools and re-sources in a number of dimensions such as availability, ma-turity, or quality. This table gives a good overview on the situation of LT for Swedish.

Core application areas

Web searchThe search engine Google, which started in 1998, is nowadays used for about 80% of all search queries world-wide2. Since 2004, the verb googla even has an entry in the Swedish Duden dictionary. Neither the search interface nor the presentation of the retrieved results has significantly changed since the first version. In the current version, Google offers a spelling correction for misspelled words and also, in 2009, incorporated basic semantic search capabilities into their algorithmic mix3, which can improve search accur-acy by analysing the meaning of the query terms in context. The success story of Google shows that with a lot of data at hand and efficient techniques for indexing these data, a mainly statistically-based approach can lead to satisfactory results.

However, for a more sophisticated information need, integ-rating deeper linguistic knowledge is essential. In particular, if a search query consists of a question or a complete sen-tence rather than a list of keywords, retrieving relevant an-swers to this query requires an analysis of this question or sentence on a syntactic and semantic level as well as the availability of an index that allows for a fast retrieval of relev-ant documents.

For example, imagine a user inputs the query “Give me a list of all companies that were taken over by other companies in the last five years“. A simple keyword-based approach will not take us very far here. Expanding the query terms by synonyms, for example using an ontological language resource like WordNet (or the equi-valent Swedish), may improve the results. However, for a sat-

2 http://www.spiegel.de/netzwelt/web/0,1518,619398,00.html3 See http://www.pcworld.com/businesscenter/article/161869/google_rolls_out_semantic_search_capabilities.html

Figure 2: A Typical Text Pro-cessing Application Architecture

Search Results

Semantic Processin

gQuery

Analysis

Web pages

Pre-prosessin

g

User query

Pre-prosessin

g

Indexing

Matching & Relevance

Figure 3: A Web Search Architec-ture

Language Technology Support for Swedish

isfactory answer, a deeper query analysis is necessary. For example, applying a syntactic parser to analyse the grammat-ical structure of the sentence, we can determine that the user is looking for companies that have been taken over and not companies that took over others. We also need to process the expression “last five years” to find out which years it refers to.

Finally, the processed query needs to be matched to a massive amount of unstructured data in order to find the piece or pieces of information the user is looking for. This in-volves the retrieval and ranking of relevant documents. In addition, generating a list of companies, we also need to ex-tract the information that a particular string of words in a document refers to a company name. This kind of information is tagged using a named-entity recognizer.We face an additional challenge if we want to match a query to documents written in a different language. For multilin-gual search, we have to automatically translate the query to all possible source languages and map the retrieved informa-tion back to the target language. Again, this requires a lin-guistic analysis of all texts involved. For users with a very specialized information need, an expan-sion of the query may require additional knowledge re-sources like a domain-specific ontology, representing the concepts relevant within the domain and the relationships between those concepts.

The increasing share of data available in non-textual format also drives the demand for services enabling multimedia search, i.e., information search on images, audio and video data. For audio and video files, this involves a speech recog-nition module to convert speech content into text or a phon-etic representation, to which user queries can be matched.

In Sweden, companies like…. Swedish search engines in-clude. (…)

Language checking (will be provided soon)

Anyone using a word processing tool such as Microsoft Word has come across a spell checking component that indicates spelling mistakes and proposes corrections. 40 years after the first spelling correction program by Ralph Gorin, lan-guage checkers nowadays do not simply compare the list of extracted words against a dictionary of correctly spelled words, but have become increasingly sophisticated. In addi-tion to language-dependent algorithms for handling morpho-logy (e.g. plural formation), some are now capable of recog-nizing simple syntax–related errors, such as a missing verb

Language Technology Support for Swedish

or a verb that does not agree with its subject in person and number, e.g. in “She *write a letter.”However, for other common error types the currently used methods are not sufficient. For example, take a look at the following first verse of a poem by Jerrold H. Zar (1992):

Eye have a spelling chequer,It came with my Pea Sea.It plane lee marks four my revueMiss Steaks I can knot sea.

Most available spell checkers (including Microsoft Word) will find no errors in this poem because they mostly look at words in isolation. However, for detecting so-called homophone er-rors (e.g. “Eye” instead of “I”), the language checker needs to consider the context in which a word occurs. This either requires the formulation of language-specific grammar rules, i.e. a high degree of expertise and manual labor, or the use of a statistical language model to calculate the probability of a particular word occurring along with the preceding and following words. For a statistical approach, usually based on n-grams, a large amount of language data (i.e. a corpus) is required to obtain sufficient statistical in-formation. Up to now, these approaches have mostly been developed and evaluated on English language data. However, they do not necessarily transfer well to other languages, e.g. highly inflectional ones or languages with a flexible word order. For these more complex languages, an advanced high-precision language checker may require the development of more sophisticated methods, involving a deeper linguistic analysis.The use of language checking is not limited to word pro-cessing tools. Other application areas are authoring sup-port, for example to assist the writer of technical documenta-tion to use technical vocabulary consistently, and the field of computer-assisted language learning. Language checking is also applied to automatically correct queries sent to search engines, e.g. Google’s “Did you mean…” suggestions.

Speech interaction(will be provided soon)

Processing speech data is also one of the main tasks in many interactive systems. For some applications, for example tele-phone banking, a speech recognition component matching a voice pattern against an existing vocabulary is enough. For other applications, e.g. dictation systems, more sophisticated software with the ability to process arbitrary natural speech input is required. For these applications, some linguistic ana-lysis of the speech input is required.

Language Technology Support for Swedish

In spite of major technological advances in the last years, currently available systems are still very restricted with re-spect to the vocabulary and sentence complexity they can process. This may for example mean that words unknown to the system are incorrectly processed or that the system can only deal with sentences on a simple syntactic level.The expected accuracy rate of the recognition module is highly dependent on the application. Whereas the user of a dictation system will usually manually verify and edit the sys-tem output, more complex requirements are imposed on a dialog system intended to naturally converse with a human. Not only does this involve a deep linguistic analysis of the speech input (i.e. named entity recognition, part-of-speech tagging, co-reference resolution, parsing), but also a dialog management component, which uses knowledge of the spe-cific task domain to analyse the input on a semantic and pragmatic level and generates the appropriate output. Transforming the generated output of an interactive system into a speech signal is done by a speech synthesis compon-ent. Nowadays, speech synthesis is usually combined with pre-recorded language data in order to produce a more nat-ural result. This is possible for systems used in a restricted domain. However, for the ultimate aim of automatically pro-ducing natural speech output from arbitrary textual input more research is needed, in particular concerning the inter-relation between syntax (as well as semantics and pragmat-ics) and prosody.A key issue for future research is the personalization of in-teractive systems. To some degree, this is already possible, for example in dictation systems or car navigation systems, which can be trained to adapt to the user’s speaking style. The user-friendly design of dialog systems is especially im-portant in assistive systems, e.g. for handicapped or elderly people, who may have inhibitions against using computer systems. This will involve an analysis of human speech beha-vior in general and in particular of the way humans interact with computers. In times where European and international markets are growing together, an important future enhancement for in-teractive systems is the ability to work in a multilingual en-vironment, which involves the automatic translation of text into other languages.

Machine translation(will be provided soon)

The idea of using digital computers for translation of natural languages came up in 1946 by A. D. Booth and was followed by substantial funding for research in this area in the 1950s and beginning again in the 1980s. Nevertheless, machine

Language Technology Support for Swedish

translation (MT) still fails to fulfill the high expectations for-mulated in the early years. At its basic level, MT simply substitutes words in one natural language by words in another. This can be useful in domains where a very restricted, formulaic language is used, e.g. weather reports. However, for a good translation of a less standardized text, larger text units (phrases, sentences or even whole passages) need to be matched to their closest counterparts in the target language. The major difficulty here lies within the fact that human language is ambiguous, which presents challenges on multiple levels, for example word sense disambiguation on the lexical level or the attachment of prepositional phrases on the syntactic level. One way of approaching the task is based on linguistic rules. For translations between closely related languages, a direct translation may be applied. But often, rule-based systems analyze the input text and create an intermediary, symbolic representation, from which the text in the target language is generated. The success of these methods is highly dependent on the availability of extensive lexicons with morphological, syntactic, and semantic information, and large sets of gram-mar rules carefully designed by a skilled linguist.Beginning in the late 1980s, as computational power in-creased and became less expensive, more interest was shown in statistical models for MT. The parameters of these stat-istical models are derived from the analysis of bilingual text corpora, such as the Europarl parallel corpus, which con-tains the proceedings of the European Parliament in 11 European languages. Given enough data, statistical machine translation works well enough to get an approximate mean-ing of a foreign language text. However, unlike rule-based systems, statistical MT often generates ungrammatical out-put. On the other hand, besides the advantage that less hu-man effort is required for grammar writing, statistical MT can also cover particularities of the language missing in the rule-based system, for example idiomatic expressions. As the strengths and weaknesses of rule-based and statistical MT are complementary, it is nowadays more or less con-sensus to target hybrid approaches combining methodolo-gies of both. This can be done in several ways. One is to use both rule-based and statistical systems and have a selection module decide on the best output for each sentence. How-ever, for longer sentences, no result will be perfect. A better solution is to combine the best parts of each sentence from multiple outputs, which can be fairly complex, as correspond-ing parts of multiple alternatives are not always obvious and need to be aligned. Another, more challenging approach is to design a new setup that combines the advantages of the two paradigms by integ-rating the good features of each. For example, making a rule-based system adaptive by adding a module for rule learning,

Language Technology Support for Swedish

or, making a statistical MT system syntax-aware by adding syntactical constraints.

- cooking metaphor for statistical vs. symbolic: with „learning by doing“/improvisation/sample-based learning you will get far, but for more complicated dishes, a recipe is needed

Information management/“LT behind the scenes“(will be provided soon)- question answering- IE- IR- summarization- text generation

Miscellaneous(will be provided soon)- eMobility: localized services- eHumanities/digital humanities- CLARIN (D-SPIN)- digital libraries (DFG)

o digitization of data/ more could be done with available data- eLearning- security- plagiarism- serious games- self-assessment- sentiment/opinion analysis

LT Industry and Programs (ca. 1 page)(to be written)- The user and provider industries in Sweden are certainly important and vital (XX, YY)- many successful LT businesses, mostly SMEs/start-ups - LT is often done „secretly“ (marketing problem)- Language industry a significant employer (how many jobs in Sweden?)- Markets

o The market for language technologies can only be estimated, and will most probably get a boost by mobile appliances, the Apple iPad and similar products, (educational) games, etc.

o In Sweden, all foreign movies are translated and an additional text string provided for the dialogue.

- Previous programs:o MOLTOo Etc..o

Language Technology Support for Swedish

LT Research and Education (ca. 1 page)(to be written)- Sweden has a number of excellent centres computational linguistics. There is a multitude

of universities and research centres. - Universities

o Speech Technology KTH, Royal Institute of Technology, School of Computer Science and Com-

munication (division of Speech, Music and Hearing) University of Gothenburg, CLT (Centre for Language Technology), Dialogue

Lab, mainly at Department of Philosophy, Linguistics and Theory of Scienceo Text based language technology research

University of Gothenburg, CLT, including several departments and units. Faculty of Arts

o The Swedish Language Bank (Språkbanken)o Department of Swedisho Department of Philosophy, Linguistics and Theory of Science

IT facultyo Department of Applied IT

Chalmers University of Technology Department of Computer Science and Egineering (also part of CLT)

University of Borås The Swedish School of Library and Information Science

Linköping University Department of Computer and Information Science

Lund University Department of Linguistics and Phonetics Department of Computer Science

Stockholm University Department of Computer and Systems Sciences Department of Linguistics

Royal Institute of Technology (KTH) School of Computer Science and Communication

Uppsala University Department of Linguistics and Philology

- Research instituteso Swedish Institute of Computer Science (SICS)

- Language technology consortiao Centre for Language Technology, University of Gothenburgo Graduate School of Language Technology (GSLT)

- Language technology documentation centreso

- Language councilso The Swedish Language Council (Språkrådet)

- Statistical bureaus- Standards organizations- Research councils

o Swedish Research Council (Vetenskapsrådet)o (Riksbankens Jubileumsfond)

- Ministries of science, education, technology, industry, commerce

Language Technology Support for Swedish

- National libraries and information centres- Federations of industries- Associations of copyright owners- User groups for LT-related software and tools- Language policy documents- ICT policy documents- Other relevant institutions, companies, individuals … varies between countries- Conferences- best paper awards?

Language Technology Support for Swedish

Status of Tools and Resources for SwedenThe following table provides an overview of the current situ-ation of language technology support for Swedish. The rating of existing technologies and resources is based on educated estimations by several leading experts using the following criteria (each ranging from 0 to 6).

1. Quantity: Does a tool/resource exist for the language at hand? The more technologies/resources exist, the higher the rating.

0: no tools/resources whatsoever 6: many technologies/resources, large variety

2. Availability: Are technologies/resources accessible, i.e., are they Open Source, freely usable on any plat-form or only available for a high price or under very restricted conditions?

0: practically all technologies/resources are only available for a high price

6: a large amount of technologies/resources is freely, openly available under sensible Open Source or Creative Commons licenses that allow re-use and re-purposing

3. Quality: How well are the respective performance cri-teria of technologies and quality indicators of re-sources met by the best available tools, applications or resources? Are these technologies/resources current and also actively maintained?

0: toy resource/technology 6: high-quality technology, human-quality an-

notations in a resource4. Coverage: To which degree do the best technologies

meet the respective coverage criteria (styles, genres, text sorts, linguistic phenomena, types of input/output, number languages supported by an MT system etc.)? To which degree are resources representative of the targeted language or sublanguages?

0: special-purpose resource or technology, spe-cific case, very small coverage, only to be used for very specific, non-general use cases

6: very broad coverage resource, very robust technology, widely applicable, many languages supported

5. Maturity: Can the technology/resource be considered mature, stable, ready for the market? Can the best available technologies/resources be used out-of-the-box or do they have to be adapted? Is the performance of such a technology adequate and ready for produc-tion use or is it only a prototype that cannot be used for production systems? An indicator may be whether resources/technologies are accepted by the community and successfully used in LT systems.

Language Technology Support for Swedish

0: preliminary prototype, toy system, proof-of-concept, example resource exercise

6: immediately integratable/applicable compon-ent

6. Sustainability: How well can the technology/resource be maintained/integrated into current IT systems? Does the technology/resource fulfil a certain level of sustainability concerning documentation/manuals, ex-planation of use cases, front-ends, GUIs etc.? Does it use/employ standard/best-practice programming envir-onments (such as Java EE)? Do industry/research standards/quasi-standards exist and if so, is the tech-nology/resource compliant (data formats etc.)?

0: completely proprietary, ad hoc data formats and APIs

6: full standard-compliance, fully documented7. Adaptability: How well can the best technologies or

resources be adapted/extended to new tasks/domains/genres/text types/use cases etc.?

0: practically impossible to adapt a technology/resource to another task, impossible even with large amounts of resources or person months at hand

6: very high level of adaptability; adaptation also very easy and efficiently possible

Further information on the table can be found in a sep-arate document, the Language Whitepaper FAQ.

Status of Tools and Resources for Swedish

Qua

ntity

Ava

ilabi

lity

Qua

lity

Cov

erag

e

Mat

urity

Sust

aina

bilit

y

Ada

ptab

ility

Language Technology (Tools, Technologies, Applications)Tokenization, Morphology (tokenization, POS tagging, mor-phological analysis/generation) 5 4 5 4 5 5 5

Parsing (shallow or deep syntactic analysis) 4 3 5 4 5 5 5Sentence Semantics (WSD, argument structure, semantic roles) 2 1 2 2 2 1 2

Text Semantics (coreference resolution, context, pragmatics, in-ference) 2 1 3 2 2 1 2Advanced Discourse Processing (text structure, coherence, rhetorical structure/RST, argumentative zoning, argumentation, text patterns, text types etc.)

1 1 1 1 1 1 1

Information Retrieval (text indexing, multimedia IR, cross-lingual IR) 4 1 4 3 4 3 3

Language Technology Support for Swedish

Information Extraction (named entity recognition, event/rela-tion extraction, opinion/sentiment recognition, text mining/analyt-ics)

4 2 4 4 4 3 4

Language Generation (sentence generation, report gener-ation, text generation) 3 3 3 2 4 3 4Summarization, Question Answering, advanced In-formation Access Technologies 2 1 1 1 1 1 1

Machine Translation 4 2 4 2 5 4 4Speech Recognition 2 1 3 4 5 5 5Speech Synthesis 3 1 3 3 3 3 3Dialogue Management (dialogue capabilities and user model-ling) 3 2 3 3 4 3 5

Language Resources (Resources, Data, Knowledge Bases)Reference Corpora 2 2 4 3 5 5 5Syntax-Corpora (treebanks, dependency banks) 2 3 3 3 5 5 5Semantics-Corpora 1 1 1 1 1 1 1Discourse-Corpora 1 1 1 1 1 1 1Parallel Corpora, Translation Memories 3 1 5 3 5 5 5Speech-Corpora (raw speech data, labelled/annotated speech data, speech dialogue data) 4 3 3 3 5 4 4

Multimedia and multimodal data(text data combined with audio/video) 1 1 1 1 1 1 1

Language Models 3 3 4 4 5 3 3Lexicons, Terminologies 5 1 5 4 3 3 3Grammars 3 2 3 3 3 4 5Thesauri, WordNets 3 3 5 4 4 5 5Ontological Resources for World Knowledge (e.g. up-per models, Linked Data) 1 1 1 1 1 1 1

The most important goal of the table is not to provide an exhaustive and scientific chart of the field. The table is meant to support abstract, high-level messages, which can be further explained in the next section.

Conclusions1) Interpretation of the table.

The most important goal of the table is not to provide an ex-haustive and scientific chart of the field. The table is meant to support abstract, high-level messages, which can be fur-ther explained in this section. Among these messages are, for example:

While some specific corpora of high quality exist, a very large syntactically annotated corpus is not avail-able.

For Swedish, a large corpus exists, but it is not easily/cheaply accessible.

Many of the resources lack standardization, i.e., even if they exist, sustainability is not given; concerted pro-grams and initiatives are needed to standardize data and interchange formats.

Language Technology Support for Swedish

Semantics is more difficult than syntax; text semantics is more difficult than word and sentence semantics.

The more semantics a tool has to deal with, the more difficult it is to find the right data; more efforts for sup-porting deep processing are needed.

Standards do exist for semantics in the sense of world knowledge (RDF, OWL, etc.); they are – however – not easily applicable in NLP tasks.

Speech processing is currently more mature than NLP for written text.

Research was successful in designing particular high quality software, but it is nearly impossible to come up with sustainable and standardized solutions given the current funding situations.

For a certain language, a certain technology simply does not exist (don’t be afraid to use 0, zero, whenever applicable).

Etc.

2) Where do we stand? What needs to be done?

3) This document describes the state of a certain language and the support that exists for the language through Lan-guage Technology. What is the situation concerning cross- and multilingual technologies? Where does the language and its LT stand in the European context?

References

Option B of the Table (to be filled out)

Ava

ilabi

lity

Acc

essi

bilit

y

Qua

lity

Cov

erag

e

Mat

urit

y

Sust

aina

bil-

ity

Ada

ptab

ility

Mul

tilin

gual

-it

y

Language Technology (Tools, Technologies, Applications)Morphology – Syntax – Grammar 4 3 3 4 5 3 2 1Sentence SemanticsText SemanticsAdvanced Discourse Processing and Gener-ationInformation RetrievalInformation Extraction Summarization, Question Answering and other common LT applicationsMachine TranslationSpeech RecognitionSpeech SynthesisDialogue ManagementLanguage Resources (Resources, Data, Knowledge Bases)Reference Corpus or CorporaSyntax-CorporaSemantics-CorporaDiscourse-CorporaSpeech-Corpora Multimedia and multimodal dataLanguage ModelsEvaluation PackagesLexicons, Thesauri, Dictionaries, Transla-tion MemoriesGrammarsOntological Resources for Language Ontological Resources for World Know-ledge

References

ReferencesTo be completed

Ethnologue. Lewis, M. Paul (ed.), 2009. Ethnologue: Lan-guages of the World, Sixteenth edition. Dallas, Tex.: SIL In-ternational. Online version: http://www.ethnologue.com

EUROMAP study. “Benchmarking HLT progress in Europe” EUROMAP study, 2003.

Internet World Stats, http://www.internetworldstats.com Copyright © 2010, Miniwatts Marketing Group. All rights re-served.

About META-NET

META-NETMETA-NET is a Network of Excellence dedicated to fostering the technological foundations of a multilingual European in-formation society For realising applications that enable auto-matic translation, multilingual information and knowledge management and content production across all European lan-guages, a concerted, substantial, and continent-wide effort in language technology research and engineering is needed. To this end, META-NET is pursuing three lines of actions:

What is the goal?

A key goal of META-NET is to build the Multilingual European Technology Alliance (META), bringing together re-searchers, commercial technology providers, private and cor-porate language technology users, language professionals and other information society stakeholders. META will pre-pare the necessary ambitious joint effort towards furthering language technologies as a means towards realising the vis-ion of a Europe united as one single digital market and in-formation space.

First META-NET Events in 2010

In the short period of its existence since February 2010, META-NET has already established high visibility in a num-ber of key stakeholder communities. Here is a selection of past events organised by META-NET or with META-NET par-ticipation:

Language Technology Days2010 (March 2010, Lux-embourg): Presenting META-NET to ca. 250 key rep-resentatives from the European Language Technology R&D landscape; networking with new and upcoming projects and initiatives.

LREC 2010 (Language Resources and Evaluation Con-ference, May 2010, Malta): biggest conference in Com-putational Linguistics and Language Technology with a focus on language resources (ca. 1500 participants). META-NET was present with multiple presentations in the main conference and workshops, with a booth in the EC Projects Village, and as a sponsor.

theMETAnk 2010 (June 2010, Berlin): brainstorming meeting with about 120 key Language Technology re-searchers mostly from academia.

META-NET booth at LREC (Malta, May 2010)

Translingual Europe (Berlin, June 2010)

theMETAnk (Berlin, June 2010)

About META-NET

Translingual Europe 2010 (June 2010, Berlin): invit-ation only industry conference with 150 participants; organized by META-NET.

META-FORUM 2010 (November, Brussels): to be added

About META-NET

Current composition of the META Technology Council

Name Affiliation Role CountryNicoletta Calzolari Consiglio Nazionale d.

RicercheDirector of Re-search

Italy

Bill Dolan Microsoft Research Head of NLP USA Josef van Genabith Dublin City University,

CNGLDirector Ireland

Yota Georgakopolou European Captioning Insti-tute

Managing Dir-ector

UK, Greece

Gregory Grefenstette Exalead Chief Science Of-ficer

France

Jan Hajic Charles University Professor Czech Re-public

Theo Hoffenberg Softissimo CTO France Thomas Hofmann Google Dir. Engineering Switzerland Keith Jeffrey ERCIM President UK Stefan Kreckwitz Across CTO Germany Claude de Loupy Syllabs CEO France Elisabeth Maier CLS Communication CTO Switzerland Daniel Marcu Language Weaver CTO USA, Ro-

maniaJoseph Mariani CNRS-LIMSI, IMMI Director France Penny Marinou EUATC President Greece Jaap van der Meer TAUS Director Netherlands Roger Moore University of Sheffield Professor UK Stelios Piperidis ILSP, Research Centre

“Athena”Head of Depart-ment

Greece

Gabor Proszeky Morphologic CEO Hungary Georg Rehm DFKI Senior Consult-

antGermany

C.M. Sperberg-Mc-Queen

World Wide Web Consor-tium

Technical Staff USA

Daniel Tapias Sigma Technologies CEO Spain Alessandro Tescari Pervoice CEO Italy Hans Uszkoreit DFKI Scientific Dir-

ectorGermany

Andrejs Vasiljevs Tilde CEO Latvia Michel Vérel Vecsys CEO France Alex Waibel CMU, University of Karls-

ruheProfessor USA/Ger-

many

About META-NET

Composition of the META-NET Network of Excellence

Country Member (Affiliation) ContactsAustria Universität Wien Gerhard BudinBelgium University of Antwerp Walter Daelemans  University of Leuven Dirk van CompernolleBulgaria Bulgarian Academy of Sciences Svetla KoevaCroatia Zagreb University Marko TadicCyprus University of Cyprus Jack BurstonCzech Rep.

Charles University in Prague Jan Hajic

Denmark University of Copenhagen Bente MaegaardEstonia University of Tartu Tiit RoosmaaFinland Aalto University Timo Honkela  University of Helsinki Kimmo Koskenniemi, Krister

Linden France CNRS, LIMSI Joseph Mariani  ELDA Khalid ChoukriGermany DFKI Hans Uszkoreit, Georg Rehm  RWTH Aachen Hermann NeyGreece ILSP, R.C. “Athena” Stelios PiperidisHungary Hungarian Academy of Sciences Tamás Váradi  Budapest Technical University Géza Németh, Gábor OlaszyIceland University of Iceland Eirikur RögnvaldssonIreland Dublin City University Josef van GenabithItaly Consiglio Nazionale Ricerche Nicoletta Calzolari  Fondazione Bruno Kessler Bernardo MagniniLatvia Tilde Andrejs Vasiljevs  University of Latvia Inguna SkadinaLithuania Institute of the Lithuanian Language Jolanta ZabarskaitëLuxem-bourg

Arax Ltd. Vartkes Goetcherian

Malta University of Malta Mike RosnerNether-lands

Universiteit Utrecht Jan Odijk

Norway University of Bergen Koenraad De SmedtPoland Polish Academy of Sciences Adam Przepiórkowski  University of Lódz Barbara L.-TomaszczykPortugal University of Lisbon Antonio Branco  Inst. for Systems Engineering and

ComputersIsabel Trancoso

Romania Romanian Academy of Sciences Dan Tufis  University Alexandru Ioan Cuza Dan CristeaSerbia Belgrade University Dusko Vitas, Cvetana Krstev  Pupin Institute Sanja VranesSlovakia Slovak Academy of Sciences Radovan Garabik

About META-NET

Slovenia Jozef Stefan Institute Marko GrobelnikSpain Barcelona Media Toni Badia  Technical University of Catalonia Asunción Moreno  University Pompeu Fabra Núria BelSweden University of Gothenburg Lars BorinUK University of Manchester Sophia Ananiandou