See also: IRC log
This is the raw scribe log for the sessions on day two of the MultilingualWeb workshop in Pisa. The log has not undergone careful post-editing and may contain errors or omissions. It should be read with that in mind. It constitutes the best efforts of the scribes to capture the gist of the talks and discussions that followed, in real time. IRC is used not only to capture notes on the talks, but can be followed in real time by remote participants, or participants with accessibility problems. People following IRC can also add contributions to the flow of text themselves.
See also the log for the first day.
Dave: Presenting on CNGL
research
... multilngual IR, real time social media translation etc. are
all part of the aim to support the global customer
... web services - benefits for localisation like "pay as your
use" models, easy deployment, ....
... industry survey shows barriers for adoption of
technology
... web services interoperability - needs to be very careful in
profiling
<tadej> yes
<tadej> Dave: proposing employing semantic web technology to the MT use case
dave: semantic web may help to
solve the problems we are looking at
... sw is a good mechanism to leverage other things
... tools are maturing
... we are interested in a small part of the sw stack, that is
RDF
... RDF is a triple langugae, everything gets a URI and can be
referenced, RDF schema provides some basic modeling methods
Dave compares RDF to relational data bases
dave: RDF provides classes,
properties, ...
... including multiple heritance, allows combinations in an
interesting way
... semantic web has not necessarily standardization, people
just create a vocabulary
... if it is taken up, good - a "survival of the fittest"
approach
... existing data can be annotated with RDF - e.g. for Web
services there is WASDL
... developed a seed taxonomy for next generation localisation
(NGL) content
... working with many researchers in CNGL to see whether the
taxonomy fits their needs, otherwise it is changed
... have a model refinement cycle for this
... fine-grained roundtrips involving customer, content
developer, LSP, translators
... looking into doing this with RDF
... "linked open data" - not focusing so much on reasoning, but
to see how to publish data you have
... triple stores are becoming robust, starting to scale
... important vocabulary from LOD: open provenancy
vocabulary
... helpful for author, segment and source QA
... next steps:
... revise semantic model, semantic sandpit, content markup via
RDFa, not standardising semantics, testing semantic
technology
... access control, etc.
... real power of SW is its extensibility
... semantic annotations can help to improve
interoperabilty
... provenance linked data can help for roundtripping
... will gather a lot of quality metadata about the content we
are localising
... that might be helpful for training statistical MT
alexandra: introducing swinng
project, part of the software cluster
... central principle: emergence
... emergent software: enables combination of components and
services for digital comparison
... components can come from ERP, BMP, BPI, the Web, ...
... agility to better acount for reducing waste, empowering the
team and the employee, ...
... challenges: find a balance for right amount of
documentation
... had experiences with writing larger user concepts or user
concepts on the white board
alexandra: actions and research
areas: include a technical writer in maximum 2 SCRUM
teams
... want to set up a controlling to measure software quality
and time to market
... difficult task, software quality is hard to measure
Andrejs: talking about challenges
for smaller challenges
... tools should be provided to help to bridge language
barriers esp. for these languages
... unesco is working on code of ethics , including demand to
represent all linguistic grops in cyber space
... alvin toffler: "survival of smaller langauges depends on
outcome of MT versus proliferation of larger languages"
... Tilde is doing both language technology and localization
services
... we can see real needs of users and test new
approaches
... MT at tilde: first rule-based, switching to data-driven
methods in 2008, heavy participation in EU R&D
... about MT development
... not only research, but bring results in tools we
provide
... MT, dictionaries widely used in the country
... work with MS research to improve MT engine for our
language
... problem of data driven MT: translation quality is low for
under-resourced langauges
... other challenge is customization: mass-market, online
MT-systems are general
... performance is poor for specific domains
... open source tools like GIZA++ or moses are hard to use for
the ordinary user, too complex
... strategies to help: see "LetsMT!" project
... building a platform to gather public and user-provided MT
training data
... increasing quality, scope and language coverage for
MT
... area is "machine translation for the multilingual
web"
... user survey about IPR of text resoruces
... there is some willingness to share data
... another project "Accurat"
... non-parallel bi- or multilingual text resources
... e.g. multilingual news feeds
... wikipedia articles, multilingual web sites, ...
... these show scale of comparability
... we calculate the comparability
... develop comparability metrics
... develop methods for automatic acquisition of parallel
texts
... cnosortium has both research institutions and SMEs
... taggingg MT translated tags would be very helpful
... to be able to distingush MT translated texts from human
translated text
... common interfaces for MT enginges would facilitae
interoperability
... standardization / BP are needed
Boštjan: about collecting aligned textual corpora from the hidden web
Boštjan: aligned parallel corpus:
a text alongside its translation(s)
... usage: translation memory, training MT systems, many NLP
scenario
... looked at standards, decided to go for TMX
... XLIFF is in my list in the last bullet point, in
brackets
... so XLIFF needs more marketing & development
... getting data: non-english professional web sites
... huge amount of translated text
... in general quality translations
Boštjan: problems:
Boštjan: translation memory is
hard to get
... data should have high precision
... .. no standard fully supports automatic harnessing or
cleaning of data
... proposed solution: crawl from the web
... > database > list of HTML candidates > list of
text candidates > parallel corpora
... see http://kameleon.ijs.si/t4me for
more info
Boštjan: we used TMX - is it the right choice?
Boštjan: source language must be
defined
... no need for me to do that, I just have paralllel texts for
machine consumption
... would need an optional parameter to define the source for
each segment
... when you develop a standard, think also about "machines" as
users, not only people
... future work: optimization in the areas of two phrase
crawling, character encoding, enhanced candidates
extraction
<luke> To answer Bostjan's question about "how many errors are acceptable", the answer (frustratingly for him, I'm sure) is "it depends": Is the text a guide for system administrators or the company homepage? Also: what are the type of errors (people can usually understand text with some grammatical errors, but if the key nouns/verbs are incorrect, it could be confusing/embaressing).
Boštjan: web service for TM memmory distribution and filtering (web 2.0 style)
gavin: interactive alignment of
parallel texts
... world wide web: need to both think globally, but alos
locally, e.g. in terms of minority languages
... "a seed-bed for poetic expression, beyond mere
communication"
... cultural context is important, see R. Jakobson
... there is an osmosis between minority languages and global
languages
... everybody becomes a 2nd language speaker
... parallel text alignment <> to communicate
semantics
... we have standards-based markup, web delivery cross-browser,
non-verbal interactivity ...
... statistical MT will not translate poetry in the next 20-50
years
... we developed a parallel text alignment web interface
demo of interactive text alignment
scribe: standards that have been
used for the demo: TEI (XML-based) structure
... presented as XHTML, with CSS, JavaScript
... semantics is not RDF, but the TEI structure
gavin: beauty of Unicode - one
can put multilingual information in directly into the
content
... pros and cons: we can interact directly with
semantics
... w3crange does not work in browsers
... TEI P5 must be subsetted
... CSS selection helps with jquery
... some browser isues, does not work everywhere
question about semantic web for MT training
dave: have thought about
that
... e.g. linking to terminology data bases
... looking into lexical markup, there was a presentation at
the last mlw workshop about this
... hot topic in MT; linguistically informed MT
discussion about legal issues with gathering corpora via the Web - is it legal at all?
Boštjan: laywers will work on finding that out
Alexandra: all languages in our
project need to be finished
... depending on the language it is difficult or easier
christian: funny to see the same questions, I had the remark on IP too, let's see where this goes
christian lieske: everyone mentioned that categorisation of what we find on the web would help with machine analysis
<fsasaki> .. not a question, but a remark: all of you mentioned that categorization of what we find on the Web would be helpful for reliable machine analysis
scribe: some communities have a
detailed approach to this
... look at last year's w3c day in berlin and you'll see how
work on digital libraries may fit well with machine
translation
<fsasaki> (above is presentation from Günther Neher)
??: often pages with the same url that are translated are not exactly the same structure
<fsasaki> (see, in German, http://www.xinnovations.de/downloads-2010.html?file=tl_files/xinnovations.2010/Download/W3C-Tag/Prof.%20Dr.%20Guenther%20Neher.pdf)
bostjan: we have done little
testing so far - about 7000 translations - and it worked
well
... our preliminary experiments show that it still works very
well, even if there isn't the same content on both sides of
parallel text
andrejs: see the FP7 project that is looking how to extract comparable corpora
s pemberton: i'm impressed by willingness to translate poetry - i'm performing in an opera and it took me a while to understand some allusions and references (gives examples)
scribe: i'm amazed that you hope ever to do this
gavinB: our approach is to find
the interface - to see how far machines can go
... it is possible to a translation based on bare bones - even
humans can get things wrong...
jorgS: if you have conceptual mismatches, how do you resolve them?
gavinB: this is where the human
translator accepts that they need to go away and study it - in
our system we mark it up in red
... the translation will never be exact
jorgS: for dave, what do you think of thenext generation of content generation based on RDF ?
dave: there's still a gap between
computational linguists and semantic web folks - there are
people looking at how to apply these things, and there are
proposals out there
... we're looking at how to integrate those approaches into
what we do
jorgB: i'm looking forward to multilingual text generation
lukeS: i was intrigued by gavin's
presentation
... seems the best you can do wrt translation is to come up
with a separate poem that has the same feel
... but this may be a useful tool for understanding the
original material better
... there may be implications for other translation
approaces
christianL: i understand the
remarks about translation poems with machines - but to me
Gavin's talk was about an annotation mechanism based on
standards
... there is a need for this approach, and gavin's presentation
was inspirational
... more and more acccurate annotations are needed, but there
are other aspects to translation and gavin's presentation
pointed to many useful aspects of this
paula: introducing how social
media is changing localization
... showing video on social media
... video emphasizing rapid growth and scale of various SNs,
describing the relationship of new generation towards social
media
... video focusing on effect of social media on advertising,
enabling higher ROI for marketing
... introducing the term "socialnomics"
<Steven> One mistake in the video - it conflated Internet and Web, so the time to 50M users was for the web, not for the internet
paula: describing the notion of
reputation control via media - the talk will be about showing
how this does not hold in presence of social media
... analogy with toddlers as example of parents not being in
control
... in social media, the user is in the middle of the system
and his worldview actually defines his experience
... emphasizing other social networks than facebook, e.g. hi5,
orkut - a reason for their success was the fact that they were
localized
... talking about surveys on social media and lionbridge
involvement - how people are using social media
multilingually
... companies using social media: a quarter of companies are
using all 4 platforms - europe and especially asia businesses
are growing much
... faster than u.s. companies, likely due to legal
issues
... twitter is increasingly popular, fastest growth
... 60% of tweets are non-english, but twitter localized only
in 7 languages
... companies engage in hyper-local strategies, twitter
account-per-region
... twitter brought new metric: TPS - tweets per second
paula: smartphones becoming the
relevant computing platform
... why are companies engaging? because SM allows them to
really interact with the users
paula: strategies of social media: 1) single centralized controlled SM outlet
paula: 2) decentralited local
pages - more effective, but users have more control
... it is still a huge opportunity - example: coca-cola has 250
people who are tasked with buying keywords
... important assertions: it is happening quickly, it's huge
and growing. instantly available content has more value that
quality content
... the real-time aspect also affects localization processes -
when localizing a message, the process might take too much
time
... real-time multilingual communication does not leave space
for pre- and post- editing, leaving a lot of human intervention
out
... last assertion: machine translation is being increasingly
more relevant for SM outlets
maarten: intro - academics are
not concerning with standards per-se, but trying to get things
done
... talk will be about standards supporting intelligent
information access of content
... in social media, people still do the same things, but
online instead of offline
... presenting concrete project of a political mashup
... gather political social media content, debates, analyze and
semantify it. political scientists are interested in tracking
topic ownership
... traditionally, this resesarch was conducted via classic
clipping, now via social media.
... however, data gathered this way is increasingly
multilingual
... another project, CoSyne, about cross-completing wikipedia
pages using different language articles on the same topic
... third example: The Mood of the Web - Livejournal has mood
annotated blogs, serving as a stream of mood-annotated
data
... when following mood patterns accross time, you can try to
interpret them, for instance "shocked", "tired"
... what would explain a huge spike in "shocked" in 2008. by
combining livejournal streams with news and counting word usage
statistics, it turns out that it was the death of actor heath
ledger.
... showing a time series on stress measurements, showing a
spike at the end of the year - that sort of analyses require a
lot of technology for text processing and information
extraction
... introducing Fietstas, a multilingual en/nl text processing
engine as infrastructure for what was presented
gustavo: comparing the SEO
process with preparing a gourmet meal
... posing the question, "what are the right ingredients for
multilingual SEO?"
... high search engine positioning is very important, holding
potential for high revenue
... introducing terms: SEO, MSEO, SMO, Social SEO as different
strategies in the field
... an important distinction is that whereas in SEO traffic
comes from search engines, in SMO traffic comes from social
media
... for example, 500 tweets have more effect than 500 incoming
links
... however, SEO still has higher ROI than SMO
... an important concept in SEO is the long tail effect in
certain business models
... just translating keywords does bring traffic, but has low
conversion rates
... for effective multilingual, international SEO, he
recommends the W3C Language Standards as basic rules
... SEO can be multilingual, internation or geographical, not
mutually exclusive among these.
... what did we learn doing it:
... 1) focus on the long tail and niche market
... 2) conversions, not traffic
... 3) things change, iterate
... showing examples - a legal company campaign was successful
once they used correct glossary translations
... healthcare insurance campaign was better once they
regionalized their content
... hotel chain: 12 languages, necessary to cover all
chiara: the talk will be around
control of content and the implications where in fact we get
more complex multimedia to include in the mix of having a
controlled vs. uncontrolled content
... controlled environment - the user does not have influence,
the content is relatively static
... in a controlled environment, the developers work with
strings with sentences, which are then combined
... in an uncontrolled component, the content is very dynamic,
developers have limited control - they combine it with the
controlled component before outputting
... even in a single sentence, there may be a combination of
controlled and uncontrolled strings
... in the translator's view, the content is treated as token
variables
... explaining their approach to i18n: handling languages with
gender, number, declensions, etc.
... different languages may have different needs than the
source language
... they solve that by "dynamic string explosion", which
enables a translator to have multiple translations for the same
source string depending on the linguistic context
... ... in romanian, the translator must specify gender, but in
finnish and russian, it is even more complicated
... an important aspect and the point of this talk is that
facebook users are the translators
... considering machine translation, but haven't implemented it
yet
... french was translated in 24 hours, released in three weeks,
now supporting 67 language, many released without professional
review
... review process:
... 1) translating the glossary of individual terms
... 2) translating the content
... 3) professional supervision and checking
... the tool supports both inline and bulk translation for in-
and out- of context translation
... why use community translation: 1) users are domain experts
2) speed 3) reach
... why do users translate: personal satisfaction and pride,
leaderboard of translation statistics
ian: SDL is an international
company, and itself also faces the multilingual problem
... big themes: social media and different devices, how
information is shaping opinions, relevant content is often in
the user's language
... reiterates the point that buyers are sensitive to the
language of the content when buying
... while around 50% of tweets are english, it is
diminishing
... connecting with visitors: be relevant, listen, understand,
engage
... this requires monitoring solutions
... understanding: finding common interests accross languages,
demographics and geographies
... it turns out that the common interests are key
... content should be relevant, and better relevance via
localisation is reflected in better effectiveness of
communication
... presenting the journey of the customer engagement, from
research of products to buying and customer support
... for the customer's journey, there's a lot of content with
which the user engages that needs to be appropriate
... if people are coming to the website, they are trying to get
stuff done, so 'user engagement' may be an obstacle
... users' expectations have changed
... they expect content in their own language
DavidGrunwald: to paula - you haven't discussed whether you have the tools in place to harness social media?
paula: they don't crowdsource, they crowd manage - using input from users of various levels of skills, split the work into tasks and monitor that
DavidGrunwald: you are not letting the crowd control the message, as you claimed in your talk
paula: the content that I am referring to is not always in public or social media
ian: with social media, you can
translate and listen, but you have to be caution with
translating and speaking (with automatic tools)
... agree with crowdsourcing, but it needs to be a love brand,
for which people want to write
LukeS: on exploding translations in facebook - there is an open source project that unicode consortium supports that handles a subset of the language morphology problem
DanTufiş: to maarten - what theories is your work relying on
maarten: machine translation as core technologies, political science as application
DanTufiş: points out Osgood's work on subjectivity with using wordnet to extract sentiment
Steven: points out that in paula's presentation, it was not the internet that took 4 years to 50 million, but the WWW
Jaap: interoperability
questionnaire
... and interestedness in standards in particular
... quotes some of the statements regarding costs
... where is the friction? mostly TM followed by
terminology
... reasons to support: freedom of tool choice
... biggest barriers: lack of compliance, lack of maturity,
etc.
... sort of restistance against interop. such as market
drop-down
... different perspectives of believers
... realists point of view such as "accept market forces",
"show business advantage", "restistance to tools", etc.
... and now the pragmatists: "they have hope..." ;-)
... future outlook (5 years!)
... content increase, multimedia, mobile, more cross-lingual
challenges, ...
... brief SWOT analysis (see other TAUS publications too)
... information pyramid representing content disruption
... apply pyramid to SWOT graphic
... business model attributes: old vs. new
... e.g. TM is core vs. data is core; one- vs.
multi-directional; word based pricing vs. SaaS; GMS vs. MT
embedded
... enterprise in 5 years need a language strategies
... last slide: interoperability agenda
... more changes in the next 5 years than in the past 25
years
Fernando: [talks about the
challenges of multilinguality for international
organizations]
... gives the context of the food and agriculture organization
of the UN
... 6 languages (en, fr, es, arabic, cz, ru); approx. 12 m
words/year
... English has the largest share of doc lang.
... websites in 6 lang. and regional relevance content in 3
lang.
... challenges for doc. and web content: tech., prof. profiles,
workflow, "consumer" languages
... additional challenges are: rules and regulations, re-use of
translations, TM/MT integration
... no analysis or lessons learned available currently
... envison the employment of CMS, CAT-tools, extend prof.
profiles, optimize workflows
... under discussion: employment of open source software, cloud
services, etc.
... funding could be based on current SME call of the EC
Stelios: [talks about language
resources sharing initiative in the context of MetaNet]
... introduces the objectives and structure of Meta-Net, focus
will be on Meta-Share
... emphasis the key challenge of data and how it relates on LT
research and development
... another important point in the initial discussions was
standards
... observations: making data employable is costly
... Meta-Share shall be an open infrastructure that enables
interoperability on various layers
... it is also built on existing projects and initiatives that
already in this broad field
... as an umbrella organization which shall also include
national efforts
... the main idea of the Meta-Share architecture is
distribution based on a "meta schema" model
... users/consumers will have the possibility to search, browse
and download resources
... fully supports open source developments including
appropriate maintenance
... Meta-Share governance is given by members and associate
members; legal issues are under cc
Chaals: Word count is going down does not mean translation workload decreases. Speculate on the implications?
Jaap: Identification of different
rating criteria; human interference; word count is
unmanageable; more demand
... for MT but with different pricing models
Fernando: New challenges through users; relying on help from different sites
Stelios: Subtitling has a
different approach based on intellectual capabilities needed;
time of media content
... mutiplied by a certain factor
Chaals: Who owns the data question?
Steven: In the Netherlands all films are subtitled... quotes a translator "we are payed by the word"? What would the
integration of MT mean?
Stelios: Translation based on a "master file", i.e. the translation pricing model applies.
Reinhard: Subtitling for free i.e. by volenteers?
Chaals: Student's translations, shipped to India; there are several models...
Stefanov: Some points need to be highlighted: PEs, interpretation vs. translation, different multi-media presentations,
quality control will change, etc.
Chaals: You mean librarians?
Stefanov: Not really... the picture is changing.
<chaals> [modern librarians learn to manage digital multimedia collections, and don't have to have their hair in a bun anymore. I am often surpriesd that they are not present at all at conferences like this - it seems we're missing out on expertise that seems highly relevant]
Christian: MT in subtitling already existing, e.g. in Scandinavia. Question on whether there are policies aimed at reducing translation costs by limiting use of multimedia?
Chaals: I have seen such rules but there are a lot of options.
END of Session