This is a live blog of the Dalhousie CS seminar by Dr. Colin Cherry of Microsoft Research, Natural Language Processing Group entitles “Syntactic Movement in Statistical Translation”
11:33 – People are still arriving. Dr. Keselj is having a conversation with Dr. Cherry. Dr. Cherry said his experiments are in English and French, since he speaks a little bit of French.
11:35 – Dr. Keselj is introducing the speaker.
11:36 – Studied at Acadia, work mostly from his thesis at U. of Alberta. Trying to put linguistics back into machine translation. Adding syntactic constraints.
11:37 – Outline: tutorial on statistical MT, word order, syntactic cohesion. Phrasal SMT – a decoder from French to English and vice versa.
11:39 – Rule based translation is very common, e.g., Babble Fish in Altavista and Yahoo. Hard to fix problems without introducing new ones. A different approach is treat MT as a math problem: trans(e) = argmax_f [ P(f|e)]
11:41 – f and e are not well defined. Ridiculous: search over all French sentences. Instead: find all possible French sentence that can be created by breaking down and reassembling an English sentence.
11:42 – Need a score to evaluate the different possible sentences.
11:43 – Exampe: begins -> debute. Existence of big collections in English and their translation in French. Available free online. Sentence alignment is easy. Word alignment within a sentence is harder, but we assume it works.
11:46 – tri-gram and 4-gram models for language modelling. No need for bilingual text to build these language models.
11:47 – Word order which to translate first affects quality. Reordering a sentence is same as travelling salesman. Factorial complexity. However, language has structure so we don’t need to examine all combinations.
11:49 – Represent a sentence as a dependency tree of which word describes which. Uses CFG. Subtrees are always continuous substrings (phrases). [basically POS parsing]
11:51 – assume that subtrees are units of thought and are language universal, i.e., preserved during translation across languages. This is called cohesion. Can be verified with word alignment in parallel text.
11:53 – project tree from one language to text in another language. If subtrees don’t intersect then we’re good.
11:54 – Fox (2002), used word alignment to validate cohesion, 90% of French-English holds. Both sides win: stats say 90% is not 100%, can’t do much; linguists say 90% is better than 80%, and it’s almost 100%.
11:56 – Google and Microsoft use phrasal MT, winning approach until last year. Similar to example-based MT. Idea: trust what you’ve seen people do before over any computed translation. Phrase or expression translation (hold your horses) vs. translation in context (the voting session begins tomorrow).
11:59 – Phrase tables from aligned text + statistics from aligned text (maximum likelihood of P(e|f) and P(f|e) )
12:00 – Translation process. Load all your options (different translation for each word or phrase). Translation is similar to single-player search.
12:02 – Use beam search, one of the favourite NLP tools.
12:03 – scoring states. A linear update that adds a weighted probability ( lambda log P(e|f) for each word)
12:05 – weights are determined by training to maximize match to some human translator. (BLEU metric). Precision with respect to a reference translation. (word count)
12:07 – Modifying the search to account for cohesiveness. Questions: is that translation cohesive? Not based on POS parsing or tree but on a notion of phrases.
12:10 – subtrees don’t match phrases because phrases cross some arcs in the tree.
12:12 – if intersection happens at a boundary point then we don’t care. What is a cohesion violation then? Should push the process to the decoder and send the decoder in a different direction. Detect if a subtranslation is not cohesive and do something about it.
12:15 – Shows an example of a ‘decoder-eye view’. At the end of the day, eliminating interruption in translating subtrees is equiv to eliminating violations.
12:18 – use soft and hard constraints. Hard: if you detect an interruption through it away. Since only 90% accurate, use soft constraints. Use interruption count as a new feature in the log linear score. Constraints change when you can use a phrase, not which phrase you can use.
12:20 – Lexical reordering: a data driven model by examining the bi-grams.
12:21 – Experiment: English to French translator, built by extending an open source system.
12:23 – in base line, 4/5 are cohesive already, with high BLEU score. The other 1/5 have lower BLEU score, maybe means a difficult translation.
12:25 – Overall, base < lex < coh < coh+lex, where base << lex in 1/5 with violations and insignificant elsewhere.
12:26 – results evaluated with human evaluators. They agree little except when coh helps. System produces more literal translations.
12:28 – conclusion: method to move blocks around as units but then translate them separately. Can be applied to named entity, collocation, etc. Non-cohesive translation means it’s a hard one for phrase-based systems.
12:29 – Six years in PhD. Open for questions.
12:30 – Mike McAllister: how do you account for adjective location differences across languages. Answer: doesn’t play with statistics and doesn’t work well with text-based approaches. Current system can be model as a finite state machine. Need a more powerful model such as CFG.
12:31 – Question: why the 10% exist? Answer: 5% come from ‘not’ -> ‘ne pas’, others include heavy paraphrasing in human translations, NLP parsers are imperfect. Question: how will it work with German where verbs move to the end? Answer: doesn’t work for German. BTW, never seen english parse trees, would be interesting to see the parse tree for ‘ne pas’.
12:37 – Question: what’s the effect of punctuation? Answer: unreliable. Can improve performance a little. Hard to train because human translators are creative, that is why services like Espresso Translations will always be required. Can help by restricting moving words across commas, true. But in a way human would not say.
12:40 – Question: how to handle unseen words? Answer: There are approaches based on modelling new words, currently just copy as is.
12:41 – Vlado Keselj: how to handle source language that are not cohesive? Answer: can’t do.
12:43 – Tony Abou-Assaleh: Google won competitions up to last year, who is winning now? Answer:
- Chinese to English: MS Asia
- Arabic to English: Google
- Unlimited Data: Google
- Microsoft approch: combine many systems on closed data.
- ISI: generally good.
… a couple other questions, but blogging stops here.