Bego˜naVilladaMoir´onandJ¨orgTiedemannAlfaInformatica,UniversityofGroningen
OudeKijkin’tJatstraat26
9712EKGroningen,TheNetherlands
{M.B.Villada.Moiron,J.Tiedemann}@rug.nl
Abstract
ForNLPapplicationsthatrequiresomesortofsemanticinterpretationitwouldbehelpfultoknowwhatexpressionsexhibitanidiomaticmeaningandwhatexpres-sionsexhibitaliteralmeaning.Weinvest-igatewhetherautomaticword-alignmentinexistingparallelcorporafacilitatestheclassificationofcandidateexpressionsalongacontinuumrangingfromliteralandtransparentexpressionstoidiomaticandopaqueexpressions.Ourmethodreliesontwocriteria:(i)meaningpredictabilitythatismeasuredassemanticentropyand(ii),theoverlapbetweenthemeaningofanex-pressionandthemeaningofitscompon-entwords.Weapproximatethementionedoverlapastheproportionofdefaultalign-ments.Weobtainasignificantimprove-mentoverthebaselinewithbothmeas-ures.
1Introduction
Knowingwhetheranexpressionreceivesalit-eralmeaningoranidiomaticmeaningisimport-antfornaturallanguageprocessingapplicationsthatrequiresomesortofsemanticinterpretation.Someapplicationsthatwouldbenefitfromknow-ingthisdistinctionaremachinetranslation(Im-amuraetal.,2003),findingparaphrases(BannardandCallison-Burch,2005),(multilingual)inform-ationretrieval(Melamed,1997a),etc.
Thepurposeofthispaperistoexploretowhatextentword-alignmentinparallelcorporacanbeusedtodistinguishidiomaticmultiwordexpres-sionsfrommoretransparentmultiwordexpres-sionsandfullyproductiveexpressions.
Intheremainderofthissection,wepresentourcharacterizationofidiomaticexpressions,themo-tivationtouseparallelcorporaandrelatedwork.Section2describesthematerialsrequiredtoap-plyourmethod.Section3portraitstheroutinetoextractalistofcandidateexpressionsfromauto-maticallyannotateddata.Experimentswithdiffer-entwordalignmenttypesandmetricsareshowninsection4.Ourresultsarediscussedinsection5.Finally,wedrawsomeconclusionsinsection6.1.1Whatareidiomaticexpressions?Idiomaticexpressionsconstituteasubsetofmul-tiwordexpressions(Sagetal.,2001).Weassumethatliteralexpressionscanbedistinguishedfromidiomaticexpressionsprovidedweknowhowtheirmeaningisderived.1Themeaningoflinguisticexpressionscanbedescribedwithinascalethatrangesfromfullytransparenttoopaque(infigur-ativeexpressions).
(1)Watmoetenlidstatenondernemenom
whatmustmemberstatesdotoaanhaareisentevoldoen?atherdemandstomeet?
‘WhatmustEUmemberstatesdotomeetherdemands?’(2)Dezesituatiebrengtdebestaandepolitieke
thissituationbringstheexistingpoliticalbarri`ereszeerduidelijkaanhetlicht.barriersveryclearlyinthelight
‘Thissituationbringstheexistingpoliticallimitationstolightveryclearly.’
(3)Wijmogenonshiernietbijneerleggen,
wemayusherenotbyagree,
maarmoetendesituatiepubliekelijkaanbutmustthesituationpubliclyopdekaakstellen.thecheekstate
‘Wecannotagreebutmustdenouncethesitu-ationopenly.’Literalandtransparentmeaningisassociatedwithhighmeaningpredictability.Themeaningofanexpressionisfullypredictableifitresultsfromcombiningthemeaningofitsindividualwordswhentheyoccurinisolation(see(1)).Whentheexpressionundergoesaprocessofmetaphor-icalinterpretationitsmeaningislesspredictable.Moon(1998)considersacontinuumoftranspar-ent,semi-transparentandopaquemetaphors.Themoretransparentmetaphorshavearatherpredict-ablemeaning(2);themoreopaquehaveanun-predictablemeaning(3).Ingeneral,anunpredict-ablemeaningresultsfromthefactthatthemean-ingoftheexpressionhasbeenfossilizedandcon-ventionalized.Inanuninformativecontext,idio-maticexpressionshaveanunpredictablemeaning(3).Putdifferently,themeaningofanidiomaticexpressioncannotbederivedfromthecumulativemeaningofitsconstituentpartswhentheyappearinisolation.1.2
Whycheckingtranslations?
Thispaperaddressesthetaskofdistinguishinglit-eral(transparent)expressionsfromidiomaticex-pressions.Decidingwhatsortofmeaninganex-pressionshowscanbedoneintwoways:•measuringhowpredictablethemeaningoftheexpressionisand
•assessingthelinkbetween(a)themeaningoftheexpressionasawholeand(b)thecumu-lativeliteralmeaningsofthecomponents.FernandoandFlavell(1981)observethatnoconnectionbetween(a)and(b)suggeststheex-istenceofopaqueidiomsand,aclearlinkbetween(a)and(b)isobservedinclearlyperceivedmeta-phorsandliteralexpressions.
Webelievewecanapproximatethemeaningofanexpressionbylookinguptheexpressions’translationinaforeignlanguage.Thus,weareinterestedinexploringtowhatextentparallelcor-
poracanhelpustofindoutthetypeofmeaninganexpressionhas.
Forourapproachwemakethefollowingas-sumptions:
•regularwordsaretranslated(moreorless)consistently,i.e.therewillbeoneoronlyafewhighlyfrequenttranslationswhereastranslationalternativeswillbeinfrequent;•anexpressionhasa(almost)literalmeaningifitstranslation(s)intoaforeignlanguageistheresultofcombiningeachword’stransla-tion(s)whentheyoccurinisolationintoafor-eignlanguage;
•anexpressionhasanon-compositionalmean-ingifitstranslation(s)intoaforeignlanguagedoesnotresultfromacombinationofthereg-ulartranslationsofitscomponentwords.Wealsoassumethatanautomaticwordalignerwillgetintotroublewhentryingtoalignnon-decomposableidiomaticexpressionswordbyword.Weexpectthealignertoproducealargevarietyoflinksforeachcomponentwordinsuchexpressionsandthattheselinksaredifferentfromthedefaultalignmentsfoundinthecorpusother-wise.
Bearingtheseassumptionsinmind,ourap-proachattemptstolocatethetranslationofaMWEinatargetlanguage.Onthebasisofallrecon-structedtranslationsofa(potential)MWE,itisde-cidedwhethertheoriginalexpression(insourcelanguage)isidiomaticoramoretransparentone.1.3Relatedwork
Melamed(1997b)measuresthesemanticentropyofwordsusingbitexts.MelamedcomputesthetranslationaldistributionTofawordsinasourcelanguageandusesittomeasurethetranslationalentropyofthewordH(T|s);thisentropyapprox-imatesthesemanticentropyofthewordthatcanbeinterpretedeitheras(a)thesemanticambigu-ityor(b)theinverseofreliability.Thus,awordwithhighsemanticentropyispotentiallyveryam-biguousandtherefore,itstranslationsarelessre-liable(orhighlycontext-dependent).Wealsouseentropytoapproximatemeaningpredictabil-ity.Melamed(1997a)investigatesvarioustech-niquestoidentifynon-compositionalcompoundsinparalleldata.Non-compositionalcompounds
arethosesequencesof2ormorewords(adja-centorseparate)thatshowaconventionalizedmeaning.FromEnglish-Frenchparallelcorpora,Melamed’smethodinducesandcomparespairsoftranslationmodels.Modelsthattakeintoaccountnon-compositionalcompoundsarehighlyaccurateintheidentificationtask.
2Dataandresources
WebaseourinvestigationsontheEuroparlcorpusconsistingofseveralyearsofproceedingsfromtheEuropeanParliament(Koehn,2003).WefocusonDutchexpressionsandtheirtranslationsintoEng-lish,SpanishandGerman.2Thus,weusedtheen-tiresectionsofEuroparlinthesethreelanguages.Thecorpushasbeentokenizedandalignedatthesentencelevel(TiedemannandNygaard,2004).TheDutchpartcontainsabout29milliontokensinabout1.2millionsentences.TheEnglish,Span-ishandGermancounterpartsareofsimilarsizebetween28and30millionwordsinroughlythesamenumberofsentences.
Automaticwordalignmenthasbeendoneus-ingGIZA++(Och,2003).Weusedstandardset-tingsofthesystemtoproduceViterbialignmentsofIBMmodel4.Alignmentshavebeenproducedforbothtranslationdirections(sourcetotargetandtargettosource)ontokenizedplaintext.3Wealsousedawell-knownheuristicsforcombiningthetwodirectionalalignments,theso-calledrefinedalignment(Ochetal.,1999).Word-to-wordalign-mentshavebeenmergedsuchthatwordsarecon-nectedwitheachotheriftheyarelinkedtothesametarget.Inthiswayweobtainedthreediffer-entwordalignmentfiles:sourcetotarget(src2trg)withpossiblemulti-wordunitsinthesourcelan-guage,targettosource(trg2src)withpossiblemulti-wordunitsinthetargetlanguage,andre-finedwithpossiblemulti-wordunitsinbothlan-guages.Wealsocreatedbilingualwordtypelinksfromthedifferentword-alignedcorpora.Theselistsincludealignmentfrequenciesthatwewilluselateronforextractingdefaultalignmentsforindividualwords.Henceforth,wewillcallthemlinklexica.
4
Availableathttp://www.let.rug.nl/˜vannoord/alp/Alpino.5
Butt(2003)maintainsthatthefirst7verbsareexamplesofsupportverbscrosslinguistically.Theother5havebeensuggestedforDutchby(Hollebrandse,1993).
whichweconsideredamanageablesizetotestourmethod.
4Methodology
Weexaminehowexpressionsinthesourcelan-guage(Dutch)areconceptualizedinatargetlan-guage.Thetranslationsinthetargetlanguageen-codethemeaningoftheexpressioninthesourcelanguage.Usingthetranslationlinksinparal-lelcorpora,weattempttoestablishwhattypeofmeaningtheexpressioninthesourcelanguagehas.Toaccomplishthiswemakeuseofthethreeword-alignedparallelcorporafromEuroparlasdescribedinsection2.
Oncethetranslationlinksofeachexpressioninthesourcelanguagehavebeencollected,theen-tropyobservedamongthetranslationlinksiscom-putedperexpression.Wealsotakeintoaccounthowoftenthetranslationofanexpressionismadeoutofthedefaultalignmentforeachtriplecom-ponent.Thedefault’translation’isextractedfromthecorrespondingbilinguallinklexicon.4.1
Collectingalignments
Foreachtripleinthesourcelanguage(Dutch)wecollectitscorresponding(hypothetical)trans-lationsinatargetlanguage.Thus,wehavealistof200VERBPPtriplesrepresenting200potentialMWEsinDutch.Weselectedalloccurrencesofeachtripleinthesourcelanguageandallalignedsentencescontainingtheircorrespondingtransla-tionsintoEnglish,GermanandSpanish.Were-strictedourselvestoinstancesfoundin1:1sen-tencealignments.Otherunitscontainmanyer-rorsinwordandsentencealignmentand,there-fore,wediscardedthem.Relyingonautomatedword-alignment,wecollectalltranslationlinksforeachverb,prepositionandnounoccurrencewithinthetriplecontextinthethreetargetlanguages.Tocapturethemeaningofasourceexpression(triple)S,wecollectallthetranslationlinksofitscomponentwordssineachtargetlanguage.Thus,foreachtriple,wegatherthreelistsoftransla-tionlinksTs.LetusseetheexampleAANLICHTBRENGrepresentingtheMWEietsaanhetlichtbrengen’reveal’.Table1showssomeofthelinksfoundforthetripleAANLICHTBRENG.Ifawordinthesourcelanguagehasnolinkinthetargetlan-guage(whichisusuallyduetoalignmentstotheemptyword),NO
aanNO
LINK,light,revealed,exposed,highlight,shown,shedlight,clarifybreng
NO
LINK.
4.2MeasuringtranslationalentropyAccordingtoourintuitionitishardertoalignwordsinidiomaticexpressionsthanotherwords.Thus,weexpectalargervarietyoflinks(includ-ingerroneousalignments)forwordsinsuchex-pressionsthanforwordstakenfromexpressionswithamoreliteralmeaning.Forthelatter,weexpectfeweralignmentcandidates,possiblywithonlyonedominantdefaulttranslation.Entropyisagoodmeasurefortheunpredictabilityofanevent.Weliketousethismeasureforcomparingthealignmentofourcandidatesandexpectahighaverageentropyforidiomaticexpressions.Inthiswayweapproximateameasureformeaningpre-dictability.
Foreachwordinatriple,wecomputetheen-tropyofthealignedtargetwordsasshowninequa-tion(1).
H(Ts|s)=−
P(t|s)logP(t|s)(1)
t∈Ts
Thismeasureisequivalenttotranslationalen-tropy(Melamed,1997b).P(t|s)isestimatedastheproportionofalignmenttamongallalign-mentsofwordsfoundinthecorpusinthecon-textofthegiventriple.6Finally,thetranslationalentropyofatripleistheaveragetranslationalen-tropyofitscomponents.Itisunclearhowto
treat
NO
LINKS,
(2)countingNO
LINKS
asoneuniquetype.
4.3
Proportionofdefaultalignments(pda)
Ifanexpressionhasaliteralmeaning,weexpectthedefaultalignmentstobeaccurateliteraltrans-lations.Ifanexpressionhasidiomaticmeaning,thedefaultalignmentswillbeverydifferentfromthelinksobservedinthetranslations.
ForeachtripleS,wecounthowofteneachofitscomponentssislinkedtooneofthedefaultalignmentsDs.Forthelatter,weusedthefourmostfrequentalignmenttypesextractedfromthecorrespondinglinklexiconasdescribedinsection2.Alargeproportionofdefaultalignments7sug-geststhattheexpressionisverylikelytohavelit-eralmeaning;alowpercentageissuggestiveofnon-transparentmeaning.Formally,pdaiscalcu-latedinthefollowingway:pda(S)=s∈Ss∈Sd∈Dsalignt∈Tsalign
freq(s,t)isthealignmentfre-quencyofwordstowordtinthecontextofthe
tripleS.
5Discussionofexperimentsandresults
Weexperimentedwiththethreeword-alignmenttypes(src2trg,trg2srcandrefined)andthetwoscoringmethods(entropyandpda).The200can-didateMWEshavebeenassessedandclassifiedintoidiomaticorliteralexpressionsbyahumanexpert.Forassessingperformance,standardpre-cisionandrecallarenotapplicableinourcasebe-causewedonotwanttodefineanartificialcut-offforourrankedlistbutevaluatetherankingit-self.Instead,wemeasuredtheperformanceofeachalignmenttypeandscoringmethodbyob-taininganotherevaluationmetricemployedinin-formationretrieval,uninterpolatedaveragepreci-sion(uap),thataggregatesprecisionpointsintooneevaluationfigure.AteachpointcwhereatruepositiveScintheretrievedlistisfound,thepre-cisionP(S1..Sc)iscomputedand,allprecisionpointsarethenaveraged(ManningandSch¨utze,1999).
LINKSintoaccount
whencomput-
ingtheproportions.
uap=
Sc
P(S1..Sc)
LINKS)
withthethreealignmenttypesfortheNL-EN
language
pair.8
Alignment
uap
baseline0.755
Table2:uapvaluesofvariousalignments.Usingwordalignmentsimprovestherankingofcandidatesinallthreecases.Amongthem,src2trgshowsthebestperformance.Thisissurprisingbecausethequalityofword-alignmentfromEnglish-to-Dutch(trg2src)ingeneralishigherduetodifferencesincompoundinginthetwolanguages.However,thisismainlyanissuefornounphraseswhichmakeuponlyonecom-ponentinthetriples.
Weassumethatsrc2trgworksbetterinourcasebecauseinthisalignmentmodelweexplicitlylinkeachwordinthesourcelanguagetoexactlyonetargetword(ortheemptyword)whereasinthetrg2srcmodelweoftengetmultiplewords(inthetargetlanguage)alignedtoindividualwordsinthetriple.Manyerrorsareintroducedinsuchalign-mentunits.Table3illustratesthiswithanexamplewithlinksfortheDutchtripleopprijsstelcorres-pondingtotheexpressionietsopprijsstellen’toappreciatesth.’
src2trgsourcetargetgesteldappreciateLINKstellen
prijsappreciate
NO
gesteldbefact
prijs
op
NO
NO
Table3:Examplesrc2trgandtrg2srcalignmentsforthetripleOPPRIJSSTEL.
src2trgalignmentproposesappreciateasalinktoallthreetriplecomponents.Thistypeofalign-mentisnotpossibleintrg2src.Instead,trg2srcin-cludestwoNO
LINKS
andmanyalign-mentalternativesintrg2srcthatinfluenceouren-tropyscores.Thiscanbeobservedforidiomaticexpressionsaswellasforliteralexpressionswhichmakestranslationalentropylessreliableintrg2srcalignmentsforcontrastingthesetwotypesofex-pressions.
Therefinedalignmentmodelstartswiththein-tersectionofthetwodirectionalmodelsandaddsiterativelylinksiftheymeetsomeadjacencycon-straints.ThisresultsinmanyNO
entropy
-withoutNO
LINKS=many0.8580.00.883-NObaseline
0.755
0.755
0.755
Table4:Translationalentropyandthepdaacrossthreelanguagepairs.Alignmentissrc2trg.Allscoresproducebetterrankingsthanthebaseline.Ingeneral,pdaachievesaslightlybetteraccuracythanentropyexceptfortheNL-DElan-guagepair.Nevertheless,thedifferencebetweenthemetricsishardlysignificant.5.3
Furtherimprovements
Oneprobleminourdataisthatwedealwithword-formalignmentsandnotwithlemmatizedver-sions.ForDutch,weknowthelemmaofeachwordinstancefromourcandidateset.However,forthetargetlanguages,weonlyhaveaccesstosurfaceformsfromthecorpus.Naturally,inflec-tionalvariationsinfluenceentropyscores(because
ofthelargervarietyofalignmenttypes)andalsothepdascores(wheretheexactwordformshavetobematchedwiththedefaultalignmentsinsteadoflemmas).Inordertotesttheeffectoflemmatiz-ationondifferentlanguagepairs,weusedCELEX(Baayenetal.,1993)forEnglishandGermantoreducewordformsinthealignmentsandinthelinklexicontocorrespondinglemmas.Weassignedthemostfrequentlemmatoambiguouswordforms.Table5showsthescoresobtainedfromapplyinglemmatizationforthesrc2trgalignmentusingentropy(withoutNO
usingentropyscores
usingpdascores
baseline0.7550.7550.755
Table5:Translationalentropyandpdafromsrc2trgalignmentsacrosslanguagespairswithdifferentsettings.
Surprisingly,lemmatizationaddslittleorevendecreasestheaccuracyofthepdaandentropyscores.ItisalsosurprisingthatlemmatizationdoesnotaffectthescoresformorphologicallyricherlanguagessuchasGerman(comparedtoEnglish).Onepossiblereasonforthisisthatlemmatizationdiscardsmorphologicalinforma-tionthatiscrucialtoidentifyidiomaticexpres-sions.Infact,nounsinidiomaticexpressionsaremorefixedthannounsinliteralexpressions.Bycontrast,verbsinidiomaticexpressionsoftenal-lowtenseinflection.Byclusteringwordformsintolemmaswelosethisinformation.Infuturework,wemightlemmatizeonlytheverb.
Anotherissueisthereliabilityofthewordalign-mentthatwebaseourinvestigationupon.Wewanttomakeuseofthefactthatautomaticwordalignmenthasproblemswiththealignmentofin-dividualwordsthatbelongtolargerlexicalunits.However,webelievethatthealignmentprogramingeneralhasproblemswithhighlyambiguouswordssuchasprepositions.Therefore,preposi-
tionsmightblurthecontrastbetweenidiomaticex-pressionsandliteraltranslationswhenmeasuredonthealignmentofindividualwords.Table5includesscoresforrankingourcandidateexpres-sionswithandwithoutprepositions.Weobservethatthereisalargeimprovementwhenleavingoutthealignmentsofprepositions.Thisisconsistentforalllanguagepairsandthescoresweusedforranking.
rank
pda
entropy
triple
MWE
118119120...180181182183
36.9036.4914.0670.5352.3374.7176.56
4.29314.290.28732.73952.73512.662.5883
okokok****
vastmetvoldoening’settlewithsatisfaction’komtotakkoord’reachagreement’brenginstemming’getinmood’staopschroeven’beunsettled’
voldoeaancriterium’satisfycriterion’
beschikoverinformatie’decideoverinformation’stemvooramendement’voteforamending’neem
187188119019119219319480.3978.0477.6382.2177.7886.3673.3339.132.09922.09241.99971.90201.90161.87751.86871.8497********
terugnaarcommissie’refertocomission’
stemtegenamendement’voteagainstaamending’onthoudvanstemming’withholdone’svote’feliciteermetwerk’congratulatewithwork’stemvoorverslag’voteforreport’
schepvanwerkgelegenheid’setupofemployment’stemvoorresolutie’voteforresolution’bedankvoorfeit’thankforfact’was
berof
NO
LINKS
is
hardertotranslatecompositionally,andprobablyanidiomaticorambiguousexpression.Altern-atively,anexpressionwithnoNO
因篇幅问题不能全部显示,请点此查看更多更全内容
Copyright © 2019- gamedaodao.com 版权所有 湘ICP备2022005869号-6
违法及侵权请联系:TEL:199 18 7713 E-MAIL:2724546146@qq.com
本站由北京市万商天勤律师事务所王兴未律师提供法律服务