您好,欢迎来到刀刀网。
搜索
您的当前位置:首页Identifying idiomatic expressions using automatic word alignment

Identifying idiomatic expressions using automatic word alignment

来源:刀刀网
Identifyingidiomaticexpressionsusingautomaticword-alignment

Bego˜naVilladaMoir´onandJ¨orgTiedemannAlfaInformatica,UniversityofGroningen

OudeKijkin’tJatstraat26

9712EKGroningen,TheNetherlands

{M.B.Villada.Moiron,J.Tiedemann}@rug.nl

Abstract

ForNLPapplicationsthatrequiresomesortofsemanticinterpretationitwouldbehelpfultoknowwhatexpressionsexhibitanidiomaticmeaningandwhatexpres-sionsexhibitaliteralmeaning.Weinvest-igatewhetherautomaticword-alignmentinexistingparallelcorporafacilitatestheclassificationofcandidateexpressionsalongacontinuumrangingfromliteralandtransparentexpressionstoidiomaticandopaqueexpressions.Ourmethodreliesontwocriteria:(i)meaningpredictabilitythatismeasuredassemanticentropyand(ii),theoverlapbetweenthemeaningofanex-pressionandthemeaningofitscompon-entwords.Weapproximatethementionedoverlapastheproportionofdefaultalign-ments.Weobtainasignificantimprove-mentoverthebaselinewithbothmeas-ures.

1Introduction

Knowingwhetheranexpressionreceivesalit-eralmeaningoranidiomaticmeaningisimport-antfornaturallanguageprocessingapplicationsthatrequiresomesortofsemanticinterpretation.Someapplicationsthatwouldbenefitfromknow-ingthisdistinctionaremachinetranslation(Im-amuraetal.,2003),findingparaphrases(BannardandCallison-Burch,2005),(multilingual)inform-ationretrieval(Melamed,1997a),etc.

Thepurposeofthispaperistoexploretowhatextentword-alignmentinparallelcorporacanbeusedtodistinguishidiomaticmultiwordexpres-sionsfrommoretransparentmultiwordexpres-sionsandfullyproductiveexpressions.

Intheremainderofthissection,wepresentourcharacterizationofidiomaticexpressions,themo-tivationtouseparallelcorporaandrelatedwork.Section2describesthematerialsrequiredtoap-plyourmethod.Section3portraitstheroutinetoextractalistofcandidateexpressionsfromauto-maticallyannotateddata.Experimentswithdiffer-entwordalignmenttypesandmetricsareshowninsection4.Ourresultsarediscussedinsection5.Finally,wedrawsomeconclusionsinsection6.1.1Whatareidiomaticexpressions?Idiomaticexpressionsconstituteasubsetofmul-tiwordexpressions(Sagetal.,2001).Weassumethatliteralexpressionscanbedistinguishedfromidiomaticexpressionsprovidedweknowhowtheirmeaningisderived.1Themeaningoflinguisticexpressionscanbedescribedwithinascalethatrangesfromfullytransparenttoopaque(infigur-ativeexpressions).

(1)Watmoetenlidstatenondernemenom

whatmustmemberstatesdotoaanhaareisentevoldoen?atherdemandstomeet?

‘WhatmustEUmemberstatesdotomeetherdemands?’(2)Dezesituatiebrengtdebestaandepolitieke

thissituationbringstheexistingpoliticalbarri`ereszeerduidelijkaanhetlicht.barriersveryclearlyinthelight

‘Thissituationbringstheexistingpoliticallimitationstolightveryclearly.’

(3)Wijmogenonshiernietbijneerleggen,

wemayusherenotbyagree,

maarmoetendesituatiepubliekelijkaanbutmustthesituationpubliclyopdekaakstellen.thecheekstate

‘Wecannotagreebutmustdenouncethesitu-ationopenly.’Literalandtransparentmeaningisassociatedwithhighmeaningpredictability.Themeaningofanexpressionisfullypredictableifitresultsfromcombiningthemeaningofitsindividualwordswhentheyoccurinisolation(see(1)).Whentheexpressionundergoesaprocessofmetaphor-icalinterpretationitsmeaningislesspredictable.Moon(1998)considersacontinuumoftranspar-ent,semi-transparentandopaquemetaphors.Themoretransparentmetaphorshavearatherpredict-ablemeaning(2);themoreopaquehaveanun-predictablemeaning(3).Ingeneral,anunpredict-ablemeaningresultsfromthefactthatthemean-ingoftheexpressionhasbeenfossilizedandcon-ventionalized.Inanuninformativecontext,idio-maticexpressionshaveanunpredictablemeaning(3).Putdifferently,themeaningofanidiomaticexpressioncannotbederivedfromthecumulativemeaningofitsconstituentpartswhentheyappearinisolation.1.2

Whycheckingtranslations?

Thispaperaddressesthetaskofdistinguishinglit-eral(transparent)expressionsfromidiomaticex-pressions.Decidingwhatsortofmeaninganex-pressionshowscanbedoneintwoways:•measuringhowpredictablethemeaningoftheexpressionisand

•assessingthelinkbetween(a)themeaningoftheexpressionasawholeand(b)thecumu-lativeliteralmeaningsofthecomponents.FernandoandFlavell(1981)observethatnoconnectionbetween(a)and(b)suggeststheex-istenceofopaqueidiomsand,aclearlinkbetween(a)and(b)isobservedinclearlyperceivedmeta-phorsandliteralexpressions.

Webelievewecanapproximatethemeaningofanexpressionbylookinguptheexpressions’translationinaforeignlanguage.Thus,weareinterestedinexploringtowhatextentparallelcor-

poracanhelpustofindoutthetypeofmeaninganexpressionhas.

Forourapproachwemakethefollowingas-sumptions:

•regularwordsaretranslated(moreorless)consistently,i.e.therewillbeoneoronlyafewhighlyfrequenttranslationswhereastranslationalternativeswillbeinfrequent;•anexpressionhasa(almost)literalmeaningifitstranslation(s)intoaforeignlanguageistheresultofcombiningeachword’stransla-tion(s)whentheyoccurinisolationintoafor-eignlanguage;

•anexpressionhasanon-compositionalmean-ingifitstranslation(s)intoaforeignlanguagedoesnotresultfromacombinationofthereg-ulartranslationsofitscomponentwords.Wealsoassumethatanautomaticwordalignerwillgetintotroublewhentryingtoalignnon-decomposableidiomaticexpressionswordbyword.Weexpectthealignertoproducealargevarietyoflinksforeachcomponentwordinsuchexpressionsandthattheselinksaredifferentfromthedefaultalignmentsfoundinthecorpusother-wise.

Bearingtheseassumptionsinmind,ourap-proachattemptstolocatethetranslationofaMWEinatargetlanguage.Onthebasisofallrecon-structedtranslationsofa(potential)MWE,itisde-cidedwhethertheoriginalexpression(insourcelanguage)isidiomaticoramoretransparentone.1.3Relatedwork

Melamed(1997b)measuresthesemanticentropyofwordsusingbitexts.MelamedcomputesthetranslationaldistributionTofawordsinasourcelanguageandusesittomeasurethetranslationalentropyofthewordH(T|s);thisentropyapprox-imatesthesemanticentropyofthewordthatcanbeinterpretedeitheras(a)thesemanticambigu-ityor(b)theinverseofreliability.Thus,awordwithhighsemanticentropyispotentiallyveryam-biguousandtherefore,itstranslationsarelessre-liable(orhighlycontext-dependent).Wealsouseentropytoapproximatemeaningpredictabil-ity.Melamed(1997a)investigatesvarioustech-niquestoidentifynon-compositionalcompoundsinparalleldata.Non-compositionalcompounds

arethosesequencesof2ormorewords(adja-centorseparate)thatshowaconventionalizedmeaning.FromEnglish-Frenchparallelcorpora,Melamed’smethodinducesandcomparespairsoftranslationmodels.Modelsthattakeintoaccountnon-compositionalcompoundsarehighlyaccurateintheidentificationtask.

2Dataandresources

WebaseourinvestigationsontheEuroparlcorpusconsistingofseveralyearsofproceedingsfromtheEuropeanParliament(Koehn,2003).WefocusonDutchexpressionsandtheirtranslationsintoEng-lish,SpanishandGerman.2Thus,weusedtheen-tiresectionsofEuroparlinthesethreelanguages.Thecorpushasbeentokenizedandalignedatthesentencelevel(TiedemannandNygaard,2004).TheDutchpartcontainsabout29milliontokensinabout1.2millionsentences.TheEnglish,Span-ishandGermancounterpartsareofsimilarsizebetween28and30millionwordsinroughlythesamenumberofsentences.

Automaticwordalignmenthasbeendoneus-ingGIZA++(Och,2003).Weusedstandardset-tingsofthesystemtoproduceViterbialignmentsofIBMmodel4.Alignmentshavebeenproducedforbothtranslationdirections(sourcetotargetandtargettosource)ontokenizedplaintext.3Wealsousedawell-knownheuristicsforcombiningthetwodirectionalalignments,theso-calledrefinedalignment(Ochetal.,1999).Word-to-wordalign-mentshavebeenmergedsuchthatwordsarecon-nectedwitheachotheriftheyarelinkedtothesametarget.Inthiswayweobtainedthreediffer-entwordalignmentfiles:sourcetotarget(src2trg)withpossiblemulti-wordunitsinthesourcelan-guage,targettosource(trg2src)withpossiblemulti-wordunitsinthetargetlanguage,andre-finedwithpossiblemulti-wordunitsinbothlan-guages.Wealsocreatedbilingualwordtypelinksfromthedifferentword-alignedcorpora.Theselistsincludealignmentfrequenciesthatwewilluselateronforextractingdefaultalignmentsforindividualwords.Henceforth,wewillcallthemlinklexica.

4

Availableathttp://www.let.rug.nl/˜vannoord/alp/Alpino.5

Butt(2003)maintainsthatthefirst7verbsareexamplesofsupportverbscrosslinguistically.Theother5havebeensuggestedforDutchby(Hollebrandse,1993).

whichweconsideredamanageablesizetotestourmethod.

4Methodology

Weexaminehowexpressionsinthesourcelan-guage(Dutch)areconceptualizedinatargetlan-guage.Thetranslationsinthetargetlanguageen-codethemeaningoftheexpressioninthesourcelanguage.Usingthetranslationlinksinparal-lelcorpora,weattempttoestablishwhattypeofmeaningtheexpressioninthesourcelanguagehas.Toaccomplishthiswemakeuseofthethreeword-alignedparallelcorporafromEuroparlasdescribedinsection2.

Oncethetranslationlinksofeachexpressioninthesourcelanguagehavebeencollected,theen-tropyobservedamongthetranslationlinksiscom-putedperexpression.Wealsotakeintoaccounthowoftenthetranslationofanexpressionismadeoutofthedefaultalignmentforeachtriplecom-ponent.Thedefault’translation’isextractedfromthecorrespondingbilinguallinklexicon.4.1

Collectingalignments

Foreachtripleinthesourcelanguage(Dutch)wecollectitscorresponding(hypothetical)trans-lationsinatargetlanguage.Thus,wehavealistof200VERBPPtriplesrepresenting200potentialMWEsinDutch.Weselectedalloccurrencesofeachtripleinthesourcelanguageandallalignedsentencescontainingtheircorrespondingtransla-tionsintoEnglish,GermanandSpanish.Were-strictedourselvestoinstancesfoundin1:1sen-tencealignments.Otherunitscontainmanyer-rorsinwordandsentencealignmentand,there-fore,wediscardedthem.Relyingonautomatedword-alignment,wecollectalltranslationlinksforeachverb,prepositionandnounoccurrencewithinthetriplecontextinthethreetargetlanguages.Tocapturethemeaningofasourceexpression(triple)S,wecollectallthetranslationlinksofitscomponentwordssineachtargetlanguage.Thus,foreachtriple,wegatherthreelistsoftransla-tionlinksTs.LetusseetheexampleAANLICHTBRENGrepresentingtheMWEietsaanhetlichtbrengen’reveal’.Table1showssomeofthelinksfoundforthetripleAANLICHTBRENG.Ifawordinthesourcelanguagehasnolinkinthetargetlan-guage(whichisusuallyduetoalignmentstotheemptyword),NO

aanNO

LINK,light,revealed,exposed,highlight,shown,shedlight,clarifybreng

NO

LINK.

4.2MeasuringtranslationalentropyAccordingtoourintuitionitishardertoalignwordsinidiomaticexpressionsthanotherwords.Thus,weexpectalargervarietyoflinks(includ-ingerroneousalignments)forwordsinsuchex-pressionsthanforwordstakenfromexpressionswithamoreliteralmeaning.Forthelatter,weexpectfeweralignmentcandidates,possiblywithonlyonedominantdefaulttranslation.Entropyisagoodmeasurefortheunpredictabilityofanevent.Weliketousethismeasureforcomparingthealignmentofourcandidatesandexpectahighaverageentropyforidiomaticexpressions.Inthiswayweapproximateameasureformeaningpre-dictability.

Foreachwordinatriple,wecomputetheen-tropyofthealignedtargetwordsasshowninequa-tion(1).

H(Ts|s)=−󰀂

P(t|s)logP(t|s)(1)

t∈Ts

Thismeasureisequivalenttotranslationalen-tropy(Melamed,1997b).P(t|s)isestimatedastheproportionofalignmenttamongallalign-mentsofwordsfoundinthecorpusinthecon-textofthegiventriple.6Finally,thetranslationalentropyofatripleistheaveragetranslationalen-tropyofitscomponents.Itisunclearhowto

treat

NO

LINKS,

(2)countingNO

LINKS

asoneuniquetype.

4.3

Proportionofdefaultalignments(pda)

Ifanexpressionhasaliteralmeaning,weexpectthedefaultalignmentstobeaccurateliteraltrans-lations.Ifanexpressionhasidiomaticmeaning,thedefaultalignmentswillbeverydifferentfromthelinksobservedinthetranslations.

ForeachtripleS,wecounthowofteneachofitscomponentssislinkedtooneofthedefaultalignmentsDs.Forthelatter,weusedthefourmostfrequentalignmenttypesextractedfromthecorrespondinglinklexiconasdescribedinsection2.Alargeproportionofdefaultalignments7sug-geststhattheexpressionisverylikelytohavelit-eralmeaning;alowpercentageissuggestiveofnon-transparentmeaning.Formally,pdaiscalcu-latedinthefollowingway:pda(S)=󰀁󰀁s∈S󰀁s∈S󰀁d∈Dsalignt∈Tsalign

freq(s,t)isthealignmentfre-quencyofwordstowordtinthecontextofthe

tripleS.

5Discussionofexperimentsandresults

Weexperimentedwiththethreeword-alignmenttypes(src2trg,trg2srcandrefined)andthetwoscoringmethods(entropyandpda).The200can-didateMWEshavebeenassessedandclassifiedintoidiomaticorliteralexpressionsbyahumanexpert.Forassessingperformance,standardpre-cisionandrecallarenotapplicableinourcasebe-causewedonotwanttodefineanartificialcut-offforourrankedlistbutevaluatetherankingit-self.Instead,wemeasuredtheperformanceofeachalignmenttypeandscoringmethodbyob-taininganotherevaluationmetricemployedinin-formationretrieval,uninterpolatedaveragepreci-sion(uap),thataggregatesprecisionpointsintooneevaluationfigure.AteachpointcwhereatruepositiveScintheretrievedlistisfound,thepre-cisionP(S1..Sc)iscomputedand,allprecisionpointsarethenaveraged(ManningandSch¨utze,1999).

LINKSintoaccount

whencomput-

ingtheproportions.

uap=

󰀁

Sc

P(S1..Sc)

LINKS)

withthethreealignmenttypesfortheNL-EN

language

pair.8

Alignment

uap

baseline0.755

Table2:uapvaluesofvariousalignments.Usingwordalignmentsimprovestherankingofcandidatesinallthreecases.Amongthem,src2trgshowsthebestperformance.Thisissurprisingbecausethequalityofword-alignmentfromEnglish-to-Dutch(trg2src)ingeneralishigherduetodifferencesincompoundinginthetwolanguages.However,thisismainlyanissuefornounphraseswhichmakeuponlyonecom-ponentinthetriples.

Weassumethatsrc2trgworksbetterinourcasebecauseinthisalignmentmodelweexplicitlylinkeachwordinthesourcelanguagetoexactlyonetargetword(ortheemptyword)whereasinthetrg2srcmodelweoftengetmultiplewords(inthetargetlanguage)alignedtoindividualwordsinthetriple.Manyerrorsareintroducedinsuchalign-mentunits.Table3illustratesthiswithanexamplewithlinksfortheDutchtripleopprijsstelcorres-pondingtotheexpressionietsopprijsstellen’toappreciatesth.’

src2trgsourcetargetgesteldappreciateLINKstellen

prijsappreciate

NO

gesteldbefact

prijs

op

NO

NO

Table3:Examplesrc2trgandtrg2srcalignmentsforthetripleOPPRIJSSTEL.

src2trgalignmentproposesappreciateasalinktoallthreetriplecomponents.Thistypeofalign-mentisnotpossibleintrg2src.Instead,trg2srcin-cludestwoNO

LINKS

andmanyalign-mentalternativesintrg2srcthatinfluenceouren-tropyscores.Thiscanbeobservedforidiomaticexpressionsaswellasforliteralexpressionswhichmakestranslationalentropylessreliableintrg2srcalignmentsforcontrastingthesetwotypesofex-pressions.

Therefinedalignmentmodelstartswiththein-tersectionofthetwodirectionalmodelsandaddsiterativelylinksiftheymeetsomeadjacencycon-straints.ThisresultsinmanyNO

entropy

-withoutNO

LINKS=many0.8580.00.883-NObaseline

0.755

0.755

0.755

Table4:Translationalentropyandthepdaacrossthreelanguagepairs.Alignmentissrc2trg.Allscoresproducebetterrankingsthanthebaseline.Ingeneral,pdaachievesaslightlybetteraccuracythanentropyexceptfortheNL-DElan-guagepair.Nevertheless,thedifferencebetweenthemetricsishardlysignificant.5.3

Furtherimprovements

Oneprobleminourdataisthatwedealwithword-formalignmentsandnotwithlemmatizedver-sions.ForDutch,weknowthelemmaofeachwordinstancefromourcandidateset.However,forthetargetlanguages,weonlyhaveaccesstosurfaceformsfromthecorpus.Naturally,inflec-tionalvariationsinfluenceentropyscores(because

ofthelargervarietyofalignmenttypes)andalsothepdascores(wheretheexactwordformshavetobematchedwiththedefaultalignmentsinsteadoflemmas).Inordertotesttheeffectoflemmatiz-ationondifferentlanguagepairs,weusedCELEX(Baayenetal.,1993)forEnglishandGermantoreducewordformsinthealignmentsandinthelinklexicontocorrespondinglemmas.Weassignedthemostfrequentlemmatoambiguouswordforms.Table5showsthescoresobtainedfromapplyinglemmatizationforthesrc2trgalignmentusingentropy(withoutNO

usingentropyscores

usingpdascores

baseline0.7550.7550.755

Table5:Translationalentropyandpdafromsrc2trgalignmentsacrosslanguagespairswithdifferentsettings.

Surprisingly,lemmatizationaddslittleorevendecreasestheaccuracyofthepdaandentropyscores.ItisalsosurprisingthatlemmatizationdoesnotaffectthescoresformorphologicallyricherlanguagessuchasGerman(comparedtoEnglish).Onepossiblereasonforthisisthatlemmatizationdiscardsmorphologicalinforma-tionthatiscrucialtoidentifyidiomaticexpres-sions.Infact,nounsinidiomaticexpressionsaremorefixedthannounsinliteralexpressions.Bycontrast,verbsinidiomaticexpressionsoftenal-lowtenseinflection.Byclusteringwordformsintolemmaswelosethisinformation.Infuturework,wemightlemmatizeonlytheverb.

Anotherissueisthereliabilityofthewordalign-mentthatwebaseourinvestigationupon.Wewanttomakeuseofthefactthatautomaticwordalignmenthasproblemswiththealignmentofin-dividualwordsthatbelongtolargerlexicalunits.However,webelievethatthealignmentprogramingeneralhasproblemswithhighlyambiguouswordssuchasprepositions.Therefore,preposi-

tionsmightblurthecontrastbetweenidiomaticex-pressionsandliteraltranslationswhenmeasuredonthealignmentofindividualwords.Table5includesscoresforrankingourcandidateexpres-sionswithandwithoutprepositions.Weobservethatthereisalargeimprovementwhenleavingoutthealignmentsofprepositions.Thisisconsistentforalllanguagepairsandthescoresweusedforranking.

rank

pda

entropy

triple

MWE

118119120...180181182183

36.9036.4914.0670.5352.3374.7176.56

4.29314.290.28732.73952.73512.662.5883

okokok****

vastmetvoldoening’settlewithsatisfaction’komtotakkoord’reachagreement’brenginstemming’getinmood’staopschroeven’beunsettled’

voldoeaancriterium’satisfycriterion’

beschikoverinformatie’decideoverinformation’stemvooramendement’voteforamending’neem

187188119019119219319480.3978.0477.6382.2177.7886.3673.3339.132.09922.09241.99971.90201.90161.87751.86871.8497********

terugnaarcommissie’refertocomission’

stemtegenamendement’voteagainstaamending’onthoudvanstemming’withholdone’svote’feliciteermetwerk’congratulatewithwork’stemvoorverslag’voteforreport’

schepvanwerkgelegenheid’setupofemployment’stemvoorresolutie’voteforresolution’bedankvoorfeit’thankforfact’was

berof

NO

LINKS

is

hardertotranslatecompositionally,andprobablyanidiomaticorambiguousexpression.Altern-atively,anexpressionwithnoNO

因篇幅问题不能全部显示,请点此查看更多更全内容

Copyright © 2019- gamedaodao.com 版权所有 湘ICP备2022005869号-6

违法及侵权请联系:TEL:199 18 7713 E-MAIL:2724546146@qq.com

本站由北京市万商天勤律师事务所王兴未律师提供法律服务