Multinomial logistic regression - Wikipedia
文章推薦指數: 80 %
In statistics, multinomial logistic regression is a classification method that generalizes logistic regression to multiclass problems, i.e. with more than ... Multinomiallogisticregression FromWikipedia,thefreeencyclopedia Jumptonavigation Jumptosearch Regressionformorethantwodiscreteoutcomes "Multinomialregression"redirectshere.FortherelatedProbitprocedure,seeMultinomialprobit. Thisarticleneedsadditionalcitationsforverification.Pleasehelpimprovethisarticlebyaddingcitationstoreliablesources.Unsourcedmaterialmaybechallengedandremoved.Findsources: "Multinomiallogisticregression" – news ·newspapers ·books ·scholar ·JSTOR(November2011)(Learnhowandwhentoremovethistemplatemessage) PartofaseriesonRegressionanalysis Models Linearregression Simpleregression Polynomialregression Generallinearmodel Generalizedlinearmodel Vectorgeneralizedlinearmodel Discretechoice Binomialregression Binaryregression Logisticregression Multinomiallogisticregression Mixedlogit Probit Multinomialprobit Orderedlogit Orderedprobit Poisson Multilevelmodel Fixedeffects Randomeffects Linearmixed-effectsmodel Nonlinearmixed-effectsmodel Nonlinearregression Supportvectorregression Nonparametric Semiparametric Robust Quantile Isotonic Principalcomponents Leastangle Local Segmented Errors-in-variables Estimation Leastsquares Linear Non-linear Ordinary Weighted Generalized Generalizedestimatingequation Partial Total Non-negative Ridgeregression Regularized Leastabsolutedeviations Iterativelyreweighted Bayesian Bayesianmultivariate Least-squaresspectralanalysis HeteroscedasticityConsistentRegressionStandardErrors HeteroscedasticityandAutocorrelationConsistentRegressionStandardErrors Background Regressionvalidation Meanandpredictedresponse Errorsandresiduals Goodnessoffit Studentizedresidual Gauss–Markovtheorem Mathematicsportalvte Instatistics,multinomiallogisticregressionisaclassificationmethodthatgeneralizeslogisticregressiontomulticlassproblems,i.e.withmorethantwopossiblediscreteoutcomes.[1]Thatis,itisamodelthatisusedtopredicttheprobabilitiesofthedifferentpossibleoutcomesofacategoricallydistributeddependentvariable,givenasetofindependentvariables(whichmaybereal-valued,binary-valued,categorical-valued,etc.). Multinomiallogisticregressionisknownbyavarietyofothernames,includingpolytomousLR,[2][3]multiclassLR,softmaxregression,multinomiallogit(mlogit),themaximumentropy(MaxEnt)classifier,andtheconditionalmaximumentropymodel.[4] Contents 1Background 2Assumptions 3Model 3.1Introduction 3.2Setup 3.2.1Datapoints 3.2.2Linearpredictor 3.3Asasetofindependentbinaryregressions 3.4Estimatingthecoefficients 3.5Asalog-linearmodel 3.6Asalatent-variablemodel 4Estimationofintercept 5Applicationinnaturallanguageprocessing 6Seealso 7References Background[edit] Multinomiallogisticregressionisusedwhenthedependentvariableinquestionisnominal(equivalentlycategorical,meaningthatitfallsintoanyoneofasetofcategoriesthatcannotbeorderedinanymeaningfulway)andforwhichtherearemorethantwocategories.Someexampleswouldbe: Whichmajorwillacollegestudentchoose,giventheirgrades,statedlikesanddislikes,etc.? Whichbloodtypedoesapersonhave,giventheresultsofvariousdiagnostictests? Inahands-freemobilephonedialingapplication,whichperson'snamewasspoken,givenvariouspropertiesofthespeechsignal? Whichcandidatewillapersonvotefor,givenparticulardemographiccharacteristics? Whichcountrywillafirmlocateanofficein,giventhecharacteristicsofthefirmandofthevariouscandidatecountries? Theseareallstatisticalclassificationproblems.Theyallhaveincommonadependentvariabletobepredictedthatcomesfromoneofalimitedsetofitemsthatcannotbemeaningfullyordered,aswellasasetofindependentvariables(alsoknownasfeatures,explanators,etc.),whichareusedtopredictthedependentvariable.Multinomiallogisticregressionisaparticularsolutiontoclassificationproblemsthatusealinearcombinationoftheobservedfeaturesandsomeproblem-specificparameterstoestimatetheprobabilityofeachparticularvalueofthedependentvariable.Thebestvaluesoftheparametersforagivenproblemareusuallydeterminedfromsometrainingdata(e.g.somepeopleforwhomboththediagnostictestresultsandbloodtypesareknown,orsomeexamplesofknownwordsbeingspoken). Assumptions[edit] Themultinomiallogisticmodelassumesthatdataarecase-specific;thatis,eachindependentvariablehasasinglevalueforeachcase.Themultinomiallogisticmodelalsoassumesthatthedependentvariablecannotbeperfectlypredictedfromtheindependentvariablesforanycase.Aswithothertypesofregression,thereisnoneedfortheindependentvariablestobestatisticallyindependentfromeachother(unlike,forexample,inanaiveBayesclassifier);however,collinearityisassumedtoberelativelylow,asitbecomesdifficulttodifferentiatebetweentheimpactofseveralvariablesifthisisnotthecase.[5] Ifthemultinomiallogitisusedtomodelchoices,itreliesontheassumptionofindependenceofirrelevantalternatives(IIA),whichisnotalwaysdesirable.Thisassumptionstatesthattheoddsofpreferringoneclassoveranotherdonotdependonthepresenceorabsenceofother"irrelevant"alternatives.Forexample,therelativeprobabilitiesoftakingacarorbustoworkdonotchangeifabicycleisaddedasanadditionalpossibility.ThisallowsthechoiceofKalternativestobemodeledasasetofK-1independentbinarychoices,inwhichonealternativeischosenasa"pivot"andtheotherK-1comparedagainstit,oneatatime.TheIIAhypothesisisacorehypothesisinrationalchoicetheory;howevernumerousstudiesinpsychologyshowthatindividualsoftenviolatethisassumptionwhenmakingchoices.Anexampleofaproblemcasearisesifchoicesincludeacarandabluebus.Supposetheoddsratiobetweenthetwois1 :1.Nowiftheoptionofaredbusisintroduced,apersonmaybeindifferentbetweenaredandabluebus,andhencemayexhibitacar :bluebus :redbusoddsratioof1 :0.5 :0.5,thusmaintaininga1 :1ratioofcar :anybuswhileadoptingachangedcar :bluebusratioof1 :0.5.Heretheredbusoptionwasnotinfactirrelevant,becausearedbuswasaperfectsubstituteforabluebus. Ifthemultinomiallogitisusedtomodelchoices,itmayinsomesituationsimposetoomuchconstraintontherelativepreferencesbetweenthedifferentalternatives.Thispointisespeciallyimportanttotakeintoaccountiftheanalysisaimstopredicthowchoiceswouldchangeifonealternativeweretodisappear(forinstanceifonepoliticalcandidatewithdrawsfromathreecandidaterace).OthermodelslikethenestedlogitorthemultinomialprobitmaybeusedinsuchcasesastheyallowforviolationoftheIIA.[6] Model[edit] Seealso:Logisticregression Introduction[edit] Therearemultipleequivalentwaystodescribethemathematicalmodelunderlyingmultinomiallogisticregression.Thiscanmakeitdifficulttocomparedifferenttreatmentsofthesubjectindifferenttexts.Thearticleonlogisticregressionpresentsanumberofequivalentformulationsofsimplelogisticregression,andmanyofthesehaveanaloguesinthemultinomiallogitmodel. Theideabehindallofthem,asinmanyotherstatisticalclassificationtechniques,istoconstructalinearpredictorfunctionthatconstructsascorefromasetofweightsthatarelinearlycombinedwiththeexplanatoryvariables(features)ofagivenobservationusingadotproduct: score ( X i , k ) = β k ⋅ X i , {\displaystyle\operatorname{score}(\mathbf{X}_{i},k)={\boldsymbol{\beta}}_{k}\cdot\mathbf{X}_{i},} whereXiisthevectorofexplanatoryvariablesdescribingobservationi,βkisavectorofweights(orregressioncoefficients)correspondingtooutcomek,andscore(Xi,k)isthescoreassociatedwithassigningobservationitocategoryk.Indiscretechoicetheory,whereobservationsrepresentpeopleandoutcomesrepresentchoices,thescoreisconsideredtheutilityassociatedwithpersonichoosingoutcomek.Thepredictedoutcomeistheonewiththehighestscore. Thedifferencebetweenthemultinomiallogitmodelandnumerousothermethods,models,algorithms,etc.withthesamebasicsetup(theperceptronalgorithm,supportvectormachines,lineardiscriminantanalysis,etc.)istheprocedurefordetermining(training)theoptimalweights/coefficientsandthewaythatthescoreisinterpreted.Inparticular,inthemultinomiallogitmodel,thescorecandirectlybeconvertedtoaprobabilityvalue,indicatingtheprobabilityofobservationichoosingoutcomekgiventhemeasuredcharacteristicsoftheobservation.Thisprovidesaprincipledwayofincorporatingthepredictionofaparticularmultinomiallogitmodelintoalargerprocedurethatmayinvolvemultiplesuchpredictions,eachwithapossibilityoferror.Withoutsuchmeansofcombiningpredictions,errorstendtomultiply.Forexample,imaginealargepredictivemodelthatisbrokendownintoaseriesofsubmodelswherethepredictionofagivensubmodelisusedastheinputofanothersubmodel,andthatpredictionisinturnusedastheinputintoathirdsubmodel,etc.Ifeachsubmodelhas90%accuracyinitspredictions,andtherearefivesubmodelsinseries,thentheoverallmodelhasonly0.95=59%accuracy.Ifeachsubmodelhas80%accuracy,thenoverallaccuracydropsto0.85=33%accuracy.Thisissueisknownaserrorpropagationandisaseriousprobleminreal-worldpredictivemodels,whichareusuallycomposedofnumerousparts.Predictingprobabilitiesofeachpossibleoutcome,ratherthansimplymakingasingleoptimalprediction,isonemeansofalleviatingthisissue.[citationneeded] Setup[edit] Thebasicsetupisthesameasinlogisticregression,theonlydifferencebeingthatthedependentvariablesarecategoricalratherthanbinary,i.e.thereareKpossibleoutcomesratherthanjusttwo.Thefollowingdescriptionissomewhatshortened;formoredetails,consultthelogisticregressionarticle. Datapoints[edit] Specifically,itisassumedthatwehaveaseriesofNobserveddatapoints.Eachdatapointi(rangingfrom1toN)consistsofasetofMexplanatoryvariablesx1,i...xM,i(akaindependentvariables,predictorvariables,features,etc.),andanassociatedcategoricaloutcomeYi(akadependentvariable,responsevariable),whichcantakeononeofKpossiblevalues.Thesepossiblevaluesrepresentlogicallyseparatecategories(e.g.differentpoliticalparties,bloodtypes,etc.),andareoftendescribedmathematicallybyarbitrarilyassigningeachanumberfrom1toK.Theexplanatoryvariablesandoutcomerepresentobservedpropertiesofthedatapoints,andareoftenthoughtofasoriginatingintheobservationsofN"experiments"—althoughan"experiment"mayconsistinnothingmorethangatheringdata.Thegoalofmultinomiallogisticregressionistoconstructamodelthatexplainstherelationshipbetweentheexplanatoryvariablesandtheoutcome,sothattheoutcomeofanew"experiment"canbecorrectlypredictedforanewdatapointforwhichtheexplanatoryvariables,butnottheoutcome,areavailable.Intheprocess,themodelattemptstoexplaintherelativeeffectofdifferingexplanatoryvariablesontheoutcome. Someexamples: Theobservedoutcomesaredifferentvariantsofadiseasesuchashepatitis(possiblyincluding"nodisease"and/orotherrelateddiseases)inasetofpatients,andtheexplanatoryvariablesmightbecharacteristicsofthepatientsthoughttobepertinent(sex,race,age,bloodpressure,outcomesofvariousliver-functiontests,etc.).Thegoalisthentopredictwhichdiseaseiscausingtheobservedliver-relatedsymptomsinanewpatient. Theobservedoutcomesarethepartychosenbyasetofpeopleinanelection,andtheexplanatoryvariablesarethedemographiccharacteristicsofeachperson(e.g.sex,race,age,income,etc.).Thegoalisthentopredictthelikelyvoteofanewvoterwithgivencharacteristics. Linearpredictor[edit] Asinotherformsoflinearregression,multinomiallogisticregressionusesalinearpredictorfunction f ( k , i ) {\displaystylef(k,i)} topredicttheprobabilitythatobservationihasoutcomek,ofthefollowingform: f ( k , i ) = β 0 , k + β 1 , k x 1 , i + β 2 , k x 2 , i + ⋯ + β M , k x M , i , {\displaystylef(k,i)=\beta_{0,k}+\beta_{1,k}x_{1,i}+\beta_{2,k}x_{2,i}+\cdots+\beta_{M,k}x_{M,i},} where β m , k {\displaystyle\beta_{m,k}} isaregressioncoefficientassociatedwiththemthexplanatoryvariableandthekthoutcome.Asexplainedinthelogisticregressionarticle,theregressioncoefficientsandexplanatoryvariablesarenormallygroupedintovectorsofsizeM+1,sothatthepredictorfunctioncanbewrittenmorecompactly: f ( k , i ) = β k ⋅ x i , {\displaystylef(k,i)={\boldsymbol{\beta}}_{k}\cdot\mathbf{x}_{i},} where β k {\displaystyle{\boldsymbol{\beta}}_{k}} isthesetofregressioncoefficientsassociatedwithoutcomek,and x i {\displaystyle\mathbf{x}_{i}} (arowvector)isthesetofexplanatoryvariablesassociatedwithobservationi. Asasetofindependentbinaryregressions[edit] Toarriveatthemultinomiallogitmodel,onecanimagine,forKpossibleoutcomes,runningK-1independentbinarylogisticregressionmodels,inwhichoneoutcomeischosenasa"pivot"andthentheotherK-1outcomesareseparatelyregressedagainstthepivotoutcome.Thiswouldproceedasfollows,ifoutcomeK(thelastoutcome)ischosenasthepivot: ln Pr ( Y i = 1 ) Pr ( Y i = K ) = β 1 ⋅ X i ln Pr ( Y i = 2 ) Pr ( Y i = K ) = β 2 ⋅ X i ⋯ ⋯ ln Pr ( Y i = K − 1 ) Pr ( Y i = K ) = β K − 1 ⋅ X i {\displaystyle{\begin{aligned}\ln{\frac{\Pr(Y_{i}=1)}{\Pr(Y_{i}=K)}}&={\boldsymbol{\beta}}_{1}\cdot\mathbf{X}_{i}\\\ln{\frac{\Pr(Y_{i}=2)}{\Pr(Y_{i}=K)}}&={\boldsymbol{\beta}}_{2}\cdot\mathbf{X}_{i}\\\cdots&\cdots\\\ln{\frac{\Pr(Y_{i}=K-1)}{\Pr(Y_{i}=K)}}&={\boldsymbol{\beta}}_{K-1}\cdot\mathbf{X}_{i}\\\end{aligned}}} Thisformulationisalsoknownasthealrtransformcommonlyusedincompositionaldataanalysis. Notethatwehaveintroducedseparatesetsofregressioncoefficients,oneforeachpossibleoutcome. Ifweexponentiatebothsides,andsolvefortheprobabilities,weget: Pr ( Y i = 1 ) = Pr ( Y i = K ) e β 1 ⋅ X i Pr ( Y i = 2 ) = Pr ( Y i = K ) e β 2 ⋅ X i ⋯ ⋯ Pr ( Y i = K − 1 ) = Pr ( Y i = K ) e β K − 1 ⋅ X i {\displaystyle{\begin{aligned}\Pr(Y_{i}=1)&={\Pr(Y_{i}=K)}e^{{\boldsymbol{\beta}}_{1}\cdot\mathbf{X}_{i}}\\\Pr(Y_{i}=2)&={\Pr(Y_{i}=K)}e^{{\boldsymbol{\beta}}_{2}\cdot\mathbf{X}_{i}}\\\cdots&\cdots\\\Pr(Y_{i}=K-1)&={\Pr(Y_{i}=K)}e^{{\boldsymbol{\beta}}_{K-1}\cdot\mathbf{X}_{i}}\\\end{aligned}}} UsingthefactthatallKoftheprobabilitiesmustsumtoone,wefind: Pr ( Y i = K ) = 1 − ∑ k = 1 K − 1 Pr ( Y i = k ) = 1 − ∑ k = 1 K − 1 Pr ( Y i = K ) e β k ⋅ X i ⇒ Pr ( Y i = K ) = 1 1 + ∑ k = 1 K − 1 e β k ⋅ X i {\displaystyle\Pr(Y_{i}=K)=1-\sum_{k=1}^{K-1}\Pr(Y_{i}=k)=1-\sum_{k=1}^{K-1}{\Pr(Y_{i}=K)}e^{{\boldsymbol{\beta}}_{k}\cdot\mathbf{X}_{i}}\Rightarrow\Pr(Y_{i}=K)={\frac{1}{1+\sum_{k=1}^{K-1}e^{{\boldsymbol{\beta}}_{k}\cdot\mathbf{X}_{i}}}}} Wecanusethistofindtheotherprobabilities: Pr ( Y i = 1 ) = e β 1 ⋅ X i 1 + ∑ k = 1 K − 1 e β k ⋅ X i Pr ( Y i = 2 ) = e β 2 ⋅ X i 1 + ∑ k = 1 K − 1 e β k ⋅ X i ⋯ ⋯ Pr ( Y i = K − 1 ) = e β K − 1 ⋅ X i 1 + ∑ k = 1 K − 1 e β k ⋅ X i {\displaystyle{\begin{aligned}\Pr(Y_{i}=1)&={\frac{e^{{\boldsymbol{\beta}}_{1}\cdot\mathbf{X}_{i}}}{1+\sum_{k=1}^{K-1}e^{{\boldsymbol{\beta}}_{k}\cdot\mathbf{X}_{i}}}}\\\\\Pr(Y_{i}=2)&={\frac{e^{{\boldsymbol{\beta}}_{2}\cdot\mathbf{X}_{i}}}{1+\sum_{k=1}^{K-1}e^{{\boldsymbol{\beta}}_{k}\cdot\mathbf{X}_{i}}}}\\\cdots&\cdots\\\Pr(Y_{i}=K-1)&={\frac{e^{{\boldsymbol{\beta}}_{K-1}\cdot\mathbf{X}_{i}}}{1+\sum_{k=1}^{K-1}e^{{\boldsymbol{\beta}}_{k}\cdot\mathbf{X}_{i}}}}\\\end{aligned}}} wherethesummationrunsfrom 1 {\displaystyle1} to K − 1 {\displaystyleK-1} orgenerally: Pr ( Y i = k ) = e β k ⋅ X i 1 + ∑ j = 1 K − 1 e β j ⋅ X i {\displaystyle{\begin{aligned}\Pr(Y_{i}=k)={\frac{e^{{\boldsymbol{\beta}}_{k}\cdot\mathbf{X}_{i}}}{1+\sum_{j=1}^{K-1}e^{{\boldsymbol{\beta}}_{j}\cdot\mathbf{X}_{i}}}}\end{aligned}}} where β K {\displaystyle\beta_{K}} isdefinedtobezero.Thefactthatwerunmultipleregressionsrevealswhythemodelreliesontheassumptionofindependenceofirrelevantalternativesdescribedabove. Estimatingthecoefficients[edit] Theunknownparametersineachvectorβkaretypicallyjointlyestimatedbymaximumaposteriori(MAP)estimation,whichisanextensionofmaximumlikelihoodusingregularizationoftheweightstopreventpathologicalsolutions(usuallyasquaredregularizingfunction,whichisequivalenttoplacingazero-meanGaussianpriordistributionontheweights,butotherdistributionsarealsopossible).Thesolutionistypicallyfoundusinganiterativeproceduresuchasgeneralizediterativescaling,[7]iterativelyreweightedleastsquares(IRLS),[8]bymeansofgradient-basedoptimizationalgorithmssuchasL-BFGS,[4]orbyspecializedcoordinatedescentalgorithms.[9] Asalog-linearmodel[edit] Theformulationofbinarylogisticregressionasalog-linearmodelcanbedirectlyextendedtomulti-wayregression.Thatis,wemodelthelogarithmoftheprobabilityofseeingagivenoutputusingthelinearpredictoraswellasanadditionalnormalizationfactor,thelogarithmofthepartitionfunction: ln Pr ( Y i = 1 ) = β 1 ⋅ X i − ln Z ln Pr ( Y i = 2 ) = β 2 ⋅ X i − ln Z ⋯ ⋯ ln Pr ( Y i = K ) = β K ⋅ X i − ln Z {\displaystyle{\begin{aligned}\ln\Pr(Y_{i}=1)&={\boldsymbol{\beta}}_{1}\cdot\mathbf{X}_{i}-\lnZ\,\\\ln\Pr(Y_{i}=2)&={\boldsymbol{\beta}}_{2}\cdot\mathbf{X}_{i}-\lnZ\,\\\cdots&\cdots\\\ln\Pr(Y_{i}=K)&={\boldsymbol{\beta}}_{K}\cdot\mathbf{X}_{i}-\lnZ\,\\\end{aligned}}} Asinthebinarycase,weneedanextraterm − ln Z {\displaystyle-\lnZ} toensurethatthewholesetofprobabilitiesformsaprobabilitydistribution,i.e.sothattheyallsumtoone: ∑ k = 1 K Pr ( Y i = k ) = 1 {\displaystyle\sum_{k=1}^{K}\Pr(Y_{i}=k)=1} Thereasonwhyweneedtoaddatermtoensurenormalization,ratherthanmultiplyasisusual,isbecausewehavetakenthelogarithmoftheprobabilities.Exponentiatingbothsidesturnstheadditivetermintoamultiplicativefactor,sothattheprobabilityisjusttheGibbsmeasure: Pr ( Y i = 1 ) = 1 Z e β 1 ⋅ X i Pr ( Y i = 2 ) = 1 Z e β 2 ⋅ X i ⋯ ⋯ Pr ( Y i = K ) = 1 Z e β K ⋅ X i {\displaystyle{\begin{aligned}\Pr(Y_{i}=1)&={\frac{1}{Z}}e^{{\boldsymbol{\beta}}_{1}\cdot\mathbf{X}_{i}}\,\\\Pr(Y_{i}=2)&={\frac{1}{Z}}e^{{\boldsymbol{\beta}}_{2}\cdot\mathbf{X}_{i}}\,\\\cdots&\cdots\\\Pr(Y_{i}=K)&={\frac{1}{Z}}e^{{\boldsymbol{\beta}}_{K}\cdot\mathbf{X}_{i}}\,\\\end{aligned}}} ThequantityZiscalledthepartitionfunctionforthedistribution.Wecancomputethevalueofthepartitionfunctionbyapplyingtheaboveconstraintthatrequiresallprobabilitiestosumto1: 1 = ∑ k = 1 K Pr ( Y i = k ) = ∑ k = 1 K 1 Z e β k ⋅ X i = 1 Z ∑ k = 1 K e β k ⋅ X i {\displaystyle{\begin{aligned}1=\sum_{k=1}^{K}\Pr(Y_{i}=k)&=\sum_{k=1}^{K}{\frac{1}{Z}}e^{{\boldsymbol{\beta}}_{k}\cdot\mathbf{X}_{i}}\\&={\frac{1}{Z}}\sum_{k=1}^{K}e^{{\boldsymbol{\beta}}_{k}\cdot\mathbf{X}_{i}}\\\end{aligned}}} Therefore: Z = ∑ k = 1 K e β k ⋅ X i {\displaystyleZ=\sum_{k=1}^{K}e^{{\boldsymbol{\beta}}_{k}\cdot\mathbf{X}_{i}}} Notethatthisfactoris"constant"inthesensethatitisnotafunctionofYi,whichisthevariableoverwhichtheprobabilitydistributionisdefined.However,itisdefinitelynotconstantwithrespecttotheexplanatoryvariables,orcrucially,withrespecttotheunknownregressioncoefficientsβk,whichwewillneedtodeterminethroughsomesortofoptimizationprocedure. Theresultingequationsfortheprobabilitiesare Pr ( Y i = 1 ) = e β 1 ⋅ X i ∑ k = 1 K e β k ⋅ X i Pr ( Y i = 2 ) = e β 2 ⋅ X i ∑ k = 1 K e β k ⋅ X i ⋯ ⋯ Pr ( Y i = K ) = e β K ⋅ X i ∑ k = 1 K e β k ⋅ X i {\displaystyle{\begin{aligned}\Pr(Y_{i}=1)&={\frac{e^{{\boldsymbol{\beta}}_{1}\cdot\mathbf{X}_{i}}}{\sum_{k=1}^{K}e^{{\boldsymbol{\beta}}_{k}\cdot\mathbf{X}_{i}}}}\,\\\Pr(Y_{i}=2)&={\frac{e^{{\boldsymbol{\beta}}_{2}\cdot\mathbf{X}_{i}}}{\sum_{k=1}^{K}e^{{\boldsymbol{\beta}}_{k}\cdot\mathbf{X}_{i}}}}\,\\\cdots&\cdots\\\Pr(Y_{i}=K)&={\frac{e^{{\boldsymbol{\beta}}_{K}\cdot\mathbf{X}_{i}}}{\sum_{k=1}^{K}e^{{\boldsymbol{\beta}}_{k}\cdot\mathbf{X}_{i}}}}\,\\\end{aligned}}} Orgenerally: Pr ( Y i = c ) = e β c ⋅ X i ∑ k = 1 K e β k ⋅ X i {\displaystyle\Pr(Y_{i}=c)={\frac{e^{{\boldsymbol{\beta}}_{c}\cdot\mathbf{X}_{i}}}{\sum_{k=1}^{K}e^{{\boldsymbol{\beta}}_{k}\cdot\mathbf{X}_{i}}}}} Thefollowingfunction: softmax ( k , x 1 , … , x n ) = e x k ∑ i = 1 n e x i {\displaystyle\operatorname{softmax}(k,x_{1},\ldots,x_{n})={\frac{e^{x_{k}}}{\sum_{i=1}^{n}e^{x_{i}}}}} isreferredtoasthesoftmaxfunction.Thereasonisthattheeffectofexponentiatingthevalues x 1 , … , x n {\displaystylex_{1},\ldots,x_{n}} istoexaggeratethedifferencesbetweenthem.Asaresult, softmax ( k , x 1 , … , x n ) {\displaystyle\operatorname{softmax}(k,x_{1},\ldots,x_{n})} willreturnavaluecloseto0whenever x k {\displaystylex_{k}} issignificantlylessthanthemaximumofallthevalues,andwillreturnavaluecloseto1whenappliedtothemaximumvalue,unlessitisextremelyclosetothenext-largestvalue.Thus,thesoftmaxfunctioncanbeusedtoconstructaweightedaveragethatbehavesasasmoothfunction(whichcanbeconvenientlydifferentiated,etc.)andwhichapproximatestheindicatorfunction f ( k ) = { 1 if k = arg max ( x 1 , … , x n ) , 0 otherwise . {\displaystylef(k)={\begin{cases}1\;{\textrm{if}}\;k=\operatorname{\arg\max}(x_{1},\ldots,x_{n}),\\0\;{\textrm{otherwise}}.\end{cases}}} Thus,wecanwritetheprobabilityequationsas Pr ( Y i = c ) = softmax ( c , β 1 ⋅ X i , … , β K ⋅ X i ) {\displaystyle\Pr(Y_{i}=c)=\operatorname{softmax}(c,{\boldsymbol{\beta}}_{1}\cdot\mathbf{X}_{i},\ldots,{\boldsymbol{\beta}}_{K}\cdot\mathbf{X}_{i})} Thesoftmaxfunctionthusservesastheequivalentofthelogisticfunctioninbinarylogisticregression. Notethatnotallofthe β k {\displaystyle\beta_{k}} vectorsofcoefficientsareuniquelyidentifiable.Thisisduetothefactthatallprobabilitiesmustsumto1,makingoneofthemcompletelydeterminedoncealltherestareknown.Asaresult,thereareonly k − 1 {\displaystylek-1} separatelyspecifiableprobabilities,andhence k − 1 {\displaystylek-1} separatelyidentifiablevectorsofcoefficients.Onewaytoseethisistonotethatifweaddaconstantvectortoallofthecoefficientvectors,theequationsareidentical: e ( β c + C ) ⋅ X i ∑ k = 1 K e ( β k + C ) ⋅ X i = e β c ⋅ X i e C ⋅ X i ∑ k = 1 K e β k ⋅ X i e C ⋅ X i = e C ⋅ X i e β c ⋅ X i e C ⋅ X i ∑ k = 1 K e β k ⋅ X i = e β c ⋅ X i ∑ k = 1 K e β k ⋅ X i {\displaystyle{\begin{aligned}{\frac{e^{({\boldsymbol{\beta}}_{c}+C)\cdot\mathbf{X}_{i}}}{\sum_{k=1}^{K}e^{({\boldsymbol{\beta}}_{k}+C)\cdot\mathbf{X}_{i}}}}&={\frac{e^{{\boldsymbol{\beta}}_{c}\cdot\mathbf{X}_{i}}e^{C\cdot\mathbf{X}_{i}}}{\sum_{k=1}^{K}e^{{\boldsymbol{\beta}}_{k}\cdot\mathbf{X}_{i}}e^{C\cdot\mathbf{X}_{i}}}}\\&={\frac{e^{C\cdot\mathbf{X}_{i}}e^{{\boldsymbol{\beta}}_{c}\cdot\mathbf{X}_{i}}}{e^{C\cdot\mathbf{X}_{i}}\sum_{k=1}^{K}e^{{\boldsymbol{\beta}}_{k}\cdot\mathbf{X}_{i}}}}\\&={\frac{e^{{\boldsymbol{\beta}}_{c}\cdot\mathbf{X}_{i}}}{\sum_{k=1}^{K}e^{{\boldsymbol{\beta}}_{k}\cdot\mathbf{X}_{i}}}}\end{aligned}}} Asaresult,itisconventionaltoset C = − β K {\displaystyleC=-{\boldsymbol{\beta}}_{K}} (oralternatively,oneoftheothercoefficientvectors).Essentially,wesettheconstantsothatoneofthevectorsbecomes0,andalloftheothervectorsgettransformedintothedifferencebetweenthosevectorsandthevectorwechose.Thisisequivalentto"pivoting"aroundoneoftheKchoices,andexamininghowmuchbetterorworsealloftheotherK-1choicesare,relativetothechoicewearepivotingaround.Mathematically,wetransformthecoefficientsasfollows: β 1 ′ = β 1 − β K ⋯ ⋯ β K − 1 ′ = β K − 1 − β K β K ′ = 0 {\displaystyle{\begin{aligned}{\boldsymbol{\beta}}'_{1}&={\boldsymbol{\beta}}_{1}-{\boldsymbol{\beta}}_{K}\\\cdots&\cdots\\{\boldsymbol{\beta}}'_{K-1}&={\boldsymbol{\beta}}_{K-1}-{\boldsymbol{\beta}}_{K}\\{\boldsymbol{\beta}}'_{K}&=0\end{aligned}}} Thisleadstothefollowingequations: Pr ( Y i = 1 ) = e β 1 ′ ⋅ X i 1 + ∑ k = 1 K − 1 e β k ′ ⋅ X i ⋯ ⋯ Pr ( Y i = K − 1 ) = e β K − 1 ′ ⋅ X i 1 + ∑ k = 1 K − 1 e β k ′ ⋅ X i Pr ( Y i = K ) = 1 1 + ∑ k = 1 K − 1 e β k ′ ⋅ X i {\displaystyle{\begin{aligned}\Pr(Y_{i}=1)&={\frac{e^{{\boldsymbol{\beta}}'_{1}\cdot\mathbf{X}_{i}}}{1+\sum_{k=1}^{K-1}e^{{\boldsymbol{\beta}}'_{k}\cdot\mathbf{X}_{i}}}}\,\\\cdots&\cdots\\\Pr(Y_{i}=K-1)&={\frac{e^{{\boldsymbol{\beta}}'_{K-1}\cdot\mathbf{X}_{i}}}{1+\sum_{k=1}^{K-1}e^{{\boldsymbol{\beta}}'_{k}\cdot\mathbf{X}_{i}}}}\,\\\Pr(Y_{i}=K)&={\frac{1}{1+\sum_{k=1}^{K-1}e^{{\boldsymbol{\beta}}'_{k}\cdot\mathbf{X}_{i}}}}\,\\\end{aligned}}} Otherthantheprimesymbolsontheregressioncoefficients,thisisexactlythesameastheformofthemodeldescribedabove,intermsofK-1independenttwo-wayregressions. Asalatent-variablemodel[edit] Itisalsopossibletoformulatemultinomiallogisticregressionasalatentvariablemodel,followingthetwo-waylatentvariablemodeldescribedforbinarylogisticregression.Thisformulationiscommoninthetheoryofdiscretechoicemodels,andmakesiteasiertocomparemultinomiallogisticregressiontotherelatedmultinomialprobitmodel,aswellastoextendittomorecomplexmodels. Imaginethat,foreachdatapointiandpossibleoutcomek=1,2,...,K,thereisacontinuouslatentvariableYi,k*(i.e.anunobservedrandomvariable)thatisdistributedasfollows: Y i , 1 ∗ = β 1 ⋅ X i + ε 1 Y i , 2 ∗ = β 2 ⋅ X i + ε 2 ⋯ Y i , K ∗ = β K ⋅ X i + ε K {\displaystyle{\begin{aligned}Y_{i,1}^{\ast}&={\boldsymbol{\beta}}_{1}\cdot\mathbf{X}_{i}+\varepsilon_{1}\,\\Y_{i,2}^{\ast}&={\boldsymbol{\beta}}_{2}\cdot\mathbf{X}_{i}+\varepsilon_{2}\,\\\cdots&\\Y_{i,K}^{\ast}&={\boldsymbol{\beta}}_{K}\cdot\mathbf{X}_{i}+\varepsilon_{K}\,\\\end{aligned}}} where ε k ∼ EV 1 ( 0 , 1 ) , {\displaystyle\varepsilon_{k}\sim\operatorname{EV}_{1}(0,1),} i.e.astandardtype-1extremevaluedistribution. Thislatentvariablecanbethoughtofastheutilityassociatedwithdatapointichoosingoutcomek,wherethereissomerandomnessintheactualamountofutilityobtained,whichaccountsforotherunmodeledfactorsthatgointothechoice.Thevalueoftheactualvariable Y i {\displaystyleY_{i}} isthendeterminedinanon-randomfashionfromtheselatentvariables(i.e.therandomnesshasbeenmovedfromtheobservedoutcomesintothelatentvariables),whereoutcomekischosenifandonlyiftheassociatedutility(thevalueof Y i , k ∗ {\displaystyleY_{i,k}^{\ast}} )isgreaterthantheutilitiesofalltheotherchoices,i.e.iftheutilityassociatedwithoutcomekisthemaximumofalltheutilities.Sincethelatentvariablesarecontinuous,theprobabilityoftwohavingexactlythesamevalueis0,soweignorethescenario.Thatis: Pr ( Y i = 1 ) = Pr ( Y i , 1 ∗ > Y i , 2 ∗ and Y i , 1 ∗ > Y i , 3 ∗ and ⋯ and Y i , 1 ∗ > Y i , K ∗ ) Pr ( Y i = 2 ) = Pr ( Y i , 2 ∗ > Y i , 1 ∗ and Y i , 2 ∗ > Y i , 3 ∗ and ⋯ and Y i , 2 ∗ > Y i , K ∗ ) ⋯ Pr ( Y i = K ) = Pr ( Y i , K ∗ > Y i , 1 ∗ and Y i , K ∗ > Y i , 2 ∗ and ⋯ and Y i , K ∗ > Y i , K − 1 ∗ ) {\displaystyle{\begin{aligned}\Pr(Y_{i}=1)&=\Pr(Y_{i,1}^{\ast}>Y_{i,2}^{\ast}{\text{and}}Y_{i,1}^{\ast}>Y_{i,3}^{\ast}{\text{and}}\cdots{\text{and}}Y_{i,1}^{\ast}>Y_{i,K}^{\ast})\\\Pr(Y_{i}=2)&=\Pr(Y_{i,2}^{\ast}>Y_{i,1}^{\ast}{\text{and}}Y_{i,2}^{\ast}>Y_{i,3}^{\ast}{\text{and}}\cdots{\text{and}}Y_{i,2}^{\ast}>Y_{i,K}^{\ast})\\\cdots&\\\Pr(Y_{i}=K)&=\Pr(Y_{i,K}^{\ast}>Y_{i,1}^{\ast}{\text{and}}Y_{i,K}^{\ast}>Y_{i,2}^{\ast}{\text{and}}\cdots{\text{and}}Y_{i,K}^{\ast}>Y_{i,K-1}^{\ast})\\\end{aligned}}} Orequivalently: Pr ( Y i = 1 ) = Pr ( max ( Y i , 1 ∗ , Y i , 2 ∗ , … , Y i , K ∗ ) = Y i , 1 ∗ ) Pr ( Y i = 2 ) = Pr ( max ( Y i , 1 ∗ , Y i , 2 ∗ , … , Y i , K ∗ ) = Y i , 2 ∗ ) ⋯ Pr ( Y i = K ) = Pr ( max ( Y i , 1 ∗ , Y i , 2 ∗ , … , Y i , K ∗ ) = Y i , K ∗ ) {\displaystyle{\begin{aligned}\Pr(Y_{i}=1)&=\Pr(\max(Y_{i,1}^{\ast},Y_{i,2}^{\ast},\ldots,Y_{i,K}^{\ast})=Y_{i,1}^{\ast})\\\Pr(Y_{i}=2)&=\Pr(\max(Y_{i,1}^{\ast},Y_{i,2}^{\ast},\ldots,Y_{i,K}^{\ast})=Y_{i,2}^{\ast})\\\cdots&\\\Pr(Y_{i}=K)&=\Pr(\max(Y_{i,1}^{\ast},Y_{i,2}^{\ast},\ldots,Y_{i,K}^{\ast})=Y_{i,K}^{\ast})\\\end{aligned}}} Let'slookmorecloselyatthefirstequation,whichwecanwriteasfollows: Pr ( Y i = 1 ) = Pr ( Y i , 1 ∗ > Y i , k ∗ ∀ k = 2 , … , K ) = Pr ( Y i , 1 ∗ − Y i , k ∗ > 0 ∀ k = 2 , … , K ) = Pr ( β 1 ⋅ X i + ε 1 − ( β k ⋅ X i + ε k ) > 0 ∀ k = 2 , … , K ) = Pr ( ( β 1 − β k ) ⋅ X i > ε k − ε 1 ∀ k = 2 , … , K ) {\displaystyle{\begin{aligned}\Pr(Y_{i}=1)&=\Pr(Y_{i,1}^{\ast}>Y_{i,k}^{\ast}\\forall\k=2,\ldots,K)\\&=\Pr(Y_{i,1}^{\ast}-Y_{i,k}^{\ast}>0\\forall\k=2,\ldots,K)\\&=\Pr({\boldsymbol{\beta}}_{1}\cdot\mathbf{X}_{i}+\varepsilon_{1}-({\boldsymbol{\beta}}_{k}\cdot\mathbf{X}_{i}+\varepsilon_{k})>0\\forall\k=2,\ldots,K)\\&=\Pr(({\boldsymbol{\beta}}_{1}-{\boldsymbol{\beta}}_{k})\cdot\mathbf{X}_{i}>\varepsilon_{k}-\varepsilon_{1}\\forall\k=2,\ldots,K)\end{aligned}}} Thereareafewthingstorealizehere: Ingeneral,if X ∼ EV 1 ( a , b ) {\displaystyleX\sim\operatorname{EV}_{1}(a,b)} and Y ∼ EV 1 ( a , b ) {\displaystyleY\sim\operatorname{EV}_{1}(a,b)} then X − Y ∼ Logistic ( 0 , b ) . {\displaystyleX-Y\sim\operatorname{Logistic}(0,b).} Thatis,thedifferenceoftwoindependentidenticallydistributedextreme-value-distributedvariablesfollowsthelogisticdistribution,wherethefirstparameterisunimportant.Thisisunderstandablesincethefirstparameterisalocationparameter,i.e.itshiftsthemeanbyafixedamount,andiftwovaluesarebothshiftedbythesameamount,theirdifferenceremainsthesame.Thismeansthatalloftherelationalstatementsunderlyingtheprobabilityofagivenchoiceinvolvethelogisticdistribution,whichmakestheinitialchoiceoftheextreme-valuedistribution,whichseemedratherarbitrary,somewhatmoreunderstandable. Thesecondparameterinanextreme-valueorlogisticdistributionisascaleparameter,suchthatif X ∼ Logistic ( 0 , 1 ) {\displaystyleX\sim\operatorname{Logistic}(0,1)} then b X ∼ Logistic ( 0 , b ) . {\displaystylebX\sim\operatorname{Logistic}(0,b).} Thismeansthattheeffectofusinganerrorvariablewithanarbitraryscaleparameterinplaceofscale1canbecompensatedsimplybymultiplyingallregressionvectorsbythesamescale.Togetherwiththepreviouspoint,thisshowsthattheuseofastandardextreme-valuedistribution(location0,scale1)fortheerrorvariablesentailsnolossofgeneralityoverusinganarbitraryextreme-valuedistribution.Infact,themodelisnonidentifiable(nosinglesetofoptimalcoefficients)ifthemoregeneraldistributionisused. Becauseonlydifferencesofvectorsofregressioncoefficientsareused,addinganarbitraryconstanttoallcoefficientvectorshasnoeffectonthemodel.Thismeansthat,justasinthelog-linearmodel,onlyK-1ofthecoefficientvectorsareidentifiable,andthelastonecanbesettoanarbitraryvalue(e.g.0). Actuallyfindingthevaluesoftheaboveprobabilitiesissomewhatdifficult,andisaproblemofcomputingaparticularorderstatistic(thefirst,i.e.maximum)ofasetofvalues.However,itcanbeshownthattheresultingexpressionsarethesameasinaboveformulations,i.e.thetwoareequivalent. Estimationofintercept[edit] Whenusingmultinomiallogisticregression,onecategoryofthedependentvariableischosenasthereferencecategory.Separateoddsratiosaredeterminedforallindependentvariablesforeachcategoryofthedependentvariablewiththeexceptionofthereferencecategory,whichisomittedfromtheanalysis.Theexponentialbetacoefficientrepresentsthechangeintheoddsofthedependentvariablebeinginaparticularcategoryvis-a-visthereferencecategory,associatedwithaoneunitchangeofthecorrespondingindependentvariable. Applicationinnaturallanguageprocessing[edit] Innaturallanguageprocessing,multinomialLRclassifiersarecommonlyusedasanalternativetonaiveBayesclassifiersbecausetheydonotassumestatisticalindependenceoftherandomvariables(commonlyknownasfeatures)thatserveaspredictors.However,learninginsuchamodelisslowerthanforanaiveBayesclassifier,andthusmaynotbeappropriategivenaverylargenumberofclassestolearn.Inparticular,learninginaNaiveBayesclassifierisasimplematterofcountingupthenumberofco-occurrencesoffeaturesandclasses,whileinamaximumentropyclassifiertheweights,whicharetypicallymaximizedusingmaximumaposteriori(MAP)estimation,mustbelearnedusinganiterativeprocedure;see#Estimatingthecoefficients. Seealso[edit] Logisticregression Multinomialprobit References[edit] ^Greene,WilliamH.(2012).EconometricAnalysis(Seventh ed.).Boston:PearsonEducation.pp. 803–806.ISBN 978-0-273-75356-8. ^Engel,J.(1988)."Polytomouslogisticregression".StatisticaNeerlandica.42(4):233–252.doi:10.1111/j.1467-9574.1988.tb01238.x. ^Menard,Scott(2002).AppliedLogisticRegressionAnalysis.SAGE.p. 91. ^abMalouf,Robert(2002).Acomparisonofalgorithmsformaximumentropyparameterestimation(PDF).SixthConf.onNaturalLanguageLearning(CoNLL).pp. 49–55. ^Belsley,David(1991).Conditioningdiagnostics :collinearityandweakdatainregression.NewYork:Wiley.ISBN 9780471528890. ^Baltas,G.;Doyle,P.(2001)."RandomUtilityModelsinMarketingResearch:ASurvey".JournalofBusinessResearch.51(2):115–125.doi:10.1016/S0148-2963(99)00058-2. ^Darroch,J.N.&Ratcliff,D.(1972)."Generalizediterativescalingforlog-linearmodels".TheAnnalsofMathematicalStatistics.43(5):1470–1480.doi:10.1214/aoms/1177692379. ^Bishop,ChristopherM.(2006).PatternRecognitionandMachineLearning.Springer.pp. 206–209. ^Yu,Hsiang-Fu;Huang,Fang-Lan;Lin,Chih-Jen(2011)."Dualcoordinatedescentmethodsforlogisticregressionandmaximumentropymodels"(PDF).MachineLearning.85(1–2):41–75.doi:10.1007/s10994-010-5221-8. Retrievedfrom"https://en.wikipedia.org/w/index.php?title=Multinomial_logistic_regression&oldid=1098653133" Categories:LogisticregressionClassificationalgorithmsRegressionmodelsHiddencategories:ArticleswithshortdescriptionShortdescriptionmatchesWikidataArticlesneedingadditionalreferencesfromNovember2011AllarticlesneedingadditionalreferencesAllarticleswithunsourcedstatementsArticleswithunsourcedstatementsfromSeptember2017 Navigationmenu Personaltools NotloggedinTalkContributionsCreateaccountLogin Namespaces ArticleTalk English Views ReadEditViewhistory More Search Navigation MainpageContentsCurrenteventsRandomarticleAboutWikipediaContactusDonate Contribute HelpLearntoeditCommunityportalRecentchangesUploadfile Tools WhatlinkshereRelatedchangesUploadfileSpecialpagesPermanentlinkPageinformationCitethispageWikidataitem Print/export DownloadasPDFPrintableversion Languages DeutschEestiEspañol Editlinks
延伸文章資訊
- 1Multinomial Logit Models - Overview - University of Notre Dame
When categories are unordered, Multinomial Logistic regression is one often-used strategy. Mlogit...
- 2Multinomial Response Models
If J = 2 the multinomial logit model reduces to the usual logistic regression model. Note that we...
- 3Multinomial Logistic Regression
Multinomial logistic regression is used to predict categorical placement in or the probability of...
- 4Multinomial Logistic Regression | Stata Data Analysis Examples
Multinomial logistic regression is used to model nominal outcome variables, in which the log odds...
- 5Multinomial logistic regression - Wikipedia
In statistics, multinomial logistic regression is a classification method that generalizes logist...