fitdistrplus/0000755000176200001440000000000014421764232013013 5ustar liggesusersfitdistrplus/NAMESPACE0000644000176200001440000000422314102173474014231 0ustar liggesusers### Imports import(stats) importFrom("survival", Surv, survfit) importFrom("MASS", kde2d) importFrom("grDevices", colorRampPalette, terrain.colors) importFrom("graphics", abline, axis, hist, hist.default, legend, lines, par, plot, points, polygon, contour, layout, matlines, segments, stripchart, image, pairs, rect, text) importFrom("utils", head, stack, modifyList, tail) importFrom("methods", formalArgs) ### Exports #fitdist class export(fitdist) S3method(summary, fitdist) S3method(plot, fitdist) S3method(print, fitdist) S3method(print, summary.fitdist) S3method(print, quantile.fitdist) S3method(print, gofstat.fitdist) S3method(quantile, fitdist) S3method(logLik, fitdist) S3method(vcov, fitdist) S3method(coef, fitdist) #fitditscens class export(fitdistcens) S3method(summary, fitdistcens) S3method(plot, fitdistcens) S3method(print, fitdistcens) S3method(print, summary.fitdistcens) S3method(print, quantile.fitdistcens) S3method(quantile, fitdistcens) S3method(logLik, fitdistcens) S3method(vcov, fitdistcens) S3method(coef, fitdistcens) export(Surv2fitdistcens) #bootdist class export(bootdist) export(CIcdfplot) S3method(summary, bootdist) S3method(plot, bootdist) S3method(print, bootdist) S3method(print, summary.bootdist) S3method(print, quantile.bootdist) S3method(quantile, bootdist) #bootdistcens class export(bootdistcens) S3method(summary, bootdistcens) S3method(plot, bootdistcens) S3method(print, bootdistcens) S3method(print, summary.bootdistcens) S3method(print, quantile.bootdistcens) S3method(quantile, bootdistcens) #descdist class export(descdist) S3method(print, descdist) #new generic functions [to be removed] #export(loglik) #S3method(loglik, default) #dedicated fit functions export(mledist, mmedist, qmedist, mgedist, msedist) export(detectbound, prefit) #plot functions export(plotdist, plotdistcens) export(llplot, llsurface, llcurve) #graphical comparison export(cdfcomp, denscomp, qqcomp, ppcomp) export(cdfcompcens, qqcompcens, ppcompcens) #quantile functions [to be removed] #export(quantiledist) #export(quantiledistcens) #other functions export(gofstat) ### use dynamic library for C code #useDynLib(fitdistrplus, .registration = TRUE) fitdistrplus/README.md0000644000176200001440000000507714415271463014305 0ustar liggesusers# Help to Fit of a Parametric Distribution to Non-Censored or Censored Data [![CRAN_Status_Badge](http://www.r-pkg.org/badges/version/fitdistrplus)](https://cran.r-project.org/package=fitdistrplus) [![CRAN Downloads](https://cranlogs.r-pkg.org/badges/fitdistrplus)](https://cran.r-project.org/package=fitdistrplus) [![R-CMD-check](https://github.com/aursiber/fitdistrplus/workflows/R-CMD-check/badge.svg)](https://github.com/aursiber/fitdistrplus/actions) `fitdistrplus` extends the `fitdistr()` function (of the `MASS` package) with several functions to help the fit of a parametric distribution to non-censored or censored data. Censored data may contain left censored, right censored and interval censored values, with several lower and upper bounds. In addition to maximum likelihood estimation (MLE), the package provides moment matching (MME), quantile matching (QME) and maximum goodness-of-fit estimation (MGE) methods (available only for non-censored data). Weighted versions of MLE, MME and QME are available. ## The package The stable version of `fitdistrplus` can be installed from CRAN using: ```r install.packages("fitdistrplus") ``` The development version of `fitdistrplus` can be installed from GitHub (`remotes` needed): ```r if (!requireNamespace("remotes", quietly = TRUE)) install.packages("remotes") remotes::install_github("aursiber/fitdistrplus") ``` Finally load the package in your current R session with the following R command: ```r library(fitdistrplus) ``` ## Documentation Three **vignettes** are attached to the `fitdistrplus` package: - Overview of the fitdistrplus package - Which optimization algorithm to choose? - Frequently Asked Questions ## Authors & Contacts Issues can be reported on https://github.com/aursiber/fitdistrplus/issues. - Marie-Laure Delignette-Muller: marielaure.delignettemuller@vetagro-sup.fr - Christophe Dutang: dutangc@gmail.com - Aurélie Siberchicot: aurelie.siberchicot@univ-lyon1.fr ## Citation If you use `fitdistrplus`, you should cite:
Marie Laure Delignette-Muller, Christophe Dutang (2015). *fitdistrplus: An R Package for Fitting Distributions.* Journal of Statistical Software. https://www.jstatsoft.org/article/view/v064i04 DOI 10.18637/jss.v064.i04. fitdistrplus/data/0000755000176200001440000000000014067302651013723 5ustar liggesusersfitdistrplus/data/salinity.rda0000644000176200001440000000057413742313702016253 0ustar liggesusers r0b```b`gd`b2Y# '(N,d``c`0 H6@h=4q ` ӸCW9fi+I(mj-TFAiM;6`- w q?vnw(wB… }`u.Bђ$h`5 ӉCjђhJⰗMLG:g &"B4<,}AN}8-:`a/N8Aw{2aʇ3\`^z ==3b#hXy\1 F&GXsS PedIM+)(L(AY_ rX 3&ȕXVaSAٙfitdistrplus/data/dataFAQscale2.rda0000644000176200001440000000036613742313702016751 0ustar liggesusers r0b```b`fba`b2Y# 'MI,Its ,NNI5 ~Wsz`:3 }ͭMa q(Hl 7fuѷdB@x0D?{K,^Y>ѿ*2@7<by~)HOC@F0D{{⿗ }7jfitdistrplus/data/danishmulti.rda0000644000176200001440000012056413742313702016742 0ustar liggesusersu\UO?% !XXnVnElEA.C7

*'6?bs:m>+ q у>A)yEC|hAkD>x!#>Ds0! O((؛(#NE\AGG#nB4isp7>C G7~?~?~F @Dz~111 c"##{m%9v` q"pđ(E$S98v "ڋ-ڋ: ډ)":]D+8A?ڢ&r0 -M["϶QvREDi b؍={"`o."A^>Gsp< ĉS~>`vdccqWrp51FDKv9 q)1N8IDS31>]G'jx_{5Dwv8~cg3;GD#;gr㤝;"į9M;vI?1K`8yU5"ާ8؎;#jq+"}oA =0 D$ZAɈ1A\q~\A;w!cbOs,/r+> - O1>c\GG?GD}8>1xGt sؾAc2DgJVqu xo08`~#bg;`>Гqa "D 08@iACĹ^pX~qy+q;`yFa3 ;Ľ8xxqL+xUDDv#?;?;7;;}e:$q5x?v@v@u@uԁ]KG*9rQﳎgxuDrDr쎈c?DqDq#ÎxtD?r$N #Ŏѿ#z8kяq帋{[AaD:mq1vDs##U^CA:}ђxt|;_:_:bhA̫1vDs~IG7G\:}ґXVϱiQ_8} wNuA9 ADsiL+hA|4 :M NN87'\:9m ޏ0u C~tONGNx_r:AӽV 91tzA\g:} ɣ< CNxr ׍N`"'ЏЏ0uB?rǜ>L`6b. 8ѩxt N:&g!bTATCTGlA|NgWg|Ng?Vםc8H88q&rxu3?q~y3xuFvƼy+VVp'M9xwޏϷ3g+'9q3s|ĸag?Θ:={gSV/;2gSggO;}ggWΘ7;}ٛxvy3޿18eģ99i8{9g"b|r3k~\A/ .w\׽.w\9 *.]8؃ .9qe("!#\0p_.ļyG.s8hO.\p .\0wu +cDK\0n \snbq9q]0nj1`~r8\p:"/._\r1q8y 3.vGļ g\pq. .׸` |R*br8A7B1l҈\[ACT %W|~A\ϸbrukV׹:W|yVpsp[Qĸq_D\/:I\1r/\1ιrp";WsK9q1Ws8y+ebjy+Ei1bs}w| b]1Or狮8A̗\+\15Oq0yQAxAi']9k?<b\u礮?]y+]1nb' /Ma_7|>yno7Cݎp8"n8ϙ0z I7np]v9 >Gv >Wv {bvs np e7O0/u ebvxphb ["1-w0޺auU7n<bV bIwu<]Խ3;9wst<;q1^c\rAܧpp*1Nc\r b|q;Ebq}Wsp "w|y;c1c;w'8hhA 1:cq]/2g9;y9T;V!qV;>/sǸq"1cq8߇}w[A?cy;w7cqA/_s9;>Os!׽ w|yz`|q^k1>z`OhW<1xb~6 >W<7zbA7O8hAC<1xs6O7o<߷6|'<1xzӛA̻q׫O+0.{a< qAy^9I/\zzkqk%1z˄Gs.b]^O0¸ua^.Ss< b|:"~yS/^zA+D\ozYsk^p1_r wc0>z:3q} 5/\gz-/k^30yap镆 fr7< ¸U b#+o%9,9Wo5bݱh޸8Ro3޽8؛O1~xj1u7o98Oqׇ޳[A|ټq=q>p{/js 9o1y#:Eopp_סޘOzc<8m{w+ q718y]b|~Awޏ8VyKby 1Ƽ7s_9y71oϼc8Ac޸x75T!b<8q7ogǼ9w!1Ƽͻc9%>o_GC_NJM^k.xz"X* }@ɍz-]9VBw퐢 ?fDTCY81m9Dn2>*?K u%~D,?^TP[uWG\] mm.!zsF.#=R6>LToLt[}&C(ciQ둏2jFC]Wy~P翮7lDw4~ T&QyNz8xFSMǯ(8ݺ[jV}" {-/m7M8' 2ԁl\uu$g}ݽ'Cnrlmt[1Mo{"޸* xHd]SԈX͇/A=ؽOL/xtbwՇl͡f}RD}_CHvmmzGůR~vfڱDl[Q:P-,7Db;Ùņ~߹2oD+3Y\G(~3&oY[N0WGv|rO[9diPAKm,yb>/{ve=|jѓk?D>a;dɰ>_ wwo||xWouHSC5Pٱ'jTàPMb B{!2d'j?2xLһh̺k.80Th^t?*VzY1eV7C%H5x֒z@H=ٱlOHֲ1*d K(فfDYߑ?Wܟ;R:_"սqzYD@n }iY출Oz?)^p}*Gvf^$R%fseEQBQ|BPՇxn9jbQ>D_KJ=)]_Njil rY=G(/n%kzn3 Tm~ez2;h?"3Y@NW}6OjWtyz|{2֛ Sb~~6AULjNŠWnO]Pufj,r9 Py]ʔPnu!iÉD͗C,ӑNDor\gdIngH Zk(v|멓cֿ+}G]iP$^0*t=o4~ z#fTOg`zCT.}ٰs5T5֋X]YdM"wRyD+fPӏwXw(M5~uENr ^ʏKW.#40W!e,.duYE:{c;51"VO*Ez S#YSû@ޝ?eU[8\wSg`U!l|"rѦ*ۏw,t|RK*'".d-kߜM? :#m!}:/GñiD|hXyA57qGEoY|2{Ҏ_wKzkdT;-4Nz]ظ^YDG2(OHY ٶ:P_Áλ2LӁC"fyW\~\Ӆ(EYP61W#'7g/B|2B@ec٤;n ra4kҮ>UT;~gz?4:D8i~yRV (ܴ%~dbVJD}megvrW֡X?*'&1Yd߃hK| >d#?4"^+dCu)l]vxKnS̃W.d*1Րz(P2MMi3+l!̮(?v ;y[@g%og}`v=gn}b?8pʃK]C!_O{,p9"2HHg_z}_ 5n^&dTg.aO`uX'^_U ~G.k%@~>#CXƌnO9S7րPHè}P/ܵy74Zj/[Y>jʷ'B)D:̞]6B<准sCy.Pqa[PugۍPQp`"̌5T^:Vا~፬e;v\*v ~ޫ63r*BH;Yxb(?u~~2Peh$zO{ ѡr5) :cVA;_)U?0dtςS+w}[[a{6no[ 徜Qyg#TG$6uC_蟜3 쓶~GMyg;@DO7"(V_y[T3)Vt|A瘟]GnǟgvC'Z_~T-: \G?iӘ*8oJH<9ܰgQ6r3QAőwvӍYo>)v/xv\-*AISY(5ǫj/<<c!ץIc"".{SҙD%gsxRW ,= Mr ܶO[6$ v|(8fӞO7~;qA =[Ar)="S_mT{U*F aDaW|T49%+T!oW b%e6ו&g-iwU}|ٽy">r=R!<`gX}:|hx_5Cγt5>;ϕC\T‡e: Hys7"*{\!7ňwxMVy>04]%4]oauF;@[AOOW`in{cRsdIsjuf/SCҞTD䞓P6갵b_{|,k:b0tTDEړǨBW6,,|#q[o@ى!=fr'Z&Fh<`tY]J?xr u$N"-FOz%Sbzv]Cդ؈7[{"D>}d:klO$j;%7Ѵ7ێDzɽ!!zX Fʋg;*LaL;磪CDy/Z@cQ\ӊܳPgaSP4'Glʞ]OC^UwtZe(TL_ Fv1/"A˧/%w w#d:/Q' MVm#ϯB/rd%(+΄٩+ 73TɄ3~Px#CEƫt)Nl&H-^:AV33?hC݋0s-~ssO`F:4J)wӦ1UF##ˊ#>@y!&P~VN>Îtx}a J=vE'%u3>qD l]9|WgTLmp~=h?6>K# (4C#^E󶍶Vr\qJP"NK5;_yzT" {$)%*PlSҮQUP^2}IZ<ʯxacyUZ&ų^&]"8NƢ{{+RyCnU}b!S73{/*?\f"u?sÂ1д{EJw"~AvA7-]֓)vCp Yͦ^n)AByAt|վj& ʏHu 2#$g)`Svm]UN+@5 av6FZI(o2/ gʍ']LVz·ɥ{BQc;*Λ%T>EOOj'%W5㚪ibz藼҈=!༨\>28aoRV@,@K ~qֵU}+>_v~M5J)lj",o2{X"G384~ݱ0c$R>*L)3,쟍t+7~tߪ˞4؂_0 ʔHuʰ>сxy4JC#B* 3 I o~n yv-H7;_jw^#@=Sy?Ӎ`>;7"_G e3Q^^[$d->L4^^ڍʅ(?\// M N3!W8/V~D嚍k,0}(SDrs3<}F!Rc_ŰɁM^bc;?(xe8RiJ/8s^O=@gkP?eݚO'9! BG(moeC&"ߋH;ϝnq'|OZ8M`Pә|xjGiivOz>=-1f:;"wDǫu>fOBy9 G~8 寎%ӖrVcگ|8[ ;3zƹ4sĈ/kG H =cnV@iŝ*G~[F $gA7s/d鉩/\9,y|\yd|$r9'FCmu &4SQ׍dC}>k?؎]/dpdtWd)V-7ț|ߟI%}{CRַ}Bɻ;11~mjVOҝ5@~JHN/ݒH8UJۜcX6W-Dr|WԅʟOG=U<},]~p_}]O(?yvUyS2|"=wܞ+)2kfƿφwz}#a/^[Fʷ-cȯ CuL\%wcfoɂU ƻҰWWB}_m#NubҥvEGEޘ_ /SRw`3Lo A^k=%km5`| gˇRőҸSpGũwD8ݳÈ Y'0Z/JiQdpp-?ҟ`{j=q೫l\αe3;C|ۺFP'jC^P;C9Ryr=?zp)(}JJ}2`}>ߌEd̮>ߋ+R5vʃ꿔O}Cz tuKv̍+HzOܨʮi;_8'IY( ZtqJ}s],3T}2Rp%I#"o:nbɛgC6K;1R5*Qp],>I^xm:-ݠΛG+&:fyw$gdλ|,,~6Q _:s$?u-'k>,ujέzy2Kڬg<)YOHy)c_\i{pυ^wYOVq|:O:/O CDCzuo 5EmWusawd1TԞi?B; J8ڏ"GODl):9\{`r1bwzjUI÷<9v7ڿ=Sv?ҏQcܔ.P dxdI %o8T\P?F2f̟Js<)we@𐪺72hǸ 5-z30:h޽ֳ/>gP'+u鮷Lxt^Q+4VBEon`vX~s7?D@Ȕٻ\И,jG`i.\FC߮ <Rsruڐ81PxHwoKFN:6-qC?8tǛ ~-6Cg.d Gê^u+{M+]qy:4;zeN*"ku>-UF?3Lu{Mڕ;ϩ<}Kisӽbn~}1kjwe=>Orɻ jUl܊Y/ MfX6b+eݾo5y*/rYCmU\rTOZ uCx7s T'u]cQL0 g/2 V{<7K_DH܍}iؼKW"*V|e~ Ϯxۅo'޺^!B`Cbzw%>sdҽCֆo,M/DZ}yX9ý36| /h36mQ}ߴ)H;/%yL&/}7 DeI-^/i3̛o~GG 깬4̤MIGSw ҋQA٭mC\5DZyVi>DȸjܾڣHے}˂S>I'Wf45.$g\UqCbbn-rdWIfT?y1#jJ\^jm@ =Ǫou]CNNF'1 D*D>[yrV}kw /Sqk̠:kyZyI鵑LTk9|}u qAY6LNE_z-K;zI3Π~,_Y{O[I.)Ãcvq(2) kOL_Qx텈}$oA?~:KQ{`maNȻVK"MqO6'~KpAkQ"c"ű;+*Oj?N(H6]њ9.?^| >Y/Fjjjsk^dmyDkzcױ k_v5YOnDEm Ҧ@d~S\  J'>|~J' CRCSߪnF%}#*sΚl }8~Xw+"ti+_ߜwiF=M ׎))GGtoItA:Lb(mǹ=6؟W<PE^2K~yJJ= JqjduOO:O"?Rgdu"toB,9'ZtAJ2䬋_:9&d^qb CuN8?ǣ|Sc(͵Yd# G: 6bIʓqEz-| I7H&CsiON}~(}j?;>SI&U%.gtKl\?"8ox搱sTgk:7[NZx"O,V-7:6$2o2vC4߅@|tG>ۡ B+Cō=%輘{v1c^4Zٙ vz~ݟ]u^k;MN(&ySq|X?*.Q%'=0^iĚw^ӽJv#b,TFoS净A-`ZlNt;hw0dCIV݌.꓈Ku]>rq^=?Cު4yԞ]׽7,g7JחzǗweǴ_yӷg K52> ۱j3TŒ͗c^+6G`-Ku^%ɬ1:e'=|ϡGQyL`Vc\֏qȸ?n3:n~EŚ Jeua׷jo`@SJ]Wvf7y0zT>U;?!/!'U]ʘPiAkg+6=%&Nц/'CfuD^ !o'{?W?χJrPmƋx(=ymcCEyj{CN+e{}!2f#{d~z~ BY5;&}TD$a5KK!-yk7u.6&8y=-4B]%ϡJ8_"I<Ҽ'j um&DqqRhPgJmSko3)"wlbWcF*~8QlZ:+6/ NlWq3M7n/A33숒cErޭPHu#ə+M(F7^rBy5'UXn=7KB{y/L:wهWx_סFۤIT/-x$f'\9eH)cgvK Bn?C?g3"LmAFJKXԲ [MăMzOdHy!ֿH=J@C1k/mSA^5oCUE_]gNP^=>oe_l.ΏU:ASܘQ=zlV_sj ΏtGx.e2%k%g%~N:m.Q;㞇 '$3P>0nocc~\e6Mlp#x~ ZЄH%T&J *3=~E98Yy YYſ?d\7M8W(sfykwe}ţ-fŻ~ծD$~ԏiȉ8/[鿃Mh%TNk+<_c8C~;ҽߌ71W z蔧Gyѝ~lbgP԰^ypnU_?>5{*b_Q_Nc`9v8xЄr|zO~{_FsWH*I-H7 %=/z䑼Һ`b~P9g [y3]d/:bn} ״P>VqRňS(u|nAkgke:kAڕ/8"N_8۟(S!#T=z %yΎݽ~ u;y ;;x;Q 1yP=GP%3 /޼dҋǘV\o"%~,b"w {;]O>skR/6"/!V>f]!")P5{} hjL#jo$mq k B/rj)<3]: :c,ԛi/N]֞Wf3{8KjR=$o 럪j[{^d0t>?x'KS2@5嫌8?r;4u_ʁ7.UsǁsRaq&4#"ZHPqL/H_t\ScFE"񘍟*c^};rِvc9i"pb /@y1&b"d\1uGoS){^[^ct̉DE{•>(\{EƍiܾH_gY?|?zќ"̒rp^ qό,R^a=΄_+`Rͷo9 ))ZU@x4O}v1Sz3$/ 2l zs[Uw_{ !:ai(?/n? yi$}^0f H$~5-6'X{(C,fr4fyzd .lOf,:51zq?O!k L ʵ;'+/ލ?Egm $MͿrr8\wkPʣ,pt=84x$t>G"rrtqL_Oٵ<Y3Jo._CkMdC)u人\Laߛn0 % k,)B?fz%"NIb}AʶEqFCa@ݫ(o&&))zT/1:5}tr l"q_5=|׸G OV_pkG@?Nu+eTԓi4\gX?G \lE/5 */e`R:]0QGQ i8'B2r̓Ʈࣷp.=kGSF͌EOz6u}ۜNCt_AjŏW!F!urt;>hXb}[7H1{ ~p>ɛL$Io>LJ_1>>,R*:[jQ{K|Wcy|'\u9OubyQu =w~!ۊQ+{ByN??a-8F:w#.OHX6P}S{~ 7پ kKn7Ɖ#cICd<M&YQDz8yH8[?BPކ欽[ѽZSIuR_wrCbӤRb!Kx*ibPS+6yxQGXoɗ8ᵳo%D2eX-W{z*͚]׼ZU/VXOf 6K5,,G;hz} 䓍"~ͮX-W/V $q̮9v*gʟΗt[u8;cס6ftlg ђrV̎^>}QL/iڳ{>>fzYMry"b$fקZZ/F٭(Ql_ '&-~!;~DRk}[mrK|K4BJJɼW{p4\]8*'ZX,X )tSsyy}lW/"ΝĖ}~RWClDd.ϧg,xЉPRfKO ~m"Ҷ-GFA\+aR.P/EDʏ~FO$=袶c3)sl?yTLvQf ᚁ[A۷SZ* `4 Y\P7<$y?[{XP@p^D8GOķ褽Pv+8j&eXD:nё?37XPblX3wi"y+ȭ'}3((u]wmG=壺Bz@ʄ%#!T>wQE8 ;7W}Gk&tGJuedI}G;֚ݱq(4:on?d۲xCC #6Ku0aD@xRYEqCt;i1SI/vs+d@#/Iۃkɽ@HG{l[)A i!Xz\%{ƪϓá;o2w엄/ǫ] 1{UWyDv@×Q'$Blٵ'w*9b7 ?ʋHӌ|JPCjL "t\޽rJ;g~{Hq^0(otm![_u#| b?v>r1C%{=yρg0; :~t^-ܚڮ7ȸ-~0IQP`vu0Ӌ$=B6/֑Yy@z%<2Z:"vnΕ,"p?~#OMT{Kֲa g3~?:bX;ws2=N2A= ]qsYC yKܒ' =-qY5"o?dn]OLco/R~1>~ *RG v]?Y}s)ę?V|2bY=)9i; )&XN'mw/ 6 Jڽ:TOi<ӸSP!oo2X⩦ֿ~bYlkWUSjXWe3{ 48wsdv4ʰWJxհe߃3kر5?W %ƜUĔ'g6PhLfvW6nq~ZwuH?v ˾?÷s>Bx;xTmy3Un.*/ZLuwBH1+lx=-4|)qݮcS6y591·'2Tç%"1GK x"ŬmxsCQ$H@m-!:`:f>GtSPD ʓTsAiѭ7|6Q -|*Uu EWZ@g>;q~.7N@6x%ޠy}R/"Cڈu_&}ᾂ*/~{.]X78 oT͉,yqϽrr=FB/]CQʀ+wen$W.xqwv (3ޫa~Kȧ>9f X4ot| AX߀ G ^_эwsh19Q]soOM{ !ʓ)7u*t\z:eϺx rnΙ2N-A^ :D_x# "mUeVF/|4K:%B3X:/{Baْv֥`@c_zGǧu 2]:p[8.ú+h7!an=$|dH8Q6fwX+?oB#{R!y(: ~X,.ﻒS|ُw_WyV$ ;mD|SZ*uι%}鑑1u3 E8VUR.gγTQtN鲥v|1y Etg A=fTw_!S,/jԎ>M=x};b[Z{AR}d뒚/ЬrW1y"]W~.vl/]Bf?\{7οw;;2t<~/ _҃q] "Jrp*$^>(-O V R쟾wvG 퓺wΩ,VbbO&ok׻3yU>ێ'v[Vi;]w/ z$=\=v*rY~J:)1䌤;orD&j6|leũt~]髧OOG3K.f`M;]OT>k.{qu^݂"y }Z\,/'}S:gu pm+쪚ETדWȏ!T =\Yv([2)?L{ 3/2H~h;!3凝bugz&/rB=.JZ192f VĂ='M[ ߂zRKfءJw¥AX~];Qܐi 4u߂я^dm N8We\=T%cad[A=N3KT^^_9DT]*ubxN.i=C5)pa鱂 wcf`~}c:3>)(Aï,?64xz!ۡ 3goPN}|Pğ\VoB:-~6K]b'J᳧(l磏|T +]O7`ZZ)>)8eNK&#p>+|bΗʃkC^zY.]!ndM7V_d9>?cT6dO׫J*%Z]B3NYeo۝GΏW']3CRSy {Ӷd߃*-}ީ .sv=fˁ-OEZAAI$%,z-g_T3lGzeVn9J/!3#6}#tu_ԞO>24Gƹ*){!4t){Փ!led)kP~_woB?!&˚K? "k{:pjt2 |k}͏ȪKޣBO:N G{܀w NPtd4z`9ah}&$l~k* <ȫL^ؿ%$u8j!`v8a ૠUm] #&WEHI%3i{רgi`?x3rAWXx/KuzTKzKf4oU݂%Un $GԨ!:Lڒ9wT֘}o,і };'l<[~h/6~{6gMXSiSzO&'o/uDhڅEw<hK\6׽x`೯ҳz7$p⩠h#"# POK+~fXx|ڹx1vV! F%B+fs >Gf] w!ta8 =ҕCތn!maMmD3>6bԯh=(j39 'ҿAVW/A=]HH^{:9z dL|x󃬲=bL3)cX'{QB;jz1~џ#rgKͱnjEڼ>]`F+N"K_P;|frrqg].nH훌Zg78g"=O텵G9@O~ԾH}_]TuߘR߈'*m7+|gV[VOeek,٨܁k[n_b ~g w/duc|}@Q{!qrnw{5m\lHNJM2O7c:_2`a3NܗvJu$Y"nBKF;ZiNCDM&woH9Z{+i IwZpe%75\@n[G ?DkB]pgm/4i"DQ(%.u'B;J#~хhq\|b&X3-!޶wevg15pRP@D>b]֩Wŋ; w.<왠K} ~$?//,^Tעm9qg mr:9[iz*8IE˦1%ϧ}_ t&;wr*鷱Qt\v?yxx%H~Qc /$ 9vhs(G=PA;`3~njmoeJ>kG}t-!Gz^k J#vihg=;:{o&88/"?jMojPoQ-9rǙ1yeLE?wpBiHa};OM4$|1}tzZb$htU,S+au?;z e>zq_W¬r K!^PFN q5s%C~B$nvx8 Fpk7M>ڷ x4wv:|)l*̧΢ *%vǶuo3CW܅\; 71.a]R3(9h?~Pþgrju|H6WvFPY1mH__sG3ҡtgGfMy:&=~t`= 1Ƨ.ʊX)c!h b,c2hoȱ!sg m{ hT8! Y6rI9LN4r%*'YPW9~ᨧ+<ף~~~2y|"k^1Nywal,;8ŵ1kjA[{oߣ)$kY]l3lBDZkYvPPw`&tn?= IAk?bXVZmUqmD'zmyߟ>nu?H}%*~hzO;}٘Y"ϙuW6{m#w MoИuOfͯb؊=u*qMA- .Y7W#l[CzVƞκo'ط Ztv{oax?RoοzAs~a{ӽ*?7{*ʈсTE>K?3*lya~pKrb#ntu8[?f˿Jo?Qa oʭvFެn{DnBkwRjYg{kڽ18VaA{H כ_O[;MlOw:YCy,\*͌h.u( 惕:Yeq~~vܸ}`kȟGE됶ݧ"y>.#57K/ߛLDlVgn^pxr$x{zt5ZcKCИ#wWS>vOPEKx}{:f^P2?*.w70{ȏ>yԪ~X{DviZl,O>L_k>4tсO__xh.8嵛pæ~˟rbCX٦^NW,ZLΰ~B6o֯9+~S~~Z;fQzDUt=BM.e)h0VJ{>SI֎c~"F|~,{wξ|K1/[f`R'}Gk8u'2(^ӹxwƮz @[savOz"i\EH"ju?/]P;ߊ|yR pnlIwQ#8%(x2KZ^CXg0mlﴞQ?7:O?O?7y~>߻iEIμ o_ _HZ189appY1VR6֗y?"uxEڅD9-ۛ79&wߞOP?3383:_ۣ>i?uu;](?־'"<[H{_=7tW%_Wܿq~Z>]#D t_xYȝq2zDʅ5j"B*ֶ^)Yf\7+0闸mIm[lw{^ԩK%Y{j-kmzс%mV:c3^um}\G,kJ߿'Etd}ԝϵ_릳ϣ۲["DO`Dk JS%T֔ՀagO>wFe⇭^7_' !k{}܃]6˶0>13&litxz!?u:/n gƟ-{ .wr]|r ׌Ύ+fSt#E& Pk{7Þ.o{yGH#\fí/O[{quni+Eڡzʝ}]PFv|NuSj_sav1i@W>^jxS&}uA\)_z&l޲?;_j7|;=w&О-q;cgLyFkrOIJ;Z?A=gz t ZS=ܓNPvwr\q;6L |ui/v3oVڏY?DD[3y8qt+ٷ>y?}_ rzj[}Oȼ6xR JŵZ8UXN9KpnJV :B{5$0;qu5[<_wL]y:>e{Ec=ONzXo8O,"]Sk玖Y`\_D$^[z\.!o߅q?Q}fc^w'=O%+~G qՓ/p 3"_uNNv /JMk| -/]ˋqt=|u^J;,m:k__Y/їss0oCo?S;v'?x_hok+ZmVy3\XrM zf.t{\Y)B/cF*zDv/kz+w8x`?:_rokz_EC-=g_;fǖuTny6K^|qͥΗqm~\~kj4ϡn^+6wfv[Iv {|*P#[}pb==5snctψGP!kӏ/x=;OvDZ>Chs1i!ihvNb@. 9d1QrO-Y zah0+|]?o?A7z(|""oOD^Yoj.ΗqUqgi;FzNz\Cԫt8'-okwO ~j苴 Lnv;"8q xMSREȑOpqA|og>3V1_ :WA^`ꢫ!Wn{mǞ]LMݻ-iFߩuTʮn16Ĉt$N U;}󃺫8M$?H^K$!3eDL^а?qCl>Pa 0O'MHjU'%m,S>Vq-Z7P翶,%, Z$ kKS>IzRP!|j⛍PBE3h̲KW8P3H+euyײGG> g5;R |M!Co<\Smx32j e Z5&wifC:7u-{m "bۑDaUfQv'_.Pq7YGڱnio'b3_><0εd x"|ryڳcE_#bCh=]ΙĬOhڠt>Q;a;Rc[?E4'6$QqQj359UC;-}\Xj0rU셵 L]HiXH潊$O[ǟϱjqP[j>2<]{M["/45jR-9v:uNԼ?MbGk^5]>QeE,#޵@/ҳ'ic=l?u QU|;[_E;ȩu'M+$a ރQok^ϘVq:,Ɵo3XKb3gp,F2nGos9"V>CNMx2D|x4{y*1qv&Dv]"3A"}"x" ]"5:펝Xf-~Cm+_TW>:Cɰ: F{gx!3Yt|2}Ӿ'DǼ9sx*_QE$  t~ W/Txl :8Oۿ[ um$@C9OISgrCu=Ok] dZב L䥬o_D$}{''W^1ߖ**U!7r7"`uʟDPbA^~OՈT@R}fsĚH_Ըׁ\rP9w.9DG6z{}%02\HX-{d%4|5Jsh4w R|E}L5ƒN_aM7&|Oe~n"[uXCz\5!>gزs9Q.Po!PydٹDj@Ps[Èđ~IPoƼ ~~<#ف0-aT.Dfܣ_Aō3W@MfT1OϴkcZŜ&)'\F@ӏm;zưvvɃhqn j1[7?h ;vޭаބw_ 9GT$VFnBz_:y/~' >ACLCՂʜ~3^n__~D5񋓡 \扰nb%^=¿ *d!WZt/ﮉPxP3t0?sD@ekCEρin!:s= m {z۫rp&9 ۥfb]2mgxJtv7D1PQ&(8 UT) M}4X Y[/A|=c. 8kH!W{fl/*$A%s&ʗ؜34Jv>DkCG{-o"bTn*_{.XK~-g"zߠPWWͅ1kW\Wqs^6RNT=V3T՗ZUK;S Ʈw9-"֡߫DU]c 1MyJFW856 %g"r#;rԛ:pqh,Mh\,Tyo̐[??8'x^{KC$YjɍoH5rA'*s͇ {] ~J @cxBy4t[=w;hXbasL~[Dgeaƺ#KLJ_쳟{ M7Bo"}{#Bmz\ZovwcD|FZy8Ƀ_ʏ5y'TQ ʺ<:v=P:G|ڪMdzeӚ7AcG{ןBkI >f a󩵾Ϳ0KǿWF:ȨaC;鴲)2 )y8'b;rwŒn?Cu&Ф8+ZJjC̈ m@UpEezE[Ʃ@iry6:HmBK:Xf "ZQ%~ Uhw[cC{"33")aιek5nGu/> ^ ՞'?{[@$gyL$d+:|p#?~&j׮V> 'Ed  Q uwU=ƾG4Z*|V%Zh̓J 45]׼ZN}_|SR<>Q(*R WVT*eyDS6Oˤ`!_N< ◜?i|)y}\p?"թp B{'K= r&]q0~R% \\"qMc467;62ן|wf{Aūo~\P:4N j'\9*NW"e]k̪=H<}g?7̯$*]e{ljK Pӧ#CacE!n` Z%@%C?ܵv ec eΌq`vQEaTOjtsR4vDYMnx&H^"E,΃G|P-"mB=:fU5kϹ_J:_\]85=6'm߮[*8#wHTi/;w4$jVGV !+\d\&2+\fD U*!ԃ{7Iō N H'2'='d$}!{kDuT"m.r4-%j-H63N!xKNDJ%Aac0?;q>˾kaK$m~qoh:[<ĨDWPgk!_(*IiƍC/pf"#{h^7uqA=i!PSf͝!fB'lrwvZJg>XBy[An>!:OOgO~ث0K~bFT]ӲGJ=l5A'9%t(TIթZIee+fJ^RYB4},ƙLuj>\Tݴ)bBpP hT2"%/DB4ξF6Mnn)4okd6Tx䤴):i'~_([;0z\eD҅_{1?9~L>BAJ"i{Pw`6yw80¥D|MJz]:-e+_*mcP;"%Y<9&U x{JDTQj;֜A^6/;PbV-PSS} sVC(Z>4f(}*X@AeoCwG}p]P:Ih=v_އHN7-O*^pʾ ]2gnTfnxqeDRuaJ]g/q}.㻤qB<-XPC}G >qtNK&xG_0ԏ{;iŷ}5w`SyCuC'HsA"ǙA k,0}(vi#By a="G*1IzQC;DdpbG홈K>UBQԾ=Z1G7-phXA ʛڇ4'&IwI;OΛ՟ؙ恡tL|9=qSG.U9\ _ug)g(s{:<.GX4xSA|;ov`07l޳%X/-BP>%zWw"Ҟ7 ȽQ>H$F,iaAfE.;͉bTx5{D;_K7L D]kfWќ/AU\a˫׉' wpè~.JqH`=12K:R:DKX!JW du9] E Mi <~tx곟MH·8ぇq\X'bot@)}PJ?ϮY7M=^(#)PU$cՈK];ŏDi3~*o#7j&҄Pa#UiF[kf>ɷJ5riqd'Mvyx..jq&tLt5G(6-fDkKPx?4zcȜl} ?ZȔ]cPi~3HO\F+`|S,Qm u_rʣ!~݉#iZyl'|9oWw.gr.M ?IN:4nWz{/|J*_}-gSLj/s7M/HE!f*M d݉OR~൞[(ͺij}2'nDF%R(5#΅S'avB7s_\uj>>gO:+|`OhA D^;߈Tr~Q/CAѬ1QTs<J/+d 70-s駟Zy*:r6<(CS{{g|L%ƗװT枿vZ"n:D~G+3THnB˵W'7o"Y^gOJM^?#X+{ %ÄA=C7`Ք?/}v% D3\3sסPYh˻8(G2?X?ml_(]՘bBͷoJP1c*gjGTN3C:ӝذeiBH8\hGPRwx}? P!?>֐Ct,~R;\;x=(_j=t&g/r61|r;k>[OAW ;2TDo:?xmli8TM]t6fwuWͦ#M^?uΝn49LNs}g"݆/DWV=3F->cׯf׈H=n~QD%}k\U\ Q:ݲAL.}ղ;.u%W ×GKDwVͦ~(Dk*_"r:@}7U띵q\(O]m6uYD,Vmy3DuQjpmK-"R~YtTj]5F ʵS:~4 nCT2{W {H$KwDoX}P[V *UzIRPGfbZqܭ2N{Mwt5D XzqԊ?`Fiz~zo2$]7lލ;Ɲyt'K+̄*?01o_`Ke'N+ZӬH7{:\R@ huI tOB)TVJC߮GMO=k(Igy(hܚy7fG" ^"gy3Z2vC{".?'H:xS"|W^^vEgM̂4O@-{T #2IK$ksǵ^\t(T6bMs7?yBP]R%sQCSqMjILM))*D^)zo]uޙy8t[͘h= JfzNT\5?|68§׋ ҧr$K.Be-Ӏv6>iס¸]_{qk~-{ƺL T6,R F:\6m5? 2N1ߧ9&Hc7 _Ddߙt&݇lẬ-G[אϔQe%6 N#JB9eI Iv.JYKW2;,="Є#O;9.iuK&aZz?B 'gCJG^ԝ|uMB}EB".\ ^>kÚrgTw9FN~#gOyD =|OP/8'Z'\GS')dcf@3;l%F̛xv*Q~8?"^C-~N 2veFDBLLJ|X^#;*>|.[=<5WfO'vz_ UYӆ t]RꙨ,١HS?hG>"Wu+K{`HqMXZ۾\R9!/6q5zu4Wb$;J>]08rFPפ9> ˨͏Ш۪^Ps(NPA${i7y@|ScdI,>[{޾0ǏIDU#`L"PeטR{%,P>AyDO1Q=ՋU@Yw]'ſ=JT i+#<Գ16 勛?Wg'onԷ*iWCZ8$HBL~(2 Pqw;%hr1_7{ܐ6+[u<4$J1)o/>ƵcM}g6:$cz)lާGCuA}~Pe|?2&TK54QCǭ~ݕf W=vz (~}= y.yظm'$bxy r()ۑ'O_mCeI":Wi#N௳wV|GĝήpdL׈ѣeJ^DS2Eqť= 5_CPӵ‘)P$>siO kCu#- #P޼MhO2Kr1T><nr{~|?5B4ݐ)Úx$!<%*EAPYŭ9_[nEtH`uڮT\E+PpKmAiKB~>Zj3&7{Ϲ;ggΪrpoK]_0.hu5`uM<\`5s?MJqk!?>GG39ROɍ] FuF? (D2[w`m)il<J/).]w*/ݣ'zS\ ۼDl~!.pfl G H;U%2`z9#ߚyhvۊdC0oK~zݲ;q[LyljBqoog *ۨLzw[!<?{M)ڜV#bM`_it!FY?5Bu.r[yl)?Ƨo=Ov=$ }gL~KCȫ! <%~6s!Uō1 z~!k89v_9_T$z#B{y?D؉H-ч F gg7yt| ΋i>f )u ׁ%%@Z(kWs38NWlteZn\3tt3 uMfb`*58SS_>sp%ֽ$v }tɜdn8ZESp,m&.2|?ifTYK>;v, >0e:}`]4[ C)IbгC>x'*Q5vQogv}iVQf>W7K=-0{}J/0Gx`O&~P},QFbX.DJ"7iwPrMmk=W0lc+0mWIY~[\&7qů˗Q#3{vw?џwK~;ࣕ.rέ $s'~ fb:pjL)XCjk:$)NgևQSQ" lA dٸ g?O+ qVSfq~Xvg viO0_ri~٫r`1/°g/`lk?y\OILC-Ov1F%崠R{pkO\.ך^Gw`3>1H1?EǢsVykg&8EiU={p_uhPd7Xq2sh &">z腅ń뵨h%ʺk% íCK*r,eG6Pw p3/Į`N'>Y鎾/Ϩ8wQlI0moq]OԞ)(1Jri?XKcg [t˛w J8Cp!~MΉcQ⼆%Y)>hbǣD~|ghz'qS [\>Aƍf& \8(oŸ0X)UV/UvRZ6 ͫs򚀩n+Bq;& (3tߪSI>[O`0MC '+:]$Σ}߱_˧c="I}DDI$nH$J(S>o? 6juky9WWT(/k([_eA~q,}ePQ <֪Ա W?'Sfitdistrplus/data/danishuni.rda0000644000176200001440000004163513742313702016404 0ustar liggesusersu\TgF@J PEQTlnnb:4 s߻{fg}ЧPׁ@$ DvY# \,t=b13|(g3F0p7c π eT1<[3 Lx ,WX ;>@0 Na ca<\ b&{$@Y1CY x<})y9D"gsd' 0Dm@ʰ؏!%p,S'10\.W2<&LQgɧ3 Qg do%*KƼ'1/{u'cyϷ1Eڃn )`Ki 6-)؁a2D _RFK $%ļ$;$ JB'S0)W^SP)/)K mL:2 t4"Xa0 S1 aXamZ"_R 0`8l"R"R?ȟA` 1 t)uÉd0cJ.+5 )5LdaR X7S7 |MG~eIՁz049`&N4w^/CA?O aXuNk6.`؆!K ?HuЁ  G1HCMQi!i .3\p-C VpÝ 'i@I! OCn M<M:N>f5 uG״/@q_PuL }ZbԿ: oŞ ^\,3D݋UBXq]u/n& ;2°D őu\u\ (Fcb}c8D0g1/1t@ׁnQ @Գ,> Oo1q1q1]F`XŨ{1C5bbԵ+FŨ[1TQbԣ'e:LG=A騳t騫tUz ^zIG?2uu0=a;tn +7~+ez?QGҧ2g@G8.f!u:-l:*u&0!t;bLQ‡.Q驠!|u:s:.:tOH|߃X'QoSӱN@[z&CC D0Dc"2-ؿ`]N%X$w%b4QwSI Q QR Oҋ!7 R L2!# |dr }d!Cԓu$$Y!u I, Af)9}/Q@P ! ! }n $# |u$&:&:&P?2`:f*сG!A!N*bINJ=zޠPRRԫRRO) J [S&)WiKag Þ JJJ~JoJQRR)-Υ%: \01pu\:M)tE!tF }ϕb&@#`p qG &RISRK៥XXRYu[eu[RYRRRik#[B@{I')G ? E_+z.e}0 #ú.ʠ;2ɠ;Ҡ'ð qCVaA /eBd A >B!@~ z$} $>0N;d)| } z%—Ƞ[22 lX":" BGd'2ytCBOd 0> } #K32 "p/$~Fgdei | :!Ngde l l2 2D?#A_dD3C22WȰ&3143ɠ3B[-D!1}6]{@܋7Cr?g+9y:9_' 3l°Y l&ɱ! %9$9$9M% bB` ɡsH99:':'O×_XH_$>Gp9C|/9ȡ_ri>IO{/C'bx/qP$\:(W2_㾠}&G'~J~}}CC%C-C=d%ɟ rʡrʡr4yi EBT 9}N_g_a 蠢C 4HE=EE##;tN}ZH|h:g z*:@@EOvc=U` *%? 20!S@c0.+/+ O_ Ot[i[X0U,c}ft]~V?+ ] 0>+ #+b_Y^+N3+ CeWY]V@ R9eUiXq}1CUXUdUUcU@o[]tUT*J zTBoJJJ)sJJ@%N n)WJ蒲-C4e{B%e}Q?UB_JqJrá %+c%|z('Jr*X0!|D~:.%ex:>>O}r;Ý w= 2( ~(_Lf /VB%tFyaJ+%~nJD%|z(%E!E }Q`~RUUBooSAwTe** **?F ,A%gؐah SATq*8tP(p%UB'Uv)]K tS?NTQ 2YGBUG| :>* :f2U*-TwUدSUU_ TBT+Juq'UAT_RW$ o*Jx5G [*Pރ)'_bNRAT:PBߩ~p@}:UnE ؏S;3D?F?F?vgg_Cu-uB1u& og[1n)Qݩv. ս@35E ]Qo2Dߦnq_@ PC/ԓK 4uC~Ijgj'5zF?oR/Sc?M}45tE =Q_ѷ7@쫩Ըߨިw1ި3LbRC55F}aN?jujZOӎ!:AAz:AaLbLe`K=gi >LT~T=լbBo5k1k4S tS~Ta~Ss!tSjgj4 4 qA}~Azikoik:E_h1 a=B?J |}Mۆa[mGJ gӢ_{3~%?Cv tN ߨiCZExчj'C-tP BJ Q;a!tL~T =¯ikw ߃3-tL ¿igZC5oZ6X|84bu}3'>Hڝι$OqMMo?]w6Riii+vlA+'fS>_ kN6*Dӊ|qh u'{$H6 !ۭnE^tl)8b:w% ݰ)p;p9Nsh-#:Z}^$R{Ծ `LbrZ㟑O,jA2FW麓ȏIAЯ:G]AvùqҲg I{fSWɵ'x(Os¯/zyHNC˹?rÃٟi]N-$ى׵ 7Ɉܓy<-8iynvVlwV/\HQmM*HKk|Qyމ vMR>x$y֢>hJE}eRi{>'m((ht_tGZN[pjZj}ǃéCT+BWQA3jE7Ԫ1vo7 R4ͨ XZ( 9~U;Η{!zG;U]Aa~$`!\|mj_頍ӪNkF:<..wT/俨C4ڧǨP,~]_?aqmsɸZw7/amȰf`BAV#)gd'*xru^|%)|\+uXEݷ&R֥r>Ui_kn'9N0zhn"/[g@h+RtꝔ.I6lyN嶼nײcT~fQgPk5{SA=SH~/QA8Sj}¿>C֜ϋf^PaME (gmU8Ism鏓#|n,==_B:qw{QFr|Nj2_SCf8S]U2b^-۠_2m#oPǼ@1Y~ͨ~vw6NJ'C?oOnw, W/ԾܧzpChKh;±]ݻ쨬ƋGTttwGʒ.mGQg 淪 76!N;F +O JywwA^ bJew uVpJMvnv]lQp v}t:uxpU>'Nf-W@V;[h 2)du*y긼{$8R ǩVŰ>]X>~&=(f!*z7cmrs'Si*\YğnwNsh|tRuhcդ3&76w]v9/nA҇܎Q_X=^lVJ%G}>CfLR=7ּ;l ? u|3N`LZ2W$~:`^;Ĵ0jة'u6Fl id=#Z+]o^7->)yO;Pgz.܏kto$9_}Ԓs݃AOon// |z )5q{'RafFyf ;>3:Ta9~#-5*d]wD *i/gcQ7wKLNY|-{!Kh|ȷ:tx6 [j4 vɥOm={4JLxJ+]1^g;uD{HμȍCzSQ~ycgNLxCsۂS/e^ozU|=b{$'5Dt9 Ո}LEUnx.>ES?=8 ajK/{GoɩS{p^w*,?3Iݪ=y{T^hяOK/i"mT8d1EI/uϓmzjR͓X֛]3em_ub3宏1ti{+Yxr9޴B xhkKgF?w)0uwI/U㽨Cos~NҀn9qiaZjMb-sb4+ԫ- l%ƾzf韊O}k>z%{gm{z6v^uwQ?1R.Fxy^}9¸ܦѱ'1*v?l)Sn-ߩ]d(9fu[2ǿy dg>U;5ϓ-$OMCǖTHw]꾋 .v:o/)w$71{Y9?\dY\DoGcB*7L-}-}ޔ!ԥ^|s(HA*ɼlA2ws| /Y.$.&[ZN7`ǧɽ-6gƊO/}XBμٚ1yײ$?(Ww@ ?4E];w8dUwd 7>?ŽHdEowGTqpág ssvHeO}"{*TTޱ;Ȝ2O3W?³jשcMsut`۹5?d d5P$sDiʝd=j"N2ZNr>󝜨톈."\kuJEW:Yނ1t0<;>Grޟ]560Roއ~Ĵb TT|D.R-qikWi: ZCaM6}CF-끥q[[$նY'ϰ%y iы <{=r<K|]_5Ɋ0;WDq/a:G w@r2^>):vC4:'Xzs7ߎ *"ޱOk4S^nSO}#5 Fŭ]2/fiabYl}qtE'QwFAP4 xM~"Yzz$8.uF۵]rHrZ Qdz.{uE0B -{ş O_wQ5Gbk קxM:qU4(T2$Fׇۺ0 ]vÞTY:vvhbpҭ$<` =PD}+lq2k7$׶?D2']˺2);?D~p;Rы*k˖ɽop~'+G{MԞkIO\wc%pv'Z⣏P'MCjh6cchWe"=s_y[>67)'d:>M޾L+1IvQ>]i)M#9z!ޓ =Z3,>̍= Y7{.|\~k qDt@S焴Hf'9{ ާr! ۏ]^wY)#Y7zjY^H +?G60:`^2+oDI#UFrN.Ňx;/1YjR4m֌Po_=7u,?E:p›<˞JϨҒ $EWQ:anW'QkG)S᥁=Mϛ)KV~>muCr 2h/mGEy%F-ZZ/ć:\j7n|A/:LCQg7n2-J:]Ͳm}'|rs[?AYi s㪴ujX6m 絵!N36s&_gzK=F"_hZÐ_?Vb_hZF=''w~8|C\~Ps&-}Lm@t>tLOt±~nRS_= 80-}"\+CmSN絖;;~䡖I z-neDߋsmUߛ5ӨԴrakPY%-zZ'MZGqu= S5rlR_pǔ8C\l :PǰQЦUm6.yHtXxlݶnI=чrx;K_\XR~$ڑڎܔ1i)u/x͔FR,SC혪$Cq I&mŕz.nbqg Sj!8K}JH5בc/ t6$ɈivK]#&YF6u7@yOٹzڮ:XqD[٠91qy)U|Wbœ"/̫+^cw?vuRg*<'b> ȵ5ƍuE7! !zěO`ͺSꉛ?Cl˼E-qD>e1%v>Znj>|1A<3=BY2KE]䏏#yxmꏋ nZ򼸀kf]௻dbjGLK KElчތ<f9-/-wz$ףMό=h6Q6w&c/O=|ҟ1ǭoSy1y]#z,]~&碑\4NRwF2,iI*ş Ysǡ !~jYmBÁsWh;w=iCtCqY癣c7L|^Z<|#|Ќ㙇;i<_@<:)d>Esc^Nǽc{%g~O !a]ʑuV7bzuc8-ގ@tˎ|_N&@o"/vtQi4ݼP)|%\+bD?uýI~*͏7gfCaE\f>sT *tcMKOl9uNX9-7/kᶙ7Ǭ3)+Z-zgxu9f :S'+MV?2i.|ܲ1e;[WS5kHx~zQc?wqe/Ehone6LM}S:uqd~:mxjȰg#IF'[v]${YB<~^K߼,[ܼRsF݂:M]~uGف&qGW}n]] KDw}$3QOpmK@i!?<.Kl}Pf$wͅ/ Ï_췅};7zЛYVOu]b`zkbhy|Gjyi 9Hֽ3&)Ǒ JR3G\]V=F3C nԎḄGD7bZMg[Cb>FqhG/.Vō6;+}vh6O$CetzٹL۟\7\[Tkї:.n2Q'y1Eo3 ZV8L39x{2BZ%t^RZy^Eis3?S@}ɬu`dMqb~LՊ+r^&W?}$gs1:gז}n&Yy󆵥v4x׷> ےX5m7.yĘظ(W1" ?ybַK4tÚN:S{QN$nxIvjWm} 'מ௻pr{[tq?_H~|dfS~o©;% KW ۮ[[+N[*H59?u[|5^x8UPG1^ݕm5'Z;rO{$^ &۾rr̘$ڦl#ƮG>Qь' %8Q5#S禜Zőy?dz80tF-/ H+; bRpO0~\&ci*t Σuy~dǬ!ҝ$}J,]v+H?kfhsymvx_^4Km${jh>{fyv?jRh;}6g"@9Soտ_?]*<%ƛ;8fuCM¡unrCJkl8) y}8 qqUIa$;Nu # nQOJuױ``{yږ>ԦFOt3Pu'Tt۵MܦMЖ$/qK1[n<:];c07[_V#^NEGجnŀc뿧8S HnDjγ!N|f՚>{!՚e~zGSQ]̄nG L2;[ʪu}v(/~,}MB\?a?j΍WDtMRFbjfGU nZ(':[=M1ߍ?][ -H3ΑYiù|d-t).߉~w)'PѼ&"~ԕ #&Gjrw޿u*(}M c_FSY3ۇsfo/vqJto}31*V 'NT p_Æî| C4gw/ T#gpjlȮ]9f>ِYm ykod( taWnY:b^#>h9Ք;\; [>T1ݼSoi%^ׇT#)cT &yjWuP9PUw$ߥpVÉ~b\^ FNuunEPGy-?9n>W.tnģ窐=3;I6Js]e=ד UyҚdMpK>Xyj36-::Ivu ^Et-4m_Wn{InФx'=5Sףe{n7{Vv 3It\ѿd49nFdih}[mƈYeV[0'MNbIa p܉{Cf.p@2EjoNtALq'Z SG.nꙸfİháV&6!F`?u*1֏'C&RE5|sIERbxX==$uؓ]59x= YΟbtGrH;Ŭ{:t0Lԝnv@=l;׏.kcۓEkϺ?esf\̤d*kOu叔1}_ۿB>i7.ˀ$SlZJ0Ǹ2tQϞT(JV]k`|{}Jq`Z^{zH }͸G}zdݘqeW/TB=cɩToӷJbpr؁O盟Sˍ$ e\Cׄfje*1-L5v>wV*?jpc91ɱv?Y-=S~jMtN("߳M.P$_t|G}Ρ|C-/~ՇgZv{WquIEX<WyzC/h֯,zf񺓧?v]$6~Zέꌹ=(mhMٴZ=NcVQ?/>sm?˞Q{x;NK:IK5$O0qV3?cݭ["\Np)Wgottaэ^tN!7jq-<;pqo٠8188:Gtm&O&WݚME+sY$3='c{v@ ,w'k1VIEFKKډ6TRKMs[5)zEꂸԎu{s&ɲ=gN#XzO$I6 l+'y}XgێC]L0ӵRA;lUFinP,~$xYa+1wlƑ<DõRJ';a,1]I_}cb\{kjs~ݞSZQuuߑD?2[3Q?8D Un\?5X\ېr[iv40~RP_~ԩBJ+ hR5bwu Ý_aYÞ'S9S'<5.j7q8˜o픩lѵa.X}wțDq敋}0|!&c^Yɵŝ?7~;}z_q烣fMaJᒔgZ~i |Ʌ| )3~-qyui_/Vm}8Wͱ#ZONܭ1?w;v$wv%\ǖ'Q-~5O*QĈ'lr51c+|s#8;=6Mk!N^ԮyĈԕ=ae9v01~DQ~--{ZFV)>uDWu!$}}ݗ 7Pmދ?/f.O3Sn^!֟d:]eI3Pw<7uӍV{Y|N鴴1U_91qtx㧋Njmcy}Yt`kԥL%GWy[\:Ņ{ݗ{.2ʼnK[6ɞ˴Q;[mjnpCt閮 yuS~Qv|:CLii?_P{в6¼<$3٭}|sa!m&"Rj96цۮϣ^mIБW~-/x2e4sn9Vr:OEĹ[SfrB<_Wy}CܐG2b}>ݦZ>2mzKMoOQy퍢"z3F;oBk*ifitdistrplus/data/toxocara.rda0000644000176200001440000000026113742313702016230 0ustar liggesusers r0b```b`RL@& `bN (ɯON,Jd``)f‚ÁY ١an b ]EXX5XΚZ 5&ȖWZ(\>L @LI,F7+%$Q/DŽㄚfitdistrplus/data/fremale.rda0000644000176200001440000000313414067302651016027 0ustar liggesusersV PSW()(Vam:KDI !TQQERpaJiTKE)@łԍk[7\CgLN޻˹W} IqO$jV6 {paNH];x fVx)ɘM@5J2*u+t )9diWQYg.s%q|%ֶE~RHkU_nzL +PVܥ ]w)s?q|Rٶua[Xu/l?,lrrfB^tbTTE$AV 4'a-JL 0We$} FtsܠOT{κ4!v?y(H^;p*q؂JZk'An%ݩP㾬}gOJWRmZ &TEIls0%%KJcZrz8z]  B+{K 6bHLqW5vE~|PH^+w+g饭LPE=ao0|9qPOϔUCYHvh{] ~U{V)Xm}4̄];<9sݚofNy6j1׏nitd(%|{L:´lfk=>OX_u H+P͎>z@?S}I_@m<.e 5Օ:rcs)h]!> pǯɑPtn<T|EAD-:AGy+D|@Pr?sK: &gh's抒P8۩q uoʧs󬰘 r흠:WCXM ]@ߞ3 ! l{2#\2 fATpUucm`$wo;+-vLfiO )Ag4\V뵑nq],^M&>3 6[mL`995%weukؐ7"(enZ768AQ]2f rn5^{u\>b;"q5>ʀgކwӘu_gn-&Luo;`!8abgpVͣkU{Ki;!};`Niϲ 44^"u83+' ,HYŐljbN!{ù V˧ @s>2tDg3mXx%wm=c Px+>VH,k _o[w5Gh Y0<66mۙ81H.~ 'c~]z{ov^_-D裸 a{Q6c R^-vpέKU !քs\%z]nrLsO( 姟dW"xhIBռ?[Vk7>GT..~D}+K/2U͖n.ݘxi=$)VYYꯌ@ iÂrQ" E0ܪAW> G!"du .+Aeg M(! 4#ȓBEex~??!2@*4 =3uulPN\O$6N͌u}2Pp= 6U7KƬ${%LP@y~|$_9Sf2bWL.c̯f{sx|aB;Ϫ!PI 4M닄(p?U!2:sP\=Wdk61T .L04W<,S/n9g `Z!}V4c\;Vm؂@ܞGmVŜ٘5F}fCb`ʅZrtU-/? K]C]nB #4GTr>DZj /mh&!6MWDSSukaǔۢ윤Jڌؖȋ [Ogl),LX[b2;u7ϋ-8mI|o>D7 E+!b#%TfD! خ@?W/\+)eC9Ր OvX6@ĸE"h$wQ%M+'?sD4{'T>?YV O?&;rGNXFS$SG0o7;ͳ+;a.NDd=ds#%`Y mE_C\UEx(yQp':]61<CL|z{xӜ,dGE!+^`u[y' 4z3b>7_(}S扇 }uŭN Kù4Y*v&F؟v \0 y1g BdNB#o.`^ǏZzޜӣGu* =MB$> RM`1f$Έ9.Ӻ<ŎJ7֗eEUƳ!›wFubc v-di$UsZ bٻ)fwǞc&Ep| uChC:}܋Xs5e-{,1ȕ(g;wR!>E|ֱBqo8XZNcw-0mN_Ҏg˙+`jJݣ^0oHp[~Z|)^YF c}?E&4ct4v5< (_?|ryBqnZ6Fv FC4FhKFUze=%t?čixs>) LLQƤh=*a-fß}/  zw#ёY֦NGo,1c|ɔ '0/ 5fau%X['YLXCGEcV`H:gY,ks9^}=[0z9O/f*)ӝE y'J÷}[7{MaI;}î{W9*! e]QxwZH_K{|ʨxm@őK$%(c>L__O(Yc }b|)W͓LFg!On,S]4 Om'=IHar[b׮qry*(8/ ptA WuV$M^Ԏ9>UG>:9cP5,/&ٱGP* |4&,+!-%ojڭ{vC (k1ox0CLN4 k~XB\5muA=CVwqt={?}F;3A+MLHfɸF5:>Aܣ,J>ݽ汊y8k@gTp*Ց>ؓ΁"&mJ e > )Kv!B5'fL.^deZ<9i)M'L_h#WŗXh)0| ?LP,!0|c<_Nzb^P $'&{`_ף8y3G q mUS.ϾftܳS| <0zG]zTqxO2dD_bPg1ۉoRF4:b< Y |M0+GIퟄ@~lR9dϙ_ 6-YUqa.ݛU$T]Ghpuy\(W\yrry %s" 35DZb-Koc"huX2_^?̿ Ix:-z|vaY{T@i#pb=z V=*!:df^ƭ)LtUmA0yȭb> Y !+w''O:>:?B|NM<5V}|3㠄k1.}H_^ۺBϮh!te_&&>ym>R<c6N*&aPvnɫ =1Zbb%A|s+C>/tTHoˣiNGCLu9WK0*@==}1&m 'b'sŜA\A !H.bi^ {Oap 'xQ 7L,re"?3eCawZoUuqH2gv\Xx8||=z+(o怷r`m?V(l ,ilĢ,svGA}g믱?'6$6:N1|⽇ {Sޘ}9Uncz*xkN@dFr>m&HĬ:4!o3.. [A2O;x5:>cI1U[IbJ/g.oiD!O*<&Im(qsދ ߙAV RInyk? ?SG *"9Bp,3/Q04sPWSl8A$~!? 7]c jfaBVx lʌ"Éd'A﷙jWE%>=,f( "Dp<RuCfQ@zvfIC@s{O;Nq:KG/Faki9H|wDyA=bkޗ}qOs!ȄGvH+$tĩF.7L_R nd%$x?R W5csOe.(Lu\Yª*, 08=H rdMXu!עt,&6Ļ:>D:L_$XAMsο*4Hv^g! y3oƘ eL7~Dǽ w0T%f;ԧu#'x3 aGp|uu2z9wX;@cH5u|7úoF#4o~ZUuboAzЎgV:qrO?g`jcmGbRi;rKr^C9wڵkDtloj1oH"#_wE/FN&|"I̷l~wY##SP  Eܫo|DeEC |KkO uzkrl/[.Yyl 8<}s W߁8a` O g[o{rVtc?Pk>}H7;sNa˯Lz➧>'?HGs1Zu4r_i?H*,&q۩:'yb%փ%xI,`rHa )#=)"R9.Ő[/W0z *U?]D/]?΁bd.4?s|s(M$jEg:hwֽAqA-TX>&%.}^\?n_$?\-cvK^_d)_kf.unUn,%i]JVBe$x6Ȁq 81%}3Ro@==g7T:HAP0ȧtAw2Ե@c)#?[ 3EƤ ? Lfo^Bֿ DeEFr.Ӣ}SVz% YѵV>fi)! 1;7U |#޷ zQÑZuG,؜>[60< NßU}ugRam,2+\x4s,5cf٩#j!#qÔ",*`4\vsebVfV1li\Pe$XryB{MpL#,ֲȡ6l0`AY,4+ x rs eZEoHNDqdLWerV~Xr]l >;mRl2^[,l/!S^@P0{K 3M\ۨ/?s3|9HBwKEy HkȘ1P/ٝ5qu1DսzRmGjxۇ08S[7=Y b@j4g0Bi=꯯{-5:KȮͳZqKL;~.Z%x|6A(iQ2!be=]|wB߲Nf?d =|eBY?LJHBG{6n ήC~ݔ@O9l6ig ]6z8ĵoҬE HocsLvYނb vzgs:M& 9w<.P]lw [vnGPmx"fXr@.|B[(H4=5cz=E~:)^Vw#_"}vv|ףLb ?fv7vX*څa'=Ĕ5PYKmfJ> {]!FD_`oUm4h&qFn ![=dOBZb6kݨN6ihUb}r.hX+SjA/R`^@,ep=Ÿ=]m؜%  ㈊IJfo( ubnw|g[ǀc9P {@#yWNTYmT{Hm aam~?jrvt^K-Ao 0Sg+6Miȿ7!ߠND=0^;M[_I G=^INcZ|ʪt!QpY~JX%+yX3E!Uؖ;-f/ C9%v:㡃QނiEJV|쇬Fo$FUgq/d0A‘$Hg<*pr?菄X~>x m!DW6I'n_׌ smL.NJ\ǝ/k N̝|9~dtAbx[mTv5ߥbh㰦/bLhz/>8/l @wqxu뽋ZT P8. T q:L.)WߋsR.$TcE (sV7IȇMRor *m"}T(8/O5B SLӐk^"Na|& 4wsA5^.H|jN# S6cy(|K҂>m-*7JoH3o#>L+Bb Ҕ}'>nEWjh1hq!뀜‡EI?mgdmq2ߺ$)l 9#1 U ν167ރ%_/B~[2; 0F*`7f ex_3~ \qf$4PM mCU F}=7_EE5 FbPؠuB#7+0sj𹞗Sn'nG!HʞS>hAבhXk1^ _S@&E 8ޓM$"'\΅س;Cz"3&Di_@KVco1ګ!T鿄>dʮ4XA}Ý*K \#$ rOل,G!2%*<6f4oJYm Ӥ30 /)\}kmv5H^l^(whZe ztn|,͙ɏvA00 r'dCܥChjZn^5b܆8a\~Dip5k&ȋ@vd!;A*U'K=$ ?.I(1#G!%Zn|՟ȭR|2Xy*>Cn̨w 0TZgMf3ݒ/`]7(!5M@&G7 F5QVhU7N2&?|" Y 2:߾:qqҍWeJCޛB|:5hy:o}]'_=:-w`ЏM7cĴQo+0LUrvbTSX'lInY w"v#Q*-*#1Uiew9';\IB]Fn\؍ ]L<`n;%CoWypf-&ٚ"?#/Bp)U{.$xaiPl&u=e0}W:(Hϑ(Loz@a-aH ^۰qQ*D+Ymފ(?)`@ L[ h^y[Ծm=uro,}l|Sn~֍5k H'#*;V(./`x˷ wk2xTIl_cy1:E'lwnקypTJoȏ .|=sSgt1c8닃;g͟OIc FF 땑g`Kֱ:;<2T][yʥ YCW p9DZY> >#z{CVN?/p:I3M-uZR[) ^Ghόpޮ(t۰{*[ $û丩"+7%Mn|O3fC` Rf|PeBsֽN&mB@Xx9t%?VKL=oz0GoX?8'2ֿ9~Ռz{yhns;Uz0}suGt+tRӀ}wZKZ31bןC-eEuk(<B`. wCfϧ-t@ˏMnq 񏧲Cߌ{H}h먖mn)W-yуFf_Y{ѓdpt ﮃDYwu(,G ~qP,96^+wk_4 ˪{<[Wa Nf>1K WI+Ѽ9# g^3Z}x| DyqQ*J%ҁ;Mm[/E^[H.LZhVU\gTC[\et2hSYnjsGcvґ9$h͔)٘ >=<)z ?eѰN*'tf.=Z,8AVc%zY$>[AOqC3jXJ"14jao2D0oΔBՑXTp:j` 5jGU{$@`iq:%p8DKE s֣+]kR&C&ɰ"w9xsqt."]d_|Ҏqt#:_'LcbMN,-~3p~ V>XEz2op`ԭo?1R#U7\Un%]cǓcqf u0Ì>V$H(6о v¼h>dGuJ0m ]U~ë{c#鲴Z16*L>¸??I~O.X\4,.1m$|#׹)W+6_8D&2s@p%iH;7\[7ꨃz:vWц/ȥP:7pn92D^Vv^%lX{׾BTIsfZ Ѕ鬙A$m{DHBO}(ZrR3rG^a!4qOqZtOq^" fY4%_1}p'x!krN6G7a~;D7JS*%.a|˔J^hSMypH$ 3wy#,N9j|,XSbRS`h T-0ZOG9`}*W? Rd/ae0I5j h49\Q^k ķ ѥI!Mvy>ay2 ᫂($xwg7O:<9OXn~<qa=mtϦ:55,1>s%1VL{A Ĥ:{F\\l1{15ft܃Qڀ[ }f4t\I DZDGwGmD#di*Ax00:}26qL>w.ݺC?,N*cP-dr#CǮ6q`ȫ.3iB`L Oi>X&6,o=;> փI;œsYIp/~j٩ȪI24˚r+8*tY!7LPb&<ł@8,/Q3`msPL?r `6CKG0ldjh4xɋ@p jY3FՄ2*(4].ڗ`[ű㶃հlo^mhizƽ`Ά0қbm?_tYlQpkÏd,96 W K imhiv3l> vԔ=4LIpgy:o1Ts䍋Wj?(Q7AD-l5FO\Ww50/;"l'1l͕Z }R&({ qN#X^^ v(-X< De!Wxl"ބ| i} 4C3LCw}Z0D_s^)7ߎU#>u;vsEȲ|CT9j˛5ʽKZDs ۑ>1ۺfiu{xU_,LnMKrH & ^J_~Ǭqq?A`W3{Q2^ׄCoc\ezh$A!݇eu,g1?9`uR|;)M恝dw;WE[dš Lx cI%~&QF& vA#;P[0ڇ{&iJurmHA rN)14Rj#/CTEvAJ!λ?:c\<φ8/%?G }_/evao71ChOYQ<Ѹof:<7LZבA+ H\^}Q6UkQѶjȕD \oW-l Q6#!xm`lj^x2\J{F20bc'+U,6="\Jކ 2 1q(Z}}0qUH;*Hx|({*A)Sク)wd_rЧJjre\1U$B؋=_;X}M_"-EUC';ٗ$ZK.> )l] 0b\Qc8g(Mo#OeC4Q+cd g Pp(0=.,jV0Řh/c{|mV9Кrjyb <1E]qd*U]%]d F6cP{y~9cSnۈ<PakoY!֐A38wʛZ 5Z))(cLf'TMt u1ܽT+}N JMok:R^ڋi:>\0sL╺fR, y1L:ٞeW贝?;yEMdۇ74'^{ {BF$%) ~Ry$ZYi^!łwĕ CקnCw{#'O$"[IXev<ĐkZC؜kb8FCQ#+k4a9oX_rqg81nZ@j*Fi^a^A(>rH%?Q { +~L1{1E(y¶<ծ`܋ndpPdWO|A|^joHl^B_`;JZN|MΘ܂T?h o|e– [bY~*dD% 3oOBQW PIIg:c;$Nk0u8~\\ramMu-laI80r>}=߆~iיP>&y^;KC 7RG<1B72 [8wnd~pb;YZrbWaxT%8>Ʒ;;ŷPSMR^ҚէAȁƔbw,nM!ˠ_6|DQ-%Q~,u<5fWђa~u=qmT8i5=Tr]bڶa Xf*,Y%/GY _^,jOzJ p{H|HȬԏn*-~D6rȍ ҊnNUF3?an^|["& F菶`o14]~qfQs{!bp}@dfcXcolMw|$8`xuM!8*Nʢ IyœL܋/4;UM`;4ЇE*vh4"aw z`O-kOhv'WY;h7#u'x{A'=(QIGaPE eZJ s$xІL7;~)qppSLzqK_3V LpWO.zy)ĵPgKw=bVz~'&jF̣rqHoGwՓQ*+P줯Ře*3ݩM y{!fzI#fߵl ׮@&ۑXnL%عu(1hYO_#L#0}gu[^Z0%[$и~ Eҕr0yJ-C2(w&zz-t{z'KH)i,ufAi̗P_h /?,RvvMavpҗ> ib3!+TLg҉3ZMM^}PČJō r+O8Q\̪Xcn\'4z~xF-hy4x{; lS6[xw;i9.Smlv؁{RL.  ./>Ȇ>_CC>v7}wGo%=.{ۅ7"Qt\|׍XT\ {D j*u ܶ)B& ӧxX2Oۄ#巟A1MV@NcWS EΞj3_^XnX.ZGld'[N5zM9W~Kn\I3QҵR&rlQW@qxs8( u_^?t'S>w@~H>=;xu~ ]o'0!YUqFz72l!(:0g2^dꅂ'̯A)s_(y^_ _%0=6=/[!=Ay^EG_Vk@ mu[;2 }o\;׻Oܺ$yGW!"+t{*7]6nr]\P# -@:5 ["Oӿ| '!$z^oiUAdrsl*~)Į ;!DR]+DWa\iGH:luɟId j{FxySĆx1"0} D.AH /cYci+8bίADdo+ 1>ƴ׺B>틫}϶yA\orMp%DF3343?q_EV [u?PւݗxbĻ+dsع+BSKMu* 2|ޘ tX-dA/v%hڇ$4x-^F8x f,;CxI1l{.S64Tə&F785c9 AoWէ@#ß;kH ՙzbem1́ҏXۮ^ ["e zy{H;e)6GК!(d=|IG0vF-GXڍM:zEc>kBt0N//gi%t6>5yj79k0z_Y.$L+{>S@=!YLx? +yپ08|t\<$ 2j0iN*I"+0Ayc 4dUB,n%!_MXj^Ot7<\p@Cb6z=(_ KXhcF3|S J찹~z{fM6 ׯJ/KVi 0|.\_<6>FGث1|Mx,W6cҍp!d#$ t{Coٴ+9Ǻ} PY*dfhInki- ’-tk5"B#5}@dǖZл,Grʝ>G3D@SH580V^~,5]Uj`G]v#WzS-tVySiZol@F<*p">`;;i7VǩBb菿sa3]!{ݮp"?} [nzxbiS>?+ ܎`MFldh- 7I]Yy35i-xZ7N`spy(5gTiB(S{J o?Jb 33"r#Cav1 S</Īѿ.(Ő_\[~J Kͦ=lem[6AR`!ͩE/<-ntyz#݈3Bo(VW!+b|ιe=kH-h*a@7U L{y^3>lQ@@=s12(Wp tĘJF?v1IiF#̰[£ x_|#1IJr& j|Dov Ait ;%,zg~xݱd1%$r~MCGcnYexO'?cr $RaNtkm 0sg Uh^ $B ikK4cnTnw0vGM[PQĩ×0Z)RG3(- jqB{(Պ OMO +[= *7}OEP1rcZHqMmGޫsph!ӽHXWxw6k0̥t4ow-DmgBy.빳GC0voPΏ{ MZ[x&IyJ=9ji@SCe-w㔆 |}~| F\ALéj{ϟx>Xy =DUe!Dcw3S\ubmKu|ήCmq!QW;\խ{(WxLN'0@q3JX^p-BbR+rcGUhI~1+/<yUŨ9d\hV+0jDf0a%tQq $/kFrLβ̗^Zޭ@ o:&~e&ľ?BstN%S섈f F`va7*aW>h'#)}1cL&+%Ѻ= 9b4#| YLTf =z p}F9'`*Pڡheؑg?E!v%wӚT^\Cb̉h1ys̾k _!AndV6RrJ0HO0f^z)CduOk"ނi5)cR!?'k_&a\m*HM< ţ$Lq)rg&za J F9|s&%k?//otIIm yL3|N6Dc|v9Y1~5yϴ`&1xu'u p'uo*o>&)`Tz HbR7jWG Hz%EȸlaSܞ~O@v ">JGJ]zN!$|2o#>ljv9 K,,ïGi3>'t.\~s9voO,Y[1g=T :d &k іhJ-ۍqϦ'\i> 64 P%)ݢ9HJ[4>05Hz؎Q扢CJCW`*?t<}QeGfMHW @0}uLa mE4;WkV.F_)흝W l )$sh1IeL֌}?c1:OEi.йv6'7TFfMQ٭hoS_kGz$lϲp6q2&L4VGz7e/}ABN[p ӷQCSݳ;w @ݺbYj[gG\~*:TjDwk1d HYR,̥@I'w .A8BО]qH~hi_2- DHW?Z@h/,s̕}J'0@X Ѯ\&LJ9iΞ՗Tbr4>[c&)ev2B%50T:Х2 :6OkĽ50/v:R"f=>dvolGlzڒ\P/' HK}暘&}nكuKYj.;9n>RbI}7{N?UC |EOdn%)va'ƹ{zu`st_'[v%Z(egMñ~s ibRfRh{ds07O!sŹ<$wxM0yvE|JP-u!pU~O6][<ˁt޿*#z-sqAF`xK-AoZoP ׶EouU5$z=/ ]i>\wv)[y#l $G86hHwߦOU&dL qa|ϥ`7%/a2A?m9Dˮ"m e) 3,e~u -7z,=1-UUAH2oCTefZP#G ;G‹vr=7H$ i_ x,cfݭ}NG_"dQT *^^YQ{.}u,S<tT4!w( 7& 7#aUM/)0\ՍR <UGի2(6BP7OYdxkCtaap,jܥ^=kCBMshRQﴆIopP އ_e.2 w8S54vqK SE u+nK;f0\gWFjnӠ9[ֳ{2o3ҪkezdONLȣD>ju#M:y 0$ 1@=1Q@̲=Riߌ"rfN%hnp3q+ڭ.G [vxDN9V]OKވLE:9pԦ n]Sφଡ଼drY <[Nls,X}ٲas 370",:qvbv: NG:Ս/̧s!ax= ?r X }@NL*x ~%op*#o8@ߩ$u}u ķ3jwU6c/}ʏ!~rAO)ȟ=.8&'RCE{f0!aj\cfitdistrplus/data/smokedfish.rda0000644000176200001440000000033113742313702016542 0ustar liggesusers r0b```b`RL@& `bN *NMI,```d`@}(=4i%U׉TA (6'b ͇g柁Qt2h=4Ic3PDf>Z ŚZ d0@*' KNjZ LAQfzF .΢r=N^ '$Ì׊)%ziE@@?,Kfitdistrplus/data/groundbeef.rda0000644000176200001440000000065613742313702016540 0ustar liggesusers VJ@4஠0.JARkJMZ BcO˗5NeNq^,LOΝpx =Vڀnmmi"$Tơ7g'8xf\!;Ea?"H8 ]}]o$p]Enc/88g.纸^MyތCݼ7?KuE*=c>K~Oon9#})\>~:wL~BB+ف>cF=KS^Ɨ>Ua͜sK_+\D8W5w4r̃<|}ҾJuE~ܭߛOU{&q WIH vk,I?6g {u eYjveԋ5Mafitdistrplus/data/endosulfan.rda0000644000176200001440000000340114076546417016561 0ustar liggesusersWϏIXB3뱱y+l8L]]TW/PCr!oPr\ "$DAH_U,vd%hzz_UgN->t; {O]ygu: s,t`\jBNºuP,I>{&g0y}n ^vU?H }Z7xr> &/oRy" K:+g%91ϐ<~i)x~,nA@'?ٿI$j_+{}'y,?P߇I:꒜ɯc~'/'a_kWzz|'!~_&}_yzgOÿQO]LNW` ƽCf+fI :Muh!aB#bV(||[ c*AJ6Jdα Fk7m"AM?z&, 4FTit⑫ nC5N4s1u8_#d2T 1R-L%~ސӃYF\m$"e"XJOM:)&uUjKnjyb[8S."qkEo[q(gיm]"]챼IHOmTRw̔izmp(k*͑{Mr߅[bό3wEZ> oXPE8PI{dHt6, *&۪pEH;@OLQ5sڧm˞r m;XRQH4&[њLiΨӮ RjV68Q-r:J.+'.\g5#D_/&.Qe6MhCq:ĵ(*%WUɭDa)D9QVZv\on0:ze+d58W ն\o& & ֕Ֆ.lLyEUu "lp`#Fpo̽[nS~oD<fitdistrplus/data/fluazinam.rda0000644000176200001440000000033113742313702016374 0ustar liggesusers r0b```b`@& `bNN)MKe``+d`a40pp8gB0ì BO1:AB7@u?g,ac0pi''0@^{?HE0{| ۋ H-2dIM+)(L(AוX JI,IK+Ez0xAO@UUfitdistrplus/data/dataFAQscale1.rda0000644000176200001440000000133513742313702016745 0ustar liggesusers],Ι+&*#*I*{d9_lw_)?QkØFcaJ#jS'.?BD#0ɯ??z{ޏx #Ah3C[2jE0jBC8QثV!l, 헫ܓ&|:b, 3؞I*:;QyXD? x jAW%K6k@_h׷C~Z% :7JZP س.cH9lF6?>ᱹSE (䯶>!Bc%KҸB0 0} \code{\link{contour}} (2D-plot) is used to add \code{nlev} contours. By default the range of values explored for each estimated parameter is of 2 standard error around the mle estimate but this range can be expanded (or contracted) using the argument \code{expansion}. } \seealso{ See \code{\link{llsurface}} and \code{\link{llcurve}} for manual (log)likelihood plots (surface ou curve) and \code{\link{plot}}, \code{\link{contour}}, \code{\link{image}} for classic plotting functions. } \references{ Delignette-Muller ML and Dutang C (2015), \emph{fitdistrplus: An R Package for Fitting Distributions}. Journal of Statistical Software, 64(4), 1-34. } \author{ Marie-Laure Delignette-Muller and Christophe Dutang. } \examples{ # (1) a distribution with one parameter # x <- rexp(50) fite <- fitdist(x, "exp") llplot(fite) llplot(fite, col = "red", fit.show = TRUE) llplot(fite, col = "red", fit.show = TRUE, loglik = FALSE) # (2) a distribution with two parameters # data(groundbeef) serving <- groundbeef$serving fitg <- fitdist(serving, "gamma") llplot(fitg) \donttest{ llplot(fitg, expansion = 2) llplot(fitg, pal.col = heat.colors(100), fit.show = TRUE) llplot(fitg, back.col = FALSE, nlev = 25, fit.show = TRUE) } # (3) a distribution with two parameters with one fixed # fitg2 <- fitdist(serving, "gamma", fix.arg = list(rate = 0.5)) llplot(fitg2, fit.show = TRUE) # (4) a distribution with three parameters # \donttest{ data(endosulfan) ATV <-endosulfan$ATV library("actuar") fBurr <- fitdist(ATV, "burr", start = list(shape1 = 0.3, shape2 = 1, rate = 1)) llplot(fBurr) llplot(fBurr, back.col = FALSE, fit.show = TRUE, fit.pch = 16) llplot(fBurr, nlev = 0, pal.col = rainbow(100), lseq = 100) } # (5) a distribution with two parameters fitted on censored data # data(salinity) fsal <- fitdistcens(salinity, "lnorm") llplot(fsal, fit.show = TRUE) llplot(fsal, fit.show = TRUE, loglik = FALSE) } \keyword{ distribution } fitdistrplus/man/groundbeef.Rd0000644000176200001440000000161114213576417016201 0ustar liggesusers\name{groundbeef} \alias{groundbeef} \docType{data} \title{Ground beef serving size data set} \description{Serving sizes collected in a French survey, for ground beef patties consumed by children under 5 years old.} \usage{data(groundbeef)} \format{\code{groundbeef} is a data frame with 1 column (serving: serving sizes in grams)} \source{ Delignette-Muller, M.L., Cornu, M. 2008. Quantitative risk assessment for \emph{Escherichia coli} O157:H7 in frozen ground beef patties consumed by young children in French households. \emph{International Journal of Food Microbiology}, \bold{128}, 158-164.} \examples{ # (1) load of data # data(groundbeef) # (2) description and plot of data # serving <- groundbeef$serving descdist(serving) plotdist(serving) # (3) fit of a Weibull distribution to data # fitW <- fitdist(serving, "weibull") summary(fitW) plot(fitW) gofstat(fitW) } \keyword{datasets} fitdistrplus/man/CIcdfplot.Rd0000644000176200001440000002466414212104322015722 0ustar liggesusers\name{CIcdfplot} \alias{CIcdfplot} \title{Empirical cumulative distribution function with pointwise confidence intervals on probabilities or on quantiles} \description{ \code{cdfband} plots the empirical cumulative distribution function with the bootstraped pointwise confidence intervals on probabilities of on quantiles. } \usage{ CIcdfplot(b, CI.output, CI.type = "two.sided", CI.level = 0.95, CI.col = "red", CI.lty = 2, CI.fill = NULL, CI.only = FALSE, xlim, ylim, xlogscale = FALSE, ylogscale = FALSE, main, xlab, ylab, datapch, datacol, fitlty, fitcol, fitlwd, horizontals = TRUE, verticals = FALSE, do.points = TRUE, use.ppoints = TRUE, a.ppoints = 0.5, name.points = NULL, lines01 = FALSE, plotstyle = "graphics", \dots) } \arguments{ \item{b}{One \code{"bootdist"} object.} \item{CI.output}{ The quantity on which (bootstraped) bootstraped confidence intervals are computed: either \code{"probability"} or \code{"quantile"}).} \item{CI.type}{ Type of confidence intervals : either \code{"two.sided"} or one-sided intervals (\code{"less"} or \code{"greater"}).} \item{CI.level}{ The confidence level.} \item{CI.col}{the color of the confidence intervals.} \item{CI.lty}{the line type of the confidence intervals.} \item{CI.fill}{a color to fill the confidence area. Default is \code{NULL} corresponding to no filling.} \item{CI.only}{A logical whether to plot empirical and fitted distribution functions or only the confidence intervals. Default to \code{FALSE}.} \item{xlim}{The \eqn{x}-limits of the plot.} \item{ylim}{The \eqn{y}-limits of the plot.} \item{xlogscale}{If \code{TRUE}, uses a logarithmic scale for the \eqn{x}-axis.} \item{ylogscale}{If \code{TRUE}, uses a logarithmic scale for the \eqn{y}-axis.} \item{main}{A main title for the plot, see also \code{\link{title}}.} \item{xlab}{A label for the \eqn{x}-axis, defaults to a description of \code{x}.} \item{ylab}{A label for the \eqn{y}-axis, defaults to a description of \code{y}.} \item{datapch}{An integer specifying a symbol to be used in plotting data points, see also \code{\link{points}} (only for non censored data).} \item{datacol}{A specification of the color to be used in plotting data points.} \item{fitcol}{A (vector of) color(s) to plot fitted distributions. If there are fewer colors than fits they are recycled in the standard fashion.} \item{fitlty}{A (vector of) line type(s) to plot fitted distributions/densities. If there are fewer values than fits they are recycled in the standard fashion. See also \code{\link{par}}.} \item{fitlwd}{A (vector of) line size(s) to plot fitted distributions/densities. If there are fewer values than fits they are recycled in the standard fashion. See also \code{\link{par}}.} \item{horizontals}{If \code{TRUE}, draws horizontal lines for the step empirical cdf function (only for non censored data). See also \code{\link{plot.stepfun}}.} \item{verticals}{If \code{TRUE}, draws also vertical lines for the empirical cdf function. Only taken into account if \code{horizontals=TRUE} (only for non censored data).} \item{do.points}{logical; if \code{TRUE}, also draw points at the x-locations. Default is TRUE (only for non censored data).} \item{use.ppoints}{If \code{TRUE}, probability points of the empirical distribution are defined using function \code{\link{ppoints}} as \code{(1:n - a.ppoints)/(n - 2a.ppoints + 1)} (only for non censored data). If \code{FALSE}, probability points are simply defined as \code{(1:n)/n}. This argument is ignored for discrete data.} \item{a.ppoints}{If \code{use.ppoints=TRUE}, this is passed to function \code{\link{ppoints}} (only for non censored data).} \item{name.points}{Label vector for points if they are drawn i.e. if do.points = TRUE (only for non censored data).} \item{lines01}{A logical to plot two horizontal lines at \code{h=0} and \code{h=1} for \code{cdfcomp}.} \item{plotstyle}{\code{"graphics"} or \code{"ggplot"}. If \code{"graphics"}, the display is built with \code{\link{graphics}} functions. If \code{"ggplot"}, a graphic object output is created with \code{ggplot2} functions (the \code{ggplot2} package must be installed).} \item{\dots}{Further graphical arguments passed to \code{matlines} or \code{polygon}, respectively when \code{CI.fill=FALSE} and \code{CI.fill=TRUE}.} } \details{ \code{CIcdfplot} provides a plot of the empirical distribution using \code{\link{cdfcomp}} or \code{\link{cdfcompcens}}, with bootstraped pointwise confidence intervals on probabilities (y values) or on quantiles (x values). Each interval is computed by evaluating the quantity of interest (probability associated to an x value or quantile associated to an y value) using all the bootstraped values of parameters to get a bootstraped sample of the quantity of interest and then by calculating percentiles on this sample to get a confidence interval (classically 2.5 and 97.5 percentiles for a 95 percent confidence level). If \code{CI.fill != NULL}, then the whole confidence area is filled by the color \code{CI.fill} thanks to the function \code{polygon}, otherwise only borders are drawn thanks to the function \code{matline}. Further graphical arguments can be passed to these functions using the three dots arguments \code{\dots}. } \seealso{ See also \code{\link{cdfcomp}}, \code{\link{cdfcompcens}}, \code{\link{bootdist}} and \code{\link{quantile}}. } \references{ Delignette-Muller ML and Dutang C (2015), \emph{fitdistrplus: An R Package for Fitting Distributions}. Journal of Statistical Software, 64(4), 1-34. } \author{ Christophe Dutang and Marie-Laure Delignette-Muller. } \examples{ # We choose a low number of bootstrap replicates in order to satisfy CRAN running times # constraint. # For practical applications, we recommend to use at least niter=501 or niter=1001. if (requireNamespace ("ggplot2", quietly = TRUE)) {ggplotEx <- TRUE} # (1) Fit of an exponential distribution # set.seed(123) s1 <- rexp(50, 1) f1 <- fitdist(s1, "exp") b1 <- bootdist(f1, niter= 11) #voluntarily low to decrease computation time # plot 95 percent bilateral confidence intervals on y values (probabilities) CIcdfplot(b1, CI.level= 95/100, CI.output = "probability") if (ggplotEx) CIcdfplot(b1, CI.level= 95/100, CI.output = "probability", plotstyle = "ggplot") \donttest{ # plot of the previous intervals as a band CIcdfplot(b1, CI.level= 95/100, CI.output = "probability", CI.fill = "pink", CI.col = "red") if (ggplotEx) CIcdfplot(b1, CI.level= 95/100, CI.output = "probability", CI.fill = "pink", CI.col = "red", plotstyle = "ggplot") # plot of the previous intervals as a band without empirical and fitted dist. functions CIcdfplot(b1, CI.level= 95/100, CI.output = "probability", CI.only = TRUE, CI.fill = "pink", CI.col = "red") if (ggplotEx) CIcdfplot(b1, CI.level= 95/100, CI.output = "probability", CI.only = TRUE, CI.fill = "pink", CI.col = "red", plotstyle = "ggplot") # same plot without contours CIcdfplot(b1, CI.level= 95/100, CI.output = "probability", CI.only = TRUE, CI.fill = "pink", CI.col = "pink") if (ggplotEx) CIcdfplot(b1, CI.level= 95/100, CI.output = "probability", CI.only = TRUE, CI.fill = "pink", CI.col = "pink", plotstyle = "ggplot") # plot 95 percent bilateral confidence intervals on x values (quantiles) CIcdfplot(b1, CI.level= 95/100, CI.output = "quantile") if (ggplotEx) CIcdfplot(b1, CI.level= 95/100, CI.output = "quantile", plotstyle = "ggplot") # plot 95 percent unilateral confidence intervals on quantiles CIcdfplot(b1, CI.level = 95/100, CI.output = "quant", CI.type = "less", CI.fill = "grey80", CI.col = "black", CI.lty = 1) if (ggplotEx) CIcdfplot(b1, CI.level = 95/100, CI.output = "quant", CI.type = "less", CI.fill = "grey80", CI.col = "black", CI.lty = 1, plotstyle = "ggplot") CIcdfplot(b1, CI.level= 95/100, CI.output = "quant", CI.type = "greater", CI.fill = "grey80", CI.col = "black", CI.lty = 1) if (ggplotEx) CIcdfplot(b1, CI.level= 95/100, CI.output = "quant", CI.type = "greater", CI.fill = "grey80", CI.col = "black", CI.lty = 1, plotstyle = "ggplot") # (2) Fit of a normal distribution on acute toxicity log-transformed values of # endosulfan for nonarthropod invertebrates, using maximum likelihood estimation # to estimate what is called a species sensitivity distribution # (SSD) in ecotoxicology, followed by estimation of the 5, 10 and 20 percent quantile # values of the fitted distribution, which are called the 5, 10, 20 percent hazardous # concentrations (HC5, HC10, HC20) in ecotoxicology, with their # confidence intervals, from a small number of bootstrap # iterations to satisfy CRAN running times constraint and plot of the band # representing pointwise confidence intervals on any quantiles (any HCx values) # For practical applications, we recommend to use at least niter=501 or niter=1001. # data(endosulfan) log10ATV <- log10(subset(endosulfan, group == "NonArthroInvert")$ATV) namesATV <- subset(endosulfan, group == "NonArthroInvert")$taxa fln <- fitdist(log10ATV, "norm") bln <- bootdist(fln, bootmethod ="param", niter=101) quantile(bln, probs = c(0.05, 0.1, 0.2)) CIcdfplot(bln, CI.output = "quantile", CI.fill = "lightblue", CI.col = "blue", xlim = c(0,5), name.points=namesATV) if (ggplotEx) CIcdfplot(bln, CI.output = "quantile", CI.fill = "lightblue", CI.col = "blue", xlim = c(0,5), name.points=namesATV, plotstyle = "ggplot") # (3) Same type of example as example (2) from ecotoxicology # with censored data # data(salinity) log10LC50 <-log10(salinity) fln <- fitdistcens(log10LC50,"norm") bln <- bootdistcens(fln, niter=101) (HC5ln <- quantile(bln,probs = 0.05)) CIcdfplot(bln, CI.output = "quantile", CI.fill = "lightblue", CI.col = "blue", xlab = "log10(LC50)",xlim=c(0.5,2),lines01 = TRUE) if (ggplotEx) CIcdfplot(bln, CI.output = "quantile", CI.fill = "lightblue", CI.col = "blue", xlab = "log10(LC50)",xlim=c(0.5,2),lines01 = TRUE, plotstyle = "ggplot") # zoom around the HC5 CIcdfplot(bln, CI.output = "quantile", CI.fill = "lightblue", CI.col = "blue", xlab = "log10(LC50)", lines01 = TRUE, xlim = c(0.8, 1.5), ylim = c(0, 0.1)) abline(h = 0.05, lty = 2) # line corresponding to a CDF of 5 percent if (ggplotEx) CIcdfplot(bln, CI.output = "quantile", CI.fill = "lightblue", CI.col = "blue", xlab = "log10(LC50)", lines01 = TRUE, xlim = c(0.8, 1.5), ylim = c(0, 0.1), plotstyle = "ggplot") + ggplot2::geom_hline(yintercept = 0.05, lty = 2) # line corresponding to a CDF of 5 percent } } \keyword{ distribution } fitdistrplus/man/smokedfish.Rd0000644000176200001440000000272714213576674016233 0ustar liggesusers\name{smokedfish} \alias{smokedfish} \docType{data} \title{Contamination data of Listeria monocytogenes in smoked fish} \description{Contamination data of \emph{Listeria monocytogenes} in smoked fish on the Belgian market in the period 2005 to 2007.} \usage{data(smokedfish)} \format{ \code{smokedfish} is a data frame with 2 columns named left and right, describing each observed value of \emph{Listeria monocytogenes} concentration (in CFU/g) as an interval. The left column contains either NA for left censored observations, the left bound of the interval for interval censored observations, or the observed value for non-censored observations. The right column contains either NA for right censored observations, the right bound of the interval for interval censored observations, or the observed value for non-censored observations. } \source{ Busschaert, P., Geereard, A.H., Uyttendaele, M., Van Impe, J.F., 2010. Estimating distributions out of qualitative and (semi) quantitative microbiological contamination data for use in risk assessment. \emph{International Journal of Food Microbiology}. \bold{138}, 260-269.} \examples{ # (1) load of data # data(smokedfish) # (2) plot of data in CFU/g # plotdistcens(smokedfish) # (3) plot of transformed data in log10[CFU/g] # Clog10 <- log10(smokedfish) plotdistcens(Clog10) # (4) Fit of a normal distribution to data in log10[CFU/g] # fitlog10 <- fitdistcens(Clog10, "norm") summary(fitlog10) plot(fitlog10) } \keyword{datasets}fitdistrplus/man/salinity.Rd0000644000176200001440000000473314213576605015724 0ustar liggesusers\name{salinity} \alias{salinity} \docType{data} \title{Species-Sensitivity Distribution (SSD) for salinity tolerance} \description{72-hour acute salinity tolerance (LC50 values) of riverine macro-invertebrates.} \usage{data(salinity)} \format{ \code{salinity} is a data frame with 2 columns named left and right, describing each observed LC50 value (in electrical condutivity, millisiemens per centimeter) as an interval. The left column contains either NA for left censored observations, the left bound of the interval for interval censored observations, or the observed value for non-censored observations. The right column contains either NA for right censored observations, the right bound of the interval for interval censored observations, or the observed value for noncensored observations. } \source{ Kefford, B.J., Nugegoda, D., Metzeling, L., Fields, E. 2006. Validating species sensitivity distributions using salinity tolerance of riverine macroinvertebrates in the southern Murray-darling Basin (Vitoria, Australia). \emph{Canadian Journal of Fisheries and Aquatic Science}, \bold{63}, 1865-1877.} \examples{ # (1) load of data # data(salinity) # (2) plot of data using Turnbull cdf plot # log10LC50 <- log10(salinity) plotdistcens(log10LC50) # (3) fit of a normal and a logistic distribution to data in log10 # (classical distributions used for species sensitivity # distributions, SSD, in ecotoxicology)) # and visual comparison of the fits using Turnbull cdf plot # fln <- fitdistcens(log10LC50, "norm") summary(fln) fll <- fitdistcens(log10LC50, "logis") summary(fll) cdfcompcens(list(fln, fll),legendtext = c("normal", "logistic"), xlab = "log10(LC50)", xlim = c(0.5, 2), lines01 = TRUE) # (4) estimation of the 5 percent quantile value of # the normal fitted distribution (5 percent hazardous concentration : HC5) # with its two-sided 95 percent confidence interval calculated by # non parametric bootstrap # from a small number of bootstrap iterations to satisfy CRAN running times constraint. # For practical applications, we recommend to use at least niter=501 or niter=1001. # # in log10(LC50) bln <- bootdistcens(fln, niter = 101) HC5ln <- quantile(bln, probs = 0.05) # in LC50 10^(HC5ln$quantiles) 10^(HC5ln$quantCI) # (5) estimation of the HC5 value # with its one-sided 95 percent confidence interval (type "greater") # # in log10(LC50) HC5lnb <- quantile(bln, probs = 0.05, CI.type = "greater") # in LC50 10^(HC5lnb$quantiles) 10^(HC5lnb$quantCI) } \keyword{datasets}fitdistrplus/man/plotdistcens.Rd0000644000176200001440000001762014213577072016601 0ustar liggesusers\name{plotdistcens} \alias{plotdistcens} \title{ Plot of empirical and theoretical distributions for censored data} \description{ Plots an empirical distribution for censored data with a theoretical one if specified. } \usage{ plotdistcens(censdata, distr, para, leftNA = -Inf, rightNA = Inf, NPMLE = TRUE, Turnbull.confint = FALSE, NPMLE.method = "Wang", \dots) } \arguments{ \item{censdata}{ A dataframe of two columns respectively named \code{left} and \code{right}, describing each observed value as an interval. The \code{left} column contains either \code{NA} for left censored observations, the left bound of the interval for interval censored observations, or the observed value for non-censored observations. The \code{right} column contains either \code{NA} for right censored observations, the right bound of the interval for interval censored observations, or the observed value for non-censored observations. } \item{distr}{ A character string \code{"name"} naming a distribution, for which the corresponding density function \code{dname} and the corresponding distribution function \code{pname} must be defined, or directly the density function.} \item{para}{ A named list giving the parameters of the named distribution. This argument may be omitted only if \code{distr} is omitted. } \item{leftNA}{ the real value of the left bound of left censored observations : \code{-Inf} or a finite value such as \code{0} for positive data for example.} \item{rightNA}{ the real value of the right bound of right censored observations : \code{Inf} or a finite value such as a realistic maximum value.} \item{NPMLE}{ if TRUE an NPMLE (nonparametric maximum likelihood estimate) technique is used to estimate the cdf curve of the censored data and previous arguments \code{leftNA} and \code{rightNA} are not used (see details)} \item{Turnbull.confint}{ if TRUE confidence intervals will be added to the Turnbull plot. In that case NPMLE.method is forced to \code{"Turnbull.middlepoints"}} \item{NPMLE.method}{Three NPMLE techniques are provided, \code{"Wang"}, the default one, rewritten from the package npsurv using function constrOptim from the package stats for optimisation, \code{"Turnbull.middlepoints"}, an older one which is implemented in the package survival and \code{"Turnbull.intervals"} that uses the same Turnbull algorithm from the package survival but associates an interval to each equivalence class instead of the middlepoint of this interval (see details). Only \code{"Wang"} and \code{"Turnbull.intervals"} enable the derivation of a Q-Q plot and a P-P plot.} \item{\dots}{ further graphical arguments passed to other methods. The title of the plot can be modified using the argument \code{main} only for the CDF plot.} } \details{ If \code{NPMLE} is \code{TRUE}, and \code{NPMLE.method} is \code{"Wang"} , empirical distributions are plotted in cdf using either the constrained Newton method (Wang, 2008) or the hierarchical constrained Newton method (Wang, 2013) to compute the overall empirical cdf curve. If \code{NPMLE} is \code{TRUE}, and \code{NPMLE.method} is \code{"Turnbull.intervals"} , empirical are plotted in cdf using the EM approach of Turnbull (Turnbull, 1974). In those two cases, grey rectangles represent areas where the empirical distribution function is not unique. In cases where a theoretical distribution is specified, two goodness-of-fit plots are also provided, a Q-Q plot (plot of the quantiles of the theoretical fitted distribution (x-axis) against the empirical quantiles of the data) and a P-P plot (i.e. for each value of the data set, plot of the cumulative density function of the fitted distribution (x-axis) against the empirical cumulative density function (y-axis)). Grey rectangles in a Q-Q plot or a P-P plot also represent areas of non uniqueness of empirical quantiles or probabilities, directly derived from non uniqueness areas of the empirical cumulative distribution. If \code{NPMLE} is \code{TRUE}, and \code{NPMLE.method} is \code{"Turnbull.middlepoints"}, empirical and, if specified, theoretical distributions are plotted in cdf using the EM approach of Turnbull (Turnbull, 1974) to compute the overall empirical cdf curve, with confidence intervals if \code{Turnbull.confint} is \code{TRUE}, by calls to functions \code{\link{survfit}} and \code{\link{plot.survfit}} from the \code{survival} package. If \code{NPMLE} is \code{FALSE} empirical and, if specified, theoretical distributions are plotted in cdf, with data directly reported as segments for interval, left and right censored data, and as points for non-censored data. Before plotting, observations are ordered and a rank r is associated to each of them. Left censored observations are ordered first, by their right bounds. Interval censored and non censored observations are then ordered by their mid-points and, at last, right censored observations are ordered by their left bounds. If \code{leftNA} (resp. \code{rightNA}) is finite, left censored (resp. right censored) observations are considered as interval censored observations and ordered by mid-points with non-censored and interval censored data. It is sometimes necessary to fix \code{rightNA} or \code{leftNA} to a realistic extreme value, even if not exactly known, to obtain a reasonable global ranking of observations. After ranking, each of the n observations is plotted as a point (one x-value) or a segment (an interval of possible x-values), with an y-value equal to r/n, r being the rank of each observation in the global ordering previously described. This second method may be interesting but is certainly less rigorous than the other methods that should be prefered. } \seealso{ \code{\link{plotdist}}, \code{\link{survfit.formula}}. } \references{ Turnbull BW (1974), \emph{Nonparametric estimation of a survivorship function with doubly censored data}. Journal of American Statistical Association, 69, 169-173. Wang Y (2008), \emph{Dimension-reduced nonparametric maximum likelihood computation for interval-censored data}. Computational Statistics & Data Analysis, 52, 2388-2402. Wang Y and Taylor SM (2013), \emph{Efficient computation of nonparametric survival functions via a hierarchical mixture formulation}. Statistics and Computing, 23, 713-725. Wang, Y., & Fani, S. (2018), \emph{Nonparametric maximum likelihood computation of a U-shaped hazard function}. Statistics and Computing, 28(1), 187-200. Delignette-Muller ML and Dutang C (2015), \emph{fitdistrplus: An R Package for Fitting Distributions}. Journal of Statistical Software, 64(4), 1-34. } \author{ Marie-Laure Delignette-Muller and Christophe Dutang. } \examples{ # (1) Plot of an empirical censored distribution (censored data) as a CDF # using the default Wang method # data(smokedfish) d1 <- as.data.frame(log10(smokedfish)) plotdistcens(d1) # (2) Add the CDF of a normal distribution # plotdistcens(d1, "norm", para=list(mean = -1.6, sd = 1.5)) # (3) Various plots of the same empirical distribution # # default Wang plot with representation of equivalence classess plotdistcens(d1, NPMLE = TRUE, NPMLE.method = "Wang") # same plot but using the Turnbull alorithm from the package survival plotdistcens(d1, NPMLE = TRUE, NPMLE.method = "Wang") # Turnbull plot with middlepoints (as in the package survival) plotdistcens(d1, NPMLE = TRUE, NPMLE.method = "Turnbull.middlepoints") # Turnbull plot with middlepoints and confidence intervals plotdistcens(d1, NPMLE = TRUE, NPMLE.method = "Turnbull.middlepoints", Turnbull.confint = TRUE) # with intervals and points plotdistcens(d1,rightNA=3, NPMLE = FALSE) # with intervals and points # defining a minimum value for left censored values plotdistcens(d1,leftNA=-3, NPMLE = FALSE) } \keyword{ distribution } fitdistrplus/man/gofstat.Rd0000644000176200001440000001776213742313702015536 0ustar liggesusers\name{gofstat} \alias{gofstat} \alias{print.gofstat.fitdist} \title{ Goodness-of-fit statistics} \description{ Computes goodness-of-fit statistics for parametric distributions fitted to a same non-censored data set. } \usage{ gofstat(f, chisqbreaks, meancount, discrete, fitnames=NULL) \method{print}{gofstat.fitdist}(x, \dots) } \arguments{ \item{f}{ An object of class \code{"fitdist"}, output of the function \code{fitdist}, or a list of \code{"fitdist"} objects.} \item{chisqbreaks}{ A numeric vector defining the breaks of the cells used to compute the chi-squared statistic. If omitted, these breaks are automatically computed from the data in order to reach roughly the same number of observations per cell, roughly equal to the argument \code{meancount}, or sligthly more if there are some ties. } \item{meancount}{ The mean number of observations per cell expected for the definition of the breaks of the cells used to compute the chi-squared statistic. This argument will not be taken into account if the breaks are directly defined in the argument \code{chisqbreaks}. If \code{chisqbreaks} and \code{meancount} are both omitted, \code{meancount} is fixed in order to obtain roughly \eqn{(4n)^{2/5}} cells with \eqn{n} the length of the dataset. } \item{discrete}{If \code{TRUE}, only the Chi-squared statistic and information criteria are computed. If missing, \code{discrete} is passed from the first object of class \code{"fitdist"} of the list \code{f}.} \item{fitnames}{A vector defining the names of the fits.} \item{x}{An object of class \code{"fitdist"}. } \item{\dots}{Further arguments to be passed to generic functions.} } \details{ Goodness-of-fit statistics are computed. The Chi-squared statistic is computed using cells defined by the argument \code{chisqbreaks} or cells automatically defined from data, in order to reach roughly the same number of observations per cell, roughly equal to the argument \code{meancount}, or sligthly more if there are some ties. The choice to define cells from the empirical distribution (data), and not from the theoretical distribution, was done to enable the comparison of Chi-squared values obtained with different distributions fitted on a same data set. If \code{chisqbreaks} and \code{meancount} are both omitted, \code{meancount} is fixed in order to obtain roughly \eqn{(4n)^{2/5}} cells, with \eqn{n} the length of the data set (Vose, 2000). The Chi-squared statistic is not computed if the program fails to define enough cells due to a too small dataset. When the Chi-squared statistic is computed, and if the degree of freedom (nb of cells - nb of parameters - 1) of the corresponding distribution is strictly positive, the p-value of the Chi-squared test is returned. For continuous distributions, Kolmogorov-Smirnov, Cramer-von Mises and Anderson-Darling and statistics are also computed, as defined by Stephens (1986). An approximate Kolmogorov-Smirnov test is performed by assuming the distribution parameters known. The critical value defined by Stephens (1986) for a completely specified distribution is used to reject or not the distribution at the significance level 0.05. Because of this approximation, the result of the test (decision of rejection of the distribution or not) is returned only for data sets with more than 30 observations. Note that this approximate test may be too conservative. For data sets with more than 5 observations and for distributions for which the test is described by Stephens (1986) for maximum likelihood estimations (\code{"exp"}, \code{"cauchy"}, \code{"gamma"} and \code{"weibull"}), the Cramer-von Mises and Anderson-darling tests are performed as described by Stephens (1986). Those tests take into account the fact that the parameters are not known but estimated from the data by maximum likelihood. The result is the decision to reject or not the distribution at the significance level 0.05. Those tests are available only for maximum likelihood estimations. Only recommended statistics are automatically printed, i.e. Cramer-von Mises, Anderson-Darling and Kolmogorov statistics for continuous distributions and Chi-squared statistics for discrete ones ( \code{"binom"}, \code{"nbinom"}, \code{"geom"}, \code{"hyper"} and \code{"pois"} ). Results of the tests are not printed but stored in the output of the function. } \value{ \code{gof.stat} returns an object of class \code{"gofstat.fitdist"} with following components, \item{ chisq }{ a named vector with the Chi-squared statistics or \code{NULL} if not computed} \item{ chisqbreaks }{ common breaks used to define cells in the Chi-squared statistic } \item{ chisqpvalue }{ a named vector with the p-values of the Chi-squared statistic or \code{NULL} if not computed} \item{ chisqdf }{ a named vector with the degrees of freedom of the Chi-squared distribution or \code{NULL} if not computed } \item{ chisqtable }{ a table with observed and theoretical counts used for the Chi-squared calculations } \item{ cvm }{ a named vector of the Cramer-von Mises statistics or \code{"not computed"} if not computed } \item{ cvmtest }{ a named vector of the decisions of the Cramer-von Mises test or \code{"not computed"} if not computed } \item{ ad }{ a named vector with the Anderson-Darling statistics or \code{"not computed"} if not computed } \item{ adtest }{ a named vector with the decisions of the Anderson-Darling test or \code{"not computed"} if not computed } \item{ ks }{ a named vector with the Kolmogorov-Smirnov statistic or \code{"not computed"} if not computed } \item{ kstest }{ a named vector with the decisions of the Kolmogorov-Smirnov test or \code{"not computed"} if not computed } \item{aic}{a named vector with the values of the Akaike's Information Criterion.} \item{bic}{a named vector with the values of the Bayesian Information Criterion.} \item{discrete}{the input argument or the automatic definition by the function from the first object of class \code{"fitdist"} of the list in input.} \item{nbfit}{Number of fits in argument.} } \seealso{ \code{\link{fitdist}}. } \references{ Cullen AC and Frey HC (1999), \emph{Probabilistic techniques in exposure assessment}. Plenum Press, USA, pp. 81-155. Stephens MA (1986), \emph{Tests based on edf statistics}. In Goodness-of-fit techniques (D'Agostino RB and Stephens MA, eds), Marcel Dekker, New York, pp. 97-194. Venables WN and Ripley BD (2002), \emph{Modern applied statistics with S}. Springer, New York, pp. 435-446. Vose D (2000), \emph{Risk analysis, a quantitative guide}. John Wiley & Sons Ltd, Chischester, England, pp. 99-143. Delignette-Muller ML and Dutang C (2015), \emph{fitdistrplus: An R Package for Fitting Distributions}. Journal of Statistical Software, 64(4), 1-34. } \author{ Marie-Laure Delignette-Muller and Christophe Dutang. } \examples{ # (1) fit of two distributions to the serving size data # by maximum likelihood estimation # and comparison of goodness-of-fit statistics # data(groundbeef) serving <- groundbeef$serving (fitg <- fitdist(serving, "gamma")) gofstat(fitg) (fitln <- fitdist(serving, "lnorm")) gofstat(fitln) gofstat(list(fitg, fitln)) # (2) fit of two discrete distributions to toxocara data # and comparison of goodness-of-fit statistics # data(toxocara) number <- toxocara$number fitp <- fitdist(number,"pois") summary(fitp) plot(fitp) fitnb <- fitdist(number,"nbinom") summary(fitnb) plot(fitnb) gofstat(list(fitp, fitnb),fitnames = c("Poisson","negbin")) # (3) Use of Chi-squared results in addition to # recommended statistics for continuous distributions # set.seed(1234) x4 <- rweibull(n=1000,shape=2,scale=1) # fit of the good distribution f4 <- fitdist(x4,"weibull") # fit of a bad distribution f4b <- fitdist(x4,"cauchy") gofstat(list(f4,f4b),fitnames=c("Weibull", "Cauchy")) } \keyword{ distribution } fitdistrplus/man/detectbound.Rd0000644000176200001440000000272014213577253016362 0ustar liggesusers\name{detectbound} \alias{detectbound} \title{ Detect bounds for density function} \description{ Manual detection of bounds of parameter of a density function/ } \usage{ detectbound(distname, vstart, obs, fix.arg=NULL, echo=FALSE) } \arguments{ \item{distname}{ A character string \code{"name"} naming a distribution for which the corresponding density function \code{dname} must be classically defined. } \item{vstart}{A named vector giving the initial values of parameters of the named distribution. } \item{obs}{ A numeric vector for non censored data. } \item{fix.arg}{An optional named vector giving the values of fixed parameters of the named distribution. Default to \code{NULL}.} \item{echo}{A logical to show some traces.} } \details{ This function manually tests the following bounds : -1, 0, and 1. } \value{ \code{detectbound} returns a 2-row matrix with the lower bounds in the first row and the upper bounds in the second row. } \seealso{ \code{\link{fitdist}}. } \references{ Delignette-Muller ML and Dutang C (2015), \emph{fitdistrplus: An R Package for Fitting Distributions}. Journal of Statistical Software, 64(4), 1-34. } \author{ Christophe Dutang and Marie-Laure Delignette-Muller. } \examples{ # case where the density returns a Not-an-Numeric value. detectbound("exp", c(rate=3), 1:10) detectbound("binom", c(size=3, prob=1/2), 1:10) detectbound("nbinom", c(size=3, prob=1/2), 1:10) } \keyword{ distribution }fitdistrplus/man/bootdistcens.Rd0000644000176200001440000001516714355535041016567 0ustar liggesusers\name{bootdistcens} \alias{bootdistcens} \alias{plot.bootdistcens} \alias{print.bootdistcens} \alias{summary.bootdistcens} \title{ Bootstrap simulation of uncertainty for censored data} \description{ Uses nonparametric bootstrap resampling in order to simulate uncertainty in the parameters of the distribution fitted to censored data. } \usage{ bootdistcens(f, niter = 1001, silent = TRUE, parallel = c("no", "snow", "multicore"), ncpus) \method{print}{bootdistcens}(x, \dots) \method{plot}{bootdistcens}(x, \dots) \method{summary}{bootdistcens}(object, \dots) } \arguments{ \item{f}{ An object of class \code{"fitdistcens"}, output of the \code{\link{fitdistcens}} function.} \item{niter}{ The number of samples drawn by bootstrap.} \item{silent}{A logical to remove or show warnings and errors when bootstraping.} \item{parallel}{The type of parallel operation to be used, \code{"snow"} or \code{"multicore"} (the second one not being available on Windows), or \code{"no"} if no parallel operation.} \item{ncpus}{Number of processes to be used in parallel operation : typically one would fix it to the number of available CPUs.} \item{x}{ An object of class \code{"bootdistcens"}.} \item{object}{ An object of class \code{"bootdistcens"}.} \item{\dots}{ Further arguments to be passed to generic methods.} } \details{ Samples are drawn by nonparametric bootstrap (resampling with replacement from the data set). On each bootstrap sample the function \code{\link{mledist}} is used to estimate bootstrapped values of parameters. When \code{\link{mledist}} fails to converge, \code{NA} values are returned. Medians and 2.5 and 97.5 percentiles are computed by removing \code{NA} values. The medians and the 95 percent confidence intervals of parameters (2.5 and 97.5 percentiles) are printed in the summary. If inferior to the whole number of iterations, the number of iterations for which \code{\link{mledist}} converges is also printed in the summary. The plot of an object of class \code{"bootdistcens"} consists in a scatterplot or a matrix of scatterplots of the bootstrapped values of parameters. It uses the function \code{\link{stripchart}} when the fitted distribution is characterized by only one parameter, and the function \code{\link{plot}} in other cases. In these last cases, it provides a representation of the joint uncertainty distribution of the fitted parameters. It is possible to accelerate the bootstrap using parallelization. We recommend you to use \code{parallel = "multicore"}, or \code{parallel = "snow"} if you work on Windows, and to fix \code{ncpus} to the number of available processors. } \value{ \code{bootdistcens} returns an object of class \code{"bootdistcens"}, a list with 6 components, \item{estim}{ a data frame containing the bootstrapped values of parameters.} \item{converg}{ a vector containing the codes for convergence of the iterative method used to estimate parameters on each bootstraped data set.} \item{method}{ A character string coding for the type of resampling : in this case \code{"nonparam"} as it is the only available method for censored data. } \item{nbboot}{ The number of samples drawn by bootstrap.} \item{CI}{ bootstrap medians and 95 percent confidence percentile intervals of parameters. } \item{fitpart}{ The object of class \code{"fitdistcens"} on which the bootstrap procedure was applied.} Generic functions: \describe{ \item{\code{print}}{ The print of a \code{"bootdistcens"} object shows the bootstrap parameter estimates. If inferior to the whole number of bootstrap iterations, the number of iterations for which the estimation converges is also printed. } \item{\code{summary}}{ The summary provides the median and 2.5 and 97.5 percentiles of each parameter. If inferior to the whole number of bootstrap iterations, the number of iterations for which the estimation converges is also printed in the summary. } \item{\code{plot}}{ The plot shows the bootstrap estimates with the \code{\link{stripchart}} function for univariate parameters and \code{\link{plot}} function for multivariate parameters. } } } \seealso{ See \code{\link{fitdistrplus}} for an overview of the package. \code{\link{fitdistcens}}, \code{\link{mledist}}, \code{\link{quantile.bootdistcens}} for another generic function to calculate quantiles from the fitted distribution and its bootstrap results and \code{\link{CIcdfplot}} for adding confidence intervals on quantiles to a CDF plot of the fitted distribution. } \references{ Cullen AC and Frey HC (1999), \emph{Probabilistic techniques in exposure assessment}. Plenum Press, USA, pp. 181-241. Delignette-Muller ML and Dutang C (2015), \emph{fitdistrplus: An R Package for Fitting Distributions}. Journal of Statistical Software, 64(4), 1-34. } \author{ Marie-Laure Delignette-Muller and Christophe Dutang. } \examples{ # We choose a low number of bootstrap replicates in order to satisfy CRAN running times # constraint. # For practical applications, we recommend to use at least niter=501 or niter=1001. # (1) Fit of a normal distribution to fluazinam data in log10 # followed by nonparametric bootstrap and calculation of quantiles # with 95 percent confidence intervals # data(fluazinam) (d1 <-log10(fluazinam)) f1 <- fitdistcens(d1, "norm") b1 <- bootdistcens(f1, niter = 101) b1 summary(b1) plot(b1) quantile(b1) CIcdfplot(b1, CI.output = "quantile") # (2) Estimation of the mean of the normal distribution # by maximum likelihood with the standard deviation fixed at 1 # using the argument fix.arg # followed by nonparametric bootstrap # and calculation of quantiles with 95 percent confidence intervals # f1b <- fitdistcens(d1, "norm", start = list(mean = 1),fix.arg = list(sd = 1)) b1b <- bootdistcens(f1b, niter = 101) summary(b1b) plot(b1b) quantile(b1b) # (3) comparison of sequential and parallel versions of bootstrap # to be tried with a greater number of iterations (1001 or more) # \donttest{ niter <- 1001 data(fluazinam) d1 <-log10(fluazinam) f1 <- fitdistcens(d1, "norm") # sequential version ptm <- proc.time() summary(bootdistcens(f1, niter = niter)) proc.time() - ptm # parallel version using snow require(parallel) ptm <- proc.time() summary(bootdistcens(f1, niter = niter, parallel = "snow", ncpus = 2)) proc.time() - ptm # parallel version using multicore (not available on Windows) ptm <- proc.time() summary(bootdistcens(f1, niter = niter, parallel = "multicore", ncpus = 2)) proc.time() - ptm } } \keyword{ distribution }fitdistrplus/man/graphcomp.Rd0000644000176200001440000003520014102210303016012 0ustar liggesusers\name{graphcomp} \alias{graphcomp} \alias{cdfcomp} \alias{denscomp} \alias{qqcomp} \alias{ppcomp} \title{Graphical comparison of multiple fitted distributions (for non-censored data)} \description{ \code{cdfcomp} plots the empirical cumulative distribution against fitted distribution functions, \code{denscomp} plots the histogram against fitted density functions, \code{qqcomp} plots theoretical quantiles against empirical ones, \code{ppcomp} plots theoretical probabilities against empirical ones. Only \code{cdfcomp} is able to plot fits of a discrete distribution. } \usage{ cdfcomp(ft, xlim, ylim, xlogscale = FALSE, ylogscale = FALSE, main, xlab, ylab, datapch, datacol, fitlty, fitcol, fitlwd, addlegend = TRUE, legendtext, xlegend = "bottomright", ylegend = NULL, horizontals = TRUE, verticals = FALSE, do.points = TRUE, use.ppoints = TRUE, a.ppoints = 0.5, name.points = NULL, lines01 = FALSE, discrete, add = FALSE, plotstyle = "graphics", fitnbpts = 101, \dots) denscomp(ft, xlim, ylim, probability = TRUE, main, xlab, ylab, datacol, fitlty, fitcol, fitlwd, addlegend = TRUE, legendtext, xlegend = "topright", ylegend = NULL, demp = FALSE, dempcol = "black", plotstyle = "graphics", discrete, fitnbpts = 101, fittype="l", \dots) qqcomp(ft, xlim, ylim, xlogscale = FALSE, ylogscale = FALSE, main, xlab, ylab, fitpch, fitcol, fitlwd, addlegend = TRUE, legendtext, xlegend = "bottomright", ylegend = NULL, use.ppoints = TRUE, a.ppoints = 0.5, line01 = TRUE, line01col = "black", line01lty = 1, ynoise = TRUE, plotstyle = "graphics", \dots) ppcomp(ft, xlim, ylim, xlogscale = FALSE, ylogscale = FALSE, main, xlab, ylab, fitpch, fitcol, fitlwd, addlegend = TRUE, legendtext, xlegend = "bottomright", ylegend = NULL, use.ppoints = TRUE, a.ppoints = 0.5, line01 = TRUE, line01col = "black", line01lty = 1, ynoise = TRUE, plotstyle = "graphics", \dots) } \arguments{ \item{ft}{One \code{"fitdist"} object or a list of objects of class \code{"fitdist"}.} \item{xlim}{The \eqn{x}-limits of the plot.} \item{ylim}{The \eqn{y}-limits of the plot.} \item{xlogscale}{If \code{TRUE}, uses a logarithmic scale for the \eqn{x}-axis.} \item{ylogscale}{If \code{TRUE}, uses a logarithmic scale for the \eqn{y}-axis.} \item{main}{A main title for the plot. See also \code{\link{title}}.} \item{xlab}{A label for the \eqn{x}-axis, defaults to a description of \code{x}.} \item{ylab}{A label for the \eqn{y}-axis, defaults to a description of \code{y}.} \item{datapch}{An integer specifying a symbol to be used in plotting data points. See also \code{\link{par}}.} \item{datacol}{A specification of the color to be used in plotting data points. See also \code{\link{par}}.} \item{fitcol}{A (vector of) color(s) to plot fitted distributions. If there are fewer colors than fits they are recycled in the standard fashion. See also \code{\link{par}}.} \item{fitlty}{A (vector of) line type(s) to plot fitted distributions/densities. If there are fewer values than fits they are recycled in the standard fashion. See also \code{\link{par}}.} \item{fitlwd}{A (vector of) line size(s) to plot fitted distributions/densities. If there are fewer values than fits they are recycled in the standard fashion. See also \code{\link{par}}.} \item{fitpch}{A (vector of) line type(s) to plot fitted quantiles/probabilities. If there are fewer values than fits they are recycled in the standard fashion. See also \code{\link{par}}.} \item{fittype}{The type of plot for fitted probabilities in the case of discrete distributions: possible types are \code{"p"} for points, \code{"l"} for lines and \code{"o"} for both overplotted (as in \code{\link{plot.default}}). \code{fittype} is not used for non-discrete distributions.} \item{fitnbpts}{A numeric for the number of points to compute fitted probabilities or cumulative probabilities. Default to \code{101}.} \item{addlegend}{If \code{TRUE}, a legend is added to the plot.} \item{legendtext}{A character or expression vector of length \eqn{\ge 1} to appear in the legend. See also \code{\link{legend}}.} \item{xlegend, ylegend}{The \eqn{x} and \eqn{y} coordinates to be used to position the legend. They can be specified by keyword. If \code{plotstyle = "graphics"}, see \code{\link{xy.coords}} and \code{\link{legend}}. If \code{plotstyle = "ggplot"}, the \code{xlegend} keyword must be one of \code{top}, \code{bottom}, \code{left}, or \code{right}. See also \code{guide_legend} in \code{ggplot2}} \item{horizontals}{If \code{TRUE}, draws horizontal lines for the step empirical cumulative distribution function (ecdf). See also \code{\link{plot.stepfun}}.} \item{verticals}{If \code{TRUE}, draws vertical lines for the empirical cumulative distribution function (ecdf). Only taken into account if \code{horizontals=TRUE}.} \item{do.points}{If \code{TRUE} (by default), draws points at the x-locations. For large dataset (n > 1e4), \code{do.points} is ignored and no point is drawn.} \item{use.ppoints}{If \code{TRUE}, probability points of the empirical distribution are defined using function \code{\link{ppoints}} as \code{(1:n - a.ppoints)/(n - 2a.ppoints + 1)}. If \code{FALSE}, probability points are simply defined as \code{(1:n)/n}. This argument is ignored for discrete data.} \item{a.ppoints}{If \code{use.ppoints=TRUE}, this is passed to the \code{\link{ppoints}} function.} \item{name.points}{Label vector for points if they are drawn i.e. if do.points = TRUE (only for non censored data).} \item{lines01}{A logical to plot two horizontal lines at \code{h=0} and \code{h=1} for \code{cdfcomp}.} \item{line01}{A logical to plot an horizontal line \eqn{y=x} for \code{qqcomp} and \code{ppcomp}.} \item{line01col, line01lty}{Color and line type for \code{line01}. See also \code{\link{par}}.} \item{demp}{A logical to add the empirical density on the plot, using the \code{\link{density}} function.} \item{dempcol}{A color for the empirical density in case it is added on the plot (\code{demp=TRUE}).} \item{ynoise}{A logical to add a small noise when plotting empirical quantiles/probabilities for \code{qqcomp} and \code{ppcomp}.} \item{probability}{A logical to use the probability scale for \code{denscomp}. See also \code{\link{hist}}.} \item{discrete}{If \code{TRUE}, the distributions are considered discrete. When missing, \code{discrete} is set to \code{TRUE} if at least one object of the list \code{ft} is discrete.} \item{add}{If \code{TRUE}, adds to an already existing plot. If \code{FALSE}, starts a new plot. This parameter is not available when \code{plotstyle = "ggplot"}.} \item{plotstyle}{\code{"graphics"} or \code{"ggplot"}. If \code{"graphics"}, the display is built with \code{\link{graphics}} functions. If \code{"ggplot"}, a graphic object output is created with \code{ggplot2} functions (the \code{ggplot2} package must be installed).} \item{\dots}{Further graphical arguments passed to graphical functions used in \code{cdfcomp}, \code{denscomp}, \code{ppcomp} and \code{qqcomp} when \code{plotstyle = "graphics"}. When \code{plotstyle = "ggplot"}, these arguments are only used by the histogram plot (\code{\link{hist}}) in the \code{denscomp} function. When \code{plotstyle = "ggplot"}, the graphical output can be customized with relevant \code{ggplot2} functions after you store your output.} } \details{ \code{cdfcomp} provides a plot of the empirical distribution and each fitted distribution in cdf, by default using the Hazen's rule for the empirical distribution, with probability points defined as \code{(1:n - 0.5)/n}. If \code{discrete} is \code{TRUE}, probability points are always defined as \code{(1:n)/n}. For large dataset (\code{n > 1e4}), no point is drawn but the line for \code{ecdf} is drawn instead. Note that when \code{horizontals, verticals and do.points} are \code{FALSE}, no empirical point is drawn, only the fitted cdf is shown. \code{denscomp} provides a density plot of each fitted distribution with the histogram of the data for conyinuous data. When \code{discrete=TRUE}, distributions are considered as discrete, no histogram is plotted but \code{demp} is forced to \code{TRUE} and fitted and empirical probabilities are plotted either with vertical lines \code{fittype="l"}, with single points \code{fittype="p"} or both lines and points \code{fittype="o"}. \code{ppcomp} provides a plot of the probabilities of each fitted distribution (\eqn{x}-axis) against the empirical probabilities (\eqn{y}-axis) by default defined as \code{(1:n - 0.5)/n} (data are assumed continuous). For large dataset (\code{n > 1e4}), lines are drawn instead of pointss and customized with the \code{fitpch} parameter. \code{qqcomp} provides a plot of the quantiles of each theoretical distribution (\eqn{x}-axis) against the empirical quantiles of the data (\eqn{y}-axis), by default defining probability points as \code{(1:n - 0.5)/n} for theoretical quantile calculation (data are assumed continuous). For large dataset (\code{n > 1e4}), lines are drawn instead of points and customized with the \code{fitpch} parameter. By default a legend is added to these plots. Many graphical arguments are optional, dedicated to personalize the plots, and fixed to default values if omitted. } \seealso{ See \code{\link{plot}}, \code{\link{legend}}, \code{\link{ppoints}}, \code{\link{plot.stepfun}} for classic plotting functions. See \code{\link{CIcdfplot}} and \code{\link{plotdist}} for other plot functions of fitdistrplus. } \references{ Delignette-Muller ML and Dutang C (2015), \emph{fitdistrplus: An R Package for Fitting Distributions}. Journal of Statistical Software, 64(4), 1-34. } \author{ Christophe Dutang, Marie-Laure Delignette-Muller and Aurelie Siberchicot. } \examples{ # (1) Plot various distributions fitted to serving size data # data(groundbeef) serving <- groundbeef$serving fitW <- fitdist(serving, "weibull") fitln <- fitdist(serving, "lnorm") fitg <- fitdist(serving, "gamma") cdfcomp(list(fitW, fitln, fitg), horizontals = FALSE) cdfcomp(list(fitW, fitln, fitg), horizontals = TRUE) cdfcomp(list(fitW, fitln, fitg), horizontals = TRUE, verticals = TRUE, datacol = "purple") cdfcomp(list(fitW, fitln, fitg), legendtext = c("Weibull", "lognormal", "gamma"), main = "ground beef fits", xlab = "serving sizes (g)", ylab = "F", xlim = c(0, 250), xlegend = "center", lines01 = TRUE) denscomp(list(fitW, fitln, fitg), legendtext = c("Weibull", "lognormal", "gamma"), main = "ground beef fits", xlab = "serving sizes (g)", xlim = c(0, 250), xlegend = "topright") ppcomp(list(fitW, fitln, fitg), legendtext = c("Weibull", "lognormal", "gamma"), main = "ground beef fits", xlegend = "bottomright", line01 = TRUE) qqcomp(list(fitW, fitln, fitg), legendtext = c("Weibull", "lognormal", "gamma"), main = "ground beef fits", xlegend = "bottomright", line01 = TRUE, xlim = c(0, 300), ylim = c(0, 300), fitpch = 16) # (2) Plot lognormal distributions fitted by # maximum goodness-of-fit estimation # using various distances (data plotted in log scale) # data(endosulfan) ATV <- subset(endosulfan, group == "NonArthroInvert")$ATV taxaATV <- subset(endosulfan, group == "NonArthroInvert")$taxa flnMGEKS <- fitdist(ATV, "lnorm", method = "mge", gof = "KS") flnMGEAD <- fitdist(ATV, "lnorm", method = "mge", gof = "AD") flnMGEADL <- fitdist(ATV, "lnorm", method = "mge", gof = "ADL") flnMGEAD2L <- fitdist(ATV, "lnorm", method = "mge", gof = "AD2L") cdfcomp(list(flnMGEKS, flnMGEAD, flnMGEADL, flnMGEAD2L), xlogscale = TRUE, main = "fits of a lognormal dist. using various GOF dist.", legendtext = c("MGE KS", "MGE AD", "MGE ADL", "MGE AD2L"), verticals = TRUE, xlim = c(1, 100000), name.points=taxaATV) qqcomp(list(flnMGEKS, flnMGEAD, flnMGEADL, flnMGEAD2L), main = "fits of a lognormal dist. using various GOF dist.", legendtext = c("MGE KS", "MGE AD", "MGE ADL", "MGE AD2L"), xlogscale = TRUE, ylogscale = TRUE) ppcomp(list(flnMGEKS, flnMGEAD, flnMGEADL, flnMGEAD2L), main = "fits of a lognormal dist. using various GOF dist.", legendtext = c("MGE KS", "MGE AD", "MGE ADL", "MGE AD2L")) # (3) Plot normal and logistic distributions fitted by # maximum likelihood estimation # using various plotting positions in cdf plots # data(endosulfan) log10ATV <-log10(subset(endosulfan, group == "NonArthroInvert")$ATV) fln <- fitdist(log10ATV, "norm") fll <- fitdist(log10ATV, "logis") # default plot using Hazen plotting position: (1:n - 0.5)/n cdfcomp(list(fln, fll), legendtext = c("normal", "logistic"), xlab = "log10ATV") # plot using mean plotting position (named also Gumbel plotting position) # (1:n)/(n + 1) cdfcomp(list(fln, fll),legendtext = c("normal", "logistic"), xlab = "log10ATV", use.ppoints = TRUE, a.ppoints = 0) # plot using basic plotting position: (1:n)/n cdfcomp(list(fln, fll),legendtext = c("normal", "logistic"), xlab = "log10ATV", use.ppoints = FALSE) # (4) Comparison of fits of two distributions fitted to discrete data # data(toxocara) number <- toxocara$number fitp <- fitdist(number, "pois") fitnb <- fitdist(number, "nbinom") cdfcomp(list(fitp, fitnb), legendtext = c("Poisson", "negative binomial")) denscomp(list(fitp, fitnb),demp = TRUE, legendtext = c("Poisson", "negative binomial")) denscomp(list(fitp, fitnb),demp = TRUE, fittype = "l", dempcol = "black", legendtext = c("Poisson", "negative binomial")) denscomp(list(fitp, fitnb),demp = TRUE, fittype = "p", dempcol = "black", legendtext = c("Poisson", "negative binomial")) denscomp(list(fitp, fitnb),demp = TRUE, fittype = "o", dempcol = "black", legendtext = c("Poisson", "negative binomial")) # (5) Customizing of graphical output and use of ggplot2 # data(groundbeef) serving <- groundbeef$serving fitW <- fitdist(serving, "weibull") fitln <- fitdist(serving, "lnorm") fitg <- fitdist(serving, "gamma") if (requireNamespace ("ggplot2", quietly = TRUE)) { denscomp(list(fitW, fitln, fitg), plotstyle = "ggplot") cdfcomp(list(fitW, fitln, fitg), plotstyle = "ggplot") qqcomp(list(fitW, fitln, fitg), plotstyle = "ggplot") ppcomp(list(fitW, fitln, fitg), plotstyle = "ggplot") } # customizing graphical output with graphics denscomp(list(fitW, fitln, fitg), legendtext = c("Weibull", "lognormal", "gamma"), main = "ground beef fits", xlab = "serving sizes (g)", xlim = c(0, 250), xlegend = "topright", addlegend = FALSE) # customizing graphical output with ggplot2 if (requireNamespace ("ggplot2", quietly = TRUE)) { dcomp <- denscomp(list(fitW, fitln, fitg), legendtext = c("Weibull", "lognormal", "gamma"), xlab = "serving sizes (g)", xlim = c(0, 250), xlegend = "topright", plotstyle = "ggplot", breaks = 20, addlegend = FALSE) dcomp + ggplot2::theme_minimal() + ggplot2::ggtitle("Ground beef fits") } } \keyword{ distribution } fitdistrplus/man/fluazinam.Rd0000644000176200001440000000452414213576336016055 0ustar liggesusers\name{fluazinam} \alias{fluazinam} \docType{data} \title{Species-Sensitivity Distribution (SSD) for Fluazinam} \description{48-hour acute toxicity values (EC50 values) for exposure of macroinvertebrates and zooplancton to fluazinam.} \usage{data(fluazinam)} \format{ \code{fluazinam} is a data frame with 2 columns named left and right, describing each observed EC50 value (in micrograms per liter) as an interval. The left column contains either NA for left censored observations, the left bound of the interval for interval censored observations, or the observed value for non-censored observations. The right column contains either NA for right censored observations, the right bound of the interval for interval censored observations, or the observed value for noncensored observations. } \source{Hose, G.C., Van den Brink, P.J. 2004. The species sensitivity distribution approach compared to a microcosm study: A case study with the fungicide fluazinam. \emph{Ecotoxicology and Environmental Safety}, \bold{73}, 109-122.} \examples{ # (1) load of data # data(fluazinam) # (2) plot of data using Turnbull cdf plot # log10EC50 <- log10(fluazinam) plotdistcens(log10EC50) # (3) fit of a lognormal and a logistic distribution to data # (classical distributions used for species sensitivity # distributions, SSD, in ecotoxicology) # and visual comparison of the fits using Turnbull cdf plot # fln <- fitdistcens(log10EC50, "norm") summary(fln) fll <- fitdistcens(log10EC50, "logis") summary(fll) cdfcompcens(list(fln,fll), legendtext = c("normal", "logistic"), xlab = "log10(EC50)") # (4) estimation of the 5 percent quantile value of # the normal fitted distribution (5 percent hazardous concentration : HC5) # with its two-sided 95 percent confidence interval calculated by # non parametric bootstrap # with a small number of iterations to satisfy CRAN running times constraint. # For practical applications, we recommend to use at least niter=501 or niter=1001. # # in log10(EC50) bln <- bootdistcens(fln, niter = 101) HC5ln <- quantile(bln, probs = 0.05) # in EC50 10^(HC5ln$quantiles) 10^(HC5ln$quantCI) # (5) estimation of the HC5 value # with its one-sided 95 percent confidence interval (type "greater") # # in log10(EC50) HC5lnb <- quantile(bln, probs = 0.05, CI.type = "greater") # in LC50 10^(HC5lnb$quantiles) 10^(HC5lnb$quantCI) } \keyword{datasets} fitdistrplus/man/mmedist.Rd0000644000176200001440000002347614421726011015524 0ustar liggesusers\name{mmedist} \alias{mmedist} \alias{mme} \title{ Matching moment fit of univariate distributions} \description{ Fit of univariate distributions by matching moments (raw or centered) for non censored data. } \usage{ mmedist(data, distr, order, memp, start = NULL, fix.arg = NULL, optim.method = "default", lower = -Inf, upper = Inf, custom.optim = NULL, weights = NULL, silent = TRUE, gradient = NULL, checkstartfix=FALSE, \dots) } \arguments{ \item{data}{A numeric vector for non censored data.} \item{distr}{A character string \code{"name"} naming a distribution (see 'details').} \item{order}{A numeric vector for the moment order(s). The length of this vector must be equal to the number of parameters to estimate.} \item{memp}{A function implementing empirical moments, raw or centered but has to be consistent with \code{distr} argument (and \code{weights} argument). See details below. } \item{start}{A named list giving the initial values of parameters of the named distribution or a function of data computing initial values and returning a named list. This argument may be omitted (default) for some distributions for which reasonable starting values are computed (see the 'details' section of \code{\link{mledist}}). } \item{fix.arg}{An optional named list giving the values of fixed parameters of the named distribution or a function of data computing (fixed) parameter values and returning a named list. Parameters with fixed value are thus NOT estimated.} \item{optim.method}{ \code{"default"} or optimization method to pass to \code{\link{optim}}. } \item{lower}{ Left bounds on the parameters for the \code{"L-BFGS-B"} method (see \code{\link{optim}}). } \item{upper}{ Right bounds on the parameters for the \code{"L-BFGS-B"} method (see \code{\link{optim}}). } \item{custom.optim}{a function carrying the optimization .} \item{weights}{an optional vector of weights to be used in the fitting process. Should be \code{NULL} or a numeric vector with strictly positive integers (typically the number of occurences of each observation). If non-\code{NULL}, weighted MME is used, otherwise ordinary MME.} \item{silent}{A logical to remove or show warnings when bootstraping.} \item{gradient}{A function to return the gradient of the squared difference for the \code{"BFGS"}, \code{"CG"} and \code{"L-BFGS-B"} methods. If it is \code{NULL}, a finite-difference approximation will be used, see details.} \item{checkstartfix}{A logical to test starting and fixed values. Do not change it.} \item{\dots}{further arguments passed to the \code{\link{optim}}, \code{\link{constrOptim}} or \code{custom.optim} function.} } \details{ The argument \code{distr} can be one of the base R distributions: \code{"norm"}, \code{"lnorm"}, \code{"exp"} and \code{"pois"}, \code{"gamma"}, \code{"logis"}, \code{"nbinom"} , \code{"geom"}, \code{"beta"} and \code{"unif"}. In that case, no other arguments than \code{data} and \code{distr} are required, because the estimate is computed by a closed-form formula. For distributions characterized by one parameter (\code{"geom"}, \code{"pois"} and \code{"exp"}), this parameter is simply estimated by matching theoretical and observed means, and for distributions characterized by two parameters, these parameters are estimated by matching theoretical and observed means and variances (Vose, 2000). Note that for these closed-form formula, \code{fix.arg} cannot be used and \code{start} is ignored. The argument \code{distr} can also be the distribution name as long as a corresponding \code{mdistr} function exists, e.g. \code{"pareto"} if \code{"mpareto"} exists. In that case arguments arguments \code{order} and \code{memp} have to be supplied in order to carry out the matching numerically, by minimization of the sum of squared differences between observed and theoretical moments. Optionnally other arguments can be supplied to control optimization (see the 'details' section of \code{\link{mledist}} for details about arguments for the control of optimization). In that case, \code{fix.arg} can be used and \code{start} is taken into account. For non closed-form estimators, \code{memp} must be provided to compute empirical moments. When \code{weights=NULL}, this function must have two arguments \code{x, order}: \code{x} the numeric vector of the data and \code{order} the order of the moment. When \code{weights!=NULL}, this function must have three arguments \code{x, order, weights}: \code{x} the numeric vector of the data, \code{order} the order of the moment, \code{weights} the numeric vector of weights. See examples below. Optionally, a vector of \code{weights} can be used in the fitting process. By default (when \code{weigths=NULL}), ordinary MME is carried out, otherwise the specified weights are used to compute (raw or centered) weighted moments. For closed-form estimators, weighted mean and variance are computed by \code{\link[Hmisc:wtd.stats]{wtdmean}} and \code{\link[Hmisc:wtd.stats]{wtdvar}} from the \code{Hmisc} package. When a numerical minimization is used, weighted are expected to be computed by the \code{memp} function. It is not yet possible to take into account weighths in functions \code{plotdist}, \code{plotdistcens}, \code{plot.fitdist}, \code{plot.fitdistcens}, \code{cdfcomp}, \code{cdfcompcens}, \code{denscomp}, \code{ppcomp}, \code{qqcomp}, \code{gofstat} and \code{descdist} (developments planned in the future). This function is not intended to be called directly but is internally called in \code{\link{fitdist}} and \code{\link{bootdist}} when used with the matching moments method. } \value{ \code{mmedist} returns a list with following components, \item{estimate}{ the parameter estimates.} \item{convergence}{ an integer code for the convergence of \code{\link{optim}} defined as below or defined by the user in the user-supplied optimization function. \code{0} indicates successful convergence. \code{1} indicates that the iteration limit of \code{\link{optim}} has been reached. \code{10} indicates degeneracy of the Nealder-Mead simplex. \code{100} indicates that \code{\link{optim}} encountered an internal error. } \item{value}{the minimal value reached for the criterion to minimize.} \item{hessian}{ a symmetric matrix computed by \code{\link{optim}} as an estimate of the Hessian at the solution found or computed in the user-supplied optimization function. } \item{optim.function}{(if appropriate) the name of the optimization function used for maximum likelihood.} \item{optim.method}{(if appropriate) when \code{\link{optim}} is used, the name of the algorithm used, the field \code{method} of the \code{custom.optim} function otherwise.} \item{fix.arg}{the named list giving the values of parameters of the named distribution that must kept fixed rather than estimated by maximum likelihood or \code{NULL} if there are no such parameters. } \item{fix.arg.fun}{the function used to set the value of \code{fix.arg} or \code{NULL}.} \item{weights}{the vector of weigths used in the estimation process or \code{NULL}.} \item{counts}{A two-element integer vector giving the number of calls to the log-likelihood function and its gradient respectively. This excludes those calls needed to compute the Hessian, if requested, and any calls to log-likelihood function to compute a finite-difference approximation to the gradient. \code{counts} is returned by \code{\link{optim}} or the user-supplied function or set to \code{NULL}.} \item{optim.message}{A character string giving any additional information returned by the optimizer, or \code{NULL}. To understand exactly the message, see the source code.} \item{loglik}{ the log-likelihood value. } \item{method}{either \code{"closed formula"} or the name of the optimization method.} \item{order}{ the order of the moment(s) matched.} \item{memp}{ the empirical moment function.} } \seealso{ \code{\link{mmedist}}, \code{\link{qmedist}}, \code{\link{mgedist}}, \code{\link{fitdist}},\code{\link{fitdistcens}}, \code{\link{optim}}, \code{\link{bootdistcens}} and \code{\link{bootdist}}. } \references{ Evans M, Hastings N and Peacock B (2000), \emph{Statistical distributions}. John Wiley and Sons Inc. Vose D (2000), \emph{Risk analysis, a quantitative guide}. John Wiley & Sons Ltd, Chischester, England, pp. 99-143. Delignette-Muller ML and Dutang C (2015), \emph{fitdistrplus: An R Package for Fitting Distributions}. Journal of Statistical Software, 64(4), 1-34. } \author{ Marie-Laure Delignette-Muller and Christophe Dutang. } \examples{ # (1) basic fit of a normal distribution with moment matching estimation # set.seed(1234) n <- 100 x1 <- rnorm(n=n) mmedist(x1, "norm") #weighted w <- c(rep(1, n/2), rep(10, n/2)) mmedist(x1, "norm", weights=w)$estimate # (2) fit a discrete distribution (Poisson) # set.seed(1234) x2 <- rpois(n=30,lambda = 2) mmedist(x2, "pois") # (3) fit a finite-support distribution (beta) # set.seed(1234) x3 <- rbeta(n=100,shape1=5, shape2=10) mmedist(x3, "beta") # (4) fit a Pareto distribution # \donttest{ require(actuar) #simulate a sample x4 <- rpareto(1000, 6, 2) #empirical raw moment memp <- function(x, order) mean(x^order) memp2 <- function(x, order, weights) sum(x^order * weights)/sum(weights) #fit by MME mmedist(x4, "pareto", order=c(1, 2), memp=memp, start=list(shape=10, scale=10), lower=1, upper=Inf) #fit by weighted MME w <- rep(1, length(x4)) w[x4 < 1] <- 2 mmedist(x4, "pareto", order=c(1, 2), memp=memp2, weights=w, start=list(shape=10, scale=10), lower=1, upper=Inf) } } \keyword{ distribution }fitdistrplus/man/dataFAQ.Rd0000644000176200001440000000070314213577421015317 0ustar liggesusers\name{dataFAQ} \alias{dataFAQ} \alias{dataFAQlog1} \alias{dataFAQscale1} \alias{dataFAQscale2} \docType{data} \title{Datasets for the FAQ} \description{ Datasets used in the FAQ vignette. } \usage{ data(dataFAQlog1) data(dataFAQscale1) data(dataFAQscale2) } \format{ \code{dataFAQlog1} \code{dataFAQscale1} \code{dataFAQscale2} are vectors of numeric data. } \author{ Marie-Laure Delignette-Muller and Christophe Dutang. } \keyword{ datasets }fitdistrplus/man/prefit.Rd0000644000176200001440000001204513742313702015345 0ustar liggesusers\name{prefit} \alias{prefit} \title{Pre-fitting procedure} \description{ Search good starting values } \usage{ prefit(data, distr, method = c("mle", "mme", "qme", "mge"), feasible.par, memp=NULL, order=NULL, probs=NULL, qtype=7, gof=NULL, fix.arg=NULL, lower, upper, weights=NULL, silent=TRUE, \dots) } \arguments{ \item{data}{A numeric vector.} \item{distr}{A character string \code{"name"} naming a distribution for which the corresponding density function \code{dname}, the corresponding distribution function \code{pname} and the corresponding quantile function \code{qname} must be defined, or directly the density function.} \item{method}{A character string coding for the fitting method: \code{"mle"} for 'maximum likelihood estimation', \code{"mme"} for 'moment matching estimation', \code{"qme"} for 'quantile matching estimation' and \code{"mge"} for 'maximum goodness-of-fit estimation'.} \item{feasible.par}{A named list giving the initial values of parameters of the named distribution or a function of data computing initial values and returning a named list. This argument may be omitted (default) for some distributions for which reasonable starting values are computed (see the 'details' section of \code{\link{mledist}}). It may not be into account for closed-form formulas.} \item{order}{A numeric vector for the moment order(s). The length of this vector must be equal to the number of parameters to estimate.} \item{memp}{A function implementing empirical moments, raw or centered but has to be consistent with \code{distr} argument (and \code{weights} argument).} \item{probs}{A numeric vector of the probabilities for which the quantile matching is done. The length of this vector must be equal to the number of parameters to estimate.} \item{qtype}{The quantile type used by the R \code{\link{quantile}} function to compute the empirical quantiles, (default 7 corresponds to the default quantile method in R).} \item{gof}{A character string coding for the name of the goodness-of-fit distance used : "CvM" for Cramer-von Mises distance,"KS" for Kolmogorov-Smirnov distance, "AD" for Anderson-Darling distance, "ADR", "ADL", "AD2R", "AD2L" and "AD2" for variants of Anderson-Darling distance described by Luceno (2006).} \item{fix.arg}{An optional named list giving the values of fixed parameters of the named distribution or a function of data computing (fixed) parameter values and returning a named list. Parameters with fixed value are thus NOT estimated by this maximum likelihood procedure. The use of this argument is not possible if \code{method="mme"} and a closed-form formula is used.} \item{weights}{an optional vector of weights to be used in the fitting process. Should be \code{NULL} or a numeric vector. If non-\code{NULL}, weighted MLE is used, otherwise ordinary MLE.} \item{silent}{A logical to remove or show warnings.} \item{lower}{Lower bounds on the parameters.} \item{upper}{Upper bounds on the parameters.} \item{\dots}{Further arguments to be passed to generic functions, or to one of the functions \code{"mledist"}, \code{"mmedist"}, \code{"qmedist"} or \code{"mgedist"} depending of the chosen method. See \code{\link{mledist}}, \code{\link{mmedist}}, \code{\link{qmedist}}, \code{\link{mgedist}} for details on parameter estimation.} } \details{ Searching good starting values is achieved by transforming the parameters (from their constraint interval to the real line) of the probability distribution. Indeed, \itemize{ \item positive parameters in \eqn{(0,Inf)} are transformed using the logarithm (typically the scale parameter \code{sd} of a normal distribution, see \link{Normal}), \item parameters in \eqn{(1,Inf)} are transformed using the function \eqn{log(x-1)}, \item probability parameters in \eqn{(0,1)} are transformed using the logit function \eqn{log(x/(1-x))} (typically the parameter \code{prob} of a geometric distribution, see \link{Geometric}), \item negative probability parameters in \eqn{(-1,0)} are transformed using the function \eqn{log(-x/(1+x))}, \item real parameters are of course not transformed at all, typically the \code{mean} of a normal distribution, see \link{Normal}. } Once parameters are transformed, an optimization is carried out by a quasi-Newton algorithm (typically BFGS) and then we transform them back to original parameter value. } \value{ A named list. } \seealso{ See \code{\link{mledist}}, \code{\link{mmedist}}, \code{\link{qmedist}}, \code{\link{mgedist}} for details on parameter estimation. See \code{\link{fitdist}} for the main procedure. } \references{ Delignette-Muller ML and Dutang C (2015), \emph{fitdistrplus: An R Package for Fitting Distributions}. Journal of Statistical Software, 64(4), 1-34. } \author{ Christophe Dutang and Marie-Laure Delignette-Muller. } \examples{ # (1) fit of a gamma distribution by maximum likelihood estimation # x <- rgamma(1e3, 5/2, 7/2) prefit(x, "gamma", "mle", list(shape=3, scale=3), lower=-Inf, upper=Inf) } \keyword{ distribution } fitdistrplus/man/endosulfan.Rd0000644000176200001440000000577314076546311016231 0ustar liggesusers\name{endosulfan} \alias{endosulfan} \docType{data} \title{Species Sensitivity Distribution (SSD) for endosulfan} \description{ Summary of 48- to 96-hour acute toxicity values (LC50 and EC50 values) for exposure of Australian an Non-Australian taxa to endosulfan. } \usage{ data(endosulfan) } \format{ \code{endosulfan} is a data frame with 4 columns, named ATV for Acute Toxicity Value (geometric mean of LC50 ou EC50 values in micrograms per liter), Australian (coding for Australian or another origin), group (arthropods, fish or non-arthropod invertebrates) and taxa. } \source{ Hose, G.C., Van den Brink, P.J. 2004. Confirming the Species-Sensitivity Distribution Concept for Endosulfan Using Laboratory, Mesocosms, and Field Data. \emph{Archives of Environmental Contamination and Toxicology}, \bold{47}, 511-520. } \examples{ # (1) load of data # data(endosulfan) # (2) plot and description of data for non Australian fish in decimal logarithm # log10ATV <-log10(subset(endosulfan,(Australian == "no") & (group == "Fish"))$ATV) plotdist(log10ATV) descdist(log10ATV,boot=1000) # (3) fit of a normal and a logistic distribution to data in log10 # (classical distributions used for SSD) # and visual comparison of the fits # fln <- fitdist(log10ATV,"norm") summary(fln) fll <- fitdist(log10ATV,"logis") summary(fll) cdfcomp(list(fln,fll),legendtext=c("normal","logistic"), xlab="log10ATV") denscomp(list(fln,fll),legendtext=c("normal","logistic"), xlab="log10ATV") qqcomp(list(fln,fll),legendtext=c("normal","logistic")) ppcomp(list(fln,fll),legendtext=c("normal","logistic")) gofstat(list(fln,fll), fitnames = c("lognormal", "loglogistic")) # (4) estimation of the 5 percent quantile value of # logistic fitted distribution (5 percent hazardous concentration : HC5) # with its two-sided 95 percent confidence interval calculated by # parametric bootstrap # with a small number of iterations to satisfy CRAN running times constraint. # For practical applications, we recommend to use at least niter=501 or niter=1001. # # in log10(ATV) bll <- bootdist(fll,niter=101) HC5ll <- quantile(bll,probs = 0.05) # in ATV 10^(HC5ll$quantiles) 10^(HC5ll$quantCI) # (5) estimation of the 5 percent quantile value of # the fitted logistic distribution (5 percent hazardous concentration : HC5) # with its one-sided 95 percent confidence interval (type "greater") # calculated by # nonparametric bootstrap # with a small number of iterations to satisfy CRAN running times constraint. # For practical applications, we recommend to use at least niter=501 or niter=1001. # # in log10(ATV) bllnonpar <- bootdist(fll,niter=101,bootmethod = "nonparam") HC5llgreater <- quantile(bllnonpar,probs = 0.05, CI.type="greater") # in ATV 10^(HC5llgreater$quantiles) 10^(HC5llgreater$quantCI) # (6) fit of a logistic distribution # by minimizing the modified Anderson-Darling AD2L distance # cf. ?mgedist for definition of this distance # fllAD2L <- fitdist(log10ATV,"logis",method="mge",gof="AD2L") summary(fllAD2L) plot(fllAD2L) } \keyword{datasets} fitdistrplus/man/qmedist.Rd0000644000176200001440000001760314421726051015527 0ustar liggesusers\name{qmedist} \alias{qmedist} \alias{qme} \title{ Quantile matching fit of univariate distributions} \description{ Fit of univariate distribution by matching quantiles for non censored data. } \usage{ qmedist(data, distr, probs, start = NULL, fix.arg = NULL, qtype = 7, optim.method = "default", lower = -Inf, upper = Inf, custom.optim = NULL, weights = NULL, silent = TRUE, gradient = NULL, checkstartfix=FALSE, \dots) } \arguments{ \item{data}{ A numeric vector for non censored data. } \item{distr}{ A character string \code{"name"} naming a distribution for which the corresponding quantile function \code{qname} and the corresponding density distribution \code{dname} must be classically defined. } \item{probs}{A numeric vector of the probabilities for which the quantile matching is done. The length of this vector must be equal to the number of parameters to estimate.} \item{start}{A named list giving the initial values of parameters of the named distribution or a function of data computing initial values and returning a named list. This argument may be omitted (default) for some distributions for which reasonable starting values are computed (see the 'details' section of \code{\link{mledist}}). } \item{fix.arg}{An optional named list giving the values of fixed parameters of the named distribution or a function of data computing (fixed) parameter values and returning a named list. Parameters with fixed value are thus NOT estimated.} \item{qtype}{The quantile type used by the R \code{\link{quantile}} function to compute the empirical quantiles, (default 7 corresponds to the default quantile method in R).} \item{optim.method}{ \code{"default"} or optimization method to pass to \code{\link{optim}}. } \item{lower}{ Left bounds on the parameters for the \code{"L-BFGS-B"} method (see \code{\link{optim}}). } \item{upper}{ Right bounds on the parameters for the \code{"L-BFGS-B"} method (see \code{\link{optim}}). } \item{custom.optim}{a function carrying the optimization.} \item{weights}{an optional vector of weights to be used in the fitting process. Should be \code{NULL} or a numeric vector with strictly positive integers (typically the number of occurences of each observation). If non-\code{NULL}, weighted QME is used, otherwise ordinary QME.} \item{silent}{A logical to remove or show warnings when bootstraping.} \item{gradient}{A function to return the gradient of the squared difference for the \code{"BFGS"}, \code{"CG"} and \code{"L-BFGS-B"} methods. If it is \code{NULL}, a finite-difference approximation will be used, see details.} \item{checkstartfix}{A logical to test starting and fixed values. Do not change it.} \item{\dots}{further arguments passed to the \code{\link{optim}}, \code{\link{constrOptim}} or \code{custom.optim} function.} } \details{ The \code{qmedist} function carries out the quantile matching numerically, by minimization of the sum of squared differences between observed and theoretical quantiles. Note that for discrete distribution, the sum of squared differences is a step function and consequently, the optimum is not unique, see the FAQ. The optimization process is the same as \code{\link{mledist}}, see the 'details' section of that function. Optionally, a vector of \code{weights} can be used in the fitting process. By default (when \code{weigths=NULL}), ordinary QME is carried out, otherwise the specified weights are used to compute weighted quantiles used in the squared differences. Weigthed quantiles are computed by \code{\link[Hmisc:wtd.stats]{wtdquantile}} from the \code{Hmisc} package. It is not yet possible to take into account weighths in functions \code{plotdist}, \code{plotdistcens}, \code{plot.fitdist}, \code{plot.fitdistcens}, \code{cdfcomp}, \code{cdfcompcens}, \code{denscomp}, \code{ppcomp}, \code{qqcomp}, \code{gofstat} and \code{descdist} (developments planned in the future). This function is not intended to be called directly but is internally called in \code{\link{fitdist}} and \code{\link{bootdist}}. } \value{ \code{qmedist} returns a list with following components, \item{estimate}{ the parameter estimates.} \item{convergence}{ an integer code for the convergence of \code{\link{optim}} defined as below or defined by the user in the user-supplied optimization function. \code{0} indicates successful convergence. \code{1} indicates that the iteration limit of \code{\link{optim}} has been reached. \code{10} indicates degeneracy of the Nealder-Mead simplex. \code{100} indicates that \code{\link{optim}} encountered an internal error. } \item{value}{the minimal value reached for the criterion to minimize.} \item{hessian}{ a symmetric matrix computed by \code{\link{optim}} as an estimate of the Hessian at the solution found or computed in the user-supplied optimization function. } \item{optim.function}{the name of the optimization function used for maximum likelihood.} \item{optim.method}{when \code{\link{optim}} is used, the name of the algorithm used, the field \code{method} of the \code{custom.optim} function otherwise.} \item{fix.arg}{the named list giving the values of parameters of the named distribution that must kept fixed rather than estimated by maximum likelihood or \code{NULL} if there are no such parameters. } \item{fix.arg.fun}{the function used to set the value of \code{fix.arg} or \code{NULL}.} \item{weights}{the vector of weigths used in the estimation process or \code{NULL}.} \item{counts}{A two-element integer vector giving the number of calls to the log-likelihood function and its gradient respectively. This excludes those calls needed to compute the Hessian, if requested, and any calls to log-likelihood function to compute a finite-difference approximation to the gradient. \code{counts} is returned by \code{\link{optim}} or the user-supplied function or set to \code{NULL}.} \item{optim.message}{A character string giving any additional information returned by the optimizer, or \code{NULL}. To understand exactly the message, see the source code.} \item{loglik}{ the log-likelihood value. } \item{probs}{ the probability vector on which quantiles are matched. } } \seealso{ \code{\link{mmedist}}, \code{\link{mledist}}, \code{\link{mgedist}}, \code{\link{fitdist}} for other estimation methods and \code{\link{quantile}} for empirical quantile estimation in R. } \references{ Klugman SA, Panjer HH and Willmot GE (2012), \emph{Loss Models: From Data to Decissions}, 4th edition. Wiley Series in Statistics for Finance, Business and Economics, p. 253. Delignette-Muller ML and Dutang C (2015), \emph{fitdistrplus: An R Package for Fitting Distributions}. Journal of Statistical Software, 64(4), 1-34. } \author{ Christophe Dutang and Marie Laure Delignette-Muller. } \examples{ # (1) basic fit of a normal distribution # set.seed(1234) x1 <- rnorm(n=100) qmedist(x1, "norm", probs=c(1/3, 2/3)) # (2) defining your own distribution functions, here for the Gumbel # distribution for other distributions, see the CRAN task view dedicated # to probability distributions dgumbel <- function(x, a, b) 1/b*exp((a-x)/b)*exp(-exp((a-x)/b)) qgumbel <- function(p, a, b) a - b*log(-log(p)) qmedist(x1, "gumbel", probs=c(1/3, 2/3), start=list(a=10,b=5)) # (3) fit a discrete distribution (Poisson) # set.seed(1234) x2 <- rpois(n=30,lambda = 2) qmedist(x2, "pois", probs=1/2) # (4) fit a finite-support distribution (beta) # set.seed(1234) x3 <- rbeta(n=100,shape1=5, shape2=10) qmedist(x3, "beta", probs=c(1/3, 2/3)) # (5) fit frequency distributions on USArrests dataset. # x4 <- USArrests$Assault qmedist(x4, "pois", probs=1/2) qmedist(x4, "nbinom", probs=c(1/3, 2/3)) } \keyword{ distribution } fitdistrplus/man/fitdist.Rd0000644000176200001440000005700414212104547015524 0ustar liggesusers\name{fitdist} \alias{fitdist} \alias{plot.fitdist} \alias{print.fitdist} \alias{summary.fitdist} \alias{logLik.fitdist} \alias{vcov.fitdist} \alias{coef.fitdist} \title{ Fit of univariate distributions to non-censored data} \description{ Fit of univariate distributions to non-censored data by maximum likelihood (mle), moment matching (mme), quantile matching (qme) or maximizing goodness-of-fit estimation (mge). The latter is also known as minimizing distance estimation. Generic methods are \code{print}, \code{plot}, \code{summary}, \code{quantile}, \code{logLik}, \code{vcov} and \code{coef}. } \usage{ fitdist(data, distr, method = c("mle", "mme", "qme", "mge", "mse"), start=NULL, fix.arg=NULL, discrete, keepdata = TRUE, keepdata.nb=100, \dots) \method{print}{fitdist}(x, \dots) \method{plot}{fitdist}(x, breaks="default", \dots) \method{summary}{fitdist}(object, \dots) \method{logLik}{fitdist}(object, \dots) \method{vcov}{fitdist}(object, \dots) \method{coef}{fitdist}(object, \dots) } \arguments{ \item{data}{A numeric vector.} \item{distr}{A character string \code{"name"} naming a distribution for which the corresponding density function \code{dname}, the corresponding distribution function \code{pname} and the corresponding quantile function \code{qname} must be defined, or directly the density function.} \item{method}{A character string coding for the fitting method: \code{"mle"} for 'maximum likelihood estimation', \code{"mme"} for 'moment matching estimation', \code{"qme"} for 'quantile matching estimation', \code{"mge"} for 'maximum goodness-of-fit estimation' and \code{"mse"} for 'maximum spacing estimation'.} \item{start}{A named list giving the initial values of parameters of the named distribution or a function of data computing initial values and returning a named list. This argument may be omitted (default) for some distributions for which reasonable starting values are computed (see the 'details' section of \code{\link{mledist}}). It may not be into account for closed-form formulas.} \item{fix.arg}{An optional named list giving the values of fixed parameters of the named distribution or a function of data computing (fixed) parameter values and returning a named list. Parameters with fixed value are thus NOT estimated by this maximum likelihood procedure. The use of this argument is not possible if \code{method="mme"} and a closed-form formula is used.} \item{keepdata}{a logical. If \code{TRUE}, dataset is returned, otherwise only a sample subset is returned.} \item{keepdata.nb}{When \code{keepdata=FALSE}, the length (>1) of the subset returned.} \item{discrete}{ If TRUE, the distribution is considered as discrete. If \code{discrete} is missing, \code{discrete} is automaticaly set to \code{TRUE} when \code{distr} belongs to \code{"binom"}, \code{"nbinom"}, \code{"geom"}, \code{"hyper"} or \code{"pois"} and to \code{FALSE} in the other cases. It is thus recommended to enter this argument when using another discrete distribution. This argument will not directly affect the results of the fit but will be passed to functions \code{\link{gofstat}}, \code{\link{plotdist}} and \code{\link{cdfcomp}}. } \item{x}{An object of class \code{"fitdist"}. } \item{object}{An object of class \code{"fitdist"}. } \item{breaks}{If \code{"default"} the histogram is plotted with the function \code{hist} with its default breaks definition. Else \code{breaks} is passed to the function \code{hist}. This argument is not taken into account with discrete distributions: \code{"binom"}, \code{"nbinom"}, \code{"geom"}, \code{"hyper"} and \code{"pois"}. } \item{\dots}{Further arguments to be passed to generic functions, or to one of the functions \code{"mledist"}, \code{"mmedist"}, \code{"qmedist"} or \code{"mgedist"} depending of the chosen method. See \code{\link{mledist}}, \code{\link{mmedist}}, \code{\link{qmedist}}, \code{\link{mgedist}} for details on parameter estimation.} } \details{ It is assumed that the \code{distr} argument specifies the distribution by the probability density function, the cumulative distribution function and the quantile function (d, p, q). The four possible fitting methods are described below: \describe{ \item{When \code{method="mle"}}{ Maximum likelihood estimation consists in maximizing the log-likelihood. A numerical optimization is carried out in \code{\link{mledist}} via \code{optim} to find the best values (see \code{\link{mledist}} for details). } \item{When \code{method="mme"}}{ Moment matching estimation consists in equalizing theoretical and empirical moments. Estimated values of the distribution parameters are computed by a closed-form formula for the following distributions : \code{"norm"}, \code{"lnorm"}, \code{"pois"}, \code{"exp"}, \code{"gamma"}, \code{"nbinom"}, \code{"geom"}, \code{"beta"}, \code{"unif"} and \code{"logis"}. Otherwise the theoretical and the empirical moments are matched numerically, by minimization of the sum of squared differences between observed and theoretical moments. In this last case, further arguments are needed in the call to \code{fitdist}: \code{order} and \code{memp} (see \code{\link{mmedist}} for details). } \item{When \code{method = "qme"}}{ Quantile matching estimation consists in equalizing theoretical and empirical quantile. A numerical optimization is carried out in \code{\link{qmedist}} via \code{optim} to minimize of the sum of squared differences between observed and theoretical quantiles. The use of this method requires an additional argument \code{probs}, defined as the numeric vector of the probabilities for which the quantile(s) is(are) to be matched (see \code{\link{qmedist}} for details). } \item{When \code{method = "mge"}}{ Maximum goodness-of-fit estimation consists in maximizing a goodness-of-fit statistics. A numerical optimization is carried out in \code{\link{mgedist}} via \code{optim} to minimize the goodness-of-fit distance. The use of this method requires an additional argument \code{gof} coding for the goodness-of-fit distance chosen. One can use the classical Cramer-von Mises distance (\code{"CvM"}), the classical Kolmogorov-Smirnov distance (\code{"KS"}), the classical Anderson-Darling distance (\code{"AD"}) which gives more weight to the tails of the distribution, or one of the variants of this last distance proposed by Luceno (2006) (see \code{\link{mgedist}} for more details). This method is not suitable for discrete distributions. } \item{When \code{method = "mse"}}{ Maximum goodness-of-fit estimation consists in maximizing the average log spacing. A numerical optimization is carried out in \code{\link{msedist}} via \code{optim}. } } By default, direct optimization of the log-likelihood (or other criteria depending of the chosen method) is performed using \code{\link{optim}}, with the "Nelder-Mead" method for distributions characterized by more than one parameter and the "BFGS" method for distributions characterized by only one parameter. The optimization algorithm used in \code{\link{optim}} can be chosen or another optimization function can be specified using \dots argument (see \code{\link{mledist}} for details). \code{start} may be omitted (i.e. \code{NULL}) for some classic distributions (see the 'details' section of \code{\link{mledist}}). Note that when errors are raised by \code{optim}, it's a good idea to start by adding traces during the optimization process by adding \code{control=list(trace=1, REPORT=1)} in \dots argument. Once the parameter(s) is(are) estimated, \code{fitdist} computes the log-likelihood for every estimation method and for maximum likelihood estimation the standard errors of the estimates calculated from the Hessian at the solution found by \code{optim} or by the user-supplied function passed to mledist. By default (\code{keepdata = TRUE}), the object returned by \code{fitdist} contains the data vector given in input. When dealing with large datasets, we can remove the original dataset from the output by setting \code{keepdata = FALSE}. In such a case, only \code{keepdata.nb} points (at most) are kept by random subsampling \code{keepdata.nb}-2 points from the dataset and adding the minimum and the maximum. If combined with \code{\link{bootdist}}, and use with non-parametric bootstrap be aware that bootstrap is performed on the subset randomly selected in \code{fitdist}. Currently, the graphical comparisons of multiple fits is not available in this framework. Weighted version of the estimation process is available for \code{method = "mle", "mme", "qme"} by using \code{weights=\dots}. See the corresponding man page for details. Weighted maximum GOF estimation (when \code{method = "mge"}) is not allowed. It is not yet possible to take into account weighths in functions \code{plotdist}, \code{plot.fitdist}, \code{cdfcomp}, \code{denscomp}, \code{ppcomp}, \code{qqcomp}, \code{gofstat} and \code{descdist} (developments planned in the future). NB: if your data values are particularly small or large, a scaling may be needed before the optimization process. See example (14) in this man page and examples (14,15) in the test file of the package. Please also take a look at the \code{Rmpfr} package available on CRAN for numerical accuracy issues. } \value{ \code{fitdist} returns an object of class \code{"fitdist"}, a list with the following components: \item{estimate}{ the parameter estimates.} \item{method}{ the character string coding for the fitting method : \code{"mle"} for 'maximum likelihood estimation', \code{"mme"} for 'matching moment estimation', \code{"qme"} for 'matching quantile estimation' \code{"mge"} for 'maximum goodness-of-fit estimation' and \code{"mse"} for 'maximum spacing estimation'.} \item{sd}{ the estimated standard errors, \code{NA} if numerically not computable or \code{NULL} if not available.} \item{cor}{ the estimated correlation matrix, \code{NA} if numerically not computable or \code{NULL} if not available.} \item{vcov}{ the estimated variance-covariance matrix, \code{NULL} if not available.} \item{loglik}{ the log-likelihood.} \item{aic}{ the Akaike information criterion.} \item{bic}{ the the so-called BIC or SBC (Schwarz Bayesian criterion).} \item{n}{ the length of the data set.} \item{data}{ the data set.} \item{distname}{ the name of the distribution.} \item{fix.arg}{ the named list giving the values of parameters of the named distribution that must be kept fixed rather than estimated by maximum likelihood or \code{NULL} if there are no such parameters. } \item{fix.arg.fun}{the function used to set the value of \code{fix.arg} or \code{NULL}.} \item{dots}{ the list of further arguments passed in \dots to be used in \code{\link{bootdist}} in iterative calls to \code{\link{mledist}}, \code{\link{mmedist}}, \code{\link{qmedist}}, \code{\link{mgedist}} or \code{NULL} if no such arguments.} \item{convergence}{ an integer code for the convergence of \code{\link{optim}}/\code{\link{constrOptim}} defined as below or defined by the user in the user-supplied optimization function. \code{0} indicates successful convergence. \code{1} indicates that the iteration limit of \code{\link{optim}} has been reached. \code{10} indicates degeneracy of the Nealder-Mead simplex. \code{100} indicates that \code{\link{optim}} encountered an internal error.} \item{discrete}{ the input argument or the automatic definition by the function to be passed to functions \code{\link{gofstat}}, \code{\link{plotdist}} and \code{\link{cdfcomp}}. } \item{weights}{the vector of weigths used in the estimation process or \code{NULL}.} Generic functions: \describe{ \item{\code{print}}{ The print of a \code{"fitdist"} object shows few traces about the fitting method and the fitted distribution. } \item{\code{summary}}{ The summary provides the parameter estimates of the fitted distribution, the log-likelihood, AIC and BIC statistics and when the maximum likelihood is used, the standard errors of the parameter estimates and the correlation matrix between parameter estimates. } \item{\code{plot}}{ The plot of an object of class "fitdist" returned by \code{fitdist} uses the function \code{\link{plotdist}}. An object of class "fitdist" or a list of objects of class "fitdist" corresponding to various fits using the same data set may also be plotted using a cdf plot (function \code{\link{cdfcomp}}), a density plot(function \code{\link{denscomp}}), a density Q-Q plot (function \code{\link{qqcomp}}), or a P-P plot (function \code{\link{ppcomp}}). } \item{\code{logLik}}{ Extracts the estimated log-likelihood from the \code{"fitdist"} object. } \item{\code{vcov}}{ Extracts the estimated var-covariance matrix from the \code{"fitdist"} object (only available When \code{method = "mle"}). } \item{\code{coef}}{ Extracts the fitted coefficients from the \code{"fitdist"} object. } } } \seealso{ See \code{\link{fitdistrplus}} for an overview of the package. See \code{\link{mledist}}, \code{\link{mmedist}}, \code{\link{qmedist}}, \code{\link{mgedist}}, \code{\link{msedist}} for details on parameter estimation. See \code{\link{gofstat}} for goodness-of-fit statistics. See \code{\link{plotdist}}, \code{\link{graphcomp}}, \code{\link{CIcdfplot}} for graphs (with or without uncertainty and/or multiple fits). See \code{\link{llplot}} for (log-)likelihood plots in the neighborhood of the fitted value. See \code{\link{bootdist}} for bootstrap procedures and \code{\link{fitdistcens}} for censored-data fitting methods. See \code{\link{optim}} for base R optimization procedures. See \code{\link{quantile.fitdist}}, another generic function, which calculates quantiles from the fitted distribution. See \code{\link{quantile}} for base R quantile computation. } \references{ Cullen AC and Frey HC (1999), \emph{Probabilistic techniques in exposure assessment}. Plenum Press, USA, pp. 81-155. Venables WN and Ripley BD (2002), \emph{Modern applied statistics with S}. Springer, New York, pp. 435-446. Vose D (2000), \emph{Risk analysis, a quantitative guide}. John Wiley & Sons Ltd, Chischester, England, pp. 99-143. Delignette-Muller ML and Dutang C (2015), \emph{fitdistrplus: An R Package for Fitting Distributions}. Journal of Statistical Software, 64(4), 1-34. } \author{ Marie-Laure Delignette-Muller and Christophe Dutang. } \examples{ # (1) fit of a gamma distribution by maximum likelihood estimation # data(groundbeef) serving <- groundbeef$serving fitg <- fitdist(serving, "gamma") summary(fitg) plot(fitg) plot(fitg, demp = TRUE) plot(fitg, histo = FALSE, demp = TRUE) cdfcomp(fitg, addlegend=FALSE) denscomp(fitg, addlegend=FALSE) ppcomp(fitg, addlegend=FALSE) qqcomp(fitg, addlegend=FALSE) # (2) use the moment matching estimation (using a closed formula) # fitgmme <- fitdist(serving, "gamma", method="mme") summary(fitgmme) # (3) Comparison of various fits # fitW <- fitdist(serving, "weibull") fitg <- fitdist(serving, "gamma") fitln <- fitdist(serving, "lnorm") summary(fitW) summary(fitg) summary(fitln) cdfcomp(list(fitW, fitg, fitln), legendtext=c("Weibull", "gamma", "lognormal")) denscomp(list(fitW, fitg, fitln), legendtext=c("Weibull", "gamma", "lognormal")) qqcomp(list(fitW, fitg, fitln), legendtext=c("Weibull", "gamma", "lognormal")) ppcomp(list(fitW, fitg, fitln), legendtext=c("Weibull", "gamma", "lognormal")) gofstat(list(fitW, fitg, fitln), fitnames=c("Weibull", "gamma", "lognormal")) # (4) defining your own distribution functions, here for the Gumbel distribution # for other distributions, see the CRAN task view # dedicated to probability distributions # dgumbel <- function(x, a, b) 1/b*exp((a-x)/b)*exp(-exp((a-x)/b)) pgumbel <- function(q, a, b) exp(-exp((a-q)/b)) qgumbel <- function(p, a, b) a-b*log(-log(p)) fitgumbel <- fitdist(serving, "gumbel", start=list(a=10, b=10)) summary(fitgumbel) plot(fitgumbel) # (5) fit discrete distributions (Poisson and negative binomial) # data(toxocara) number <- toxocara$number fitp <- fitdist(number,"pois") summary(fitp) plot(fitp) fitnb <- fitdist(number,"nbinom") summary(fitnb) plot(fitnb) cdfcomp(list(fitp,fitnb)) gofstat(list(fitp,fitnb)) # (6) how to change the optimisation method? # data(groundbeef) serving <- groundbeef$serving fitdist(serving, "gamma", optim.method="Nelder-Mead") fitdist(serving, "gamma", optim.method="BFGS") fitdist(serving, "gamma", optim.method="SANN") # (7) custom optimization function # \donttest{ #create the sample set.seed(1234) mysample <- rexp(100, 5) mystart <- list(rate=8) res1 <- fitdist(mysample, dexp, start= mystart, optim.method="Nelder-Mead") #show the result summary(res1) #the warning tell us to use optimise, because the Nelder-Mead is not adequate. #to meet the standard 'fn' argument and specific name arguments, we wrap optimize, myoptimize <- function(fn, par, ...) { res <- optimize(f=fn, ..., maximum=FALSE) #assume the optimization function minimize standardres <- c(res, convergence=0, value=res$objective, par=res$minimum, hessian=NA) return(standardres) } #call fitdist with a 'custom' optimization function res2 <- fitdist(mysample, "exp", start=mystart, custom.optim=myoptimize, interval=c(0, 100)) #show the result summary(res2) } # (8) custom optimization function - another example with the genetic algorithm # \donttest{ #set a sample fit1 <- fitdist(serving, "gamma") summary(fit1) #wrap genoud function rgenoud package mygenoud <- function(fn, par, ...) { require(rgenoud) res <- genoud(fn, starting.values=par, ...) standardres <- c(res, convergence=0) return(standardres) } #call fitdist with a 'custom' optimization function fit2 <- fitdist(serving, "gamma", custom.optim=mygenoud, nvars=2, Domains=cbind(c(0, 0), c(10, 10)), boundary.enforcement=1, print.level=1, hessian=TRUE) summary(fit2) } # (9) estimation of the standard deviation of a gamma distribution # by maximum likelihood with the shape fixed at 4 using the argument fix.arg # data(groundbeef) serving <- groundbeef$serving f1c <- fitdist(serving,"gamma",start=list(rate=0.1),fix.arg=list(shape=4)) summary(f1c) plot(f1c) # (10) fit of a Weibull distribution to serving size data # by maximum likelihood estimation # or by quantile matching estimation (in this example # matching first and third quartiles) # data(groundbeef) serving <- groundbeef$serving fWmle <- fitdist(serving, "weibull") summary(fWmle) plot(fWmle) gofstat(fWmle) fWqme <- fitdist(serving, "weibull", method="qme", probs=c(0.25, 0.75)) summary(fWqme) plot(fWqme) gofstat(fWqme) # (11) Fit of a Pareto distribution by numerical moment matching estimation # \donttest{ require(actuar) #simulate a sample x4 <- rpareto(1000, 6, 2) #empirical raw moment memp <- function(x, order) mean(x^order) #fit fP <- fitdist(x4, "pareto", method="mme", order=c(1, 2), memp="memp", start=list(shape=10, scale=10), lower=1, upper=Inf) summary(fP) plot(fP) } # (12) Fit of a Weibull distribution to serving size data by maximum # goodness-of-fit estimation using all the distances available # \donttest{ data(groundbeef) serving <- groundbeef$serving (f1 <- fitdist(serving, "weibull", method="mge", gof="CvM")) (f2 <- fitdist(serving, "weibull", method="mge", gof="KS")) (f3 <- fitdist(serving, "weibull", method="mge", gof="AD")) (f4 <- fitdist(serving, "weibull", method="mge", gof="ADR")) (f5 <- fitdist(serving, "weibull", method="mge", gof="ADL")) (f6 <- fitdist(serving, "weibull", method="mge", gof="AD2R")) (f7 <- fitdist(serving, "weibull", method="mge", gof="AD2L")) (f8 <- fitdist(serving, "weibull", method="mge", gof="AD2")) cdfcomp(list(f1, f2, f3, f4, f5, f6, f7, f8)) cdfcomp(list(f1, f2, f3, f4, f5, f6, f7, f8), xlogscale=TRUE, xlim=c(8, 250), verticals=TRUE) denscomp(list(f1, f2, f3, f4, f5, f6, f7, f8)) } # (13) Fit of a uniform distribution using maximum likelihood # (a closed formula is used in this special case where the loglikelihood is not defined), # or maximum goodness-of-fit with Cramer-von Mises or Kolmogorov-Smirnov distance # set.seed(1234) u <- runif(50, min=5, max=10) fumle <- fitdist(u, "unif", method="mle") summary(fumle) plot(fumle) gofstat(fumle) fuCvM <- fitdist(u, "unif", method="mge", gof="CvM") summary(fuCvM) plot(fuCvM) gofstat(fuCvM) fuKS <- fitdist(u, "unif", method="mge", gof="KS") summary(fuKS) plot(fuKS) gofstat(fuKS) # (14) scaling problem # the simulated dataset (below) has particularly small values, hence without scaling (10^0), # the optimization raises an error. The for loop shows how scaling by 10^i # for i=1,...,6 makes the fitting procedure work correctly. set.seed(1234) x2 <- rnorm(100, 1e-4, 2e-4) for(i in 0:6) cat(i, try(fitdist(x2*10^i, "cauchy", method="mle")$estimate, silent=TRUE), "\n") # (15) Fit of a normal distribution on acute toxicity values of endosulfan in log10 for # nonarthropod invertebrates, using maximum likelihood estimation # to estimate what is called a species sensitivity distribution # (SSD) in ecotoxicology, followed by estimation of the 5 percent quantile value of # the fitted distribution (which is called the 5 percent hazardous concentration, HC5, # in ecotoxicology) and estimation of other quantiles. # data(endosulfan) ATV <- subset(endosulfan, group == "NonArthroInvert")$ATV log10ATV <- log10(subset(endosulfan, group == "NonArthroInvert")$ATV) fln <- fitdist(log10ATV, "norm") quantile(fln, probs = 0.05) quantile(fln, probs = c(0.05, 0.1, 0.2)) # (16) Fit of a triangular distribution using Cramer-von Mises or # Kolmogorov-Smirnov distance # \donttest{ set.seed(1234) require(mc2d) t <- rtriang(100, min=5, mode=6, max=10) fCvM <- fitdist(t, "triang", method="mge", start = list(min=4, mode=6,max=9), gof="CvM") fKS <- fitdist(t, "triang", method="mge", start = list(min=4, mode=6,max=9), gof="KS") cdfcomp(list(fCvM,fKS)) } # (17) fit a non classical discrete distribution (the zero inflated Poisson distribution) # \donttest{ require(gamlss.dist) set.seed(1234) x <- rZIP(n = 30, mu = 5, sigma = 0.2) plotdist(x, discrete = TRUE) fitzip <- fitdist(x, "ZIP", start = list(mu = 4, sigma = 0.15), discrete = TRUE, optim.method = "L-BFGS-B", lower = c(0, 0), upper = c(Inf, 1)) summary(fitzip) plot(fitzip) fitp <- fitdist(x, "pois") cdfcomp(list(fitzip, fitp)) gofstat(list(fitzip, fitp)) } # (18) examples with distributions in actuar (predefined starting values) # \donttest{ require(actuar) x <- c(2.3,0.1,2.7,2.2,0.4,2.6,0.2,1.,7.3,3.2,0.8,1.2,33.7,14., 21.4,7.7,1.,1.9,0.7,12.6,3.2,7.3,4.9,4000.,2.5,6.7,3.,63., 6.,1.6,10.1,1.2,1.5,1.2,30.,3.2,3.5,1.2,0.2,1.9,0.7,17., 2.8,4.8,1.3,3.7,0.2,1.8,2.6,5.9,2.6,6.3,1.4,0.8) #log logistic ft_llogis <- fitdist(x,'llogis') x <- c(0.3837053, 0.8576858, 0.3552237, 0.6226119, 0.4783756, 0.3139799, 0.4051403, 0.4537631, 0.4711057, 0.5647414, 0.6479617, 0.7134207, 0.5259464, 0.5949068, 0.3509200, 0.3783077, 0.5226465, 1.0241043, 0.4384580, 1.3341520) #inverse weibull ft_iw <- fitdist(x,'invweibull') } } \keyword{ distribution } fitdistrplus/man/toxocara.Rd0000644000176200001440000000164014213576765015711 0ustar liggesusers\name{toxocara} \alias{toxocara} \docType{data} \title{Parasite abundance in insular feral cats} \description{Toxocara cati abundance in feral cats living on Kerguelen island.} \usage{data(toxocara)} \format{\code{toxocara} is a data frame with 1 column (number: number of parasites in digestive tract)} \source{ Fromont, E., Morvilliers, L., Artois, M., Pontier, D. 2001. Parasite richness and abundance in insular and mainland feral cats. \emph{Parasitology}, \bold{123}, 143-151.} \examples{ # (1) load of data # data(toxocara) # (2) description and plot of data # number <- toxocara$number descdist(number, discrete = TRUE, boot = 1000) plotdist(number, discrete = TRUE) # (3) fit of a Poisson distribution to data # fitp <- fitdist(number, "pois") summary(fitp) plot(fitp) # (4) fit of a negative binomial distribution to data # fitnb <- fitdist(number, "nbinom") summary(fitnb) plot(fitnb) } \keyword{datasets}fitdistrplus/man/mgedist.Rd0000644000176200001440000002041114213577504015512 0ustar liggesusers\name{mgedist} \alias{mgedist} \alias{mge} \title{ Maximum goodness-of-fit fit of univariate continuous distributions} \description{ Fit of univariate continuous distribution by maximizing goodness-of-fit (or minimizing distance) for non censored data. } \usage{ mgedist(data, distr, gof = "CvM", start = NULL, fix.arg = NULL, optim.method = "default", lower = -Inf, upper = Inf, custom.optim = NULL, silent = TRUE, gradient = NULL, checkstartfix=FALSE, \dots) } \arguments{ \item{data}{ A numeric vector for non censored data. } \item{distr}{ A character string \code{"name"} naming a distribution for which the corresponding quantile function \code{qname} and the corresponding density distribution \code{dname} must be classically defined. } \item{gof}{A character string coding for the name of the goodness-of-fit distance used : \code{"CvM"} for Cramer-von Mises distance, \code{"KS"} for Kolmogorov-Smirnov distance, \code{"AD"} for Anderson-Darling distance, \code{"ADR"}, \code{"ADL"}, \code{"AD2R"}, \code{"AD2L"} and \code{"AD2"} for variants of Anderson-Darling distance described by Luceno (2006).} \item{start}{A named list giving the initial values of parameters of the named distribution or a function of data computing initial values and returning a named list. This argument may be omitted (default) for some distributions for which reasonable starting values are computed (see the 'details' section of \code{\link{mledist}}). } \item{fix.arg}{An optional named list giving the values of fixed parameters of the named distribution or a function of data computing (fixed) parameter values and returning a named list. Parameters with fixed value are thus NOT estimated.} \item{optim.method}{ \code{"default"} or optimization method to pass to \code{\link{optim}}. } \item{lower}{ Left bounds on the parameters for the \code{"L-BFGS-B"} method (see \code{\link{optim}}). } \item{upper}{ Right bounds on the parameters for the \code{"L-BFGS-B"} method (see \code{\link{optim}}). } \item{custom.optim}{a function carrying the optimization.} \item{silent}{A logical to remove or show warnings when bootstraping.} \item{gradient}{A function to return the gradient of the gof distance for the \code{"BFGS"}, \code{"CG"} and \code{"L-BFGS-B"} methods. If it is \code{NULL}, a finite-difference approximation will be used.} \item{checkstartfix}{A logical to test starting and fixed values. Do not change it.} \item{\dots}{further arguments passed to the \code{\link{optim}}, \code{\link{constrOptim}} or \code{custom.optim} function.} } \details{ The \code{mgedist} function numerically maximizes goodness-of-fit, or minimizes a goodness-of-fit distance coded by the argument \code{gof}. One may use one of the classical distances defined in Stephens (1986), the Cramer-von Mises distance (\code{"CvM"}), the Kolmogorov-Smirnov distance (\code{"KS"}) or the Anderson-Darling distance (\code{"AD"}) which gives more weight to the tails of the distribution, or one of the variants of this last distance proposed by Luceno (2006). The right-tail AD (\code{"ADR"}) gives more weight only to the right tail, the left-tail AD (\code{"ADL"}) gives more weight only to the left tail. Either of the tails, or both of them, can receive even larger weights by using second order Anderson-Darling Statistics (using \code{"AD2R"}, \code{"AD2L"} or \code{"AD2"}). The optimization process is the same as \code{\link{mledist}}, see the 'details' section of that function. This function is not intended to be called directly but is internally called in \code{\link{fitdist}} and \code{\link{bootdist}}. This function is intended to be used only with continuous distributions and weighted maximum goodness-of-fit estimation is not allowed. NB: if your data values are particularly small or large, a scaling may be needed before the optimization process. See example (4). } \value{ \code{mgedist} returns a list with following components, \item{estimate}{ the parameter estimates.} \item{convergence}{ an integer code for the convergence of \code{\link{optim}} defined as below or defined by the user in the user-supplied optimization function. \code{0} indicates successful convergence. \code{1} indicates that the iteration limit of \code{\link{optim}} has been reached. \code{10} indicates degeneracy of the Nealder-Mead simplex. \code{100} indicates that \code{\link{optim}} encountered an internal error. } \item{value}{the minimal value reached for the criterion to minimize.} \item{hessian}{ a symmetric matrix computed by \code{\link{optim}} as an estimate of the Hessian at the solution found or computed in the user-supplied optimization function. } \item{optim.function}{the name of the optimization function used for maximum likelihood.} \item{optim.method}{when \code{\link{optim}} is used, the name of the algorithm used, the field \code{method} of the \code{custom.optim} function otherwise.} \item{fix.arg}{the named list giving the values of parameters of the named distribution that must kept fixed rather than estimated by maximum likelihood or \code{NULL} if there are no such parameters. } \item{fix.arg.fun}{the function used to set the value of \code{fix.arg} or \code{NULL}.} \item{weights}{the vector of weigths used in the estimation process or \code{NULL}.} \item{counts}{A two-element integer vector giving the number of calls to the log-likelihood function and its gradient respectively. This excludes those calls needed to compute the Hessian, if requested, and any calls to log-likelihood function to compute a finite-difference approximation to the gradient. \code{counts} is returned by \code{\link{optim}} or the user-supplied function or set to \code{NULL}.} \item{optim.message}{A character string giving any additional information returned by the optimizer, or \code{NULL}. To understand exactly the message, see the source code.} \item{loglik}{ the log-likelihood value. } \item{gof}{ the code of the goodness-of-fit distance maximized. } } \seealso{ \code{\link{mmedist}}, \code{\link{mledist}}, \code{\link{qmedist}}, \code{\link{fitdist}} for other estimation methods. } \references{ Luceno A (2006), \emph{Fitting the generalized Pareto distribution to data using maximum goodness-of-fit estimators}. Computational Statistics and Data Analysis, 51, 904-917. Stephens MA (1986), \emph{Tests based on edf statistics}. In Goodness-of-fit techniques (D'Agostino RB and Stephens MA, eds), Marcel Dekker, New York, pp. 97-194. Delignette-Muller ML and Dutang C (2015), \emph{fitdistrplus: An R Package for Fitting Distributions}. Journal of Statistical Software, 64(4), 1-34. } \author{ Marie-Laure Delignette-Muller and Christophe Dutang. } \examples{ # (1) Fit of a Weibull distribution to serving size data by maximum # goodness-of-fit estimation using all the distances available # data(groundbeef) serving <- groundbeef$serving mgedist(serving, "weibull", gof="CvM") mgedist(serving, "weibull", gof="KS") mgedist(serving, "weibull", gof="AD") mgedist(serving, "weibull", gof="ADR") mgedist(serving, "weibull", gof="ADL") mgedist(serving, "weibull", gof="AD2R") mgedist(serving, "weibull", gof="AD2L") mgedist(serving, "weibull", gof="AD2") # (2) Fit of a uniform distribution using Cramer-von Mises or # Kolmogorov-Smirnov distance # set.seed(1234) u <- runif(100,min=5,max=10) mgedist(u,"unif",gof="CvM") mgedist(u,"unif",gof="KS") # (3) Fit of a triangular distribution using Cramer-von Mises or # Kolmogorov-Smirnov distance # \donttest{ require(mc2d) set.seed(1234) t <- rtriang(100,min=5,mode=6,max=10) mgedist(t,"triang",start = list(min=4, mode=6,max=9),gof="CvM") mgedist(t,"triang",start = list(min=4, mode=6,max=9),gof="KS") } # (4) scaling problem # the simulated dataset (below) has particularly small values, hence without scaling (10^0), # the optimization raises an error. The for loop shows how scaling by 10^i # for i=1,...,6 makes the fitting procedure work correctly. set.seed(1234) x2 <- rnorm(100, 1e-4, 2e-4) for(i in 6:0) cat(i, try(mgedist(x*10^i,"cauchy")$estimate, silent=TRUE), "\n") } \keyword{ distribution }fitdistrplus/man/msedist.Rd0000644000176200001440000002255314213577117015537 0ustar liggesusers\name{msedist} \alias{msedist} \alias{mse} \title{Maximum spacing estimation of univariate distributions} \description{ Fit of univariate distribution by maximizing (log) spacings for non censored data. } \usage{ msedist(data, distr, phidiv="KL", power.phidiv=NULL, start = NULL, fix.arg = NULL, optim.method = "default", lower = -Inf, upper = Inf, custom.optim = NULL, weights=NULL, silent = TRUE, gradient = NULL, checkstartfix=FALSE, \dots) } \arguments{ \item{data}{ A numeric vector for non censored data. } \item{distr}{ A character string \code{"name"} naming a distribution for which the corresponding quantile function \code{qname} and the corresponding density distribution \code{dname} must be classically defined. } \item{phidiv}{A character string coding for the name of the phi-divergence used : \code{"KL"} for Kullback-Leibler information (corresponds to classic maximum spacing estimation), \code{"J"} for Jeffreys' divergence, \code{"R"} for Renyi's divergence, \code{"H"} for Hellinger distance, \code{"V"} for Vajda's measure of information, see details.} \item{power.phidiv}{If relevant, a numeric for the power used in some phi-divergence : should be \code{NULL} when \code{phidiv="KL"} or \code{phidiv="J"} , should be positive and different from 1 when \code{phidiv="R"}, should be greater or equal to 1 when \code{phidiv="H"} or \code{phidiv="V"}, see details.} \item{start}{A named list giving the initial values of parameters of the named distribution or a function of data computing initial values and returning a named list. This argument may be omitted (default) for some distributions for which reasonable starting values are computed (see the 'details' section of \code{\link{mledist}}). } \item{fix.arg}{An optional named list giving the values of fixed parameters of the named distribution or a function of data computing (fixed) parameter values and returning a named list. Parameters with fixed value are thus NOT estimated.} \item{optim.method}{ \code{"default"} or optimization method to pass to \code{\link{optim}}. } \item{lower}{ Left bounds on the parameters for the \code{"L-BFGS-B"} method (see \code{\link{optim}}). } \item{upper}{ Right bounds on the parameters for the \code{"L-BFGS-B"} method (see \code{\link{optim}}). } \item{custom.optim}{a function carrying the optimization.} \item{weights}{an optional vector of weights to be used in the fitting process. Should be \code{NULL} or a numeric vector with strictly positive integers (typically the number of occurences of each observation). If non-\code{NULL}, weighted MSE is used, otherwise ordinary MSE.} \item{silent}{A logical to remove or show warnings when bootstraping.} \item{gradient}{A function to return the gradient of the gof distance for the \code{"BFGS"}, \code{"CG"} and \code{"L-BFGS-B"} methods. If it is \code{NULL}, a finite-difference approximation will be used.} \item{checkstartfix}{A logical to test starting and fixed values. Do not change it.} \item{\dots}{further arguments passed to the \code{\link{optim}}, \code{\link{constrOptim}} or \code{custom.optim} function.} } \details{ The \code{msedist} function numerically maximizes a phi-divergence function of spacings, where spacings are the differences of the cumulative distribution function evaluated at the sorted dataset. The classical maximum spacing estimation (MSE) was introduced by Cheng and Amin (1986) and Ranneby (1984) independently where the phi-diverence is the logarithm, see Anatolyev and Kosenok (2005) for a link between MSE and maximum likelihood estimation. MSE was generalized by Ranneby and Ekstrom (1997) by allowing different phi-divergence function. Generalized MSE maximizes \deqn{ S_n(\theta)=\frac{1}{n+1}\sum_{i=1}^{n+1} \phi\left(F(x_{(i)}; \theta)-F(x_{(i-1)}; \theta) \right), }{ S_n(\theta)= sum(i=1,..,n+1; \phi(F(x_(i); \theta)-F(x_(i-1); \theta))) / (n+1), } where \eqn{F(;\theta)} is the parametric distribution function to be fitted, \eqn{\phi} is the phi-divergence function, \eqn{x_{(1)}<\dots0, \alpha\neq 1 }{ \phi(x)=x^\alpha*sign(1-\alpha) with 0<\alpha<1 or 1 <\alpha } \item Hellinger distance (when \code{phidiv="H"} and \code{power.phidiv=p}) \deqn{\phi(x)=-|1-x^{1/p}|^p \textrm{ with } p\ge 1 }{ \phi(x)=-|1-x^(1/p)|^p with p\ge 1 } \item Vajda's measure of information (when \code{phidiv="V"} and \code{power.phidiv=beta}) \deqn{\phi(x)=-|1-x|^\beta \textrm{ with } \beta\ge 1 }{ \phi(x)=-|1-x|^\beta with \beta\ge 1 } } The optimization process is the same as \code{\link{mledist}}, see the 'details' section of that function. This function is not intended to be called directly but is internally called in \code{\link{fitdist}} and \code{\link{bootdist}}. This function is intended to be used only with non-censored data. NB: if your data values are particularly small or large, a scaling may be needed before the optimization process, see \code{\link{mledist}}'s examples. } \value{ \code{msedist} returns a list with following components, \item{estimate}{ the parameter estimates.} \item{convergence}{ an integer code for the convergence of \code{\link{optim}} defined as below or defined by the user in the user-supplied optimization function. \code{0} indicates successful convergence. \code{1} indicates that the iteration limit of \code{\link{optim}} has been reached. \code{10} indicates degeneracy of the Nealder-Mead simplex. \code{100} indicates that \code{\link{optim}} encountered an internal error. } \item{value}{the minimal value reached for the criterion to minimize.} \item{hessian}{ a symmetric matrix computed by \code{\link{optim}} as an estimate of the Hessian at the solution found or computed in the user-supplied optimization function. } \item{optim.function}{the name of the optimization function used for maximum likelihood.} \item{optim.method}{when \code{\link{optim}} is used, the name of the algorithm used, the field \code{method} of the \code{custom.optim} function otherwise.} \item{fix.arg}{the named list giving the values of parameters of the named distribution that must kept fixed rather than estimated by maximum likelihood or \code{NULL} if there are no such parameters. } \item{fix.arg.fun}{the function used to set the value of \code{fix.arg} or \code{NULL}.} \item{weights}{the vector of weigths used in the estimation process or \code{NULL}.} \item{counts}{A two-element integer vector giving the number of calls to the log-likelihood function and its gradient respectively. This excludes those calls needed to compute the Hessian, if requested, and any calls to log-likelihood function to compute a finite-difference approximation to the gradient. \code{counts} is returned by \code{\link{optim}} or the user-supplied function or set to \code{NULL}.} \item{optim.message}{A character string giving any additional information returned by the optimizer, or \code{NULL}. To understand exactly the message, see the source code.} \item{loglik}{ the log-likelihood value.} \item{phidiv}{The character string coding for the name of the phi-divergence used either \code{"KL"}, \code{"J"}, \code{"R"}, \code{"H"} or \code{"V"}.} \item{power.phidiv}{Either \code{NULL} or a numeric for the power used in the phi-divergence.} } \seealso{ \code{\link{mmedist}}, \code{\link{mledist}}, \code{\link{qmedist}}, \code{\link{mgedist}}, \code{\link{fitdist}} for other estimation methods. } \references{ Anatolyev, S., and Kosenok, G. (2005). \emph{An alternative to maximum likelihood based on spacings}. Econometric Theory, 21(2), 472-476. Cheng, R.C.H. and N.A.K. Amin (1983) \emph{Estimating parameters in continuous univariate distributions with a shifted origin}. Journal of the Royal Statistical Society Series B 45, 394-403. Ranneby, B. (1984) \emph{The maximum spacing method: An estimation method related to the maximum likelihood method}. Scandinavian Journal of Statistics 11, 93-112. Ranneby, B. and Ekstroem, M. (1997). \emph{Maximum spacing estimates based on different metrics}. Umea universitet. } \author{ Marie-Laure Delignette-Muller and Christophe Dutang. } \examples{ # (1) Fit of a Weibull distribution to serving size data by maximum # spacing estimation # data(groundbeef) serving <- groundbeef$serving msedist(serving, "weibull") # (2) Fit of an exponential distribution # set.seed(123) x1 <- rexp(1e3) #the convergence is quick msedist(x1, "exp", control=list(trace=0, REPORT=1)) } \keyword{ distribution } fitdistrplus/man/fremale.Rd0000644000176200001440000000126114410525510015460 0ustar liggesusers\name{fremale} \alias{fremale} \docType{data} \title{Fictive survival dataset of a french Male population} \description{100 male individuals randomly taken from \code{frefictivetable} in \code{CASdatasets} package} \usage{data(fremale)} \format{ \code{fremale} is a data frame with 3 columns names \code{AgeIn}, \code{AgeOut} respectively for entry age and exit age; \code{Death} a binary dummy: 1 indicating the death of the individual; 0 a censored observation. } \references{See full dataset \code{frefictivetable} of \code{CASdatasets} at \url{http://dutangc.perso.math.cnrs.fr/RRepository/}} \examples{ # (1) load of data # data(fremale) summary(fremale) } \keyword{datasets} fitdistrplus/man/graphcompcens.Rd0000644000176200001440000002153613750266146016722 0ustar liggesusers\name{graphcompcens} \alias{graphcompcens} \alias{cdfcompcens} \alias{denscompcens} \alias{qqcompcens} \alias{ppcompcens} \title{Graphical comparison of multiple fitted distributions for censored data} \description{ \code{cdfcompcens} plots the empirical cumulative distribution against fitted distribution functions, \code{qqcompcens} plots theoretical quantiles against empirical ones, \code{ppcompcens} plots theoretical probabilities against empirical ones. } \usage{ cdfcompcens(ft, xlim, ylim, xlogscale = FALSE, ylogscale = FALSE, main, xlab, ylab, datacol, fillrect, fitlty, fitcol, fitlwd, addlegend = TRUE, legendtext, xlegend = "bottomright", ylegend = NULL, lines01 = FALSE, Turnbull.confint = FALSE, NPMLE.method = "Wang", add = FALSE, plotstyle = "graphics", \dots) qqcompcens(ft, xlim, ylim, xlogscale = FALSE, ylogscale = FALSE, main, xlab, ylab, fillrect, fitcol, fitlwd, addlegend = TRUE, legendtext, xlegend = "bottomright", ylegend = NULL, line01 = TRUE, line01col = "black", line01lty = 1, ynoise = TRUE, NPMLE.method = "Wang", plotstyle = "graphics", \dots) ppcompcens(ft, xlim, ylim, xlogscale = FALSE, ylogscale = FALSE, main, xlab, ylab, fillrect, fitcol, fitlwd, addlegend = TRUE, legendtext, xlegend = "bottomright", ylegend = NULL, line01 = TRUE, line01col = "black", line01lty = 1, ynoise = TRUE, NPMLE.method = "Wang", plotstyle = "graphics", \dots) } \arguments{ \item{ft}{One \code{"fitdistcens"} object or a list of objects of class \code{"fitdistcens"}.} \item{xlim}{The \eqn{x}-limits of the plot.} \item{ylim}{The \eqn{y}-limits of the plot.} \item{xlogscale}{If \code{TRUE}, uses a logarithmic scale for the \eqn{x}-axis.} \item{ylogscale}{If \code{TRUE}, uses a logarithmic scale for the \eqn{y}-axis.} \item{main}{A main title for the plot, see also \code{\link{title}}.} \item{xlab}{A label for the \eqn{x}-axis, defaults to a description of \code{x}.} \item{ylab}{A label for the \eqn{y}-axis, defaults to a description of \code{y}.} \item{datacol}{A specification of the color to be used in plotting data points.} \item{fillrect}{A specification of the color to be used for filling rectanges of non uniqueness of the empirical cumulative distribution (only used if \code{NPMLE.method} is equal to \code{"Wang"} in \code{cdfcompcens}). Fix it to \code{NA} if you do not want to fill the rectangles.} \item{fitcol}{A (vector of) color(s) to plot fitted distributions. If there are fewer colors than fits they are recycled in the standard fashion.} \item{fitlty}{A (vector of) line type(s) to plot fitted distributions. If there are fewer values than fits they are recycled in the standard fashion. See also \code{\link{par}}.} \item{fitlwd}{A (vector of) line size(s) to plot fitted distributions. If there are fewer values than fits they are recycled in the standard fashion. See also \code{\link{par}}.} \item{addlegend}{If \code{TRUE}, a legend is added to the plot.} \item{legendtext}{A character or expression vector of length \eqn{\geq 1} to appear in the legend, see also \code{\link{legend}}.} \item{xlegend, ylegend}{The \eqn{x} and \eqn{y} coordinates to be used to position the legend. They can be specified by keyword. If \code{plotstyle = "graphics"}, see \code{\link{xy.coords}} and \code{\link{legend}}. If \code{plotstyle = "ggplot"}, the \code{xlegend} keyword must be one of \code{top}, \code{bottom}, \code{left}, or \code{right}. See also \code{guide_legend} in \code{ggplot2}} \item{lines01}{A logical to plot two horizontal lines at \code{h=0} and \code{h=1} for \code{cdfcompcens}.} \item{Turnbull.confint}{ if TRUE confidence intervals will be added to the Turnbull plot. In that case NPMLE.method is forced to \code{"Turnbull"}} \item{NPMLE.method}{Three NPMLE techniques are provided, \code{"Wang"}, the default one, rewritten from the package npsurv using function constrOptim from the package stats for optimisation, \code{"Turnbull.middlepoints"}, an older one which is implemented in the package survival and \code{"Turnbull.intervals"} that uses the same Turnbull algorithm from the package survival but associates an interval to each equivalence class instead of the middlepoint of this interval (see details). Only \code{"Wang"} and \code{"Turnbull.intervals"} enable the derivation of a Q-Q plot and a P-P plot.} \item{add}{If \code{TRUE}, adds to an already existing plot. If \code{FALSE}, starts a new plot. This parameter is not available when \code{plotstyle = "ggplot"}.} \item{line01}{A logical to plot an horizontal line \eqn{y=x} for \code{qqcompcens} and \code{ppcompcens}.} \item{line01col, line01lty}{Color and line type for \code{line01}. See also \code{\link{par}}.} \item{ynoise}{A logical to add a small noise when plotting empirical quantiles/probabilities for \code{qqcompcens} and \code{ppcompcens}. \code{ynoise} is only used when various fits are plotted with the \code{"graphics"} \code{plotstyle}. Facets are used instead with the \code{"ggplot"} \code{plotstyle}.} \item{plotstyle}{\code{"graphics"} or \code{"ggplot"}. If \code{"graphics"}, the display is built with \code{\link{graphics}} functions. If \code{"ggplot"}, a graphic object output is created with \code{ggplot2} functions (the \code{ggplot2} package must be installed). In \code{"cdfcompcens"}, \code{"ggplot"} graphics are only available with \code{"Wang"} NPMLE technique.} \item{\dots}{Further graphical arguments passed to graphical functions used in \code{cdfcompcens}, \code{ppcompcens} and \code{qqcompcens}.} } \details{ See details of \code{\link{plotdistcens}} for a detailed description of provided goddness-of-fit plots. } \seealso{ \code{\link{plotdistcens}}, \code{\link{survfit.formula}}, \code{\link{legend}} and \code{\link{par}}. } \references{ Turnbull BW (1974), \emph{Nonparametric estimation of a survivorship function with doubly censored data}. Journal of American Statistical Association, 69, 169-173. Wang Y (2008), \emph{Dimension-reduced nonparametric maximum likelihood computation for interval-censored data}. Computational Statistics & Data Analysis, 52, 2388-2402. Wang Y and Taylor SM (2013), \emph{Efficient computation of nonparametric survival functions via a hierarchical mixture formulation}. Statistics and Computing, 23, 713-725. Delignette-Muller ML and Dutang C (2015), \emph{fitdistrplus: An R Package for Fitting Distributions}. Journal of Statistical Software, 64(4), 1-34. } \author{ Marie-Laure Delignette-Muller and Christophe Dutang. } \examples{ # (1) Plot various distributions fitted to bacterial contamination data # data(smokedfish) Clog10 <- log10(smokedfish) fitsfn <- fitdistcens(Clog10,"norm") summary(fitsfn) fitsfl <- fitdistcens(Clog10,"logis") summary(fitsfl) dgumbel <- function(x,a,b) 1/b*exp((a-x)/b)*exp(-exp((a-x)/b)) pgumbel <- function(q,a,b) exp(-exp((a-q)/b)) qgumbel <- function(p,a,b) a-b*log(-log(p)) fitsfg<-fitdistcens(Clog10,"gumbel",start=list(a=-3,b=3)) summary(fitsfg) # CDF plot cdfcompcens(list(fitsfn,fitsfl,fitsfg)) cdfcompcens(list(fitsfn,fitsfl,fitsfg),datacol="orange",fillrect = NA, legendtext=c("normal","logistic","Gumbel"), main="bacterial contamination fits", xlab="bacterial concentration (CFU/g)",ylab="F", xlegend = "bottom",lines01 = TRUE) # alternative Turnbull plot for the empirical cumulative distribution # (default plot of the previous versions of the package) cdfcompcens(list(fitsfn,fitsfl,fitsfg), NPMLE.method = "Turnbull.middlepoints") # customizing graphical output with ggplot2 if (requireNamespace ("ggplot2", quietly = TRUE)) { cdfcompcens <- cdfcompcens(list(fitsfn,fitsfl,fitsfg),datacol="orange",fillrect = NA, legendtext=c("normal","logistic","Gumbel"), xlab="bacterial concentration (CFU/g)",ylab="F", xlegend = "bottom",lines01 = TRUE, plotstyle = "ggplot") cdfcompcens + ggplot2::theme_minimal() + ggplot2::ggtitle("Bacterial contamination fits") } # PP plot ppcompcens(list(fitsfn,fitsfl,fitsfg)) ppcompcens(list(fitsfn,fitsfl,fitsfg), ynoise = FALSE) par(mfrow = c(2,2)) ppcompcens(fitsfn) ppcompcens(fitsfl) ppcompcens(fitsfg) par(mfrow = c(1,1)) if (requireNamespace ("ggplot2", quietly = TRUE)) { ppcompcens(list(fitsfn,fitsfl,fitsfg), plotstyle = "ggplot") ppcompcens(list(fitsfn,fitsfl,fitsfg), plotstyle = "ggplot", fillrect = c("lightpink", "lightblue", "lightgreen"), fitcol = c("red", "blue", "green")) } # QQ plot qqcompcens(list(fitsfn,fitsfl,fitsfg)) qqcompcens(list(fitsfn,fitsfl,fitsfg), ynoise = FALSE) par(mfrow = c(2,2)) qqcompcens(fitsfn) qqcompcens(fitsfl) qqcompcens(fitsfg) par(mfrow = c(1,1)) if (requireNamespace ("ggplot2", quietly = TRUE)) { qqcompcens(list(fitsfn,fitsfl,fitsfg), ynoise = FALSE, plotstyle = "ggplot") qqcompcens(list(fitsfn,fitsfl,fitsfg), ynoise = FALSE, plotstyle = "ggplot", fillrect = c("lightpink", "lightblue", "lightgreen"), fitcol = c("red", "blue", "green")) } } \keyword{distribution} fitdistrplus/man/quantile.Rd0000644000176200001440000001650514213577171015711 0ustar liggesusers\name{quantile} \alias{quantile} \alias{quantile.fitdist} \alias{quantile.fitdistcens} \alias{quantile.bootdist} \alias{quantile.bootdistcens} \alias{print.quantile.fitdist} \alias{print.quantile.fitdistcens} \alias{print.quantile.bootdist} \alias{print.quantile.bootdistcens} \title{Quantile estimation from a fitted distribution} \description{ Quantile estimation from a fitted distribution, optionally with confidence intervals calculated from the bootstrap result. } \usage{ \method{quantile}{fitdist}(x, probs = seq(0.1, 0.9, by=0.1), \dots) \method{quantile}{fitdistcens}(x, probs = seq(0.1, 0.9, by=0.1), \dots) \method{quantile}{bootdist}(x, probs = seq(0.1, 0.9, by=0.1),CI.type = "two.sided", CI.level = 0.95, \dots) \method{quantile}{bootdistcens}(x, probs = seq(0.1, 0.9, by=0.1),CI.type = "two.sided", CI.level = 0.95, \dots) \method{print}{quantile.fitdist}(x, \dots) \method{print}{quantile.fitdistcens}(x, \dots) \method{print}{quantile.bootdist}(x, \dots) \method{print}{quantile.bootdistcens}(x, \dots) } \arguments{ \item{x}{ An object of class \code{"fitdist"}, \code{"fitdistcens"}, \code{"bootdist"}, \code{"bootdistcens"} or \code{"quantile.fitdist"}, \code{"quantile.fitdistcens"}, \code{"quantile.bootdist"}, \code{"quantile.bootdistcens"} for the \code{print} generic function. } \item{probs}{ A numeric vector of probabilities with values in [0, 1] at which quantiles must be calculated.} \item{CI.type}{ Type of confidence intervals : either \code{"two.sided"} or one-sided intervals (\code{"less"} or \code{"greater"}).} \item{CI.level}{ The confidence level.} \item{\dots}{Further arguments to be passed to generic functions.} } \details{ Quantiles of the parametric distribution are calculated for each probability specified in \code{probs}, using the estimated parameters. When used with an object of class \code{"bootdist"} or \code{"bootdistcens"}, percentile confidence intervals and medians etimates are also calculated from the bootstrap result. If \code{CI.type} is \code{two.sided}, the \code{CI.level} two-sided confidence intervals of quantiles are calculated. If \code{CI.type} is \code{less} or \code{greater}, the \code{CI.level} one-sided confidence intervals of quantiles are calculated. The print functions show the estimated quantiles with percentile confidence intervals and median estimates when a bootstrap resampling has been done previously, and the number of bootstrap iterations for which the estimation converges if it is inferior to the whole number of bootstrap iterations. } \value{ \code{quantile} returns a list with 2 components (the first two described below) when called with an object of class \code{"fitdist"} or \code{"fitdistcens"} and 8 components (described below) when called with an object of class \code{"bootdist"} or \code{"bootdistcens"} : \item{quantiles}{ a dataframe containing the estimated quantiles for each probability value specified in the argument \code{probs} (one row, and as many columns as values in \code{probs}).} \item{probs}{the numeric vector of probabilities at which quantiles are calculated.} \item{bootquant }{ A data frame containing the bootstraped values for each quantile (many rows, as specified in the call to \code{\link{bootdist}} in the argument \code{niter}, and as many columns as values in \code{probs})} \item{quantCI}{ If \code{CI.type} is \code{two.sided}, the two bounds of the \code{CI.level} percent two.sided confidence interval for each quantile (two rows and as many columns as values in \code{probs}). If \code{CI.type} is \code{less}, right bound of the \code{CI.level} percent one.sided confidence interval for each quantile (one row). If \code{CI.type} is \code{greater}, left bound of the \code{CI.level} percent one.sided confidence interval for each quantile (one row). } \item{quantmedian}{Median of bootstrap estimates (per probability).} \item{CI.type}{ Type of confidence interval: either \code{"two.sided"} or one-sided intervals (\code{"less"} or \code{"greater"}).} \item{CI.level}{ The confidence level.} \item{nbboot}{ The number of samples drawn by bootstrap.} \item{nbconverg}{ The number of iterations for which the optimization algorithm converges.} } \seealso{ \code{\link{fitdist}}, \code{\link{bootdist}}, \code{\link{fitdistcens}}, \code{\link{bootdistcens}} and \code{\link{CIcdfplot}}. } \references{ Delignette-Muller ML and Dutang C (2015), \emph{fitdistrplus: An R Package for Fitting Distributions}. Journal of Statistical Software, 64(4), 1-34. } \author{ Marie-Laure Delignette-Muller and Christophe Dutang. } \examples{ # (1) Fit of a normal distribution on acute toxicity log-transformed values of # endosulfan for nonarthropod invertebrates, using maximum likelihood estimation # to estimate what is called a species sensitivity distribution # (SSD) in ecotoxicology, followed by estimation of the 5, 10 and 20 percent quantile # values of the fitted distribution, which are called the 5, 10, 20 percent hazardous # concentrations (HC5, HC10, HC20) in ecotoxicology, followed with calculations of their # confidence intervals with various definitions, from a small number of bootstrap # iterations to satisfy CRAN running times constraint. # For practical applications, we recommend to use at least niter=501 or niter=1001. # data(endosulfan) ATV <- subset(endosulfan, group == "NonArthroInvert")$ATV log10ATV <- log10(subset(endosulfan, group == "NonArthroInvert")$ATV) fln <- fitdist(log10ATV, "norm") quantile(fln, probs = c(0.05, 0.1, 0.2)) bln <- bootdist(fln, bootmethod="param", niter=101) quantile(bln, probs = c(0.05, 0.1, 0.2)) quantile(bln, probs = c(0.05, 0.1, 0.2), CI.type = "greater") quantile(bln, probs = c(0.05, 0.1, 0.2), CI.level = 0.9) # (2) Draw of 95 percent confidence intervals on quantiles of the # previously fitted distribution # cdfcomp(fln) q1 <- quantile(bln, probs = seq(0,1,length=101)) points(q1$quantCI[1,],q1$probs,type="l") points(q1$quantCI[2,],q1$probs,type="l") # (2b) Draw of 95 percent confidence intervals on quantiles of the # previously fitted distribution # using the NEW function CIcdfplot # CIcdfplot(bln, CI.output = "quantile", CI.fill = "pink") # (3) Fit of a distribution on acute salinity log-transformed tolerance # for riverine macro-invertebrates, using maximum likelihood estimation # to estimate what is called a species sensitivity distribution # (SSD) in ecotoxicology, followed by estimation of the 5, 10 and 20 percent quantile # values of the fitted distribution, which are called the 5, 10, 20 percent hazardous # concentrations (HC5, HC10, HC20) in ecotoxicology, followed with calculations of # their confidence intervals with various definitions. # from a small number of bootstrap iterations to satisfy CRAN running times constraint. # For practical applications, we recommend to use at least niter=501 or niter=1001. # data(salinity) log10LC50 <-log10(salinity) flncens <- fitdistcens(log10LC50,"norm") quantile(flncens, probs = c(0.05, 0.1, 0.2)) blncens <- bootdistcens(flncens, niter = 101) quantile(blncens, probs = c(0.05, 0.1, 0.2)) quantile(blncens, probs = c(0.05, 0.1, 0.2), CI.type = "greater") quantile(blncens, probs = c(0.05, 0.1, 0.2), CI.level = 0.9) } \keyword{ distribution }fitdistrplus/man/mledist.Rd0000644000176200001440000003243714213577400015525 0ustar liggesusers\name{mledist} \alias{mledist} \alias{mle} \title{ Maximum likelihood fit of univariate distributions} \description{ Fit of univariate distributions using maximum likelihood for censored or non censored data. } \usage{ mledist(data, distr, start = NULL, fix.arg = NULL, optim.method = "default", lower = -Inf, upper = Inf, custom.optim = NULL, weights = NULL, silent = TRUE, gradient = NULL, checkstartfix=FALSE, \dots) } \arguments{ \item{data}{ A numeric vector for non censored data or a dataframe of two columns respectively named \code{left} and \code{right}, describing each observed value as an interval for censored data. In that case the \code{left} column contains either \code{NA} for left censored observations, the left bound of the interval for interval censored observations, or the observed value for non-censored observations. The \code{right} column contains either \code{NA} for right censored observations, the right bound of the interval for interval censored observations, or the observed value for non-censored observations. } \item{distr}{ A character string \code{"name"} naming a distribution for which the corresponding density function \code{dname} and the corresponding distribution function \code{pname} must be classically defined. } \item{start}{A named list giving the initial values of parameters of the named distribution or a function of data computing initial values and returning a named list. This argument may be omitted (default) for some distributions for which reasonable starting values are computed (see details). } \item{fix.arg}{An optional named list giving the values of fixed parameters of the named distribution or a function of data computing (fixed) parameter values and returning a named list. Parameters with fixed value are thus NOT estimated by this maximum likelihood procedure.} \item{optim.method}{ \code{"default"} (see details) or an optimization method to pass to \code{\link{optim}}. } \item{lower}{Left bounds on the parameters for the \code{"L-BFGS-B"} method (see \code{\link{optim}}). } \item{upper}{ Right bounds on the parameters for the \code{"L-BFGS-B"} method (see \code{\link{optim}}). } \item{custom.optim}{a function carrying the MLE optimisation (see details).} \item{weights}{an optional vector of weights to be used in the fitting process. Should be \code{NULL} or a numeric vector with strictly positive integers (typically the number of occurences of each observation). If non-\code{NULL}, weighted MLE is used, otherwise ordinary MLE.} \item{silent}{A logical to remove or show warnings when bootstraping.} \item{gradient}{A function to return the gradient of the log-likelihood for the \code{"BFGS"}, \code{"CG"} and \code{"L-BFGS-B"} methods. If it is \code{NULL}, a finite-difference approximation will be used, see details.} \item{checkstartfix}{A logical to test starting and fixed values. Do not change it.} \item{\dots}{further arguments passed to the \code{\link{optim}}, \code{\link{constrOptim}} or \code{custom.optim} function.} } \details{ This function is not intended to be called directly but is internally called in \code{\link{fitdist}} and \code{\link{bootdist}} when used with the maximum likelihood method and \code{\link{fitdistcens}} and \code{\link{bootdistcens}}. It is assumed that the \code{distr} argument specifies the distribution by the probability density function and the cumulative distribution function (d, p). The quantile function and the random generator function (q, r) may be needed by other function such as \code{\link{mmedist}}, \code{\link{qmedist}}, \code{\link{mgedist}}, \code{\link{fitdist}},\code{\link{fitdistcens}}, \code{\link{bootdistcens}} and \code{\link{bootdist}}. For the following named distributions, reasonable starting values will be computed if \code{start} is omitted (i.e. \code{NULL}) : \code{"norm"}, \code{"lnorm"}, \code{"exp"} and \code{"pois"}, \code{"cauchy"}, \code{"gamma"}, \code{"logis"}, \code{"nbinom"} (parametrized by mu and size), \code{"geom"}, \code{"beta"}, \code{"weibull"} from the \code{stats} package; \code{"invgamma"}, \code{"llogis"}, \code{"invweibull"}, \code{"pareto1"}, \code{"pareto"}, \code{"lgamma"}, \code{"trgamma"}, \code{"invtrgamma"} from the \code{actuar} package. Note that these starting values may not be good enough if the fit is poor. The function uses a closed-form formula to fit the uniform distribution. If \code{start} is a list, then it should be a named list with the same names as in the d,p,q,r functions of the chosen distribution. If \code{start} is a function of data, then the function should return a named list with the same names as in the d,p,q,r functions of the chosen distribution. The \code{mledist} function allows user to set a fixed values for some parameters. As for \code{start}, if \code{fix.arg} is a list, then it should be a named list with the same names as in the d,p,q,r functions of the chosen distribution. If \code{fix.arg} is a function of data, then the function should return a named list with the same names as in the d,p,q,r functions of the chosen distribution. When \code{custom.optim=NULL} (the default), maximum likelihood estimations of the distribution parameters are computed with the R base \code{\link{optim}} or \code{\link{constrOptim}}. If no finite bounds (\code{lower=-Inf} and \code{upper=Inf}) are supplied, \code{\link{optim}} is used with the method specified by \code{optim.method}. Note that \code{optim.method="default"} means \code{optim.method="Nelder-Mead"} for distributions with at least two parameters and \code{optim.method="BFGS"} for distributions with only one parameter. If finite bounds are supplied (among \code{lower} and \code{upper}) and \code{gradient != NULL}, \code{\link{constrOptim}} is used. If finite bounds are supplied (among \code{lower} and \code{upper}) and \code{gradient == NULL}, \code{\link{constrOptim}} is used when \code{optim.method="Nelder-Mead"}; \code{\link{optim}} is used when \code{optim.method="L-BFGS-B"} or \code{"Brent"}; in other case, an error is raised (same behavior as \code{\link{constrOptim}}). When errors are raised by \code{\link{optim}}, it's a good idea to start by adding traces during the optimization process by adding \code{control=list(trace=1, REPORT=1)}. If \code{custom.optim} is not \code{NULL}, then the user-supplied function is used instead of the R base \code{\link{optim}}. The \code{custom.optim} must have (at least) the following arguments \code{fn} for the function to be optimized, \code{par} for the initialized parameters. Internally the function to be optimized will also have other arguments, such as \code{obs} with observations and \code{ddistname} with distribution name for non censored data (Beware of potential conflicts with optional arguments of \code{custom.optim}). It is assumed that \code{custom.optim} should carry out a MINIMIZATION. Finally, it should return at least the following components \code{par} for the estimate, \code{convergence} for the convergence code, \code{value} for \code{fn(par)}, \code{hessian}, \code{counts} for the number of calls (function and gradient) and \code{message} (default to \code{NULL}) for the error message when \code{custom.optim} raises an error, see the returned value of \code{\link{optim}}. See examples in \code{\link{fitdist}} and \code{\link{fitdistcens}}. Optionally, a vector of \code{weights} can be used in the fitting process. By default (when \code{weigths=NULL}), ordinary MLE is carried out, otherwise the specified weights are used to balance the log-likelihood contributions. It is not yet possible to take into account weights in functions \code{plotdist}, \code{plotdistcens}, \code{plot.fitdist}, \code{plot.fitdistcens}, \code{cdfcomp}, \code{cdfcompcens}, \code{denscomp}, \code{ppcomp}, \code{qqcomp}, \code{gofstat}, \code{descdist}, \code{bootdist}, \code{bootdistcens} and \code{mgedist}. (developments planned in the future). NB: if your data values are particularly small or large, a scaling may be needed before the optimization process. See Example (7). } \value{ \code{mledist} returns a list with following components, \item{estimate}{ the parameter estimates.} \item{convergence}{ an integer code for the convergence of \code{\link{optim}}/\code{\link{constrOptim}} defined as below or defined by the user in the user-supplied optimization function. \code{0} indicates successful convergence. \code{1} indicates that the iteration limit of \code{\link{optim}} has been reached. \code{10} indicates degeneracy of the Nealder-Mead simplex. \code{100} indicates that \code{\link{optim}} encountered an internal error. } \item{value}{the minimal value reached for the criterion to minimize.} \item{hessian}{a symmetric matrix computed by \code{\link{optim}} as an estimate of the Hessian at the solution found or computed in the user-supplied optimization function. It is used in \code{fitdist} to estimate standard errors. } \item{optim.function}{the name of the optimization function used for maximum likelihood.} \item{optim.method}{when \code{\link{optim}} is used, the name of the algorithm used, the field \code{method} of the \code{custom.optim} function otherwise.} \item{fix.arg}{the named list giving the values of parameters of the named distribution that must kept fixed rather than estimated by maximum likelihood or \code{NULL} if there are no such parameters. } \item{fix.arg.fun}{the function used to set the value of \code{fix.arg} or \code{NULL}.} \item{weights}{the vector of weigths used in the estimation process or \code{NULL}.} \item{counts}{A two-element integer vector giving the number of calls to the log-likelihood function and its gradient respectively. This excludes those calls needed to compute the Hessian, if requested, and any calls to log-likelihood function to compute a finite-difference approximation to the gradient. \code{counts} is returned by \code{\link{optim}} or the user-supplied function or set to \code{NULL}.} \item{optim.message}{A character string giving any additional information returned by the optimizer, or \code{NULL}. To understand exactly the message, see the source code.} \item{loglik}{the log-likelihood value.} \item{method}{\code{"closed formula"} if appropriate otherwise \code{NULL}.} } \seealso{ \code{\link{mmedist}}, \code{\link{qmedist}}, \code{\link{mgedist}}, \code{\link{fitdist}},\code{\link{fitdistcens}} for other estimation methods, \code{\link{optim}}, \code{\link{constrOptim}} for optimization routines, \code{\link{bootdistcens}} and \code{\link{bootdist}} for bootstrap, and \code{\link{llplot}} for plotting the (log)likelihood. } \references{ Venables WN and Ripley BD (2002), \emph{Modern applied statistics with S}. Springer, New York, pp. 435-446. Delignette-Muller ML and Dutang C (2015), \emph{fitdistrplus: An R Package for Fitting Distributions}. Journal of Statistical Software, 64(4), 1-34. } \author{ Marie-Laure Delignette-Muller and Christophe Dutang. } \examples{ # (1) basic fit of a normal distribution with maximum likelihood estimation # set.seed(1234) x1 <- rnorm(n=100) mledist(x1,"norm") # (2) defining your own distribution functions, here for the Gumbel distribution # for other distributions, see the CRAN task view dedicated to probability distributions dgumbel <- function(x,a,b) 1/b*exp((a-x)/b)*exp(-exp((a-x)/b)) mledist(x1,"gumbel",start=list(a=10,b=5)) # (3) fit of a discrete distribution (Poisson) # set.seed(1234) x2 <- rpois(n=30,lambda = 2) mledist(x2,"pois") # (4) fit a finite-support distribution (beta) # set.seed(1234) x3 <- rbeta(n=100,shape1=5, shape2=10) mledist(x3,"beta") # (5) fit frequency distributions on USArrests dataset. # x4 <- USArrests$Assault mledist(x4, "pois") mledist(x4, "nbinom") # (6) fit a continuous distribution (Gumbel) to censored data. # data(fluazinam) log10EC50 <-log10(fluazinam) # definition of the Gumbel distribution dgumbel <- function(x,a,b) 1/b*exp((a-x)/b)*exp(-exp((a-x)/b)) pgumbel <- function(q,a,b) exp(-exp((a-q)/b)) qgumbel <- function(p,a,b) a-b*log(-log(p)) mledist(log10EC50,"gumbel",start=list(a=0,b=2),optim.method="Nelder-Mead") # (7) scaling problem # the simulated dataset (below) has particularly small values, # hence without scaling (10^0), # the optimization raises an error. The for loop shows how scaling by 10^i # for i=1,...,6 makes the fitting procedure work correctly. set.seed(1234) x2 <- rnorm(100, 1e-4, 2e-4) for(i in 6:0) cat(i, try(mledist(x*10^i, "cauchy")$estimate, silent=TRUE), "\n") # (17) small example for the zero-modified geometric distribution # dzmgeom <- function(x, p1, p2) p1 * (x == 0) + (1-p1)*dgeom(x-1, p2) #pdf x2 <- c(2, 4, 0, 40, 4, 21, 0, 0, 0, 2, 5, 0, 0, 13, 2) #simulated dataset initp1 <- function(x) list(p1=mean(x == 0)) #init as MLE mledist(x2, "zmgeom", fix.arg=initp1, start=list(p2=1/2)) } \keyword{ distribution }fitdistrplus/man/danish.Rd0000644000176200001440000000432313742313702015322 0ustar liggesusers\name{danish} \alias{danish} \alias{danishuni} \alias{danishmulti} \docType{data} \title{ Danish reinsurance claim dataset } \description{ The univariate dataset was collected at Copenhagen Reinsurance and comprise 2167 fire losses over the period 1980 to 1990. They have been adjusted for inflation to reflect 1985 values and are expressed in millions of Danish Krone. The multivariate data set is the same data as above but the total claim has been divided into a building loss, a loss of contents and a loss of profits. } \usage{ data(danishuni) data(danishmulti) } \format{ \code{danishuni} contains two columns: \describe{ \item{\code{Date}}{The day of claim occurence.} \item{\code{Loss}}{The total loss amount in millions of Danish Krone (DKK).} } \code{danishmulti} contains five columns: \describe{ \item{\code{Date}}{The day of claim occurence.} \item{\code{Building}}{The loss amount (mDKK) of the building coverage.} \item{\code{Contents}}{The loss amount (mDKK) of the contents coverage.} \item{\code{Profits}}{The loss amount (mDKK) of the profit coverage.} \item{\code{Total}}{The total loss amount (mDKK).} } All columns are numeric except Date columns of class Date. } \source{ Embrechts, P., Kluppelberg, C. and Mikosch, T. (1997) \emph{Modelling Extremal Events for Insurance and Finance}. Berlin: Springer. } \references{ Dataset used in McNeil (1996), \emph{Estimating the Tails of Loss Severity Distributions using Extreme Value Theory}, ASTIN Bull. Davison, A. C. (2003) \emph{Statistical Models}. Cambridge University Press. Page 278. } \examples{ # (1) load of data # data(danishuni) # (2) plot and description of data # plotdist(danishuni$Loss) # (3) load of data # data(danishmulti) # (4) plot and description of data # idx <- sample(1:NROW(danishmulti), 10) barplot(danishmulti$Building[idx], col = "grey25", ylim = c(0, max(danishmulti$Total[idx])), main = "Some claims of danish data set") barplot(danishmulti$Content[idx], add = TRUE, col = "grey50", axes = FALSE) barplot(danishmulti$Profits[idx], add = TRUE, col = "grey75", axes = FALSE) legend("topleft", legend = c("Building", "Content", "Profits"), fill = c("grey25", "grey50", "grey75")) } \keyword{datasets} fitdistrplus/man/plotdist.Rd0000644000176200001440000001241614213577122015722 0ustar liggesusers\name{plotdist} \alias{plotdist} \title{ Plot of empirical and theoretical distributions for non-censored data} \description{ Plots an empirical distribution (non-censored data) with a theoretical one if specified. } \usage{ plotdist(data, distr, para, histo = TRUE, breaks = "default", demp = FALSE, discrete, \dots) } \arguments{ \item{data}{ A numeric vector. } \item{distr}{ A character string \code{"name"} naming a distribution for which the corresponding density function \code{dname}, the corresponding distribution function \code{pname} and the corresponding quantile function \code{qname} must be defined, or directly the density function. This argument may be omitted only if \code{para} is omitted. } \item{para}{ A named list giving the parameters of the named distribution. This argument may be omitted only if \code{distr} is omitted. } \item{histo}{A logical to plot the histogram using the \code{\link{hist}} function.} \item{breaks}{ If \code{"default"} the histogram is plotted with the function \code{hist} with its default breaks definition. Else \code{breaks} is passed to the function \code{hist}. This argument is not taken into account if \code{discrete} is \code{TRUE}. } \item{demp}{A logical to plot the empirical density on the first plot (alone or superimposed on the histogram depending of the value of the argument \code{histo}) using the \code{\link{density}} function.} \item{discrete}{ If TRUE, the distribution is considered as discrete. If both \code{distr} and \code{discrete} are missing, \code{discrete} is set to \code{FALSE}. If \code{discrete} is missing but not \code{distr}, \code{discrete} is set to \code{TRUE} when \code{distr} belongs to \code{"binom"}, \code{"nbinom"},\code{"geom"}, \code{"hyper"} or \code{"pois"}. } \item{\dots}{ further graphical arguments passed to graphical functions used in plotdist.} } \details{ Empirical and, if specified, theoretical distributions are plotted in density and in cdf. For the plot in density, the user can use the arguments \code{histo} and \code{demp} to specify if he wants the histogram using the function \code{\link{hist}}, the density plot using the function \code{\link{density}}, or both (at least one of the two arguments must be put to \code{"TRUE"}). For continuous distributions, the function \code{\link{hist}} is used with its default breaks definition if \code{breaks} is \code{"default"} or passing \code{breaks} as an argument if it differs from \code{"default"}. For continuous distribution and when a theoretical distribution is specified by both arguments \code{distname} and \code{para}, Q-Q plot (plot of the quantiles of the theoretical fitted distribution (x-axis) against the empirical quantiles of the data) and P-P plot (i.e. for each value of the data set, plot of the cumulative density function of the fitted distribution (x-axis) against the empirical cumulative density function (y-axis)) are also given (Cullen and Frey, 1999). The function \code{\link{ppoints}} (with default parameter for argument a) is used for the Q-Q plot, to generate the set of probabilities at which to evaluate the inverse distribution. NOTE THAT FROM VERSION 0.4-3, \code{\link{ppoints}} is also used for P-P plot and cdf plot for continuous data. To personalize the four plots proposed for continuous data, for example to change the plotting position, we recommend the use of functions \code{\link{cdfcomp}}, \code{\link{denscomp}}, \code{\link{qqcomp}} and \code{\link{ppcomp}}. } \seealso{ \code{\link{graphcomp}}, \code{\link{descdist}}, \code{\link{hist}}, \code{\link{plot}}, \code{\link{plotdistcens}} and \code{\link{ppoints}}. } \references{ Cullen AC and Frey HC (1999), \emph{Probabilistic techniques in exposure assessment}. Plenum Press, USA, pp. 81-155. Delignette-Muller ML and Dutang C (2015), \emph{fitdistrplus: An R Package for Fitting Distributions}. Journal of Statistical Software, 64(4), 1-34. } \author{ Marie-Laure Delignette-Muller and Christophe Dutang. } \examples{ # (1) Plot of an empirical distribution with changing # of default line types for CDF and colors # and optionally adding a density line # set.seed(1234) x1 <- rnorm(n=30) plotdist(x1) plotdist(x1,demp = TRUE) plotdist(x1,histo = FALSE, demp = TRUE) plotdist(x1, col="blue", type="b", pch=16) plotdist(x1, type="s") # (2) Plot of a discrete distribution against data # set.seed(1234) x2 <- rpois(n=30, lambda = 2) plotdist(x2, discrete=TRUE) plotdist(x2, "pois", para=list(lambda = mean(x2))) plotdist(x2, "pois", para=list(lambda = mean(x2)), lwd="2") # (3) Plot of a continuous distribution against data # xn <- rnorm(n=100, mean=10, sd=5) plotdist(xn, "norm", para=list(mean=mean(xn), sd=sd(xn))) plotdist(xn, "norm", para=list(mean=mean(xn), sd=sd(xn)), pch=16) plotdist(xn, "norm", para=list(mean=mean(xn), sd=sd(xn)), demp = TRUE) plotdist(xn, "norm", para=list(mean=mean(xn), sd=sd(xn)), histo = FALSE, demp = TRUE) # (4) Plot of serving size data # data(groundbeef) plotdist(groundbeef$serving, type="s") # (5) Plot of numbers of parasites with a Poisson distribution data(toxocara) number <- toxocara$number plotdist(number, discrete = TRUE) plotdist(number,"pois",para=list(lambda=mean(number))) } \keyword{ distribution } fitdistrplus/man/bootdist.Rd0000644000176200001440000002222614212141547015704 0ustar liggesusers\name{bootdist} \alias{bootdist} \alias{plot.bootdist} \alias{print.bootdist} \alias{summary.bootdist} \title{ Bootstrap simulation of uncertainty for non-censored data} \description{ Uses parametric or nonparametric bootstrap resampling in order to simulate uncertainty in the parameters of the distribution fitted to non-censored data. } \usage{ bootdist(f, bootmethod = "param", niter = 1001, silent = TRUE, parallel = c("no", "snow", "multicore"), ncpus) \method{print}{bootdist}(x, \dots) \method{plot}{bootdist}(x, main = "Bootstrapped values of parameters", enhance = FALSE, trueval = NULL, rampcol = NULL, nbgrid = 100, nbcol = 100, \dots) \method{summary}{bootdist}(object, \dots) } \arguments{ \item{f}{ An object of class \code{"fitdist"}, output of the \code{\link{fitdist}} function.} \item{bootmethod}{ A character string coding for the type of resampling : \code{"param"} for a parametric resampling and \code{"nonparam"} for a nonparametric resampling of data.} \item{niter}{ The number of samples drawn by bootstrap. } \item{silent}{A logical to remove or show warnings and errors when bootstraping.} \item{parallel}{The type of parallel operation to be used, \code{"snow"} or \code{"multicore"} (the second one not being available on Windows), or \code{"no"} if no parallel operation.} \item{ncpus}{Number of processes to be used in parallel operation : typically one would fix it to the number of available CPUs.} \item{x}{ An object of class \code{"bootdist"}. } \item{object}{ An object of class \code{"bootdist"}. } \item{main}{an overall title for the plot: see \code{\link{title}}, default to \code{"Bootstrapped values of parameters"}.} \item{enhance}{a logical to get an enhanced plot.} \item{trueval}{when relevant, a numeric vector with the true value of parameters (for backfitting purposes).} \item{rampcol}{colors to interpolate; must be a valid argument to \code{\link[grDevices]{colorRampPalette}()}.} \item{nbgrid}{Number of grid points in each direction. Can be scalar or a length-2 integer vector.} \item{nbcol}{an integer argument, the required number of colors} \item{\dots}{ Further arguments to be passed to generic methods } } \details{ Samples are drawn by parametric bootstrap (resampling from the distribution fitted by \code{\link{fitdist}}) or nonparametric bootstrap (resampling with replacement from the data set). On each bootstrap sample the function \code{\link{mledist}} (or \code{\link{mmedist}}, \code{\link{qmedist}}, \code{\link{mgedist}} according to the component \code{f$method} of the object of class \code{"fitdist"}) is used to estimate bootstrapped values of parameters. When that function fails to converge, \code{NA} values are returned. Medians and 2.5 and 97.5 percentiles are computed by removing \code{NA} values. The medians and the 95 percent confidence intervals of parameters (2.5 and 97.5 percentiles) are printed in the summary. If inferior to the whole number of iterations, the number of iterations for which the function converges is also printed in the summary. By default (when \code{enhance=FALSE}), the plot of an object of class \code{"bootdist"} consists in a scatterplot or a matrix of scatterplots of the bootstrapped values of parameters. It uses the function \code{\link{stripchart}} when the fitted distribution is characterized by only one parameter, the function \code{\link{plot}} when there are two paramters and the function \code{\link{pairs}} in other cases. In these last cases, it provides a representation of the joint uncertainty distribution of the fitted parameters. When \code{enhance=TRUE}, a personalized plot version of \code{\link{pairs}} is used where upper graphs are scatterplots and lower graphs are heatmap image using \code{\link{image}} based on a kernel based estimator for the 2D density function (using \code{kde2d} from MASS package). Arguments \code{rampcol}, \code{nbgrid}, \code{nbcol} can be used to customize the plots. Defautls values are \code{rampcol=c("green", "yellow", "orange", "red")}, \code{nbcol=100} (see \code{\link[grDevices]{colorRampPalette}()}), \code{nbgrid=100} (see \code{kde2d}). In addition, when fitting parameters on simulated datasets for backtesting purposes, an additional argument \code{trueval} can be used to plot a cross at the true value. It is possible to accelerate the bootstrap using parallelization. We recommend you to use \code{parallel = "multicore"}, or \code{parallel = "snow"} if you work on Windows, and to fix \code{ncpus} to the number of available processors. } \value{ \code{bootdist} returns an object of class \code{"bootdist"}, a list with 6 components, \item{estim}{ a data frame containing the bootstrapped values of parameters.} \item{converg}{ a vector containing the codes for convergence obtained if an iterative method is used to estimate parameters on each bootstraped data set (and 0 if a closed formula is used).} \item{method}{ A character string coding for the type of resampling : \code{"param"} for a parametric resampling and \code{"nonparam"} for a nonparametric resampling. } \item{nbboot}{ The number of samples drawn by bootstrap.} \item{CI}{ bootstrap medians and 95 percent confidence percentile intervals of parameters. } \item{fitpart}{ The object of class \code{"fitdist"} on which the bootstrap procedure was applied.} Generic functions: \describe{ \item{\code{print}}{ The print of a \code{"bootdist"} object shows the bootstrap parameter estimates. If inferior to the whole number of bootstrap iterations, the number of iterations for which the estimation converges is also printed. } \item{\code{summary}}{ The summary provides the median and 2.5 and 97.5 percentiles of each parameter. If inferior to the whole number of bootstrap iterations, the number of iterations for which the estimation converges is also printed in the summary. } \item{\code{plot}}{ The plot shows the bootstrap estimates with \code{\link{stripchart}} function for univariate parameters and \code{\link{plot}} function for multivariate parameters. } } } \seealso{ See \code{\link{fitdistrplus}} for an overview of the package. \code{\link{fitdist}}, \code{\link{mledist}}, \code{\link{qmedist}}, \code{\link{mmedist}}, \code{\link{mgedist}}, \code{\link{quantile.bootdist}} for another generic function to calculate quantiles from the fitted distribution and its bootstrap results and \code{\link{CIcdfplot}} for adding confidence intervals on quantiles to a CDF plot of the fitted distribution. } \references{ Cullen AC and Frey HC (1999), \emph{Probabilistic techniques in exposure assessment}. Plenum Press, USA, pp. 181-241. Delignette-Muller ML and Dutang C (2015), \emph{fitdistrplus: An R Package for Fitting Distributions}. Journal of Statistical Software, 64(4), 1-34. } \author{ Marie-Laure Delignette-Muller and Christophe Dutang. } \examples{ # We choose a low number of bootstrap replicates in order to satisfy CRAN running times # constraint. # For practical applications, we recommend to use at least niter=501 or niter=1001. # (1) Fit of a gamma distribution to serving size data # using default method (maximum likelihood estimation) # followed by parametric bootstrap # data(groundbeef) x1 <- groundbeef$serving f1 <- fitdist(x1, "gamma") b1 <- bootdist(f1, niter=51) print(b1) plot(b1) plot(b1, enhance=TRUE) summary(b1) quantile(b1) CIcdfplot(b1, CI.output = "quantile") # (2) non parametric bootstrap on the same fit # b1b <- bootdist(f1, bootmethod="nonparam", niter=51) summary(b1b) quantile(b1b) # (3) Fit of a normal distribution on acute toxicity values of endosulfan in log10 for # nonarthropod invertebrates, using maximum likelihood estimation # to estimate what is called a species sensitivity distribution # (SSD) in ecotoxicology, followed by estimation of the 5 percent quantile value of # the fitted distribution, what is called the 5 percent hazardous concentration (HC5) # in ecotoxicology, with its two-sided 95 percent confidence interval calculated by # parametric bootstrap # data(endosulfan) ATV <- subset(endosulfan, group == "NonArthroInvert")$ATV log10ATV <- log10(subset(endosulfan, group == "NonArthroInvert")$ATV) fln <- fitdist(log10ATV, "norm") bln <- bootdist(fln, bootmethod = "param", niter=51) quantile(bln, probs = c(0.05, 0.1, 0.2)) # (4) comparison of sequential and parallel versions of bootstrap # to be tried with a greater number of iterations (1001 or more) # \donttest{ niter <- 1001 data(groundbeef) x1 <- groundbeef$serving f1 <- fitdist(x1, "gamma") # sequential version ptm <- proc.time() summary(bootdist(f1, niter = niter)) proc.time() - ptm # parallel version using snow require(parallel) ptm <- proc.time() summary(bootdist(f1, niter = niter, parallel = "snow", ncpus = 2)) proc.time() - ptm # parallel version using multicore (not available on Windows) ptm <- proc.time() summary(bootdist(f1, niter = niter, parallel = "multicore", ncpus = 2)) proc.time() - ptm } } \keyword{ distribution }fitdistrplus/man/fitdistcens.Rd0000644000176200001440000003032014213577227016377 0ustar liggesusers\name{fitdistcens} \alias{fitdistcens} \alias{plot.fitdistcens} \alias{print.fitdistcens} \alias{summary.fitdistcens} \alias{logLik.fitdistcens} \alias{vcov.fitdistcens} \alias{coef.fitdistcens} \title{Fitting of univariate distributions to censored data} \description{ Fits a univariate distribution to censored data by maximum likelihood. } \usage{ fitdistcens(censdata, distr, start=NULL, fix.arg=NULL, keepdata = TRUE, keepdata.nb=100, \dots) \method{print}{fitdistcens}(x, \dots) \method{plot}{fitdistcens}(x, \dots) \method{summary}{fitdistcens}(object, \dots) \method{logLik}{fitdistcens}(object, \dots) \method{vcov}{fitdistcens}(object, \dots) \method{coef}{fitdistcens}(object, \dots) } \arguments{ \item{censdata}{ A dataframe of two columns respectively named \code{left} and \code{right}, describing each observed value as an interval. The \code{left} column contains either \code{NA} for left censored observations, the left bound of the interval for interval censored observations, or the observed value for non-censored observations. The \code{right} column contains either \code{NA} for right censored observations, the right bound of the interval for interval censored observations, or the observed value for non-censored observations. } \item{distr}{ A character string \code{"name"} naming a distribution, for which the corresponding density function \code{dname} and the corresponding distribution function \code{pname} must be defined, or directly the density function.} \item{start}{ A named list giving the initial values of parameters of the named distribution. This argument may be omitted for some distributions for which reasonable starting values are computed (see the 'details' section of \code{\link{mledist}}). } \item{fix.arg}{ An optional named list giving the values of parameters of the named distribution that must be kept fixed rather than estimated by maximum likelihood. } \item{x}{ an object of class \code{"fitdistcens"}. } \item{object}{ an object of class \code{"fitdistcens"}. } \item{keepdata}{a logical. If \code{TRUE}, dataset is returned, otherwise only a sample subset is returned.} \item{keepdata.nb}{When \code{keepdata=FALSE}, the length of the subset returned.} \item{\dots}{ further arguments to be passed to generic functions, to the function \code{plotdistcens} in order to control the type of ecdf-plot used for censored data, or to the function \code{mledist} in order to control the optimization method.} } \details{ Maximum likelihood estimations of the distribution parameters are computed using the function \code{\link{mledist}}. By default direct optimization of the log-likelihood is performed using \code{\link{optim}}, with the "Nelder-Mead" method for distributions characterized by more than one parameter and the "BFGS" method for distributions characterized by only one parameter. The algorithm used in \code{\link{optim}} can be chosen or another optimization function can be specified using \dots argument (see \code{\link{mledist}} for details). \code{start} may be omitted (i.e. \code{NULL}) for some classic distributions (see the 'details' section of \code{\link{mledist}}). Note that when errors are raised by \code{optim}, it's a good idea to start by adding traces during the optimization process by adding \code{control=list(trace=1, REPORT=1)} in \dots argument. The function is not able to fit a uniform distribution. With the parameter estimates, the function returns the log-likelihood and the standard errors of the estimates calculated from the Hessian at the solution found by \code{optim} or by the user-supplied function passed to mledist. By default (\code{keepdata = TRUE}), the object returned by \code{fitdist} contains the data vector given in input. When dealing with large datasets, we can remove the original dataset from the output by setting \code{keepdata = FALSE}. In such a case, only \code{keepdata.nb} points (at most) are kept by random subsampling \code{keepdata.nb}-4 points from the dataset and adding the component-wise minimum and maximum. If combined with \code{\link{bootdistcens}}, be aware that bootstrap is performed on the subset randomly selected in \code{fitdistcens}. Currently, the graphical comparisons of multiple fits is not available in this framework. Weighted version of the estimation process is available for \code{method = "mle"} by using \code{weights=\dots}. See the corresponding man page for details. It is not yet possible to take into account weighths in functions plotdistcens, plot.fitdistcens and cdfcompcens (developments planned in the future). } \value{ \code{fitdistcens} returns an object of class \code{"fitdistcens"}, a list with the following components: \item{estimate}{ the parameter estimates.} \item{method}{ the character string coding for the fitting method : only \code{"mle"} for 'maximum likelihood estimation'.} \item{sd}{ the estimated standard errors.} \item{cor}{ the estimated correlation matrix, \code{NA} if numerically not computable or \code{NULL} if not available.} \item{vcov}{ the estimated variance-covariance matrix, \code{NULL} if not available.} \item{loglik}{ the log-likelihood.} \item{aic}{ the Akaike information criterion.} \item{bic}{ the the so-called BIC or SBC (Schwarz Bayesian criterion).} \item{censdata}{ the censored data set.} \item{distname}{ the name of the distribution.} \item{fix.arg}{ the named list giving the values of parameters of the named distribution that must be kept fixed rather than estimated by maximum likelihood or \code{NULL} if there are no such parameters. } \item{fix.arg.fun}{the function used to set the value of \code{fix.arg} or \code{NULL}.} \item{dots}{ the list of further arguments passed in \dots to be used in \code{bootdistcens} to control the optimization method used in iterative calls to \code{mledist} or \code{NULL} if no such arguments.} \item{convergence}{ an integer code for the convergence of \code{\link{optim}}/\code{\link{constrOptim}} defined as below or defined by the user in the user-supplied optimization function. \code{0} indicates successful convergence. \code{1} indicates that the iteration limit of \code{\link{optim}} has been reached. \code{10} indicates degeneracy of the Nealder-Mead simplex. \code{100} indicates that \code{\link{optim}} encountered an internal error.} \item{discrete}{always \code{FALSE}.} \item{weights}{the vector of weigths used in the estimation process or \code{NULL}.} Generic functions: \describe{ \item{\code{print}}{ The print of a \code{"fitdist"} object shows few traces about the fitting method and the fitted distribution. } \item{\code{summary}}{ The summary provides the parameter estimates of the fitted distribution, the log-likelihood, AIC and BIC statistics, the standard errors of the parameter estimates and the correlation matrix between parameter estimates. } \item{\code{plot}}{ The plot of an object of class \code{"fitdistcens"} returned by \code{fitdistcens} uses the function \code{\link{plotdistcens}}. } \item{\code{logLik}}{ Extracts the estimated log-likelihood from the \code{"fitdistcens"} object. } \item{\code{vcov}}{ Extracts the estimated var-covariance matrix from the \code{"fitdistcens"} object (only available When \code{method = "mle"}). } \item{\code{coef}}{ Extracts the fitted coefficients from the \code{"fitdistcens"} object. } } } \seealso{ See \code{\link{fitdistrplus}} for an overview of the package. \code{\link{plotdistcens}}, \code{\link{optim}}, \code{\link{mledist}}, \code{\link{fitdist}} and \code{\link{quantile.fitdistcens}} for another generic function to calculate quantiles from the fitted distribution. } \references{ Venables WN and Ripley BD (2002), \emph{Modern applied statistics with S}. Springer, New York, pp. 435-446. Delignette-Muller ML and Dutang C (2015), \emph{fitdistrplus: An R Package for Fitting Distributions}. Journal of Statistical Software, 64(4), 1-34. } \author{ Marie-Laure Delignette-Muller and Christophe Dutang. } \examples{ # (1) Fit of a lognormal distribution to bacterial contamination data # data(smokedfish) fitsf <- fitdistcens(smokedfish,"lnorm") summary(fitsf) # default plot using the Wang technique (see ?plotdiscens for details) plot(fitsf) # plot using the Turnbull algorithm (see ?plotdiscens for details) # with confidence intervals for the empirical distribution plot(fitsf, NPMLE = TRUE, NPMLE.method = "Turnbull", Turnbull.confint = TRUE) # basic plot using intervals and points (see ?plotdiscens for details) plot(fitsf, NPMLE = FALSE) # plot of the same fit using the Turnbull algorithm in logscale cdfcompcens(fitsf,main="bacterial contamination fits", xlab="bacterial concentration (CFU/g)",ylab="F", addlegend = FALSE,lines01 = TRUE, xlogscale = TRUE, xlim = c(1e-2,1e2)) # zoom on large values of F cdfcompcens(fitsf,main="bacterial contamination fits", xlab="bacterial concentration (CFU/g)",ylab="F", addlegend = FALSE,lines01 = TRUE, xlogscale = TRUE, xlim = c(1e-2,1e2),ylim=c(0.4,1)) # (2) Fit of a normal distribution on acute toxicity values # of fluazinam (in decimal logarithm) for # macroinvertebrates and zooplancton, using maximum likelihood estimation # to estimate what is called a species sensitivity distribution # (SSD) in ecotoxicology # data(fluazinam) log10EC50 <-log10(fluazinam) fln <- fitdistcens(log10EC50,"norm") fln summary(fln) plot(fln) # (3) defining your own distribution functions, here for the Gumbel distribution # for other distributions, see the CRAN task view dedicated to # probability distributions # dgumbel <- function(x,a,b) 1/b*exp((a-x)/b)*exp(-exp((a-x)/b)) pgumbel <- function(q,a,b) exp(-exp((a-q)/b)) qgumbel <- function(p,a,b) a-b*log(-log(p)) fg <- fitdistcens(log10EC50,"gumbel",start=list(a=1,b=1)) summary(fg) plot(fg) # (4) comparison of fits of various distributions # fll <- fitdistcens(log10EC50,"logis") summary(fll) cdfcompcens(list(fln,fll,fg),legendtext=c("normal","logistic","gumbel"), xlab = "log10(EC50)") # (5) how to change the optimisation method? # fitdistcens(log10EC50,"logis",optim.method="Nelder-Mead") fitdistcens(log10EC50,"logis",optim.method="BFGS") fitdistcens(log10EC50,"logis",optim.method="SANN") # (6) custom optimisation function - example with the genetic algorithm # \donttest{ #wrap genoud function rgenoud package mygenoud <- function(fn, par, ...) { require(rgenoud) res <- genoud(fn, starting.values=par, ...) standardres <- c(res, convergence=0) return(standardres) } # call fitdistcens with a 'custom' optimization function fit.with.genoud <- fitdistcens(log10EC50,"logis", custom.optim=mygenoud, nvars=2, Domains=cbind(c(0,0), c(5, 5)), boundary.enforcement=1, print.level=1, hessian=TRUE) summary(fit.with.genoud) } # (7) estimation of the mean of a normal distribution # by maximum likelihood with the standard deviation fixed at 1 using the argument fix.arg # flnb <- fitdistcens(log10EC50, "norm", start = list(mean = 1),fix.arg = list(sd = 1)) # (8) Fit of a lognormal distribution on acute toxicity values of fluazinam for # macroinvertebrates and zooplancton, using maximum likelihood estimation # to estimate what is called a species sensitivity distribution # (SSD) in ecotoxicology, followed by estimation of the 5 percent quantile value of # the fitted distribution (which is called the 5 percent hazardous concentration, HC5, # in ecotoxicology) and estimation of other quantiles. data(fluazinam) log10EC50 <-log10(fluazinam) fln <- fitdistcens(log10EC50,"norm") quantile(fln, probs = 0.05) quantile(fln, probs = c(0.05, 0.1, 0.2)) # (9) Fit of a lognormal distribution on 72-hour acute salinity tolerance (LC50 values) # of riverine macro-invertebrates using maximum likelihood estimation data(salinity) log10LC50 <-log10(salinity) fln <- fitdistcens(log10LC50,"norm") plot(fln) } \keyword{ distribution } fitdistrplus/man/Surv2fitdistcens.Rd0000644000176200001440000001220614213577312017337 0ustar liggesusers\name{Surv2fitdistcens} \alias{Surv2fitdistcens} \title{Handling of data formated as in the survival package for use in fitdistcens()} \description{ Provide a function to prepare a data frame needed by fitdistcens() from data classically coded when using the Surv() function of the survival package } \usage{ Surv2fitdistcens(time, time2, event, type = c('right', 'left', 'interval', 'interval2')) } \arguments{ \item{time}{for right censored data, this is the follow up time. For interval data, the first argument is the starting time for the interval.} \item{event}{The status indicator, normally \code{0}=alive, \code{1}=dead. Other choices are \code{TRUE/FALSE} (\code{TRUE} = death) or \code{1/2} (\code{2}=death). For interval censored data, the status indicator is \code{0}=right censored, \code{1}=event at time, \code{2}=left censored, \code{3}=interval censored. For factor data, assume that it has only two levels with the second level coding death.} \item{time2}{ending time of the interval for interval censored. Intervals are assumed to be open on the left and closed on the right, (start, end].} \item{type}{character string specifying the type of censoring. Possible values are \code{"right"}, \code{"left"}, \code{"interval"}, \code{"interval2"}.} } \details{ \code{Surv2fitdistcens} makes a \code{data.frame} with two columns respectively named \code{left} and \code{right}, describing each observed value as an interval as required in fitdistcens(): the \code{left} column contains either \code{NA} for left-censored observations, the left bound of the interval for interval-censored observations, or the observed value for non-censored observations. The right column contains either \code{NA} for right-censored observations, the right bound of the interval for interval censored observations, or the observed value for non-censored observations. } \value{ \code{Surv2fitdistcens} returns a data.frame with two columns respectively named \code{left} and \code{right}. } \seealso{ See \code{\link{fitdistrplus}} for an overview of the package. See \code{\link{fitdistcens}} for fitting of univariate distributions to censored data and \code{\link{fremale}} for the full dataset used in examples below. See \code{\link[survival:Surv]{Surv}} for survival objects which use the same arguments. } \references{ Delignette-Muller ML and Dutang C (2015), \emph{fitdistrplus: An R Package for Fitting Distributions}. Journal of Statistical Software, 64(4), 1-34. } \author{ Christophe Dutang and Marie-Laure Delignette-Muller. } \examples{ # (1) randomized fictive survival data - right-censored # origdata <- data.frame(rbind( c( 43.01, 55.00, 0), c( 36.37, 47.17, 0), c( 33.10, 34.51, 0), c( 71.00, 81.15, 1), c( 80.89, 81.91, 1), c( 67.81, 78.48, 1), c( 73.98, 76.92, 1), c( 53.19, 54.80, 1))) colnames(origdata) <- c("AgeIn", "AgeOut", "Death") # add of follow-up time (for type = "right" in Surv()) origdata$followuptime <- origdata$AgeOut - origdata$AgeIn origdata ### use of default survival type "right" # in Surv() survival::Surv(time = origdata$followuptime, event = origdata$Death, type = "right") # for fitdistcens() Surv2fitdistcens(origdata$followuptime, event = origdata$Death, type = "right") # use of survival type "interval" # in Surv() survival::Surv(time = origdata$followuptime, time2 = origdata$followuptime, event = origdata$Death, type = "interval") # for fitdistcens() Surv2fitdistcens(time = origdata$followuptime, time2 = origdata$followuptime, event = origdata$Death, type = "interval") # use of survival type "interval2" origdata$survivalt1 <- origdata$followuptime origdata$survivalt2 <- origdata$survivalt1 origdata$survivalt2[1:3] <- Inf origdata survival::Surv(time = origdata$survivalt1, time2 = origdata$survivalt2, type = "interval2") Surv2fitdistcens(origdata$survivalt1, time2 = origdata$survivalt2, type = "interval2") # (2) Other examples with various left, right and interval censored values # # with left censored data (d1 <- data.frame(time = c(2, 5, 3, 7), ind = c(0, 1, 1, 1))) survival::Surv(time = d1$time, event = d1$ind, type = "left") Surv2fitdistcens(time = d1$time, event = d1$ind, type = "left") (d1bis <- data.frame(t1 = c(2, 5, 3, 7), t2 = c(2, 5, 3, 7), censtype = c(2, 1, 1, 1))) survival::Surv(time = d1bis$t1, time2 = d1bis$t2, event = d1bis$censtype, type = "interval") Surv2fitdistcens(time = d1bis$t1, time2 = d1bis$t2, event = d1bis$censtype, type = "interval") # with interval, left and right censored data (d2 <- data.frame(t1 = c(-Inf, 2, 3, 4, 3, 7), t2 = c(2, 5, 3, 7, 8, Inf))) survival::Surv(time = d2$t1, time2 = d2$t2, type = "interval2") Surv2fitdistcens(time = d2$t1, time2 = d2$t2, type = "interval2") (d2bis <- data.frame(t1 = c(2, 2, 3, 4, 3, 7), t2 = c(2, 5, 3, 7, 8, 7), censtype = c(2,3,1,3,3,0))) survival::Surv(time = d2bis$t1, time2 = d2bis$t2, event = d2bis$censtype, type = "interval") Surv2fitdistcens(time = d2bis$t1, time2 = d2bis$t2, event = d2bis$censtype, type = "interval") } \keyword{distribution}fitdistrplus/man/logLik-surface.Rd0000644000176200001440000001142113742313702016720 0ustar liggesusers\name{logLiksurface} \alias{llsurface} \alias{llcurve} \title{(Log)likelihood surfaces or (log)likelihood curves} \description{ \code{llsurface} plots the likelihood surface for distributions with two or more parameters, \code{llcurve} plots the likelihood curve for distributions with one or more parameters. } \usage{ llsurface(data, distr, plot.arg, min.arg, max.arg, lseq = 50, fix.arg = NULL, loglik = TRUE, back.col = TRUE, nlev = 10, pal.col = terrain.colors(100), weights = NULL, \dots) llcurve(data, distr, plot.arg, min.arg, max.arg, lseq = 50, fix.arg = NULL, loglik = TRUE, weights = NULL, \dots) } \arguments{ \item{data}{A numeric vector for non censored data or a dataframe of two columns respectively named left and right, describing each observed value as an interval for censored data. In that case the left column contains either NA for left censored observations, the left bound of the interval for interval censored observations, or the observed value for non-censored observations. The right column contains either NA for right censored observations, the right bound of the interval for interval censored observations, or the observed value for non-censored observations.} \item{distr}{A character string "name" naming a distribution for which the corresponding density function dname and the corresponding distribution function pname must be classically defined.} \item{plot.arg}{a two-element vector with the names of the two parameters that will vary for \code{llsurface}, only one element for \code{llcurve}.} \item{min.arg}{a two-element vector with lower plotting bounds for \code{llsurface}, only one element for \code{llcurve}.} \item{max.arg}{a two-element vector with upper plotting bounds for \code{llsurface}, only one element for \code{llcurve}.} \item{lseq}{length of sequences of parameters.} \item{fix.arg}{a named list with fixed value of other parameters.} \item{loglik}{a logical to plot log-likelihood or likelihood function.} \item{back.col}{logical (for llsurface only). Contours are plotted with a background gradient of colors if TRUE.} \item{nlev}{number of contour levels to plot (for llsurface only).} \item{pal.col}{Palette of colors. Colors to be used as back (for llsurface only).} \item{weights}{an optional vector of weights to be used in the fitting process. Should be \code{NULL} or a numeric vector with strictly positive values (classically the number of occurences of each observation).} \item{\dots}{Further graphical arguments passed to graphical functions.} } \details{ These two function are not intended to be called directly but is internally called in \code{\link{llplot}}. \code{llsurface} plots the likelihood surface for distributions with two varying parameters and other parameters fixed. When \code{back.col}, \code{\link{image}} (2D-plot) is used. When \code{nlev > 0}, \code{\link{contour}} (2D-plot) is used to add \code{nlev} contours. \code{llcurve} plots the likelihood curve for distributions with one varying parameter and other parameters fixed. } \seealso{ See \code{\link{llplot}} for an automatic (log)likelihood plots (surface ou curve) of an object of class \code{"fitdist"} or \code{"fitdistcens"} and \code{\link{plot}}, \code{\link{contour}}, \code{\link{image}} for classic plotting functions. } \references{ Delignette-Muller ML and Dutang C (2015), \emph{fitdistrplus: An R Package for Fitting Distributions}. Journal of Statistical Software, 64(4), 1-34. } \author{ Marie-Laure Delignette-Muller and Christophe Dutang. } \examples{ # (1) loglikelihood or likelihood curve # n <- 100 set.seed(1234) x <- rexp(n) llcurve(data = x, distr = "exp", plot.arg = "rate", min.arg = 0, max.arg = 4) llcurve(data = x, distr = "exp", plot.arg = "rate", min.arg = 0, max.arg = 4, loglik = FALSE) llcurve(data = x, distr = "exp", plot.arg = "rate", min.arg = 0, max.arg = 4, main = "log-likelihood for exponential distribution", col = "red") abline(v = 1, lty = 2) # (2) loglikelihood surface # x <- rnorm(n, 0, 1) llsurface(data =x, distr="norm", plot.arg=c("mean", "sd"), min.arg=c(-1, 0.5), max.arg=c(1, 3/2), back.col = FALSE, main="log-likelihood for normal distribution") llsurface(data =x, distr="norm", plot.arg=c("mean", "sd"), min.arg=c(-1, 0.5), max.arg=c(1, 3/2), main="log-likelihood for normal distribution", nlev = 20, pal.col = heat.colors(100),) points(0, 1, pch="+", col="red") llsurface(data =x, distr="norm", plot.arg=c("mean", "sd"), min.arg=c(-1, 0.5), max.arg=c(1, 3/2), main="log-likelihood for normal distribution", nlev = 0, back.col = TRUE, pal.col = rainbow(100, s = 0.5, end = 0.8)) points(0, 1, pch="+", col="black") } \keyword{ distribution } fitdistrplus/man/fitdistrplus.Rd0000644000176200001440000000735114205417001016604 0ustar liggesusers\name{fitdistrplus-package} \alias{fitdistrplus} \alias{fitdistrplus-package} \title{Overview of the \pkg{fitdistrplus} package} \description{ The idea of this package emerged in 2008 from a collaboration between JB Denis, R Pouillot and ML Delignette who at this time worked in the area of quantitative risk assessment. The implementation of this package was a part of a more general project named "Risk assessment with R" gathering different packages and hosted in \href{https://r-forge.r-project.org/projects/riskassessment/}{R-forge}. The \pkg{fitdistrplus} package was first written by ML Delignette-Muller and made available in \href{https://cran.r-project.org/package=fitdistrplus}{CRAN} on 2009 and presented at the \href{https://www.r-project.org/conferences/useR-2009/}{2009 useR conference} in Rennes. A few months after, C Dutang joined the project by starting to participate to the implementation of the \pkg{fitdistrplus} package. The package has also been presented at the \href{https://www.r-project.org/conferences/useR-2011/}{2011 useR conference} and at the 2eme rencontres R in 2013 (https://r2013-lyon.sciencesconf.org/). Three vignettes are available within the package: \itemize{ \item a \href{../doc/paper2JSS.pdf}{general overview} of the package published in the Journal of Statistical Software (\doi{10.18637/jss.v064.i04}), \item a html document answering the most \href{../doc/FAQ.html}{Frequently Asked Questions}, \item a html document presenting a \href{../doc/Optimalgo.html}{benchmark of optimization algorithms} when finding parameters. } The \pkg{fitdistrplus} package is a general package that aims at helping the fit of univariate parametric distributions to censored or non-censored data. The two main functions are \code{\link{fitdist}} for fit on non-censored data and \code{\link{fitdistcens}} for fit on censored data. The choice of candidate distributions to fit may be helped using functions \code{\link{descdist}} and \code{\link{plotdist}} for non-censored data and \code{\link{plotdistcens}} for censored data). Using functions \code{\link{fitdist}} and \code{\link{fitdistcens}}, different methods can be used to estimate the distribution parameters: \itemize{ \item maximum likelihood estimation by default (\code{\link{mledist}}), \item moment matching estimation (\code{\link{mmedist}}), \item quantile matching estimation (\code{\link{qmedist}}), \item maximum goodness-of-fit estimation (\code{\link{mgedist}}). } For classical distributions initial values are automatically calculated if not provided by the user. Graphical functions \code{\link{plotdist}} and \code{\link{plotdistcens}} can be used to help a manual calibration of initial values for parameters of non-classical distributions. Function \code{\link{prefit}} is proposed to help the definition of good starting values in the special case of constrained parameters. In the case where maximum likelihood is chosen as the estimation method, function \code{\link{llplot}} enables to visualize loglikelihood contours. The goodness-of-fit of fitted distributions (a single fit or multiple fits) can be explored using different graphical functions (\code{\link{cdfcomp}}, \code{\link{denscomp}}, \code{\link{qqcomp}} and \code{\link{ppcomp}} for non-censored data and \code{\link{cdfcompcens}} for censored data). Goodness-of-fit statistics are also provided for non-censored data using function \code{\link{gofstat}}. Bootstrap is proposed to quantify the uncertainty on parameter estimates (functions \code{\link{bootdist}} and \code{\link{bootdistcens}}) and also to quantify the uncertainty on CDF or quantiles estimated from the fitted distribution (\code{\link{quantile}} and \code{\link{CIcdfplot}}). } \author{ Marie-Laure Delignette-Muller and Christophe Dutang. } fitdistrplus/DESCRIPTION0000644000176200001440000000474414421764232014532 0ustar liggesusersPackage: fitdistrplus Title: Help to Fit of a Parametric Distribution to Non-Censored or Censored Data Version: 1.1-11 Authors@R: c(person("Marie-Laure", "Delignette-Muller", role = "aut", email = "marielaure.delignettemuller@vetagro-sup.fr", comment = c(ORCID = "0000-0001-5453-3994")), person("Christophe", "Dutang", role = "aut", email = "christophe.dutang@ensimag.fr", comment = c(ORCID = "0000-0001-6732-1501")), person("Regis", "Pouillot", role = "ctb"), person("Jean-Baptiste", "Denis", role = "ctb"), person("Aurelie", "Siberchicot", role = c("aut", "cre"), email = "aurelie.siberchicot@univ-lyon1.fr", comment = c(ORCID = "0000-0002-7638-8318"))) Description: Extends the fitdistr() function (of the MASS package) with several functions to help the fit of a parametric distribution to non-censored or censored data. Censored data may contain left censored, right censored and interval censored values, with several lower and upper bounds. In addition to maximum likelihood estimation (MLE), the package provides moment matching (MME), quantile matching (QME), maximum goodness-of-fit estimation (MGE) and maximum spacing estimation (MSE) methods (available only for non-censored data). Weighted versions of MLE, MME, QME and MSE are available. See e.g. Casella & Berger (2002), Statistical inference, Pacific Grove, for a general introduction to parametric estimation. Depends: R (>= 3.5.0), MASS, grDevices, survival, methods Imports: stats Suggests: actuar, rgenoud, mc2d, gamlss.dist, knitr, ggplot2, GeneralizedHyperbolic, rmarkdown, Hmisc, bookdown VignetteBuilder: knitr BuildVignettes: true License: GPL (>= 2) Encoding: UTF-8 URL: https://lbbe.univ-lyon1.fr/fr/fitdistrplus, https://github.com/aursiber/fitdistrplus BugReports: https://github.com/aursiber/fitdistrplus/issues Contact: Marie-Laure Delignette-Muller or Christophe Dutang NeedsCompilation: no Packaged: 2023-04-25 13:20:20 UTC; siberchicot Author: Marie-Laure Delignette-Muller [aut] (), Christophe Dutang [aut] (), Regis Pouillot [ctb], Jean-Baptiste Denis [ctb], Aurelie Siberchicot [aut, cre] () Maintainer: Aurelie Siberchicot Repository: CRAN Date/Publication: 2023-04-25 14:50:02 UTC fitdistrplus/build/0000755000176200001440000000000014421751624014113 5ustar liggesusersfitdistrplus/build/vignette.rds0000644000176200001440000000050114421751624016446 0ustar liggesusersRMO@엩1LLZc7iX WBSagyo3o^&>q!L>5 .{/c8]eoEp BŹz-6Dhbs_[){P;:\(n#DT];S&9F EAHwF~*@J*6#m'4cmĨ($6w;?=S;k6dyk?c;:5k|#'n_I <+xo{p {ؗBjm4li }PU籣 f:$dȼku;Eʄfitdistrplus/build/partial.rdb0000644000176200001440000000007414421751555016244 0ustar liggesusersb```b`a 00 FN ͚Z d@$w7fitdistrplus/tests/0000755000176200001440000000000014212327571014154 5ustar liggesusersfitdistrplus/tests/t-logLik-vcov-coef.R0000644000176200001440000000076413742313702017653 0ustar liggesuserslibrary(fitdistrplus) # (1) basic fit of a gamma distribution by maximum likelihood estimation # data(groundbeef) serving <- groundbeef$serving fitg <- fitdist(serving, "gamma") logLik(fitg) vcov(fitg) coef(fitg) fitg <- fitdist(serving, "gamma", method="mme") logLik(fitg) vcov(fitg) coef(fitg) # (2) Fit of a lognormal distribution to bacterial contamination data # data(smokedfish) fitsf <- fitdistcens(smokedfish,"lnorm") logLik(fitsf) vcov(fitsf) coef(fitsf) fitdistrplus/tests/t-plotdistcens.R0000644000176200001440000000430713742313702017255 0ustar liggesuserslibrary(fitdistrplus) # (1) Plot of an empirical censored distribution (censored data) as a CDF # using the default Turnbull method # data(smokedfish) plotdistcens(smokedfish) d1 <- as.data.frame(log10(smokedfish)) plotdistcens(d1) # (2) Add the CDF of a normal distribution and QQ and PP plots # plotdistcens(smokedfish,"lnorm", para=list(meanlog=-3.6,sdlog=3.5)) plotdistcens(d1,"norm", para=list(mean=-1.6,sd=1.5)) # (3) Various plots of the same empirical distribution # # default Wang plot plotdistcens(d1, NPMLE = TRUE, NPMLE.method = "Wang") plotdistcens(d1, NPMLE = TRUE, NPMLE.method = "Wang", lwd = 3, main = "Wang ECDF plot") # Turnbull plot plotdistcens(d1, NPMLE = TRUE, NPMLE.method = "Turnbull", col = "red", main = "Turnbull ECDF plot") plotdistcens(d1,Turnbull = TRUE) # deprecated way to do it # Turnbull plot with confidence intervals plotdistcens(d1,NPMLE = TRUE, NPMLE.method = "Turnbull", Turnbull.confint = TRUE) plotdistcens(d1,Turnbull = TRUE,Turnbull.confint = TRUE) # deprecated way to do it # with intervals and points plotdistcens(d1,NPMLE = FALSE) plotdistcens(d1,NPMLE = FALSE, col = "red", lwd = 2) plotdistcens(d1,rightNA=3, NPMLE = FALSE) plotdistcens(d1,rightNA=3, Turnbull = FALSE) # deprecated way to do it # with intervals and points # defining a minimum value for left censored values plotdistcens(d1,leftNA=-3, NPMLE = FALSE) # (4) Goodness-of-fit plots for the same dataset after logarithmic transformation # with a lognormal distribution, successively using the three proposed methods # d3 <- smokedfish plotdistcens(d3,"lnorm",para=list(meanlog=-3.6,sdlog=3.5), main = "Wang plot") plotdistcens(d3,"lnorm",para=list(meanlog=-3.6,sdlog=3.5), NPMLE.method = "Turnbull", main = "Turnbull plot") plotdistcens(d3,"lnorm",para=list(meanlog=-3.6,sdlog=3.5), NPMLE = FALSE, leftNA=0, main = "Plot of ordered intervals") # Test with the salinity data set # data(salinity) log10LC50 <-log10(salinity) plotdistcens(log10LC50) plotdistcens(log10LC50, NPMLE.method = "Turnbull") plotdistcens(log10LC50, NPMLE = FALSE) fn <- fitdistcens(log10LC50,"norm") fl <- fitdistcens(log10LC50,"logis") plot(fn) plot(fl) fitdistrplus/tests/t-llplot.R0000644000176200001440000000374613742313702016056 0ustar liggesusersrequire(fitdistrplus) visualize <- FALSE # TRUE for manual tests with visualization of results nsample <- 10000 nsample <- 10 # (1) tests with the Burr distribution (three parameters) # if(any(installed.packages()[, "Package"] == "actuar")) { require(actuar) data(endosulfan) ATV <-endosulfan$ATV library("actuar") fBurr <- fitdist(ATV, "burr", start = list(shape1 = 0.3, shape2 = 1, rate = 1)) llplot(fBurr) fBurr2 <- fitdist(ATV, "burr", start = list(shape1 = 0.3, shape2 = 1), fix.arg = list(rate = 1.5)) llplot(fBurr2) fBurr3 <- fitdist(ATV, "burr", start = list(shape1 = 0.3, rate = 1), fix.arg = list(shape2 = 1.5)) llplot(fBurr3) } # (2) An example on discrete data with or without weights # set.seed(1234) x <- rpois(nsample, 10) xtab <- table(x) xval <- sort(unique(x)) f1 <- fitdist(x, "pois") f2 <- fitdist(xval, "pois", weights = xtab) f1$estimate f2$estimate # should give the same llplot(f1, fit.show = TRUE) llplot(f2, fit.show = TRUE) # should give the same llplot(f1, loglik = FALSE, fit.show = TRUE) llplot(f2, loglik = FALSE,fit.show = TRUE) # should give the same # (3) An example on censored data with or without weights # if(visualize) { data(salinity) salinity.unique <- unique(salinity) string.unique <- paste(salinity.unique$left, salinity.unique$right) string.salinity <- paste(salinity$left, salinity$right) nobs <- nrow(salinity.unique) salinity.weights <- numeric(nobs) for (i in 1:nobs) { salinity.weights[i] <- length(which(string.salinity == string.unique[i])) } cbind(salinity.unique, salinity.weights) (fa <- fitdistcens(salinity, "lnorm")) (fb <- fitdistcens(salinity.unique, "lnorm", weights = salinity.weights)) llplot(fa, fit.show = TRUE) llplot(fb, fit.show = TRUE) # should give the same llplot(fa, fit.show = TRUE, loglik = FALSE) llplot(fb, fit.show = TRUE, loglik = FALSE) # should give the same }fitdistrplus/tests/t-qqcomp.R0000644000176200001440000001745114050660542016046 0ustar liggesuserslibrary(fitdistrplus) # ?qqcomp visualize <- FALSE # TRUE for manual tests with visualization of results nsample <- 1000 nsample <- 10 # (1) Plot various distributions fitted to serving size data # data(groundbeef) serving <- groundbeef$serving fitW <- fitdist(serving, "weibull") fitln <- fitdist(serving, "lnorm") fitg <- fitdist(serving, "gamma") #sanity checks try(qqcomp("list(fitW, fitln, fitg)"), silent = TRUE) try(qqcomp(list(fitW, fitln, fitg, a = 1)), silent = TRUE) #real call qqcomp(list(fitW, fitln, fitg)) qqcomp(list(fitW, fitln, fitg), legendtext = c("Weibull", "lognormal", "gamma"), main = "ground beef fits", xlab = "Theo.", ylab = "serving sizes (g)", xlim = c(0, 250)) qqcomp(list(fitW, fitln, fitg), legendtext=c("Weibull","lognormal","gamma"), main="ground beef fits", xlab="Theo.", ylab="serving sizes (g)", xlogscale=TRUE) qqcomp(list(fitW, fitln, fitg), legendtext=c("Weibull","lognormal","gamma"), main="ground beef fits", xlab="Theo.", ylab="serving sizes (g)", ylogscale=TRUE) qqcomp(list(fitW, fitln, fitg), legendtext=c("Weibull","lognormal","gamma"), main="ground beef fits", ylim=c(1, 250), xlim=c(1, 250), fitpch=c("+", "-", ".")) if (requireNamespace ("ggplot2", quietly = TRUE)) { qqcomp(list(fitW, fitln, fitg), plotstyle = "ggplot") } if (requireNamespace ("ggplot2", quietly = TRUE) & visualize) { qqcomp(list(fitW, fitln, fitg), legendtext=c("Weibull","lognormal","gamma"), main="ground beef fits", xlab="Theo.", ylab="serving sizes (g)", xlim = c(0,250), plotstyle = "ggplot") qqcomp(list(fitW, fitln, fitg), legendtext=c("Weibull","lognormal","gamma"), main="ground beef fits", xlab="Theo.", ylab="serving sizes (g)", xlogscale=TRUE, plotstyle = "ggplot") qqcomp(list(fitW, fitln, fitg), legendtext=c("Weibull","lognormal","gamma"), main="ground beef fits", xlab="Theo.", ylab="serving sizes (g)", ylogscale=TRUE, plotstyle = "ggplot") qqcomp(list(fitW, fitln, fitg), legendtext=c("Weibull","lognormal","gamma"), main="ground beef fits", ylim=c(1, 250), xlim=c(1, 250), fitpch=c("+", "-", "."), plotstyle = "ggplot") } # (2) Plot lognormal distributions fitted by # maximum goodness-of-fit estimation # using various distances (data plotted in log scale) # data(endosulfan) ATV <-subset(endosulfan, group == "NonArthroInvert")$ATV flnMGEKS <- fitdist(ATV,"lnorm",method="mge",gof="KS") flnMGEAD <- fitdist(ATV,"lnorm",method="mge",gof="AD") flnMGEADL <- fitdist(ATV,"lnorm",method="mge",gof="ADL") flnMGEAD2L <- fitdist(ATV,"lnorm",method="mge",gof="AD2L") llfit <- list(flnMGEKS, flnMGEAD, flnMGEADL, flnMGEAD2L) qqcomp(llfit, main="fits of a lognormal dist. using various GOF dist.") qqcomp(llfit, xlogscale=TRUE, main="fits of a lognormal dist. using various GOF dist.", legendtext=c("MGE KS","MGE AD","MGE ADL","MGE AD2L")) qqcomp(llfit, xlogscale=TRUE, main="fits of a lognormal dist. using various GOF dist.", legendtext=c("MGE KS","MGE AD","MGE ADL","MGE AD2L"), fitcol=c("black", "darkgreen", "yellowgreen", "yellow2")) qqcomp(llfit, ynoise=FALSE, xlogscale=TRUE, ylogscale=TRUE, xlim=c(10,100000), ylim=c(10,100000)) qqcomp(flnMGEKS, xlogscale=TRUE, xlim=c(10,100000)) if (requireNamespace ("ggplot2", quietly = TRUE) & visualize) { qqcomp(llfit, main="fits of a lognormal dist. using various GOF dist.", plotstyle = "ggplot") qqcomp(llfit, xlogscale=TRUE, main="fits of a lognormal dist. using various GOF dist.", legendtext=c("MGE KS","MGE AD","MGE ADL","MGE AD2L"), plotstyle = "ggplot") qqcomp(llfit, xlogscale=TRUE, main="fits of a lognormal dist. using various GOF dist.", legendtext=c("MGE KS","MGE AD","MGE ADL","MGE AD2L"), fitcol=c("black", "darkgreen", "yellowgreen", "yellow2"), plotstyle = "ggplot") qqcomp(llfit, ynoise=FALSE, xlogscale=TRUE, ylogscale=TRUE, xlim=c(10,100000), ylim=c(10,100000), plotstyle = "ggplot") qqcomp(flnMGEKS, xlogscale=TRUE, xlim=c(10,100000), plotstyle = "ggplot") } # (3) Plot lognormal distributions fitted by # maximum goodness-of-fit estimation # using various distances (data plotted in log scale) # if (visualize) { x1 <- c(6.4,13.3,4.1,1.3,14.1,10.6,9.9,9.6,15.3,22.1,13.4,13.2,8.4,6.3,8.9,5.2,10.9,14.4) n1 <- length(x1) dgumbel <- function(x, a, b) 1/b*exp((a-x)/b)*exp(-exp((a-x)/b)) pgumbel <- function(q, a, b) exp(-exp((a-q)/b)) qgumbel <- function(p, a, b) a - b*log(-log(p)) f1 <- mledist(x1, "norm") f2 <- mledist(x1, "gumbel", start = list(a = 10, b = 5)) f3 <- mledist(x1, "exp") xx1 <- qnorm(1:n1/n1, f1$estimate[1], f1$estimate[2]) xx2 <- qgumbel(1:n1/n1, f2$estimate[1], f2$estimate[2]) xx3 <- qexp(1:n1/n1, f3$estimate[1]) xlim <- c(xx1, xx2, xx3) xlim <- range(xlim[which(is.finite(xlim))]) # graph 1 plot(xx1, sort(x1), col="red", xlim = xlim) points(xx2, sort(x1), col = "green") points(xx3, sort(x1), col = "blue") legend("bottomright", pch = 1, leg = c("Normal","Gumbel","Exp"), col = c("red", "green", "blue")) # graph 2 f1 <- fitdist(x1,"norm") f2 <- fitdist(x1,"gumbel",start=list(a=10,b=5)) f3 <- fitdist(x1, "exp") qqcomp(list(f1, f2, f3), fitcol=c("red","green","blue"), ynoise = FALSE, legendtext = c("Normal","Gumbel","Exp")) # graph 3 if (requireNamespace ("ggplot2", quietly = TRUE)) { qqcomp(list(f1, f2, f3), fitcol=c("red","green","blue"), ynoise = FALSE, legendtext = c("Normal","Gumbel","Exp"), plotstyle = "gg") } } # (4) normal mixture # #mixture of two normal distributions #density dnorm2 <- function(x, poid, m1, s1, m2, s2) poid*dnorm(x, m1, s1) + (1-poid)*dnorm(x, m2, s2) #numerical approximate quantile function qnorm2 <- function(p, poid, m1, s1, m2, s2) { L2 <- function(x, prob) (prob - pnorm2(x, poid, m1, s1, m2, s2))^2 sapply(p, function(pr) optimize(L2, c(-1000, 1000), prob=pr)$minimum) } #distribution function pnorm2 <- function(q, poid, m1, s1, m2, s2) poid*pnorm(q, m1, s1) + (1-poid)*pnorm(q, m2, s2) #basic normal distribution set.seed(1234) x2 <- c(rnorm(nsample, 5), rnorm(nsample, 10)) #MLE fit fit1 <- fitdist(x2, "norm2", "mle", start=list(poid=1/3, m1=4, s1=2, m2=8, s2=2), lower=c(0, 0, 0, 0, 0)) fit2 <- fitdist(x2, "norm2", "qme", probs=c(1/6, 1/4, 1/3, 1/2, 2/3), start=list(poid=1/3, m1=4, s1=2, m2=8, s2=2), lower=c(0, 0, 0, 0, 0), upper=c(1/2, Inf, Inf, Inf, Inf)) fit3 <- fitdist(x2, "norm2", "mge", gof="AD", start=list(poid=1/3, m1=4, s1=2, m2=8, s2=2), lower=c(0, 0, 0, 0, 0), upper=c(1/2, Inf, Inf, Inf, Inf)) qqcomp(list(fit1, fit2, fit3), fitpch=rep(".", 3), fitcol=c("green", "red", "blue")) if (requireNamespace ("ggplot2", quietly = TRUE) & visualize) { qqcomp(list(fit1, fit2, fit3), fitpch=rep(".", 3), fitcol=c("green", "red", "blue"), plotstyle = "gg") } # (5) large data # { n <- 2e4 # n <- 1e2 x <- rlnorm(n) f1 <- fitdist(x, "lnorm") f2 <- fitdist(x, "exp") qqcomp(list(f1, f2), fitpch=2) qqcomp(list(f1, f2), fitlwd=4) if (requireNamespace ("ggplot2", quietly = TRUE)) { qqcomp(list(f1, f2), fitpch=2, plotstyle = "gg") qqcomp(list(f1, f2), fitlwd=2, plotstyle = "gg") } } # (6) test legend labels # if (visualize) { serving <- groundbeef$serving fitW <- fitdist(serving,"weibull") fitW2 <- fitdist(serving,"weibull", method="qme", probs=c(1/3,2/3)) fitW3 <- fitdist(serving,"weibull", method="qme", probs=c(1/2,2/3)) fitln <- fitdist(serving,"lnorm") fitg <- fitdist(serving,"gamma") qqcomp(list(fitW, fitln, fitg)) #distrib qqcomp(list(fitW, fitW2, fitln, fitg)) #distrib+method qqcomp(list(fitW, fitW2, fitW3, fitln, fitg)) #distrib+method+num if (requireNamespace ("ggplot2", quietly = TRUE)) qqcomp(list(fitW, fitW2, fitW3, fitln, fitg), plotstyle = "ggplot") #distrib+method+num } fitdistrplus/tests/t-weird-ppcomp-cens.R0000644000176200001440000000760513742313702020102 0ustar liggesuserslibrary(fitdistrplus) visualize <- FALSE # TRUE for manual tests with visualization of results set.seed(123) n <- 20 if (visualize) { # (1) test ppcomp/ppcompcens on a good example # x <- rlnorm(n, 0, 1) dx <- data.frame(left=x, right=x) dx$right[1:(n/2)*2] <- NA dx$left[2:(n/4)*4-1] <- NA f1 <- fitdist(x, "lnorm") f1c <- fitdistcens(dx, "lnorm") f3 <- fitdist(x, "lnorm", fix.arg=list(sdlog=1)) f3c <- fitdistcens(dx, "lnorm", fix.arg=list(sdlog=1)) par(mfrow=1:2, mar=c(4,4,2,1)) ppcomp(f1) ppcompcens(f1c) par(mfrow = c(1,1)) if (requireNamespace ("ggplot2", quietly = TRUE)) { ppcomp(f1, plotstyle = "ggplot") ppcompcens(f1c, plotstyle = "ggplot") } #test log-scale par(mfrow=1:2, mar=c(4,4,2,1)) ppcomp(f1, xlogscale = TRUE, ylogscale = TRUE) ppcompcens(f1c, xlogscale = TRUE, ylogscale = TRUE, xlim=c(.1, 1), ylim=c(.1, 1)) par(mfrow = c(1,1)) if (requireNamespace ("ggplot2", quietly = TRUE)) { ppcomp(f1, xlogscale = TRUE, ylogscale = TRUE, plotstyle = "ggplot") ppcompcens(f1c, xlogscale = TRUE, ylogscale = TRUE, xlim=c(.1, 1), ylim=c(.1, 1), plotstyle = "ggplot") } # (2) test ppcomp/ppcompcens on a weird example # f2 <- fitdist(x, "unif") f2c <- fitdistcens(dx, "unif") par(mfrow=1:2, mar=c(4,4,2,1)) ppcomp(list(f1, f2, f3)) ppcompcens(list(f1c, f2c, f3c)) par(mfrow = c(1,1)) if (requireNamespace ("ggplot2", quietly = TRUE)) { ppcomp(list(f1, f2, f3), plotstyle = "ggplot") ppcompcens(list(f1c, f2c, f3c), plotstyle = "ggplot") } #test log-scale par(mfrow=1:2, mar=c(4,4,2,1)) ppcomp(list(f1, f2, f3), xlogscale = TRUE, ylogscale = TRUE, xlim=c(.1, 1), ylim=c(.1, 1)) ppcompcens(list(f1c, f2c, f3c), xlogscale = TRUE, ylogscale = TRUE, xlim=c(.1, 1), ylim=c(.1, 1)) par(mfrow = c(1,1)) if (requireNamespace ("ggplot2", quietly = TRUE)) { ppcomp(list(f1, f2, f3), xlogscale = TRUE, ylogscale = TRUE, xlim=c(.1, 1), ylim=c(.1, 1), plotstyle = "ggplot") ppcompcens(list(f1c, f2c, f3c), xlogscale = TRUE, ylogscale = TRUE, xlim=c(.1, 1), ylim=c(.1, 1), plotstyle = "ggplot") } #test y noise par(mfrow=1:2, mar=c(4,4,2,1)) ppcomp(list(f1, f2, f3)) ppcomp(list(f1, f2, f3), ynoise=FALSE) par(mfrow = c(1,1)) if (requireNamespace ("ggplot2", quietly = TRUE)) { ppcomp(list(f1, f2, f3), plotstyle = "ggplot") ppcomp(list(f1, f2, f3), ynoise=FALSE, plotstyle = "ggplot") } par(mfrow=1:2, mar=c(4,4,2,1)) ppcompcens(list(f1c, f2c, f3c)) ppcompcens(list(f1c, f2c, f3c), ynoise=FALSE) par(mfrow = c(1,1)) if (requireNamespace ("ggplot2", quietly = TRUE)) { ppcompcens(list(f1c, f2c, f3c), ynoise=FALSE, plotstyle = "ggplot") } #test log-scale y-noise par(mfrow=1:2, mar=c(4,4,2,1)) ppcomp(list(f1, f2, f3), xlogscale = TRUE, ylogscale = TRUE, xlim=c(.1, 1), ylim=c(.1, 1)) ppcomp(list(f1, f2, f3), xlogscale = TRUE, ylogscale = TRUE, xlim=c(.1, 1), ylim=c(.1, 1), ynoise=FALSE) par(mfrow = c(1,1)) if (requireNamespace ("ggplot2", quietly = TRUE)) { ppcomp(list(f1, f2, f3), xlogscale = TRUE, ylogscale = TRUE, xlim=c(.1, 1), ylim=c(.1, 1), plotstyle = "ggplot") ppcomp(list(f1, f2, f3), xlogscale = TRUE, ylogscale = TRUE, xlim=c(.1, 1), ylim=c(.1, 1), plotstyle = "ggplot") } par(mfrow=1:2, mar=c(4,4,2,1)) ppcompcens(list(f1c, f2c), xlogscale = TRUE, ylogscale = TRUE, xlim=c(.1, 1), ylim=c(.1, 1)) ppcompcens(list(f1c, f2c), xlogscale = TRUE, ylogscale = TRUE, xlim=c(.1, 1), ylim=c(.1, 1), ynoise=FALSE) par(mfrow = c(1,1)) if (requireNamespace ("ggplot2", quietly = TRUE)) { ppcompcens(list(f1c, f2c), xlogscale = TRUE, ylogscale = TRUE, xlim=c(.1, 1), ylim=c(.1, 1), plotstyle = "ggplot") ppcompcens(list(f1c, f2c), xlogscale = TRUE, ylogscale = TRUE, xlim=c(.1, 1), ylim=c(.1, 1), plotstyle = "ggplot") } }fitdistrplus/tests/t-manageparam.R0000644000176200001440000000335613742313702017016 0ustar liggesuserslibrary(fitdistrplus) manageparam <- fitdistrplus:::manageparam obs1 <- rnorm(10) s1 <- NULL s2 <- list("mean"=2, "sd"=3) s3 <- function(x) list("mean"=1.01*mean(x)) s4 <- list("mean"=1) f1 <- NULL f2 <- list("sd"=3) f3 <- function(x) list("sd"=1.01*sd(x)) f4 <- list("toto"=2) #no error manageparam(s1, f1, obs1, "norm") manageparam(s2, f1, obs1, "norm") manageparam(s3, f1, obs1, "norm") manageparam(s1, f2, obs1, "norm") manageparam(s1, f3, obs1, "norm") #raise error try(manageparam(matrix(3), f1, obs1, "norm")) try(manageparam(function(x) c("a"=33), f1, obs1, "norm")) try(manageparam(function(x) list(33), f1, obs1, "norm")) try(manageparam(NULL, list(mean=1, sd=1), obs1, "norm")) #no error checkparamlist <- fitdistrplus:::checkparamlist myformal <- names(formals("dnorm")) res <- manageparam(s1, f1, obs1, "norm") checkparamlist(res$start.arg, res$fix.arg, myformal) res <- manageparam(s1, f2, obs1, "norm") checkparamlist(res$start.arg, res$fix.arg, myformal) res <- manageparam(s1, f3, obs1, "norm") checkparamlist(res$start.arg, res$fix.arg, myformal) res <- manageparam(s2, f1, obs1, "norm") checkparamlist(res$start.arg, res$fix.arg, myformal) #raise errors res <- manageparam(s1, f4, obs1, "norm") try(checkparamlist(res$start.arg, res$fix.arg, myformal)) res <- manageparam(s2, f2, obs1, "norm") try(checkparamlist(res$start.arg, res$fix.arg, myformal)) res <- manageparam(s2, f3, obs1, "norm") try(checkparamlist(res$start.arg, res$fix.arg, myformal)) #no error fitdist(obs1, "norm", start=NULL, fix.arg=NULL) fitdist(obs1, "norm", start=NULL, fix.arg=f3) #raise error try(fitdist(obs1, "norm", start=NULL, fix.arg=f4)) try(fitdist(obs1, "norm", start=s2, fix.arg=f2)) try(fitdist(obs1, "norm", start=s2, fix.arg=f3)) fitdistrplus/tests/t-mledist-paramsupport.R0000644000176200001440000000307614050660542020740 0ustar liggesuserslibrary(fitdistrplus) set.seed(1234) nsample <- 10 # (1) uniform distribution fit - no fixed value #manual check dunif2 <- function(x, min, max) dunif(x, min, max) punif2 <- function(q, min, max) punif(q, min, max) x1 <- runif(nsample, 3, 5) L <- function(a, b, obs) prod(dunif(obs, min=a, max=b)) l <- Vectorize(L, "a") curve(l(x, b=5, obs=x1), from=1, to=3) f1 <- fitdist(x1, "unif") f2 <- fitdist(x1, "unif2", start=list(min=0, max=10), lower=c(-Inf, max(x1)), upper=c(min(x1), Inf)) c(logLik(f1), logLik(f2)) delta <- .2 llsurface(x1, "unif", plot.arg = c("min", "max"), min.arg=c(1, 5-delta), max.arg=c(3+delta, 7), main="likelihood surface for uniform") abline(v=3, h=5, col="red", lty=2) points(f1$estimate[1], f1$estimate[2], pch="x", col="violet") points(f2$estimate[1], f2$estimate[2], pch="o", col="blue") # (2) uniform distribution fit - fixed value f3 <- fitdist(x1, "unif", fix.arg=list(min=2.5)) logLik(f3) llcurve(x1, "unif", plot.arg="max", min.arg = 5-delta, max.arg=7) f4 <- fitdist(x1, "unif", fix.arg=list(max=5.5)) logLik(f4) # (3) four parameter beta - also known as PERT distribution require(mc2d) x2 <- rpert(2*nsample, 0, 1, 2, 3) f1 <- fitdist(x2, "pert", start=list(min=-1, mode=0, max=10, shape=1), lower=c(-Inf, -Inf, max(x2), 0), upper=c(min(x2), Inf, Inf, Inf)) f2 <- fitdist(x2, "pert", start=list(mode=1, shape=1), lower=c(-Inf, 0), upper=c(Inf, Inf), fix.arg=list(min=min(x2)-1e-6, max=max(x2)+1e-6)) gofstat(list(f1,f2)) cdfcomp(list(f1,f2)) fitdistrplus/tests/t-mledist-cens.R0000644000176200001440000000376713742313702017142 0ustar liggesuserslibrary(fitdistrplus) # (1) fit a user-defined continuous distribution (Gumbel) to censored data. # data(fluazinam) log10EC50 <-log10(fluazinam) # definition of the Gumbel distribution dgumbel <- function(x,a,b) 1/b*exp((a-x)/b)*exp(-exp((a-x)/b)) pgumbel <- function(q,a,b) exp(-exp((a-q)/b)) qgumbel <- function(p,a,b) a-b*log(-log(p)) mledist(log10EC50, "gumbel", start=list(a=0,b=2), optim.method="Nelder-Mead") mledist(log10EC50, "gumbel", start=list(a=0,b=2)) #default NM # (2) test optimization arguments to censored data MLE. # mledist(log10EC50, "lnorm", optim.method="BFGS") mledist(log10EC50, "lnorm", optim.method="Nelder") #optim() is used mledist(log10EC50, "lnorm", optim.method="L-BFGS-B", lower=0) mledist(log10EC50, "lnorm", optim.method="BFGS", lower=0) #a simple warning #constrOptim() is used mledist(log10EC50, "lnorm", optim.method="Nelder", lower=0) mledist(log10EC50, "lnorm", lower=0) #error # (3) weighted MLE # xleft <- c(-1.8, -0.6, -0.1, 0.07, 0.14, 1, 1.2, 1.2, 1.2) xright <- c(-1.8, -0.6, -0.1, 0.07, 0.14, 1, NA, NA, NA) d <- data.frame(left = xleft, right = xright) f1 <- mledist(d, "norm") dbis <- data.frame(left = c(-1.8, -0.6, -0.1, 0.07, 0.14, 1, 1.2), right = c(-1.8, -0.6, -0.1, 0.07, 0.14, 1, NA)) f2 <- mledist(dbis, "norm", weights = c(rep(1,6),3)) # f1 and f2 must give quite the same results (only starting values differ) f1$estimate f2$estimate # (4) test the definition of fix.arg/start.arg as functions # if defined as functions, start.arg and fix.arg must be # functions of pseudo data (output pseudo of cens2pseudo()) mledist(d, "norm", start = function(x) list(mean = 0, sd = 1))$estimate mledist(d, "norm", start = function(x) list(mean = mean(x), sd = 1))$estimate mledist(d, "norm", fix.arg = function(x) list(mean = 0))$estimate mledist(d, "norm", fix.arg = function(x) list(mean = 0.544))$estimate mledist(d, "norm", fix.arg = function(x) list(mean = mean(x)))$estimate fitdistrplus/tests/t-prefit.R0000644000176200001440000000225114050660542016027 0ustar liggesusersrequire(fitdistrplus) nsample <- 10 # (1) gamma x <- rgamma(nsample, 5/2, 7/2) prefit(x, "gamma", "mle", list(shape=3, scale=3), lower=-Inf, upper=Inf, silent=TRUE, control=list(trace=1, REPORT=1)) prefit(x, "gamma", "mle", list(shape=1, scale=1), lower=-Inf, upper=Inf, silent=TRUE) prefit(x, "gamma", "mle", list(shape=3), fix.arg=list(scale=7/2), lower=-Inf, upper=Inf, silent=TRUE) prefit(x, "gamma", "qme", list(shape=1, scale=1), probs=1:2/3, lower=-Inf, upper=Inf, silent=TRUE) prefit(x, "gamma", "mge", list(shape=1, scale=1), gof="CvM", lower=-Inf, upper=Inf, silent=TRUE) prefit(x, "gamma", "mge", list(shape=1, scale=1), gof="AD", lower=-Inf, upper=Inf, silent=TRUE) # (2) geometric x <- rgeom(nsample, 1/7) prefit(x, "geom", "mle", list(prob=1/2), lower=-Inf, upper=Inf, silent=TRUE) tbx <- table(x) prefit(as.numeric(names(tbx)), "geom", "mle", list(prob=1/2), lower=-Inf, upper=Inf, silent=TRUE, weights=tbx) prefit(x, "geom", "qme", list(prob=1/2), probs=1/2, lower=-Inf, upper=Inf) # (3) Pareto require(actuar) x <- rpareto(nsample, 6, 2) prefit(x, "pareto", "mme", list(shape=10, scale=10), order=1:2, memp=function(x, order) mean(x^order), lower=-Inf, upper=Inf) fitdistrplus/tests/t-detectbound.R0000644000176200001440000000161414205417001017030 0ustar liggesusersrequire(fitdistrplus) #case where the density returns a Not-an-Numeric value. detectbound("gamma", c(shape=3, scale=3), 1:10, echo=TRUE) detectbound("logis", c(location=3, scale=3), 1:10) detectbound("geom", c(prob=1/2), 1:10) #test rate-scale arg detectbound("gamma", c(shape=1, scale=3), 1:10) detectbound("gamma", c(shape=1, rate=1/3), 1:10) #case where the density returns a Not-an-Numeric value and one parameter is fixed. detectbound("gamma", c(shape=3), 1:10, fix.arg=c(scale=3)) #case where the density returns an error rather than a Not-an-Numeric value. dgeom2 <- function(x, prob, log=FALSE) { stopifnot(prob >= 0 && prob <= 1) dgeom(x, prob, log) } detectbound("geom2", c(prob=1/2), 1:10) #case where the density returns a Not-an-Numeric value for actuar package require(actuar) detectbound("burr", c(shape1=3, shape2=3, rate=1), 1:10) detectbound("llogis", c(shape=3, rate=1), 1:10) fitdistrplus/tests/t-util-npsurv-mainfunction.R0000644000176200001440000001734414421725657021562 0ustar liggesuserslibrary(fitdistrplus) vizualise <- FALSE mytrace <- 1 mytrace <- 0 #------------------------------------------------- # example with right censoring from package npsurv ap <- cbind(L=c(1:15, 1:15), R=c(1:15, rep(Inf, 15)), count=c(456, 226, 152, 171, 135, 125, 83, 74, 51, 42, 43, 34, 18, 9, 6, 39, 22, 23, 24, 107, 133, 102, 68, 64, 45, 53, 33, 27, 23, 30)) dim(ap) resap <- fitdistrplus:::npsurvminimal(ap, w=1, maxit=100, tol=1e-6, verb=mytrace, pkg="stats") #------------------------------------------------ # example with left censoring from package npsurv ap <- cbind(L=c(1:15, rep(-Inf, 15)), R=c(1:15, 1:15), count=c(456, 226, 152, 171, 135, 125, 83, 74, 51, 42, 43, 34, 18, 9, 6, 39, 22, 23, 24, 107, 133, 102, 68, 64, 45, 53, 33, 27, 23, 30)) dim(ap) resap <- fitdistrplus:::npsurvminimal(ap, w=1, maxit=100, tol=1e-6, verb=mytrace, pkg="stats") if(vizualise) { #------------------------------------------------------------------------------------------------- # example with interval censoring from package npsurv leading to 16 maximal intersection intervals ap <- cbind(L=c(0:14,1:15), R=c(1:15, rep(Inf, 15)), count=c(456, 226, 152, 171, 135, 125, 83, 74, 51, 42, 43, 34, 18, 9, 6, 39, 22, 23, 24, 107, 133, 102, 68, 64, 45, 53, 33, 27, 23, 30)) dim(ap) ap.x2 <- fitdistrplus:::icendata(ap, w=1) ap.x <- rbind(cbind(ap.x2$t, ap.x2$t), ap.x2$o) ap.Delta <- fitdistrplus:::Deltamatrix(ap.x) #cbind(ap.Delta$left, unique(ap.x[,"L"])) resap <- fitdistrplus:::npsurvminimal(ap, w=1, maxit=100, tol=1e-6, verb=mytrace, pkg="stats") # if(FALSE) # { # require(npsurv) # rescheck <- npsurv::npsurv(ap, w=1, maxit=100, tol=1e-6, verb=3) # c(resap$ll, rescheck$ll) # # cbind(resap$f$left, rescheck$f$left) # cbind(resap$f$right, rescheck$f$right) # cbind(resap$f$p, rescheck$f$p) # sum(abs(resap$f$p- rescheck$f$p)) # } #---------------------------------------------------------------------------------- # example with interval censoring leading to a single maximal intersection interval LR <- matrix(1:100, ncol=2) cnt <- round(1000*pgeom(1:NROW(LR)-1, prob=1/10, lower=FALSE)) fakedata <- cbind(L=LR[,1], R=LR[,2], count=cnt) fakedata.x2 <- fitdistrplus:::icendata(fakedata, w=1) fakedata.x2.x <- rbind(cbind(fakedata.x2$t, fakedata.x2$t), fakedata.x2$o) fakedata.x2.Delta <- fitdistrplus:::Deltamatrix(fakedata.x2.x) # a single vector resfk <- fitdistrplus:::npsurvminimal(fakedata, w=1, maxit=100, tol=1e-6, verb=mytrace, pkg="stats") #--------------------------------------------------------------------------------------- # geometric example with interval censoring leading to 50 maximal intersection intervals LR <- matrix(1:100, ncol=2, byrow = TRUE) theop <- dgeom(1:NROW(LR)-1, prob=1/10) cnt <- round(1000*pgeom(1:NROW(LR)-1, prob=1/10, lower=FALSE)) fakedata <- cbind(L=LR[,1], R=LR[,2], count=cnt) fakedata.x2 <- fitdistrplus:::icendata(fakedata, w=1) fakedata.x2.x <- rbind(cbind(fakedata.x2$t, fakedata.x2$t), fakedata.x2$o) fakedata.x2.Delta <- fitdistrplus:::Deltamatrix(fakedata.x2.x) #is diagonal resfk <- fitdistrplus:::npsurvminimal(fakedata, w=1, maxit=100, tol=1e-6, verb=mytrace, pkg="stats") head(cbind(optimized=resfk$f$p, theoretical=theop)) #---------------------------------------------------------------------------------------- # multimodal example with interval censoring leading to 50 maximal intersection intervals LR <- matrix(1:100, ncol=2, byrow = TRUE) cnt <- round(450*dgeom(1:NROW(LR)-1, prob=1/10) +550*dbinom(1:NROW(LR)-1, size=NROW(LR), prob=4/5)) theop <- cnt/sum(cnt) fakedata <- cbind(L=LR[,1], R=LR[,2], count=cnt) fakedata.x2 <- fitdistrplus:::icendata(fakedata, w=1) fakedata.x2.x <- rbind(cbind(fakedata.x2$t, fakedata.x2$t), fakedata.x2$o) fakedata.x2.Delta <- fitdistrplus:::Deltamatrix(fakedata.x2.x) #is diagonal resfk <- fitdistrplus:::npsurvminimal(fakedata, w=1, maxit=100, tol=1e-6, verb=mytrace, pkg="stats") head(cbind(optimized=resfk$f$p, theoretical=theop[1:resfk$m])) #---------------------------------------------------------------------------------------- # multimodal example with interval censoring leading to 43 maximal intersection intervals n <- 100 set.seed(123) LR <- sample.int(n) LR <- cbind(LR, LR+sample.int(n)/10) cnt <- 1+round(450*dgeom(1:NROW(LR)-1, prob=1/10) +550*dbinom(1:NROW(LR)-1, size=NROW(LR), prob=4/5)) theop <- cnt/sum(cnt) fakedata <- cbind(L=LR[,1], R=LR[,2], count=cnt) fakedata.x2 <- fitdistrplus:::icendata(fakedata, w=1) fakedata.x2.x <- rbind(cbind(fakedata.x2$t, fakedata.x2$t), fakedata.x2$o) fakedata.x2.Delta <- fitdistrplus:::Deltamatrix(fakedata.x2.x) str(fakedata.x2.Delta) resfk <- fitdistrplus:::npsurvminimal(fakedata, w=1, maxit=100, tol=1e-6, verb=mytrace, pkg="stats") # if(FALSE) # { # require(npsurv) # rescheck <- npsurv::npsurv(fakedata, w=1, maxit=100, tol=1e-6, verb=3) # c(resfk$ll, rescheck$ll) # # cbind(resfk$f$left, rescheck$f$left) # cbind(resfk$f$right, rescheck$f$right) # cbind(resfk$f$p, rescheck$f$p) # sum(abs(resfk$f$p- rescheck$f$p)) # } #---------------------------------------------------------------------------------------- # simulated example with interval censoring leading to 258 maximal intersection intervals set.seed(1232) ns <- 500 ns <- 100 r <- rnorm(ns) d8 <- data.frame(left = r, right = r) delta <- rlnorm(ns) icensored <- rbinom(ns, size = 1, prob = 0.2) Lcensored <- rbinom(ns, size = 1, prob = 0.2*(1 - icensored)) Rcensored <- rbinom(ns, size = 1, prob = 0.3*(1 - icensored)*(1 - Lcensored)) # icensored + Lcensored + Rcensored d8$left <- d8$left * (1 - Lcensored) + (-1000) * Lcensored d8$right <- d8$right * (1 - Rcensored) + (1000) * Rcensored d8$right <- d8$right + delta * icensored d8$right[d8$right == 1000] <- +Inf d8$left[d8$left == -1000] <- -Inf d8.x2 <- fitdistrplus:::icendata(d8, w=1) d8.x2.x <- rbind(cbind(d8.x2$t, d8.x2$t), d8.x2$o) d8.x2.Delta <- fitdistrplus:::Deltamatrix(d8.x2.x) str(d8.x2.Delta) system.time( resd8 <- fitdistrplus:::npsurvminimal(d8, w=1, maxit=100, tol=1e-6, verb=mytrace, pkg="stats") ) # if(FALSE) # { # require(npsurv) # d8bis <- d8 # d8bis$left[is.na(d8bis$left)] <- -Inf # d8bis$right[is.na(d8bis$right)] <- Inf # # rescheck <- npsurv::npsurv(d8bis, w=1, maxit=100, tol=1e-6, verb=3) # c(resd8$ll, rescheck$ll) # # cbind(resd8$f$left, rescheck$f$left) # cbind(resd8$f$right, rescheck$f$right) # head(cbind(resd8$f$p, rescheck$f$p)) # sum(abs(resd8$f$p- rescheck$f$p)) # } #------------------------------------------------------ # crash example with wrong interval censoring intervals set.seed(1232) ns <- 100 r <- rnorm(ns) d8 <- data.frame(left = r, right = r) delta <- rlnorm(ns) icensored <- rbinom(ns, size = 1, prob = 0.2) Lcensored <- rbinom(ns, size = 1, prob = 0.2*(1 - icensored)) Rcensored <- rbinom(ns, size = 1, prob = 0.3*(1 - icensored)*(1 - Lcensored)) # icensored + Lcensored + Rcensored d8$left <- d8$left * (1 - Lcensored) + (-1000) * Lcensored d8$right <- d8$right * (1 - Rcensored) + (1000) * Rcensored d8$right <- d8$right + delta * icensored d8$right[d8$right == 1000] <- -Inf d8$left[d8$left == -1000] <- +Inf try(resd8 <- fitdistrplus:::npsurvminimal(d8, w=1, maxit=100, tol=1e-6, verb=2, pkg="stats")) }fitdistrplus/tests/t-qqcompcens.R0000644000176200001440000000541313742313702016712 0ustar liggesuserslibrary(fitdistrplus) visualize <- FALSE # TRUE for manual tests with visualization of results data(smokedfish) fitsf <- fitdistcens(smokedfish,"lnorm") plot(fitsf) qqcompcens(fitsf) qqcompcens(fitsf, fillrect = NA) qqcompcens(fitsf, fitcol = "black") qqcompcens(fitsf, fitcol = "black", fillrect = NA) qqcompcens(fitsf, ylim = c(0,150)) qqcompcens(fitsf, xlim = c(0,150)) qqcompcens(fitsf, xlim = c(0,150), ylim = c(0, 120)) if (requireNamespace("ggplot2", quietly = TRUE)) { qqcompcens(fitsf, plotstyle = "ggplot") } if (requireNamespace("ggplot2", quietly = TRUE) & visualize) { qqcompcens(fitsf, fillrect = NA, plotstyle = "ggplot") qqcompcens(fitsf, fitcol = "black", plotstyle = "ggplot") qqcompcens(fitsf, fitcol = "black", fillrect = NA, plotstyle = "ggplot") qqcompcens(fitsf, ylim = c(0,150), plotstyle = "ggplot") qqcompcens(fitsf, xlim = c(0,150), plotstyle = "ggplot") qqcompcens(fitsf, xlim = c(0,150), ylim = c(0, 120), plotstyle = "ggplot") } if (visualize) { data(fluazinam) log10EC50 <-log10(fluazinam) fln <- fitdistcens(log10EC50,"norm") plot(fln) qqcompcens(fln) if (requireNamespace("ggplot2", quietly = TRUE)) { qqcompcens(fln, plotstyle = "ggplot") } } data(salinity) log10LC50 <-log10(salinity) plotdistcens(log10LC50) plotdistcens(log10LC50, NPMLE = FALSE) fn <- fitdistcens(log10LC50,"norm") fl <- fitdistcens(log10LC50,"logis") plot(fn) plot(fl) qqcompcens(fn) qqcompcens(fl) qqcompcens(list(fn, fl)) qqcompcens(list(fn, fl), fitlwd = c(5, 2)) if (requireNamespace("ggplot2", quietly = TRUE)) { qqcompcens(list(fn, fl), plotstyle = "ggplot") qqcompcens(list(fn, fl), plotstyle = "ggplot", fitlwd = c(5, 2)) } if (requireNamespace("ggplot2", quietly = TRUE) & visualize) { qqcompcens(fl, plotstyle = "ggplot") qqcompcens(fn, plotstyle = "ggplot") } require(actuar) data(salinity) fln <- fitdistcens(salinity,"lnorm") fll <- fitdistcens(salinity,"llogis") plot(fln) par(mfrow = c(2,1)) qqcompcens(fln) qqcompcens(fll) par(mfrow = c(1,1)) qqcompcens(list(fln, fll)) qqcompcens(list(fln, fll), ynoise = FALSE) qqcompcens(list(fln, fll), fitcol = c("blue", "orange")) qqcompcens(list(fln, fll), xlogscale = TRUE, ylogscale = TRUE) qqcompcens(list(fln, fll), ylogscale = TRUE) qqcompcens(list(fln, fll), xlogscale = TRUE, ynoise = FALSE) if (requireNamespace("ggplot2", quietly = TRUE) & visualize) { qqcompcens(list(fln, fll), plotstyle = "ggplot") qqcompcens(list(fln, fll), ynoise = FALSE, plotstyle = "ggplot") qqcompcens(list(fln, fll), fitcol = c("blue", "orange"), plotstyle = "ggplot") qqcompcens(list(fln, fll), xlogscale = TRUE, ylogscale = TRUE, plotstyle = "ggplot") qqcompcens(list(fln, fll), ylogscale = TRUE, plotstyle = "ggplot") qqcompcens(list(fln, fll), xlogscale = TRUE, ynoise = FALSE, plotstyle = "ggplot") } fitdistrplus/tests/t-denscomp.R0000644000176200001440000003306014050660542016350 0ustar liggesuserslibrary(fitdistrplus) # ?denscomp visualize <- FALSE # TRUE for manual test with visualization of plots nsample <- 1000 nsample <- 10 # (1) Plot various distributions fitted to serving size data # data(groundbeef) serving <- groundbeef$serving fitW <- fitdist(serving,"weibull") fitln <- fitdist(serving,"lnorm") fitg <- fitdist(serving,"gamma") #sanity checks try(denscomp("list(fitW, fitln, fitg)",horizontals = FALSE), silent=TRUE) try(denscomp(list(fitW, fitln, fitg, a=1),horizontals = FALSE), silent=TRUE) #real call denscomp(list(fitW, fitln, fitg), probability = TRUE) denscomp(list(fitW, fitln, fitg), probability = FALSE) if (requireNamespace ("ggplot2", quietly = TRUE)) { denscomp(list(fitW, fitln, fitg), probability=TRUE, plotstyle = "ggplot") denscomp(list(fitW, fitln, fitg), probability=FALSE, plotstyle = "ggplot") } #test ylim argument denscomp(list(fitW, fitln, fitg), probability=TRUE, ylim=c(0, .05)) denscomp(list(fitW, fitln, fitg), probability=FALSE, ylim=c(0, 100)) if (requireNamespace ("ggplot2", quietly = TRUE) & visualize) { denscomp(list(fitW, fitln, fitg), probability=TRUE, ylim=c(0, .05), plotstyle = "ggplot") denscomp(list(fitW, fitln, fitg), probability=FALSE, ylim=c(0, 100), plotstyle = "ggplot") } #test xlim, legend, main, demp denscomp(list(fitW, fitln, fitg), legendtext=c("Weibull","lognormal","gamma"), main="ground beef fits",xlab="serving sizes (g)", ylab="F",xlim = c(0,250), xlegend = "topright", demp=TRUE) if (requireNamespace ("ggplot2", quietly = TRUE) & visualize) { denscomp(list(fitW, fitln, fitg), legendtext=c("Weibull","lognormal","gamma"), main="ground beef fits",xlab="serving sizes (g)", ylab="F",xlim = c(0,250), xlegend = "topright", demp=TRUE, plotstyle = "ggplot") } # (2) Plot lognormal distributions fitted by # maximum goodness-of-fit estimation # using various distances (data plotted in log scale) # data(endosulfan) ATV <-subset(endosulfan, group == "NonArthroInvert")$ATV flnMGEKS <- fitdist(ATV,"lnorm",method="mge",gof="KS") flnMGEAD <- fitdist(ATV,"lnorm",method="mge",gof="AD") flnMGEADL <- fitdist(ATV,"lnorm",method="mge",gof="ADL") flnMGEAD2L <- fitdist(ATV,"lnorm",method="mge",gof="AD2L") llfit <- list(flnMGEKS, flnMGEAD, flnMGEADL, flnMGEAD2L) denscomp(list(flnMGEKS, flnMGEAD, flnMGEADL, flnMGEAD2L), main="fits of a lognormal dist. using various GOF dist.") if (requireNamespace ("ggplot2", quietly = TRUE) & visualize) { denscomp(list(flnMGEKS, flnMGEAD, flnMGEADL, flnMGEAD2L), main="fits of a lognormal dist. using various GOF dist.", plotstyle = "ggplot") } denscomp(list(flnMGEKS, flnMGEAD, flnMGEADL, flnMGEAD2L), main="fits of a lognormal dist. using various GOF dist.", legendtext=c("MGE KS","MGE AD","MGE ADL","MGE AD2L")) if (requireNamespace ("ggplot2", quietly = TRUE) & visualize) { denscomp(list(flnMGEKS, flnMGEAD, flnMGEADL, flnMGEAD2L), main="fits of a lognormal dist. using various GOF dist.", legendtext=c("MGE KS","MGE AD","MGE ADL","MGE AD2L"), plotstyle = "ggplot") } denscomp(list(flnMGEKS, flnMGEAD, flnMGEADL, flnMGEAD2L), main="fits of a lognormal dist. using various GOF dist.", legendtext=c("MGE KS","MGE AD","MGE ADL","MGE AD2L"), fitcol=c("black", "darkgreen", "yellowgreen", "yellow2")) if (requireNamespace ("ggplot2", quietly = TRUE) & visualize) { denscomp(list(flnMGEKS, flnMGEAD, flnMGEADL, flnMGEAD2L), main="fits of a lognormal dist. using various GOF dist.", legendtext=c("MGE KS","MGE AD","MGE ADL","MGE AD2L"), fitcol=c("black", "darkgreen", "yellowgreen", "yellow2"), plotstyle = "ggplot") } denscomp(llfit, main="fits of a lognormal dist. using various GOF dist.", legendtext=c("MGE KS","MGE AD","MGE ADL","MGE AD2L"), fitcol=c("black", "darkgreen", "yellowgreen", "yellow2"), datacol="grey") if (requireNamespace ("ggplot2", quietly = TRUE) & visualize) { denscomp(llfit, main="fits of a lognormal dist. using various GOF dist.", legendtext=c("MGE KS","MGE AD","MGE ADL","MGE AD2L"), fitcol=c("black", "darkgreen", "yellowgreen", "yellow2"), datacol="grey", plotstyle = "ggplot") } denscomp(flnMGEKS, xlim=c(10,100000)) if (requireNamespace ("ggplot2", quietly = TRUE) & visualize) denscomp(flnMGEKS, xlim=c(10,100000), plotstyle = "ggplot") # (3) # # if (visualize) { x1 <- c(6.4,13.3,4.1,1.3,14.1,10.6,9.9,9.6,15.3,22.1,13.4,13.2,8.4,6.3,8.9,5.2,10.9,14.4) x <- seq(0, 1.1*max(x1), length=100) dgumbel <- function(x,a,b) 1/b*exp((a-x)/b)*exp(-exp((a-x)/b)) pgumbel <- function(x,a,b) exp(-exp((a-x)/b)) f1 <- mledist(x1,"norm") f2 <- mledist(x1,"gumbel", start = list(a = 10, b = 5)) f3 <- mledist(x1, "exp") # graph 1 hist(x1, 10, prob=TRUE) lines(x, dnorm(x, f1$estimate[1], f1$estimate[2]), col="red") lines(x, dgumbel(x, f2$estimate[1], f2$estimate[2]), col="green") lines(x, dexp(x, f3$estimate[1]), col="blue") legend("topright", lty=1, leg = c("Normal", "Gumbel", "Exp"), col = c("red", "green", "blue")) # graph 2 f1 <- fitdist(x1,"norm") f2 <- fitdist(x1,"gumbel", start = list(a = 10, b = 5)) f3 <- fitdist(x1, "exp") denscomp(list(f1, f2, f3), xlim = c(0, 30), fitlty = 1, legendtext = c("Normal","Gumbel","Exp")) # graph 3 if (requireNamespace ("ggplot2", quietly = TRUE)) denscomp(list(f1, f2, f3), xlim = c(0, 30), fitlty = 1, legendtext = c("Normal","Gumbel","Exp"), breaks = 12, plotstyle = "ggplot") } # (4) normal mixture # if (visualize) { #mixture of two normal distributions #density dnorm2 <- function(x, poid, m1, s1, m2, s2) poid*dnorm(x, m1, s1) + (1-poid)*dnorm(x, m2, s2) #numerical approximate quantile function qnorm2 <- function(p, poid, m1, s1, m2, s2) { L2 <- function(x, prob) (prob - pnorm2(x, poid, m1, s1, m2, s2))^2 sapply(p, function(pr) optimize(L2, c(-1000, 1000), prob=pr)$minimum) } #distribution function pnorm2 <- function(q, poid, m1, s1, m2, s2) poid*pnorm(q, m1, s1) + (1-poid)*pnorm(q, m2, s2) #basic normal distribution set.seed(1234) x2 <- c(rnorm(nsample, 5), rnorm(nsample, 10)) #MLE fit fit1 <- fitdist(x2, "norm2", "mle", start=list(poid=1/3, m1=4, s1=2, m2=8, s2=2), lower=c(0, 0, 0, 0, 0)) fit2 <- fitdist(x2, "norm2", "qme", probs=c(1/6, 1/4, 1/3, 1/2, 2/3), start=list(poid=1/3, m1=4, s1=2, m2=8, s2=2), lower=c(0, 0, 0, 0, 0), upper=c(1/2, Inf, Inf, Inf, Inf)) fit3 <- fitdist(x2, "norm2", "mge", gof="AD", start=list(poid=1/3, m1=4, s1=2, m2=8, s2=2), lower=c(0, 0, 0, 0, 0), upper=c(1/2, Inf, Inf, Inf, Inf)) denscomp(list(fit1, fit2, fit3)) if (requireNamespace ("ggplot2", quietly = TRUE) & visualize) denscomp(list(fit1, fit2, fit3), plotstyle = "ggplot") } # (5) large data # if (visualize) { n <- 1e4 x <- rnorm(n) f <- fitdist(x, "norm") denscomp(f) denscomp(f, demp=TRUE) if (requireNamespace ("ggplot2", quietly = TRUE)) { denscomp(f, plotstyle = "ggplot") denscomp(f, demp=TRUE, plotstyle = "ggplot") } } # (6) graphical parameters # if (visualize) { # 'graphics' plot style denscomp(list(fit1, fit2, fit3), plotstyle = "gr") denscomp(list(fit1, fit2, fit3), title = "Fitted distribution") denscomp(list(fit1, fit2, fit3), main = "Fitted distribution", addlegend = F, demp = T, dempcol = "purple") # 'ggplot' plot style if (requireNamespace ("ggplot2", quietly = TRUE)) { denscomp(list(fit1, fit2, fit3), plotstyle = "gg") denscomp(list(fit1, fit2, fit3), plotstyle = "ggplot", breaks = 20, pro = F) dcomp <- denscomp(list(fit1, fit2, fit3), plotstyle = "gg", demp = T) dcomp + ggplot2::theme_minimal() + ggplot2::ggtitle("Histogram and\ntheoretical densities") dcomp + ggplot2::guides(colour = ggplot2::guide_legend("Fitted distribution"), linetype = ggplot2::guide_legend("Fitted distribution")) } } # (7) test legend labels # if (visualize) { serving <- groundbeef$serving fitW <- fitdist(serving,"weibull") fitW2 <- fitdist(serving,"weibull", method="qme", probs=c(1/3,2/3)) fitW3 <- fitdist(serving,"weibull", method="qme", probs=c(1/2,2/3)) fitln <- fitdist(serving,"lnorm") fitg <- fitdist(serving,"gamma") denscomp(list(fitW, fitln, fitg)) #distrib denscomp(list(fitW, fitW2, fitln, fitg)) #distrib+method denscomp(list(fitW, fitW2, fitW3, fitln, fitg)) #distrib+method+num if (requireNamespace ("ggplot2", quietly = TRUE)) denscomp(list(fitW, fitW2, fitW3, fitln, fitg), plotstyle = "ggplot") #distrib+method+num } # (8) discrete distrib # x <- c(rpois(nsample, 5), rbinom(nsample, 12, 2/3)) fpois <- fitdist(x, "pois") fgeo <- fitdist(x, "geom") fnbinom <- fitdist(x, "nbinom") par(mar=c(4,4,2,1)) denscomp(list(fpois, fnbinom, fgeo), probability = TRUE) denscomp(list(fpois, fnbinom, fgeo), probability = FALSE) denscomp(list(fpois, fnbinom, fgeo), fittype="o") denscomp(list(fpois, fnbinom, fgeo), fittype="p") # 'ggplot' plot style if (requireNamespace ("ggplot2", quietly = TRUE) & visualize) { denscomp(list(fpois, fnbinom, fgeo), plotstyle="ggplot", probability = TRUE) denscomp(list(fpois, fnbinom, fgeo), plotstyle="ggplot", probability = FALSE) denscomp(list(fpois, fnbinom, fgeo), fittype="o", plotstyle="ggplot") denscomp(list(fpois, fnbinom, fgeo), fittype="p", plotstyle="ggplot") } # test the call to any() fpois$discrete <- fnbinom$discrete <- FALSE denscomp(list(fpois, fnbinom, fgeo)) if (requireNamespace ("ggplot2", quietly = TRUE) & visualize) denscomp(list(fpois, fnbinom, fgeo), plotstyle="ggplot") #test the forced usage fgeo$discrete <- FALSE denscomp(list(fpois, fnbinom, fgeo), discrete=TRUE) if (requireNamespace ("ggplot2", quietly = TRUE) & visualize) denscomp(list(fpois, fnbinom, fgeo), discrete=TRUE, plotstyle="ggplot") if (visualize) { x <- c(rpois(nsample, 30), rbinom(nsample, 12, 2/3)) fpois <- fitdist(x, "pois") fgeo <- fitdist(x, "geom") fnbinom <- fitdist(x, "nbinom") #3 types of plot of probability mass function par(mar=c(4,4,2,1)) denscomp(list(fpois, fnbinom, fgeo)) #fittype = "l" denscomp(list(fpois, fnbinom, fgeo), fittype = "p") denscomp(list(fpois, fnbinom, fgeo), fittype = "o") if (requireNamespace ("ggplot2", quietly = TRUE)) { denscomp(list(fpois, fnbinom, fgeo), plotstyle="ggplot") #fittype = "l" denscomp(list(fpois, fnbinom, fgeo), fittype = "p", plotstyle="ggplot") denscomp(list(fpois, fnbinom, fgeo), fittype = "o", plotstyle="ggplot") } } # (9) examples with user specified regular of irregular breaks in the histogram # in probability or not # if (visualize) { # two plots with user specified regular breaks in probability or not # hist(serving, breaks = seq(0,200,50)) denscomp(list(fitW, fitln, fitg), addlegend = FALSE, main = "ground beef fits", xlab = "serving sizes (g)", xlim = c(0, 250), breaks = seq(0,200,50)) denscomp(list(fitW, fitln, fitg), addlegend = FALSE, main = "ground beef fits", xlab = "serving sizes (g)", xlim = c(0, 250), probability = FALSE, breaks = seq(0,200,50)) # with ggplot2 denscomp(list(fitW, fitln, fitg), addlegend = FALSE, main = "ground beef fits", xlab = "serving sizes (g)", xlim = c(0, 250), plotstyle = "ggplot", breaks = seq(0,200,50)) denscomp(list(fitW, fitln, fitg), addlegend = FALSE, main = "ground beef fits", xlab = "serving sizes (g)", xlim = c(0, 250), probability = FALSE, plotstyle = "ggplot", breaks = seq(0,200,50)) # two plots with irregular breaks in probability or not # hist(serving, breaks = c(0, 20, 50, 100, 200, 300)) denscomp(list(fitW, fitln, fitg), addlegend = FALSE, main = "ground beef fits", xlab = "serving sizes (g)", xlim = c(0, 250), breaks = c(0, 20, 50, 100, 200, 300)) # hist(serving, breaks = c(0, 20, 50, 100, 200, 300), probability = FALSE) try(denscomp(list(fitW, fitln, fitg), addlegend = FALSE, main = "ground beef fits", xlab = "serving sizes (g)", xlim = c(0, 250), breaks = c(0, 20, 50, 100, 200, 300), probability = FALSE)) # with ggplot2 denscomp(list(fitW, fitln, fitg), addlegend = FALSE, main = "ground beef fits", xlab = "serving sizes (g)", xlim = c(0, 250), breaks = c(0, 20, 50, 100, 200, 300), plotstyle = "ggplot") ##### ggplot2 does not take into account non-equidistant breaks !!!!!!!!!!!!!!!! try(denscomp(list(fitW, fitln, fitg), addlegend = FALSE, main = "ground beef fits", xlab = "serving sizes (g)", xlim = c(0, 250), breaks = c(0, 20, 50, 100, 200, 300), probability = FALSE, plotstyle = "ggplot")) } # (10) fitlty, fitlwd for discrete x <- c(rpois(nsample, 30), rbinom(nsample, 12, 2/3)) fpois <- fitdist(x, "pois") fgeo <- fitdist(x, "geom") fnbinom <- fitdist(x, "nbinom") denscomp(list(fpois, fnbinom, fgeo), fitlty = 2, fitlwd = 3:1) denscomp(list(fpois, fnbinom, fgeo), fittype = "o", fitlwd = 3:1) if (requireNamespace ("ggplot2", quietly = TRUE)) { denscomp(list(fpois, fnbinom, fgeo), plotstyle = "ggplot", fitlty = 2, fitlwd = 3:1) denscomp(list(fpois, fnbinom, fgeo), plotstyle = "ggplot", fittype = "o", fitlty = 1, fitlwd = 3:1) } # (11) fitlty, fitlwd for non discrete denscomp(list(fitW, fitln, fitg), fitlty = 1, fitlwd = 3:1) denscomp(list(fitW, fitln, fitg), fitlty = 1, fitlwd = 1:3, fitcol = c(1:2, 7)) if (requireNamespace ("ggplot2", quietly = TRUE)) { denscomp(list(fitW, fitln, fitg), plotstyle = "ggplot", fitlty = 1, fitlwd = 3:1) denscomp(list(fitW, fitln, fitg), plotstyle = "ggplot", fitlty = 1, fitlwd = 1:3, fitcol = c(1:2, 7)) } fitdistrplus/tests/t-util-testdensity.R0000644000176200001440000000467413742313702020103 0ustar liggesuserslibrary(fitdistrplus) testdpqfun <- fitdistrplus:::testdpqfun ##### first argument ##### #a data.frame of TRUE and "" testdpqfun("exp", start=c(rate=1)) #a data.frame with error messages dEXP <- function(y, rate) dexp(x, rate) pEXP <- function(y, rate) pexp(x, rate) qEXP <- function(y, rate) qexp(x, rate) testdpqfun("EXP", start=c(rate=1)) ##### existence ##### #a data.frame of TRUE and "" testdpqfun("exp", start=c(rate=1)) #a data.frame with error messages testdpqfun("exp2", start=c(rate=1)) ##### void vector ##### dexp2 <- function(x, rate) ifelse(length(x)==0, stop("zero input"), dexp(x,rate)) dexp3 <- function(x, rate) ifelse(length(x)==0, NA, dexp(x,rate)) #TRUE testdpqfun("exp", "d", c(rate=1)) #error message testdpqfun("exp2", "d", c(rate=1)) #error message testdpqfun("exp3", "d", c(rate=1)) ##### inconsistent value ##### pexp2 <- function(q, rate) { res <- pexp(q, rate) if(any(is.nan(res))) stop("NaN values") res } pexp3 <- function(q, rate) { res <- pexp(q, rate) if(any(is.infinite(q))) stop("Inf values") res } #TRUE testdpqfun("exp", "p", c(rate=1)) #error message testdpqfun("exp2", "p", c(rate=1)) #error message testdpqfun("exp3", "p", c(rate=1)) ##### missing value ##### qexp2 <- function(p, rate) { res <- qexp(p, rate) if(any(is.na(res))) stop("NA values") res } qexp3 <- function(p, rate) { res <- qexp(p, rate) res[!is.na(res)] } #TRUE testdpqfun("exp", "q", c(rate=1)) #error message testdpqfun("exp2", "q", c(rate=1)) #error message testdpqfun("exp3", "q", c(rate=1)) ##### inconsistent parameter ##### dnorm2 <- function(x, mean, sd) { if(sd < 0) stop("negative param") else dnorm(x,mean,sd) } #TRUE testdpqfun("norm", "d", c(mean=1, sd=1)) #error message testdpqfun("norm2", "d", c(mean=1, sd=1)) ##### inconsistent name ##### dnorm2 <- function(x, mean=0, sd=1, ...) dnorm(x, mean, sd) dnorm3 <- dnorm2 pnorm3 <- pnorm qnorm3 <- qnorm #TRUE testdpqfun("norm", "d", c(mean=1, sd=1)) #error message testdpqfun("norm2", "d", c(mean=1, sd=1)) #a data.frame with error messages testdpqfun("norm", c("d", "p", "q"), c(mean=1, sd=1)) testdpqfun("norm2", c("d", "p", "q"), c(mean=1, sd=1)) testdpqfun("norm3", c("d", "p", "q"), c(mean=1, sd=1)) x <- rnorm(100) fitdist(x, "norm") #ok fitdist(x, "norm2", start=list(mean=1, sd=1)) #pnorm2 not defined fitdist(x, "norm3", start=list(mean=1, sd=1)) #The dnorm3 function should return raise an error when names are incorrectly named fitdistrplus/tests/t-fitdistcens.R0000644000176200001440000000477014050660542017065 0ustar liggesuserslibrary(fitdistrplus) nsample <- 500 nsample <- 10 visualize <- FALSE # TRUE for manual tests with visualization of results set.seed(1234) # (6) custom optimisation function - example with the genetic algorithm # data(fluazinam) log10EC50 <-log10(fluazinam) #wrap genoud function rgenoud package mygenoud <- function(fn, par, ...) { require(rgenoud) res <- genoud(fn, starting.values=par, ...) standardres <- c(res, convergence=0) return(standardres) } # call fitdistcens with a 'custom' optimization function fit.with.genoud <- fitdistcens(log10EC50, "logis", custom.optim=mygenoud, nvars=2, start=list(location=0, scale=1), Domains=cbind(c(0,0), c(5, 5)), boundary.enforcement=1, print.level=1, hessian=TRUE) summary(fit.with.genoud) # (9) check keepdata # if (visualize) # LONG TO RUN ON CRAN AND NEEDS VISALIZATION OF RESULTS { set.seed(1234) x <- rexp(1e3, 5) # x <- data.frame(left=x, right=x+rexp(x, 1/2)) x <- data.frame(left=x, right=x) f1 <- fitdistcens(x, "exp", keepdata=FALSE) f2 <- fitdistcens(x, "exp", keepdata=TRUE) f1$censdata f2$censdata plot(f1) plot(f2) } # (9) fixing parameters # x <- rexp(nsample, 5) x <- data.frame(left=x, right=x+.1) f1 <- fitdistcens(x, "gamma", fix.arg=list(shape=1.5)) f1 f1$fix.arg f1 <- fitdistcens(x, "gamma", fix.arg=function(x) list(shape=1.5)) f1 f1$fix.arg.fun # (10) weights # data(salinity) salinity.unique <- unique(salinity) string.unique <- paste(salinity.unique$left, salinity.unique$right) string.salinity <- paste(salinity$left, salinity$right) nobs <- nrow(salinity.unique) salinity.weights <- numeric(nobs) for (i in 1:nobs) { salinity.weights[i] <- length(which(string.salinity == string.unique[i])) } cbind(salinity.unique, salinity.weights) (fa <- fitdistcens(salinity, "lnorm")) (fb <- fitdistcens(salinity.unique, "lnorm", weights = salinity.weights)) # should give the same results # (11) check the warning messages when using weights in the fit followed by functions # that do not yet take weights into account # with an example to be used later to see if weights are well taken into account # x <- rexp(100, 5) x <- sort(x) x <- data.frame(left=x, right=x+.1) (f <- fitdistcens(x, "gamma", weights=c(rep(10, 50), rep(1, 50)))) try(plot(f)) try(cdfcompcens(f)) (f2 <- fitdistcens(x, "weibull", weights=c(rep(10, 50), rep(1, 50)))) try(cdfcompcens(list(f, f2))) try(bootdistcens(f)) fitdistrplus/tests/t-lnL-surf.R0000644000176200001440000000374713742313702016253 0ustar liggesuserslibrary(fitdistrplus) nsample <- 1000 nsample <- 10 #(1) beta distribution # x <- rbeta(nsample, 3, 3/4) llsurface(data = x, distr = "beta", plot.arg=c("shape1", "shape2"), min.arg=c(0.1, 0.1), max.arg=c(7, 3)) llsurface(data = x, distr = "beta", plot.arg=c("shape1", "shape2"), min.arg=c(0.1, 0.1), max.arg=c(7, 3), back.col = FALSE ) points(3, 3/4, pch="+", col="red") llcurve(data = x, distr = "beta", plot.arg = "shape1", min.arg = 0.1, max.arg = 7, fix.arg = list(shape2 = 3/4), lseq=100, col = "blue") llcurve(data = x, distr = "beta", plot.arg = "shape2", min.arg = 0.1, max.arg = 7, fix.arg = list(shape1 = 3), lseq=100, col = "red") #test psi <- function(x) digamma(x) grbetalnl <- function(x, a, b) c(log(x)-psi(a)+psi(a+b), log(1-x)-psi(b)+psi(a+b)) grbetalnl(x, 3, 4) grlnL <- function(par, obs, ...) -rowSums(sapply(obs, function(x) grbetalnl(x, a=par[1], b=par[2]))) rowSums(sapply(x, function(x) grbetalnl(x, 3, 4))) grlnL(c(3, 4), x) grlnL(c(3, 3/4), x) ctr <- list(trace=0, REPORT=1, maxit=1000) bfgs_gr <- mledist(x, dist="beta", optim.method="BFGS", gr=grlnL, control=ctr) bfgs <- mledist(x, dist="beta", optim.method="BFGS", control=ctr) cg_gr <- mledist(x, dist="beta", optim.method="CG", gr=grlnL, control=ctr) cg <- mledist(x, dist="beta", optim.method="CG", control=ctr) nm_gr <- mledist(x, dist="beta", optim.method="Nelder", gr=grlnL, control=ctr) nm <- mledist(x, dist="beta", optim.method="Nelder", control=ctr) getval <- function(x) c(x$estimate, loglik=x$loglik, x$counts) cbind(NM=getval(nm), NMgrad=getval(nm_gr), CG=getval(cg), CGgrad=getval(cg_gr), BFGS=getval(bfgs), BFGSgrad=getval(bfgs_gr)) llsurface(data = x, distr = "beta", plot.arg = c("shape1", "shape2"), min.arg = c(0.1, 0.1), max.arg = c(7, 3), pal.col = heat.colors(50)) points(bfgs$estimate[1], bfgs$estimate[2], pch="+", col="red") points(3, 3/4, pch="x", col="green") fitdistrplus/tests/t-bootdist.R0000644000176200001440000002102014212327571016362 0ustar liggesuserslibrary(fitdistrplus) #We choose a low number of bootstrap replicates in order to satisfy CRAN running times constraint. #For practical application, we recommend to use nbboot=501 or nbboot=1001. nbboot <- 1001 nbboot <- 11 nsample <- 100 nsample <- 10 visualize <- FALSE # TRUE for manual tests with visualization of results # (1) Fit of a gamma distribution to serving size data # using default method (maximum likelihood estimation) # followed by parametric bootstrap # data(groundbeef) serving <- groundbeef$serving f1 <- fitdist(serving, "gamma") b1 <- bootdist(f1, niter=nbboot, silent=TRUE) b1 <- bootdist(f1, niter=nbboot, silent=FALSE) print(lapply(b1, head)) plot(b1) summary(b1) # (1) bis test new plot arguments #for new graph functions f1 <- fitdist(rgamma(nsample, 2, 3), "gamma") b1 <- bootdist(f1, niter=nbboot, silent=TRUE) plot(b1) plot(b1, trueval = c(2, 3)) plot(b1, enhance=TRUE) plot(b1, enhance=TRUE, trueval = c(2, 3)) plot(b1, enhance=TRUE, rampcol=c("blue", "green"), nbgrid=15, nbcol=15) if(any(installed.packages()[, "Package"] == "actuar") && visualize) { require(actuar) set.seed(123) f1 <- fitdist(rburr(nsample, 2, 3, 1), "burr", start=list(shape1=10, shape2=10, rate=1)) b1 <- bootdist(f1, niter=nbboot, silent=TRUE) plot(b1) plot(b1, trueval = c(2, 3, 1)) plot(b1, enhance=TRUE) plot(b1, enhance=TRUE, trueval = c(2, 3, 1)) } # (3) estimation of the rate of a gamma distribution # by maximum likelihood with the shape fixed at 4 using the argument fix.arg # followed by parametric bootstrap # f1c <- fitdist(serving, "gamma", start=list(rate=0.1), fix.arg=list(shape=4)) b1c <- bootdist(f1c, niter=nbboot) summary(b1c) # (4) fit of a gamma distribution to serving size data # by quantile matching estimation (in this example matching # first and third quartiles) followed by parametric bootstrap # f1d <- fitdist(serving, "gamma", method="qme", probs=c(0.25, 0.75)) b1d <- bootdist(f1d, niter=nbboot) summary(b1d) # (5) fit of a gamma distribution with control of the optimization # method, followed by parametric bootstrap # if(visualize) { # check ERROR on aarch64-apple-darwin20.4.0 (64-bit) (2021/05/12) set.seed(1234) f1e <- fitdist(serving, "gamma", optim.method="L-BFGS-B", lower=c(0, 0)) b1e <- bootdist(f1e, niter=nbboot) summary(b1e) } # (6) fit of a discrete distribution by matching moment estimation # (using a closed formula) followed by parametric bootstrap # set.seed(1234) x2 <- rpois(nsample, lambda = 5) f2 <- fitdist(x2, "pois", method="mme") b2 <- bootdist(f2, niter=nbboot) plot(b2,pch=16) summary(b2) # (7) Fit of a uniform distribution using the Cramer-von Mises distance # followed by parametric bootstrap # if(visualize) { x3 <- runif(nsample, min=5, max=10) f3 <- fitdist(x3, "unif", method="mge", gof="CvM") b3 <- bootdist(f3, bootmethod="param", niter=nbboot) summary(b3) plot(b3) } # (9) fit of a Weibull distribution to serving size data by maximum likelihood # estimation or by quantile matching estimation (in this example matching # first and third quartiles) followed by parametric bootstrap # fWmle <- fitdist(serving, "weibull") bWmle <- bootdist(fWmle, niter=nbboot) summary(bWmle) quantile(bWmle, probs=c(0.25, 0.75)) fWqme <- fitdist(serving, "weibull", method="qme", probs=c(0.25, 0.75)) bWqme <- bootdist(fWqme, niter=nbboot) summary(bWqme) quantile(bWqme, probs=c(0.25, 0.75)) # (10) Fit of a Pareto distribution by numerical moment matching estimation # followed by parametric bootstrap # !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! LONG TO RUN !!!!!!!!!!!!!!!!!!!!!!!! # if (visualize) { if(any(installed.packages()[, "Package"] == "actuar")) { require(actuar) #simulate a sample x4 <- rpareto(nsample, 6, 2) memp <- function(x, order) ifelse(order == 1, mean(x), sum(x^order)/length(x)) f4 <- fitdist(x4, "pareto", "mme", order=1:2, start=list(shape=10, scale=10), lower=1, memp="memp", upper=50) b4 <- bootdist(f4, niter=nbboot) summary(b4) b4npar <- bootdist(f4, niter=nbboot, bootmethod="nonparam") summary(b4npar) } } # (11) Fit of a Burr distribution (3 parameters) using MLE # followed by parametric boostrap # !!!!!!!!!!!!!!!! LONG TO RUN !!!!!!!!!!!!!!!!!! # if (visualize) { if(any(installed.packages()[, "Package"] == "actuar")) { require(actuar) data(danishuni) fdan <- fitdist(danishuni$Loss, "burr", method="mle", start=list(shape1=5, shape2=5, rate=10), lower=0+1e-1, control=list(trace=0)) bdan <- bootdist(fdan, bootmethod="param", niter=nbboot) summary(bdan) plot(bdan) cdfcomp(fdan, xlogscale=TRUE) } } # (12) Fit of a Triangular distribution (3 parameters) using MLE # followed by parametric boostrap, with crashes of optim # if(any(installed.packages()[, "Package"] == "mc2d")) { require(mc2d) set.seed(1234) x4 <- rtriang(100,min=0,mode=4,max=20) # nsample not used : does not converge if the sample is too small fit4t<-fitdist(x4,dtriang,start=list(min=0,mode=4,max=20)) summary(fit4t) b4t<-bootdist(fit4t,niter=nbboot) b4t plot(b4t) summary(b4t) quantile(b4t) } # (13) Fit of a Pareto and a Burr distribution, with bootstrap on the Burr distribution # # if(visualize) { data(endosulfan) ATV <-endosulfan$ATV plotdist(ATV) descdist(ATV,boot=nbboot) fln <- fitdist(ATV, "lnorm") summary(fln) gofstat(fln) # use of plotdist to find good reasonable initial values for parameters plotdist(ATV, "pareto", para=list(shape=1, scale=500)) fP <- fitdist(ATV, "pareto", start=list(shape=1, scale=500)) summary(fP) gofstat(fP) # definition of the initial values from the fit of the Pareto # as the Burr distribution is the Pareto when shape2 == 1 fB <- fitdist(ATV, "burr", start=list(shape1=0.3, shape2=1, rate=1)) summary(fB) gofstat(fB) cdfcomp(list(fln,fP,fB),xlogscale=TRUE) qqcomp(list(fln,fP,fB),xlogscale=TRUE,ylogscale=TRUE) ppcomp(list(fln,fP,fB),xlogscale=TRUE,ylogscale=TRUE) denscomp(list(fln,fP,fB)) # without great interest as hist does accept argument log="x" # comparison of HC5 values (5 percent quantiles) quantile(fln,probs=0.05) quantile(fP,probs=0.05) quantile(fB,probs=0.05) # bootstrap for the Burr distribution bfB <- bootdist(fB,niter=nbboot) plot(bfB) } # (14) relevant example for zero modified geometric distribution # dzmgeom <- function(x, p1, p2) { p1 * (x == 0) + (1-p1)*dgeom(x-1, p2) } pzmgeom <- function(q, p1, p2) { p1 * (q >= 0) + (1-p1)*pgeom(q-1, p2) } rzmgeom <- function(n, p1, p2) { u <- rbinom(n, 1, 1-p1) #prob to get zero is p1 u[u != 0] <- rgeom(sum(u != 0), p2)+1 u } x2 <- rzmgeom(nsample, 1/2, 1/10) f2 <- fitdist(x2, "zmgeom", method="mle", fix.arg=function(x) list(p1=mean(x == 0)), start=list(p2=1/2)) b2 <- bootdist(f2, niter=nbboot) plot(b2) f3 <- fitdist(x2, "zmgeom", method="mle", start=list(p1=1/2, p2=1/2)) b3 <- bootdist(f3, niter=nbboot) plot(b3, enhance=TRUE) # (15) does fixing p1 reduce bias of estimating p2? summary(b2$estim[, "p2"] - 1/10) summary(b3$estim[, "p2"] - 1/10) par(mfrow=c(1, 2)) hist(b2$estim[, "p2"] - 1/10, breaks=100, xlim=c(-.015, .015)) hist(b3$estim[, "p2"] - 1/10, breaks=100, xlim=c(-.015, .015)) par(mfrow=c(1, 1)) # (16) efficiency of parallel operation if (visualize) { niter <- 1001 data(groundbeef) serving <- groundbeef$serving f1 <- fitdist(serving, "gamma") alltime <- matrix(NA, 9, 5) colnames(alltime) <- c("user.self", "sys.self", "elapsed", "user.child", "sys.child" ) rownames(alltime) <- c("base R", paste("snow", 1:4), paste("multicore", 1:4)) alltime[1,] <- system.time(res <- bootdist(f1, niter = niter)) for (cli in 1:4) { cat("\nnb cluster", cli, "\n") #ptm <- proc.time() alltime[cli+1,] <- system.time(res <- bootdist(f1, niter = niter, parallel = "snow", ncpus = cli)) print(summary(res)) #print(proc.time() - ptm) } # not available on Windows if(.Platform$OS.type == "unix") for (cli in 1:4) { cat("\nnb cluster", cli, "\n") #ptm <- proc.time() alltime[cli+5,] <- system.time(res <- bootdist(f1, niter = niter, parallel = "multicore", ncpus = cli)) print(summary(res)) #print(proc.time() - ptm) } alltime } # (17) bootdist with weights (not yet available, test of error message) # x <- rpois(nsample, 10) xtab <- table(x) xval <- sort(unique(x)) (f1 <- fitdist(x, "pois")) (f2 <- fitdist(xval, "pois", weights = xtab)) summary(bootdist(f1, niter = nbboot)) try(summary(bootdist(f2, niter = nbboot))) # not yet developed fitdistrplus/tests/t-init-actuar.R0000644000176200001440000000257114421750625016767 0ustar liggesusersif(FALSE) { require(fitdistrplus) #test actuar initialization startargdefault <- fitdistrplus:::startargdefault #burr library(actuar) alpha <- 3 x <- rburr(1000, alpha, 4, .1) initburr <- function(x) { pi <- 1:3/4 qi <- 1-pi xi <- as.numeric(quantile(x, probs=pi)) y <- log(xi[2])/log(xi[1]/xi[2]) y1 <- log(xi[3]/xi[2])/log(xi[2]/xi[1]) y2 <- log(xi[1]/xi[3])/log(xi[2]/xi[1]) f <- function(eta) (qi[1]^eta-1)^y1*(qi[2]^eta-1)^y2*(qi[3]^eta-1) - 1 eta <- try(uniroot(f, c(-10, -1e-6))$root, silent=TRUE) if(inherits(eta, "try-error")) eta <- -1 alpha <- -1/eta lambda <- (qi[1]^eta-1)^y*(qi[2]^eta-1)^(-y-1) gamma <- log(lambda*(qi[1]^eta-1))/log(xi[1]) theta <- lambda^(1/gamma) list(shape1=alpha, shape2=gamma, rate=1/theta) } initburr(x) #fitdist(x, "burr", lower=0, start=initburr(x)) #transformed gamma x <- rtrgamma(1000, 3, 2, .1) start1 <- startargdefault(x, "trgamma") plot(ecdf(x)) curve(ptrgamma(x, start1$shape1, start1$shape2, start1$rate), add=TRUE, lty=2) #fitdist(x, "trgamma", lower=0) #inverse transformed gamma y <- rinvtrgamma(1000, 3, 2, .1) start1 <- startargdefault(y, "invtrgamma") plot(ecdf(y)) curve(pinvtrgamma(x, start1$shape1, start1$shape2, start1$rate), add=TRUE, lty=2) #fitdist(y, "invtrgamma") } fitdistrplus/tests/t-getparam.R0000644000176200001440000000035713742313702016343 0ustar liggesuserslibrary(fitdistrplus) computegetparam <- fitdistrplus:::computegetparam computegetparam(names(formals(dgamma))) computegetparam(names(formals(pgamma))) computegetparam(names(formals(qgamma))) computegetparam(names(formals(dnbinom))) fitdistrplus/tests/t-mledist-nocens.R0000644000176200001440000002526214205426677017504 0ustar liggesuserslibrary(fitdistrplus) # (1) basic fit of a normal distribution with maximum likelihood estimation # set.seed(1234) x1 <- rnorm(n=100) mledist(x1,"norm") # (2) defining your own distribution functions, here for the Gumbel distribution # for other distributions, see the CRAN task view dedicated to probability distributions dgumbel <- function(x,a,b) 1/b*exp((a-x)/b)*exp(-exp((a-x)/b)) mledist(x1,"gumbel",start=list(a=10,b=5), silent=TRUE) mledist(x1,"gumbel",start=list(a=10,b=5), silent=FALSE) # (3) fit a discrete distribution (Poisson) # set.seed(1234) x2 <- rpois(n=30,lambda = 2) mledist(x2,"pois") # (4) fit a finite-support distribution (beta) # set.seed(1234) x3 <- rbeta(n=100,shape1=5, shape2=10) mledist(x3,"beta") # (5) fit frequency distributions on USArrests dataset. # x4 <- USArrests$Assault mledist(x4, "pois", silent=TRUE) mledist(x4, "pois", silent=FALSE) mledist(x4, "nbinom") # (6) scaling problem # the simulated dataset (below) has particularly small values, hence without scaling (10^0), # the optimization raises an error. The for loop shows how scaling by 10^i # for i=1,...,6 makes the fitting procedure work correctly. set.seed(1234) x2 <- rnorm(100, 1e-4, 2e-4) for(i in 6:0) cat(i, try(mledist(x*10^i, "cauchy")$estimate, silent=TRUE), "\n") # (7) scaling problem # x <- c(-0.00707717, -0.000947418, -0.00189753, -0.000474947, -0.00190205, -0.000476077, 0.00237812, 0.000949668, 0.000474496, 0.00284226, -0.000473149, -0.000473373, 0, 0, 0.00283688, -0.0037843, -0.0047506, -0.00238379, -0.00286807, 0.000478583, 0.000478354, -0.00143575, 0.00143575, 0.00238835, 0.0042847, 0.00237248, -0.00142281, -0.00142484, 0, 0.00142484, 0.000948767, 0.00378609, -0.000472478, 0.000472478, -0.0014181, 0, -0.000946522, -0.00284495, 0, 0.00331832, 0.00283554, 0.00141476, -0.00141476, -0.00188947, 0.00141743, -0.00236351, 0.00236351, 0.00235794, 0.00235239, -0.000940292, -0.0014121, -0.00283019, 0.000472255, 0.000472032, 0.000471809, -0.0014161, 0.0014161, -0.000943842, 0.000472032, -0.000944287, -0.00094518, -0.00189304, -0.000473821, -0.000474046, 0.00331361, -0.000472701, -0.000946074, 0.00141878, -0.000945627, -0.00189394, -0.00189753, -0.0057143, -0.00143369, -0.00383326, 0.00143919, 0.000479272, -0.00191847, -0.000480192, 0.000960154, 0.000479731, 0, 0.000479501, 0.000958313, -0.00383878, -0.00240674, 0.000963391, 0.000962464, -0.00192586, 0.000481812, -0.00241138, -0.00144963) #only i == 0, no scaling, should not converge. for(i in 6:0) cat(i, try(mledist(x*10^i, "cauchy")$estimate, silent=TRUE), "\n") # (8) normal mixture # #mixture of two normal distributions #density dnorm2 <- function(x, w, m1, s1, m2, s2) w*dnorm(x, m1, s1) + (1-w)*dnorm(x, m2, s2) #numerically-approximated quantile function qnorm2 <- function(p, w, m1, s1, m2, s2) { L2 <- function(x, prob) (prob - pnorm2(x, w, m1, s1, m2, s2))^2 sapply(p, function(pr) optimize(L2, c(-20, 30), prob=pr)$minimum) } #distribution function pnorm2 <- function(q, w, m1, s1, m2, s2) w*pnorm(q, m1, s1) + (1-w)*pnorm(q, m2, s2) #basic normal distribution x <- c(rnorm(1000, 5), rnorm(1000, 10)) #MLE fit fit1 <- mledist(x, "norm2", start=list(w=1/3, m1=4, s1=2, m2=8, s2=2), lower=c(0, 0, 0, 0, 0)) # (9) fit a Pareto and log-logistic distributions # if(any(installed.packages()[,"Package"] == "actuar")) { require(actuar) #simulate a sample x4 <- rpareto(1000, 6, 2) #fit mledist(x4, "pareto", start=list(shape=10, scale=10), lower=1, upper=Inf) #simulate a sample x4 <- rllogis(1000, 6, 2) #fit mledist(x4, "llogis", start=list(shape=10, rate=1), lower=1, upper=Inf) } # (10) custom optim for exponential distribution # if(any(installed.packages()[,"Package"] == "rgenoud") && FALSE) { mysample <- rexp(1000, 5) mystart <- list(rate=8) fNM <- mledist(mysample, "exp", optim.method="Nelder-Mead") fBFGS <- mledist(mysample, "exp", optim.method="BFGS") fLBFGSB <- mledist(mysample, "exp", optim.method="L-BFGS-B", lower=0) fSANN <- mledist(mysample, "exp", optim.method="SANN") fCG <- try(mledist(mysample, "exp", optim.method="CG") ) if(inherits(fCG, "try-error")) fCG <- list(estimate=NA) #the warning tell us to use optimise... #to meet the standard 'fn' argument and specific name arguments, we wrap optimize, myoptimize <- function(fn, par, ...) { res <- optimize(f=fn, ..., maximum=FALSE) c(res, convergence=0, value=res$objective, par=res$minimum, hessian=NA) } foptimize <- mledist(mysample, "exp", start=mystart, custom.optim=myoptimize, interval=c(0, 100)) library(rgenoud) #wrap genoud function rgenoud package mygenoud <- function(fn, par, ...) { res <- genoud(fn, starting.values=par, ...) c(res, convergence=0, counts=NULL) } fgenoud <- mledist(mysample, "exp", start=mystart, custom.optim= mygenoud, nvars=1, Domains=cbind(0, 10), boundary.enforcement=1, hessian=TRUE, print.level=0) c(NM=fNM$estimate, BFGS=fBFGS$estimate, LBFGSB=fLBFGSB$estimate, SANN=fSANN$estimate, CG=fCG$estimate, optimize=foptimize$estimate, fgenoud=fgenoud$estimate) } # (11) custom optim for gamma distribution # if(any(installed.packages()[,"Package"] == "rgenoud") && FALSE) { mysample <- rgamma(1000, 5, 3) mystart <- c(shape=10, rate=10) fNM <- mledist(mysample, "gamma", optim.method="Nelder-Mead") fBFGS <- mledist(mysample, "gamma", optim.method="BFGS") fLBFGSB <- mledist(mysample, "gamma", optim.method="L-BFGS-B", lower=0) fSANN <- mledist(mysample, "gamma", optim.method="SANN") fCG <- try( mledist(mysample, "gamma", optim.method="CG", control=list(maxit=1000)) ) if(inherits(fCG, "try-error")) fCG <- list(estimate=NA) fgenoud <- mledist(mysample, "gamma", start=mystart, custom.optim= mygenoud, nvars=2, Domains=cbind(c(0,0), c(100,100)), boundary.enforcement=1, hessian=TRUE, print.level=0) cbind(NM=fNM$estimate, BFGS=fBFGS$estimate, LBFGSB=fLBFGSB$estimate, SANN=fSANN$estimate, CG=fCG$estimate, fgenoud=fgenoud$estimate) data(groundbeef) fNM <- mledist(groundbeef$serving, "gamma", optim.method="Nelder-Mead") fBFGS <- mledist(groundbeef$serving, "gamma", optim.method="BFGS") fLBFGSB <- mledist(groundbeef$serving, "gamma", optim.method="L-BFGS-B", lower=0) fSANN <- mledist(groundbeef$serving, "gamma", optim.method="SANN") fCG <- try( mledist(groundbeef$serving, "gamma", optim.method="CG", control=list(maxit=10000)) ) if(inherits(fCG, "try-error")) fCG <- list(estimate=NA) fgenoud <- mledist(groundbeef$serving, "gamma", start=list(shape=4, rate=1), custom.optim= mygenoud, nvars=2, max.generations=10, Domains=cbind(c(0,0), c(10,10)), boundary.enforcement=1, hessian=TRUE, print.level=0, P9=10) cbind(NM=fNM$estimate, BFGS=fBFGS$estimate, LBFGSB=fLBFGSB$estimate, SANN=fSANN$estimate, CG=fCG$estimate, fgenoud=fgenoud$estimate) } # (12) test error messages # dnorm2 <- function(x, a) "NA" x <- rexp(10) #should get a one-line error res <- mledist(x, "norm2", start=list(a=1)) #as in attr(try(log("a"), silent=TRUE), "condition") # (13) weighted MLE # n <- 1e6 n <- 1e2 x <- rpois(n, 10) xtab <- table(x) xval <- sort(unique(x)) f1 <- mledist(x, "pois", start=list(lambda=mean(x)), optim.method="Brent", lower=0, upper=100, control=list(trace=1)) f2 <- mledist(xval, "pois", weights=xtab, start=list(lambda=mean(x))) f1$estimate f2$estimate #should be identical #test discrete distrib f2 <- try(mledist(xval, "pois", weights=1:length(xval), start=list(lambda=mean(x)))) #test non integer weights f2 <- try(mledist(xval, "pois", weights=rep(1/3, length(xval)), start=list(lambda=mean(x)))) f2 <- try(mledist(1:10, "pois", weights=c(rep(1, 9), 1.001), start=list(lambda=mean(x)))) f2 <- try(mledist(1:10, "pois", weights=c(rep(1, 9), 1.0000001), start=list(lambda=mean(x)))) # (14) no convergence # n <- 1e2 x <- c(rep(0, n), rpois(n, 10), rpois(n, 50)) mledist(x, "pois", optim.method="Nelder-Mead", control=list(maxit=10)) # (15) basic fit of a normal distribution with new fix.arg/start.arg # set.seed(1234) x1 <- rnorm(n=100) #correct usage mledist(x1,"norm") mledist(x1,"norm", start=function(x) list(mean=0, sd=1)) mledist(x1,"norm", fix.arg=function(x) list(mean=mean(x))) mledist(x1,"norm", fix.arg=list(mean=1/2)) mledist(x1,"norm", fix.arg=list(mean=1/2), start=list(sd=1)) mledist(x1,"norm", fix.arg=function(x) list(mean=0), start=list(sd=1)) mledist(x1,"norm", fix.arg=function(x) list(mean=mean(x)), start=list(sd=1)) #wrong usage (see text message in util-checkparam.R) try( mledist(x1,"norm", start=list(a=1/2)) ) #t3 try( mledist(x1,"norm", start=function(x) list(a=0, b=1)) ) #t3 try( mledist(x1,"norm", fix.arg=list(a=1/2)) ) #t4 try( mledist(x1,"norm", fix.arg=function(x) list(a=0), start=list(sd=1)) ) #t4 try( mledist(x1,"norm", start=matrix(1/2)) ) #t1 try( mledist(x1,"norm", fix.arg=matrix(1/2)) ) #t0 try( mledist(x1,"norm", fix.arg=matrix(1/2), start=matrix(1/2)) ) #t2 try( mledist(x1,"norm", fix.arg=function(x) list(mean=mean(x), sd=2), start=list(sd=1)) ) #t5 dabcnorm <- function(x, mean, sd) 1 try( mledist(x1,"abcnorm", fix.arg=function(x) list(mean=mean(x))) ) #t8 # (16) relevant example for zero modified geometric distribution # dzmgeom <- function(x, p1, p2) { p1 * (x == 0) + (1-p1)*dgeom(x-1, p2) } rzmgeom <- function(n, p1, p2) { u <- rbinom(n, 1, 1-p1) #prob to get zero is p1 u[u != 0] <- rgeom(sum(u != 0), p2)+1 u } # check # dzmgeom(0:5, 1/2, 1/10) x2 <- rzmgeom(100, 1/2, 1/10) x3 <- rzmgeom(100, 1/3, 1/10) x4 <- rzmgeom(100, 1/4, 1/10) table(x2) #this is the MLE which converges almost surely and in distribution to the true value. initp1 <- function(x) list(p1=mean(x == 0)) mledist(x2, "zmgeom", fix.arg=initp1, start=list(p2=1/2))[c("estimate", "fix.arg")] mledist(x3, "zmgeom", fix.arg=initp1, start=list(p2=1/2))[c("estimate", "fix.arg")] mledist(x4, "zmgeom", fix.arg=initp1, start=list(p2=1/2))[c("estimate", "fix.arg")] # (17) test the component optim.message x <- rnorm(100) #change parameter to obtain unsuccessful convergence mledist(x, "norm", control=list(maxit=2), start=list(mean=1e5, sd=1), optim.method="L-BFGS-B", lower=0) # (18) management of bounds in optim/constrOptim x <- rexp(100) mledist(x, "exp") #optim, BFGS mledist(x, "exp", optim.method="Brent", lower=0, upper=100) #optim, Brent mledist(x, "exp", optim.method="Nelder-Mead") #optim, Nelder-Mead mledist(x, "exp", lower=0, optim.method="Nelder-Mead") #constrOptim, Nelder-Mead mledist(x, "exp", lower=0, optim.method="BFGS") #optim, L-BFGS-B x <- rbeta(100, 3/2, 7/3) mledist(x, "beta", optim.method="Nelder") #optim, Nelder-Mead mledist(x, "beta", lower=0, optim.method="Nelder-Mead") #constrOptim, Nelder-Mead #as the result of optim(c(-1.2,1), fr, method = "Nelder-Mead", hessian=TRUE, gr=NULL, lower=-Inf, upper=Inf) from optim() example mledist(x, "beta", lower=0, optim.method="BFGS") #optim, L-BFGS-B fitdistrplus/tests/t-cvg-algo.R0000644000176200001440000001003514050660542016234 0ustar liggesusers if(FALSE) { library(fitdistrplus) #(1) beta distribution # n <- 100 set.seed(12345) x <- rbeta(n, 3, 3/4) psi <- function(x) digamma(x) grbetalnl <- function(x, a, b) c(log(x)-psi(a)+psi(a+b), log(1-x)-psi(b)+psi(a+b)) #grbetalnl(x, 3, 4) grlnL <- function(par, obs, ...) -rowSums(sapply(obs, function(x) grbetalnl(x, a=par[1], b=par[2]))) #rowSums(sapply(x, function(x) grbetalnl(x, 3, 4))) #grlnL(c(3, 4), x) #grlnL(c(3, 3/4), x) constrOptim2 <- function(par, fn, gr=NULL, ui, ci, ...) constrOptim(theta=unlist(par), f=fn, grad=gr, ui=ui, ci=ci, ...) #control parameters ctr <- list(trace=3, REPORT=1, maxit=1000) ctr <- list(trace=0, REPORT=1, maxit=1000) bfgs_gr$time <- system.time(bfgs_gr <- mledist(x, dist="beta", optim.method="BFGS", gr=grlnL, control=ctr))[3] bfgs <- mledist(x, dist="beta", optim.method="BFGS", control=ctr) lbfgs_gr <- mledist(x, dist="beta", optim.method="L-BFGS-B", gr=grlnL, control=ctr, lower=c(0,0)) lbfgs <- mledist(x, dist="beta", optim.method="L-BFGS-B", control=ctr, lower=c(0,0)) cg_gr <- mledist(x, dist="beta", optim.method="CG", gr=grlnL, control=ctr) cg <- mledist(x, dist="beta", optim.method="CG", control=ctr) nm_gr <- mledist(x, dist="beta", optim.method="Nelder", gr=grlnL, control=ctr) nm <- mledist(x, dist="beta", optim.method="Nelder", control=ctr) constr_nm_gr <- mledist(x, dist="beta", custom.optim=constrOptim2, ui = diag(2), ci = c(0, 0), optim.method="Nelder", gr=grlnL, control=ctr) constr_nm <- mledist(x, dist="beta", custom.optim=constrOptim2, ui = diag(2), ci = c(0, 0), optim.method="Nelder", control=ctr) constr_bfgs_gr <- mledist(x, dist="beta", custom.optim=constrOptim2, ui = diag(2), ci = c(0, 0), optim.method="BFGS", gr=grlnL, control=ctr) constr_bfgs <- mledist(x, dist="beta", custom.optim=constrOptim2, ui = diag(2), ci = c(0, 0), optim.method="BFGS", control=ctr) constr_cg_gr <- mledist(x, dist="beta", custom.optim=constrOptim2, ui = diag(2), ci = c(0, 0), optim.method="CG", gr=grlnL, control=ctr) constr_cg <- mledist(x, dist="beta", custom.optim=constrOptim2, ui = diag(2), ci = c(0, 0), optim.method="CG", control=ctr) lnL <- function(par, fix.arg, obs, ddistnam) { fitdistrplus:::loglikelihood(par, fix.arg, obs, ddistnam, weights = rep(1, NROW(obs))) } constrOptim2(c(shape1=1, shape2=1), lnL, obs=x, fix.arg=NULL, ddistnam="dbeta", ui = diag(2), ci = c(0, 0)) #no log dbeta3 <- function(x, shape1, shape2) dbeta(x, shape1, shape2) #Ripley trick : param transform dbeta2 <- function(x, shape1, shape2, log) dbeta(x, exp(shape1), exp(shape2), log=log) pbeta2 <- function(q, shape1, shape2, log.p) pbeta(q, exp(shape1), exp(shape2), log.p=log.p) bfgs2 <- mledist(x, dist="beta2", optim.method="BFGS", control=ctr, start=list(shape1=0, shape2=0)) bfgs3 <- mledist(x, dist="beta3", optim.method="BFGS", control=ctr, start=list(shape1=1, shape2=1)) getval <- function(x) c(x$estimate, loglik=x$loglik, x$counts) getval2 <- function(x) c(exp(x$estimate), loglik=x$loglik, x$counts) cbind(trueval=c(3, 3/4, lnL(c(3, 3/4), NULL, x, "dbeta"), NA, NA), NM=getval(nm), NMgrad=getval(nm_gr), constr_NM=getval(constr_nm), constr_NMgrad=getval(constr_nm_gr), CG=getval(cg), CGgrad=getval(cg_gr), constr_CG=getval(constr_cg), constr_CGgrad=getval(constr_cg_gr), BFGS=getval(bfgs), BFGSgrad=getval(bfgs_gr), constr_BFGS=getval(constr_bfgs), constr_BFGSgrad=getval(constr_bfgs_gr), BFGS_exp=getval2(bfgs2), BFGS_nolog=getval(bfgs3)) llsurface(min.arg=c(0.1, 0.1), max.arg=c(7, 3), plot.arg=c("shape1", "shape2"), lseq=50, data=x, distr="beta") points(bfgs$estimate[1], bfgs$estimate[2], pch="+", col="red") points(3, 3/4, pch="x", col="green") } fitdistrplus/tests/t-parallel.R0000644000176200001440000000216513742313702016336 0ustar liggesusersvisualize <- FALSE # TRUE for manual tests with visualization of results if(visualize) { library(parallel) #fonction basique evaluant la moyenne empirique d'un echantillon gaussien f <- function(i) mean(rnorm(1e4, mean=i)) #exemple simple glist <- 1:20 cl <- parallel::makeCluster(2) system.time( res <- parallel::parLapply(cl, glist, f) ) parallel::stopCluster(cl) #exemple en faisant varier le nombre de coeurs et le nombre de simulations nbsimu <- 10^(1:2) cores <- 1:4 cores <- 1:getOption("cl.cores", 2) partime <- matrix(NA, length(nbsimu), length(cores)+1) colnames(partime) <- c("R", paste("core",cores)) rownames(partime) <- paste("n", nbsimu, sep="=") partime[, 1] <- sapply(1:length(nbsimu), function(i) system.time(lapply(1:nbsimu[i], f))[3]) for(j in 1:length(cores)) { print(cores[j]) cl <- parallel::makeCluster(cores[j]) partime[, j+1] <- sapply(1:length(nbsimu), function(i) system.time(parallel::parLapply(cl, 1:nbsimu[i], f))[3]) parallel::stopCluster(cl) } partime }fitdistrplus/tests/t-ppcompcens.R0000644000176200001440000000447013742313702016712 0ustar liggesuserslibrary(fitdistrplus) visualize <- FALSE # TRUE for manual tests with visualization of results data(smokedfish) fitsf <- fitdistcens(smokedfish,"lnorm") plot(fitsf) ppcompcens(fitsf) ppcompcens(fitsf, fillrect = NA) ppcompcens(fitsf, fitcol = "black") ppcompcens(fitsf, fitcol = "black", fillrect = NA) ppcompcens(fitsf, ylim = c(0.4,1)) ppcompcens(fitsf, xlim = c(0.4,1)) ppcompcens(fitsf, xlim = c(0.4,1), ylim = c(0,1)) ppcompcens(fitsf, xlim = c(0.5,0.99), xlogscale = TRUE) try(ppcompcens(fitsf, xlogscale = TRUE)) if (requireNamespace("ggplot2", quietly = TRUE)) { ppcompcens(fitsf, plotstyle = "ggplot") } if (requireNamespace("ggplot2", quietly = TRUE) & visualize) { ppcompcens(fitsf, fillrect = NA, plotstyle = "ggplot") ppcompcens(fitsf, fitcol = "black", plotstyle = "ggplot") ppcompcens(fitsf, fitcol = "black", fillrect = NA, plotstyle = "ggplot") ppcompcens(fitsf, ylim = c(0.4,1), plotstyle = "ggplot") ppcompcens(fitsf, xlim = c(0.4,1), plotstyle = "ggplot") ppcompcens(fitsf, xlim = c(0.4,1), ylim = c(0,1), plotstyle = "ggplot") ppcompcens(fitsf, xlim = c(0.5,0.99), xlogscale = TRUE, plotstyle = "ggplot") } if (visualize) { data(fluazinam) log10EC50 <-log10(fluazinam) fln <- fitdistcens(log10EC50,"norm") plot(fln) if (requireNamespace("ggplot2", quietly = TRUE)) { ppcompcens(fln, plotstyle = "ggplot") } } data(salinity) log10LC50 <-log10(salinity) fn <- fitdistcens(log10LC50,"norm") fl <- fitdistcens(log10LC50,"logis") ppcompcens(list(fn, fl)) ppcompcens(list(fn, fl), fitlwd = c(5, 2)) if (visualize) { plotdistcens(log10LC50) plotdistcens(log10LC50, NPMLE = FALSE) plot(fn) plot(fl) ppcompcens(fn) ppcompcens(fl) ppcompcens(list(fn, fl), ynoise = FALSE) ppcompcens(list(fn, fl), xlogscale = TRUE, xlim = c(0.01, 0.6)) } if (requireNamespace ("ggplot2", quietly = TRUE) ) { ppcompcens(list(fn, fl), plotstyle = "ggplot", fitcol = "red") ppcompcens(list(fn, fl), plotstyle = "ggplot", fitlwd = c(5, 2)) } if (requireNamespace ("ggplot2", quietly = TRUE) & visualize) { ppcompcens(fn, plotstyle = "ggplot") ppcompcens(fl, plotstyle = "ggplot") ppcompcens(list(fn, fl), plotstyle = "ggplot", fitcol = "red") ppcompcens(list(fn, fl), ynoise = FALSE, plotstyle = "ggplot") ppcompcens(list(fn, fl), xlogscale = TRUE, xlim = c(0.01, 0.6), plotstyle = "ggplot") } fitdistrplus/tests/t-fitdist.R0000644000176200001440000002626614205426665016230 0ustar liggesuserslibrary(fitdistrplus) nbboot <- 100 nbboot <- 10 nsample <- 100 nsample <- 10 visualize <- FALSE # TRUE for manual tests with visualization of results # (1) basic fit of a gamma distribution by maximum likelihood estimation # data(groundbeef) serving <- groundbeef$serving fitg <- fitdist(serving, "gamma") summary(fitg) plot(fitg) cdfcomp(fitg, addlegend=FALSE) #check names names(fitdist(serving, "gamma", optim.method="Brent", lower=0, upper=10, fix.arg=list(shape=2))$estimate) names(fitdist(serving, "gamma", optim.method="Nelder-Mead")$estimate) names(fitdist(serving, "gamma", optim.method="BFGS")$estimate) # names(fitdist(serving, "gamma", optim.method="CG", start=list(shape=4, rate=1/20))$estimate) if(visualize) { # check ERROR on aarch64-apple-darwin20.4.0 (64-bit) (2021/05/12) set.seed(1234) names(fitdist(serving, "gamma", optim.method="L-BFGS-B", lower=0)$estimate) } # (7) custom optimization function # #create the sample set.seed(1234) mysample <- rexp(nsample, 5) mystart <- list(rate=8) res1 <- fitdist(mysample, dexp, start= mystart, optim.method="Nelder-Mead") #show the result summary(res1) #the warning tell us to use optimise, because the Nelder-Mead is not adequate. #to meet the standard 'fn' argument and specific name arguments, we wrap optimize, myoptimize <- function(fn, par, ...) { res <- optimize(f=fn, ..., maximum=FALSE) #assume the optimization function minimize standardres <- c(res, convergence=0, value=res$objective, par=res$minimum, hessian=NA) return(standardres) } #call fitdist with a 'custom' optimization function res2 <- fitdist(mysample, dexp, start=mystart, custom.optim=myoptimize, interval=c(0, 100)) #show the result summary(res2) # (8) custom optimization function - another example with the genetic algorithm # if(any(installed.packages()[,"Package"] == "rgenoud") && visualize) { #set a sample fit1 <- fitdist(serving, "gamma") summary(fit1) #wrap genoud function rgenoud package mygenoud <- function(fn, par, ...) { require(rgenoud) res <- genoud(fn, starting.values=par, ...) standardres <- c(res, convergence=0, counts=NULL) return(standardres) } #call fitdist with a 'custom' optimization function fit2 <- fitdist(serving, "gamma", custom.optim=mygenoud, nvars=2, start=as.list(fit1$estimate), Domains=cbind(c(0, 0), c(10, 10)), boundary.enforcement=1, print.level=0, hessian=TRUE) summary(fit2) } # (11) Fit of a Pareto distribution by numerical moment matching estimation # if(any(installed.packages()[,"Package"] == "actuar") && visualize) { require(actuar) #simulate a sample set.seed(1234) x4 <- rpareto(nsample, 6, 2) #empirical raw moment memp <- function(x, order) ifelse(order == 1, mean(x), sum(x^order)/length(x)) #fit fP <- fitdist(x4, "pareto", method="mme", order=c(1, 2), memp="memp", start=list(shape=10, scale=10), lower=1, upper=Inf) summary(fP) plot(fP) } # (14) scaling problem - too small values # if (visualize) # LONG TO RUN ON CRAN { x2 <- c(-0.00707717, -0.000947418, -0.00189753, -0.000474947, -0.00190205, -0.000476077, 0.00237812, 0.000949668, 0.000474496, 0.00284226, -0.000473149, -0.000473373, 0, 0, 0.00283688, -0.0037843, -0.0047506, -0.00238379, -0.00286807, 0.000478583, 0.000478354, -0.00143575, 0.00143575, 0.00238835, 0.0042847, 0.00237248, -0.00142281, -0.00142484, 0, 0.00142484, 0.000948767, 0.00378609, -0.000472478, 0.000472478, -0.0014181, 0, -0.000946522, -0.00284495, 0, 0.00331832, 0.00283554, 0.00141476, -0.00141476, -0.00188947, 0.00141743, -0.00236351, 0.00236351, 0.00235794, 0.00235239, -0.000940292, -0.0014121, -0.00283019, 0.000472255, 0.000472032, 0.000471809, -0.0014161, 0.0014161, -0.000943842, 0.000472032, -0.000944287, -0.00094518, -0.00189304, -0.000473821, -0.000474046, 0.00331361, -0.000472701, -0.000946074, 0.00141878, -0.000945627, -0.00189394, -0.00189753, -0.0057143, -0.00143369, -0.00383326, 0.00143919, 0.000479272, -0.00191847, -0.000480192, 0.000960154, 0.000479731, 0, 0.000479501, 0.000958313, -0.00383878, -0.00240674, 0.000963391, 0.000962464, -0.00192586, 0.000481812, -0.00241138, -0.00144963) for(i in 6:0) { cat("\nscaling", 10^i, "\n") res <- try(mledist(x2*10^i, "cauchy"), silent=TRUE) if(inherits(res, "try-error")) print(res) else { cat("estimate\n") print(res$estimate) cat("Hessian\n") print(res$hessian) } } } # (15) scaling problem - too big values # if (visualize) # LONG TO RUN ON CRAN { x1 <- c( 1401928684, 1413455609, 1432458425, 1436910475, 1494883250, 1565770323, 1577486458, 1568908053, 1606424896, 1632264979, 1780495643, 1865525923, 2035689865, 2141429306, 2335443964, 2465661689, 2563368221, 2845012431, 2949890881, 3180645942, 3309009836, 3618581152, 4109197451, 4064662257, 4028375795, 4176781983, 4303024833, 4493470109 ) for(i in 0:5) { cat("\nscaling", 10^(-2*i), "\n") res <- mledist(x1*10^(-2*i), "norm") Hm1 <- try(solve(res$hessian), silent=TRUE) if(inherits(Hm1, "try-error")) print(Hm1) else { cat("estimate\n") print(res$estimate) cat("Hessian\n") print(res$hessian) cat("inverse Hessian\n") print(Hm1) } } fitdist(x1, "norm") fitdist(x1*1e-6, "norm") } # (16) Fit of a lognormal distribution on acute toxicity values of endosulfan for # nonarthropod invertebrates, using maximum likelihood estimation # to estimate what is called a species sensitivity distribution # (SSD) in ecotoxicology, followed by estimation of the 5 percent quantile value of # the fitted distribution, what is called the 5 percent hazardous concentration (HC5) # in ecotoxicology, with its two-sided 95 percent confidence interval calculated by # parametric bootstrap # data(endosulfan) ATV <- subset(endosulfan, group == "NonArthroInvert")$ATV log10ATV <- log10(subset(endosulfan, group == "NonArthroInvert")$ATV) fln <- fitdist(log10ATV, "norm") quantile(fln, probs = 0.05) # (17) Fit of a triangular distribution using Cramer-von Mises or # Kolmogorov-Smirnov distance # if(any(installed.packages()[,"Package"] == "mc2d") && visualize) { set.seed(1234) require(mc2d) t <- rtriang(100,min=5,mode=6,max=10) # nsample not used : does not converge with a too small sample fCvM <- fitdist(t,"triang",method="mge",start = list(min=4, mode=6,max=9),gof="CvM") fKS <- fitdist(t,"triang",method="mge",start = list(min=4, mode=6,max=9),gof="KS") cdfcomp(list(fCvM,fKS)) } # (18) gumbel distribution # dgumbel <- function(x, a, b) 1/b*exp((a-x)/b)*exp(-exp((a-x)/b)) pgumbel <- function(q, a, b) exp(-exp((a-q)/b)) qgumbel <- function(p, a, b) a-b*log(-log(p)) data(danishuni) fitdist(danishuni$Loss, "gumbel", start=list(a=5, b=10)) # (19) check the 'start' argument # if (FALSE) # NO INTEREST WITHOUT VISUALIZATION OF THE RESULT { #create the sample mysample <- rexp(nsample, 5) mystart2 <- list(rate2=8) mystart3 <- list(8) try( fitdist(mysample, dexp, start= mystart2, method="mle") ) try( fitdist(mysample, dexp, start= mystart3, method="mle") ) try( fitdist(mysample, dexp, start= mystart2, method="mme") ) try( fitdist(mysample, dexp, start= mystart3, method="mme") ) try( fitdist(mysample, dexp, start= mystart2, method="qme", probs=1/2) ) try( fitdist(mysample, dexp, start= mystart3, method="qme", probs=1/2) ) try( fitdist(mysample, dexp, start= mystart2, method="mge", gof="AD") ) try( fitdist(mysample, dexp, start= mystart3, method="mge", gof="AD") ) } # (20) example with dexgauss # would require to suggest the package gamlss.dist in the Description file # #if(any(installed.packages()[,"Package"] == "gamlss.dist")) #{ # require(gamlss.dist) # set.seed(1234) # a=rexGAUS(100,mu=500,sigma=50,nu=75) # fitdist(a,dexGAUS,start=list(mu=median(a),sigma=sqrt(var(a)/2),nu=sqrt(var(a)/2))) #} # (21) check the 'keepdata' argument # if (visualize) # REQUIRES VISUALIZATION OF THE RESULTS { #create the sample x <- rexp(1e6, 5) summary(x) f1 <- fitdist(x, "exp", keepdata=FALSE) f2 <- fitdist(x, "exp", keepdata=TRUE) par(mfrow=c(1,2)) cdfcomp(f1) cdfcomp(f2) } # (22) relevant example for zero modified geometric distribution # dzmgeom <- function(x, p1, p2) { p1 * (x == 0) + (1-p1)*dgeom(x-1, p2) } pzmgeom <- function(q, p1, p2) { p1 * (q >= 0) + (1-p1)*pgeom(q-1, p2) } rzmgeom <- function(n, p1, p2) { u <- rbinom(n, 1, 1-p1) #prob to get zero is p1 u[u != 0] <- rgeom(sum(u != 0), p2)+1 u } x2 <- rzmgeom(nsample, 1/2, 1/10) table(x2) #this is the MLE which converges almost surely and in distribution to the true value. initp1 <- function(x) list(p1=mean(x == 0)) fitdist(x2, "zmgeom", start=list(p1=1/2, p2=1/2)) f2 <- fitdist(x2, "zmgeom", fix.arg=initp1, start=list(p2=1/2)) print(f2) summary(f2) f2 <- fitdist(x2, "zmgeom", fix.arg=list(p1=1/2), start=list(p2=1/2)) print(f2) summary(f2) # (23) check the use of weights with MLE # set.seed(1234) x <- rpois(nsample, 10) xtab <- table(x) xval <- sort(unique(x)) f1 <- fitdist(x, "pois") f2 <- fitdist(xval, "pois", weights = xtab) f1$estimate f2$estimate #should be identical # (24) check the use of weights with other methods # set.seed(1234) x <- rpois(nsample, 10) xtab <- table(x) xval <- sort(unique(x)) (f1 <- fitdist(x, "norm", method = "mle")) (f2 <- fitdist(xval, "norm", weights = xtab, method = "mle")) (f1 <- fitdist(x, "norm", method = "mme")) (f2 <- fitdist(xval, "norm", weights = xtab, method = "mme")) (f1 <- fitdist(x, "norm", method = "qme", probs=c(1/4, 3/4))) (f2 <- fitdist(xval, "norm", method = "qme", weights = xtab, probs=c(1/4, 3/4) )) fitdist(x, "norm", method="mge", gof = "CvM") try(fitdist(xval, "norm", method="mge", gof = "CvM", weights = xtab)) # not yet developped # (24b) check the use of weights with qme with a discrete distribution # set.seed(1234) x <- rpois(nsample, 10) xtab <- table(x) xval <- sort(unique(x)) (f1 <- fitdist(x, "pois", method = "qme", probs=c(1/2))) (f2 <- fitdist(xval, "pois", method = "qme", weights = xtab, probs=c(1/2) )) # similar to f1 fitdist(xval, "pois", method = "qme", weights = xtab, probs=c(1/2), optim.method="SANN", control=list(maxit=1000)) # fitdist(x, "pois", method = "qme", probs=c(1/2), optim.method="SANN", control=list(maxit=1000)) # should be similar # should give similar results for big samples # (25) check the warning messages when using weights in the fit followed by functions # that do not yet take weights into account # with an example to be used later to see if weights are well taken into account # if(visualize) { x3 <- rnorm(100) # this sample size must be fixed here (see next lines, 50+50) x3 <- sort(x3) (f <- fitdist(x3, "norm", method="mle", weights= c(rep(1, 50), rep(2, 50)))) try(plot(f)) try(cdfcomp(f)) (f2 <- fitdist(x3, "logis", method="mle", weights= c(rep(1, 50), rep(2, 50)))) try(cdfcomp(list(f,f2))) try(denscomp(f)) try(denscomp(list(f,f2))) try(ppcomp(f)) try(ppcomp(list(f,f2))) try(qqcomp(f)) try(qqcomp(list(f,f2))) try(gofstat(f)) try(gofstat(list(f,f2))) try(bootdist(f)) }fitdistrplus/tests/t-startfixarg-overall.R0000644000176200001440000001205613742313702020542 0ustar liggesuserslibrary(fitdistrplus) # (1) non-censored data (continuous) # s1 <- NULL s2 <- list("mean"=2) s0 <- list("mean"=2, "sd"=1) s3 <- function(x) list("mean"=1.01*mean(x)) s4 <- function(x) list("mean"=1.01*mean(x), "sd"=sd(x)) f1 <- NULL f2 <- list("sd"=3) f3 <- function(x) list("sd"=1.01*sd(x)) x <- rnorm(1000) #redefine normal distribution for better check dnorm2 <- dnorm pnorm2 <- pnorm qnorm2 <- qnorm rnorm2 <- rnorm mnorm2 <- function(order, mean, sd) ifelse(order == 1, mean, sd^2) memp <- function(x, order) mean(x^order) # both NULL mf1 <- mledist(x, "norm", start=s1, fix.arg=f1) mf1 <- mmedist(x, "norm", start=s1, fix.arg=f1) mf1 <- qmedist(x, "norm", start=s1, fix.arg=f1, probs=1:2/3) mf1 <- mgedist(x, "norm", start=s1, fix.arg=f1) fit1 <- fitdist(x, "norm", start=s1, fix.arg=f1) boot1 <- bootdist(fit1, niter=10) # both named list mf1 <- mledist(x, "norm2", start=s2, fix.arg=f2) mf1 <- mmedist(x, "norm2", start=s2, fix.arg=f2, order=1, memp=memp) mf1 <- qmedist(x, "norm2", start=s2, fix.arg=f2, probs=1/3) mf1 <- mgedist(x, "norm2", start=s2, fix.arg=f2) fit1 <- fitdist(x, "norm2", start=s2, fix.arg=f2) boot1 <- bootdist(fit1, niter=10) # named list and NULL mf1 <- mledist(x, "norm2", start=s0, fix.arg=f1) mf1 <- mmedist(x, "norm2", start=s0, fix.arg=f1, order=1:2, memp=memp) mf1 <- qmedist(x, "norm2", start=s0, fix.arg=f1, probs=1:2/3) mf1 <- mgedist(x, "norm2", start=s0, fix.arg=f1) fit1 <- fitdist(x, "norm2", start=s0, fix.arg=f1) boot1 <- bootdist(fit1, niter=10) # NULL and named list mf1 <- mledist(x, "norm", start=s1, fix.arg=f2) mf1 <- qmedist(x, "norm", start=s1, fix.arg=f2, probs=1/3) mf1 <- mgedist(x, "norm", start=s1, fix.arg=f2) fit1 <- fitdist(x, "norm", start=s1, fix.arg=f2) boot1 <- bootdist(fit1, niter=10) # both function mf1 <- mledist(x, "norm2", start=s3, fix.arg=f3) mf1 <- mmedist(x, "norm2", start=s3, fix.arg=f3, order=1, memp=memp) mf1 <- qmedist(x, "norm2", start=s3, fix.arg=f3, probs=1/3) mf1 <- mgedist(x, "norm2", start=s3, fix.arg=f3) fit1 <- fitdist(x, "norm2", start=s3, fix.arg=f3) boot1 <- bootdist(fit1, niter=10) # function and NULL mf1 <- mledist(x, "norm2", start=s4, fix.arg=f1) mf1 <- mmedist(x, "norm2", start=s4, fix.arg=f1, order=1:2, memp=memp) mf1 <- qmedist(x, "norm2", start=s4, fix.arg=f1, probs=1:2/3) mf1 <- mgedist(x, "norm2", start=s4, fix.arg=f1) fit1 <- fitdist(x, "norm2", start=s4, fix.arg=f1) boot1 <- bootdist(fit1, niter=10) # NULL and function mf1 <- mledist(x, "norm", start=s1, fix.arg=f3) mf1 <- qmedist(x, "norm", start=s1, fix.arg=f3, probs=1/3) mf1 <- mgedist(x, "norm", start=s1, fix.arg=f3) fit1 <- fitdist(x, "norm", start=s1, fix.arg=f3) boot1 <- bootdist(fit1, niter=10) # should raise error for too less parameters try(mgedist(x, "norm", start=s2, fix.arg=f1)) try(fitdist(x, "norm", start=s2, fix.arg=f1)) # should raise error for too much parameters try(mgedist(x, "norm", start=s0, fix.arg=f2)) try(fitdist(x, "norm", start=s0, fix.arg=f2)) # should raise error for NA value try(mgedist(x, "norm", start=s1, fix.arg=list(sd=NA))) try(fitdist(x, "norm", start=list(sd=NA))) # should raise error for inconsistent parameter try(mgedist(x, "norm", start=function(x) list("toto"=1))) try(fitdist(x, "norm", fix=list(toto=2))) #test unset arguments dbeta2<-function(x, shape1, ncp2) dbeta(x, shape1, shape1, ncp2) x <- rbeta(1e2, 3, 3) try(fitdist(x, "beta2", start=list(shape1=2))) dbeta3<-function(x, shape1, ncp2=0) dbeta(x, shape1, shape1, ncp2) fitdist(x, "beta3", start=list(shape1=2)) # (2) censored data # data(salinity) log10LC50 <-log10(salinity) s1 <- NULL s2 <- list("mean"=2) s0 <- list("mean"=2, "sd"=1) s3 <- function(x) list("mean"=mean(x)) f1 <- NULL f2 <- list("sd"=3) f3 <- function(x) list("sd"=sd(x)) fitdistcens(log10LC50, "norm", start=s1, fix.arg = f1) fitdistcens(log10LC50, "norm", start=s1, fix.arg = f2) fitdistcens(log10LC50, "norm", start=s2, fix.arg = f2) fitdistcens(log10LC50, "norm", start=s0, fix.arg = f1) fitdistcens(log10LC50, "norm", start=s3, fix.arg = f2) fitdistcens(log10LC50, "norm", start=s3, fix.arg = f3) fitdistcens(log10LC50, "norm", start=s1, fix.arg = f3) fit1 <- fitdistcens(log10LC50, "norm", start=s1, fix.arg = f1) boot1 <- bootdistcens(fit1, niter = 10) fit1 <- fitdistcens(log10LC50, "norm", start=s3, fix.arg = f2) boot1 <- bootdistcens(fit1, niter = 10) fit1 <- fitdistcens(log10LC50, "norm", start=s2, fix.arg = f3) boot1 <- bootdistcens(fit1, niter = 10) # (3) non-censored data (discrete) # n <- 200 trueval <- c("size"=10, "prob"=3/4, "mu"=10/3) x <- rnbinom(n, trueval["size"], trueval["prob"]) mledist(x, "nbinom") fitdist(x, "nbinom") # (4) non-censored data (continuous) external distributions # data("endosulfan") ATV <-endosulfan$ATV fendo.ln <- fitdist(ATV, "lnorm") fendo.g <- fitdist(ATV, "gamma", start=list(shape=2, scale=1), lower=0) require("actuar") fendo.ll <- fitdist(ATV, "llogis", start = list(shape = 1, scale = 500)) fendo.P <- fitdist(ATV, "pareto", start = list(shape = 1, scale = 500)) fendo.B <- fitdist(ATV, "burr", start = list(shape1 = 0.3, shape2 = 1, rate = 1)) fitdistrplus/tests/t-msedist.R0000644000176200001440000001300214050660542016202 0ustar liggesuserslibrary(fitdistrplus) nboot <- 1000 nboot <- 10 nsample <- 10 set.seed(123) #manual implementation geomeanspacing <- function(pdist, obs, ...) { sx <- c(-Inf, sort(obs), Inf) n <- length(sx) Di <- pdist(sx[-1], ...) - pdist(sx[-n], ...) mean(log(Di)) } #-------------------------------------------------------- # exponential sample x1 <- rexp(nsample) lam <- seq(0.1, 2, length=51) Sn <- sapply(lam, function(x) geomeanspacing(pexp, obs=x1, rate=x)) Dn <- function(theta) -geomeanspacing(pexp, obs=x1, rate=theta[1]) #theoretical optimum Dn(1/mean(x1)) #check curve behavior par(mar=c(4,4,2,1)) plot(lam, Sn, type="l", xlab="theta", main="Average spacing logarithm") abline(v=1, col="green") abline(v=msedist(x1, "exp")$estimate, lty=2, col="blue") legend("bottomright", lty=1:2, col=c("green", "blue"), leg=c("theoretical value", "fitted value")) msedist(x1, "exp", control=list(trace=0, REPORT=1)) mse_exp <- fitdist(x1, "exp", method="mse") plot(mse_exp) summary(mse_exp) gofstat(mse_exp) mse_exp_boot <- bootdist(mse_exp, niter = nboot) plot(mse_exp_boot) abline(v=1, col="green") abline(v=msedist(x1, "exp")$estimate, lty=2, col="blue") legend("bottomright", lty=1:2, col=c("green", "blue"), leg=c("theoretical value", "fitted value")) # library(BMT) # x <- rBMT(1e3, 1/4, 3/4) # # BMTfit.mpse(x) # fitdist(x, "BMT", method="mse", start=list(p3=1/2, p4=1/2, p1=-1/2, p2=1/2), lower=c(0, 0, -Inf, 0), # upper=c(1,1,0, Inf)) # pBMT(x, p3=1/2, p4=1/2, p1=-1/2, p2=1/2) #-------------------------------------------------------- # lognormal sample x1 <- rlnorm(nsample, 0, 1) mu <- seq(-1, 1, length=51) Sn <- sapply(mu, function(x) geomeanspacing(plnorm, obs=x1, mean=x, sd=1)) Dn <- function(theta) -geomeanspacing(plnorm, obs=x1, mean=theta[1], sd=theta[2]) plot(mu, Sn, type="l") abline(v=0) optim(c(2,2), Dn) msedist(x1, "lnorm", control=list(trace=0, REPORT=1)) mse_lnorm <- fitdist(x1, "lnorm", method="mse") mle_lnorm <- fitdist(x1, "lnorm", method="mle") plot(mse_lnorm) summary(mse_lnorm) cdfcomp(list(mse_lnorm, mle_lnorm)) gofstat(list(mse_lnorm, mle_lnorm)) mse_lnorm_boot <- bootdist(mse_lnorm, niter = nboot) par(mar=c(4,4,2,1)) plot(mse_lnorm_boot, enhance = TRUE, trueval=c(0,1)) #-------------------------------------------------------- # Pareto sample library(actuar) x1 <- rburr(nsample, 2,2,2) Dn <- function(theta) -geomeanspacing(pburr, obs=x1, shape1=theta[1], shape2=theta[2], rate=theta[3]) Dn(c(1,1,10)) optim(c(1,1,10), Dn) msedist(x1, "burr", start=list(shape1=1, shape2=1, rate=10), control=list(trace=0, REPORT=1)) mse_burr <- fitdist(x1, "burr", method="mse", start=list(shape1=1, shape2=1, rate=10)) mle_burr <- fitdist(x1, "burr", method="mle", start=list(shape1=1, shape2=1, rate=10)) plot(mse_burr) summary(mse_burr) cdfcomp(list(mse_burr, mle_burr)) gofstat(list(mse_burr, mle_burr)) mse_burr_boot <- bootdist(mse_burr, niter = pmin(nboot,100)) plot(mse_burr_boot, enhance = TRUE, trueval=c(2,2,2)) #-------------------------------------------------------- # Poisson sample x1 <- rpois(nsample, 15) geomeanSpacingUnique <- function(pdist, obs, ...) { sx <- c(-Inf, unique(sort(obs)), Inf) n <- length(sx) Di <- pdist(sx[-1], ...) - pdist(sx[-n], ...) mean(log(Di)) } geomeanSpacingWeight <- function(pdist, obs, weights, ...) { sx <- c(-Inf, unique(sort(obs)), Inf) weights <- c(1, weights) n <- length(sx) Di <- pdist(sx[-1], ...) - pdist(sx[-n], ...) mean(weights*log(Di)) } DnUnique <- function(theta) -geomeanSpacingUnique(ppois, obs=x1, lambda=theta[1]) DnWeight <- function(theta, weights) -geomeanSpacingWeight(ppois, obs=x1, lambda=theta[1], weights=weights) optimize(DnWeight, c(1, 30), weights=as.numeric(table(x1))) optimize(DnUnique, c(1, 30)) optimize(Dn, c(1, 30)) #does not converge mle_pois1 <- fitdist(x1, "pois", method="mle") #no weight mse_pois1 <- fitdist(x1, "pois", method="mse") #with weight mse_pois2 <- fitdist(unique(sort(x1)), "pois", method="mse", weights=as.numeric(table(x1))) plot(mse_pois1) plot(mse_pois2) summary(mse_pois1) gofstat(mse_pois1) gofstat(mse_pois2) par(mfrow=c(1,1)) cdfcomp(list(mle_pois1, mse_pois1), addlegend = FALSE, fitlty = 1) curve(ppois(x, lambda=mse_pois2$estimate), type="s", col="blue", add=TRUE) legend("bottomright", lty=1, col=c("red", "green", "blue"), leg=c("MLE", "MSE no weight", "MSE with weight")) #-------------------------------------------------------- # real dataset # library(CASdatasets) # data("ushustormloss") # x <- ushustormloss$Normalized.CL05 # # plot(Normalized.CL05 ~ Year, data=ushustormloss, type="h", main="Normalized Hurricane Damages in United States") # # mse_burr <- fitdist(x, "burr", method="mse", start=list(shape1=1, shape2=1, rate=10), lower=0) # mle_burr0 <- fitdist(x, "burr", method="mle", start=list(shape1=1, shape2=1, rate=10), lower=0) # # cbind(MSE=coef(mse_burr), MLE=coef(mle_burr0)) # # # setwd("~/Desktop/") # par(mar=c(4,4,2,1)) # pdf("Ushustorm-cdfcomp.pdf", 6, 6) # cdfcomp(list(mse_burr, mle_burr0), xlogscale = TRUE, do.points = FALSE) # dev.off() # pdf("Ushustorm-qqcomp.pdf", 6, 6) # qqcomp(list(mse_burr, mle_burr0), xlogscale=TRUE, ylogscale=TRUE) # dev.off() # pdf("Ushustorm-ppcomp.pdf", 6, 6) # ppcomp(list(mse_burr, mle_burr0)) # dev.off() # # gofstat(list(mse_burr, mle_burr0)) # # mse_iburr <- fitdist(x, "invburr", method="mse", start=list(shape1=1, shape2=1, rate=10), lower=0) # mle_iburr0 <- fitdist(x, "invburr", method="mle", start=list(shape1=1, shape2=1, rate=10), lower=0) # # gofstat(list(mse_iburr, mle_iburr0)) # cdfcomp(list(mse_iburr, mle_iburr0)) fitdistrplus/tests/t-gen-max-spacing-estim.R0000644000176200001440000001026014050660542020632 0ustar liggesuserslibrary(fitdistrplus) set.seed(123) nsample <- 10 #-------------------------------------------------------- # exponential sample x1 <- rexp(nsample) mseKL_exp <- msedist(x1, "exp") mseJ_exp <- msedist(x1, "exp", phidiv = "J") mseR2_exp <- msedist(x1, "exp", phidiv = "R", power.phidiv=2) mseR1o2_exp <- msedist(x1, "exp", phidiv = "R", power.phidiv=1/2) mseH3o2_exp <- msedist(x1, "exp", phidiv = "H", power.phidiv=3/2) mseV3o2_exp <- msedist(x1, "exp", phidiv = "V", power.phidiv=3/2) c(true=1, mseKL_exp$estimate, mseJ_exp$estimate, mseR2_exp$estimate, mseR1o2_exp$estimate, mseH3o2_exp$estimate, mseV3o2_exp$estimate) mseKL_exp <- fitdist(x1, "exp", method="mse", phidiv="KL") mseJ_exp <- fitdist(x1, "exp", method="mse", phidiv = "J") mseR2_exp <- fitdist(x1, "exp", method="mse", phidiv = "R", power.phidiv=2) mseR1o2_exp <- fitdist(x1, "exp", method="mse", phidiv = "R", power.phidiv=1/2) mseH3o2_exp <- fitdist(x1, "exp", method="mse", phidiv = "H", power.phidiv=3/2) mseV3o2_exp <- fitdist(x1, "exp", method="mse", phidiv = "V", power.phidiv=3/2) gofstat(list(mseKL_exp, mseJ_exp, mseR2_exp, mseR1o2_exp, mseH3o2_exp, mseV3o2_exp)) cdfcomp(list(mseKL_exp, mseJ_exp, mseR2_exp, mseR1o2_exp, mseH3o2_exp, mseV3o2_exp), do.points=FALSE, legendtext = c("Kullback-Leibler", "Jeffreys", "Renyi alpha=2", "Renyi alpha=1/2", "Hellinger p=3/2", "Vajda beta=3/2")) qqcomp(list(mseKL_exp, mseJ_exp, mseR2_exp, mseR1o2_exp, mseH3o2_exp, mseV3o2_exp), legendtext = c("Kullback-Leibler", "Jeffreys", "Renyi alpha=2", "Renyi alpha=1/2", "Hellinger p=3/2", "Vajda beta=3/2")) denscomp(list(mseKL_exp, mseJ_exp, mseR2_exp, mseR1o2_exp, mseH3o2_exp, mseV3o2_exp), demp = TRUE, legendtext = c("Kullback-Leibler", "Jeffreys", "Renyi alpha=2", "Renyi alpha=1/2", "Hellinger p=3/2", "Vajda beta=3/2")) #defensive test try(msedist(x1, "exp", phidiv="ABC")) try(msedist(x1, "exp", phidiv="K", power.phidiv = "a")) try(msedist(x1, "exp", phidiv="J", power.phidiv = "a")) try(msedist(x1, "exp", phidiv="R", power.phidiv = 0)) try(msedist(x1, "exp", phidiv="R", power.phidiv = 1:10)) try(msedist(x1, "exp", phidiv="H", power.phidiv = 0)) try(msedist(x1, "exp", phidiv="H", power.phidiv = 1:10)) #-------------------------------------------------------- # Poisson sample x1 <- rpois(nsample, lambda=15) #no weight mseKL_pois1 <- fitdist(x1, "pois", method="mse", phidiv="KL") mseJ_pois1 <- fitdist(x1, "pois", method="mse", phidiv = "J") mseR2_pois1 <- fitdist(x1, "pois", method="mse", phidiv = "R", power.phidiv=2) mseR1o2_pois1 <- fitdist(x1, "pois", method="mse", phidiv = "R", power.phidiv=1/2) mseH3o2_pois1 <- fitdist(x1, "pois", method="mse", phidiv = "H", power.phidiv=3/2) mseV3o2_pois1 <- fitdist(x1, "pois",method="mse", phidiv = "V", power.phidiv=3/2) #with weight mseKL_pois2 <- fitdist(unique(sort(x1)), "pois", method="mse", phidiv="KL", weights=as.numeric(table(x1))) mseJ_pois2 <- fitdist(unique(sort(x1)), "pois", method="mse", phidiv = "J", weights=as.numeric(table(x1))) mseR2_pois2 <- fitdist(unique(sort(x1)), "pois", method="mse", phidiv = "R", power.phidiv=2, weights=as.numeric(table(x1))) mseR1o2_pois2 <- fitdist(unique(sort(x1)), "pois", method="mse", phidiv = "R", power.phidiv=1/2, weights=as.numeric(table(x1))) mseH3o2_pois2 <- fitdist(unique(sort(x1)), "pois", method="mse", phidiv = "H", power.phidiv=3/2, weights=as.numeric(table(x1))) mseV3o2_pois2 <- fitdist(unique(sort(x1)), "pois",method="mse", phidiv = "V", power.phidiv=3/2, weights=as.numeric(table(x1))) #taking into account partially unbias the estimation of the true cdf cdfcomp(list(mseKL_pois1, mseJ_pois1, mseR2_pois1), do.points=FALSE, fitlty=1, legendtext = c("Kullback-Leibler", "Jeffreys", "Renyi alpha=2")) cdfcomp(list(mseKL_pois2, mseJ_pois2, mseR2_pois2), do.points=FALSE, add=TRUE, fitlty=2, addlegend = FALSE, datacol="white") cdfcomp(list(mseR1o2_pois1, mseH3o2_pois1, mseV3o2_pois1), do.points=FALSE, fitlty=1, legendtext = c("Renyi alpha=1/2", "Hellinger p=3/2", "Vajda beta=3/2")) cdfcomp(list(mseR1o2_pois2, mseH3o2_pois2, mseV3o2_pois2), do.points=FALSE, add=TRUE, fitlty=2, addlegend = FALSE, datacol="white") fitdistrplus/tests/t-qme-discrete.R0000644000176200001440000000226614050660542017126 0ustar liggesuserslibrary(fitdistrplus) set.seed(1234) nsample <- 10 #Poisson x <- rpois(nsample, lambda=7.5) L2 <- function(lam) (qpois(1/2, lambda = lam) - median(x))^2 curve(L2(x), 5, 9, xlab=expression(lambda), ylab=expression(L2(lambda)), main="squared differences", n=201) fitdist(x, "pois", method="qme", probs=1/2, start=list(lambda=2), control=list(trace=1, REPORT=1)) fitdist(x, "pois", method="qme", probs=1/2, start=list(lambda=6.8), control=list(trace=1, REPORT=1)) fitdist(x, "pois", method="qme", probs=1/2, start=list(lambda=15), control=list(trace=1, REPORT=1)) fitdist(x, "pois", method="qme", optim.method="SANN", probs=1/2, start=list(lambda=2), control=list(trace=1, REPORT=100)) fitdist(x, "pois", method="qme", optim.method="SANN", probs=1/2, start=list(lambda=17), control=list(trace=1, REPORT=100)) #Geometric x <- rgeom(nsample, 1/3) L2 <- function(p) (qgeom(1/2, p) - median(x))^2 curve(L2(x), 0.10, 0.95, xlab=expression(p), ylab=expression(L2(p)), main="squared differences", n=301) fitdist(x, "geom", method="qme", probs=1/2, start=list(prob=1/2), control=list(trace=1, REPORT=1)) fitdist(x, "geom", method="qme", probs=1/2, start=list(prob=1/20), control=list(trace=1, REPORT=1)) fitdistrplus/tests/t-fitdist-customoptim.R0000644000176200001440000000220714050660542020566 0ustar liggesusersvisualize <- FALSE # TRUE for manual tests with visualization of results require(fitdistrplus) myoptimize <- function(fn,par,ui,ci,...){ res <- constrOptim(f=fn,theta=par,method="Nelder-Mead",ui=ui,ci=ci, ...) standardres <- c(res,convergence=0,value=res$objective,par=res$minimum,hessian=NA) return(standardres) } #one parameter example x <- rexp(100) #binding example fitdist(x, "exp", custom.optim=myoptimize, ui=1, ci=2, start=list(rate=10)) fitdist(x, "exp", lower= 2, optim.method="L-BFGS-B") #two parameter example if(visualize) { # check ERROR on aarch64-apple-darwin20.4.0 (64-bit) (2021/05/12) x <- rbeta(100, pi, 1/pi) set.seed(1234) fitdist(x, "beta") #binding example fitdist(x, "beta", custom.optim=myoptimize, ui=rbind(1,1), ci=c(1/2,1/2), start=list(shape1=5, shape2=5)) fitdist(x, "beta", lower= c(1/2,1/2), optim.method="L-BFGS-B") } #true example library(GeneralizedHyperbolic) args(dnig) x <- rnig(100, 3, 1/2, 1/2, 1/4) ui<-rbind(c(0,1,0,0),c(0,0,1,0),c(0,0,1,-1),c(0,0,1,1)) ci<-c(0,0,0,0) fitdist(x, "nig", custom.optim=myoptimize, ui=ui, ci=ci, start=list(mu = 0, delta = 1, alpha = 1, beta = 0)) fitdistrplus/tests/t-plotdist.R0000644000176200001440000000225214050660542016401 0ustar liggesuserslibrary(fitdistrplus) nsample <- 10 # (1) Plot of an empirical distribution with changing # of default line types for CDF and colors # and optionally adding a density line # set.seed(1234) x1 <- rnorm(n=nsample) plotdist(x1) plotdist(x1, col="blue", type="b", pch=16) plotdist(x1, type="s") plotdist(x1, demp = TRUE) plotdist(x1,demp = TRUE, adjust = 1.5) # (2) Plot of a discrete distribution against data # set.seed(1234) x2 <- rpois(n=nsample, lambda = 2) plotdist(x2, discrete=TRUE) plotdist(x2, "pois", para=list(lambda = mean(x2))) plotdist(x2, "pois", para=list(lambda = mean(x2)), lwd="2") # (3) Plot of a continuous distribution against data # xn <- rnorm(n=nsample, mean=10, sd=5) plotdist(xn, "norm", para=list(mean=mean(xn), sd=sd(xn))) plotdist(xn, "norm", para=list(mean=mean(xn), sd=sd(xn)), pch=16) plotdist(xn, "norm", para=list(mean=mean(xn), sd=sd(xn)), pch=16, demp = TRUE) # (4) Plot of serving size data # data(groundbeef) plotdist(groundbeef$serving, type="s") # (5) Plot of numbers of parasites with a Poisson distribution data(toxocara) number <- toxocara$number plotdist(number, discrete = TRUE) plotdist(number,"pois",para=list(lambda=mean(number))) fitdistrplus/tests/t-qmedist.R0000644000176200001440000000577614050660542016223 0ustar liggesuserslibrary(fitdistrplus) nsample <- 10 # (1) basic fit of a normal distribution # set.seed(1234) x1 <- rnorm(n=nsample) qmedist(x1, "norm", probs=c(1/3, 2/3)) # (2) defining your own distribution functions, here for the Gumbel # distribution for other distributions, see the CRAN task view dedicated # to probability distributions dgumbel <- function(x, a, b) 1/b*exp((a-x)/b)*exp(-exp((a-x)/b)) qgumbel <- function(p, a, b) a - b*log(-log(p)) qmedist(x1, "gumbel", probs=c(1/3, 2/3), start=list(a=10,b=5)) # (3) fit a discrete distribution (Poisson) # set.seed(1234) x2 <- rpois(n=nsample,lambda = 2) qmedist(x2, "pois", probs=1/2) # (4) fit a finite-support distribution (beta) # set.seed(1234) x3 <- rbeta(n=nsample, shape1=5, shape2=10) qmedist(x3, "beta", probs=c(1/3, 2/3)) # (5) fit frequency distributions on USArrests dataset. # x4 <- USArrests$Assault qmedist(x4, "pois", probs=1/2) qmedist(x4, "nbinom", probs=c(1/3, 2/3)) # (6) normal mixture # #mixture of two normal distributions #density dnorm2 <- function(x, poid, m1, s1, m2, s2) poid*dnorm(x, m1, s1) + (1-poid)*dnorm(x, m2, s2) #numerically approximate quantile function qnorm2 <- function(p, poid, m1, s1, m2, s2) { L2 <- function(x, prob) (prob - pnorm2(x, poid, m1, s1, m2, s2))^2 sapply(p, function(pr) optimize(L2, c(-20, 30), prob=pr)$minimum) } #distribution function pnorm2 <- function(q, poid, m1, s1, m2, s2) poid*pnorm(q, m1, s1) + (1-poid)*pnorm(q, m2, s2) #basic normal distribution x <- c(rnorm(nsample, 5), rnorm(nsample, 10)) #QME fit2 <- qmedist(x, "norm2", probs=c(1/6, 1/4, 1/3, 1/2, 2/3), start=list(poid=1/3, m1=4, s1=2, m2=8, s2=2), lower=c(0, 0, 0, 0, 0), upper=c(1/2, Inf, Inf, Inf, Inf)) # (7) test error messages # dnorm3 <- qnorm3 <- function(x, a) "NA" x <- rexp(nsample) #should get a one-line error res <- qmedist(x, "norm3", start=list(a=1), probs=1/2) #as in attr(try(log("a"), silent=TRUE), "condition") # (8) weighted QME # n <- 1e6 x <- rpois(nsample, 10) xtab <- table(x) xval <- sort(unique(x)) f1 <- qmedist(x, "pois", start=list(lambda=mean(x)), lower=0, upper=100, probs=1/2) #, control=list(trace=1, REPORT=1) f2 <- qmedist(xval, "pois", weights=xtab, start=list(lambda=mean(x)), probs=1/2) f1$estimate f2$estimate #should be identical x <- rexp(nsample) f3 <- qmedist(x, "exp", probs=1/2) w4 <- c(rep(1, nsample/2), round(sqrt(1:(nsample/2)))) f4 <- qmedist(x, "exp", weights=w4, probs=1/2) f3$estimate f4$estimate f3$loglik f4$loglik median(x) median(tail(x, 50)) #try non integer weights try(qmedist(x, "exp", weights=c(rep(1, n/2), sqrt(1:(n/2))), probs=1/2)) # (9) test the component optim.message x <- rnorm(nsample) #change parameter to obtain unsuccessful convergence qmedist(x, "norm", probs=1:2/3, control=list(maxit=2), start=list(mean=1e5, sd=1), optim.method="L-BFGS-B", lower=0) # (10) test bounds x <- rnorm(nsample) qmedist(x, "norm", probs=1:2/3, optim.method="L-BFGS-B", lower=c(-Inf, 0)) #via optim qmedist(x, "norm", probs=1:2/3, optim.method="Nelder", lower=c(-Inf, 0)) #via constrOptim fitdistrplus/tests/t-bootdistcens.R0000644000176200001440000000535614050660542017247 0ustar liggesuserslibrary(fitdistrplus) nbboot <- 101 nbboot <- 11 nsample <- 10 visualize <- FALSE # TRUE for manual tests with visualization of results # (1) Fit of a normal distribution to fluazinam data in log10 # followed by nonparametric bootstrap # data(fluazinam) (d1 <-log10(fluazinam)) f1 <- fitdistcens(d1, "norm") b1 <- bootdistcens(f1, niter = nbboot, silent=TRUE) b1 <- bootdistcens(f1, niter = nbboot, silent=FALSE) b1 summary(b1) plot(b1) # (3) Estimation of the standard deviation of a normal distribution # by maximum likelihood with the mean fixed at 0.1 using the argument fix.arg # followed by nonparametric bootstrap # f1b <- fitdistcens(d1, "norm", start=list(sd=1.5), fix.arg=list(mean=0.1)) b1b <- bootdistcens(f1b, niter=nbboot) summary(b1b) plot(b1b) # (4) Comparison of fitdist and fitdistcens and bootdist and bootdistcens # for non censored data x1<-c(6.4,13.3,4.1,1.3,14.1,10.6,9.9,9.6,15.3,22.1,13.4, 13.2,8.4,6.3,8.9,5.2,10.9,14.4) fx1<-fitdist(x1,"norm",method="mle") cx1<-bootdist(fx1,bootmethod="nonparam", niter=nbboot) xx1<-data.frame(left=x1,right=x1) fxx1<-fitdistcens(xx1,"norm") summary(fx1) summary(fxx1) cdfcomp(fx1) cdfcompcens(fxx1) cxx1<-bootdistcens(fxx1, niter=nbboot) summary(cx1) summary(cxx1) # (5) fixing parameters # set.seed(1234) x <- rexp(nsample, 5) x <- data.frame(left=x, right=x+.1) f1 <- fitdistcens(x, "gamma", fix.arg=list(shape=1.5)) b1 <- bootdistcens(f1, niter=nbboot) plot(b1) f1 <- fitdistcens(x, "gamma", fix.arg=function(x) list(shape=1.5)) b1 <- bootdistcens(f1, niter=nbboot) plot(b1) # (6) efficiency of parallel operation if (visualize) # too long to run on CRAN and forbidden due to parallel computing { niter <- 5001 data(fluazinam) d1 <-log10(fluazinam) f1 <- fitdistcens(d1, "norm") for (cli in 1:4) { print(cli) ptm <- proc.time() print(summary(bootdistcens(f1, niter = niter, parallel = "snow", ncpus = cli))) print(proc.time() - ptm) } # not available on Windows for (cli in 1:4) { print(cli) ptm <- proc.time() print(summary(bootdistcens(f1, niter = niter, parallel = "multicore", ncpus = cli))) print(proc.time() - ptm) } } # (5) with weights (not yet available, test of error message) # data(salinity) salinity.unique <- unique(salinity) string.unique <- paste(salinity.unique$left, salinity.unique$right) string.salinity <- paste(salinity$left, salinity$right) nobs <- nrow(salinity.unique) salinity.weights <- numeric(nobs) for (i in 1:nobs) { salinity.weights[i] <- length(which(string.salinity == string.unique[i])) } cbind(salinity.unique, salinity.weights) (fa <- fitdistcens(salinity, "lnorm")) (fb <- fitdistcens(salinity.unique, "lnorm", weights = salinity.weights)) # should give the same results summary(bootdistcens(fa, niter = nbboot)) fitdistrplus/tests/t-CIcdfplot.R0000644000176200001440000001306714210345773016420 0ustar liggesuserslibrary(fitdistrplus) nbboot <- 201 nbboot <- 10 ggplotEx <- requireNamespace("ggplot2", quietly = TRUE) # (1) Fit of a gamma distribution # set.seed(123) s1 <- rgamma(20, 3, 2) f1 <- fitdist(s1, "gamma") b1 <- bootdist(f1, niter=nbboot, silent=TRUE) plot(b1) quantile(b1) par(mfrow=c(1,2)) CIcdfplot(b1, CI.level=95/100, CI.output = "probability", CI.fill="grey80", CI.col="black") CIcdfplot(b1, CI.level=95/100, CI.output = "quantile", datacol="blue") if(ggplotEx) CIcdfplot(b1, CI.level=95/100, CI.output = "probability", CI.fill="grey80", CI.col="black", plotstyle = "ggplot") if(ggplotEx) CIcdfplot(b1, CI.level=95/100, CI.output = "quantile", datacol="blue", plotstyle = "ggplot") par(mfrow=c(1,2)) CIcdfplot(b1, CI.level=85/100, CI.output = "probability") CIcdfplot(b1, CI.level=85/100, CI.output = "quantile") if(ggplotEx) CIcdfplot(b1, CI.level=85/100, CI.output = "probability", plotstyle = "ggplot") if(ggplotEx) CIcdfplot(b1, CI.level=85/100, CI.output = "quantile", plotstyle = "ggplot") par(mfrow=c(1,2)) CIcdfplot(b1, CI.level=90/100, CI.output = "probability") CIcdfplot(b1, CI.level=90/100, CI.output = "quantile", CI.col="black", CI.type = "less", CI.fill="grey85", verticals=TRUE, datacol="blue", do.points=FALSE) if(ggplotEx) CIcdfplot(b1, CI.level=90/100, CI.output = "probability", plotstyle = "ggplot") if(ggplotEx) CIcdfplot(b1, CI.level=90/100, CI.output = "quantile", CI.col="black", CI.type = "less", CI.fill="grey85", verticals=TRUE, datacol="blue", do.points=FALSE, plotstyle = "ggplot") par(mfrow=c(1,2)) CIcdfplot(b1, CI.level=90/100, CI.output = "probability", CI.type = "greater") CIcdfplot(b1, CI.level=90/100, CI.output = "quantile", CI.col="black", CI.type = "greater", CI.fill="grey90", datacol="blue", datapch=21) if(ggplotEx) CIcdfplot(b1, CI.level=90/100, CI.output = "probability", CI.type = "greater", plotstyle = "ggplot") if(ggplotEx) CIcdfplot(b1, CI.level=90/100, CI.output = "quantile", CI.col="black", CI.type = "greater", CI.fill="grey90", datacol="blue", datapch=21, plotstyle = "ggplot") par(mfrow=c(1,1)) CIcdfplot(b1, CI.level=90/100, CI.output = "probability", CI.col="black", CI.type = "less", CI.fill="grey90") CIcdfplot(b1, CI.level=90/100, CI.output = "quantile", CI.col="black", CI.type = "less", CI.fill="grey90", verticals=TRUE, datacol="blue", do.points=FALSE) CIcdfplot(b1, CI.level=90/100, CI.output = "quantile", CI.col="grey90", CI.type = "less", CI.fill="grey90", verticals=TRUE, datacol="blue", do.points=FALSE, CI.only=TRUE) CIcdfplot(b1, CI.level=90/100, CI.output = "probability", CI.col="grey85", CI.type = "less", CI.fill="grey90", CI.only = TRUE) CIcdfplot(b1, CI.output = "probability", fitlty=3, fitlwd=4) if(ggplotEx) CIcdfplot(b1, CI.level=90/100, CI.output = "probability", CI.col="black", CI.type = "less", CI.fill="grey90", plotstyle = "ggplot") if(ggplotEx) CIcdfplot(b1, CI.level=90/100, CI.output = "quantile", CI.col="black", CI.type = "less", CI.fill="grey90", verticals=TRUE, datacol="blue", do.points=FALSE, plotstyle = "ggplot") if(ggplotEx) CIcdfplot(b1, CI.level=90/100, CI.output = "quantile", CI.col="grey90", CI.type = "less", CI.fill="grey90", verticals=TRUE, datacol="blue", do.points=FALSE, CI.only=TRUE, plotstyle = "ggplot") if(ggplotEx) CIcdfplot(b1, CI.level=90/100, CI.output = "probability", CI.col="grey85", CI.type = "less", CI.fill="grey90", CI.only = TRUE, plotstyle = "ggplot") if(ggplotEx) CIcdfplot(b1, CI.output = "probability", fitlty=3, fitlwd=4, plotstyle = "ggplot") # (2) an example from ecotoxicology # with censored data # data(salinity) log10LC50 <-log10(salinity) fln <- fitdistcens(log10LC50,"norm") bln <- bootdistcens(fln, niter=nbboot) (HC5ln <- quantile(bln,probs = 0.05)) CIcdfplot(bln, CI.output = "quantile", CI.fill = "lightblue", CI.col = "blue", xlab = "log10(LC50)",xlim=c(0.5,2),lines01 = TRUE) if(ggplotEx) CIcdfplot(bln, CI.output = "quantile", CI.fill = "lightblue", CI.col = "blue", xlab = "log10(LC50)",xlim=c(0.5,2),lines01 = TRUE, plotstyle = "ggplot") # zoom around the HC5 with CI on quantiles CIcdfplot(bln, CI.output = "quantile", CI.fill = "lightblue", CI.col = "blue", xlab = "log10(LC50)", lines01 = TRUE, xlim = c(0.8, 1.5), ylim = c(0, 0.1)) abline(h = 0.05, lty = 1) if(ggplotEx) CIcdfplot(bln, CI.output = "quantile", CI.fill = "lightblue", CI.col = "blue", xlab = "log10(LC50)", lines01 = TRUE, xlim = c(0.8, 1.5), ylim = c(0, 0.1), plotstyle = "ggplot") + ggplot2::geom_hline(yintercept = 0.05) # zoom around the HC5 with CI on probabilities CIcdfplot(bln, CI.output = "probability", CI.fill = "lightblue", CI.col = "blue", xlab = "log10(LC50)", lines01 = TRUE, xlim = c(0.8, 1.5), ylim = c(0, 0.1)) abline(h = 0.05, lty = 1) if(ggplotEx) CIcdfplot(bln, CI.output = "probability", CI.fill = "lightblue", CI.col = "blue", xlab = "log10(LC50)", lines01 = TRUE, xlim = c(0.8, 1.5), ylim = c(0, 0.1), plotstyle = "ggplot") + ggplot2::geom_hline(yintercept = 0.05) # (3) An example where the difference between "probability" # and "quantile" is clear on the plot # set.seed(123) s3 <- rgamma(5, 3, 10) f3 <- fitdist(s3, "norm") b3 <- bootdist(f3, niter=nbboot, silent=TRUE) par(mfrow=c(1,2)) CIcdfplot(b3, CI.level=90/100, CI.output = "probability") CIcdfplot(b3, CI.level=90/100, CI.output = "quantile") if(ggplotEx) CIcdfplot(b3, CI.level=90/100, CI.output = "probability", plotstyle = "ggplot") if(ggplotEx) CIcdfplot(b3, CI.level=90/100, CI.output = "quantile", plotstyle = "ggplot") #some ideas from http://edild.github.io/ssd/ fitdistrplus/tests/t-Surv2fitdistcens.R0000644000176200001440000000050414075737424020031 0ustar liggesusersrequire(fitdistrplus) data("fremale") # fremale test fremale.cens <- Surv2fitdistcens(fremale$AgeIn, fremale$AgeOut, fremale$Death) f1 <- fitdistcens(fremale.cens, "norm") f2 <- fitdistcens(fremale.cens, "logis") f3 <- fitdistcens(fremale.cens, "cauchy") cdfcompcens(list(f1, f2, f3)) sapply(list(f1, f2, f3), logLik) fitdistrplus/tests/t-descdist.R0000644000176200001440000000213514050660542016341 0ustar liggesuserslibrary(fitdistrplus) nbboot <- 100 nbboot <- 10 nsample <- 10 # (1) Description of a sample from a normal distribution # with and without uncertainty on skewness and kurtosis estimated by bootstrap # set.seed(1234) x1 <- rnorm(nsample) descdist(x1) descdist(x1,boot=nbboot) # (2) Description of a sample from a beta distribution # with uncertainty on skewness and kurtosis estimated by bootstrap # with changing of default colors # descdist(rbeta(nsample,shape1=0.05,shape2=1),boot=nbboot, obs.col="blue",boot.col="orange") # (3) Description of a sample from a gamma distribution # with uncertainty on skewness and kurtosis estimated by bootstrap # without plotting # descdist(rgamma(nsample,shape=2,rate=1),boot=nbboot,graph=FALSE) # (3) Description of a sample from a Poisson distribution # with uncertainty on skewness and kurtosis estimated by bootstrap # descdist(rpois(nsample,lambda=2),discrete=TRUE,boot=nbboot) # (4) Description of serving size data # with uncertainty on skewness and kurtosis estimated by bootstrap # data(groundbeef) serving <- groundbeef$serving descdist(serving, boot=nbboot) fitdistrplus/tests/t-quantiledist.R0000644000176200001440000000506113742313702017246 0ustar liggesuserslibrary(fitdistrplus) nbboot <- 1001 nbboot <- 10 # (1) Fit of a normal distribution on acute toxicity log-transformed values of endosulfan for # nonarthropod invertebrates, using maximum likelihood estimation # to estimate what is called a species sensitivity distribution # (SSD) in ecotoxicology, followed by estimation of the 5, 10 and 20 percent quantile values of # the fitted distribution, which are called the 5, 10, 20 percent hazardous concentrations (HC5, HC10, HC20) # in ecotoxicology, followed with calculations of their confidence intervals with various definitions. # data(endosulfan) ATV <- subset(endosulfan, group == "NonArthroInvert")$ATV log10ATV <- log10(subset(endosulfan, group == "NonArthroInvert")$ATV) fln <- fitdist(log10ATV, "norm") quantile(fln, probs = c(0.05, 0.1, 0.2)) bln <- bootdist(fln, niter=nbboot, bootmethod="param") quantile(bln, probs = c(0.05, 0.1, 0.2)) quantile(bln, probs = c(0.05, 0.1, 0.2), CI.type = "greater") quantile(bln, probs = c(0.05, 0.1, 0.2), CI.level = 0.9) # (2) Fit of a distribution on acute salinity log-transformed tolerance # for riverine macro-invertebrates, using maximum likelihood estimation # to estimate what is called a species sensitivity distribution # (SSD) in ecotoxicology, followed by estimation of the 5, 10 and 20 percent quantile values of # the fitted distribution, which are called the 5, 10, 20 percent hazardous concentrations (HC5, HC10, HC20) # in ecotoxicology, followed with calculations of their confidence intervals with various definitions. # data(salinity) log10LC50 <-log10(salinity) flncens <- fitdistcens(log10LC50,"norm") quantile(flncens, probs = c(0.05, 0.1, 0.2)) blncens <- bootdistcens(flncens,niter=nbboot) quantile(blncens, probs = c(0.05, 0.1, 0.2)) quantile(blncens, probs = c(0.05, 0.1, 0.2), CI.type = "greater") quantile(blncens, probs = c(0.05, 0.1, 0.2), CI.level = 0.9) # (3) Estimation of quantiles of the fitted distribution (fln) # and two-sided 95 percent confidence intervals for various # probabilities using non-parametric bootstrap with 101 iterations # bln.np <- bootdist(fln, bootmethod = "nonparam", niter = nbboot) quantile(bln.np, probs = c(0.05, 0.1, 0.2)) # (4) Fit of a loglogistic distribution on the same acute toxicity values and # estimation of the 5 percent quantile (HC5) of the fitted distribution # and associated two-sided 95 percent confidence interval # fll <- fitdist(log10ATV, "logis") bll <- bootdist(fll, bootmethod = "param", niter = nbboot) # in log10(ATV) HC5ll <- quantile(bll, probs = 0.05) HC5ll # in ATV 10^(HC5ll$basequant) 10^(HC5ll$quantCI) fitdistrplus/tests/t-mmedist.R0000644000176200001440000001306714421726007016210 0ustar liggesuserslibrary(fitdistrplus) nsample <- 10 # (1) basic fit of a normal distribution with moment matching estimation # set.seed(1234) x1 <- rnorm(n=nsample) mmedist(x1,"norm") try(mmedist(x1,"norm", fix.arg=list(mean=0))) # (2) fit a discrete distribution (Poisson) # set.seed(1234) x2 <- rpois(n=nsample, lambda = 2) mmedist(x2,"pois") # (3) fit a finite-support distribution (beta) # set.seed(1234) x3 <- rbeta(n=nsample, shape1=5, shape2=10) mmedist(x3,"beta") # (4) fit a Pareto distribution # if(any(installed.packages()[, "Package"] == "actuar")) { require(actuar) #simulate a sample x4 <- rpareto(nsample, 6, 2) #empirical raw moment memp <- function(x, order) mean(x^order) #fit mmedist(x4, "pareto", order=c(1, 2), memp=memp, start=c(shape=10, scale=10), lower=1, upper=Inf) mmedist(x4, "pareto", order=1, memp=memp, start=list(shape=10), fix.arg=list(scale=1.5), lower=2, upper=Inf) mmedist(x4, "pareto", order=1, memp=memp, start=function(x) list(shape=10), fix.arg=list(scale=1.5), lower=2, upper=Inf) mmedist(x4, "pareto", order=1, memp=memp, start=list(shape=10), fix.arg=function(x) list(scale=1.5), lower=2, upper=Inf) #weights memp2 <- function(x, order, weights) sum(x^order * weights)/sum(weights) w <- rep(1, length(x4)) w[x4 < 1] <- 2 mmedist(x4, "pareto", order=c(1, 2), memp=memp2, weights=w, start=list(shape=10, scale=10), lower=1, upper=Inf) #fit data(danishuni) fparedanishMME <- mmedist(danishuni$Loss, "pareto", order=1:2, memp=memp, start=c(shape=10, scale=10), lower=2+1e-6, upper=Inf) c(theo = mpareto(1, fparedanishMME$estimate[1], fparedanishMME$estimate[2]), emp = memp(danishuni$Loss, 1)) c(theo = mpareto(2, fparedanishMME$estimate[1], fparedanishMME$estimate[2]), emp = memp(danishuni$Loss, 2)) } # (5) fit a lognormal distribution # f1 <- mledist(x3, "lnorm") #previously mmedist was the same as mledist f2 <- mmedist(x3, "lnorm") n <- length(x3) s2 <- log(1+var(x3)/mean(x3)^2*(n-1)/n) mu <- log(mean(x3)) - s2/2 cbind(c(mu, s2), f2$estimate) c(truestim=exp(mu+s2/2), jensen=as.numeric(exp(f1$estimate["meanlog"]+f1$estimate["sdlog"]^2/2)), emp=mean(x3)) c(truestim=exp(2*mu+s2)*(exp(s2)-1), jensen=as.numeric(exp(2*f1$estimate["meanlog"]+f1$estimate["sdlog"]^2)*(exp(f1$estimate["sdlog"]^2)-1)), emp=var(x3)*(n-1)/n) # (6) test error messages # mnorm3 <- dnorm3 <- function(x, a) "NA" x <- rnorm(nsample) #should get a one-line error res <- mmedist(x, "norm3", start=list(a=1), order=1, memp=function(x, order) mean(x)) #as in attr(try(log("a"), silent=TRUE), "condition") # (7) fit of a normal distribution with weighted moment matching estimation # n <- nsample w <- c(rep(1, n/2), rep(10, n/2)) mmedist(x1, "norm", weights=w)$estimate #check sum(w*x1)/sum(w) fitdistrplus:::wtdmean(x1, w) sum(w*(x1-sum(w*x1)/sum(w))^2)/sum(w) fitdistrplus:::wtdvar(x1, w) mmedist(exp(x1), "lnorm", weights=w)$estimate #test non integer weights try(mmedist(x1, "norm", weights=rep(1/3, length(x1)))) try(mmedist(1:10, "pois", weights=c(rep(1, 9), 1.001), start=list(lambda=mean(x)))) try(mmedist(1:10, "pois", weights=c(rep(1, 9), 1.0000001), start=list(lambda=mean(x)))) # (8) fit of a neg binom distribution with weighted moment matching estimation # x4 <- rnbinom(nsample, 5, 1/2) table(x4) w <- rep(1, length(x4)) w[x4 > 5] <- 2 mmedist(x4, "nbinom", weights=w)$estimate mmedist(x4, "nbinom", weights=NULL)$estimate # (9) relevant example for zero modified geometric distribution # rzmgeom <- function(n, p1, p2) { u <- rbinom(n, 1, 1-p1) #prob to get zero is p1 u[u != 0] <- rgeom(sum(u != 0), p2)+1 u } dzmgeom <- function(x, p1, p2) { p1 * (x == 0) + (1-p1)*dgeom(x-1, p2) } mgeom <- function(order, prob) { if(order == 1) (1-prob)/(prob) else if(order == 2) (2-3*prob+prob^2)/prob^2 else stop("not yet implemented") } if(FALSE) { c(mean(rgeom(1e4, 1/6)), mgeom(1, 1/6)) c(mean(rgeom(1e4, 1/6)^2), mgeom(2, 1/6)) } mzmgeom <- function(order, p1, p2) #raw moment { if(order == 1) (1-p1)*( mgeom(1, p2) + 1 ) + p1*0 #mean else if(order == 2) (1-p1)*( mgeom(2, p2)+ 2*mgeom(1, p2)+1) + p1*0 #E(X^2) else stop("not yet implemented") } if(FALSE) { c(mean(rzmgeom(1e4, 1/3, 1/6)), mzmgeom(1, 1/3, 1/6)) c(mean(rzmgeom(1e4, 1/3, 1/6)^2), mzmgeom(2, 1/3, 1/6)) } memp1 <- function(x, order) mean(x^order) memp2 <- function(x, order, weights) sum(x^order * weights)/sum(weights) x5 <- rzmgeom(nsample, 1/3, 1/6) w <- rep(1, length(x5)) w[x5 > 5] <- 2 mmedist(x5, "zmgeom", order=1:2, memp=memp1, start=list(p1=mean(x5 == 0), p2=1/mean(x5[x5 > 0])), lower=0.01, upper=0.99)$estimate mmedist(x5, "zmgeom", order=1:2, memp=memp2, start=list(p1=mean(x5 == 0), p2=1/mean(x5[x5 > 0])), weights=w)$estimate mmedist(x5, "zmgeom", order=1:2, memp=memp1, start=list(p1=mean(x5 == 0), p2=1/mean(x5[x5 > 0])), lower=0.01, upper=0.99)$loglik mmedist(x5, "zmgeom", order=1:2, memp=memp2, start=list(p1=mean(x5 == 0), p2=1/mean(x5[x5 > 0])), weights=w)$loglik # (10) bounds # if(any(installed.packages()[, "Package"] == "actuar")) { require(actuar) #simulate a sample x4 <- rpareto(nsample, 6, 2) #empirical raw moment memp <- function(x, order) mean(x^order) #fit mmedist(x4, "pareto", order=c(1, 2), memp=memp, start=c(shape=10, scale=10), lower=1, upper=Inf, optim.method = "L-BFGS-B") #L-BFGS-B via optim mmedist(x4, "pareto", order=c(1, 2), memp=memp, start=c(shape=10, scale=10), lower=1, upper=Inf, optim.method = "Nelder") #Nelder Mead via constrOptim } fitdistrplus/tests/t-ppcomp.R0000644000176200001440000001752214050660542016043 0ustar liggesuserslibrary(fitdistrplus) # ?ppcomp visualize <- FALSE # TRUE for manual tests with visualization of results nsample <- 1000 nsample <- 10 # (1) Plot various distributions fitted to serving size data # data(groundbeef) serving <- groundbeef$serving fitW <- fitdist(serving, "weibull") fitln <- fitdist(serving, "lnorm") fitg <- fitdist(serving, "gamma") # sanity checks try(ppcomp("list(fitW, fitln, fitg)"), silent = TRUE) try(ppcomp(list(fitW, fitln, fitg, a = 1)), silent = TRUE) # real call ppcomp(list(fitW, fitln, fitg)) ppcomp(list(fitW, fitln, fitg), legendtext = c("Weibull", "lognormal", "gamma"), main="ground beef fits", xlab="Theo.", ylab="serving sizes (g)", xlim = c(0, 1/2)) ppcomp(list(fitW, fitln, fitg), legendtext=c("Weibull","lognormal","gamma"), main="ground beef fits", xlab="Theo.", ylab="serving sizes (g)", xlogscale=TRUE, line01=FALSE) ppcomp(list(fitW, fitln, fitg), legendtext=c("Weibull","lognormal","gamma"), main="ground beef fits", xlab="Theo.", ylab="serving sizes (g)", ylogscale=TRUE, line01=FALSE) ppcomp(list(fitW, fitln, fitg), legendtext=c("Weibull","lognormal","gamma"), main="ground beef fits", ylim=c(1e-3, 1), xlim=c(1e-3, 1), fitpch=c("+", "-", ".")) if (requireNamespace ("ggplot2", quietly = TRUE)) { ppcomp(list(fitW, fitln, fitg), plotstyle = "ggplot") } if (requireNamespace ("ggplot2", quietly = TRUE) & visualize) { ppcomp(list(fitW, fitln, fitg), legendtext=c("Weibull","lognormal","gamma"), main="ground beef fits", xlab="Theo.", ylab="serving sizes (g)", xlim = c(0, 1/2), plotstyle = "ggplot") ppcomp(list(fitW, fitln, fitg), legendtext=c("Weibull","lognormal","gamma"), main="ground beef fits", xlab="Theo.", ylab="serving sizes (g)", xlogscale=TRUE, line01=FALSE, plotstyle = "ggplot") ppcomp(list(fitW, fitln, fitg), legendtext=c("Weibull","lognormal","gamma"), main="ground beef fits", xlab="Theo.", ylab="serving sizes (g)", ylogscale=TRUE, line01=FALSE, plotstyle = "ggplot") ppcomp(list(fitW, fitln, fitg), legendtext=c("Weibull","lognormal","gamma"), main="ground beef fits", ylim=c(1e-3, 1), xlim=c(1e-3, 1), fitpch=c("+", "-", "."), plotstyle = "ggplot") } # (2) Plot lognormal distributions fitted by # maximum goodness-of-fit estimation # using various distances (data plotted in log scale) # data(endosulfan) ATV <-subset(endosulfan, group == "NonArthroInvert")$ATV flnMGEKS <- fitdist(ATV,"lnorm",method="mge",gof="KS") flnMGEAD <- fitdist(ATV,"lnorm",method="mge",gof="AD") flnMGEADL <- fitdist(ATV,"lnorm",method="mge",gof="ADL") flnMGEAD2L <- fitdist(ATV,"lnorm",method="mge",gof="AD2L") llfit <- list(flnMGEKS, flnMGEAD, flnMGEADL, flnMGEAD2L) ppcomp(llfit, main="fits of a lognormal dist. using various GOF dist.") ppcomp(llfit, xlegend="topleft", xlogscale=TRUE, main="fits of a lognormal dist. using various GOF dist.", legendtext=c("MGE KS","MGE AD","MGE ADL","MGE AD2L")) ppcomp(llfit, xlegend="topleft", xlogscale=TRUE, main="fits of a lognormal dist. using various GOF dist.", legendtext=c("MGE KS","MGE AD","MGE ADL","MGE AD2L"), fitcol=c("black", "darkgreen", "yellowgreen", "yellow2")) ppcomp(llfit, ynoise=FALSE, xlogscale=TRUE, ylogscale=TRUE, xlim=c(1e-3,1), ylim=c(1e-3,1)) ppcomp(flnMGEKS) if (requireNamespace ("ggplot2", quietly = TRUE) & visualize) { ppcomp(llfit, main="fits of a lognormal dist. using various GOF dist.", plotstyle = "ggplot") ppcomp(llfit, xlegend="topleft", xlogscale=TRUE, main="fits of a lognormal dist. using various GOF dist.", legendtext=c("MGE KS","MGE AD","MGE ADL","MGE AD2L"), plotstyle = "ggplot") ppcomp(llfit, xlegend="topleft", xlogscale=TRUE, main="fits of a lognormal dist. using various GOF dist.", legendtext=c("MGE KS","MGE AD","MGE ADL","MGE AD2L"), fitcol=c("black", "darkgreen", "yellowgreen", "yellow2"), plotstyle = "ggplot") ppcomp(llfit, ynoise=FALSE, xlogscale=TRUE, ylogscale=TRUE, xlim=c(1e-3,1), ylim=c(1e-3,1), plotstyle = "ggplot") ppcomp(flnMGEKS, plotstyle = "ggplot") } # (3) Plot lognormal distributions fitted by # maximum goodness-of-fit estimation # using various distances (data plotted in log scale) # if (visualize) { x1 <- c(6.4,13.3,4.1,1.3,14.1,10.6,9.9,9.6,15.3,22.1,13.4,13.2,8.4,6.3,8.9,5.2,10.9,14.4) n1 <- length(x1) dgumbel <- function(x,a,b) 1/b*exp((a-x)/b)*exp(-exp((a-x)/b)) pgumbel <- function(q,a,b) exp(-exp((a-q)/b)) f1 <- mledist(x1, "norm") f2 <- mledist(x1, "gumbel", start = list(a = 10, b = 5)) f3 <- mledist(x1, "exp") xx1 <- pnorm(sort(x1), f1$estimate[1], f1$estimate[2]) xx2 <- pgumbel(sort(x1), f2$estimate[1], f2$estimate[2]) xx3 <- pexp(sort(x1), f3$estimate[1]) xlim <- c(xx1, xx2, xx3) xlim <- range(xlim[which(is.finite(xlim))]) # graph 1 plot(xx1, 1:n1/n1, col = "red", xlim = xlim) points(xx2, 1:n1/n1, col = "green") points(xx3, 1:n1/n1, col = "blue") legend("bottomright", pch = 1, leg = c("Normal","Gumbel","Exp"), col = c("red","green","blue")) # graph 2 f1 <- fitdist(x1,"norm") f2 <- fitdist(x1,"gumbel",start=list(a=10,b=5)) f3 <- fitdist(x1, "exp") ppcomp(list(f1, f2, f3), fitcol=c("red","green","blue"), ynoise = FALSE, legendtext = c("Normal","Gumbel","Exp")) # graph 3 if (requireNamespace ("ggplot2", quietly = TRUE)) { ppcomp(list(f1, f2, f3), fitcol=c("red","green","blue"), ynoise = FALSE, legendtext = c("Normal","Gumbel","Exp"), plotstyle = "gg") } } # (4) normal mixture # #mixture of two normal distributions #density dnorm2 <- function(x, poid, m1, s1, m2, s2) poid*dnorm(x, m1, s1) + (1-poid)*dnorm(x, m2, s2) #numerical approximate quantile function qnorm2 <- function(p, poid, m1, s1, m2, s2) { L2 <- function(x, prob) (prob - pnorm2(x, poid, m1, s1, m2, s2))^2 sapply(p, function(pr) optimize(L2, c(-1000, 1000), prob=pr)$minimum) } #distribution function pnorm2 <- function(q, poid, m1, s1, m2, s2) poid*pnorm(q, m1, s1) + (1-poid)*pnorm(q, m2, s2) #basic normal distribution set.seed(1234) x2 <- c(rnorm(nsample, 5), rnorm(nsample, 10)) #MLE fit fit1 <- fitdist(x2, "norm2", "mle", start=list(poid=1/3, m1=4, s1=2, m2=8, s2=2), lower=c(0, 0, 0, 0, 0)) fit2 <- fitdist(x2, "norm2", "qme", probs=c(1/6, 1/4, 1/3, 1/2, 2/3), start=list(poid=1/3, m1=4, s1=2, m2=8, s2=2), lower=c(0, 0, 0, 0, 0), upper=c(1/2, Inf, Inf, Inf, Inf)) fit3 <- fitdist(x2, "norm2", "mge", gof="AD", start=list(poid=1/3, m1=4, s1=2, m2=8, s2=2), lower=c(0, 0, 0, 0, 0), upper=c(1/2, Inf, Inf, Inf, Inf)) ppcomp(list(fit1, fit2, fit3), fitpch=rep(".", 3), fitcol=c("green", "red", "blue")) if (requireNamespace ("ggplot2", quietly = TRUE) & visualize) { ppcomp(list(fit1, fit2, fit3), fitpch=rep(".", 3), fitcol=c("green", "red", "blue"), plotstyle = "gg") } # (5) large data # if (visualize) { n <- 2e4 x <- rlnorm(n) f1 <- fitdist(x, "lnorm") f2 <- fitdist(x, "exp") ppcomp(list(f1, f2), fitpch=3) ppcomp(list(f1, f2), fitlwd=2) if (requireNamespace ("ggplot2", quietly = TRUE)) { ppcomp(list(f1, f2), fitpch=3, plotstyle = "ggplot") ppcomp(list(f1, f2), fitlwd=2, plotstyle = "ggplot") } } # (6) test legend labels # if (visualize) { serving <- groundbeef$serving fitW <- fitdist(serving,"weibull") fitW2 <- fitdist(serving,"weibull", method="qme", probs=c(1/3,2/3)) fitW3 <- fitdist(serving,"weibull", method="qme", probs=c(1/2,2/3)) fitln <- fitdist(serving,"lnorm") fitg <- fitdist(serving,"gamma") ppcomp(list(fitW, fitln, fitg)) #distrib ppcomp(list(fitW, fitW2, fitln, fitg)) #distrib+method ppcomp(list(fitW, fitW2, fitW3, fitln, fitg)) #distrib+method+num if (requireNamespace ("ggplot2", quietly = TRUE)) ppcomp(list(fitW, fitW2, fitW3, fitln, fitg), plotstyle = "ggplot") #distrib+method+num } fitdistrplus/tests/t-fitbench.R0000644000176200001440000000122514421750615016323 0ustar liggesusersif(FALSE) { require(fitdistrplus) x <- rgamma(1e3, shape=3/2, rate= 1/2) initval <- unlist(fitdistrplus:::startargdefault(x, "gamma")) fitdistrplus:::fitbench(x, "gamma", "mle") fitdistrplus:::fitbench(x, "gamma", "mle", lower=0) # grgam(1:2, x) # grgamlnl(x[1], shape=1, rate=2) # grgamlnl(x[1:2], shape=1, rate=2) fitdistrplus:::fitbench(x, "gamma", "mle", grad=fitdistrplus:::grlnlgamma) fitdistrplus:::fitbench(x, "gamma", "mle", grad=fitdistrplus:::grlnlgamma, lower=0) #mledist(x, "gamma", grad=grgam, lower=0, optim.method = "CG") #mledist(x, "gamma", grad=grgam, lower=0, optim.method = "BFGS") } fitdistrplus/tests/t-util-npmle.R0000644000176200001440000001571614421726211016634 0ustar liggesuserslibrary(fitdistrplus) compareplotdistcens <- function(d) { par(mfrow = c(2,2)) plotdistcens(d, NPMLE.method = "Turnbull.middlepoints") plotdistcens(d, NPMLE.method = "Turnbull.intervals") plotdistcens(d, NPMLE.method = "Wang") } comparenpmle <- function(d) { npmleT <- fitdistrplus:::npmle(d, method = "Turnbull.intervals") npmleW <- fitdistrplus:::npmle(d, method = "Wang") print(npmleT) print(npmleW) cat("nb of intervals for Turnbull: ", nrow(npmleT), "\n") cat("nb of intervals for Wang: ", nrow(npmleW), "\n") xmin <- min(c(npmleT$left[is.finite(npmleT$left)],npmleT$left[is.finite(npmleT$left)])) xmax <- max(c(npmleT$right[is.finite(npmleT$right)],npmleT$right[is.finite(npmleT$right)])) par(mfrow = c(2, 1)) plot(npmleT$left, npmleT$p, xlim = c(xmin, xmax), main = "left") points(npmleW$left, npmleW$p, col = "red", pch = 4) plot(npmleT$right, npmleT$p, xlim = c(xmin, xmax), main = "right") points(npmleW$right, npmleW$p, col = "red", pch = 4) } #### Comparison of plotdistcens with different NPMLE methods if(FALSE) { # d1 = trivial case with only interval censored data d1 <- data.frame(left = c(1, 2, 3, 4, 3, 7), right = c(2, 5, 3, 7, 8, 9)) d <- d1 par(mfrow = c(2,2)) compareplotdistcens(d) comparenpmle(d) fa <- fitdistcens(d, "norm") fb <- fitdistcens(d, "logis") par(mfrow = c(2,2)) cdfcompcens(list(fa, fb), NPMLE.method = "Turnbull.middlepoints") cdfcompcens(list(fa, fb), NPMLE.method = "Turnbull.intervals") cdfcompcens(list(fa, fb), NPMLE.method = "Wang") par(mfrow = c(1,2)) ppcompcens(list(fa, fb), ynoise = FALSE, NPMLE.method = "Turnbull.intervals") ppcompcens(list(fa, fb), ynoise = FALSE, NPMLE.method = "Wang") qqcompcens(list(fa, fb), ynoise = FALSE, NPMLE.method = "Turnbull.intervals") qqcompcens(list(fa, fb), ynoise = FALSE, NPMLE.method = "Wang") # d2 = case with left and right censored data data(smokedfish) d2 <- smokedfish d <- d2 compareplotdistcens(d) comparenpmle(d) fa <- fitdistcens(d, "lnorm") fb <- fitdistcens(d, "gamma") par(mfrow = c(2,2)) cdfcompcens(list(fa, fb), NPMLE.method = "Turnbull.middlepoints") cdfcompcens(list(fa, fb), NPMLE.method = "Turnbull.intervals") cdfcompcens(list(fa, fb), NPMLE.method = "Wang") par(mfrow = c(1,2)) ppcompcens(list(fa, fb), ynoise = FALSE, NPMLE.method = "Turnbull.intervals") ppcompcens(list(fa, fb), ynoise = FALSE, NPMLE.method = "Wang") qqcompcens(list(fa, fb), ynoise = FALSE, NPMLE.method = "Turnbull.intervals") qqcompcens(list(fa, fb), ynoise = FALSE, NPMLE.method = "Wang") # d3 = case with also rigth censored data d3 <- data.frame(left = c(-1.4, 1.18, -1.4, 2, -1.4, 0), right = c(1, 1.18, 2, NA, 0, 2)) d <- d3 compareplotdistcens(d) comparenpmle(d) fa <- fitdistcens(d, "norm") fb <- fitdistcens(d, "logis") par(mfrow = c(1,2)) ppcompcens(list(fa, fb), ynoise = FALSE, NPMLE.method = "Turnbull.intervals") ppcompcens(list(fa, fb), ynoise = FALSE, NPMLE.method = "Wang") qqcompcens(list(fa, fb), ynoise = FALSE, NPMLE.method = "Turnbull.intervals") qqcompcens(list(fa, fb), ynoise = FALSE, NPMLE.method = "Wang") # d4 = case with also right censored data # with differences between the algorithms by the way they polish # the ECDF function, by putting some masses to zero. require(actuar) data(fluazinam) d4 <- fluazinam d <- d4 compareplotdistcens(d) comparenpmle(d) fa <- fitdistcens(d, "lnorm") fb <- fitdistcens(d, "llogis") par(mfrow = c(1,2)) cdfcompcens(list(fa, fb), NPMLE.method = "Turnbull.intervals") cdfcompcens(list(fa, fb), NPMLE.method = "Wang") ppcompcens(list(fa, fb), ynoise = FALSE, NPMLE.method = "Turnbull.intervals") ppcompcens(list(fa, fb), ynoise = FALSE, NPMLE.method = "Wang") qqcompcens(list(fa, fb), ynoise = FALSE, NPMLE.method = "Turnbull.intervals") qqcompcens(list(fa, fb), ynoise = FALSE, NPMLE.method = "Wang") # d5 a random example with exact values set.seed(123) r <- rnorm(10) d5 <- data.frame(left = r, right = r) d <- d5 compareplotdistcens(d) comparenpmle(d) fa <- fitdistcens(d, "norm") fb <- fitdistcens(d, "logis") par(mfrow = c(1,2)) ppcompcens(list(fa, fb), ynoise = FALSE, NPMLE.method = "Turnbull.intervals") ppcompcens(list(fa, fb), ynoise = FALSE, NPMLE.method = "Wang") qqcompcens(list(fa, fb), ynoise = FALSE, NPMLE.method = "Turnbull.intervals") qqcompcens(list(fa, fb), ynoise = FALSE, NPMLE.method = "Wang") # d7 = bigger dataset with also rigth censored data data(salinity) d7 <- log10(salinity) d <- d7 compareplotdistcens(d) comparenpmle(d) fa <- fitdistcens(d, "logis") fb <- fitdistcens(d, "norm") par(mfrow = c(1,2)) ppcompcens(list(fa, fb), ynoise = FALSE, NPMLE.method = "Turnbull.intervals") ppcompcens(list(fa, fb), ynoise = FALSE, NPMLE.method = "Wang") qqcompcens(list(fa, fb), ynoise = FALSE, NPMLE.method = "Turnbull.intervals") qqcompcens(list(fa, fb), ynoise = FALSE, NPMLE.method = "Wang") # d8 = an random example with all types of data (a small one) # set.seed(1234) # check OK # set.seed(1231) # check OK set.seed(1232) ns <- 25 r <- rnorm(ns) d8 <- data.frame(left = r, right = r) delta <- rlnorm(ns) icensored <- rbinom(ns, size = 1, prob = 0.2) Lcensored <- rbinom(ns, size = 1, prob = 0.2*(1 - icensored)) Rcensored <- rbinom(ns, size = 1, prob = 0.3*(1 - icensored)*(1 - Lcensored)) # icensored + Lcensored + Rcensored d8$left <- d8$left * (1 - Lcensored) + (-1000) * Lcensored d8$right <- d8$right * (1 - Rcensored) + (1000) * Rcensored d8$right <- d8$right + delta * icensored d8$right[d8$right == 1000] <- NA d8$left[d8$left == -1000] <- NA d8 d <- d8 compareplotdistcens(d) comparenpmle(d) fa <- fitdistcens(d, "logis") fb <- fitdistcens(d, "norm") par(mfrow = c(1,2)) ppcompcens(list(fa, fb), ynoise = FALSE, NPMLE.method = "Turnbull.intervals") ppcompcens(list(fa, fb), ynoise = FALSE, NPMLE.method = "Wang") qqcompcens(list(fa, fb), ynoise = FALSE, NPMLE.method = "Turnbull.intervals") qqcompcens(list(fa, fb), ynoise = FALSE, NPMLE.method = "Wang") # d8 = an random example with all types of data (a big one) # set.seed(1234) # check OK # set.seed(1231) # check OK set.seed(1232) ns <- 500 # ns <- 5000 r <- rnorm(ns) d8 <- data.frame(left = r, right = r) delta <- rlnorm(ns) icensored <- rbinom(ns, size = 1, prob = 0.2) Lcensored <- rbinom(ns, size = 1, prob = 0.2*(1 - icensored)) Rcensored <- rbinom(ns, size = 1, prob = 0.3*(1 - icensored)*(1 - Lcensored)) # icensored + Lcensored + Rcensored d8$left <- d8$left * (1 - Lcensored) + (-1000) * Lcensored d8$right <- d8$right * (1 - Rcensored) + (1000) * Rcensored d8$right <- d8$right + delta * icensored d8$right[d8$right == 1000] <- NA d8$left[d8$left == -1000] <- NA d <- d8 par(mfrow = c(2,2)) system.time(plotdistcens(d, NPMLE.method = "Turnbull.middlepoints")) system.time(plotdistcens(d, NPMLE.method = "Turnbull.intervals")) system.time(plotdistcens(d, NPMLE.method = "Wang")) } fitdistrplus/tests/t-cdfcomp.R0000644000176200001440000003260514102205434016150 0ustar liggesuserslibrary(fitdistrplus) # ?cdfcomp visualize <- FALSE # TRUE for manual tests with visualization of results nsample <- 1000 nsample <- 10 # (1) Plot various distributions fitted to serving size data # data(groundbeef) serving <- groundbeef$serving fitW <- fitdist(serving,"weibull") fitln <- fitdist(serving,"lnorm") fitg <- fitdist(serving,"gamma") #sanity checks try(cdfcomp("list(fitW, fitln, fitg)",horizontals = FALSE), silent=TRUE) try(cdfcomp(list(fitW, fitln, fitg, a=1),horizontals = FALSE), silent=TRUE) #real call cdfcomp(list(fitW, fitln, fitg), horizontals = FALSE) cdfcomp(list(fitW, fitln, fitg), horizontals = TRUE) cdfcomp(list(fitW, fitln, fitg), horizontals = TRUE, lines01 = TRUE) cdfcomp(list(fitW, fitln, fitg), horizontals = TRUE, verticals = TRUE, datacol = "grey") if (requireNamespace ("ggplot2", quietly = TRUE)) { cdfcomp(list(fitW, fitln, fitg), horizontals = FALSE, plotstyle = "ggplot") } if (requireNamespace ("ggplot2", quietly = TRUE) & visualize) { cdfcomp(list(fitW, fitln, fitg), horizontals = TRUE, plotstyle = "ggplot") cdfcomp(list(fitW, fitln, fitg), horizontals = TRUE, lines01 = TRUE, plotstyle = "ggplot") cdfcomp(list(fitW, fitln, fitg), horizontals = TRUE, verticals = TRUE, datacol = "grey", plotstyle = "ggplot") } cdfcomp(list(fitW, fitln, fitg), legendtext = c("Weibull", "lognormal", "gamma"), main = "ground beef fits", xlab = "serving sizes (g)", ylab = "F(g)", xlim = c(0, 250), ylim = c(.5, 1)) if (requireNamespace ("ggplot2", quietly = TRUE) & visualize) { cdfcomp(list(fitW, fitln, fitg), legendtext = c("Weibull", "lognormal", "gamma"), main = "ground beef fits", xlab = "serving sizes (g)", ylab = "F(g)", xlim = c(0, 250), ylim = c(.5, 1), plotstyle = "ggplot") } cdfcomp(list(fitW, fitln, fitg), legendtext = c("Weibull", "lognormal", "gamma"), main = "ground beef fits", xlab = "serving sizes (g)", ylab = "F(g)", xlogscale = TRUE) if (requireNamespace ("ggplot2", quietly = TRUE) & visualize) { cdfcomp(list(fitW, fitln, fitg), legendtext = c("Weibull", "lognormal", "gamma"), main = "ground beef fits", xlab = "serving sizes (g)", ylab = "F(g)", xlogscale = TRUE, plotstyle = "ggplot") } cdfcomp(list(fitW,fitln,fitg),legendtext=c("Weibull","lognormal","gamma"), main="ground beef fits",xlab="serving sizes (g)", ylab="F(g)", xlogscale=TRUE, ylogscale=TRUE, ylim=c(.005, .99)) if (requireNamespace ("ggplot2", quietly = TRUE) & visualize) { cdfcomp(list(fitW,fitln,fitg),legendtext=c("Weibull","lognormal","gamma"), main="ground beef fits",xlab="serving sizes (g)", ylab="F(g)", xlogscale=TRUE, ylogscale=TRUE, ylim=c(.005, .99), plotstyle = "ggplot") } cdfcomp(list(fitW,fitln,fitg), legendtext=c("Weibull","lognormal","gamma"), main="ground beef fits",xlab="serving sizes (g)", ylab="F(g)",xlim = c(0,250), xlegend = "topleft") if (requireNamespace ("ggplot2", quietly = TRUE) & visualize) { cdfcomp(list(fitW,fitln,fitg), legendtext=c("Weibull","lognormal","gamma"), main="ground beef fits",xlab="serving sizes (g)", ylab="F(g)",xlim = c(0,250), xlegend = "topleft", plotstyle = "ggplot") } # (2) Plot lognormal distributions fitted by # maximum goodness-of-fit estimation # using various distances (data plotted in log scale) # data(endosulfan) ATV <-subset(endosulfan, group == "NonArthroInvert")$ATV taxaATV <- subset(endosulfan, group == "NonArthroInvert")$taxa flnMGEKS <- fitdist(ATV,"lnorm",method="mge",gof="KS") flnMGEAD <- fitdist(ATV,"lnorm",method="mge",gof="AD") flnMGEADL <- fitdist(ATV,"lnorm",method="mge",gof="ADL") flnMGEAD2L <- fitdist(ATV,"lnorm",method="mge",gof="AD2L") llfit <- list(flnMGEKS, flnMGEAD, flnMGEADL, flnMGEAD2L) cdfcomp(list(flnMGEKS, flnMGEAD, flnMGEADL, flnMGEAD2L), xlogscale=TRUE, main="fits of a lognormal dist. using various GOF dist.") cdfcomp(list(flnMGEKS, flnMGEAD, flnMGEADL, flnMGEAD2L), xlogscale=TRUE,main="fits of a lognormal dist. using various GOF dist.", legendtext=c("MGE KS","MGE AD","MGE ADL","MGE AD2L")) cdfcomp(list(flnMGEKS, flnMGEAD, flnMGEADL, flnMGEAD2L), xlogscale=TRUE,main="fits of a lognormal dist. using various GOF dist.", legendtext=c("MGE KS","MGE AD","MGE ADL","MGE AD2L"), fitcol=c("black", "darkgreen", "yellowgreen", "yellow2"), horizontals=FALSE, datapch="+") cdfcomp(llfit, xlogscale=TRUE, main="fits of a lognormal dist. using various GOF dist.", legendtext=paste("MGE", c("KS","AD","ADL","AD2L")), fitcol="grey35", fitlty="dotted", horizontals=FALSE, datapch=21, datacol="grey30") cdfcomp(list(flnMGEKS, flnMGEAD, flnMGEADL, flnMGEAD2L), xlogscale=TRUE, verticals=TRUE, xlim=c(10,100000), datapch=21) cdfcomp(flnMGEKS, xlogscale=TRUE, verticals=TRUE, xlim=c(10,100000)) cdfcomp(list(flnMGEKS, flnMGEAD, flnMGEADL, flnMGEAD2L), xlogscale=TRUE, verticals=TRUE, xlim=c(1,100000), datapch=21, name.points=taxaATV) cdfcomp(flnMGEKS, xlogscale=TRUE, verticals=TRUE, xlim=c(1,100000), name.points=taxaATV) if (requireNamespace ("ggplot2", quietly = TRUE) & visualize) { cdfcomp(list(flnMGEKS, flnMGEAD, flnMGEADL, flnMGEAD2L), xlogscale=TRUE, main = "fits of a lognormal dist. using various GOF dist.", plotstyle = "ggplot") cdfcomp(list(flnMGEKS, flnMGEAD, flnMGEADL, flnMGEAD2L), xlogscale=TRUE,main="fits of a lognormal dist. using various GOF dist.", legendtext=c("MGE KS","MGE AD","MGE ADL","MGE AD2L"), plotstyle = "ggplot") cdfcomp(list(flnMGEKS, flnMGEAD, flnMGEADL, flnMGEAD2L), xlogscale=TRUE,main="fits of a lognormal dist. using various GOF dist.", legendtext=c("MGE KS","MGE AD","MGE ADL","MGE AD2L"), fitcol=c("black", "darkgreen", "yellowgreen", "yellow2"), horizontals=FALSE, datapch="+", plotstyle = "ggplot") cdfcomp(llfit, xlogscale=TRUE, main="fits of a lognormal dist. using various GOF dist.", legendtext=paste("MGE", c("KS","AD","ADL","AD2L")), fitcol="grey35", fitlty="dotted", horizontals=FALSE, datapch=21, datacol="grey30", plotstyle = "ggplot") cdfcomp(list(flnMGEKS, flnMGEAD, flnMGEADL, flnMGEAD2L), xlogscale=TRUE, verticals=TRUE, xlim=c(10,100000), datapch=21, plotstyle = "ggplot") cdfcomp(flnMGEKS, xlogscale=TRUE, verticals=TRUE, xlim=c(10,100000), plotstyle = "ggplot") cdfcomp(list(flnMGEKS, flnMGEAD, flnMGEADL, flnMGEAD2L), xlogscale=TRUE, verticals=TRUE, xlim=c(1,100000), datapch=21, name.points=taxaATV, plotstyle = "ggplot") cdfcomp(flnMGEKS, xlogscale=TRUE, verticals=TRUE, xlim=c(1,100000), name.points=taxaATV, plotstyle = "ggplot") } # (3) Plot normal and logistic distributions fitted by # maximum likelihood estimation # using various plotting positions in cdf plots # if (visualize) { data(endosulfan) log10ATV <-log10(subset(endosulfan, group == "NonArthroInvert")$ATV) taxaATV <- subset(endosulfan, group == "NonArthroInvert")$taxa fln <- fitdist(log10ATV, "norm") fll <- fitdist(log10ATV, "logis") # default plot using Hazen plotting position: (1:n - 0.5)/n cdfcomp(list(fln, fll), legendtext = c("normal", "logistic"), xlab = "log10ATV", name.points=taxaATV, xlim = c(0,5)) if (requireNamespace ("ggplot2", quietly = TRUE)) { cdfcomp(list(fln, fll), legendtext = c("normal", "logistic"),xlab = "log10ATV", name.points=taxaATV, xlim=c(0,5), plotstyle = "ggplot") } # plot using mean plotting position (named also Gumbel plotting position) # (1:n)/(n + 1) cdfcomp(list(fln,fll),legendtext=c("normal","logistic"),xlab="log10ATV", use.ppoints = TRUE, a.ppoints = 0) if (requireNamespace ("ggplot2", quietly = TRUE)) { cdfcomp(list(fln,fll),legendtext=c("normal","logistic"),xlab="log10ATV", use.ppoints = TRUE, a.ppoints = 0, plotstyle = "ggplot") } # plot using basic plotting position: (1:n)/n cdfcomp(list(fln,fll),legendtext=c("normal","logistic"),xlab="log10ATV", use.ppoints = FALSE) if (requireNamespace ("ggplot2", quietly = TRUE)) { cdfcomp(list(fln,fll),legendtext=c("normal","logistic"),xlab="log10ATV", use.ppoints = FALSE, plotstyle = "ggplot") } } # (4) Plot lognormal distributions fitted by # maximum goodness-of-fit estimation # using various distances (data plotted in log scale) # if (visualize) { x1 <- c(6.4,13.3,4.1,1.3,14.1,10.6,9.9,9.6,15.3,22.1,13.4,13.2,8.4,6.3,8.9,5.2,10.9,14.4) x <- seq(0, 1.1*max(x1), length=100) dgumbel <- function(x,a,b) 1/b*exp((a-x)/b)*exp(-exp((a-x)/b)) pgumbel <- function(q,a,b) exp(-exp((a-q)/b)) f1 <- mledist(x1, "norm") f2 <- mledist(x1, "gumbel", start = list(a = 10, b = 5)) f3 <- mledist(x1, "exp") # graph 1 plot(ecdf(x1)) lines(x, pnorm(x, f1$estimate[1], f1$estimate[2]), col = "red") lines(x, pgumbel(x, f2$estimate[1], f2$estimate[2]), col = "green") lines(x, pexp(x, f3$estimate[1]), col = "blue") legend("bottomright", lty = 1, leg = c("Normal", "Gumbel", "Exp"), col = c("red", "green", "blue")) # graph 2 f1 <- fitdist(x1, "norm") f2 <- fitdist(x1, "gumbel", start = list(a = 10, b = 5)) f3 <- fitdist(x1, "exp") cdfcomp(list(f1, f2, f3), xlim=range(x), fitcol = c("red", "green", "blue"), fitlty = 1, legendtext = c("Normal", "Gumbel", "Exp")) cdfcomp(list(f1, f2, f3), xlim=range(x), fitcol = c("red", "green", "blue"), fitlty = 1, fitlwd = (1:3)*2, legendtext = c("Normal", "Gumbel", "Exp")) # graph 3 if (requireNamespace ("ggplot2", quietly = TRUE)) { cdfcomp(list(f1, f2, f3), xlim=range(x), fitcol=c("red","green","blue"), fitlty = 1, legendtext = c("Normal", "Gumbel", "Exp"), plotstyle = "ggplot") cdfcomp(list(f1, f2, f3), xlim=range(x), fitcol=c("red","green","blue"), fitlty = 1, fitlwd = 1:3, legendtext = c("Normal", "Gumbel", "Exp"), plotstyle = "ggplot") } } # (5) normal mixture # # mixture of two normal distributions # density dnorm2 <- function(x, poid, m1, s1, m2, s2) poid*dnorm(x, m1, s1) + (1-poid)*dnorm(x, m2, s2) # numerically approximate quantile function qnorm2 <- function(p, poid, m1, s1, m2, s2) { L2 <- function(x, prob) (prob - pnorm2(x, poid, m1, s1, m2, s2))^2 sapply(p, function(pr) optimize(L2, c(-1000, 1000), prob=pr)$minimum) } # distribution function pnorm2 <- function(q, poid, m1, s1, m2, s2) poid*pnorm(q, m1, s1) + (1-poid)*pnorm(q, m2, s2) # basic normal distribution set.seed(1234) x2 <- c(rnorm(nsample, 5), rnorm(nsample, 10)) # MLE fit fit1 <- fitdist(x2, "norm2", "mle", start=list(poid=1/3, m1=4, s1=2, m2=8, s2=2), lower=c(0, 0, 0, 0, 0)) fit2 <- fitdist(x2, "norm2", "qme", probs=c(1/6, 1/4, 1/3, 1/2, 2/3), start=list(poid=1/3, m1=4, s1=2, m2=8, s2=2), lower=c(0, 0, 0, 0, 0), upper=c(1/2, Inf, Inf, Inf, Inf)) fit3 <- fitdist(x2, "norm2", "mge", gof="AD", start=list(poid=1/3, m1=4, s1=2, m2=8, s2=2), lower=c(0, 0, 0, 0, 0), upper=c(1/2, Inf, Inf, Inf, Inf)) cdfcomp(list(fit1, fit2, fit3), datapch=".") cdfcomp(list(fit1, fit2, fit3), datapch=".", xlim=c(6, 8), ylim=c(.4, .55)) if (requireNamespace ("ggplot2", quietly = TRUE) & visualize) { cdfcomp(list(fit1, fit2, fit3), datapch=".", plotstyle = "ggplot") cdfcomp(list(fit1, fit2, fit3), datapch=".", xlim=c(6, 8), ylim=c(.4, .55), plotstyle = "ggplot") } # (6) discrete example # set.seed(1234) x3 <- rpois(20, 10) fit1 <- fitdist(x3, "pois", "mle") fit2 <- fitdist(x3, "nbinom", "qme", probs=c(1/3, 2/3)) cdfcomp(list(fit1, fit2), datapch=21, horizontals=FALSE) if (requireNamespace ("ggplot2", quietly = TRUE) & visualize) { cdfcomp(list(fit1, fit2), datapch=21, horizontals=FALSE, plotstyle = "ggplot") } # (7) large dataset # if (visualize) { n <- 1e4 f1 <- fitdist(rlnorm(n), "lnorm") cdfcomp(f1, do.points=TRUE) cdfcomp(f1, do.points=FALSE) cdfcomp(f1, horizontals = FALSE, verticals = FALSE, do.points = FALSE) if (requireNamespace ("ggplot2", quietly = TRUE) & visualize) { cdfcomp(f1, do.points = TRUE, plotstyle = "ggplot") cdfcomp(f1, do.points = FALSE, plotstyle = "ggplot") cdfcomp(f1, horizontals = FALSE, verticals = FALSE, do.points = FALSE, plotstyle = "ggplot") } } # (8) argument add (must give the same plot (except colors) as ex. 6) # set.seed(1234) x3 <- rpois(nsample, 10) fit1 <- fitdist(x3, "pois", "mle") cdfcomp(fit1, fitcol = "red", horizontals=FALSE, addlegend = FALSE) fit2 <- fitdist(x3, "nbinom", "qme", probs=c(1/3, 2/3)) cdfcomp(fit2, fitcol = "blue", horizontals=FALSE, addlegend = FALSE, add = TRUE) cdfcomp(list(fit1, fit2), horizontals=FALSE, addlegend = FALSE, fitcol=c("red", "blue")) if (requireNamespace ("ggplot2", quietly = TRUE) & visualize) { # the argument add is not available when plotstyle = "ggplot" cdfcomp(list(fit1, fit2), fitcol = c("red", "blue"), fitlty = 1, horizontals = FALSE, addlegend = FALSE, plotstyle = "ggplot") } # (9) test legend labels # if (visualize) { serving <- groundbeef$serving fitW <- fitdist(serving,"weibull") fitW2 <- fitdist(serving,"weibull", method="qme", probs=c(1/3,2/3)) fitW3 <- fitdist(serving,"weibull", method="qme", probs=c(1/2,2/3)) fitln <- fitdist(serving,"lnorm") fitg <- fitdist(serving,"gamma") cdfcomp(list(fitW, fitln, fitg)) #distrib cdfcomp(list(fitW, fitW2, fitln, fitg)) #distrib+method cdfcomp(list(fitW, fitW2, fitW3, fitln, fitg)) #distrib+method+num if (requireNamespace ("ggplot2", quietly = TRUE)) cdfcomp(list(fitW, fitW2, fitW3, fitln, fitg), plotstyle = "ggplot") #distrib+method+num } fitdistrplus/tests/t-cdfcompcens.R0000644000176200001440000000615713742313702017033 0ustar liggesuserslibrary(fitdistrplus) visualize <- FALSE # TRUE for manual tests with visualization of results # (1) Plot various distributions fitted to bacterial contamination data # data(smokedfish) Clog10 <- log10(smokedfish) fitsfn <- fitdistcens(Clog10,"norm") fitsfl <- fitdistcens(Clog10,"logis") dgumbel <- function(x,a,b) 1/b*exp((a-x)/b)*exp(-exp((a-x)/b)) pgumbel <- function(q,a,b) exp(-exp((a-q)/b)) qgumbel <- function(p,a,b) a-b*log(-log(p)) fitsfg <- fitdistcens(Clog10, "gumbel", start=list(a=-3,b=3)) cdfcompcens(list(fitsfn,fitsfl,fitsfg)) cdfcompcens(list(fitsfn,fitsfl,fitsfg), fitlty=1, fitlwd=3) # Same plot in y logscale cdfcompcens(list(fitsfn, fitsfl, fitsfg), NPMLE.method = "Turnbull", ylogscale = TRUE, ylim=c(.5, .99)) cdfcompcens(list(fitsfn, fitsfl, fitsfg), NPMLE.method = "Wang", ylogscale = TRUE, ylim=c(.5, .99)) if (requireNamespace ("ggplot2", quietly = TRUE)) { cdfcompcens(list(fitsfn,fitsfl,fitsfg), plotstyle = "ggplot") cdfcompcens(list(fitsfn,fitsfl,fitsfg), plotstyle = "ggplot", fitlty=1, fitlwd=3) cdfcompcens(list(fitsfn,fitsfl,fitsfg), datacol="grey", legendtext=c("normal","logistic","Gumbel"), main="bacterial contamination fits", xlab="bacterial concentration (CFU/g)", ylab="F", xlegend = "center", lines01 = TRUE, plotstyle = "ggplot") # Same plot in y logscale cdfcompcens(list(fitsfn, fitsfl, fitsfg), NPMLE.method = "Wang", ylogscale = TRUE, ylim=c(.5, .99), plotstyle = "ggplot") } # Use of x logscale if (visualize) { if(any(installed.packages()[,"Package"] == "actuar")) { require(actuar) data(smokedfish) fln <- fitdistcens(smokedfish,"lnorm") fll <- fitdistcens(smokedfish,"llogis") cdfcompcens(list(fln, fll)) cdfcompcens(list(fln, fll), xlogscale = TRUE) cdfcompcens(list(fln, fll), xlogscale = TRUE, xlim = c(0.01, 1000)) cdfcompcens(list(fln, fll), NPMLE.method = "Turnbull", xlogscale = TRUE, xlim = c(0.01, 1000)) if (requireNamespace ("ggplot2", quietly = TRUE)) { cdfcompcens(list(fln, fll), plotstyle = "ggplot") cdfcompcens(list(fln, fll), xlogscale = TRUE, plotstyle = "ggplot") cdfcompcens(list(fln, fll), xlogscale = TRUE, xlim = c(0.01, 1000), plotstyle = "ggplot") } } # same plot using argument add cdfcompcens(fitsfn, addlegend = FALSE, fitcol = "red") cdfcompcens(fitsfl, addlegend = FALSE, fillrect = NA, fitcol = "green", add = TRUE) cdfcompcens(fitsfg, addlegend = FALSE, fillrect = NA, fitcol = "blue", add = TRUE) cdfcompcens(list(fitsfn, fitsfl, fitsfg), addlegend = FALSE, fitcol = 2:4, fitlty = 1, plotstyle = "ggplot") cdfcompcens(list(fitsfn, fitsfl, fitsfg), addlegend = FALSE, fitcol = 2:4, fitlty = 1, plotstyle = "ggplot") } # Test on the salinity data set # data(salinity) log10LC50 <-log10(salinity) plotdistcens(log10LC50) plotdistcens(log10LC50, NPMLE = FALSE) fn <- fitdistcens(log10LC50,"norm") fl <- fitdistcens(log10LC50,"logis") plot(fn) plot(fl) cdfcompcens(list(fn, fl)) if (requireNamespace("ggplot2", quietly = TRUE)) { cdfcompcens(list(fn, fl), plotstyle = "ggplot") } fitdistrplus/tests/t-mgedist.R0000644000176200001440000000700214050660542016171 0ustar liggesuserslibrary(fitdistrplus) nsample <- 10 # (1) Fit of a Weibull distribution to serving size data by maximum # goodness-of-fit estimation using all the distances available # data(groundbeef) serving <- groundbeef$serving mgedist(serving, "weibull", gof="CvM") mgedist(serving, "weibull", gof="CvM", silent = FALSE) mgedist(serving, "weibull", gof="KS") mgedist(serving, "weibull", gof="AD") mgedist(serving, "weibull", gof="ADR") mgedist(serving, "weibull", gof="ADL") mgedist(serving, "weibull", gof="AD2R") mgedist(serving, "weibull", gof="AD2L") mgedist(serving, "weibull", gof="AD2") # (2) Fit of a uniform distribution using Cramer-von Mises or # Kolmogorov-Smirnov distance # set.seed(1234) u <- runif(nsample, min=5, max=10) mgedist(u, "unif", gof="CvM") mgedist(u, "unif", gof="KS") # (3) Fit of a triangular distribution using Cramer-von Mises or # Kolmogorov-Smirnov distance # require(mc2d) set.seed(1234) t <- rtriang(nsample, min=5, mode=6, max=10) mgedist(t, "triang", start = list(min=4, mode=6,max=9), gof="CvM") mgedist(t, "triang", start = list(min=4, mode=6,max=9), gof="KS") # (4) scaling problem # the simulated dataset (below) has particularly small values, hence without scaling (10^0), # the optimization raises an error. The for loop shows how scaling by 10^i # for i=1,...,6 makes the fitting procedure work correctly. if(FALSE) { set.seed(1234) x2 <- rnorm(nsample, 1e-4, 2e-4) for(i in 6:0) cat(i, try(mgedist(x2*10^i,"cauchy")$estimate, silent=TRUE), "\n") } # (5) scaling problem # if(FALSE) { x <- c(-0.00707717, -0.000947418, -0.00189753, -0.000474947, -0.00190205, -0.000476077, 0.00237812, 0.000949668, 0.000474496, 0.00284226, -0.000473149, -0.000473373, 0, 0, 0.00283688, -0.0037843, -0.0047506, -0.00238379, -0.00286807, 0.000478583, 0.000478354, -0.00143575, 0.00143575, 0.00238835, 0.0042847, 0.00237248, -0.00142281, -0.00142484, 0, 0.00142484, 0.000948767, 0.00378609, -0.000472478, 0.000472478, -0.0014181, 0, -0.000946522, -0.00284495, 0, 0.00331832, 0.00283554, 0.00141476, -0.00141476, -0.00188947, 0.00141743, -0.00236351, 0.00236351, 0.00235794, 0.00235239, -0.000940292, -0.0014121, -0.00283019, 0.000472255, 0.000472032, 0.000471809, -0.0014161, 0.0014161, -0.000943842, 0.000472032, -0.000944287, -0.00094518, -0.00189304, -0.000473821, -0.000474046, 0.00331361, -0.000472701, -0.000946074, 0.00141878, -0.000945627, -0.00189394, -0.00189753, -0.0057143, -0.00143369, -0.00383326, 0.00143919, 0.000479272, -0.00191847, -0.000480192, 0.000960154, 0.000479731, 0, 0.000479501, 0.000958313, -0.00383878, -0.00240674, 0.000963391, 0.000962464, -0.00192586, 0.000481812, -0.00241138, -0.00144963) #only i == 0, no scaling, should not converge. for(i in 6:0) cat(i, try(mgedist(x*10^i,"cauchy")$estimate, silent=TRUE), "\n") } # (6) test error messages # dnorm2 <- pnorm2 <- function(x, a) "NA" x <- rexp(10) #should get a one-line error res <- mgedist(x, "norm2", start=list(a=1)) #as in attr(try(log("a"), silent=TRUE), "condition") # (7) test the component optim.message x <- rnorm(1000) #change parameter to obtain unsuccessful convergence mgedist(x, "norm", control=list(maxit=2), start=list(mean=1e5, sd=1), optim.method="L-BFGS-B", lower=0) # (8) test bounds x <- rnorm(1000) mgedist(x, "norm", optim.method="L-BFGS-B", lower=c(-Inf, 0)) #optim and L-BFGS-B mgedist(x, "norm", optim.method="Nelder", lower=c(-Inf, 0)) fitdistrplus/tests/t-startingvalues.R0000644000176200001440000000316013742313702017611 0ustar liggesuserslibrary(fitdistrplus) library(actuar) x <- c(3.1334614, 1.0300544, 0.8839272, 0.7503320, 1.9378476, 1.7571313, 0.5369516, 3.1181863, 1.7412457, 1.1514215, 1.6252670, 1.3333488, 3.0594471, 1.0479867, 0.4466562, 0.4387154, 1.5234127, 1.5476924, 1.3279816, 1.3978141) ft_igam <- fitdist(x, "invgamma") x <- c(2.3,0.1,2.7,2.2,0.4,2.6,0.2,1.,7.3,3.2,0.8,1.2,33.7,14., 21.4,7.7,1.,1.9,0.7,12.6,3.2,7.3,4.9,4000.,2.5,6.7,3.,63., 6.,1.6,10.1,1.2,1.5,1.2,30.,3.2,3.5,1.2,0.2,1.9,0.7,17., 2.8,4.8,1.3,3.7,0.2,1.8,2.6,5.9,2.6,6.3,1.4,0.8) ft_llogis <- fitdist(x,"llogis") x <- c(0.3837053, 0.8576858, 0.3552237, 0.6226119, 0.4783756, 0.3139799, 0.4051403, 0.4537631, 0.4711057, 0.5647414, 0.6479617, 0.7134207, 0.5259464, 0.5949068, 0.3509200, 0.3783077, 0.5226465, 1.0241043, 0.4384580, 1.3341520) ft_iw <- fitdist(x,"invweibull") x <- c(2.06832800, 0.80102221, 9.32589765, 4.31776597, 1.10230717, 2.06120589, 1.17598984, 0.49806928, 27.25326920, 0.03468531, 1.92468452, 0.07877918, 0.58094911, 0.31080116, 3.86929105, 1.05732456, 3.06687014, 7.48990404, 0.18640324, 2.50568830) #equivalent to Pareto type 2 ft_par2 <- fitdist(x, "pareto") x <- c(1.083905, 1.266773, 1.044362, 1.557070, 1.227676, 1.195054, 1.241101, 1.224004, 1.014631, 1.173721, 1.051038, 1.104520, 1.133755, 1.051605, 1.081230, 1.123193, 1.005997, 1.044696, 1.155369, 1.039439) #equivalent to Pareto type 1 ft_par1 <-fitdist(x, "pareto1", upper=c(Inf, min(x)), fix.arg=list(min = 1), control=list(trace=1, REPORT=1), silent=FALSE) fitdistrplus/tests/t-gofstat.R0000644000176200001440000000313214075476130016211 0ustar liggesuserslibrary(fitdistrplus) # (1) fit of two distributions by maximum likelihood estimation # to the serving size data # and comparison of goodness-of-fit statistics # data(groundbeef) serving <- groundbeef$serving (fitg <- fitdist(serving, "gamma")) gg <- gofstat(fitg) (fitln <- fitdist(serving, "lnorm")) gn <- gofstat(fitln) gofstat(list(fitg, fitln)) # (2) fit of two discrete distributions to toxocara data # and comparison of goodness-of-fit statistics # data(toxocara) number <- toxocara$number fitp <- fitdist(number, "pois") summary(fitp) plot(fitp) gp <- gofstat(fitp) gp fitnb <- fitdist(number, "nbinom") summary(fitnb) plot(fitnb) gnb <- gofstat(fitnb) gnb gofstat(list(fitp, fitnb)) attributes(gofstat(list(fitp, fitnb))) # (3) Use of Chi-squared results in addition to # recommended statistics for continuous distributions # set.seed(1234) x4 <- rweibull(n=10,shape=2,scale=1) # fit of the good distribution f4 <- fitdist(x4, "weibull") g4 <- gofstat(f4, meancount=10) print(g4) # fit of a bad distribution f4b <- fitdist(x4, "cauchy") g4b <- gofstat(f4b, meancount=10) print(g4b) # (4) estimation of the standard deviation of a normal distribution # by maximum likelihood with the mean fixed at 10 using the argument fix.arg # f1b <- fitdist(serving, "norm", start=list(sd=5), fix.arg=list(mean=10), lower=0) gofstat(f1b) # (5) Use on a small data set (less than 10 observations) # no pb identified # set.seed(1234) x5a <- rweibull(n=4,shape=2,scale=1) f5a <- fitdist(x5a, "weibull") (g5a <- gofstat(f5a)) x5b <- rpois(n = 4, lambda = 1) f5b <- fitdist(x5b, "pois") (g5b <- gofstat(f5b)) fitdistrplus/tests/t-weird-qqcomp-cens.R0000644000176200001440000000635213742313702020102 0ustar liggesuserslibrary(fitdistrplus) set.seed(123) n <- 20 visualize <- FALSE # TRUE for manual tests with visualization of results if (visualize) { # (1) test qqcomp/qqcompcens on a good example # x <- rlnorm(n, 0, 1) dx <- data.frame(left=x, right=x) dx$right[1:(n/2)*2] <- NA dx$left[2:(n/4)*4-1] <- NA f1 <- fitdist(x, "lnorm") f1c <- fitdistcens(dx, "lnorm") f3 <- fitdist(x, "lnorm", fix.arg=list(sdlog=1)) f3c <- fitdistcens(dx, "lnorm", fix.arg=list(sdlog=1)) par(mfrow=1:2) qqcomp(f1) qqcompcens(f1c) par(mfrow = c(1,1)) if(requireNamespace ("ggplot2", quietly = TRUE)) { qqcomp(f1, plotstyle = "ggplot") qqcompcens(f1c, plotstyle = "ggplot") } #test log-scale par(mfrow=1:2, mar=c(4,4,2,1)) qqcomp(f1, xlogscale = TRUE, ylogscale = TRUE) qqcompcens(f1c, xlogscale = TRUE, ylogscale = TRUE) par(mfrow = c(1,1)) if (requireNamespace ("ggplot2", quietly = TRUE)) { qqcomp(f1, xlogscale = TRUE, ylogscale = TRUE, plotstyle = "ggplot") qqcompcens(f1c, xlogscale = TRUE, ylogscale = TRUE, plotstyle = "ggplot") } # (2) test qqcomp/qqcompcens on a weird example # f2 <- fitdist(x, "unif") f2c <- fitdistcens(dx, "unif") par(mfrow=1:2, mar=c(4,4,2,1)) qqcomp(list(f1, f2, f3)) qqcompcens(list(f1c, f2c, f3c)) par(mfrow = c(1,1)) if (requireNamespace ("ggplot2", quietly = TRUE)) { qqcomp(list(f1, f2, f3), plotstyle = "ggplot") qqcompcens(list(f1c, f2c, f3c), plotstyle = "ggplot") } #test log-scale par(mfrow=1:2, mar=c(4,4,2,1)) qqcomp(list(f1, f2, f3), xlogscale = TRUE, ylogscale = TRUE) qqcompcens(list(f1c, f2c, f3c), xlogscale = TRUE, ylogscale = TRUE) par(mfrow = c(1,1)) if (requireNamespace ("ggplot2", quietly = TRUE)) { qqcomp(list(f1, f2, f3), xlogscale = TRUE, ylogscale = TRUE, plotstyle = "ggplot") qqcompcens(list(f1c, f2c, f3c), xlogscale = TRUE, ylogscale = TRUE, plotstyle = "ggplot") } #test y noise par(mfrow=1:2, mar=c(4,4,2,1)) qqcomp(list(f1, f2, f3)) qqcomp(list(f1, f2, f3), ynoise=FALSE) par(mfrow = c(1,1)) if (requireNamespace ("ggplot2", quietly = TRUE)) { qqcomp(list(f1, f2, f3), ynoise=FALSE, plotstyle = "ggplot") } par(mfrow=1:2, mar=c(4,4,2,1)) qqcompcens(list(f1c, f2c, f3c)) qqcompcens(list(f1c, f2c, f3c), ynoise=FALSE) par(mfrow = c(1,1)) if (requireNamespace ("ggplot2", quietly = TRUE)) { qqcompcens(list(f1c, f2c, f3c), plotstyle = "ggplot") } #test log-scale y-noise par(mfrow=1:2, mar=c(4,4,2,1)) qqcomp(list(f1, f2, f3), xlogscale = TRUE, ylogscale = TRUE) qqcomp(list(f1, f2, f3), xlogscale = TRUE, ylogscale = TRUE, ynoise=FALSE) par(mfrow = c(1,1)) if (requireNamespace ("ggplot2", quietly = TRUE)) { qqcomp(list(f1, f2, f3), xlogscale = TRUE, ylogscale = TRUE, ynoise=FALSE, plotstyle = "ggplot") } par(mfrow=1:2, mar=c(4,4,2,1)) qqcompcens(list(f1c, f2c), xlogscale = TRUE, ylogscale = TRUE) qqcompcens(list(f1c, f2c), xlogscale = TRUE, ylogscale = TRUE, ynoise=FALSE) par(mfrow = c(1,1)) if (requireNamespace ("ggplot2", quietly = TRUE)) { qqcompcens(list(f1c, f2c), xlogscale = TRUE, ylogscale = TRUE, plotstyle = "ggplot") } } fitdistrplus/vignettes/0000755000176200001440000000000014421751624015024 5ustar liggesusersfitdistrplus/vignettes/FAQ.Rmd0000644000176200001440000017036414411504652016106 0ustar liggesusers--- title: Frequently Asked Questions author: Marie Laure Delignette Muller, Christophe Dutang date: '`r Sys.Date()`' output: bookdown::html_document2: base_format: rmarkdown::html_vignette fig_caption: yes toc: true number_sections: yes link-citations: true vignette: > %\VignetteEngine{knitr::rmarkdown} %\VignetteIndexEntry{Frequently Asked Questions} %!\VignetteEncoding{UTF-8} \usepackage[utf8]{inputenc} pkgdown: as_is: true --- ```{r setup, echo=FALSE, message=FALSE, warning=FALSE} require(fitdistrplus) set.seed(1234) options(digits = 3) ``` # Questions regarding distributions ## How do I know the root name of a distribution? The root name of a probability distribution is the `name` which is used the `d`, `p`, `q`, `r` functions. For base R distributions, the root names are given in R-intro : https://cran.r-project.org/doc/manuals/R-intro.html#Probability-distributions. For example, you must use `"pois"` for the Poisson distribution and **not** `"poisson"`. ## How do I find "non standard" distributions? For non-standard distributions, you can either find a package implementing them or define by yourself. A comprehensive list of non-standard distributions is given in the Distributions task view https://CRAN.R-project.org/view=Distributions. Here are some two examples of user-defined distributions. A third example (shifted exponential) is given in FAQ 3.5.4. * The Gumbel distribution ```{r, eval=FALSE} dgumbel <- function(x, a, b) 1/b*exp((a-x)/b)*exp(-exp((a-x)/b)) pgumbel <- function(q, a, b) exp(-exp((a-q)/b)) qgumbel <- function(p, a, b) a-b*log(-log(p)) data(groundbeef) fitgumbel <- fitdist(groundbeef$serving, "gumbel", start=list(a=10, b=10)) ``` * The zero-modified geometric distribution ```{r, eval=FALSE} dzmgeom <- function(x, p1, p2) p1 * (x == 0) + (1-p1)*dgeom(x-1, p2) pzmgeom <- function(q, p1, p2) p1 * (q >= 0) + (1-p1)*pgeom(q-1, p2) rzmgeom <- function(n, p1, p2) { u <- rbinom(n, 1, 1-p1) #prob to get zero is p1 u[u != 0] <- rgeom(sum(u != 0), p2)+1 u } x2 <- rzmgeom(1000, 1/2, 1/10) fitdist(x2, "zmgeom", start=list(p1=1/2, p2=1/2)) ``` ## How do I set (or find) initial values for non standard distributions? As documented, we provide initial values for the following distributions: `"norm"`, `"lnorm"`, `"exp"`, `"pois"`, `"cauchy"`, `"gamma`", `"logis"`, `"nbinom"`, `"geom"`, `"beta"`, `"weibull"` from the `stats` package; `"invgamma"`, `"llogis"`, `"invweibull"`, `"pareto1"`, `"pareto"`, `"lgamma"`, `"trgamma"`, `"invtrgamma"` from the `actuar` package. Look first at statistics and probability books such as * different volumes of N. L. Johnson, S. Kotz and N. Balakrishnan books, e.g. **Continuous Univariate Distributions, Vol. 1**, * **Thesaurus of univariate discrete probability distributions** by G. Wimmer and G. Altmann. * **Statistical Distributions** by M. Evans, N. Hastings, B. Peacock. * **Distributional Analysis with L-moment Statistics using the R Environment for Statistical Computing** by W. Asquith. If not available, find initial values by equalling theoretical and empirical quartiles. The graphical function `plotdist()` and `plotdistcens()` can also be used to assess the suitability of starting values : by an iterative manual process you can move parameter values so as to obtain a distribution that roughly fits the data and take these parameter values as starting values for the real fit. You may also consider the `prefit()` function to find initial values especially in case where parameters are constrained. ## Is it possible to fit a distribution with at least 3 parameters? Yes, an example with the Burr distribution is detailed in the JSS paper. We reproduce it very quickly here. ```{r, message=FALSE} data("endosulfan") library("actuar") fendo.B <- fitdist(endosulfan$ATV, "burr", start = list(shape1 = 0.3, shape2 = 1, rate = 1)) summary(fendo.B) ``` ## Why there are differences between MLE and MME for the lognormal distribution? We recall that the lognormal distribution function is given by $$ F_X(x) = \Phi\left(\frac{\log(x)-\mu}{\sigma} \right), $$ where $\Phi$ denotes the distribution function of the standard normal distribution. We know that $E(X) = \exp\left( \mu+\frac{1}{2} \sigma^2 \right)$ and $Var(X) = \exp\left( 2\mu+\sigma^2\right) (e^{\sigma^2} -1)$. The MME is obtained by inverting the previous formulas, whereas the MLE has the following explicit solution $$ \hat\mu_{MLE} = \frac{1}{n}\sum_{i=1}^n \log(x_i),~~ \hat\sigma^2_{MLE} = \frac{1}{n}\sum_{i=1}^n (\log(x_i) - \hat\mu_{MLE})^2. $$ Let us fit a sample by MLE and MME. The fit looks particularly good in both cases. ```{r, fig.height=3, fig.width=6} x3 <- rlnorm(1000) f1 <- fitdist(x3, "lnorm", method="mle") f2 <- fitdist(x3, "lnorm", method="mme") par(mfrow=1:2) cdfcomp(list(f1, f2), do.points=FALSE, xlogscale = TRUE, main = "CDF plot") denscomp(list(f1, f2), demp=TRUE, main = "Density plot") ``` Let us compare the theoretical moments (mean and variance) given the fitted values ($\hat\mu,\hat\sigma$), that is $$ E(X) = \exp\left( \hat\mu+\frac{1}{2} \hat\sigma^2 \right), Var(X) = \exp\left( 2\hat\mu+\hat\sigma^2\right) (e^{\hat\sigma^2} -1). $$ ```{r} c("E(X) by MME"=as.numeric(exp(f2$estimate["meanlog"]+f2$estimate["sdlog"]^2/2)), "E(X) by MLE"=as.numeric(exp(f1$estimate["meanlog"]+f1$estimate["sdlog"]^2/2)), "empirical"=mean(x3)) c("Var(X) by MME"=as.numeric(exp(2*f2$estimate["meanlog"]+f2$estimate["sdlog"]^2) * (exp(f2$estimate["sdlog"]^2)-1)), "Var(X) by MLE"=as.numeric(exp(2*f1$estimate["meanlog"]+f1$estimate["sdlog"]^2) * (exp(f1$estimate["sdlog"]^2)-1)), "empirical"=var(x3)) ``` From a MLE point of view, a lognormal sample $x_1,\dots,x_n$ is equivalent to handle a normal sample $\log(x_1),\dots,\log(x_n)$. However, it is well know by the Jensen inequality that $E(X) = E(\exp(\log(X))) \geq \exp(E(\log(X)))$ implying the MME estimates provides better moment estimates than with MLE. ## Can I fit a distribution with positive support when data contains negative values? The answer is no: you cannot fit a distribution with positive support (say gamma distribution) when data contains negative values. ```{r} set.seed(1234) x <- rnorm(100, mean = 1, sd = 0.5) (try(fitdist(x, "exp"))) ``` It is irrelevant to do such fit. If you really need to use that distribution, you have two options: either to remove negative values (not recommended) or to shift the data. ```{r} fitdist(x[x >= 0], "exp") fitdist(x - min(x), "exp") ``` ## Can I fit a finite-support distribution when data is outside that support? The answer is no: you cannot fit a distribution with finite-support (say beta distribution) when data is outside $[0,1]$. ```{r} set.seed(1234) x <- rnorm(100, mean = 0.5, sd = 0.25) (try(fitdist(x, "beta"))) ``` It is irrelevant to do such a fit. If you really need to use that distribution, you have two ways to tackle this issue: either to remove impossible values (not recommended) or to shift/scale the data. ```{r} fitdist(x[x > 0 & x < 1], "beta") fitdist((x - min(x)*1.01) / (max(x) * 1.01 - min(x) * 1.01), "beta") ``` ## Can I fit truncated distributions? The answer is yes: but the fitting procedure must be carried out carefully. Let $X$ be the original untruncated random variable. The truncated variable is the conditionnal random variable $Y = X ~\vert~ l< X = low) * (x <= upp) } ptexp <- function(q, rate, low, upp) { PU <- pexp(upp, rate=rate) PL <- pexp(low, rate=rate) (pexp(q, rate)-PL) / (PU-PL) * (q >= low) * (q <= upp) + 1 * (q > upp) } n <- 200 x <- rexp(n); x <- x[x > .5 & x < 3] f1 <- fitdist(x, "texp", method="mle", start=list(rate=3), fix.arg=list(low=min(x), upp=max(x))) f2 <- fitdist(x, "texp", method="mle", start=list(rate=3), fix.arg=list(low=.5, upp=3)) gofstat(list(f1, f2)) cdfcomp(list(f1, f2), do.points = FALSE, xlim=c(0, 3.5)) ``` ## Can I fit truncated inflated distributions? The answer is yes: but the fitting procedure must be carried out carefully. Let $X$ be the original untruncated random variable. The truncated variable is $Y = \max(\min(X, u), l)$ with $ly>l} + 1_{y>u}$. There is no density (w.r.t. the Lebesgues measure) since there are two probability masses $P(Y=l)= P(X\leq l)>0$ and $P(Y=u)=P(X>u)>0$. However, the density function with respect to the measure $m(x)= \delta_l(x)+\delta_u(x)+\lambda(x)$ is $$ f_Y(y) = \left\{\begin{array}{ll} F_X(l) & \text{if } y=l \\ f_X(y) & \text{if } l\min_i y_i$ or $u<\max_i y_i$ and increasing with respect to $l$ in $]-\infty, \min_i y_i]$ and decreasing with respect to $u$ in $[\max_i y_i,+\infty[$. So the maximum of $L$ is reached at $l=\min_i y_i$ and $u=\max_i y_i$. The MLE of $\theta$ is then obtained by maximizing the log-likelihood $\log(L(l, \theta, u))$ with $u=\max_i Y_i$ and $l=\min_i Y_i$. Let us illustrate truncated distribution with the truncated exponential distribution. The log-likelihood is particularly bad-shaped. ```{r, message=FALSE, fig.height=4, fig.width=8} dtiexp <- function(x, rate, low, upp) { PU <- pexp(upp, rate=rate, lower.tail = FALSE) PL <- pexp(low, rate=rate) dexp(x, rate) * (x >= low) * (x <= upp) + PL * (x == low) + PU * (x == upp) } ptiexp <- function(q, rate, low, upp) pexp(q, rate) * (q >= low) * (q <= upp) + 1 * (q > upp) n <- 100; x <- pmax(pmin(rexp(n), 3), .5) # the loglikelihood has a discontinous point at the solution par(mar=c(4,4,2,1), mfrow=1:2) llcurve(x, "tiexp", plot.arg="low", fix.arg = list(rate=2, upp=5), min.arg=0, max.arg=.5, lseq=200) llcurve(x, "tiexp", plot.arg="upp", fix.arg = list(rate=2, low=0), min.arg=3, max.arg=4, lseq=200) ``` The first method directly maximizes the log-likelihood $L(l, \theta, u)$; the second method maximizes the log-likelihood $L(\theta)$ assuming that $u=\hat u$ and $l=\hat l$ are known. Inside $[0.5,3]$, the CDF are correctly estimated in both methods but the first method does not succeed to estimate the true value of the bounds $l,u$. ```{r, fig.height=4, fig.width=6} (f1 <- fitdist(x, "tiexp", method="mle", start=list(rate=3, low=0, upp=20))) (f2 <- fitdist(x, "tiexp", method="mle", start=list(rate=3), fix.arg=list(low=min(x), upp=max(x)))) gofstat(list(f1, f2)) cdfcomp(list(f1, f2), do.points = FALSE, addlegend=FALSE, xlim=c(0, 3.5)) curve(ptiexp(x, 1, .5, 3), add=TRUE, col="blue", lty=3) legend("bottomright", lty=1:3, col=c("red", "green", "blue", "black"), leg=c("full MLE", "MLE fixed arg", "true CDF", "emp. CDF")) ``` ## Can I fit a uniform distribution? The uniform distribution $\mathcal U(a,b)$ has only support parameters since the density does not have a scale or a shape parameter $f_U(u) = \frac{1}{b-a}1_{[a,b]}(u)$. For this distribution, we should not maximize the log-likelihood but only the likelihood. Let $(x_i)_i$ be i.i.d. observations from $\mathcal U(a,b)$ distribution. The likelihood is $$ L(a,b) = \prod_{i=1}^n \frac{1}{b-a} 1_{[a,b]}(x_i) = 1_{a\leq x_i \leq b, i=1,\dots,n} \frac{1}{b-a}^n = 1_{a\leq \min_i x_i} 1_{\max_i x_i \leq b} \frac{1}{b-a}^n $$ Hence $a\mapsto L(a,b)$ for any fixed $b\in]\max_i x_i, +\infty[$ is increasing on $]-\infty, \min_i x_i]$, similarly $b\mapsto L(a,b)$ is decreasing for any fixed $a$. This leads to $\min_i x_i$ and $\max_i x_i$ to be the MLE of the uniform distribution. We should notice that the likelihood function $L$ is defined on $\mathbb R^2$ yet it cancels outside $S=]-\infty, \min_i x_i]\times]\max_i x_i, +\infty[$. Hence, the log-likelihood is undefined outside $S$, which is an issue when maximizing the log-likelihood. For these reasons, `fitdist(data, dist="unif", method="mle")` uses the explicit form of the MLE for this distribution. Here is an example below ```{r, fig.height=4, fig.width=6} trueval <- c("min"=3, "max"=5) x <- runif(n=500, trueval[1], trueval[2]) f1 <- fitdist(x, "unif") delta <- .01 llsurface(x, "unif", plot.arg = c("min", "max"), min.arg=c(min(x)-2*delta, max(x)-delta), max.arg=c(min(x)+delta, max(x)+2*delta), main="likelihood surface for uniform", loglik=FALSE) abline(v=min(x), h=max(x), col="grey", lty=2) points(f1$estimate[1], f1$estimate[2], pch="x", col="red") points(trueval[1], trueval[2], pch="+", col="blue") legend("bottomright", pch=c("+","x"), col=c("blue","red"), c("true", "fitted")) delta <- .2 llsurface(x, "unif", plot.arg = c("min", "max"), min.arg=c(3-2*delta, 5-delta), max.arg=c(3+delta, 5+2*delta), main="log-likelihood surface for uniform") abline(v=min(x), h=max(x), col="grey", lty=2) points(f1$estimate[1], f1$estimate[2], pch="x", col="red") points(trueval[1], trueval[2], pch="+", col="blue") legend("bottomright", pch=c("+","x"), col=c("blue","red"), c("true", "fitted")) ``` Maximizing the log-likelihood is harder and can be done by defining a new density function. Appropriate starting values and parameters bound must be supplied. Using the closed-form expression (as in `fitdist()`) or maximizing the log-likelihood (with `unif2`) lead to very similar results. ```{r} dunif2 <- function(x, min, max) dunif(x, min, max) punif2 <- function(q, min, max) punif(q, min, max) f2 <- fitdist(x, "unif2", start=list(min=0, max=10), lower=c(-Inf, max(x)), upper=c(min(x), Inf)) print(c(logLik(f1), logLik(f2)), digits=7) print(cbind(coef(f1), coef(f2)), digits=7) ``` ## Can I fit a beta distribution with the same shape parameter? Yes, you can wrap the density function of the beta distribution so that there is a only one shape parameter. Here is an example of a concave density. ```{r} x <- rbeta(1000, 3, 3) dbeta2 <- function(x, shape, ...) dbeta(x, shape, shape, ...) pbeta2 <- function(q, shape, ...) pbeta(q, shape, shape, ...) fitdist(x, "beta2", start=list(shape=1/2)) ``` Another example with a U-shaped density. ```{r} x <- rbeta(1000, .3, .3) fitdist(x, "beta2", start=list(shape=1/2), optim.method="L-BFGS-B", lower=1e-2) ``` ## How to estimate support parameter? the case of the four-parameter beta Let us consider the four-parameter beta distribution, also known as the PERT distribution, defined by the following density for $x\in [a,c]$ $f_X(x) = (x-a)^{\alpha-1} (c-x)^{\beta-1}/C_N$ with $C_N$ a normalizing constant and $\alpha=1+d(b-a)/(c-a)$, $\beta=1+d(c-b)/(c-a)$. $a,c$ are support parameters, $b\in]a,c[$ is the mode and $d$ the shape parameter. As for uniform distribution, one can show that the MLE of $a$ and $c$ are respectively the sample minimum and maximum. The code below illustrates the strategy using partial closed formula with `fix.arg` and the full numerical search of MLE. NB: on small sample size, the latter has generally better goodness-of-fit statistics; a small positive number is added or subtracted when fixing the support parameters $a$ and $c$ to sample minimum and maximum. ```{r, message=FALSE, fig.height=4, fig.width=6} require(mc2d) x2 <- rpert(n=2e2, min=0, mode=1, max=2, shape=3/4) eps <- sqrt(.Machine$double.eps) f1 <- fitdist(x2, "pert", start=list(min=-1, mode=0, max=10, shape=1), lower=c(-Inf, -Inf, -Inf, 0), upper=c(Inf, Inf, Inf, Inf)) f2 <- fitdist(x2, "pert", start=list(mode=1, shape=1), fix.arg=list(min=min(x2)-eps, max=max(x2)+eps), lower=c(min(x2), 0), upper=c(max(x2), Inf)) gofstat(list(f1,f2)) cdfcomp(list(f1,f2)) print(cbind(coef(f1), c(f2$fix.arg["min"], coef(f2)["mode"], f2$fix.arg["max"], coef(f2)["shape"])), digits=7) ``` # Questions regarding goodness-of-fit tests and statistics ## Where can we find the results of goodness-of-fit tests ? Results of goodness-of-fit tests are not printed but are given in the object returned by `gofstat()` and you can have access to them as described in the example below. Nevertheless, p-values are not given for every test. For Anderson-Darling (ad), Cramer von Mises (cvm) and Kolomogorov (ks), only the decision (rejection of H0 or not) is given, when available (see FAQ 2.3 for more details). ```{r, fig.height=3, fig.width=6} set.seed(1234) x <- rgamma(n = 100, shape = 2, scale = 1) # fit of the good distribution fgamma <- fitdist(x, "gamma") # fit of a bad distribution fexp <- fitdist(x, "exp") g <- gofstat(list(fgamma, fexp), fitnames = c("gamma", "exp")) denscomp(list(fgamma, fexp), legendtext = c("gamma", "exp")) # results of the tests ## chi square test (with corresponding table with theoretical and observed counts) g$chisqpvalue g$chisqtable ## Anderson-Darling test g$adtest ## Cramer von Mises test g$cvmtest ## Kolmogorov-Smirnov test g$kstest ``` ## Is it reasonable to use goodness-of-fit tests to validate the fit of a distribution ? In the first versions of fitdistrplus, when they were available, the results of the GOF tests (AD, KS, CvM) were automatically printed. We decided to suppress this automatic printing when we realized that some users had some difficulties to interpret the results of those tests and sometimes misused them. Goodness-of-fit tests often appear as objective tools to decide wether a fitted distribution well describes a data set. **But they are not !** It would not be reasonable at all to reject a distribution just because a goodness-of-fit test rejects it (see FAQ 2.2.1). And it would not be reasonable at all any more to validate a distribution because goodness-of-fit tests do not reject it (see FAQ 2.2.2). A fitted distribution should be evaluated using graphical methods (goodness-of-fit graphs automatically provided in our package by plotting the result of the fit (output of `fitdist()` or `fitdistcens()` and the complementary graphs that help to compare different fits - see `?graphcomp`). We really think it is the most appropriate way to evaluate the adequacy of a fit and we are not the only ones to recommend it. You can find the same type of recommendations in reference books : * **Probabilistic techniques in exposure assessment - a handbook dealing with variability and uncertainty in models and inputs** by A.C. Cullen and H.C. Frey. * **Application of uncertainty analysis to ecological risks of pesticides** by W.J. Warren-Hicks and A. Hart. * **Statistical inference** by G. Casella and R.L. Berger * **Loss models: from data to decision** by S.A. Klugman and H.H. Panjer and G.E. Willmot Moreover, the selection of a distribution should also be driven by knowledge of underlying processes when available. For example when a variable cannot be negative, one would be very cautious while fitting a normal distribution, that potentially gives negative values, even if the observed data of this variable seem well fitted by a normal distribution. ### Should I reject a distribution because a goodness-of-fit test rejects it ? No it would not be reasonable at all to reject a distribution just because a goodness-of-fit test rejects it, especially in the case of big samples. In the real life, as soon as you have a sufficient amount of data, you will reject the fitted distribution. We know that a model cannot perfectly describe real data, and generally the true question is to find the better distribution among a pool of simple parametric distributions to describe the data, so to compare different models (see FAQ 2.4 and 2.5 for corresponding questions). To illustre this point let us comment the example presented below. We drew two samples from the same Poisson distribution with a mean parameter equal to 100. In many applications, for this value of its parameter, the Poisson distribution would be considered to be well approximated by a normal distribution. Testing the fit (here using a Kolmogorov-Smirnov test ) of the normal distribution on a sample of 100 observations would not reject the normal fit, while testing it on a sample of 10000 observations would reject it, while both samples come from the same distribution. ```{r, fig.height=3, fig.width=6} set.seed(1234) x1 <- rpois(n = 100, lambda = 100) f1 <- fitdist(x1, "norm") g1 <- gofstat(f1) g1$kstest x2 <- rpois(n = 10000, lambda = 100) f2 <- fitdist(x2, "norm") g2 <- gofstat(f2) g2$kstest par(mfrow=1:2) denscomp(f1, demp = TRUE, addlegend = FALSE, main = "small sample") denscomp(f2, demp = TRUE, addlegend = FALSE, main = "big sample") ``` ### Should I accept a distribution because goodness-of-fit tests do not reject it ? No, it would not be reasonable at all to validate a distribution because goodness-of-fit tests do not reject it. Like all the other hypothesis tests, goodness-of-fit tests lack of statistical power when the sample size is not so high. And the different goodness-of-fit tests are not equally sensitive to different types of deviation between empirical and fitted distributions. For example the Kolmogorov-Smirnov test is sensitive when distributions differ in a global fashion near the centre of the distribution. The Anderson-Darling test will be more sensitive when distributions differ in their tails, and the Cramer von Mises will be more sensitive when there are small but repetitive differences between empirical and theoretical distribution functions. The sensitivity of a chi square test will depend on the definition of classes, and even if we propose a default definition of classes when the user does not provide classes, this choice is not obvious and could impact the results of the test. This test is more appropriate when data are discrete, even if they are modelled by a continuous distribution, as in the following example. Two samples of respective sizes 500 and 50 are drawn from a Poisson distribution of mean parameter equal to 1 (not a sufficiently high value to consider that the Poisson distribution could be approximated by a normal one). Using a Kolmogorov-Smirnov test, for the small sample the normal fit is rejected only for the bigger sample. It is not rejected with the smaller sample even if the fit could be rejected after a simple visual confrontation of the distributions. In that particular case, the chi square test with classes defined by default would have rejected te normal fit for both samples. ```{r, fig.height=3, fig.width=6} set.seed(1234) x3 <- rpois(n = 500, lambda = 1) f3 <- fitdist(x3, "norm") g3 <- gofstat(f3) g3$kstest x4 <- rpois(n = 50, lambda = 1) f4 <- fitdist(x4, "norm") g4 <- gofstat(f4) g4$kstest par(mfrow=1:2) denscomp(f3, addlegend = FALSE, main = "big sample") denscomp(f4, addlegend = FALSE, main = "small sample") ``` ```{r} g3$chisqtable g3$chisqpvalue g4$chisqtable g4$chisqpvalue ``` ## Why all goodness-of-fit tests are not available for every distribution ? The Chi-squared test is available for any distribution but one must be conscious that its result depends on the definition of cells in which observed data are grouped, and a correct definition is not possible with a too small sample. Concerning the Kolmogorov-Smirnov test, it is proposed for any continuous distribution, but with a critical value corresponding to the comparison of the empirical distribution to a fully specified distribution. As the distribution is not fully known for a fitted distribution, the result of this test is subject to caution, but there is no general asymptotic theory for the Kolmogorov-Smirnov statistics in case of a fitted distribution. Nevertheless, one can use Monte Carlo methods to conduct Kolmgorov-Smirnov goodness-of-fit tests in cases when the sample is used to estimate model parameters. Such a method is implemented in the R package `KScorrect` for a variety of continuous distributions. Such an asymptotic theory was proposed for quadratic statistics for some distributions (Anderson-Darling, Cramer von Mises). The reference book we used on this subject (**Tests based on edf statistics** by Stephens MA in **Goodness-of-fit techniques** by D'Agostino RB and Stephens MA) proposes critical values of those statistics for a some classical distributions (exponential, gamma, Weibull, logistic, Cauchy, normal and lognormal). But the asymptotic theory about these statistics also depends on the way the parameters are estimated. And as they were not estimated by maximum likelihood for Cauchy, normal and lognormal distributions in the results reported by Stephens, we only propose the results of the Anderson-Darling and Cramer von Mises using those results for exponential, gamma, Weibull, logistic distributions. The user can refer to the cited books and use the proposed formula to estimate the parameters of Cauchy, normal and lognormal distributions and apply the tests using critical values given in the book. R packages `goftest` and `ADGofTest` could also be explored by users who would like to apply Anderson-Darling and Cramer von Mises tests on other distributions. But at this time we are not sure that the case where parameters are unknown (estimated by maximum likelihood) is tackled in those two packages. Concerning the development of our package, rather than develoing further more goodness-of-fit tests we made the choice to develop graphical tools to help to appreciate the quality of a fit and to compare the fits of different distributions on a same data set (see FAQ 2.2 for argumentation). ## How can we use goodness-of-fit statistics to compare the fit of different distributions on a same data set ? Goodness-of-fit statistics based on the empirical distribution function (Kolmogorov-Smirnov, Anderson-Darling and Cramer von Mises) may be used to measure a distance between the fitted distribution and the empirical distribution. So if one wants to compare the fit of various distributions on the same data set, the smaller are those statistics the better. The Kolmogorov-Smirnov statistics will be sensitive when distributions differ in a global fashion near the centre of the distribution while the Anderson-Darling statistics will be more sensitive when distributions differ in their tails, and the Cramer von Mises statistics will be more sensitive when there are small but repetitive differences between empirical and theoretical distribution functions. But as mentioned in the main vignette of our package, the use of the Anderson-Darling to compare the fit of different distributions is subject to caution due to the the weighting of the quadratic distance between fitted and empirical distribution functions that depends on the parametric distribution. Moreover, statistics based on the empirical distribution function do not penalize distributions with a greater number of parameters and as those are generally more flexible, this could induce over-fitting. Goodness-fo-fit statistics based on information criteria (AIC, BIC) correspond to deviance penalized by the complexity of the model (the number of parameters of the distribution), and so the smaller the better. As more generic statistics, they are not adapted to focus on a part of the fitted distribution, but they take into account the complexity of the distribution and thus could help to prevent overfitting. ## Can we use a test to compare the fit of two distributions on a same data set ? In our package we did not implement such a test but for **two nested distributions** (when one is a special case of the other one, e.g. exponential and gamma distributions) a likelihood ratio test can be easily implemented using the loglikelihood provided by `fitdist` or `fitdistcens`. Denoting $L$ the maximum likelihood obtained with the complete distribution and $L_0$ the one obtained with the simplified distribution, when the sample size increases, $- 2 ln(\frac{L_0}{L}) = 2 ln(L) - 2 ln(L_0)$ tends to a Chi squared distribution degrees of freedom equal to the difference on the numbers of parameters characterizing the **two nested distributions**. You will find below an example of such a test. ```{r} set.seed(1234) g <- rgamma(100, shape = 2, rate = 1) (f <- fitdist(g, "gamma")) (f0 <- fitdist(g, "exp")) L <- logLik(f) k <- length(f$estimate) # number of parameters of the complete distribution L0 <- logLik(f0) k0 <- length(f0$estimate) # number of parameters of the simplified distribution (stat <- 2*L - 2*L0) (critical_value <- qchisq(0.95, df = k - k0)) (rejected <- stat > critical_value) ``` Such a test can also be used for fits on censored data. ## Can we get goodness-of-fit statistics for a fit on censored data ? Function `gofstat` is not yet proposed in our package for fits on censored data but to develop one is among one of our objectives in the future. Published works on goodness-of-fit statistics based on the empirical distribution function for censored data generally focused on data containing only one type of censoring (e.g. right censored data in survival data). Build such statistics in the general case, with data containing in the same time (right, left and interval censoring), remains tricky. Nevertheless, it is possible for any type of censored data, to use information criteria (AIC and BIC given in the summary of an object of class `fitdistcens`) to compare the fits of various distributions to a same data set. # Questions regarding optimization procedures ## How to choose optimization method? If you want to perform optimization without bounds, `optim()` is used. You can try the derivative-free method Nelder-Mead and the Hessian-free method BFGS. If you want to perform optimization with bounds, only two methods are available without providing the gradient of the objective function: Nelder-Mead via `constrOptim()` and bounded BFGS via `optim()`. In both cases, see the help of `mledist()` and the vignette on optimization algorithms. ## The optimization algorithm stops with error code 100. What shall I do? First, add traces by adding `control=list(trace=1, REPORT=1)`. Second, try to set bounds for parameters. Third, find better starting values (see FAQ 1.3). ## Why distribution with a `log` argument may converge better? Say, we study the shifted lognormal distribution defined by the following density $$ f(x) = \frac{1}{x \sigma \sqrt{2 \pi}} \exp\left(- \frac{(\ln (x+\delta)- \mu)^2}{2\sigma^2}\right) $$ for $x>-\delta$ where $\mu$ is a location parameter, $\sigma$ a scale parameter and $\delta$ a boundary parameter. Let us fit this distribution on the dataset `y` by MLE. We define two functions for the densities with and without a `log` argument. ```{r} dshiftlnorm <- function(x, mean, sigma, shift, log = FALSE) dlnorm(x+shift, mean, sigma, log=log) pshiftlnorm <- function(q, mean, sigma, shift, log.p = FALSE) plnorm(q+shift, mean, sigma, log.p=log.p) qshiftlnorm <- function(p, mean, sigma, shift, log.p = FALSE) qlnorm(p, mean, sigma, log.p=log.p)-shift dshiftlnorm_no <- function(x, mean, sigma, shift) dshiftlnorm(x, mean, sigma, shift) pshiftlnorm_no <- function(q, mean, sigma, shift) pshiftlnorm(q, mean, sigma, shift) ``` We now optimize the minus log-likelihood. ```{r} data(dataFAQlog1) y <- dataFAQlog1 D <- 1-min(y) f0 <- fitdist(y+D, "lnorm") start <- list(mean=as.numeric(f0$estimate["meanlog"]), sigma=as.numeric(f0$estimate["sdlog"]), shift=D) # works with BFGS, but not Nelder-Mead f <- fitdist(y, "shiftlnorm", start=start, optim.method="BFGS") summary(f) ``` If we don't use the `log` argument, the algorithms stalls. ```{r, error=FALSE} f2 <- try(fitdist(y, "shiftlnorm_no", start=start, optim.method="BFGS")) print(attr(f2, "condition")) ``` Indeed the algorithm stops because at the following value, the log-likelihood is infinite. ```{r} sum(log(dshiftlnorm_no(y, 0.16383978, 0.01679231, 1.17586600 ))) log(prod(dshiftlnorm_no(y, 0.16383978, 0.01679231, 1.17586600 ))) sum(dshiftlnorm(y, 0.16383978, 0.01679231, 1.17586600, TRUE )) ``` There is something wrong in the computation. Only the R-base implementation using `log` argument seems reliable. This happens the C-base implementation of `dlnorm` takes care of the log value. In the file `../src/nmath/dlnorm.c` in the R sources, we find the C code for `dlnorm` ```{r, eval=FALSE, echo=TRUE} double dlnorm(double x, double meanlog, double sdlog, int give_log) { double y; #ifdef IEEE_754 if (ISNAN(x) || ISNAN(meanlog) || ISNAN(sdlog)) return x + meanlog + sdlog; #endif if(sdlog <= 0) { if(sdlog < 0) ML_ERR_return_NAN; // sdlog == 0 : return (log(x) == meanlog) ? ML_POSINF : R_D__0; } if(x <= 0) return R_D__0; y = (log(x) - meanlog) / sdlog; return (give_log ? -(M_LN_SQRT_2PI + 0.5 * y * y + log(x * sdlog)) : M_1_SQRT_2PI * exp(-0.5 * y * y) / (x * sdlog)); /* M_1_SQRT_2PI = 1 / sqrt(2 * pi) */ } ``` In the last four lines with the logical condtion `give_log?`, we see how the `log` argument is handled: * when log=TRUE, we use $-(\log(\sqrt{2\pi}) + y^2/2+\log(x\sigma))$ ```{r, eval=FALSE, echo=TRUE} -(M_LN_SQRT_2PI + 0.5 * y * y + log(x * sdlog)) ``` * when log=FALSE, we use $\sqrt{2\pi} *\exp( y^2/2)/(x\sigma))$ (and then the logarithm outside `dlnorm`) ```{r, eval=FALSE, echo=TRUE} M_1_SQRT_2PI * exp(-0.5 * y * y) / (x * sdlog)) ``` Note that the constant $\log(\sqrt{2\pi})$ is pre-computed in the C macro `M_LN_SQRT_2PI`. In order to sort out this problem, we use the `constrOptim` wrapping `optim` to take into account linear constraints. This allows also to use other optimization methods than L-BFGS-B (low-memory BFGS bounded) used in optim. ```{r} f2 <- fitdist(y, "shiftlnorm", start=start, lower=c(-Inf, 0, -min(y)), optim.method="Nelder-Mead") summary(f2) print(cbind(BFGS=f$estimate, NelderMead=f2$estimate)) ``` Another possible would be to perform all computations with higher precision arithmetics as implemented in the package `Rmpfr` using the MPFR library. ## What to do when there is a scaling issue? Let us consider a dataset which has particular small values. ```{r} data(dataFAQscale1) head(dataFAQscale1) summary(dataFAQscale1) ``` The only way to sort out is to multiply the dataset by a large value. ```{r} for(i in 6:0) cat(10^i, try(mledist(dataFAQscale1*10^i, "cauchy")$estimate), "\n") ``` Let us consider a dataset which has particular large values. ```{r} data(dataFAQscale2) head(dataFAQscale2) summary(dataFAQscale2) ``` The only way to sort out is to multiply the dataset by a small value. ```{r} for(i in 0:5) cat(10^(-2*i), try(mledist(dataFAQscale2*10^(-2*i), "cauchy")$estimate), "\n") ``` ## How do I set bounds on parameters when optimizing? ### Setting bounds for scale parameters Consider the normal distribution $\mathcal{N}(\mu, \sigma^2)$ defined by the density $$ f(x) = \frac{1}{\sqrt{2\pi\sigma^2}}\exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right), x\in\mathbb{R}, $$ where $\mu$ is a location parameter such that $\mu\in\mathbb{R}$, $\sigma^2$ is a scale parameter such that $\sigma^2>0$. Therefore when optimizing the log-likelihood or the squared differences or the GoF statistics. Setting a lower bound for the scale parameter is easy with `fitdist`: just use the `lower` argument. ```{r scalenormal, echo=TRUE, warning=FALSE} set.seed(1234) x <- rnorm(1000, 1, 2) fitdist(x, "norm", lower=c(-Inf, 0)) ``` ### Setting bounds for shape parameters Consider the Burr distribution $\mathcal B(\mu, \sigma^2)$ defined by the density $$ f(x) = \frac{a b (x/s)^b}{x [1 + (x/s)^b]^{a + 1}}, x\in\mathbb{R}, $$ where $a,b$ are shape parameters such that $a,b>0$, $s$ is a scale parameter such that $s>0$. ```{r shapeburr, echo=TRUE, warning=FALSE} x <- rburr(1000, 1, 2, 3) fitdist(x, "burr", lower=c(0, 0, 0), start=list(shape1 = 1, shape2 = 1, rate = 1)) ``` ### Setting bounds for probability parameters Consider the geometric distribution $\mathcal G(p)$ defined by the mass probability function $$ f(x) = p(1-p)^x, x\in\mathbb{N}, $$ where $p$ is a probability parameter such that $p\in[0,1]$. ```{r probgeom, echo=TRUE, warning=FALSE} x <- rgeom(1000, 1/4) fitdist(x, "geom", lower=0, upper=1) ``` ### Setting bounds for boundary parameters Consider the shifted exponential distribution $\mathcal E(\mu,\lambda)$ defined by the mass probability function $$ f(x) = \lambda \exp(-\lambda(x-\mu)), x>\mu, $$ where $\lambda$ is a scale parameter such that $\lambda>0$, $\mu$ is a boundary (or shift) parameter such that $\mu\in\mathbb{R}$. When optimizing the log-likelihood, the boundary constraint is $$ \forall i=1,\dots,n, x_i>\mu \Rightarrow \min_{i=1,\dots,n} x_i > \mu \Leftrightarrow \mu > -\min_{i=1,\dots,n} x_i. $$ Note that when optimizing the squared differences or the GoF statistics, this constraint may not be necessary. Let us do it in R. ```{r shiftexp, echo=TRUE, warning=FALSE} dsexp <- function(x, rate, shift) dexp(x-shift, rate=rate) psexp <- function(x, rate, shift) pexp(x-shift, rate=rate) rsexp <- function(n, rate, shift) rexp(n, rate=rate)+shift x <- rsexp(1000, 1/4, 1) fitdist(x, "sexp", start=list(rate=1, shift=0), lower= c(0, -min(x))) ``` ### Setting linear inequality bounds For some distributions, bounds between parameters are not independent. For instance, the normal inverse Gaussian distribution ($\mu, \delta, \alpha, \beta$ parametrization) has the following parameter constraints, which can be reformulated as a linear inequality: $$ \left\{ \begin{array}{l}\alpha > 0\\ \delta >0\\ \alpha > |\beta|\end{array} \right. \Leftrightarrow \underbrace{ \left( \begin{matrix} 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 1 & -1 \\ 0 & 0 & 1 & 1 \\ \end{matrix} \right) }_{ui} \left( \begin{matrix} \mu\\ \delta\\ \alpha \\ \beta \\ \end{matrix} \right) \geq \underbrace{ \left( \begin{matrix} 0\\ 0\\ 0 \\ 0 \\ \end{matrix} \right)}_{ci}. $$ These constraints can be carried out via `constrOptim()` and the arguments `ci` and `ui`. Here is an example ```{r, message=FALSE} library(GeneralizedHyperbolic) myoptim <- function(fn, par, ui, ci, ...) { res <- constrOptim(f=fn, theta=par, method="Nelder-Mead", ui=ui, ci=ci, ...) c(res, convergence=res$convergence, value=res$objective, par=res$minimum, hessian=res$hessian) } x <- rnig(1000, 3, 1/2, 1/2, 1/4) ui <- rbind(c(0,1,0,0), c(0,0,1,0), c(0,0,1,-1), c(0,0,1,1)) ci <- c(0,0,0,0) fitdist(x, "nig", custom.optim=myoptim, ui=ui, ci=ci, start=list(mu = 0, delta = 1, alpha = 1, beta = 0)) ``` ## How works quantile matching estimation for discrete distributions? Let us consider the geometric distribution with values in $\{0,1,2,3,\dots\}$. The probability mass function, the cumulative distribution function and the quantile function are $$ P(X=x)= p (1-p)^{\lfloor x\rfloor}, F_X(x) = 1- (1-p)^{\lfloor x\rfloor}, F_X^{-1}(q) = \left\lfloor\frac{\log(1-q)}{\log(1-p)}\right\rfloor. $$ Due to the integer part (floor function), both the distribution function and the quantile function are step functions. ```{r, fig.height=3, fig.width=6} pgeom(0:3, prob=1/2) qgeom(c(0.3, 0.6, 0.9), prob=1/2) par(mar=c(4,4,2,1), mfrow=1:2) curve(pgeom(x, prob=1/2), 0, 10, n=301, main="c.d.f.") curve(qgeom(x, prob=1/2), 0, 1, n=301, main="q.f.") ``` Now we study the QME for the geometric distribution. Since we have only one parameter, we choose one probabiliy, $p=1/2$. The theoretical median is the following integer $$ F_X^{-1}(1/2) = \left\lfloor\frac{\log(1/2)}{\log(1-p)}\right\rfloor. $$ Note that the theoretical median for a discrete distribution is an integer. Empirically, the median may not be an integer. Indeed for an even length dataset, the empirical median is $$ q_{n,1/2} = \frac{x_{n/2}^\star + x_{n/2+1}^\star}{2}, $$ where $x_{1}^\star<\dots ## Is it possible to add the names of the observations in a goodness-of-fit plot, e.g. the names of the species in the plot of the Species Sensitivity Distribution (SSD) classically used in ecotoxicology ? An argument named `name.points` can be used in functions `cdfcomp` or `CIcdfcomp` to pass a label vector for observed points so as to add the names of the points on the left of each point. This option is available only for ECDF goodness-of-fit plots and only for non censored data. This option can be used as below, for example, to name the species in the classical plot of the Species Sensitivity Distributions (SSD) in ecotoxicology. ```{r, fig.height= 6, fig.width= 6, warning = FALSE} data(endosulfan) ATV <- subset(endosulfan, group == "NonArthroInvert")$ATV taxaATV <- subset(endosulfan, group == "NonArthroInvert")$taxa f <- fitdist(ATV, "lnorm") cdfcomp(f, xlogscale = TRUE, main = "Species Sensitivty Distribution", xlim = c(1, 100000), name.points = taxaATV, addlegend = FALSE, plotstyle = "ggplot") ``` # Questions regarding (left, right and/or interval) censored data ## How to code censored data in `fitdistrplus` ? Censored data must be rpresented in the package by a dataframe of two columns respectively named left and right, describing each observed value as an interval. The left column contains either `NA` for left censored observations, the left bound of the interval for interval censored observations, or the observed value for non-censored observations. The right column contains either `NA` for right censored observations, the right bound of the interval for interval censored observations, or the observed value for non-censored observations. This type of representation corresponds to the coding names `"interval2"` in function `Surv` of the package `survival`. There is no other way to represent censored data in `fitdistrplus` but the function Surv2fitdistcens() can be used to help you to format data for use in fitdistcens() from one of the format used in the survival package (see the help page of Surv2fitdistcens()). You have a toy example below. ```{r} dtoy <- data.frame(left = c(NA, 2, 4, 6, 9.7, 10), right = c(1, 3, 7, 8, 9.7, NA)) dtoy ``` ## How do I prepare the input of `fitdistcens()` with `Surv2fitdistcens()`? Let us consider a classical right-censored dataset for human life: twenty values randomly chosen from the `canlifins` dataset of `CASdatasets` package. We refer to the help of `Surv2fitdistcens()` for other censoring types. ```{r} exitage <- c(81.1,78.9,72.6,67.9,60.1,78.3,83.4,66.9,74.8,80.5,75.6,67.1, 75.3,82.8,70.1,85.4,74,70,71.6,76.5) death <- c(0,0,1,0,0,0,0,1,0,0,0,0,0,0,1,1,0,0,0,0) ``` When performing survival analysis, it is very common to use `Surv()` function from package `survival` to handle different types of censoring. In order to ease the use of `fitdistcens()`, a dedicated function `Surv2fitdistcens()` has been implemented with arguments similar to the ones of `Surv()`. ```{r} svdata <- Surv2fitdistcens(exitage, event=death) ``` Let us now fit two simple distributions. ```{r, fig.height= 4, fig.width= 6} flnormc <- fitdistcens(svdata, "lnorm") fweic <- fitdistcens(svdata, "weibull") cdfcompcens(list(fweic, flnormc), xlim=range(exitage), xlegend = "topleft") ``` ## How to represent an empirical distribution from censored data ? The representation of an empirical distribution from censored data is not a trivial problem. One can simply represent each observation as an interval at an y-value defined by the rank of the observation as done below using function `plotdistcens`. This representation can be interesting to visualize the raw data, but it remains difficult to correctly order the observations in any case (see the example below on the right using data `smokedfish`). ```{r, fig.height= 4, fig.width= 8} par(mfrow = c(1,2), mar = c(3, 4, 3, 0.5)) plotdistcens(dtoy, NPMLE = FALSE) data(smokedfish) dsmo <- log10(smokedfish) plotdistcens(dsmo, NPMLE = FALSE) ``` Many authors worked on the development of algorithms for **non parametric maximum likelihood estimation (NPMLE)** of the **empirical cumulative distribution function (ECDF)** from interval censored data (including left and right censored data that can be considered as interval censored data with one bound at infinity). In old versions of `fitdistrplus` we used the Turnbull algorithm using calls to functions of the package `survival`. Even if this Turnbull algorithm is still available in the package, the default plot now uses the function `npsurv` of the package `npsurv`. This package provides more performant algorithms developped by Yong Wang (see references cited in the help page of `plotdistcens`). Due to lack of maintenance of this package we were forced to rewrite their main functions in our package, using another optimization function. The same ECDF plot was also implemented in our using the Turnbull algorithm of survival (see below). ```{r, fig.height= 6, fig.width= 6} par(mfrow = c(2, 2), mar = c(3, 4, 3, 0.5)) # Turnbull algorithm with representation of middle points of equivalence classes plotdistcens(dsmo, NPMLE.method = "Turnbull.middlepoints", xlim = c(-1.8, 2.4)) # Turnbull algorithm with representation of equivalence classes as intervals plotdistcens(dsmo, NPMLE.method = "Turnbull.intervals") # Wang algorithm with representation of equivalence classes as intervals plotdistcens(dsmo, NPMLE.method = "Wang") ``` As you can see in the above example, the new implementation of NPMLE provides a different type of plot for the ECDF, representing by filled rectangles the zones of non-uniqueness of the NPMLE ECDF. Indeed an NPMLE algorithm generally proceeds in two steps. 1. The first step aims at identifying **equivalence classes** (also named in the litterture **Turnbull intervals** or **maximal intersection intervals** or **innermost intervals** or **maximal cliques** of the data). Equivalences classess are points/intervals under which the NPMLE ECDF may change. Equivalence classes have been shown to correspond to regions between a left bound of an interval (named L in the following plot on a the previous toy example) immediately followed by a right bound of an interval (named R in the following plot). An equivalence class may be of null length (for example at each non censored value). 2. The second step aims at assigning a **probability mass** to each equivalence class, which may be zero on some classes. The NPMLE is unique only up to these equivalence classes and this **non uniqueness** of the NPMLE ECDF is represented by **filled rectangles**. Various NPMLE algorithms are implemented in the packages **Icens**, **interval** and **npsurv**. They are more or less performant and all of them do not enable the handling of other data than survival data, especially with left censored observations. ```{r, echo = FALSE, fig.height= 4, fig.width= 8} d <- data.frame(left = c(NA, 2, 4, 6, 9.5, 10), right = c(1, 3, 7, 8, 9.5, NA)) addbounds <- function(d) { xbounds <- c(d$left, d$right) xboundsnotNA <- xbounds[!is.na(xbounds)] abline(v = xboundsnotNA, col = "grey") } addLR <- function(d) { Lbounds <- d$left[!is.na(d$left)] Rbounds <- d$right[!is.na(d$right)] range <- range(c(Lbounds,Rbounds)) eps <- (range[2] - range[1]) * 0.01 text(x = Lbounds-eps, y = 0.05, labels = "L", col = "red", cex = 0.75) text(x = Rbounds+eps, y = 0.05, labels = "R", col = "red", cex = 0.75) } addeq <- function(deq) { left <- deq$left left[is.na(left)] <- -100 right <- deq$right right[is.na(right)] <- 100 rect(left, -2, right, 2, density = 10) } par(mfrow = c(2,1), mar = c(2, 4, 3, 0.5)) # First step plotdistcens(d, NPMLE = FALSE, lwd = 2, col = "blue", main = "Step 1 : identification of equivalence classes") addbounds(d) addLR(d) deq <- data.frame(left = c(NA, 2, 6, 9.5, 10), right = c(1, 3, 7,9.5, NA)) addeq(deq) # Second step plotdistcens(d, lwd = 2, main = "Step 2 : estimation of mass probabilities") ``` ## How to assess the goodness-of-fit of a distribution fitted on censored data ? The only available method in `fitdistrplus` to fit distributions on censored data is the maximum likelihood estimation (MLE). Once a distribution is fitted using `fitdistcens`, AIC and BIC values can be found in the `summary` of the object of class `fitdistcens` returned by the function. Those values can be used to compare the fit of various distributions on a same dataset. Function `gofstat` is not yet proposed in our package for fits on censored data but we plan to develop it in the future with the calculation of other goodness-of-fit statistics for censored data. ```{r} fnorm <- fitdistcens(dsmo,"norm") flogis <- fitdistcens(dsmo,"logis") # comparison of AIC values summary(fnorm)$aic summary(flogis)$aic ``` Considering goodness-of-fit plots, the generic `plot` function of an object of class `fitdistcens`provides three plots, one in CDF using the NPMLE ECDF plot (by default using the Wang prepresentation, see previous part for details), a Q-Q plot and a P-P plot simply derived from the Wang plot of the ECDF, with filled rectangles indicating non uniqueness of the NPMLE ECDF. ```{r, fig.height= 6, fig.width= 6} par(mar = c(2, 4, 3, 0.5)) plot(fnorm) ``` Functions `cdfcompcens()`, `qqcompens()` and `ppcompcens()` can be used to individualize and personnalize CDF, Q-Q and P-P goodness-of-fit plots and/or to compare the fit of various distributions on a same dataset. ```{r, fig.height= 4, fig.width= 4} cdfcompcens(list(fnorm, flogis), fitlty = 1) qqcompcens(list(fnorm, flogis)) ppcompcens(list(fnorm, flogis)) ``` Considering Q-Q plots and P-P plots, it may be easier to compare various fits by splitting the plots as below which is done automatically using the `plotstyle` `ggplot` in `qqcompens()` and `ppcompcens()` but can also be done manually with the `plotstyle` `graphics`. ```{r, fig.height= 4, fig.width= 8} qqcompcens(list(fnorm, flogis), lwd = 2, plotstyle = "ggplot", fitcol = c("red", "green"), fillrect = c("pink", "lightgreen"), legendtext = c("normal distribution", "logistic distribution")) ``` fitdistrplus/vignettes/fitdistrplus_vignette.Rmd0000644000176200001440000016775514411504334022147 0ustar liggesusers--- title: Overview of the fitdistrplus package author: Marie Laure Delignette Muller, Christophe Dutang date: '`r Sys.Date()`' output: bookdown::html_document2: base_format: rmarkdown::html_vignette fig_caption: yes toc: true number_sections: yes bibliography: fitdistrplus.bib link-citations: true vignette: > %\VignetteEngine{knitr::rmarkdown} %\VignetteIndexEntry{Overview of the fitdistrplus package} %!\VignetteEncoding{UTF-8} \usepackage[utf8]{inputenc} \usepackage{amsmath} pkgdown: as_is: true --- ```{r setup, echo=FALSE, message=FALSE, warning=FALSE} options(digits = 4) set.seed(1234) ```
Based on the article *fitdistrplus: an R Package for Fitting Distributions* (Marie Laure Delignette-Muller and Christophe Dutang, 2015, Journal of Statistical Software, DOI 10.18637/jss.v064.i04)
***Keywords**: probability distribution fitting, bootstrap, censored data, maximum likelihood, moment matching, quantile matching, maximum goodness-of-fit, distributions, R*
# Introduction {#Introduction} Fitting distributions to data is a very common task in statistics and consists in choosing a probability distribution modelling the random variable, as well as finding parameter estimates for that distribution. This requires judgment and expertise and generally needs an iterative process of distribution choice, parameter estimation, and quality of fit assessment. In the R [@R13] package **MASS** [@MASS], maximum likelihood estimation is available via the `fitdistr` function; other steps of the fitting process can be done using other R functions [@Ricci05]. In this paper, we present the R package **fitdistrplus** [@fitdistrplus] implementing several methods for fitting univariate parametric distribution. A first objective in developing this package was to provide R users a set of functions dedicated to help this overall process. The `fitdistr` function estimates distribution parameters by maximizing the likelihood function using the `optim` function. No distinction between parameters with different roles (e.g., main parameter and nuisance parameter) is made, as this paper focuses on parameter estimation from a general point-of-view. In some cases, other estimation methods could be prefered, such as maximum goodness-of-fit estimation (also called minimum distance estimation), as proposed in the R package **actuar** with three different goodness-of-fit distances [@actuarJSS]. While developping the **fitdistrplus** package, a second objective was to consider various estimation methods in addition to maximum likelihood estimation (MLE). Functions were developped to enable moment matching estimation (MME), quantile matching estimation (QME), and maximum goodness-of-fit estimation (MGE) using eight different distances. Moreover, the **fitdistrplus** package offers the possibility to specify a user-supplied function for optimization, useful in cases where classical optimization techniques, not included in `optim`, are more adequate. In applied statistics, it is frequent to have to fit distributions to censored data [@kleinmoeschberger03, @helsel05, @busschaertetal10, @lehaetal11, @commeauetal12]. The **MASS** `fitdistr` function does not enable maximum likelihood estimation with this type of data. Some packages can be used to work with censored data, especially survival data [@survival, @hiranoetal94, @jordan05], but those packages generally focus on specific models, enabling the fit of a restricted set of distributions. A third objective is thus to provide R users a function to estimate univariate distribution parameters from right-, left- and interval-censored data. Few packages on CRAN provide estimation procedures for any user-supplied parametric distribution and support different types of data. The **distrMod** package [@distrModJSS] provides an object-oriented (S4) implementation of probability models and includes distribution fitting procedures for a given minimization criterion. This criterion is a user-supplied function which is sufficiently flexible to handle censored data, yet not in a trivial way, see Example M4 of the **distrMod** vignette. The fitting functions `MLEstimator` and `MDEstimator` return an S4 class for which a coercion method to class `mle` is provided so that the respective functionalities (e.g., `confint` and `logLik`) from package **stats4** are available, too. In **fitdistrplus**, we chose to use the standard S3 class system for its understanding by most R users. When designing the **fitdistrplus** package, we did not forget to implement generic functions also available for S3 classes. Finally, various other packages provide functions to estimate the mode, the moments or the L-moments of a distribution, see the reference manuals of **modeest**, **lmomco** and **Lmoments** packages. The package is available from the Comprehensive R Archive Network at \url{https://cran.r-project.org/package=fitdistrplus}. The paper is organized as follows: Section \@ref(fitnoncenscont) presents tools for fitting continuous distributions to classic non-censored data. Section \@ref(advtopic) deals with other estimation methods and other types of data, before Section \@ref(ccl) concludes.
# Fitting distributions to continuous non-censored data {#fitnoncenscont} ## Choice of candidate distributions {#Choice} For illustrating the use of various functions of the **fitdistrplus** package with continuous non-censored data, we will first use a data set named `groundbeef` which is included in our package. This data set contains pointwise values of serving sizes in grams, collected in a French survey, for ground beef patties consumed by children under 5 years old. It was used in a quantitative risk assessment published by @Delignette08. ```{r datgroundbeef, echo=TRUE} library("fitdistrplus") data("groundbeef") str(groundbeef) ``` Before fitting one or more distributions to a data set, it is generally necessary to choose good candidates among a predefined set of distributions. This choice may be guided by the knowledge of stochastic processes governing the modeled variable, or, in the absence of knowledge regarding the underlying process, by the observation of its empirical distribution. To help the user in this choice, we developed functions to plot and characterize the empirical distribution. First of all, it is common to start with plots of the empirical distribution function and the histogram (or density plot), which can be obtained with the `plotdist` function of the **fitdistrplus** package. This function provides two plots (see Figure \@ref(fig:figgroundbeef)): the left-hand plot is by default the histogram on a density scale (or density plot of both, according to values of arguments `histo` and `demp`) and the right-hand plot the empirical cumulative distribution function (CDF). ```{r figgroundbeef, fig.align='center', fig.width=7, fig.height=4, fig.cap="Histogram and CDF plots of an empirical distribution for a continuous variable (serving size from the `groundbeef` data set) as provided by the `plotdist` function."} plotdist(groundbeef$serving, histo = TRUE, demp = TRUE) ```
In addition to empirical plots, descriptive statistics may help to choose candidates to describe a distribution among a set of parametric distributions. Especially the skewness and kurtosis, linked to the third and fourth moments, are useful for this purpose. A non-zero skewness reveals a lack of symmetry of the empirical distribution, while the kurtosis value quantifies the weight of tails in comparison to the normal distribution for which the kurtosis equals 3. The skewness and kurtosis and their corresponding unbiased estimator [@casellaberger02] from a sample $(X_i)_i \stackrel{\text{i.i.d.}}{\sim} X$ with observations $(x_i)_i$ are given by: \begin{equation} sk(X) = \frac{E[(X-E(X))^3]}{Var(X)^{\frac{3}{2}}}~,~\widehat{sk}=\frac{\sqrt{n(n-1)}}{n-2}\times\frac{m_{3}}{m_{2}^{\frac{3}{2}}},(\#eq:eq1) \end{equation} \begin{equation} kr(X) = \frac{E[(X-E(X))^4]}{Var(X)^{2}}~,~\widehat{kr}=\frac{n-1}{(n-2)(n-3)}((n+1) \times \frac{m_{4}}{m_{2}^{2}}-3(n-1)) + 3,(\#eq:eq2) \end{equation} where $m_{2}$, $m_{3}$, $m_{4}$ denote empirical moments defined by $m_{k}=\frac{1}{n}\sum_{i=1}^n(x_{i}-\overline{x})^{k}$, with $x_{i}$ the $n$ observations of variable $x$ and $\overline{x}$ their mean value. The `descdist` function provides classical descriptive statistics (minimum, maximum, median, mean, standard deviation), skewness and kurtosis. By default, unbiased estimations of the three last statistics are provided. Nevertheless, the argument `method` can be changed from `"unbiased"` (default) to `"sample"` to obtain them without correction for bias. A skewness-kurtosis plot such as the one proposed by @Cullen99 is provided by the `descdist` function for the empirical distribution (see Figure \@ref(fig:descgroundbeefplot) for the `groundbeef` data set). On this plot, values for common distributions are displayed in order to help the choice of distributions to fit to data. For some distributions (normal, uniform, logistic, exponential), there is only one possible value for the skewness and the kurtosis. Thus, the distribution is represented by a single point on the plot. For other distributions, areas of possible values are represented, consisting in lines (as for gamma and lognormal distributions), or larger areas (as for beta distribution). Skewness and kurtosis are known not to be robust. In order to take into account the uncertainty of the estimated values of kurtosis and skewness from data, a nonparametric bootstrap procedure [@efrontibshirani94] can be performed by using the argument `boot`. Values of skewness and kurtosis are computed on bootstrap samples (constructed by random sampling with replacement from the original data set) and reported on the skewness-kurtosis plot. Nevertheless, the user needs to know that skewness and kurtosis, like all higher moments, have a very high variance. This is a problem which cannot be completely solved by the use of bootstrap. The skewness-kurtosis plot should then be regarded as indicative only. The properties of the random variable should be considered, notably its expected value and its range, as a complement to the use of the `plotdist` and `descdist` functions. Below is a call to the `descdist` function to describe the distribution of the serving size from the `groundbeef` data set and to draw the corresponding skewness-kurtosis plot (see Figure \@ref(fig:descgroundbeefplot)). Looking at the results on this example with a positive skewness and a kurtosis not far from 3, the fit of three common right-skewed distributions could be considered, Weibull, gamma and lognormal distributions. ```{r descgroundbeefplot, fig.align='center', fig.width=5, fig.height=5, fig.cap="Skewness-kurtosis plot for a continuous variable (serving size from the `groundbeef` data set) as provided by the `descdist` function."} descdist(groundbeef$serving, boot = 1000) ``` ## Fit of distributions by maximum likelihood estimation {#FIT} Once selected, one or more parametric distributions $f(.\vert \theta)$ (with parameter $\theta\in\mathbb{R}^d$) may be fitted to the data set, one at a time, using the `fitdist` function. Under the i.i.d. sample assumption, distribution parameters $\theta$ are by default estimated by maximizing the likelihood function defined as: \begin{equation} L(\theta)=\prod_{i=1}^n f(x_{i}\vert \theta)(\#eq:eq3) \end{equation} with $x_{i}$ the $n$ observations of variable $X$ and $f(.\vert \theta)$ the density function of the parametric distribution. The other proposed estimation methods are described in Section \@ref(Alternatives). The `fitdist` function returns an S3 object of class `fitdist` for which `print`, `summary` and `plot` functions are provided. The fit of a distribution using `fitdist` assumes that the corresponding `d`, `p`, `q` functions (standing respectively for the density, the distribution and the quantile functions) are defined. Classical distributions are already defined in that way in the **stats** package, e.g., `dnorm`, `pnorm` and `qnorm` for the normal distribution (see `?Distributions`). Others may be found in various packages (see the CRAN task view: Probability Distributions at \url{https://cran.r-project.org/web/views/Distributions.html}). Distributions not found in any package must be implemented by the user as `d`, `p`, `q` functions. In the call to `fitdist`, a distribution has to be specified via the argument `dist` either by the character string corresponding to its common root name used in the names of `d`, `p`, `q` functions (e.g., `"norm"` for the normal distribution) or by the density function itself, from which the root name is extracted (e.g., `dnorm` for the normal distribution). Numerical results returned by the `fitdist` function are (1) the parameter estimates, (2) the estimated standard errors (computed from the estimate of the Hessian matrix at the maximum likelihood solution), (3) the loglikelihood, (4) Akaike and Bayesian information criteria (the so-called AIC and BIC), and (5) the correlation matrix between parameter estimates. Below is a call to the `fitdist` function to fit a Weibull distribution to the serving size from the `groundbeef` data set. ```{r fitgroundbeef.weibull} fw <- fitdist(groundbeef$serving, "weibull") summary(fw) ``` The plot of an object of class `fitdist` provides four classical goodness-of-fit plots [@Cullen99] presented on Figure \@ref(fig:groundbeefcomp): - a density plot representing the density function of the fitted distribution along with the histogram of the empirical distribution, - a CDF plot of both the empirical distribution and the fitted distribution, - a Q-Q plot representing the empirical quantiles (y-axis) against the theoretical quantiles (x-axis), - a P-P plot representing the empirical distribution function evaluated at each data point (y-axis) against the fitted distribution function (x-axis). For CDF, Q-Q and P-P plots, the probability plotting position is defined by default using Hazen's rule, with probability points of the empirical distribution calculated as `(1:n - 0.5)/n`, as recommended by @Blom. This plotting position can be easily changed (see the reference manual for details [@fitdistrplus]). Unlike the generic `plot` function, the `denscomp`, `cdfcomp`, `qqcomp` and `ppcomp` functions enable to draw separately each of these four plots, in order to compare the empirical distribution and multiple parametric distributions fitted on a same data set. These functions must be called with a first argument corresponding to a list of objects of class `fitdist`, and optionally further arguments to customize the plot (see the reference manual for lists of arguments that may be specific to each plot [@fitdistrplus]). In the following example, we compare the fit of a Weibull, a lognormal and a gamma distributions to the `groundbeef` data set (Figure \@ref(fig:groundbeefcomp)). ```{r groundbeefcomp, fig.align='center', fig.width=7, fig.height=7, fig.cap="Four Goodness-of-fit plots for various distributions fitted to continuous data (Weibull, gamma and lognormal distributions fitted to serving sizes from the `groundbeef` data set) as provided by functions `denscomp`, `qqcomp`, `cdfcomp` and `ppcomp`."} par(mfrow = c(2, 2), mar = c(4, 4, 2, 1)) fg <- fitdist(groundbeef$serving, "gamma") fln <- fitdist(groundbeef$serving, "lnorm") plot.legend <- c("Weibull", "lognormal", "gamma") denscomp(list(fw, fln, fg), legendtext = plot.legend) qqcomp(list(fw, fln, fg), legendtext = plot.legend) cdfcomp(list(fw, fln, fg), legendtext = plot.legend) ppcomp(list(fw, fln, fg), legendtext = plot.legend) ```
The density plot and the CDF plot may be considered as the basic classical goodness-of-fit plots. The two other plots are complementary and can be very informative in some cases. The Q-Q plot emphasizes the lack-of-fit at the distribution tails while the P-P plot emphasizes the lack-of-fit at the distribution center. In the present example (in Figure \@ref(fig:groundbeefcomp)), none of the three fitted distributions correctly describes the center of the distribution, but the Weibull and gamma distributions could be prefered for their better description of the right tail of the empirical distribution, especially if this tail is important in the use of the fitted distribution, as it is in the context of food risk assessment. The data set named `endosulfan` will now be used to illustrate other features of the **fitdistrplus** package. This data set contains acute toxicity values for the organochlorine pesticide endosulfan (geometric mean of LC50 ou EC50 values in $\mu g.L^{-1}$), tested on Australian and non-Australian laboratory-species [@Hose04]. In ecotoxicology, a lognormal or a loglogistic distribution is often fitted to such a data set in order to characterize the species sensitivity distribution (SSD) for a pollutant. A low percentile of the fitted distribution, generally the 5% percentile, is then calculated and named the hazardous concentration 5% (HC5). It is interpreted as the value of the pollutant concentration protecting 95% of the species [@Posthuma2010]. But the fit of a lognormal or a loglogistic distribution to the whole `endosulfan` data set is rather bad (Figure \@ref(fig:fitendo)), especially due to a minority of very high values. The two-parameter Pareto distribution and the three-parameter Burr distribution (which is an extension of both the loglogistic and the Pareto distributions) have been fitted. Pareto and Burr distributions are provided in the package **actuar**. Until here, we did not have to define starting values (in the optimization process) as reasonable starting values are implicity defined within the `fitdist` function for most of the distributions defined in R (see `?fitdist` for details). For other distributions like the Pareto and the Burr distribution, initial values for the distribution parameters have to be supplied in the argument `start`, as a named list with initial values for each parameter (as they appear in the `d`, `p`, `q` functions). Having defined reasonable starting values[^1] various distributions can be fitted and graphically compared. On this example, the function `cdfcomp` can be used to report CDF values in a logscale so as to emphasize discrepancies on the tail of interest while defining an HC5 value (Figure \@ref(fig:fitendo)). [^1]: The `plotdist` function can plot any parametric distribution with specified parameter values in argument `para`. It can thus help to find correct initial values for the distribution parameters in non trivial cases, by iterative calls if necessary (see the reference manual for examples [@fitdistrplus]). ```{r fitendo, fig.align='center', fig.width=6, fig.height=6, fig.cap="CDF plot to compare the fit of four distributions to acute toxicity values of various organisms for the organochlorine pesticide endosulfan (`endosulfan` data set) as provided by the `cdfcomp` function, with CDF values in a logscale to emphasize discrepancies on the left tail."} library(actuar) data("endosulfan") ATV <- endosulfan$ATV fendo.ln <- fitdist(ATV, "lnorm") fendo.ll <- fitdist(ATV, "llogis", start = list(shape = 1, scale = 500)) fendo.P <- fitdist(ATV, "pareto", start = list(shape = 1, scale = 500)) fendo.B <- fitdist(ATV, "burr", start = list(shape1 = 0.3, shape2 = 1, rate = 1)) cdfcomp(list(fendo.ln, fendo.ll, fendo.P, fendo.B), xlogscale = TRUE, ylogscale = TRUE, legendtext = c("lognormal", "loglogistic", "Pareto", "Burr")) ```
None of the fitted distribution correctly describes the right tail observed in the data set, but as shown in Figure \@ref(fig:fitendo), the left-tail seems to be better described by the Burr distribution. Its use could then be considered to estimate the HC5 value as the 5% quantile of the distribution. This can be easily done using the `quantile` generic function defined for an object of class `fitdist`. Below is this calculation together with the calculation of the empirical quantile for comparison. ```{r quantilefitdist, echo=TRUE, fig=FALSE} quantile(fendo.B, probs = 0.05) quantile(ATV, probs = 0.05) ``` In addition to the ecotoxicology context, the `quantile` generic function is also attractive in the actuarial-financial context. In fact, the value-at-risk $VAR_\alpha$ is defined as the $1-\alpha$-quantile of the loss distribution and can be computed with `quantile` on a `fitdist` object. The computation of different goodness-of-fit statistics is proposed in the **fitdistrplus** package in order to further compare fitted distributions. The purpose of goodness-of-fit statistics aims to measure the distance between the fitted parametric distribution and the empirical distribution: e.g., the distance between the fitted cumulative distribution function $F$ and the empirical distribution function $F_{n}$. When fitting continuous distributions, three goodness-of-fit statistics are classicaly considered: Cramer-von Mises, Kolmogorov-Smirnov and Anderson-Darling statistics [@Stephens86]. Naming $x_{i}$ the $n$ observations of a continuous variable $X$ arranged in an ascending order, Table \@ref(tab:tabKSCvMAD) gives the definition and the empirical estimate of the three considered goodness-of-fit statistics. They can be computed using the function `gofstat` as defined by Stephens [@Stephens86]. ```{r fendo.gof.print, echo=TRUE, fig=FALSE} gofstat(list(fendo.ln, fendo.ll, fendo.P, fendo.B), fitnames = c("lnorm", "llogis", "Pareto", "Burr")) ```
| Statistic | General formula | Computational formula | |-------------------------|-------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------| | Kolmogorov-Smirnov (KS) | $\sup|F_{n}(x) - F(x)|$ | $\max(D^{+},D^{-})$ with $D^{+}=\max\limits_{i=1,\dots,n}\left(\frac{i}{n} - F_i\right)$ and $D^{-}=\max\limits_{i=1,\dots,n}\left(F_{i}-\frac{i-1}{n}\right)$ | | Cramer-von Mises (CvM) | $n \int_{-\infty}^{\infty}(F_{n}(x) - F(x))^2 dx$ | $\frac{1}{12n} + \sum\limits_{i=1}^n \left(F_i-\frac{2i-1}{2n} \right)^{2}$ | | Anderson-Darling (AD) | $n \int_{-\infty}^{\infty}\frac{(F_{n}(x) - F(x))^2}{F(x) (1 - F(x))} dx$ | $-n -\frac{1}{n}\sum\limits_{i=1}^n (2i-1)\log(F_i(1-F_{n+1-i}))$ | Table: (\#tab:tabKSCvMAD) Goodness-of-fit statistics as defined by Stephens [@Stephens86]. where $F_i\stackrel{\triangle}{=} F(x_i)$
As giving more weight to distribution tails, the Anderson-Darling statistic is of special interest when it matters to equally emphasize the tails as well as the main body of a distribution. This is often the case in risk assessment [@Cullen99, @Vose10]. For this reason, this statistics is often used to select the best distribution among those fitted. Nevertheless, this statistics should be used cautiously when comparing fits of various distributions. Keeping in mind that the weighting of each CDF quadratic difference depends on the parametric distribution in its definition (see Table \@ref(tab:tabKSCvMAD)), Anderson-Darling statistics computed for several distributions fitted on a same data set are theoretically difficult to compare. Moreover, such a statistic, as Cramer-von Mises and Kolmogorov-Smirnov ones, does not take into account the complexity of the model (i.e., parameter number). It is not a problem when compared distributions are characterized by the same number of parameters, but it could systematically promote the selection of the more complex distributions in the other case. Looking at classical penalized criteria based on the loglikehood (AIC, BIC) seems thus also interesting, especially to discourage overfitting. In the previous example, all the goodness-of-fit statistics based on the CDF distance are in favor of the Burr distribution, the only one characterized by three parameters, while AIC and BIC values respectively give the preference to the Burr distribution or the Pareto distribution. The choice between these two distributions seems thus less obvious and could be discussed. Even if specifically recommended for discrete distributions, the Chi-squared statistic may also be used for continuous distributions (see Section \@ref(otherdata) and the reference manual for examples [@fitdistrplus]). ## Uncertainty in parameter estimates {#Uncertainty} The uncertainty in the parameters of the fitted distribution can be estimated by parametric or nonparametric bootstraps using the `boodist` function for non-censored data [@efrontibshirani94]. This function returns the bootstrapped values of parameters in an S3 class object which can be plotted to visualize the bootstrap region. The medians and the 95% confidence intervals of parameters (2.5 and 97.5 percentiles) are printed in the summary. When inferior to the whole number of iterations (due to lack of convergence of the optimization algorithm for some bootstrapped data sets), the number of iterations for which the estimation converges is also printed in the summary. The plot of an object of class `bootdist` consists in a scatterplot or a matrix of scatterplots of the bootstrapped values of parameters providing a representation of the joint uncertainty distribution of the fitted parameters. Below is an example of the use of the `bootdist` function with the previous fit of the Burr distribution to the `endosulfan` data set (Figure \@ref(fig:bootstrap)). ```{r fitBurr.boot.echo, echo=TRUE} bendo.B <- bootdist(fendo.B, niter = 1001) summary(bendo.B) ``` ```{r bootstrap, fig.align='center', fig.width=6, fig.height=6, fig.cap="Bootstrappped values of parameters for a fit of the Burr distribution characterized by three parameters (example on the `endosulfan` data set) as provided by the plot of an object of class `bootdist`."} plot(bendo.B) ```
Bootstrap samples of parameter estimates are useful especially to calculate confidence intervals on each parameter of the fitted distribution from the marginal distribution of the bootstraped values. It is also interesting to look at the joint distribution of the bootstraped values in a scatterplot (or a matrix of scatterplots if the number of parameters exceeds two) in order to understand the potential structural correlation between parameters (see Figure \@ref(fig:bootstrap)). The use of the whole bootstrap sample is also of interest in the risk assessment field. Its use enables the characterization of uncertainty in distribution parameters. It can be directly used within a second-order Monte Carlo simulation framework, especially within the package **mc2d** [@mc2d]. One could refer to @Pouillot10 for an introduction to the use of **mc2d** and **fitdistrplus** packages in the context of quantitative risk assessment. The bootstrap method can also be used to calculate confidence intervals on quantiles of the fitted distribution. For this purpose, a generic `quantile` function is provided for class `bootdist`. By default, 95% percentiles bootstrap confidence intervals of quantiles are provided. Going back to the previous example from ecotoxicolgy, this function can be used to estimate the uncertainty associated to the HC5 estimation, for example from the previously fitted Burr distribution to the `endosulfan` data set. ```{r fitATV.lnorm.quantile, echo=TRUE} quantile(bendo.B, probs = 0.05) ``` # Advanced topics {#advtopic} ## Alternative methods for parameter estimation {#Alternatives} This subsection focuses on alternative estimation methods. One of the alternative for continuous distributions is the maximum goodness-of-fit estimation method also called minimum distance estimation method [@Stephens86, @actuarJSS]. In this package this method is proposed with eight different distances: the three classical distances defined in Table \@ref(tab:tabKSCvMAD), or one of the variants of the Anderson-Darling distance proposed by @Luceno06 and defined in Table \@ref(tab:modifiedAD). The right-tail AD gives more weight to the right-tail, the left-tail AD gives more weight only to the left tail. Either of the tails, or both of them, can receive even larger weights by using second order Anderson-Darling Statistics.
| Statistic | General formula | Computational formula | |---------------------------------|-------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------| | Right-tail AD (ADR) | $\int_{-\infty}^{\infty}\frac{(F_{n}(x) - F(x))^2 }{1 - F(x)} dx$ | $\frac{n}{2} -2\sum\limits_{i=1}^nF_i -\frac{1}{n}\sum\limits_{i=1}^n(2i-1)ln(\overline F_{n+1-i})$ | | Left-tail AD (ADL) | $\int_{-\infty}^{\infty}\frac{(F_{n}(x) - F(x))^2 }{(F(x))} dx$ | $-\frac{3n}{2} +2\sum\limits_{i=1}^nF_i -\frac{1}{n}\sum\limits_{i=1}^n(2i-1)ln(F_i)$ | | Right-tail AD 2nd order (AD2R) | $ad2r=\int_{-\infty}^{\infty}\frac{(F_{n}(x) - F(x))^2 }{(1 - F(x))^{2}} dx$ | $ad2r=2\sum\limits_{i=1}^nln(\overline F_i) +\frac{1}{n}\sum\limits_{i=1}^n \frac{2i-1}{\overline F_{n+1-i}}$ | | Left-tail AD 2nd order (AD2L) | $ad2l=\int_{-\infty}^{\infty}\frac{(F_{n}(x) - F(x))^2 }{(F(x))^{2}} dx$ | $ad2l=2\sum\limits_{i=1}^nln(F_i) +\frac{1}{n}\sum\limits_{i=1}^n\frac{2i-1}{F_i}$ | | AD 2nd order (AD2) | $ad2r+ad2l$ | $ad2r+ad2l$ Table: (\#tab:modifiedAD) Modified Anderson-Darling statistics as defined by @Luceno06. where $F_i\stackrel{\triangle}{=} F(x_{i})$ and $\overline F_i\stackrel{\triangle}{=}1-F(x_{i})$
To fit a distribution by maximum goodness-of-fit estimation, one needs to fix the argument `method` to `mge` in the call to `fitdist` and to specify the argument `gof` coding for the chosen goodness-of-fit distance. This function is intended to be used only with continuous non-censored data. Maximum goodness-of-fit estimation may be useful to give more weight to data at one tail of the distribution. In the previous example from ecotoxicology, we used a non classical distribution (the Burr distribution) to correctly fit the empirical distribution especially on its left tail. In order to correctly estimate the 5$\%$ percentile, we could also consider the fit of the classical lognormal distribution, but minimizing a goodness-of-fit distance giving more weight to the left tail of the empirical distribution. In what follows, the left tail Anderson-Darling distances of first or second order are used to fit a lognormal to `endosulfan` data set (see Figure \@ref(fig:plotfitMGE)). ```{r plotfitMGE, fig.align='center', fig.width=6, fig.height=6, fig.cap="Comparison of a lognormal distribution fitted by MLE and by MGE using two different goodness-of-fit distances: left-tail Anderson-Darling and left-tail Anderson Darling of second order (example with the `endosulfan` data set) as provided by the `cdfcomp` function, with CDF values in a logscale to emphasize discrepancies on the left tail."} fendo.ln.ADL <- fitdist(ATV, "lnorm", method = "mge", gof = "ADL") fendo.ln.AD2L <- fitdist(ATV, "lnorm", method = "mge", gof = "AD2L") cdfcomp(list(fendo.ln, fendo.ln.ADL, fendo.ln.AD2L), xlogscale = TRUE, ylogscale = TRUE, main = "Fitting a lognormal distribution", xlegend = "bottomright", legendtext = c("MLE", "Left-tail AD", "Left-tail AD 2nd order")) ```
Comparing the 5% percentiles (HC5) calculated using these three fits to the one calculated from the MLE fit of the Burr distribution, we can observe, on this example, that fitting the lognormal distribution by maximizing left tail Anderson-Darling distances of first or second order enables to approach the value obtained by fitting the Burr distribution by MLE. ```{r quantilefitdist2, echo=TRUE, fig=FALSE} (HC5.estimates <- c( empirical = as.numeric(quantile(ATV, probs = 0.05)), Burr = as.numeric(quantile(fendo.B, probs = 0.05)$quantiles), lognormal_MLE = as.numeric(quantile(fendo.ln, probs = 0.05)$quantiles), lognormal_AD2 = as.numeric(quantile(fendo.ln.ADL, probs = 0.05)$quantiles), lognormal_AD2L = as.numeric(quantile(fendo.ln.AD2L, probs = 0.05)$quantiles))) ``` The moment matching estimation (MME) is another method commonly used to fit parametric distributions [@Vose10]. MME consists in finding the value of the parameter $\theta$ that equalizes the first theoretical raw moments of the parametric distribution to the corresponding empirical raw moments as in Equation \@ref(eq:eq4): \begin{equation} E(X^{k}|\theta)=\frac{1}{n}\sum_{i=1}^{n}x_{i}^{k},(\#eq:eq4) \end{equation} for $k=1,\ldots,d$, with $d$ the number of parameters to estimate and $x_{i}$ the $n$ observations of variable $X$. For moments of order greater than or equal to 2, it may also be relevant to match centered moments. Therefore, we match the moments given in Equation \@ref(eq:eq5): \begin{equation} E(X\vert \theta) = \overline{x} ~,~E\left((X-E(X))^{k}|\theta\right)=m_k, \text{ for } k=2,\ldots,d,(\#eq:eq5) \end{equation} where $m_k$ denotes the empirical centered moments. This method can be performed by setting the argument `method` to `"mme"` in the call to `fitdist`. The estimate is computed by a closed-form formula for the following distributions: normal, lognormal, exponential, Poisson, gamma, logistic, negative binomial, geometric, beta and uniform distributions. In this case, for distributions characterized by one parameter (geometric, Poisson and exponential), this parameter is simply estimated by matching theoretical and observed means, and for distributions characterized by two parameters, these parameters are estimated by matching theoretical and observed means and variances [@Vose10]. For other distributions, the equation of moments is solved numerically using the `optim` function by minimizing the sum of squared differences between observed and theoretical moments (see the **fitdistrplus** reference manual for technical details [@fitdistrplus]). A classical data set from the Danish insurance industry published in @mcneil97 will be used to illustrate this method. In **fitdistrplus**, the data set is stored in `danishuni` for the univariate version and contains the loss amounts collected at Copenhagen Reinsurance between 1980 and 1990. In actuarial science, it is standard to consider positive heavy-tailed distributions and have a special focus on the right-tail of the distributions. In this numerical experiment, we choose classic actuarial distributions for loss modelling: the lognormal distribution and the Pareto type II distribution [@Klugmanetal09]. The lognormal distribution is fitted to `danishuni` data set by matching moments implemented as a closed-form formula. On the left-hand graph of Figure \@ref(fig:danishmme), the fitted distribution functions obtained using the moment matching estimation (MME) and maximum likelihood estimation (MLE) methods are compared. The MME method provides a more cautious estimation of the insurance risk as the MME-fitted distribution function (resp. MLE-fitted) underestimates (overestimates) the empirical distribution function for large values of claim amounts. ```{r danish.mme, echo=TRUE, eval=TRUE} data("danishuni") str(danishuni) ```
```{r danishmme, fig.align='center', fig.width=7, fig.height=4, fig.cap="Comparison between MME and MLE when fitting a lognormal or a Pareto distribution to loss data from the `danishuni` data set."} fdanish.ln.MLE <- fitdist(danishuni$Loss, "lnorm") fdanish.ln.MME <- fitdist(danishuni$Loss, "lnorm", method = "mme", order = 1:2) library(actuar) fdanish.P.MLE <- fitdist(danishuni$Loss, "pareto", start = list(shape = 10, scale = 10), lower = 2+1e-6, upper = Inf) memp <- function(x, order) sum(x^order) / length(x) fdanish.P.MME <- fitdist(danishuni$Loss, "pareto", method = "mme", order = 1:2, memp = "memp", start = list(shape = 10, scale = 10), lower = c(2+1e-6, 2+1e-6), upper = c(Inf, Inf)) par(mfrow = c(1, 2)) cdfcomp(list(fdanish.ln.MLE, fdanish.ln.MME), legend = c("lognormal MLE", "lognormal MME"), main = "Fitting a lognormal distribution", xlogscale = TRUE, datapch = 20) cdfcomp(list(fdanish.P.MLE, fdanish.P.MME), legend = c("Pareto MLE", "Pareto MME"), main = "Fitting a Pareto distribution", xlogscale = TRUE, datapch = 20) ```
In a second time, a Pareto distribution, which gives more weight to the right-tail of the distribution, is fitted. As the lognormal distribution, the Pareto has two parameters, which allows a fair comparison. We use the implementation of the **actuar** package providing raw and centered moments for that distribution (in addition to `d`, `p`, `q` and `r` functions [@actuar12]. Fitting a heavy-tailed distribution for which the first and the second moments do not exist for certain values of the shape parameter requires some cautiousness. This is carried out by providing, for the optimization process, a lower and an upper bound for each parameter. The code below calls the L-BFGS-B optimization method in `optim`, since this quasi-Newton allows box constraints [^2]. We choose match moments defined in Equation \@ref(eq:eq4), and so a function for computing the empirical raw moment (called `memp` in our example) is passed to `fitdist`. For two-parameter distributions (i.e., $d=2$), Equations \@ref(eq:eq4) and \@ref(eq:eq5) are equivalent. [^2]: That is what the B stands for. ```{r danish.mme.pareto, echo=TRUE, fig=FALSE} gofstat(list(fdanish.ln.MLE, fdanish.P.MLE, fdanish.ln.MME, fdanish.P.MME), fitnames = c("lnorm.mle", "Pareto.mle", "lnorm.mme", "Pareto.mme")) ``` As shown on Figure \@ref(fig:danishmme), MME and MLE fits are far less distant (when looking at the right-tail) for the Pareto distribution than for the lognormal distribution on this data set. Furthermore, for these two distributions, the MME method better fits the right-tail of the distribution from a visual point of view. This seems logical since empirical moments are influenced by large observed values. In the previous traces, we gave the values of goodness-of-fit statistics. Whatever the statistic considered, the MLE-fitted lognormal always provides the best fit to the observed data. Maximum likelihood and moment matching estimations are certainly the most commonly used method for fitting distributions [@Cullen99]. Keeping in mind that these two methods may produce very different results, the user should be aware of its great sensitivity to outliers when choosing the moment matching estimation. This may be seen as an advantage in our example if the objective is to better describe the right tail of the distribution, but it may be seen as a drawback if the objective is different. Fitting of a parametric distribution may also be done by matching theoretical quantiles of the parametric distributions (for specified probabilities) against the empirical quantiles [@Tse2009]. The equality of theoretical and empirical quantiles is expressed by Equation \@ref(eq:eq6) below, which is very similar to Equations \@ref(eq:eq4) and \@ref(eq:eq5): \begin{equation} F^{-1}(p_{k}|\theta)=Q_{n,p_{k}}(\#eq:eq6) \end{equation} for $k=1,\ldots,d$, with $d$ the number of parameters to estimate (dimension of $\theta$ if there is no fixed parameters) and $Q_{n,p_{k}}$ the empirical quantiles calculated from data for specified probabilities $p_{k}$. Quantile matching estimation (QME) is performed by setting the argument `method` to `"qme"` in the call to `fitdist` and adding an argument `probs` defining the probabilities for which the quantile matching is performed (see Figure \@ref(fig:danishqme)). The length of this vector must be equal to the number of parameters to estimate (as the vector of moment orders for MME). Empirical quantiles are computed using the `quantile` function of the **stats** package using `type=7` by default (see `?quantile` and @hyndmanfan96). But the type of quantile can be easily changed by using the `qty` argument in the call to the `qme` function. The quantile matching is carried out numerically, by minimizing the sum of squared differences between observed and theoretical quantiles. ```{r danishqme, fig.align='center', fig.width=6, fig.height=6, fig.cap="Comparison between QME and MLE when fitting a lognormal distribution to loss data from the `danishuni` data set."} fdanish.ln.QME1 <- fitdist(danishuni$Loss, "lnorm", method = "qme", probs = c(1/3, 2/3)) fdanish.ln.QME2 <- fitdist(danishuni$Loss, "lnorm", method = "qme", probs = c(8/10, 9/10)) cdfcomp(list(fdanish.ln.MLE, fdanish.ln.QME1, fdanish.ln.QME2), legend = c("MLE", "QME(1/3, 2/3)", "QME(8/10, 9/10)"), main = "Fitting a lognormal distribution", xlogscale = TRUE, datapch = 20) ```
Above is an example of fitting of a lognormal distribution to `danishuni} data set by matching probabilities $(p_1= 1/3, p_2=2/3)$ and $(p_1= 8/10, p_2=9/10)$. As expected, the second QME fit gives more weight to the right-tail of the distribution. Compared to the maximum likelihood estimation, the second QME fit best suits the right-tail of the distribution, whereas the first QME fit best models the body of the distribution. The quantile matching estimation is of particular interest when we need to focus around particular quantiles, e.g., $p=99.5\%$ in the Solvency II insurance context or $p=5\%$ for the HC5 estimation in the ecotoxicology context. ## Customization of the optimization algorithm {#Customization} Each time a numerical minimization is carried out in the `fitdistrplus` package, the `optim` function of the **stats** package is used by default with the `Nelder-Mead` method for distributions characterized by more than one parameter and the `BFGS` method for distributions characterized by only one parameter. Sometimes the default algorithm fails to converge. It is then interesting to change some options of the `optim` function or to use another optimization function than `optim` to minimize the objective function. The argument `optim.method` can be used in the call to `fitdist` or `fitdistcens`. It will internally be passed to `mledist`, `mmedist`, `mgedist` or `qmedist`, and to `optim` (see `?optim` for details about the different algorithms available). Even if no error is raised when computing the optimization, changing the algorithm is of particular interest to enforce bounds on some parameters. For instance, a volatility parameter $\sigma$ is strictly positive $\sigma>0$ and a probability parameter $p$ lies in $p\in [0,1]$. This is possible by using arguments `lower` and/or `upper`, for which their use automatically forces `optim.method="L-BFGS-B"`. Below are examples of fits of a gamma distribution $\mathcal{G}(\alpha, \lambda)$ to the `groundbeef` data set with various algorithms. Note that the conjugate gradient algorithm (`CG`) needs far more iterations to converge (around 2500 iterations) compared to other algorithms (converging in less than 100 iterations). ```{r optimmethod.gamma, echo=TRUE} data("groundbeef") fNM <- fitdist(groundbeef$serving, "gamma", optim.method = "Nelder-Mead") fBFGS <- fitdist(groundbeef$serving, "gamma", optim.method = "BFGS") fSANN <- fitdist(groundbeef$serving, "gamma", optim.method = "SANN") fCG <- try(fitdist(groundbeef$serving, "gamma", optim.method = "CG", control = list(maxit = 10000))) if(inherits(fCG, "try-error")) {fCG <- list(estimate = NA)} ``` It is also possible to use another function than `optim` to minimize the objective function by specifying by the argument `custom.optim` in the call to `fitdist`. It may be necessary to customize this optimization function to meet the following requirements. (1) `custom.optim` function must have the following arguments: `fn` for the function to be optimized and `par` for the initialized parameters. (2) `custom.optim` should carry out a MINIMIZATION and must return the following components: `par` for the estimate, `convergence` for the convergence code, `value=fn(par)` and `hessian`. Below is an example of code written to wrap the `genoud` function from the **rgenoud** package in order to respect our optimization ``template''. The **rgenoud** package implements the genetic (stochastic) algorithm. ```{r optimmethod.customgenoud, echo=TRUE} mygenoud <- function(fn, par, ...) { require(rgenoud) res <- genoud(fn, starting.values = par, ...) standardres <- c(res, convergence = 0) return(standardres) } ``` The customized optimization function can then be passed as the argument `custom.optim` in the call to `fitdist` or `fitdistcens`. The following code can for example be used to fit a gamma distribution to the `groundbeef` data set. Note that in this example various arguments are also passed from `fitdist` to `genoud`: `nvars`, `Domains`, `boundary.enforcement`, `print.level` and `hessian`. The code below compares all the parameter estimates ($\hat\alpha$, $\hat\lambda$) by the different algorithms: shape $\alpha$ and rate $\lambda$ parameters are relatively similar on this example, roughly 4.00 and 0.05, respectively. ```{r optimmethod.customgenoud.fitdist, echo=TRUE, eval=TRUE} fgenoud <- mledist(groundbeef$serving, "gamma", custom.optim = mygenoud, nvars = 2, max.generations = 10, Domains = cbind(c(0, 0), c(10, 10)), boundary.enforcement = 1, hessian = TRUE, print.level = 0, P9 = 10) cbind(NM = fNM$estimate, BFGS = fBFGS$estimate, SANN = fSANN$estimate, CG = fCG$estimate, fgenoud = fgenoud$estimate) ``` ## Fitting distributions to other types of data {#otherdata} *This section was modified since the publication of this vignette in the Journal of Statistical Software in order to include new goodness-of-fit plots for censored and discrete data.* Analytical methods often lead to semi-quantitative results which are referred to as censored data. Observations only known to be under a limit of detection are left-censored data. Observations only known to be above a limit of quantification are right-censored data. Results known to lie between two bounds are interval-censored data. These two bounds may correspond to a limit of detection and a limit of quantification, or more generally to uncertainty bounds around the observation. Right-censored data are also commonly encountered with survival data [@kleinmoeschberger03]. A data set may thus contain right-, left-, or interval-censored data, or may be a mixture of these categories, possibly with different upper and lower bounds. Censored data are sometimes excluded from the data analysis or replaced by a fixed value, which in both cases may lead to biased results. A more recommended approach to correctly model such data is based upon maximum likelihood [@kleinmoeschberger03, @helsel05]. Censored data may thus contain left-censored, right-censored and interval-censored values, with several lower and upper bounds. Before their use in package **fitdistrplus**, such data must be coded into a dataframe with two columns, respectively named `left` and `right`, describing each observed value as an interval. The `left` column contains either `NA` for left censored observations, the left bound of the interval for interval censored observations, or the observed value for non-censored observations. The `right` column contains either `NA` for right censored observations, the right bound of the interval for interval censored observations, or the observed value for non-censored observations. To illustrate the use of package **fitdistrplus** to fit distributions to censored continous data, we will use another data set from ecotoxicology, included in our package and named `salinity`. This data set contains acute salinity tolerance (LC50 values in electrical conductivity, $mS$.$cm^{-1}$) of riverine macro-invertebrates taxa from the southern Murray-Darling Basin in Central Victoria, Australia [@kefford07]. ```{r datsalinity, echo=TRUE} data("salinity") str(salinity) ``` Using censored data such as those coded in the `salinity} data set, the empirical distribution can be plotted using the `plotdistcens} function. In older versions of the package, by default this function used the Expectation-Maximization approach of @Turnbull74 to compute the overall empirical cdf curve with optional confidence intervals, by calls to `survfit` and `plot.survfit` functions from the **survival** package. Even if this representation is always available (by fixing the argument `NPMLE.method` to `"Turnbull.middlepoints"`), now the default plot of the empirical cumulative distribution function (ECDF) explicitly represents the regions of non uniqueness of the NPMLE ECDF. The default computation of those regions of non uniqueness and their associated masses uses the non parametric maximum likelihood estimation (NPMLE) approach developped by Wang [@Wang2007, @Wang2008, @Wang2013, @Wang2018]. Figure \@ref(fig:cdfcompcens) shows on the top left the new plot of data together with two fitted distributions. Grey filled rectangles in such a plot represent the regions of non uniqueness of the NPMLE ECDF. A less rigorous but sometimes more illustrative plot can be obtained by fixing the argument `NPMLE` to `FALSE` in the call to `plotdistcens` (see Figure \@ref(fig:plotsalinity2) for an example and the help page of Function `plotdistcens` for details). This plot enables to see the real nature of censored data, as points and intervals, but the difficulty in building such a plot is to define a relevant ordering of observations. ```{r plotsalinity2, fig.align='center', fig.width=6, fig.height=6, fig.cap="Simple plot of censored raw data (72-hour acute salinity tolerance of riverine macro-invertebrates from the `salinity` data set) as ordered points and intervals."} plotdistcens(salinity, NPMLE = FALSE) ```
As for non censored data, one or more parametric distributions can be fitted to the censored data set, one at a time, but using in this case the `fitdistcens` function. This function estimates the vector of distribution parameters $\theta$ by maximizing the likelihood for censored data defined as: \begin{equation} L(\theta) = \prod_{i=1}^{N_{nonC}} f(x_{i}|\theta)\times \prod_{j=1}^{N_{leftC}} F(x^{upper}_{j}|\theta) \\ \times \prod_{k=1}^{N_{rightC}} (1- F(x^{lower}_{k}|\theta))\times \prod_{m=1}^{N_{intC}} (F(x^{upper}_{m}|\theta)- F(x^{lower}_{j}|\theta))(\#eq:eq7) \end{equation} with $x_{i}$ the $N_{nonC}$ non-censored observations, $x^{upper}_{j}$ upper values defining the $N_{leftC}$ left-censored observations, $x^{lower}_{k}$ lower values defining the $N_{rightC}$ right-censored observations, $[x^{lower}_{m} ; x^{upper}_{m}]$ the intervals defining the $N_{intC}$ interval-censored observations, and F the cumulative distribution function of the parametric distribution [@kleinmoeschberger03, @helsel05]. As `fitdist`, `fitdistcens` returns the results of the fit of any parametric distribution to a data set as an S3 class object that can be easily printed, summarized or plotted. For the `salinity` data set, a lognormal distribution or a loglogistic can be fitted as commonly done in ecotoxicology for such data. As with `fitdist`, for some distributions (see @fitdistrplus for details), it is necessary to specify initial values for the distribution parameters in the argument `start`. The `plotdistcens` function can help to find correct initial values for the distribution parameters in non trivial cases, by a manual iterative use if necessary. ```{r plotdistcens, echo=TRUE, fig=FALSE} fsal.ln <- fitdistcens(salinity, "lnorm") fsal.ll <- fitdistcens(salinity, "llogis", start = list(shape = 5, scale = 40)) summary(fsal.ln) summary(fsal.ll) ``` Computations of goodness-of-fit statistics have not yet been developed for fits using censored data but the quality of fit can be judged using Akaike and Schwarz's Bayesian information criteria (AIC and BIC) and the goodness-of-fit CDF plot, respectively provided when summarizing or plotting an object of class `fitdistcens`. Functions `cdfcompcens`, `qqcompcens` and `ppcompcens` can also be used to compare the fit of various distributions to the same censored data set. Their calls are similar to the ones of `cdfcomp`, `qqcomp` and `ppcomp`. Below are examples of use of those functions with the two fitted distributions to the `salinity` data set (see Figure \@ref(fig:cdfcompcens)). When `qqcompcens` and `ppcompcens` are used with more than one fitted distribution, the non uniqueness rectangles are not filled and a small noise is added on the y-axis in order to help the visualization of various fits. But we rather recommend the use of the `plotstyle` `ggplot` of `qqcompcens` and `ppcompcens` to compare the fits of various distributions as it provides a clearer plot splitted in facets (see `?graphcompcens`). ```{r cdfcompcens, fig.align='center', fig.width=7, fig.height=7, fig.cap="Some goodness-of-fit plots for fits of a lognormal and a loglogistic distribution to censored data: LC50 values from the `salinity` data set."} par(mfrow = c(2, 2)) cdfcompcens(list(fsal.ln, fsal.ll), legendtext = c("lognormal", "loglogistic ")) qqcompcens(fsal.ln, legendtext = "lognormal") ppcompcens(fsal.ln, legendtext = "lognormal") qqcompcens(list(fsal.ln, fsal.ll), legendtext = c("lognormal", "loglogistic "), main = "Q-Q plot with 2 dist.") ```
Function `bootdistcens` is the equivalent of `bootdist` for censored data, except that it only proposes nonparametric bootstrap. Indeed, it is not obvious to simulate censoring within a parametric bootstrap resampling procedure. The generic function `quantile` can also be applied to an object of class `fitdistcens` or `bootdistcens`, as for continuous non-censored data. In addition to the fit of distributions to censored or non censored continuous data, our package can also accomodate discrete variables, such as count numbers, using the functions developped for continuous non-censored data. These functions will provide somewhat different graphs and statistics, taking into account the discrete nature of the modeled variable. The discrete nature of the variable is automatically recognized when a classical distribution is fitted to data (binomial, negative binomial, geometric, hypergeometric and Poisson distributions) but must be indicated by fixing argument `discrete` to `TRUE` in the call to functions in other cases. The `toxocara` data set included in the package corresponds to the observation of such a discrete variable. Numbers of *Toxocara cati* parasites present in digestive tract are reported from a random sampling of feral cats living on Kerguelen island [@Fromont01]. We will use it to illustrate the case of discrete data. ```{r dattoxocara, echo=TRUE} data("toxocara") str(toxocara) ``` The fit of a discrete distribution to discrete data by maximum likelihood estimation requires the same procedure as for continuous non-censored data. As an example, using the `toxocara` data set, Poisson and negative binomial distributions can be easily fitted. ```{r fittoxocara.poisnbinom, echo = TRUE, fig = FALSE} (ftoxo.P <- fitdist(toxocara$number, "pois")) (ftoxo.nb <- fitdist(toxocara$number, "nbinom")) ``` For discrete distributions, the plot of an object of class `fitdist` simply provides two goodness-of-fit plots comparing empirical and theoretical distributions in density and in CDF. Functions `cdfcomp` and `denscomp` can also be used to compare several plots to the same data set, as follows for the previous fits (Figure \@ref(fig:fittoxocarapoisnbinom)). ```{r fittoxocarapoisnbinom, fig.align='center', fig.width=7, fig.height=4, fig.cap="Comparison of the fits of a negative binomial and a Poisson distribution to numbers of *Toxocara cati* parasites from the `toxocara` data set."} par(mfrow = c(1, 2)) denscomp(list(ftoxo.P, ftoxo.nb), legendtext = c("Poisson", "negative binomial"), fitlty = 1) cdfcomp(list(ftoxo.P, ftoxo.nb), legendtext = c("Poisson", "negative binomial"), fitlty = 1) ```
When fitting discrete distributions, the Chi-squared statistic is computed by the `gofstat` function using cells defined by the argument `chisqbreaks` or cells automatically defined from the data in order to reach roughly the same number of observations per cell. This number is roughly equal to the argument `meancount`, or sligthly greater if there are some ties. The choice to define cells from the empirical distribution (data), and not from the theoretical distribution, was done to enable the comparison of Chi-squared values obtained with different distributions fitted on a same data set. If arguments `chisqbreaks` and `meancount` are both omitted, `meancount` is fixed in order to obtain roughly $(4n)^{2/5}$ cells, with $n$ the length of the data set [@Vose10]. Using this default option the two previous fits are compared as follows, giving the preference to the negative binomial distribution, from both Chi-squared statistics and information criteria: ```{r fittoxocara.poisnbinom.gof} gofstat(list(ftoxo.P, ftoxo.nb), fitnames = c("Poisson", "negative binomial")) ``` # Conclusion {#ccl} The R package **fitdistrplus** allows to easily fit distributions. Our main objective while developing this package was to provide tools for helping R users to fit distributions to data. We have been encouraged to pursue our work by feedbacks from users of our package in various areas as food or environmental risk assessment, epidemiology, ecology, molecular biology, genomics, bioinformatics, hydraulics, mechanics, financial and actuarial mathematics or operations research. Indeed, this package is already used by a lot of practionners and academics for simple MLE fits [@jaloustreetal11, @saketal11, @kochetal12, @marquetouxetal12, @scholletal12, @suuronenetal12, @varoetal12, @mandletal13, @mala2013publi, @nadarajahbakar2013, @simoetal13, @westphalfitch13, @wayland13, @vianaetal13, @wu13, @staggeetal13, @fiorellietal13, @tikoleetal13, @voigtetal14], for MLE fits and goodness-of-fit statistics [@tarnczi11, @bagariaetal12, @benavidesetal12, @breitbach12, @Pouillot10, @vaninsky13], for MLE fits and bootstrap [@croucheretal12, @meheustetal12, @orellanoetal12, @telloetal12, @hoelzeretal12, @prosseretal13, @Zhang2013, @Rigaux2014], for MLE fits, bootstrap and goodness-of-fit statistics [@larrasetal13], for MME fit [@luangkesornetal12, @callauetal13, @satoetal13], for censored MLE and bootstrap [@lehaetal11, @poulliotetal12, @jongenburgeretal12, @commeauetal12, @contrerasetal2013], for graphic analysing in [@anandetal12], for grouped-data fitting methods [@fusteinercostafreda12] or more generally [@busschaertetal10, @eling12, @sosaetal2013, @srinivasanetal2013, @meyeretal13, @Guillier2013471, @Daelmanetal13, @eiketal13, @Wu2013b, @drakeetal2014]. The **fitdistrplus** package is complementary with the **distrMod** package [@distrModJSS]. **distrMod** provides an even more flexible way to estimate distribution parameters but its use requires a greater initial investment to learn how to manipulate the S4 classes and methods developed in the `distr`-family packages. Many extensions of the **fitdistrplus** package are planned in the future: we target to extend to censored data some methods for the moment only available for non-censored data, especially concerning goodness-of-fit evaluation and fitting methods. We will also enlarge the choice of fitting methods for non-censored data, by proposing new goodness-of-fit distances (e.g., distances based on quantiles) for maximum goodness-of-fit estimation and new types of moments (e.g., limited expected values) for moment matching estimation. At last, we will consider the case of multivariate distribution fitting.
# Acknowledgments {-} The package would not have been at this stage without the stimulating contribution of Régis Pouillot and Jean-Baptiste Denis, especially for its conceptualization. We also want to thank Régis Pouillot for his very valuable comments on the first version of this paper. The authors gratefully acknowledges the two anonymous referees and the Editor for useful and constructive comments. The remaining errors, of course, should be attributed to the authors alone.
# References {-} fitdistrplus/vignettes/Optimalgo.Rmd0000644000176200001440000004373214421750634017434 0ustar liggesusers--- title: Which optimization algorithm to choose? author: Marie Laure Delignette Muller, Christophe Dutang date: '`r Sys.Date()`' output: bookdown::html_document2: base_format: rmarkdown::html_vignette fig_caption: yes toc: true number_sections: yes link-citations: true vignette: > %\VignetteEngine{knitr::rmarkdown} %\VignetteIndexEntry{Which optimization algorithm to choose?} %!\VignetteEncoding{UTF-8} \usepackage[utf8]{inputenc} pkgdown: as_is: true --- ```{r setup, echo=FALSE, message=FALSE, warning=FALSE} require(fitdistrplus) require(knitr) #for kable() function set.seed(12345) options(digits = 3) ``` # Quick overview of main optimization methods We present very quickly the main optimization methods. Please refer to **Numerical Optimization (Nocedal \& Wright, 2006)** or **Numerical Optimization: theoretical and practical aspects (Bonnans, Gilbert, Lemarechal \& Sagastizabal, 2006)** for a good introduction. We consider the following problem $\min_x f(x)$ for $x\in\mathbb{R}^n$. ## Derivative-free optimization methods The Nelder-Mead method is one of the most well known derivative-free methods that use only values of $f$ to search for the minimum. It consists in building a simplex of $n+1$ points and moving/shrinking this simplex into the good direction. 1. set initial points $x_1, \dots, x_{n+1}$. 2. order points such that $f(x_1)\leq f(x_2)\leq\dots\leq f(x_{n+1})$. 3. compute $x_o$ as the centroid of $x_1, \dots, x_{n}$. 4. Reflection: + compute the reflected point $x_r = x_o + \alpha(x_o-x_{n+1})$. + **if** $f(x_1)\leq f(x_r)1$, once initiated by $d_1 = -g(x_1)$. $\beta_k$ are updated according a scheme: * $\beta_k = \frac{ g_k^T g_k}{g_{k-1}^T g_{k-1} }$: Fletcher-Reeves update, * $\beta_k = \frac{ g_k^T (g_k-g_{k-1} )}{g_{k-1}^T g_{k-1}}$: Polak-Ribiere update. There exists also three-term formula for computing direction $d_k = -g(x_k) + \beta_k d_{k-1}+\gamma_{k} d_t$ for $tt+1$ otherwise $\gamma_k=0$ if $k=t$. See Yuan (2006) for other well-known schemes such as Hestenses-Stiefel, Dixon or Conjugate-Descent. The three updates (Fletcher-Reeves, Polak-Ribiere, Beale-Sorenson) of the (non-linear) conjugate gradient are available in `optim`. ### Computing the stepsize $t_k$ Let $\phi_k(t) = f(x_k + t d_k)$ for a given direction/iterate $(d_k, x_k)$. We need to find conditions to find a satisfactory stepsize $t_k$. In literature, we consider the descent condition: $\phi_k'(0) < 0$ and the Armijo condition: $\phi_k(t) \leq \phi_k(0) + t c_1 \phi_k'(0)$ ensures a decrease of $f$. Nocedal \& Wright (2006) presents a backtracking (or geometric) approach satisfying the Armijo condition and minimal condition, i.e. Goldstein and Price condition. * set $t_{k,0}$ e.g. 1, $0 < \alpha < 1$, * **Repeat** until Armijo satisfied, + $t_{k,i+1} = \alpha \times t_{k,i}$. * **end Repeat** This backtracking linesearch is available in `optim`. ## Benchmark To simplify the benchmark of optimization methods, we create a `fitbench` function that computes the desired estimation method for all optimization methods. This function is currently not exported in the package. ```{r, echo=TRUE, eval=FALSE} fitbench <- function(data, distr, method, grad = NULL, control = list(trace = 0, REPORT = 1, maxit = 1000), lower = -Inf, upper = +Inf, ...) ``` ```{r, echo=FALSE} fitbench <- fitdistrplus:::fitbench ``` # Numerical illustration with the beta distribution ## Log-likelihood function and its gradient for beta distribution ### Theoretical value The density of the beta distribution is given by $$ f(x; \delta_1,\delta_2) = \frac{x^{\delta_1-1}(1-x)^{\delta_2-1}}{\beta(\delta_1,\delta_2)}, $$ where $\beta$ denotes the beta function, see the NIST Handbook of mathematical functions https://dlmf.nist.gov/. We recall that $\beta(a,b)=\Gamma(a)\Gamma(b)/\Gamma(a+b)$. There the log-likelihood for a set of observations $(x_1,\dots,x_n)$ is $$ \log L(\delta_1,\delta_2) = (\delta_1-1)\sum_{i=1}^n\log(x_i)+ (\delta_2-1)\sum_{i=1}^n\log(1-x_i)+ n \log(\beta(\delta_1,\delta_2)) $$ The gradient with respect to $a$ and $b$ is $$ \nabla \log L(\delta_1,\delta_2) = \left(\begin{matrix} \sum\limits_{i=1}^n\ln(x_i) - n\psi(\delta_1)+n\psi( \delta_1+\delta_2) \\ \sum\limits_{i=1}^n\ln(1-x_i)- n\psi(\delta_2)+n\psi( \delta_1+\delta_2) \end{matrix}\right), $$ where $\psi(x)=\Gamma'(x)/\Gamma(x)$ is the digamma function, see the NIST Handbook of mathematical functions https://dlmf.nist.gov/. ### `R` implementation As in the `fitdistrplus` package, we minimize the opposite of the log-likelihood: we implement the opposite of the gradient in `grlnL`. Both the log-likelihood and its gradient are not exported. ```{r} lnL <- function(par, fix.arg, obs, ddistnam) fitdistrplus:::loglikelihood(par, fix.arg, obs, ddistnam) grlnlbeta <- fitdistrplus:::grlnlbeta ``` ## Random generation of a sample ```{r, fig.height=4, fig.width=4} #(1) beta distribution n <- 200 x <- rbeta(n, 3, 3/4) grlnlbeta(c(3, 4), x) #test hist(x, prob=TRUE) lines(density(x), col="red") curve(dbeta(x, 3, 3/4), col="green", add=TRUE) legend("topleft", lty=1, col=c("red","green"), leg=c("empirical", "theoretical")) ``` ## Fit Beta distribution Define control parameters. ```{r} ctr <- list(trace=0, REPORT=1, maxit=1000) ``` Call `mledist` with the default optimization function (`optim` implemented in `stats` package) with and without the gradient for the different optimization methods. ```{r} unconstropt <- fitbench(x, "beta", "mle", grad=grlnlbeta, lower=0) ``` In the case of constrained optimization, `mledist` permits the direct use of `constrOptim` function (still implemented in `stats` package) that allow linear inequality constraints by using a logarithmic barrier. Use a exp/log transformation of the shape parameters $\delta_1$ and $\delta_2$ to ensure that the shape parameters are strictly positive. ```{r} dbeta2 <- function(x, shape1, shape2, log) dbeta(x, exp(shape1), exp(shape2), log=log) #take the log of the starting values startarg <- lapply(fitdistrplus:::startargdefault(x, "beta"), log) #redefine the gradient for the new parametrization grbetaexp <- function(par, obs, ...) grlnlbeta(exp(par), obs) * exp(par) expopt <- fitbench(x, distr="beta2", method="mle", grad=grbetaexp, start=startarg) #get back to original parametrization expopt[c("fitted shape1", "fitted shape2"), ] <- exp(expopt[c("fitted shape1", "fitted shape2"), ]) ``` Then we extract the values of the fitted parameters, the value of the corresponding log-likelihood and the number of counts to the function to minimize and its gradient (whether it is the theoretical gradient or the numerically approximated one). ## Results of the numerical investigation Results are displayed in the following tables: (1) the original parametrization without specifying the gradient (`-B` stands for bounded version), (2) the original parametrization with the (true) gradient (`-B` stands for bounded version and `-G` for gradient), (3) the log-transformed parametrization without specifying the gradient, (4) the log-transformed parametrization with the (true) gradient (`-G` stands for gradient). ```{r, results='asis', echo=FALSE} kable(unconstropt[, grep("G-", colnames(unconstropt), invert=TRUE)], digits=3) ``` ```{r, results='asis', echo=FALSE} kable(unconstropt[, grep("G-", colnames(unconstropt))], digits=3) ``` ```{r, results='asis', echo=FALSE} kable(expopt[, grep("G-", colnames(expopt), invert=TRUE)], digits=3) ``` ```{r, results='asis', echo=FALSE} kable(expopt[, grep("G-", colnames(expopt))], digits=3) ``` Using `llsurface`, we plot the log-likehood surface around the true value (green) and the fitted parameters (red). ```{r, fig.width=4, fig.height=4} llsurface(min.arg=c(0.1, 0.1), max.arg=c(7, 3), plot.arg=c("shape1", "shape2"), nlev=25, plot.np=50, data=x, distr="beta", back.col = FALSE) points(unconstropt[1,"BFGS"], unconstropt[2,"BFGS"], pch="+", col="red") points(3, 3/4, pch="x", col="green") ``` We can simulate bootstrap replicates using the `bootdist` function. ```{r, fig.width=4, fig.height=4} b1 <- bootdist(fitdist(x, "beta", method = "mle", optim.method = "BFGS"), niter = 100, parallel = "snow", ncpus = 2) summary(b1) plot(b1) abline(v = 3, h = 3/4, col = "red", lwd = 1.5) ``` # Numerical illustration with the negative binomial distribution ## Log-likelihood function and its gradient for negative binomial distribution ### Theoretical value The p.m.f. of the Negative binomial distribution is given by $$ f(x; m,p) = \frac{\Gamma(x+m)}{\Gamma(m)x!} p^m (1-p)^x, $$ where $\Gamma$ denotes the beta function, see the NIST Handbook of mathematical functions https://dlmf.nist.gov/. There exists an alternative representation where $\mu=m (1-p)/p$ or equivalently $p=m/(m+\mu)$. Thus, the log-likelihood for a set of observations $(x_1,\dots,x_n)$ is $$ \log L(m,p) = \sum_{i=1}^{n} \log\Gamma(x_i+m) -n\log\Gamma(m) -\sum_{i=1}^{n} \log(x_i!) + mn\log(p) +\sum_{i=1}^{n} {x_i}\log(1-p) $$ The gradient with respect to $m$ and $p$ is $$ \nabla \log L(m,p) = \left(\begin{matrix} \sum_{i=1}^{n} \psi(x_i+m) -n \psi(m) + n\log(p) \\ mn/p -\sum_{i=1}^{n} {x_i}/(1-p) \end{matrix}\right), $$ where $\psi(x)=\Gamma'(x)/\Gamma(x)$ is the digamma function, see the NIST Handbook of mathematical functions https://dlmf.nist.gov/. ### `R` implementation As in the `fitdistrplus` package, we minimize the opposite of the log-likelihood: we implement the opposite of the gradient in `grlnL`. ```{r} grlnlNB <- function(x, obs, ...) { m <- x[1] p <- x[2] n <- length(obs) c(sum(psigamma(obs+m)) - n*psigamma(m) + n*log(p), m*n/p - sum(obs)/(1-p)) } ``` ## Random generation of a sample ```{r, fig.height=4, fig.width=4} #(1) beta distribution n <- 200 trueval <- c("size"=10, "prob"=3/4, "mu"=10/3) x <- rnbinom(n, trueval["size"], trueval["prob"]) hist(x, prob=TRUE, ylim=c(0, .3)) lines(density(x), col="red") points(min(x):max(x), dnbinom(min(x):max(x), trueval["size"], trueval["prob"]), col = "green") legend("topleft", lty = 1, col = c("red", "green"), leg = c("empirical", "theoretical")) ``` ## Fit a negative binomial distribution Define control parameters and make the benchmark. ```{r} ctr <- list(trace = 0, REPORT = 1, maxit = 1000) unconstropt <- fitbench(x, "nbinom", "mle", grad = grlnlNB, lower = 0) unconstropt <- rbind(unconstropt, "fitted prob" = unconstropt["fitted mu", ] / (1 + unconstropt["fitted mu", ])) ``` In the case of constrained optimization, `mledist` permits the direct use of `constrOptim` function (still implemented in `stats` package) that allow linear inequality constraints by using a logarithmic barrier. Use a exp/log transformation of the shape parameters $\delta_1$ and $\delta_2$ to ensure that the shape parameters are strictly positive. ```{r} dnbinom2 <- function(x, size, prob, log) dnbinom(x, exp(size), 1 / (1 + exp(-prob)), log = log) # transform starting values startarg <- fitdistrplus:::startargdefault(x, "nbinom") startarg$mu <- startarg$size / (startarg$size + startarg$mu) startarg <- list(size = log(startarg[[1]]), prob = log(startarg[[2]] / (1 - startarg[[2]]))) # redefine the gradient for the new parametrization Trans <- function(x) c(exp(x[1]), plogis(x[2])) grNBexp <- function(par, obs, ...) grlnlNB(Trans(par), obs) * c(exp(par[1]), plogis(x[2])*(1-plogis(x[2]))) expopt <- fitbench(x, distr="nbinom2", method="mle", grad=grNBexp, start=startarg) # get back to original parametrization expopt[c("fitted size", "fitted prob"), ] <- apply(expopt[c("fitted size", "fitted prob"), ], 2, Trans) ``` Then we extract the values of the fitted parameters, the value of the corresponding log-likelihood and the number of counts to the function to minimize and its gradient (whether it is the theoretical gradient or the numerically approximated one). ## Results of the numerical investigation Results are displayed in the following tables: (1) the original parametrization without specifying the gradient (`-B` stands for bounded version), (2) the original parametrization with the (true) gradient (`-B` stands for bounded version and `-G` for gradient), (3) the log-transformed parametrization without specifying the gradient, (4) the log-transformed parametrization with the (true) gradient (`-G` stands for gradient). ```{r, results='asis', echo=FALSE} kable(unconstropt[, grep("G-", colnames(unconstropt), invert=TRUE)], digits=3) ``` ```{r, results='asis', echo=FALSE} kable(unconstropt[, grep("G-", colnames(unconstropt))], digits=3) ``` ```{r, results='asis', echo=FALSE} kable(expopt[, grep("G-", colnames(expopt), invert=TRUE)], digits=3) ``` ```{r, results='asis', echo=FALSE} kable(expopt[, grep("G-", colnames(expopt))], digits=3) ``` Using `llsurface`, we plot the log-likehood surface around the true value (green) and the fitted parameters (red). ```{r, fig.width=4, fig.height=4} llsurface(min.arg = c(5, 0.3), max.arg = c(15, 1), plot.arg = c("size", "prob"), nlev = 25, plot.np = 50, data = x, distr = "nbinom", back.col = FALSE) points(unconstropt["fitted size", "BFGS"], unconstropt["fitted prob", "BFGS"], pch = "+", col = "red") points(trueval["size"], trueval["prob"], pch = "x", col = "green") ``` We can simulate bootstrap replicates using the `bootdist` function. ```{r, fig.width=4, fig.height=4} b1 <- bootdist(fitdist(x, "nbinom", method = "mle", optim.method = "BFGS"), niter = 100, parallel = "snow", ncpus = 2) summary(b1) plot(b1) abline(v = trueval["size"], h = trueval["mu"], col = "red", lwd = 1.5) ``` # Conclusion Based on the two previous examples, we observe that all methods converge to the same point. This is rassuring. However, the number of function evaluations (and the gradient evaluations) is very different from a method to another. Furthermore, specifying the true gradient of the log-likelihood does not help at all the fitting procedure and generally slows down the convergence. Generally, the best method is the standard BFGS method or the BFGS method with the exponential transformation of the parameters. Since the exponential function is differentiable, the asymptotic properties are still preserved (by the Delta method) but for finite-sample this may produce a small bias. fitdistrplus/vignettes/fitdistrplus.bib0000644000176200001440000012026414410757112020237 0ustar liggesusers%% This BibTeX bibliography file was created using BibDesk. %% http://bibdesk.sourceforge.net/ %% Created for Christophe Dutang at 2013-05-15 16:49:04 +0200 %% Saved with string encoding Unicode (UTF-8) @manual{R13, Address = {Vienna, Austria}, Author = {{R Development Core Team}}, Title = {R: A Language and Environment for Statistical Computing}, Url = {https://www.r-project.org/}, Year = {2013}} @article{hyndmanfan96, Author = {R.J. Hyndman and Y. Fan}, Date-Modified = {2013-05-15 16:18:15 +0200}, Journal = {The American Statistician}, Pages = {361-365}, Title = {{Sample Quantiles in Statistical Packages}}, Volume = {50}, Year = {1996}} @book{casellaberger02, Author = {G. Casella and R.L. Berger}, Date-Modified = {2013-05-15 16:09:07 +0200}, Edition = {2nd}, Publisher = {Duxbury Thomson Learning}, Title = {Statistical Inference}, Year = {2002}} @book{efrontibshirani94, Author = {B. Efron and R.J. Tibshirani}, Date-Added = {2013-03-25 08:56:42 +0100}, Date-Modified = {2013-05-15 16:10:02 +0200}, Edition = {1st}, Publisher = {Chapman \& Hall}, Title = {An Introduction to the Bootstrap}, Year = {1994}} @article{kefford07, Author = {B.J. Kefford and E.J. Fields and C. Clay and D. Nugegoda}, Date-Added = {2013-03-06 21:41:48 +0100}, Date-Modified = {2013-05-15 16:20:52 +0200}, Journal = {Marine and Freshwater Research}, Pages = {1019-1031}, Title = {{Salinity Tolerance of Riverine Macroinvertebrates from the Southern Murray-Darling Basin}}, Volume = {58}, Year = {2007}} @book{helsel05, Author = {D.R. Helsel}, Date-Added = {2013-03-11 16:55:34 +0100}, Date-Modified = {2013-03-11 16:55:34 +0100}, Edition = {1st}, Publisher = {John Wiley \& Sons}, Title = {Nondetects and Data Analysis: Statistics for Censored Environmental Data}, Year = {2005}} @book{kleinmoeschberger03, Author = {J.P. Klein and M.L. Moeschberger}, Date-Added = {2013-03-06 21:37:34 +0100}, Date-Modified = {2013-03-06 21:38:47 +0100}, Edition = {2nd}, Publisher = {Springer-Verlag}, Title = {Survival Analysis: Techniques for Censored and Truncated Data}, Year = {2003}} @article{jordan05, Author = {D. Jordan}, Date-Added = {2013-03-06 18:21:50 +0100}, Date-Modified = {2013-05-15 16:20:35 +0200}, Journal = {Preventive Veterinary Medicine}, Number = {1-2}, Pages = {59-73}, Title = {Simulating the Sensitivity of Pooled-sample Herd Tests for Fecal Salmonella in Cattle}, Volume = {70}, Year = {2005}} @article{hiranoetal94, Author = {S.S. Hirano and M.K. Clayton and C.D. Upper}, Date-Added = {2013-03-06 18:15:38 +0100}, Date-Modified = {2013-05-15 16:18:01 +0200}, Journal = {Phytopathology}, Number = {9}, Pages = {934-940}, Title = {{Estimation of and Temporal Changes in Means and Variances of Populations of \emph{Pseudomonas syringae} on Snap Bean Leaflets}}, Volume = {84}, Year = {1994}} @manual{actuar12, Author = {Goulet, V.}, Date-Added = {2012-11-07 17:28:38 +0100}, Date-Modified = {2013-05-15 16:44:30 +0200}, Note = {R~package version~1.1-5}, Title = {\pkg{actuar}: An R Package for Actuarial Science}, Url = {https://cran.r-project.org/package=actuar}, Year = 2012 } @article{mcneil97, Author = {A.J. McNeil}, Date-Added = {2012-11-02 18:47:05 +0100}, Date-Modified = {2013-05-15 16:18:38 +0200}, Journal = {ASTIN Bulletin}, Number = {1}, Pages = {117-137}, Title = {Estimating the Tails of Loss Severity Distributions Using Extreme Value Theory}, Volume = {27}, Year = {1997}} @article{distrModJSS, Author = {M. Kohl and P. Ruckdeschel}, Date-Added = {2012-10-18 18:41:33 +0200}, Date-Modified = {2012-10-18 18:42:23 +0200}, Journal = {Journal of Statistical Software}, Number = {10}, Pages = {1-27}, Title = {{R Package \pkg{distrMod}: S4 Classes and Methods for Probability Models}}, Volume = {35}, Year = {2010}} @article{actuarJSS, Author = {C. Dutang and V. Goulet and M. Pigeon}, Date-Added = {2012-10-18 18:32:43 +0200}, Date-Modified = {2012-10-18 18:35:33 +0200}, Journal = {Journal of Statistical Software}, Number = {7}, Pages = {1-37}, Title = {{\pkg{actuar}: an R Package for Actuarial Science}}, Volume = {25}, Year = {2008}} @book{Cullen99, Author = {Cullen, A.C. and Frey, H.C.}, Date-Modified = {2013-05-15 16:09:40 +0200}, Edition = {1st}, Language = {english}, Publisher = {Plenum Publishing Co.}, Title = {Probabilistic Techniques in Exposure Assessment}, Year = 1999} @book{Vose10, Author = {Vose, D.}, Date-Modified = {2013-05-15 16:07:47 +0200}, Edition = {1st}, Language = {english}, Publisher = {John Wiley \& Sons}, Title = {Quantitative Risk Analysis. A Guide to Monte Carlo Simulation Modelling}, Year = {2010}} @book{Klugmanetal09, Author = {Klugman, S.A. and Panjer, H.H. and Willmot, G.E.}, Date-Modified = {2013-05-15 16:08:27 +0200}, Edition = {3rd}, Language = {english}, Publisher = {John Wiley \& Sons}, Title = {Loss Models: from Data to Decisions}, Year = {2009}} @book{Stephens86, Author = {D'Agostino, R.B. and Stephens, M.A.}, Date-Modified = {2013-05-15 16:10:10 +0200}, Edition = {1st}, Language = {english}, Publisher = {Dekker}, Title = {Goodness-of-Fit Techniques}, Year = {1986}} @book{MASS, Author = {Venables, W. N. and Ripley, B. D.}, Edition = {4th}, Language = {english}, Publisher = {Springer-Verlag}, Title = {Modern Applied Statistics with {S}}, Year = 2010} @book{Blom, Asin = {B002LBF3JO}, Author = {Blom, G.}, Date-Modified = {2013-05-15 16:10:43 +0200}, Edition = {1st}, Language = {english}, Publisher = {John Wiley \& Sons}, Title = {Statistical Estimates and Transformed Beta Variables}, Year = 1959} @article{Luceno06, Author = {Luceno, A.}, Date-Modified = {2013-05-15 16:39:08 +0200}, Journal = {{Computational Statistics and Data Analysis}}, Month = {{NOV 15}}, Number = {{2}}, Pages = {{904-917}}, Title = {{Fitting the Generalized Pareto Distribution to Data Using Maximum Goodness-of-fit Estimators}}, Volume = {{51}}, Year = {{2006}}} @manual{survival, Author = {Therneau, T.}, Date-Modified = {2013-05-15 16:40:53 +0200}, Note = {R~package version~2.36-9}, Title = {\pkg{survival}: Survival Analysis, Including Penalized Likelihood}, Url = {https://cran.r-project.org/package=survival}, Year = {2011} } @manual{fitdistrplus, Author = {Delignette-Muller, M.L. and Pouillot, R. and Denis, J.B. and Dutang, C.}, Date-Modified = {2014-07-15 16:44:56 +0200}, Note = {R package version 1.0-2}, Title = {\pkg{fitdistrplus}: Help to Fit of a Parametric Distribution to Non-Censored or Censored Data}, Url = {https://cran.r-project.org/package=fitdistrplus}, Year = {2014} } @manual{mc2d, Author = {Pouillot, R. and Delignette-Muller, M.L. and Denis, J.B.}, Date-Modified = {2013-05-15 16:41:18 +0200}, Note = {R package version 0.1-12}, Title = {\pkg{mc2d}: Tools for Two-Dimensional Monte-Carlo Simulations}, Url = {https://cran.r-project.org/package=mc2d}, Year = {2011} } @unpublished{Ricci05, Author = {Ricci, V.}, Date-Modified = {2013-05-15 16:48:58 +0200}, Language = {english}, Note = {Contributed Documentation available on CRAN}, Title = {Fitting Distributions with R}, Url = {https://cran.r-project.org/doc/contrib/Ricci-distributions-en.pdf}, Year = {2005} } @article{Pouillot10, Author = {Pouillot, R. and Delignette-Muller, M.L.}, Date-Modified = {2012-11-02 19:01:40 +0100}, Journal = {{International Journal of Food Microbiology}}, Month = {{SEP 1}}, Number = {{3}}, Pages = {{330-340}}, Title = {{Evaluating Variability and Uncertainty Separately in Microbial Quantitative Risk Assessment using two R Packages}}, Unique-Id = {{ISI:000281981700007}}, Volume = {{142}}, Year = {{2010}}, Bdsk-Url-1 = {http://dx.doi.org/10.1016/j.ijfoodmicro.2010.07.011} } @article{Delignette08, title = {Quantitative risk assessment for Escherichia coli O157:H7 in frozen ground beef patties consumed by young children in French households}, journal = {International Journal of Food Microbiology}, volume = {128}, number = {1}, pages = {158-164}, year = {2008}, note = {5th International Conference on Predictive Modelling in Foods}, doi = {https://doi.org/10.1016/j.ijfoodmicro.2008.05.040}, url = {https://www.sciencedirect.com/science/article/pii/S0168160508003103}, author = {M.L. Delignette-Muller and M. Cornu}, } @article{Busschaert10, Author = {Busschaert, P. and Geeraerd, A. H. and Uyttendaele, M. and Van Impe, J. F.}, Date-Modified = {2012-11-02 19:01:35 +0100}, Doi = {{10.1016/j.ijfoodmicro.2010.01.025}}, Issn = {{0168-1605}}, Journal = {{International Journal of Food Microbiology}}, Month = {{APR 15}}, Number = {{3}}, Pages = {{260-269}}, Title = {{Estimating Distributions out of Qualitative and (Semi)Quantitative Microbiological Contamination Data for Use in Risk Assessment}}, Unique-Id = {{ISI:000276819700010}}, Volume = {{138}}, Year = {{2010}}, Bdsk-Url-1 = {http://dx.doi.org/10.1016/j.ijfoodmicro.2010.01.025} } @article{Hose04, Author = {Hose, G.C. and Van den Brink, P.J.}, Date-Modified = {2013-03-06 21:55:00 +0100}, Journal = {{Archives of Environmental Contamination and Toxicology}}, Month = {{OCT}}, Number = {{4}}, Pages = {{511-520}}, Title = {{Confirming the Species-Sensitivity Distribution Concept for Endosulfan Using Laboratory, Mesocosm, and Field Data}}, Unique-Id = {{ISI:000224753800011}}, Volume = {{47}}, Year = {{2004}}, Bdsk-Url-1 = {http://dx.doi.org/10.1007/s00244-003-3212-5} } @article{Fromont01, Author = {Fromont, E and Morvilliers, L and Artois, M and Pontier, D}, Date-Modified = {2012-11-02 19:00:22 +0100}, Journal = {{Parasitology}}, Month = {{AUG}}, Number = {{Part 2}}, Pages = {{143-151}}, Title = {{Parasite Richness and Abundance in Insular and Mainland Feral Cats: Insularity or Density?}}, Unique-Id = {{ISI:000170390500004}}, Volume = {{123}}, Year = {{2001}}, Bdsk-Url-1 = {http://dx.doi.org/10.1017/S0031182001008277} } @article{Turnbull74, Author = {Turnbull, B.W.}, Date-Modified = {2013-03-06 21:55:28 +0100}, Journal = {{Journal of the American Statistical Association}}, Number = {{345}}, Pages = {{169-173}}, Title = {{Nonparametric Estimation of a Survivorship Function with Doubly Censored Data}}, Unique-Id = {{ISI:A1974S755300025}}, Volume = {{69}}, Year = {{1974}}, Bdsk-Url-1 = {http://dx.doi.org/10.2307/2285518} } @article{Wang2007, title={On fast computation of the non-parametric maximum likelihood estimate of a mixing distribution}, author={Wang, Yong}, journal={Journal of the Royal Statistical Society: Series B (Statistical Methodology)}, volume={69}, number={2}, pages={185--198}, year={2007}, publisher={Wiley Online Library} } @article{Wang2008, title={Dimension-reduced nonparametric maximum likelihood computation for interval-censored data}, author={Wang, Yong}, journal={Computational Statistics \& Data Analysis}, volume={52}, number={5}, pages={2388--2402}, year={2008}, publisher={Elsevier} } @article{Wang2013, title={Efficient computation of nonparametric survival functions via a hierarchical mixture formulation}, author={Wang, Yong and Taylor, Stephen M}, journal={Statistics and Computing}, volume={23}, number={6}, pages={713--725}, year={2013}, publisher={Springer} } @article{Wang2018, title={Nonparametric maximum likelihood computation of a U-shaped hazard function}, author={Wang, Yong and Fani, Shabnam}, journal={Statistics and Computing}, volume={28}, number={1}, pages={187--200}, year={2018}, publisher={Springer} } @book{Posthuma2010, Author = {Posthuma, L. and Suter, G.W. and Traas, T.P.}, Publisher = {Taylor \& Francis}, Series = {Environmental and Ecological Risk Assessment Series}, Title = {Species Sensitivity Distributions in Ecotoxicology}, Year = {2010}} @book{Tse2009, Author = {Tse, Y.K.}, Date-Modified = {2013-05-15 16:08:20 +0200}, Edition = {1st}, Lccn = {2009520513}, Publisher = {Cambridge University Press}, Series = {International Series on Actuarial Science}, Title = {Nonlife Actuarial Models: Theory, Methods and Evaluation}, Year = {2009}} %%%%%%%%%%%%%%%%%%%%%%%%%% papers citing fitdistrplus %%%%%%%%%%%%%%%%%%%% @article{Wu2013b, Author = {Xing Zheng Wu}, Date-Added = {2014-02-04 12:23:42 +0100}, Date-Modified = {2014-02-04 12:24:29 +0100}, Journal = {Soils and Foundations}, Number = {4}, Pages = {540 - 556}, Title = {Trivariate Analysis of Soil Ranking-Correlated Characteristics and its Application to Probabilistic Stability Assessments in Geotechnical Engineering Problems}, Volume = {53}, Year = {2013}, Bdsk-Url-1 = {http://www.sciencedirect.com/science/article/pii/S003808061300070X}, Bdsk-Url-2 = {http://dx.doi.org/10.1016/j.sandf.2013.06.006}} @mastersthesis{lebaut13, Author = {J. Le Baut}, Date-Added = {2014-02-04 12:15:16 +0100}, Date-Modified = {2014-02-04 12:19:59 +0100}, School = {Institut Universitaire Europ\'een de la mer, Universit\'e de Brest}, Title = {Evaluation d'un stock de poissons pour lequel les donn\'ees sont d\'eficientes : Application au stock de sardine (Sardina pilchardus) du golfe de Gascogne, de Manche et de mer Celtique}, Year = {2013}} @article{Zhang2013, Author = {Yu Zhang and Emad Habib and Robert J. Kuligowski and Dongsoo Kim}, Date-Added = {2014-02-04 12:12:17 +0100}, Date-Modified = {2014-02-04 12:12:35 +0100}, Journal = {Advances in Water Resources}, Pages = {133 - 145}, Title = {Joint Distribution of Multiplicative Errors in Radar and Satellite \{QPEs\} and its Use in Estimating the Conditional Exceedance Probability}, Volume = {59}, Year = {2013}, Bdsk-Url-1 = {http://www.sciencedirect.com/science/article/pii/S030917081300105X}, Bdsk-Url-2 = {http://dx.doi.org/10.1016/j.advwatres.2013.06.004}} @article{Rigaux2014, Author = {Cl{\'e}mence Rigaux and St{\'e}phane Andr{\'e} and Isabelle Albert and Fr{\'e}d{\'e}ric Carlin}, Date-Added = {2014-02-04 12:10:00 +0100}, Date-Modified = {2014-02-04 12:10:13 +0100}, Journal = {International Journal of Food Microbiology}, Pages = {119 - 128}, Title = {Quantitative Assessment of the Risk of Microbial Spoilage in Foods. Prediction of Non-Stability at 55$\,^{\circ}$C Caused by \emph{Geobacillus stearothermophilus} in Canned Green Beans}, Volume = {171}, Year = {2014}, Bdsk-Url-1 = {http://www.sciencedirect.com/science/article/pii/S0168160513005357}, Bdsk-Url-2 = {http://dx.doi.org/10.1016/j.ijfoodmicro.2013.11.014}} @article{eiketal13, Author = {M. Eik and K. Luhmus and M. Tigasson and M. Listak and J. Puttonen and H. Herrmann}, Date-Added = {2014-02-04 12:06:14 +0100}, Date-Modified = {2014-02-04 12:07:44 +0100}, Journal = {Journal of Materials Science}, Number = {10}, Pages = {3745-3759}, Title = {{DC-Conductivity Testing Combined with Photometry for Measuring Fibre Orientations in SFRC}}, Volume = {48}, Year = {2013}} @mastersthesis{gustafssonetal13, Author = {A. Gustafsson and K. Fagerstroem}, Date-Added = {2014-02-04 12:00:47 +0100}, Date-Modified = {2014-02-04 12:02:27 +0100}, School = {Faculty of Engineering, Lund University}, Title = {Modelling Risk in Forest Insurance: Extreme Value and Frequency Analysis of Insurance Claims Due to Storm Damaged Forest}, Year = {2013}} @article{tikoleetal13, Author = {S. Tikole and V. Jaravine and V. Yu Orekhov and P. Guentert}, Date-Added = {2014-02-04 11:57:02 +0100}, Date-Modified = {2014-02-04 11:58:27 +0100}, Journal = {PloS one}, Title = {{Effects of NMR spectral resolution on protein structure calculation}}, Volume = {8}, Number = {7}, Pages = {e68567}, Year = {2013}} @unpublished{mukhopadhyay13, Author = {S. Mukhopadhyay}, Date-Added = {2014-02-04 11:54:02 +0100}, Date-Modified = {2014-02-04 11:55:14 +0100}, Note = {arXiv}, Title = {{CDfdr: A comparison density approach to local false discovery rate}}, Year = {2013}} @mastersthesis{feng13, Author = {Z. Feng}, Date-Added = {2014-02-04 11:52:24 +0100}, Date-Modified = {2014-02-04 11:53:07 +0100}, School = {Norwegian University of Science and Technology}, Title = {Markov process applied to degradation modelling: different modelling alternatives and their properties}, Year = {2013}} @phdthesis{duarte13, Author = {A.S. Ribeiro Duarte}, Date-Added = {2014-02-04 11:49:31 +0100}, Date-Modified = {2014-02-04 11:50:31 +0100}, School = {Technical University of Denmark and National Food Institute}, Title = {The interpretation of quantitative microbial data: meeting the demands of quantitative microbiological risk assessment}, Year = {2013}} @article{fiorellietal13, Author = {L.E. Fiorelli and M.D. Ezcurra and E.M. Hechenleitner and E. Arga\~{n}araz and R. Jeremias and A. Taborda and M.J. Trotteyn and M. Bel\'en von Baczko and J.B. Desojo}, Date-Added = {2014-02-04 11:45:15 +0100}, Date-Modified = {2014-02-04 11:47:47 +0100}, Journal = {Scientific Reports}, Title = {{The Oldest Known Communal Latrines Provide Evidence of Gregarism in Triassic Megaherbivores}}, Volume = {3}, Number = {3348}, Pages = {1-7}, Year = {2013}} @article{vaninsky13, Author = {A.Y. Vaninsky}, Date-Added = {2014-02-04 11:40:36 +0100}, Date-Modified = {2014-02-04 11:41:26 +0100}, Journal = {American Journal of Applied Mathematics and Statistics}, Number = {4}, Pages = {57-63}, Title = {{Stochastic DEA with a Perfect Object and Its Application to Analysis of Environmental Efficiency}}, Volume = {1}, Year = {2013}} @article{staggeetal13, Author = {Stagge, J. H. and Moglen, G. E.}, Date-Added = {2014-02-04 11:36:49 +0100}, Date-Modified = {2014-02-04 11:37:27 +0100}, Journal = {Water Resources Research}, Number = {10}, Pages = {6179-6193}, Title = {A Nonparametric Stochastic Method for Generating Daily Climate-Adjusted Streamflows}, Volume = {49}, Year = {2013}, Bdsk-Url-1 = {http://dx.doi.org/10.1002/wrcr.20448}} @article{wu13, Author = {Xing Zheng Wu}, Date-Added = {2014-02-03 21:54:24 +0100}, Date-Modified = {2014-02-03 21:55:15 +0100}, Journal = {Computational Geosciences}, Number = {5}, Pages = {739-755}, Title = {Probabilistic Slope Stability Analysis by a Copula-Based Sampling Method}, Volume = {17}, Year = {2013}} @article{vianaetal13, Author = {Viana, D. S. and Santamar\'a, L. and Michot, T. C. and Figuerola, J.}, Date-Added = {2014-02-03 21:49:39 +0100}, Date-Modified = {2014-02-03 21:50:54 +0100}, Journal = {The American Naturalist}, Number = {5}, Pages = {649-662}, Title = {Allometric Scaling of Long-Distance Seed Dispersal by Migratory Birds}, Volume = {181}, Year = {2013}} @article{wayland13, Author = {M.T. Wayland}, Date-Added = {2014-02-03 21:46:13 +0100}, Date-Modified = {2014-02-03 21:47:06 +0100}, Journal = {Biodiversity Data Journal}, Title = {{Morphological Variation in \emph{Echinorhynchus truttae} Schrank, 1788 and the \emph{Echinorhynchus bothniensis} Zdzitowiecki \& Valtonen, 1987 species complex from freshwater fishes of northern Europe}}, Volume = {1}, Pages = {e975}, Year = {2013}} @article{westphalfitch13, Author = {Westphal-Fitch, G. and Fitch, W. T.}, Date-Added = {2014-02-03 21:42:15 +0100}, Date-Modified = {2014-02-03 21:43:23 +0100}, Journal = {PloS one}, Number = {9}, Title = {Spatial Analysis of ``Crazy Quilts'', a Class of Potentially Random Aesthetic Artefacts}, Volume = {8}, Pages = {e74055}, Year = {2013}} @article{larrasetal13, Author = {Floriane Larras and Bernard Montuelle and Agn\`es Bouchez}, Date-Added = {2014-02-03 21:34:58 +0100}, Date-Modified = {2014-02-03 21:36:20 +0100}, Journal = {Science of The Total Environment}, Pages = {469-477}, Title = {{Assessment of Toxicity Thresholds in Aquatic Environments: Does Benthic Growth of Diatoms Affect their Exposure and Sensitivity to Herbicides?}}, Volume = {463-464}, Year = {2013}} @article{Daelmanetal13, Author = {Jeff Daelman and Jeanne-Marie Membr\'e and Liesbeth Jacxsens and An Vermeulen and Frank Devlieghere and Mieke Uyttendaele}, Date-Added = {2014-02-03 21:23:53 +0100}, Date-Modified = {2014-02-03 21:29:36 +0100}, Journal = {International Journal of Food Microbiology}, Number = {3}, Pages = {433-449}, Title = {A Quantitative Microbiological Exposure Assessment Model for \emph{Bacillus cereus} in REPFEDs}, Volume = {166}, Year = {2013}} @article{voigtetal14, Author = {Christian C. Voigt and Linn S. Lehnert and Ana G. Popa-Lisseanu and Mateusz Ciechanowski and P\'eter Est\'ok and Florian Gloza-Rausch and Tam\'as Goerfoel and Matthias Goettsche and Carsten Harrje and Meike Hoetzel and Tobias Teige and Reiner Wohlgemuth and Stephanie Kramer-Schadt}, Date-Added = {2014-02-03 21:07:48 +0100}, Date-Modified = {2014-02-03 21:21:42 +0100}, Journal = {Biodiversity and Conservation}, Volume = {23}, Pages = {617-631}, Title = {{The Trans-Boundary Importance of Artificial Bat \emph{hibernacula} in Managed European Forests}}, Year = {2014}} @article{Guillier2013471, Author = {Laurent Guillier and Corinne Danan and H\'el\`ene Bergis and Marie-Laure Delignette-Muller and Sophie Granier and Sylvie Rudelle and Annie Beaufort and Anne Brisabois}, Date-Added = {2014-02-03 21:00:28 +0100}, Date-Modified = {2014-02-03 21:04:18 +0100}, Journal = {International Journal of Food Microbiology}, Number = {3}, Pages = {471 - 478}, Title = {{Use of Quantitative Microbial Risk Assessment when Investigating Foodborne Illness Outbreaks: the Example of a Monophasic \emph{Salmonella Typhimurium} 4,5,12:i:- Outbreak Implicating Beef Burgers}}, Volume = {166}, Year = {2013}} @article{satoetal13, Author = {Maria Ines Z. Sato and Ana Tereza Galvani and Jose Antonio Padula and Adelaide Cassia Nardocci and Marcelo de Souza Lauretto and Maria Tereza Pepe Razzolini and Elayse Maria Hachich}, Date-Added = {2014-02-03 19:20:58 +0100}, Date-Modified = {2014-02-03 19:22:42 +0100}, Journal = {Science of The Total Environment}, Pages = {389-396}, Title = {{Assessing the Infection Risk of \emph{Giardia} and \emph{Cryptosporidium} in Public Drinking Water Delivered by Surface Water Systems in Sao Paulo State, Brazil}}, Volume = {442}, Year = {2013}} @article{callauetal13, Author = {Callau Poduje, Ana Claudia and Belli, Aslan and Haberlandt, Uwe}, Date-Added = {2014-02-03 19:19:01 +0100}, Date-Modified = {2014-02-03 19:20:14 +0100}, Doi = {10.1080/02626667.2013.871014}, Journal = {Hydrological Sciences Journal}, Title = {Dam Risk Assessment Based on Univariate versus Bivariate Statistical Approaches - a Case Study for Argentina}, Year = {2013}} @article{meyeretal13, Author = {Meyer, W. K. and Zhang, S. and Hayakawa, S. and Imai, H. and Przeworski, M.}, Date-Added = {2014-02-03 19:15:45 +0100}, Date-Modified = {2014-02-03 19:17:29 +0100}, Journal = {American Journal of Physical Anthropology}, Number = {3}, Title = {The Convergent Evolution of Blue Iris Pigmentation in Primates Took Distinct Molecular Paths}, Volume = {151}, Pages = {398-407}, Year = {2013}} @article{prosseretal13, Author = {D.J. Prosser and L.L. Hungerford and R.M. Erwin and M.A. Ottinger and J.Y. Takekawa and E.C. Ellis}, Date-Added = {2014-02-03 19:11:30 +0100}, Date-Modified = {2014-02-03 19:12:43 +0100}, Journal = {Frontiers in Public Health}, Title = {Mapping Avian Influenza Transmission Risk at the Interface of Domestic Poultry and Wild Birds}, Volume = {1}, Number = {28}, Pages = {1-11}, Year = {2013}} @article{srinivasanetal2013, Author = {S. Srinivasan and T.P. Sorrell and J.P. Brooks and D.J. Edwards and R. Diehl McDougle}, Date-Added = {2014-02-03 19:08:28 +0100}, Date-Modified = {2014-02-03 19:09:55 +0100}, Journal = {Policing: An International Journal of Police Strategies \& Management}, Number = {4}, Pages = {702-718}, Title = {{Workforce Assessment Method for an Urban Police Department: Using Analytics to Estimate Patrol Staffing}}, Volume = {36}, Year = {2013}} @article{drakeetal2014, Author = {T. Drake and Z. Chalabi and R. Coker}, Date-Added = {2014-02-03 19:02:50 +0100}, Date-Modified = {2014-02-03 19:03:43 +0100}, Journal = {Health Policy and Planning}, Title = {{Buy Now, saved Later? The Critical Impact of Time-to-Pandemic Uncertainty on Pandemic Cost-Effectiveness Analyses}}, Doi = {10.1093/heapol/czt101}, Year = {2014}} @article{contrerasetal2013, Author = {V. De La Huerta Contreras and H. Vaquera Huerta and B.C. Arnold}, Date-Added = {2014-02-03 18:58:47 +0100}, Date-Modified = {2014-02-03 19:00:41 +0100}, Journal = {Journal of Statistical Computation and Simulation}, Doi = {10.1080/00949655.2013.825095}, Title = {A Test for Equality of Variance with Censored Samples}, Year = {2013}} @article{sosaetal2013, Author = {A. Samuel-Rosa and R. Simao Diniz Dalmolin and P. Miguel}, Date-Added = {2014-02-03 18:56:07 +0100}, Date-Modified = {2014-02-03 22:02:48 +0100}, Journal = {Revista Brasileira de Ciencia do Solo}, Pages = {422-430}, Title = {Building Predictive Models of Soil Particle-Size Distribution}, Volume = {37}, Year = {2013}} @article{simoetal13, Author = {J. Sim\'o and Francesc Casa{\~n}a and J. Sabat\'e}, Date-Added = {2014-02-03 18:46:51 +0100}, Date-Modified = {2014-02-03 18:49:34 +0100}, Journal = {Statistics and Operations Research Transactions}, Number = {1}, Title = {{Modelling ``cal\c{c}ots'' (\emph{Alium cepa L.}) Growth by Gompertz Function}}, Volume = {37}, Pages = {95-106}, Year = {2013}} @unpublished{Kingetal2013, Author = {G. Kon Kam King and P. Veber and S. Charles and M.-L. Delignette-Muller}, Date-Added = {2014-02-03 18:41:16 +0100}, Date-Modified = {2014-02-03 18:42:28 +0100}, Note = {preprint on arXiv}, Title = {{MOSAIC SSD: a New Web-Tool for the Species Sensitivity Distribution, Allowing to Include Censored Data by Maximum Likelihood}}, Year = {2013}} @article{nadarajahbakar2013, Author = {S. Nadarajah and S.A.A. Bakar}, Date-Added = {2014-02-03 18:36:31 +0100}, Date-Modified = {2014-02-03 18:37:29 +0100}, Journal = {R journal}, Number = {2}, Pages = {98-104}, Title = {{CompLognormal: An R Package for Composite Lognormal Distributions.}}, Volume = {5}, Year = {2013}} @article{mala2013publi, Author = {I. Mal\'a}, Date-Added = {2014-02-03 18:33:08 +0100}, Date-Modified = {2014-02-03 18:34:37 +0100}, Journal = {Research Journal of Economics, Business and ICT}, Number = {2}, Pages = {55-61}, Title = {The Use of Finite Mixtures of Lognormal and Gamma Distributions}, Volume = {8}, Year = {2013}} @inproceedings{mala2013, Author = {I. Mal\'a}, Booktitle = {The seventh International Days of Statistics and Economics}, Date-Added = {2014-02-03 18:29:50 +0100}, Date-Modified = {2014-02-03 18:31:12 +0100}, Title = {Finite mixtures of lognormal and gamma distributions}, Year = {2013}} @mastersthesis{rebuge12, Author = {A.J.d.S. Rebuge}, Date-Added = {2013-02-10 12:02:10 +0100}, Date-Modified = {2013-05-15 16:32:28 +0200}, School = {Universidade Tecnica de Lisboa}, Title = {Business Process Analysis in Healthcare Environments}, Year = {2012}} @mastersthesis{nordan12, Author = {R.P.V. Nordan}, Date-Added = {2013-02-10 11:55:59 +0100}, Date-Modified = {2013-05-15 16:32:52 +0200}, School = {Norwegian University of Science and Technology}, Title = {An Investigation of Potential Methods for Topology Preservation in Interactive Vector Tile Map Applications}, Year = {2012}} @article{mandletal13, Author = {J.N. Mandl and J.P. Monteiro and N. Vrisekoop and R.N. Germain}, Date-Added = {2013-02-10 11:48:03 +0100}, Date-Modified = {2013-05-15 16:31:29 +0200}, Journal = {Immunity}, Number = {{2}}, Pages = {{263-274}}, Title = {{T Cell-Positive Selection Uses Self-Ligand Binding Strength to Optimize Repertoire Recognition of Foreign Antigens}}, Volume = {{38}}, Year = {2013}} @phdthesis{bakos11, Author = {R.O. Bakos}, Date-Added = {2013-02-10 11:42:32 +0100}, Date-Modified = {2013-05-15 16:34:11 +0200}, School = {Hungarian Veterinary Archive}, Title = {Poszm\'eh Egy\"uttesek \"Osszehasonl\'it\'o Vizsg\'alata a Cser\'epfalusi f\'as Legel\'o k\"l\"onb\"oz\'o n\"ov\'enybor\'it\'as\'u ter\"uletein}, Year = {2011}} @mastersthesis{poduje12, Author = {A.C. Callau Poduje}, Date-Added = {2013-02-10 11:34:40 +0100}, Date-Modified = {2013-05-15 16:33:15 +0200}, School = {Leibniz Universitat Hannover}, Title = {{Bivariate Analysis and Synthesis of Flood Events for the Design of Hydraulic Structures -- a Case Study for Argentina}}, Year = {2012}} @mastersthesis{garcia12, Author = {P. Garcia}, Date-Added = {2013-02-10 11:29:03 +0100}, Date-Modified = {2013-05-15 16:33:24 +0200}, School = {Universit\'e de Strasbourg}, Title = {{Analyse Statistique des Pannes du R\'eseau HTA}}, Year = {2012}} @article{benavidesetal12, Author = {R. Benavides-Piccione and I. Fernaud-Espinosa and V. Robles and R. Yuste and J. DeFelipe}, Date-Added = {2013-02-10 11:23:11 +0100}, Date-Modified = {2013-05-15 16:23:55 +0200}, Number = {8}, Pages = {1798-1810}, Volume = {23}, Journal = {Cerebral Cortex}, Title = {Age-Based Comparison of Human Dendritic Spine Structure Using Complete Three-Dimensional Reconstructions}, Year = {2012}, Bdsk-Url-1 = {http://dx.doi.org/10.1093/cercor/bhs154}} @article{breitbach12, Author = {N. Breitbach and K. B\"ohning-Gaese and I. Laube and M. Schleuning}, Date-Added = {2013-02-10 11:18:03 +0100}, Date-Modified = {2013-05-15 16:24:21 +0200}, Journal = {Journal of Ecology}, Number = {6}, Pages = {1349-1358}, Title = {Short Seed-Dispersal Distances and Low Seedling Recruitment in Farmland Populations of Bird-Dispersed Cherry Trees}, Volume = {100}, Year = {2012}} @article{meheustetal12, Author = {D. M\'eheust and P. Le Cann and T. Reponen and J. Wakefield and S. Vesper}, Date-Added = {2013-02-10 11:10:54 +0100}, Date-Modified = {2013-05-15 16:24:39 +0200}, Journal = {International Journal of Hygiene and Environmental Health}, Number = {{3}}, Pages = {{333-340}}, Title = {{Possible Application of the Environmental Relative Moldiness Index in France: a Pilot Study in Brittany}}, Volume = {{216}}, Year = {2012}} @article{varoetal12, Author = {J.P. Gonz\'alez-Varo and J.V. L\'opez-Bao and J. Guiti\'an}, Date-Added = {2013-02-10 11:03:06 +0100}, Date-Modified = {2013-05-15 16:24:57 +0200}, Volume = {{82}}, Pages = {{562-571}}, Journal = {Journal of Animal Ecology}, Title = {Functional Diversity Among Seed Dispersal Kernels Generated by Carnivorous Mammals}, Year = {2012}, Bdsk-Url-1 = {http://dx.doi.org/10.1111/1365-2656.12024}} @article{telloetal12, Author = {A. Tello and B. Austin and T.C. Telfer}, Date-Added = {2013-02-10 10:57:51 +0100}, Date-Modified = {2013-05-15 16:25:16 +0200}, Journal = {Environmental Health Perspectives}, Number = {8}, Pages = {1100-1106}, Title = {Selective Pressure of Antibiotic Pollution on Bacteria of Importance to Public Health}, Volume = {120}, Year = {2012}} @article{orellanoetal12, Author = {P.W. Orellano and J.I. Reynoso and A. Grassi and A. Palmieri and O. Uez and O. Carlino}, Date-Added = {2013-02-10 10:51:38 +0100}, Date-Modified = {2013-05-15 16:25:38 +0200}, Journal = {Iranian Journal of Public Health}, Number = {12}, Pages = {26-29}, Title = {{Estimation of the Serial Interval for Pandemic Influenza (pH1N1) in the Most Southern Province of Argentina}}, Volume = {41}, Year = {2012}} @unpublished{pouillotetal11, Author = {R. Pouillot and M.L. Delignette-Muller and M. Cornu}, Date-Added = {2013-02-10 09:42:22 +0100}, Date-Modified = {2013-05-15 16:49:50 +0200}, Title = {{Case study: \emph{L. monocytogenes} in Cold-Smoked Salmon}}, Url = {http://CRAN.R-project.org/web/packages/mc2d/vignettes/mc2dLmEnglish.pdf}, Year = {2011}, Bdsk-Url-1 = {http://cran.r-project.org/web/packages/mc2d/vignettes/mc2dLmEnglish.pdf}} @article{luangkesornetal12, Author = {K.L. Luangkesorn and B.A. Norman and Y. Zhuang and M. Falbo and J. Sysko}, Date-Added = {2013-02-10 09:29:42 +0100}, Date-Modified = {2013-05-15 16:25:55 +0200}, Journal = {Interfaces}, Number = {4}, Pages = {406-409}, Title = {{Practice Summaries: Designing Disease Prevention and Screening Centers in Abu Dhabi}}, Volume = {42}, Year = {2012}} @article{tarnczi11, Author = {T. Tarnczi and V. Fenyves and Z. Bcs}, Date-Added = {2013-02-10 09:26:15 +0100}, Date-Modified = {2013-05-15 16:22:41 +0200}, Journal = {International Journal of Management Cases}, Number = {3}, Pages = {159-167}, Title = {The Business Uncertainty and Variability Management with Real Options Models Combined two Dimensional Simulation}, Volume = {13}, Year = {2011}} @article{suuronenetal12, Author = {J.P. Suuronen and A. Kallonen and M. Eik and J. Puttonen and Ritva Serimaa and Heiko Herrmann}, Date-Modified = {2013-05-15 16:26:28 +0200}, Journal = {Journal of Materials Science}, Number = {3}, Pages = {1358--1367}, Title = {Analysis of Short Fibres Orientation in Steel Fibre-Reinforced Concrete (SFRC) by X-ray Tomography}, Volume = {48}, Year = {2012}} @article{scholletal12, Author = {C.F. Scholl and C.C. Nice and J.A. Fordyce and Z. Gompert and M.L. Forister}, Date-Added = {2012-11-07 22:56:30 +0100}, Date-Modified = {2013-05-15 16:26:46 +0200}, Journal = {International Journal of Ecology}, Title = {Larval Performance in the Context of Ecological Diversification and Speciation in Lycaeides Butterflies}, Number = {ID 242154}, Pages = {1-13}, Volume = {2012}, Year = {2012}, Bdsk-Url-1 = {http://dx.doi.org/10.1155/2012/242154}} @article{saketal11, Author = {H. Sak and C. Haksoz}, Date-Added = {2012-11-07 22:53:37 +0100}, Date-Modified = {2013-05-15 16:22:57 +0200}, Journal = {Journal of Operational Risk}, Number = {3}, Pages = {15-38}, Title = {A Copula-based Simulation Model for Supply Portfolio Risk}, Volume = {6}, Year = {2011}} @mastersthesis{rosa12, Author = {A. S. Rosa}, Date-Added = {2012-11-07 22:44:37 +0100}, Date-Modified = {2013-05-15 16:33:40 +0200}, School = {Universidade Federal de Santa Maria}, Title = {Fun\c{c}\~{o}es de Predi\c{c}\~{a}o Espacial de Propriedades do Solo}, Year = {2012}} @article{poulliotetal12, Author = {R. Pouillot and K. Hoelzer and Y. Chen and S. Dennis}, Date-Added = {2012-11-07 22:41:36 +0100}, Date-Modified = {2013-05-15 16:27:28 +0200}, Journal = {Food Control}, Keywords = {cited in page 7 for MLE censored fit}, Number = {2}, Pages = {350-357}, Title = {{Estimating Probability Distributions of Bacterial Concentrations in Food Based on Data Generated Using the Most Probable Number (MPN) Method for Use in Risk Assessment}}, Volume = {29}, Year = {2012}} @article{marquetouxetal12, Author = {N. Marquetoux and M. Paul and S. Wongnarkpet and C. Poolkhet and W. Thanapongtham and F. Roger and C. Ducrot and K. Chalvet-Monfray}, Date-Added = {2012-11-07 22:37:48 +0100}, Date-Modified = {2013-05-15 16:27:56 +0200}, Journal = {Preventive Veterinary Medicine}, Number = {2}, Pages = {143-151}, Title = {{Estimating Spatial and Temporal Variations of the Reproduction Number for Highly Pathogenic Avian Influenza H5N1 Epidemic in Thailand}}, Volume = {106}, Year = {2012}} @article{lehaetal11, Author = {A. Leha and T. Beissbarth and K. Jung}, Date-Added = {2012-11-07 22:34:24 +0100}, Date-Modified = {2013-05-15 16:23:17 +0200}, Journal = {BMC Bioinformatics}, Number = {127}, Pages = {1-14}, Title = {{Sequential Interim Analyses of Survival Data in {DNA} Microarray Experiments}}, Volume = {12}, Year = {2011}} @article{kochetal12, Author = {F.H. Koch and D. Yemshanov and R.D. Magarey and W.D. Smith}, Date-Added = {2012-11-07 22:28:27 +0100}, Date-Modified = {2013-03-06 21:49:35 +0100}, Journal = {Journal of Economic Entomology}, Number = {2}, Pages = {438-450}, Title = {Dispersal of Invasive Forest Insects via Recreational Firewood: a Quantitative Analysis}, Volume = {105}, Year = {2012}} @article{hoelzeretal12, Author = {K. Hoelzer and R. Pouillot and D. Gallagher and M.B. Silverman and J. Kause and S. Dennis}, Date-Added = {2012-11-07 22:19:27 +0100}, Date-Modified = {2013-05-15 16:28:30 +0200}, Journal = {International Journal of Food Microbiology}, Number = {2}, Pages = {267-277}, Title = {{Estimation of \emph{Listeria Monocytogenes} Transfer Coefficients and Efficacy of Bacterial Removal Through Cleaning and Sanitation}}, Volume = {157}, Year = {2012}} @article{jongenburgeretal12, Author = {I. Jongenburger and M.W. Reij and E.P.J. Boer and M.H. Zwietering and L.G.M. Gorris}, Date-Added = {2012-11-07 22:16:14 +0100}, Date-Modified = {2013-05-15 16:28:56 +0200}, Journal = {International Journal of Food Microbiology}, Number = {1}, Pages = {35-44}, Title = {Modelling Homogeneous and Heterogeneous Microbial Contaminations in a Powdered Food Product}, Volume = {157}, Year = {2012}} @article{jaloustreetal11, Author = {S. Jaloustre and M. Cornu and E. Morelli and V. Noel and M.L. Delignette-Muller}, Date-Added = {2012-11-07 22:11:48 +0100}, Date-Modified = {2013-05-15 16:23:33 +0200}, Journal = {Food microbiology}, Number = {2}, Pages = {311-320}, Title = {{Bayesian Modeling of \emph{Clostridium perfringens} Growth in Beef-in-Sauce Products}}, Volume = {28}, Year = {2011}} @article{fusteinercostafreda12, Author = {C.H.Y. Fu and H. Steiner and S.G. Costafreda}, Date-Added = {2012-11-07 22:05:23 +0100}, Date-Modified = {2013-05-15 16:29:33 +0200}, Journal = {Neurobiology of Disease}, Pages = {75-83}, Title = {Predictive Neural Biomarkers of Clinical Response in Depression: A Meta-Analysis of Functional and Structural Neuroimaging Studies of Pharmacological and Psychological Therapies}, Volume = {52}, Year = {2012}} @article{eling12, Author = {M. Eling}, Date-Added = {2012-10-08 23:04:08 +0200}, Date-Modified = {2013-05-15 16:29:53 +0200}, Journal = {Insurance: Mathematics and Economics}, Number = {2}, Pages = {239-248}, Title = {{Fitting Insurance Claims to Skewed Distributions: Are the Skew-normal and the Skew-student Good Models?}}, Volume = {51}, Year = {2012}} @inproceedings{eikhermann12, Author = {M. Eik and H. Herrmann}, Booktitle = {the Estonian Academy of Sciences}, Date-Added = {2012-10-08 23:00:46 +0200}, Date-Modified = {2013-05-15 16:31:55 +0200}, Number = {2}, Pages = {128-136}, Title = {Raytraced Images for Testing the Reconstruction of Fibre Orientation Distributions}, Volume = {61}, Year = {2012}} @article{croucheretal12, Author = {N. J. Croucher and S. R. Harris and L. Barquist and J. Parkhill and S. D. Bentley}, Date-Added = {2012-10-08 22:56:37 +0200}, Date-Modified = {2013-03-06 21:51:01 +0100}, Journal = {PLoS Pathogens}, Number = {6}, Pages = {e1002745}, Title = {A High-Resolution View of Genome-Wide Pneumococcal Transformation}, Volume = {8}, Year = {2012}} @article{commeauetal12, Author = {N. Commeau and E. Parent and M.-L. Delignette-Muller and M. Cornu}, Date-Added = {2012-10-08 22:53:49 +0200}, Date-Modified = {2013-05-15 16:30:14 +0200}, Journal = {International Journal of Food Microbiology}, Pages = {146-152}, Title = {Fitting a Lognormal Distribution to Enumeration and Absence/Presence Data}, Volume = {155}, Year = {2012}} @article{busschaertetal10, Author = {P. Busschaert and A.H. Geeraerd and M. Uyttendaele and J.F. VanImpe}, Date-Added = {2012-10-08 22:48:29 +0200}, Date-Modified = {2013-05-15 16:22:08 +0200}, Journal = {International Journal of Food Microbiology}, Pages = {260-269}, Title = {Estimating Distributions out of Qualitative and (Semi)Quantitative Microbiological Contamination Data for Use in Risk Assessment}, Volume = {138}, Year = {2010}} @inproceedings{brooksetal11, Author = {J.P. Brooks and D.J. Edwards and T.P. Sorrell and S. Srinivasan and R.L. Diehl}, Booktitle = {the 2011 Winter Simulation Conference}, Date-Added = {2012-10-08 22:43:23 +0200}, Date-Modified = {2013-03-06 21:51:21 +0100}, Pages = {1770-1777}, Title = {Simulating Calls for Service for an Urban Police Department}, Year = {2011}} @article{bagariaetal12, Author = {A. Bagaria and V. Jaravine and Y.J. Huang and G.T. Montelione and P. G\"untert}, Date-Added = {2012-10-08 22:39:34 +0200}, Date-Modified = {2013-05-15 16:30:38 +0200}, Journal = {Protein Science}, Number = {2}, Pages = {229-238}, Title = {Protein Structure Validation by Generalized Linear Model Root-Mean-Square Deviation Prediction}, Volume = {21}, Year = {2012}} @article{anandetal12, Author = {P. Anand and K. Yeturu and N. Chandra}, Date-Added = {2012-10-08 22:27:41 +0200}, Date-Modified = {2013-05-15 16:30:52 +0200}, Journal = {Nucleic Acids Research}, Pages = {1-9}, Title = {PocketAnnotate: Towards Site-Based Function Annotation}, Volume = {40}, Year = {2012}} @mastersthesis{aktassjostrand11, Author = {\"O. Akta\c{s} and M. Sj\"ostrand}, Date-Added = {2012-10-08 22:22:14 +0200}, Date-Modified = {2013-05-15 16:32:16 +0200}, Rating = {5}, School = {School of Information Science, Computer and Electrical Engineering, Halmstad University}, Title = {Cornish-Fisher Expansion and Value-at-Risk Method in Application to Risk Management of Large Portfolios}, Year = {2011}} fitdistrplus/R/0000755000176200001440000000000014355553455013225 5ustar liggesusersfitdistrplus/R/util-checkparamlist.R0000644000176200001440000000535413767050553017321 0ustar liggesusers# checkparam function checks start.arg and fix.arg that parameters are named correctly # INPUTS # start.arg : a named list # fix.arg : NULL or a named list # argdistname : argument names of the distribution # hasnodefaultval : vector of logical indicating no default value of argument # OUTPUTS # a named list with components: ok (TRUE or FALSE), txt (NULL or the error message), # start.arg : a named list of starting values for optimization # or a function to compute them from data checkparamlist <- function(start.arg, fix.arg, argdistname, hasnodefaultval) { errtxt <- list(t3="'start' must specify names which are arguments to 'distr'.", t4="'fix.arg' must specify names which are arguments to 'distr'.", t5="A distribution parameter cannot be specified both in 'start' and 'fix.arg'.", t6="'start' should not have NA or NaN values.", t7="'fix.arg' should not have NA or NaN values.", t8="Some parameter names have no starting/fixed value: ", t9="Some parameter names have no starting/fixed value but have a default value: ") vstart <- unlist(start.arg) #check unexpected names m <- match(names(vstart), argdistname) if (any(is.na(m))) stop(errtxt$t3) #check NA/NaN values if(any(is.na(vstart)) || any(is.nan(vstart))) stop(errtxt$t6) if(!is.null(fix.arg)) { vfix <- unlist(fix.arg) #check unexpected names mfix <- match(names(vfix), argdistname) if (any(is.na(mfix))) stop(errtxt$t4) # check that some parameters are not both in fix.arg and start minter <- match(names(vstart), names(vfix)) if (any(!is.na(minter))) stop(errtxt$t5) #check NA/NaN values if(any(is.na(vfix)) || any(is.nan(vfix))) stop(errtxt$t7) allparname <- names(c(vstart, vfix)) }else allparname <- names(vstart) theoparam <- computegetparam(argdistname) #special case where both scale and rate are allowed, see ?dgamma if("scale" %in% theoparam && "rate" %in% theoparam) { errt8 <- any(!allparname %in% theoparam) || length(allparname) != length(theoparam)-1 #special case where both prob and mu are allowed, see ?dnbinom }else if(length(theoparam) == 3 && all(c("size", "prob", "mu") %in% theoparam)) { errt8 <- any(!allparname %in% theoparam) || length(allparname) != length(theoparam)-1 }else errt8 <- any(!theoparam %in% allparname) #only make a warning if unset arguments have a default value if(errt8) { unsetarg <- theoparam[!theoparam %in% allparname] if(any(hasnodefaultval[unsetarg])) stop(paste0(errtxt$t8, paste(unsetarg, collapse = ", "), ".")) else warning(paste0(errtxt$t9, paste(unsetarg, collapse = ", "), ".")) } list("start.arg"=start.arg, "fix.arg"=fix.arg) }fitdistrplus/R/cdfcompcens.R0000644000176200001440000003571614355553455015650 0ustar liggesusers############################################################################# # Copyright (c) 2011 Marie Laure Delignette-Muller # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the # Free Software Foundation, Inc., # 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA # ############################################################################# ### plot cumulative distribution functions for various fits ### of continuous distribution(s) (fitdist results) ### on a same dataset ### ### R functions ### cdfcompcens <- function(ft, xlim, ylim, xlogscale = FALSE, ylogscale = FALSE, main, xlab, ylab, datacol, fillrect, fitlty, fitcol, fitlwd, addlegend = TRUE, legendtext, xlegend = "bottomright", ylegend = NULL, lines01 = FALSE,Turnbull.confint = FALSE, NPMLE.method = "Wang", add = FALSE, plotstyle = "graphics", ...) { if(inherits(ft, "fitdistcens")) { ft <- list(ft) }else if(!is.list(ft)) { stop("argument ft must be a list of 'fitdistcens' objects") }else { if(any(sapply(ft, function(x) !inherits(x, "fitdistcens")))) stop("argument ft must be a list of 'fitdistcens' objects") } NPMLE.method <- match.arg(NPMLE.method, c("Wang", "Turnbull.intervals", "Turnbull.middlepoints", "Turnbull")) if (NPMLE.method == "Turnbull") { warning("Turnbull is now a deprecated option for NPMLE.method. You should use Turnbull.middlepoints of Turnbull.intervals. It was here fixed as Turnbull.middlepoints, equivalent to former Turnbull.") NPMLE.method <- "Turnbull.middlepoints" } if ((Turnbull.confint == TRUE) & ((NPMLE.method == "Wang") | (NPMLE.method == "Turnbull.intervals"))) { warning("When Turnbull.confint is TRUE NPMLE.method is forced to Turnbull.middlepoints." ) NPMLE.method <- "Turnbull.middlepoints" } # check the 'plotstyle' argument plotstyle <- match.arg(plotstyle[1], choices = c("graphics", "ggplot"), several.ok = FALSE) if ((plotstyle == "ggplot") & (NPMLE.method == "Turnbull.middlepoints")) { warning("When NPMLE.method is Turnbull.middlepoints, plotstyle is forced to graphics." ) plotstyle <- "graphics" } # In the future developments, it will be necessary to check that all the fits share the same weights if(!is.null(ft[[1]]$weights)) stop("cdfcompcens is not yet available when using weights") nft <- length(ft) if (missing(datacol)) datacol <- "black" if (missing(fillrect)) fillrect <- "lightgrey" if (missing(fitcol)) fitcol <- 2:(nft+1) if (missing(fitlty)) fitlty <- 1:nft if (missing(fitlwd)) fitlwd <- 1 fitcol <- rep(fitcol, length.out=nft) fitlty <- rep(fitlty, length.out=nft) fitlwd <- rep(fitlwd, length.out=nft) if (missing(xlab)) xlab <- ifelse(xlogscale, "censored data in log scale", "censored data") if (missing(ylab)) ylab <- "CDF" if (missing(main)) main <- paste("Empirical and theoretical CDFs") censdata <- ft[[1]]$censdata logxy <- paste(ifelse(xlogscale, "x", ""), ifelse(ylogscale, "y", ""), sep="") verif.ftidata <- function(fti) { if (any(fti$censdata$left != censdata$left, na.rm=TRUE) | any(fti$censdata$right != censdata$right, na.rm=TRUE)) stop("All compared fits must have been obtained with the same dataset") } l <- lapply( ft, verif.ftidata) rm(l) # calculations for Wang method, for both graphics and ggplot displays if ((NPMLE.method == "Wang") | (NPMLE.method == "Turnbull.intervals")) { f <- npmle(censdata, method = NPMLE.method) bounds <- c(f$right, f$left) finitebounds <- bounds[is.finite(bounds)] upper <- max(finitebounds) lower <- min(finitebounds) width <- upper - lower if(missing(xlim)) { if (xlogscale == TRUE) { xmin <- lower * (upper / lower)^(-0.1) xmax <- upper * (upper / lower)^0.1 } else { xmin <- lower - width * 0.1 xmax <- upper + width * 0.1 } xlim <- c(xmin, xmax) } else { xmin <- xlim[1] xmax <- xlim[2] } if(missing(ylim)) { ylim <- c(0,1) } if ((xlogscale == TRUE) & xmin <= 0) stop("log transformation of data requires only positive values") if (xlogscale == TRUE) { xmininf <- lower * (upper / lower)^(-10) # 10 to be very large xmaxinf <- upper * (upper / lower)^10 } else { xmininf <- lower - width * 10 xmaxinf <- upper + width * 10 } k <- length(f$left) Fnpsurv <- cumsum(f$p) ## calculation of points for Q and P in graphs Fbefore <- c(0, Fnpsurv[-k]) df <- data.frame(left = f$left, right = f$right) # Definition of vertices of each rectangle Qi.left <- df$left # dim k Qi.left4plot <- Qi.left if (is.infinite(Qi.left4plot[1]) | is.nan(Qi.left4plot[1])) Qi.left4plot[1] <- xmininf Qi.right <- df$right Qi.right4plot <- Qi.right if (is.infinite(Qi.right4plot[k]) | is.nan(Qi.right4plot[k])) Qi.right4plot[k] <- xmaxinf # keep only 16 significants digits for R configured with noLD (--disable-long-double) Pi.low <- signif(Fbefore, 16) Pi.up <- signif(Fnpsurv, 16) # the line at right of the rectangles dright <- c(f$left[1], rep(f$right, rep(2,k)), f$right[k]) Fright <- rep(c(0,Fnpsurv), rep(2,k+1)) # the line at left of the rectangles dleft <- rep(c(f$left,f$right[k]), rep(2,k+1)) Fleft <- c(0,rep(Fnpsurv, rep(2,k)),1) } # check legend parameters if added if(missing(legendtext)) { legendtext <- sapply(ft, function(x) x$distname) if(length(legendtext) != length(unique(legendtext))) legendtext <- paste(legendtext, sapply(ft, function(x) toupper(x$method)), sep="-") if(length(legendtext) != length(unique(legendtext))) legendtext <- paste(legendtext, 1:nft, sep="-") } if(plotstyle == "graphics") { ######## plot if plotstyle=='graphics' ######## if (NPMLE.method == "Turnbull.middlepoints") # Turnbull plot { if(missing(xlim)) { xmin <- min(c(censdata$left, censdata$right), na.rm=TRUE) xmax <- max(c(censdata$left, censdata$right), na.rm=TRUE) xlim <- c(xmin, xmax) } else { xmin <- xlim[1] xmax <- xlim[2] } if ((xlogscale == TRUE) & xmin <= 0) stop("log transformation of data requires only positive values") # plot of data (ecdf) using Turnbull algorithm survdata <- Surv(time = censdata$left, time2 = censdata$right, type="interval2") survfitted <- survfit(survdata ~ 1) #main plotting if(missing(ylim)) { if (Turnbull.confint) { if (!add) plot(survfitted, fun="event", xlab=xlab, ylab=ylab, main=main, log=logxy, col=datacol, xlim = xlim, ...) else lines(survfitted, fun="event", xlab=xlab, ylab=ylab, main=main, log=logxy, col=datacol, xlim = xlim, ...) }else { if (!add) plot(survfitted, fun="event", xlab=xlab, ylab=ylab, main=main, log=logxy, col=datacol, conf.int = FALSE, xlim = xlim, ...) else lines(survfitted, fun="event", xlab=xlab, ylab=ylab, main=main, log=logxy, col=datacol, conf.int = FALSE, xlim = xlim, ...) } } else { if (Turnbull.confint) { if (!add) plot(survfitted, fun="event", xlab=xlab, ylab=ylab, main=main, log=logxy, col=datacol, xlim = xlim, ylim=ylim, ...) else lines(survfitted, fun="event", xlab=xlab, ylab=ylab, main=main, log=logxy, col=datacol, xlim = xlim, ylim=ylim, ...) } else { if (!add) plot(survfitted, fun="event", xlab=xlab, ylab=ylab, main=main, log=logxy, col=datacol, conf.int = FALSE, xlim = xlim, ylim = ylim, ...) else lines(survfitted, fun="event", xlab=xlab, ylab=ylab, main=main, log=logxy, col=datacol, conf.int = FALSE, xlim = xlim, ylim = ylim, ...) } } } else # if NPMLE.method == "Wang" # Wang plot { # Plot of the ECDF if (!add) plot(1, 1, type = "n", xlab=xlab, ylab=ylab, main=main, log = logxy, xlim = xlim, ylim = ylim, ...) # the line at right of the rectangles lines(dright, Fright, col = datacol) # the line at left of the rectangles lines(dleft, Fleft, col = datacol) # Add of the filled rectangles # ca donne un rendu bizarre - plutot ajouter un argument fill.datacol # rgbdatacol <- col2rgb(datacol) # lightdatacol <- rgb(rgbdatacol[1], rgbdatacol[2], rgbdatacol[3], maxColorValue = 255, # alpha = 10) for(i in 1:k) { rect(xleft = Qi.left4plot, ybottom = Pi.low, xright = Qi.right4plot, ytop = Pi.up, border = datacol, col = fillrect) } } ################## plot of each fitted distribution plot.fti <- function(i, ...) { fti <- ft[[i]] para <- c(as.list(fti$estimate), as.list(fti$fix.arg)) distname <- fti$distname pdistname <- paste("p", distname, sep="") if (is.element(distname, c("binom", "nbinom", "geom", "hyper", "pois"))) warning(" Be careful, variables are considered continuous in this function!") if (xlogscale == TRUE) { sfin <- 10^seq(log10(xmin), log10(xmax), by=(log10(xmax)-log10(xmin))/100) } else { sfin <- seq(xmin, xmax, by=(xmax-xmin)/100) } theopfin <- do.call(pdistname, c(list(sfin), as.list(para))) lines(sfin, theopfin, lty=fitlty[i], col=fitcol[i], lwd=fitlwd[i], ...) } s <- sapply(1:nft, plot.fti, ...) rm(s) if(lines01) abline(h=c(0, 1), lty="dashed", col="grey") if (addlegend) { legend(x=xlegend, y=ylegend, bty="n", legend=legendtext, lty=fitlty, col=fitcol, lwd=fitlwd, ...) } invisible() } else if (!requireNamespace("ggplot2", quietly = TRUE)) { stop("ggplot2 needed for this function to work with plotstyle = 'ggplot'. Please install it", call. = FALSE) } else { ######## plot if plotstyle=='ggplot' ######## if ((NPMLE.method == "Wang") | (NPMLE.method == "Turnbull.intervals")) { # recode the legend position according to available positions in ggplot2 if(xlegend %in% c("topleft", "bottomleft")) xlegend <- "left" if(xlegend %in% c("topright", "bottomright")) xlegend <- "right" if(xlegend == "center") xlegend <- "right" # the line at right of the rectangles dsegmright <- cbind(dright, Fright) dsegmright <- cbind(dsegmright[seq(1, nrow(dsegmright), by = 2), ], dsegmright[seq(2, nrow(dsegmright), by = 2), ]) dsegmright <- as.data.frame(dsegmright) colnames(dsegmright) <- c("x1", "y1", "x2", "y2") # the line at left of the rectangles dsegmleft <- cbind(dleft, Fleft) dsegmleft <- cbind(dsegmleft[seq(1, nrow(dsegmleft), by = 2), ], dsegmleft[seq(2, nrow(dsegmleft), by = 2), ]) dsegmleft <- as.data.frame(dsegmleft) colnames(dsegmleft) <- c("x1", "y1", "x2", "y2") drect <- data.frame(x1=Qi.left4plot, x2=Qi.right4plot, y1=Pi.low, y2=Pi.up) if (xlogscale == TRUE) { sfin <- rep(10^seq(log10(xmin), log10(xmax), by=(log10(xmax)-log10(xmin))/100), times = nft) } else { sfin <- rep(seq(xmin, xmax, by=(xmax-xmin)/100), times = nft) } theopfin <- vector(mode = "numeric", length = length(sfin)) ind <- vector(mode = "character", length = length(sfin)) len <- length(sfin) / nft for(i in 1:nft) { fti <- ft[[i]] para <- c(as.list(fti$estimate), as.list(fti$fix.arg)) distname <- fti$distname if (is.element(distname, c("binom", "nbinom", "geom", "hyper", "pois"))) warning(" Be careful, variables are considered continuous in this function!") pdistname <- paste("p", distname, sep="") theopfin[((i - 1) * len + 1):(i * len)] <- do.call(pdistname, c(list(sfin[((i - 1) * len + 1):(i * len)]), as.list(para))) ind[((i - 1) * len + 1):(i * len)] <- distname } dline <- data.frame(x = sfin, y = theopfin, ind = ind) dline$ind <- factor(dline$ind, levels = unique(dline$ind)) # reorder levels in the appearance order of the input ggcdfcompcens <- ggplot2::ggplot() + ggplot2::coord_cartesian(xlim = xlim, ylim = ylim) + ggplot2::ggtitle(main) + ggplot2::xlab(xlab) + ggplot2::ylab(ylab) + {if(lines01) ggplot2::geom_hline(ggplot2::aes(yintercept=0), color="grey", linetype="dashed")} + {if(lines01) ggplot2::geom_hline(ggplot2::aes(yintercept=1), color="grey", linetype="dashed")} + ggplot2::geom_rect(data=drect, mapping=ggplot2::aes_(xmin=quote(x1), xmax=quote(x2), ymin=quote(y1), ymax=quote(y2)), colour = datacol, fill = fillrect, alpha=0.5) + ggplot2::geom_segment(data=dsegmright, mapping=ggplot2::aes_(x=quote(x1), y=quote(y1), xend=quote(x2), yend=quote(y2)), colour = datacol) + ggplot2::geom_segment(data=dsegmleft, mapping=ggplot2::aes_(x=quote(x1), y=quote(y1), xend=quote(x2), yend=quote(y2)), colour = datacol) + ggplot2::geom_line(data=dline, ggplot2::aes_(quote(x), quote(y), group = quote(ind), colour = quote(ind), linetype = quote(ind), size = quote(ind))) + ggplot2::theme_bw() + {if(addlegend) ggplot2::theme(legend.position = c(xlegend, ylegend), plot.title = ggplot2::element_text(hjust = 0.5)) else ggplot2::theme(legend.position = "none", plot.title = ggplot2::element_text(hjust = 0.5))} + ggplot2::scale_color_manual(values = fitcol, labels = legendtext) + ggplot2::scale_linetype_manual(values = fitlty, labels = legendtext) + ggplot2::scale_size_manual(values = fitlwd, labels = legendtext) + ggplot2::guides(colour = ggplot2::guide_legend(title = NULL)) + ggplot2::guides(linetype = ggplot2::guide_legend(title = NULL)) + ggplot2::guides(size = ggplot2::guide_legend(title = NULL)) + {if(xlogscale) ggplot2::scale_x_continuous(trans='log10')} + {if(ylogscale) ggplot2::scale_y_continuous(trans='log10')} return(ggcdfcompcens) } } } fitdistrplus/R/mledist.R0000644000176200001440000004623314421725502015005 0ustar liggesusers############################################################################# # Copyright (c) 2009 Marie Laure Delignette-Muller, Christophe Dutang # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the # Free Software Foundation, Inc., # 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA # ############################################################################# ### maximum likelihood estimation for censored or non-censored data ### ### R functions ### ### many ideas are taken from the fitdistr function of the MASS package and ### the mle function of the stat package. mledist <- function (data, distr, start=NULL, fix.arg=NULL, optim.method="default", lower=-Inf, upper=Inf, custom.optim=NULL, weights=NULL, silent=TRUE, gradient=NULL, checkstartfix=FALSE, ...) # data may correspond to a vector for non censored data or to # a dataframe of two columns named left and right for censored data { if (!is.character(distr)) stop("distr must be a character string naming a distribution") else distname <- distr ddistname <- paste("d", distname, sep="") argddistname <- names(formals(ddistname)) if (!exists(ddistname, mode="function")) stop(paste("The ", ddistname, " function must be defined")) if(is.null(custom.optim)) optim.method <- match.arg(optim.method, c("default", "Nelder-Mead", "BFGS", "CG", "L-BFGS-B", "SANN", "Brent")) start.arg <- start #to avoid confusion with the start() function of stats pkg (check is done lines 87-100) if(is.vector(start.arg)) #backward compatibility start.arg <- as.list(start.arg) txt1 <- "data must be a numeric vector of length greater than 1 for non censored data" txt2 <- "or a dataframe with two columns named left and right and more than one line for censored data" if(!is.null(weights)) { if(any(weights < 0)) stop("weights should be a vector of integers greater than 0") if(!isallintw(weights)) stop("weights should be a vector of (strictly) positive integers") if(length(weights) != NROW(data)) stop("weights should be a vector with a length equal to the observation number") warning("weights are not taken into account in the default initial values") } if (is.vector(data)) { cens <- FALSE if (!(is.numeric(data) & length(data)>1)) stop(paste(txt1, txt2)) } else { cens <- TRUE censdata <- data if (!(is.vector(censdata$left) & is.vector(censdata$right) & length(censdata[, 1])>1)) stop(paste(txt1, txt2)) pdistname<-paste("p", distname, sep="") if (!exists(pdistname, mode="function")) stop(paste("The ", pdistname, " function must be defined to apply maximum likelihood to censored data")) } if (cens) { #format data for calculation of starting values and fitting process dataformat <- cens2pseudo(censdata) data <- dataformat$pseudo rcens <- dataformat$rcens; lcens <- dataformat$lcens icens <- dataformat$icens; ncens <- dataformat$ncens irow <- cens2idxrow(censdata) irow.rcens <- irow$rcens; irow.lcens <- irow$lcens irow.icens <- irow$icens; irow.ncens <- irow$ncens } if(!checkstartfix) #pre-check has not been done by fitdist() or bootdist() { # manage starting/fixed values: may raise errors or return two named list arg_startfix <- manageparam(start.arg=start, fix.arg=fix.arg, obs=data, distname=distname) #check inconsistent parameters hasnodefaultval <- sapply(formals(ddistname), is.name) arg_startfix <- checkparamlist(arg_startfix$start.arg, arg_startfix$fix.arg, argddistname, hasnodefaultval) #arg_startfix contains two names list (no longer NULL nor function) #set fix.arg.fun if(is.function(fix.arg)) fix.arg.fun <- fix.arg else fix.arg.fun <- NULL }else #pre-check has been done by fitdist() or bootdist() { arg_startfix <- list(start.arg=start, fix.arg=fix.arg) fix.arg.fun <- NULL } #unlist starting values as needed in optim() vstart <- unlist(arg_startfix$start.arg) #sanity check if(is.null(vstart)) stop("Starting values could not be NULL with checkstartfix=TRUE") #erase user value #(cannot coerce to vector as there might be different modes: numeric, character...) fix.arg <- arg_startfix$fix.arg ############# closed-form formula for uniform distribution ########## if(distname == "unif") { if(length(fix.arg) >= 2) { stop("'fix.arg' sets all distribution parameters without any parameter to estimate.") }else if(length(fix.arg) == 1) { if(names(fix.arg) == "min") par <- c(max=max(data)) else if(names(fix.arg) == "max") par <- c(min=min(data)) else stop("'fix.arg' must specify names which are arguments to 'distr'.") }else par <- c(min=min(data), max=max(data)) myarg <- c(list(data), as.list(par), as.list(fix.arg)) loglikval <- sum(log(do.call(dunif, myarg))) res <- list(estimate = par[!names(par) %in% names(fix.arg)], convergence = 0, loglik = loglikval, meth = "closed formula", hessian = NA, optim.function= NA, fix.arg = fix.arg) return(res) } ############# MLE fit using optim or custom.optim ########## # definition of the function to minimize : - log likelihood # for non censored data if (!cens && is.null(weights)) { # the argument names are: # - par for parameters (like in optim function) # - fix.arg for optional fixed parameters # - obs for observations (previously dat but conflicts with genoud data.type.int argument) # - ddistnam for distribution name if ("log" %in% argddistname){ fnobj <- function(par, fix.arg, obs, ddistnam){ -sum(do.call(ddistnam, c(list(obs), as.list(par), as.list(fix.arg), log=TRUE) ) ) } } else{ fnobj <- function(par, fix.arg, obs, ddistnam) { -sum(log(do.call(ddistnam, c(list(obs), as.list(par), as.list(fix.arg)) ) ) ) } } } else if(cens && is.null(weights)) #censored data { argpdistname<-names(formals(pdistname)) if (("log" %in% argddistname) & ("log.p" %in% argpdistname)) fnobjcens <- function(par, fix.arg, rcens, lcens, icens, ncens, ddistnam, pdistnam) -sum(do.call(ddistnam, c(list(ncens), as.list(par), as.list(fix.arg), list(log=TRUE)))) - sum(do.call(pdistnam, c(list(lcens), as.list(par), as.list(fix.arg), list(log=TRUE)))) - sum(do.call(pdistnam, c(list(rcens), as.list(par), as.list(fix.arg), list(lower.tail=FALSE), list(log=TRUE)))) - sum(log(do.call(pdistnam, c(list(icens$right), as.list(par), as.list(fix.arg))) - # without log=TRUE here do.call(pdistnam, c(list(icens$left), as.list(par), as.list(fix.arg))) )) # without log=TRUE here else fnobjcens <- function(par, fix.arg, rcens, lcens, icens, ncens, ddistnam, pdistnam) -sum(log(do.call(ddistnam, c(list(ncens), as.list(par), as.list(fix.arg))))) - sum(log(do.call(pdistnam, c(list(lcens), as.list(par), as.list(fix.arg))))) - sum(log(1-do.call(pdistnam, c(list(rcens), as.list(par), as.list(fix.arg))))) - sum(log(do.call(pdistnam, c(list(icens$right), as.list(par), as.list(fix.arg))) - do.call(pdistnam, c(list(icens$left), as.list(par), as.list(fix.arg))) )) }else if(!cens && !is.null(weights)) { fnobj <- function(par, fix.arg, obs, ddistnam) { -sum(weights * log(do.call(ddistnam, c(list(obs), as.list(par), as.list(fix.arg)) ) ) ) } }else if(cens && !is.null(weights)) { fnobjcens <- function(par, fix.arg, rcens, lcens, icens, ncens, ddistnam, pdistnam) { p1 <- log(do.call(ddistnam, c(list(ncens), as.list(par), as.list(fix.arg)))) p2 <- log(do.call(pdistnam, c(list(lcens), as.list(par), as.list(fix.arg)))) p3 <- log(1-do.call(pdistnam, c(list(rcens), as.list(par), as.list(fix.arg)))) p4 <- log(do.call(pdistnam, c(list(icens$right), as.list(par), as.list(fix.arg))) - do.call(pdistnam, c(list(icens$left), as.list(par), as.list(fix.arg))) ) -sum(weights[irow.ncens] * p1) - sum(weights[irow.lcens] * p2) - sum(weights[irow.rcens] * p3) - sum(weights[irow.icens] * p4) } } #get warning value owarn <- getOption("warn") # Try to minimize the minus (log-)likelihood using the base R optim function if(is.null(custom.optim)) { hasbound <- any(is.finite(lower) | is.finite(upper)) # Choice of the optimization method if (optim.method == "default") { meth <- ifelse(length(vstart) > 1, "Nelder-Mead", "BFGS") }else meth <- optim.method if(meth == "BFGS" && hasbound && is.null(gradient)) { meth <- "L-BFGS-B" txt1 <- "The BFGS method cannot be used with bounds without provided the gradient." txt2 <- "The method is changed to L-BFGS-B." warning(paste(txt1, txt2)) } options(warn=ifelse(silent, -1, 0)) #select optim or constrOptim if(hasbound) #finite bounds are provided { if(!is.null(gradient)) { opt.fun <- "constrOptim" }else #gradient == NULL { if(meth == "Nelder-Mead") opt.fun <- "constrOptim" else if(meth %in% c("L-BFGS-B", "Brent")) opt.fun <- "optim" else { txt1 <- paste("The method", meth, "cannot be used by constrOptim() nor optim() without gradient and bounds.") txt2 <- "Only optimization methods L-BFGS-B, Brent and Nelder-Mead can be used in such case." stop(paste(txt1, txt2)) } } if(opt.fun == "constrOptim") { #recycle parameters npar <- length(vstart) #as in optim() line 34 lower <- as.double(rep_len(lower, npar)) #as in optim() line 64 upper <- as.double(rep_len(upper, npar)) # constraints are : Mat %*% theta >= Bnd, i.e. # +1 * theta[i] >= lower[i]; # -1 * theta[i] >= -upper[i] #select rows from the identity matrix haslow <- is.finite(lower) Mat <- diag(npar)[haslow, ] #select rows from the opposite of the identity matrix hasupp <- is.finite(upper) Mat <- rbind(Mat, -diag(npar)[hasupp, ]) colnames(Mat) <- names(vstart) rownames(Mat) <- paste0("constr", 1:NROW(Mat)) #select the bounds Bnd <- c(lower[is.finite(lower)], -upper[is.finite(upper)]) names(Bnd) <- paste0("constr", 1:length(Bnd)) initconstr <- Mat %*% vstart - Bnd if(any(initconstr < 0)) stop("Starting values must be in the feasible region.") if(!cens) { opttryerror <- try(opt <- constrOptim(theta=vstart, f=fnobj, ui=Mat, ci=Bnd, grad=gradient, fix.arg=fix.arg, obs=data, ddistnam=ddistname, hessian=!is.null(gradient), method=meth, ...), silent=TRUE) } else #cens == TRUE opttryerror <- try(opt <- constrOptim(theta=vstart, f=fnobjcens, ui=Mat, ci=Bnd, grad=gradient, ddistnam=ddistname, rcens=rcens, lcens=lcens, icens=icens, ncens=ncens, pdistnam=pdistname, fix.arg=fix.arg, hessian=!is.null(gradient), method=meth, ...), silent=TRUE) if(!inherits(opttryerror, "try-error")) if(length(opt$counts) == 1) #appears when the initial point is a solution opt$counts <- c(opt$counts, NA) }else #opt.fun == "optim" { if(!cens) opttryerror <- try(opt <- optim(par=vstart, fn=fnobj, fix.arg=fix.arg, obs=data, gr=gradient, ddistnam=ddistname, hessian=TRUE, method=meth, lower=lower, upper=upper, ...), silent=TRUE) else #cens == TRUE opttryerror <- try(opt <- optim(par=vstart, fn=fnobjcens, fix.arg=fix.arg, gr=gradient, rcens=rcens, lcens=lcens, icens=icens, ncens=ncens, ddistnam=ddistname, pdistnam=pdistname, hessian=TRUE, method=meth, lower=lower, upper=upper, ...), silent=TRUE) } }else #hasbound == FALSE { opt.fun <- "optim" if(!cens) opttryerror <- try(opt <- optim(par=vstart, fn=fnobj, fix.arg=fix.arg, obs=data, gr=gradient, ddistnam=ddistname, hessian=TRUE, method=meth, lower=lower, upper=upper, ...), silent=TRUE) else #cens == TRUE opttryerror <- try(opt <- optim(par=vstart, fn=fnobjcens, fix.arg=fix.arg, gr=gradient, rcens=rcens, lcens=lcens, icens=icens, ncens=ncens, ddistnam=ddistname, pdistnam=pdistname, hessian=TRUE, method=meth, lower=lower, upper=upper, ...), silent=TRUE) } options(warn=owarn) if (inherits(opttryerror, "try-error")) { warnings("The function optim encountered an error and stopped.") if(getOption("show.error.messages")) print(attr(opttryerror, "condition")) return(list(estimate = rep(NA, length(vstart)), convergence = 100, loglik = NA, hessian = NA, optim.function=opt.fun, fix.arg = fix.arg, optim.method=meth, fix.arg.fun = fix.arg.fun, counts=c(NA, NA))) } if (opt$convergence>0) { warnings("The function optim failed to converge, with the error code ", opt$convergence) } if(is.null(names(opt$par))) names(opt$par) <- names(vstart) res <- list(estimate = opt$par, convergence = opt$convergence, value=opt$value, hessian = opt$hessian, optim.function=opt.fun, optim.method=meth, fix.arg = fix.arg, fix.arg.fun = fix.arg.fun, weights = weights, counts=opt$counts, optim.message=opt$message, loglik = -opt$value) } else # Try to minimize the minus (log-)likelihood using a user-supplied optim function { options(warn=ifelse(silent, -1, 0)) if (!cens) opttryerror <- try(opt <- custom.optim(fn=fnobj, fix.arg=fix.arg, obs=data, ddistnam=ddistname, par=vstart, ...), silent=TRUE) else opttryerror <-try(opt<-custom.optim(fn=fnobjcens, fix.arg=fix.arg, rcens=rcens, lcens=lcens, icens=icens, ncens=ncens, ddistnam=ddistname, pdistnam=pdistname, par=vstart, ...), silent=TRUE) options(warn=owarn) if (inherits(opttryerror, "try-error")) { warnings("The customized optimization function encountered an error and stopped.") if(getOption("show.error.messages")) print(attr(opttryerror, "condition")) return(list(estimate = rep(NA, length(vstart)), convergence = 100, loglik = NA, hessian = NA, optim.function=custom.optim, fix.arg = fix.arg, fix.arg.fun = fix.arg.fun, counts=c(NA, NA))) } if (opt$convergence>0) { warnings("The customized optimization function failed to converge, with the error code ", opt$convergence) } if(is.null(names(opt$par))) names(opt$par) <- names(vstart) argdot <- list(...) method.cust <- argdot$method res <- list(estimate = opt$par, convergence = opt$convergence, value=opt$value, hessian = opt$hessian, optim.function = custom.optim, optim.method = method.cust, fix.arg = fix.arg, fix.arg.fun = fix.arg.fun, weights = weights, counts=opt$counts, optim.message=opt$message, loglik = -opt$value) } return(res) } ## old function with previous name for censored data mledistcens <- function(censdata, distr, start=NULL, optim.method="default", lower=-Inf, upper=Inf) { stop("The function \"mledistcens\" is no more used. Now the same function \"mledist\" must be used both for censored and non censored data.") } fitdistrplus/R/quantiles.R0000644000176200001440000002451214421716571015353 0ustar liggesusers############################################################################# # Copyright (c) 2012 Marie Laure Delignette-Muller, Christophe Dutang # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the # Free Software Foundation, Inc., # 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA # ############################################################################# ### calculation of theoretical quantiles from a parametric distribution ### fitted on censored or non-censored data ### ### R functions ### #quantile function for fitdist objects quantile.fitdist <- function(x, probs = seq(0.1, 0.9, by=0.1), ...) { if (!inherits(x, "fitdist")) stop("Use only with 'fitdist' objects") myquantilesfitdist(f = x, probs = probs, cens = FALSE) } #quantile function for fitdistcens objects quantile.fitdistcens <- function(x, probs = seq(0.1, 0.9, by=0.1), ...) { if (!inherits(x, "fitdistcens")) stop("Use only with 'fitdistcens' objects") myquantilesfitdist(f = x, probs = probs, cens = TRUE) } #internal quantile function for fitdist myquantilesfitdist <- function(f, probs, cens) { qdistname<-paste("q", f$distname, sep="") if (!exists(qdistname, mode="function")) stop(paste("The ", qdistname, " function must be defined")) # computation and print of quantiles using estimations of parameters para=c(as.list(f$estimate), as.list(f$fix.arg)) quantiles <- do.call(qdistname, c(list(probs), as.list(para))) if (length(probs)>1) quantiles <- as.data.frame(t(quantiles)) else quantiles <- as.data.frame(quantiles) colnames(quantiles) <- paste("p=", probs, sep="") rownames(quantiles) <- "estimate" reslist <- list(quantiles = quantiles, probs = probs) if(!cens) class(reslist) <- "quantile.fitdist" else class(reslist) <- "quantile.fitdistcens" reslist } print.quantile.fitdist <- function(x, ...) { if (!inherits(x, "quantile.fitdist")) stop("Use only with 'quantile.fitdist' objects") typedata <- "(non-censored data)" cat("Estimated quantiles for each specified probability ", typedata,"\n", sep="") print(x$quantiles) invisible(x) } print.quantile.fitdistcens <- function(x, ...) { if (!inherits(x, "quantile.fitdistcens")) stop("Use only with 'quantile.fitdistcens' objects") typedata <- "(censored data)" cat("Estimated quantiles for each specified probability ", typedata,"\n", sep="") print(x$quantiles) invisible(x) } ############################################################################# ### calculation of theoretical quantiles from a parametric distribution ### fitted on censored or non-censored data ### and associated bootstrap confidence intervals ### ### R functions ### #quantile function for bootdist objects quantile.bootdist <- function(x, probs = seq(0.1, 0.9, by=0.1), CI.type = "two.sided", CI.level = 0.95, ...) { if (!inherits(x, "bootdist")) stop("Use only with 'bootdist' objects") myquantilesbootdist(b = x, probs = probs, CI.type = CI.type, CI.level = CI.level, cens = FALSE) } #quantile function for bootdistcens objects quantile.bootdistcens <- function(x, probs = seq(0.1, 0.9, by=0.1), CI.type = "two.sided", CI.level = 0.95, ...) { if (!inherits(x, "bootdistcens")) stop("Use only with 'bootdistcens' objects") myquantilesbootdist(b = x, probs = probs, CI.type = CI.type, CI.level = CI.level, cens = TRUE) } #internal quantile function for bootdist myquantilesbootdist <- function(b, probs, CI.type, CI.level, cens) { CI.type <- match.arg(CI.type, c("two.sided", "less", "greater")) if(!is.logical(cens)) stop("wrong argument cens.") CI.level <- CI.level[1] # 1/ computation of quantiles using quantile.fitdist basequant <- quantile(b$fitpart, probs=probs) # 2/ computation of bootstraped quantiles and alpha-percent CI of quantiles qdistname <- paste("q", b$fitpart$distname, sep="") calcquant <- function(i) { parai <- c(as.list(b$estim[i, ]), as.list(b$fitpart$fix.arg)) do.call(qdistname, c(list(probs), as.list(parai))) } bootquant <- sapply(1:b$nbboot, calcquant) if (length(probs)>1) bootquant <- as.data.frame(t(bootquant)) else bootquant <- as.data.frame(bootquant) colnames(bootquant) <- paste("p=", probs, sep="") quantmedian <- rbind(apply(bootquant, MARGIN=2, median, na.rm=TRUE)) colnames(quantmedian) <- paste("p=", probs, sep="") rownames(quantmedian) <- "estimate" if (CI.type == "two.sided") { alpha <- (1-CI.level)/2 quantCI <- rbind( apply(bootquant, MARGIN=2, quantile, alpha, na.rm=TRUE), apply(bootquant, MARGIN=2, quantile, 1-alpha, na.rm=TRUE)) rownames(quantCI) <- formatperc(c(alpha, 1-alpha), 3) }else if (CI.type == "less") { quantCI <- t(apply(bootquant, MARGIN=2, quantile, CI.level, na.rm=TRUE)) rownames(quantCI) <- formatperc(CI.level, 3) }else { quantCI <- t(apply(bootquant, MARGIN=2, quantile, 1-CI.level, na.rm=TRUE)) rownames(quantCI) <- formatperc(1-CI.level, 3) } # message when lack of convergence nbconverg <- length(b$converg[b$converg == 0]) reslist <- list(quantiles = basequant$quantiles, probs=probs, bootquant = bootquant, quantCI = as.data.frame(quantCI), quantmedian = quantmedian, CI.type = CI.type, CI.level = CI.level, nbboot = b$nbboot, nbconverg = nbconverg) if(!cens) class(reslist) <- "quantile.bootdist" else class(reslist) <- "quantile.bootdistcens" reslist } print.quantile.bootdist <- function(x, ...) { if (!inherits(x, "quantile.bootdist")) stop("Use only with 'quantile.bootdist' objects") typedata <- "(non-censored data)" #base quantiles cat("(original) estimated quantiles for each specified probability ", typedata,"\n", sep="") print(x$quantiles) cat("Median of bootstrap estimates\n") print(x$quantmedian) #confidence intervals cat("\n") if (x$CI.type == "two.sided") { cat("two-sided ", formatperc(x$CI.level, 3)," CI of each quantile\n", sep="") print(x$quantCI) }else if (x$CI.type == "less") { cat("right bound of one-sided ", formatperc(x$CI.level, 3)," CI of each quantile\n") print(x$quantCI) }else { cat("left bound of one-sided ", formatperc(x$CI.level, 3)," CI of each quantile\n") print(x$quantCI) } if (x$nbconverg < x$nbboot) { cat("\n") cat("The estimation method converged only for", x$nbconverg, "among", x$nbboot, "bootstrap iterations.\n") } invisible(x) } print.quantile.bootdistcens <- function(x, ...) { if (!inherits(x, "quantile.bootdistcens")) stop("Use only with 'quantile.bootdistcens' objects") typedata <- "(censored data)" #base quantiles cat("(original) estimated quantiles for each specified probability ", typedata,"\n", sep="") print(x$quantiles) cat("Median of bootstrap estimates\n") print(x$quantmedian) #confidence intervals cat("\n") if (x$CI.type == "two.sided") { cat("two-sided ", formatperc(x$CI.level, 3)," CI of each quantile\n", sep="") print(x$quantCI) }else if (x$CI.type == "less") { cat("right bound of one-sided ", formatperc(x$CI.level, 3)," CI of each quantile\n") print(x$quantCI) }else { cat("left bound of one-sided ", formatperc(x$CI.level, 3)," CI of each quantile\n") print(x$quantCI) } if (x$nbconverg < x$nbboot) { cat("\n") cat("The estimation method converged only for", x$nbconverg, "among", x$nbboot, "bootstrap iterations.\n") } invisible(x) } #from the stat package (not exported in fitdistrplus) formatperc <- function(probs, digits) ## Not yet exported, maybe useful in other contexts: ## quantile.default() sometimes uses a version of it paste(format(100 * probs, trim = TRUE, scientific = FALSE, digits = digits), "%") fitdistrplus/R/gofstat.R0000644000176200001440000003322014421716455015012 0ustar liggesusers############################################################################# # Copyright (c) 2009 Marie Laure Delignette-Muller, Christophe Dutang # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the # Free Software Foundation, Inc., # 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA # ############################################################################# ### calculate goodness-of-fit statistics for ### fit of a parametric distribution on non-censored data ### ### R functions ### gofstat <- function (f, chisqbreaks, meancount, discrete, fitnames=NULL) { if(inherits(f, "fitdist")) { f <- list(f) }else if(length(f) == 1) { if(!inherits(f, "fitdist")) stop("argument f must a 'fitdist' object or a list of 'fitdist' objects.") }else if(!is.list(f)) { stop("argument f must be a list of 'fitdist' objects") }else { if(any(sapply(f, function(x) !inherits(x, "fitdist")))) stop("argument f must be a list of 'fitdist' objects") } # In the future developments, it will be necessary to check that all the fits share the same weights if(!is.null(f[[1]]$weights)) stop("gofstat is not yet available when using weights") odata <- f[[1]]$data sdata <- sort(odata) n <- f[[1]]$n distname <- f[[1]]$distname pdistname <- paste("p", distname, sep="") estimate <- f[[1]]$estimate fix.arg <- f[[1]]$fix.arg verif.ftidata <- function(fti) { if (any(fti$data != odata)) stop("All compared fits must have been obtained with the same dataset") invisible() } lapply(f, verif.ftidata) # initiate discrete if not given if(missing(discrete)) { discrete <- f[[1]]$discrete } if(!is.logical(discrete)) stop("wrong argument 'discrete'.") #define chisqbreaks if not defined if (missing(chisqbreaks)) { if (missing(meancount)) meancount <- round( n / ((4*n)^(2/5)) ) if (length(sdata)>ceiling(1.5*meancount)) { limit <- sdata[meancount] sdata <- sdata[sdata>limit] chisqbreaks <- limit }else { warnings("The sample is too small to automatically define chisqbreaks") chisq <- NULL chisqbreaks <- NULL chisqpvalue <- NULL chisqtable <- NULL chisqdf <- NULL } while (length(sdata)>ceiling(1.5*meancount)) { limit <- sdata[meancount] sdata <- sdata[sdata>limit] chisqbreaks <- c(chisqbreaks,limit) } sdata <- sort(odata) } nbfit <- length(f) if(is.null(fitnames)) fitnames <- paste(1:nbfit, sapply(f, function(x) x$method), sapply(f, function(x) x$distname), sep="-") Chi2 <- computegofstatChi2(sdata, n, distname, pdistname, estimate, fix.arg, chisqbreaks) if(length(f) > 1) { #renaming names(Chi2$chisq) <- names(Chi2$chisqpvalue) <- names(Chi2$chisqdf) <- fitnames[1] colnames(Chi2$chisqtable)[2] <- paste("theo", fitnames[1], sep=" ") #computation and storing for(i in 2:nbfit) { Chi2temp <- computegofstatChi2(sdata, n, f[[i]]$distname, paste("p", f[[i]]$distname, sep=""), f[[i]]$estimate, f[[i]]$fix.arg, chisqbreaks) names(Chi2temp$chisq) <- names(Chi2temp$chisqpvalue) <- names(Chi2temp$chisqdf) <- fitnames[i] Chi2$chisq <- c(Chi2$chisq, Chi2temp$chisq) Chi2$chisqpvalue <- c(Chi2$chisqpvalue, Chi2temp$chisqpvalue) Chi2$chisqdf <- c(Chi2$chisqdf, Chi2temp$chisqdf) Chi2$chisqtable <- cbind(Chi2$chisqtable, Chi2temp$chisqtable[,2]) colnames(Chi2$chisqtable)[NCOL(Chi2$chisqtable)] <- paste("theo", fitnames[i], sep=" ") } } if(discrete) { addres <- Chi2 }else { KSCvMAD <- computegofstatKSCvMAD(sdata, n, distname, pdistname, estimate, fix.arg, f[[1]]$method) #renaming names(KSCvMAD$cvm) <- names(KSCvMAD$ad) <- names(KSCvMAD$ks) <- fitnames[1] if(!is.null(KSCvMAD$cvmtest)) names(KSCvMAD$cvmtest) <- names(KSCvMAD$adtest) <- names(KSCvMAD$kstest) <- fitnames[1] if(length(f) > 1) { #computation and storing for(i in 2:nbfit) { KSCvMADtemp <- computegofstatKSCvMAD(sdata, n, f[[i]]$distname, paste("p", f[[i]]$distname, sep=""), f[[i]]$estimate, f[[i]]$fix.arg, f[[i]]$method) names(KSCvMADtemp$cvm) <- names(KSCvMADtemp$ad) <- names(KSCvMADtemp$ks) <- fitnames[i] if(!is.null(KSCvMADtemp$cvmtest)) names(KSCvMADtemp$cvmtest) <- names(KSCvMADtemp$adtest) <- names(KSCvMADtemp$kstest) <- fitnames[i] KSCvMAD$cvm <- c(KSCvMAD$cvm, KSCvMADtemp$cvm) KSCvMAD$cvmtest <- c(KSCvMAD$cvmtest, KSCvMADtemp$cvmtest) KSCvMAD$ad <- c(KSCvMAD$ad, KSCvMADtemp$ad) KSCvMAD$adtest <- c(KSCvMAD$adtest, KSCvMADtemp$adtest) KSCvMAD$ks <- c(KSCvMAD$ks, KSCvMADtemp$ks) KSCvMAD$kstest <- c(KSCvMAD$kstest, KSCvMADtemp$kstest) } } addres <- c(Chi2, KSCvMAD) } aics <- sapply(f, function(x) x$aic) names(aics) <- fitnames bics <- sapply(f, function(x) x$bic) names(bics) <- fitnames res <- c(addres, aic=list(aics), bic=list(bics), discrete=discrete, nbfit=nbfit) class(res) <- c("gofstat.fitdist", "fitdist") res } #---------------------------------------------------------------------- #KS, CvM, AD statistics : only for continuous distributions computegofstatKSCvMAD <- function(sdata, n, distname, pdistname, estimate, fix.arg, method) { obspu <- seq(1,n)/n obspl <- seq(0,n-1)/n theop <- do.call(pdistname, c(list(sdata), as.list(estimate), fix.arg)) # Kolmogorov-Smirnov statistic ks <- max(pmax(abs(theop-obspu), abs(theop-obspl))) Dmod <- ks*(sqrt(n)+0.12+0.11/sqrt(n)) # Kolmogorov-Smirnov test if (n>=30) kstest <- ifelse(Dmod>1.358,"rejected","not rejected") else kstest <- "not computed" # Anderson-Darling statistic ad <- - n - mean( (2 * 1:n - 1) * (log(theop) + log(1 - rev(theop))) ) # ad <- -n-sum((2*(1:n)-1)*log(theop) + (2*n+1-2*(1:n))*log(1-theop))/n # Anderson-Darling test if (is.null(fix.arg) & method == "mle") { # the following test does not correspond to MLE estimate but to unbiased # estimate of the variance #if ((distname == "norm" | distname == "lnorm") & n>=5) { # a2mod <- ad*(1+0.75/n+2.25/n^2) # adtest <- ifelse(a2mod>0.752,"rejected","not rejected") #} #else if (distname == "exp" & n>=5) { a2mod <- ad*(1+0.6/n) adtest <- ifelse(a2mod>1.321, "rejected", "not rejected") }else if (distname == "gamma" & n>=5) { m <- as.list(estimate)$shape interp <- approxfun(c(1,2,3,4,5,6,8,10,12,15,20), c(0.786,0.768,0.762,0.759,0.758,0.757,0.755,0.754,0.754,0.754,0.753), yright=0.752) adtest <- ifelse(ad>interp(m), "rejected", "not rejected") }else if (distname == "weibull" & n>=5) { a2mod <- ad*(1+0.2/sqrt(n)) adtest <- ifelse(a2mod>0.757, "rejected", "not rejected") }else if (distname == "logis" & n>=5) { a2mod <- ad*(1+0.25/n) adtest <- ifelse(a2mod>0.66,"rejected","not rejected") }else adtest <- "not computed" } else # if (is.null(fix.arg)...) adtest <- "not computed" # Cramer-von Mises statistic cvm <- 1/(12*n) + sum( ( theop - (2 * 1:n - 1)/(2 * n) )^2 ) # Cramer-von Mises test if (is.null(fix.arg) & method == "mle") { # the following test does not correspond to MLE estimate but to unbiased # estimate of the variance # if ((distname == "norm" | distname == "lnorm") & n>=5) { # w2mod <- cvm*(1+0.5/n) # cvmtest <- ifelse(w2mod>0.126,"rejected","not rejected") # } # else if (distname == "exp" & n>=5) { w2mod <- cvm*(1+0.16/n) cvmtest <- ifelse(w2mod>0.222,"rejected","not rejected") }else if (distname == "gamma" & n>=5) { m <- as.list(estimate)$shape interp <- approxfun(c(1,2,3,4,5,6,8,10,12,15,20), c(0.136,0.131,0.129,0.128,0.128,0.128,0.127,0.127,0.127,0.127,0.126), yright=0.126) cvmtest <- ifelse(cvm>interp(m),"rejected","not rejected") }else if (distname == "weibull" & n>=5) { w2mod <- cvm*(1+0.2/sqrt(n)) cvmtest <- ifelse(w2mod>0.124,"rejected","not rejected") }else if (distname == "logis" & n>=5) { w2mod <- (n*cvm - 0.08)/(n - 1) cvmtest <- ifelse(w2mod>0.098,"rejected","not rejected") }else cvmtest <- "not computed" } else # if (is.null(fix.arg)) cvmtest <- "not computed" if (length(table(sdata)) != length(sdata)) warnings("Kolmogorov-Smirnov, Cramer-von Mises and Anderson-Darling statistics may not be correct with ties") list(cvm = cvm, cvmtest = cvmtest, ad = ad,adtest = adtest, ks = ks, kstest=kstest) } #---------------------------------------------------------------------- #chi-squared statistic : both for continuous and discrete distributions computegofstatChi2 <- function(sdata, n, distname, pdistname, estimate, fix.arg, chisqbreaks) { # chi-squared statistic and pvalues if (!is.null(chisqbreaks)) { if(!is.numeric(chisqbreaks)) stop("chisqbreaks must be a numeric vector defining the cell boundaries") nbreaks <- length(chisqbreaks) pbreaks <- do.call(pdistname, c(list(chisqbreaks), as.list(estimate), fix.arg)) Fobsbreaks <- ecdf(sdata)(chisqbreaks) Fobsunder <- c(0, Fobsbreaks[1:nbreaks-1]) punder <- c(0, pbreaks[1:nbreaks-1]) if (pbreaks[nbreaks]==1 & Fobsbreaks[nbreaks]==1) { p <- pbreaks-punder Fobs <- Fobsbreaks-Fobsunder }else { p <- c(pbreaks-punder, 1-pbreaks[nbreaks]) Fobs <- c(Fobsbreaks-Fobsunder, 1-Fobsbreaks[nbreaks]) } obscounts <- round(Fobs*n) theocounts <- p*n chisq <- sum(((obscounts-theocounts)^2)/theocounts) chisqdf <- length(obscounts)-1-length(estimate) # replacing of the line below which causes an error message for chisqdf <=0 # chisqpvalue <- ifelse(chisqdf>0, pchisq(chisq, df=chisqdf, lower.tail=FALSE), NULL) if (chisqdf>0) { chisqpvalue <- pchisq(chisq, df=chisqdf, lower.tail=FALSE) } else { chisqpvalue <- NULL } chisqtable <- as.table(cbind(obscounts, theocounts)) for (i in 1:length(obscounts)-1) rownames(chisqtable)[i] <- paste("<=", signif(chisqbreaks[i], digits=4)) rownames(chisqtable)[length(obscounts)] <- paste(">", signif(chisqbreaks[i], digits=4)) return( list(chisq = chisq, chisqbreaks=chisqbreaks, chisqpvalue = chisqpvalue, chisqdf = chisqdf, chisqtable = chisqtable) ) }else return(NULL) } print.gofstat.fitdist <- function(x, ...) { if (!inherits(x, "gofstat.fitdist")) stop("Use only with 'gofstat.fitdist' objects") if (x$discrete) #discrete distribution { if(!is.null(x$chisq)) { cat("Chi-squared statistic: ",x$chisq,"\n") cat("Degree of freedom of the Chi-squared distribution: ",x$chisqdf,"\n") if (any(x$chisqdf <= 0)) { cat(" The degree of freedom of the chi-squared distribution is less than 1 \n") cat(" The number of cells is insufficient to calculate the p-value. \n") }else { cat("Chi-squared p-value: ",x$chisqpvalue,"\n") if (any(x$chisqtable[,-1] < 5)) cat(" the p-value may be wrong with some theoretical counts < 5 \n") } cat("Chi-squared table:\n") print(x$chisqtable) cat("\nGoodness-of-fit criteria\n") mm <- rbind(AIC=x$aic, BIC=x$bic) rownames(mm) <- c("Akaike's Information Criterion", "Bayesian Information Criterion") print(mm) }else cat("The sample is too small to automatically define cells for Chi-squared test \n") }else # continuous distribution { cat("Goodness-of-fit statistics\n") mm <- rbind(KS=x$ks, CvM=x$cvm, AD=x$ad) rownames(mm) <- c("Kolmogorov-Smirnov statistic", "Cramer-von Mises statistic", "Anderson-Darling statistic") print(mm) cat("\nGoodness-of-fit criteria\n") mm <- rbind(AIC=x$aic, BIC=x$bic) rownames(mm) <- c("Akaike's Information Criterion", "Bayesian Information Criterion") print(mm) # } } invisible(x) }fitdistrplus/R/util-npsurv-NNLS.R0000644000176200001440000001111313742313702016411 0ustar liggesusers# ----------------------------------------------------------------------- # # Nonparametric maximum likelihood estimation from interval-censored data # # ----------------------------------------------------------------------- # # Internal optimization functions for nonnegative least square method # # ----------------------------------------------------------------------- # ## ========================================================================== ## Nonnegative least squares (NNLS): ## ## Minimize ||ax - b|| ## subject to non-negativity for all components ## x >= 0 i.e. diag(ncol(x)) x - 0 >= 0 ## sum up to one ## sum(x) = 1, ie. - 1 >= 0, + 1 >= 0 ## ========================================================================== NNLS_constrSum <- function(a, b, control=list(), pkg="stats", sumtotal=1, ...) { pkg <- match.arg(pkg, c("stats")) controlnames <- c("trace", "fnscale", "parscale", "ndeps", "maxit", "abstol", "reltol", "alpha", "beta", "gamma", "REPORT", "warn.1d.NelderMead", "type", "lmm", "factr", "pgtol", "tmax", "temp") if(length(control) >= 1) control <- control[names(control) %in% controlnames] else control <- list(maxit=500) if (pkg == "stats") #control parameter for BFGS used below { control$maxit <- max(control$maxit, 1000) control$reltol <- 1e-6 } #sanity check if(!is.vector(b)) b = drop(b) if(!is.matrix(a)) stop("a not matrix") if(length(sumtotal) > 1) stop("sumtotal must be a positive scalar") if(!is.numeric(sumtotal)) stop("sumtotal must be a positive scalar") if(sumtotal <= 0) stop("sumtotal must be a positive scalar") m <- NROW(a) n <- NCOL(a) if (pkg == "stats") { if(control$trace >= 2) cat("nb of optimized variables", n, "\n") if(n > 2) { #residual least square sum with theta=x[1:(n-1)] ; x[n] = 1-sum(theta) RLS <- function(theta) { x <- c(theta, sumtotal-sum(theta)) y <- a %*% x - b sum(y^2) } gradRLS <- function(theta) { x <- c(theta, sumtotal-sum(theta)) diffa <- a[, 1:(NCOL(a)-1)] - a[, NCOL(a)] y <- a %*% x - b 2*crossprod(diffa, y) } # non negativity constraint one_n <- rep(1, n-1) ui <- diag(n-1) rownames(ui) <- c(paste0("theta", 1:(n-1))) ci <- rep(0, n-1) #initial guess x0 <- rep(1/(n-1), n-1) #call to constrOptim control$trace <- control$trace >=5 res <- constrOptim(theta=x0, f=RLS, grad=gradRLS, ui=ui, ci=ci, method="BFGS", control=control, ...) #warning : convergence is not reached if(res$convergence != 0) res$prob <- c(res$par, sumtotal-sum(res$par)) #sanity check res$prob <- pmax(res$prob, 0) res$prob <- res$prob/sum(res$prob)*sumtotal }else if(n == 2) { #residual least square sum with theta=x[1:(n-1)] ; x[n] = 1-sum(theta) RLS <- function(theta) { x <- c(theta, sumtotal-sum(theta)) y <- a %*% x - b sum(y^2) } gradRLS <- function(theta) { x <- c(theta, sumtotal-sum(theta)) diffa <- a[,1]-a[,2] 2*crossprod(diffa, a %*% x - b) } # non negativity constraint one_n <- rep(1, n-1) ui <- diag(n-1) rownames(ui) <- c(paste0("theta", 1:(n-1))) ci <- rep(0, n-1) #initial guess x0 <- rep(1/(n-1), n-1) #call to constrOptim res <- constrOptim(theta=x0, f=RLS, grad=gradRLS, ui=ui, ci=ci, method="BFGS", control=control, ...) #warning : convergence is not reached if(res$convergence != 0) res$prob <- c(res$par, sumtotal-sum(res$par)) #sanity check res$prob <- pmax(res$prob, 0) res$prob <- res$prob/sum(res$prob)*sumtotal }else if(n == 1) { xstar <- sumtotal res <- list(prob=xstar, value=(a*xstar-b)^2, counts=0, par=NA, convergence=1*(xstar < 0), message="no optimization, because fully constrained problem") }else stop("wrong argument a for NNLS_constrSum()") }else # if (pkg == "limSolve") # { # #TODO : pass control argument to limSolve::lsei # require(limSolve) # res <- limSolve::lsei(A=a, B=b, E=rep(1,n), F=1, G=diag(n), H=rep(0, n), ...) # if(res$IsError) # stop("error in limSolve::lsei when computing NNLS") # xstar <- res$X # }else stop("wrong package") res } fitdistrplus/R/util-getparam.R0000644000176200001440000000227313742313702016113 0ustar liggesusers# INPUTS # argdistname : argument names of the distribution from names(formals()) # OUTPUTS # parameter names (as a vector) of the distribution (excluding non parameter argument) computegetparam <- function(argdistname) { #remove first argument, that should be "x", "p", "q", or "n", see ?dgamma, pgamma, qgamma argdistname <- argdistname[-1] nonparaminR <- c("x", "p", "q", "n") #defensive programming #remove other arguments, see ?dgamma, pgamma, qgamma, dbeta nonparaminR <- c(nonparaminR, "log", "log.p", "lower.tail", "ncp") nonparaminActuar <- c("limit", "order", "t") nonparaminGamlssdist <- "fast" nonparamspecial <- c("...", "..1", "..2") #see ?dnig, dhyperb, dskewlap, dgig,... nonparaminGenHyperbolic <- c("param", "KOmega", "ibfTol", "nmax", "method", "intTol", "valueOnly", "nInterpol", "uniTol", "subdivisions", "logPars") #see ?dsn nonparamsn <- "dp" plist <- setdiff(argdistname, nonparaminR) plist <- setdiff(plist, nonparaminActuar) plist <- setdiff(plist, nonparaminGamlssdist) plist <- setdiff(plist, nonparamspecial) plist <- setdiff(plist, nonparaminGenHyperbolic) plist <- setdiff(plist, nonparamsn) plist } fitdistrplus/R/util-wtdstat.R0000644000176200001440000001516114421726133016006 0ustar liggesusers############################################################################# # Copyright (c) 2015 Frank E Harrell Jr # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the # Free Software Foundation, Inc., # 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA # ############################################################################# #some functions from Hmisc also under GPL #From wtd.stats.s (line 1) of the Hmisc package wtdmean <- function(x, weights=NULL, normwt='ignored', na.rm=TRUE) { if(!length(weights)) return(mean(x, na.rm=na.rm)) if(na.rm) { s <- !is.na(x + weights) x <- x[s] weights <- weights[s] } sum(weights*x)/sum(weights) } #From wtd.stats.s (line 15) of the Hmisc package wtdvar <- function(x, weights=NULL, normwt=FALSE, na.rm=TRUE, method = c('unbiased', 'ML')) { method <- match.arg(method) if(!length(weights)) { if(na.rm) x <- x[!is.na(x)] return(var(x)) } if(na.rm) { s <- !is.na(x + weights) x <- x[s] weights <- weights[s] } if(normwt) weights <- weights * length(x) / sum(weights) if(method == 'ML') return(as.numeric(stats::cov.wt(cbind(x), weights, method = "ML")$cov)) sw <- sum(weights) xbar <- sum(weights * x) / sw sum(weights*((x - xbar)^2)) / (sw - (if(normwt) sum(weights ^ 2) / sw else 1)) } #From wtd.stats.s (line 43) of the Hmisc package wtdquantile <- function(x, weights=NULL, probs=c(0, .25, .5, .75, 1), type='quantile', normwt=FALSE, na.rm=TRUE) { if(!length(weights)) return(quantile(x, probs=probs, na.rm=na.rm)) type <- match.arg(type) if(any(probs < 0 | probs > 1)) stop("Probabilities must be between 0 and 1 inclusive") nams <- paste(format(round(probs * 100, if(length(probs) > 1) 2 - log10(diff(range(probs))) else 2)), "%", sep = "") w <- wtdtable(x, weights, na.rm=na.rm, normwt=normwt, type='list') x <- w$x wts <- w$sum.of.weights n <- sum(wts) order <- 1 + (n - 1) * probs low <- pmax(floor(order), 1) high <- pmin(low + 1, n) order <- order %% 1 ## Find low and high order statistics ## These are minimum values of x such that the cum. freqs >= c(low,high) allq <- approx(cumsum(wts), x, xout=c(low,high), method='constant', f=1, rule=2)$y k <- length(probs) quantiles <- (1 - order)*allq[1:k] + order*allq[-(1:k)] names(quantiles) <- nams return(quantiles) } #From wtd.stats.s (line 119) of the Hmisc package wtdtable <- function(x, weights=NULL, type=c('list','table'), normwt=FALSE, na.rm=TRUE) { type <- match.arg(type) if(!length(weights)) weights <- rep(1, length(x)) #isdate <- testDateTime(x) ## 31aug02 + next 2 ax <- attributes(x) ax$names <- NULL if(is.character(x)) x <- as.factor(x) lev <- levels(x) x <- unclass(x) if(na.rm) { s <- !is.na(x + weights) x <- x[s, drop=FALSE] ## drop is for factor class weights <- weights[s] } n <- length(x) if(normwt) weights <- weights * length(x) / sum(weights) i <- order(x) # R does not preserve levels here x <- x[i]; weights <- weights[i] if(anyDuplicated(x)) { ## diff(x) == 0 faster but doesn't handle Inf weights <- tapply(weights, x, sum) if(length(lev)) { levused <- lev[sort(unique(x))] if((length(weights) > length(levused)) && any(is.na(weights))) weights <- weights[!is.na(weights)] if(length(weights) != length(levused)) stop('program logic error') names(weights) <- levused } if(!length(names(weights))) stop('program logic error') if(type=='table') return(weights) x <- allisnumeric(names(weights), 'vector') #if(isdate) # attributes(x) <- c(attributes(x),ax) names(weights) <- NULL return(list(x=x, sum.of.weights=weights)) } xx <- x #if(isdate) # attributes(xx) <- c(attributes(xx),ax) if(type=='list') list(x=if(length(lev))lev[x] else xx, sum.of.weights=weights) else { names(weights) <- if(length(lev)) lev[x] else xx weights } } #From Misc.s (line 241) of the Hmisc package allisnumeric <- function(x, what=c('test','vector'), extras=c('.','NA')) { what <- match.arg(what) x <- sub('[[:space:]]+$', '', x) x <- sub('^[[:space:]]+', '', x) xs <- x[!x %in% c('',extras)] #originally %nin% isnum <- suppressWarnings(!any(is.na(as.numeric(xs)))) if(what=='test') isnum else if(isnum) as.numeric(x) else x } fitdistrplus/R/vcov.R0000644000176200001440000000602713742313702014316 0ustar liggesusers############################################################################# # Copyright (c) 2009 Marie Laure Delignette-Muller, Regis Pouillot, Jean-Baptiste Denis, Christophe Dutang # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the # Free Software Foundation, Inc., # 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA # ############################################################################# ### var-covariance matrix ### ### R functions ### #already in R # #vcov <- function(object, ...) # UseMethod("vcov") # #vcov.default <- function(object, ...) # return(object) vcov.fitdist <- function(object, ...) { stopifnot(inherits(object, "fitdist")) if (object$method != "mle") warning("The variance-covariance matrix can only be calculated for fits using the mle method") return(object$vcov) } vcov.fitdistcens <- function(object, ...) { stopifnot(inherits(object, "fitdistcens")) return(object$vcov) } fitdistrplus/R/cdfcomp.R0000644000176200001440000002753614102171104014751 0ustar liggesusers############################################################################# # Copyright (c) 2011 Marie Laure Delignette-Muller, Christophe Dutang, Aurelie Siberchicot # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the # Free Software Foundation, Inc., # 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA # ############################################################################# ### plot cumulative distribution functions for various fits ### of continuous distribution(s) (fitdist results) ### on a same dataset ### ### R functions ### cdfcomp <- function(ft, xlim, ylim, xlogscale = FALSE, ylogscale = FALSE, main, xlab, ylab, datapch, datacol, fitlty, fitcol, fitlwd, addlegend = TRUE, legendtext, xlegend = "bottomright", ylegend = NULL, horizontals = TRUE, verticals = FALSE, do.points = TRUE, use.ppoints = TRUE, a.ppoints = 0.5, name.points = NULL, lines01 = FALSE, discrete, add = FALSE, plotstyle = "graphics", fitnbpts = 101, ...) { if(inherits(ft, "fitdist")) { ft <- list(ft) }else if(!is.list(ft)) { stop("argument ft must be a list of 'fitdist' objects") }else { if(any(sapply(ft, function(x) !inherits(x, "fitdist")))) stop("argument ft must be a list of 'fitdist' objects") } # In the future developments, it will be necessary to check that all the fits share the same weights if(!is.null(ft[[1]]$weights)) stop("cdfcomp is not yet available when using weights") # check the 'plotstyle' argument plotstyle <- match.arg(plotstyle[1], choices = c("graphics", "ggplot"), several.ok = FALSE) # manage default parameters nft <- length(ft) if (missing(datapch)) datapch <- 16 if (missing(datacol)) datacol <- "black" if (missing(fitcol)) fitcol <- 2:(nft+1) if (missing(fitlty)) fitlty <- 1:nft if (missing(fitlwd)) fitlwd <- 1 fitcol <- rep(fitcol, length.out=nft) fitlty <- rep(fitlty, length.out=nft) fitlwd <- rep(fitlwd, length.out=nft) if (missing(xlab)) xlab <- ifelse(xlogscale, "data in log scale", "data") if (missing(ylab)) ylab <- "CDF" if (missing(main)) main <- paste("Empirical and theoretical CDFs") # check legend parameters if added if(missing(legendtext)) { legendtext <- sapply(ft, function(x) x$distname) if(length(legendtext) != length(unique(legendtext))) legendtext <- paste(legendtext, sapply(ft, function(x) toupper(x$method)), sep="-") if(length(legendtext) != length(unique(legendtext))) legendtext <- paste(legendtext, 1:nft, sep="-") } # initiate discrete if not given if(missing(discrete)) { discrete <- any(sapply(ft, function(x) x$discrete)) } if(!is.logical(discrete)) stop("wrong argument 'discrete'.") # check data mydata <- ft[[1]]$data verif.ftidata <- function(fti) { if (any(fti$data != mydata)) stop("All compared fits must have been obtained with the same dataset") invisible() } lapply(ft, verif.ftidata) # check xlim if(missing(xlim)) { xmin <- min(mydata) xmax <- max(mydata) xlim <- c(xmin, xmax) } else { xmin <- xlim[1] xmax <- xlim[2] } # some variable definitions distname <- ft[[1]]$distname n <- length(mydata) sdata <- sort(mydata) largedata <- (n > 1e4) logxy <- paste(ifelse(xlogscale,"x",""), ifelse(ylogscale,"y",""), sep="") if ((xlogscale == TRUE) & min(mydata) <= 0) stop("log transformation of data requires only positive values") # plot of data (ecdf) if(xlogscale && !discrete) sfin <- seq(log10(xmin), log10(xmax), by=(log10(xmax)-log10(xmin))/fitnbpts[1]) else # (!xlogscale && !discrete) and discrete sfin <- seq(xmin, xmax, length.out=fitnbpts[1]) # previous version with no vizualisation of ex-aequos # obsp <- ecdf(sdata) if (use.ppoints && !discrete) obsp <- ppoints(n, a = a.ppoints) else obsp <- (1:n) / n # computation of each fitted distribution comput.fti <- function(i) { fti <- ft[[i]] para <- c(as.list(fti$estimate), as.list(fti$fix.arg)) distname <- fti$distname pdistname <- paste("p", distname, sep = "") if(xlogscale && !discrete) { do.call(pdistname, c(list(10^sfin), as.list(para))) }else { do.call(pdistname, c(list(sfin), as.list(para))) } } fittedprob <- sapply(1:nft, comput.fti) if(NCOL(fittedprob) != nft || NROW(fittedprob) != length(sfin)) stop("problem when computing fitted CDFs.") # check ylim if(missing(ylim)) ylim <- range(obsp, fittedprob) else ylim <- range(ylim) #in case of users enter a bad ylim # optional add of horizontal and vertical lines for step function xhleft <- sdata[-length(sdata)] xhright <- sdata[-1L] yh <- obsp[-length(sdata)] xv <- xhright yvdown <- yh yvup <- obsp[-1L] if(xlogscale) sfin <- 10^sfin if(plotstyle == "graphics") { ######## plot if plotstyle=='graphics' ######## #main plot if(!add) #create a new graphic { if(!largedata && do.points) plot(sdata, obsp, main=main, xlab=xlab, ylab=ylab, xlim=xlim, ylim=ylim, log=logxy, pch=datapch, col=datacol, type="p", ...) else if(largedata) plot(sdata, obsp, main=main, xlab=xlab, ylab=ylab, xlim=xlim, ylim=ylim, log=logxy, col=datacol, type="s", ...) else if(!do.points) plot(sdata, obsp, main=main, xlab=xlab, ylab=ylab, xlim=xlim, ylim=ylim, log=logxy, col=datacol, type="n", ...) else stop("internal error in cdfcomp().") }else #add to the current graphic { #do not need parameters: main=main, xlab=xlab, ylab=ylab, xlim=xlim, ylim=ylim, log=logxy, if(!largedata && do.points) points(sdata, obsp, pch=datapch, col=datacol, type="p", ...) else if(largedata) points(sdata, obsp, col=datacol, type="s", ...) #else if(!do.points) nothing to plot } if(!largedata && do.points && !is.null(name.points)) text(sdata, obsp, labels = name.points, pos = 2) # optional add of horizontal and vertical lines for step function if (!largedata && horizontals) { segments(xhleft, yh, xhright, yh, col=datacol,...) segments(sdata[length(sdata)], 1, xmax, 1, col=datacol, lty = 2, ...) segments(xmin, 0, sdata[1], 0, col=datacol, lty = 2, ...) if (verticals) { segments(xv, yvdown, xv, yvup, col=datacol,...) segments(sdata[1], 0, sdata[1], obsp[1], col=datacol, ...) } } # plot fitted cdfs for(i in 1:nft) lines(sfin, fittedprob[,i], lty=fitlty[i], col=fitcol[i], lwd=fitlwd[i], type=ifelse(discrete, "s", "l"), ...) if(lines01) abline(h=c(0, 1), lty="dashed", col="grey") if(addlegend) legend(x=xlegend, y=ylegend, bty="n", legend=legendtext, lty=fitlty, col=fitcol, lwd=fitlwd, ...) invisible() } else if (!requireNamespace("ggplot2", quietly = TRUE)) { stop("ggplot2 needed for this function to work with plotstyle = 'ggplot'. Please install it", call. = FALSE) } else { ######## plot if plotstyle=='ggplot' ######## # recode the legend position according to available positions in ggplot2 if(xlegend %in% c("topleft", "bottomleft")) xlegend <- "left" if(xlegend %in% c("topright", "bottomright")) xlegend <- "right" if(xlegend == "center") xlegend <- "right" # structure the fittedprob in a relevant data.frame fittedprob <- as.data.frame(fittedprob) colnames(fittedprob) <- unlist(lapply(ft, function(X) X["distname"])) fittedprob <- stack(fittedprob) fittedprob$sfin <- sfin # sfin is recycled in the standard fashion fittedprob$ind <- factor(fittedprob$ind, levels = unique(fittedprob$ind)) # reorder levels in the appearance order of the input step <- data.frame(values = obsp, ind = "step", sfin = sdata) horiz <- data.frame(x = xhleft, y = yh, xend = xhright, yend = yh, ind = "horiz") horiz0 <- data.frame(x = xmin, y = 0, xend = sdata[1], yend = 0, ind = "horiz0") horiz1 <- data.frame(x = sdata[length(sdata)], y = 1, xend = xmax, yend = 1, ind = "horiz1") verti <- data.frame(x = sdata[1], y = 0, xend = sdata[1], yend = obsp[1], ind = "verti") ggcdfcomp <- ggplot2::ggplot(data = fittedprob, ggplot2::aes_(quote(sfin), quote(values), group = quote(ind), colour = quote(ind))) + ggplot2::xlab(xlab) + ggplot2::ylab(ylab) + ggplot2::ggtitle(main) + ggplot2::coord_cartesian(xlim = c(xlim[1], xlim[2]), ylim = c(ylim[1], ylim[2])) + {if(!largedata && do.points) ggplot2::geom_point(data = step, ggplot2::aes_(quote(sfin), quote(values)), show.legend = FALSE, colour = datacol, shape = datapch)} + {if(!largedata && do.points && !is.null(name.points)) ggplot2::geom_text(data = step, ggplot2::aes_(quote(sfin), quote(values), label = name.points), hjust = "right", nudge_x = -0.05, inherit.aes = FALSE, show.legend = FALSE)} + {if(largedata) ggplot2::geom_step(data = step, ggplot2::aes_(quote(sfin), quote(values)), show.legend = FALSE, colour = datacol, shape = datapch)} + {if(!largedata && horizontals && !verticals) ggplot2::geom_segment(data = horiz, ggplot2::aes_(x=quote(x), y=quote(y), xend=quote(xend), yend=quote(yend)), show.legend = FALSE, colour = datacol)} + {if(!largedata && horizontals && verticals) ggplot2::geom_step(data = step, ggplot2::aes_(quote(sfin), quote(values)), show.legend = FALSE, colour = datacol)} + {if(!largedata && horizontals) ggplot2::geom_segment(data = horiz1, ggplot2::aes_(x=quote(x), y=quote(y), xend=quote(xend), yend=quote(yend)), show.legend = FALSE, colour = datacol, linetype = 2)} + {if(!largedata && horizontals) ggplot2::geom_segment(data = horiz0, ggplot2::aes_(x=quote(x), y=quote(y), xend=quote(xend), yend=quote(yend)), show.legend = FALSE, colour = datacol, linetype = 2)} + {if(!largedata && horizontals && verticals) ggplot2::geom_segment(data = verti, ggplot2::aes_(x=quote(x), y=quote(y), xend=quote(xend), yend=quote(yend)), show.legend = FALSE, colour = datacol)} + {if(discrete) ggplot2::geom_step(data = fittedprob, ggplot2::aes_(linetype = quote(ind), colour = quote(ind)), size = 0.4)} + {if(!discrete) ggplot2::geom_line(data = fittedprob, ggplot2::aes_(linetype = quote(ind), colour = quote(ind), size = quote(ind)))} + ggplot2::theme_bw() + {if(addlegend) ggplot2::theme(legend.position = c(xlegend, ylegend)) else ggplot2::theme(legend.position = "none")} + ggplot2::scale_color_manual(values = fitcol, labels = legendtext) + ggplot2::scale_linetype_manual(values = fitlty, labels = legendtext) + ggplot2::scale_size_manual(values = fitlwd, labels = legendtext) + ggplot2::guides(colour = ggplot2::guide_legend(title = NULL)) + ggplot2::guides(linetype = ggplot2::guide_legend(title = NULL)) + ggplot2::guides(size = ggplot2::guide_legend(title = NULL)) + {if(lines01) ggplot2::geom_hline(ggplot2::aes(yintercept=0), color="grey", linetype="dashed")} + {if(lines01) ggplot2::geom_hline(ggplot2::aes(yintercept=1), color="grey", linetype="dashed")} + {if(xlogscale) ggplot2::scale_x_continuous(trans='log10')} + {if(ylogscale) ggplot2::scale_y_continuous(trans='log10')} return(ggcdfcomp) } } fitdistrplus/R/ppcomp.R0000644000176200001440000002056413747741345014656 0ustar liggesusers############################################################################# # Copyright (c) 2012 Christophe Dutang, Aurelie Siberchicot # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the # Free Software Foundation, Inc., # 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA # ############################################################################# ### P-P plot for various fits ### of continuous distribution(s) (fitdist results) ### on a same dataset ### ### R functions ### ppcomp <- function(ft, xlim, ylim, xlogscale = FALSE, ylogscale = FALSE, main, xlab, ylab, fitpch, fitcol, fitlwd, addlegend = TRUE, legendtext, xlegend = "bottomright", ylegend = NULL, use.ppoints = TRUE, a.ppoints = 0.5, line01 = TRUE, line01col = "black", line01lty = 1, ynoise = TRUE, plotstyle = "graphics", ...) { if(inherits(ft, "fitdist")) { ft <- list(ft) }else if(!is.list(ft)) { stop("argument ft must be a list of 'fitdist' objects") }else { if(any(sapply(ft, function(x) !inherits(x, "fitdist")))) stop("argument ft must be a list of 'fitdist' objects") } # In the future developments, it will be necessary to check that all the fits share the same weights if(!is.null(ft[[1]]$weights)) stop("ppcomp is not yet available when using weights") # check the 'plotstyle' argument plotstyle <- match.arg(plotstyle[1], choices = c("graphics", "ggplot"), several.ok = FALSE) # check data mydata <- ft[[1]]$data verif.ftidata <- function(fti) { if (any(fti$data != mydata)) stop("All compared fits must have been obtained with the same dataset") invisible() } lapply(ft, verif.ftidata) n <- length(mydata) sdata <- sort(mydata) largedata <- (n > 1e4) if (xlogscale != ylogscale) { warning("As a P-P plot should use the same scale on x and y axes, both or none of the axes should be put in a logarithmic scale.") } logxy <- paste(ifelse(xlogscale,"x",""), ifelse(ylogscale,"y",""), sep="") # manage default parameters nft <- length(ft) if (missing(fitcol)) fitcol <- 2:(nft+1) if (missing(fitpch)) fitpch <- ifelse(largedata, 1, 21) if (missing(fitlwd)) fitlwd <- 1 fitcol <- rep(fitcol, length.out=nft) fitpch <- rep(fitpch, length.out=nft) fitlwd <- rep(fitlwd, length.out=nft) # check legend parameters if added if(missing(legendtext)) { legendtext <- sapply(ft, function(x) x$distname) if(length(legendtext) != length(unique(legendtext))) legendtext <- paste(legendtext, sapply(ft, function(x) toupper(x$method)), sep="-") if(length(legendtext) != length(unique(legendtext))) legendtext <- paste(legendtext, 1:nft, sep="-") } if (missing(xlab)) xlab <- "Theoretical probabilities" if (missing(ylab)) ylab <- "Empirical probabilities" if (missing(main)) main <- "P-P plot" if (use.ppoints) obsp <- ppoints(n, a = a.ppoints) else obsp <- (1:n) / n # computation of each fitted distribution comput.fti <- function(i) { fti <- ft[[i]] para <- c(as.list(fti$estimate), as.list(fti$fix.arg)) distname <- fti$distname pdistname <- paste("p", distname, sep="") do.call(pdistname, c(list(sdata), as.list(para))) } fittedprob <- sapply(1:nft, comput.fti) if(NCOL(fittedprob) != nft || NROW(fittedprob) != length(sdata)) stop("problem when computing fitted probabilities.") # check limits if (missing(xlim)) xlim <- range(fittedprob) if (missing(ylim)) ylim <- range(obsp) if(plotstyle == "graphics") { ######## plot if plotstyle=='graphics' ######## #main plotting if(!largedata) resquant <- plot(fittedprob[,1], obsp, main=main, xlab=xlab, ylab=ylab, log=logxy, pch = fitpch[1], xlim=xlim, ylim=ylim, col=fitcol[1], type="p", ...) else resquant <- plot(fittedprob[,1], obsp, main=main, xlab=xlab, ylab=ylab, log=logxy, lty = fitpch[1], xlim=xlim, ylim=ylim, col=fitcol[1], type="l", lwd = fitlwd[1], ...) #plot other fitted probabilities if(nft > 1 && !ynoise && !largedata) for(i in 2:nft) points(fittedprob[,i], obsp, pch=fitpch[i], col=fitcol[i], ...) if(nft > 1 && ynoise && !largedata) for(i in 2:nft) { if (ylogscale) { noise2mult <- runif(n, 0.95, 1.05) points(fittedprob[,i], obsp*noise2mult, pch=fitpch[i], col=fitcol[i], ...) }else { noise2add <- runif(n, -0.02, 0.02) points(fittedprob[,i], obsp+noise2add, pch=fitpch[i], col=fitcol[i], ...) } } if(nft > 1 && largedata) for(i in 2:nft) lines(fittedprob[,i], obsp, col=fitcol[i], lty = fitpch[i], lwd = fitlwd[i], ...) if(line01) abline(0, 1, lty=line01lty, col=line01col) if(addlegend) { if(!largedata) legend(x=xlegend, y=ylegend, bty="n", legend=legendtext, col=fitcol, pch = fitpch, ...) else legend(x=xlegend, y=ylegend, bty="n", legend=legendtext, col=fitcol, lty = fitpch, lwd = fitlwd, ...) } invisible() } else if (!requireNamespace("ggplot2", quietly = TRUE)) { stop("ggplot2 needed for this function to work with plotstyle = 'ggplot'. Please install it", call. = FALSE) } else { ######## plot if plotstyle=='ggplot' ######## # recode the legend position according to available positions in ggplot2 if(xlegend %in% c("topleft", "bottomleft")) xlegend <- "left" if(xlegend %in% c("topright", "bottomright")) xlegend <- "right" # structure the fittedprob in a relevant data.frame fittedprob <- as.data.frame(fittedprob) colnames(fittedprob) <- unlist(lapply(ft, function(X) X["distname"])) fittedprob <- stack(fittedprob) nfp <- nrow(fittedprob) fittedprob$obsp <- obsp # obsp is recycled in the standard fashion fittedprob$ind <- factor(fittedprob$ind, levels = unique(fittedprob$ind)) # reorder levels in the appearance order of the input if(nft > 1 && ynoise && !largedata) { if (ylogscale) { noise2mult <- runif(nfp, 0.95, 1.05) fittedprob$obsp <- fittedprob$obsp*noise2mult }else { noise2add <- runif(nfp, -0.02, 0.02) fittedprob$obsp <- fittedprob$obsp+noise2add } } ggppcomp <- ggplot2::ggplot(data = fittedprob, ggplot2::aes_(quote(values), quote(obsp), group = quote(ind), colour = quote(ind), shape = quote(ind), size = quote(ind))) + ggplot2::xlab(xlab) + ggplot2::ylab(ylab) + ggplot2::ggtitle(main) + ggplot2::coord_cartesian(xlim = c(xlim[1], xlim[2]), ylim = c(ylim[1], ylim[2])) + {if(!largedata) ggplot2::geom_point() else ggplot2::geom_line(ggplot2::aes_(linetype = quote(ind), size = quote(ind)))} + {if(addlegend) ggplot2::theme(legend.position = c(xlegend, ylegend), plot.title = ggplot2::element_text(hjust = 0.5)) else ggplot2::theme(legend.position = "none", plot.title = ggplot2::element_text(hjust = 0.5))} + ggplot2::scale_color_manual(values = fitcol, labels = legendtext) + ggplot2::scale_shape_manual(values = fitpch, labels = legendtext) + ggplot2::scale_linetype_manual(values = fitpch, labels = legendtext) + ggplot2::scale_size_manual(values = fitlwd, labels = legendtext) + ggplot2::guides(colour = ggplot2::guide_legend(title = NULL)) + ggplot2::guides(shape = ggplot2::guide_legend(title = NULL)) + ggplot2::guides(linetype = ggplot2::guide_legend(title = NULL)) + ggplot2::guides(size = ggplot2::guide_legend(title = NULL)) + {if(line01) ggplot2::geom_abline(intercept = 0, slope = 1)} + {if(xlogscale) ggplot2::scale_x_continuous(trans='log10')} + {if(ylogscale) ggplot2::scale_y_continuous(trans='log10')} return(ggppcomp) } } fitdistrplus/R/fitbench.R0000644000176200001440000001327213742313702015123 0ustar liggesusers############################################################################# # Copyright (c) 2016 Marie Laure Delignette-Muller, Christophe Dutang # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the # Free Software Foundation, Inc., # 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA # ############################################################################# ### Benchmark of optimization algorithms to compute estimate ### ### R functions ### fitbench <- function(data, distr, method, grad=NULL, control=list(trace=0, REPORT=1, maxit=1000), lower=-Inf, upper=+Inf, ...) { if(method != "mle") stop("not supported") hasbound <- any(is.finite(lower) | is.finite(upper)) hasgrad <- !is.null(grad) reslist <- NULL # 1 - methods without gradient without constraint (always possible) for(meth in c("BFGS", "Nelder", "CG")) #CG with FR update { res1fit$time <- system.time(res1fit <- mledist(data, distr=distr, optim.method=meth, control=control, ...))[3] reslist <- c(reslist, list(res1fit)) } for(type in 2:3) #CG with PR or BS updates { res1fit$time <- system.time(res1fit <- mledist(data, distr=distr, optim.method="CG", control=c(control, type=type), ...))[3] reslist <- c(reslist, list(res1fit)) } fullname <- c("BFGS", "NM", paste0("CG", c("FR", "PR", "BS"))) # 2 - methods without gradient with constraints if(hasbound) { for(meth in c("L-BFGS-B", "Nelder")) { res1fit$time <- system.time(res1fit <- mledist(data, distr=distr, optim.method=meth, control=control, lower=lower, upper=upper, ...))[3] reslist <- c(reslist, list(res1fit)) } fullname <- c(fullname, "L-BFGS-B", "NM-B") } # 3 - methods with gradient without constraint if(hasgrad) { for(meth in c("BFGS", "CG")) #CG with FR update { res1fit$time <- system.time(res1fit <- mledist(data, distr=distr, gradient=grad, optim.method=meth, control=control, ...))[3] reslist <- c(reslist, list(res1fit)) } for(type in 2:3) #CG with PR or BS updates { res1fit$time <- system.time(res1fit <- mledist(data, distr=distr, gradient=grad, optim.method="CG", control=c(control, type=type), ...))[3] reslist <- c(reslist, list(res1fit)) } fullname <- c(fullname, paste0("G-",c("BFGS", paste0("CG", c("FR", "PR", "BS")))) ) } # 4 - methods with gradient with constraints if(hasbound && hasgrad) { for(meth in c("BFGS", "Nelder", "CG")) #CG with FR update { res1fit$time <- system.time(res1fit <- mledist(data, distr=distr, optim.method=meth, control=control, lower=lower, upper=upper, gradient=grad, ...))[3] reslist <- c(reslist, list(res1fit)) } for(type in 2:3) #CG with PR or BS updates { res1fit$time <- system.time(res1fit <- mledist(data, distr=distr, optim.method="CG", control=c(control, type=type), lower=lower, upper=upper, gradient=grad, ...))[3] reslist <- c(reslist, list(res1fit)) } fullname <- c(fullname, paste0("G-", c("BFGS", "NM", paste0("CG", c("FR", "PR", "BS"))), "-B") ) } names(reslist) <- fullname getval <- function(x) c(x$estimate, loglik=x$loglik, x$counts, x$time) #suspect behavior if(any(sapply(reslist, length) != 12)) { print(sapply(reslist, length)) } resmat <- sapply(reslist, getval) if(is.null(dim(resmat))) stop("wrong extract") allname <- c(paste("fitted", names(reslist[[1]]$estimate)), "fitted loglik", "func. eval. nb.", "grad. eval. nb.", "time (sec)") if(NROW(resmat) != length(allname)) stop("wrong extract") else rownames(resmat) <- allname resmat }fitdistrplus/R/util-isint.R0000644000176200001440000000533714421725463015453 0ustar liggesusers############################################################################# # Copyright (c) 2009 Marie Laure Delignette-Muller, Christophe Dutang # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the # Free Software Foundation, Inc., # 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA # ############################################################################# ### test if integers ### ### R functions ### #test if the weight vector contains integer values isintw <- function(x) { x <- x - floor(x) abs(x) < .Machine$double.eps } #test if the weight vector contains only integer values isallintw <- function(x) all(isintw(x)) fitdistrplus/R/descdist.R0000644000176200001440000002301414406552254015143 0ustar liggesusers############################################################################# # Copyright (c) 2009 Marie Laure Delignette-Muller # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the # Free Software Foundation, Inc., # 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA # ############################################################################# ### Description of an empirical distribution ### ### R functions ### descdist <- function(data, discrete = FALSE, boot = NULL, method = "unbiased", graph = TRUE, print = TRUE, obs.col = "darkblue", obs.pch = 16, boot.col = "orange") { #if(is.mcnode(data)) data <- as.vector(data) if (missing(data) || !is.vector(data,mode="numeric")) stop("data must be a numeric vector") if (length(data) < 4) stop("data must be a numeric vector containing at least four values") moment <- function(data,k){ m1 <- mean(data) return(sum((data-m1)^k)/length(data)) } if (method=="unbiased") { skewness <- function(data){ # unbiased estimation (Fisher 1930) sd <- sqrt(moment(data,2)) n <- length(data) gamma1 <- moment(data,3)/sd^3 unbiased.skewness <- sqrt(n*(n-1)) * gamma1 / (n-2) return(unbiased.skewness) } kurtosis <- function(data){ # unbiased estimation (Fisher 1930) n <- length(data) var<-moment(data,2) gamma2 <- moment(data,4)/var^2 unbiased.kurtosis <- (n-1)/ ((n-2)*(n-3)) * ((n+1)*gamma2 -3*(n-1) ) + 3 return(unbiased.kurtosis) } standdev <- function(data){ sd(data) } } else if (method=="sample") { skewness <- function(data) { sd <- sqrt(moment(data, 2)) return(moment(data, 3)/sd^3) } kurtosis <- function(data) { var <- moment(data, 2) return(moment(data, 4)/var^2) } standdev <- function(data){ sqrt(moment(data,2)) } } else stop("The only possible value for the argument method are 'unbiased' or 'sample'") res <- list(min=min(data),max=max(data),median=median(data), mean=mean(data),sd=standdev(data), skewness=skewness(data),kurtosis=kurtosis(data), method = method) skewdata<-res$skewness kurtdata<-res$kurtosis # Cullen and Frey graph if (graph) { # bootstrap sample for observed distribution # and computation of kurtmax from this sample if (!is.null(boot)) { if (!is.numeric(boot) || boot<10) { stop("boot must be NULL or a integer above 10") } n<-length(data) databoot<-matrix(sample(data,size=n*boot,replace=TRUE),nrow=n,ncol=boot) s2boot<-sapply(1:boot,function(iter) skewness(databoot[,iter])^2) kurtboot<-sapply(1:boot,function(iter) kurtosis(databoot[,iter])) kurtmax<-max(10,ceiling(max(kurtboot))) xmax<-max(4,ceiling(max(s2boot))) } else{ kurtmax<-max(10,ceiling(kurtdata)) xmax<-max(4,ceiling(skewdata^2)) } ymax<-kurtmax-1 plot(skewdata^2,kurtmax-kurtdata,pch=obs.pch,xlim=c(0,xmax),ylim=c(0,ymax), yaxt="n",xlab="square of skewness",ylab="kurtosis",main="Cullen and Frey graph") yax<-as.character(kurtmax-0:ymax) axis(side=2,at=0:ymax,labels=yax) if (!discrete) { # beta dist p<-exp(-100) lq<-seq(-100,100,0.1) q<-exp(lq) s2a<-(4*(q-p)^2*(p+q+1))/((p+q+2)^2*p*q) ya<-kurtmax-(3*(p+q+1)*(p*q*(p+q-6)+2*(p+q)^2)/(p*q*(p+q+2)*(p+q+3))) p<-exp(100) lq<-seq(-100,100,0.1) q<-exp(lq) s2b<-(4*(q-p)^2*(p+q+1))/((p+q+2)^2*p*q) yb<-kurtmax-(3*(p+q+1)*(p*q*(p+q-6)+2*(p+q)^2)/(p*q*(p+q+2)*(p+q+3))) s2<-c(s2a,s2b) y<-c(ya,yb) polygon(s2,y,col="lightgrey",border="lightgrey") # gamma dist lshape<-seq(-100,100,0.1) shape<-exp(lshape) s2<-4/shape y<-kurtmax-(3+6/shape) lines(s2,y,lty=2) # lnorm dist lshape<-seq(-100,100,0.1) shape<-exp(lshape) es2<-exp(shape^2) s2<-(es2+2)^2*(es2-1) y<-kurtmax-(es2^4+2*es2^3+3*es2^2-3) lines(s2,y,lty=3) legend(xmax*0.2,ymax*1.03,pch=obs.pch,legend="Observation",bty="n",cex=0.8,pt.cex=1.2,col=obs.col) if (!is.null(boot)) { legend(xmax*0.2,ymax*0.98,pch=1,legend="bootstrapped values", bty="n",cex=0.8,col=boot.col) } legend(xmax*0.55,ymax*1.03,legend="Theoretical distributions",bty="n",cex=0.8) legend(xmax*0.6,0.98*ymax,pch=8,legend="normal",bty="n",cex=0.8) legend(xmax*0.6,0.94*ymax,pch=2,legend="uniform",bty="n",cex=0.8) legend(xmax*0.6,0.90*ymax,pch=7,legend="exponential",bty="n",cex=0.8) legend(xmax*0.6,0.86*ymax,pch=3,legend="logistic",bty="n",cex=0.8) legend(xmax*0.6,0.82*ymax,fill="grey80",legend="beta",bty="n",cex=0.8) legend(xmax*0.6,0.78*ymax,lty=3,legend="lognormal",bty="n",cex=0.8) legend(xmax*0.6,0.74*ymax,lty=2,legend="gamma",bty="n",cex=0.8) legend(xmax*0.58,0.69*ymax,legend=c("(Weibull is close to gamma and lognormal)"), bty="n",cex=0.6) } else { # negbin dist p<-exp(-10) lr<-seq(-100,100,0.1) r<-exp(lr) s2a<-(2-p)^2/(r*(1-p)) ya<-kurtmax-(3+6/r+p^2/(r*(1-p))) p<-1-exp(-10) lr<-seq(100,-100,-0.1) r<-exp(lr) s2b<-(2-p)^2/(r*(1-p)) yb<-kurtmax-(3+6/r+p^2/(r*(1-p))) s2<-c(s2a,s2b) y<-c(ya,yb) polygon(s2,y,col="grey80",border="grey80") legend(xmax*0.2,ymax*1.03,pch=obs.pch,legend="Observation",bty="n",cex=0.8,pt.cex=1.2, col = obs.col) if (!is.null(boot)) { legend(xmax*0.2,ymax*0.98,pch=1,legend="bootstrapped values", bty="n",cex=0.8,col=boot.col) } legend(xmax*0.55,ymax*1.03,legend="Theoretical distributions",bty="n",cex=0.8) legend(xmax*0.6,0.98*ymax,pch=8,legend="normal",bty="n",cex=0.8) legend(xmax*0.6,0.94*ymax,fill="grey80",legend="negative binomial", bty="n",cex=0.8) legend(xmax*0.6,0.90*ymax,lty=2,legend="Poisson",bty="n",cex=0.8) # poisson dist llambda<-seq(-100,100,0.1) lambda<-exp(llambda) s2<-1/lambda y<-kurtmax-(3+1/lambda) lines(s2,y,lty=2) } # bootstrap sample for observed distribution if (!is.null(boot)) { points(s2boot,kurtmax-kurtboot,pch=1,col=boot.col,cex=0.5) } # observed distribution points(skewness(data)^2,kurtmax-kurtosis(data),pch=obs.pch,cex=2,col=obs.col) # norm dist points(0,kurtmax-3,pch=8,cex=1.5,lwd=2) if (!discrete) { # unif dist points(0,kurtmax-9/5,pch=2,cex=1.5,lwd=2) # exp dist points(2^2,kurtmax-9,pch=7,cex=1.5,lwd=2) # logistic dist points(0,kurtmax-4.2,pch=3,cex=1.5,lwd=2) } } # end of is (graph) if (!print) invisible(structure(res, class = "descdist")) else structure(res, class = "descdist") } print.descdist <- function(x, ...) { if (!inherits(x, "descdist")) stop("Use only with 'descdist' objects") cat("summary statistics\n") cat("------\n") cat("min: ",x$min," max: ",x$max,"\n") cat("median: ",x$median,"\n") cat("mean: ",x$mean,"\n") if (x$method=="sample") { cat("sample sd: ",x$sd,"\n") cat("sample skewness: ",x$skewness,"\n") cat("sample kurtosis: ",x$kurtosis,"\n") } else if (x$method=="unbiased") { cat("estimated sd: ",x$sd,"\n") cat("estimated skewness: ",x$skewness,"\n") cat("estimated kurtosis: ",x$kurtosis,"\n") } } fitdistrplus/R/mgedist.R0000644000176200001440000004246213742313702015000 0ustar liggesusers############################################################################# # Copyright (c) 2009 Marie Laure Delignette-Muller # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the # Free Software Foundation, Inc., # 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA # ############################################################################# ### maximum goodness-of-fit estimation for censored or non-censored data ### and continuous distributions ### (at this time only available for non censored data) ### ### R functions ### mgedist <- function (data, distr, gof = "CvM", start=NULL, fix.arg=NULL, optim.method="default", lower=-Inf, upper=Inf, custom.optim=NULL, silent=TRUE, gradient=NULL, checkstartfix=FALSE, ...) # data may correspond to a vector for non censored data or to # a dataframe of two columns named left and right for censored data { if (!is.character(distr)) stop("distr must be a character string naming a distribution") else distname <- distr if (is.element(distname,c("binom","nbinom","geom","hyper","pois"))) stop("Maximum goodness-of-fit estimation method is not intended to fit discrete distributions") pdistname <- paste("p",distname,sep="") if (!exists(pdistname, mode="function")) stop(paste("The ", pdistname, " function must be defined")) ddistname <- paste("d",distname,sep="") if (!exists(ddistname, mode="function")) stop(paste("The ", ddistname, " function must be defined")) argddistname <- names(formals(ddistname)) if(is.null(custom.optim)) optim.method <- match.arg(optim.method, c("default", "Nelder-Mead", "BFGS", "CG", "L-BFGS-B", "SANN", "Brent")) gof <- match.arg(gof, c("CvM", "KS", "AD", "ADR", "ADL", "AD2R", "AD2L", "AD2")) start.arg <- start #to avoid confusion with the start() function of stats pkg (check is done lines 87-100) if(is.vector(start.arg)) #backward compatibility start.arg <- as.list(start.arg) my3dots <- list(...) if ("weights" %in% names(my3dots)) stop("Weights is not allowed for maximum GOF estimation") if (is.vector(data)) { cens <- FALSE if (!(is.numeric(data) & length(data)>1)) stop("data must be a numeric vector of length greater than 1 for non censored data or a dataframe with two columns named left and right and more than one line for censored data") } else { cens <- TRUE censdata <- data if (!(is.vector(censdata$left) & is.vector(censdata$right) & length(censdata[,1])>1)) stop("data must be a numeric vector of length greater than 1 for non censored data or a dataframe with two columns named left and right and more than one line for censored data") pdistname<-paste("p",distname,sep="") if (!exists(pdistname,mode="function")) stop(paste("The ",pdistname," function must be defined to apply maximum likelihood to censored data")) } if (cens) { # Definition of datasets lcens (left censored)=vector, rcens (right censored)= vector, # icens (interval censored) = dataframe with left and right # and ncens (not censored) = vector lcens<-censdata[is.na(censdata$left),]$right if (any(is.na(lcens)) ) stop("An observation cannot be both right and left censored, coded with two NA values") rcens<-censdata[is.na(censdata$right),]$left ncens<-censdata[censdata$left==censdata$right & !is.na(censdata$left) & !is.na(censdata$right),]$left icens<-censdata[censdata$left!=censdata$right & !is.na(censdata$left) & !is.na(censdata$right),] # Definition of a data set for calculation of starting values data<-c(rcens,lcens,ncens,(icens$left+icens$right)/2) } if(!checkstartfix) #pre-check has not been done by fitdist() or bootdist() { # manage starting/fixed values: may raise errors or return two named list arg_startfix <- manageparam(start.arg=start, fix.arg=fix.arg, obs=data, distname=distname) #check inconsistent parameters hasnodefaultval <- sapply(formals(ddistname), is.name) arg_startfix <- checkparamlist(arg_startfix$start.arg, arg_startfix$fix.arg, argddistname, hasnodefaultval) #arg_startfix contains two names list (no longer NULL nor function) #set fix.arg.fun if(is.function(fix.arg)) fix.arg.fun <- fix.arg else fix.arg.fun <- NULL }else #pre-check has been done by fitdist() or bootdist() { arg_startfix <- list(start.arg=start, fix.arg=fix.arg) fix.arg.fun <- NULL } #unlist starting values as needed in optim() vstart <- unlist(arg_startfix$start.arg) #sanity check if(is.null(vstart)) stop("Starting values could not be NULL with checkstartfix=TRUE") #erase user value #(cannot coerce to vector as there might be different modes: numeric, character...) fix.arg <- arg_startfix$fix.arg ############# MGE fit using optim or custom.optim ########## # definition of the function to minimize depending on the argument gof # for non censored data if (!cens) { # the argument names are: # - par for parameters (like in optim function) # - fix.arg for optional fixed parameters # - obs for observations (previously dat but conflicts with genoud data.type.int argument) # - pdistnam for distribution name if (gof == "CvM") fnobj <- function(par, fix.arg, obs, pdistnam) { n <- length(obs) s <- sort(obs) theop <- do.call(pdistnam,c(list(s),as.list(par),as.list(fix.arg))) 1/(12*n) + sum( ( theop - (2 * 1:n - 1)/(2 * n) )^2 ) } else if (gof == "KS") fnobj <- function(par, fix.arg, obs, pdistnam) { n <- length(obs) s <- sort(obs) obspu <- seq(1,n)/n obspl <- seq(0,n-1)/n theop <- do.call(pdistnam,c(list(s),as.list(par),as.list(fix.arg))) max(pmax(abs(theop-obspu),abs(theop-obspl))) } else if (gof == "AD") fnobj <- function(par, fix.arg, obs, pdistnam) { n <- length(obs) s <- sort(obs) theop <- do.call(pdistnam,c(list(s),as.list(par),as.list(fix.arg))) - n - mean( (2 * 1:n - 1) * (log(theop) + log(1 - rev(theop))) ) } else if (gof == "ADR") fnobj <- function(par, fix.arg, obs, pdistnam) { n <- length(obs) s <- sort(obs) theop <- do.call(pdistnam,c(list(s),as.list(par),as.list(fix.arg))) n/2 - 2 * sum(theop) - mean ( (2 * 1:n - 1) * log(1 - rev(theop)) ) } else if (gof == "ADL") fnobj <- function(par, fix.arg, obs, pdistnam) { n <- length(obs) s <- sort(obs) theop <- do.call(pdistnam,c(list(s),as.list(par),as.list(fix.arg))) -3*n/2 + 2 * sum(theop) - mean ( (2 * 1:n - 1) * log(theop) ) } else if (gof == "AD2R") fnobj <- function(par, fix.arg, obs, pdistnam) { n <- length(obs) s <- sort(obs) theop <- do.call(pdistnam,c(list(s),as.list(par),as.list(fix.arg))) 2 * sum(log(1 - theop)) + mean ( (2 * 1:n - 1) / (1 - rev(theop)) ) } else if (gof == "AD2L") fnobj <- function(par, fix.arg, obs, pdistnam) { n <- length(obs) s <- sort(obs) theop <- do.call(pdistnam,c(list(s),as.list(par),as.list(fix.arg))) 2 * sum(log(theop)) + mean ( (2 * 1:n - 1) / theop ) } else if (gof == "AD2") fnobj <- function(par, fix.arg, obs, pdistnam) { n <- length(obs) s <- sort(obs) theop <- do.call(pdistnam,c(list(s),as.list(par),as.list(fix.arg))) 2 * sum(log(theop) + log(1 - theop) ) + mean ( ((2 * 1:n - 1) / theop) + ((2 * 1:n - 1) / (1 - rev(theop))) ) } } else # if (!cens) stop("Maximum goodness-of-fit estimation is not yet available for censored data.") # Function to calculate the loglikelihood to return loglik <- function(par, fix.arg, obs, ddistnam) { sum(log(do.call(ddistnam, c(list(obs), as.list(par), as.list(fix.arg)) ) ) ) } owarn <- getOption("warn") # Try to minimize the gof distance using the base R optim function if(is.null(custom.optim)) { hasbound <- any(is.finite(lower) | is.finite(upper)) # Choice of the optimization method if (optim.method == "default") { meth <- ifelse(length(vstart) > 1, "Nelder-Mead", "BFGS") }else meth <- optim.method if(meth == "BFGS" && hasbound && is.null(gradient)) { meth <- "L-BFGS-B" txt1 <- "The BFGS method cannot be used with bounds without provided the gradient." txt2 <- "The method is changed to L-BFGS-B." warning(paste(txt1, txt2)) } options(warn=ifelse(silent, -1, 0)) #select optim or constrOptim if(hasbound) #finite bounds are provided { if(!is.null(gradient)) { opt.fun <- "constrOptim" }else #gradient == NULL { if(meth == "Nelder-Mead") opt.fun <- "constrOptim" else if(meth %in% c("L-BFGS-B", "Brent")) opt.fun <- "optim" else { txt1 <- paste("The method", meth, "cannot be used by constrOptim() nor optim() without gradient and bounds.") txt2 <- "Only optimization methods L-BFGS-B, Brent and Nelder-Mead can be used in such case." stop(paste(txt1, txt2)) } } if(opt.fun == "constrOptim") { #recycle parameters npar <- length(vstart) #as in optim() line 34 lower <- as.double(rep_len(lower, npar)) #as in optim() line 64 upper <- as.double(rep_len(upper, npar)) # constraints are : Mat %*% theta >= Bnd, i.e. # +1 * theta[i] >= lower[i]; # -1 * theta[i] >= -upper[i] #select rows from the identity matrix haslow <- is.finite(lower) Mat <- diag(npar)[haslow, ] #select rows from the opposite of the identity matrix hasupp <- is.finite(upper) Mat <- rbind(Mat, -diag(npar)[hasupp, ]) colnames(Mat) <- names(vstart) rownames(Mat) <- paste0("constr", 1:NROW(Mat)) #select the bounds Bnd <- c(lower[is.finite(lower)], -upper[is.finite(upper)]) names(Bnd) <- paste0("constr", 1:length(Bnd)) initconstr <- Mat %*% vstart - Bnd if(any(initconstr < 0)) stop("Starting values must be in the feasible region.") opttryerror <- try(opt <- constrOptim(theta=vstart, f=fnobj, ui=Mat, ci=Bnd, grad=gradient, fix.arg=fix.arg, obs=data, pdistnam=pdistname, hessian=!is.null(gradient), method=meth, ...), silent=TRUE) if(!inherits(opttryerror, "try-error")) if(length(opt$counts) == 1) #appears when the initial point is a solution opt$counts <- c(opt$counts, NA) }else #opt.fun == "optim" { opttryerror <- try(opt <- optim(par=vstart, fn=fnobj, fix.arg=fix.arg, obs=data, gr=gradient, pdistnam=pdistname, hessian=TRUE, method=meth, lower=lower, upper=upper, ...), silent=TRUE) } }else #hasbound == FALSE { opt.fun <- "optim" opttryerror <- try(opt <- optim(par=vstart, fn=fnobj, fix.arg=fix.arg, obs=data, gr=gradient, pdistnam=pdistname, hessian=TRUE, method=meth, lower=lower, upper=upper, ...), silent=TRUE) } options(warn=owarn) if (inherits(opttryerror,"try-error")) { warnings("The function optim encountered an error and stopped.") if(getOption("show.error.messages")) print(attr(opttryerror, "condition")) return(list(estimate = rep(NA,length(vstart)), convergence = 100, loglik = NA, hessian = NA)) } if (opt$convergence>0) { warnings("The function optim failed to converge, with the error code ", opt$convergence) } if(is.null(names(opt$par))) names(opt$par) <- names(vstart) res <- list(estimate = opt$par, convergence = opt$convergence, value = opt$value, hessian = opt$hessian, optim.function=opt.fun, optim.method=meth, fix.arg = fix.arg, fix.arg.fun = fix.arg.fun, weights=NULL, counts=opt$counts, optim.message=opt$message, loglik=loglik(opt$par, fix.arg, data, ddistname), gof=gof) } else # Try to minimize the gof distance using a user-supplied optim function { options(warn=ifelse(silent, -1, 0)) if (!cens) opttryerror <- try(opt <- custom.optim(fn=fnobj, fix.arg=fix.arg, obs=data, pdistnam=pdistname, par=vstart, ...), silent=TRUE) else stop("Maximum goodness-of-fit estimation is not yet available for censored data.") options(warn=owarn) if (inherits(opttryerror,"try-error")) { warnings("The customized optimization function encountered an error and stopped.") if(getOption("show.error.messages")) print(attr(opttryerror, "condition")) return(list(estimate = rep(NA,length(vstart)), convergence = 100, value = NA, hessian = NA)) } if (opt$convergence>0) { warnings("The customized optimization function failed to converge, with the error code ", opt$convergence) } if(is.null(names(opt$par))) names(opt$par) <- names(vstart) argdot <- list(...) method.cust <- argdot$method res <- list(estimate = opt$par, convergence = opt$convergence, value = opt$value, hessian = opt$hessian, optim.function=custom.optim, optim.method=method.cust, fix.arg = fix.arg, fix.arg.fun = fix.arg.fun, weights=NULL, counts=opt$counts, optim.message=opt$message, loglik=loglik(opt$par, fix.arg, data, ddistname), gof=gof) } return(res) } fitdistrplus/R/util-npsurv-intercens.R0000644000176200001440000000736114421725553017651 0ustar liggesusers# ----------------------------------------------------------------------- # # Nonparametric maximum likelihood estimation from interval-censored data # # ----------------------------------------------------------------------- # # Internal functions for handling interval censored data # # ----------------------------------------------------------------------- # # Original code from Yong Wang, 2020 # # ----------------------------------------------------------------------- # # Delta matrix = incidence matrix of maximal intersection intervals # Delta[i,j] = 1 indicates that ith observation is within jth interval # # see section 2.2, p3 of Wang & Taylor : Delta matrix is denoted S # An interval is either (Li, Ri] if Li < Ri, or [Li, Ri] if Li = Ri. Deltamatrix = function(LR) { #remove NAs id_noNA <- rowSums(is.na(LR)) == 0 LR <- LR[id_noNA,] L = LR[,1] R = LR[,2] ic = L != R # inverval-censored nc = sum(ic) # tol = max(R[R!=Inf]) * 1e-8 if(nc > 0) { L1 = L[ic] + max(R[R!=Inf]) * 1e-8 # open left endpoints LRc = cbind(c(L1, R[ic]), c(rep(0,nc), rep(1,nc)), rep(1:nc, 2)) LRc.o = LRc[order(LRc[,1]),] j = which(diff(LRc.o[,2]) == 1) left = L[ic][LRc.o[j,3]] right = R[ic][LRc.o[j+1,3]] } else left = right = numeric(0) if(nrow(LR) - nc > 0) { ut = unique(L[!ic]) jin = colSums(outer(ut, left, ">") & outer(ut, right, "<=")) > 0 left = c(ut, left[!jin]) # remove those that contain exact obs. right = c(ut, right[!jin]) o = order(left, right) left = left[o] right = right[o] } ## D = outer(L, left, "<=") & outer(R, right, ">=") D = outer(L, left, "<=") & outer(R, right, ">=") & (outer(L, right, "<") | outer(R, left, "==")) colnames(D) <- paste0("left=", round(left, 1), "-right=", round(right, 1)) rownames(D) <- paste0("obs", 1:length(L)) names(left) = names(right) = NULL list(left=left, right=right, Delta=D) } # interval distribution function, i.e., a distribution function defined on # a set of intervals. # left Left endpoints of the intervals # right Right endpoints of the intervals # p Probability masses allocated to the intervals idf = function(left, right, p) { if(length(left) != length(right)) stop("length(left) != length(right)") names(left) = names(right) = names(p) = NULL p = rep(p, length=length(left)) f = list(left=left, right=right, p=p/sum(p)) f } idf2dataframe <- function(object) { if(inherits(object, "idf")) data.frame(left=object$left, right=object$right, p=object$p) else as.data.frame(object) } icendata = function(x, w=1) { if(is.null(x)) return(NULL) if(is.icendata(x)) { if(all(w == 1)) return(x) w = rep(w, length = length(x$t) + nrow(x$o)) if(length(x$t) > 0) x$wt = x$wt * w[1:length(x$wt)] if(nrow(x$o) > 0) x$wo = x$wo * w[length(x$wt)+1:nrow(x$o)] return(x) } z = vector("list", 7) names(z) = c("t", "wt", "o", "wo", "i1", "upper", "u") if(is.vector(x)) x = cbind(x, x) if(!is.matrix(x)) x = as.matrix(x) if(ncol(x) == 3) {w = w * x[,3]; x = x[,1:2]} if(length(w) != nrow(x)) w = rep(w, len=nrow(x)) iw = w > 0 w = w[iw] x = x[iw,,drop=FALSE] o = order(x[,1], x[,2]) x = x[o,] w = w[o] id = c(TRUE, diff(x[,1]) > 0 | diff(x[,2]) > 0) id[is.na(id)] = FALSE # for Inf's w = aggregate(w, by=list(group=cumsum(id)), sum)[,2] x = x[id,] i = x[,1] == x[,2] z$t = x[i,1] names(z$t) = NULL z$wt = w[i] z$o = x[!i,1:2,drop=FALSE] dimnames(z$o) = list(NULL, c("L","R")) z$wo = w[!i] z$upper = max(x[,1]) z$i1 = z$t != z$upper z$u = sort(unique(c(0, pmin(c(x[,1], x[,2]), z$upper)))) class(z) = "icendata" z } is.icendata = function(x) "icendata" %in% class(x) fitdistrplus/R/plotdist.R0000644000176200001440000003023013742313702015174 0ustar liggesusers############################################################################# # Copyright (c) 2009 Marie Laure Delignette-Muller # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the # Free Software Foundation, Inc., # 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA # ############################################################################# ### plot functions for non-censored data ### ### R functions ### plotdist <- function(data, distr, para, histo = TRUE, breaks="default", demp = FALSE, discrete, ...) { def.par <- par(no.readonly = TRUE) if (missing(data) || !is.vector(data, mode="numeric")) stop("data must be a numeric vector") if ((missing(distr) & !missing(para)) || (missing(distr) & !missing(para))) stop("distr and para must defined") if (!histo & !demp) stop("one the arguments histo and demp must be put to TRUE") xlim <- c(min(data), max(data)) # for plot of discrete distributions s <- sort(data) n <- length(data) if (missing(distr)) { ## Plot of data only par(mfrow=c(1, 2)) if(missing(discrete)) discrete <- FALSE if (!discrete) { # plot for continuous data alone obsp <- ppoints(s) # PLOT 1 --- if (histo) { if(demp) { if (breaks=="default") h <- hist(data, freq=FALSE, xlab="Data", main="Empirical density", ...) else h <- hist(data, freq=FALSE, xlab="Data", main="Empirical density", breaks=breaks, ...) lines(density(data)$x, density(data)$y, lty=2, col="black") } else { if (breaks=="default") h <- hist(data, freq=FALSE, xlab="Data", main="Histogram", ...) else h <- hist(data, freq=FALSE, xlab="Data", main="Histogram", breaks=breaks, ...) } } else { h <- hist(data, freq=FALSE, xlab="Data", main="Histogram", plot = FALSE, ...) plot(density(data)$x, density(data)$y, lty=1, col="black", type = "l", xlab="Data", main=paste("Empirical density"), ylab = "Density",...) } # PLOT 2 --- plot(s, obsp, main=paste("Cumulative distribution"), xlab="Data", xlim=c(h$breaks[1], h$breaks[length(h$breaks)]), ylab="CDF", ...) } else { # plot for discrete data alone if (breaks!="default") warning("Breaks are not taken into account for discrete data") # plot of empirical distribution t <- table(data) xval <- as.numeric(names(t)) # xvalfin <- seq(min(xval), max(xval),by=1) ydobs <- as.vector(t)/n ydmax <- max(ydobs) plot(xval, ydobs, type="h", xlim=xlim, ylim=c(0, ydmax), main="Empirical distribution", xlab="Data", ylab="Density", ...) # plot of the cumulative probability distributions ycdfobs <- cumsum(ydobs) plot(xval, ycdfobs, type="p", xlim=xlim, ylim=c(0, 1), main="Empirical CDFs", xlab="Data", ylab="CDF", ...) } } #end of if (missing(distr)) else { # plot of data and distribution if (!is.character(distr)) distname <- substring(as.character(match.call()$distr), 2) else distname <- distr if (!is.list(para)) stop("'para' must be a named list") ddistname <- paste("d", distname, sep="") if (!exists(ddistname, mode="function")) stop(paste("The ", ddistname, " function must be defined")) pdistname <- paste("p", distname, sep="") if (!exists(pdistname, mode="function")) stop(paste("The ", pdistname, " function must be defined")) qdistname <- paste("q", distname, sep="") if (!exists(qdistname, mode="function")) stop(paste("The ", qdistname, " function must be defined")) densfun <- get(ddistname, mode="function") nm <- names(para) f <- formals(densfun) args <- names(f) m <- match(nm, args) if (any(is.na(m))) stop(paste("'para' specifies names which are not arguments to ", ddistname)) if(missing(discrete)) { if (is.element(distname, c("binom", "nbinom", "geom", "hyper", "pois"))) discrete <- TRUE else discrete <- FALSE } if (!discrete) { # plot of continuous data with theoretical distribution par(mfrow=c(2, 2)) obsp <- ppoints(s) # plot of empirical and theoretical density # computes densities in order to define limits for y-axis if (breaks=="default") h <- hist(data, plot=FALSE) else h <- hist(data, breaks=breaks, plot=FALSE, ...) xhist <- seq(min(h$breaks), max(h$breaks), length=1000) yhist <- do.call(ddistname, c(list(xhist), as.list(para))) if(length(yhist) != length(xhist)) stop("problem when computing densities.") ymax <- ifelse(is.finite(max(yhist)), max(max(h$density), max(yhist)), max(h$density)) # PLOT 1 - plot of empirical and theoretical density # empirical density if (histo) { hist(data, freq=FALSE, xlab="Data", ylim=c(0, ymax), breaks=h$breaks, main=paste("Empirical and theoretical dens."), ...) if(demp) { lines(density(data)$x, density(data)$y, lty=2, col="black") } } else plot(density(data)$x, density(data)$y, lty=2, col="black", type = "l", xlab="Data", main=paste("Empirical and theoretical dens."), ylab = "Density", xlim = c(min(h$breaks), max(h$breaks)), ...) if (demp) legend("topright",bty="n",lty=c(2,1),col=c("black","red"), legend=c("empirical","theoretical"), bg="white",cex=0.7) # Add of theoretical density lines(xhist, yhist,lty=1,col="red") # PLOT 2 - plot of the qqplot theoq <- do.call(qdistname, c(list(obsp), as.list(para))) if(length(theoq) != length(obsp)) stop("problem when computing quantities.") plot(theoq, s, main=" Q-Q plot", xlab="Theoretical quantiles", ylab="Empirical quantiles", ...) abline(0, 1) # PLOT 3 - plot of the cumulative probability distributions xmin <- h$breaks[1] xmax <- h$breaks[length(h$breaks)] if(length(s) != length(obsp)) stop("problem when computing probabilities.") plot(s, obsp, main=paste("Empirical and theoretical CDFs"), xlab="Data", ylab="CDF", xlim=c(xmin, xmax), ...) sfin <- seq(xmin, xmax, by=(xmax-xmin)/100) theopfin <- do.call(pdistname, c(list(sfin), as.list(para))) lines(sfin, theopfin, lty=1,col="red") # PLOT 4 - plot of the ppplot theop <- do.call(pdistname, c(list(s), as.list(para))) if(length(theop) != length(obsp)) stop("problem when computing probabilities.") plot(theop, obsp, main="P-P plot", xlab="Theoretical probabilities", ylab="Empirical probabilities", ...) abline(0, 1) } else { # plot of discrete data with theoretical distribution par(mfrow=c(1, 2)) if (breaks!="default") warning("Breaks are not taken into account for discrete distributions") # plot of empirical and theoretical distributions t <- table(data) xval <- as.numeric(names(t)) xvalfin <- seq(min(xval), max(xval),by=1) xlinesdec <- min((max(xval)-min(xval))/30, 0.4) yd <- do.call(ddistname, c(list(xvalfin), as.list(para))) if(length(yd) != length(xvalfin)) stop("problem when computing density points.") ydobs <- as.vector(t)/n ydmax <- max(yd, ydobs) plot(xvalfin+xlinesdec, yd, type='h', xlim=c(min(xval), max(xval)+xlinesdec), ylim=c(0, ydmax), lty=1, col="red", main="Emp. and theo. distr.", xlab="Data", ylab="Density", ...) points(xval, ydobs, type='h', lty=1, col="black",...) legend("topright", lty=c(1, 1), col=c("black","red"), legend=c("empirical", paste("theoretical")), bty="o", bg="white",cex=0.6,...) # plot of the cumulative probability distributions ycdf <- do.call(pdistname, c(list(xvalfin), as.list(para))) if(length(ycdf) != length(xvalfin)) stop("problem when computing probabilities.") plot(xvalfin, ycdf, type="s", xlim=c(min(xval), max(xval)+xlinesdec), ylim=c(0, 1), lty=1, col="red", main="Emp. and theo. CDFs", xlab="Data", ylab="CDF", ...) # plot(xvalfin+xlinesdec, ycdf, type="h", xlim=c(min(xval), max(xval)+xlinesdec), # ylim=c(0, 1), lty=3, col="red", # main="Emp. and theo. CDFs", xlab="Data", # ylab="CDF", ...) ycdfobs <- cumsum(ydobs) points(xval,ycdfobs, type="p", col="black",...) legend("bottomright", lty=c(1, 1), col=c("black","red"), legend=c("empirical", paste("theoretical")), bty="o", bg ="white",cex=0.6,...) } } par(def.par) invisible() } fitdistrplus/R/prefit.R0000644000176200001440000003071114421726030014624 0ustar liggesusers############################################################################# # Copyright (c) 2016 Marie Laure Delignette-Muller, Christophe Dutang # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the # Free Software Foundation, Inc., # 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA # ############################################################################# ### prefit for maximum likelihood estimation ### ### R functions ### #search good starting values prefit <- function(data, distr, method = c("mle", "mme", "qme", "mge"), feasible.par, memp=NULL, order=NULL, probs=NULL, qtype=7, gof=NULL, fix.arg=NULL, lower, upper, weights=NULL, silent=TRUE, ...) { if (!is.character(distr)) distname <- substring(as.character(match.call()$distr), 2) else distname <- distr method <- match.arg(method, c("mle", "mme", "qme", "mge")) if(method != "qme" && !is.null(probs)) stop("probs is not needed") if(method != "mme" && (!is.null(memp) || !is.null(order))) stop("memp, order are not needed") if(method != "mge" && !is.null(gof)) stop("gof is not needed") ddistname <- paste0("d", distname) if (!exists(ddistname, mode="function")) stop(paste("The ", ddistname, " function must be defined")) pdistname <- paste0("p", distname) if (!exists(pdistname, mode="function") && method == "mge") stop(paste("The ", pdistname, " function must be defined")) qdistname <- paste0("q",distname) if (!exists(qdistname, mode="function") && method == "qme") stop(paste("The ", qdistname, " function must be defined")) mdistname <- paste0("m",distname) if (!exists(mdistname, mode="function") && method == "mme") stop(paste("The ", mdistname, " function must be defined")) if(is.null(probs) && method == "qme") stop("probs must be provided") if(is.null(order) && method == "mme") stop("order must be provided") if(missing(feasible.par)) stop("feasible values must be provided") if(missing(lower) || missing(upper)) stop("bounds (yet infinite) must be provided") #recycle parameters if(is.list(feasible.par)) feasible.par <- unlist(feasible.par) npar <- length(feasible.par) #as in optim() line 34 lower <- as.double(rep_len(lower, npar)) #as in optim() line 64 upper <- as.double(rep_len(upper, npar)) if(all(is.infinite(lower)) && all(is.infinite(upper))) { bnd <- detectbound(distname, feasible.par, data, fix.arg=fix.arg) }else { bnd <- rbind(lower, upper) colnames(bnd) <- names(feasible.par) rownames(bnd) <- c("lowb", "uppb") } if(!silent) print(bnd) translist <- invlist <- NULL for(i in 1:NCOL(bnd)) { if(bnd["lowb", i] == -Inf && bnd["uppb", i] == Inf) { translist <- c(translist, list(function(x) x)) invlist <- c(invlist, list(function(x) x)) }else if(bnd["lowb", i] == 0 && bnd["uppb", i] == Inf) { translist <- c(translist, list(T0Inf)) invlist <- c(invlist, list(iT0Inf)) }else if(bnd["lowb", i] == 1 && bnd["uppb", i] == Inf) { translist <- c(translist, list(T1Inf)) invlist <- c(invlist, list(iT1Inf)) }else if(bnd["lowb", i] == 0 && bnd["uppb", i] == 1) { translist <- c(translist, list(T01)) invlist <- c(invlist, list(iT01)) }else if(bnd["lowb", i] == -1 && bnd["uppb", i] == 0) { translist <- c(translist, list(Tm10)) invlist <- c(invlist, list(iTm10)) }else { print(bnd) stop("unknown parameter domain") } } if(!silent) print(translist) if(!is.null(weights)) { if(any(weights < 0)) stop("weights should be a vector of numerics greater than 0") if(length(weights) != NROW(data)) stop("weights should be a vector with a length equal to the observation number") if(method == "mge") stop("weights is not allowed for maximum GOF estimation") } #maximum likelihood if(method == "mle") { if(is.null(weights)) weights <- rep(1, NROW(data)) fnobj <- function(par, fix.arg, obs, ddistnam, qdistnam, pdistnam, mdistnam, qtype, memp, gof) { if(!is.list(par)) par <- as.list(par) lpar <- lapply(1:length(par), function(i) translist[[i]](par[[i]])) -sum( weights * log(do.call(ddistnam, c(list(obs), lpar, as.list(fix.arg)) ) ) ) } } #quantile matching if(method == "qme" && is.null(weights)) { DIFF2Q <- function(par, fix.arg, prob, obs, qdistnam, qtype) { if(!is.list(par)) par <- as.list(par) lpar <- lapply(1:length(par), function(i) translist[[i]](par[[i]])) qtheo <- do.call(qdistnam, c(list(prob), lpar, as.list(fix.arg)) ) qemp <- as.numeric(quantile(obs, probs=prob, type=qtype)) (qemp - qtheo)^2 } fnobj <- function(par, fix.arg, obs, ddistnam, qdistnam, pdistnam, mdistnam, qtype, memp, gof) sum( sapply(probs, function(p) DIFF2Q(par, fix.arg, p, obs, qdistnam, qtype)) ) } if(method == "qme" && !is.null(weights)) { DIFF2Q <- function(par, fix.arg, prob, obs, qdistnam, qtype) { if(!is.list(par)) par <- as.list(par) lpar <- lapply(1:length(par), function(i) translist[[i]](par[[i]])) qtheo <- do.call(qdistnam, c(list(prob), lpar, as.list(fix.arg)) ) qemp <- as.numeric(wtdquantile(x=obs, weights=weights, probs=prob)) (qemp - qtheo)^2 } fnobj <- function(par, fix.arg, obs, ddistnam, qdistnam, pdistnam, mdistnam, qtype, memp, gof) sum( sapply(probs, function(p) DIFF2Q(par, fix.arg, p, obs, qdistnam, qtype)) ) } #moment matching if(method == "mme" && is.null(weights)) { DIFF2 <- function(par, fix.arg, order, obs, mdistnam, memp, weights) { if(!is.list(par)) par <- as.list(par) lpar <- lapply(1:length(par), function(i) translist[[i]](par[[i]])) momtheo <- do.call(mdistnam, c(list(order), lpar, as.list(fix.arg)) ) momemp <- as.numeric(memp(obs, order)) (momemp - momtheo)^2 } fnobj <- function(par, fix.arg, obs, ddistnam, qdistnam, pdistnam, mdistnam, qtype, memp, gof) sum( sapply(order, function(o) DIFF2(par, fix.arg, o, obs, mdistnam, memp)) ) } if(method == "mme" && !is.null(weights)) { DIFF2 <- function(par, fix.arg, order, obs, mdistnam, memp, weights) { if(!is.list(par)) par <- as.list(par) lpar <- lapply(1:length(par), function(i) translist[[i]](par[[i]])) momtheo <- do.call(mdistnam, c(list(order), lpar, as.list(fix.arg)) ) momemp <- as.numeric(memp(obs, order, weights)) (momemp - momtheo)^2 } fnobj <- function(par, fix.arg, obs, ddistnam, qdistnam, pdistnam, mdistnam, qtype, memp, gof) sum( sapply(order, function(o) DIFF2(par, fix.arg, o, obs, mdistnam, memp, weights)) ) } #gof matching if(method == "mge") { fnobj <- function(par, fix.arg, obs, ddistnam, qdistnam, pdistnam, mdistnam, qtype, memp, gof) { if(!is.list(par)) par <- as.list(par) lpar <- lapply(1:length(par), function(i) translist[[i]](par[[i]])) n <- length(obs) s <- sort(obs) theop <- do.call(pdistnam, c(list(s), lpar, as.list(fix.arg)) ) obspu <- seq(1,n)/n obspl <- seq(0,n-1)/n if (gof == "CvM") 1/(12*n) + sum( ( theop - (2 * 1:n - 1)/(2 * n) )^2 ) else if (gof == "KS") max(pmax(abs(theop-obspu),abs(theop-obspl))) else if (gof == "AD") - n - mean( (2 * 1:n - 1) * (log(theop) + log(1 - rev(theop))) ) else if (gof == "ADR") n/2 - 2 * sum(theop) - mean ( (2 * 1:n - 1) * log(1 - rev(theop)) ) else if (gof == "ADL") -3*n/2 + 2 * sum(theop) - mean ( (2 * 1:n - 1) * log(theop) ) else if (gof == "AD2R") 2 * sum(log(1 - theop)) + mean ( (2 * 1:n - 1) / (1 - rev(theop)) ) else if (gof == "AD2L") 2 * sum(log(theop)) + mean ( (2 * 1:n - 1) / theop ) else if (gof == "AD2") 2*sum(log(theop) + log(1 - theop)) + mean(((2*1:n - 1) / theop) + ((2*1:n - 1) / (1 - rev(theop)))) } } ltrans.par <- sapply(1:length(feasible.par), function(i) invlist[[i]](feasible.par[[i]])) if(!silent) { cat("before transform\n") print(unlist(feasible.par)) cat("after transform\n") print(unlist(ltrans.par)) } if(method == "mle") test1 <- try(fnobj(par=ltrans.par, fix.arg = fix.arg, obs=data, ddistnam = ddistname), silent=silent) if(method == "qme") test1 <- try(fnobj(par=ltrans.par, fix.arg = fix.arg, obs=data, qdistnam=qdistname, qtype=qtype), silent=silent) if(method == "mme") test1 <- try(fnobj(par=ltrans.par, fix.arg = fix.arg, obs=data, mdistnam=mdistname, memp=memp), silent=silent) if(method == "mge") test1 <- try(fnobj(par=ltrans.par, fix.arg = fix.arg, obs=data, pdistnam=pdistname, gof=gof), silent=silent) if(inherits(test1, "try-error") || silent == FALSE) print(test1) #get old warning value and set it owarn <- options(warn=ifelse(silent, -1, 0)) if(method == "mle") opttryerror <- try(opt <- optim(par=ltrans.par, fn=fnobj, fix.arg=fix.arg, obs=data, ddistnam=ddistname, hessian=FALSE, method="BFGS", ...), silent=silent) if(method == "qme") opttryerror <- try(opt <- optim(par=ltrans.par, fn=fnobj, fix.arg=fix.arg, obs=data, qdistnam=qdistname, qtype=qtype, hessian=FALSE, method="BFGS", ...), silent=silent) if(method == "mme") opttryerror <- try(opt <- optim(par=ltrans.par, fn=fnobj, fix.arg=fix.arg, obs=data, mdistnam=mdistname, memp=memp, hessian=FALSE, method="BFGS", ...), silent=silent) if(method == "mge") opttryerror <- try(opt <- optim(par=ltrans.par, fn=fnobj, fix.arg=fix.arg, obs=data, pdistnam=pdistname, gof=gof, hessian=FALSE, method="BFGS", ...), silent=silent) #get back to old warning value on.exit(options(owarn), add=TRUE) if(inherits(opttryerror, "try-error")) stop("unsuccessful pre-fitting process") if(!silent) print(opt) if(opt$convergence %in% 0:1) #either successful or reached the iteration limit (see ?optim) { prefitpar <- unlist(sapply(1:length(opt$par), function(i) translist[[i]](opt$par[i]))) }else { prefitpar <- rep(NA, length(opt$par)) } names(prefitpar) <- names(feasible.par) as.list(prefitpar) } fitdistrplus/R/util-startarg.R0000644000176200001440000001146614421750607016152 0ustar liggesusers# startargdefault function returns initial values of parameters generally using moments or quantiles # INPUTS #x : data vector or matrix #distr : the distribution name # OUTPUTS # a named list or raises an error startargdefault <- function(x, distr) { if (distr == "norm") { n <- length(x) sd0 <- sqrt((n - 1)/n) * sd(x) mx <- mean(x) start <- list(mean=mx, sd=sd0) }else if (distr == "lnorm") { if (any(x <= 0)) stop("values must be positive to fit a lognormal distribution") n <- length(x) lx <- log(x) sd0 <- sqrt((n - 1)/n) * sd(lx) ml <- mean(lx) start <- list(meanlog=ml, sdlog=sd0) }else if (distr == "pois") { start <- list(lambda=mean(x)) }else if (distr == "exp") { if (any(x < 0)) stop("values must be positive to fit an exponential distribution") start <- list(rate=1/mean(x)) }else if (distr == "gamma") { if (any(x < 0)) stop("values must be positive to fit an gamma distribution") n <- length(x) m <- mean(x) v <- (n - 1)/n*var(x) start <- list(shape=m^2/v, rate=m/v) }else if (distr == "nbinom") { n <- length(x) m <- mean(x) v <- (n - 1)/n*var(x) size <- ifelse(v > m, m^2/(v - m), 100) start <- list(size = size, mu = m) }else if (distr == "geom" ) { m <- mean(x) prob <- ifelse(m>0, 1/(1+m), 1) start <- list(prob=prob) }else if (distr == "beta") { if (any(x < 0) | any(x > 1)) stop("values must be in [0-1] to fit a beta distribution") n <- length(x) m <- mean(x) v <- (n - 1)/n*var(x) aux <- m*(1-m)/v - 1 start <- list(shape1=m*aux, shape2=(1-m)*aux) }else if (distr == "weibull") { if (any(x < 0)) stop("values must be positive to fit an Weibull distribution") m <- mean(log(x)) v <- var(log(x)) shape <- 1.2/sqrt(v) scale <- exp(m + 0.572/shape) start <- list(shape = shape, scale = scale) }else if (distr == "logis") { n <- length(x) m <- mean(x) v <- (n - 1)/n*var(x) start <- list(location=m, scale=sqrt(3*v)/pi) }else if (distr == "cauchy") { start <- list(location=median(x), scale=IQR(x)/2) }else if (distr == "unif"){ start <- list(min=0, max=1) }else if (distr == "invgamma") { if (any(x < 0)) stop("values must be positive to fit an inverse gamma distribution") #http://en.wikipedia.org/wiki/Inverse-gamma_distribution m1 <- mean(x) m2 <- mean(x^2) shape <- (2*m2-m1^2)/(m2-m1^2) scale <- m1*m2/(m2-m1^2) start <- list(shape=shape, scale=scale) }else if (distr == "llogis") { if (any(x < 0)) stop("values must be positive to fit a log-logistic distribution") p25 <- as.numeric(quantile(x, 0.25)) p75 <- as.numeric(quantile(x, 0.75)) shape <- 2*log(3)/(log(p75)-log(p25)) scale <- exp(log(p75)+log(p25))/2 start <- list(shape=shape, scale=scale) }else if (distr == "invweibull") { if (any(x < 0)) stop("values must be positive to fit an inverse Weibull distribution") g <- log(log(4))/(log(log(4/3))) p25 <- as.numeric(quantile(x, 0.25)) p75 <- as.numeric(quantile(x, 0.75)) shape <- exp((g*log(p75)-log(p25))/(g-1)) scale <-log(log(4))/(log(shape)-log(p25)) start <- list(shape=shape, scale=max(scale, 1e-9)) }else if (distr == "pareto1") { if (any(x < 0)) stop("values must be positive to fit a Pareto distribution") #http://www.math.umt.edu/gideon/pareto.pdf x1 <- min(x) m1 <- mean(x) n <- length(x) shape <- (n*m1-x1)/(n*(m1-x1)) min <- x1*(n*shape - 1)/(n*shape) start <- list(shape=shape, min=min) }else if (distr == "pareto") { if (any(x < 0)) stop("values must be positive to fit a Pareto distribution") m1 <- mean(x) m2 <- mean(x^2) scale <- (m1*m2)/(m2-2*m1^2) shape <- 2*(m2-m1^2)/(m2-2*m1^2) start <- list(shape=shape, scale=scale) }else if (distr == "lgamma") { if (any(x < 0)) stop("values must be positive to fit a log-gamma distribution") #p228 of Klugmann and Hogg (1984) m1 <- mean(log(x)) m2 <- mean(log(x)^2) alpha <- m1^2/(m2-m1^2) lambda <- m1/(m2-m1^2) start <- list(shapelog=alpha, ratelog=lambda) }else if (distr == "trgamma") { if (any(x < 0)) stop("values must be positive to fit an trans-gamma distribution") #same as gamma with shape2=tau=1 n <- length(x) m <- mean(x) v <- (n - 1)/n*var(x) start <- list(shape1=m^2/v, shape2=1, rate=m/v) }else if (distr == "invtrgamma") { if (any(x < 0)) stop("values must be positive to fit an inverse trans-gamma distribution") #same as gamma with shape2=tau=1 n <- length(1/x) m <- mean(1/x) v <- (n - 1)/n*var(1/x) start <- list(shape1=m^2/v, shape2=1, rate=m/v) }else stop(paste0("Unknown starting values for distribution ", distr, ".")) return(start) } fitdistrplus/R/ppcompcens.R0000644000176200001440000002065713742313702015515 0ustar liggesusers############################################################################# # Copyright (c) 2018 Marie Laure Delignette-Muller, Christophe Dutang, # Aurelie Siberchicot # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the # Free Software Foundation, Inc., # 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA # ############################################################################# ### PP plot for various fits ### of continuous distribution(s) (fitdistcens results) ### on a same dataset ### ### R functions ### ppcompcens <- function(ft, xlim, ylim, xlogscale = FALSE, ylogscale = FALSE, main, xlab, ylab, fillrect, fitcol, fitlwd, addlegend = TRUE, legendtext, xlegend = "bottomright", ylegend = NULL, line01 = TRUE, line01col = "black", line01lty = 1, ynoise = TRUE, NPMLE.method = "Wang", plotstyle = "graphics", ...) { if(inherits(ft, "fitdistcens")) { ft <- list(ft) }else if(!is.list(ft)) { stop("argument ft must be a list of 'fitdistcens' objects") }else { if(any(sapply(ft, function(x) !inherits(x, "fitdistcens")))) stop("argument ft must be a list of 'fitdistcens' objects") } NPMLE.method <- match.arg(NPMLE.method, c("Wang", "Turnbull.intervals", "Turnbull.middlepoints")) if (NPMLE.method == "Turnbull.middlepoints") { warning("The PPcomp plot for censored data is not available with NPMLE.method at Turnbull.middlepoints. Turnbull.intervals will be used instead of Turnbull.middlepoints.") NPMLE.method <- "Turnbull.intervals" } # check the 'plotstyle' argument plotstyle <- match.arg(plotstyle[1], choices = c("graphics", "ggplot"), several.ok = FALSE) # In the future developments, it will be necessary to check that all the fits share the same weights if(!is.null(ft[[1]]$weights)) stop("qqcompcens is not yet available when using weights") censdata <- ft[[1]]$censdata # check data verif.ftidata <- function(fti) { if (any(fti$censdata$left != censdata$left, na.rm=TRUE) | any(fti$censdata$right != censdata$right, na.rm=TRUE)) stop("All compared fits must have been obtained with the same dataset") } l <- lapply( ft, verif.ftidata) rm(l) if (xlogscale != ylogscale) { xlogscale <- ylogscale <- TRUE warning("As a Q-Q plot should use the same scale on x and y axes, both axes were put in a logarithmic scale.") } logxy <- paste(ifelse(xlogscale,"x",""), ifelse(ylogscale,"y",""), sep="") # manage default parameters nft <- length(ft) if (missing(fitcol)) fitcol <- 2:(nft+1) if (missing(fitlwd)) fitlwd <- 1 fitcol <- rep(fitcol, length.out=nft) fitlwd <- rep(fitlwd, length.out=nft) if (missing(fillrect)) if ((nft == 1) | plotstyle == "ggplot") fillrect <- "lightgrey" else fillrect <- NA # check legend parameters if added if(missing(legendtext)) { legendtext <- sapply(ft, function(x) x$distname) if(length(legendtext) != length(unique(legendtext))) legendtext <- paste(legendtext, sapply(ft, function(x) toupper(x$method)), sep="-") if(length(legendtext) != length(unique(legendtext))) legendtext <- paste(legendtext, 1:nft, sep="-") } if (missing(xlab)) xlab <- "Theoretical probabilities" if (missing(ylab)) ylab <- "Empirical probabilities" if (missing(main)) main <- "P-P plot" # computation from censdata f <- npmle(censdata, method = NPMLE.method) if(missing(xlim) & missing(ylim)) { if (xlogscale == TRUE) { stop("You must define the limits of axes using xlim or ylim in order to use a logarithmic scale for a P-P plot with censored data.") } else { xlim <- ylim <- c(0, 1) } } else # at least xlim or ylim are specified { if (missing(xlim) | missing(ylim)) { warning("By default the same limits are applied to x and y axes. You should specifiy both if you want different x and y limits") if (missing(xlim)) xlim <- ylim else ylim <- xlim } } k <- length(f$left) Fnpsurv <- cumsum(f$p) Fbefore <- c(0, Fnpsurv[-k]) df <- data.frame(left = f$left, right = f$right) # Definition of vertices of each rectangle Qi.left <- df$left # dim k Qi.right <- df$right nQi <- length(Qi.left) Pi.low <- Fbefore Pi.up <- Fnpsurv lrect <- vector(mode = "list", length = nft) for(i in 1:nft) { fti <- ft[[i]] para <- c(as.list(fti$estimate), as.list(fti$fix.arg)) distname <- fti$distname pdistname <- paste("p", distname, sep="") if (is.element(distname, c("binom", "nbinom", "geom", "hyper", "pois"))) warning(" Be careful, variables are considered continuous in this function!") Pitheo.low <- do.call(pdistname, c(list(Qi.left), as.list(para))) Pitheo.up <- do.call(pdistname, c(list(Qi.right), as.list(para))) lrect[[i]] <- data.frame(Pitheo.low = Pitheo.low, Pitheo.up = Pitheo.up, Pi.low = Pi.low, Pi.up = Pi.up, ind = legendtext[i]) } if(plotstyle == "graphics") { ######## plot if plotstyle=='graphics' ######## # main plot plot(1, 1, type = "n", main = main, xlim = xlim, ylim = ylim, xlab = xlab, ylab = ylab, log = logxy) # plot of rectangles plot.fti <- function(i, ...) { Pitheo.low <- lrect[[i]]$Pitheo.low Pi.low <- lrect[[i]]$Pi.low Pitheo.up <- lrect[[i]]$Pitheo.up Pi.up <- lrect[[i]]$Pi.up if (ynoise & nft > 1) { if (xlogscale == TRUE) { noise2mult <- runif(nQi, 0.99, 1.01) rect(xleft = Pitheo.low, ybottom = Pi.low * noise2mult, xright = Pitheo.up, ytop = Pi.up * noise2mult, border = fitcol[i], col = fillrect[i], lwd = fitlwd[i]) } else { noise2add <- runif(nQi, -0.01, 0.01) rect(xleft = Pitheo.low, ybottom = Pi.low + noise2add, xright = Pitheo.up, ytop = Pi.up + noise2add, border = fitcol[i], col = fillrect[i], lwd = fitlwd[i]) } } else # ! ynoise { rect(xleft = Pitheo.low, ybottom = Pi.low, xright = Pitheo.up, ytop = Pi.up, border = fitcol[i], col = fillrect[i], lwd = fitlwd[i]) } } s <- sapply(1:nft, plot.fti, ...) rm(s) if(line01) abline(0, 1, lty = line01lty, col = line01col) if (addlegend) { legend(x=xlegend, y=ylegend, bty="n", legend=legendtext, col=fitcol, lty = 1, lwd = fitlwd, ...) } invisible() } else if (!requireNamespace("ggplot2", quietly = TRUE)) { stop("ggplot2 needed for this function to work with plotstyle = 'ggplot'. Please install it", call. = FALSE) } else { ######## plot if plotstyle=='ggplot' ######## drect <- do.call("rbind", lrect) ind <- as.factor(drect$ind) fitcol <- rep(fitcol, table(ind)) fitlwd <- rep(fitlwd, table(ind)) fillrect <- if(length(fillrect) > 1) {rep(fillrect, table(ind))} else {fillrect} ggppcompcens <- ggplot2::ggplot(drect) + ggplot2::coord_cartesian(xlim = xlim, ylim = ylim) + ggplot2::ggtitle(main) + ggplot2::xlab(xlab) + ggplot2::ylab(ylab) + ggplot2::geom_rect(data=drect, mapping=ggplot2::aes_(xmin=quote(Pitheo.low), xmax=quote(Pitheo.up), ymin=quote(Pi.low), ymax=quote(Pi.up)), colour = fitcol, fill = fillrect, alpha=0.5, size = fitlwd) + ggplot2::theme_bw() + ggplot2::theme(plot.title = ggplot2::element_text(hjust = 0.5)) + {if(line01) ggplot2::geom_abline(ggplot2::aes(slope = 1, intercept = 0), color = line01col, linetype = line01lty)} + {if(xlogscale) ggplot2::scale_x_continuous(trans='log10')} + {if(ylogscale) ggplot2::scale_y_continuous(trans='log10')} + ggplot2::facet_wrap(~ind) return(ggppcompcens) } } fitdistrplus/R/util-npmle.R0000644000176200001440000000230414421725603015423 0ustar liggesusers# ----------------------------------------------------------------------- # # Copyright (c) 2020 Marie Laure Delignette-Muller # # and Christophe Dutang # # # Nonparametric maximum likelihood estimation from interval-censored data # # ----------------------------------------------------------------------- # # Function calling alternatives to npsurv() from the npsurv package # # ----------------------------------------------------------------------- # npmle <- function(censdata, method = "Wang") { method <- match.arg(method, c("Wang", "Turnbull.intervals")) if (method == "Wang") { db <- censdata db$left[is.na(db$left)] <- -Inf db$right[is.na(db$right)] <- Inf r <- npsurvminimal(db, pkg="stats") if (r$convergence) { f <- as.data.frame(r$f) } else { warning("Due to lack of convergence of Wang algorithm, method Turnbull.intervals was used instead for NPMLE") f <- Turnbull.intervals(censdata) } } else if (method == "Turnbull.intervals") { f <- Turnbull.intervals(censdata) } return(f) } fitdistrplus/R/bootdist.R0000644000176200001440000002450714212141104015157 0ustar liggesusers############################################################################# # Copyright (c) 2009 Marie Laure Delignette-Muller, Christophe Dutang # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the # Free Software Foundation, Inc., # 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA # ############################################################################# ### bootstrap in fitdistrplus ### ### R functions ### bootdist <- function (f, bootmethod="param", niter=1001, silent=TRUE, parallel = c("no", "snow", "multicore"), ncpus) { if (niter<10) stop("niter must be an integer above 10") bootmethod <- match.arg(bootmethod, c("param", "nonparam")) parallel <- match.arg(parallel, c("no", "snow", "multicore")) if (parallel == "multicore" & .Platform$OS.type == "windows") { parallel <- "snow" warning("As the multicore option is not supported on Windows it was replaced by snow") } if ((parallel == "snow" | parallel == "multicore") & missing(ncpus)) stop("You have to specify the number of available processors to parallelize the bootstrap") if (!inherits(f, "fitdist")) stop("Use only with 'fitdist' objects") if(!is.null(f$weights)) stop("Bootstrap is not yet available when using weights") #simulate bootstrap data if (bootmethod == "param") { # parametric bootstrap rdistname <- paste("r", f$distname, sep="") if (!exists(rdistname, mode="function")) stop(paste("The ", rdistname, " function must be defined")) rdata <- do.call(rdistname, c(list(niter*f$n), as.list(f$estimate), f$fix.arg)) dim(rdata) <- c(f$n, niter) } else { # non parametric bootstrap rdata <- sample(f$data, size=niter*f$n, replace=TRUE) dim(rdata) <- c(f$n, niter) } #compute bootstrap estimates foncestim <- switch(f$method, "mle"=mledist, "qme"=qmedist, "mme"=mmedist, "mge"=mgedist, "mse"=msedist) start <- as.list(f$estimate) #a named vector is no longer is accepted as starting values. if(is.function(f$fix.arg.fun)) fix.arg <- f$fix.arg.fun else fix.arg <- f$fix.arg if (is.null(f$dots) && !is.function(fix.arg)) { func <- function(iter) { res <- try(do.call(foncestim, list(data=rdata[, iter], distr=f$distname, start=start, fix.arg=fix.arg, checkstartfix=TRUE)), silent=silent) if(inherits(res, "try-error")) return(c(rep(NA, length(start)), 100)) else return(c(res$estimate, res$convergence)) } }else if (is.null(f$dots) && is.function(fix.arg)) { func <- function(iter) { fix.arg.iter <- fix.arg(rdata[, iter]) res <- try(do.call(foncestim, list(data=rdata[, iter], distr=f$distname, start=start, fix.arg=fix.arg.iter, checkstartfix=TRUE)), silent=silent) if(inherits(res, "try-error")) return(c(rep(NA, length(start)), 100)) else return(c(res$estimate, res$convergence)) } }else if(!is.null(f$dots) && !is.function(fix.arg)) { func <- function(iter) { res <- try(do.call(foncestim, c(list(data=rdata[, iter], distr=f$distname, start=start, fix.arg=fix.arg, checkstartfix=TRUE), f$dots)), silent=silent) if(inherits(res, "try-error")) return(c(rep(NA, length(start)), 100)) else return(c(res$estimate, res$convergence)) } }else if(!is.null(f$dots) && is.function(fix.arg)) { func <- function(iter) { fix.arg.iter <- fix.arg(rdata[, iter]) res <- try(do.call(foncestim, c(list(data=rdata[, iter], distr=f$distname, start=start, fix.arg=fix.arg.iter, checkstartfix=TRUE), f$dots)), silent=silent) if(inherits(res, "try-error")) return(c(rep(NA, length(start)), 100)) else return(c(res$estimate, res$convergence)) } }else stop("wrong implementation in bootdist") owarn <- getOption("warn") oerr <- getOption("show.error.messages") options(warn=ifelse(silent, -1, 0), show.error.messages=!silent) # parallel or sequential computation if (parallel != "no") { if (parallel == "snow") type <- "PSOCK" else if (parallel == "multicore") type <- "FORK" clus <- parallel::makeCluster(ncpus, type = type) resboot <- parallel::parSapply(clus, 1:niter, func) parallel::stopCluster(clus) } else { resboot <- sapply(1:niter, func) } options(warn=owarn, show.error.messages=oerr) rownames(resboot) <- c(names(start), "convergence") if (length(resboot[, 1])>2) { estim <- data.frame(t(resboot)[, -length(resboot[, 1])]) bootCI <- cbind(apply(resboot[-length(resboot[, 1]), ], 1, median, na.rm=TRUE), apply(resboot[-length(resboot[, 1]), ], 1, quantile, 0.025, na.rm=TRUE), apply(resboot[-length(resboot[, 1]), ], 1, quantile, 0.975, na.rm=TRUE)) colnames(bootCI) <- c("Median", "2.5%", "97.5%") } else { estim <- as.data.frame(t(resboot)[, -length(resboot[, 1])]) names(estim) <- names(f$estimate) bootCI <- c(median(resboot[-length(resboot[, 1]), ], na.rm=TRUE), quantile(resboot[-length(resboot[, 1]), ], 0.025, na.rm=TRUE), quantile(resboot[-length(resboot[, 1]), ], 0.975, na.rm=TRUE)) names(bootCI) <- c("Median", "2.5%", "97.5%") } # code of convergence of the optimization function for each iteration converg <- t(resboot)[, length(resboot[, 1])] res <- structure(list(estim=estim, converg=converg, method=bootmethod, nbboot=niter, CI=bootCI, fitpart=f), class="bootdist") res } print.bootdist <- function(x, ...){ if (!inherits(x, "bootdist")) stop("Use only with 'bootdist' objects") if (x$method=="param") cat("Parameter values obtained with parametric bootstrap \n") else cat("Parameter values obtained with nonparametric bootstrap \n") print(head(x$estim), ...) nconverg <- length(x$converg[x$converg==0]) if (nconverg < length(x$converg)) { cat("\n") cat("The estimation method converged only for", nconverg, "among", length(x$converg), "iterations \n") } } plot.bootdist <- function(x, main="Bootstrapped values of parameters", enhance=FALSE, trueval=NULL, rampcol=NULL, nbgrid = 100, nbcol = 100, ...){ if (!inherits(x, "bootdist")) stop("Use only with 'bootdist' objects") if (dim(x$estim)[2]==1) { stripchart(x$estim, method="jitter", main=main, xlab="Bootstrapped values of the parameter", ...) } else { if(!is.logical(enhance)) stop("wrong argument enhance for plot.bootdist.") if (!enhance) { if(is.null(trueval)) #no true value supplied pairs(data.matrix(x$estim), main=main, ...) else #some true value supplied pairs4boot(x$estim, main=main, trueval=trueval, enhance=FALSE, ...) } else { if(is.null(rampcol)) rampcol <- c("green", "yellow", "orange", "red") pairs4boot(x$estim, main=main, trueval=trueval, col4ramp = rampcol, nbgrid = nbgrid, nbcol = nbcol, ...) } } } summary.bootdist <- function(object, ...){ if (!inherits(object, "bootdist")) stop("Use only with 'bootdist' objects") class(object) <- c("summary.bootdist", class(object)) object } print.summary.bootdist <- function(x, ...){ if (!inherits(x, "summary.bootdist")) stop("Use only with 'summary.bootdist' objects") if (x$method=="param") cat("Parametric bootstrap medians and 95% percentile CI \n") else cat("Nonparametric bootstrap medians and 95% percentile CI \n") print(x$CI) nconverg <- length(x$converg[x$converg==0]) if (nconverg < length(x$converg)) { cat("\n") cat("The estimation method converged only for", nconverg, "among", length(x$converg), "iterations \n") } } fitdistrplus/R/fitdistcens.R0000644000176200001440000002277713742313702015672 0ustar liggesusers############################################################################# # Copyright (c) 2009 Marie Laure Delignette-Muller # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the # Free Software Foundation, Inc., # 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA # ############################################################################# ### fit parametric distributions for censored data ### ### R functions ### fitdistcens <- function (censdata, distr, start=NULL, fix.arg=NULL, keepdata = TRUE, keepdata.nb=100, ...) { if (missing(censdata) || !(is.vector(censdata$left) & is.vector(censdata$right) & length(censdata[, 1])>1)) stop("datacens must be a dataframe with two columns named left and right and more than one line") leftsupright <- censdata$left>censdata$right leftsupright <- leftsupright[!is.na(leftsupright)] if (any(leftsupright)) stop("each left value must be less or equal to the corresponding right value") if (!is.character(distr)) distname <- substring(as.character(match.call()$distr), 2) else distname <- distr ddistname <- paste("d", distname, sep="") if (!exists(ddistname, mode="function")) stop(paste("The ", ddistname, " function must be defined")) pdistname <- paste("p", distname, sep="") if (!exists(pdistname, mode="function")) stop(paste("The ", pdistname, " function must be defined")) if(!is.logical(keepdata) || !is.numeric(keepdata.nb) || keepdata.nb < 3) stop("wrong arguments 'keepdata' and 'keepdata.nb'.") #encapsulate three dots arguments my3dots <- list(...) if (length(my3dots) == 0) my3dots <- NULL #format data for calculation of starting values pseudodata <- cens2pseudo(censdata)$pseudo # manage starting/fixed values: may raise errors or return two named list arg_startfix <- manageparam(start.arg=start, fix.arg=fix.arg, obs=pseudodata, distname=distname) #check inconsistent parameters argddistname <- names(formals(ddistname)) hasnodefaultval <- sapply(formals(ddistname), is.name) arg_startfix <- checkparamlist(arg_startfix$start.arg, arg_startfix$fix.arg, argddistname, hasnodefaultval) #arg_startfix contains two names list (no longer NULL nor function) #store fix.arg.fun if supplied by the user if(is.function(fix.arg)) fix.arg.fun <- fix.arg else fix.arg.fun <- NULL # check d, p, q, functions of distname dpq2test <- c("d", "p") resdpq <- testdpqfun(distname, dpq2test, start.arg=arg_startfix$start.arg, fix.arg=arg_startfix$fix.arg, discrete=FALSE) if(any(!resdpq$ok)) { for(x in resdpq[!resdpq$ok, "txt"]) warning(x) } # MLE fit with mledist mle <- mledist(censdata, distname, start=arg_startfix$start.arg, fix.arg=arg_startfix$fix.arg, checkstartfix=TRUE, ...) if (mle$convergence>0) stop("the function mle failed to estimate the parameters, with the error code ", mle$convergence) estimate <- mle$estimate if(!is.null(mle$hessian)){ #check for NA values and invertible Hessian if(all(!is.na(mle$hessian)) && qr(mle$hessian)$rank == NCOL(mle$hessian)){ varcovar <- solve(mle$hessian) sd <- sqrt(diag(varcovar)) correl <- cov2cor(varcovar) }else{ varcovar <- NA sd <- NA correl <- NA } }else{ varcovar <- NA sd <- NA correl <- NA } loglik <- mle$loglik n <- nrow(censdata) npar <- length(estimate) aic <- -2*loglik+2*npar bic <- -2*loglik+log(n)*npar fix.arg <- mle$fix.arg weights <- mle$weights #needed for bootstrap if (!is.null(fix.arg)) fix.arg <- as.list(fix.arg) if(keepdata) { reslist <- list(estimate = estimate, method="mle", sd = sd, cor = correl, vcov = varcovar, loglik = loglik, aic=aic, bic=bic, n=n, censdata=censdata, distname=distname, fix.arg=fix.arg, fix.arg.fun = fix.arg.fun, dots=my3dots, convergence=mle$convergence, discrete=FALSE, weights = weights) }else { n2keep <- min(keepdata.nb, n)-4 imin <- unique(apply(censdata, 2, which.min)) imax <- unique(apply(censdata, 2, which.max)) subdata <- censdata[sample((1:n)[-c(imin, imax)], size=n2keep, replace=FALSE), ] subdata <- rbind.data.frame(subdata, censdata[c(imin, imax), ]) reslist <- list(estimate = estimate, method="mle", sd = sd, cor = correl, vcov = varcovar, loglik = loglik, aic=aic, bic=bic, n=n, censdata=subdata, distname=distname, fix.arg=fix.arg, fix.arg.fun = fix.arg.fun, dots=my3dots, convergence=mle$convergence, discrete=FALSE, weights = weights) } return(structure(reslist, class = "fitdistcens")) } print.fitdistcens <- function(x, ...){ if (!inherits(x, "fitdistcens")) stop("Use only with 'fitdistcens' objects") cat("Fitting of the distribution '", x$distname, "' on censored data by maximum likelihood \n") cat("Parameters:\n") print(cbind.data.frame("estimate" = x$estimate), ...) if(!is.null(x$fix.arg)) { if(is.null(x$fix.arg.fun)) { cat("Fixed parameters:\n") }else { cat("Fixed parameters (computed by a user-supplied function):\n") } print(cbind.data.frame("value" = unlist(x$fix.arg)), ...) } } plot.fitdistcens <- function(x, ...){ if (!inherits(x, "fitdistcens")) stop("Use only with 'fitdistcens' objects") if(!is.null(x$weights)) stop("The plot of the fit is not yet available when using weights") plotdistcens(censdata=x$censdata, distr=x$distname, para=c(as.list(x$estimate), as.list(x$fix.arg)), ...) } summary.fitdistcens <- function(object, ...){ if (!inherits(object, "fitdistcens")) stop("Use only with 'fitdistcens' objects") object$ddistname <- paste("d", object$distname, sep="") object$pdistname <- paste("p", object$distname, sep="") class(object) <- c("summary.fitdistcens", class(object)) object } print.summary.fitdistcens <- function(x, ...){ if (!inherits(x, "summary.fitdistcens")) stop("Use only with 'fitdistcens' objects") ddistname <- x$ddistname pdistname <- x$pdistname cat("Fitting of the distribution '", x$distname, "' By maximum likelihood on censored data \n") cat("Parameters\n") print(cbind.data.frame("estimate" = x$estimate, "Std. Error" = x$sd), ...) if(!is.null(x$fix.arg)) { if(is.null(x$fix.arg.fun)) { cat("Fixed parameters:\n") }else { cat("Fixed parameters (computed by a user-supplied function):\n") } print(cbind.data.frame("value" = unlist(x$fix.arg)), ...) } cat("Loglikelihood: ", x$loglik, " ") cat("AIC: ", x$aic, " ") cat("BIC: ", x$bic, "\n") if (length(x$estimate) > 1) { cat("Correlation matrix:\n") print(x$cor) cat("\n") } invisible(x) } #see quantiles.R for quantile.fitdistcens #see logLik.R for loglik.fitdistcens #see vcov.R for vcov.fitdistcens #see coef.R for coef.fitdistcens fitdistrplus/R/util-testdensity.R0000644000176200001440000000774414205427136016704 0ustar liggesusers# testdpqfun function returns a vector of TRUE when the d,p,q functions exist # and behave like in R (e.g. d,p,qexp()), otherwise a list of messages. # INPUTS #distr : the distribution name #fun: a character vector with letters among d, p, q #start.arg: an initial value list #fix.arg: a fixed value list #discrete: a logical whether the distribution is discrete # OUTPUTS # a vector of logical TRUE or a vector of text messages testdpqfun <- function(distr, fun=c("d","p","q"), start.arg, fix.arg=NULL, discrete=FALSE) { stopifnot(all(is.character(fun))) fun <- fun[fun %in% c("d","p","q")] stopifnot(length(fun) > 0) if(is.vector(start.arg)) start.arg <- as.list(start.arg) if(is.function(fix.arg)) stop("fix.arg should be either a named list or NULL but not a function") op <- options() #get current options #print(getOption("warn")) options(warn=-1) res <- NULL if("d" %in% fun) res <- rbind(res, test1fun(paste0("d", distr), start.arg, fix.arg)) if("p" %in% fun) res <- rbind(res, test1fun(paste0("p", distr), start.arg, fix.arg)) if("q" %in% fun) res <- rbind(res, test1fun(paste0("q", distr), start.arg, fix.arg)) options(op) # reset (all) initial options res } test1fun <- function(fn, start.arg, fix.arg, dpqr) { res <- data.frame(ok=FALSE, txt="") stopifnot(is.list(start.arg)) if(!is.null(fix.arg)) stopifnot(is.list(fix.arg)) #does the function exist? if(!exists(fn, mode="function")) { res$txt <- paste("The", fn, "function must be defined") return(res) } #naming convention if(missing(dpqr)) dpqr <- substr(fn, 1, 1) firstarg_theo <- switch(dpqr, "d"="x", "p"="q", "q"="p", "r"="n") firstarg_found <- names(formals(fn))[1] if(firstarg_found != firstarg_theo) { t0 <- paste("The", fn, "function should have its first argument named:", firstarg_theo) res$txt <- paste(t0, "as in base R") return(res) } #zero-component vector res0 <- try(do.call(fn, c(list(numeric(0)), start.arg, fix.arg)), silent=TRUE) t0 <- paste("The", fn, "function should return a zero-length vector when input has length zero and not raise an error") t1 <- paste("The", fn, "function should return a zero-length vector when input has length zero") if(inherits(res0, "try-error")) { res$txt <- t0 return(res) } if(length(res0) != 0) { res$txt <- t1 return(res) } #inconsistent value x <- c(0, 1, Inf, NaN, -1) res1 <- try(do.call(fn, c(list(x), start.arg, fix.arg)), silent=TRUE) t2 <- paste("The", fn, "function should return a vector of with NaN values when input has inconsistent values and not raise an error") if(inherits(res1, "try-error")) { res$txt <- t2 return(res) } #missing value x <- c(0, 1, NA) res2 <- try(do.call(fn, c(list(x), start.arg, fix.arg)), silent=TRUE) t4 <- paste("The", fn, "function should return a vector of with NA values when input has missing values and not raise an error") t5 <- paste("The", fn, "function should return a vector of with NA values when input has missing values and not remove missing values") if(inherits(res2, "try-error")) { res$txt <- t4 return(res) } if(length(res2) != length(x)) { res$txt <- t5 return(res) } #inconsistent parameter x <- 0:1 start.arg <- lapply(start.arg, function(x) -x) res3 <- try(do.call(fn, c(list(x), start.arg, fix.arg)), silent=TRUE) t6 <- paste("The", fn, "function should return a vector of with NaN values when input has inconsistent parameters and not raise an error") if(inherits(res3, "try-error")) { res$txt <- t6 return(res) } #wrong parameter name x <- 0:1 names(start.arg) <- paste0(names(start.arg), "_") res4 <- try(do.call(fn, c(list(x), start.arg, fix.arg)), silent=TRUE) t8 <- paste("The", fn, "function should raise an error when names are incorrectly named") if(!inherits(res4, "try-error")) { res$txt <- t8 return(res) } return(data.frame(ok=TRUE, txt="")) } fitdistrplus/R/bootdistcens.R0000644000176200001440000002267213742313702016045 0ustar liggesusers############################################################################# # Copyright (c) 2009 Marie Laure Delignette-Muller, Christophe Dutang # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the # Free Software Foundation, Inc., # 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA # ############################################################################# ### bootstrap in fitdistrplus with censored data ### ### R functions ### bootdistcens <- function (f, niter=1001, silent=TRUE, parallel = c("no", "snow", "multicore"), ncpus) { if (niter<10) stop("niter must be an integer above 10") parallel <- match.arg(parallel, c("no", "snow", "multicore")) if (parallel == "multicore" & .Platform$OS.type == "windows") { parallel <- "snow" warning("As the multicore option is not supported on Windows it was replaced by snow") } if ((parallel == "snow" | parallel == "multicore") & missing(ncpus)) stop("You have to specify the number of available processors to parallelize the bootstrap") if (!inherits(f, "fitdistcens")) stop("Use only with 'fitdistcens' objects") if(!is.null(f$weights)) stop("Bootstrap is not yet available when using weights") # non parametric bootstrap n <- length(f$censdata[, 1]) numrow <- seq(1, n) rnumrow <- sample(numrow, size=niter*n, replace=TRUE) dim(rnumrow) <- c(n, niter) start <- as.list(f$estimate) #a named vector is no longer is accepted as starting values. if(is.function(f$fix.arg.fun)) fix.arg <- f$fix.arg.fun else fix.arg <- f$fix.arg if (is.null(f$dots) && !is.function(fix.arg)) { funcmle <- function(iter) { mle <- try(do.call(mledist, list(data=data.frame(left=f$censdata[rnumrow[, iter], ]$left, right=f$censdata[rnumrow[, iter], ]$right), distr=f$distname, start=start, fix.arg=fix.arg, checkstartfix=TRUE)), silent=silent) if(inherits(mle, "try-error")) return(c(rep(NA, length(start)), 100)) else return(c(mle$estimate, mle$convergence)) } }else if (is.null(f$dots) && is.function(fix.arg)) { funcmle <- function(iter) { bootdata <- data.frame(left=f$censdata[rnumrow[, iter], ]$left, right=f$censdata[rnumrow[, iter], ]$right) fix.arg.iter <- fix.arg(cens2pseudo(bootdata)$pseudo) mle <- try(do.call(mledist, list(data=bootdata, distr=f$distname, start=start, fix.arg=fix.arg.iter, checkstartfix=TRUE)), silent=silent) if(inherits(mle, "try-error")) return(c(rep(NA, length(start)), 100)) else return(c(mle$estimate, mle$convergence)) } }else if(!is.null(f$dots) && !is.function(fix.arg)) { funcmle <- function(iter) { mle <- try(do.call(mledist, c(list(data=data.frame(left=f$censdata[rnumrow[, iter], ]$left, right=f$censdata[rnumrow[, iter], ]$right), distr=f$distname, start=start), fix.arg=fix.arg, f$dots, checkstartfix=TRUE)), silent=silent) if(inherits(mle, "try-error")) return(c(rep(NA, length(start)), 100)) else return(c(mle$estimate, mle$convergence)) } }else if(!is.null(f$dots) && is.function(fix.arg)) { funcmle <- function(iter) { bootdata <- data.frame(left=f$censdata[rnumrow[, iter], ]$left, right=f$censdata[rnumrow[, iter], ]$right) fix.arg.iter <- fix.arg(cens2pseudo(bootdata)$pseudo) mle <- try(do.call(mledist, c(list(data=bootdata, distr=f$distname, start=start), fix.arg=fix.arg.iter, f$dots, checkstartfix=TRUE)), silent=silent) if(inherits(mle, "try-error")) return(c(rep(NA, length(start)), 100)) else return(c(mle$estimate, mle$convergence)) } }else stop("wrong implementation in bootdistcens") owarn <- getOption("warn") oerr <- getOption("show.error.messages") options(warn=ifelse(silent, -1, 0), show.error.messages=!silent) # parallel or sequential computation if (parallel != "no") { if (parallel == "snow") type <- "PSOCK" else if (parallel == "multicore") type <- "FORK" clus <- parallel::makeCluster(ncpus, type = type) resboot <- parallel::parSapply(clus, 1:niter, funcmle) parallel::stopCluster(clus) } else { resboot <- sapply(1:niter, funcmle) } options(warn=owarn, show.error.messages=oerr) rownames(resboot) <- c(names(start), "convergence") if (length(resboot[, 1])>2) { estim <- data.frame(t(resboot)[, -length(resboot[, 1])]) bootCI <- cbind(apply(resboot[-length(resboot[, 1]), ], 1, median, na.rm=TRUE), apply(resboot[-length(resboot[, 1]), ], 1, quantile, 0.025, na.rm=TRUE), apply(resboot[-length(resboot[, 1]), ], 1, quantile, 0.975, na.rm=TRUE)) colnames(bootCI) <- c("Median", "2.5%", "97.5%") } else { estim <- as.data.frame(t(resboot)[, -length(resboot[, 1])]) names(estim) <- names(f$estimate) bootCI <- c(median(resboot[-length(resboot[, 1]), ], na.rm=TRUE), quantile(resboot[-length(resboot[, 1]), ], 0.025, na.rm=TRUE), quantile(resboot[-length(resboot[, 1]), ], 0.975, na.rm=TRUE)) names(bootCI) <- c("Median", "2.5%", "97.5%") } res <- structure(list(estim=estim, converg=t(resboot)[, length(resboot[, 1])], method="nonparam", nbboot=niter, CI=bootCI, fitpart=f), class="bootdistcens") res } print.bootdistcens <- function(x, ...){ if (!inherits(x, "bootdistcens")) stop("Use only with 'bootdistcens' objects") cat("Parameter values obtained with nonparametric bootstrap \n") print(x$estim, ...) nconverg <- length(x$converg[x$converg==0]) if (nconverg < length(x$converg)) { cat("\n") cat("The estimation method converged only for", nconverg, "among", length(x$converg), "iterations \n") } } plot.bootdistcens <- function(x, ...){ if (!inherits(x, "bootdistcens")) stop("Use only with 'bootdistcens' objects") if (dim(x$estim)[2]==1) { stripchart(x$estim, method="jitter", xlab="Bootstrapped values of the parameter", ...) } else { if (dim(x$estim)[2]==2) plot(x$estim, main="Bootstrapped values of the two parameters", ...) else plot(x$estim, main="Bootstrapped values of parameters", ...) } } summary.bootdistcens <- function(object, ...){ if (!inherits(object, "bootdistcens")) stop("Use only with 'bootdistcens' objects") class(object) <- c("summary.bootdistcens", class(object)) object } print.summary.bootdistcens <- function(x, ...){ if (!inherits(x, "summary.bootdistcens")) stop("Use only with 'summary.bootdistcens' objects") cat("Nonparametric bootstrap medians and 95% percentile CI \n") print(x$CI) nconverg <- length(x$converg[x$converg==0]) if (nconverg < length(x$converg)) { cat("\n") cat("The estimation method converged only for", nconverg, "among", length(x$converg), "iterations \n") } } fitdistrplus/R/util-npsurv-main.R0000644000176200001440000001217314421725626016601 0ustar liggesusers# ----------------------------------------------------------------------- # # Nonparametric maximum likelihood estimation from interval-censored data # # ----------------------------------------------------------------------- # # Main function : npsurvminimal # # ----------------------------------------------------------------------- # # Original code from Yong Wang, 2020 # # ----------------------------------------------------------------------- # ## Arguments: ## data: Data ## w: Weights ## maxit: Maximum number of iterations to perform ## tol: Tolerance for the stopping condition (in log-likelihood value) ## verb: For internal use only: depth of recursion (up to 4) ## pkg: package used in NNLS_constrSum() ## probtol: Tolerance for keeping output probabilities (default value has ## been chosen so that the output is similar to original npsurv()) npsurvminimal <- function(data, w=1, maxit=100, tol=1e-6, verb=0, pkg="stats", probtol=2e-4, ...) { #sanity checks pkg <- match.arg(pkg, c("stats")) if(length(maxit) > 1) stop("maxit should be a positive scalar") if(maxit <= 0) stop("maxit should be a positive scalar") if(length(tol) > 1) stop("maxit should be a positive scalar") if(tol <= 0) stop("maxit should be a positive scalar") if(length(verb) > 1) stop("maxit should be a non-negative scalar") if(verb < 0) stop("maxit should be a non-negative scalar") if(length(probtol) > 1) stop("maxit should be a positive probability") if(probtol <= 0 || probtol >= 1) stop("maxit should be a positive probability") if(sum(is.na(data))) stop("data should have no NA values") x2 = icendata(data, w) #see npsurv-intercens.R #sanity checks if(sum(apply(x2$o, 1, function(x) x[1] > x[2])) > 0) stop("some intervals have left bounds strictly greater than right bounds") # exact or right-censored only if(nrow(x2$o) == 0 || all(x2$o[,2] == Inf)) { if(verb > 0) cat("call to km()\n") r0 = km(x2) #see npsurv-km.R r = list(f=r0$f, upper=max(x2$t, x2$o[,1]), convergence=TRUE, ll=r0$ll, maxgrad=0, numiter=1, method="km") return(r) } #get left/right interval bounds x = rbind(cbind(x2$t, x2$t), x2$o) nx = nrow(x) w = c(x2$wt, x2$wo) wr = sqrt(w) n = sum(w) upper = x2$upper # compute incidence matrix of maximal intersection intervals # see section 2.2, p3 of Wang & Taylor : Delta matrix is denoted S dmat = Deltamatrix(x) #see npsurv-intercens.R left = dmat$left right = dmat$right D = dmat$Delta m = length(left) p = double(m) if(verb > 0) cat("nb row orig data=", nx, "\nnb of maximal intersection intervals=", m, "\n") i = rowSums(D) != 1 j = colSums(D[!i,,drop=FALSE]) > 0 j[c(1,m)] = TRUE # Initial p must ensure P > 0 repeat { jm = which.max(colSums(D[i,,drop=FALSE])) j[jm] = TRUE i[D[,jm]] = FALSE if( sum(i) == 0 ) break } p = colSums(w * D) * j p = p / sum(p) # interval censored and large dataset if(m >= 30) { ## Turn to HCNM if(verb > 0) cat("call to hcnm()\n") r = hcnm(w=w, D=D, p0=p, maxit=maxit, tol=tol, verb=verb) #see npsurv-hcnm.R j = r$pf > probtol f = idf(left=left[j], right=right[j], p=r$pf[j]) #normalize prob vector f$p <- f$p/sum(f$p) r = list(f=f, upper=upper, convergence=r$convergence, method="hcnm", ll=r$ll, maxgrad=r$maxgrad, numiter=r$numiter, m=m) return(r) } # interval censored and small dataset if(verb > 0) cat("body of npsurvminimal()\n\n") P = drop(D %*% p) ll = sum( w * log(P) ) converge = FALSE for(i in 1:maxit) { p.old = p ll.old = ll S = D / P d = colSums(w * S) dmax = max(d) - n if(verb > 0) { cat("\n##### Iteration", i, "#####\n") cat("Log-likelihood: ", signif(ll, 6), "\n") } if(verb > 1) cat("Maximum gradient: ", signif(dmax, 6), "\n") if(verb > 2) {cat("Probability vector:\n"); print(p)} j[which(j)-1 + aggregate(d, by=list(group=cumsum(j)), which.max)[,2]] = TRUE #orignal call #pj = pnnls(wr * S[,j,drop=FALSE], wr * 2, sum=1)$x #new call resNNLS <- NNLS_constrSum(a=wr * S[,j,drop=FALSE], b=wr * 2, pkg=pkg, sumtotal=1, control=list(trace=verb), ...) #see npsurv-NNLS.R if(resNNLS$convergence != 0) break else pj <- resNNLS$prob p[j] = pj / sum(pj) # line search alpha = 1 pd = p - p.old lld = sum(d * pd) p.alpha = p repeat { P.alpha = drop(D %*% p.alpha) ll.alpha = sum(w * log(P.alpha)) if(ll.alpha >= ll + alpha * lld * .33) { p = p.alpha; P = P.alpha; ll = ll.alpha; break } if((alpha <- alpha * .5) < 1e-10) break p.alpha = p.old + alpha * pd } j = p > 0 if( ll <= ll.old + tol ) #convergence attained { converge=TRUE; break } } f <- idf(left[j], right[j], p[j]) #see npsurv-intercens.R res <- list(f=f, upper=upper, convergence=converge, method="cnm", ll=ll, maxgrad=max(crossprod(w/P, D))-n, numiter=i, m=m) res } fitdistrplus/R/qmedist.R0000644000176200001440000003416014421726036015011 0ustar liggesusers############################################################################# # Copyright (c) 2010 Christophe Dutang and Marie Laure Delignette-Muller # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the # Free Software Foundation, Inc., # 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA # ############################################################################# ### quantile matching estimation for censored or non-censored data ### ### R functions ### qmedist <- function (data, distr, probs, start=NULL, fix.arg=NULL, qtype=7, optim.method="default", lower=-Inf, upper=Inf, custom.optim=NULL, weights=NULL, silent=TRUE, gradient=NULL, checkstartfix=FALSE, ...) # data may correspond to a vector for non censored data or to # a dataframe of two columns named left and right for censored data { if (!is.character(distr)) # distname <- substring(as.character(match.call()$distr), 2) stop("distr must be a character string naming a distribution") else distname <- distr qdistname <- paste("q",distname,sep="") ddistname <- paste("d",distname,sep="") argddistname <- names(formals(ddistname)) if (!exists(qdistname, mode="function")) stop(paste("The ", qdistname, " function must be defined")) if (!exists(ddistname, mode="function")) stop(paste("The ", ddistname, " function must be defined")) if (missing(probs)) stop("missing probs argument for quantile matching estimation") if(is.null(custom.optim)) optim.method <- match.arg(optim.method, c("default", "Nelder-Mead", "BFGS", "CG", "L-BFGS-B", "SANN", "Brent")) start.arg <- start #to avoid confusion with the start() function of stats pkg (check is done lines 87-100) if(is.vector(start.arg)) #backward compatibility start.arg <- as.list(start.arg) if(qtype < 1 || qtype > 9) stop("wrong type for the R quantile function") if(!is.null(weights)) { if(any(weights < 0)) stop("weights should be a vector of integers greater than 0") if(!isallintw(weights)) stop("weights should be a vector of (strictly) positive integers") if(length(weights) != NROW(data)) stop("weights should be a vector with a length equal to the observation number") warning("weights are not taken into account in the default initial values") } if (is.vector(data)) { cens <- FALSE if (!(is.numeric(data) & length(data)>1)) stop("data must be a numeric vector of length greater than 1 for non censored data or a dataframe with two columns named left and right and more than one line for censored data") } else { cens <- TRUE stop("Quantile matching estimation is not yet available for censored data.") } if(!checkstartfix) #pre-check has not been done by fitdist() or bootdist() { # manage starting/fixed values: may raise errors or return two named list arg_startfix <- manageparam(start.arg=start, fix.arg=fix.arg, obs=data, distname=distname) #check inconsistent parameters hasnodefaultval <- sapply(formals(ddistname), is.name) arg_startfix <- checkparamlist(arg_startfix$start.arg, arg_startfix$fix.arg, argddistname, hasnodefaultval) #arg_startfix contains two names list (no longer NULL nor function) #set fix.arg.fun if(is.function(fix.arg)) fix.arg.fun <- fix.arg else fix.arg.fun <- NULL }else #pre-check has been done by fitdist() or bootdist() { arg_startfix <- list(start.arg=start, fix.arg=fix.arg) fix.arg.fun <- NULL } #unlist starting values as needed in optim() vstart <- unlist(arg_startfix$start.arg) #sanity check if(is.null(vstart)) stop("Starting values could not be NULL with checkstartfix=TRUE") #erase user value #(cannot coerce to vector as there might be different modes: numeric, character...) fix.arg <- arg_startfix$fix.arg if(length(vstart) != length(probs)) stop("wrong dimension for the quantiles to match.") ############# QME fit using optim or custom.optim ########## # definition of the function to minimize : # for non censored data if (!cens && is.null(weights)) { # the argument names are: # - par for parameters (like in optim function) # - fix.arg for optional fixed parameters # - obs for observations (previously dat but conflicts with genoud data.type.int argument) # - qdistnam for distribution name DIFF2Q <- function(par, fix.arg, prob, obs, qdistnam, qtype) { qtheo <- do.call(qdistnam, c(as.list(prob), as.list(par), as.list(fix.arg)) ) qemp <- as.numeric(quantile(obs, probs=prob, type=qtype)) (qemp - qtheo)^2 } fnobj <- function(par, fix.arg, obs, qdistnam, qtype) sum( sapply(probs, function(p) DIFF2Q(par, fix.arg, p, obs, qdistnam, qtype)) ) }else if (!cens && !is.null(weights)) { DIFF2Q <- function(par, fix.arg, prob, obs, qdistnam, qtype) { qtheo <- do.call(qdistnam, c(as.list(prob), as.list(par), as.list(fix.arg)) ) qemp <- as.numeric(wtdquantile(x=obs, weights=weights, probs=prob)) (qemp - qtheo)^2 } fnobj <- function(par, fix.arg, obs, qdistnam, qtype) sum( sapply(probs, function(p) DIFF2Q(par, fix.arg, p, obs, qdistnam, qtype)) ) } # Function to calculate the loglikelihood to return if(is.null(weights)) { loglik <- function(par, fix.arg, obs, ddistnam) sum(log(do.call(ddistnam, c(list(obs), as.list(par), as.list(fix.arg)) ) ) ) }else { loglik <- function(par, fix.arg, obs, ddistnam) sum(weights * log(do.call(ddistnam, c(list(obs), as.list(par), as.list(fix.arg)) ) ) ) } owarn <- getOption("warn") # Try to minimize the stat distance using the base R optim function if(is.null(custom.optim)) { hasbound <- any(is.finite(lower) | is.finite(upper)) # Choice of the optimization method if (optim.method == "default") { meth <- ifelse(length(vstart) > 1, "Nelder-Mead", "BFGS") }else meth <- optim.method if(meth == "BFGS" && hasbound && is.null(gradient)) { meth <- "L-BFGS-B" txt1 <- "The BFGS method cannot be used with bounds without provided the gradient." txt2 <- "The method is changed to L-BFGS-B." warning(paste(txt1, txt2)) } options(warn=ifelse(silent, -1, 0)) #select optim or constrOptim if(hasbound) #finite bounds are provided { if(!is.null(gradient)) { opt.fun <- "constrOptim" }else #gradient == NULL { if(meth == "Nelder-Mead") opt.fun <- "constrOptim" else if(meth %in% c("L-BFGS-B", "Brent")) opt.fun <- "optim" else { txt1 <- paste("The method", meth, "cannot be used by constrOptim() nor optim() without gradient and bounds.") txt2 <- "Only optimization methods L-BFGS-B, Brent and Nelder-Mead can be used in such case." stop(paste(txt1, txt2)) } } if(opt.fun == "constrOptim") { #recycle parameters npar <- length(vstart) #as in optim() line 34 lower <- as.double(rep_len(lower, npar)) #as in optim() line 64 upper <- as.double(rep_len(upper, npar)) # constraints are : Mat %*% theta >= Bnd, i.e. # +1 * theta[i] >= lower[i]; # -1 * theta[i] >= -upper[i] #select rows from the identity matrix haslow <- is.finite(lower) Mat <- diag(npar)[haslow, ] #select rows from the opposite of the identity matrix hasupp <- is.finite(upper) Mat <- rbind(Mat, -diag(npar)[hasupp, ]) colnames(Mat) <- names(vstart) rownames(Mat) <- paste0("constr", 1:NROW(Mat)) #select the bounds Bnd <- c(lower[is.finite(lower)], -upper[is.finite(upper)]) names(Bnd) <- paste0("constr", 1:length(Bnd)) initconstr <- Mat %*% vstart - Bnd if(any(initconstr < 0)) stop("Starting values must be in the feasible region.") opttryerror <- try(opt <- constrOptim(theta=vstart, f=fnobj, ui=Mat, ci=Bnd, grad=gradient, fix.arg=fix.arg, obs=data, qdistnam=qdistname, qtype=qtype, hessian=!is.null(gradient), method=meth, ...), silent=TRUE) if(!inherits(opttryerror, "try-error")) if(length(opt$counts) == 1) #appears when the initial point is a solution opt$counts <- c(opt$counts, NA) }else #opt.fun == "optim" { opttryerror <- try(opt <- optim(par=vstart, fn=fnobj, fix.arg=fix.arg, obs=data, gr=gradient, qdistnam=qdistname, qtype=qtype, hessian=TRUE, method=meth, lower=lower, upper=upper, ...), silent=TRUE) } }else #hasbound == FALSE { opt.fun <- "optim" opttryerror <- try(opt <- optim(par=vstart, fn=fnobj, fix.arg=fix.arg, obs=data, gr=gradient, qdistnam=qdistname, qtype=qtype, hessian=TRUE, method=meth, lower=lower, upper=upper, ...), silent=TRUE) } options(warn=owarn) if (inherits(opttryerror,"try-error")) { warnings("The function optim encountered an error and stopped.") if(getOption("show.error.messages")) print(attr(opttryerror, "condition")) return(list(estimate = rep(NA,length(vstart)), convergence = 100, value = NA, hessian = NA)) } if (opt$convergence>0) { warnings("The function optim failed to converge, with the error code ", opt$convergence) } if(is.null(names(opt$par))) names(opt$par) <- names(vstart) res <- list(estimate = opt$par, convergence = opt$convergence, value = opt$value, hessian = opt$hessian, optim.function=opt.fun, optim.method=meth, fix.arg=fix.arg, fix.arg.fun=fix.arg.fun, weights = weights, counts=opt$counts, optim.message=opt$message, loglik=loglik(opt$par, fix.arg, data, ddistname), probs=probs) } else # Try to minimize the stat distance using a user-supplied optim function { if (!cens) opttryerror <- try(opt <- custom.optim(fn=fnobj, fix.arg=fix.arg, obs=data, qdistnam=qdistname, qtype=qtype, par=vstart, ...), silent=TRUE) else stop("Quantile matching estimation is not yet available for censored data.") if (inherits(opttryerror,"try-error")) { warnings("The customized optimization function encountered an error and stopped.") if(getOption("show.error.messages")) print(attr(opttryerror, "condition")) return(list(estimate = rep(NA,length(vstart)), convergence = 100, value = NA, hessian = NA)) } if (opt$convergence>0) { warnings("The customized optimization function failed to converge, with the error code ", opt$convergence) } if(is.null(names(opt$par))) names(opt$par) <- names(vstart) argdot <- list(...) method.cust <- argdot$method res <- list(estimate = opt$par, convergence = opt$convergence, value = opt$value, hessian = opt$hessian, optim.function=custom.optim, optim.method=method.cust, fix.arg=fix.arg, fix.arg.fun=fix.arg.fun, weights = weights, counts=opt$counts, optim.message=opt$message, loglik=loglik(opt$par, fix.arg, data, ddistname), probs=probs) } return(res) } fitdistrplus/R/qqcomp.R0000644000176200001440000002063713747741345014661 0ustar liggesusers############################################################################# # Copyright (c) 2012 Christophe Dutang, Aurelie Siberchicot # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the # Free Software Foundation, Inc., # 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA # ############################################################################# ### plot density functions for various fits ### of continuous distribution(s) (fitdist results) ### on a same dataset ### ### R functions ### qqcomp <- function(ft, xlim, ylim, xlogscale = FALSE, ylogscale = FALSE, main, xlab, ylab, fitpch, fitcol, fitlwd, addlegend = TRUE, legendtext, xlegend = "bottomright", ylegend = NULL, use.ppoints = TRUE, a.ppoints = 0.5, line01 = TRUE, line01col = "black", line01lty = 1, ynoise = TRUE, plotstyle = "graphics", ...) { if(inherits(ft, "fitdist")) { ft <- list(ft) }else if(!is.list(ft)) { stop("argument ft must be a list of 'fitdist' objects") }else { if(any(sapply(ft, function(x) !inherits(x, "fitdist")))) stop("argument ft must be a list of 'fitdist' objects") } # In the future developments, it will be necessary to check that all the fits share the same weights if(!is.null(ft[[1]]$weights)) stop("qqcomp is not yet available when using weights") # check the 'plotstyle' argument plotstyle <- match.arg(plotstyle[1], choices = c("graphics", "ggplot"), several.ok = FALSE) # check data mydata <- ft[[1]]$data verif.ftidata <- function(fti) { if (any(fti$data != mydata)) stop("All compared fits must have been obtained with the same dataset") invisible() } lapply(ft, verif.ftidata) n <- length(mydata) sdata <- sort(mydata) largedata <- (n > 1e4) if (xlogscale != ylogscale) { warning("As a Q-Q plot should use the same scale on x and y axes, both or none of the axes should be put in a logarithmic scale.") } logxy <- paste(ifelse(xlogscale,"x",""), ifelse(ylogscale,"y",""), sep="") # manage default parameters nft <- length(ft) if (missing(fitcol)) fitcol <- 2:(nft+1) if (missing(fitpch)) fitpch <- ifelse(largedata, 1, 21) if (missing(fitlwd)) fitlwd <- 1 fitcol <- rep(fitcol, length.out=nft) fitpch <- rep(fitpch, length.out=nft) fitlwd <- rep(fitlwd, length.out=nft) # check legend parameters if added if(missing(legendtext)) { legendtext <- sapply(ft, function(x) x$distname) if(length(legendtext) != length(unique(legendtext))) legendtext <- paste(legendtext, sapply(ft, function(x) toupper(x$method)), sep="-") if(length(legendtext) != length(unique(legendtext))) legendtext <- paste(legendtext, 1:nft, sep="-") } if (missing(xlab)) xlab <- "Theoretical quantiles" if (missing(ylab)) ylab <- "Empirical quantiles" if (missing(main)) main <- "Q-Q plot" if (use.ppoints) obsp <- ppoints(n, a = a.ppoints) else obsp <- (1:n) / n # computation of each fitted distribution comput.fti <- function(i) { fti <- ft[[i]] para <- c(as.list(fti$estimate), as.list(fti$fix.arg)) distname <- fti$distname qdistname <- paste("q", distname, sep="") do.call(qdistname, c(list(obsp), as.list(para))) } fittedquant <- sapply(1:nft, comput.fti) if(NCOL(fittedquant) != nft || NROW(fittedquant) != length(obsp)) stop("problem when computing fitted CDFs.") # check limits if(missing(xlim)) xlim <- range(fittedquant) if(missing(ylim)) ylim <- range(mydata) if(plotstyle == "graphics") { ######## plot if plotstyle=='graphics' ######## # main plot if(!largedata) resquant <- plot(fittedquant[,1], sdata, main=main, xlab=xlab, ylab=ylab, log=logxy, pch = fitpch[1], xlim=xlim, ylim=ylim, col=fitcol[1], type="p", ...) else resquant <- plot(fittedquant[,1], sdata, main=main, xlab=xlab, ylab=ylab, log=logxy, lty = fitpch[1], xlim=xlim, ylim=ylim, col=fitcol[1], type="l", lwd = fitlwd[1], ...) #plot of other fitted quantiles if(nft > 1 && !ynoise && !largedata) for(i in 2:nft) points(fittedquant[,i], sdata, pch=fitpch[i], col=fitcol[i], ...) if(nft > 1 && ynoise && !largedata) for(i in 2:nft) if (ylogscale) { noise2mult <- runif(n, 0.95, 1.05) points(fittedquant[,i], sdata*noise2mult, pch=fitpch[i], col=fitcol[i], ...) }else { noise2add <- runif(n, -0.02, 0.02) points(fittedquant[,i], sdata+noise2add, pch=fitpch[i], col=fitcol[i], ...) } if(nft > 1 && largedata) for(i in 2:nft) lines(fittedquant[,i], sdata, col=fitcol[i], lty = fitpch[i], lwd = fitlwd[i], ...) if(line01) abline(0, 1, lty=line01lty, col=line01col) if (addlegend) { if(!largedata) legend(x=xlegend, y=ylegend, bty="n", legend=legendtext, col=fitcol, pch = fitpch, ...) else legend(x=xlegend, y=ylegend, bty="n", legend=legendtext, col=fitcol, lty = fitpch, lwd = fitlwd, ...) } invisible() } else if (!requireNamespace("ggplot2", quietly = TRUE)) { stop("ggplot2 needed for this function to work with plotstyle = 'ggplot'. Please install it", call. = FALSE) } else { ######## plot if plotstyle=='ggplot' ######## # recode the legend position according to available positions in ggplot2 if(xlegend %in% c("topleft", "bottomleft")) xlegend <- "left" if(xlegend %in% c("topright", "bottomright")) xlegend <- "right" # structure the fittedquant in a relevant data.frame fittedquant <- as.data.frame(fittedquant) colnames(fittedquant) <- unlist(lapply(ft, function(X) X["distname"])) fittedquant <- stack(fittedquant) nfq <- nrow(fittedquant) fittedquant$sdata <- sdata # sdata is recycled in the standard fashion fittedquant$ind <- factor(fittedquant$ind, levels = unique(fittedquant$ind)) # reorder levels in the appearance order of the input if(nft > 1 && ynoise && !largedata) { if (ylogscale) { noise2mult <- runif(nfq, 0.95, 1.05) fittedquant$sdata <- fittedquant$sdata*noise2mult }else { noise2add <- runif(nfq, -0.02, 0.02) fittedquant$sdata <- fittedquant$sdata+noise2add } } ggqqcomp <- ggplot2::ggplot(data = fittedquant, ggplot2::aes_(quote(values), quote(sdata), group = quote(ind), colour = quote(ind), shape = quote(ind), size = quote(ind))) + ggplot2::xlab(xlab) + ggplot2::ylab(ylab) + ggplot2::ggtitle(main) + ggplot2::coord_cartesian(xlim = c(xlim[1], xlim[2]), ylim = c(ylim[1], ylim[2])) + {if(!largedata) ggplot2::geom_point() else ggplot2::geom_line(ggplot2::aes_(linetype = quote(ind), size = quote(ind)))} + {if(addlegend) ggplot2::theme(legend.position = c(xlegend, ylegend), plot.title = ggplot2::element_text(hjust = 0.5)) else ggplot2::theme(legend.position = "none", plot.title = ggplot2::element_text(hjust = 0.5))} + ggplot2::scale_color_manual(values = fitcol, labels = legendtext) + ggplot2::scale_shape_manual(values = fitpch, labels = legendtext) + ggplot2::scale_linetype_manual(values = fitpch, labels = legendtext) + ggplot2::scale_size_manual(values = fitlwd, labels = legendtext) + ggplot2::guides(colour = ggplot2::guide_legend(title = NULL)) + ggplot2::guides(shape = ggplot2::guide_legend(title = NULL)) + ggplot2::guides(linetype = ggplot2::guide_legend(title = NULL)) + ggplot2::guides(size = ggplot2::guide_legend(title = NULL)) + {if(line01) ggplot2::geom_abline(intercept = 0, slope = 1)} + {if(xlogscale) ggplot2::scale_x_continuous(trans='log10')} + {if(ylogscale) ggplot2::scale_y_continuous(trans='log10')} return(ggqqcomp) } } fitdistrplus/R/detectbound.R0000644000176200001440000001121514205427010015625 0ustar liggesusers############################################################################# # Copyright (c) 2016 Marie Laure Delignette-Muller, Christophe Dutang # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the # Free Software Foundation, Inc., # 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA # ############################################################################# ### detect boundaries detectbound <- function(distname, vstart, obs, fix.arg=NULL, echo=FALSE) { ddistname <- paste("d", distname, sep="") argdist <- formalArgs(ddistname) argdist <- argdist[!argdist %in% c("x", "log")] stopifnot(all(names(vstart) %in% argdist)) if("scale" %in% argdist && "rate" %in% argdist) { if(length(grep("rate", as.character(formals(ddistname)$scale))) > 0) { argdist <- argdist[argdist != "rate"] #remove rate for parameter list if("rate" %in% names(vstart)) #update value if rate is supplied { vstart["rate"] <- 1/vstart["rate"] names(vstart)[names(vstart) == "rate"] <- "scale" } } } argdist <- argdist[!argdist %in% names(fix.arg)] if(length(argdist) == 0) return(NULL) if(echo) { print(argdist) print(vstart) } lowb <- rep(-Inf, length(argdist)) uppb <- -lowb names(lowb) <- names(uppb) <- argdist eps <- sqrt(.Machine$double.eps) owarn <- getOption("warn") oerr <- getOption("show.error.messages") options(warn=-1, show.error.messages=FALSE) for(a in argdist) { if(echo) cat(a, "\n") dx <- do.call(ddistname, c(list(obs), as.list(vstart), as.list(fix.arg))) if(any(is.nan(dx))) stop("wrong init param") vstarta <- vstart aval <- -1:1 for(i in 1:length(aval)) { vstarta[a] <- aval[i]-eps dx1 <- try(do.call(ddistname, c(list(obs), as.list(vstarta), as.list(fix.arg))), silent=TRUE) vstarta[a] <- aval[i]+eps dx2 <- try(do.call(ddistname, c(list(obs), as.list(vstarta), as.list(fix.arg))), silent=TRUE) if(echo) { cat(i, "\ttested value", vstarta, "\n") print(dx1) print(dx2) } if(inherits(dx1, "try-error") && !inherits(dx2, "try-error")) { lowb[a] <- aval[i] } if(any(is.nan(dx1)) && any(!is.nan(dx2))) { lowb[a] <- aval[i] } if(!inherits(dx1, "try-error") && inherits(dx2, "try-error")) { uppb[a] <- aval[i] } if(any(!is.nan(dx1)) && any(is.nan(dx2))) { uppb[a] <- aval[i] } } } options(warn=owarn, show.error.messages=oerr) rbind(lowb, uppb) } fitdistrplus/R/logLik.R0000644000176200001440000000615313742313702014562 0ustar liggesusers############################################################################# # Copyright (c) 2009 Marie Laure Delignette-Muller, Regis Pouillot, Jean-Baptiste Denis, Christophe Dutang # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the # Free Software Foundation, Inc., # 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA # ############################################################################# ### log likelihood ### ### R functions ### #logLik already defined in R # #logLik <- function(object, ...) # UseMethod("logLik") # #logLik.default <- function(object, ...) # return(object) logLik.fitdist <- function(object, ...) { stopifnot(inherits(object, "fitdist")) if(is.null(object$loglik)) stop("Internal error in loglik.fitdist") else return(object$loglik) } logLik.fitdistcens <- function(object, ...) { stopifnot(inherits(object, "fitdistcens")) if(is.null(object$loglik)) stop("Internal error in loglik.fitdistcens") else return(object$loglik) } fitdistrplus/R/plotdistcens.R0000644000176200001440000003526513742313702016062 0ustar liggesusers############################################################################# # Copyright (c) 2009 Marie Laure Delignette-Muller, Regis Pouillot # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the # Free Software Foundation, Inc., # 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA # ############################################################################# ### plot functions for censored data ### ### R functions ### plotdistcens <- function(censdata, distr, para, leftNA = -Inf, rightNA = Inf, NPMLE = TRUE, Turnbull.confint = FALSE, NPMLE.method = "Wang", ...) { if (missing(censdata) || !(is.vector(censdata$left) & is.vector(censdata$right) & length(censdata[,1])>1)) stop("datacens must be a dataframe with two columns named left and right and more than one line") if ((missing(distr) & !missing(para)) || (!missing(distr) & missing(para))) stop("distr and para must defined") my3dots <- list(...) if("main" %in% names(my3dots)) specifytitle <- FALSE else specifytitle <- TRUE if (missing(distr)) { onlyCDFplot <- TRUE titleCDF <- "Cumulative distribution" } else { if (!is.character(distr)) distname <- substring(as.character(match.call()$distr), 2) else distname <- distr if (!is.list(para)) stop("'para' must be a named list") ddistname <- paste("d", distname, sep="") if (!exists(ddistname, mode="function")) stop(paste("The ", ddistname, " function must be defined")) onlyCDFplot <- FALSE titleCDF <- "Empirical and theoretical CDFs" } # definition of xlim or lim for data and of xrange, xmininf and xmaxinf ibounds <- c(censdata$right, censdata$left) iboundsnotNA <- ibounds[!is.na(ibounds)] xmin <- min(iboundsnotNA) xmax <- max(iboundsnotNA) xrange <- xmax - xmin xmin <- xmin - 0.1 * xrange xmax <- xmax + 0.1 * xrange xmininf <- xmin - 100 * xrange xmaxinf <- xmax + 100 * xrange xlim <- c(xmin, xmax) #definition of ylim or lim for ECDF ylim <- c(0,1) # Supression of the deprecated argument Turnbull ############################################### # if (!missing(Turnbull)) # { # warning("The argument Turnbull is deprecated and should note be used any more. # Now use the argument NPMLE to tell if you want to compute a nonparametric # maximum likelihood estimation of the cumulative distribution, and the argument NPMLE.method # to define the method chosen for the computation (Turnbull or Wang).") # if (missing(NPMLE) & missing(NPMLE.method)) # { # if (Turnbull == TRUE) # { # NPMLE <- TRUE # NPMLE.method <- "Turnbull" # } else # { # NPMLE <- FALSE # } # } # } NPMLE.method <- match.arg(NPMLE.method, c("Wang", "Turnbull.intervals", "Turnbull.middlepoints", "Turnbull")) if (NPMLE.method == "Turnbull") { warning("Turnbull is now a deprecated option for NPMLE.method. You should use Turnbull.middlepoints of Turnbull.intervals. It was here fixed as Turnbull.middlepoints, equivalent to former Turnbull.") NPMLE.method <- "Turnbull.middlepoints" } if ((Turnbull.confint == TRUE) & ((NPMLE.method == "Wang") | (NPMLE.method == "Turnbull.intervals"))) { warning("When Turnbull.confint is TRUE NPMLE.method is forced to Turnbull.middlepoints." ) NPMLE.method <- "Turnbull.middlepoints" # so the second part of the message will be printed in the following if needed onlyCDFplot <- TRUE } if ((NPMLE.method == "Turnbull.middlepoints") & !missing(distr)) { warning("Q-Q plot and P-P plot are available only with the arguments NPMLE.method at Wang (default value) or Turnbull.intervals." ) onlyCDFplot <- TRUE } if ((NPMLE == FALSE) & !missing(distr)) { warning("When NPMLE is FALSE the nonparametric maximum likelihood estimation of the cumulative distribution function is not computed. Q-Q plot and P-P plot are available only with the arguments NPMLE.method at Wang (default value) or Turnbull.intervals." ) onlyCDFplot <- TRUE } if (!onlyCDFplot) { pdistname <- paste("p", distname, sep="") if (!exists(pdistname, mode="function")) stop(paste("The ", pdistname, " function must be defined")) qdistname <- paste("q", distname, sep="") if (!exists(qdistname, mode="function")) stop(paste("The ", qdistname, " function must be defined")) densfun <- get(ddistname, mode="function") nm <- names(para) f <- formals(densfun) args <- names(f) m <- match(nm, args) if (any(is.na(m))) stop(paste("'para' specifies names which are not arguments to ", ddistname)) def.par <- par(no.readonly = TRUE) par(mfrow = c(2, 2)) } # Plot of the empirical distribution as an ECDF if (NPMLE) { if (NPMLE.method == "Wang" | NPMLE.method =="Turnbull.intervals") { f <- npmle(censdata, method = NPMLE.method) # New xlim calculation from equivalence intervals bounds <- c(f$right, f$left) finitebounds <- bounds[is.finite(bounds)] upper <- max(finitebounds) lower <- min(finitebounds) width <- upper - lower xmin.Wang.cdf <- lower - width * 0.1 xmax.Wang.cdf <- upper + width * 0.1 xlim.Wang.cdf <- c(xmin.Wang.cdf, xmax.Wang.cdf) ylim.Wang.cdf <- c(0,1) k <- length(f$left) Fnpsurv <- cumsum(f$p) ## calul des points points pour Q et P dans les GOF stat et graph Fbefore <- c(0, Fnpsurv[-k]) df <- data.frame(left = f$left, right = f$right) # Definition of vertices of each rectangle Qi.left <- df$left # dim k Qi.left4plot <- Qi.left if (is.infinite(Qi.left4plot[1]) | is.nan(Qi.left4plot[1])) Qi.left4plot[1] <- xmininf Qi.right <- df$right Qi.right4plot <- Qi.right if (is.infinite(Qi.right4plot[k]) | is.nan(Qi.right4plot[k])) Qi.right4plot[k] <- xmaxinf # keep only 16 significants digits for R configured with noLD (--disable-long-double) Pi.low <- signif(Fbefore, 16) Pi.up <- signif(Fnpsurv, 16) # Plot of the ECDF if (specifytitle) { plot(1, 1, type = "n", xlim = xlim.Wang.cdf, ylim = ylim.Wang.cdf, xlab = "Censored data", ylab = "CDF", main = titleCDF, ...) } else { plot(1, 1, type = "n", xlim = xlim.Wang.cdf, ylim = ylim.Wang.cdf, xlab = "Censored data", ylab = "CDF", ...) } xmin <- par("usr")[1] xmax <- par("usr")[2] # the line at right of the rectangles dright <- c(f$left[1], rep(f$right, rep(2,k)), f$right[k]) Fright <- rep(c(0,Fnpsurv), rep(2,k+1)) lines(dright, Fright, ...) ### the line at left of the rectangles dleft <- rep(c(f$left,f$right[k]), rep(2,k+1)) Fleft <- c(0,rep(Fnpsurv, rep(2,k)),1) lines(dleft, Fleft, ...) # Add of the filled rectangles for(i in 1:k) { rect(xleft = Qi.left4plot, ybottom = Pi.low, xright = Qi.right4plot, ytop = Pi.up, border = "black", col = "lightgrey", ...) } } else # plot using package survival { survdata <- Surv(time = censdata$left, time2 = censdata$right, type="interval2") survfitted <- survfit(survdata ~ 1) if (Turnbull.confint) { if (specifytitle) { plot(survfitted,fun="event",xlab="Censored data", ylab="CDF",ylim = c(0,1), main = titleCDF, ...) } else { plot(survfitted,fun="event",xlab="Censored data", ylab="CDF", ylim = c(0,1), ...) } } else { if (specifytitle) { plot(survfitted,fun="event",xlab="Censored data", ylab="CDF", conf.int = FALSE, main = titleCDF, ylim = c(0,1), ...) } else { plot(survfitted,fun="event",xlab="Censored data", ylab="CDF", conf.int = FALSE, ylim = c(0,1), ...) } } xmin <- par("usr")[1] xmax <- par("usr")[2] } }else # if !NPMLE { if (is.finite(leftNA) & any(is.na(censdata$left))) censdata[is.na(censdata$left),]$left<-leftNA if (is.finite(rightNA) & any(is.na(censdata$right))) censdata[is.na(censdata$right),]$right<-rightNA lcens<-censdata[is.na(censdata$left),]$right if (any(is.na(lcens)) ) stop("An observation cannot be both right and left censored, coded with two NA values") rcens<-censdata[is.na(censdata$right),]$left noricens<-censdata[!is.na(censdata$left) & !is.na(censdata$right),] # definition of mid point for each observation (if not NA) # in order to have the order of plot of each observation # and order of left and right bounds for censored observations midnoricens<-(noricens$left+noricens$right)/2 ordmid<-order(midnoricens) ordlcens<-order(lcens) ordrcens<-order(rcens) nlcens<-length(lcens) nrcens<-length(rcens) nnoricens<-length(noricens$left) n<-length(censdata$left) if (specifytitle) { plot(c(0,0),c(0,0),type="n",xlim=xlim,ylim=ylim,xlab="Censored data", ylab="CDF",main=titleCDF, ...) } else { plot(c(0,0),c(0,0),type="n",xlim=xlim,ylim=ylim,xlab="Censored data", ylab="CDF", ...) } # functions to plot one interval or point for each observation for # observation ordered i out of n plotlcens <- function(i) { y <- i/n lines(c(xmininf, lcens[ordlcens[i]]), c(y, y), ...) } if (nlcens>=1) toto<-sapply(1:nlcens,plotlcens) plotnoricens<-function(i) { y<-(i+nlcens)/n if (noricens[ordmid[i],]$left!=noricens[ordmid[i],]$right) lines(c(noricens[ordmid[i],]$left,noricens[ordmid[i],]$right),c(y,y), ...) else points(noricens[ordmid[i],]$left,y,pch=4, ...) } if (nnoricens>=1) toto<-sapply(1:nnoricens,plotnoricens) plotrcens<-function(i) { y<-(i+nlcens+nnoricens)/n lines(c(rcens[ordrcens[i]],xmaxinf),c(y,y), ...) } if (nrcens>=1) toto <- sapply(1:nrcens,plotrcens) } # en of else if NPMLE if (!missing(distr)){ # plot of the theoretical cumulative function if (!is.character(distr)) distname<-substring(as.character(match.call()$distr),2) else distname<-distr if (!is.list(para)) stop("'para' must be a named list") ddistname <- paste0("d", distname) if (!exists(ddistname, mode="function")) stop(paste("The ", ddistname, " function must be defined")) pdistname <- paste0("p", distname) if (!exists(pdistname, mode="function")) stop(paste("The ", pdistname, " function must be defined")) densfun <- get(ddistname,mode="function") nm <- names(para) f <- formals(densfun) args <- names(f) m <- match(nm, args) if (any(is.na(m))) stop(paste("'para' specifies names which are not arguments to ",ddistname)) # plot of continuous data with theoretical distribution s <- seq(xmin, xmax, by=(xmax-xmin)/100) theop <- do.call(pdistname, c(list(s), as.list(para))) lines(s, theop, col="red") } if (!onlyCDFplot) { # definition of rectangles and limits Qitheo.left <- do.call(qdistname, c(list(Pi.low), as.list(para))) Qitheo.right <- do.call(qdistname, c(list(Pi.up), as.list(para))) xmin.Wang.qq <- min(xmin.Wang.cdf, Qitheo.right[-k]) xmax.Wang.qq <- max(xmin.Wang.cdf, Qitheo.left[-1]) xlim.Wang.qq <- c(xmin.Wang.qq, xmax.Wang.qq) Qitheo.left4plot <- Qitheo.left if (is.infinite(Qitheo.left4plot[1]) | is.nan(Qitheo.left4plot[1])) Qitheo.left4plot[1] <- xmininf Qitheo.right4plot <- Qitheo.right if (is.infinite(Qitheo.right4plot[k]) | is.nan(Qitheo.right4plot[k])) Qitheo.right4plot[k] <- xmaxinf ## Q-Q plot plot(1, 1, type = "n", main = "Q-Q plot", xlim = xlim.Wang.qq, ylim = xlim.Wang.qq, xlab = "Theoretical quantiles", ylab = "Empirical quantiles") rect(xleft = Qitheo.left4plot, ybottom = Qi.left4plot, xright = Qitheo.right4plot, ytop = Qi.right4plot, border = "black", col = "lightgrey") abline(0,1) ## P-P plot plot(1, 1, type = "n", main = "P-P plot", xlim = ylim, ylim = ylim, xlab = "Theoretical probabilities", ylab = "Empirical probabilities") # plot of rectangles Pitheo.low <- do.call(pdistname, c(list(Qi.left), as.list(para))) Pitheo.up <- do.call(pdistname, c(list(Qi.right), as.list(para))) rect(xleft = Pitheo.low, ybottom = Pi.low, xright = Pitheo.up, ytop = Pi.up, border = "black", col = "lightgrey") abline(0,1) par(def.par) } invisible() } fitdistrplus/R/util-transform.R0000644000176200001440000000611213742313702016322 0ustar liggesusers############################################################################# # Copyright (c) 2009 Marie Laure Delignette-Muller, Christophe Dutang # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the # Free Software Foundation, Inc., # 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA # ############################################################################# ### Infinitely differentiable transformations of R into a bounded or half-bounded interval ### ### R functions ### #inverse are useless? #Transformation from (-Inf, +Inf) to (-1, 0) Tm10 <- function(x) -1/(1+exp(-x)) #Inverse iTm10 <- function(x) log(-x/(1+x)) #Transformation from (-Inf, +Inf) to (0, 1) T01 <- function(x) 1/(1+exp(-x)) #Inverse iT01 <- function(x) log(x/(1-x)) #Transformation from (-Inf, +Inf) to (0, +Inf) T0Inf <- function(x) exp(x) #Inverse iT0Inf <- function(x) log(x) #Transformation from (-Inf, +Inf) to (1, +Inf) T1Inf <- function(x) 1+exp(x) #Inverse iT1Inf <- function(x) log(x-1) fitdistrplus/R/bootdist-graph.R0000644000176200001440000000177013742313702016267 0ustar liggesusers pairs4boot <- function(x, trueval, col4ramp = c("green", "yellow", "orange", "red"), nbgrid = 100, nbcol = 100, enhance=TRUE, ...) { x <- data.matrix(rbind(x, trueval)) n <- NROW(x) if(is.null(trueval)) id1 <- 1:n else id1 <- 1:(n-1) panel.points <- function(x, y, ...) { points(x[id1], y[id1], xlim=range(x, na.rm=TRUE), ylim=range(y, na.rm=TRUE)) if(!is.null(trueval)) abline(v=x[n], h=y[n], col="red", lwd=2) } panel.density <- function(x, y, ...) { id2 <- id1[!is.na(x[id1])] #require(MASS) k <- kde2d(x[id2], y[id2], n=nbgrid) image(k, col=colorRampPalette(col4ramp)(nbcol), add=TRUE, xlim=range(x, na.rm=TRUE), ylim=range(y, na.rm=TRUE)) if(!is.null(trueval)) abline(v=x[n], h=y[n], col="black", lty="dashed") } if(enhance) pairs(x, upper.panel=panel.points, lower.panel=panel.density, ...) else pairs(x, upper.panel=panel.points, lower.panel=panel.points, ...) invisible() } fitdistrplus/R/util-manageparam.R0000644000176200001440000000544114421750603016564 0ustar liggesusers# checkparam function checks start.arg and fix.arg that parameters are named correctly # INPUTS # start.arg : starting values for optimization or the function to compute them from data or NULL # fix.arg : fixed values of paramaters or the function to compute them from data or NULL # obs : the full dataset # distname : name of the distribution # OUTPUTS # two named list with untested components manageparam <- function(start.arg, fix.arg, obs, distname) { #if clause with 3 different cases: #start.arg : NULL | named list | a function if(is.null(start.arg)) { trystart <- try(startargdefault(obs, distname), silent = TRUE) if(inherits(trystart, "try-error")) { cat("Error in computing default starting values.\n") stop(trystart) } lstart <- trystart #lstart should be a named list but check it hasnoname <- is.null(names(lstart)) || !is.list(lstart) if(hasnoname) stop("Starting values must be a named list, error in default starting value.") }else if(is.list(start.arg)) { hasnoname <- is.null(names(start.arg)) if(hasnoname) stop("Starting values must be a named list (or a function returning a named list).") lstart <- start.arg }else if(is.function(start.arg)) { trystart <- try(start.arg(obs), silent = TRUE) if(inherits(trystart, "try-error")) { cat("Error in computing starting values with your function.\n") stop(trystart) } lstart <- trystart hasnoname <- is.null(names(lstart)) || !is.list(lstart) if(hasnoname) stop("Starting values must be a named list, your function does not return that.") }else stop("Wrong type of argument for start") #if clause with 3 different cases: #fix.arg : NULL | named list | a function if(is.null(fix.arg)) { lfix <- NULL }else if(is.list(fix.arg)) { hasnoname <- is.null(names(fix.arg)) if(hasnoname) stop("Fixed parameter values must be a named list (or a function returning a named list).") lfix <- fix.arg }else if(is.function(fix.arg)) { tryfix <- try(fix.arg(obs), silent = TRUE) if(inherits(tryfix, "try-error")) { cat("Error in computing fixed parameter values with your function.\n") stop(tryfix) } lfix <- tryfix hasnoname <- is.null(names(lfix)) || !is.list(lfix) if(hasnoname) stop("Fixed parameter values must be a named list, your function does not return that.") }else stop("Wrong type of argument for fix.arg") #eliminate arguments both in lstart and lfix (when start.arg was NULL) if(is.null(start.arg) && !is.null(lfix)) { lstart <- lstart[!names(lstart) %in% names(lfix)] if(length(lstart) == 0) stop("Don't need to use fitdist() if all parameters have fixed values") } list("start.arg"=lstart, "fix.arg"=lfix) } fitdistrplus/R/mmedist.R0000644000176200001440000004572514421725713015017 0ustar liggesusers############################################################################# # Copyright (c) 2010 Marie Laure Delignette-Muller, Christophe Dutang # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the # Free Software Foundation, Inc., # 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA # ############################################################################# ### Matching moment estimation for non-censored data ### ### R functions ### mmedist <- function (data, distr, order, memp, start=NULL, fix.arg=NULL, optim.method="default", lower=-Inf, upper=Inf, custom.optim=NULL, weights=NULL, silent=TRUE, gradient=NULL, checkstartfix=FALSE, ...) { if (!is.character(distr)) stop("distr must be a character string naming a distribution") else distname <- distr if (is.element(distname, c("norm", "lnorm", "pois", "exp", "gamma", "nbinom", "geom", "beta", "unif", "logis"))) meth <- "closed formula" else meth <- optim.method mdistname <- paste("m", distname, sep="") ddistname <- paste("d", distname, sep="") argddistname <- names(formals(ddistname)) if(is.null(custom.optim)) optim.method <- match.arg(optim.method, c("default", "Nelder-Mead", "BFGS", "CG", "L-BFGS-B", "SANN", "Brent")) if(!is.null(weights)) { if(any(weights < 0)) stop("weights should be a vector of integers greater than 0") if(!isallintw(weights)) stop("weights should be a vector of (strictly) positive integers") if(length(weights) != NROW(data)) stop("weights should be a vector with a length equal to the observation number") warning("weights are not taken into account in the default initial values") } if(meth != "closed formula") { if (!exists(mdistname, mode="function")) stop(paste0("The moment ", mdistname, " function must be defined.")) # mdistname contains the good name of the theoretical moment function } if (!(is.numeric(data) & length(data)>1)) stop("data must be a numeric vector of length greater than 1") if(is.null(weights)) { loglik <- function(par, fix.arg, obs, ddistnam) sum(log(do.call(ddistnam, c(list(obs), as.list(par), as.list(fix.arg)) ) ) ) }else { loglik <- function(par, fix.arg, obs, ddistnam) sum(weights * log(do.call(ddistnam, c(list(obs), as.list(par), as.list(fix.arg)) ) ) ) } if(meth == "closed formula") { n <- length(data) if(is.null(weights)) { m <- mean(data) v <- (n - 1)/n*var(data) }else #weighted version from util-wtdstat.R { m <- wtdmean(data, weights=weights) v <- wtdvar(data, weights=weights) } if (!is.null(fix.arg)) stop("argument fix.arg cannot be used when a closed formula is used.") # Fitting by matching moments if (!(is.vector(data) & is.numeric(data) & length(data)>1)) stop("data must be a numeric vector of length greater than 1") if (distname == "norm") { estimate <- c(mean=m, sd=sqrt(v)) order <- 1:2 } if (distname == "lnorm") { if (any(data <= 0)) stop("values must be positive to fit a lognormal distribution") sd2 <- log(1+v/m^2) estimate <- c(meanlog=log(m) - sd2/2, sdlog=sqrt(sd2)) order <- 1:2 } if (distname == "pois") { estimate <- c(lambda=m) order <- 1 } if (distname == "exp") { estimate <- c(rate=1/m) order <- 1 } if (distname == "gamma" ) { shape <- m^2/v rate <- m/v estimate<-c(shape=shape, rate=rate) order <- 1:2 } if (distname == "nbinom" ) { size <- if (v > m) m^2/(v - m) else NaN estimate<-c(size=size, mu=m) order <- 1:2 } if (distname == "geom" ) { prob<-if (m>0) 1/(1+m) else NaN estimate<-c(prob=prob) order <- 1 } if (distname == "beta" ) { if (any(data < 0) | any(data > 1)) stop("values must be in [0-1] to fit a beta distribution") aux<-m*(1-m)/v - 1 shape1 <- m*aux shape2 <- (1-m)*aux estimate<-c(shape1=shape1, shape2=shape2) order <- 1:2 } if (distname == "unif" ) { min1 <- m-sqrt(3*v) max1 <- m+sqrt(3*v) estimate<-c(min1,max1) order <- 1:2 } if (distname == "logis" ) { scale <- sqrt(3*v)/pi estimate<-c(location=m, scale=scale) order <- 1:2 } if (exists(ddistname)) loglikval <- loglik(estimate, fix.arg, data, ddistname) else loglikval <- NULL res <- list(estimate=estimate, convergence=0, value=NULL, hessian=NULL, optim.function=NULL, opt.meth=NULL, fix.arg=NULL, fix.arg.fun=NULL, weights=weights, counts=NULL, optim.message=NULL, loglik= loglikval, method=meth, order=order, memp=NULL) }else #an optimimisation has to be done, where fix.arg and start can be a function { if(is.vector(start)) #backward compatibility start <- as.list(start) if(!checkstartfix) #pre-check has not been done by fitdist() or bootdist() { cat("checkstartfix is carried out\n") # manage starting/fixed values: may raise errors or return two named list arg_startfix <- manageparam(start.arg=start, fix.arg=fix.arg, obs=data, distname=distname) #check inconsistent parameters hasnodefaultval <- sapply(formals(ddistname), is.name) arg_startfix <- checkparamlist(arg_startfix$start.arg, arg_startfix$fix.arg, argddistname, hasnodefaultval) #arg_startfix contains two names list (no longer NULL nor function) #set fix.arg.fun if(is.function(fix.arg)) fix.arg.fun <- fix.arg else fix.arg.fun <- NULL }else #pre-check has been done by fitdist() or bootdist() { arg_startfix <- list(start.arg=start, fix.arg=fix.arg) fix.arg.fun <- NULL } #unlist starting values as needed in optim() vstart <- unlist(arg_startfix$start.arg) #sanity check if(is.null(vstart)) stop("Starting values could not be NULL with checkstartfix=TRUE") #erase user value #(cannot coerce to vector as there might be different modes: numeric, character...) fix.arg <- arg_startfix$fix.arg if(length(vstart) != length(order)) stop("wrong dimension for the moment order to match") if(missing(memp)) stop("the empirical moment function must be defined") #backward compatibility when memp is the name of the function and not the function itself if(is.character(memp)) memp <- get0(memp, envir=pos.to.env(1)) #check the memp function if(!is.function(memp)) stop("the empirical moment must be defined as a function") if(is.null(weights)) { txt <- "the empirical moment function must be a two-argument function of 'x', 'order'" if(length(formals(memp)) != 2) stop(txt) if(any(names(formals(memp)) != c("x", "order"))) stop(txt) }else { txt <- "the empirical moment function must be a three-argument function of 'x', 'order', 'weights'" if(length(formals(memp)) != 3) stop(txt) if(any(names(formals(memp)) != c("x", "order", "weights"))) stop(txt) } ############# MME fit using optim or custom.optim ########## # definition of the function to minimize : least square (Cramer - von Mises type) if(is.null(weights)) { DIFF2 <- function(par, fix.arg, order, obs, mdistnam, memp, weights) { momtheo <- do.call(mdistnam, c(as.list(order), as.list(par), as.list(fix.arg)) ) momemp <- as.numeric(memp(obs, order)) (momemp - momtheo)^2 } fnobj <- function(par, fix.arg, obs, mdistnam, memp, weights) sum( sapply(order, function(o) DIFF2(par, fix.arg, o, obs, mdistnam, memp)) ) }else { DIFF2 <- function(par, fix.arg, order, obs, mdistnam, memp, weights) { momtheo <- do.call(mdistnam, c(as.list(order), as.list(par), as.list(fix.arg)) ) momemp <- as.numeric(memp(obs, order, weights)) (momemp - momtheo)^2 } fnobj <- function(par, fix.arg, obs, mdistnam, memp, weights) sum( sapply(order, function(o) DIFF2(par, fix.arg, o, obs, mdistnam, memp, weights)) ) } cens <- FALSE if(cens) stop("Moment matching estimation for censored data is not yet available.") owarn <- getOption("warn") # Try to minimize the stat distance using the base R optim function if(is.null(custom.optim)) { hasbound <- any(is.finite(lower) | is.finite(upper)) # Choice of the optimization method if (optim.method == "default") { opt.meth <- ifelse(length(vstart) > 1, "Nelder-Mead", "BFGS") }else opt.meth <- optim.method if(opt.meth == "BFGS" && hasbound && is.null(gradient)) { opt.meth <- "L-BFGS-B" txt1 <- "The BFGS method cannot be used with bounds without provided the gradient." txt2 <- "The method is changed to L-BFGS-B." warning(paste(txt1, txt2)) } options(warn=ifelse(silent, -1, 0)) #select optim or constrOptim if(hasbound) #finite bounds are provided { if(!is.null(gradient)) { opt.fun <- "constrOptim" }else #gradient == NULL { if(opt.meth == "Nelder-Mead") opt.fun <- "constrOptim" else if(opt.meth %in% c("L-BFGS-B", "Brent")) opt.fun <- "optim" else { txt1 <- paste("The method", opt.meth, "cannot be used by constrOptim() nor optim() without gradient and bounds.") txt2 <- "Only optimization methods L-BFGS-B, Brent and Nelder-Mead can be used in such case." stop(paste(txt1, txt2)) } } if(opt.fun == "constrOptim") { #recycle parameters npar <- length(vstart) #as in optim() line 34 lower <- as.double(rep_len(lower, npar)) #as in optim() line 64 upper <- as.double(rep_len(upper, npar)) # constraints are : Mat %*% theta >= Bnd, i.e. # +1 * theta[i] >= lower[i]; # -1 * theta[i] >= -upper[i] #select rows from the identity matrix haslow <- is.finite(lower) Mat <- diag(npar)[haslow, ] #select rows from the opposite of the identity matrix hasupp <- is.finite(upper) Mat <- rbind(Mat, -diag(npar)[hasupp, ]) colnames(Mat) <- names(vstart) rownames(Mat) <- paste0("constr", 1:NROW(Mat)) #select the bounds Bnd <- c(lower[is.finite(lower)], -upper[is.finite(upper)]) names(Bnd) <- paste0("constr", 1:length(Bnd)) initconstr <- Mat %*% vstart - Bnd if(any(initconstr < 0)) stop("Starting values must be in the feasible region.") opttryerror <- try(opt <- constrOptim(theta=vstart, f=fnobj, ui=Mat, ci=Bnd, grad=gradient, fix.arg=fix.arg, obs=data, mdistnam=mdistname, memp=memp, hessian=!is.null(gradient), method=opt.meth, weights=weights, ...), silent=TRUE) if(!inherits(opttryerror, "try-error")) if(length(opt$counts) == 1) #appears when the initial point is a solution opt$counts <- c(opt$counts, NA) }else #opt.fun == "optim" { opttryerror <- try(opt <- optim(par=vstart, fn=fnobj, fix.arg=fix.arg, obs=data, gr=gradient, mdistnam=mdistname, memp=memp, hessian=TRUE, method=opt.meth, lower=lower, upper=upper, weights=weights, ...), silent=TRUE) } }else #hasbound == FALSE { opt.fun <- "optim" opttryerror <- try(opt <- optim(par=vstart, fn=fnobj, fix.arg=fix.arg, obs=data, gr=gradient, mdistnam=mdistname, memp=memp, hessian=TRUE, method=opt.meth, lower=lower, upper=upper, ...), silent=TRUE) } options(warn=owarn) if (inherits(opttryerror,"try-error")) { warnings("The function optim encountered an error and stopped.") if(getOption("show.error.messages")) print(attr(opttryerror, "condition")) return(list(estimate = rep(NA,length(vstart)), convergence = 100, value = NA, hessian = NA)) } if (opt$convergence>0) { warnings("The function optim failed to converge, with the error code ", opt$convergence) } if(is.null(names(opt$par))) names(opt$par) <- names(vstart) res <- list(estimate = opt$par, convergence = opt$convergence, value = opt$value, hessian = opt$hessian, optim.function=opt.fun, optim.method=opt.meth, fix.arg=fix.arg, fix.arg.fun=fix.arg.fun, weights=weights, counts=opt$counts, optim.message=opt$message, loglik=ifelse(exists(ddistname), loglik(opt$par, fix.arg, data, ddistname), NULL), method=meth, order=order, memp=memp) }else # Try to minimize the stat distance using a user-supplied optim function { opt.meth <- NULL if (!cens) { options(warn=ifelse(silent, -1, 0)) opttryerror <- try(opt <- custom.optim(fn=fnobj, fix.arg=fix.arg, obs=data, mdistnam=mdistname, memp=memp, par=vstart, weights=weights, ...), silent=TRUE) options(warn=owarn) }else stop("Moment matching estimation for censored data is not yet available.") if (inherits(opttryerror,"try-error")) { warnings("The customized optimization function encountered an error and stopped.") if(getOption("show.error.messages")) print(attr(opttryerror, "condition")) return(list(estimate = rep(NA,length(vstart)), convergence = 100, value = NA, hessian = NA)) } if (opt$convergence>0) { warnings("The customized optimization function failed to converge, with the error code ", opt$convergence) } if(is.null(names(opt$par))) names(opt$par) <- names(vstart) argdot <- list(...) method.cust <- argdot$method res <- list(estimate = opt$par, convergence = opt$convergence, value = opt$value, hessian = opt$hessian, optim.function=custom.optim, optim.method=method.cust, fix.arg=fix.arg, fix.arg.fun=fix.arg.fun, weights=weights, counts=opt$counts, optim.message=opt$message, loglik=ifelse(exists(ddistname), loglik(opt$par, fix.arg, data, ddistname), NULL), method=meth, order=order, memp=memp) } } return(res) } ## old function with previous name momdist<-function (data, distr) { stop("the name \"momdist\" for matching moments function is NO MORE used and is replaced by \"mmedist\".") } fitdistrplus/R/util-cens2pseudo.R0000644000176200001440000000460513742313702016546 0ustar liggesusers# compute pseudo data from a censored dataset # INPUTS # censdata : a two-column matrix with left, right and/or interval censored data # OUTPUTS # a vector of pseudo data (needed to compute starting values) # a vector of right censored data # a vector of left censored data # a vector of non censored data # a vector of interval censored data cens2pseudo <- function(censdata) { # Definition of datasets lcens (left censored)=vector, rcens (right censored)= vector, # icens (interval censored) = dataframe with left and right # ncens (not censored) = vector and # pseudo (data transformed in non censored pseudo data by taking the # mean of bounds for interval censored data and the left or right bound for the other # censored data) = vector irow.lcens <- is.na(censdata$left) # rows corresponding to lcens data lcens <- censdata[irow.lcens, ]$right irow.rcens <- is.na(censdata$right) # rows corresponding to rcens data rcens <- censdata[irow.rcens, ]$left if (any(is.na(lcens)) ) stop("An observation cannot be both right and left censored, coded with two NA values") irow.ncens <- censdata$left==censdata$right & !is.na(censdata$left) & !is.na(censdata$right) # rows corresponding to ncens data ncens<-censdata[irow.ncens, ]$left irow.icens <- censdata$left!=censdata$right & !is.na(censdata$left) & !is.na(censdata$right) # rows corresponding to icens data icens<-censdata[irow.icens, ] pseudo <- c(rcens, lcens, ncens, (icens$left+icens$right)/2) list(pseudo=pseudo, rcens=rcens, lcens=lcens, ncens=ncens, icens=icens) } # compute row indexes from a censored dataset cens2idxrow <- function(censdata) { # Definition of datasets lcens (left censored)=vector, rcens (right censored)= vector, # icens (interval censored) = dataframe with left and right # and ncens (not censored) = vector irow.lcens <- is.na(censdata$left) # rows corresponding to lcens data irow.rcens <- is.na(censdata$right) # rows corresponding to rcens data irow.ncens <- censdata$left==censdata$right & !is.na(censdata$left) & !is.na(censdata$right) # rows corresponding to ncens data irow.icens <- censdata$left!=censdata$right & !is.na(censdata$left) & !is.na(censdata$right) # rows corresponding to icens data list(lcens=irow.lcens, rcens=irow.rcens, ncens=irow.ncens, icens=irow.icens) }fitdistrplus/R/util-npsurv-km.R0000644000176200001440000000270313742313702016253 0ustar liggesusers# ----------------------------------------------------------------------- # # Nonparametric maximum likelihood estimation from interval-censored data # # ----------------------------------------------------------------------- # # Internal kaplan meier function # # ----------------------------------------------------------------------- # # Original code from Yong Wang, 2020 # # ----------------------------------------------------------------------- # # Kaplan-Meier estimate of the survival function for right-censored data km = function(data, w=1) { x = icendata(data, w) if(any(x$o[,2] != Inf)) stop("Not all observations are exact or right-censored") if(nrow(x$o) == 0) { # no right-censored observations f = idf(x$t, x$t, x$wt) ll = sum(x$wt * log(f$p)) return(list(f=f, ll=ll)) } c = colSums(x$wo * outer(x$o[,1], x$t, "<")) n = sum(x$wt, x$wo) # number of observations r = n - c - c(0,cumsum(x$wt))[1:length(x$t)] # no. at risk S = cumprod(1 - x$wt/r) # survival prob. # tab = cbind(x$t, x$wt, c, r, S) p = rev(diff(rev(c(1,S,0)))) dc = x$wt + c if(max(x$t) > max(x$o[,1])) { f = idf(x$t, x$t, p[-length(p)]) ll = sum( x$wt * log(f$p) ) } else { f = idf(c(x$t,max(x$o[,1])), c(x$t,Inf), p) ll = sum(c(x$wt, n - sum(x$wt)) * log(f$p)) } list(f=f, ll=ll) } fitdistrplus/R/Surv2fitdistcens.R0000644000176200001440000001117714075737424016637 0ustar liggesusers############################################################################# # Copyright (c) 2021 Christophe Dutang and Marie Laure Delignette-Muller # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the # Free Software Foundation, Inc., # 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA # ############################################################################# ### data handling functions ### ### R functions ### Surv2fitdistcens <- function(time, time2, event, type=c('right', 'left', 'interval', 'interval2')) { type <- match.arg(type, c('right', 'left', 'interval', 'interval2')) stopifnot(is.numeric(time)) stopifnot(length(time) >= 1) if (type %in% c('right', 'left')) { if (!missing(time2)) warning("time2 is not used with type 'right' and 'left' ") } else { stopifnot(is.numeric(time2)) stopifnot(length(time) == length(time2)) if (type == 'interval2' & !missing(event)) warning("event is not used with type 'interval2' ") } if (type != 'interval2') { stopifnot(length(time) == length(event)) if(is.logical(event)) event <- 1*(event == TRUE) else if(all(event == 1 | event == 2) & type != 'interval') { event <- 1*(event == 2) }else if(all(event %in% 0:3) && sum(event >= 3) > 0) { stopifnot(type %in% 'interval') }else if(is.factor(event)) { stopifnot(length(levels(event)) == 2) event <- 1*(event == levels(event)[2]) } if(any(!event %in% 0:1) && type != "interval") stop("wrong 'event' argument") } #compute data.frame if(type == "right") { out <- cbind.data.frame(left=time, right=NA) out$right[event == 1] <- time[event == 1] }else if(type == "left") { out <- cbind.data.frame(left=NA, right=time) out$left[event == 1] <- time[event == 1] }else if(type == "interval") { out <- cbind.data.frame(left=rep(NA, length(time)), right=rep(NA, length(time2))) #0=right censored, out$left[event == 0] <- time[event == 0] #1=event at time, out$left[event == 1] <- time[event == 1] out$right[event == 1] <- time[event == 1] #2=left censored, out$right[event == 2] <- time[event == 2] #3=interval censored out$left[event == 3] <- time[event == 3] out$right[event == 3] <- time2[event == 3] } else # type "interval2 { out <- cbind.data.frame(left=time,right=time2) out$left[!is.finite(time)] <- NA out$right[!is.finite(time2)] <- NA } out } fitdistrplus/R/CIcdfplot.R0000644000176200001440000002536114421716550015215 0ustar liggesusers############################################################################# # Copyright (c) 2016 Marie Laure Delignette-Muller, Christophe Dutang # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the # Free Software Foundation, Inc., # 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA # ############################################################################# ### plot cumulative distribution functions with confidence interval band ### ### R functions ### CIcdfplot <- function(b, CI.output, CI.type = "two.sided", CI.level = 0.95, CI.col = "red", CI.lty = 2, CI.fill = NULL, CI.only = FALSE, xlim, ylim, xlogscale = FALSE, ylogscale = FALSE, main, xlab, ylab, datapch, datacol, fitlty, fitcol, fitlwd, horizontals = TRUE, verticals = FALSE, do.points = TRUE, use.ppoints = TRUE, a.ppoints = 0.5, name.points = NULL, lines01 = FALSE, plotstyle = "graphics", ...) { if(inherits(b, "bootdist")) { cens <- FALSE } else if(inherits(b, "bootdistcens")) { cens <- TRUE } else { stop("argument b must be a 'bootdist' or a `bootdistcens` object") } if(missing(CI.output)) stop("argument CI.output must be specified: either 'probability' or 'quantile'.") CI.output <- match.arg(CI.output, c("probability", "quantile")) CI.type <- match.arg(CI.type, c("two.sided", "less", "greater")) CI.level <- CI.level[1] #compute lower and upper value for the area if (!cens) { mydat <- b$fitpart$data n <- length(mydat) xmin <- min(mydat) xmax <- max(mydat) } else { censdata <- b$fitpart$censdata n <- nrow(censdata) xmin <- min(c(censdata$left, censdata$right), na.rm=TRUE) xmax <- max(c(censdata$left, censdata$right), na.rm=TRUE) } if (missing(xlim)) xlim <- c(xmin, xmax) lowx <- min(xlim[1], ifelse(xmin < 0, xmin*1.5, xmin*.5)) uppx <- max(xlim[2], ifelse(xmax < 0, xmax*.5, xmax*1.5)) if(missing(ylim)) ylim <- c(0, 1) if(!is.logical(CI.only)) stop("argument CI.only must be a logical") # check the 'plotstyle' argument plotstyle <- match.arg(plotstyle[1], choices = c("graphics", "ggplot"), several.ok = FALSE) #default values (same as cdfcomp()) if (missing(datapch)) datapch <- 16 if (missing(datacol)) datacol <- "black" if (missing(fitcol)) fitcol <- 2 if (missing(fitlty)) fitlty <- 1 if (missing(fitlwd)) fitlwd <- 1 if (missing(xlab)) { if (!cens) xlab <- ifelse(xlogscale, "data in log scale", "data") else xlab <- ifelse(xlogscale, "censored data in log scale", "censored data") } if (missing(ylab)) ylab <- "CDF" if (missing(main)) main <- ifelse(CI.only, "Theoretical CDF with CI", "Empirical and theoretical CDF with CI") #get name and cdf name distname <- b$fitpart$distname pdistname <- paste("p",distname,sep="") qdistname <- paste("q",distname,sep="") if (!exists(pdistname, mode="function") && CI.output == "probability") stop(paste("The ", pdistname, " function must be defined")) if (!exists(qdistname, mode="function") && CI.output == "quantile") stop(paste("The ", qdistname, " function must be defined")) #compute c.d.f. values on bootstraped parameters if(CI.output == "probability") { cdfval <- function(x) { calcp <- function(i) { parai <- c(as.list(b$estim[i, ]), as.list(b$fitpart$fix.arg)) do.call(pdistname, c(list(x), as.list(parai))) } res <- t(sapply(1:b$nbboot, calcp)) rownames(res) <- 1:b$nbboot colnames(res) <- paste0("x=", x) res } x <- seq(lowx, uppx, length=501) #compute quantiles on c.d.f. if (CI.type == "two.sided") { alpha <- (1-CI.level)/2 CIband <- t(apply(cdfval(x), MARGIN=2, quantile, probs=c(alpha, 1-alpha), na.rm=TRUE)) colnames(CIband) <- formatperc(c(alpha, 1-alpha), 3) }else if (CI.type == "less") { CIband <- as.matrix(apply(cdfval(x), MARGIN=2, quantile, probs=CI.level, na.rm=TRUE)) colnames(CIband) <- formatperc(CI.level, 3) }else { CIband <- as.matrix(apply(cdfval(x), MARGIN=2, quantile, probs=1-CI.level, na.rm=TRUE)) colnames(CIband) <- formatperc(1-CI.level, 3) } }else #CI.output == "quantile" { qval <- function(p) { calcp <- function(i) { parai <- c(as.list(b$estim[i, ]), as.list(b$fitpart$fix.arg)) do.call(qdistname, c(list(p), as.list(parai))) } res <- t(sapply(1:b$nbboot, calcp)) rownames(res) <- 1:b$nbboot colnames(res) <- paste0("p=", p) res } #compute lower and upper value for the area # p <- seq(sqrt(.Machine$double.eps), 1- sqrt(.Machine$double.eps), length=101) p <- seq(0.0001, 1- 0.0001, length=501) #compute quantiles on c.d.f. if (CI.type == "two.sided") { alpha <- (1-CI.level)/2 CIband <- t(apply(qval(p), MARGIN=2, quantile, probs=c(alpha, 1-alpha), na.rm=TRUE)) colnames(CIband) <- formatperc(c(alpha, 1-alpha), 3) }else if (CI.type == "less") { CIband <- as.matrix(apply(qval(p), MARGIN=2, quantile, probs=1-CI.level, na.rm=TRUE)) colnames(CIband) <- formatperc(CI.level, 3) }else { CIband <- as.matrix(apply(qval(p), MARGIN=2, quantile, probs=CI.level, na.rm=TRUE)) colnames(CIband) <- formatperc(1-CI.level, 3) } } #temp var to open a graphic (if needed) logxy <- paste0(ifelse(xlogscale,"x",""), ifelse(ylogscale,"y","")) ##### plot #### if(plotstyle == "graphics") { ######## plot if plotstyle=='graphics' ######## #open graphic window plot(0, 0, main=main, xlab=xlab, ylab=ylab, xlim=xlim, ylim=ylim, log=logxy, type="n") if (!is.null(CI.fill)) # first fill the band { if(CI.output == "probability") { if(CI.type == "two.sided") polygon(c(x, rev(x)), c(CIband[,2], rev(CIband[,1])), col=CI.fill, border=CI.fill, ...) else if(CI.type == "less") polygon(c(x, uppx, uppx), c(CIband, 1, 0), col=CI.fill, border=CI.fill, ...) else #if(CI.type == "greater") polygon(c(x, lowx, lowx), c(CIband, 1, 0), col=CI.fill, border=CI.fill, ...) }else #CI.output == "quantile" { if(CI.type == "two.sided") polygon(c(CIband[,2], rev(CIband[,1])), c(p, rev(p)), col=CI.fill, border=CI.fill, ...) else if(CI.type == "less") polygon(c(CIband, uppx, uppx), c(p, 1, 0), col=CI.fill, border=CI.fill, ...) else #if(CI.type == "greater") polygon(c(CIband, lowx, lowx), c(p, 1, 0), col=CI.fill, border=CI.fill, ...) } } # add lines for the bounds of the CI if(CI.output == "probability") { matlines(x, CIband, col=CI.col, lty=CI.lty, ...) }else #CI.output == "quantile" { matlines(CIband, p, col=CI.col, lty=CI.lty, ...) } if(!CI.only) # add the empirical and fitted distributions { if (!cens) { cdfcomp(b$fitpart, xlim=xlim, ylim=ylim, xlogscale = xlogscale, ylogscale = ylogscale, main=main, xlab=xlab, ylab=ylab, datapch=datapch, datacol=datacol, fitlty=fitlty, fitlwd=fitlwd, fitcol=fitcol, horizontals = horizontals, verticals = verticals, do.points = do.points, use.ppoints = use.ppoints, a.ppoints = a.ppoints, name.points = name.points, lines01 = lines01, addlegend = FALSE, add=TRUE) } else { cdfcompcens(b$fitpart, xlim=xlim, ylim=ylim, xlogscale = xlogscale, ylogscale = ylogscale, main=main, xlab=xlab, ylab=ylab, datacol=datacol, fitlty=fitlty, fitlwd=fitlwd, fillrect = NA, fitcol=fitcol, lines01 = lines01, Turnbull.confint = FALSE, addlegend = FALSE, add=TRUE) } } } else if (!requireNamespace("ggplot2", quietly = TRUE)) { stop("ggplot2 needed for this function to work with plotstyle = 'ggplot'. Please install it", call. = FALSE) } else { ######## plot if plotstyle=='ggplot' ######## if(CI.output == "probability") { if(CI.type == "less") CIband <- cbind(rep(0, NROW(CIband)), CIband) if(CI.type == "greater") CIband <- cbind(CIband, rep(1, NROW(CIband))) dd <- as.data.frame(cbind(x, x, CIband)) } if(CI.output == "quantile") { if(CI.type == "less") CIband <- cbind(rep(uppx, NROW(CIband)), CIband) if(CI.type == "greater") CIband <- cbind(rep(lowx, NROW(CIband)), CIband) dd <- as.data.frame(cbind(CIband, p, p)) } colnames(dd) <- c("x1", "x2", "y1", "y2") {if (!cens) { cdfcomp(b$fitpart, xlim=xlim, ylim=ylim, xlogscale = xlogscale, ylogscale = ylogscale, main=main, xlab=xlab, ylab=ylab, datapch=datapch, datacol=datacol, fitlty={if(!CI.only) fitlty else 0}, fitlwd=fitlwd, fitcol=fitcol, horizontals = {if(!CI.only) horizontals else FALSE}, verticals = {if(!CI.only) verticals else FALSE}, do.points = {if(!CI.only) do.points else FALSE}, use.ppoints = use.ppoints, a.ppoints = a.ppoints, name.points = name.points, lines01 = lines01, addlegend = FALSE, add=TRUE, plotstyle = "ggplot") } else { cdfcompcens(b$fitpart, xlim=xlim, ylim=ylim, xlogscale = xlogscale, ylogscale = ylogscale, main=main, xlab=xlab, ylab=ylab, datacol=datacol, fitlty=fitlty, fitlwd=fitlwd, fillrect = NA, fitcol=fitcol, lines01 = lines01, Turnbull.confint = FALSE, addlegend = FALSE, add=TRUE, plotstyle = "ggplot") }} + ggplot2::geom_line(data = dd, ggplot2::aes_(x=quote(x1), y=quote(y1)), inherit.aes = FALSE, color = CI.col, lty = CI.lty, alpha = 0.5) + ggplot2::geom_line(data = dd, ggplot2::aes_(x=quote(x2), y=quote(y2)), inherit.aes = FALSE, color = CI.col, lty = CI.lty, alpha = 0.5) + {if(!is.null(CI.fill) & CI.output == "probability") ggplot2::geom_ribbon(data = dd, ggplot2::aes_(x = quote(x1), ymin=quote(y1), ymax=quote(y2)), inherit.aes = FALSE, fill = CI.fill, alpha = 0.5)} + {if(!is.null(CI.fill) & CI.output == "quantile") ggplot2::geom_ribbon(data = dd, ggplot2::aes_(xmin = quote(x1), xmax = quote(x2), y = quote(y1)), inherit.aes = FALSE, fill = CI.fill, alpha = 0.5)} } }fitdistrplus/R/util-Turnbull-intervals.R0000644000176200001440000000470514421726567020145 0ustar liggesusers# ----------------------------------------------------------------------- # # Copyright (c) 2020 Marie Laure Delignette-Muller # # # # Extrapolation of bounds of the Turnbull intervals # # form the outputs of survfit (pakage survival) # # ----------------------------------------------------------------------- # Turnbull.intervals <- function(censdata, threshold.par = 0.001) { survdata <- Surv(time = censdata$left, time2 = censdata$right, type="interval2") survfitted <- survfit(survdata ~ 1) # calculation of mass s <- survfitted$surv ns <- length(s) savant <- c(1, s[-ns]) mass <- savant - s # calculation of bounds of Turnbull intervals (equivalence classes) middletime <- survfitted$time lem <- length(middletime) leftbounds <- numeric(length = lem) rightbounds <- numeric(length = lem) db <- censdata db$left[is.na(db$left)] <- -Inf db$right[is.na(db$right)] <- Inf bounds <- sort(unique(c(db$right, db$left))) leb <- length(bounds) finitebounds <- bounds[is.finite(bounds)] minbounds <- min(finitebounds) maxbounds <- max(finitebounds) uncensoredvalues <- censdata[censdata$left == censdata$right,]$left j <- 1 for (i in 1:lem) { while(bounds[j] < middletime[i]) j <- j + 1 if (isTRUE(all.equal(middletime[i], bounds[j]))) { leftbounds[i] <- bounds[j] rightbounds[i] <- bounds[j] } else { leftbounds[i] <- bounds[j - 1] rightbounds[i] <- bounds[j] } } # try to rewrite it with a min(which()) and apply # correction for first and last bounds if needed if (!is.finite(bounds[1]) & isTRUE(all.equal(leftbounds[1], rightbounds[1])) & isTRUE(all.equal(leftbounds[1], minbounds)) & !is.element(leftbounds[1], uncensoredvalues)) { leftbounds[1] <- -Inf } if (!is.finite(bounds[leb]) & isTRUE(all.equal(leftbounds[lem], rightbounds[lem])) & isTRUE(all.equal(leftbounds[lem], maxbounds))& !is.element(leftbounds[lem], uncensoredvalues)) { rightbounds[lem] <- Inf } mass[lem] <- 1 - sum(mass[1:(lem - 1)]) f <- data.frame(left = leftbounds, right = rightbounds, p = mass, middletime = middletime) # elimination of negligible masses threshold <- threshold.par / nrow(censdata) f <- f[f$p > threshold, ] nf <- nrow(f) f$p[nf] <- 1 - sum(f$p[1:nf - 1]) return(f) } fitdistrplus/R/msedist.R0000644000176200001440000004331514421725526015020 0ustar liggesusers############################################################################# # Copyright (c) 2019 Christophe Dutang # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the # Free Software Foundation, Inc., # 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA # ############################################################################# ### maximum goodness-of-fit estimation for censored or non-censored data ### and continuous distributions ### (at this time only available for non censored data) ### ### R functions ### msedist <- function (data, distr, phidiv="KL", power.phidiv=NULL, start=NULL, fix.arg=NULL, optim.method="default", lower=-Inf, upper=Inf, custom.optim=NULL, weights=NULL, silent=TRUE, gradient=NULL, checkstartfix=FALSE, ...) # data may correspond to a vector for non censored data or to # a dataframe of two columns named left and right for censored data { if (!is.character(distr)) stop("distr must be a character string naming a distribution") else distname <- distr pdistname <- paste("p",distname,sep="") if (!exists(pdistname, mode="function")) stop(paste("The ", pdistname, " function must be defined")) ddistname <- paste("d",distname,sep="") if (!exists(ddistname, mode="function")) stop(paste("The ", ddistname, " function must be defined")) argddistname <- names(formals(ddistname)) if(is.null(custom.optim)) optim.method <- match.arg(optim.method, c("default", "Nelder-Mead", "BFGS", "CG", "L-BFGS-B", "SANN", "Brent")) phidiv <- match.arg(phidiv, c("KL", "J", "R", "H", "V")) if(phidiv == "KL") #Kullback-Leibler information stopifnot(is.null(power.phidiv)) if(phidiv == "J") #Jeffreys divergence stopifnot(is.null(power.phidiv)) if(phidiv == "R") #Renyi divergence { stopifnot(length(power.phidiv) == 1) stopifnot(power.phidiv > 0 && power.phidiv != 1) } if(phidiv %in% c("H", "V")) #Hellinger distance or Vajda information { stopifnot(length(power.phidiv) == 1) stopifnot(power.phidiv >= 1) } start.arg <- start #to avoid confusion with the start() function of stats pkg (check is done lines 87-100) if(is.vector(start.arg)) #backward compatibility start.arg <- as.list(start.arg) if(!is.null(weights)) { if(any(weights < 0)) stop("weights should be a vector of integers greater than 0") if(!isallintw(weights)) stop("weights should be a vector of (strictly) positive integers") if(length(weights) != NROW(data)) stop("weights should be a vector with a length equal to the observation number") warning("weights are not taken into account in the default initial values") } if (is.vector(data)) { cens <- FALSE if (!(is.numeric(data) & length(data)>1)) stop("data must be a numeric vector of length greater than 1 for non censored data or a dataframe with two columns named left and right and more than one line for censored data") } else { cens <- TRUE censdata <- data stop("not yet implemented") } if (cens) { stop("not yet implemented") } if(!checkstartfix) #pre-check has not been done by fitdist() or bootdist() { # manage starting/fixed values: may raise errors or return two named list arg_startfix <- manageparam(start.arg=start, fix.arg=fix.arg, obs=data, distname=distname) #check inconsistent parameters hasnodefaultval <- sapply(formals(ddistname), is.name) arg_startfix <- checkparamlist(arg_startfix$start.arg, arg_startfix$fix.arg, argddistname, hasnodefaultval) #arg_startfix contains two names list (no longer NULL nor function) #set fix.arg.fun if(is.function(fix.arg)) fix.arg.fun <- fix.arg else fix.arg.fun <- NULL }else #pre-check has been done by fitdist() or bootdist() { arg_startfix <- list(start.arg=start, fix.arg=fix.arg) fix.arg.fun <- NULL } #unlist starting values as needed in optim() vstart <- unlist(arg_startfix$start.arg) #sanity check if(is.null(vstart)) stop("Starting values could not be NULL with checkstartfix=TRUE") #erase user value #(cannot coerce to vector as there might be different modes: numeric, character...) fix.arg <- arg_startfix$fix.arg ############# MGE fit using optim or custom.optim ########## # definition of the function to minimize depending on the argument gof # for non censored data if (!cens) { if(is.null(weights)) weights <- rep(1, NROW(data)) if(length(weights) != NROW(data)) stop("weights should be a vector with a length equal to the observation number") # objective function (to minimize) is the opposite of the weighted average of the log spacings # the argument names are: # - par for parameters (like in optim function) # - fix.arg for optional fixed parameters # - obs for observations (previously dat but conflicts with genoud data.type.int argument) # - pdistnam for distribution name #NB1: by keeping only positive value, we take into cdf that are step functions #NB2: weight has a duplicate first value to have same length as spacings diffFx if(phidiv == "KL") { fnobj <- function(par, fix.arg, obs, pdistnam) { sx <- c(-Inf, sort(obs), Inf) n <- length(sx) Fxi <- do.call(pdistnam, c(list(sx[-1]), as.list(par), as.list(fix.arg))) Fxim1 <- do.call(pdistnam, c(list(sx[-n]), as.list(par), as.list(fix.arg))) diffFx <- Fxi - Fxim1 idxPositive <- diffFx > 0 if(sum(idxPositive, na.rm = TRUE) == 0) return(NaN) wx <- c(weights[1], weights) - mean(wx[idxPositive] * log(diffFx[idxPositive])) } }else if(phidiv == "J") { fnobj <- function(par, fix.arg, obs, pdistnam) { sx <- c(-Inf, sort(obs), Inf) n <- length(sx) Fxi <- do.call(pdistnam, c(list(sx[-1]), as.list(par), as.list(fix.arg))) Fxim1 <- do.call(pdistnam, c(list(sx[-n]), as.list(par), as.list(fix.arg))) diffFx <- Fxi - Fxim1 idxPositive <- diffFx > 0 if(sum(idxPositive, na.rm = TRUE) == 0) return(NaN) wx <- c(weights[1], weights) - mean(wx[idxPositive] * log(diffFx[idxPositive]) * (1-diffFx[idxPositive])) } }else if(phidiv == "R") { fnobj <- function(par, fix.arg, obs, pdistnam) { sx <- c(-Inf, sort(obs), Inf) n <- length(sx) Fxi <- do.call(pdistnam, c(list(sx[-1]), as.list(par), as.list(fix.arg))) Fxim1 <- do.call(pdistnam, c(list(sx[-n]), as.list(par), as.list(fix.arg))) diffFx <- Fxi - Fxim1 idxPositive <- diffFx > 0 if(sum(idxPositive, na.rm = TRUE) == 0) return(NaN) wx <- c(weights[1], weights) - mean(wx[idxPositive] * diffFx[idxPositive]^power.phidiv * sign(1-power.phidiv)) } }else if(phidiv == "H") { fnobj <- function(par, fix.arg, obs, pdistnam) { sx <- c(-Inf, sort(obs), Inf) n <- length(sx) Fxi <- do.call(pdistnam, c(list(sx[-1]), as.list(par), as.list(fix.arg))) Fxim1 <- do.call(pdistnam, c(list(sx[-n]), as.list(par), as.list(fix.arg))) diffFx <- Fxi - Fxim1 idxPositive <- diffFx > 0 if(sum(idxPositive, na.rm = TRUE) == 0) return(NaN) wx <- c(weights[1], weights) mean(wx[idxPositive] * abs( 1-diffFx[idxPositive]^(1/power.phidiv) )^power.phidiv ) } }else if(phidiv == "V") { fnobj <- function(par, fix.arg, obs, pdistnam) { sx <- c(-Inf, sort(obs), Inf) n <- length(sx) Fxi <- do.call(pdistnam, c(list(sx[-1]), as.list(par), as.list(fix.arg))) Fxim1 <- do.call(pdistnam, c(list(sx[-n]), as.list(par), as.list(fix.arg))) diffFx <- Fxi - Fxim1 idxPositive <- diffFx > 0 if(sum(idxPositive, na.rm = TRUE) == 0) return(NaN) wx <- c(weights[1], weights) mean(wx[idxPositive] * abs( 1-diffFx[idxPositive] )^power.phidiv ) } }else stop("error: wrong phidiv") } else # if (!cens) stop("Maximum goodness-of-fit estimation is not yet available for censored data.") # Function to calculate the loglikelihood to return loglik <- function(par, fix.arg, obs, ddistnam) { sum(log(do.call(ddistnam, c(list(obs), as.list(par), as.list(fix.arg)) ) ) ) } owarn <- getOption("warn") # Try to minimize the gof distance using the base R optim function if(is.null(custom.optim)) { hasbound <- any(is.finite(lower) | is.finite(upper)) # Choice of the optimization method if (optim.method == "default") { meth <- ifelse(length(vstart) > 1, "Nelder-Mead", "BFGS") }else meth <- optim.method if(meth == "BFGS" && hasbound && is.null(gradient)) { meth <- "L-BFGS-B" txt1 <- "The BFGS method cannot be used with bounds without provided the gradient." txt2 <- "The method is changed to L-BFGS-B." warning(paste(txt1, txt2)) } options(warn=ifelse(silent, -1, 0)) #select optim or constrOptim if(hasbound) #finite bounds are provided { if(!is.null(gradient)) { opt.fun <- "constrOptim" }else #gradient == NULL { if(meth == "Nelder-Mead") opt.fun <- "constrOptim" else if(meth %in% c("L-BFGS-B", "Brent")) opt.fun <- "optim" else { txt1 <- paste("The method", meth, "cannot be used by constrOptim() nor optim() without gradient and bounds.") txt2 <- "Only optimization methods L-BFGS-B, Brent and Nelder-Mead can be used in such case." stop(paste(txt1, txt2)) } } if(opt.fun == "constrOptim") { #recycle parameters npar <- length(vstart) #as in optim() line 34 lower <- as.double(rep_len(lower, npar)) #as in optim() line 64 upper <- as.double(rep_len(upper, npar)) # constraints are : Mat %*% theta >= Bnd, i.e. # +1 * theta[i] >= lower[i]; # -1 * theta[i] >= -upper[i] #select rows from the identity matrix haslow <- is.finite(lower) Mat <- diag(npar)[haslow, ] #select rows from the opposite of the identity matrix hasupp <- is.finite(upper) Mat <- rbind(Mat, -diag(npar)[hasupp, ]) colnames(Mat) <- names(vstart) rownames(Mat) <- paste0("constr", 1:NROW(Mat)) #select the bounds Bnd <- c(lower[is.finite(lower)], -upper[is.finite(upper)]) names(Bnd) <- paste0("constr", 1:length(Bnd)) initconstr <- Mat %*% vstart - Bnd if(any(initconstr < 0)) stop("Starting values must be in the feasible region.") opttryerror <- try(opt <- constrOptim(theta=vstart, f=fnobj, ui=Mat, ci=Bnd, grad=gradient, fix.arg=fix.arg, obs=data, pdistnam=pdistname, hessian=!is.null(gradient), method=meth, ...), silent=TRUE) if(!inherits(opttryerror, "try-error")) if(length(opt$counts) == 1) #appears when the initial point is a solution opt$counts <- c(opt$counts, NA) }else #opt.fun == "optim" { opttryerror <- try(opt <- optim(par=vstart, fn=fnobj, fix.arg=fix.arg, obs=data, gr=gradient, pdistnam=pdistname, hessian=TRUE, method=meth, lower=lower, upper=upper, ...), silent=TRUE) } }else #hasbound == FALSE { opt.fun <- "optim" opttryerror <- try(opt <- optim(par=vstart, fn=fnobj, fix.arg=fix.arg, obs=data, gr=gradient, pdistnam=pdistname, hessian=TRUE, method=meth, lower=lower, upper=upper, ...), silent=TRUE) } options(warn=owarn) if (inherits(opttryerror,"try-error")) { warnings("The function optim encountered an error and stopped.") if(getOption("show.error.messages")) print(attr(opttryerror, "condition")) return(list(estimate = rep(NA,length(vstart)), convergence = 100, loglik = NA, hessian = NA)) } if (opt$convergence>0) { warnings("The function optim failed to converge, with the error code ", opt$convergence) } if(is.null(names(opt$par))) names(opt$par) <- names(vstart) res <- list(estimate = opt$par, convergence = opt$convergence, value = opt$value, hessian = opt$hessian, optim.function=opt.fun, optim.method=meth, fix.arg = fix.arg, fix.arg.fun = fix.arg.fun, weights=NULL, counts=opt$counts, optim.message=opt$message, loglik=loglik(opt$par, fix.arg, data, ddistname), phidiv=phidiv, power.phidiv=power.phidiv) } else # Try to minimize the gof distance using a user-supplied optim function { options(warn=ifelse(silent, -1, 0)) if (!cens) opttryerror <- try(opt <- custom.optim(fn=fnobj, fix.arg=fix.arg, obs=data, pdistnam=pdistname, par=vstart, ...), silent=TRUE) else stop("Maximum goodness-of-fit estimation is not yet available for censored data.") options(warn=owarn) if (inherits(opttryerror,"try-error")) { warnings("The customized optimization function encountered an error and stopped.") if(getOption("show.error.messages")) print(attr(opttryerror, "condition")) return(list(estimate = rep(NA,length(vstart)), convergence = 100, value = NA, hessian = NA)) } if (opt$convergence>0) { warnings("The customized optimization function failed to converge, with the error code ", opt$convergence) } if(is.null(names(opt$par))) names(opt$par) <- names(vstart) argdot <- list(...) method.cust <- argdot$method res <- list(estimate = opt$par, convergence = opt$convergence, value = opt$value, hessian = opt$hessian, optim.function=custom.optim, optim.method=method.cust, fix.arg = fix.arg, fix.arg.fun = fix.arg.fun, weights=weights, counts=opt$counts, optim.message=opt$message, loglik=loglik(opt$par, fix.arg, data, ddistname), phidiv=phidiv, power.phidiv=power.phidiv) } return(res) } fitdistrplus/R/denscomp.R0000644000176200001440000003306014373435663015162 0ustar liggesusers############################################################################# # Copyright (c) 2012 Christophe Dutang, Aurelie Siberchicot, # Marie Laure Delignette-Muller # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the # Free Software Foundation, Inc., # 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA # ############################################################################# ### plot density functions for various fits ### of continuous distribution(s) (fitdist results) ### on a same dataset ### ### R functions ### denscomp <- function(ft, xlim, ylim, probability = TRUE, main, xlab, ylab, datacol, fitlty, fitcol, fitlwd, addlegend = TRUE, legendtext, xlegend = "topright", ylegend = NULL, demp = FALSE, dempcol = "black", plotstyle = "graphics", discrete, fitnbpts = 101, fittype="l", ...) { if(inherits(ft, "fitdist")) { ft <- list(ft) }else if(!is.list(ft)) { stop("argument ft must be a list of 'fitdist' objects") }else { if(any(sapply(ft, function(x) !inherits(x, "fitdist")))) stop("argument ft must be a list of 'fitdist' objects") } # In the future developments, it will be necessary to check that all the fits share the same weights if(!is.null(ft[[1]]$weights)) stop("denscomp is not yet available when using weights") # check the 'plotstyle' argument plotstyle <- match.arg(plotstyle[1], choices = c("graphics", "ggplot"), several.ok = FALSE) # parameters used in 'hist' function ###### Where are they used - to remove ? !!!!!!!!!!!!!!!!!!!!!!!! argshistPlotFalse <- c("breaks", "nclass", "include.lowest", "right") argshistPlotTrue <- c(argshistPlotFalse, "density", "angle", "border", "axes", "labels") # manage default parameters nft <- length(ft) if (missing(datacol)) datacol <- NULL if (missing(fitcol)) fitcol <- 2:(nft+1) if (missing(fitlty)) fitlty <- 1:nft if (missing(fitlwd)) fitlwd <- 1 fitcol <- rep(fitcol, length.out=nft) fitlty <- rep(fitlty, length.out=nft) fitlwd <- rep(fitlwd, length.out=nft) fittype <- match.arg(fittype[1], c("p", "l", "o")) if (missing(xlab)) xlab <- "data" if (missing(ylab)) ylab <- ifelse(probability, "Density", "Frequency") if (missing(main)) main <- ifelse(probability, "Histogram and theoretical densities", "Histogram and theoretical frequencies") # check data mydata <- ft[[1]]$data verif.ftidata <- function(fti) { if (any(fti$data != mydata)) stop("All compared fits must have been obtained with the same dataset") invisible() } lapply(ft, verif.ftidata) # check xlim if(missing(xlim)) { xmin <- min(mydata) xmax <- max(mydata) xlim <- range(mydata) }else { xmin <- xlim[1] xmax <- xlim[2] } # initiate discrete if not given if(missing(discrete)) { discrete <- any(sapply(ft, function(x) x$discrete)) } if(!is.logical(discrete)) stop("wrong argument 'discrete'.") if(!is.logical(demp)) stop("wrong argument 'discrete'.") # some variable definitions n <- length(mydata) if(!discrete) sfin <- seq(xmin, xmax, length.out = fitnbpts[1]) else sfin <- unique(round(seq(xmin, xmax, length.out = fitnbpts[1]), digits = 0)) reshist <- hist(mydata, plot = FALSE, ...) if (!discrete) { if (probability) { scalefactor <- 1 } else { if (length(unique(diff(reshist$breaks))) > 1) # wrong histogram and not possible to compute a scale factor stop("You should not use probability = FALSE with non-equidistant breaks for the histogram !") else scalefactor <- n * diff(reshist$breaks)[1] } # previous writing that gave incorrect output in case of probability = 1 and non-equidistant breaks # scalefactor <- ifelse(probability, 1, n * diff(reshist$breaks)) } else { scalefactor <- ifelse(probability, 1, n) } # binwidth <- min(diff(reshist$breaks)) # computation of each fitted distribution comput.fti <- function(i) { fti <- ft[[i]] para <- c(as.list(fti$estimate), as.list(fti$fix.arg)) distname <- fti$distname ddistname <- paste("d", distname, sep="") do.call(ddistname, c(list(sfin), as.list(para))) * scalefactor } fitteddens <- sapply(1:nft, comput.fti) if(NCOL(fitteddens) != nft || NROW(fitteddens) != length(sfin)) stop("problem when computing fitted densities.") # check ylim if (missing(ylim)) { if(!probability) if (discrete) { ylim <- c(0, max(as.numeric(table(mydata)))) } else { ylim <- c(0, max(reshist$counts)) } else # so if probability { if (discrete) { ylim <- c(0, max(as.numeric(table(mydata))/length(mydata))) } else { ylim <- c(0, max(reshist$density)) } } ylim <- range(ylim, fitteddens) }else ylim <- range(ylim) # in case of users enter a bad ylim # check legend parameters if added if(missing(legendtext)) { legendtext <- sapply(ft, function(x) x$distname) if(length(legendtext) != length(unique(legendtext))) legendtext <- paste(legendtext, sapply(ft, function(x) toupper(x$method)), sep="-") if(length(legendtext) != length(unique(legendtext))) legendtext <- paste(legendtext, 1:nft, sep="-") } # forces demp to TRUE if discrete is TRUE if(discrete) demp <- TRUE #add empirical density/fmp to legend vectors if(demp) { legendtext <- c(legendtext, "emp.") fitlty <- c(fitlty, 1) fitlwd <- c(fitlwd, 1) fitcol <- c(fitcol, dempcol) } if(plotstyle == "graphics") { ######## plot if plotstyle=='graphics' ######## if(!discrete) { #main plotting reshist <- hist(mydata, main = main, xlab = xlab, ylab = ylab, xlim = xlim, ylim = ylim, col = datacol, probability = probability, ...) #plot fitted densities (line) for(i in 1:nft) lines(sfin, fitteddens[,i], lty=fitlty[i], col=fitcol[i], lwd=fitlwd[i], ...) #plot empirical density if(demp) lines(density(mydata)$x, density(mydata)$y * scalefactor, col=dempcol) if (addlegend) legend(x=xlegend, y=ylegend, bty="n", legend=legendtext, lty=fitlty, col=fitcol, lwd=fitlwd, ...) }else # so if discrete { #main plotting # plotting of an empty histogram reshist <- hist(mydata, main = main, xlab = xlab, ylab = ylab, xlim = xlim, ylim = ylim, border = "white", probability = probability, ...) eps <- diff(range(sfin))/200 if(fittype %in% c("l", "o")) { #plot fitted mass probability functions (line) for(i in 1:nft) lines(sfin+(i)*eps, fitteddens[,i], lty=fitlty[i], col=fitcol[i], lwd=fitlwd[i], type="h", ...) #plot empirical mass probability function if(demp) { empval <- sort(unique(mydata)) empprob <- as.numeric(table(mydata))/length(mydata) * scalefactor lines(empval, empprob, col=dempcol, type="h") } } if(fittype %in% c("p", "o")) { #plot fitted mass probability functions (point) for(i in 1:nft) points(sfin+(i)*eps, fitteddens[,i], col=fitcol[i], pch=1) #plot empirical density if(demp) { empval <- sort(unique(mydata)) empprob <- as.numeric(table(mydata))/length(mydata) * scalefactor points(empval, empprob, col=dempcol, pch=1) } } if (addlegend && fittype %in% c("l", "o")) legend(x=xlegend, y=ylegend, bty="n", legend=legendtext, lty=fitlty, col=fitcol, lwd=fitlwd, ...) if (addlegend && fittype == "p") legend(x=xlegend, y=ylegend, bty="n", legend=legendtext, pch=1, col=fitcol, ...) } invisible() } else if (!requireNamespace("ggplot2", quietly = TRUE)) { stop("ggplot2 needed for this function to work with plotstyle = 'ggplot'. Please install it", call. = FALSE) } else { ######## plot if plotstyle=='ggplot' ######## # recode the legend position according to available positions in ggplot2 if(xlegend %in% c("topleft", "bottomleft")) xlegend <- "left" if(xlegend %in% c("topright", "bottomright")) xlegend <- "right" # the default colors of the bars is the same as panel.background.fill in theme_grey() if(is.null(datacol)) datacol <- "grey92" if (!discrete) { # structure the fitteddens in a relevant data.frame fitteddens <- as.data.frame(fitteddens) colnames(fitteddens) <- unlist(lapply(ft, function(X) X["distname"])) fitteddens <- stack(fitteddens) fitteddens$sfin <- sfin # sfin is recycled in the standard fashion fitteddens$ind <- factor(fitteddens$ind, levels = unique(fitteddens$ind)) # reorder levels in the appearance order of the input if(demp) # bind empirical data if demp is TRUE fitteddens <- rbind(fitteddens, data.frame(values = density(mydata)$y * scalefactor, ind = "demp", sfin = density(mydata)$x)) histdata <- data.frame(values = mydata, ind = "hist", sfin = mydata) # the added data must have the same column names as the main data to be compatible with ggplot ggdenscomp <- ggplot2::ggplot(fitteddens, ggplot2::aes_(quote(sfin), quote(values), group = quote(ind), colour = quote(ind))) + ggplot2::xlab(xlab) + ggplot2::ylab(ylab) + ggplot2::ggtitle(main) + ggplot2::coord_cartesian(xlim = c(xlim[1], xlim[2]), ylim = c(ylim[1], ylim[2])) + {if(probability) ggplot2::geom_histogram(data = histdata, ggplot2::aes_(quote(values), quote(ggplot2::after_stat(density))), breaks = reshist$breaks, boundary = 0, show.legend = FALSE, col = "black", alpha = 1, fill = datacol) else ggplot2::geom_histogram(data = histdata, ggplot2::aes_(quote(values), quote(..count..)), breaks = reshist$breaks, boundary = 0, show.legend = FALSE, col = "black", alpha = 1, fill = datacol)} + ggplot2::geom_line(data = fitteddens, ggplot2::aes_(linetype = quote(ind), colour = quote(ind), size = quote(ind))) + ggplot2::guides(colour = ggplot2::guide_legend(title = NULL)) + ggplot2::guides(linetype = ggplot2::guide_legend(title = NULL)) + ggplot2::guides(size = ggplot2::guide_legend(title = NULL)) + ggplot2::theme(plot.title = ggplot2::element_text(hjust = 0.5)) + {if(addlegend) ggplot2::theme(legend.position = c(xlegend, ylegend)) else ggplot2::theme(legend.position = "none")} + ggplot2::scale_color_manual(values = fitcol, labels = legendtext) + ggplot2::scale_linetype_manual(values = fitlty, labels = legendtext) + ggplot2::scale_size_manual(values = fitlwd, labels = legendtext) return(ggdenscomp) } else { eps <- diff(range(sfin))/200 # structure the fitteddens in a relevant data.frame fitteddens <- as.data.frame(fitteddens) colnames(fitteddens) <- unlist(lapply(ft, function(X) X["distname"])) fitteddens <- stack(fitteddens) fitteddens$ind <- factor(fitteddens$ind, levels = unique(fitteddens$ind)) # reorder levels in the appearance order of the input fitteddens$sfin <- sfin + sapply(fitteddens$ind, function(X) which(X == levels(fitteddens$ind))) *eps # sfin is recycled in the standard fashion if(demp) # bind empirical data if demp is TRUE fitteddens <- rbind(fitteddens, data.frame(values = as.numeric(table(mydata))/length(mydata) * scalefactor, ind = "demp", sfin = as.numeric(names(table(mydata))))) ggdenscomp <- ggplot2::ggplot(fitteddens, ggplot2::aes_(quote(sfin), quote(values), group = quote(ind), colour = quote(ind))) + ggplot2::xlab(xlab) + ggplot2::ylab(ylab) + ggplot2::ggtitle(main) + ggplot2::coord_cartesian(xlim = c(xlim[1], xlim[2]), ylim = c(ylim[1], ylim[2])) + {if(fittype %in% c("l", "o")) ggplot2::geom_segment(data = fitteddens, ggplot2::aes_(x = quote(sfin), xend = quote(sfin), y = 0, yend = quote(values), linetype = quote(ind), size = quote(ind)))} + {if(fittype %in% c("p", "o")) ggplot2::geom_point(data = fitteddens, ggplot2::aes_(x = quote(sfin), y = quote(values), colour = quote(ind)), shape = 1)} + ggplot2::guides(colour = ggplot2::guide_legend(title = NULL)) + ggplot2::guides(linetype = ggplot2::guide_legend(title = NULL)) + ggplot2::guides(size = ggplot2::guide_legend(title = NULL)) + ggplot2::theme(plot.title = ggplot2::element_text(hjust = 0.5)) + {if(addlegend) ggplot2::theme(legend.position = c(xlegend, ylegend)) else ggplot2::theme(legend.position = "none")} + ggplot2::scale_color_manual(values = fitcol, labels = legendtext) + ggplot2::scale_linetype_manual(values = fitlty, labels = legendtext) + ggplot2::scale_size_manual(values = fitlwd, labels = legendtext) return(ggdenscomp) } } } fitdistrplus/R/coef.R0000644000176200001440000000614513742313702014256 0ustar liggesusers############################################################################# # Copyright (c) 2009 Marie Laure Delignette-Muller, Regis Pouillot, Jean-Baptiste Denis, Christophe Dutang # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the # Free Software Foundation, Inc., # 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA # ############################################################################# ### extract model coefficients ### ### R functions ### #already in R # #coef <- function(object, ...) # UseMethod("coef") # #coef.default <- function(object, ...) # return(object) coef.fitdist <- function(object, ...) { stopifnot(inherits(object, "fitdist")) if(is.null(object$estimate)) stop("Internal error in coef.fitdist") else return(object$estimate) } coef.fitdistcens <- function(object, ...) { stopifnot(inherits(object, "fitdistcens")) if(is.null(object$estimate)) stop("Internal error in coef.fitdistcens") else return(object$estimate) } fitdistrplus/R/gradlogLik.R0000644000176200001440000000614313742313702015417 0ustar liggesusers############################################################################# # Copyright (c) 2016 Marie Laure Delignette-Muller, Christophe Dutang # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the # Free Software Foundation, Inc., # 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA # ############################################################################# ### gradient of log-likelihood ### ### R functions (not exported) ### #Beta distribution grdbeta <- function(x, d1, d2) #individual contribution c(log(x)-digamma(d1)+digamma(d1+d2), log(1-x)-digamma(d2)+digamma(d1+d2)) grlnlbeta <- function(par, obs, ...) #total grad loglik -rowSums(sapply(obs, function(x) grdbeta(x, d1=par[1], d2=par[2]))) #Gamma distribution grdgamma <- function(x, shape, rate) #individual contribution c(log(x)-log(rate)-digamma(shape), x/rate^2-shape/rate) grlnlgamma <- function(par, obs, ...) #total grad loglik { n <- length(obs) res <- grdgamma(obs, shape=par[1], rate=par[2]) c(-sum(res[1:n]), -sum(res[1:n+n])) } fitdistrplus/R/fitdist.R0000644000176200001440000003346314067266732015026 0ustar liggesusers############################################################################# # Copyright (c) 2009 Marie Laure Delignette-Muller, Regis Pouillot, Jean-Baptiste Denis, Christophe Dutang # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the # Free Software Foundation, Inc., # 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA # ############################################################################# ### fit parametric distributions for non-censored data ### ### R functions ### fitdist <- function (data, distr, method = c("mle", "mme", "qme", "mge", "mse"), start=NULL, fix.arg=NULL, discrete, keepdata = TRUE, keepdata.nb=100, ...) { #check argument distr if (!is.character(distr)) distname <- substring(as.character(match.call()$distr), 2) else distname <- distr ddistname <- paste("d", distname, sep="") if (!exists(ddistname, mode="function")) stop(paste("The ", ddistname, " function must be defined")) #pdistname <- paste("p", distname, sep="") #if (!exists(pdistname, mode="function")) # stop(paste("The ", pdistname, " function must be defined")) #check argument discrete if(missing(discrete)) { if (is.element(distname, c("binom", "nbinom", "geom", "hyper", "pois"))) discrete <- TRUE else discrete <- FALSE } if(!is.logical(discrete)) stop("wrong argument 'discrete'.") if(!is.logical(keepdata) || !is.numeric(keepdata.nb) || keepdata.nb < 2) stop("wrong arguments 'keepdata' and 'keepdata.nb'") #check argument method if(any(method == "mom")) warning("the name \"mom\" for matching moments is NO MORE used and is replaced by \"mme\"") method <- match.arg(method, c("mle", "mme", "qme", "mge", "mse")) if(method %in% c("mle", "mme", "mge", "mse")) dpq2test <- c("d", "p") else dpq2test <- c("d", "p", "q") #check argument data if (!(is.vector(data) & is.numeric(data) & length(data)>1)) stop("data must be a numeric vector of length greater than 1") #encapsulate three dots arguments my3dots <- list(...) if (length(my3dots) == 0) my3dots <- NULL n <- length(data) # manage starting/fixed values: may raise errors or return two named list arg_startfix <- manageparam(start.arg=start, fix.arg=fix.arg, obs=data, distname=distname) #check inconsistent parameters argddistname <- names(formals(ddistname)) hasnodefaultval <- sapply(formals(ddistname), is.name) arg_startfix <- checkparamlist(arg_startfix$start.arg, arg_startfix$fix.arg, argddistname, hasnodefaultval) #arg_startfix contains two names list (no longer NULL nor function) #store fix.arg.fun if supplied by the user if(is.function(fix.arg)) fix.arg.fun <- fix.arg else fix.arg.fun <- NULL # check d, p, q, functions of distname resdpq <- testdpqfun(distname, dpq2test, start.arg=arg_startfix$start.arg, fix.arg=arg_startfix$fix.arg, discrete=discrete) if(any(!resdpq$ok)) { for(x in resdpq[!resdpq$ok, "txt"]) warning(x) } # Fit with mledist, qmedist, mgedist or mmedist if (method == "mme") { if (!is.element(distname, c("norm", "lnorm", "pois", "exp", "gamma", "nbinom", "geom", "beta", "unif", "logis"))) if (!"order" %in% names(my3dots)) stop("moment matching estimation needs an 'order' argument") mme <- mmedist(data, distname, start=arg_startfix$start.arg, fix.arg=arg_startfix$fix.arg, checkstartfix=TRUE, ...) sd <- NULL correl <- varcovar <- NULL estimate <- mme$estimate loglik <- mme$loglik npar <- length(estimate) aic <- -2*loglik+2*npar bic <- -2*loglik+log(n)*npar convergence <- mme$convergence fix.arg <- mme$fix.arg weights <- mme$weights }else if (method == "mle") { mle <- mledist(data, distname, start=arg_startfix$start.arg, fix.arg=arg_startfix$fix.arg, checkstartfix=TRUE, ...) if (mle$convergence>0) stop("the function mle failed to estimate the parameters, with the error code ", mle$convergence, "\n") estimate <- mle$estimate if(!is.null(mle$hessian)){ #check for NA values and invertible Hessian if(all(!is.na(mle$hessian)) && qr(mle$hessian)$rank == NCOL(mle$hessian)){ varcovar <- solve(mle$hessian) sd <- sqrt(diag(varcovar)) correl <- cov2cor(varcovar) }else{ varcovar <- NA sd <- NA correl <- NA } }else{ varcovar <- NA sd <- NA correl <- NA } loglik <- mle$loglik npar <- length(estimate) aic <- -2*loglik+2*npar bic <- -2*loglik+log(n)*npar convergence <- mle$convergence fix.arg <- mle$fix.arg weights <- mle$weights }else if (method == "qme") { if (!"probs" %in% names(my3dots)) stop("quantile matching estimation needs an 'probs' argument") qme <- qmedist(data, distname, start=arg_startfix$start.arg, fix.arg=arg_startfix$fix.arg, checkstartfix=TRUE, ...) estimate <- qme$estimate sd <- NULL loglik <- qme$loglik npar <- length(estimate) aic <- -2*loglik+2*npar bic <- -2*loglik+log(n)*npar correl <- varcovar <- NULL convergence <- qme$convergence fix.arg <- qme$fix.arg weights <- qme$weights }else if (method == "mge") { if (!"gof" %in% names(my3dots)) warning("maximum GOF estimation has a default 'gof' argument set to 'CvM'") mge <- mgedist(data, distname, start=arg_startfix$start.arg, fix.arg=arg_startfix$fix.arg, checkstartfix=TRUE, ...) estimate <- mge$estimate sd <- NULL loglik <- mge$loglik npar <- length(estimate) aic <- -2*loglik+2*npar bic <- -2*loglik+log(n)*npar correl <- varcovar <- NULL convergence <- mge$convergence fix.arg <- mge$fix.arg weights <- NULL }else if (method == "mse") { mse <- msedist(data, distname, start=arg_startfix$start.arg, fix.arg=arg_startfix$fix.arg, checkstartfix=TRUE, ...) estimate <- mse$estimate sd <- NULL loglik <- mse$loglik npar <- length(estimate) aic <- -2*loglik+2*npar bic <- -2*loglik+log(n)*npar correl <- varcovar <- NULL convergence <- mse$convergence fix.arg <- mse$fix.arg weights <- mse$weights }else { stop("match.arg() did not work correctly") } #needed for bootstrap if (!is.null(fix.arg)) fix.arg <- as.list(fix.arg) if(keepdata) { reslist <- list(estimate = estimate, method = method, sd = sd, cor = correl, vcov = varcovar, loglik = loglik, aic=aic, bic=bic, n = n, data=data, distname = distname, fix.arg = fix.arg, fix.arg.fun = fix.arg.fun, dots = my3dots, convergence = convergence, discrete = discrete, weights = weights) }else #just keep a sample set of all observations { n2keep <- min(keepdata.nb, n)-2 imin <- which.min(data) imax <- which.max(data) subdata <- data[sample((1:n)[-c(imin, imax)], size=n2keep, replace=FALSE)] subdata <- c(subdata, data[c(imin, imax)]) reslist <- list(estimate = estimate, method = method, sd = sd, cor = correl, vcov = varcovar, loglik = loglik, aic=aic, bic=bic, n = n, data=subdata, distname = distname, fix.arg = fix.arg, fix.arg.fun = fix.arg.fun, dots = my3dots, convergence = convergence, discrete = discrete, weights = weights) } return(structure(reslist, class = "fitdist")) } print.fitdist <- function(x, ...) { if (!inherits(x, "fitdist")) stop("Use only with 'fitdist' objects") if (x$method=="mme") cat("Fitting of the distribution '", x$distname, "' by matching moments \n") else if (x$method=="mle") cat("Fitting of the distribution '", x$distname, "' by maximum likelihood \n") else if (x$method=="qme") cat("Fitting of the distribution '", x$distname, "' by matching quantiles \n") else if (x$method=="mge") cat("Fitting of the distribution '", x$distname, "' by maximum goodness-of-fit \n") cat("Parameters:\n") if (x$method=="mle") print(cbind.data.frame("estimate" = x$estimate, "Std. Error" = x$sd), ...) else print(cbind.data.frame("estimate" = x$estimate), ...) if(!is.null(x$fix.arg)) { if(is.null(x$fix.arg.fun)) { cat("Fixed parameters:\n") }else { cat("Fixed parameters (computed by a user-supplied function):\n") } print(cbind.data.frame("value" = unlist(x$fix.arg)), ...) } } plot.fitdist <- function(x, breaks="default", ...) { if (!inherits(x, "fitdist")) stop("Use only with 'fitdist' objects") if(!is.null(x$weights)) stop("The plot of the fit is not yet available when using weights") if(!is.null(x$data)) plotdist(data=x$data, distr=x$distname, para=c(as.list(x$estimate), as.list(x$fix.arg)), breaks=breaks, discrete = x$discrete, ...) if(!is.null(x$weights)) stop("The plot of the fit is not yet available when using weights") } summary.fitdist <- function(object, ...) { if (!inherits(object, "fitdist")) stop("Use only with 'fitdist' objects") object$ddistname <- paste("d", object$distname, sep="") object$pdistname <- paste("p", object$distname, sep="") object$qdistname <- paste("q", object$distname, sep="") class(object) <- c("summary.fitdist", class(object)) object } print.summary.fitdist <- function(x, ...) { if (!inherits(x, "summary.fitdist")) stop("Use only with 'summary.fitdist' objects") ddistname <- x$ddistname pdistname <- x$pdistname if (x$method=="mme") cat("Fitting of the distribution '", x$distname, "' by matching moments \n") else if (x$method=="mle") cat("Fitting of the distribution '", x$distname, "' by maximum likelihood \n") else if (x$method=="qme") cat("Fitting of the distribution '", x$distname, "' by matching quantiles \n") else if (x$method=="mge") cat("Fitting of the distribution '", x$distname, "' by maximum goodness-of-fit \n") cat("Parameters : \n") if (x$method == "mle") print(cbind.data.frame("estimate" = x$estimate, "Std. Error" = x$sd), ...) else print(cbind.data.frame("estimate" = x$estimate), ...) if(!is.null(x$fix.arg)) { if(is.null(x$fix.arg.fun)) { cat("Fixed parameters:\n") }else { cat("Fixed parameters (computed by a user-supplied function):\n") } print(cbind.data.frame("value" = unlist(x$fix.arg)), ...) } cat("Loglikelihood: ", x$loglik, " ") cat("AIC: ", x$aic, " ") cat("BIC: ", x$bic, "\n") if (x$method=="mle") { if (length(x$estimate) > 1) { cat("Correlation matrix:\n") print(x$cor) cat("\n") } } invisible(x) } #see quantiles.R for quantile.fitdist #see logLik.R for loglik.fitdist #see vcov.R for vcov.fitdist #see coef.R for coef.fitdist fitdistrplus/R/util-npsurv-hcnm.R0000644000176200001440000002036513742313702016575 0ustar liggesusers# ----------------------------------------------------------------------- # # Nonparametric maximum likelihood estimation from interval-censored data # # ----------------------------------------------------------------------- # # Internal optimi functions for Hierarchical constrained Newton Method # # ----------------------------------------------------------------------- # # Original code from Yong Wang, 2020 # # ----------------------------------------------------------------------- # ## ========================================================================== ## Hierarchical CNM: a variant of the Constrained Newton Method for finding ## the NPMLE survival function of a data set containing interval censoring. ## This is a new method to build on those in the Icens and MLEcens ## packages. It uses the idea of block subsets of the S matrix to move ## probability mass among blocks of candidate support intervals. ## ## Usage (parameters and return value) is similar to the methods in package ## Icens, although note the transposed clique matrix. ## ## Arguments: ## data: Data ## w: Weights ## D: Clique matrix, n*m (note, transposed c.f. Icens::EMICM, ## MLEcens::reduc). The clique matrix may contain conditional ## probabilities rather than just membership flags, for use in HCNM ## recursively calling itself. ## p0: Vector (length m) of initial estimates for the probabilities of ## the support intervals. ## maxit: Maximum number of iterations to perform ## tol: Tolerance for the stopping condition (in log-likelihood value) ## blockpar: ## NA or NULL means choose a value based on the data (using n and r) ## ==0 means same as cnm (don't do blocks) ## <1 means nblocks is this power of sj, e.g. 0.5 for sqrt ## >1 means exactly this block size (e.g. 40) ## recurs.maxit: For internal use only: maximum number of iterations in ## recursive calls ## depth: For internal use only: depth of recursion ## verb: For internal use only: depth of recursion ## pkg: package used in NNLS_constrSum() ## Author: Stephen S. Taylor and Yong Wang ## Reference: Wang, Y. and Taylor, S. M. (2013). Efficient computation of ## nonparametric survival functions via a hierarchical mixture ## formulation. Statistics and Computing, 23, 713-725. ## ========================================================================== hcnm = function(data, w=1, D=NULL, p0=NULL, maxit=100, tol=1e-6, blockpar=NULL, recurs.maxit=2, depth=1, verb=0, pkg="stats", ...) { if(missing(D)) { x2 = icendata(data, w) #see npsurv-intercens.R if(nrow(x2$o) == 0 || all(x2$o[,2] == Inf)) { # exact or right-censored only r0 = km(x2) r = list(f=r0$f, convergence=TRUE, ll=r0$ll, maxgrad=0, numiter=1) class(r) = "npsurv" return(r) } x = rbind(cbind(x2$t, x2$t), x2$o) nx = nrow(x) w = c(x2$wt, x2$wo) dmat = Deltamatrix(x) #see npsurv-intercens.R left = dmat$left right = dmat$right intervals = cbind(left, right) D = dmat$Delta } else { if (missing(p0)) stop("Must provide 'p0' with D.") if (!missing(data)) warning("D and data both provided. LR ignored!") nx = nrow(D) w = rep(w, length=nx) intervals = NULL } n = sum(w) wr = sqrt(w) converge = FALSE m = ncol(D) m1 = 1:m nblocks = 1 maxdepth = depth i = rowSums(D) == 1 r = mean(i) # Proportion of exact observations if(is.null(p0)) { ## Derive an initial p vector. j = colSums(D[i,,drop=FALSE]) > 0 while(any(c(FALSE,(i <- rowSums(D[,j,drop=FALSE])==0)))) { j[which.max(colSums(D[i,,drop=FALSE]))] = TRUE } p = colSums(w * D) * j } else { if(length(p <- p0) != m) stop("Argument 'p0' is the wrong length.") } p = p / sum(p) P = drop(D %*% p) ll = sum(w * log(P)) evenstep = FALSE for(iter in 1:maxit) { p.old = p ll.old = ll S = D / P g = colSums(w * S) dmax = max(g) - n if(verb > 0) { cat("\n##### Iteration", iter, "#####\n") cat("Log-likelihood: ", signif(ll, 6), "\n") } if(verb > 1) cat("Maximum gradient: ", signif(dmax, 6), "\n") if(verb > 2) { cat("Probability vector of length ", length(p), " :\n") if(length(p) > 12) cat(head(as.numeric(p)), "\t ...\t", tail(as.numeric(p)), "\n") else cat(as.numeric(p), "\n") } j = p > 0 if(depth==1) { s = unique(c(1,m1[j],m)) if (length(s) > 1) for (l in 2:length(s)) { j[s[l-1] + which.max(g[s[l-1]:s[l]]) - 1] = TRUE } } sj = sum(j) ## BW: matrix of block weights: sj rows, nblocks columns if(is.null(blockpar) || is.na(blockpar)) ## Default blockpar based on log(sj), Equation (14) p6 of Wang & Taylor iter.blockpar = ifelse(sj < 30, 0, 1 - log(max(20,10*log(sj/100)))/log(sj)) else iter.blockpar = blockpar if(iter.blockpar==0 | sj < 30) { nblocks = 1 BW = matrix(1, nrow=sj, ncol=1) } else { nblocks = max(1, if(iter.blockpar>1) round(sj/iter.blockpar) else floor(min(sj/2, sj^iter.blockpar))) i = seq(0, nblocks, length=sj+1)[-1] if(evenstep) { nblocks = nblocks + 1 BW = outer(round(i)+1, 1:nblocks, "==") } else BW = outer(ceiling(i), 1:nblocks, "==") storage.mode(BW) = "numeric" } for(block in 1:nblocks) { jj = logical(m) jj[j] = BW[,block] > 0 sjj = sum(jj) if (sjj > 1 && (delta <- sum(p.old[jj])) > 0) { Sj = S[,jj] #original call #res = pnnls(wr * Sj, wr * drop(Sj %*% p.old[jj]) + wr, sum=delta) #new call resNNLS <- NNLS_constrSum(a=wr * Sj, b=wr * drop(Sj %*% p.old[jj]) + wr, pkg=pkg, sumtotal=delta, control=list(trace=verb, maxit=5000), ...) #see npsurv-NNLS.R if(resNNLS$convergence != 0) warning("Problem NNLS_constrSum(a,b)") xj <- resNNLS$prob if(verb > 3) {cat("Block:", block, "\t delta", delta, "\t Optimized vector by NNLS:\n"); print(xj)} p[jj] = p[jj] + BW[jj[j],block] * (xj * (delta / sum(xj)) - p.old[jj]) } } ## Maximise likelihood along the line between p and p.old p.gap = p - p.old # vector from old to new estimate ## extrapolated rise in ll, based on gradient at old estimate ll.rise.gap = sum(g * p.gap) alpha = 1 p.alpha = p ll.rise.alpha = ll.rise.gap repeat { P = drop(D %*% p.alpha) ll = sum(w * log(P)) if(ll >= ll.old && ll + ll.rise.alpha <= ll.old) { p = p.alpha # flat land reached converge = TRUE break } if(ll > ll.old && ll >= ll.old + ll.rise.alpha * .33) { p = p.alpha # Normal situation: new ll is higher break } if((alpha <- alpha * 0.5) < 1e-10) { p = p.old P = drop(D %*% p) ll = ll.old converge = TRUE break } p.alpha = p.old + alpha * p.gap ll.rise.alpha = alpha * ll.rise.gap } if(converge) break if (nblocks > 1) { ## Now jiggle p around among the blocks Q = sweep(BW,1,p[j],"*") # Matrix of weighted probabilities: [sj,nblocks] q = colSums(Q) # its column sums (total in each block) ## Now Q is n*nblocks Matrix of probabilities for mixture components Q = sweep(D[,j] %*% Q, 2, q, "/") if (any(q == 0)) { warning("A block has zero probability!") } else { if(verb >= 4) cat("recursion\n") ## Recursively call HCNM to allocate probability among the blocks res = hcnm(w=w, D=Q, p0=q, blockpar=iter.blockpar, maxit=recurs.maxit, recurs.maxit=recurs.maxit, depth=depth+1, pkg=pkg, ...) maxdepth = max(maxdepth, res$maxdepth) if (res$ll > ll) { p[j] = p[j] * (BW %*% (res$pf / q)) P = drop(D %*% p) ll = sum(w * log(P)) # should match res$lval } } } if(iter > 2) if( ll <= ll.old + tol ) {converge=TRUE; break} evenstep = !evenstep } list(pf=p, intervals=intervals, convergence=converge, method="hcnm", ll=ll, maxgrad=max(crossprod(w/P, D))-n, numiter=iter) } fitdistrplus/R/logLik-surface.R0000644000176200001440000003700013742313702016203 0ustar liggesusers############################################################################# # Copyright (c) 2015 Christophe Dutang and Marie Laure Delignette-Muller # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the # Free Software Foundation, Inc., # 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA # ############################################################################# ### (log)-likelihood surface/line ### ### R functions ### llplot <- function(mlefit, loglik = TRUE, expansion = 1, lseq = 50, back.col = TRUE, nlev = 10, pal.col = terrain.colors(100), fit.show = FALSE, fit.pch = 4, ...) { if (!inherits(mlefit, c("fitdist", "fitdistcens"))) stop("Use only with 'fitdist' or 'fitdistcens' objects") if(inherits(mlefit, "fitdist")) { if(mlefit$method !="mle") stop("This plot is only available for distribution fits using maximum likelihood") data <- mlefit$data } else # censored data { data <- mlefit$censdata } distr <- mlefit$distname np <- length(mlefit$estimate) if (np == 1) { estim.value <- mlefit$estimate estim.sd <- mlefit$sd plot.arg <- names(mlefit$estimate) fix.arg <- mlefit$fix.arg llcurve(data, distr, plot.arg = plot.arg, min.arg = estim.value - estim.sd * 2 *expansion, max.arg = estim.value + estim.sd * 2 *expansion, lseq = lseq, fix.arg = fix.arg, loglik = loglik, weights = mlefit$weights, ...) if (fit.show) points(estim.value, ifelse(loglik, mlefit$loglik, exp(mlefit$loglik)), pch = fit.pch) } else # so if np > 1 if (np == 2) { estim.value <- mlefit$estimate estim.sd <- mlefit$sd plot.arg <- names(mlefit$estimate) fix.arg <- mlefit$fix.arg llsurface(data, distr, plot.arg = plot.arg, min.arg = estim.value - estim.sd * 2 *expansion, max.arg = estim.value + estim.sd * 2 *expansion, lseq = lseq, fix.arg = fix.arg, loglik = loglik, back.col = back.col, nlev = nlev, pal.col = pal.col, weights = mlefit$weights, ...) if (fit.show) points(estim.value[1], estim.value[2], pch = fit.pch) } else # so if np > 2 { def.par <- par(no.readonly = TRUE) ncombi <- choose(np, 2) lay <- lower.tri(matrix(0, (np - 1), (np - 1)), TRUE) lay[which(lay, TRUE)] <- 1:ncombi layout(lay) par(mar = c(5, 4, 0.2, 0.2)) for (i in 1:(np - 1)) for (j in (i+1):np) { plot.arg <- names(mlefit$estimate)[c(i, j)] estim.value <- mlefit$estimate[c(i, j)] estim.sd <- mlefit$sd[c(i, j)] fix.arg <- c(mlefit$fix.arg, as.list(mlefit$estimate[-c(i,j)])) llsurface(data, distr, plot.arg = plot.arg, min.arg = estim.value - estim.sd * 2 *expansion, max.arg = estim.value + estim.sd * 2 *expansion, lseq = lseq, fix.arg = fix.arg, loglik = loglik, back.col = back.col, nlev = nlev, pal.col = pal.col, weights = mlefit$weights, ...) if (fit.show) points(estim.value[1], estim.value[2], pch = fit.pch) } par(def.par) } invisible() } llsurface <- function(data, distr, plot.arg, min.arg, max.arg, lseq = 50, fix.arg = NULL, loglik = TRUE, back.col = TRUE, nlev = 10, pal.col = terrain.colors(100), weights = NULL, ...) { stopifnot(is.vector(plot.arg) || length(plot.arg) == 2) stopifnot(is.list(fix.arg) || is.null(fix.arg)) if(!is.null(weights)) { if(any(weights < 0)) stop("weights should be a vector of numerics greater than 0") if(length(weights) != NROW(data)) stop("weights should be a vector with a length equal to the observation number") } else { weights <- rep(1, NROW(data)) } if (is.vector(data)) { cens <- FALSE } else { cens <- TRUE # Definition of datasets lcens (left censored)=vector, rcens (right censored)= vector, # icens (interval censored) = dataframe with left and right # and ncens (not censored) = vector censdata <- data irow.lcens <- is.na(censdata$left) # rows corresponding to lcens data lcens <- censdata[irow.lcens, ]$right if (any(is.na(lcens)) ) stop("An observation cannot be both right and left censored, coded with two NA values") irow.rcens <- is.na(censdata$right) # rows corresponding to rcens data rcens <- censdata[irow.rcens, ]$left irow.ncens <- censdata$left==censdata$right & !is.na(censdata$left) & !is.na(censdata$right) # rows corresponding to ncens data ncens<-censdata[irow.ncens, ]$left irow.icens <- censdata$left!=censdata$right & !is.na(censdata$left) & !is.na(censdata$right) # rows corresponding to icens data icens<-censdata[irow.icens, ] } #get distrib name if (!is.character(distr)) stop("distr must be a character string naming a distribution") else distname <- distr ddistname <- paste("d", distname, sep="") if (!exists(ddistname, mode="function")) stop(paste("The ", ddistname, " function must be defined")) if (cens) { pdistname <- paste("p", distname, sep="") if (!exists(pdistname, mode="function")) stop(paste("The ", pdistname, " function must be defined")) } #sanity check for argument names argdistname <- names(formals(ddistname)) m <- match(plot.arg, argdistname) if (any(is.na(m))) #check unexpected names stop("'plot.arg' must specify names which are arguments to 'distr'.") m <- match(names(fix.arg), argdistname) if (any(is.na(m))) #check unexpected names stop("'fix.arg' must specify names which are arguments to 'distr'.") #function to plot if (!cens) { if(loglik) { f2plot <- function(x, y) { par <- list(x,y) names(par) <- plot.arg loglikelihood(par, fix.arg = fix.arg, obs = data, ddistnam = ddistname, weights = weights) } }else { f2plot <- function(x, y) { par <- list(x,y) names(par) <- plot.arg likelihood(par, fix.arg = fix.arg, obs= data, ddistnam = ddistname, weights = weights) } } } else # for censored data { if(loglik) { f2plot <- function(x, y) { par <- list(x,y) names(par) <- plot.arg loglikelihoodcens(par, fix.arg = fix.arg, rcens = rcens, lcens = lcens, icens = icens, ncens = ncens, ddistnam = ddistname, pdistnam = pdistname, weights = weights, irow.ncens = irow.ncens, irow.lcens = irow.lcens, irow.rcens = irow.rcens, irow.icens = irow.icens) } }else { f2plot <- function(x, y) { par <- list(x,y) names(par) <- plot.arg likelihoodcens(par, fix.arg = fix.arg, rcens = rcens, lcens = lcens, icens = icens, ncens = ncens, ddistnam = ddistname, pdistnam = pdistname, weights = weights, irow.ncens = irow.ncens, irow.lcens = irow.lcens, irow.rcens = irow.rcens, irow.icens = irow.icens) } } } #create x, y and z matrix. p1 <- seq(min.arg[1], max.arg[1], length=lseq) p2 <- seq(min.arg[2], max.arg[2], length=lseq) z <- outer(p1, p2, Vectorize(f2plot, c("x","y"))) # vectorize is necessary to vectorize the function f2plot if (back.col) { image(p1, p2, z, col = pal.col, xlab = plot.arg[1], ylab = plot.arg[2], ...) if (nlev > 0) contour(p1, p2, z, nlevels = nlev, add = TRUE, ...) } else { contour(p1, p2, z, nlevels = nlev, xlab = plot.arg[1], ylab = plot.arg[2], ...) } invisible() } llcurve <- function(data, distr, plot.arg, min.arg, max.arg, lseq = 50, fix.arg = NULL, loglik = TRUE, weights = NULL, ...) { stopifnot(is.vector(plot.arg) || length(plot.arg) == 1) stopifnot(is.list(fix.arg) || is.null(fix.arg)) if(!is.null(weights)) { if(any(weights < 0)) stop("weights should be a vector of numerics greater than 0") if(length(weights) != NROW(data)) stop("weights should be a vector with a length equal to the observation number") } else { weights <- rep(1, NROW(data)) } if (is.vector(data)) { cens <- FALSE } else { cens <- TRUE # Definition of datasets lcens (left censored)=vector, rcens (right censored)= vector, # icens (interval censored) = dataframe with left and right # and ncens (not censored) = vector censdata <- data irow.lcens <- is.na(censdata$left) # rows corresponding to lcens data lcens <- censdata[irow.lcens, ]$right if (any(is.na(lcens)) ) stop("An observation cannot be both right and left censored, coded with two NA values") irow.rcens <- is.na(censdata$right) # rows corresponding to rcens data rcens <- censdata[irow.rcens, ]$left irow.ncens <- censdata$left==censdata$right & !is.na(censdata$left) & !is.na(censdata$right) # rows corresponding to ncens data ncens<-censdata[irow.ncens, ]$left irow.icens <- censdata$left!=censdata$right & !is.na(censdata$left) & !is.na(censdata$right) # rows corresponding to icens data icens<-censdata[irow.icens, ] } if (!is.character(distr)) stop("distr must be a character string naming a distribution") else distname <- distr ddistname <- paste("d", distname, sep="") if (!exists(ddistname, mode="function")) stop(paste("The ", ddistname, " function must be defined")) if (cens) { pdistname <- paste("p", distname, sep="") if (!exists(pdistname, mode="function")) stop(paste("The ", pdistname, " function must be defined")) } #sanity check for argument names argdistname <- names(formals(ddistname)) m <- match(plot.arg, argdistname) if (any(is.na(m))) #check unexpected names stop("'plot.arg' must specify names which are arguments to 'distr'.") m <- match(names(fix.arg), argdistname) if (any(is.na(m))) #check unexpected names stop("'fix.arg' must specify names which are arguments to 'distr'.") if (!cens) { #function to plot if(loglik) { f2plot <- function(x) { par <- list(x) names(par) <- plot.arg loglikelihood(par, fix.arg=fix.arg, obs = data, ddistnam = ddistname, weights = weights) } }else { f2plot <- function(x) { par <- list(x) names(par) <- plot.arg likelihood(par, fix.arg=fix.arg, obs = data, ddistnam = ddistname, weights = weights) } } } else # for censored data { if(loglik) { f2plot <- function(x) { par <- list(x) names(par) <- plot.arg loglikelihoodcens(par, fix.arg = fix.arg, rcens = rcens, lcens = lcens, icens = icens, ncens = ncens, ddistnam = ddistname, pdistnam = pdistname, weights = weights, irow.ncens = irow.ncens, irow.lcens = irow.lcens, irow.rcens = irow.rcens, irow.icens = irow.icens) } }else { f2plot <- function(x) { par <- list(x) names(par) <- plot.arg likelihoodcens(par, fix.arg = fix.arg, rcens = rcens, lcens = lcens, icens = icens, ncens = ncens, ddistnam = ddistname, pdistnam = pdistname, weights = weights, irow.ncens = irow.ncens, irow.lcens = irow.lcens, irow.rcens = irow.rcens, irow.icens = irow.icens) } } } #create x, y matrix. p1 <- seq(min.arg[1], max.arg[1], length = lseq) y <- sapply(p1, function(x) f2plot(x)) plot(p1, y, type="l", xlab = plot.arg, ylab = ifelse(loglik, "loglikelihood", "likelihood"), ...) invisible() } #local definition of loglikelihood loglikelihood <- function(par, fix.arg, obs, ddistnam, weights) sum(weights * log(do.call(ddistnam, c(list(obs), as.list(par), as.list(fix.arg)) ) ) ) #local definition of likelihood likelihood <- function(par, fix.arg, obs, ddistnam, weights) prod(do.call(ddistnam, c(list(obs), as.list(par), as.list(fix.arg)) )^weights ) #local definition of loglikelihood for censored data loglikelihoodcens <- function(par, fix.arg, rcens, lcens, icens, ncens, ddistnam, pdistnam, weights, irow.ncens, irow.lcens, irow.rcens, irow.icens) { p1 <- log(do.call(ddistnam, c(list(ncens), as.list(par), as.list(fix.arg)))) p2 <- log(do.call(pdistnam, c(list(lcens), as.list(par), as.list(fix.arg)))) p3 <- log(1-do.call(pdistnam, c(list(rcens), as.list(par), as.list(fix.arg)))) p4 <- log(do.call(pdistnam, c(list(icens$right), as.list(par), as.list(fix.arg))) - do.call(pdistnam, c(list(icens$left), as.list(par), as.list(fix.arg))) ) sum(weights[irow.ncens] * p1) + sum(weights[irow.lcens] * p2) + sum(weights[irow.rcens] * p3) + sum(weights[irow.icens] * p4) } #local definition of likelihood for censored data likelihoodcens <- function(par, fix.arg, rcens, lcens, icens, ncens, ddistnam, pdistnam, weights, irow.ncens, irow.lcens, irow.rcens, irow.icens) { p1 <- do.call(ddistnam, c(list(ncens), as.list(par), as.list(fix.arg))) p2 <- do.call(pdistnam, c(list(lcens), as.list(par), as.list(fix.arg))) p3 <- 1-do.call(pdistnam, c(list(rcens), as.list(par), as.list(fix.arg))) p4 <- do.call(pdistnam, c(list(icens$right), as.list(par), as.list(fix.arg))) - do.call(pdistnam, c(list(icens$left), as.list(par), as.list(fix.arg))) prod(p1^weights[irow.ncens]) * prod(p2^weights[irow.lcens]) * prod(p3^weights[irow.rcens]) * prod(p4^weights[irow.icens]) } fitdistrplus/R/qqcompcens.R0000644000176200001440000002520213742313702015506 0ustar liggesusers############################################################################# # Copyright (c) 2018 Marie Laure Delignette-Muller, Christophe Dutang, # Aurelie Siberchicot # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the # Free Software Foundation, Inc., # 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA # ############################################################################# ### QQ plot for various fits ### of continuous distribution(s) (fitdistcens results) ### on a same dataset ### ### R functions ### qqcompcens <- function(ft, xlim, ylim, xlogscale = FALSE, ylogscale = FALSE, main, xlab, ylab, fillrect, fitcol, fitlwd, addlegend = TRUE, legendtext, xlegend = "bottomright", ylegend = NULL, line01 = TRUE, line01col = "black", line01lty = 1, ynoise = TRUE, NPMLE.method = "Wang", plotstyle = "graphics", ...) { if(inherits(ft, "fitdistcens")) { ft <- list(ft) }else if(!is.list(ft)) { stop("argument ft must be a list of 'fitdistcens' objects") }else { if(any(sapply(ft, function(x) !inherits(x, "fitdistcens")))) stop("argument ft must be a list of 'fitdistcens' objects") } NPMLE.method <- match.arg(NPMLE.method, c("Wang", "Turnbull.intervals", "Turnbull.middlepoints")) if (NPMLE.method == "Turnbull.middlepoints") { warning("The QQcomp plot for censored data is not available with NPMLE.method at Turnbull.middlepoints. Turnbull.intervals will be used instead of Turnbull.middlepoints.") NPMLE.method <- "Turnbull.intervals" } # check the 'plotstyle' argument plotstyle <- match.arg(plotstyle[1], choices = c("graphics", "ggplot"), several.ok = FALSE) # In the future developments, it will be necessary to check that all the fits share the same weights if(!is.null(ft[[1]]$weights)) stop("qqcompcens is not yet available when using weights") censdata <- ft[[1]]$censdata # check data verif.ftidata <- function(fti) { if (any(fti$censdata$left != censdata$left, na.rm=TRUE) | any(fti$censdata$right != censdata$right, na.rm=TRUE)) stop("All compared fits must have been obtained with the same dataset") } l <- lapply( ft, verif.ftidata) rm(l) if (xlogscale != ylogscale) { xlogscale <- ylogscale <- TRUE warning("As a Q-Q plot should use the same scale on x and y axes, both axes were put in a logarithmic scale.") } logxy <- paste(ifelse(xlogscale,"x",""), ifelse(ylogscale,"y",""), sep="") # manage default parameters nft <- length(ft) if (missing(fitcol)) fitcol <- 2:(nft+1) if (missing(fitlwd)) fitlwd <- 1 fitcol <- rep(fitcol, length.out=nft) fitlwd <- rep(fitlwd, length.out=nft) if (missing(fillrect)) if ((nft == 1) | plotstyle == "ggplot") fillrect <- "lightgrey" else fillrect <- NA # check legend parameters if added if(missing(legendtext)) { legendtext <- sapply(ft, function(x) x$distname) if(length(legendtext) != length(unique(legendtext))) legendtext <- paste(legendtext, sapply(ft, function(x) toupper(x$method)), sep="-") if(length(legendtext) != length(unique(legendtext))) legendtext <- paste(legendtext, 1:nft, sep="-") } if (missing(xlab)) xlab <- "Theoretical quantiles" if (missing(ylab)) ylab <- "Empirical quantiles" if (missing(main)) main <- "Q-Q plot" # computation from censdata f <- npmle(censdata, method = NPMLE.method) bounds <- c(f$right, f$left) finitebounds <- bounds[is.finite(bounds)] if(missing(xlim) & missing(ylim)) { user.defined.lim <- FALSE upper <- max(finitebounds) lower <- min(finitebounds) width <- upper - lower if (xlogscale == TRUE) { xmin <- lower * (upper / lower)^(-0.1) xmax <- upper * (upper / lower)^0.1 xmininf <- lower * (upper / lower)^(-100) # 100 to be very large xmaxinf <- upper * (upper / lower)^100 } else { xmin <- lower - width * 0.1 xmax <- upper + width * 0.1 xmininf <- lower - width * 100 xmaxinf <- upper + width * 100 } xlim <- c(xmin, xmax) ylim <- c(xmin, xmax) } else # at least xlim or ylim are specified { user.defined.lim <- TRUE if (missing(xlim) | missing(ylim)) { warning("By default the same limits are applied to x and y axes. You should specify both if you want different x and y limits") if (missing(xlim)) xlim <- ylim else ylim <- xlim } lower <- min(c(xlim, ylim)) upper <- max(c(xlim, ylim)) width <- upper - lower if (xlogscale == TRUE) { xmininf <- lower * (upper / lower)^(-100) # 100 to be very large xmaxinf <- upper * (upper / lower)^100 } else { xmininf <- lower - width * 100 xmaxinf <- upper + width * 100 } } k <- length(f$left) Fnpsurv <- cumsum(f$p) Fbefore <- c(0, Fnpsurv[-k]) df <- data.frame(left = f$left, right = f$right) # Definition of vertices of each rectangle Qi.left <- df$left # dim k Qi.left4plot <- Qi.left # when R is configured with noLD (--disable-long-double), qnorm and other 'q' functions # produce NaN values instead of Inf values for 0 and first argument. if (is.infinite(Qi.left4plot[1]) | is.nan(Qi.left4plot[1])) Qi.left4plot[1] <- xmininf Qi.right <- df$right Qi.right4plot <- Qi.right if (is.infinite(Qi.right4plot[k]) | is.nan(Qi.right4plot[k])) Qi.right4plot[k] <- xmaxinf # keep only 16 significants digits for R configured with noLD (--disable-long-double) Pi.low <- signif(Fbefore, 16) Pi.up <- signif(Fnpsurv, 16) nPi <- length(Pi.low) lrect <- vector(mode = "list", length = nft) theo.xmin <- xlim[1] theo.xmax <- xlim[2] for(i in 1:nft) { fti <- ft[[i]] para <- c(as.list(fti$estimate), as.list(fti$fix.arg)) distname <- fti$distname qdistname <- paste("q", distname, sep="") if (is.element(distname, c("binom", "nbinom", "geom", "hyper", "pois"))) warning(" Be careful, variables are considered continuous in this function!") Qitheo.left <- do.call(qdistname, c(list(Pi.low), as.list(para))) Qitheo.right <- do.call(qdistname, c(list(Pi.up), as.list(para))) theo.xmin <- min(theo.xmin, Qitheo.right[-k]) Qitheo.left4plot <- Qitheo.left theo.xmax <- max(theo.xmax, Qitheo.left[-1]) if (is.infinite(Qitheo.left4plot[1]) | is.nan(Qitheo.left4plot[1])) Qitheo.left4plot[1] <- xmininf Qitheo.right4plot <- Qitheo.right if (is.infinite(Qitheo.right4plot[k]) | is.nan(Qitheo.right4plot[k])) Qitheo.right4plot[k] <- xmaxinf lrect[[i]] <- data.frame(Qitheo.left4plot = Qitheo.left4plot, Qi.left4plot = Qi.left4plot, Qitheo.right4plot = Qitheo.right4plot, Qi.right4plot = Qi.right4plot, ind = legendtext[i]) } # insert here a check of limits in order to enlarge xlim and ylim if needed # in order to be sure to visualize each interval, for all the fitted distributions if (!user.defined.lim) { xlim <- c(theo.xmin, theo.xmax) ylim <- c(theo.xmin, theo.xmax) } if(plotstyle == "graphics") { ######## plot if plotstyle=='graphics' ######## # main plot plot(1, 1, type = "n", main = main, xlim = xlim, ylim = ylim, xlab = xlab, ylab = ylab, log = logxy) # plot of rectangles plot.fti <- function(i, ...) { Qitheo.left4plot <- lrect[[i]]$Qitheo.left4plot Qi.left4plot <- lrect[[i]]$Qi.left4plot Qitheo.right4plot <- lrect[[i]]$Qitheo.right4plot Qi.right4plot <- lrect[[i]]$Qi.right4plot if (ynoise & nft > 1) { if (xlogscale == TRUE) { noise2mult <- runif(nPi, 0.99, 1.01) rect(xleft = Qitheo.left4plot, ybottom = Qi.left4plot * noise2mult, xright = Qitheo.right4plot, ytop = Qi.right4plot * noise2mult, border = fitcol[i], col = fillrect[i], lwd = fitlwd[i]) } else { noise2add <- runif(nPi, -width*0.01, width*0.01) rect(xleft = Qitheo.left4plot, ybottom = Qi.left4plot + noise2add, xright = Qitheo.right4plot, ytop = Qi.right4plot + noise2add, border = fitcol[i], col = fillrect[i], lwd = fitlwd[i]) } } else # ! ynoise { rect(xleft = Qitheo.left4plot, ybottom = Qi.left4plot, xright = Qitheo.right4plot, ytop = Qi.right4plot, border = fitcol[i], col = fillrect[i], lwd = fitlwd[i]) } } s <- sapply(1:nft, plot.fti, ...) rm(s) if(line01) abline(0, 1, lty = line01lty, col = line01col) if (addlegend) { legend(x=xlegend, y=ylegend, bty="n", legend=legendtext, col=fitcol, lty=1, lwd=fitlwd, ...) } invisible() } else if (!requireNamespace("ggplot2", quietly = TRUE)) { stop("ggplot2 needed for this function to work with plotstyle = 'ggplot'. Please install it", call. = FALSE) } else { ######## plot if plotstyle=='ggplot' ######## drect <- do.call("rbind", lrect) ind <- as.factor(drect$ind) fitcol <- rep(fitcol, table(ind)) fitlwd <- rep(fitlwd, table(ind)) fillrect <- if(length(fillrect) > 1) {rep(fillrect, table(ind))} else {fillrect} ggqqcompcens <- ggplot2::ggplot(drect) + ggplot2::coord_cartesian(xlim = xlim, ylim = ylim) + ggplot2::ggtitle(main) + ggplot2::xlab(xlab) + ggplot2::ylab(ylab) + ggplot2::geom_rect(data=drect, mapping=ggplot2::aes_(xmin=quote(Qitheo.left4plot), xmax=quote(Qitheo.right4plot), ymin=quote(Qi.left4plot), ymax=quote(Qi.right4plot)), colour = fitcol, fill = fillrect, size = fitlwd, alpha=0.5) + ggplot2::theme_bw() + ggplot2::theme(plot.title = ggplot2::element_text(hjust = 0.5)) + {if(line01) ggplot2::geom_abline(ggplot2::aes(slope = 1, intercept = 0), color = line01col, linetype = line01lty)} + {if(xlogscale) ggplot2::scale_x_continuous(trans='log10')} + {if(ylogscale) ggplot2::scale_y_continuous(trans='log10')} + ggplot2::facet_wrap(~ind) return(ggqqcompcens) } } fitdistrplus/R/util-checkparam.R0000644000176200001440000001124414421750641016411 0ustar liggesusers# checkparam function checks start.arg and fix.arg that parameters are named correctly # INPUTS # start.arg : starting values for optimization or the function to compute them from data # fix.arg : fixed values of paramaters or the function to compute them from data # argdistname : parameter names of the distribution # errtxt : error text messages # data10 : the first ten values of data # distname : name of the distribution # OUTPUTS # a named list with components: ok (TRUE or FALSE), txt (NULL or the error message), # start.arg : a named list of starting values for optimization # or a function to compute them from data checkparam <- function(start.arg, fix.arg, argdistname, errtxt=NULL, data10, distname) { if(is.null(errtxt)) errtxt <- list(t0="Fixed values must be either a named list or a function returning a named list.", t1="Starting values must be either a named list or a function returning a named list.", t2="Starting and fixed values must be either a named list or a function returning a named list.", t3="'start' must specify names which are arguments to 'distr'.", t4="'fix.arg' must specify names which are arguments to 'distr'.", t5="A distribution parameter cannot be specified both in 'start' and 'fix.arg'.") #t6 = "Unknown starting values..." #before any treatment start.arg.was.null <- is.null(start.arg) #if clause with 4 different cases: #start.arg \ fix.arg | NULL | non NULL # NULL | 1 | 2 # non NULL | 3 | 4 if(is.null(start.arg) && is.null(fix.arg)) #1 { #default case from fitdist, mledist,... start.arg <- startargdefault(data10, distr=distname) }else if(is.null(start.arg) && !is.null(fix.arg)) #2 { #fix.arg should be a function or a named list if(!is.list(fix.arg) && !is.function(fix.arg)) return(list(ok=FALSE, txt=errtxt$t0)) #get param names if(is.function(fix.arg)) namarg <- names(fix.arg(data10)) else namarg <- names(fix.arg) start.arg <- startargdefault(data10, distr=distname) #could return "Unknown starting values..." start.arg <- start.arg[!names(start.arg) %in% namarg] }else if(!is.null(start.arg) && is.null(fix.arg)) #3 { #start should be a function or a named list if(!is.list(start.arg) && !is.function(start.arg)) return(list(ok=FALSE, txt=errtxt$t1)) }else if(!is.null(start.arg) && !is.null(fix.arg)) #4 { #fix.arg and start should be a function or a named list if( (!is.list(fix.arg) && !is.function(fix.arg)) || (!is.list(start.arg) && !is.function(start.arg)) ) return(list(ok=FALSE, txt=errtxt$t2)) }else stop("wrong implementation") #check start #start.arg : function() | list() #start.arg cannot be null because set to a named list (by startargdefault) when NULL if(is.function(start.arg)) #a function { start2 <- start.arg(data10) if(!is.list(start2) && is.null(names(start2))) #check a named list return(list(ok=FALSE, txt=errtxt$t3)) vstart <- unlist(start2) }else #a list vstart <- unlist(start.arg) m <- match(names(vstart), argdistname) if (any(is.na(m))) #check unexpected names return(list(ok=FALSE, txt=errtxt$t3)) #check fix.arg #fix.arg : function() | list() | NULL if(is.function(fix.arg)) #a function { fix.arg2 <- fix.arg(data10) if(!is.list(fix.arg2) && is.null(names(fix.arg2))) #check a named list return(list(ok=FALSE, txt=errtxt$t4)) vfix.arg <- unlist(fix.arg2) }else if(is.list(fix.arg)) #a list vfix.arg <- unlist(fix.arg) else vfix.arg <- NULL mfix <- match(names(vfix.arg), argdistname) if (any(is.na(mfix))) #check unexpected names return(list(ok=FALSE, txt=errtxt$t4)) # check that some parameters are not both in fix.arg and start minter <- match(names(vstart), names(vfix.arg)) if (any(!is.na(minter))) return(list(ok=FALSE, txt=errtxt$t5)) #prepare the starg.arg for outputs, i.e. when start.arg=NULL, # returns startargdefault if not fixed param # returns a subset of startargdefault if fixed param if(start.arg.was.null && is.null(fix.arg)) start.arg <- function(x) startargdefault(x, distr=distname) #could return "Unknown starting values..." else if(start.arg.was.null && !is.null(fix.arg)) { if(is.function(fix.arg)) namarg <- names(fix.arg(data10)) else namarg <- names(fix.arg) start.arg <- function(x){ start.arg <- startargdefault(x, distr=distname) #could return "Unknown starting values..." start.arg[!names(start.arg) %in% namarg] } } #otherwise start.arg is a named list or a function return(list(ok=TRUE, txt=NULL, start.arg=start.arg)) }fitdistrplus/NEWS.md0000644000176200001440000003375214406571077014131 0ustar liggesusers# fitdistrplus 1.1-9 BUG FIX - the use of some deprecated ggplot2 functions is updated - the use of some deprecated BibTeX entries is updated - bug fixed in drawing CI lines in CIcdfcplot when ggplot2 is called - bug fixed in drawing horizontal lines in cdfcompcens NEW FEATURES - add a 'print' argument in the 'descdist' function to allow to only plot the skewness-kurtosis graph, without printing the computed parameters # fitdistrplus 1.1-8 WARNING FIX - update an URL in fitdistrplus.Rd from \href{https://doi.org/10.18637/jss.v064.i04}{} to \doi{doi.org/10.18637/jss.v064.i04} - replace 'if(class(x) == XX)' by 'if(inherits(x, XX))' - replace all 'dontrun' tags by 'donttest' in examples in rd files BUG FIX - fix an error in t-detectbound.R producing "failure: length > 1 in coercion to logical" reported by Brian Ripley # fitdistrplus 1.1-6 NEW FEATURES - new function Surv2fitdistcens() to format data for use in fitdistcens() from a format used in the survival package - new dataset fremale in order to illustrate Surv2fitdistcens() - support the use of ggplot2 for CIcdfplot - add the taxon names to the endosulfan dataset - new argument name.points in cdfcomp and CIcdfplot to add labels next to points # fitdistrplus 1.1-5 WARNING FIX - reduce testing times in test files # fitdistrplus 1.1-3 NEW FEATURE - take into account fix.arg for uniform distribution BUG FIXES - add the loglikelihood value for uniform distribution (in mledist()) - correct usage of triple dots argument in llsurface() - fix an error in ppcomp() and qqcomp() raised for large dataset # fitdistrplus 1.1-1 NEW FEATURES - add of internal functions to cope with problems of lack of maintenance of the package npsurv and remove the dependence to this package - remove of the deprecated argument Turnbull of plotdistcens() # fitdistrplus 1.0-14 NEW FEATURES - add a new estimation method called maximum spacing estimation via msedist() # fitdistrplus 1.0-13 BUG FIXES - fix issues coming from the noLD (--disable-long-double) configuration of R # fitdistrplus 1.0-12 BUG FIXES - bug fixed in qmedist() and fitdistcens() which raised an error in checkparamlist(). - bug fixed in testdpqfun() which assumes the first argument of d,p,q,r functions are exactly the same as in base R. # fitdistrplus 1.0-11 NEW FEATURES - update the FAQ with beta(a,a). - improve graphics for discrete distributions in denscomp(). - improve automatic naming of legends in xxxcomp(). - harmonize outputs in mledist(), qmedist(), mmedist(), mgedist(), fitdist() and fitdistcens(). - automatic test of d, p, q functions in fitdist() and raise warnings. - improve test for starting and fixed values. - add new default starting values for distributions in actuar. - change of the default CDF plot for censored data, using the Wang NPMLE algorithm provided in the package npsurv (in plotdistcens() and cdfcompcens()) - add of two new goodness-of-fit plots (QQ-plot and PP-plot) for censored data (cf. plotdistcens, qqcompcens and ppcompcens). - add of a part dedicated to censored datain the FAQ vignette. - homogeneization of xlim and ylim default definition in plotdistcens. - Removing of the name of the first argument in calls to dpq functions in order to make the package compatible with distributions defined with a non classical name for their first argument (resp. x, q, p for d, p, q functions). - add the possibility to change the title of the CDF plot in plotdistcens() using the argument main. - support the use of ggplot2 for cdfcompcens, qqcompcens, ppcompcens. BUG FIXES - bug fixed concerning the use of gofstat with a chi squared df <=0 (error message blocking the other functions) - bug fix in mledist() when bounds were set (so not NULL) for censored MLE - enable a correct use of non-equidistant breaks in denscomp for the histogram when plotstyle = "ggplot", and prohibit the use of non-equidistant breaks with probability = FALSE (adding a stop in this case). # fitdistrplus 1.0-9 - update the FAQ with linear inequality constraints. # fitdistrplus 1.0-8 NEW FEATURES - support the use of ggplot2 for cdfcomp, denscomp, qqcomp, ppcomp. BUG FIXES - correct legend for qqcomp and ppomp on large data. - correct weights in mmedist. - correct the name Akaike in gofstat. - correct the use of trueval in plot.bootdist. - correct the vignette on truncate (inflated) distributions. # fitdistrplus 1.0-7 NEW FEATURES - keep the JSS vignette as a pdf. - start the FAQ vignette and add datasets (?dataFAQ) for it. - provide likelihood plot/surface/curve: llplot, llcurve, llsurface. - provide parallelization of bootstrap in bootdist and bootdistcens. - provide graphic of (e)cdf with bootstraped confidence interval/area: CIcdfplot. - allow the use of constrOptim() in mledist, mmedist, mgedist, qmedist functions. - add a possible pre-fitting procedure: prefit. BUG FIXES - add invisible() for all graphical functions. - bug fixed concerning the use of weights on censored data. # fitdistrplus 1.0-6 BUG FIXES - automatic definition of starting values for distributions "llogis" and "invweibull" is now working. # fitdistrplus 1.0-5 NEW FEATURES - update starting/fixing values in mledist, mmedist, mgedist, qmedist functions. - update graphics for bootstrap procedure. - add argument do.points in cdfcomp. - add argument weights in mledist, qmedist, mmedist, fitdist, fitdistcens. - add argument keepdata in fitdist, fitdistcens. - suppress warnings/errors in fitdist(cens), bootdist(cens). BUG FIXES - defensive programming in plotdist, cdfcomp,... - simplify plotting curves in cdfcomp where seq(xmin, xmax, by=1) was used. # fitdistrplus 1.0-4 - release for the JSS publication. # fitdistrplus 1.0-3 NEW FEATURES - new generic functions for fitdist(cens): loglik, vcov and coef. - vignette updated to the version of a paper accepted by the Journal of Statistical Software. - add of an argument discrete in fitdist in order to be able to take into account non classical discrete distributions while plotting the fit with plot.fitdist or cdfcomp and while calculating goodness-of-fit statistics with gofstat (add of an example : fit of a zero inflate Poisson distribution). - add of an S3 class for descdist and a print method. BUG FIXES - fitdist can handle non invertible Hessian matrices. # fitdistrplus 1.0-2 NEW FEATURES - plotdist can plot empirical density as an histogram, a density plot or both superimposed. - a strong warning was added to the documentation of function descdist about the problematic high variance of skewness and kurtosis. BUG FIXES - bug fixed in bootdistcens : argument fix.arg is now correctly passed to mle. # fitdistrplus 1.0-1 NEW FEATURES - gofstat can handle multiple 'fitdist' objects. - plotdist for discrete data is slightly enhanced. # fitdistrplus 1.0-0 NEW FEATURES - update cdfcomp and add denscomp, ppcomp and qqcomp functions. - add of an argument Turnbull.confint to functions plotdistcens and cdfcompcens in order to draw confidence intervals on the empirical distribution only if requested. - ppoints now used in "fitdist" for QQ plot, PP plot and cdf plot for continuous data (was used only for QQ plot in previous versions) to enable Blom type plotting position (using by default Hazen plotting position than can be chanfge using arguments use.ppoints and a.ppoints) - many changes in the examples given in the reference manual. - the vignette was removed, to be transformed in a paper that we will soon submit to a journal. - add of four data sets : fluazinam, salinity, danishuni and danishmulti. - add of functions to calculate quantiles of the fitted distribution, with 95 percent CI calculated by bootstrap : quantile generic function is available both for "fitdist" and "bootdist" objects and quantile generic function is available both for "fitdistcens" and "bootdistcens" objects. BUG FIXES - correction the formula for the CvM test for Weibull distribution. - elimination of CvM and AD tests for normal, lognormal and logistic distributions : formulas previously used (given by Stephens 1986) do not use exactly MLE estimates and thus results were only approximates. - make arguments xlim and ylim functional in cdfcompcens. - bug fix in the closed formula in mmedist for lognormal distributions. # fitdistrplus 0.3-4 NEW FEATURES - posibility to fix xlegend to a keyword (e.g. "bottomright") in "cdfcomp" and "cdfcompdens". - improvement of the new vignette. BUG FIXES - correction of the NAMESPACE file in order to enable the correct print of a summary of a fitdistcens object (with the correlation matrix, the loglikelihood and AIC and BIC statistics). # fitdistrplus 0.3-3 NEW FEATURES - a new function ("cdfcompcens") to plot cumulative distributions corresponding to various fits using a same censored data set. - add an example with scaling problem in man pages. # fitdistrplus 0.3-2 NEW FEATURES - new plot of the empirical cdf curve in plotdistcens, using the Turnbull algorithm by a call to function survfit{survival}. - new arguments to function "cdfcomp" : verticals, horizontals and xlim. # fitdistrplus 0.3-1 NEW FEATURES - add of a draft of a new version of the vignette. # fitdistrplus 0.3-0 NEW FEATURES - a new function ("cdfcomp") to plot cumulative distributions corresponding to various fits using a same non censored data set. - add of two data sets : "endosulfan" and "toxocara". # fitdistrplus 0.2-2 BUG FIXES - elimination of NON-ASCII characters in the vignette. # fitdistrplus 0.2-1 NEW FEATURES - a new fitting method was implemented for continuous distributions : the maximum goodness-of-fit estimation (function "mgedist") (for the moment only available for non censored data). # fitdistrplus 0.1-5 NEW FEATURES - a new goodness-of-fit statistic was added in gofstat, with corresponding test : the Cramer-von Mises distance. - a new fitting method has been implemented : the quantile matching estimation (function "qmedist"). For the moment, only available for non censored data. - the moment matching estimation has been extended (in function mmedist) to enable numerical matching when closed formula are not available. BUG FIXES - correction of a bug inserted while adding the argument "fix.arg" which prevent the print of the results of goodness-of-fit tests. # fitdistrplus 0.1-4 NEW FEATURES - a component named dots added to the list returned by fitdist and fitdistcens in order to pass optional arguments for the control of optimization in mledist to bootdist and bootdistcens. bootdist and bootdistcens changed to take into account these optional arguments if they are defined in the call to fitdist or fitdistcens. - an argument added to fitdist, fitdistcens and mledist, named fix.arg, and giving the possibility to fix some of the distribution parameters while maximizing the likelihood. Functions bootdist, bootdistcens and gofstat were also changed in order to take this new argument into account. - a new data file of bacterial contamination censored data extracted from Busschaert et al. 2000 and examples corresponding to analysis of this dataset. BUG FIXES - correction of a bug in the print and the plot of bootstraped samples using bootdist or bootdistcens when there was only one parameter estimated by maximum likelihood. # fitdistrplus 0.1-3 NEW FEATURES - new data file "groundbeef" (groundbeef.rda and groundbeef.Rd) and new use of this dataset in some examples. - new function gofstat. Goodness-of-fit statistics are no more computed by fitdist but may computed and printed by the use of the function gofstat. In this new function, the whole results computed are not printed : results of tests are printed only if the argument print.test==TRUE and for continuous distributions only Anderson-Darling and Kolomogorov-Smirnov statistics are printed by default (but complete results are returned by gofstat). - modifications in descdist : three arguments were added in descdist 1/ method, to choose between unbiased estimations of standard deviation, skewness and kurtosis (default choice) and sample values. 2/ obs.col to choose the color used to plot the observed point on the graph. 3/ boot.col to choose the color used to plot the bootstrap sample of points on the graph. - modifications in plotfit : minor changes were performed in order to facilitate the use of the argument ... to personnalize the plots (see examples in plotdist.Rd) - modication of the vignette BUG FIXES - correction of a bug in plotdist due to the redefinition in the previous version of the parameter "ylim" for the plot of a histogram with theoretical density function (there was a problem with infinite values of theoretical density function). # fitdistrplus 0.1-2 NEW FEATURES - deletion of mledistcens and modification of mledist in order to maximize likelihood for both censored and non censored data. - possibility to choose the optimization method used for maximum likelihood estimation (MLE) of distribution parameters using the new argument "optim.method" of mledist. - possibility to specify contraints on distribution parameters using the new arguments "lower" and "upper" of mledist. - possibility to use a custom optimization function for MLE using the new argument "custom.optim". - moment matching estimation is no longer done with argument method set to "mom" but set to "mme" in fitdist. - renaming of momdist in mmedist. - calculation of AIC and BIC criterion after maximum likelihood estimation of distribution parameters - change of the default number of iterations from 999 to 1001 for bootstrap in order to avoid interpolation using the quantile function - use of the argument "log" and (resp. "log.p") of density (resp. distribution) when available to compute the loglikelihood. BUG FIXES - omitting the name of the first argument in calls to the density function during maximization of the likelihood in order to enable the use of a density function defined with a first parameter (the vector of quantiles) with a name differing from "x" (classical name for density distributions defined in R), such as the density function dexGAUS from the package gamlss.dist. # fitdistrplus 0.1-1 - Initial release. fitdistrplus/MD50000644000176200001440000002033314421764232013324 0ustar liggesusersb272360078a06c2bd5f189b974b76fb8 *DESCRIPTION de394941018816ccc72c06e4fc32517d *NAMESPACE 31724e1395b44b4a5f6362736c646897 *NEWS.md ad8499551d64710052f3eeb629fdc97e *R/CIcdfplot.R 92007c3f790b5991302bfc8949aa3c59 *R/Surv2fitdistcens.R e51b1ec6ecfd544336918fa128d4d2cc *R/bootdist-graph.R 3c26daea244f61f71595ed71909bcb20 *R/bootdist.R 9c12eac5e8b4ca4a70c5f77f8cabdcde *R/bootdistcens.R 529c6a524d5d6caeb594477b5f99a833 *R/cdfcomp.R 21691572b580df0ac948edf384e74287 *R/cdfcompcens.R a1885db7833d7335b8583357c8ac61e7 *R/coef.R 5e449af40fd2916db2ce700c7d7a60da *R/denscomp.R 63b3c858efc3ef60e98229231f152d9b *R/descdist.R b997f308a2c8821804577162a620c231 *R/detectbound.R d76db67eade3aa3a6e8684b57ed232d7 *R/fitbench.R e10db1ae53bceee0faa185cdd4d0a37b *R/fitdist.R 8b7fe0d5f65c319686b848497d007f28 *R/fitdistcens.R 55af8b871f5ef894a45a8ff9cd2ce750 *R/gofstat.R 82d3e8c7e8cbef28e7e5e34aded990b4 *R/gradlogLik.R e16afa96315460129b7be6a16ca509f3 *R/logLik-surface.R 63e5f7d6b4fd5e3146c8f54521a376b8 *R/logLik.R 7270db811a7166f9d38a73b7a61c5ebb *R/mgedist.R 4c2b0eee71197aceb19657c3fc4a4414 *R/mledist.R 95a01a8c51f51d88e206f647572443fd *R/mmedist.R f41687162b09533d17286a25fdacba38 *R/msedist.R bd2851f3db6111d020917249af8206f0 *R/plotdist.R 56f269286f3ad932f6e595096039be69 *R/plotdistcens.R 2d611928867a1b6fd63209c28560633c *R/ppcomp.R 78f22dab7165c64b948941045615b6c9 *R/ppcompcens.R f9264f2e0c0e527621e3fc8e66dc43b4 *R/prefit.R 00cd5f63089828d6777464e29489ef21 *R/qmedist.R b84a413520a05d9c2ea6fabcc922bbf3 *R/qqcomp.R 85102230bfe77b081d7944336c075cda *R/qqcompcens.R 95c34e49e732d89555159e98f97b3346 *R/quantiles.R c3cefdb7c45178a6480c8a6a1c5bae4f *R/util-Turnbull-intervals.R 3a2ef8f74dd2eddc677e94f43db8748c *R/util-cens2pseudo.R bc91229d1ea631a569704234590b4e1b *R/util-checkparam.R 632a9fdc83759dd7094409f7de17fa01 *R/util-checkparamlist.R 0776b6e417ce2d77b759c6d97ca19bd8 *R/util-getparam.R 75d88a28ca7b47e532c366765bc5da18 *R/util-isint.R 6eecbc55dc862b469ff849517afa640c *R/util-manageparam.R e4a0c6f53e7321928286a83a36a9091b *R/util-npmle.R 3703ec7cbdabb68bd71760a81a6a6346 *R/util-npsurv-NNLS.R 04e5164ab6d088884353e48c356e0cf1 *R/util-npsurv-hcnm.R 8df4c0265680f490bead14b30947075e *R/util-npsurv-intercens.R 49fb006e703dce03b8004ddf43103e10 *R/util-npsurv-km.R e83249d24ae4bd52e9ec24f17f684ae4 *R/util-npsurv-main.R ebffdfe9b570d951a167da9157e8d323 *R/util-startarg.R 9374854a7cdd53e40e5289201a209aab *R/util-testdensity.R efcc748256f6f7912a7815e952e2c5c2 *R/util-transform.R 1b12ca9979562058332e8d3de65224b5 *R/util-wtdstat.R 3e809f399b41fa5574e15d86c7544ae5 *R/vcov.R b7293331911822da10b2c02377e855a3 *README.md 439bf689fa27cf9affd0335332142165 *build/partial.rdb 5553ff33daa4de5117337a2440321b72 *build/vignette.rds 8aeb7eb0fb1218a9eac135e7e0200639 *data/danishmulti.rda 54ef81c0377c3e6ac8479ffd85499bba *data/danishuni.rda 1c644a363be0660d5db8f587cfe8938a *data/dataFAQlog1.rda c7efd1d1b67d6a4348f65dff31fe4d8c *data/dataFAQscale1.rda 66166ff91b40cc18fc95b03ee5f45ba6 *data/dataFAQscale2.rda b97c01a6db75501fc8c0581a28f51bb0 *data/endosulfan.rda c5561b18f5f5ade0e9ab40055094a26a *data/fluazinam.rda d171a7a2267a0ffaf960f64b657d609f *data/fremale.rda 174c5238de9d810f191b011b4ae8dc8f *data/groundbeef.rda 044cf443a7bf2d86bd8352665ed3f068 *data/salinity.rda 594ad11436b821c73706f8dc5c4d95d3 *data/smokedfish.rda e0d960a9181b15cad54f95de005532e6 *data/toxocara.rda c47d79d72d34bddb39f81a6423e3233c *inst/CITATION 6fd6dca66da53a0137f51c9773c8c068 *inst/doc/FAQ.R ac34cc90a1421b869196a4affb477286 *inst/doc/FAQ.Rmd 1ce9b51e218fe4a17231d19b99dc9f4e *inst/doc/FAQ.html 0ef249017fac6ddff76f893af43721b1 *inst/doc/Optimalgo.R ca733bd4e54fe3c494fa695cd5242d64 *inst/doc/Optimalgo.Rmd 49fec5e9808ad939435809ec4c8d7d0c *inst/doc/Optimalgo.html da4d0af2fdadac4ba4a4e00bf86d37a7 *inst/doc/fitdistrplus_vignette.R df9f79f7b5e25c2bb55da6ced61ab4e7 *inst/doc/fitdistrplus_vignette.Rmd 352c04a684891eab4c0bb962c43e6bbf *inst/doc/fitdistrplus_vignette.html 527925704344e97dedffba1617f8a712 *man/CIcdfplot.Rd 538f62bc9df34c1cc725dad801f877fe *man/Surv2fitdistcens.Rd d153f05bb06fa0bf13c76d0d601385a2 *man/bootdist.Rd 0f34d195b8d179aa2cea79cc9e7d3296 *man/bootdistcens.Rd 638b4ab2997f4943fc2d94a3a37e308b *man/danish.Rd 3c2cd14460583c52d747adadecc0a5dc *man/dataFAQ.Rd c26890adb04cdb8e379dd98cf1de5cd7 *man/descdist.Rd 5de9438ca0080ea5bf2f9c4d07033685 *man/detectbound.Rd 1a805ed619fc6c79b3c67c4fc9c686bb *man/endosulfan.Rd d42586eff5ea948c28da4f5c181ba195 *man/fitdist.Rd c1185ae7a587de0787f35c21588e9bca *man/fitdistcens.Rd 7c25a99627cb1d58c1599faeb8e19b83 *man/fitdistrplus.Rd 32bd56570f397a0b7b4d2512b909105e *man/fluazinam.Rd 788540f8ed3cd11c0ab1a19d1b5d6232 *man/fremale.Rd 63eeea9aa01a89d47e38a12328cc0a82 *man/gofstat.Rd b3be3448df16749b3dcd7e3fa2fd4faa *man/graphcomp.Rd 3558e56db90d451b223d10c690b4cc36 *man/graphcompcens.Rd 732b38c5b981b09d6c42aed52c25cfc7 *man/groundbeef.Rd 9e1034d360cd8c43246f483f1389196b *man/logLik-plot.Rd 5933a05cfe4ed6b5580b917c4bdbb1af *man/logLik-surface.Rd e03ed5142f2b42056c8763710c2b8a7a *man/mgedist.Rd 7343f5e88bef0fabd830e43182787575 *man/mledist.Rd 90b4f6f9540c8505f3cebb2cd393ed43 *man/mmedist.Rd a2a4fdf082c8579532b6184860012652 *man/msedist.Rd 15d16710e1b65d5affd32c090ddc09ba *man/plotdist.Rd be70efea67d1a12586bc390890c8e7fa *man/plotdistcens.Rd 783ac1a62b6a8b26b02ade7cd3b178c9 *man/prefit.Rd 7cdf7bd88e073dbbf7b3d75572285a80 *man/qmedist.Rd 781ba62869fe4ccd3f6823dc9d0e349a *man/quantile.Rd 01c0d483abefaba002d96a2017ded5da *man/salinity.Rd c3af7940533d7908747dd8c9e18da3b9 *man/smokedfish.Rd 76180ba1145189d2134f2e23cbcc56a6 *man/toxocara.Rd c3b1628a97761d8ac9037f526b1e5026 *tests/t-CIcdfplot.R 387be0138a2022ca190d7efffefa15b5 *tests/t-Surv2fitdistcens.R 8a943a8da4d2532e47eb0c1c67903030 *tests/t-bootdist.R 6de295af895d219b115b66e2a5c1fc6c *tests/t-bootdistcens.R 11ecf2496d760cfb6ace050d6147e2b1 *tests/t-cdfcomp.R 40d1b5678c3cb09fd3bbdb39c2f9526d *tests/t-cdfcompcens.R 590bd6107a2777b36551f4f5863125bb *tests/t-cvg-algo.R 59333d1bdd1052f4a32db0a07cec54ab *tests/t-denscomp.R 091922cbb3188dc916f60c0729b1d001 *tests/t-descdist.R 42b44a61b0efde76615863087c86986e *tests/t-detectbound.R ad3a1d87d40f091a6998118c3c8a8c21 *tests/t-fitbench.R 4b14d31978044e50018eb9c58d7939ac *tests/t-fitdist-customoptim.R 62303c2fe2fcda5b135459205ac25217 *tests/t-fitdist.R dddfaf8ea8626c1e58942b1b4f71686b *tests/t-fitdistcens.R 10204028cb11cb6b8430d885b77430b5 *tests/t-gen-max-spacing-estim.R 5b7e2f4e1b970622d8ee100c76c99554 *tests/t-getparam.R 8d2caabb07d60ae9fdea976f77ba625d *tests/t-gofstat.R ea7d90e1ca86d20cad35dc3484934f58 *tests/t-init-actuar.R 3c8a93146ee8d4b39acbd54035123d31 *tests/t-llplot.R 986f32254a74010c6ebd151598744a0f *tests/t-lnL-surf.R bf445663f525ef8c9ab1baeb9b5677c5 *tests/t-logLik-vcov-coef.R 86c01974a09c93f1697d0c1036cf35cd *tests/t-manageparam.R b84179b08b018be21b2b72ff9b6b615b *tests/t-mgedist.R 6717be50d96c5f4f3a42658109d13a30 *tests/t-mledist-cens.R e28d137a329ddafef40fe6f888be7bd3 *tests/t-mledist-nocens.R 1f3d2ce43f1c338a3b54d5ef694c12ba *tests/t-mledist-paramsupport.R 05fa231be8c97ae4bf722f4d7fcb0c3c *tests/t-mmedist.R b88f49725eb084a55b10cb455c85da00 *tests/t-msedist.R 82e3b99e3ec105e1bac5792f64c5c583 *tests/t-parallel.R 41859be68197676a2d1f2cdb75c49962 *tests/t-plotdist.R 61dcbc90c4785098b68c6213085d4b9a *tests/t-plotdistcens.R be4ec8bc492de8bc6f704452030deeda *tests/t-ppcomp.R 0ba0d48a32291b4d359f6477a32eda10 *tests/t-ppcompcens.R e680266064f0edee51b3ea4bc1981763 *tests/t-prefit.R 01392f301e431a4a04c19982a2c6b25e *tests/t-qme-discrete.R 9989c73af7776c125edcb2a95bd9f2e9 *tests/t-qmedist.R 62c41606c29ba212b4fa0aa9be7a3f96 *tests/t-qqcomp.R 6ee7dfba17f45f6889018e909525f7fd *tests/t-qqcompcens.R 0091d2e1b68e2b57b1b8d4ec26f46461 *tests/t-quantiledist.R 94332098f8609f629854eaf842b38451 *tests/t-startfixarg-overall.R 7d1e7a2ada0542a3cad30c88f2420b90 *tests/t-startingvalues.R b85ca3d012cfa68d996b1a7ac24709e9 *tests/t-util-npmle.R 0e99feaad2b5a4295d9bfdb004917725 *tests/t-util-npsurv-mainfunction.R 4a6da9856d5656fe52ea7e9a43d2eb2e *tests/t-util-testdensity.R a6e91ae2d5083bf94697ab0468fe7423 *tests/t-weird-ppcomp-cens.R cb0be7f5e63736f49f2cc6a4aa575566 *tests/t-weird-qqcomp-cens.R ac34cc90a1421b869196a4affb477286 *vignettes/FAQ.Rmd ca733bd4e54fe3c494fa695cd5242d64 *vignettes/Optimalgo.Rmd 07448abba61581aa87b8437be5a3b99c *vignettes/fitdistrplus.bib df9f79f7b5e25c2bb55da6ced61ab4e7 *vignettes/fitdistrplus_vignette.Rmd fitdistrplus/inst/0000755000176200001440000000000014421751624013771 5ustar liggesusersfitdistrplus/inst/doc/0000755000176200001440000000000014421751624014536 5ustar liggesusersfitdistrplus/inst/doc/FAQ.Rmd0000644000176200001440000017036414411504652015620 0ustar liggesusers--- title: Frequently Asked Questions author: Marie Laure Delignette Muller, Christophe Dutang date: '`r Sys.Date()`' output: bookdown::html_document2: base_format: rmarkdown::html_vignette fig_caption: yes toc: true number_sections: yes link-citations: true vignette: > %\VignetteEngine{knitr::rmarkdown} %\VignetteIndexEntry{Frequently Asked Questions} %!\VignetteEncoding{UTF-8} \usepackage[utf8]{inputenc} pkgdown: as_is: true --- ```{r setup, echo=FALSE, message=FALSE, warning=FALSE} require(fitdistrplus) set.seed(1234) options(digits = 3) ``` # Questions regarding distributions ## How do I know the root name of a distribution? The root name of a probability distribution is the `name` which is used the `d`, `p`, `q`, `r` functions. For base R distributions, the root names are given in R-intro : https://cran.r-project.org/doc/manuals/R-intro.html#Probability-distributions. For example, you must use `"pois"` for the Poisson distribution and **not** `"poisson"`. ## How do I find "non standard" distributions? For non-standard distributions, you can either find a package implementing them or define by yourself. A comprehensive list of non-standard distributions is given in the Distributions task view https://CRAN.R-project.org/view=Distributions. Here are some two examples of user-defined distributions. A third example (shifted exponential) is given in FAQ 3.5.4. * The Gumbel distribution ```{r, eval=FALSE} dgumbel <- function(x, a, b) 1/b*exp((a-x)/b)*exp(-exp((a-x)/b)) pgumbel <- function(q, a, b) exp(-exp((a-q)/b)) qgumbel <- function(p, a, b) a-b*log(-log(p)) data(groundbeef) fitgumbel <- fitdist(groundbeef$serving, "gumbel", start=list(a=10, b=10)) ``` * The zero-modified geometric distribution ```{r, eval=FALSE} dzmgeom <- function(x, p1, p2) p1 * (x == 0) + (1-p1)*dgeom(x-1, p2) pzmgeom <- function(q, p1, p2) p1 * (q >= 0) + (1-p1)*pgeom(q-1, p2) rzmgeom <- function(n, p1, p2) { u <- rbinom(n, 1, 1-p1) #prob to get zero is p1 u[u != 0] <- rgeom(sum(u != 0), p2)+1 u } x2 <- rzmgeom(1000, 1/2, 1/10) fitdist(x2, "zmgeom", start=list(p1=1/2, p2=1/2)) ``` ## How do I set (or find) initial values for non standard distributions? As documented, we provide initial values for the following distributions: `"norm"`, `"lnorm"`, `"exp"`, `"pois"`, `"cauchy"`, `"gamma`", `"logis"`, `"nbinom"`, `"geom"`, `"beta"`, `"weibull"` from the `stats` package; `"invgamma"`, `"llogis"`, `"invweibull"`, `"pareto1"`, `"pareto"`, `"lgamma"`, `"trgamma"`, `"invtrgamma"` from the `actuar` package. Look first at statistics and probability books such as * different volumes of N. L. Johnson, S. Kotz and N. Balakrishnan books, e.g. **Continuous Univariate Distributions, Vol. 1**, * **Thesaurus of univariate discrete probability distributions** by G. Wimmer and G. Altmann. * **Statistical Distributions** by M. Evans, N. Hastings, B. Peacock. * **Distributional Analysis with L-moment Statistics using the R Environment for Statistical Computing** by W. Asquith. If not available, find initial values by equalling theoretical and empirical quartiles. The graphical function `plotdist()` and `plotdistcens()` can also be used to assess the suitability of starting values : by an iterative manual process you can move parameter values so as to obtain a distribution that roughly fits the data and take these parameter values as starting values for the real fit. You may also consider the `prefit()` function to find initial values especially in case where parameters are constrained. ## Is it possible to fit a distribution with at least 3 parameters? Yes, an example with the Burr distribution is detailed in the JSS paper. We reproduce it very quickly here. ```{r, message=FALSE} data("endosulfan") library("actuar") fendo.B <- fitdist(endosulfan$ATV, "burr", start = list(shape1 = 0.3, shape2 = 1, rate = 1)) summary(fendo.B) ``` ## Why there are differences between MLE and MME for the lognormal distribution? We recall that the lognormal distribution function is given by $$ F_X(x) = \Phi\left(\frac{\log(x)-\mu}{\sigma} \right), $$ where $\Phi$ denotes the distribution function of the standard normal distribution. We know that $E(X) = \exp\left( \mu+\frac{1}{2} \sigma^2 \right)$ and $Var(X) = \exp\left( 2\mu+\sigma^2\right) (e^{\sigma^2} -1)$. The MME is obtained by inverting the previous formulas, whereas the MLE has the following explicit solution $$ \hat\mu_{MLE} = \frac{1}{n}\sum_{i=1}^n \log(x_i),~~ \hat\sigma^2_{MLE} = \frac{1}{n}\sum_{i=1}^n (\log(x_i) - \hat\mu_{MLE})^2. $$ Let us fit a sample by MLE and MME. The fit looks particularly good in both cases. ```{r, fig.height=3, fig.width=6} x3 <- rlnorm(1000) f1 <- fitdist(x3, "lnorm", method="mle") f2 <- fitdist(x3, "lnorm", method="mme") par(mfrow=1:2) cdfcomp(list(f1, f2), do.points=FALSE, xlogscale = TRUE, main = "CDF plot") denscomp(list(f1, f2), demp=TRUE, main = "Density plot") ``` Let us compare the theoretical moments (mean and variance) given the fitted values ($\hat\mu,\hat\sigma$), that is $$ E(X) = \exp\left( \hat\mu+\frac{1}{2} \hat\sigma^2 \right), Var(X) = \exp\left( 2\hat\mu+\hat\sigma^2\right) (e^{\hat\sigma^2} -1). $$ ```{r} c("E(X) by MME"=as.numeric(exp(f2$estimate["meanlog"]+f2$estimate["sdlog"]^2/2)), "E(X) by MLE"=as.numeric(exp(f1$estimate["meanlog"]+f1$estimate["sdlog"]^2/2)), "empirical"=mean(x3)) c("Var(X) by MME"=as.numeric(exp(2*f2$estimate["meanlog"]+f2$estimate["sdlog"]^2) * (exp(f2$estimate["sdlog"]^2)-1)), "Var(X) by MLE"=as.numeric(exp(2*f1$estimate["meanlog"]+f1$estimate["sdlog"]^2) * (exp(f1$estimate["sdlog"]^2)-1)), "empirical"=var(x3)) ``` From a MLE point of view, a lognormal sample $x_1,\dots,x_n$ is equivalent to handle a normal sample $\log(x_1),\dots,\log(x_n)$. However, it is well know by the Jensen inequality that $E(X) = E(\exp(\log(X))) \geq \exp(E(\log(X)))$ implying the MME estimates provides better moment estimates than with MLE. ## Can I fit a distribution with positive support when data contains negative values? The answer is no: you cannot fit a distribution with positive support (say gamma distribution) when data contains negative values. ```{r} set.seed(1234) x <- rnorm(100, mean = 1, sd = 0.5) (try(fitdist(x, "exp"))) ``` It is irrelevant to do such fit. If you really need to use that distribution, you have two options: either to remove negative values (not recommended) or to shift the data. ```{r} fitdist(x[x >= 0], "exp") fitdist(x - min(x), "exp") ``` ## Can I fit a finite-support distribution when data is outside that support? The answer is no: you cannot fit a distribution with finite-support (say beta distribution) when data is outside $[0,1]$. ```{r} set.seed(1234) x <- rnorm(100, mean = 0.5, sd = 0.25) (try(fitdist(x, "beta"))) ``` It is irrelevant to do such a fit. If you really need to use that distribution, you have two ways to tackle this issue: either to remove impossible values (not recommended) or to shift/scale the data. ```{r} fitdist(x[x > 0 & x < 1], "beta") fitdist((x - min(x)*1.01) / (max(x) * 1.01 - min(x) * 1.01), "beta") ``` ## Can I fit truncated distributions? The answer is yes: but the fitting procedure must be carried out carefully. Let $X$ be the original untruncated random variable. The truncated variable is the conditionnal random variable $Y = X ~\vert~ l< X = low) * (x <= upp) } ptexp <- function(q, rate, low, upp) { PU <- pexp(upp, rate=rate) PL <- pexp(low, rate=rate) (pexp(q, rate)-PL) / (PU-PL) * (q >= low) * (q <= upp) + 1 * (q > upp) } n <- 200 x <- rexp(n); x <- x[x > .5 & x < 3] f1 <- fitdist(x, "texp", method="mle", start=list(rate=3), fix.arg=list(low=min(x), upp=max(x))) f2 <- fitdist(x, "texp", method="mle", start=list(rate=3), fix.arg=list(low=.5, upp=3)) gofstat(list(f1, f2)) cdfcomp(list(f1, f2), do.points = FALSE, xlim=c(0, 3.5)) ``` ## Can I fit truncated inflated distributions? The answer is yes: but the fitting procedure must be carried out carefully. Let $X$ be the original untruncated random variable. The truncated variable is $Y = \max(\min(X, u), l)$ with $ly>l} + 1_{y>u}$. There is no density (w.r.t. the Lebesgues measure) since there are two probability masses $P(Y=l)= P(X\leq l)>0$ and $P(Y=u)=P(X>u)>0$. However, the density function with respect to the measure $m(x)= \delta_l(x)+\delta_u(x)+\lambda(x)$ is $$ f_Y(y) = \left\{\begin{array}{ll} F_X(l) & \text{if } y=l \\ f_X(y) & \text{if } l\min_i y_i$ or $u<\max_i y_i$ and increasing with respect to $l$ in $]-\infty, \min_i y_i]$ and decreasing with respect to $u$ in $[\max_i y_i,+\infty[$. So the maximum of $L$ is reached at $l=\min_i y_i$ and $u=\max_i y_i$. The MLE of $\theta$ is then obtained by maximizing the log-likelihood $\log(L(l, \theta, u))$ with $u=\max_i Y_i$ and $l=\min_i Y_i$. Let us illustrate truncated distribution with the truncated exponential distribution. The log-likelihood is particularly bad-shaped. ```{r, message=FALSE, fig.height=4, fig.width=8} dtiexp <- function(x, rate, low, upp) { PU <- pexp(upp, rate=rate, lower.tail = FALSE) PL <- pexp(low, rate=rate) dexp(x, rate) * (x >= low) * (x <= upp) + PL * (x == low) + PU * (x == upp) } ptiexp <- function(q, rate, low, upp) pexp(q, rate) * (q >= low) * (q <= upp) + 1 * (q > upp) n <- 100; x <- pmax(pmin(rexp(n), 3), .5) # the loglikelihood has a discontinous point at the solution par(mar=c(4,4,2,1), mfrow=1:2) llcurve(x, "tiexp", plot.arg="low", fix.arg = list(rate=2, upp=5), min.arg=0, max.arg=.5, lseq=200) llcurve(x, "tiexp", plot.arg="upp", fix.arg = list(rate=2, low=0), min.arg=3, max.arg=4, lseq=200) ``` The first method directly maximizes the log-likelihood $L(l, \theta, u)$; the second method maximizes the log-likelihood $L(\theta)$ assuming that $u=\hat u$ and $l=\hat l$ are known. Inside $[0.5,3]$, the CDF are correctly estimated in both methods but the first method does not succeed to estimate the true value of the bounds $l,u$. ```{r, fig.height=4, fig.width=6} (f1 <- fitdist(x, "tiexp", method="mle", start=list(rate=3, low=0, upp=20))) (f2 <- fitdist(x, "tiexp", method="mle", start=list(rate=3), fix.arg=list(low=min(x), upp=max(x)))) gofstat(list(f1, f2)) cdfcomp(list(f1, f2), do.points = FALSE, addlegend=FALSE, xlim=c(0, 3.5)) curve(ptiexp(x, 1, .5, 3), add=TRUE, col="blue", lty=3) legend("bottomright", lty=1:3, col=c("red", "green", "blue", "black"), leg=c("full MLE", "MLE fixed arg", "true CDF", "emp. CDF")) ``` ## Can I fit a uniform distribution? The uniform distribution $\mathcal U(a,b)$ has only support parameters since the density does not have a scale or a shape parameter $f_U(u) = \frac{1}{b-a}1_{[a,b]}(u)$. For this distribution, we should not maximize the log-likelihood but only the likelihood. Let $(x_i)_i$ be i.i.d. observations from $\mathcal U(a,b)$ distribution. The likelihood is $$ L(a,b) = \prod_{i=1}^n \frac{1}{b-a} 1_{[a,b]}(x_i) = 1_{a\leq x_i \leq b, i=1,\dots,n} \frac{1}{b-a}^n = 1_{a\leq \min_i x_i} 1_{\max_i x_i \leq b} \frac{1}{b-a}^n $$ Hence $a\mapsto L(a,b)$ for any fixed $b\in]\max_i x_i, +\infty[$ is increasing on $]-\infty, \min_i x_i]$, similarly $b\mapsto L(a,b)$ is decreasing for any fixed $a$. This leads to $\min_i x_i$ and $\max_i x_i$ to be the MLE of the uniform distribution. We should notice that the likelihood function $L$ is defined on $\mathbb R^2$ yet it cancels outside $S=]-\infty, \min_i x_i]\times]\max_i x_i, +\infty[$. Hence, the log-likelihood is undefined outside $S$, which is an issue when maximizing the log-likelihood. For these reasons, `fitdist(data, dist="unif", method="mle")` uses the explicit form of the MLE for this distribution. Here is an example below ```{r, fig.height=4, fig.width=6} trueval <- c("min"=3, "max"=5) x <- runif(n=500, trueval[1], trueval[2]) f1 <- fitdist(x, "unif") delta <- .01 llsurface(x, "unif", plot.arg = c("min", "max"), min.arg=c(min(x)-2*delta, max(x)-delta), max.arg=c(min(x)+delta, max(x)+2*delta), main="likelihood surface for uniform", loglik=FALSE) abline(v=min(x), h=max(x), col="grey", lty=2) points(f1$estimate[1], f1$estimate[2], pch="x", col="red") points(trueval[1], trueval[2], pch="+", col="blue") legend("bottomright", pch=c("+","x"), col=c("blue","red"), c("true", "fitted")) delta <- .2 llsurface(x, "unif", plot.arg = c("min", "max"), min.arg=c(3-2*delta, 5-delta), max.arg=c(3+delta, 5+2*delta), main="log-likelihood surface for uniform") abline(v=min(x), h=max(x), col="grey", lty=2) points(f1$estimate[1], f1$estimate[2], pch="x", col="red") points(trueval[1], trueval[2], pch="+", col="blue") legend("bottomright", pch=c("+","x"), col=c("blue","red"), c("true", "fitted")) ``` Maximizing the log-likelihood is harder and can be done by defining a new density function. Appropriate starting values and parameters bound must be supplied. Using the closed-form expression (as in `fitdist()`) or maximizing the log-likelihood (with `unif2`) lead to very similar results. ```{r} dunif2 <- function(x, min, max) dunif(x, min, max) punif2 <- function(q, min, max) punif(q, min, max) f2 <- fitdist(x, "unif2", start=list(min=0, max=10), lower=c(-Inf, max(x)), upper=c(min(x), Inf)) print(c(logLik(f1), logLik(f2)), digits=7) print(cbind(coef(f1), coef(f2)), digits=7) ``` ## Can I fit a beta distribution with the same shape parameter? Yes, you can wrap the density function of the beta distribution so that there is a only one shape parameter. Here is an example of a concave density. ```{r} x <- rbeta(1000, 3, 3) dbeta2 <- function(x, shape, ...) dbeta(x, shape, shape, ...) pbeta2 <- function(q, shape, ...) pbeta(q, shape, shape, ...) fitdist(x, "beta2", start=list(shape=1/2)) ``` Another example with a U-shaped density. ```{r} x <- rbeta(1000, .3, .3) fitdist(x, "beta2", start=list(shape=1/2), optim.method="L-BFGS-B", lower=1e-2) ``` ## How to estimate support parameter? the case of the four-parameter beta Let us consider the four-parameter beta distribution, also known as the PERT distribution, defined by the following density for $x\in [a,c]$ $f_X(x) = (x-a)^{\alpha-1} (c-x)^{\beta-1}/C_N$ with $C_N$ a normalizing constant and $\alpha=1+d(b-a)/(c-a)$, $\beta=1+d(c-b)/(c-a)$. $a,c$ are support parameters, $b\in]a,c[$ is the mode and $d$ the shape parameter. As for uniform distribution, one can show that the MLE of $a$ and $c$ are respectively the sample minimum and maximum. The code below illustrates the strategy using partial closed formula with `fix.arg` and the full numerical search of MLE. NB: on small sample size, the latter has generally better goodness-of-fit statistics; a small positive number is added or subtracted when fixing the support parameters $a$ and $c$ to sample minimum and maximum. ```{r, message=FALSE, fig.height=4, fig.width=6} require(mc2d) x2 <- rpert(n=2e2, min=0, mode=1, max=2, shape=3/4) eps <- sqrt(.Machine$double.eps) f1 <- fitdist(x2, "pert", start=list(min=-1, mode=0, max=10, shape=1), lower=c(-Inf, -Inf, -Inf, 0), upper=c(Inf, Inf, Inf, Inf)) f2 <- fitdist(x2, "pert", start=list(mode=1, shape=1), fix.arg=list(min=min(x2)-eps, max=max(x2)+eps), lower=c(min(x2), 0), upper=c(max(x2), Inf)) gofstat(list(f1,f2)) cdfcomp(list(f1,f2)) print(cbind(coef(f1), c(f2$fix.arg["min"], coef(f2)["mode"], f2$fix.arg["max"], coef(f2)["shape"])), digits=7) ``` # Questions regarding goodness-of-fit tests and statistics ## Where can we find the results of goodness-of-fit tests ? Results of goodness-of-fit tests are not printed but are given in the object returned by `gofstat()` and you can have access to them as described in the example below. Nevertheless, p-values are not given for every test. For Anderson-Darling (ad), Cramer von Mises (cvm) and Kolomogorov (ks), only the decision (rejection of H0 or not) is given, when available (see FAQ 2.3 for more details). ```{r, fig.height=3, fig.width=6} set.seed(1234) x <- rgamma(n = 100, shape = 2, scale = 1) # fit of the good distribution fgamma <- fitdist(x, "gamma") # fit of a bad distribution fexp <- fitdist(x, "exp") g <- gofstat(list(fgamma, fexp), fitnames = c("gamma", "exp")) denscomp(list(fgamma, fexp), legendtext = c("gamma", "exp")) # results of the tests ## chi square test (with corresponding table with theoretical and observed counts) g$chisqpvalue g$chisqtable ## Anderson-Darling test g$adtest ## Cramer von Mises test g$cvmtest ## Kolmogorov-Smirnov test g$kstest ``` ## Is it reasonable to use goodness-of-fit tests to validate the fit of a distribution ? In the first versions of fitdistrplus, when they were available, the results of the GOF tests (AD, KS, CvM) were automatically printed. We decided to suppress this automatic printing when we realized that some users had some difficulties to interpret the results of those tests and sometimes misused them. Goodness-of-fit tests often appear as objective tools to decide wether a fitted distribution well describes a data set. **But they are not !** It would not be reasonable at all to reject a distribution just because a goodness-of-fit test rejects it (see FAQ 2.2.1). And it would not be reasonable at all any more to validate a distribution because goodness-of-fit tests do not reject it (see FAQ 2.2.2). A fitted distribution should be evaluated using graphical methods (goodness-of-fit graphs automatically provided in our package by plotting the result of the fit (output of `fitdist()` or `fitdistcens()` and the complementary graphs that help to compare different fits - see `?graphcomp`). We really think it is the most appropriate way to evaluate the adequacy of a fit and we are not the only ones to recommend it. You can find the same type of recommendations in reference books : * **Probabilistic techniques in exposure assessment - a handbook dealing with variability and uncertainty in models and inputs** by A.C. Cullen and H.C. Frey. * **Application of uncertainty analysis to ecological risks of pesticides** by W.J. Warren-Hicks and A. Hart. * **Statistical inference** by G. Casella and R.L. Berger * **Loss models: from data to decision** by S.A. Klugman and H.H. Panjer and G.E. Willmot Moreover, the selection of a distribution should also be driven by knowledge of underlying processes when available. For example when a variable cannot be negative, one would be very cautious while fitting a normal distribution, that potentially gives negative values, even if the observed data of this variable seem well fitted by a normal distribution. ### Should I reject a distribution because a goodness-of-fit test rejects it ? No it would not be reasonable at all to reject a distribution just because a goodness-of-fit test rejects it, especially in the case of big samples. In the real life, as soon as you have a sufficient amount of data, you will reject the fitted distribution. We know that a model cannot perfectly describe real data, and generally the true question is to find the better distribution among a pool of simple parametric distributions to describe the data, so to compare different models (see FAQ 2.4 and 2.5 for corresponding questions). To illustre this point let us comment the example presented below. We drew two samples from the same Poisson distribution with a mean parameter equal to 100. In many applications, for this value of its parameter, the Poisson distribution would be considered to be well approximated by a normal distribution. Testing the fit (here using a Kolmogorov-Smirnov test ) of the normal distribution on a sample of 100 observations would not reject the normal fit, while testing it on a sample of 10000 observations would reject it, while both samples come from the same distribution. ```{r, fig.height=3, fig.width=6} set.seed(1234) x1 <- rpois(n = 100, lambda = 100) f1 <- fitdist(x1, "norm") g1 <- gofstat(f1) g1$kstest x2 <- rpois(n = 10000, lambda = 100) f2 <- fitdist(x2, "norm") g2 <- gofstat(f2) g2$kstest par(mfrow=1:2) denscomp(f1, demp = TRUE, addlegend = FALSE, main = "small sample") denscomp(f2, demp = TRUE, addlegend = FALSE, main = "big sample") ``` ### Should I accept a distribution because goodness-of-fit tests do not reject it ? No, it would not be reasonable at all to validate a distribution because goodness-of-fit tests do not reject it. Like all the other hypothesis tests, goodness-of-fit tests lack of statistical power when the sample size is not so high. And the different goodness-of-fit tests are not equally sensitive to different types of deviation between empirical and fitted distributions. For example the Kolmogorov-Smirnov test is sensitive when distributions differ in a global fashion near the centre of the distribution. The Anderson-Darling test will be more sensitive when distributions differ in their tails, and the Cramer von Mises will be more sensitive when there are small but repetitive differences between empirical and theoretical distribution functions. The sensitivity of a chi square test will depend on the definition of classes, and even if we propose a default definition of classes when the user does not provide classes, this choice is not obvious and could impact the results of the test. This test is more appropriate when data are discrete, even if they are modelled by a continuous distribution, as in the following example. Two samples of respective sizes 500 and 50 are drawn from a Poisson distribution of mean parameter equal to 1 (not a sufficiently high value to consider that the Poisson distribution could be approximated by a normal one). Using a Kolmogorov-Smirnov test, for the small sample the normal fit is rejected only for the bigger sample. It is not rejected with the smaller sample even if the fit could be rejected after a simple visual confrontation of the distributions. In that particular case, the chi square test with classes defined by default would have rejected te normal fit for both samples. ```{r, fig.height=3, fig.width=6} set.seed(1234) x3 <- rpois(n = 500, lambda = 1) f3 <- fitdist(x3, "norm") g3 <- gofstat(f3) g3$kstest x4 <- rpois(n = 50, lambda = 1) f4 <- fitdist(x4, "norm") g4 <- gofstat(f4) g4$kstest par(mfrow=1:2) denscomp(f3, addlegend = FALSE, main = "big sample") denscomp(f4, addlegend = FALSE, main = "small sample") ``` ```{r} g3$chisqtable g3$chisqpvalue g4$chisqtable g4$chisqpvalue ``` ## Why all goodness-of-fit tests are not available for every distribution ? The Chi-squared test is available for any distribution but one must be conscious that its result depends on the definition of cells in which observed data are grouped, and a correct definition is not possible with a too small sample. Concerning the Kolmogorov-Smirnov test, it is proposed for any continuous distribution, but with a critical value corresponding to the comparison of the empirical distribution to a fully specified distribution. As the distribution is not fully known for a fitted distribution, the result of this test is subject to caution, but there is no general asymptotic theory for the Kolmogorov-Smirnov statistics in case of a fitted distribution. Nevertheless, one can use Monte Carlo methods to conduct Kolmgorov-Smirnov goodness-of-fit tests in cases when the sample is used to estimate model parameters. Such a method is implemented in the R package `KScorrect` for a variety of continuous distributions. Such an asymptotic theory was proposed for quadratic statistics for some distributions (Anderson-Darling, Cramer von Mises). The reference book we used on this subject (**Tests based on edf statistics** by Stephens MA in **Goodness-of-fit techniques** by D'Agostino RB and Stephens MA) proposes critical values of those statistics for a some classical distributions (exponential, gamma, Weibull, logistic, Cauchy, normal and lognormal). But the asymptotic theory about these statistics also depends on the way the parameters are estimated. And as they were not estimated by maximum likelihood for Cauchy, normal and lognormal distributions in the results reported by Stephens, we only propose the results of the Anderson-Darling and Cramer von Mises using those results for exponential, gamma, Weibull, logistic distributions. The user can refer to the cited books and use the proposed formula to estimate the parameters of Cauchy, normal and lognormal distributions and apply the tests using critical values given in the book. R packages `goftest` and `ADGofTest` could also be explored by users who would like to apply Anderson-Darling and Cramer von Mises tests on other distributions. But at this time we are not sure that the case where parameters are unknown (estimated by maximum likelihood) is tackled in those two packages. Concerning the development of our package, rather than develoing further more goodness-of-fit tests we made the choice to develop graphical tools to help to appreciate the quality of a fit and to compare the fits of different distributions on a same data set (see FAQ 2.2 for argumentation). ## How can we use goodness-of-fit statistics to compare the fit of different distributions on a same data set ? Goodness-of-fit statistics based on the empirical distribution function (Kolmogorov-Smirnov, Anderson-Darling and Cramer von Mises) may be used to measure a distance between the fitted distribution and the empirical distribution. So if one wants to compare the fit of various distributions on the same data set, the smaller are those statistics the better. The Kolmogorov-Smirnov statistics will be sensitive when distributions differ in a global fashion near the centre of the distribution while the Anderson-Darling statistics will be more sensitive when distributions differ in their tails, and the Cramer von Mises statistics will be more sensitive when there are small but repetitive differences between empirical and theoretical distribution functions. But as mentioned in the main vignette of our package, the use of the Anderson-Darling to compare the fit of different distributions is subject to caution due to the the weighting of the quadratic distance between fitted and empirical distribution functions that depends on the parametric distribution. Moreover, statistics based on the empirical distribution function do not penalize distributions with a greater number of parameters and as those are generally more flexible, this could induce over-fitting. Goodness-fo-fit statistics based on information criteria (AIC, BIC) correspond to deviance penalized by the complexity of the model (the number of parameters of the distribution), and so the smaller the better. As more generic statistics, they are not adapted to focus on a part of the fitted distribution, but they take into account the complexity of the distribution and thus could help to prevent overfitting. ## Can we use a test to compare the fit of two distributions on a same data set ? In our package we did not implement such a test but for **two nested distributions** (when one is a special case of the other one, e.g. exponential and gamma distributions) a likelihood ratio test can be easily implemented using the loglikelihood provided by `fitdist` or `fitdistcens`. Denoting $L$ the maximum likelihood obtained with the complete distribution and $L_0$ the one obtained with the simplified distribution, when the sample size increases, $- 2 ln(\frac{L_0}{L}) = 2 ln(L) - 2 ln(L_0)$ tends to a Chi squared distribution degrees of freedom equal to the difference on the numbers of parameters characterizing the **two nested distributions**. You will find below an example of such a test. ```{r} set.seed(1234) g <- rgamma(100, shape = 2, rate = 1) (f <- fitdist(g, "gamma")) (f0 <- fitdist(g, "exp")) L <- logLik(f) k <- length(f$estimate) # number of parameters of the complete distribution L0 <- logLik(f0) k0 <- length(f0$estimate) # number of parameters of the simplified distribution (stat <- 2*L - 2*L0) (critical_value <- qchisq(0.95, df = k - k0)) (rejected <- stat > critical_value) ``` Such a test can also be used for fits on censored data. ## Can we get goodness-of-fit statistics for a fit on censored data ? Function `gofstat` is not yet proposed in our package for fits on censored data but to develop one is among one of our objectives in the future. Published works on goodness-of-fit statistics based on the empirical distribution function for censored data generally focused on data containing only one type of censoring (e.g. right censored data in survival data). Build such statistics in the general case, with data containing in the same time (right, left and interval censoring), remains tricky. Nevertheless, it is possible for any type of censored data, to use information criteria (AIC and BIC given in the summary of an object of class `fitdistcens`) to compare the fits of various distributions to a same data set. # Questions regarding optimization procedures ## How to choose optimization method? If you want to perform optimization without bounds, `optim()` is used. You can try the derivative-free method Nelder-Mead and the Hessian-free method BFGS. If you want to perform optimization with bounds, only two methods are available without providing the gradient of the objective function: Nelder-Mead via `constrOptim()` and bounded BFGS via `optim()`. In both cases, see the help of `mledist()` and the vignette on optimization algorithms. ## The optimization algorithm stops with error code 100. What shall I do? First, add traces by adding `control=list(trace=1, REPORT=1)`. Second, try to set bounds for parameters. Third, find better starting values (see FAQ 1.3). ## Why distribution with a `log` argument may converge better? Say, we study the shifted lognormal distribution defined by the following density $$ f(x) = \frac{1}{x \sigma \sqrt{2 \pi}} \exp\left(- \frac{(\ln (x+\delta)- \mu)^2}{2\sigma^2}\right) $$ for $x>-\delta$ where $\mu$ is a location parameter, $\sigma$ a scale parameter and $\delta$ a boundary parameter. Let us fit this distribution on the dataset `y` by MLE. We define two functions for the densities with and without a `log` argument. ```{r} dshiftlnorm <- function(x, mean, sigma, shift, log = FALSE) dlnorm(x+shift, mean, sigma, log=log) pshiftlnorm <- function(q, mean, sigma, shift, log.p = FALSE) plnorm(q+shift, mean, sigma, log.p=log.p) qshiftlnorm <- function(p, mean, sigma, shift, log.p = FALSE) qlnorm(p, mean, sigma, log.p=log.p)-shift dshiftlnorm_no <- function(x, mean, sigma, shift) dshiftlnorm(x, mean, sigma, shift) pshiftlnorm_no <- function(q, mean, sigma, shift) pshiftlnorm(q, mean, sigma, shift) ``` We now optimize the minus log-likelihood. ```{r} data(dataFAQlog1) y <- dataFAQlog1 D <- 1-min(y) f0 <- fitdist(y+D, "lnorm") start <- list(mean=as.numeric(f0$estimate["meanlog"]), sigma=as.numeric(f0$estimate["sdlog"]), shift=D) # works with BFGS, but not Nelder-Mead f <- fitdist(y, "shiftlnorm", start=start, optim.method="BFGS") summary(f) ``` If we don't use the `log` argument, the algorithms stalls. ```{r, error=FALSE} f2 <- try(fitdist(y, "shiftlnorm_no", start=start, optim.method="BFGS")) print(attr(f2, "condition")) ``` Indeed the algorithm stops because at the following value, the log-likelihood is infinite. ```{r} sum(log(dshiftlnorm_no(y, 0.16383978, 0.01679231, 1.17586600 ))) log(prod(dshiftlnorm_no(y, 0.16383978, 0.01679231, 1.17586600 ))) sum(dshiftlnorm(y, 0.16383978, 0.01679231, 1.17586600, TRUE )) ``` There is something wrong in the computation. Only the R-base implementation using `log` argument seems reliable. This happens the C-base implementation of `dlnorm` takes care of the log value. In the file `../src/nmath/dlnorm.c` in the R sources, we find the C code for `dlnorm` ```{r, eval=FALSE, echo=TRUE} double dlnorm(double x, double meanlog, double sdlog, int give_log) { double y; #ifdef IEEE_754 if (ISNAN(x) || ISNAN(meanlog) || ISNAN(sdlog)) return x + meanlog + sdlog; #endif if(sdlog <= 0) { if(sdlog < 0) ML_ERR_return_NAN; // sdlog == 0 : return (log(x) == meanlog) ? ML_POSINF : R_D__0; } if(x <= 0) return R_D__0; y = (log(x) - meanlog) / sdlog; return (give_log ? -(M_LN_SQRT_2PI + 0.5 * y * y + log(x * sdlog)) : M_1_SQRT_2PI * exp(-0.5 * y * y) / (x * sdlog)); /* M_1_SQRT_2PI = 1 / sqrt(2 * pi) */ } ``` In the last four lines with the logical condtion `give_log?`, we see how the `log` argument is handled: * when log=TRUE, we use $-(\log(\sqrt{2\pi}) + y^2/2+\log(x\sigma))$ ```{r, eval=FALSE, echo=TRUE} -(M_LN_SQRT_2PI + 0.5 * y * y + log(x * sdlog)) ``` * when log=FALSE, we use $\sqrt{2\pi} *\exp( y^2/2)/(x\sigma))$ (and then the logarithm outside `dlnorm`) ```{r, eval=FALSE, echo=TRUE} M_1_SQRT_2PI * exp(-0.5 * y * y) / (x * sdlog)) ``` Note that the constant $\log(\sqrt{2\pi})$ is pre-computed in the C macro `M_LN_SQRT_2PI`. In order to sort out this problem, we use the `constrOptim` wrapping `optim` to take into account linear constraints. This allows also to use other optimization methods than L-BFGS-B (low-memory BFGS bounded) used in optim. ```{r} f2 <- fitdist(y, "shiftlnorm", start=start, lower=c(-Inf, 0, -min(y)), optim.method="Nelder-Mead") summary(f2) print(cbind(BFGS=f$estimate, NelderMead=f2$estimate)) ``` Another possible would be to perform all computations with higher precision arithmetics as implemented in the package `Rmpfr` using the MPFR library. ## What to do when there is a scaling issue? Let us consider a dataset which has particular small values. ```{r} data(dataFAQscale1) head(dataFAQscale1) summary(dataFAQscale1) ``` The only way to sort out is to multiply the dataset by a large value. ```{r} for(i in 6:0) cat(10^i, try(mledist(dataFAQscale1*10^i, "cauchy")$estimate), "\n") ``` Let us consider a dataset which has particular large values. ```{r} data(dataFAQscale2) head(dataFAQscale2) summary(dataFAQscale2) ``` The only way to sort out is to multiply the dataset by a small value. ```{r} for(i in 0:5) cat(10^(-2*i), try(mledist(dataFAQscale2*10^(-2*i), "cauchy")$estimate), "\n") ``` ## How do I set bounds on parameters when optimizing? ### Setting bounds for scale parameters Consider the normal distribution $\mathcal{N}(\mu, \sigma^2)$ defined by the density $$ f(x) = \frac{1}{\sqrt{2\pi\sigma^2}}\exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right), x\in\mathbb{R}, $$ where $\mu$ is a location parameter such that $\mu\in\mathbb{R}$, $\sigma^2$ is a scale parameter such that $\sigma^2>0$. Therefore when optimizing the log-likelihood or the squared differences or the GoF statistics. Setting a lower bound for the scale parameter is easy with `fitdist`: just use the `lower` argument. ```{r scalenormal, echo=TRUE, warning=FALSE} set.seed(1234) x <- rnorm(1000, 1, 2) fitdist(x, "norm", lower=c(-Inf, 0)) ``` ### Setting bounds for shape parameters Consider the Burr distribution $\mathcal B(\mu, \sigma^2)$ defined by the density $$ f(x) = \frac{a b (x/s)^b}{x [1 + (x/s)^b]^{a + 1}}, x\in\mathbb{R}, $$ where $a,b$ are shape parameters such that $a,b>0$, $s$ is a scale parameter such that $s>0$. ```{r shapeburr, echo=TRUE, warning=FALSE} x <- rburr(1000, 1, 2, 3) fitdist(x, "burr", lower=c(0, 0, 0), start=list(shape1 = 1, shape2 = 1, rate = 1)) ``` ### Setting bounds for probability parameters Consider the geometric distribution $\mathcal G(p)$ defined by the mass probability function $$ f(x) = p(1-p)^x, x\in\mathbb{N}, $$ where $p$ is a probability parameter such that $p\in[0,1]$. ```{r probgeom, echo=TRUE, warning=FALSE} x <- rgeom(1000, 1/4) fitdist(x, "geom", lower=0, upper=1) ``` ### Setting bounds for boundary parameters Consider the shifted exponential distribution $\mathcal E(\mu,\lambda)$ defined by the mass probability function $$ f(x) = \lambda \exp(-\lambda(x-\mu)), x>\mu, $$ where $\lambda$ is a scale parameter such that $\lambda>0$, $\mu$ is a boundary (or shift) parameter such that $\mu\in\mathbb{R}$. When optimizing the log-likelihood, the boundary constraint is $$ \forall i=1,\dots,n, x_i>\mu \Rightarrow \min_{i=1,\dots,n} x_i > \mu \Leftrightarrow \mu > -\min_{i=1,\dots,n} x_i. $$ Note that when optimizing the squared differences or the GoF statistics, this constraint may not be necessary. Let us do it in R. ```{r shiftexp, echo=TRUE, warning=FALSE} dsexp <- function(x, rate, shift) dexp(x-shift, rate=rate) psexp <- function(x, rate, shift) pexp(x-shift, rate=rate) rsexp <- function(n, rate, shift) rexp(n, rate=rate)+shift x <- rsexp(1000, 1/4, 1) fitdist(x, "sexp", start=list(rate=1, shift=0), lower= c(0, -min(x))) ``` ### Setting linear inequality bounds For some distributions, bounds between parameters are not independent. For instance, the normal inverse Gaussian distribution ($\mu, \delta, \alpha, \beta$ parametrization) has the following parameter constraints, which can be reformulated as a linear inequality: $$ \left\{ \begin{array}{l}\alpha > 0\\ \delta >0\\ \alpha > |\beta|\end{array} \right. \Leftrightarrow \underbrace{ \left( \begin{matrix} 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 1 & -1 \\ 0 & 0 & 1 & 1 \\ \end{matrix} \right) }_{ui} \left( \begin{matrix} \mu\\ \delta\\ \alpha \\ \beta \\ \end{matrix} \right) \geq \underbrace{ \left( \begin{matrix} 0\\ 0\\ 0 \\ 0 \\ \end{matrix} \right)}_{ci}. $$ These constraints can be carried out via `constrOptim()` and the arguments `ci` and `ui`. Here is an example ```{r, message=FALSE} library(GeneralizedHyperbolic) myoptim <- function(fn, par, ui, ci, ...) { res <- constrOptim(f=fn, theta=par, method="Nelder-Mead", ui=ui, ci=ci, ...) c(res, convergence=res$convergence, value=res$objective, par=res$minimum, hessian=res$hessian) } x <- rnig(1000, 3, 1/2, 1/2, 1/4) ui <- rbind(c(0,1,0,0), c(0,0,1,0), c(0,0,1,-1), c(0,0,1,1)) ci <- c(0,0,0,0) fitdist(x, "nig", custom.optim=myoptim, ui=ui, ci=ci, start=list(mu = 0, delta = 1, alpha = 1, beta = 0)) ``` ## How works quantile matching estimation for discrete distributions? Let us consider the geometric distribution with values in $\{0,1,2,3,\dots\}$. The probability mass function, the cumulative distribution function and the quantile function are $$ P(X=x)= p (1-p)^{\lfloor x\rfloor}, F_X(x) = 1- (1-p)^{\lfloor x\rfloor}, F_X^{-1}(q) = \left\lfloor\frac{\log(1-q)}{\log(1-p)}\right\rfloor. $$ Due to the integer part (floor function), both the distribution function and the quantile function are step functions. ```{r, fig.height=3, fig.width=6} pgeom(0:3, prob=1/2) qgeom(c(0.3, 0.6, 0.9), prob=1/2) par(mar=c(4,4,2,1), mfrow=1:2) curve(pgeom(x, prob=1/2), 0, 10, n=301, main="c.d.f.") curve(qgeom(x, prob=1/2), 0, 1, n=301, main="q.f.") ``` Now we study the QME for the geometric distribution. Since we have only one parameter, we choose one probabiliy, $p=1/2$. The theoretical median is the following integer $$ F_X^{-1}(1/2) = \left\lfloor\frac{\log(1/2)}{\log(1-p)}\right\rfloor. $$ Note that the theoretical median for a discrete distribution is an integer. Empirically, the median may not be an integer. Indeed for an even length dataset, the empirical median is $$ q_{n,1/2} = \frac{x_{n/2}^\star + x_{n/2+1}^\star}{2}, $$ where $x_{1}^\star<\dots ## Is it possible to add the names of the observations in a goodness-of-fit plot, e.g. the names of the species in the plot of the Species Sensitivity Distribution (SSD) classically used in ecotoxicology ? An argument named `name.points` can be used in functions `cdfcomp` or `CIcdfcomp` to pass a label vector for observed points so as to add the names of the points on the left of each point. This option is available only for ECDF goodness-of-fit plots and only for non censored data. This option can be used as below, for example, to name the species in the classical plot of the Species Sensitivity Distributions (SSD) in ecotoxicology. ```{r, fig.height= 6, fig.width= 6, warning = FALSE} data(endosulfan) ATV <- subset(endosulfan, group == "NonArthroInvert")$ATV taxaATV <- subset(endosulfan, group == "NonArthroInvert")$taxa f <- fitdist(ATV, "lnorm") cdfcomp(f, xlogscale = TRUE, main = "Species Sensitivty Distribution", xlim = c(1, 100000), name.points = taxaATV, addlegend = FALSE, plotstyle = "ggplot") ``` # Questions regarding (left, right and/or interval) censored data ## How to code censored data in `fitdistrplus` ? Censored data must be rpresented in the package by a dataframe of two columns respectively named left and right, describing each observed value as an interval. The left column contains either `NA` for left censored observations, the left bound of the interval for interval censored observations, or the observed value for non-censored observations. The right column contains either `NA` for right censored observations, the right bound of the interval for interval censored observations, or the observed value for non-censored observations. This type of representation corresponds to the coding names `"interval2"` in function `Surv` of the package `survival`. There is no other way to represent censored data in `fitdistrplus` but the function Surv2fitdistcens() can be used to help you to format data for use in fitdistcens() from one of the format used in the survival package (see the help page of Surv2fitdistcens()). You have a toy example below. ```{r} dtoy <- data.frame(left = c(NA, 2, 4, 6, 9.7, 10), right = c(1, 3, 7, 8, 9.7, NA)) dtoy ``` ## How do I prepare the input of `fitdistcens()` with `Surv2fitdistcens()`? Let us consider a classical right-censored dataset for human life: twenty values randomly chosen from the `canlifins` dataset of `CASdatasets` package. We refer to the help of `Surv2fitdistcens()` for other censoring types. ```{r} exitage <- c(81.1,78.9,72.6,67.9,60.1,78.3,83.4,66.9,74.8,80.5,75.6,67.1, 75.3,82.8,70.1,85.4,74,70,71.6,76.5) death <- c(0,0,1,0,0,0,0,1,0,0,0,0,0,0,1,1,0,0,0,0) ``` When performing survival analysis, it is very common to use `Surv()` function from package `survival` to handle different types of censoring. In order to ease the use of `fitdistcens()`, a dedicated function `Surv2fitdistcens()` has been implemented with arguments similar to the ones of `Surv()`. ```{r} svdata <- Surv2fitdistcens(exitage, event=death) ``` Let us now fit two simple distributions. ```{r, fig.height= 4, fig.width= 6} flnormc <- fitdistcens(svdata, "lnorm") fweic <- fitdistcens(svdata, "weibull") cdfcompcens(list(fweic, flnormc), xlim=range(exitage), xlegend = "topleft") ``` ## How to represent an empirical distribution from censored data ? The representation of an empirical distribution from censored data is not a trivial problem. One can simply represent each observation as an interval at an y-value defined by the rank of the observation as done below using function `plotdistcens`. This representation can be interesting to visualize the raw data, but it remains difficult to correctly order the observations in any case (see the example below on the right using data `smokedfish`). ```{r, fig.height= 4, fig.width= 8} par(mfrow = c(1,2), mar = c(3, 4, 3, 0.5)) plotdistcens(dtoy, NPMLE = FALSE) data(smokedfish) dsmo <- log10(smokedfish) plotdistcens(dsmo, NPMLE = FALSE) ``` Many authors worked on the development of algorithms for **non parametric maximum likelihood estimation (NPMLE)** of the **empirical cumulative distribution function (ECDF)** from interval censored data (including left and right censored data that can be considered as interval censored data with one bound at infinity). In old versions of `fitdistrplus` we used the Turnbull algorithm using calls to functions of the package `survival`. Even if this Turnbull algorithm is still available in the package, the default plot now uses the function `npsurv` of the package `npsurv`. This package provides more performant algorithms developped by Yong Wang (see references cited in the help page of `plotdistcens`). Due to lack of maintenance of this package we were forced to rewrite their main functions in our package, using another optimization function. The same ECDF plot was also implemented in our using the Turnbull algorithm of survival (see below). ```{r, fig.height= 6, fig.width= 6} par(mfrow = c(2, 2), mar = c(3, 4, 3, 0.5)) # Turnbull algorithm with representation of middle points of equivalence classes plotdistcens(dsmo, NPMLE.method = "Turnbull.middlepoints", xlim = c(-1.8, 2.4)) # Turnbull algorithm with representation of equivalence classes as intervals plotdistcens(dsmo, NPMLE.method = "Turnbull.intervals") # Wang algorithm with representation of equivalence classes as intervals plotdistcens(dsmo, NPMLE.method = "Wang") ``` As you can see in the above example, the new implementation of NPMLE provides a different type of plot for the ECDF, representing by filled rectangles the zones of non-uniqueness of the NPMLE ECDF. Indeed an NPMLE algorithm generally proceeds in two steps. 1. The first step aims at identifying **equivalence classes** (also named in the litterture **Turnbull intervals** or **maximal intersection intervals** or **innermost intervals** or **maximal cliques** of the data). Equivalences classess are points/intervals under which the NPMLE ECDF may change. Equivalence classes have been shown to correspond to regions between a left bound of an interval (named L in the following plot on a the previous toy example) immediately followed by a right bound of an interval (named R in the following plot). An equivalence class may be of null length (for example at each non censored value). 2. The second step aims at assigning a **probability mass** to each equivalence class, which may be zero on some classes. The NPMLE is unique only up to these equivalence classes and this **non uniqueness** of the NPMLE ECDF is represented by **filled rectangles**. Various NPMLE algorithms are implemented in the packages **Icens**, **interval** and **npsurv**. They are more or less performant and all of them do not enable the handling of other data than survival data, especially with left censored observations. ```{r, echo = FALSE, fig.height= 4, fig.width= 8} d <- data.frame(left = c(NA, 2, 4, 6, 9.5, 10), right = c(1, 3, 7, 8, 9.5, NA)) addbounds <- function(d) { xbounds <- c(d$left, d$right) xboundsnotNA <- xbounds[!is.na(xbounds)] abline(v = xboundsnotNA, col = "grey") } addLR <- function(d) { Lbounds <- d$left[!is.na(d$left)] Rbounds <- d$right[!is.na(d$right)] range <- range(c(Lbounds,Rbounds)) eps <- (range[2] - range[1]) * 0.01 text(x = Lbounds-eps, y = 0.05, labels = "L", col = "red", cex = 0.75) text(x = Rbounds+eps, y = 0.05, labels = "R", col = "red", cex = 0.75) } addeq <- function(deq) { left <- deq$left left[is.na(left)] <- -100 right <- deq$right right[is.na(right)] <- 100 rect(left, -2, right, 2, density = 10) } par(mfrow = c(2,1), mar = c(2, 4, 3, 0.5)) # First step plotdistcens(d, NPMLE = FALSE, lwd = 2, col = "blue", main = "Step 1 : identification of equivalence classes") addbounds(d) addLR(d) deq <- data.frame(left = c(NA, 2, 6, 9.5, 10), right = c(1, 3, 7,9.5, NA)) addeq(deq) # Second step plotdistcens(d, lwd = 2, main = "Step 2 : estimation of mass probabilities") ``` ## How to assess the goodness-of-fit of a distribution fitted on censored data ? The only available method in `fitdistrplus` to fit distributions on censored data is the maximum likelihood estimation (MLE). Once a distribution is fitted using `fitdistcens`, AIC and BIC values can be found in the `summary` of the object of class `fitdistcens` returned by the function. Those values can be used to compare the fit of various distributions on a same dataset. Function `gofstat` is not yet proposed in our package for fits on censored data but we plan to develop it in the future with the calculation of other goodness-of-fit statistics for censored data. ```{r} fnorm <- fitdistcens(dsmo,"norm") flogis <- fitdistcens(dsmo,"logis") # comparison of AIC values summary(fnorm)$aic summary(flogis)$aic ``` Considering goodness-of-fit plots, the generic `plot` function of an object of class `fitdistcens`provides three plots, one in CDF using the NPMLE ECDF plot (by default using the Wang prepresentation, see previous part for details), a Q-Q plot and a P-P plot simply derived from the Wang plot of the ECDF, with filled rectangles indicating non uniqueness of the NPMLE ECDF. ```{r, fig.height= 6, fig.width= 6} par(mar = c(2, 4, 3, 0.5)) plot(fnorm) ``` Functions `cdfcompcens()`, `qqcompens()` and `ppcompcens()` can be used to individualize and personnalize CDF, Q-Q and P-P goodness-of-fit plots and/or to compare the fit of various distributions on a same dataset. ```{r, fig.height= 4, fig.width= 4} cdfcompcens(list(fnorm, flogis), fitlty = 1) qqcompcens(list(fnorm, flogis)) ppcompcens(list(fnorm, flogis)) ``` Considering Q-Q plots and P-P plots, it may be easier to compare various fits by splitting the plots as below which is done automatically using the `plotstyle` `ggplot` in `qqcompens()` and `ppcompcens()` but can also be done manually with the `plotstyle` `graphics`. ```{r, fig.height= 4, fig.width= 8} qqcompcens(list(fnorm, flogis), lwd = 2, plotstyle = "ggplot", fitcol = c("red", "green"), fillrect = c("pink", "lightgreen"), legendtext = c("normal distribution", "logistic distribution")) ``` fitdistrplus/inst/doc/FAQ.html0000644000176200001440000343734514421751577016065 0ustar liggesusers Frequently Asked Questions

Frequently Asked Questions

Marie Laure Delignette Muller, Christophe Dutang

2023-04-25

1 Questions regarding distributions

1.1 How do I know the root name of a distribution?

The root name of a probability distribution is the name which is used the d, p, q, r functions. For base R distributions, the root names are given in R-intro : https://cran.r-project.org/doc/manuals/R-intro.html#Probability-distributions. For example, you must use "pois" for the Poisson distribution and not "poisson".

1.2 How do I find “non standard” distributions?

For non-standard distributions, you can either find a package implementing them or define by yourself. A comprehensive list of non-standard distributions is given in the Distributions task view https://CRAN.R-project.org/view=Distributions. Here are some two examples of user-defined distributions. A third example (shifted exponential) is given in FAQ 3.5.4.

dgumbel <- function(x, a, b) 1/b*exp((a-x)/b)*exp(-exp((a-x)/b))
pgumbel <- function(q, a, b) exp(-exp((a-q)/b))
qgumbel <- function(p, a, b) a-b*log(-log(p))
data(groundbeef)
fitgumbel <- fitdist(groundbeef$serving, "gumbel", start=list(a=10, b=10))
dzmgeom <- function(x, p1, p2) p1 * (x == 0) + (1-p1)*dgeom(x-1, p2)
pzmgeom <- function(q, p1, p2) p1 * (q >= 0) + (1-p1)*pgeom(q-1, p2)
rzmgeom <- function(n, p1, p2) 
{
  u <- rbinom(n, 1, 1-p1) #prob to get zero is p1
  u[u != 0] <- rgeom(sum(u != 0), p2)+1
  u
}
x2 <- rzmgeom(1000, 1/2, 1/10)
fitdist(x2, "zmgeom", start=list(p1=1/2, p2=1/2))

1.3 How do I set (or find) initial values for non standard distributions?

As documented, we provide initial values for the following distributions: "norm", "lnorm", "exp", "pois", "cauchy", "gamma", "logis", "nbinom", "geom", "beta", "weibull" from the stats package; "invgamma", "llogis", "invweibull", "pareto1", "pareto", "lgamma", "trgamma", "invtrgamma" from the actuar package.

Look first at statistics and probability books such as

If not available, find initial values by equalling theoretical and empirical quartiles. The graphical function plotdist() and plotdistcens() can also be used to assess the suitability of starting values : by an iterative manual process you can move parameter values so as to obtain a distribution that roughly fits the data and take these parameter values as starting values for the real fit. You may also consider the prefit() function to find initial values especially in case where parameters are constrained.

1.4 Is it possible to fit a distribution with at least 3 parameters?

Yes, an example with the Burr distribution is detailed in the JSS paper. We reproduce it very quickly here.

data("endosulfan")
library("actuar")
fendo.B <- fitdist(endosulfan$ATV, "burr", start = list(shape1 = 0.3, shape2 = 1, rate = 1))
summary(fendo.B)
## Fitting of the distribution ' burr ' by maximum likelihood 
## Parameters : 
##        estimate Std. Error
## shape1    0.206     0.0561
## shape2    1.540     0.3188
## rate      1.497     0.4683
## Loglikelihood:  -520   AIC:  1046   BIC:  1054 
## Correlation matrix:
##        shape1 shape2   rate
## shape1  1.000 -0.900 -0.727
## shape2 -0.900  1.000  0.588
## rate   -0.727  0.588  1.000

1.5 Why there are differences between MLE and MME for the lognormal distribution?

We recall that the lognormal distribution function is given by \[ F_X(x) = \Phi\left(\frac{\log(x)-\mu}{\sigma} \right), \] where \(\Phi\) denotes the distribution function of the standard normal distribution. We know that \(E(X) = \exp\left( \mu+\frac{1}{2} \sigma^2 \right)\) and \(Var(X) = \exp\left( 2\mu+\sigma^2\right) (e^{\sigma^2} -1)\). The MME is obtained by inverting the previous formulas, whereas the MLE has the following explicit solution \[ \hat\mu_{MLE} = \frac{1}{n}\sum_{i=1}^n \log(x_i),~~ \hat\sigma^2_{MLE} = \frac{1}{n}\sum_{i=1}^n (\log(x_i) - \hat\mu_{MLE})^2. \] Let us fit a sample by MLE and MME. The fit looks particularly good in both cases.

x3 <- rlnorm(1000)
f1 <- fitdist(x3, "lnorm", method="mle") 
f2 <- fitdist(x3, "lnorm", method="mme")
par(mfrow=1:2)
cdfcomp(list(f1, f2), do.points=FALSE, xlogscale = TRUE, main = "CDF plot")
denscomp(list(f1, f2), demp=TRUE, main = "Density plot")

Let us compare the theoretical moments (mean and variance) given the fitted values (\(\hat\mu,\hat\sigma\)), that is \[ E(X) = \exp\left( \hat\mu+\frac{1}{2} \hat\sigma^2 \right), Var(X) = \exp\left( 2\hat\mu+\hat\sigma^2\right) (e^{\hat\sigma^2} -1). \]

c("E(X) by MME"=as.numeric(exp(f2$estimate["meanlog"]+f2$estimate["sdlog"]^2/2)), 
    "E(X) by MLE"=as.numeric(exp(f1$estimate["meanlog"]+f1$estimate["sdlog"]^2/2)), 
    "empirical"=mean(x3))
## E(X) by MME E(X) by MLE   empirical 
##        1.61        1.60        1.61
c("Var(X) by MME"=as.numeric(exp(2*f2$estimate["meanlog"]+f2$estimate["sdlog"]^2) * 
                               (exp(f2$estimate["sdlog"]^2)-1)), 
    "Var(X) by MLE"=as.numeric(exp(2*f1$estimate["meanlog"]+f1$estimate["sdlog"]^2) * 
                                 (exp(f1$estimate["sdlog"]^2)-1)), 
    "empirical"=var(x3))
## Var(X) by MME Var(X) by MLE     empirical 
##          4.30          4.36          4.30

From a MLE point of view, a lognormal sample \(x_1,\dots,x_n\) is equivalent to handle a normal sample \(\log(x_1),\dots,\log(x_n)\). However, it is well know by the Jensen inequality that \(E(X) = E(\exp(\log(X))) \geq \exp(E(\log(X)))\) implying the MME estimates provides better moment estimates than with MLE.

1.6 Can I fit a distribution with positive support when data contains negative values?

The answer is no: you cannot fit a distribution with positive support (say gamma distribution) when data contains negative values.

set.seed(1234)
x <- rnorm(100, mean = 1, sd = 0.5)
(try(fitdist(x, "exp")))
## Error in computing default starting values.
## Error in manageparam(start.arg = start, fix.arg = fix.arg, obs = data,  : 
##   Error in startargdefault(obs, distname) : 
##   values must be positive to fit an exponential  distribution
## [1] "Error in manageparam(start.arg = start, fix.arg = fix.arg, obs = data,  : \n  Error in startargdefault(obs, distname) : \n  values must be positive to fit an exponential  distribution\n\n"
## attr(,"class")
## [1] "try-error"
## attr(,"condition")
## <simpleError in manageparam(start.arg = start, fix.arg = fix.arg, obs = data,     distname = distname): Error in startargdefault(obs, distname) : 
##   values must be positive to fit an exponential  distribution
## >

It is irrelevant to do such fit. If you really need to use that distribution, you have two options: either to remove negative values (not recommended) or to shift the data.

fitdist(x[x >= 0], "exp")
## Fitting of the distribution ' exp ' by maximum likelihood 
## Parameters:
##      estimate Std. Error
## rate     1.06      0.107
fitdist(x - min(x), "exp")
## Fitting of the distribution ' exp ' by maximum likelihood 
## Parameters:
##      estimate Std. Error
## rate    0.914     0.0914

1.7 Can I fit a finite-support distribution when data is outside that support?

The answer is no: you cannot fit a distribution with finite-support (say beta distribution) when data is outside \([0,1]\).

set.seed(1234)
x <- rnorm(100, mean = 0.5, sd = 0.25)
(try(fitdist(x, "beta")))
## Error in computing default starting values.
## Error in manageparam(start.arg = start, fix.arg = fix.arg, obs = data,  : 
##   Error in startargdefault(obs, distname) : 
##   values must be in [0-1] to fit a beta distribution
## [1] "Error in manageparam(start.arg = start, fix.arg = fix.arg, obs = data,  : \n  Error in startargdefault(obs, distname) : \n  values must be in [0-1] to fit a beta distribution\n\n"
## attr(,"class")
## [1] "try-error"
## attr(,"condition")
## <simpleError in manageparam(start.arg = start, fix.arg = fix.arg, obs = data,     distname = distname): Error in startargdefault(obs, distname) : 
##   values must be in [0-1] to fit a beta distribution
## >

It is irrelevant to do such a fit. If you really need to use that distribution, you have two ways to tackle this issue: either to remove impossible values (not recommended) or to shift/scale the data.

fitdist(x[x > 0 & x < 1], "beta")
## Fitting of the distribution ' beta ' by maximum likelihood 
## Parameters:
##        estimate Std. Error
## shape1     2.08      0.288
## shape2     2.50      0.352
fitdist((x - min(x)*1.01) / (max(x) * 1.01 - min(x) * 1.01), "beta")
## Fitting of the distribution ' beta ' by maximum likelihood 
## Parameters:
##        estimate Std. Error
## shape1     1.77      0.236
## shape2     2.17      0.296

1.8 Can I fit truncated distributions?

The answer is yes: but the fitting procedure must be carried out carefully. Let \(X\) be the original untruncated random variable. The truncated variable is the conditionnal random variable \(Y = X ~\vert~ l< X <u\) with \(l<u\) the lower and upper bounds. The cdf of \(Y\) is \(F_Y(y)=\frac{F_X(x) - F_X(l)}{F_X(u)-F_X(l)}\). There is a density (w.r.t. the Lebesgues measure) given by \[ f_Y(y) = \left\{\begin{array}{ll} \frac{f_X(x)}{F_X(u)-F_X(l)} & \text{if } l < x < u \\ 0 & \text{otherwise }\\ \end{array}\right. \]

dtexp <- function(x, rate, low, upp)
{
  PU <- pexp(upp, rate=rate)
  PL <- pexp(low, rate=rate)
  dexp(x, rate) / (PU-PL) * (x >= low) * (x <= upp) 
}
ptexp <- function(q, rate, low, upp)
{
  PU <- pexp(upp, rate=rate)
  PL <- pexp(low, rate=rate)
  (pexp(q, rate)-PL) / (PU-PL) * (q >= low) * (q <= upp) + 1 * (q > upp)
}
n <- 200
x <- rexp(n); x <- x[x > .5 & x < 3]
f1 <- fitdist(x, "texp", method="mle", start=list(rate=3), fix.arg=list(low=min(x), upp=max(x)))
f2 <- fitdist(x, "texp", method="mle", start=list(rate=3), fix.arg=list(low=.5, upp=3))
gofstat(list(f1, f2))
## Goodness-of-fit statistics
##                              1-mle-texp 2-mle-texp
## Kolmogorov-Smirnov statistic     0.0952      0.084
## Cramer-von Mises statistic       0.1343      0.104
## Anderson-Darling statistic          Inf      1.045
## 
## Goodness-of-fit criteria
##                                1-mle-texp 2-mle-texp
## Akaike's Information Criterion        127        132
## Bayesian Information Criterion        130        135
cdfcomp(list(f1, f2), do.points = FALSE, xlim=c(0, 3.5))

1.9 Can I fit truncated inflated distributions?

The answer is yes: but the fitting procedure must be carried out carefully. Let \(X\) be the original untruncated random variable. The truncated variable is \(Y = \max(\min(X, u), l)\) with \(l<u\) the lower and upper bounds. The cdf of \(Y\) is \(F_Y(y)=F_X(y)1_{u>y>l} + 1_{y>u}\). There is no density (w.r.t. the Lebesgues measure) since there are two probability masses \(P(Y=l)= P(X\leq l)>0\) and \(P(Y=u)=P(X>u)>0\). However, the density function with respect to the measure \(m(x)= \delta_l(x)+\delta_u(x)+\lambda(x)\) is \[ f_Y(y) = \left\{\begin{array}{ll} F_X(l) & \text{if } y=l \\ f_X(y) & \text{if } l<y<u \\ 1-F_X(u) & \text{if } y=u \\ \end{array}\right. \] Let \(\theta\) be the parameter of the untruncated distribution. Since the likelihood can be factorized, the maximization can be done separately \[ L(l, \theta, u) = 1_{\forall i, l\leq y_i\leq u} \prod_{i=1, y_i=l}^n F_X(l, \theta) \times \prod_{i=1,l<y_i<u}^n f_X(y_i, \theta) \times \prod_{i=1,y_i=u}^n (1-F_X(u, \theta)), \] Furthermore, using \(\forall i, l\leq y_i\leq u\Leftrightarrow l\leq \min_i y_i\leq \max_i y_i\leq u\), the likelihood is zero for \(l>\min_i y_i\) or \(u<\max_i y_i\) and increasing with respect to \(l\) in \(]-\infty, \min_i y_i]\) and decreasing with respect to \(u\) in \([\max_i y_i,+\infty[\). So the maximum of \(L\) is reached at \(l=\min_i y_i\) and \(u=\max_i y_i\). The MLE of \(\theta\) is then obtained by maximizing the log-likelihood \(\log(L(l, \theta, u))\) with \(u=\max_i Y_i\) and \(l=\min_i Y_i\).

Let us illustrate truncated distribution with the truncated exponential distribution. The log-likelihood is particularly bad-shaped.

dtiexp <- function(x, rate, low, upp)
{
  PU <- pexp(upp, rate=rate, lower.tail = FALSE)
  PL <- pexp(low, rate=rate)
  dexp(x, rate) * (x >= low) * (x <= upp) + PL * (x == low) + PU * (x == upp)
}
ptiexp <- function(q, rate, low, upp)
  pexp(q, rate) * (q >= low) * (q <= upp) + 1 * (q > upp)
n <- 100; x <- pmax(pmin(rexp(n), 3), .5)
# the loglikelihood has a discontinous point at the solution
par(mar=c(4,4,2,1), mfrow=1:2)
llcurve(x, "tiexp", plot.arg="low", fix.arg = list(rate=2, upp=5), min.arg=0, max.arg=.5, lseq=200)
llcurve(x, "tiexp", plot.arg="upp", fix.arg = list(rate=2, low=0), min.arg=3, max.arg=4, lseq=200)

The first method directly maximizes the log-likelihood \(L(l, \theta, u)\); the second method maximizes the log-likelihood \(L(\theta)\) assuming that \(u=\hat u\) and \(l=\hat l\) are known. Inside \([0.5,3]\), the CDF are correctly estimated in both methods but the first method does not succeed to estimate the true value of the bounds \(l,u\).

(f1 <- fitdist(x, "tiexp", method="mle", start=list(rate=3, low=0, upp=20)))
## Fitting of the distribution ' tiexp ' by maximum likelihood 
## Parameters:
##      estimate Std. Error
## rate    0.949         NA
## low    -0.502         NA
## upp    23.072         NA
(f2 <- fitdist(x, "tiexp", method="mle", start=list(rate=3), fix.arg=list(low=min(x), upp=max(x))))
## Fitting of the distribution ' tiexp ' by maximum likelihood 
## Parameters:
##      estimate Std. Error
## rate    0.947     0.0982
## Fixed parameters:
##     value
## low   0.5
## upp   3.0
gofstat(list(f1, f2))
## Goodness-of-fit statistics
##                              1-mle-tiexp 2-mle-tiexp
## Kolmogorov-Smirnov statistic       0.378       0.377
## Cramer-von Mises statistic         1.890       1.882
## Anderson-Darling statistic        10.222      10.193
## 
## Goodness-of-fit criteria
##                                1-mle-tiexp 2-mle-tiexp
## Akaike's Information Criterion         216         162
## Bayesian Information Criterion         224         165
cdfcomp(list(f1, f2), do.points = FALSE, addlegend=FALSE, xlim=c(0, 3.5))
curve(ptiexp(x, 1, .5, 3), add=TRUE, col="blue", lty=3)
legend("bottomright", lty=1:3, col=c("red", "green", "blue", "black"), 
        leg=c("full MLE", "MLE fixed arg", "true CDF", "emp. CDF"))

1.10 Can I fit a uniform distribution?

The uniform distribution \(\mathcal U(a,b)\) has only support parameters since the density does not have a scale or a shape parameter \(f_U(u) = \frac{1}{b-a}1_{[a,b]}(u)\). For this distribution, we should not maximize the log-likelihood but only the likelihood. Let \((x_i)_i\) be i.i.d. observations from \(\mathcal U(a,b)\) distribution. The likelihood is \[ L(a,b) = \prod_{i=1}^n \frac{1}{b-a} 1_{[a,b]}(x_i) = 1_{a\leq x_i \leq b, i=1,\dots,n} \frac{1}{b-a}^n = 1_{a\leq \min_i x_i} 1_{\max_i x_i \leq b} \frac{1}{b-a}^n \] Hence \(a\mapsto L(a,b)\) for any fixed \(b\in]\max_i x_i, +\infty[\) is increasing on \(]-\infty, \min_i x_i]\), similarly \(b\mapsto L(a,b)\) is decreasing for any fixed \(a\). This leads to \(\min_i x_i\) and \(\max_i x_i\) to be the MLE of the uniform distribution.

We should notice that the likelihood function \(L\) is defined on \(\mathbb R^2\) yet it cancels outside \(S=]-\infty, \min_i x_i]\times]\max_i x_i, +\infty[\). Hence, the log-likelihood is undefined outside \(S\), which is an issue when maximizing the log-likelihood.

For these reasons, fitdist(data, dist="unif", method="mle") uses the explicit form of the MLE for this distribution. Here is an example below

trueval <- c("min"=3, "max"=5)
x <- runif(n=500, trueval[1], trueval[2])

f1 <- fitdist(x, "unif")
delta <- .01
llsurface(x, "unif", plot.arg = c("min", "max"), min.arg=c(min(x)-2*delta, max(x)-delta),
          max.arg=c(min(x)+delta, max(x)+2*delta), main="likelihood surface for uniform",
        loglik=FALSE)
abline(v=min(x), h=max(x), col="grey", lty=2)
points(f1$estimate[1], f1$estimate[2], pch="x", col="red")
points(trueval[1], trueval[2], pch="+", col="blue")
legend("bottomright", pch=c("+","x"), col=c("blue","red"), c("true", "fitted"))

delta <- .2
llsurface(x, "unif", plot.arg = c("min", "max"), min.arg=c(3-2*delta, 5-delta),
          max.arg=c(3+delta, 5+2*delta), main="log-likelihood surface for uniform")
abline(v=min(x), h=max(x), col="grey", lty=2)
points(f1$estimate[1], f1$estimate[2], pch="x", col="red")
points(trueval[1], trueval[2], pch="+", col="blue")
legend("bottomright", pch=c("+","x"), col=c("blue","red"), c("true", "fitted"))

Maximizing the log-likelihood is harder and can be done by defining a new density function. Appropriate starting values and parameters bound must be supplied. Using the closed-form expression (as in fitdist()) or maximizing the log-likelihood (with unif2) lead to very similar results.

dunif2 <- function(x, min, max) dunif(x, min, max)
punif2 <- function(q, min, max) punif(q, min, max)
f2 <- fitdist(x, "unif2", start=list(min=0, max=10), lower=c(-Inf, max(x)),
              upper=c(min(x), Inf))
print(c(logLik(f1), logLik(f2)), digits=7)
## [1] -346.0539 -346.0540
print(cbind(coef(f1), coef(f2)), digits=7)
##         [,1]     [,2]
## min 3.000684 3.000683
## max 4.998606 4.998606

1.11 Can I fit a beta distribution with the same shape parameter?

Yes, you can wrap the density function of the beta distribution so that there is a only one shape parameter. Here is an example of a concave density.

x <- rbeta(1000, 3, 3)
dbeta2 <- function(x, shape, ...)
    dbeta(x, shape, shape, ...)
pbeta2 <- function(q, shape, ...)
    pbeta(q, shape, shape, ...) 
fitdist(x, "beta2", start=list(shape=1/2))
## Fitting of the distribution ' beta2 ' by maximum likelihood 
## Parameters:
##       estimate Std. Error
## shape     3.24      0.135

Another example with a U-shaped density.

x <- rbeta(1000, .3, .3)
fitdist(x, "beta2", start=list(shape=1/2), optim.method="L-BFGS-B", lower=1e-2) 
## Fitting of the distribution ' beta2 ' by maximum likelihood 
## Parameters:
##       estimate Std. Error
## shape    0.295    0.00986

1.12 How to estimate support parameter? the case of the four-parameter beta

Let us consider the four-parameter beta distribution, also known as the PERT distribution, defined by the following density for \(x\in [a,c]\) \(f_X(x) = (x-a)^{\alpha-1} (c-x)^{\beta-1}/C_N\) with \(C_N\) a normalizing constant and \(\alpha=1+d(b-a)/(c-a)\), \(\beta=1+d(c-b)/(c-a)\). \(a,c\) are support parameters, \(b\in]a,c[\) is the mode and \(d\) the shape parameter.

As for uniform distribution, one can show that the MLE of \(a\) and \(c\) are respectively the sample minimum and maximum. The code below illustrates the strategy using partial closed formula with fix.arg and the full numerical search of MLE. NB: on small sample size, the latter has generally better goodness-of-fit statistics; a small positive number is added or subtracted when fixing the support parameters \(a\) and \(c\) to sample minimum and maximum.

require(mc2d)
x2 <- rpert(n=2e2, min=0, mode=1, max=2, shape=3/4)
eps <- sqrt(.Machine$double.eps)
f1 <- fitdist(x2, "pert", start=list(min=-1, mode=0, max=10, shape=1),
              lower=c(-Inf, -Inf, -Inf, 0), upper=c(Inf, Inf, Inf, Inf))
## Warning in checkparamlist(arg_startfix$start.arg, arg_startfix$fix.arg, : Some
## parameter names have no starting/fixed value but have a default value: mean.
f2 <- fitdist(x2, "pert", start=list(mode=1, shape=1), 
              fix.arg=list(min=min(x2)-eps, max=max(x2)+eps),
              lower=c(min(x2), 0), upper=c(max(x2), Inf))
## Warning in checkparamlist(arg_startfix$start.arg, arg_startfix$fix.arg, : Some
## parameter names have no starting/fixed value but have a default value: mean.
gofstat(list(f1,f2))
## Goodness-of-fit statistics
##                              1-mle-pert 2-mle-pert
## Kolmogorov-Smirnov statistic     0.0452     0.0584
## Cramer-von Mises statistic       0.0823     0.1834
## Anderson-Darling statistic       0.5325     1.2776
## 
## Goodness-of-fit criteria
##                                1-mle-pert 2-mle-pert
## Akaike's Information Criterion        268        265
## Bayesian Information Criterion        281        272
cdfcomp(list(f1,f2))

print(cbind(coef(f1), c(f2$fix.arg["min"], coef(f2)["mode"], 
                        f2$fix.arg["max"], coef(f2)["shape"])), digits=7)
##       [,1]       [,2]       
## min   0.03106317 0.03395487 
## mode  1.120283   1.956149   
## max   1.9595     1.956234   
## shape 0.3056077  0.008646087

2 Questions regarding goodness-of-fit tests and statistics

2.1 Where can we find the results of goodness-of-fit tests ?

Results of goodness-of-fit tests are not printed but are given in the object returned by gofstat() and you can have access to them as described in the example below. Nevertheless, p-values are not given for every test. For Anderson-Darling (ad), Cramer von Mises (cvm) and Kolomogorov (ks), only the decision (rejection of H0 or not) is given, when available (see FAQ 2.3 for more details).

set.seed(1234)
x <- rgamma(n = 100, shape = 2, scale = 1)
# fit of the good distribution
fgamma <- fitdist(x, "gamma")
# fit of a bad distribution
fexp <- fitdist(x, "exp")
g <- gofstat(list(fgamma, fexp), fitnames = c("gamma", "exp"))
denscomp(list(fgamma, fexp), legendtext = c("gamma", "exp"))

# results of the tests
## chi square test (with corresponding table with theoretical and observed counts)
g$chisqpvalue
##    gamma      exp 
## 1.89e-01 7.73e-05
g$chisqtable
##           obscounts theo gamma theo exp
## <= 0.5483         9      10.06    23.66
## <= 0.8122         9       8.82     9.30
## <= 0.9592         9       5.27     4.68
## <= 1.368          9      14.63    11.37
## <= 1.523          9       5.24     3.74
## <= 1.701          9       5.73     3.97
## <= 1.94           9       7.09     4.82
## <= 2.381          9      11.08     7.50
## <= 2.842          9       9.00     6.29
## <= 3.801          9      11.93     9.28
## > 3.801          10      11.16    15.40
## Anderson-Darling test
g$adtest
##          gamma            exp 
## "not rejected"     "rejected"
## Cramer von  Mises test
g$cvmtest
##          gamma            exp 
## "not rejected"     "rejected"
## Kolmogorov-Smirnov test
g$kstest
##          gamma            exp 
## "not rejected"     "rejected"

2.2 Is it reasonable to use goodness-of-fit tests to validate the fit of a distribution ?

In the first versions of fitdistrplus, when they were available, the results of the GOF tests (AD, KS, CvM) were automatically printed. We decided to suppress this automatic printing when we realized that some users had some difficulties to interpret the results of those tests and sometimes misused them.

Goodness-of-fit tests often appear as objective tools to decide wether a fitted distribution well describes a data set. But they are not ! It would not be reasonable at all to reject a distribution just because a goodness-of-fit test rejects it (see FAQ 2.2.1). And it would not be reasonable at all any more to validate a distribution because goodness-of-fit tests do not reject it (see FAQ 2.2.2).

A fitted distribution should be evaluated using graphical methods (goodness-of-fit graphs automatically provided in our package by plotting the result of the fit (output of fitdist() or fitdistcens() and the complementary graphs that help to compare different fits - see ?graphcomp). We really think it is the most appropriate way to evaluate the adequacy of a fit and we are not the only ones to recommend it. You can find the same type of recommendations in reference books :

Moreover, the selection of a distribution should also be driven by knowledge of underlying processes when available. For example when a variable cannot be negative, one would be very cautious while fitting a normal distribution, that potentially gives negative values, even if the observed data of this variable seem well fitted by a normal distribution.

2.2.1 Should I reject a distribution because a goodness-of-fit test rejects it ?

No it would not be reasonable at all to reject a distribution just because a goodness-of-fit test rejects it, especially in the case of big samples. In the real life, as soon as you have a sufficient amount of data, you will reject the fitted distribution. We know that a model cannot perfectly describe real data, and generally the true question is to find the better distribution among a pool of simple parametric distributions to describe the data, so to compare different models (see FAQ 2.4 and 2.5 for corresponding questions).

To illustre this point let us comment the example presented below. We drew two samples from the same Poisson distribution with a mean parameter equal to 100. In many applications, for this value of its parameter, the Poisson distribution would be considered to be well approximated by a normal distribution. Testing the fit (here using a Kolmogorov-Smirnov test ) of the normal distribution on a sample of 100 observations would not reject the normal fit, while testing it on a sample of 10000 observations would reject it, while both samples come from the same distribution.

set.seed(1234)
x1 <- rpois(n = 100, lambda = 100)
f1 <- fitdist(x1, "norm")
g1 <- gofstat(f1)
g1$kstest
##     1-mle-norm 
## "not rejected"
x2 <- rpois(n = 10000, lambda = 100)
f2 <- fitdist(x2, "norm")
g2 <- gofstat(f2)
g2$kstest
## 1-mle-norm 
## "rejected"
par(mfrow=1:2)
denscomp(f1, demp = TRUE, addlegend = FALSE, main = "small sample")
denscomp(f2, demp = TRUE, addlegend = FALSE, main = "big sample")

2.2.2 Should I accept a distribution because goodness-of-fit tests do not reject it ?

No, it would not be reasonable at all to validate a distribution because goodness-of-fit tests do not reject it. Like all the other hypothesis tests, goodness-of-fit tests lack of statistical power when the sample size is not so high. And the different goodness-of-fit tests are not equally sensitive to different types of deviation between empirical and fitted distributions. For example the Kolmogorov-Smirnov test is sensitive when distributions differ in a global fashion near the centre of the distribution. The Anderson-Darling test will be more sensitive when distributions differ in their tails, and the Cramer von Mises will be more sensitive when there are small but repetitive differences between empirical and theoretical distribution functions.

The sensitivity of a chi square test will depend on the definition of classes, and even if we propose a default definition of classes when the user does not provide classes, this choice is not obvious and could impact the results of the test. This test is more appropriate when data are discrete, even if they are modelled by a continuous distribution, as in the following example. Two samples of respective sizes 500 and 50 are drawn from a Poisson distribution of mean parameter equal to 1 (not a sufficiently high value to consider that the Poisson distribution could be approximated by a normal one). Using a Kolmogorov-Smirnov test, for the small sample the normal fit is rejected only for the bigger sample. It is not rejected with the smaller sample even if the fit could be rejected after a simple visual confrontation of the distributions. In that particular case, the chi square test with classes defined by default would have rejected te normal fit for both samples.

set.seed(1234)
x3 <- rpois(n = 500, lambda = 1)
f3 <- fitdist(x3, "norm")
g3 <- gofstat(f3)
g3$kstest
## 1-mle-norm 
## "rejected"
x4 <- rpois(n = 50, lambda = 1)
f4 <- fitdist(x4, "norm")
g4 <- gofstat(f4)
g4$kstest
##     1-mle-norm 
## "not rejected"
par(mfrow=1:2)
denscomp(f3, addlegend = FALSE, main = "big sample") 
denscomp(f4, addlegend = FALSE, main = "small sample")

g3$chisqtable
##      obscounts theocounts
## <= 0     180.0       80.3
## <= 1     187.0      163.5
## <= 2      87.0      168.1
## <= 3      32.0       73.4
## > 3       14.0       14.7
g3$chisqpvalue
## [1] 7.11e-42
g4$chisqtable
##      obscounts theocounts
## <= 0     14.00       5.46
## <= 1     15.00      14.23
## <= 2     15.00      18.09
## > 2       6.00      12.22
g4$chisqpvalue
## [1] 3.57e-05

2.3 Why all goodness-of-fit tests are not available for every distribution ?

The Chi-squared test is available for any distribution but one must be conscious that its result depends on the definition of cells in which observed data are grouped, and a correct definition is not possible with a too small sample.

Concerning the Kolmogorov-Smirnov test, it is proposed for any continuous distribution, but with a critical value corresponding to the comparison of the empirical distribution to a fully specified distribution. As the distribution is not fully known for a fitted distribution, the result of this test is subject to caution, but there is no general asymptotic theory for the Kolmogorov-Smirnov statistics in case of a fitted distribution. Nevertheless, one can use Monte Carlo methods to conduct Kolmgorov-Smirnov goodness-of-fit tests in cases when the sample is used to estimate model parameters. Such a method is implemented in the R package KScorrect for a variety of continuous distributions.

Such an asymptotic theory was proposed for quadratic statistics for some distributions (Anderson-Darling, Cramer von Mises). The reference book we used on this subject (Tests based on edf statistics by Stephens MA in Goodness-of-fit techniques by D’Agostino RB and Stephens MA) proposes critical values of those statistics for a some classical distributions (exponential, gamma, Weibull, logistic, Cauchy, normal and lognormal). But the asymptotic theory about these statistics also depends on the way the parameters are estimated. And as they were not estimated by maximum likelihood for Cauchy, normal and lognormal distributions in the results reported by Stephens, we only propose the results of the Anderson-Darling and Cramer von Mises using those results for exponential, gamma, Weibull, logistic distributions.

The user can refer to the cited books and use the proposed formula to estimate the parameters of Cauchy, normal and lognormal distributions and apply the tests using critical values given in the book. R packages goftest and ADGofTest could also be explored by users who would like to apply Anderson-Darling and Cramer von Mises tests on other distributions. But at this time we are not sure that the case where parameters are unknown (estimated by maximum likelihood) is tackled in those two packages.

Concerning the development of our package, rather than develoing further more goodness-of-fit tests we made the choice to develop graphical tools to help to appreciate the quality of a fit and to compare the fits of different distributions on a same data set (see FAQ 2.2 for argumentation).

2.4 How can we use goodness-of-fit statistics to compare the fit of different distributions on a same data set ?

Goodness-of-fit statistics based on the empirical distribution function (Kolmogorov-Smirnov, Anderson-Darling and Cramer von Mises) may be used to measure a distance between the fitted distribution and the empirical distribution. So if one wants to compare the fit of various distributions on the same data set, the smaller are those statistics the better. The Kolmogorov-Smirnov statistics will be sensitive when distributions differ in a global fashion near the centre of the distribution while the Anderson-Darling statistics will be more sensitive when distributions differ in their tails, and the Cramer von Mises statistics will be more sensitive when there are small but repetitive differences between empirical and theoretical distribution functions.

But as mentioned in the main vignette of our package, the use of the Anderson-Darling to compare the fit of different distributions is subject to caution due to the the weighting of the quadratic distance between fitted and empirical distribution functions that depends on the parametric distribution. Moreover, statistics based on the empirical distribution function do not penalize distributions with a greater number of parameters and as those are generally more flexible, this could induce over-fitting.

Goodness-fo-fit statistics based on information criteria (AIC, BIC) correspond to deviance penalized by the complexity of the model (the number of parameters of the distribution), and so the smaller the better. As more generic statistics, they are not adapted to focus on a part of the fitted distribution, but they take into account the complexity of the distribution and thus could help to prevent overfitting.

2.5 Can we use a test to compare the fit of two distributions on a same data set ?

In our package we did not implement such a test but for two nested distributions (when one is a special case of the other one, e.g. exponential and gamma distributions) a likelihood ratio test can be easily implemented using the loglikelihood provided by fitdist or fitdistcens. Denoting \(L\) the maximum likelihood obtained with the complete distribution and \(L_0\) the one obtained with the simplified distribution, when the sample size increases, \(- 2 ln(\frac{L_0}{L}) = 2 ln(L) - 2 ln(L_0)\) tends to a Chi squared distribution degrees of freedom equal to the difference on the numbers of parameters characterizing the two nested distributions. You will find below an example of such a test.

set.seed(1234)
g <- rgamma(100, shape = 2, rate = 1)
(f <- fitdist(g, "gamma"))
## Fitting of the distribution ' gamma ' by maximum likelihood 
## Parameters:
##       estimate Std. Error
## shape    2.025      0.266
## rate     0.997      0.149
(f0 <- fitdist(g, "exp"))
## Fitting of the distribution ' exp ' by maximum likelihood 
## Parameters:
##      estimate Std. Error
## rate    0.492     0.0492
L <- logLik(f)
k <- length(f$estimate) # number of parameters of the complete distribution
L0 <- logLik(f0)
k0 <- length(f0$estimate) # number of parameters of the simplified distribution
(stat <- 2*L - 2*L0)
## [1] 23.9
(critical_value <- qchisq(0.95, df = k - k0))
## [1] 3.84
(rejected <- stat > critical_value)
## [1] TRUE

Such a test can also be used for fits on censored data.

2.6 Can we get goodness-of-fit statistics for a fit on censored data ?

Function gofstat is not yet proposed in our package for fits on censored data but to develop one is among one of our objectives in the future. Published works on goodness-of-fit statistics based on the empirical distribution function for censored data generally focused on data containing only one type of censoring (e.g. right censored data in survival data). Build such statistics in the general case, with data containing in the same time (right, left and interval censoring), remains tricky.

Nevertheless, it is possible for any type of censored data, to use information criteria (AIC and BIC given in the summary of an object of class fitdistcens) to compare the fits of various distributions to a same data set.

3 Questions regarding optimization procedures

3.1 How to choose optimization method?

If you want to perform optimization without bounds, optim() is used. You can try the derivative-free method Nelder-Mead and the Hessian-free method BFGS. If you want to perform optimization with bounds, only two methods are available without providing the gradient of the objective function: Nelder-Mead via constrOptim() and bounded BFGS via optim(). In both cases, see the help of mledist() and the vignette on optimization algorithms.

3.2 The optimization algorithm stops with error code 100. What shall I do?

First, add traces by adding control=list(trace=1, REPORT=1). Second, try to set bounds for parameters. Third, find better starting values (see FAQ 1.3).

3.3 Why distribution with a log argument may converge better?

Say, we study the shifted lognormal distribution defined by the following density \[ f(x) = \frac{1}{x \sigma \sqrt{2 \pi}} \exp\left(- \frac{(\ln (x+\delta)- \mu)^2}{2\sigma^2}\right) \] for \(x>-\delta\) where \(\mu\) is a location parameter, \(\sigma\) a scale parameter and \(\delta\) a boundary parameter. Let us fit this distribution on the dataset y by MLE. We define two functions for the densities with and without a log argument.

dshiftlnorm <- function(x, mean, sigma, shift, log = FALSE) dlnorm(x+shift, mean, sigma, log=log)
pshiftlnorm <- function(q, mean, sigma, shift, log.p = FALSE) plnorm(q+shift, mean, sigma, log.p=log.p)
qshiftlnorm <- function(p, mean, sigma, shift, log.p = FALSE) qlnorm(p, mean, sigma, log.p=log.p)-shift
dshiftlnorm_no <- function(x, mean, sigma, shift) dshiftlnorm(x, mean, sigma, shift)
pshiftlnorm_no <- function(q, mean, sigma, shift) pshiftlnorm(q, mean, sigma, shift)

We now optimize the minus log-likelihood.

data(dataFAQlog1)
y <- dataFAQlog1
D <- 1-min(y)
f0 <- fitdist(y+D, "lnorm")
start <- list(mean=as.numeric(f0$estimate["meanlog"]),  
              sigma=as.numeric(f0$estimate["sdlog"]), shift=D)
# works with BFGS, but not Nelder-Mead
f <- fitdist(y, "shiftlnorm", start=start, optim.method="BFGS")
summary(f)
## Fitting of the distribution ' shiftlnorm ' by maximum likelihood 
## Parameters : 
##       estimate Std. Error
## mean    -1.386    0.02401
## sigma    0.071    0.00192
## shift    0.248    0.00598
## Loglikelihood:  8299   AIC:  -16591   BIC:  -16573 
## Correlation matrix:
##         mean  sigma  shift
## mean   1.000 -0.885  0.999
## sigma -0.885  1.000 -0.886
## shift  0.999 -0.886  1.000

If we don’t use the log argument, the algorithms stalls.

f2 <- try(fitdist(y, "shiftlnorm_no", start=start, optim.method="BFGS"))
## <simpleError in optim(par = vstart, fn = fnobj, fix.arg = fix.arg, obs = data,     gr = gradient, ddistnam = ddistname, hessian = TRUE, method = meth,     lower = lower, upper = upper, ...): différences finies ayant des valeurs infinies [2]>
## Error in fitdist(y, "shiftlnorm_no", start = start, optim.method = "BFGS") : 
##   the function mle failed to estimate the parameters, 
##                 with the error code 100
print(attr(f2, "condition"))
## <simpleError in fitdist(y, "shiftlnorm_no", start = start, optim.method = "BFGS"): the function mle failed to estimate the parameters, 
##                 with the error code 100
## >

Indeed the algorithm stops because at the following value, the log-likelihood is infinite.

sum(log(dshiftlnorm_no(y, 0.16383978, 0.01679231, 1.17586600 )))
## [1] -Inf
log(prod(dshiftlnorm_no(y, 0.16383978, 0.01679231, 1.17586600 )))
## [1] -Inf
sum(dshiftlnorm(y, 0.16383978, 0.01679231, 1.17586600, TRUE ))
## [1] 7761

There is something wrong in the computation.

Only the R-base implementation using log argument seems reliable. This happens the C-base implementation of dlnorm takes care of the log value. In the file ../src/nmath/dlnorm.c in the R sources, we find the C code for dlnorm

double dlnorm(double x, double meanlog, double sdlog, int give_log)
{
    double y;

#ifdef IEEE_754
    if (ISNAN(x) || ISNAN(meanlog) || ISNAN(sdlog))
    return x + meanlog + sdlog;
#endif
    if(sdlog <= 0) {
    if(sdlog < 0) ML_ERR_return_NAN;
    // sdlog == 0 :
    return (log(x) == meanlog) ? ML_POSINF : R_D__0;
    }
    if(x <= 0) return R_D__0;

    y = (log(x) - meanlog) / sdlog;
    return (give_log ?
        -(M_LN_SQRT_2PI   + 0.5 * y * y + log(x * sdlog)) :
        M_1_SQRT_2PI * exp(-0.5 * y * y)  /  (x * sdlog));
    /* M_1_SQRT_2PI = 1 / sqrt(2 * pi) */

}

In the last four lines with the logical condtion give_log?, we see how the log argument is handled:

-(M_LN_SQRT_2PI   + 0.5 * y * y + log(x * sdlog))
M_1_SQRT_2PI * exp(-0.5 * y * y)  /  (x * sdlog))

Note that the constant \(\log(\sqrt{2\pi})\) is pre-computed in the C macro M_LN_SQRT_2PI.

In order to sort out this problem, we use the constrOptim wrapping optim to take into account linear constraints. This allows also to use other optimization methods than L-BFGS-B (low-memory BFGS bounded) used in optim.

f2 <- fitdist(y, "shiftlnorm", start=start, lower=c(-Inf, 0, -min(y)), optim.method="Nelder-Mead")
summary(f2)
## Fitting of the distribution ' shiftlnorm ' by maximum likelihood 
## Parameters : 
##       estimate Std. Error
## mean   -1.3873         NA
## sigma   0.0711         NA
## shift   0.2481         NA
## Loglikelihood:  8299   AIC:  -16591   BIC:  -16573 
## Correlation matrix:
## [1] NA
print(cbind(BFGS=f$estimate, NelderMead=f2$estimate))
##         BFGS NelderMead
## mean  -1.386    -1.3873
## sigma  0.071     0.0711
## shift  0.248     0.2481

Another possible would be to perform all computations with higher precision arithmetics as implemented in the package Rmpfr using the MPFR library.

3.4 What to do when there is a scaling issue?

Let us consider a dataset which has particular small values.

data(dataFAQscale1)
head(dataFAQscale1)
## [1] -0.007077 -0.000947 -0.001898 -0.000475 -0.001902 -0.000476
summary(dataFAQscale1)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.00708 -0.00143 -0.00047 -0.00031  0.00096  0.00428

The only way to sort out is to multiply the dataset by a large value.

for(i in 6:0)
cat(10^i, try(mledist(dataFAQscale1*10^i, "cauchy")$estimate), "\n")
## 1e+06 -290 1194 
## 1e+05 -29 119 
## 10000 -2.9 11.9 
## 1000 -0.29 1.19 
## 100 -0.029 0.119 
## 10 -0.0029 0.0119 
## <simpleError in optim(par = vstart, fn = fnobj, fix.arg = fix.arg, obs = data,     gr = gradient, ddistnam = ddistname, hessian = TRUE, method = meth,     lower = lower, upper = upper, ...): différences finies ayant des valeurs infinies [2]>
## 1 NA NA

Let us consider a dataset which has particular large values.

data(dataFAQscale2)
head(dataFAQscale2)
## [1] 1.40e+09 1.41e+09 1.43e+09 1.44e+09 1.49e+09 1.57e+09
summary(dataFAQscale2)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 1.40e+09 1.58e+09 2.24e+09 2.55e+09 3.39e+09 4.49e+09

The only way to sort out is to multiply the dataset by a small value.

for(i in 0:5)
cat(10^(-2*i), try(mledist(dataFAQscale2*10^(-2*i), "cauchy")$estimate), "\n")
## 1 2.03e+09 6.59e+08 
## 0.01 20283641 6594932 
## 1e-04 202836 65949 
## 1e-06 2028 659 
## 1e-08 20.3 6.59 
## 1e-10 0.203 0.0659

3.5 How do I set bounds on parameters when optimizing?

3.5.1 Setting bounds for scale parameters

Consider the normal distribution \(\mathcal{N}(\mu, \sigma^2)\) defined by the density \[ f(x) = \frac{1}{\sqrt{2\pi\sigma^2}}\exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right), x\in\mathbb{R}, \] where \(\mu\) is a location parameter such that \(\mu\in\mathbb{R}\), \(\sigma^2\) is a scale parameter such that \(\sigma^2>0\). Therefore when optimizing the log-likelihood or the squared differences or the GoF statistics. Setting a lower bound for the scale parameter is easy with fitdist: just use the lower argument.

set.seed(1234)
x <- rnorm(1000, 1, 2)
fitdist(x, "norm", lower=c(-Inf, 0))
## Fitting of the distribution ' norm ' by maximum likelihood 
## Parameters:
##      estimate Std. Error
## mean    0.947         NA
## sd      1.994         NA

3.5.2 Setting bounds for shape parameters

Consider the Burr distribution \(\mathcal B(\mu, \sigma^2)\) defined by the density \[ f(x) = \frac{a b (x/s)^b}{x [1 + (x/s)^b]^{a + 1}}, x\in\mathbb{R}, \] where \(a,b\) are shape parameters such that \(a,b>0\), \(s\) is a scale parameter such that \(s>0\).

x <- rburr(1000, 1, 2, 3)
fitdist(x, "burr", lower=c(0, 0, 0), start=list(shape1 = 1, shape2 = 1, 
  rate = 1))
## Fitting of the distribution ' burr ' by maximum likelihood 
## Parameters:
##        estimate Std. Error
## shape1    0.969         NA
## shape2    2.051         NA
## rate      3.180         NA

3.5.3 Setting bounds for probability parameters

Consider the geometric distribution \(\mathcal G(p)\) defined by the mass probability function \[ f(x) = p(1-p)^x, x\in\mathbb{N}, \] where \(p\) is a probability parameter such that \(p\in[0,1]\).

x <- rgeom(1000, 1/4)
fitdist(x, "geom", lower=0, upper=1)
## Fitting of the distribution ' geom ' by maximum likelihood 
## Parameters:
##      estimate Std. Error
## prob    0.242    0.00666

3.5.4 Setting bounds for boundary parameters

Consider the shifted exponential distribution \(\mathcal E(\mu,\lambda)\) defined by the mass probability function \[ f(x) = \lambda \exp(-\lambda(x-\mu)), x>\mu, \] where \(\lambda\) is a scale parameter such that \(\lambda>0\), \(\mu\) is a boundary (or shift) parameter such that \(\mu\in\mathbb{R}\). When optimizing the log-likelihood, the boundary constraint is \[ \forall i=1,\dots,n, x_i>\mu \Rightarrow \min_{i=1,\dots,n} x_i > \mu \Leftrightarrow \mu > -\min_{i=1,\dots,n} x_i. \] Note that when optimizing the squared differences or the GoF statistics, this constraint may not be necessary. Let us do it in R.

dsexp <- function(x, rate, shift)
  dexp(x-shift, rate=rate)
psexp <- function(x, rate, shift)
  pexp(x-shift, rate=rate)
rsexp <- function(n, rate, shift)
  rexp(n, rate=rate)+shift
x <- rsexp(1000, 1/4, 1)
fitdist(x, "sexp", start=list(rate=1, shift=0), lower= c(0, -min(x)))
## Fitting of the distribution ' sexp ' by maximum likelihood 
## Parameters:
##       estimate Std. Error
## rate     0.248         NA
## shift    1.005         NA

3.5.5 Setting linear inequality bounds

For some distributions, bounds between parameters are not independent. For instance, the normal inverse Gaussian distribution (\(\mu, \delta, \alpha, \beta\) parametrization) has the following parameter constraints, which can be reformulated as a linear inequality: \[ \left\{ \begin{array}{l}\alpha > 0\\ \delta >0\\ \alpha > |\beta|\end{array} \right. \Leftrightarrow \underbrace{ \left( \begin{matrix} 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 1 & -1 \\ 0 & 0 & 1 & 1 \\ \end{matrix} \right) }_{ui} \left( \begin{matrix} \mu\\ \delta\\ \alpha \\ \beta \\ \end{matrix} \right) \geq \underbrace{ \left( \begin{matrix} 0\\ 0\\ 0 \\ 0 \\ \end{matrix} \right)}_{ci}. \] These constraints can be carried out via constrOptim() and the arguments ci and ui. Here is an example

library(GeneralizedHyperbolic)
myoptim <- function(fn, par, ui, ci, ...)
{
  res <- constrOptim(f=fn, theta=par, method="Nelder-Mead", ui=ui, ci=ci, ...)
  c(res, convergence=res$convergence, value=res$objective, par=res$minimum, hessian=res$hessian)
}
x <- rnig(1000, 3, 1/2, 1/2, 1/4)
ui <- rbind(c(0,1,0,0), c(0,0,1,0), c(0,0,1,-1), c(0,0,1,1))
ci <- c(0,0,0,0)
fitdist(x, "nig", custom.optim=myoptim, ui=ui, ci=ci, start=list(mu = 0, delta = 1, alpha = 1, beta = 0))
## Warning in fitdist(x, "nig", custom.optim = myoptim, ui = ui, ci = ci, start =
## list(mu = 0, : The dnig function should return a vector of with NaN values when
## input has inconsistent parameters and not raise an error
## Warning in fitdist(x, "nig", custom.optim = myoptim, ui = ui, ci = ci, start =
## list(mu = 0, : The pnig function should return a vector of with NaN values when
## input has inconsistent values and not raise an error
## Fitting of the distribution ' nig ' by maximum likelihood 
## Parameters:
##       estimate Std. Error
## mu       2.985         NA
## delta    0.457         NA
## alpha    0.466         NA
## beta     0.237         NA

3.6 How works quantile matching estimation for discrete distributions?

Let us consider the geometric distribution with values in \(\{0,1,2,3,\dots\}\). The probability mass function, the cumulative distribution function and the quantile function are \[ P(X=x)= p (1-p)^{\lfloor x\rfloor}, F_X(x) = 1- (1-p)^{\lfloor x\rfloor}, F_X^{-1}(q) = \left\lfloor\frac{\log(1-q)}{\log(1-p)}\right\rfloor. \] Due to the integer part (floor function), both the distribution function and the quantile function are step functions.

pgeom(0:3, prob=1/2)
## [1] 0.500 0.750 0.875 0.938
qgeom(c(0.3, 0.6, 0.9), prob=1/2)
## [1] 0 1 3
par(mar=c(4,4,2,1), mfrow=1:2)
curve(pgeom(x, prob=1/2), 0, 10, n=301, main="c.d.f.")
curve(qgeom(x, prob=1/2), 0, 1, n=301, main="q.f.")

Now we study the QME for the geometric distribution. Since we have only one parameter, we choose one probabiliy, \(p=1/2\). The theoretical median is the following integer \[ F_X^{-1}(1/2) = \left\lfloor\frac{\log(1/2)}{\log(1-p)}\right\rfloor. \] Note that the theoretical median for a discrete distribution is an integer. Empirically, the median may not be an integer. Indeed for an even length dataset, the empirical median is \[ q_{n,1/2} = \frac{x_{n/2}^\star + x_{n/2+1}^\star}{2}, \] where \(x_{1}^\star<\dots<x_{n}^\star\) is the sorted sample, which is not an integer value if \(x_{n/2}^\star + x_{n/2+1}^\star\) is not an even number. However for an odd length dataset, the empirical median is an integer \(q_{n,1/2}=x_{(n+1)/2}^\star\).

x <- c(0, 0, 0, 0, 1, 1, 3, 2, 1, 0, 0)
median(x[-1]) #sample size 10
## [1] 0.5
median(x) #sample size 11
## [1] 0

Therefore, a first issue arises: if the median is not an integer, it is impossible to match exactly the empirical median with the theoretical quantile.

Furthermore, the second issue is the non-uniqueness of the solution. Admitting matching \(q_{n,1/2}\) is an integer, QME aims to find some \(p\) such that \[ \left\lfloor\frac{\log(1/2)}{\log(1-p)}\right\rfloor = q_{n,1/2} \Leftrightarrow q_{n,1/2} \leq \frac{\log(1/2)}{\log(1-p)} < q_{n,1/2} +1. \] Let us plot the squared differences \((F_X^{-1}(1/2) - q_{n,1/2})^2\).

x <- rgeom(100, 1/3)
L2 <- function(p)
  (qgeom(1/2, p) - median(x))^2
L2(1/3) #theoretical value
## [1] 0
curve(L2(x), 0.10, 0.95, xlab=expression(p), ylab=expression(L2(p)), main="squared differences", n=301)

Any value between [1/3, 5/9] minimizes the squared differences. Therefore, fitdist() may be sensitive to the chosen initial value with deterministic optimization algorithm.

fitdist(x, "geom", method="qme", probs=1/2, start=list(prob=1/2), control=list(trace=1, REPORT=1))
## initial  value 1.000000 
## iter   2 value 0.000000
## iter   2 value 0.000000
## iter   2 value 0.000000
## final  value 0.000000 
## converged
## Fitting of the distribution ' geom ' by matching quantiles 
## Parameters:
##      estimate
## prob     0.34
fitdist(x, "geom", method="qme", probs=1/2, start=list(prob=1/20), control=list(trace=1, REPORT=1))
## initial  value 144.000000 
## iter   1 value 144.000000
## final  value 144.000000 
## converged
## Fitting of the distribution ' geom ' by matching quantiles 
## Parameters:
##      estimate
## prob     0.05

The solution is to use a stochastic algorithm such as simulated annealing (SANN).

fitdist(x, "geom", method="qme", probs=1/2, optim.method="SANN", start=list(prob=1/20))
## Fitting of the distribution ' geom ' by matching quantiles 
## Parameters:
##      estimate
## prob    0.497
fitdist(x, "geom", method="qme", probs=1/2, optim.method="SANN", start=list(prob=1/2))
## Fitting of the distribution ' geom ' by matching quantiles 
## Parameters:
##      estimate
## prob    0.401

Let us consider the Poisson distribution defined by the following mass probability and the cumulative distribution functions \[ P(X=k)=\frac{\lambda^k}{k!}\exp(-\lambda),~ F_X(x) = \exp(-\lambda)\sum_{k=0}^{\lfloor x \rfloor}\frac{\lambda^k}{k!},~ x\geq 0. \] The quantile function \(F_X^{-1}(p)=\inf(x, F_X(x)\geq p)\) simplifies to \[ F_X^{-1}(p) = i \text{ such that } \sum_{k=0}^{i-1} P(X=k) \leq p < \sum_{k=0}^{i} P(X=k). \] Again, the quantile function is a step function \[ F_X^{-1}(p) = \left\{ \begin{array}{ll} 0 & \text{if } p < P(X=0) \\ 1 & \text{if } P(X=0) \leq p < P(X=0)+P(X=1) \\ 2 & \text{if } P(X=0)+P(X=1) \leq p < P(X=0)+P(X=1)+P(X=2) \\ \dots \\ i & \text{if } \sum_{k=0}^{i-1} P(X=k) \leq p < \sum_{k=0}^{i} P(X=k) \\ \dots \\ \end{array} \right. \]

Again, the squared differences is a step function \((F_X^{-1}(1/2) - q_{n,1/2})^2\).

par(mar=c(4,4,2,1))
x <- rpois(100, lambda=7.5)
L2 <- function(lam)
  (qpois(1/2, lambda = lam) - median(x))^2
curve(L2(x), 6, 9, xlab=expression(lambda), ylab=expression(L2(lambda)), main="squared differences", n=201)

Therefore, using fitdist() may be sensitive to the chosen initial value.

fitdist(x, "pois", method="qme", probs=1/2, start=list(lambda=2))
## Fitting of the distribution ' pois ' by matching quantiles 
## Parameters:
##        estimate
## lambda        2
fitdist(x, "pois", method="qme", probs=1/2, optim.method="SANN", start=list(lambda=2))
## Fitting of the distribution ' pois ' by matching quantiles 
## Parameters:
##        estimate
## lambda     6.73

4 Questions regarding uncertainty

4.1 Can we compute marginal confidence intervals on parameter estimates from their reported standard error ?

In statistics, deriving marginal confidence intervals on MLE parameter estimates using the approximation of their standard errors (calculated from the hessian) is a quite common procedure. It is based on the wald approximation which stands that when the sample size \(n\) is sufficiently high, the marginal \(95\%\) confidence on the ith component \(\theta_i\) of a model parameter \(\theta\) estimated by maximum likelihood (estimate denoted \(\hat \theta\)) can be approximated by : \(\hat \theta_i \pm 1.96 \times SE(\hat \theta_i )\) with \(SE(\hat \theta_i )\) the ith term of the diagonal of the covariance matrix of the estimates (\(V_{ii}\)). \(V\) is generally approximated by the inverse of the Fisher information matrix (\(I(\hat \theta)\)). The Fisher information matrix corresponds to the opposite of the hessian matrix evaluated on the MLE estimate. Let us recall that the hessian matrix is defined by \(H_{ij}(y, \theta) = \frac{\partial^2 L(y, \theta)}{\partial \theta_i \partial \theta_j}\) with \(L(y, \theta)\) the loglikelihod function for data \(y\) and parameter \(\theta\).

Before using this approximation, one must keep in mind that its validity does not only depend on the sample size. It also strongly depends on the data, of the distribution, and also on the parameterization of the distribution. For this reason we recommend potential users of the Wald approximation to compare the results to the ones obtained using the bootstrap procedure (see below) before using this approximation. A look at the loglikelihood contours is also interesting as the Wald approximation assumes elliptical contours. In a more general context, we recommend the use of bootstrap to compute confidence intervals on parameters or on any function of those parameters.

Below you will find two examples, one for which Wald confidence intervals seem correct and one for which they give wrong results, with parameter values even outside their possible range (negative rate bound for the gamma distribution).

set.seed(1234)
n <- rnorm(30, mean = 10, sd = 2)
fn <- fitdist(n, "norm")
bn <- bootdist(fn)
bn$CI
##      Median 2.5% 97.5%
## mean   9.41 8.78 10.02
## sd     1.73 1.33  2.15
fn$estimate + cbind("estimate"= 0, "2.5%"= -1.96*fn$sd, "97.5%"= 1.96*fn$sd)
##      estimate 2.5% 97.5%
## mean     9.41 8.77 10.04
## sd       1.78 1.33  2.22
llplot(fn, back.col = FALSE)

set.seed(1234)
g <- rgamma(30, shape = 0.1, rate = 10)
fg <- fitdist(g, "gamma")
bg <- bootdist(fg)
bg$CI
##        Median   2.5%   97.5%
## shape  0.0923 0.0636   0.145
## rate  30.0782 9.6306 146.660
fg$estimate + cbind("estimate"= 0, "2.5%"= -1.96*fg$sd, "97.5%"= 1.96*fg$sd)
##       estimate    2.5%  97.5%
## shape   0.0882  0.0553  0.121
## rate   24.2965 -6.3504 54.943
llplot(fg, back.col = FALSE)

4.2 How can we compute confidence intervals on quantiles from the fit of a distribution ?

The quantile() function can be used to calculate any quantile from a fitted distribution when it is called with an object of class fitdist or fitdistcens as the first argument. When called with an object of class bootdist or bootdistcens as the first argument, quantiles are returned accompanied with a confidence interval calculated using the bootstraped sample of parameters. Moreover, you can use the CIcdfplot() function to plot the fitted distribution as a CDF curve surrounded by a band corresponding to pointwise intervals on the quantiles. See an example below on censored data corresponding to 72-hour acute salinity tolerance (LC50values) of rivermarine invertebrates.

data(salinity)
log10LC50 <-log10(salinity)
fit <- fitdistcens(log10LC50, "norm")
# Bootstrap 
bootsample <- bootdistcens(fit, niter = 101)
#### We used only 101 iterations in that example to limit the calculation time but
#### in practice you should take at least 1001 bootstrap iterations
# Calculation of the quantile of interest (here the 5 percent hazard concentration)
(HC5 <- quantile(bootsample, probs = 0.05))
## (original) estimated quantiles for each specified probability (censored data)
##          p=0.05
## estimate   1.12
## Median of bootstrap estimates
##          p=0.05
## estimate   1.12
## 
## two-sided 95 % CI of each quantile
##        p=0.05
## 2.5 %    1.05
## 97.5 %   1.20
# visualizing pointwise confidence intervals on other quantiles
CIcdfplot(bootsample, CI.output = "quantile", CI.fill = "pink", xlim = c(0.5,2), main = "")

4.3 How can we compute confidence intervals on any function of the parameters of the fitted distribution ?

The bootstrap sample of parameter estimates can be used to calculate a bootstrap sample of any variable defined as a function of the parameters of the fitted distribution. From such a bootstrap sample you can easily compute a conidence interval using percentiles. Below is an example that uses the bootstrap sample of parameters from the previous example (FAQ 4.2) to calculate a 95 percent confidence interval on the Potentially Affected Portion (PAF) of species at a given exposure to salinity (fixed to 1.2 in log10 in this example).

exposure <- 1.2
# Bootstrap sample of the PAF at this exposure
PAF <- pnorm(exposure, mean = bootsample$estim$mean, sd = bootsample$estim$sd)
# confidence interval from 2.5 and 97.5 percentiles
quantile(PAF, probs = c(0.025, 0.975))
##   2.5%  97.5% 
## 0.0487 0.1470

For more complex calculations especially to tranfer uncertainty within a quantitative risk assessment, we recommend the use of the package mc2d which aims at making such calculations easy and which gives extensive examples of use of such bootstrap samples of parameters estimated using functions of the package fitdistrplus.

5 How to personalize plots

5.1 Can I personalize the default plot given for an object of class fitdist or fitdistcens?

The default plot given by using the plot() function on an object of class fitdist or fitdistcens is hard to personalize. Indeed this plot was designed only to give a quick overview of the fit, not to be used as a graph in a manuscript or a formal presentation. To personalize some of (or all) the goodness-of-fit plots, you should rather use specific graphical functions, denscomp, cdfcomp, ppcomp, qqcomp or cdfcompcens (see how in the following paragraphs).

5.2 How to personalize goodness-of-fit plots ?

The default plot of an object of class fitdist can be easily reproduced and personalized using denscomp, cdfcomp, ppcomp and qqcomp.

data(groundbeef)
serving <- groundbeef$serving
fit <- fitdist(serving, "gamma")
par(mfrow = c(2,2), mar = c(4, 4, 1, 1))
denscomp(fit, addlegend = FALSE, main = "", xlab = "serving sizes (g)", fitcol = "orange")
qqcomp(fit, addlegend = FALSE, main = "", fitpch = 16, fitcol = "grey", line01lty = 2)
cdfcomp(fit, addlegend = FALSE, main = "", xlab = "serving sizes (g)", fitcol = "orange", lines01 = TRUE)
ppcomp(fit, addlegend = FALSE, main = "", fitpch = 16, fitcol = "grey", line01lty = 2)

In a similar way, the default plot of an object of class fitdistcens can be easily personalized using cdfcompcens.

5.3 Is it possible to obtain ggplot2 plots ?

An argument plotstyle was added to functions denscomp, cdfcomp, ppcomp, qqcompand cdfcompcens, ppcompcens, qqcompcens to enable the generation of plots using the ggplot2 package. This argument by default fixed at graphics must simply be fixed at ggplot for this purpose, as in the following example. In that latter case the graphical functions return a graphic object that can be further personalized using ggplot2 functions.

library(ggplot2)
fitW <- fitdist(serving, "weibull")
fitln <- fitdist(serving, "lnorm")
fitg <- fitdist(serving, "gamma")
dcomp <- denscomp(list(fitW, fitln, fitg), legendtext = c("Weibull", "lognormal", "gamma"),
    xlab = "serving sizes (g)", xlim = c(0, 250), 
    fitcol = c("red", "green", "orange"), fitlty = 1, fitlwd = 1:3, 
    xlegend = "topright", plotstyle = "ggplot", addlegend = FALSE)
dcomp + ggplot2::theme_minimal() + ggplot2::ggtitle("Ground beef fits")

5.4 Is it possible to add the names of the observations in a goodness-of-fit plot, e.g. the names of the species in the plot of the Species Sensitivity Distribution (SSD) classically used in ecotoxicology ?

An argument named name.points can be used in functions cdfcomp or CIcdfcomp to pass a label vector for observed points so as to add the names of the points on the left of each point. This option is available only for ECDF goodness-of-fit plots and only for non censored data. This option can be used as below, for example, to name the species in the classical plot of the Species Sensitivity Distributions (SSD) in ecotoxicology.

data(endosulfan)
ATV <- subset(endosulfan, group == "NonArthroInvert")$ATV
taxaATV <- subset(endosulfan, group == "NonArthroInvert")$taxa
f <- fitdist(ATV, "lnorm")
cdfcomp(f, xlogscale = TRUE, main = "Species Sensitivty Distribution", 
    xlim = c(1, 100000), name.points = taxaATV, 
    addlegend = FALSE, plotstyle = "ggplot")

6 Questions regarding (left, right and/or interval) censored data

6.1 How to code censored data in fitdistrplus ?

Censored data must be rpresented in the package by a dataframe of two columns respectively named left and right, describing each observed value as an interval. The left column contains either NA for left censored observations, the left bound of the interval for interval censored observations, or the observed value for non-censored observations. The right column contains either NA for right censored observations, the right bound of the interval for interval censored observations, or the observed value for non-censored observations. This type of representation corresponds to the coding names "interval2" in function Surv of the package survival. There is no other way to represent censored data in fitdistrplus but the function Surv2fitdistcens() can be used to help you to format data for use in fitdistcens() from one of the format used in the survival package (see the help page of Surv2fitdistcens()). You have a toy example below.

dtoy <- data.frame(left = c(NA, 2, 4, 6, 9.7, 10), right = c(1, 3, 7, 8, 9.7, NA))
dtoy
##   left right
## 1   NA   1.0
## 2  2.0   3.0
## 3  4.0   7.0
## 4  6.0   8.0
## 5  9.7   9.7
## 6 10.0    NA

6.2 How do I prepare the input of fitdistcens() with Surv2fitdistcens()?

Let us consider a classical right-censored dataset for human life: twenty values randomly chosen from the canlifins dataset of CASdatasets package. We refer to the help of Surv2fitdistcens() for other censoring types.

exitage <- c(81.1,78.9,72.6,67.9,60.1,78.3,83.4,66.9,74.8,80.5,75.6,67.1,
             75.3,82.8,70.1,85.4,74,70,71.6,76.5)
death <- c(0,0,1,0,0,0,0,1,0,0,0,0,0,0,1,1,0,0,0,0)

When performing survival analysis, it is very common to use Surv() function from package survival to handle different types of censoring. In order to ease the use of fitdistcens(), a dedicated function Surv2fitdistcens() has been implemented with arguments similar to the ones of Surv().

svdata <- Surv2fitdistcens(exitage, event=death)

Let us now fit two simple distributions.

flnormc <- fitdistcens(svdata, "lnorm")
fweic <- fitdistcens(svdata, "weibull")
cdfcompcens(list(fweic, flnormc), xlim=range(exitage), xlegend = "topleft")

6.3 How to represent an empirical distribution from censored data ?

The representation of an empirical distribution from censored data is not a trivial problem. One can simply represent each observation as an interval at an y-value defined by the rank of the observation as done below using function plotdistcens. This representation can be interesting to visualize the raw data, but it remains difficult to correctly order the observations in any case (see the example below on the right using data smokedfish).

par(mfrow = c(1,2), mar = c(3, 4, 3, 0.5))
plotdistcens(dtoy, NPMLE = FALSE)
data(smokedfish)
dsmo <-  log10(smokedfish)
plotdistcens(dsmo, NPMLE = FALSE)

Many authors worked on the development of algorithms for non parametric maximum likelihood estimation (NPMLE) of the empirical cumulative distribution function (ECDF) from interval censored data (including left and right censored data that can be considered as interval censored data with one bound at infinity). In old versions of fitdistrplus we used the Turnbull algorithm using calls to functions of the package survival. Even if this Turnbull algorithm is still available in the package, the default plot now uses the function npsurv of the package npsurv. This package provides more performant algorithms developped by Yong Wang (see references cited in the help page of plotdistcens). Due to lack of maintenance of this package we were forced to rewrite their main functions in our package, using another optimization function. The same ECDF plot was also implemented in our using the Turnbull algorithm of survival (see below).

par(mfrow = c(2, 2),  mar = c(3, 4, 3, 0.5))
# Turnbull algorithm with representation of middle points of equivalence classes
plotdistcens(dsmo, NPMLE.method = "Turnbull.middlepoints", xlim = c(-1.8, 2.4))
# Turnbull algorithm with representation of equivalence classes as intervals
plotdistcens(dsmo, NPMLE.method = "Turnbull.intervals")
# Wang algorithm with representation of equivalence classes as intervals
plotdistcens(dsmo, NPMLE.method = "Wang")

As you can see in the above example, the new implementation of NPMLE provides a different type of plot for the ECDF, representing by filled rectangles the zones of non-uniqueness of the NPMLE ECDF. Indeed an NPMLE algorithm generally proceeds in two steps.

  1. The first step aims at identifying equivalence classes (also named in the litterture Turnbull intervals or maximal intersection intervals or innermost intervals or maximal cliques of the data). Equivalences classess are points/intervals under which the NPMLE ECDF may change. Equivalence classes have been shown to correspond to regions between a left bound of an interval (named L in the following plot on a the previous toy example) immediately followed by a right bound of an interval (named R in the following plot). An equivalence class may be of null length (for example at each non censored value).

  2. The second step aims at assigning a probability mass to each equivalence class, which may be zero on some classes. The NPMLE is unique only up to these equivalence classes and this non uniqueness of the NPMLE ECDF is represented by filled rectangles.

Various NPMLE algorithms are implemented in the packages Icens, interval and npsurv. They are more or less performant and all of them do not enable the handling of other data than survival data, especially with left censored observations.

6.4 How to assess the goodness-of-fit of a distribution fitted on censored data ?

The only available method in fitdistrplus to fit distributions on censored data is the maximum likelihood estimation (MLE). Once a distribution is fitted using fitdistcens, AIC and BIC values can be found in the summary of the object of class fitdistcens returned by the function. Those values can be used to compare the fit of various distributions on a same dataset. Function gofstat is not yet proposed in our package for fits on censored data but we plan to develop it in the future with the calculation of other goodness-of-fit statistics for censored data.

fnorm <- fitdistcens(dsmo,"norm")
flogis <- fitdistcens(dsmo,"logis")
# comparison of AIC values
summary(fnorm)$aic
## [1] 178
summary(flogis)$aic
## [1] 177

Considering goodness-of-fit plots, the generic plot function of an object of class fitdistcensprovides three plots, one in CDF using the NPMLE ECDF plot (by default using the Wang prepresentation, see previous part for details), a Q-Q plot and a P-P plot simply derived from the Wang plot of the ECDF, with filled rectangles indicating non uniqueness of the NPMLE ECDF.

par(mar = c(2, 4, 3, 0.5))
plot(fnorm)

Functions cdfcompcens(), qqcompens() and ppcompcens() can be used to individualize and personnalize CDF, Q-Q and P-P goodness-of-fit plots and/or to compare the fit of various distributions on a same dataset.

cdfcompcens(list(fnorm, flogis), fitlty = 1)

qqcompcens(list(fnorm, flogis))

ppcompcens(list(fnorm, flogis))

Considering Q-Q plots and P-P plots, it may be easier to compare various fits by splitting the plots as below which is done automatically using the plotstyle ggplot in qqcompens() and ppcompcens() but can also be done manually with the plotstyle graphics.

qqcompcens(list(fnorm, flogis), lwd = 2, plotstyle = "ggplot",
  fitcol = c("red", "green"), fillrect = c("pink", "lightgreen"),
  legendtext = c("normal distribution", "logistic distribution"))

fitdistrplus/inst/doc/fitdistrplus_vignette.Rmd0000644000176200001440000016775514411504334021661 0ustar liggesusers--- title: Overview of the fitdistrplus package author: Marie Laure Delignette Muller, Christophe Dutang date: '`r Sys.Date()`' output: bookdown::html_document2: base_format: rmarkdown::html_vignette fig_caption: yes toc: true number_sections: yes bibliography: fitdistrplus.bib link-citations: true vignette: > %\VignetteEngine{knitr::rmarkdown} %\VignetteIndexEntry{Overview of the fitdistrplus package} %!\VignetteEncoding{UTF-8} \usepackage[utf8]{inputenc} \usepackage{amsmath} pkgdown: as_is: true --- ```{r setup, echo=FALSE, message=FALSE, warning=FALSE} options(digits = 4) set.seed(1234) ```
Based on the article *fitdistrplus: an R Package for Fitting Distributions* (Marie Laure Delignette-Muller and Christophe Dutang, 2015, Journal of Statistical Software, DOI 10.18637/jss.v064.i04)
***Keywords**: probability distribution fitting, bootstrap, censored data, maximum likelihood, moment matching, quantile matching, maximum goodness-of-fit, distributions, R*
# Introduction {#Introduction} Fitting distributions to data is a very common task in statistics and consists in choosing a probability distribution modelling the random variable, as well as finding parameter estimates for that distribution. This requires judgment and expertise and generally needs an iterative process of distribution choice, parameter estimation, and quality of fit assessment. In the R [@R13] package **MASS** [@MASS], maximum likelihood estimation is available via the `fitdistr` function; other steps of the fitting process can be done using other R functions [@Ricci05]. In this paper, we present the R package **fitdistrplus** [@fitdistrplus] implementing several methods for fitting univariate parametric distribution. A first objective in developing this package was to provide R users a set of functions dedicated to help this overall process. The `fitdistr` function estimates distribution parameters by maximizing the likelihood function using the `optim` function. No distinction between parameters with different roles (e.g., main parameter and nuisance parameter) is made, as this paper focuses on parameter estimation from a general point-of-view. In some cases, other estimation methods could be prefered, such as maximum goodness-of-fit estimation (also called minimum distance estimation), as proposed in the R package **actuar** with three different goodness-of-fit distances [@actuarJSS]. While developping the **fitdistrplus** package, a second objective was to consider various estimation methods in addition to maximum likelihood estimation (MLE). Functions were developped to enable moment matching estimation (MME), quantile matching estimation (QME), and maximum goodness-of-fit estimation (MGE) using eight different distances. Moreover, the **fitdistrplus** package offers the possibility to specify a user-supplied function for optimization, useful in cases where classical optimization techniques, not included in `optim`, are more adequate. In applied statistics, it is frequent to have to fit distributions to censored data [@kleinmoeschberger03, @helsel05, @busschaertetal10, @lehaetal11, @commeauetal12]. The **MASS** `fitdistr` function does not enable maximum likelihood estimation with this type of data. Some packages can be used to work with censored data, especially survival data [@survival, @hiranoetal94, @jordan05], but those packages generally focus on specific models, enabling the fit of a restricted set of distributions. A third objective is thus to provide R users a function to estimate univariate distribution parameters from right-, left- and interval-censored data. Few packages on CRAN provide estimation procedures for any user-supplied parametric distribution and support different types of data. The **distrMod** package [@distrModJSS] provides an object-oriented (S4) implementation of probability models and includes distribution fitting procedures for a given minimization criterion. This criterion is a user-supplied function which is sufficiently flexible to handle censored data, yet not in a trivial way, see Example M4 of the **distrMod** vignette. The fitting functions `MLEstimator` and `MDEstimator` return an S4 class for which a coercion method to class `mle` is provided so that the respective functionalities (e.g., `confint` and `logLik`) from package **stats4** are available, too. In **fitdistrplus**, we chose to use the standard S3 class system for its understanding by most R users. When designing the **fitdistrplus** package, we did not forget to implement generic functions also available for S3 classes. Finally, various other packages provide functions to estimate the mode, the moments or the L-moments of a distribution, see the reference manuals of **modeest**, **lmomco** and **Lmoments** packages. The package is available from the Comprehensive R Archive Network at \url{https://cran.r-project.org/package=fitdistrplus}. The paper is organized as follows: Section \@ref(fitnoncenscont) presents tools for fitting continuous distributions to classic non-censored data. Section \@ref(advtopic) deals with other estimation methods and other types of data, before Section \@ref(ccl) concludes.
# Fitting distributions to continuous non-censored data {#fitnoncenscont} ## Choice of candidate distributions {#Choice} For illustrating the use of various functions of the **fitdistrplus** package with continuous non-censored data, we will first use a data set named `groundbeef` which is included in our package. This data set contains pointwise values of serving sizes in grams, collected in a French survey, for ground beef patties consumed by children under 5 years old. It was used in a quantitative risk assessment published by @Delignette08. ```{r datgroundbeef, echo=TRUE} library("fitdistrplus") data("groundbeef") str(groundbeef) ``` Before fitting one or more distributions to a data set, it is generally necessary to choose good candidates among a predefined set of distributions. This choice may be guided by the knowledge of stochastic processes governing the modeled variable, or, in the absence of knowledge regarding the underlying process, by the observation of its empirical distribution. To help the user in this choice, we developed functions to plot and characterize the empirical distribution. First of all, it is common to start with plots of the empirical distribution function and the histogram (or density plot), which can be obtained with the `plotdist` function of the **fitdistrplus** package. This function provides two plots (see Figure \@ref(fig:figgroundbeef)): the left-hand plot is by default the histogram on a density scale (or density plot of both, according to values of arguments `histo` and `demp`) and the right-hand plot the empirical cumulative distribution function (CDF). ```{r figgroundbeef, fig.align='center', fig.width=7, fig.height=4, fig.cap="Histogram and CDF plots of an empirical distribution for a continuous variable (serving size from the `groundbeef` data set) as provided by the `plotdist` function."} plotdist(groundbeef$serving, histo = TRUE, demp = TRUE) ```
In addition to empirical plots, descriptive statistics may help to choose candidates to describe a distribution among a set of parametric distributions. Especially the skewness and kurtosis, linked to the third and fourth moments, are useful for this purpose. A non-zero skewness reveals a lack of symmetry of the empirical distribution, while the kurtosis value quantifies the weight of tails in comparison to the normal distribution for which the kurtosis equals 3. The skewness and kurtosis and their corresponding unbiased estimator [@casellaberger02] from a sample $(X_i)_i \stackrel{\text{i.i.d.}}{\sim} X$ with observations $(x_i)_i$ are given by: \begin{equation} sk(X) = \frac{E[(X-E(X))^3]}{Var(X)^{\frac{3}{2}}}~,~\widehat{sk}=\frac{\sqrt{n(n-1)}}{n-2}\times\frac{m_{3}}{m_{2}^{\frac{3}{2}}},(\#eq:eq1) \end{equation} \begin{equation} kr(X) = \frac{E[(X-E(X))^4]}{Var(X)^{2}}~,~\widehat{kr}=\frac{n-1}{(n-2)(n-3)}((n+1) \times \frac{m_{4}}{m_{2}^{2}}-3(n-1)) + 3,(\#eq:eq2) \end{equation} where $m_{2}$, $m_{3}$, $m_{4}$ denote empirical moments defined by $m_{k}=\frac{1}{n}\sum_{i=1}^n(x_{i}-\overline{x})^{k}$, with $x_{i}$ the $n$ observations of variable $x$ and $\overline{x}$ their mean value. The `descdist` function provides classical descriptive statistics (minimum, maximum, median, mean, standard deviation), skewness and kurtosis. By default, unbiased estimations of the three last statistics are provided. Nevertheless, the argument `method` can be changed from `"unbiased"` (default) to `"sample"` to obtain them without correction for bias. A skewness-kurtosis plot such as the one proposed by @Cullen99 is provided by the `descdist` function for the empirical distribution (see Figure \@ref(fig:descgroundbeefplot) for the `groundbeef` data set). On this plot, values for common distributions are displayed in order to help the choice of distributions to fit to data. For some distributions (normal, uniform, logistic, exponential), there is only one possible value for the skewness and the kurtosis. Thus, the distribution is represented by a single point on the plot. For other distributions, areas of possible values are represented, consisting in lines (as for gamma and lognormal distributions), or larger areas (as for beta distribution). Skewness and kurtosis are known not to be robust. In order to take into account the uncertainty of the estimated values of kurtosis and skewness from data, a nonparametric bootstrap procedure [@efrontibshirani94] can be performed by using the argument `boot`. Values of skewness and kurtosis are computed on bootstrap samples (constructed by random sampling with replacement from the original data set) and reported on the skewness-kurtosis plot. Nevertheless, the user needs to know that skewness and kurtosis, like all higher moments, have a very high variance. This is a problem which cannot be completely solved by the use of bootstrap. The skewness-kurtosis plot should then be regarded as indicative only. The properties of the random variable should be considered, notably its expected value and its range, as a complement to the use of the `plotdist` and `descdist` functions. Below is a call to the `descdist` function to describe the distribution of the serving size from the `groundbeef` data set and to draw the corresponding skewness-kurtosis plot (see Figure \@ref(fig:descgroundbeefplot)). Looking at the results on this example with a positive skewness and a kurtosis not far from 3, the fit of three common right-skewed distributions could be considered, Weibull, gamma and lognormal distributions. ```{r descgroundbeefplot, fig.align='center', fig.width=5, fig.height=5, fig.cap="Skewness-kurtosis plot for a continuous variable (serving size from the `groundbeef` data set) as provided by the `descdist` function."} descdist(groundbeef$serving, boot = 1000) ``` ## Fit of distributions by maximum likelihood estimation {#FIT} Once selected, one or more parametric distributions $f(.\vert \theta)$ (with parameter $\theta\in\mathbb{R}^d$) may be fitted to the data set, one at a time, using the `fitdist` function. Under the i.i.d. sample assumption, distribution parameters $\theta$ are by default estimated by maximizing the likelihood function defined as: \begin{equation} L(\theta)=\prod_{i=1}^n f(x_{i}\vert \theta)(\#eq:eq3) \end{equation} with $x_{i}$ the $n$ observations of variable $X$ and $f(.\vert \theta)$ the density function of the parametric distribution. The other proposed estimation methods are described in Section \@ref(Alternatives). The `fitdist` function returns an S3 object of class `fitdist` for which `print`, `summary` and `plot` functions are provided. The fit of a distribution using `fitdist` assumes that the corresponding `d`, `p`, `q` functions (standing respectively for the density, the distribution and the quantile functions) are defined. Classical distributions are already defined in that way in the **stats** package, e.g., `dnorm`, `pnorm` and `qnorm` for the normal distribution (see `?Distributions`). Others may be found in various packages (see the CRAN task view: Probability Distributions at \url{https://cran.r-project.org/web/views/Distributions.html}). Distributions not found in any package must be implemented by the user as `d`, `p`, `q` functions. In the call to `fitdist`, a distribution has to be specified via the argument `dist` either by the character string corresponding to its common root name used in the names of `d`, `p`, `q` functions (e.g., `"norm"` for the normal distribution) or by the density function itself, from which the root name is extracted (e.g., `dnorm` for the normal distribution). Numerical results returned by the `fitdist` function are (1) the parameter estimates, (2) the estimated standard errors (computed from the estimate of the Hessian matrix at the maximum likelihood solution), (3) the loglikelihood, (4) Akaike and Bayesian information criteria (the so-called AIC and BIC), and (5) the correlation matrix between parameter estimates. Below is a call to the `fitdist` function to fit a Weibull distribution to the serving size from the `groundbeef` data set. ```{r fitgroundbeef.weibull} fw <- fitdist(groundbeef$serving, "weibull") summary(fw) ``` The plot of an object of class `fitdist` provides four classical goodness-of-fit plots [@Cullen99] presented on Figure \@ref(fig:groundbeefcomp): - a density plot representing the density function of the fitted distribution along with the histogram of the empirical distribution, - a CDF plot of both the empirical distribution and the fitted distribution, - a Q-Q plot representing the empirical quantiles (y-axis) against the theoretical quantiles (x-axis), - a P-P plot representing the empirical distribution function evaluated at each data point (y-axis) against the fitted distribution function (x-axis). For CDF, Q-Q and P-P plots, the probability plotting position is defined by default using Hazen's rule, with probability points of the empirical distribution calculated as `(1:n - 0.5)/n`, as recommended by @Blom. This plotting position can be easily changed (see the reference manual for details [@fitdistrplus]). Unlike the generic `plot` function, the `denscomp`, `cdfcomp`, `qqcomp` and `ppcomp` functions enable to draw separately each of these four plots, in order to compare the empirical distribution and multiple parametric distributions fitted on a same data set. These functions must be called with a first argument corresponding to a list of objects of class `fitdist`, and optionally further arguments to customize the plot (see the reference manual for lists of arguments that may be specific to each plot [@fitdistrplus]). In the following example, we compare the fit of a Weibull, a lognormal and a gamma distributions to the `groundbeef` data set (Figure \@ref(fig:groundbeefcomp)). ```{r groundbeefcomp, fig.align='center', fig.width=7, fig.height=7, fig.cap="Four Goodness-of-fit plots for various distributions fitted to continuous data (Weibull, gamma and lognormal distributions fitted to serving sizes from the `groundbeef` data set) as provided by functions `denscomp`, `qqcomp`, `cdfcomp` and `ppcomp`."} par(mfrow = c(2, 2), mar = c(4, 4, 2, 1)) fg <- fitdist(groundbeef$serving, "gamma") fln <- fitdist(groundbeef$serving, "lnorm") plot.legend <- c("Weibull", "lognormal", "gamma") denscomp(list(fw, fln, fg), legendtext = plot.legend) qqcomp(list(fw, fln, fg), legendtext = plot.legend) cdfcomp(list(fw, fln, fg), legendtext = plot.legend) ppcomp(list(fw, fln, fg), legendtext = plot.legend) ```
The density plot and the CDF plot may be considered as the basic classical goodness-of-fit plots. The two other plots are complementary and can be very informative in some cases. The Q-Q plot emphasizes the lack-of-fit at the distribution tails while the P-P plot emphasizes the lack-of-fit at the distribution center. In the present example (in Figure \@ref(fig:groundbeefcomp)), none of the three fitted distributions correctly describes the center of the distribution, but the Weibull and gamma distributions could be prefered for their better description of the right tail of the empirical distribution, especially if this tail is important in the use of the fitted distribution, as it is in the context of food risk assessment. The data set named `endosulfan` will now be used to illustrate other features of the **fitdistrplus** package. This data set contains acute toxicity values for the organochlorine pesticide endosulfan (geometric mean of LC50 ou EC50 values in $\mu g.L^{-1}$), tested on Australian and non-Australian laboratory-species [@Hose04]. In ecotoxicology, a lognormal or a loglogistic distribution is often fitted to such a data set in order to characterize the species sensitivity distribution (SSD) for a pollutant. A low percentile of the fitted distribution, generally the 5% percentile, is then calculated and named the hazardous concentration 5% (HC5). It is interpreted as the value of the pollutant concentration protecting 95% of the species [@Posthuma2010]. But the fit of a lognormal or a loglogistic distribution to the whole `endosulfan` data set is rather bad (Figure \@ref(fig:fitendo)), especially due to a minority of very high values. The two-parameter Pareto distribution and the three-parameter Burr distribution (which is an extension of both the loglogistic and the Pareto distributions) have been fitted. Pareto and Burr distributions are provided in the package **actuar**. Until here, we did not have to define starting values (in the optimization process) as reasonable starting values are implicity defined within the `fitdist` function for most of the distributions defined in R (see `?fitdist` for details). For other distributions like the Pareto and the Burr distribution, initial values for the distribution parameters have to be supplied in the argument `start`, as a named list with initial values for each parameter (as they appear in the `d`, `p`, `q` functions). Having defined reasonable starting values[^1] various distributions can be fitted and graphically compared. On this example, the function `cdfcomp` can be used to report CDF values in a logscale so as to emphasize discrepancies on the tail of interest while defining an HC5 value (Figure \@ref(fig:fitendo)). [^1]: The `plotdist` function can plot any parametric distribution with specified parameter values in argument `para`. It can thus help to find correct initial values for the distribution parameters in non trivial cases, by iterative calls if necessary (see the reference manual for examples [@fitdistrplus]). ```{r fitendo, fig.align='center', fig.width=6, fig.height=6, fig.cap="CDF plot to compare the fit of four distributions to acute toxicity values of various organisms for the organochlorine pesticide endosulfan (`endosulfan` data set) as provided by the `cdfcomp` function, with CDF values in a logscale to emphasize discrepancies on the left tail."} library(actuar) data("endosulfan") ATV <- endosulfan$ATV fendo.ln <- fitdist(ATV, "lnorm") fendo.ll <- fitdist(ATV, "llogis", start = list(shape = 1, scale = 500)) fendo.P <- fitdist(ATV, "pareto", start = list(shape = 1, scale = 500)) fendo.B <- fitdist(ATV, "burr", start = list(shape1 = 0.3, shape2 = 1, rate = 1)) cdfcomp(list(fendo.ln, fendo.ll, fendo.P, fendo.B), xlogscale = TRUE, ylogscale = TRUE, legendtext = c("lognormal", "loglogistic", "Pareto", "Burr")) ```
None of the fitted distribution correctly describes the right tail observed in the data set, but as shown in Figure \@ref(fig:fitendo), the left-tail seems to be better described by the Burr distribution. Its use could then be considered to estimate the HC5 value as the 5% quantile of the distribution. This can be easily done using the `quantile` generic function defined for an object of class `fitdist`. Below is this calculation together with the calculation of the empirical quantile for comparison. ```{r quantilefitdist, echo=TRUE, fig=FALSE} quantile(fendo.B, probs = 0.05) quantile(ATV, probs = 0.05) ``` In addition to the ecotoxicology context, the `quantile` generic function is also attractive in the actuarial-financial context. In fact, the value-at-risk $VAR_\alpha$ is defined as the $1-\alpha$-quantile of the loss distribution and can be computed with `quantile` on a `fitdist` object. The computation of different goodness-of-fit statistics is proposed in the **fitdistrplus** package in order to further compare fitted distributions. The purpose of goodness-of-fit statistics aims to measure the distance between the fitted parametric distribution and the empirical distribution: e.g., the distance between the fitted cumulative distribution function $F$ and the empirical distribution function $F_{n}$. When fitting continuous distributions, three goodness-of-fit statistics are classicaly considered: Cramer-von Mises, Kolmogorov-Smirnov and Anderson-Darling statistics [@Stephens86]. Naming $x_{i}$ the $n$ observations of a continuous variable $X$ arranged in an ascending order, Table \@ref(tab:tabKSCvMAD) gives the definition and the empirical estimate of the three considered goodness-of-fit statistics. They can be computed using the function `gofstat` as defined by Stephens [@Stephens86]. ```{r fendo.gof.print, echo=TRUE, fig=FALSE} gofstat(list(fendo.ln, fendo.ll, fendo.P, fendo.B), fitnames = c("lnorm", "llogis", "Pareto", "Burr")) ```
| Statistic | General formula | Computational formula | |-------------------------|-------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------| | Kolmogorov-Smirnov (KS) | $\sup|F_{n}(x) - F(x)|$ | $\max(D^{+},D^{-})$ with $D^{+}=\max\limits_{i=1,\dots,n}\left(\frac{i}{n} - F_i\right)$ and $D^{-}=\max\limits_{i=1,\dots,n}\left(F_{i}-\frac{i-1}{n}\right)$ | | Cramer-von Mises (CvM) | $n \int_{-\infty}^{\infty}(F_{n}(x) - F(x))^2 dx$ | $\frac{1}{12n} + \sum\limits_{i=1}^n \left(F_i-\frac{2i-1}{2n} \right)^{2}$ | | Anderson-Darling (AD) | $n \int_{-\infty}^{\infty}\frac{(F_{n}(x) - F(x))^2}{F(x) (1 - F(x))} dx$ | $-n -\frac{1}{n}\sum\limits_{i=1}^n (2i-1)\log(F_i(1-F_{n+1-i}))$ | Table: (\#tab:tabKSCvMAD) Goodness-of-fit statistics as defined by Stephens [@Stephens86]. where $F_i\stackrel{\triangle}{=} F(x_i)$
As giving more weight to distribution tails, the Anderson-Darling statistic is of special interest when it matters to equally emphasize the tails as well as the main body of a distribution. This is often the case in risk assessment [@Cullen99, @Vose10]. For this reason, this statistics is often used to select the best distribution among those fitted. Nevertheless, this statistics should be used cautiously when comparing fits of various distributions. Keeping in mind that the weighting of each CDF quadratic difference depends on the parametric distribution in its definition (see Table \@ref(tab:tabKSCvMAD)), Anderson-Darling statistics computed for several distributions fitted on a same data set are theoretically difficult to compare. Moreover, such a statistic, as Cramer-von Mises and Kolmogorov-Smirnov ones, does not take into account the complexity of the model (i.e., parameter number). It is not a problem when compared distributions are characterized by the same number of parameters, but it could systematically promote the selection of the more complex distributions in the other case. Looking at classical penalized criteria based on the loglikehood (AIC, BIC) seems thus also interesting, especially to discourage overfitting. In the previous example, all the goodness-of-fit statistics based on the CDF distance are in favor of the Burr distribution, the only one characterized by three parameters, while AIC and BIC values respectively give the preference to the Burr distribution or the Pareto distribution. The choice between these two distributions seems thus less obvious and could be discussed. Even if specifically recommended for discrete distributions, the Chi-squared statistic may also be used for continuous distributions (see Section \@ref(otherdata) and the reference manual for examples [@fitdistrplus]). ## Uncertainty in parameter estimates {#Uncertainty} The uncertainty in the parameters of the fitted distribution can be estimated by parametric or nonparametric bootstraps using the `boodist` function for non-censored data [@efrontibshirani94]. This function returns the bootstrapped values of parameters in an S3 class object which can be plotted to visualize the bootstrap region. The medians and the 95% confidence intervals of parameters (2.5 and 97.5 percentiles) are printed in the summary. When inferior to the whole number of iterations (due to lack of convergence of the optimization algorithm for some bootstrapped data sets), the number of iterations for which the estimation converges is also printed in the summary. The plot of an object of class `bootdist` consists in a scatterplot or a matrix of scatterplots of the bootstrapped values of parameters providing a representation of the joint uncertainty distribution of the fitted parameters. Below is an example of the use of the `bootdist` function with the previous fit of the Burr distribution to the `endosulfan` data set (Figure \@ref(fig:bootstrap)). ```{r fitBurr.boot.echo, echo=TRUE} bendo.B <- bootdist(fendo.B, niter = 1001) summary(bendo.B) ``` ```{r bootstrap, fig.align='center', fig.width=6, fig.height=6, fig.cap="Bootstrappped values of parameters for a fit of the Burr distribution characterized by three parameters (example on the `endosulfan` data set) as provided by the plot of an object of class `bootdist`."} plot(bendo.B) ```
Bootstrap samples of parameter estimates are useful especially to calculate confidence intervals on each parameter of the fitted distribution from the marginal distribution of the bootstraped values. It is also interesting to look at the joint distribution of the bootstraped values in a scatterplot (or a matrix of scatterplots if the number of parameters exceeds two) in order to understand the potential structural correlation between parameters (see Figure \@ref(fig:bootstrap)). The use of the whole bootstrap sample is also of interest in the risk assessment field. Its use enables the characterization of uncertainty in distribution parameters. It can be directly used within a second-order Monte Carlo simulation framework, especially within the package **mc2d** [@mc2d]. One could refer to @Pouillot10 for an introduction to the use of **mc2d** and **fitdistrplus** packages in the context of quantitative risk assessment. The bootstrap method can also be used to calculate confidence intervals on quantiles of the fitted distribution. For this purpose, a generic `quantile` function is provided for class `bootdist`. By default, 95% percentiles bootstrap confidence intervals of quantiles are provided. Going back to the previous example from ecotoxicolgy, this function can be used to estimate the uncertainty associated to the HC5 estimation, for example from the previously fitted Burr distribution to the `endosulfan` data set. ```{r fitATV.lnorm.quantile, echo=TRUE} quantile(bendo.B, probs = 0.05) ``` # Advanced topics {#advtopic} ## Alternative methods for parameter estimation {#Alternatives} This subsection focuses on alternative estimation methods. One of the alternative for continuous distributions is the maximum goodness-of-fit estimation method also called minimum distance estimation method [@Stephens86, @actuarJSS]. In this package this method is proposed with eight different distances: the three classical distances defined in Table \@ref(tab:tabKSCvMAD), or one of the variants of the Anderson-Darling distance proposed by @Luceno06 and defined in Table \@ref(tab:modifiedAD). The right-tail AD gives more weight to the right-tail, the left-tail AD gives more weight only to the left tail. Either of the tails, or both of them, can receive even larger weights by using second order Anderson-Darling Statistics.
| Statistic | General formula | Computational formula | |---------------------------------|-------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------| | Right-tail AD (ADR) | $\int_{-\infty}^{\infty}\frac{(F_{n}(x) - F(x))^2 }{1 - F(x)} dx$ | $\frac{n}{2} -2\sum\limits_{i=1}^nF_i -\frac{1}{n}\sum\limits_{i=1}^n(2i-1)ln(\overline F_{n+1-i})$ | | Left-tail AD (ADL) | $\int_{-\infty}^{\infty}\frac{(F_{n}(x) - F(x))^2 }{(F(x))} dx$ | $-\frac{3n}{2} +2\sum\limits_{i=1}^nF_i -\frac{1}{n}\sum\limits_{i=1}^n(2i-1)ln(F_i)$ | | Right-tail AD 2nd order (AD2R) | $ad2r=\int_{-\infty}^{\infty}\frac{(F_{n}(x) - F(x))^2 }{(1 - F(x))^{2}} dx$ | $ad2r=2\sum\limits_{i=1}^nln(\overline F_i) +\frac{1}{n}\sum\limits_{i=1}^n \frac{2i-1}{\overline F_{n+1-i}}$ | | Left-tail AD 2nd order (AD2L) | $ad2l=\int_{-\infty}^{\infty}\frac{(F_{n}(x) - F(x))^2 }{(F(x))^{2}} dx$ | $ad2l=2\sum\limits_{i=1}^nln(F_i) +\frac{1}{n}\sum\limits_{i=1}^n\frac{2i-1}{F_i}$ | | AD 2nd order (AD2) | $ad2r+ad2l$ | $ad2r+ad2l$ Table: (\#tab:modifiedAD) Modified Anderson-Darling statistics as defined by @Luceno06. where $F_i\stackrel{\triangle}{=} F(x_{i})$ and $\overline F_i\stackrel{\triangle}{=}1-F(x_{i})$
To fit a distribution by maximum goodness-of-fit estimation, one needs to fix the argument `method` to `mge` in the call to `fitdist` and to specify the argument `gof` coding for the chosen goodness-of-fit distance. This function is intended to be used only with continuous non-censored data. Maximum goodness-of-fit estimation may be useful to give more weight to data at one tail of the distribution. In the previous example from ecotoxicology, we used a non classical distribution (the Burr distribution) to correctly fit the empirical distribution especially on its left tail. In order to correctly estimate the 5$\%$ percentile, we could also consider the fit of the classical lognormal distribution, but minimizing a goodness-of-fit distance giving more weight to the left tail of the empirical distribution. In what follows, the left tail Anderson-Darling distances of first or second order are used to fit a lognormal to `endosulfan` data set (see Figure \@ref(fig:plotfitMGE)). ```{r plotfitMGE, fig.align='center', fig.width=6, fig.height=6, fig.cap="Comparison of a lognormal distribution fitted by MLE and by MGE using two different goodness-of-fit distances: left-tail Anderson-Darling and left-tail Anderson Darling of second order (example with the `endosulfan` data set) as provided by the `cdfcomp` function, with CDF values in a logscale to emphasize discrepancies on the left tail."} fendo.ln.ADL <- fitdist(ATV, "lnorm", method = "mge", gof = "ADL") fendo.ln.AD2L <- fitdist(ATV, "lnorm", method = "mge", gof = "AD2L") cdfcomp(list(fendo.ln, fendo.ln.ADL, fendo.ln.AD2L), xlogscale = TRUE, ylogscale = TRUE, main = "Fitting a lognormal distribution", xlegend = "bottomright", legendtext = c("MLE", "Left-tail AD", "Left-tail AD 2nd order")) ```
Comparing the 5% percentiles (HC5) calculated using these three fits to the one calculated from the MLE fit of the Burr distribution, we can observe, on this example, that fitting the lognormal distribution by maximizing left tail Anderson-Darling distances of first or second order enables to approach the value obtained by fitting the Burr distribution by MLE. ```{r quantilefitdist2, echo=TRUE, fig=FALSE} (HC5.estimates <- c( empirical = as.numeric(quantile(ATV, probs = 0.05)), Burr = as.numeric(quantile(fendo.B, probs = 0.05)$quantiles), lognormal_MLE = as.numeric(quantile(fendo.ln, probs = 0.05)$quantiles), lognormal_AD2 = as.numeric(quantile(fendo.ln.ADL, probs = 0.05)$quantiles), lognormal_AD2L = as.numeric(quantile(fendo.ln.AD2L, probs = 0.05)$quantiles))) ``` The moment matching estimation (MME) is another method commonly used to fit parametric distributions [@Vose10]. MME consists in finding the value of the parameter $\theta$ that equalizes the first theoretical raw moments of the parametric distribution to the corresponding empirical raw moments as in Equation \@ref(eq:eq4): \begin{equation} E(X^{k}|\theta)=\frac{1}{n}\sum_{i=1}^{n}x_{i}^{k},(\#eq:eq4) \end{equation} for $k=1,\ldots,d$, with $d$ the number of parameters to estimate and $x_{i}$ the $n$ observations of variable $X$. For moments of order greater than or equal to 2, it may also be relevant to match centered moments. Therefore, we match the moments given in Equation \@ref(eq:eq5): \begin{equation} E(X\vert \theta) = \overline{x} ~,~E\left((X-E(X))^{k}|\theta\right)=m_k, \text{ for } k=2,\ldots,d,(\#eq:eq5) \end{equation} where $m_k$ denotes the empirical centered moments. This method can be performed by setting the argument `method` to `"mme"` in the call to `fitdist`. The estimate is computed by a closed-form formula for the following distributions: normal, lognormal, exponential, Poisson, gamma, logistic, negative binomial, geometric, beta and uniform distributions. In this case, for distributions characterized by one parameter (geometric, Poisson and exponential), this parameter is simply estimated by matching theoretical and observed means, and for distributions characterized by two parameters, these parameters are estimated by matching theoretical and observed means and variances [@Vose10]. For other distributions, the equation of moments is solved numerically using the `optim` function by minimizing the sum of squared differences between observed and theoretical moments (see the **fitdistrplus** reference manual for technical details [@fitdistrplus]). A classical data set from the Danish insurance industry published in @mcneil97 will be used to illustrate this method. In **fitdistrplus**, the data set is stored in `danishuni` for the univariate version and contains the loss amounts collected at Copenhagen Reinsurance between 1980 and 1990. In actuarial science, it is standard to consider positive heavy-tailed distributions and have a special focus on the right-tail of the distributions. In this numerical experiment, we choose classic actuarial distributions for loss modelling: the lognormal distribution and the Pareto type II distribution [@Klugmanetal09]. The lognormal distribution is fitted to `danishuni` data set by matching moments implemented as a closed-form formula. On the left-hand graph of Figure \@ref(fig:danishmme), the fitted distribution functions obtained using the moment matching estimation (MME) and maximum likelihood estimation (MLE) methods are compared. The MME method provides a more cautious estimation of the insurance risk as the MME-fitted distribution function (resp. MLE-fitted) underestimates (overestimates) the empirical distribution function for large values of claim amounts. ```{r danish.mme, echo=TRUE, eval=TRUE} data("danishuni") str(danishuni) ```
```{r danishmme, fig.align='center', fig.width=7, fig.height=4, fig.cap="Comparison between MME and MLE when fitting a lognormal or a Pareto distribution to loss data from the `danishuni` data set."} fdanish.ln.MLE <- fitdist(danishuni$Loss, "lnorm") fdanish.ln.MME <- fitdist(danishuni$Loss, "lnorm", method = "mme", order = 1:2) library(actuar) fdanish.P.MLE <- fitdist(danishuni$Loss, "pareto", start = list(shape = 10, scale = 10), lower = 2+1e-6, upper = Inf) memp <- function(x, order) sum(x^order) / length(x) fdanish.P.MME <- fitdist(danishuni$Loss, "pareto", method = "mme", order = 1:2, memp = "memp", start = list(shape = 10, scale = 10), lower = c(2+1e-6, 2+1e-6), upper = c(Inf, Inf)) par(mfrow = c(1, 2)) cdfcomp(list(fdanish.ln.MLE, fdanish.ln.MME), legend = c("lognormal MLE", "lognormal MME"), main = "Fitting a lognormal distribution", xlogscale = TRUE, datapch = 20) cdfcomp(list(fdanish.P.MLE, fdanish.P.MME), legend = c("Pareto MLE", "Pareto MME"), main = "Fitting a Pareto distribution", xlogscale = TRUE, datapch = 20) ```
In a second time, a Pareto distribution, which gives more weight to the right-tail of the distribution, is fitted. As the lognormal distribution, the Pareto has two parameters, which allows a fair comparison. We use the implementation of the **actuar** package providing raw and centered moments for that distribution (in addition to `d`, `p`, `q` and `r` functions [@actuar12]. Fitting a heavy-tailed distribution for which the first and the second moments do not exist for certain values of the shape parameter requires some cautiousness. This is carried out by providing, for the optimization process, a lower and an upper bound for each parameter. The code below calls the L-BFGS-B optimization method in `optim`, since this quasi-Newton allows box constraints [^2]. We choose match moments defined in Equation \@ref(eq:eq4), and so a function for computing the empirical raw moment (called `memp` in our example) is passed to `fitdist`. For two-parameter distributions (i.e., $d=2$), Equations \@ref(eq:eq4) and \@ref(eq:eq5) are equivalent. [^2]: That is what the B stands for. ```{r danish.mme.pareto, echo=TRUE, fig=FALSE} gofstat(list(fdanish.ln.MLE, fdanish.P.MLE, fdanish.ln.MME, fdanish.P.MME), fitnames = c("lnorm.mle", "Pareto.mle", "lnorm.mme", "Pareto.mme")) ``` As shown on Figure \@ref(fig:danishmme), MME and MLE fits are far less distant (when looking at the right-tail) for the Pareto distribution than for the lognormal distribution on this data set. Furthermore, for these two distributions, the MME method better fits the right-tail of the distribution from a visual point of view. This seems logical since empirical moments are influenced by large observed values. In the previous traces, we gave the values of goodness-of-fit statistics. Whatever the statistic considered, the MLE-fitted lognormal always provides the best fit to the observed data. Maximum likelihood and moment matching estimations are certainly the most commonly used method for fitting distributions [@Cullen99]. Keeping in mind that these two methods may produce very different results, the user should be aware of its great sensitivity to outliers when choosing the moment matching estimation. This may be seen as an advantage in our example if the objective is to better describe the right tail of the distribution, but it may be seen as a drawback if the objective is different. Fitting of a parametric distribution may also be done by matching theoretical quantiles of the parametric distributions (for specified probabilities) against the empirical quantiles [@Tse2009]. The equality of theoretical and empirical quantiles is expressed by Equation \@ref(eq:eq6) below, which is very similar to Equations \@ref(eq:eq4) and \@ref(eq:eq5): \begin{equation} F^{-1}(p_{k}|\theta)=Q_{n,p_{k}}(\#eq:eq6) \end{equation} for $k=1,\ldots,d$, with $d$ the number of parameters to estimate (dimension of $\theta$ if there is no fixed parameters) and $Q_{n,p_{k}}$ the empirical quantiles calculated from data for specified probabilities $p_{k}$. Quantile matching estimation (QME) is performed by setting the argument `method` to `"qme"` in the call to `fitdist` and adding an argument `probs` defining the probabilities for which the quantile matching is performed (see Figure \@ref(fig:danishqme)). The length of this vector must be equal to the number of parameters to estimate (as the vector of moment orders for MME). Empirical quantiles are computed using the `quantile` function of the **stats** package using `type=7` by default (see `?quantile` and @hyndmanfan96). But the type of quantile can be easily changed by using the `qty` argument in the call to the `qme` function. The quantile matching is carried out numerically, by minimizing the sum of squared differences between observed and theoretical quantiles. ```{r danishqme, fig.align='center', fig.width=6, fig.height=6, fig.cap="Comparison between QME and MLE when fitting a lognormal distribution to loss data from the `danishuni` data set."} fdanish.ln.QME1 <- fitdist(danishuni$Loss, "lnorm", method = "qme", probs = c(1/3, 2/3)) fdanish.ln.QME2 <- fitdist(danishuni$Loss, "lnorm", method = "qme", probs = c(8/10, 9/10)) cdfcomp(list(fdanish.ln.MLE, fdanish.ln.QME1, fdanish.ln.QME2), legend = c("MLE", "QME(1/3, 2/3)", "QME(8/10, 9/10)"), main = "Fitting a lognormal distribution", xlogscale = TRUE, datapch = 20) ```
Above is an example of fitting of a lognormal distribution to `danishuni} data set by matching probabilities $(p_1= 1/3, p_2=2/3)$ and $(p_1= 8/10, p_2=9/10)$. As expected, the second QME fit gives more weight to the right-tail of the distribution. Compared to the maximum likelihood estimation, the second QME fit best suits the right-tail of the distribution, whereas the first QME fit best models the body of the distribution. The quantile matching estimation is of particular interest when we need to focus around particular quantiles, e.g., $p=99.5\%$ in the Solvency II insurance context or $p=5\%$ for the HC5 estimation in the ecotoxicology context. ## Customization of the optimization algorithm {#Customization} Each time a numerical minimization is carried out in the `fitdistrplus` package, the `optim` function of the **stats** package is used by default with the `Nelder-Mead` method for distributions characterized by more than one parameter and the `BFGS` method for distributions characterized by only one parameter. Sometimes the default algorithm fails to converge. It is then interesting to change some options of the `optim` function or to use another optimization function than `optim` to minimize the objective function. The argument `optim.method` can be used in the call to `fitdist` or `fitdistcens`. It will internally be passed to `mledist`, `mmedist`, `mgedist` or `qmedist`, and to `optim` (see `?optim` for details about the different algorithms available). Even if no error is raised when computing the optimization, changing the algorithm is of particular interest to enforce bounds on some parameters. For instance, a volatility parameter $\sigma$ is strictly positive $\sigma>0$ and a probability parameter $p$ lies in $p\in [0,1]$. This is possible by using arguments `lower` and/or `upper`, for which their use automatically forces `optim.method="L-BFGS-B"`. Below are examples of fits of a gamma distribution $\mathcal{G}(\alpha, \lambda)$ to the `groundbeef` data set with various algorithms. Note that the conjugate gradient algorithm (`CG`) needs far more iterations to converge (around 2500 iterations) compared to other algorithms (converging in less than 100 iterations). ```{r optimmethod.gamma, echo=TRUE} data("groundbeef") fNM <- fitdist(groundbeef$serving, "gamma", optim.method = "Nelder-Mead") fBFGS <- fitdist(groundbeef$serving, "gamma", optim.method = "BFGS") fSANN <- fitdist(groundbeef$serving, "gamma", optim.method = "SANN") fCG <- try(fitdist(groundbeef$serving, "gamma", optim.method = "CG", control = list(maxit = 10000))) if(inherits(fCG, "try-error")) {fCG <- list(estimate = NA)} ``` It is also possible to use another function than `optim` to minimize the objective function by specifying by the argument `custom.optim` in the call to `fitdist`. It may be necessary to customize this optimization function to meet the following requirements. (1) `custom.optim` function must have the following arguments: `fn` for the function to be optimized and `par` for the initialized parameters. (2) `custom.optim` should carry out a MINIMIZATION and must return the following components: `par` for the estimate, `convergence` for the convergence code, `value=fn(par)` and `hessian`. Below is an example of code written to wrap the `genoud` function from the **rgenoud** package in order to respect our optimization ``template''. The **rgenoud** package implements the genetic (stochastic) algorithm. ```{r optimmethod.customgenoud, echo=TRUE} mygenoud <- function(fn, par, ...) { require(rgenoud) res <- genoud(fn, starting.values = par, ...) standardres <- c(res, convergence = 0) return(standardres) } ``` The customized optimization function can then be passed as the argument `custom.optim` in the call to `fitdist` or `fitdistcens`. The following code can for example be used to fit a gamma distribution to the `groundbeef` data set. Note that in this example various arguments are also passed from `fitdist` to `genoud`: `nvars`, `Domains`, `boundary.enforcement`, `print.level` and `hessian`. The code below compares all the parameter estimates ($\hat\alpha$, $\hat\lambda$) by the different algorithms: shape $\alpha$ and rate $\lambda$ parameters are relatively similar on this example, roughly 4.00 and 0.05, respectively. ```{r optimmethod.customgenoud.fitdist, echo=TRUE, eval=TRUE} fgenoud <- mledist(groundbeef$serving, "gamma", custom.optim = mygenoud, nvars = 2, max.generations = 10, Domains = cbind(c(0, 0), c(10, 10)), boundary.enforcement = 1, hessian = TRUE, print.level = 0, P9 = 10) cbind(NM = fNM$estimate, BFGS = fBFGS$estimate, SANN = fSANN$estimate, CG = fCG$estimate, fgenoud = fgenoud$estimate) ``` ## Fitting distributions to other types of data {#otherdata} *This section was modified since the publication of this vignette in the Journal of Statistical Software in order to include new goodness-of-fit plots for censored and discrete data.* Analytical methods often lead to semi-quantitative results which are referred to as censored data. Observations only known to be under a limit of detection are left-censored data. Observations only known to be above a limit of quantification are right-censored data. Results known to lie between two bounds are interval-censored data. These two bounds may correspond to a limit of detection and a limit of quantification, or more generally to uncertainty bounds around the observation. Right-censored data are also commonly encountered with survival data [@kleinmoeschberger03]. A data set may thus contain right-, left-, or interval-censored data, or may be a mixture of these categories, possibly with different upper and lower bounds. Censored data are sometimes excluded from the data analysis or replaced by a fixed value, which in both cases may lead to biased results. A more recommended approach to correctly model such data is based upon maximum likelihood [@kleinmoeschberger03, @helsel05]. Censored data may thus contain left-censored, right-censored and interval-censored values, with several lower and upper bounds. Before their use in package **fitdistrplus**, such data must be coded into a dataframe with two columns, respectively named `left` and `right`, describing each observed value as an interval. The `left` column contains either `NA` for left censored observations, the left bound of the interval for interval censored observations, or the observed value for non-censored observations. The `right` column contains either `NA` for right censored observations, the right bound of the interval for interval censored observations, or the observed value for non-censored observations. To illustrate the use of package **fitdistrplus** to fit distributions to censored continous data, we will use another data set from ecotoxicology, included in our package and named `salinity`. This data set contains acute salinity tolerance (LC50 values in electrical conductivity, $mS$.$cm^{-1}$) of riverine macro-invertebrates taxa from the southern Murray-Darling Basin in Central Victoria, Australia [@kefford07]. ```{r datsalinity, echo=TRUE} data("salinity") str(salinity) ``` Using censored data such as those coded in the `salinity} data set, the empirical distribution can be plotted using the `plotdistcens} function. In older versions of the package, by default this function used the Expectation-Maximization approach of @Turnbull74 to compute the overall empirical cdf curve with optional confidence intervals, by calls to `survfit` and `plot.survfit` functions from the **survival** package. Even if this representation is always available (by fixing the argument `NPMLE.method` to `"Turnbull.middlepoints"`), now the default plot of the empirical cumulative distribution function (ECDF) explicitly represents the regions of non uniqueness of the NPMLE ECDF. The default computation of those regions of non uniqueness and their associated masses uses the non parametric maximum likelihood estimation (NPMLE) approach developped by Wang [@Wang2007, @Wang2008, @Wang2013, @Wang2018]. Figure \@ref(fig:cdfcompcens) shows on the top left the new plot of data together with two fitted distributions. Grey filled rectangles in such a plot represent the regions of non uniqueness of the NPMLE ECDF. A less rigorous but sometimes more illustrative plot can be obtained by fixing the argument `NPMLE` to `FALSE` in the call to `plotdistcens` (see Figure \@ref(fig:plotsalinity2) for an example and the help page of Function `plotdistcens` for details). This plot enables to see the real nature of censored data, as points and intervals, but the difficulty in building such a plot is to define a relevant ordering of observations. ```{r plotsalinity2, fig.align='center', fig.width=6, fig.height=6, fig.cap="Simple plot of censored raw data (72-hour acute salinity tolerance of riverine macro-invertebrates from the `salinity` data set) as ordered points and intervals."} plotdistcens(salinity, NPMLE = FALSE) ```
As for non censored data, one or more parametric distributions can be fitted to the censored data set, one at a time, but using in this case the `fitdistcens` function. This function estimates the vector of distribution parameters $\theta$ by maximizing the likelihood for censored data defined as: \begin{equation} L(\theta) = \prod_{i=1}^{N_{nonC}} f(x_{i}|\theta)\times \prod_{j=1}^{N_{leftC}} F(x^{upper}_{j}|\theta) \\ \times \prod_{k=1}^{N_{rightC}} (1- F(x^{lower}_{k}|\theta))\times \prod_{m=1}^{N_{intC}} (F(x^{upper}_{m}|\theta)- F(x^{lower}_{j}|\theta))(\#eq:eq7) \end{equation} with $x_{i}$ the $N_{nonC}$ non-censored observations, $x^{upper}_{j}$ upper values defining the $N_{leftC}$ left-censored observations, $x^{lower}_{k}$ lower values defining the $N_{rightC}$ right-censored observations, $[x^{lower}_{m} ; x^{upper}_{m}]$ the intervals defining the $N_{intC}$ interval-censored observations, and F the cumulative distribution function of the parametric distribution [@kleinmoeschberger03, @helsel05]. As `fitdist`, `fitdistcens` returns the results of the fit of any parametric distribution to a data set as an S3 class object that can be easily printed, summarized or plotted. For the `salinity` data set, a lognormal distribution or a loglogistic can be fitted as commonly done in ecotoxicology for such data. As with `fitdist`, for some distributions (see @fitdistrplus for details), it is necessary to specify initial values for the distribution parameters in the argument `start`. The `plotdistcens` function can help to find correct initial values for the distribution parameters in non trivial cases, by a manual iterative use if necessary. ```{r plotdistcens, echo=TRUE, fig=FALSE} fsal.ln <- fitdistcens(salinity, "lnorm") fsal.ll <- fitdistcens(salinity, "llogis", start = list(shape = 5, scale = 40)) summary(fsal.ln) summary(fsal.ll) ``` Computations of goodness-of-fit statistics have not yet been developed for fits using censored data but the quality of fit can be judged using Akaike and Schwarz's Bayesian information criteria (AIC and BIC) and the goodness-of-fit CDF plot, respectively provided when summarizing or plotting an object of class `fitdistcens`. Functions `cdfcompcens`, `qqcompcens` and `ppcompcens` can also be used to compare the fit of various distributions to the same censored data set. Their calls are similar to the ones of `cdfcomp`, `qqcomp` and `ppcomp`. Below are examples of use of those functions with the two fitted distributions to the `salinity` data set (see Figure \@ref(fig:cdfcompcens)). When `qqcompcens` and `ppcompcens` are used with more than one fitted distribution, the non uniqueness rectangles are not filled and a small noise is added on the y-axis in order to help the visualization of various fits. But we rather recommend the use of the `plotstyle` `ggplot` of `qqcompcens` and `ppcompcens` to compare the fits of various distributions as it provides a clearer plot splitted in facets (see `?graphcompcens`). ```{r cdfcompcens, fig.align='center', fig.width=7, fig.height=7, fig.cap="Some goodness-of-fit plots for fits of a lognormal and a loglogistic distribution to censored data: LC50 values from the `salinity` data set."} par(mfrow = c(2, 2)) cdfcompcens(list(fsal.ln, fsal.ll), legendtext = c("lognormal", "loglogistic ")) qqcompcens(fsal.ln, legendtext = "lognormal") ppcompcens(fsal.ln, legendtext = "lognormal") qqcompcens(list(fsal.ln, fsal.ll), legendtext = c("lognormal", "loglogistic "), main = "Q-Q plot with 2 dist.") ```
Function `bootdistcens` is the equivalent of `bootdist` for censored data, except that it only proposes nonparametric bootstrap. Indeed, it is not obvious to simulate censoring within a parametric bootstrap resampling procedure. The generic function `quantile` can also be applied to an object of class `fitdistcens` or `bootdistcens`, as for continuous non-censored data. In addition to the fit of distributions to censored or non censored continuous data, our package can also accomodate discrete variables, such as count numbers, using the functions developped for continuous non-censored data. These functions will provide somewhat different graphs and statistics, taking into account the discrete nature of the modeled variable. The discrete nature of the variable is automatically recognized when a classical distribution is fitted to data (binomial, negative binomial, geometric, hypergeometric and Poisson distributions) but must be indicated by fixing argument `discrete` to `TRUE` in the call to functions in other cases. The `toxocara` data set included in the package corresponds to the observation of such a discrete variable. Numbers of *Toxocara cati* parasites present in digestive tract are reported from a random sampling of feral cats living on Kerguelen island [@Fromont01]. We will use it to illustrate the case of discrete data. ```{r dattoxocara, echo=TRUE} data("toxocara") str(toxocara) ``` The fit of a discrete distribution to discrete data by maximum likelihood estimation requires the same procedure as for continuous non-censored data. As an example, using the `toxocara` data set, Poisson and negative binomial distributions can be easily fitted. ```{r fittoxocara.poisnbinom, echo = TRUE, fig = FALSE} (ftoxo.P <- fitdist(toxocara$number, "pois")) (ftoxo.nb <- fitdist(toxocara$number, "nbinom")) ``` For discrete distributions, the plot of an object of class `fitdist` simply provides two goodness-of-fit plots comparing empirical and theoretical distributions in density and in CDF. Functions `cdfcomp` and `denscomp` can also be used to compare several plots to the same data set, as follows for the previous fits (Figure \@ref(fig:fittoxocarapoisnbinom)). ```{r fittoxocarapoisnbinom, fig.align='center', fig.width=7, fig.height=4, fig.cap="Comparison of the fits of a negative binomial and a Poisson distribution to numbers of *Toxocara cati* parasites from the `toxocara` data set."} par(mfrow = c(1, 2)) denscomp(list(ftoxo.P, ftoxo.nb), legendtext = c("Poisson", "negative binomial"), fitlty = 1) cdfcomp(list(ftoxo.P, ftoxo.nb), legendtext = c("Poisson", "negative binomial"), fitlty = 1) ```
When fitting discrete distributions, the Chi-squared statistic is computed by the `gofstat` function using cells defined by the argument `chisqbreaks` or cells automatically defined from the data in order to reach roughly the same number of observations per cell. This number is roughly equal to the argument `meancount`, or sligthly greater if there are some ties. The choice to define cells from the empirical distribution (data), and not from the theoretical distribution, was done to enable the comparison of Chi-squared values obtained with different distributions fitted on a same data set. If arguments `chisqbreaks` and `meancount` are both omitted, `meancount` is fixed in order to obtain roughly $(4n)^{2/5}$ cells, with $n$ the length of the data set [@Vose10]. Using this default option the two previous fits are compared as follows, giving the preference to the negative binomial distribution, from both Chi-squared statistics and information criteria: ```{r fittoxocara.poisnbinom.gof} gofstat(list(ftoxo.P, ftoxo.nb), fitnames = c("Poisson", "negative binomial")) ``` # Conclusion {#ccl} The R package **fitdistrplus** allows to easily fit distributions. Our main objective while developing this package was to provide tools for helping R users to fit distributions to data. We have been encouraged to pursue our work by feedbacks from users of our package in various areas as food or environmental risk assessment, epidemiology, ecology, molecular biology, genomics, bioinformatics, hydraulics, mechanics, financial and actuarial mathematics or operations research. Indeed, this package is already used by a lot of practionners and academics for simple MLE fits [@jaloustreetal11, @saketal11, @kochetal12, @marquetouxetal12, @scholletal12, @suuronenetal12, @varoetal12, @mandletal13, @mala2013publi, @nadarajahbakar2013, @simoetal13, @westphalfitch13, @wayland13, @vianaetal13, @wu13, @staggeetal13, @fiorellietal13, @tikoleetal13, @voigtetal14], for MLE fits and goodness-of-fit statistics [@tarnczi11, @bagariaetal12, @benavidesetal12, @breitbach12, @Pouillot10, @vaninsky13], for MLE fits and bootstrap [@croucheretal12, @meheustetal12, @orellanoetal12, @telloetal12, @hoelzeretal12, @prosseretal13, @Zhang2013, @Rigaux2014], for MLE fits, bootstrap and goodness-of-fit statistics [@larrasetal13], for MME fit [@luangkesornetal12, @callauetal13, @satoetal13], for censored MLE and bootstrap [@lehaetal11, @poulliotetal12, @jongenburgeretal12, @commeauetal12, @contrerasetal2013], for graphic analysing in [@anandetal12], for grouped-data fitting methods [@fusteinercostafreda12] or more generally [@busschaertetal10, @eling12, @sosaetal2013, @srinivasanetal2013, @meyeretal13, @Guillier2013471, @Daelmanetal13, @eiketal13, @Wu2013b, @drakeetal2014]. The **fitdistrplus** package is complementary with the **distrMod** package [@distrModJSS]. **distrMod** provides an even more flexible way to estimate distribution parameters but its use requires a greater initial investment to learn how to manipulate the S4 classes and methods developed in the `distr`-family packages. Many extensions of the **fitdistrplus** package are planned in the future: we target to extend to censored data some methods for the moment only available for non-censored data, especially concerning goodness-of-fit evaluation and fitting methods. We will also enlarge the choice of fitting methods for non-censored data, by proposing new goodness-of-fit distances (e.g., distances based on quantiles) for maximum goodness-of-fit estimation and new types of moments (e.g., limited expected values) for moment matching estimation. At last, we will consider the case of multivariate distribution fitting.
# Acknowledgments {-} The package would not have been at this stage without the stimulating contribution of Régis Pouillot and Jean-Baptiste Denis, especially for its conceptualization. We also want to thank Régis Pouillot for his very valuable comments on the first version of this paper. The authors gratefully acknowledges the two anonymous referees and the Editor for useful and constructive comments. The remaining errors, of course, should be attributed to the authors alone.
# References {-} fitdistrplus/inst/doc/Optimalgo.Rmd0000644000176200001440000004373214421750634017146 0ustar liggesusers--- title: Which optimization algorithm to choose? author: Marie Laure Delignette Muller, Christophe Dutang date: '`r Sys.Date()`' output: bookdown::html_document2: base_format: rmarkdown::html_vignette fig_caption: yes toc: true number_sections: yes link-citations: true vignette: > %\VignetteEngine{knitr::rmarkdown} %\VignetteIndexEntry{Which optimization algorithm to choose?} %!\VignetteEncoding{UTF-8} \usepackage[utf8]{inputenc} pkgdown: as_is: true --- ```{r setup, echo=FALSE, message=FALSE, warning=FALSE} require(fitdistrplus) require(knitr) #for kable() function set.seed(12345) options(digits = 3) ``` # Quick overview of main optimization methods We present very quickly the main optimization methods. Please refer to **Numerical Optimization (Nocedal \& Wright, 2006)** or **Numerical Optimization: theoretical and practical aspects (Bonnans, Gilbert, Lemarechal \& Sagastizabal, 2006)** for a good introduction. We consider the following problem $\min_x f(x)$ for $x\in\mathbb{R}^n$. ## Derivative-free optimization methods The Nelder-Mead method is one of the most well known derivative-free methods that use only values of $f$ to search for the minimum. It consists in building a simplex of $n+1$ points and moving/shrinking this simplex into the good direction. 1. set initial points $x_1, \dots, x_{n+1}$. 2. order points such that $f(x_1)\leq f(x_2)\leq\dots\leq f(x_{n+1})$. 3. compute $x_o$ as the centroid of $x_1, \dots, x_{n}$. 4. Reflection: + compute the reflected point $x_r = x_o + \alpha(x_o-x_{n+1})$. + **if** $f(x_1)\leq f(x_r)1$, once initiated by $d_1 = -g(x_1)$. $\beta_k$ are updated according a scheme: * $\beta_k = \frac{ g_k^T g_k}{g_{k-1}^T g_{k-1} }$: Fletcher-Reeves update, * $\beta_k = \frac{ g_k^T (g_k-g_{k-1} )}{g_{k-1}^T g_{k-1}}$: Polak-Ribiere update. There exists also three-term formula for computing direction $d_k = -g(x_k) + \beta_k d_{k-1}+\gamma_{k} d_t$ for $tt+1$ otherwise $\gamma_k=0$ if $k=t$. See Yuan (2006) for other well-known schemes such as Hestenses-Stiefel, Dixon or Conjugate-Descent. The three updates (Fletcher-Reeves, Polak-Ribiere, Beale-Sorenson) of the (non-linear) conjugate gradient are available in `optim`. ### Computing the stepsize $t_k$ Let $\phi_k(t) = f(x_k + t d_k)$ for a given direction/iterate $(d_k, x_k)$. We need to find conditions to find a satisfactory stepsize $t_k$. In literature, we consider the descent condition: $\phi_k'(0) < 0$ and the Armijo condition: $\phi_k(t) \leq \phi_k(0) + t c_1 \phi_k'(0)$ ensures a decrease of $f$. Nocedal \& Wright (2006) presents a backtracking (or geometric) approach satisfying the Armijo condition and minimal condition, i.e. Goldstein and Price condition. * set $t_{k,0}$ e.g. 1, $0 < \alpha < 1$, * **Repeat** until Armijo satisfied, + $t_{k,i+1} = \alpha \times t_{k,i}$. * **end Repeat** This backtracking linesearch is available in `optim`. ## Benchmark To simplify the benchmark of optimization methods, we create a `fitbench` function that computes the desired estimation method for all optimization methods. This function is currently not exported in the package. ```{r, echo=TRUE, eval=FALSE} fitbench <- function(data, distr, method, grad = NULL, control = list(trace = 0, REPORT = 1, maxit = 1000), lower = -Inf, upper = +Inf, ...) ``` ```{r, echo=FALSE} fitbench <- fitdistrplus:::fitbench ``` # Numerical illustration with the beta distribution ## Log-likelihood function and its gradient for beta distribution ### Theoretical value The density of the beta distribution is given by $$ f(x; \delta_1,\delta_2) = \frac{x^{\delta_1-1}(1-x)^{\delta_2-1}}{\beta(\delta_1,\delta_2)}, $$ where $\beta$ denotes the beta function, see the NIST Handbook of mathematical functions https://dlmf.nist.gov/. We recall that $\beta(a,b)=\Gamma(a)\Gamma(b)/\Gamma(a+b)$. There the log-likelihood for a set of observations $(x_1,\dots,x_n)$ is $$ \log L(\delta_1,\delta_2) = (\delta_1-1)\sum_{i=1}^n\log(x_i)+ (\delta_2-1)\sum_{i=1}^n\log(1-x_i)+ n \log(\beta(\delta_1,\delta_2)) $$ The gradient with respect to $a$ and $b$ is $$ \nabla \log L(\delta_1,\delta_2) = \left(\begin{matrix} \sum\limits_{i=1}^n\ln(x_i) - n\psi(\delta_1)+n\psi( \delta_1+\delta_2) \\ \sum\limits_{i=1}^n\ln(1-x_i)- n\psi(\delta_2)+n\psi( \delta_1+\delta_2) \end{matrix}\right), $$ where $\psi(x)=\Gamma'(x)/\Gamma(x)$ is the digamma function, see the NIST Handbook of mathematical functions https://dlmf.nist.gov/. ### `R` implementation As in the `fitdistrplus` package, we minimize the opposite of the log-likelihood: we implement the opposite of the gradient in `grlnL`. Both the log-likelihood and its gradient are not exported. ```{r} lnL <- function(par, fix.arg, obs, ddistnam) fitdistrplus:::loglikelihood(par, fix.arg, obs, ddistnam) grlnlbeta <- fitdistrplus:::grlnlbeta ``` ## Random generation of a sample ```{r, fig.height=4, fig.width=4} #(1) beta distribution n <- 200 x <- rbeta(n, 3, 3/4) grlnlbeta(c(3, 4), x) #test hist(x, prob=TRUE) lines(density(x), col="red") curve(dbeta(x, 3, 3/4), col="green", add=TRUE) legend("topleft", lty=1, col=c("red","green"), leg=c("empirical", "theoretical")) ``` ## Fit Beta distribution Define control parameters. ```{r} ctr <- list(trace=0, REPORT=1, maxit=1000) ``` Call `mledist` with the default optimization function (`optim` implemented in `stats` package) with and without the gradient for the different optimization methods. ```{r} unconstropt <- fitbench(x, "beta", "mle", grad=grlnlbeta, lower=0) ``` In the case of constrained optimization, `mledist` permits the direct use of `constrOptim` function (still implemented in `stats` package) that allow linear inequality constraints by using a logarithmic barrier. Use a exp/log transformation of the shape parameters $\delta_1$ and $\delta_2$ to ensure that the shape parameters are strictly positive. ```{r} dbeta2 <- function(x, shape1, shape2, log) dbeta(x, exp(shape1), exp(shape2), log=log) #take the log of the starting values startarg <- lapply(fitdistrplus:::startargdefault(x, "beta"), log) #redefine the gradient for the new parametrization grbetaexp <- function(par, obs, ...) grlnlbeta(exp(par), obs) * exp(par) expopt <- fitbench(x, distr="beta2", method="mle", grad=grbetaexp, start=startarg) #get back to original parametrization expopt[c("fitted shape1", "fitted shape2"), ] <- exp(expopt[c("fitted shape1", "fitted shape2"), ]) ``` Then we extract the values of the fitted parameters, the value of the corresponding log-likelihood and the number of counts to the function to minimize and its gradient (whether it is the theoretical gradient or the numerically approximated one). ## Results of the numerical investigation Results are displayed in the following tables: (1) the original parametrization without specifying the gradient (`-B` stands for bounded version), (2) the original parametrization with the (true) gradient (`-B` stands for bounded version and `-G` for gradient), (3) the log-transformed parametrization without specifying the gradient, (4) the log-transformed parametrization with the (true) gradient (`-G` stands for gradient). ```{r, results='asis', echo=FALSE} kable(unconstropt[, grep("G-", colnames(unconstropt), invert=TRUE)], digits=3) ``` ```{r, results='asis', echo=FALSE} kable(unconstropt[, grep("G-", colnames(unconstropt))], digits=3) ``` ```{r, results='asis', echo=FALSE} kable(expopt[, grep("G-", colnames(expopt), invert=TRUE)], digits=3) ``` ```{r, results='asis', echo=FALSE} kable(expopt[, grep("G-", colnames(expopt))], digits=3) ``` Using `llsurface`, we plot the log-likehood surface around the true value (green) and the fitted parameters (red). ```{r, fig.width=4, fig.height=4} llsurface(min.arg=c(0.1, 0.1), max.arg=c(7, 3), plot.arg=c("shape1", "shape2"), nlev=25, plot.np=50, data=x, distr="beta", back.col = FALSE) points(unconstropt[1,"BFGS"], unconstropt[2,"BFGS"], pch="+", col="red") points(3, 3/4, pch="x", col="green") ``` We can simulate bootstrap replicates using the `bootdist` function. ```{r, fig.width=4, fig.height=4} b1 <- bootdist(fitdist(x, "beta", method = "mle", optim.method = "BFGS"), niter = 100, parallel = "snow", ncpus = 2) summary(b1) plot(b1) abline(v = 3, h = 3/4, col = "red", lwd = 1.5) ``` # Numerical illustration with the negative binomial distribution ## Log-likelihood function and its gradient for negative binomial distribution ### Theoretical value The p.m.f. of the Negative binomial distribution is given by $$ f(x; m,p) = \frac{\Gamma(x+m)}{\Gamma(m)x!} p^m (1-p)^x, $$ where $\Gamma$ denotes the beta function, see the NIST Handbook of mathematical functions https://dlmf.nist.gov/. There exists an alternative representation where $\mu=m (1-p)/p$ or equivalently $p=m/(m+\mu)$. Thus, the log-likelihood for a set of observations $(x_1,\dots,x_n)$ is $$ \log L(m,p) = \sum_{i=1}^{n} \log\Gamma(x_i+m) -n\log\Gamma(m) -\sum_{i=1}^{n} \log(x_i!) + mn\log(p) +\sum_{i=1}^{n} {x_i}\log(1-p) $$ The gradient with respect to $m$ and $p$ is $$ \nabla \log L(m,p) = \left(\begin{matrix} \sum_{i=1}^{n} \psi(x_i+m) -n \psi(m) + n\log(p) \\ mn/p -\sum_{i=1}^{n} {x_i}/(1-p) \end{matrix}\right), $$ where $\psi(x)=\Gamma'(x)/\Gamma(x)$ is the digamma function, see the NIST Handbook of mathematical functions https://dlmf.nist.gov/. ### `R` implementation As in the `fitdistrplus` package, we minimize the opposite of the log-likelihood: we implement the opposite of the gradient in `grlnL`. ```{r} grlnlNB <- function(x, obs, ...) { m <- x[1] p <- x[2] n <- length(obs) c(sum(psigamma(obs+m)) - n*psigamma(m) + n*log(p), m*n/p - sum(obs)/(1-p)) } ``` ## Random generation of a sample ```{r, fig.height=4, fig.width=4} #(1) beta distribution n <- 200 trueval <- c("size"=10, "prob"=3/4, "mu"=10/3) x <- rnbinom(n, trueval["size"], trueval["prob"]) hist(x, prob=TRUE, ylim=c(0, .3)) lines(density(x), col="red") points(min(x):max(x), dnbinom(min(x):max(x), trueval["size"], trueval["prob"]), col = "green") legend("topleft", lty = 1, col = c("red", "green"), leg = c("empirical", "theoretical")) ``` ## Fit a negative binomial distribution Define control parameters and make the benchmark. ```{r} ctr <- list(trace = 0, REPORT = 1, maxit = 1000) unconstropt <- fitbench(x, "nbinom", "mle", grad = grlnlNB, lower = 0) unconstropt <- rbind(unconstropt, "fitted prob" = unconstropt["fitted mu", ] / (1 + unconstropt["fitted mu", ])) ``` In the case of constrained optimization, `mledist` permits the direct use of `constrOptim` function (still implemented in `stats` package) that allow linear inequality constraints by using a logarithmic barrier. Use a exp/log transformation of the shape parameters $\delta_1$ and $\delta_2$ to ensure that the shape parameters are strictly positive. ```{r} dnbinom2 <- function(x, size, prob, log) dnbinom(x, exp(size), 1 / (1 + exp(-prob)), log = log) # transform starting values startarg <- fitdistrplus:::startargdefault(x, "nbinom") startarg$mu <- startarg$size / (startarg$size + startarg$mu) startarg <- list(size = log(startarg[[1]]), prob = log(startarg[[2]] / (1 - startarg[[2]]))) # redefine the gradient for the new parametrization Trans <- function(x) c(exp(x[1]), plogis(x[2])) grNBexp <- function(par, obs, ...) grlnlNB(Trans(par), obs) * c(exp(par[1]), plogis(x[2])*(1-plogis(x[2]))) expopt <- fitbench(x, distr="nbinom2", method="mle", grad=grNBexp, start=startarg) # get back to original parametrization expopt[c("fitted size", "fitted prob"), ] <- apply(expopt[c("fitted size", "fitted prob"), ], 2, Trans) ``` Then we extract the values of the fitted parameters, the value of the corresponding log-likelihood and the number of counts to the function to minimize and its gradient (whether it is the theoretical gradient or the numerically approximated one). ## Results of the numerical investigation Results are displayed in the following tables: (1) the original parametrization without specifying the gradient (`-B` stands for bounded version), (2) the original parametrization with the (true) gradient (`-B` stands for bounded version and `-G` for gradient), (3) the log-transformed parametrization without specifying the gradient, (4) the log-transformed parametrization with the (true) gradient (`-G` stands for gradient). ```{r, results='asis', echo=FALSE} kable(unconstropt[, grep("G-", colnames(unconstropt), invert=TRUE)], digits=3) ``` ```{r, results='asis', echo=FALSE} kable(unconstropt[, grep("G-", colnames(unconstropt))], digits=3) ``` ```{r, results='asis', echo=FALSE} kable(expopt[, grep("G-", colnames(expopt), invert=TRUE)], digits=3) ``` ```{r, results='asis', echo=FALSE} kable(expopt[, grep("G-", colnames(expopt))], digits=3) ``` Using `llsurface`, we plot the log-likehood surface around the true value (green) and the fitted parameters (red). ```{r, fig.width=4, fig.height=4} llsurface(min.arg = c(5, 0.3), max.arg = c(15, 1), plot.arg = c("size", "prob"), nlev = 25, plot.np = 50, data = x, distr = "nbinom", back.col = FALSE) points(unconstropt["fitted size", "BFGS"], unconstropt["fitted prob", "BFGS"], pch = "+", col = "red") points(trueval["size"], trueval["prob"], pch = "x", col = "green") ``` We can simulate bootstrap replicates using the `bootdist` function. ```{r, fig.width=4, fig.height=4} b1 <- bootdist(fitdist(x, "nbinom", method = "mle", optim.method = "BFGS"), niter = 100, parallel = "snow", ncpus = 2) summary(b1) plot(b1) abline(v = trueval["size"], h = trueval["mu"], col = "red", lwd = 1.5) ``` # Conclusion Based on the two previous examples, we observe that all methods converge to the same point. This is rassuring. However, the number of function evaluations (and the gradient evaluations) is very different from a method to another. Furthermore, specifying the true gradient of the log-likelihood does not help at all the fitting procedure and generally slows down the convergence. Generally, the best method is the standard BFGS method or the BFGS method with the exponential transformation of the parameters. Since the exponential function is differentiable, the asymptotic properties are still preserved (by the Delta method) but for finite-sample this may produce a small bias. fitdistrplus/inst/doc/FAQ.R0000644000176200001440000005370314421751577015307 0ustar liggesusers## ----setup, echo=FALSE, message=FALSE, warning=FALSE-------------------------- require(fitdistrplus) set.seed(1234) options(digits = 3) ## ---- eval=FALSE-------------------------------------------------------------- # dgumbel <- function(x, a, b) 1/b*exp((a-x)/b)*exp(-exp((a-x)/b)) # pgumbel <- function(q, a, b) exp(-exp((a-q)/b)) # qgumbel <- function(p, a, b) a-b*log(-log(p)) # data(groundbeef) # fitgumbel <- fitdist(groundbeef$serving, "gumbel", start=list(a=10, b=10)) ## ---- eval=FALSE-------------------------------------------------------------- # dzmgeom <- function(x, p1, p2) p1 * (x == 0) + (1-p1)*dgeom(x-1, p2) # pzmgeom <- function(q, p1, p2) p1 * (q >= 0) + (1-p1)*pgeom(q-1, p2) # rzmgeom <- function(n, p1, p2) # { # u <- rbinom(n, 1, 1-p1) #prob to get zero is p1 # u[u != 0] <- rgeom(sum(u != 0), p2)+1 # u # } # x2 <- rzmgeom(1000, 1/2, 1/10) # fitdist(x2, "zmgeom", start=list(p1=1/2, p2=1/2)) ## ---- message=FALSE----------------------------------------------------------- data("endosulfan") library("actuar") fendo.B <- fitdist(endosulfan$ATV, "burr", start = list(shape1 = 0.3, shape2 = 1, rate = 1)) summary(fendo.B) ## ---- fig.height=3, fig.width=6----------------------------------------------- x3 <- rlnorm(1000) f1 <- fitdist(x3, "lnorm", method="mle") f2 <- fitdist(x3, "lnorm", method="mme") par(mfrow=1:2) cdfcomp(list(f1, f2), do.points=FALSE, xlogscale = TRUE, main = "CDF plot") denscomp(list(f1, f2), demp=TRUE, main = "Density plot") ## ----------------------------------------------------------------------------- c("E(X) by MME"=as.numeric(exp(f2$estimate["meanlog"]+f2$estimate["sdlog"]^2/2)), "E(X) by MLE"=as.numeric(exp(f1$estimate["meanlog"]+f1$estimate["sdlog"]^2/2)), "empirical"=mean(x3)) c("Var(X) by MME"=as.numeric(exp(2*f2$estimate["meanlog"]+f2$estimate["sdlog"]^2) * (exp(f2$estimate["sdlog"]^2)-1)), "Var(X) by MLE"=as.numeric(exp(2*f1$estimate["meanlog"]+f1$estimate["sdlog"]^2) * (exp(f1$estimate["sdlog"]^2)-1)), "empirical"=var(x3)) ## ----------------------------------------------------------------------------- set.seed(1234) x <- rnorm(100, mean = 1, sd = 0.5) (try(fitdist(x, "exp"))) ## ----------------------------------------------------------------------------- fitdist(x[x >= 0], "exp") fitdist(x - min(x), "exp") ## ----------------------------------------------------------------------------- set.seed(1234) x <- rnorm(100, mean = 0.5, sd = 0.25) (try(fitdist(x, "beta"))) ## ----------------------------------------------------------------------------- fitdist(x[x > 0 & x < 1], "beta") fitdist((x - min(x)*1.01) / (max(x) * 1.01 - min(x) * 1.01), "beta") ## ---- message=FALSE, fig.height=4, fig.width=8-------------------------------- dtexp <- function(x, rate, low, upp) { PU <- pexp(upp, rate=rate) PL <- pexp(low, rate=rate) dexp(x, rate) / (PU-PL) * (x >= low) * (x <= upp) } ptexp <- function(q, rate, low, upp) { PU <- pexp(upp, rate=rate) PL <- pexp(low, rate=rate) (pexp(q, rate)-PL) / (PU-PL) * (q >= low) * (q <= upp) + 1 * (q > upp) } n <- 200 x <- rexp(n); x <- x[x > .5 & x < 3] f1 <- fitdist(x, "texp", method="mle", start=list(rate=3), fix.arg=list(low=min(x), upp=max(x))) f2 <- fitdist(x, "texp", method="mle", start=list(rate=3), fix.arg=list(low=.5, upp=3)) gofstat(list(f1, f2)) cdfcomp(list(f1, f2), do.points = FALSE, xlim=c(0, 3.5)) ## ---- message=FALSE, fig.height=4, fig.width=8-------------------------------- dtiexp <- function(x, rate, low, upp) { PU <- pexp(upp, rate=rate, lower.tail = FALSE) PL <- pexp(low, rate=rate) dexp(x, rate) * (x >= low) * (x <= upp) + PL * (x == low) + PU * (x == upp) } ptiexp <- function(q, rate, low, upp) pexp(q, rate) * (q >= low) * (q <= upp) + 1 * (q > upp) n <- 100; x <- pmax(pmin(rexp(n), 3), .5) # the loglikelihood has a discontinous point at the solution par(mar=c(4,4,2,1), mfrow=1:2) llcurve(x, "tiexp", plot.arg="low", fix.arg = list(rate=2, upp=5), min.arg=0, max.arg=.5, lseq=200) llcurve(x, "tiexp", plot.arg="upp", fix.arg = list(rate=2, low=0), min.arg=3, max.arg=4, lseq=200) ## ---- fig.height=4, fig.width=6----------------------------------------------- (f1 <- fitdist(x, "tiexp", method="mle", start=list(rate=3, low=0, upp=20))) (f2 <- fitdist(x, "tiexp", method="mle", start=list(rate=3), fix.arg=list(low=min(x), upp=max(x)))) gofstat(list(f1, f2)) cdfcomp(list(f1, f2), do.points = FALSE, addlegend=FALSE, xlim=c(0, 3.5)) curve(ptiexp(x, 1, .5, 3), add=TRUE, col="blue", lty=3) legend("bottomright", lty=1:3, col=c("red", "green", "blue", "black"), leg=c("full MLE", "MLE fixed arg", "true CDF", "emp. CDF")) ## ---- fig.height=4, fig.width=6----------------------------------------------- trueval <- c("min"=3, "max"=5) x <- runif(n=500, trueval[1], trueval[2]) f1 <- fitdist(x, "unif") delta <- .01 llsurface(x, "unif", plot.arg = c("min", "max"), min.arg=c(min(x)-2*delta, max(x)-delta), max.arg=c(min(x)+delta, max(x)+2*delta), main="likelihood surface for uniform", loglik=FALSE) abline(v=min(x), h=max(x), col="grey", lty=2) points(f1$estimate[1], f1$estimate[2], pch="x", col="red") points(trueval[1], trueval[2], pch="+", col="blue") legend("bottomright", pch=c("+","x"), col=c("blue","red"), c("true", "fitted")) delta <- .2 llsurface(x, "unif", plot.arg = c("min", "max"), min.arg=c(3-2*delta, 5-delta), max.arg=c(3+delta, 5+2*delta), main="log-likelihood surface for uniform") abline(v=min(x), h=max(x), col="grey", lty=2) points(f1$estimate[1], f1$estimate[2], pch="x", col="red") points(trueval[1], trueval[2], pch="+", col="blue") legend("bottomright", pch=c("+","x"), col=c("blue","red"), c("true", "fitted")) ## ----------------------------------------------------------------------------- dunif2 <- function(x, min, max) dunif(x, min, max) punif2 <- function(q, min, max) punif(q, min, max) f2 <- fitdist(x, "unif2", start=list(min=0, max=10), lower=c(-Inf, max(x)), upper=c(min(x), Inf)) print(c(logLik(f1), logLik(f2)), digits=7) print(cbind(coef(f1), coef(f2)), digits=7) ## ----------------------------------------------------------------------------- x <- rbeta(1000, 3, 3) dbeta2 <- function(x, shape, ...) dbeta(x, shape, shape, ...) pbeta2 <- function(q, shape, ...) pbeta(q, shape, shape, ...) fitdist(x, "beta2", start=list(shape=1/2)) ## ----------------------------------------------------------------------------- x <- rbeta(1000, .3, .3) fitdist(x, "beta2", start=list(shape=1/2), optim.method="L-BFGS-B", lower=1e-2) ## ---- message=FALSE, fig.height=4, fig.width=6-------------------------------- require(mc2d) x2 <- rpert(n=2e2, min=0, mode=1, max=2, shape=3/4) eps <- sqrt(.Machine$double.eps) f1 <- fitdist(x2, "pert", start=list(min=-1, mode=0, max=10, shape=1), lower=c(-Inf, -Inf, -Inf, 0), upper=c(Inf, Inf, Inf, Inf)) f2 <- fitdist(x2, "pert", start=list(mode=1, shape=1), fix.arg=list(min=min(x2)-eps, max=max(x2)+eps), lower=c(min(x2), 0), upper=c(max(x2), Inf)) gofstat(list(f1,f2)) cdfcomp(list(f1,f2)) print(cbind(coef(f1), c(f2$fix.arg["min"], coef(f2)["mode"], f2$fix.arg["max"], coef(f2)["shape"])), digits=7) ## ---- fig.height=3, fig.width=6----------------------------------------------- set.seed(1234) x <- rgamma(n = 100, shape = 2, scale = 1) # fit of the good distribution fgamma <- fitdist(x, "gamma") # fit of a bad distribution fexp <- fitdist(x, "exp") g <- gofstat(list(fgamma, fexp), fitnames = c("gamma", "exp")) denscomp(list(fgamma, fexp), legendtext = c("gamma", "exp")) # results of the tests ## chi square test (with corresponding table with theoretical and observed counts) g$chisqpvalue g$chisqtable ## Anderson-Darling test g$adtest ## Cramer von Mises test g$cvmtest ## Kolmogorov-Smirnov test g$kstest ## ---- fig.height=3, fig.width=6----------------------------------------------- set.seed(1234) x1 <- rpois(n = 100, lambda = 100) f1 <- fitdist(x1, "norm") g1 <- gofstat(f1) g1$kstest x2 <- rpois(n = 10000, lambda = 100) f2 <- fitdist(x2, "norm") g2 <- gofstat(f2) g2$kstest par(mfrow=1:2) denscomp(f1, demp = TRUE, addlegend = FALSE, main = "small sample") denscomp(f2, demp = TRUE, addlegend = FALSE, main = "big sample") ## ---- fig.height=3, fig.width=6----------------------------------------------- set.seed(1234) x3 <- rpois(n = 500, lambda = 1) f3 <- fitdist(x3, "norm") g3 <- gofstat(f3) g3$kstest x4 <- rpois(n = 50, lambda = 1) f4 <- fitdist(x4, "norm") g4 <- gofstat(f4) g4$kstest par(mfrow=1:2) denscomp(f3, addlegend = FALSE, main = "big sample") denscomp(f4, addlegend = FALSE, main = "small sample") ## ----------------------------------------------------------------------------- g3$chisqtable g3$chisqpvalue g4$chisqtable g4$chisqpvalue ## ----------------------------------------------------------------------------- set.seed(1234) g <- rgamma(100, shape = 2, rate = 1) (f <- fitdist(g, "gamma")) (f0 <- fitdist(g, "exp")) L <- logLik(f) k <- length(f$estimate) # number of parameters of the complete distribution L0 <- logLik(f0) k0 <- length(f0$estimate) # number of parameters of the simplified distribution (stat <- 2*L - 2*L0) (critical_value <- qchisq(0.95, df = k - k0)) (rejected <- stat > critical_value) ## ----------------------------------------------------------------------------- dshiftlnorm <- function(x, mean, sigma, shift, log = FALSE) dlnorm(x+shift, mean, sigma, log=log) pshiftlnorm <- function(q, mean, sigma, shift, log.p = FALSE) plnorm(q+shift, mean, sigma, log.p=log.p) qshiftlnorm <- function(p, mean, sigma, shift, log.p = FALSE) qlnorm(p, mean, sigma, log.p=log.p)-shift dshiftlnorm_no <- function(x, mean, sigma, shift) dshiftlnorm(x, mean, sigma, shift) pshiftlnorm_no <- function(q, mean, sigma, shift) pshiftlnorm(q, mean, sigma, shift) ## ----------------------------------------------------------------------------- data(dataFAQlog1) y <- dataFAQlog1 D <- 1-min(y) f0 <- fitdist(y+D, "lnorm") start <- list(mean=as.numeric(f0$estimate["meanlog"]), sigma=as.numeric(f0$estimate["sdlog"]), shift=D) # works with BFGS, but not Nelder-Mead f <- fitdist(y, "shiftlnorm", start=start, optim.method="BFGS") summary(f) ## ---- error=FALSE------------------------------------------------------------- f2 <- try(fitdist(y, "shiftlnorm_no", start=start, optim.method="BFGS")) print(attr(f2, "condition")) ## ----------------------------------------------------------------------------- sum(log(dshiftlnorm_no(y, 0.16383978, 0.01679231, 1.17586600 ))) log(prod(dshiftlnorm_no(y, 0.16383978, 0.01679231, 1.17586600 ))) sum(dshiftlnorm(y, 0.16383978, 0.01679231, 1.17586600, TRUE )) ## ---- eval=FALSE, echo=TRUE--------------------------------------------------- # double dlnorm(double x, double meanlog, double sdlog, int give_log) # { # double y; # # #ifdef IEEE_754 # if (ISNAN(x) || ISNAN(meanlog) || ISNAN(sdlog)) # return x + meanlog + sdlog; # #endif # if(sdlog <= 0) { # if(sdlog < 0) ML_ERR_return_NAN; # // sdlog == 0 : # return (log(x) == meanlog) ? ML_POSINF : R_D__0; # } # if(x <= 0) return R_D__0; # # y = (log(x) - meanlog) / sdlog; # return (give_log ? # -(M_LN_SQRT_2PI + 0.5 * y * y + log(x * sdlog)) : # M_1_SQRT_2PI * exp(-0.5 * y * y) / (x * sdlog)); # /* M_1_SQRT_2PI = 1 / sqrt(2 * pi) */ # # } ## ---- eval=FALSE, echo=TRUE--------------------------------------------------- # -(M_LN_SQRT_2PI + 0.5 * y * y + log(x * sdlog)) ## ---- eval=FALSE, echo=TRUE--------------------------------------------------- # M_1_SQRT_2PI * exp(-0.5 * y * y) / (x * sdlog)) ## ----------------------------------------------------------------------------- f2 <- fitdist(y, "shiftlnorm", start=start, lower=c(-Inf, 0, -min(y)), optim.method="Nelder-Mead") summary(f2) print(cbind(BFGS=f$estimate, NelderMead=f2$estimate)) ## ----------------------------------------------------------------------------- data(dataFAQscale1) head(dataFAQscale1) summary(dataFAQscale1) ## ----------------------------------------------------------------------------- for(i in 6:0) cat(10^i, try(mledist(dataFAQscale1*10^i, "cauchy")$estimate), "\n") ## ----------------------------------------------------------------------------- data(dataFAQscale2) head(dataFAQscale2) summary(dataFAQscale2) ## ----------------------------------------------------------------------------- for(i in 0:5) cat(10^(-2*i), try(mledist(dataFAQscale2*10^(-2*i), "cauchy")$estimate), "\n") ## ----scalenormal, echo=TRUE, warning=FALSE------------------------------------ set.seed(1234) x <- rnorm(1000, 1, 2) fitdist(x, "norm", lower=c(-Inf, 0)) ## ----shapeburr, echo=TRUE, warning=FALSE-------------------------------------- x <- rburr(1000, 1, 2, 3) fitdist(x, "burr", lower=c(0, 0, 0), start=list(shape1 = 1, shape2 = 1, rate = 1)) ## ----probgeom, echo=TRUE, warning=FALSE--------------------------------------- x <- rgeom(1000, 1/4) fitdist(x, "geom", lower=0, upper=1) ## ----shiftexp, echo=TRUE, warning=FALSE--------------------------------------- dsexp <- function(x, rate, shift) dexp(x-shift, rate=rate) psexp <- function(x, rate, shift) pexp(x-shift, rate=rate) rsexp <- function(n, rate, shift) rexp(n, rate=rate)+shift x <- rsexp(1000, 1/4, 1) fitdist(x, "sexp", start=list(rate=1, shift=0), lower= c(0, -min(x))) ## ---- message=FALSE----------------------------------------------------------- library(GeneralizedHyperbolic) myoptim <- function(fn, par, ui, ci, ...) { res <- constrOptim(f=fn, theta=par, method="Nelder-Mead", ui=ui, ci=ci, ...) c(res, convergence=res$convergence, value=res$objective, par=res$minimum, hessian=res$hessian) } x <- rnig(1000, 3, 1/2, 1/2, 1/4) ui <- rbind(c(0,1,0,0), c(0,0,1,0), c(0,0,1,-1), c(0,0,1,1)) ci <- c(0,0,0,0) fitdist(x, "nig", custom.optim=myoptim, ui=ui, ci=ci, start=list(mu = 0, delta = 1, alpha = 1, beta = 0)) ## ---- fig.height=3, fig.width=6----------------------------------------------- pgeom(0:3, prob=1/2) qgeom(c(0.3, 0.6, 0.9), prob=1/2) par(mar=c(4,4,2,1), mfrow=1:2) curve(pgeom(x, prob=1/2), 0, 10, n=301, main="c.d.f.") curve(qgeom(x, prob=1/2), 0, 1, n=301, main="q.f.") ## ----------------------------------------------------------------------------- x <- c(0, 0, 0, 0, 1, 1, 3, 2, 1, 0, 0) median(x[-1]) #sample size 10 median(x) #sample size 11 ## ---- fig.height=4, fig.width=6----------------------------------------------- x <- rgeom(100, 1/3) L2 <- function(p) (qgeom(1/2, p) - median(x))^2 L2(1/3) #theoretical value curve(L2(x), 0.10, 0.95, xlab=expression(p), ylab=expression(L2(p)), main="squared differences", n=301) ## ----------------------------------------------------------------------------- fitdist(x, "geom", method="qme", probs=1/2, start=list(prob=1/2), control=list(trace=1, REPORT=1)) fitdist(x, "geom", method="qme", probs=1/2, start=list(prob=1/20), control=list(trace=1, REPORT=1)) ## ----------------------------------------------------------------------------- fitdist(x, "geom", method="qme", probs=1/2, optim.method="SANN", start=list(prob=1/20)) fitdist(x, "geom", method="qme", probs=1/2, optim.method="SANN", start=list(prob=1/2)) ## ---- fig.height=4, fig.width=6----------------------------------------------- par(mar=c(4,4,2,1)) x <- rpois(100, lambda=7.5) L2 <- function(lam) (qpois(1/2, lambda = lam) - median(x))^2 curve(L2(x), 6, 9, xlab=expression(lambda), ylab=expression(L2(lambda)), main="squared differences", n=201) ## ----------------------------------------------------------------------------- fitdist(x, "pois", method="qme", probs=1/2, start=list(lambda=2)) fitdist(x, "pois", method="qme", probs=1/2, optim.method="SANN", start=list(lambda=2)) ## ---- fig.height=4, fig.width=4, warning = FALSE------------------------------ set.seed(1234) n <- rnorm(30, mean = 10, sd = 2) fn <- fitdist(n, "norm") bn <- bootdist(fn) bn$CI fn$estimate + cbind("estimate"= 0, "2.5%"= -1.96*fn$sd, "97.5%"= 1.96*fn$sd) llplot(fn, back.col = FALSE) ## ---- fig.height=4, fig.width=4, warning = FALSE------------------------------ set.seed(1234) g <- rgamma(30, shape = 0.1, rate = 10) fg <- fitdist(g, "gamma") bg <- bootdist(fg) bg$CI fg$estimate + cbind("estimate"= 0, "2.5%"= -1.96*fg$sd, "97.5%"= 1.96*fg$sd) llplot(fg, back.col = FALSE) ## ---- fig.height=3, fig.width=4, warning = FALSE------------------------------ data(salinity) log10LC50 <-log10(salinity) fit <- fitdistcens(log10LC50, "norm") # Bootstrap bootsample <- bootdistcens(fit, niter = 101) #### We used only 101 iterations in that example to limit the calculation time but #### in practice you should take at least 1001 bootstrap iterations # Calculation of the quantile of interest (here the 5 percent hazard concentration) (HC5 <- quantile(bootsample, probs = 0.05)) # visualizing pointwise confidence intervals on other quantiles CIcdfplot(bootsample, CI.output = "quantile", CI.fill = "pink", xlim = c(0.5,2), main = "") ## ----------------------------------------------------------------------------- exposure <- 1.2 # Bootstrap sample of the PAF at this exposure PAF <- pnorm(exposure, mean = bootsample$estim$mean, sd = bootsample$estim$sd) # confidence interval from 2.5 and 97.5 percentiles quantile(PAF, probs = c(0.025, 0.975)) ## ---- fig.height=6, fig.width=6, warning = FALSE------------------------------ data(groundbeef) serving <- groundbeef$serving fit <- fitdist(serving, "gamma") par(mfrow = c(2,2), mar = c(4, 4, 1, 1)) denscomp(fit, addlegend = FALSE, main = "", xlab = "serving sizes (g)", fitcol = "orange") qqcomp(fit, addlegend = FALSE, main = "", fitpch = 16, fitcol = "grey", line01lty = 2) cdfcomp(fit, addlegend = FALSE, main = "", xlab = "serving sizes (g)", fitcol = "orange", lines01 = TRUE) ppcomp(fit, addlegend = FALSE, main = "", fitpch = 16, fitcol = "grey", line01lty = 2) ## ---- fig.height= 4, fig.width= 6, warning = FALSE---------------------------- library(ggplot2) fitW <- fitdist(serving, "weibull") fitln <- fitdist(serving, "lnorm") fitg <- fitdist(serving, "gamma") dcomp <- denscomp(list(fitW, fitln, fitg), legendtext = c("Weibull", "lognormal", "gamma"), xlab = "serving sizes (g)", xlim = c(0, 250), fitcol = c("red", "green", "orange"), fitlty = 1, fitlwd = 1:3, xlegend = "topright", plotstyle = "ggplot", addlegend = FALSE) dcomp + ggplot2::theme_minimal() + ggplot2::ggtitle("Ground beef fits") ## ---- fig.height= 6, fig.width= 6, warning = FALSE---------------------------- data(endosulfan) ATV <- subset(endosulfan, group == "NonArthroInvert")$ATV taxaATV <- subset(endosulfan, group == "NonArthroInvert")$taxa f <- fitdist(ATV, "lnorm") cdfcomp(f, xlogscale = TRUE, main = "Species Sensitivty Distribution", xlim = c(1, 100000), name.points = taxaATV, addlegend = FALSE, plotstyle = "ggplot") ## ----------------------------------------------------------------------------- dtoy <- data.frame(left = c(NA, 2, 4, 6, 9.7, 10), right = c(1, 3, 7, 8, 9.7, NA)) dtoy ## ----------------------------------------------------------------------------- exitage <- c(81.1,78.9,72.6,67.9,60.1,78.3,83.4,66.9,74.8,80.5,75.6,67.1, 75.3,82.8,70.1,85.4,74,70,71.6,76.5) death <- c(0,0,1,0,0,0,0,1,0,0,0,0,0,0,1,1,0,0,0,0) ## ----------------------------------------------------------------------------- svdata <- Surv2fitdistcens(exitage, event=death) ## ---- fig.height= 4, fig.width= 6--------------------------------------------- flnormc <- fitdistcens(svdata, "lnorm") fweic <- fitdistcens(svdata, "weibull") cdfcompcens(list(fweic, flnormc), xlim=range(exitage), xlegend = "topleft") ## ---- fig.height= 4, fig.width= 8--------------------------------------------- par(mfrow = c(1,2), mar = c(3, 4, 3, 0.5)) plotdistcens(dtoy, NPMLE = FALSE) data(smokedfish) dsmo <- log10(smokedfish) plotdistcens(dsmo, NPMLE = FALSE) ## ---- fig.height= 6, fig.width= 6--------------------------------------------- par(mfrow = c(2, 2), mar = c(3, 4, 3, 0.5)) # Turnbull algorithm with representation of middle points of equivalence classes plotdistcens(dsmo, NPMLE.method = "Turnbull.middlepoints", xlim = c(-1.8, 2.4)) # Turnbull algorithm with representation of equivalence classes as intervals plotdistcens(dsmo, NPMLE.method = "Turnbull.intervals") # Wang algorithm with representation of equivalence classes as intervals plotdistcens(dsmo, NPMLE.method = "Wang") ## ---- echo = FALSE, fig.height= 4, fig.width= 8------------------------------- d <- data.frame(left = c(NA, 2, 4, 6, 9.5, 10), right = c(1, 3, 7, 8, 9.5, NA)) addbounds <- function(d) { xbounds <- c(d$left, d$right) xboundsnotNA <- xbounds[!is.na(xbounds)] abline(v = xboundsnotNA, col = "grey") } addLR <- function(d) { Lbounds <- d$left[!is.na(d$left)] Rbounds <- d$right[!is.na(d$right)] range <- range(c(Lbounds,Rbounds)) eps <- (range[2] - range[1]) * 0.01 text(x = Lbounds-eps, y = 0.05, labels = "L", col = "red", cex = 0.75) text(x = Rbounds+eps, y = 0.05, labels = "R", col = "red", cex = 0.75) } addeq <- function(deq) { left <- deq$left left[is.na(left)] <- -100 right <- deq$right right[is.na(right)] <- 100 rect(left, -2, right, 2, density = 10) } par(mfrow = c(2,1), mar = c(2, 4, 3, 0.5)) # First step plotdistcens(d, NPMLE = FALSE, lwd = 2, col = "blue", main = "Step 1 : identification of equivalence classes") addbounds(d) addLR(d) deq <- data.frame(left = c(NA, 2, 6, 9.5, 10), right = c(1, 3, 7,9.5, NA)) addeq(deq) # Second step plotdistcens(d, lwd = 2, main = "Step 2 : estimation of mass probabilities") ## ----------------------------------------------------------------------------- fnorm <- fitdistcens(dsmo,"norm") flogis <- fitdistcens(dsmo,"logis") # comparison of AIC values summary(fnorm)$aic summary(flogis)$aic ## ---- fig.height= 6, fig.width= 6--------------------------------------------- par(mar = c(2, 4, 3, 0.5)) plot(fnorm) ## ---- fig.height= 4, fig.width= 4--------------------------------------------- cdfcompcens(list(fnorm, flogis), fitlty = 1) qqcompcens(list(fnorm, flogis)) ppcompcens(list(fnorm, flogis)) ## ---- fig.height= 4, fig.width= 8--------------------------------------------- qqcompcens(list(fnorm, flogis), lwd = 2, plotstyle = "ggplot", fitcol = c("red", "green"), fillrect = c("pink", "lightgreen"), legendtext = c("normal distribution", "logistic distribution")) fitdistrplus/inst/doc/fitdistrplus_vignette.R0000644000176200001440000002345214421751622021326 0ustar liggesusers## ----setup, echo=FALSE, message=FALSE, warning=FALSE-------------------------- options(digits = 4) set.seed(1234) ## ----datgroundbeef, echo=TRUE------------------------------------------------- library("fitdistrplus") data("groundbeef") str(groundbeef) ## ----figgroundbeef, fig.align='center', fig.width=7, fig.height=4, fig.cap="Histogram and CDF plots of an empirical distribution for a continuous variable (serving size from the `groundbeef` data set) as provided by the `plotdist` function."---- plotdist(groundbeef$serving, histo = TRUE, demp = TRUE) ## ----descgroundbeefplot, fig.align='center', fig.width=5, fig.height=5, fig.cap="Skewness-kurtosis plot for a continuous variable (serving size from the `groundbeef` data set) as provided by the `descdist` function."---- descdist(groundbeef$serving, boot = 1000) ## ----fitgroundbeef.weibull---------------------------------------------------- fw <- fitdist(groundbeef$serving, "weibull") summary(fw) ## ----groundbeefcomp, fig.align='center', fig.width=7, fig.height=7, fig.cap="Four Goodness-of-fit plots for various distributions fitted to continuous data (Weibull, gamma and lognormal distributions fitted to serving sizes from the `groundbeef` data set) as provided by functions `denscomp`, `qqcomp`, `cdfcomp` and `ppcomp`."---- par(mfrow = c(2, 2), mar = c(4, 4, 2, 1)) fg <- fitdist(groundbeef$serving, "gamma") fln <- fitdist(groundbeef$serving, "lnorm") plot.legend <- c("Weibull", "lognormal", "gamma") denscomp(list(fw, fln, fg), legendtext = plot.legend) qqcomp(list(fw, fln, fg), legendtext = plot.legend) cdfcomp(list(fw, fln, fg), legendtext = plot.legend) ppcomp(list(fw, fln, fg), legendtext = plot.legend) ## ----fitendo, fig.align='center', fig.width=6, fig.height=6, fig.cap="CDF plot to compare the fit of four distributions to acute toxicity values of various organisms for the organochlorine pesticide endosulfan (`endosulfan` data set) as provided by the `cdfcomp` function, with CDF values in a logscale to emphasize discrepancies on the left tail."---- library(actuar) data("endosulfan") ATV <- endosulfan$ATV fendo.ln <- fitdist(ATV, "lnorm") fendo.ll <- fitdist(ATV, "llogis", start = list(shape = 1, scale = 500)) fendo.P <- fitdist(ATV, "pareto", start = list(shape = 1, scale = 500)) fendo.B <- fitdist(ATV, "burr", start = list(shape1 = 0.3, shape2 = 1, rate = 1)) cdfcomp(list(fendo.ln, fendo.ll, fendo.P, fendo.B), xlogscale = TRUE, ylogscale = TRUE, legendtext = c("lognormal", "loglogistic", "Pareto", "Burr")) ## ----quantilefitdist, echo=TRUE, fig=FALSE------------------------------------ quantile(fendo.B, probs = 0.05) quantile(ATV, probs = 0.05) ## ----fendo.gof.print, echo=TRUE, fig=FALSE------------------------------------ gofstat(list(fendo.ln, fendo.ll, fendo.P, fendo.B), fitnames = c("lnorm", "llogis", "Pareto", "Burr")) ## ----fitBurr.boot.echo, echo=TRUE--------------------------------------------- bendo.B <- bootdist(fendo.B, niter = 1001) summary(bendo.B) ## ----bootstrap, fig.align='center', fig.width=6, fig.height=6, fig.cap="Bootstrappped values of parameters for a fit of the Burr distribution characterized by three parameters (example on the `endosulfan` data set) as provided by the plot of an object of class `bootdist`."---- plot(bendo.B) ## ----fitATV.lnorm.quantile, echo=TRUE----------------------------------------- quantile(bendo.B, probs = 0.05) ## ----plotfitMGE, fig.align='center', fig.width=6, fig.height=6, fig.cap="Comparison of a lognormal distribution fitted by MLE and by MGE using two different goodness-of-fit distances: left-tail Anderson-Darling and left-tail Anderson Darling of second order (example with the `endosulfan` data set) as provided by the `cdfcomp` function, with CDF values in a logscale to emphasize discrepancies on the left tail."---- fendo.ln.ADL <- fitdist(ATV, "lnorm", method = "mge", gof = "ADL") fendo.ln.AD2L <- fitdist(ATV, "lnorm", method = "mge", gof = "AD2L") cdfcomp(list(fendo.ln, fendo.ln.ADL, fendo.ln.AD2L), xlogscale = TRUE, ylogscale = TRUE, main = "Fitting a lognormal distribution", xlegend = "bottomright", legendtext = c("MLE", "Left-tail AD", "Left-tail AD 2nd order")) ## ----quantilefitdist2, echo=TRUE, fig=FALSE----------------------------------- (HC5.estimates <- c( empirical = as.numeric(quantile(ATV, probs = 0.05)), Burr = as.numeric(quantile(fendo.B, probs = 0.05)$quantiles), lognormal_MLE = as.numeric(quantile(fendo.ln, probs = 0.05)$quantiles), lognormal_AD2 = as.numeric(quantile(fendo.ln.ADL, probs = 0.05)$quantiles), lognormal_AD2L = as.numeric(quantile(fendo.ln.AD2L, probs = 0.05)$quantiles))) ## ----danish.mme, echo=TRUE, eval=TRUE----------------------------------------- data("danishuni") str(danishuni) ## ----danishmme, fig.align='center', fig.width=7, fig.height=4, fig.cap="Comparison between MME and MLE when fitting a lognormal or a Pareto distribution to loss data from the `danishuni` data set."---- fdanish.ln.MLE <- fitdist(danishuni$Loss, "lnorm") fdanish.ln.MME <- fitdist(danishuni$Loss, "lnorm", method = "mme", order = 1:2) library(actuar) fdanish.P.MLE <- fitdist(danishuni$Loss, "pareto", start = list(shape = 10, scale = 10), lower = 2+1e-6, upper = Inf) memp <- function(x, order) sum(x^order) / length(x) fdanish.P.MME <- fitdist(danishuni$Loss, "pareto", method = "mme", order = 1:2, memp = "memp", start = list(shape = 10, scale = 10), lower = c(2+1e-6, 2+1e-6), upper = c(Inf, Inf)) par(mfrow = c(1, 2)) cdfcomp(list(fdanish.ln.MLE, fdanish.ln.MME), legend = c("lognormal MLE", "lognormal MME"), main = "Fitting a lognormal distribution", xlogscale = TRUE, datapch = 20) cdfcomp(list(fdanish.P.MLE, fdanish.P.MME), legend = c("Pareto MLE", "Pareto MME"), main = "Fitting a Pareto distribution", xlogscale = TRUE, datapch = 20) ## ----danish.mme.pareto, echo=TRUE, fig=FALSE---------------------------------- gofstat(list(fdanish.ln.MLE, fdanish.P.MLE, fdanish.ln.MME, fdanish.P.MME), fitnames = c("lnorm.mle", "Pareto.mle", "lnorm.mme", "Pareto.mme")) ## ----danishqme, fig.align='center', fig.width=6, fig.height=6, fig.cap="Comparison between QME and MLE when fitting a lognormal distribution to loss data from the `danishuni` data set."---- fdanish.ln.QME1 <- fitdist(danishuni$Loss, "lnorm", method = "qme", probs = c(1/3, 2/3)) fdanish.ln.QME2 <- fitdist(danishuni$Loss, "lnorm", method = "qme", probs = c(8/10, 9/10)) cdfcomp(list(fdanish.ln.MLE, fdanish.ln.QME1, fdanish.ln.QME2), legend = c("MLE", "QME(1/3, 2/3)", "QME(8/10, 9/10)"), main = "Fitting a lognormal distribution", xlogscale = TRUE, datapch = 20) ## ----optimmethod.gamma, echo=TRUE--------------------------------------------- data("groundbeef") fNM <- fitdist(groundbeef$serving, "gamma", optim.method = "Nelder-Mead") fBFGS <- fitdist(groundbeef$serving, "gamma", optim.method = "BFGS") fSANN <- fitdist(groundbeef$serving, "gamma", optim.method = "SANN") fCG <- try(fitdist(groundbeef$serving, "gamma", optim.method = "CG", control = list(maxit = 10000))) if(inherits(fCG, "try-error")) {fCG <- list(estimate = NA)} ## ----optimmethod.customgenoud, echo=TRUE-------------------------------------- mygenoud <- function(fn, par, ...) { require(rgenoud) res <- genoud(fn, starting.values = par, ...) standardres <- c(res, convergence = 0) return(standardres) } ## ----optimmethod.customgenoud.fitdist, echo=TRUE, eval=TRUE------------------- fgenoud <- mledist(groundbeef$serving, "gamma", custom.optim = mygenoud, nvars = 2, max.generations = 10, Domains = cbind(c(0, 0), c(10, 10)), boundary.enforcement = 1, hessian = TRUE, print.level = 0, P9 = 10) cbind(NM = fNM$estimate, BFGS = fBFGS$estimate, SANN = fSANN$estimate, CG = fCG$estimate, fgenoud = fgenoud$estimate) ## ----datsalinity, echo=TRUE--------------------------------------------------- data("salinity") str(salinity) ## ----plotsalinity2, fig.align='center', fig.width=6, fig.height=6, fig.cap="Simple plot of censored raw data (72-hour acute salinity tolerance of riverine macro-invertebrates from the `salinity` data set) as ordered points and intervals."---- plotdistcens(salinity, NPMLE = FALSE) ## ----plotdistcens, echo=TRUE, fig=FALSE--------------------------------------- fsal.ln <- fitdistcens(salinity, "lnorm") fsal.ll <- fitdistcens(salinity, "llogis", start = list(shape = 5, scale = 40)) summary(fsal.ln) summary(fsal.ll) ## ----cdfcompcens, fig.align='center', fig.width=7, fig.height=7, fig.cap="Some goodness-of-fit plots for fits of a lognormal and a loglogistic distribution to censored data: LC50 values from the `salinity` data set."---- par(mfrow = c(2, 2)) cdfcompcens(list(fsal.ln, fsal.ll), legendtext = c("lognormal", "loglogistic ")) qqcompcens(fsal.ln, legendtext = "lognormal") ppcompcens(fsal.ln, legendtext = "lognormal") qqcompcens(list(fsal.ln, fsal.ll), legendtext = c("lognormal", "loglogistic "), main = "Q-Q plot with 2 dist.") ## ----dattoxocara, echo=TRUE--------------------------------------------------- data("toxocara") str(toxocara) ## ----fittoxocara.poisnbinom, echo = TRUE, fig = FALSE------------------------- (ftoxo.P <- fitdist(toxocara$number, "pois")) (ftoxo.nb <- fitdist(toxocara$number, "nbinom")) ## ----fittoxocarapoisnbinom, fig.align='center', fig.width=7, fig.height=4, fig.cap="Comparison of the fits of a negative binomial and a Poisson distribution to numbers of *Toxocara cati* parasites from the `toxocara` data set."---- par(mfrow = c(1, 2)) denscomp(list(ftoxo.P, ftoxo.nb), legendtext = c("Poisson", "negative binomial"), fitlty = 1) cdfcomp(list(ftoxo.P, ftoxo.nb), legendtext = c("Poisson", "negative binomial"), fitlty = 1) ## ----fittoxocara.poisnbinom.gof----------------------------------------------- gofstat(list(ftoxo.P, ftoxo.nb), fitnames = c("Poisson", "negative binomial")) fitdistrplus/inst/doc/Optimalgo.R0000644000176200001440000001463414421751612016621 0ustar liggesusers## ----setup, echo=FALSE, message=FALSE, warning=FALSE-------------------------- require(fitdistrplus) require(knitr) #for kable() function set.seed(12345) options(digits = 3) ## ---- echo=TRUE, eval=FALSE--------------------------------------------------- # fitbench <- function(data, distr, method, grad = NULL, # control = list(trace = 0, REPORT = 1, maxit = 1000), # lower = -Inf, upper = +Inf, ...) ## ---- echo=FALSE-------------------------------------------------------------- fitbench <- fitdistrplus:::fitbench ## ----------------------------------------------------------------------------- lnL <- function(par, fix.arg, obs, ddistnam) fitdistrplus:::loglikelihood(par, fix.arg, obs, ddistnam) grlnlbeta <- fitdistrplus:::grlnlbeta ## ---- fig.height=4, fig.width=4----------------------------------------------- #(1) beta distribution n <- 200 x <- rbeta(n, 3, 3/4) grlnlbeta(c(3, 4), x) #test hist(x, prob=TRUE) lines(density(x), col="red") curve(dbeta(x, 3, 3/4), col="green", add=TRUE) legend("topleft", lty=1, col=c("red","green"), leg=c("empirical", "theoretical")) ## ----------------------------------------------------------------------------- ctr <- list(trace=0, REPORT=1, maxit=1000) ## ----------------------------------------------------------------------------- unconstropt <- fitbench(x, "beta", "mle", grad=grlnlbeta, lower=0) ## ----------------------------------------------------------------------------- dbeta2 <- function(x, shape1, shape2, log) dbeta(x, exp(shape1), exp(shape2), log=log) #take the log of the starting values startarg <- lapply(fitdistrplus:::startargdefault(x, "beta"), log) #redefine the gradient for the new parametrization grbetaexp <- function(par, obs, ...) grlnlbeta(exp(par), obs) * exp(par) expopt <- fitbench(x, distr="beta2", method="mle", grad=grbetaexp, start=startarg) #get back to original parametrization expopt[c("fitted shape1", "fitted shape2"), ] <- exp(expopt[c("fitted shape1", "fitted shape2"), ]) ## ---- results='asis', echo=FALSE---------------------------------------------- kable(unconstropt[, grep("G-", colnames(unconstropt), invert=TRUE)], digits=3) ## ---- results='asis', echo=FALSE---------------------------------------------- kable(unconstropt[, grep("G-", colnames(unconstropt))], digits=3) ## ---- results='asis', echo=FALSE---------------------------------------------- kable(expopt[, grep("G-", colnames(expopt), invert=TRUE)], digits=3) ## ---- results='asis', echo=FALSE---------------------------------------------- kable(expopt[, grep("G-", colnames(expopt))], digits=3) ## ---- fig.width=4, fig.height=4----------------------------------------------- llsurface(min.arg=c(0.1, 0.1), max.arg=c(7, 3), plot.arg=c("shape1", "shape2"), nlev=25, plot.np=50, data=x, distr="beta", back.col = FALSE) points(unconstropt[1,"BFGS"], unconstropt[2,"BFGS"], pch="+", col="red") points(3, 3/4, pch="x", col="green") ## ---- fig.width=4, fig.height=4----------------------------------------------- b1 <- bootdist(fitdist(x, "beta", method = "mle", optim.method = "BFGS"), niter = 100, parallel = "snow", ncpus = 2) summary(b1) plot(b1) abline(v = 3, h = 3/4, col = "red", lwd = 1.5) ## ----------------------------------------------------------------------------- grlnlNB <- function(x, obs, ...) { m <- x[1] p <- x[2] n <- length(obs) c(sum(psigamma(obs+m)) - n*psigamma(m) + n*log(p), m*n/p - sum(obs)/(1-p)) } ## ---- fig.height=4, fig.width=4----------------------------------------------- #(1) beta distribution n <- 200 trueval <- c("size"=10, "prob"=3/4, "mu"=10/3) x <- rnbinom(n, trueval["size"], trueval["prob"]) hist(x, prob=TRUE, ylim=c(0, .3)) lines(density(x), col="red") points(min(x):max(x), dnbinom(min(x):max(x), trueval["size"], trueval["prob"]), col = "green") legend("topleft", lty = 1, col = c("red", "green"), leg = c("empirical", "theoretical")) ## ----------------------------------------------------------------------------- ctr <- list(trace = 0, REPORT = 1, maxit = 1000) unconstropt <- fitbench(x, "nbinom", "mle", grad = grlnlNB, lower = 0) unconstropt <- rbind(unconstropt, "fitted prob" = unconstropt["fitted mu", ] / (1 + unconstropt["fitted mu", ])) ## ----------------------------------------------------------------------------- dnbinom2 <- function(x, size, prob, log) dnbinom(x, exp(size), 1 / (1 + exp(-prob)), log = log) # transform starting values startarg <- fitdistrplus:::startargdefault(x, "nbinom") startarg$mu <- startarg$size / (startarg$size + startarg$mu) startarg <- list(size = log(startarg[[1]]), prob = log(startarg[[2]] / (1 - startarg[[2]]))) # redefine the gradient for the new parametrization Trans <- function(x) c(exp(x[1]), plogis(x[2])) grNBexp <- function(par, obs, ...) grlnlNB(Trans(par), obs) * c(exp(par[1]), plogis(x[2])*(1-plogis(x[2]))) expopt <- fitbench(x, distr="nbinom2", method="mle", grad=grNBexp, start=startarg) # get back to original parametrization expopt[c("fitted size", "fitted prob"), ] <- apply(expopt[c("fitted size", "fitted prob"), ], 2, Trans) ## ---- results='asis', echo=FALSE---------------------------------------------- kable(unconstropt[, grep("G-", colnames(unconstropt), invert=TRUE)], digits=3) ## ---- results='asis', echo=FALSE---------------------------------------------- kable(unconstropt[, grep("G-", colnames(unconstropt))], digits=3) ## ---- results='asis', echo=FALSE---------------------------------------------- kable(expopt[, grep("G-", colnames(expopt), invert=TRUE)], digits=3) ## ---- results='asis', echo=FALSE---------------------------------------------- kable(expopt[, grep("G-", colnames(expopt))], digits=3) ## ---- fig.width=4, fig.height=4----------------------------------------------- llsurface(min.arg = c(5, 0.3), max.arg = c(15, 1), plot.arg = c("size", "prob"), nlev = 25, plot.np = 50, data = x, distr = "nbinom", back.col = FALSE) points(unconstropt["fitted size", "BFGS"], unconstropt["fitted prob", "BFGS"], pch = "+", col = "red") points(trueval["size"], trueval["prob"], pch = "x", col = "green") ## ---- fig.width=4, fig.height=4----------------------------------------------- b1 <- bootdist(fitdist(x, "nbinom", method = "mle", optim.method = "BFGS"), niter = 100, parallel = "snow", ncpus = 2) summary(b1) plot(b1) abline(v = trueval["size"], h = trueval["mu"], col = "red", lwd = 1.5) fitdistrplus/inst/doc/Optimalgo.html0000644000176200001440000107467114421751612017374 0ustar liggesusers Which optimization algorithm to choose?

Which optimization algorithm to choose?

Marie Laure Delignette Muller, Christophe Dutang

2023-04-25

1 Quick overview of main optimization methods

We present very quickly the main optimization methods. Please refer to Numerical Optimization (Nocedal & Wright, 2006) or Numerical Optimization: theoretical and practical aspects (Bonnans, Gilbert, Lemarechal & Sagastizabal, 2006) for a good introduction. We consider the following problem \(\min_x f(x)\) for \(x\in\mathbb{R}^n\).

1.1 Derivative-free optimization methods

The Nelder-Mead method is one of the most well known derivative-free methods that use only values of \(f\) to search for the minimum. It consists in building a simplex of \(n+1\) points and moving/shrinking this simplex into the good direction.

  1. set initial points \(x_1, \dots, x_{n+1}\).
  2. order points such that \(f(x_1)\leq f(x_2)\leq\dots\leq f(x_{n+1})\).
  3. compute \(x_o\) as the centroid of \(x_1, \dots, x_{n}\).
  4. Reflection:
    • compute the reflected point \(x_r = x_o + \alpha(x_o-x_{n+1})\).
    • if \(f(x_1)\leq f(x_r)<f(x_n)\), then replace \(x_{n+1}\) by \(x_r\), go to step 2.
    • else go step 5.
  5. Expansion:
    • if \(f(x_r)<f(x_1)\), then compute the expansion point \(x_e= x_o+\gamma(x_o-x_{n+1})\).
    • if \(f(x_e) <f(x_r)\), then replace \(x_{n+1}\) by \(x_e\), go to step 2.
    • else \(x_{n+1}\) by \(x_r\), go to step 2.
    • else go to step 6.
  6. Contraction:
    • compute the contracted point \(x_c = x_o + \beta(x_o-x_{n+1})\).
    • if \(f(x_c)<f(x_{n+1})\), then replace \(x_{n+1}\) by \(x_c\), go to step 2.
    • else go step 7.
  7. Reduction:
    • for \(i=2,\dots, n+1\), compute \(x_i = x_1+\sigma(x_i-x_{1})\).

The Nelder-Mead method is available in optim. By default, in optim, \(\alpha=1\), \(\beta=1/2\), \(\gamma=2\) and \(\sigma=1/2\).

1.2 Hessian-free optimization methods

For smooth non-linear function, the following method is generally used: a local method combined with line search work on the scheme \(x_{k+1} =x_k + t_k d_{k}\), where the local method will specify the direction \(d_k\) and the line search will specify the step size \(t_k \in \mathbb{R}\).

1.2.1 Computing the direction \(d_k\)

A desirable property for \(d_k\) is that \(d_k\) ensures a descent \(f(x_{k+1}) < f(x_{k})\). Newton methods are such that \(d_k\) minimizes a local quadratic approximation of \(f\) based on a Taylor expansion, that is \(q_f(d) = f(x_k) + g(x_k)^Td +\frac{1}{2} d^T H(x_k) d\) where \(g\) denotes the gradient and \(H\) denotes the Hessian.

The consists in using the exact solution of local minimization problem \(d_k = - H(x_k)^{-1} g(x_k)\).
In practice, other methods are preferred (at least to ensure positive definiteness). The method approximates the Hessian by a matrix \(H_k\) as a function of \(H_{k-1}\), \(x_k\), \(f(x_k)\) and then \(d_k\) solves the system \(H_k d = - g(x_k)\). Some implementation may also directly approximate the inverse of the Hessian \(W_k\) in order to compute \(d_k = -W_k g(x_k)\). Using the Sherman-Morrison-Woodbury formula, we can switch between \(W_k\) and \(H_k\).

To determine \(W_k\), first it must verify the secant equation \(H_k y_k =s_k\) or \(y_k=W_k s_k\) where \(y_k = g_{k+1}-g_k\) and \(s_k=x_{k+1}-x_k\). To define the \(n(n-1)\) terms, we generally impose a symmetry and a minimum distance conditions. We say we have a rank 2 update if \(H_k = H_{k-1} + a u u^T + b v v^T\) and a rank 1 update if $H_k = H_{k-1} + a u u^T $. Rank \(n\) update is justified by the spectral decomposition theorem.

There are two rank-2 updates which are symmetric and preserve positive definiteness

  • DFP minimizes \(\min || H - H_k ||_F\) such that \(H=H^T\): \[ H_{k+1} = \left (I-\frac {y_k s_k^T} {y_k^T s_k} \right ) H_k \left (I-\frac {s_k y_k^T} {y_k^T s_k} \right )+\frac{y_k y_k^T} {y_k^T s_k} \Leftrightarrow W_{k+1} = W_k + \frac{s_k s_k^T}{y_k^{T} s_k} - \frac {W_k y_k y_k^T W_k^T} {y_k^T W_k y_k} . \]
  • BFGS minimizes \(\min || W - W_k ||_F\) such that \(W=W^T\): \[ H_{k+1} = H_k - \frac{ H_k y_k y_k^T H_k }{ y_k^T H_k y_k } + \frac{ s_k s_k^T }{ y_k^T s_k } \Leftrightarrow W_{k+1} = \left (I-\frac {y_k s_k^T} {y_k^T s_k} \right )^T W_k \left (I-\frac { y_k s_k^T} {y_k^T s_k} \right )+\frac{s_k s_k^T} {y_k^T s_k} . \]

In R, the so-called BFGS scheme is implemented in optim.

Another possible method (which is initially arised from quadratic problems) is the nonlinear conjugate gradients. This consists in computing directions \((d_0, \dots, d_k)\) that are conjugate with respect to a matrix close to the true Hessian \(H(x_k)\). Directions are computed iteratively by \(d_k = -g(x_k) + \beta_k d_{k-1}\) for \(k>1\), once initiated by \(d_1 = -g(x_1)\). \(\beta_k\) are updated according a scheme:

  • \(\beta_k = \frac{ g_k^T g_k}{g_{k-1}^T g_{k-1} }\): Fletcher-Reeves update,
  • \(\beta_k = \frac{ g_k^T (g_k-g_{k-1} )}{g_{k-1}^T g_{k-1}}\): Polak-Ribiere update.

There exists also three-term formula for computing direction \(d_k = -g(x_k) + \beta_k d_{k-1}+\gamma_{k} d_t\) for \(t<k\). A possible scheme is the Beale-Sorenson update defined as \(\beta_k = \frac{ g_k^T (g_k-g_{k-1} )}{d^T_{k-1}(g_{k}- g_{k-1})}\) and \(\gamma_k = \frac{ g_k^T (g_{t+1}-g_{t} )}{d^T_{t}(g_{t+1}- g_{t})}\) if \(k>t+1\) otherwise \(\gamma_k=0\) if \(k=t\). See Yuan (2006) for other well-known schemes such as Hestenses-Stiefel, Dixon or Conjugate-Descent. The three updates (Fletcher-Reeves, Polak-Ribiere, Beale-Sorenson) of the (non-linear) conjugate gradient are available in optim.

1.2.2 Computing the stepsize \(t_k\)

Let \(\phi_k(t) = f(x_k + t d_k)\) for a given direction/iterate \((d_k, x_k)\). We need to find conditions to find a satisfactory stepsize \(t_k\). In literature, we consider the descent condition: \(\phi_k'(0) < 0\) and the Armijo condition: \(\phi_k(t) \leq \phi_k(0) + t c_1 \phi_k'(0)\) ensures a decrease of \(f\). Nocedal & Wright (2006) presents a backtracking (or geometric) approach satisfying the Armijo condition and minimal condition, i.e. Goldstein and Price condition.

  • set \(t_{k,0}\) e.g. 1, \(0 < \alpha < 1\),
  • Repeat until Armijo satisfied,
    • \(t_{k,i+1} = \alpha \times t_{k,i}\).
  • end Repeat

This backtracking linesearch is available in optim.

1.3 Benchmark

To simplify the benchmark of optimization methods, we create a fitbench function that computes the desired estimation method for all optimization methods. This function is currently not exported in the package.

fitbench <- function(data, distr, method, grad = NULL, 
                     control = list(trace = 0, REPORT = 1, maxit = 1000), 
                     lower = -Inf, upper = +Inf, ...) 

2 Numerical illustration with the beta distribution

2.1 Log-likelihood function and its gradient for beta distribution

2.1.1 Theoretical value

The density of the beta distribution is given by \[ f(x; \delta_1,\delta_2) = \frac{x^{\delta_1-1}(1-x)^{\delta_2-1}}{\beta(\delta_1,\delta_2)}, \] where \(\beta\) denotes the beta function, see the NIST Handbook of mathematical functions https://dlmf.nist.gov/. We recall that \(\beta(a,b)=\Gamma(a)\Gamma(b)/\Gamma(a+b)\). There the log-likelihood for a set of observations \((x_1,\dots,x_n)\) is \[ \log L(\delta_1,\delta_2) = (\delta_1-1)\sum_{i=1}^n\log(x_i)+ (\delta_2-1)\sum_{i=1}^n\log(1-x_i)+ n \log(\beta(\delta_1,\delta_2)) \] The gradient with respect to \(a\) and \(b\) is \[ \nabla \log L(\delta_1,\delta_2) = \left(\begin{matrix} \sum\limits_{i=1}^n\ln(x_i) - n\psi(\delta_1)+n\psi( \delta_1+\delta_2) \\ \sum\limits_{i=1}^n\ln(1-x_i)- n\psi(\delta_2)+n\psi( \delta_1+\delta_2) \end{matrix}\right), \] where \(\psi(x)=\Gamma'(x)/\Gamma(x)\) is the digamma function, see the NIST Handbook of mathematical functions https://dlmf.nist.gov/.

2.1.2 R implementation

As in the fitdistrplus package, we minimize the opposite of the log-likelihood: we implement the opposite of the gradient in grlnL. Both the log-likelihood and its gradient are not exported.

lnL <- function(par, fix.arg, obs, ddistnam) 
  fitdistrplus:::loglikelihood(par, fix.arg, obs, ddistnam) 
grlnlbeta <- fitdistrplus:::grlnlbeta

2.2 Random generation of a sample

#(1) beta distribution
n <- 200
x <- rbeta(n, 3, 3/4)
grlnlbeta(c(3, 4), x) #test
## [1] -133  317
hist(x, prob=TRUE)
lines(density(x), col="red")
curve(dbeta(x, 3, 3/4), col="green", add=TRUE)
legend("topleft", lty=1, col=c("red","green"), leg=c("empirical", "theoretical"))

2.3 Fit Beta distribution

Define control parameters.

ctr <- list(trace=0, REPORT=1, maxit=1000)

Call mledist with the default optimization function (optim implemented in stats package) with and without the gradient for the different optimization methods.

unconstropt <- fitbench(x, "beta", "mle", grad=grlnlbeta, lower=0)
##     BFGS       NM     CGFR     CGPR     CGBS L-BFGS-B     NM-B   G-BFGS 
##       13       13       13       13       13       13       13       13 
##   G-CGFR   G-CGPR   G-CGBS G-BFGS-B   G-NM-B G-CGFR-B G-CGPR-B G-CGBS-B 
##       13       13       13       13       13       13       13       13

In the case of constrained optimization, mledist permits the direct use of constrOptim function (still implemented in stats package) that allow linear inequality constraints by using a logarithmic barrier.

Use a exp/log transformation of the shape parameters \(\delta_1\) and \(\delta_2\) to ensure that the shape parameters are strictly positive.

dbeta2 <- function(x, shape1, shape2, log)
  dbeta(x, exp(shape1), exp(shape2), log=log)
#take the log of the starting values
startarg <- lapply(fitdistrplus:::startargdefault(x, "beta"), log)
#redefine the gradient for the new parametrization
grbetaexp <- function(par, obs, ...) 
    grlnlbeta(exp(par), obs) * exp(par)
    

expopt <- fitbench(x, distr="beta2", method="mle", grad=grbetaexp, start=startarg) 
##   BFGS     NM   CGFR   CGPR   CGBS G-BFGS G-CGFR G-CGPR G-CGBS 
##     13     13     13     13     13     13     13     13     13
#get back to original parametrization
expopt[c("fitted shape1", "fitted shape2"), ] <- exp(expopt[c("fitted shape1", "fitted shape2"), ])

Then we extract the values of the fitted parameters, the value of the corresponding log-likelihood and the number of counts to the function to minimize and its gradient (whether it is the theoretical gradient or the numerically approximated one).

2.4 Results of the numerical investigation

Results are displayed in the following tables: (1) the original parametrization without specifying the gradient (-B stands for bounded version), (2) the original parametrization with the (true) gradient (-B stands for bounded version and -G for gradient), (3) the log-transformed parametrization without specifying the gradient, (4) the log-transformed parametrization with the (true) gradient (-G stands for gradient).

BFGS NM CGFR CGPR CGBS L-BFGS-B NM-B
fitted shape1 2.665 2.664 2.665 2.665 2.665 2.665 2.665
fitted shape2 0.731 0.731 0.731 0.731 0.731 0.731 0.731
fitted loglik 114.165 114.165 114.165 114.165 114.165 114.165 114.165
func. eval. nb. 23.000 47.000 211.000 263.000 183.000 11.000 47.000
grad. eval. nb. 5.000 NA 53.000 69.000 47.000 11.000 NA
time (sec) 0.004 0.004 0.024 0.029 0.021 0.004 0.006
G-BFGS G-CGFR G-CGPR G-CGBS G-BFGS-B G-NM-B G-CGFR-B G-CGPR-B G-CGBS-B
fitted shape1 2.665 2.665 2.665 2.665 2.665 2.665 2.665 2.665 2.665
fitted shape2 0.731 0.731 0.731 0.731 0.731 0.731 0.731 0.731 0.731
fitted loglik 114.165 114.165 114.165 114.165 114.165 114.165 114.165 114.165 114.165
func. eval. nb. 20.000 249.000 225.000 138.000 25.000 47.000 263.000 188.000 176.000
grad. eval. nb. 5.000 71.000 69.000 43.000 5.000 NA 69.000 59.000 47.000
time (sec) 0.009 0.072 0.070 0.045 0.014 0.013 0.083 0.075 0.063
BFGS NM CGFR CGPR CGBS
fitted shape1 2.665 2.664 2.665 2.665 2.665
fitted shape2 0.731 0.731 0.731 0.731 0.731
fitted loglik 114.165 114.165 114.165 114.165 114.165
func. eval. nb. 18.000 41.000 131.000 116.000 134.000
grad. eval. nb. 5.000 NA 27.000 29.000 35.000
time (sec) 0.005 0.003 0.014 0.013 0.016
G-BFGS G-CGFR G-CGPR G-CGBS
fitted shape1 2.665 2.665 2.665 2.665
fitted shape2 0.731 0.731 0.731 0.731
fitted loglik 114.165 114.165 114.165 114.165
func. eval. nb. 20.000 175.000 125.000 112.000
grad. eval. nb. 5.000 39.000 41.000 35.000
time (sec) 0.010 0.043 0.042 0.036

Using llsurface, we plot the log-likehood surface around the true value (green) and the fitted parameters (red).

llsurface(min.arg=c(0.1, 0.1), max.arg=c(7, 3), 
          plot.arg=c("shape1", "shape2"), nlev=25,
          plot.np=50, data=x, distr="beta", back.col = FALSE)
## Warning in plot.window(xlim, ylim, ...): "plot.np" n'est pas un paramètre
## graphique
## Warning in title(...): "plot.np" n'est pas un paramètre graphique
## Warning in axis(side = side, at = at, labels = labels, ...): "plot.np" n'est
## pas un paramètre graphique

## Warning in axis(side = side, at = at, labels = labels, ...): "plot.np" n'est
## pas un paramètre graphique
## Warning in box(...): "plot.np" n'est pas un paramètre graphique
points(unconstropt[1,"BFGS"], unconstropt[2,"BFGS"], pch="+", col="red")
points(3, 3/4, pch="x", col="green")

We can simulate bootstrap replicates using the bootdist function.

b1 <- bootdist(fitdist(x, "beta", method = "mle", optim.method = "BFGS"), 
               niter = 100, parallel = "snow", ncpus = 2)
summary(b1)
## Parametric bootstrap medians and 95% percentile CI 
##        Median  2.5% 97.5%
## shape1   2.73 2.272 3.283
## shape2   0.75 0.652 0.888
plot(b1)
abline(v = 3, h = 3/4, col = "red", lwd = 1.5)

3 Numerical illustration with the negative binomial distribution

3.1 Log-likelihood function and its gradient for negative binomial distribution

3.1.1 Theoretical value

The p.m.f. of the Negative binomial distribution is given by \[ f(x; m,p) = \frac{\Gamma(x+m)}{\Gamma(m)x!} p^m (1-p)^x, \] where \(\Gamma\) denotes the beta function, see the NIST Handbook of mathematical functions https://dlmf.nist.gov/. There exists an alternative representation where \(\mu=m (1-p)/p\) or equivalently \(p=m/(m+\mu)\). Thus, the log-likelihood for a set of observations \((x_1,\dots,x_n)\) is \[ \log L(m,p) = \sum_{i=1}^{n} \log\Gamma(x_i+m) -n\log\Gamma(m) -\sum_{i=1}^{n} \log(x_i!) + mn\log(p) +\sum_{i=1}^{n} {x_i}\log(1-p) \] The gradient with respect to \(m\) and \(p\) is \[ \nabla \log L(m,p) = \left(\begin{matrix} \sum_{i=1}^{n} \psi(x_i+m) -n \psi(m) + n\log(p) \\ mn/p -\sum_{i=1}^{n} {x_i}/(1-p) \end{matrix}\right), \] where \(\psi(x)=\Gamma'(x)/\Gamma(x)\) is the digamma function, see the NIST Handbook of mathematical functions https://dlmf.nist.gov/.

3.1.2 R implementation

As in the fitdistrplus package, we minimize the opposite of the log-likelihood: we implement the opposite of the gradient in grlnL.

grlnlNB <- function(x, obs, ...)
{
  m <- x[1]
  p <- x[2]
  n <- length(obs)
  c(sum(psigamma(obs+m)) - n*psigamma(m) + n*log(p),
    m*n/p - sum(obs)/(1-p))
}

3.2 Random generation of a sample

#(1) beta distribution
n <- 200
trueval <- c("size"=10, "prob"=3/4, "mu"=10/3)
x <- rnbinom(n, trueval["size"], trueval["prob"])

hist(x, prob=TRUE, ylim=c(0, .3))
lines(density(x), col="red")
points(min(x):max(x), dnbinom(min(x):max(x), trueval["size"], trueval["prob"]), 
       col = "green")
legend("topleft", lty = 1, col = c("red", "green"), 
       leg = c("empirical", "theoretical"))

3.3 Fit a negative binomial distribution

Define control parameters and make the benchmark.

ctr <- list(trace = 0, REPORT = 1, maxit = 1000)
unconstropt <- fitbench(x, "nbinom", "mle", grad = grlnlNB, lower = 0)
##     BFGS       NM     CGFR     CGPR     CGBS L-BFGS-B     NM-B   G-BFGS 
##       13       13       13       13       13       13       13       13 
##   G-CGFR   G-CGPR   G-CGBS G-BFGS-B   G-NM-B G-CGFR-B G-CGPR-B G-CGBS-B 
##       13       13       13       13       13       13       13       13
unconstropt <- rbind(unconstropt, 
                     "fitted prob" = unconstropt["fitted mu", ] / (1 + unconstropt["fitted mu", ]))

In the case of constrained optimization, mledist permits the direct use of constrOptim function (still implemented in stats package) that allow linear inequality constraints by using a logarithmic barrier.

Use a exp/log transformation of the shape parameters \(\delta_1\) and \(\delta_2\) to ensure that the shape parameters are strictly positive.

dnbinom2 <- function(x, size, prob, log)
  dnbinom(x, exp(size), 1 / (1 + exp(-prob)), log = log)
# transform starting values
startarg <- fitdistrplus:::startargdefault(x, "nbinom")
startarg$mu <- startarg$size / (startarg$size + startarg$mu)
startarg <- list(size = log(startarg[[1]]), 
                 prob = log(startarg[[2]] / (1 - startarg[[2]])))

# redefine the gradient for the new parametrization
Trans <- function(x)
  c(exp(x[1]), plogis(x[2]))
grNBexp <- function(par, obs, ...) 
    grlnlNB(Trans(par), obs) * c(exp(par[1]), plogis(x[2])*(1-plogis(x[2])))

expopt <- fitbench(x, distr="nbinom2", method="mle", grad=grNBexp, start=startarg) 
##   BFGS     NM   CGFR   CGPR   CGBS G-BFGS G-CGFR G-CGPR G-CGBS 
##     13     13     13     13     13     13     13     13     13
# get back to original parametrization
expopt[c("fitted size", "fitted prob"), ] <- 
  apply(expopt[c("fitted size", "fitted prob"), ], 2, Trans)

Then we extract the values of the fitted parameters, the value of the corresponding log-likelihood and the number of counts to the function to minimize and its gradient (whether it is the theoretical gradient or the numerically approximated one).

3.4 Results of the numerical investigation

Results are displayed in the following tables: (1) the original parametrization without specifying the gradient (-B stands for bounded version), (2) the original parametrization with the (true) gradient (-B stands for bounded version and -G for gradient), (3) the log-transformed parametrization without specifying the gradient, (4) the log-transformed parametrization with the (true) gradient (-G stands for gradient).

BFGS NM CGFR CGPR CGBS L-BFGS-B NM-B
fitted size 61.944 68.138 61.969 62.475 62.878 61.944 67.397
fitted mu 3.425 3.425 3.425 3.425 3.425 3.425 3.425
fitted loglik -401.612 -401.611 -401.612 -401.612 -401.612 -401.612 -401.611
func. eval. nb. 2.000 37.000 2999.000 2663.000 2610.000 2.000 0.000
grad. eval. nb. 1.000 NA 1001.000 1001.000 1001.000 2.000 NA
time (sec) 0.001 0.002 0.240 0.244 0.234 0.001 0.002
fitted prob 0.774 0.774 0.774 0.774 0.774 0.774 0.774
G-BFGS G-CGFR G-CGPR G-CGBS G-BFGS-B G-NM-B G-CGFR-B G-CGPR-B G-CGBS-B
fitted size 61.944 61.944 61.944 61.944 61.944 67.397 61.944 61.944 61.944
fitted mu 3.425 3.425 3.425 3.425 3.425 3.425 3.425 3.425 3.425
fitted loglik -401.612 -401.612 -401.612 -401.612 -401.612 -401.611 -401.612 -401.612 -401.612
func. eval. nb. 26.000 233.000 174.000 233.000 0.000 0.000 0.000 0.000 0.000
grad. eval. nb. 1.000 15.000 11.000 15.000 NA NA NA NA NA
time (sec) 0.008 0.010 0.008 0.009 0.002 0.002 0.009 0.007 0.009
fitted prob 0.774 0.774 0.774 0.774 0.774 0.774 0.774 0.774 0.774
BFGS NM CGFR CGPR CGBS
fitted size 61.946 67.787 63.450 67.940 67.885
fitted prob 0.948 0.952 0.949 0.952 0.952
fitted loglik -401.612 -401.611 -401.612 -401.611 -401.611
func. eval. nb. 6.000 47.000 4001.000 3728.000 327.000
grad. eval. nb. 1.000 NA 1001.000 1001.000 87.000
time (sec) 0.004 0.002 0.290 0.228 0.023
G-BFGS G-CGFR G-CGPR G-CGBS
fitted size 61.944 61.944 61.944 61.944
fitted prob 0.948 0.948 0.948 0.948
fitted loglik -401.612 -401.612 -401.612 -401.612
func. eval. nb. 21.000 43.000 42.000 42.000
grad. eval. nb. 1.000 3.000 3.000 3.000
time (sec) 0.007 0.002 0.002 0.002

Using llsurface, we plot the log-likehood surface around the true value (green) and the fitted parameters (red).

llsurface(min.arg = c(5, 0.3), max.arg = c(15, 1), 
          plot.arg = c("size", "prob"), nlev = 25,
          plot.np = 50, data = x, distr = "nbinom", back.col = FALSE)
## Warning in plot.window(xlim, ylim, ...): "plot.np" n'est pas un paramètre
## graphique
## Warning in title(...): "plot.np" n'est pas un paramètre graphique
## Warning in axis(side = side, at = at, labels = labels, ...): "plot.np" n'est
## pas un paramètre graphique

## Warning in axis(side = side, at = at, labels = labels, ...): "plot.np" n'est
## pas un paramètre graphique
## Warning in box(...): "plot.np" n'est pas un paramètre graphique
points(unconstropt["fitted size", "BFGS"], unconstropt["fitted prob", "BFGS"], 
       pch = "+", col = "red")
points(trueval["size"], trueval["prob"], pch = "x", col = "green")

We can simulate bootstrap replicates using the bootdist function.

b1 <- bootdist(fitdist(x, "nbinom", method = "mle", optim.method = "BFGS"), 
               niter = 100, parallel = "snow", ncpus = 2)
summary(b1)
## Parametric bootstrap medians and 95% percentile CI 
##      Median  2.5%  97.5%
## size  61.95 11.05 118.32
## mu     3.43  3.17   3.72
## 
## The estimation method converged only for 76 among 100 iterations
plot(b1)
abline(v = trueval["size"], h = trueval["mu"], col = "red", lwd = 1.5)

4 Conclusion

Based on the two previous examples, we observe that all methods converge to the same point. This is rassuring.
However, the number of function evaluations (and the gradient evaluations) is very different from a method to another. Furthermore, specifying the true gradient of the log-likelihood does not help at all the fitting procedure and generally slows down the convergence. Generally, the best method is the standard BFGS method or the BFGS method with the exponential transformation of the parameters. Since the exponential function is differentiable, the asymptotic properties are still preserved (by the Delta method) but for finite-sample this may produce a small bias.

fitdistrplus/inst/doc/fitdistrplus_vignette.html0000644000176200001440000233515114421751624022077 0ustar liggesusers Overview of the fitdistrplus package

Overview of the fitdistrplus package

Marie Laure Delignette Muller, Christophe Dutang

2023-04-25


Based on the article fitdistrplus: an R Package for Fitting Distributions (Marie Laure Delignette-Muller and Christophe Dutang, 2015, Journal of Statistical Software, DOI 10.18637/jss.v064.i04)


Keywords: probability distribution fitting, bootstrap, censored data, maximum likelihood, moment matching, quantile matching, maximum goodness-of-fit, distributions, R


1 Introduction

Fitting distributions to data is a very common task in statistics and consists in choosing a probability distribution modelling the random variable, as well as finding parameter estimates for that distribution. This requires judgment and expertise and generally needs an iterative process of distribution choice, parameter estimation, and quality of fit assessment. In the R (R Development Core Team 2013) package MASS (Venables and Ripley 2010), maximum likelihood estimation is available via the fitdistr function; other steps of the fitting process can be done using other R functions (Ricci 2005). In this paper, we present the R package fitdistrplus (Delignette-Muller et al. 2014) implementing several methods for fitting univariate parametric distribution. A first objective in developing this package was to provide R users a set of functions dedicated to help this overall process.

The fitdistr function estimates distribution parameters by maximizing the likelihood function using the optim function. No distinction between parameters with different roles (e.g., main parameter and nuisance parameter) is made, as this paper focuses on parameter estimation from a general point-of-view. In some cases, other estimation methods could be prefered, such as maximum goodness-of-fit estimation (also called minimum distance estimation), as proposed in the R package actuar with three different goodness-of-fit distances (Dutang, Goulet, and Pigeon 2008). While developping the fitdistrplus package, a second objective was to consider various estimation methods in addition to maximum likelihood estimation (MLE). Functions were developped to enable moment matching estimation (MME), quantile matching estimation (QME), and maximum goodness-of-fit estimation (MGE) using eight different distances. Moreover, the fitdistrplus package offers the possibility to specify a user-supplied function for optimization, useful in cases where classical optimization techniques, not included in optim, are more adequate.

In applied statistics, it is frequent to have to fit distributions to censored data (Klein and Moeschberger 2003, @helsel05, @busschaertetal10, @lehaetal11, @commeauetal12). The MASS fitdistr function does not enable maximum likelihood estimation with this type of data. Some packages can be used to work with censored data, especially survival data (Therneau 2011, @hiranoetal94, @jordan05), but those packages generally focus on specific models, enabling the fit of a restricted set of distributions. A third objective is thus to provide R users a function to estimate univariate distribution parameters from right-, left- and interval-censored data.

Few packages on CRAN provide estimation procedures for any user-supplied parametric distribution and support different types of data. The distrMod package (Kohl and Ruckdeschel 2010) provides an object-oriented (S4) implementation of probability models and includes distribution fitting procedures for a given minimization criterion. This criterion is a user-supplied function which is sufficiently flexible to handle censored data, yet not in a trivial way, see Example M4 of the distrMod vignette. The fitting functions MLEstimator and MDEstimator return an S4 class for which a coercion method to class mle is provided so that the respective functionalities (e.g., confint and logLik) from package stats4 are available, too. In fitdistrplus, we chose to use the standard S3 class system for its understanding by most R users. When designing the fitdistrplus package, we did not forget to implement generic functions also available for S3 classes. Finally, various other packages provide functions to estimate the mode, the moments or the L-moments of a distribution, see the reference manuals of modeest, lmomco and Lmoments packages.

The package is available from the Comprehensive R Archive Network at . The paper is organized as follows: Section 2 presents tools for fitting continuous distributions to classic non-censored data. Section 3 deals with other estimation methods and other types of data, before Section 4 concludes.


2 Fitting distributions to continuous non-censored data

2.1 Choice of candidate distributions

For illustrating the use of various functions of the fitdistrplus package with continuous non-censored data, we will first use a data set named groundbeef which is included in our package. This data set contains pointwise values of serving sizes in grams, collected in a French survey, for ground beef patties consumed by children under 5 years old. It was used in a quantitative risk assessment published by Delignette-Muller and Cornu (2008).

library("fitdistrplus")
data("groundbeef")
str(groundbeef)
## 'data.frame':    254 obs. of  1 variable:
##  $ serving: num  30 10 20 24 20 24 40 20 50 30 ...

Before fitting one or more distributions to a data set, it is generally necessary to choose good candidates among a predefined set of distributions. This choice may be guided by the knowledge of stochastic processes governing the modeled variable, or, in the absence of knowledge regarding the underlying process, by the observation of its empirical distribution. To help the user in this choice, we developed functions to plot and characterize the empirical distribution.

First of all, it is common to start with plots of the empirical distribution function and the histogram (or density plot), which can be obtained with the plotdist function of the fitdistrplus package. This function provides two plots (see Figure 2.1): the left-hand plot is by default the histogram on a density scale (or density plot of both, according to values of arguments histo and demp) and the right-hand plot the empirical cumulative distribution function (CDF).

plotdist(groundbeef$serving, histo = TRUE, demp = TRUE)
Histogram and CDF plots of an empirical distribution for a continuous variable (serving size from the `groundbeef` data set) as provided by the `plotdist` function.

Figure 2.1: Histogram and CDF plots of an empirical distribution for a continuous variable (serving size from the groundbeef data set) as provided by the plotdist function.


In addition to empirical plots, descriptive statistics may help to choose candidates to describe a distribution among a set of parametric distributions. Especially the skewness and kurtosis, linked to the third and fourth moments, are useful for this purpose. A non-zero skewness reveals a lack of symmetry of the empirical distribution, while the kurtosis value quantifies the weight of tails in comparison to the normal distribution for which the kurtosis equals 3. The skewness and kurtosis and their corresponding unbiased estimator (Casella and Berger 2002) from a sample \((X_i)_i \stackrel{\text{i.i.d.}}{\sim} X\) with observations \((x_i)_i\) are given by:

\[\begin{equation} sk(X) = \frac{E[(X-E(X))^3]}{Var(X)^{\frac{3}{2}}}~,~\widehat{sk}=\frac{\sqrt{n(n-1)}}{n-2}\times\frac{m_{3}}{m_{2}^{\frac{3}{2}}},\tag{2.1} \end{equation}\]

\[\begin{equation} kr(X) = \frac{E[(X-E(X))^4]}{Var(X)^{2}}~,~\widehat{kr}=\frac{n-1}{(n-2)(n-3)}((n+1) \times \frac{m_{4}}{m_{2}^{2}}-3(n-1)) + 3,\tag{2.2} \end{equation}\]

where \(m_{2}\), \(m_{3}\), \(m_{4}\) denote empirical moments defined by \(m_{k}=\frac{1}{n}\sum_{i=1}^n(x_{i}-\overline{x})^{k}\), with \(x_{i}\) the \(n\) observations of variable \(x\) and \(\overline{x}\) their mean value.

The descdist function provides classical descriptive statistics (minimum, maximum, median, mean, standard deviation), skewness and kurtosis. By default, unbiased estimations of the three last statistics are provided. Nevertheless, the argument method can be changed from "unbiased" (default) to "sample" to obtain them without correction for bias. A skewness-kurtosis plot such as the one proposed by Cullen and Frey (1999) is provided by the descdist function for the empirical distribution (see Figure 2.2 for the groundbeef data set). On this plot, values for common distributions are displayed in order to help the choice of distributions to fit to data. For some distributions (normal, uniform, logistic, exponential), there is only one possible value for the skewness and the kurtosis. Thus, the distribution is represented by a single point on the plot. For other distributions, areas of possible values are represented, consisting in lines (as for gamma and lognormal distributions), or larger areas (as for beta distribution).

Skewness and kurtosis are known not to be robust. In order to take into account the uncertainty of the estimated values of kurtosis and skewness from data, a nonparametric bootstrap procedure (Efron and Tibshirani 1994) can be performed by using the argument boot. Values of skewness and kurtosis are computed on bootstrap samples (constructed by random sampling with replacement from the original data set) and reported on the skewness-kurtosis plot. Nevertheless, the user needs to know that skewness and kurtosis, like all higher moments, have a very high variance. This is a problem which cannot be completely solved by the use of bootstrap. The skewness-kurtosis plot should then be regarded as indicative only. The properties of the random variable should be considered, notably its expected value and its range, as a complement to the use of the plotdist and descdist functions. Below is a call to the descdist function to describe the distribution of the serving size from the groundbeef data set and to draw the corresponding skewness-kurtosis plot (see Figure 2.2). Looking at the results on this example with a positive skewness and a kurtosis not far from 3, the fit of three common right-skewed distributions could be considered, Weibull, gamma and lognormal distributions.

descdist(groundbeef$serving, boot = 1000)
Skewness-kurtosis plot for a continuous variable (serving size from the `groundbeef` data set) as provided by the `descdist` function.

Figure 2.2: Skewness-kurtosis plot for a continuous variable (serving size from the groundbeef data set) as provided by the descdist function.

## summary statistics
## ------
## min:  10   max:  200 
## median:  79 
## mean:  73.65 
## estimated sd:  35.88 
## estimated skewness:  0.7353 
## estimated kurtosis:  3.551

2.2 Fit of distributions by maximum likelihood estimation

Once selected, one or more parametric distributions \(f(.\vert \theta)\) (with parameter \(\theta\in\mathbb{R}^d\)) may be fitted to the data set, one at a time, using the fitdist function. Under the i.i.d. sample assumption, distribution parameters \(\theta\) are by default estimated by maximizing the likelihood function defined as:

\[\begin{equation} L(\theta)=\prod_{i=1}^n f(x_{i}\vert \theta)\tag{2.3} \end{equation}\]

with \(x_{i}\) the \(n\) observations of variable \(X\) and \(f(.\vert \theta)\) the density function of the parametric distribution. The other proposed estimation methods are described in Section 3.1.

The fitdist function returns an S3 object of class fitdist for which print, summary and plot functions are provided. The fit of a distribution using fitdist assumes that the corresponding d, p, q functions (standing respectively for the density, the distribution and the quantile functions) are defined. Classical distributions are already defined in that way in the stats package, e.g., dnorm, pnorm and qnorm for the normal distribution (see ?Distributions). Others may be found in various packages (see the CRAN task view: Probability Distributions at ). Distributions not found in any package must be implemented by the user as d, p, q functions. In the call to fitdist, a distribution has to be specified via the argument dist either by the character string corresponding to its common root name used in the names of d, p, q functions (e.g., "norm" for the normal distribution) or by the density function itself, from which the root name is extracted (e.g., dnorm for the normal distribution). Numerical results returned by the fitdist function are (1) the parameter estimates, (2) the estimated standard errors (computed from the estimate of the Hessian matrix at the maximum likelihood solution), (3) the loglikelihood, (4) Akaike and Bayesian information criteria (the so-called AIC and BIC), and (5) the correlation matrix between parameter estimates. Below is a call to the fitdist function to fit a Weibull distribution to the serving size from the groundbeef data set.

fw <- fitdist(groundbeef$serving, "weibull")
summary(fw)
## Fitting of the distribution ' weibull ' by maximum likelihood 
## Parameters : 
##       estimate Std. Error
## shape    2.186     0.1046
## scale   83.348     2.5269
## Loglikelihood:  -1255   AIC:  2514   BIC:  2522 
## Correlation matrix:
##        shape  scale
## shape 1.0000 0.3218
## scale 0.3218 1.0000

The plot of an object of class fitdist provides four classical goodness-of-fit plots (Cullen and Frey 1999) presented on Figure 2.3:

  • a density plot representing the density function of the fitted distribution along with the histogram of the empirical distribution,
  • a CDF plot of both the empirical distribution and the fitted distribution,
  • a Q-Q plot representing the empirical quantiles (y-axis) against the theoretical quantiles (x-axis),
  • a P-P plot representing the empirical distribution function evaluated at each data point (y-axis) against the fitted distribution function (x-axis).

For CDF, Q-Q and P-P plots, the probability plotting position is defined by default using Hazen’s rule, with probability points of the empirical distribution calculated as (1:n - 0.5)/n, as recommended by Blom (1959). This plotting position can be easily changed (see the reference manual for details (Delignette-Muller et al. 2014)).

Unlike the generic plot function, the denscomp, cdfcomp, qqcomp and ppcomp functions enable to draw separately each of these four plots, in order to compare the empirical distribution and multiple parametric distributions fitted on a same data set. These functions must be called with a first argument corresponding to a list of objects of class fitdist, and optionally further arguments to customize the plot (see the reference manual for lists of arguments that may be specific to each plot (Delignette-Muller et al. 2014)). In the following example, we compare the fit of a Weibull, a lognormal and a gamma distributions to the groundbeef data set (Figure 2.3).

par(mfrow = c(2, 2), mar = c(4, 4, 2, 1))
fg <- fitdist(groundbeef$serving, "gamma")
fln <- fitdist(groundbeef$serving, "lnorm")
plot.legend <- c("Weibull", "lognormal", "gamma")
denscomp(list(fw, fln, fg), legendtext = plot.legend)
qqcomp(list(fw, fln, fg), legendtext = plot.legend)
cdfcomp(list(fw, fln, fg), legendtext = plot.legend)
ppcomp(list(fw, fln, fg), legendtext = plot.legend)
Four Goodness-of-fit plots for various distributions fitted to continuous data (Weibull, gamma and lognormal distributions fitted to serving sizes from the `groundbeef` data set) as provided by functions `denscomp`, `qqcomp`, `cdfcomp` and `ppcomp`.

Figure 2.3: Four Goodness-of-fit plots for various distributions fitted to continuous data (Weibull, gamma and lognormal distributions fitted to serving sizes from the groundbeef data set) as provided by functions denscomp, qqcomp, cdfcomp and ppcomp.


The density plot and the CDF plot may be considered as the basic classical goodness-of-fit plots. The two other plots are complementary and can be very informative in some cases. The Q-Q plot emphasizes the lack-of-fit at the distribution tails while the P-P plot emphasizes the lack-of-fit at the distribution center. In the present example (in Figure 2.3), none of the three fitted distributions correctly describes the center of the distribution, but the Weibull and gamma distributions could be prefered for their better description of the right tail of the empirical distribution, especially if this tail is important in the use of the fitted distribution, as it is in the context of food risk assessment.

The data set named endosulfan will now be used to illustrate other features of the fitdistrplus package. This data set contains acute toxicity values for the organochlorine pesticide endosulfan (geometric mean of LC50 ou EC50 values in \(\mu g.L^{-1}\)), tested on Australian and non-Australian laboratory-species (Hose and Van den Brink, n.d.). In ecotoxicology, a lognormal or a loglogistic distribution is often fitted to such a data set in order to characterize the species sensitivity distribution (SSD) for a pollutant. A low percentile of the fitted distribution, generally the 5% percentile, is then calculated and named the hazardous concentration 5% (HC5). It is interpreted as the value of the pollutant concentration protecting 95% of the species (Posthuma, Suter, and Traas 2010). But the fit of a lognormal or a loglogistic distribution to the whole endosulfan data set is rather bad (Figure 2.4), especially due to a minority of very high values. The two-parameter Pareto distribution and the three-parameter Burr distribution (which is an extension of both the loglogistic and the Pareto distributions) have been fitted. Pareto and Burr distributions are provided in the package actuar. Until here, we did not have to define starting values (in the optimization process) as reasonable starting values are implicity defined within the fitdist function for most of the distributions defined in R (see ?fitdist for details). For other distributions like the Pareto and the Burr distribution, initial values for the distribution parameters have to be supplied in the argument start, as a named list with initial values for each parameter (as they appear in the d, p, q functions). Having defined reasonable starting values1 various distributions can be fitted and graphically compared. On this example, the function cdfcomp can be used to report CDF values in a logscale so as to emphasize discrepancies on the tail of interest while defining an HC5 value (Figure 2.4).

library(actuar)
data("endosulfan")
ATV <- endosulfan$ATV
fendo.ln <- fitdist(ATV, "lnorm")
fendo.ll <- fitdist(ATV, "llogis", start = list(shape = 1, scale = 500))
fendo.P <- fitdist(ATV, "pareto", start = list(shape = 1, scale = 500))
fendo.B <- fitdist(ATV, "burr", start = list(shape1 = 0.3, shape2 = 1, rate = 1))
cdfcomp(list(fendo.ln, fendo.ll, fendo.P, fendo.B), xlogscale = TRUE, 
        ylogscale = TRUE, legendtext = c("lognormal", "loglogistic", "Pareto", "Burr"))
CDF plot to compare the fit of four distributions to acute toxicity values of various organisms for the organochlorine pesticide endosulfan (`endosulfan` data set) as provided by the `cdfcomp` function, with CDF values in a logscale to emphasize discrepancies on the left tail.

Figure 2.4: CDF plot to compare the fit of four distributions to acute toxicity values of various organisms for the organochlorine pesticide endosulfan (endosulfan data set) as provided by the cdfcomp function, with CDF values in a logscale to emphasize discrepancies on the left tail.


None of the fitted distribution correctly describes the right tail observed in the data set, but as shown in Figure 2.4, the left-tail seems to be better described by the Burr distribution. Its use could then be considered to estimate the HC5 value as the 5% quantile of the distribution. This can be easily done using the quantile generic function defined for an object of class fitdist. Below is this calculation together with the calculation of the empirical quantile for comparison.

quantile(fendo.B, probs = 0.05)
## Estimated quantiles for each specified probability (non-censored data)
##          p=0.05
## estimate 0.2939
quantile(ATV, probs = 0.05)
##  5% 
## 0.2

In addition to the ecotoxicology context, the quantile generic function is also attractive in the actuarial-financial context. In fact, the value-at-risk \(VAR_\alpha\) is defined as the \(1-\alpha\)-quantile of the loss distribution and can be computed with quantile on a fitdist object.

The computation of different goodness-of-fit statistics is proposed in the fitdistrplus package in order to further compare fitted distributions. The purpose of goodness-of-fit statistics aims to measure the distance between the fitted parametric distribution and the empirical distribution: e.g., the distance between the fitted cumulative distribution function \(F\) and the empirical distribution function \(F_{n}\). When fitting continuous distributions, three goodness-of-fit statistics are classicaly considered: Cramer-von Mises, Kolmogorov-Smirnov and Anderson-Darling statistics (D’Agostino and Stephens 1986). Naming \(x_{i}\) the \(n\) observations of a continuous variable \(X\) arranged in an ascending order, Table 2.1 gives the definition and the empirical estimate of the three considered goodness-of-fit statistics. They can be computed using the function gofstat as defined by Stephens (D’Agostino and Stephens 1986).

gofstat(list(fendo.ln, fendo.ll, fendo.P, fendo.B), 
        fitnames = c("lnorm", "llogis", "Pareto", "Burr"))
## Goodness-of-fit statistics
##                               lnorm llogis  Pareto    Burr
## Kolmogorov-Smirnov statistic 0.1672 0.1196 0.08488 0.06155
## Cramer-von Mises statistic   0.6374 0.3827 0.13926 0.06803
## Anderson-Darling statistic   3.4721 2.8316 0.89206 0.52393
## 
## Goodness-of-fit criteria
##                                lnorm llogis Pareto Burr
## Akaike's Information Criterion  1069   1069   1048 1046
## Bayesian Information Criterion  1074   1075   1053 1054


Table 2.1: Goodness-of-fit statistics as defined by Stephens (D’Agostino and Stephens 1986).
Statistic General formula Computational formula
Kolmogorov-Smirnov (KS) \(\sup|F_{n}(x) - F(x)|\) \(\max(D^{+},D^{-})\) with \(D^{+}=\max\limits_{i=1,\dots,n}\left(\frac{i}{n} - F_i\right)\) and \(D^{-}=\max\limits_{i=1,\dots,n}\left(F_{i}-\frac{i-1}{n}\right)\)
Cramer-von Mises (CvM) \(n \int_{-\infty}^{\infty}(F_{n}(x) - F(x))^2 dx\) \(\frac{1}{12n} + \sum\limits_{i=1}^n \left(F_i-\frac{2i-1}{2n} \right)^{2}\)
Anderson-Darling (AD) \(n \int_{-\infty}^{\infty}\frac{(F_{n}(x) - F(x))^2}{F(x) (1 - F(x))} dx\) \(-n -\frac{1}{n}\sum\limits_{i=1}^n (2i-1)\log(F_i(1-F_{n+1-i}))\)

where \(F_i\stackrel{\triangle}{=} F(x_i)\)


As giving more weight to distribution tails, the Anderson-Darling statistic is of special interest when it matters to equally emphasize the tails as well as the main body of a distribution. This is often the case in risk assessment (Cullen and Frey 1999, @Vose10). For this reason, this statistics is often used to select the best distribution among those fitted. Nevertheless, this statistics should be used cautiously when comparing fits of various distributions. Keeping in mind that the weighting of each CDF quadratic difference depends on the parametric distribution in its definition (see Table 2.1), Anderson-Darling statistics computed for several distributions fitted on a same data set are theoretically difficult to compare. Moreover, such a statistic, as Cramer-von Mises and Kolmogorov-Smirnov ones, does not take into account the complexity of the model (i.e., parameter number). It is not a problem when compared distributions are characterized by the same number of parameters, but it could systematically promote the selection of the more complex distributions in the other case. Looking at classical penalized criteria based on the loglikehood (AIC, BIC) seems thus also interesting, especially to discourage overfitting.

In the previous example, all the goodness-of-fit statistics based on the CDF distance are in favor of the Burr distribution, the only one characterized by three parameters, while AIC and BIC values respectively give the preference to the Burr distribution or the Pareto distribution. The choice between these two distributions seems thus less obvious and could be discussed. Even if specifically recommended for discrete distributions, the Chi-squared statistic may also be used for continuous distributions (see Section 3.3 and the reference manual for examples (Delignette-Muller et al. 2014)).

2.3 Uncertainty in parameter estimates

The uncertainty in the parameters of the fitted distribution can be estimated by parametric or nonparametric bootstraps using the boodist function for non-censored data (Efron and Tibshirani 1994). This function returns the bootstrapped values of parameters in an S3 class object which can be plotted to visualize the bootstrap region. The medians and the 95% confidence intervals of parameters (2.5 and 97.5 percentiles) are printed in the summary. When inferior to the whole number of iterations (due to lack of convergence of the optimization algorithm for some bootstrapped data sets), the number of iterations for which the estimation converges is also printed in the summary.

The plot of an object of class bootdist consists in a scatterplot or a matrix of scatterplots of the bootstrapped values of parameters providing a representation of the joint uncertainty distribution of the fitted parameters. Below is an example of the use of the bootdist function with the previous fit of the Burr distribution to the endosulfan data set (Figure 2.5).

bendo.B <- bootdist(fendo.B, niter = 1001)
summary(bendo.B)
## Parametric bootstrap medians and 95% percentile CI 
##        Median    2.5%  97.5%
## shape1 0.1983 0.09283 0.3606
## shape2 1.5863 1.05306 3.0629
## rate   1.4907 0.70828 2.7775
plot(bendo.B)
Bootstrappped values of parameters for a fit of the Burr distribution characterized by three parameters (example on the `endosulfan` data set) as provided by the plot of an object of class `bootdist`.

Figure 2.5: Bootstrappped values of parameters for a fit of the Burr distribution characterized by three parameters (example on the endosulfan data set) as provided by the plot of an object of class bootdist.


Bootstrap samples of parameter estimates are useful especially to calculate confidence intervals on each parameter of the fitted distribution from the marginal distribution of the bootstraped values. It is also interesting to look at the joint distribution of the bootstraped values in a scatterplot (or a matrix of scatterplots if the number of parameters exceeds two) in order to understand the potential structural correlation between parameters (see Figure 2.5).

The use of the whole bootstrap sample is also of interest in the risk assessment field. Its use enables the characterization of uncertainty in distribution parameters. It can be directly used within a second-order Monte Carlo simulation framework, especially within the package mc2d (Pouillot, Delignette-Muller, and Denis 2011). One could refer to Pouillot and Delignette-Muller (n.d.) for an introduction to the use of mc2d and fitdistrplus packages in the context of quantitative risk assessment.

The bootstrap method can also be used to calculate confidence intervals on quantiles of the fitted distribution. For this purpose, a generic quantile function is provided for class bootdist. By default, 95% percentiles bootstrap confidence intervals of quantiles are provided. Going back to the previous example from ecotoxicolgy, this function can be used to estimate the uncertainty associated to the HC5 estimation, for example from the previously fitted Burr distribution to the endosulfan data set.

quantile(bendo.B, probs = 0.05)
## (original) estimated quantiles for each specified probability (non-censored data)
##          p=0.05
## estimate 0.2939
## Median of bootstrap estimates
##          p=0.05
## estimate 0.2994
## 
## two-sided 95 % CI of each quantile
##        p=0.05
## 2.5 %  0.1792
## 97.5 % 0.4999

3 Advanced topics

3.1 Alternative methods for parameter estimation

This subsection focuses on alternative estimation methods. One of the alternative for continuous distributions is the maximum goodness-of-fit estimation method also called minimum distance estimation method (D’Agostino and Stephens 1986, @actuarJSS). In this package this method is proposed with eight different distances: the three classical distances defined in Table 2.1, or one of the variants of the Anderson-Darling distance proposed by Luceno (n.d.) and defined in Table 3.1. The right-tail AD gives more weight to the right-tail, the left-tail AD gives more weight only to the left tail. Either of the tails, or both of them, can receive even larger weights by using second order Anderson-Darling Statistics.


Table 3.1: Modified Anderson-Darling statistics as defined by Luceno (n.d.).
Statistic General formula Computational formula
Right-tail AD (ADR) \(\int_{-\infty}^{\infty}\frac{(F_{n}(x) - F(x))^2 }{1 - F(x)} dx\) \(\frac{n}{2} -2\sum\limits_{i=1}^nF_i -\frac{1}{n}\sum\limits_{i=1}^n(2i-1)ln(\overline F_{n+1-i})\)
Left-tail AD (ADL) \(\int_{-\infty}^{\infty}\frac{(F_{n}(x) - F(x))^2 }{(F(x))} dx\) \(-\frac{3n}{2} +2\sum\limits_{i=1}^nF_i -\frac{1}{n}\sum\limits_{i=1}^n(2i-1)ln(F_i)\)
Right-tail AD 2nd order (AD2R) \(ad2r=\int_{-\infty}^{\infty}\frac{(F_{n}(x) - F(x))^2 }{(1 - F(x))^{2}} dx\) \(ad2r=2\sum\limits_{i=1}^nln(\overline F_i) +\frac{1}{n}\sum\limits_{i=1}^n \frac{2i-1}{\overline F_{n+1-i}}\)
Left-tail AD 2nd order (AD2L) \(ad2l=\int_{-\infty}^{\infty}\frac{(F_{n}(x) - F(x))^2 }{(F(x))^{2}} dx\) \(ad2l=2\sum\limits_{i=1}^nln(F_i) +\frac{1}{n}\sum\limits_{i=1}^n\frac{2i-1}{F_i}\)
AD 2nd order (AD2) \(ad2r+ad2l\) \(ad2r+ad2l\)

where \(F_i\stackrel{\triangle}{=} F(x_{i})\) and \(\overline F_i\stackrel{\triangle}{=}1-F(x_{i})\)


To fit a distribution by maximum goodness-of-fit estimation, one needs to fix the argument method to mge in the call to fitdist and to specify the argument gof coding for the chosen goodness-of-fit distance. This function is intended to be used only with continuous non-censored data.

Maximum goodness-of-fit estimation may be useful to give more weight to data at one tail of the distribution. In the previous example from ecotoxicology, we used a non classical distribution (the Burr distribution) to correctly fit the empirical distribution especially on its left tail. In order to correctly estimate the 5\(\%\) percentile, we could also consider the fit of the classical lognormal distribution, but minimizing a goodness-of-fit distance giving more weight to the left tail of the empirical distribution. In what follows, the left tail Anderson-Darling distances of first or second order are used to fit a lognormal to endosulfan data set (see Figure 3.1).

fendo.ln.ADL <- fitdist(ATV, "lnorm", method = "mge", gof = "ADL")
fendo.ln.AD2L <- fitdist(ATV, "lnorm", method = "mge", gof = "AD2L")
cdfcomp(list(fendo.ln, fendo.ln.ADL, fendo.ln.AD2L), 
  xlogscale = TRUE, ylogscale = TRUE, 
  main = "Fitting a lognormal distribution",
  xlegend = "bottomright", 
  legendtext = c("MLE", "Left-tail AD", "Left-tail AD 2nd order"))
Comparison of a lognormal distribution fitted by MLE and by MGE using two different goodness-of-fit distances: left-tail Anderson-Darling and left-tail Anderson Darling of second order (example with the `endosulfan` data set) as provided by the `cdfcomp` function, with CDF values in a logscale to emphasize discrepancies on the left tail.

Figure 3.1: Comparison of a lognormal distribution fitted by MLE and by MGE using two different goodness-of-fit distances: left-tail Anderson-Darling and left-tail Anderson Darling of second order (example with the endosulfan data set) as provided by the cdfcomp function, with CDF values in a logscale to emphasize discrepancies on the left tail.


Comparing the 5% percentiles (HC5) calculated using these three fits to the one calculated from the MLE fit of the Burr distribution, we can observe, on this example, that fitting the lognormal distribution by maximizing left tail Anderson-Darling distances of first or second order enables to approach the value obtained by fitting the Burr distribution by MLE.

(HC5.estimates <- c(
  empirical = as.numeric(quantile(ATV, probs = 0.05)), 
  Burr = as.numeric(quantile(fendo.B, probs = 0.05)$quantiles), 
  lognormal_MLE = as.numeric(quantile(fendo.ln, probs = 0.05)$quantiles), 
  lognormal_AD2 = as.numeric(quantile(fendo.ln.ADL, probs = 0.05)$quantiles), 
  lognormal_AD2L = as.numeric(quantile(fendo.ln.AD2L, probs = 0.05)$quantiles)))
##      empirical           Burr  lognormal_MLE  lognormal_AD2 lognormal_AD2L 
##        0.20000        0.29393        0.07259        0.19591        0.25877

The moment matching estimation (MME) is another method commonly used to fit parametric distributions (Vose 2010). MME consists in finding the value of the parameter \(\theta\) that equalizes the first theoretical raw moments of the parametric distribution to the corresponding empirical raw moments as in Equation (3.1):

\[\begin{equation} E(X^{k}|\theta)=\frac{1}{n}\sum_{i=1}^{n}x_{i}^{k},\tag{3.1} \end{equation}\] for \(k=1,\ldots,d\), with \(d\) the number of parameters to estimate and \(x_{i}\) the \(n\) observations of variable \(X\). For moments of order greater than or equal to 2, it may also be relevant to match centered moments. Therefore, we match the moments given in Equation (3.2):

\[\begin{equation} E(X\vert \theta) = \overline{x} ~,~E\left((X-E(X))^{k}|\theta\right)=m_k, \text{ for } k=2,\ldots,d,\tag{3.2} \end{equation}\]

where \(m_k\) denotes the empirical centered moments. This method can be performed by setting the argument method to "mme" in the call to fitdist. The estimate is computed by a closed-form formula for the following distributions: normal, lognormal, exponential, Poisson, gamma, logistic, negative binomial, geometric, beta and uniform distributions. In this case, for distributions characterized by one parameter (geometric, Poisson and exponential), this parameter is simply estimated by matching theoretical and observed means, and for distributions characterized by two parameters, these parameters are estimated by matching theoretical and observed means and variances (Vose 2010). For other distributions, the equation of moments is solved numerically using the optim function by minimizing the sum of squared differences between observed and theoretical moments (see the fitdistrplus reference manual for technical details (Delignette-Muller et al. 2014)).

A classical data set from the Danish insurance industry published in McNeil (1997) will be used to illustrate this method. In fitdistrplus, the data set is stored in danishuni for the univariate version and contains the loss amounts collected at Copenhagen Reinsurance between 1980 and 1990. In actuarial science, it is standard to consider positive heavy-tailed distributions and have a special focus on the right-tail of the distributions. In this numerical experiment, we choose classic actuarial distributions for loss modelling: the lognormal distribution and the Pareto type II distribution (Klugman, Panjer, and Willmot 2009).

The lognormal distribution is fitted to danishuni data set by matching moments implemented as a closed-form formula. On the left-hand graph of Figure 3.2, the fitted distribution functions obtained using the moment matching estimation (MME) and maximum likelihood estimation (MLE) methods are compared. The MME method provides a more cautious estimation of the insurance risk as the MME-fitted distribution function (resp. MLE-fitted) underestimates (overestimates) the empirical distribution function for large values of claim amounts.

data("danishuni")
str(danishuni)
## 'data.frame':    2167 obs. of  2 variables:
##  $ Date: Date, format: "1980-01-03" "1980-01-04" ...
##  $ Loss: num  1.68 2.09 1.73 1.78 4.61 ...


fdanish.ln.MLE <- fitdist(danishuni$Loss, "lnorm")
fdanish.ln.MME <- fitdist(danishuni$Loss, "lnorm", method = "mme", order = 1:2)
library(actuar)
fdanish.P.MLE <- fitdist(danishuni$Loss, "pareto", start = list(shape = 10, scale = 10), 
                         lower = 2+1e-6, upper = Inf)
memp <- function(x, order) sum(x^order) / length(x)
fdanish.P.MME <- fitdist(danishuni$Loss, "pareto", method = "mme", order = 1:2, memp = "memp", 
                         start = list(shape = 10, scale = 10), lower = c(2+1e-6, 2+1e-6), 
                         upper = c(Inf, Inf))

par(mfrow = c(1, 2))
cdfcomp(list(fdanish.ln.MLE, fdanish.ln.MME), legend = c("lognormal MLE", "lognormal MME"),
        main = "Fitting a lognormal distribution", xlogscale = TRUE, datapch = 20)
cdfcomp(list(fdanish.P.MLE, fdanish.P.MME), legend = c("Pareto MLE", "Pareto MME"), 
        main = "Fitting a Pareto distribution", xlogscale = TRUE, datapch = 20)
Comparison between MME and MLE when fitting a lognormal or a Pareto distribution to loss data from the `danishuni` data set.

Figure 3.2: Comparison between MME and MLE when fitting a lognormal or a Pareto distribution to loss data from the danishuni data set.


In a second time, a Pareto distribution, which gives more weight to the right-tail of the distribution, is fitted. As the lognormal distribution, the Pareto has two parameters, which allows a fair comparison.

We use the implementation of the actuar package providing raw and centered moments for that distribution (in addition to d, p, q and r functions (Goulet 2012). Fitting a heavy-tailed distribution for which the first and the second moments do not exist for certain values of the shape parameter requires some cautiousness. This is carried out by providing, for the optimization process, a lower and an upper bound for each parameter. The code below calls the L-BFGS-B optimization method in optim, since this quasi-Newton allows box constraints.2 We choose match moments defined in Equation (3.1), and so a function for computing the empirical raw moment (called memp in our example) is passed to fitdist. For two-parameter distributions (i.e., \(d=2\)), Equations (3.1) and (3.2) are equivalent.

gofstat(list(fdanish.ln.MLE, fdanish.P.MLE, fdanish.ln.MME, fdanish.P.MME), 
        fitnames = c("lnorm.mle", "Pareto.mle", "lnorm.mme", "Pareto.mme"))
## Goodness-of-fit statistics
##                              lnorm.mle Pareto.mle lnorm.mme Pareto.mme
## Kolmogorov-Smirnov statistic    0.1375     0.3124    0.4368       0.37
## Cramer-von Mises statistic     14.7911    37.7227   88.9503      55.43
## Anderson-Darling statistic     87.1933   208.3388  416.2567     281.58
## 
## Goodness-of-fit criteria
##                                lnorm.mle Pareto.mle lnorm.mme Pareto.mme
## Akaike's Information Criterion      8120       9250      9792       9409
## Bayesian Information Criterion      8131       9261      9803       9420

As shown on Figure 3.2, MME and MLE fits are far less distant (when looking at the right-tail) for the Pareto distribution than for the lognormal distribution on this data set. Furthermore, for these two distributions, the MME method better fits the right-tail of the distribution from a visual point of view. This seems logical since empirical moments are influenced by large observed values. In the previous traces, we gave the values of goodness-of-fit statistics. Whatever the statistic considered, the MLE-fitted lognormal always provides the best fit to the observed data.

Maximum likelihood and moment matching estimations are certainly the most commonly used method for fitting distributions (Cullen and Frey 1999). Keeping in mind that these two methods may produce very different results, the user should be aware of its great sensitivity to outliers when choosing the moment matching estimation. This may be seen as an advantage in our example if the objective is to better describe the right tail of the distribution, but it may be seen as a drawback if the objective is different.

Fitting of a parametric distribution may also be done by matching theoretical quantiles of the parametric distributions (for specified probabilities) against the empirical quantiles (Tse 2009). The equality of theoretical and empirical quantiles is expressed by Equation (3.3) below, which is very similar to Equations (3.1) and (3.2):

\[\begin{equation} F^{-1}(p_{k}|\theta)=Q_{n,p_{k}}\tag{3.3} \end{equation}\] for \(k=1,\ldots,d\), with \(d\) the number of parameters to estimate (dimension of \(\theta\) if there is no fixed parameters) and \(Q_{n,p_{k}}\) the empirical quantiles calculated from data for specified probabilities \(p_{k}\).

Quantile matching estimation (QME) is performed by setting the argument method to "qme" in the call to fitdist and adding an argument probs defining the probabilities for which the quantile matching is performed (see Figure 3.3). The length of this vector must be equal to the number of parameters to estimate (as the vector of moment orders for MME). Empirical quantiles are computed using the quantile function of the stats package using type=7 by default (see ?quantile and Hyndman and Fan (1996)). But the type of quantile can be easily changed by using the qty argument in the call to the qme function.
The quantile matching is carried out numerically, by minimizing the sum of squared differences between observed and theoretical quantiles.

fdanish.ln.QME1 <- fitdist(danishuni$Loss, "lnorm", method = "qme", probs = c(1/3, 2/3))
fdanish.ln.QME2 <- fitdist(danishuni$Loss, "lnorm", method = "qme", probs = c(8/10, 9/10))
cdfcomp(list(fdanish.ln.MLE, fdanish.ln.QME1, fdanish.ln.QME2), 
        legend = c("MLE", "QME(1/3, 2/3)", "QME(8/10, 9/10)"), 
        main = "Fitting a lognormal distribution", xlogscale = TRUE, datapch = 20)
Comparison between QME and MLE when fitting a lognormal distribution to loss data from the `danishuni` data set.

Figure 3.3: Comparison between QME and MLE when fitting a lognormal distribution to loss data from the danishuni data set.


Above is an example of fitting of a lognormal distribution to `danishuni} data set by matching probabilities \((p_1= 1/3, p_2=2/3)\) and \((p_1= 8/10, p_2=9/10)\). As expected, the second QME fit gives more weight to the right-tail of the distribution. Compared to the maximum likelihood estimation, the second QME fit best suits the right-tail of the distribution, whereas the first QME fit best models the body of the distribution. The quantile matching estimation is of particular interest when we need to focus around particular quantiles, e.g., \(p=99.5\%\) in the Solvency II insurance context or \(p=5\%\) for the HC5 estimation in the ecotoxicology context.

3.2 Customization of the optimization algorithm

Each time a numerical minimization is carried out in the fitdistrplus package, the optim function of the stats package is used by default with the Nelder-Mead method for distributions characterized by more than one parameter and the BFGS method for distributions characterized by only one parameter. Sometimes the default algorithm fails to converge. It is then interesting to change some options of the optim function or to use another optimization function than optim to minimize the objective function. The argument optim.method can be used in the call to fitdist or fitdistcens. It will internally be passed to mledist, mmedist, mgedist or qmedist, and to optim (see ?optim for details about the different algorithms available).

Even if no error is raised when computing the optimization, changing the algorithm is of particular interest to enforce bounds on some parameters. For instance, a volatility parameter \(\sigma\) is strictly positive \(\sigma>0\) and a probability parameter \(p\) lies in \(p\in [0,1]\). This is possible by using arguments lower and/or upper, for which their use automatically forces optim.method="L-BFGS-B".

Below are examples of fits of a gamma distribution \(\mathcal{G}(\alpha, \lambda)\) to the groundbeef data set with various algorithms. Note that the conjugate gradient algorithm (CG) needs far more iterations to converge (around 2500 iterations) compared to other algorithms (converging in less than 100 iterations).

data("groundbeef")
fNM <- fitdist(groundbeef$serving, "gamma", optim.method = "Nelder-Mead")
fBFGS <- fitdist(groundbeef$serving, "gamma", optim.method = "BFGS") 
fSANN <- fitdist(groundbeef$serving, "gamma", optim.method = "SANN")
fCG <- try(fitdist(groundbeef$serving, "gamma", optim.method = "CG",
                   control = list(maxit = 10000)))
if(inherits(fCG, "try-error")) {fCG <- list(estimate = NA)}

It is also possible to use another function than optim to minimize the objective function by specifying by the argument custom.optim in the call to fitdist. It may be necessary to customize this optimization function to meet the following requirements. (1) custom.optim function must have the following arguments: fn for the function to be optimized and par for the initialized parameters. (2) custom.optim should carry out a MINIMIZATION and must return the following components: par for the estimate, convergence for the convergence code, value=fn(par) and hessian. Below is an example of code written to wrap the genoud function from the rgenoud package in order to respect our optimization ``template’’. The rgenoud package implements the genetic (stochastic) algorithm.

mygenoud <- function(fn, par, ...) 
{
   require(rgenoud)
   res <- genoud(fn, starting.values = par, ...)        
   standardres <- c(res, convergence = 0)
   return(standardres)
}

The customized optimization function can then be passed as the argument custom.optim in the call to fitdist or fitdistcens. The following code can for example be used to fit a gamma distribution to the groundbeef data set. Note that in this example various arguments are also passed from fitdist to genoud: nvars, Domains, boundary.enforcement, print.level and hessian. The code below compares all the parameter estimates (\(\hat\alpha\), \(\hat\lambda\)) by the different algorithms: shape \(\alpha\) and rate \(\lambda\) parameters are relatively similar on this example, roughly 4.00 and 0.05, respectively.

fgenoud <- mledist(groundbeef$serving, "gamma", custom.optim = mygenoud, nvars = 2, 
                   max.generations = 10, Domains = cbind(c(0, 0), c(10, 10)), 
                   boundary.enforcement = 1, hessian = TRUE, print.level = 0, P9 = 10)
## Le chargement a nécessité le package : rgenoud
## ##  rgenoud (Version 5.9-0.3, Build Date: 2022-04-19)
## ##  See http://sekhon.berkeley.edu/rgenoud for additional documentation.
## ##  Please cite software as:
## ##   Walter Mebane, Jr. and Jasjeet S. Sekhon. 2011.
## ##   ``Genetic Optimization Using Derivatives: The rgenoud package for R.''
## ##   Journal of Statistical Software, 42(11): 1-26. 
## ##
cbind(NM = fNM$estimate, BFGS = fBFGS$estimate, SANN = fSANN$estimate, CG = fCG$estimate, 
      fgenoud = fgenoud$estimate)
##            NM    BFGS    SANN      CG fgenoud
## shape 4.00825 4.22848 3.96743 4.12850 4.00834
## rate  0.05442 0.05742 0.05385 0.05606 0.05443

3.3 Fitting distributions to other types of data

This section was modified since the publication of this vignette in the Journal of Statistical Software in order to include new goodness-of-fit plots for censored and discrete data.

Analytical methods often lead to semi-quantitative results which are referred to as censored data. Observations only known to be under a limit of detection are left-censored data. Observations only known to be above a limit of quantification are right-censored data. Results known to lie between two bounds are interval-censored data. These two bounds may correspond to a limit of detection and a limit of quantification, or more generally to uncertainty bounds around the observation. Right-censored data are also commonly encountered with survival data (Klein and Moeschberger 2003). A data set may thus contain right-, left-, or interval-censored data, or may be a mixture of these categories, possibly with different upper and lower bounds. Censored data are sometimes excluded from the data analysis or replaced by a fixed value, which in both cases may lead to biased results. A more recommended approach to correctly model such data is based upon maximum likelihood (Klein and Moeschberger 2003, @helsel05).

Censored data may thus contain left-censored, right-censored and interval-censored values, with several lower and upper bounds. Before their use in package fitdistrplus, such data must be coded into a dataframe with two columns, respectively named left and right, describing each observed value as an interval. The left column contains either NA for left censored observations, the left bound of the interval for interval censored observations, or the observed value for non-censored observations. The right column contains either NA for right censored observations, the right bound of the interval for interval censored observations, or the observed value for non-censored observations. To illustrate the use of package fitdistrplus to fit distributions to censored continous data, we will use another data set from ecotoxicology, included in our package and named salinity. This data set contains acute salinity tolerance (LC50 values in electrical conductivity, \(mS\).\(cm^{-1}\)) of riverine macro-invertebrates taxa from the southern Murray-Darling Basin in Central Victoria, Australia (Kefford et al. 2007).

data("salinity")
str(salinity)
## 'data.frame':    108 obs. of  2 variables:
##  $ left : num  20 20 20 20 20 21.5 15 20 23.7 25 ...
##  $ right: num  NA NA NA NA NA 21.5 30 25 23.7 NA ...

Using censored data such as those coded in the salinity} data set, the empirical distribution can be plotted using theplotdistcens} function. In older versions of the package, by default this function used the Expectation-Maximization approach of Turnbull (n.d.) to compute the overall empirical cdf curve with optional confidence intervals, by calls to survfit and plot.survfit functions from the survival package. Even if this representation is always available (by fixing the argument NPMLE.method to "Turnbull.middlepoints"), now the default plot of the empirical cumulative distribution function (ECDF) explicitly represents the regions of non uniqueness of the NPMLE ECDF. The default computation of those regions of non uniqueness and their associated masses uses the non parametric maximum likelihood estimation (NPMLE) approach developped by Wang (Wang 2007, @Wang2008, @Wang2013, @Wang2018).
Figure 3.5 shows on the top left the new plot of data together with two fitted distributions. Grey filled rectangles in such a plot represent the regions of non uniqueness of the NPMLE ECDF.

A less rigorous but sometimes more illustrative plot can be obtained by fixing the argument NPMLE to FALSE in the call to plotdistcens (see Figure 3.4 for an example and the help page of Function plotdistcens for details). This plot enables to see the real nature of censored data, as points and intervals, but the difficulty in building such a plot is to define a relevant ordering of observations.

plotdistcens(salinity, NPMLE = FALSE)
Simple plot of censored raw data (72-hour acute salinity tolerance of riverine macro-invertebrates from the `salinity` data set) as ordered points and intervals.

Figure 3.4: Simple plot of censored raw data (72-hour acute salinity tolerance of riverine macro-invertebrates from the salinity data set) as ordered points and intervals.


As for non censored data, one or more parametric distributions can be fitted to the censored data set, one at a time, but using in this case the fitdistcens function. This function estimates the vector of distribution parameters \(\theta\) by maximizing the likelihood for censored data defined as:

\[\begin{equation} L(\theta) = \prod_{i=1}^{N_{nonC}} f(x_{i}|\theta)\times \prod_{j=1}^{N_{leftC}} F(x^{upper}_{j}|\theta) \\ \times \prod_{k=1}^{N_{rightC}} (1- F(x^{lower}_{k}|\theta))\times \prod_{m=1}^{N_{intC}} (F(x^{upper}_{m}|\theta)- F(x^{lower}_{j}|\theta))\tag{3.4} \end{equation}\]

with \(x_{i}\) the \(N_{nonC}\) non-censored observations, \(x^{upper}_{j}\) upper values defining the \(N_{leftC}\) left-censored observations, \(x^{lower}_{k}\) lower values defining the \(N_{rightC}\) right-censored observations, \([x^{lower}_{m} ; x^{upper}_{m}]\) the intervals defining the \(N_{intC}\) interval-censored observations, and F the cumulative distribution function of the parametric distribution (Klein and Moeschberger 2003, @helsel05).

As fitdist, fitdistcens returns the results of the fit of any parametric distribution to a data set as an S3 class object that can be easily printed, summarized or plotted. For the salinity data set, a lognormal distribution or a loglogistic can be fitted as commonly done in ecotoxicology for such data. As with fitdist, for some distributions (see Delignette-Muller et al. (2014) for details), it is necessary to specify initial values for the distribution parameters in the argument start. The plotdistcens function can help to find correct initial values for the distribution parameters in non trivial cases, by a manual iterative use if necessary.

fsal.ln <- fitdistcens(salinity, "lnorm")
fsal.ll <- fitdistcens(salinity, "llogis", start = list(shape = 5, scale = 40))
summary(fsal.ln)
## Fitting of the distribution ' lnorm ' By maximum likelihood on censored data 
## Parameters
##         estimate Std. Error
## meanlog   3.3854    0.06487
## sdlog     0.4961    0.05455
## Loglikelihood:  -139.1   AIC:  282.1   BIC:  287.5 
## Correlation matrix:
##         meanlog  sdlog
## meanlog  1.0000 0.2938
## sdlog    0.2938 1.0000
summary(fsal.ll)
## Fitting of the distribution ' llogis ' By maximum likelihood on censored data 
## Parameters
##       estimate Std. Error
## shape    3.421     0.4158
## scale   29.930     1.9447
## Loglikelihood:  -140.1   AIC:  284.1   BIC:  289.5 
## Correlation matrix:
##         shape   scale
## shape  1.0000 -0.2022
## scale -0.2022  1.0000

Computations of goodness-of-fit statistics have not yet been developed for fits using censored data but the quality of fit can be judged using Akaike and Schwarz’s Bayesian information criteria (AIC and BIC) and the goodness-of-fit CDF plot, respectively provided when summarizing or plotting an object of class fitdistcens. Functions cdfcompcens, qqcompcens and ppcompcens can also be used to compare the fit of various distributions to the same censored data set. Their calls are similar to the ones of cdfcomp, qqcomp and ppcomp. Below are examples of use of those functions with the two fitted distributions to the salinity data set (see Figure 3.5). When qqcompcens and ppcompcens are used with more than one fitted distribution, the non uniqueness rectangles are not filled and a small noise is added on the y-axis in order to help the visualization of various fits. But we rather recommend the use of the plotstyle ggplot of qqcompcens and ppcompcens to compare the fits of various distributions as it provides a clearer plot splitted in facets (see ?graphcompcens).

par(mfrow = c(2, 2))
cdfcompcens(list(fsal.ln, fsal.ll), legendtext = c("lognormal", "loglogistic "))
qqcompcens(fsal.ln, legendtext = "lognormal")
ppcompcens(fsal.ln, legendtext = "lognormal")
qqcompcens(list(fsal.ln, fsal.ll), legendtext = c("lognormal", "loglogistic "),
           main = "Q-Q plot with 2 dist.")
Some goodness-of-fit plots for fits of a lognormal and a loglogistic distribution to censored data: LC50 values from the `salinity` data set.

Figure 3.5: Some goodness-of-fit plots for fits of a lognormal and a loglogistic distribution to censored data: LC50 values from the salinity data set.


Function bootdistcens is the equivalent of bootdist for censored data, except that it only proposes nonparametric bootstrap. Indeed, it is not obvious to simulate censoring within a parametric bootstrap resampling procedure. The generic function quantile can also be applied to an object of class fitdistcens or bootdistcens, as for continuous non-censored data.

In addition to the fit of distributions to censored or non censored continuous data, our package can also accomodate discrete variables, such as count numbers, using the functions developped for continuous non-censored data. These functions will provide somewhat different graphs and statistics, taking into account the discrete nature of the modeled variable. The discrete nature of the variable is automatically recognized when a classical distribution is fitted to data (binomial, negative binomial, geometric, hypergeometric and Poisson distributions) but must be indicated by fixing argument discrete to TRUE in the call to functions in other cases. The toxocara data set included in the package corresponds to the observation of such a discrete variable. Numbers of Toxocara cati parasites present in digestive tract are reported from a random sampling of feral cats living on Kerguelen island (Fromont et al., n.d.). We will use it to illustrate the case of discrete data.

data("toxocara")
str(toxocara)
## 'data.frame':    53 obs. of  1 variable:
##  $ number: int  0 0 0 0 0 0 0 0 0 0 ...

The fit of a discrete distribution to discrete data by maximum likelihood estimation requires the same procedure as for continuous non-censored data. As an example, using the toxocara data set, Poisson and negative binomial distributions can be easily fitted.

(ftoxo.P <- fitdist(toxocara$number, "pois"))
## Fitting of the distribution ' pois ' by maximum likelihood 
## Parameters:
##        estimate Std. Error
## lambda    8.679     0.4047
(ftoxo.nb <- fitdist(toxocara$number, "nbinom"))
## Fitting of the distribution ' nbinom ' by maximum likelihood 
## Parameters:
##      estimate Std. Error
## size   0.3971    0.08289
## mu     8.6803    1.93501

For discrete distributions, the plot of an object of class fitdist simply provides two goodness-of-fit plots comparing empirical and theoretical distributions in density and in CDF. Functions cdfcomp and denscomp can also be used to compare several plots to the same data set, as follows for the previous fits (Figure 3.6).

par(mfrow = c(1, 2))
denscomp(list(ftoxo.P, ftoxo.nb), legendtext = c("Poisson", "negative binomial"), fitlty = 1)
cdfcomp(list(ftoxo.P, ftoxo.nb), legendtext = c("Poisson", "negative binomial"), fitlty = 1)
Comparison of the fits of a negative binomial and a Poisson distribution to numbers of *Toxocara cati* parasites from the `toxocara` data set.

Figure 3.6: Comparison of the fits of a negative binomial and a Poisson distribution to numbers of Toxocara cati parasites from the toxocara data set.


When fitting discrete distributions, the Chi-squared statistic is computed by the gofstat function using cells defined by the argument chisqbreaks or cells automatically defined from the data in order to reach roughly the same number of observations per cell. This number is roughly equal to the argument meancount, or sligthly greater if there are some ties. The choice to define cells from the empirical distribution (data), and not from the theoretical distribution, was done to enable the comparison of Chi-squared values obtained with different distributions fitted on a same data set. If arguments chisqbreaks and meancount are both omitted, meancount is fixed in order to obtain roughly \((4n)^{2/5}\) cells, with \(n\) the length of the data set (Vose 2010). Using this default option the two previous fits are compared as follows, giving the preference to the negative binomial distribution, from both Chi-squared statistics and information criteria:

gofstat(list(ftoxo.P, ftoxo.nb), fitnames = c("Poisson", "negative binomial"))
## Chi-squared statistic:  31257 7.486 
## Degree of freedom of the Chi-squared distribution:  5 4 
## Chi-squared p-value:  0 0.1123 
##    the p-value may be wrong with some theoretical counts < 5  
## Chi-squared table:
##       obscounts theo Poisson theo negative binomial
## <= 0         14     0.009014                 15.295
## <= 1          8     0.078237                  5.809
## <= 3          6     1.321767                  6.845
## <= 4          6     2.131298                  2.408
## <= 9          6    29.827829                  7.835
## <= 21         6    19.626224                  8.271
## > 21          7     0.005631                  6.537
## 
## Goodness-of-fit criteria
##                                Poisson negative binomial
## Akaike's Information Criterion    1017             322.7
## Bayesian Information Criterion    1019             326.6

4 Conclusion

The R package fitdistrplus allows to easily fit distributions. Our main objective while developing this package was to provide tools for helping R users to fit distributions to data. We have been encouraged to pursue our work by feedbacks from users of our package in various areas as food or environmental risk assessment, epidemiology, ecology, molecular biology, genomics, bioinformatics, hydraulics, mechanics, financial and actuarial mathematics or operations research. Indeed, this package is already used by a lot of practionners and academics for simple MLE fits (Jaloustre et al. 2011, @saketal11, @kochetal12, @marquetouxetal12, @scholletal12, @suuronenetal12, @varoetal12, @mandletal13, @mala2013publi, @nadarajahbakar2013, @simoetal13, @westphalfitch13, @wayland13, @vianaetal13, @wu13, @staggeetal13, @fiorellietal13, @tikoleetal13, @voigtetal14), for MLE fits and goodness-of-fit statistics (Tarnczi, Fenyves, and Bcs 2011, @bagariaetal12, @benavidesetal12, @breitbach12, @Pouillot10, @vaninsky13), for MLE fits and bootstrap (Croucher et al. 2012, @meheustetal12, @orellanoetal12, @telloetal12, @hoelzeretal12, @prosseretal13, @Zhang2013, @Rigaux2014), for MLE fits, bootstrap and goodness-of-fit statistics (Larras, Montuelle, and Bouchez 2013), for MME fit (Luangkesorn et al. 2012, @callauetal13, @satoetal13), for censored MLE and bootstrap (Leha, Beissbarth, and Jung 2011, @poulliotetal12, @jongenburgeretal12, @commeauetal12, @contrerasetal2013), for graphic analysing in (Anand, Yeturu, and Chandra 2012), for grouped-data fitting methods (Fu, Steiner, and Costafreda 2012) or more generally (Busschaert et al. 2010, @eling12, @sosaetal2013, @srinivasanetal2013, @meyeretal13, @Guillier2013471, @Daelmanetal13, @eiketal13, @Wu2013b, @drakeetal2014).

The fitdistrplus package is complementary with the distrMod package (Kohl and Ruckdeschel 2010). distrMod provides an even more flexible way to estimate distribution parameters but its use requires a greater initial investment to learn how to manipulate the S4 classes and methods developed in the distr-family packages.

Many extensions of the fitdistrplus package are planned in the future: we target to extend to censored data some methods for the moment only available for non-censored data, especially concerning goodness-of-fit evaluation and fitting methods. We will also enlarge the choice of fitting methods for non-censored data, by proposing new goodness-of-fit distances (e.g., distances based on quantiles) for maximum goodness-of-fit estimation and new types of moments (e.g., limited expected values) for moment matching estimation. At last, we will consider the case of multivariate distribution fitting.


Acknowledgments

The package would not have been at this stage without the stimulating contribution of Régis Pouillot and Jean-Baptiste Denis, especially for its conceptualization. We also want to thank Régis Pouillot for his very valuable comments on the first version of this paper.

The authors gratefully acknowledges the two anonymous referees and the Editor for useful and constructive comments. The remaining errors, of course, should be attributed to the authors alone.


References

Anand, P., K. Yeturu, and N. Chandra. 2012. “PocketAnnotate: Towards Site-Based Function Annotation.” Nucleic Acids Research 40: 1–9.

Bagaria, A., V. Jaravine, Y. J. Huang, G. T. Montelione, and P. Güntert. 2012. “Protein Structure Validation by Generalized Linear Model Root-Mean-Square Deviation Prediction.” Protein Science 21 (2): 229–38.

Benavides-Piccione, R., I. Fernaud-Espinosa, V. Robles, R. Yuste, and J. DeFelipe. 2012. “Age-Based Comparison of Human Dendritic Spine Structure Using Complete Three-Dimensional Reconstructions.” Cerebral Cortex 23 (8): 1798–1810.

Blom, G. 1959. Statistical Estimates and Transformed Beta Variables. 1st ed. John Wiley & Sons.

Breitbach, N., K. Böhning-Gaese, I. Laube, and M. Schleuning. 2012. “Short Seed-Dispersal Distances and Low Seedling Recruitment in Farmland Populations of Bird-Dispersed Cherry Trees.” Journal of Ecology 100 (6): 1349–58.

Busschaert, P., A. H. Geeraerd, M. Uyttendaele, and J. F. VanImpe. 2010. “Estimating Distributions Out of Qualitative and (Semi)Quantitative Microbiological Contamination Data for Use in Risk Assessment.” International Journal of Food Microbiology 138: 260–69.

Callau Poduje, Ana Claudia, Aslan Belli, and Uwe Haberlandt. 2013. “Dam Risk Assessment Based on Univariate Versus Bivariate Statistical Approaches - a Case Study for Argentina.” Hydrological Sciences Journal. https://doi.org/10.1080/02626667.2013.871014.

Casella, G., and R. L. Berger. 2002. Statistical Inference. 2nd ed. Duxbury Thomson Learning.

Commeau, N., E. Parent, M.-L. Delignette-Muller, and M. Cornu. 2012. “Fitting a Lognormal Distribution to Enumeration and Absence/Presence Data.” International Journal of Food Microbiology 155: 146–52.

Contreras, V. De La Huerta, H. Vaquera Huerta, and B. C. Arnold. 2013. “A Test for Equality of Variance with Censored Samples.” Journal of Statistical Computation and Simulation. https://doi.org/10.1080/00949655.2013.825095.

Croucher, N. J., S. R. Harris, L. Barquist, J. Parkhill, and S. D. Bentley. 2012. “A High-Resolution View of Genome-Wide Pneumococcal Transformation.” PLoS Pathogens 8 (6): e1002745.

Cullen, A. C., and H. C. Frey. 1999. Probabilistic Techniques in Exposure Assessment. 1st ed. Plenum Publishing Co.

Daelman, Jeff, Jeanne-Marie Membré, Liesbeth Jacxsens, An Vermeulen, Frank Devlieghere, and Mieke Uyttendaele. 2013. “A Quantitative Microbiological Exposure Assessment Model for Bacillus Cereus in Repfeds.” International Journal of Food Microbiology 166 (3): 433–49.

D’Agostino, R. B., and M. A. Stephens. 1986. Goodness-of-Fit Techniques. 1st ed. Dekker.

Delignette-Muller, M. L., and M. Cornu. 2008. “Quantitative Risk Assessment for Escherichia Coli O157:H7 in Frozen Ground Beef Patties Consumed by Young Children in French Households.” International Journal of Food Microbiology 128 (1): 158–64. https://doi.org/https://doi.org/10.1016/j.ijfoodmicro.2008.05.040.

Delignette-Muller, M. L., R. Pouillot, J. B. Denis, and C. Dutang. 2014. Fitdistrplus: Help to Fit of a Parametric Distribution to Non-Censored or Censored Data. https://cran.r-project.org/package=fitdistrplus.

Drake, T., Z. Chalabi, and R. Coker. 2014. “Buy Now, saved Later? The Critical Impact of Time-to-Pandemic Uncertainty on Pandemic Cost-Effectiveness Analyses.” Health Policy and Planning. https://doi.org/10.1093/heapol/czt101.

Dutang, C., V. Goulet, and M. Pigeon. 2008. “actuar: an R Package for Actuarial Science.” Journal of Statistical Software 25 (7): 1–37.

Efron, B., and R. J. Tibshirani. 1994. An Introduction to the Bootstrap. 1st ed. Chapman & Hall.

Eik, M., K. Luhmus, M. Tigasson, M. Listak, J. Puttonen, and H. Herrmann. 2013. “DC-Conductivity Testing Combined with Photometry for Measuring Fibre Orientations in SFRC.” Journal of Materials Science 48 (10): 3745–59.

Eling, M. 2012. “Fitting Insurance Claims to Skewed Distributions: Are the Skew-normal and the Skew-student Good Models?” Insurance: Mathematics and Economics 51 (2): 239–48.

Fiorelli, L. E., M. D. Ezcurra, E. M. Hechenleitner, E. Argañaraz, R. Jeremias, A. Taborda, M. J. Trotteyn, M. Belén von Baczko, and J. B. Desojo. 2013. “The Oldest Known Communal Latrines Provide Evidence of Gregarism in Triassic Megaherbivores.” Scientific Reports 3 (3348): 1–7.

Fromont, E, L Morvilliers, M Artois, and D Pontier. n.d. “Parasite Richness and Abundance in Insular and Mainland Feral Cats: Insularity or Density?” Parasitology 123 (Part 2): 143–51.

Fu, C. H. Y., H. Steiner, and S. G. Costafreda. 2012. “Predictive Neural Biomarkers of Clinical Response in Depression: A Meta-Analysis of Functional and Structural Neuroimaging Studies of Pharmacological and Psychological Therapies.” Neurobiology of Disease 52: 75–83.

González-Varo, J. P., J. V. López-Bao, and J. Guitián. 2012. “Functional Diversity Among Seed Dispersal Kernels Generated by Carnivorous Mammals.” Journal of Animal Ecology 82: 562–71.

Goulet, V. 2012. Actuar: An R Package for Actuarial Science. https://cran.r-project.org/package=actuar.

Guillier, Laurent, Corinne Danan, Hélène Bergis, Marie-Laure Delignette-Muller, Sophie Granier, Sylvie Rudelle, Annie Beaufort, and Anne Brisabois. 2013. “Use of Quantitative Microbial Risk Assessment when Investigating Foodborne Illness Outbreaks: the Example of a Monophasic Salmonella Typhimurium 4,5,12:i:- Outbreak Implicating Beef Burgers.” International Journal of Food Microbiology 166 (3): 471–78.

Helsel, D. R. 2005. Nondetects and Data Analysis: Statistics for Censored Environmental Data. 1st ed. John Wiley & Sons.

Hirano, S. S., M. K. Clayton, and C. D. Upper. 1994. “Estimation of and Temporal Changes in Means and Variances of Populations of Pseudomonas syringae on Snap Bean Leaflets.” Phytopathology 84 (9): 934–40.

Hoelzer, K., R. Pouillot, D. Gallagher, M. B. Silverman, J. Kause, and S. Dennis. 2012. “Estimation of Listeria Monocytogenes Transfer Coefficients and Efficacy of Bacterial Removal Through Cleaning and Sanitation.” International Journal of Food Microbiology 157 (2): 267–77.

Hose, G. C., and P. J. Van den Brink. n.d. “Confirming the Species-Sensitivity Distribution Concept for Endosulfan Using Laboratory, Mesocosm, and Field Data.” Archives of Environmental Contamination and Toxicology 47 (4): 511–20.

Hyndman, R. J., and Y. Fan. 1996. “Sample Quantiles in Statistical Packages.” The American Statistician 50: 361–65.

Jaloustre, S., M. Cornu, E. Morelli, V. Noel, and M. L. Delignette-Muller. 2011. “Bayesian Modeling of Clostridium perfringens Growth in Beef-in-Sauce Products.” Food Microbiology 28 (2): 311–20.

Jongenburger, I., M. W. Reij, E. P. J. Boer, M. H. Zwietering, and L. G. M. Gorris. 2012. “Modelling Homogeneous and Heterogeneous Microbial Contaminations in a Powdered Food Product.” International Journal of Food Microbiology 157 (1): 35–44.

Jordan, D. 2005. “Simulating the Sensitivity of Pooled-Sample Herd Tests for Fecal Salmonella in Cattle.” Preventive Veterinary Medicine 70 (1-2): 59–73.

Kefford, B. J., E. J. Fields, C. Clay, and D. Nugegoda. 2007. “Salinity Tolerance of Riverine Macroinvertebrates from the Southern Murray-Darling Basin.” Marine and Freshwater Research 58: 1019–31.

Klein, J. P., and M. L. Moeschberger. 2003. Survival Analysis: Techniques for Censored and Truncated Data. 2nd ed. Springer-Verlag.

Klugman, S. A., H. H. Panjer, and G. E. Willmot. 2009. Loss Models: From Data to Decisions. 3rd ed. John Wiley & Sons.

Koch, F. H., D. Yemshanov, R. D. Magarey, and W. D. Smith. 2012. “Dispersal of Invasive Forest Insects via Recreational Firewood: A Quantitative Analysis.” Journal of Economic Entomology 105 (2): 438–50.

Kohl, M., and P. Ruckdeschel. 2010. “R Package distrMod: S4 Classes and Methods for Probability Models.” Journal of Statistical Software 35 (10): 1–27.

Larras, Floriane, Bernard Montuelle, and Agnès Bouchez. 2013. “Assessment of Toxicity Thresholds in Aquatic Environments: Does Benthic Growth of Diatoms Affect their Exposure and Sensitivity to Herbicides?” Science of the Total Environment 463-464: 469–77.

Leha, A., T. Beissbarth, and K. Jung. 2011. “Sequential Interim Analyses of Survival Data in DNA Microarray Experiments.” BMC Bioinformatics 12 (127): 1–14.

Luangkesorn, K. L., B. A. Norman, Y. Zhuang, M. Falbo, and J. Sysko. 2012. “Practice Summaries: Designing Disease Prevention and Screening Centers in Abu Dhabi.” Interfaces 42 (4): 406–9.

Luceno, A. n.d. “Fitting the Generalized Pareto Distribution to Data Using Maximum Goodness-of-fit Estimators.” Computational Statistics and Data Analysis 51 (2): 904–17.

Malá, I. 2013. “The Use of Finite Mixtures of Lognormal and Gamma Distributions.” Research Journal of Economics, Business and ICT 8 (2): 55–61.

Mandl, J. N., J. P. Monteiro, N. Vrisekoop, and R. N. Germain. 2013. “T Cell-Positive Selection Uses Self-Ligand Binding Strength to Optimize Repertoire Recognition of Foreign Antigens.” Immunity 38 (2): 263–74.

Marquetoux, N., M. Paul, S. Wongnarkpet, C. Poolkhet, W. Thanapongtham, F. Roger, C. Ducrot, and K. Chalvet-Monfray. 2012. “Estimating Spatial and Temporal Variations of the Reproduction Number for Highly Pathogenic Avian Influenza H5N1 Epidemic in Thailand.” Preventive Veterinary Medicine 106 (2): 143–51.

McNeil, A. J. 1997. “Estimating the Tails of Loss Severity Distributions Using Extreme Value Theory.” ASTIN Bulletin 27 (1): 117–37.

Méheust, D., P. Le Cann, T. Reponen, J. Wakefield, and S. Vesper. 2012. “Possible Application of the Environmental Relative Moldiness Index in France: a Pilot Study in Brittany.” International Journal of Hygiene and Environmental Health 216 (3): 333–40.

Meyer, W. K., S. Zhang, S. Hayakawa, H. Imai, and M. Przeworski. 2013. “The Convergent Evolution of Blue Iris Pigmentation in Primates Took Distinct Molecular Paths.” American Journal of Physical Anthropology 151 (3): 398–407.

Nadarajah, S., and S. A. A. Bakar. 2013. “CompLognormal: An R Package for Composite Lognormal Distributions.” R Journal 5 (2): 98–104.

Orellano, P. W., J. I. Reynoso, A. Grassi, A. Palmieri, O. Uez, and O. Carlino. 2012. “Estimation of the Serial Interval for Pandemic Influenza (pH1N1) in the Most Southern Province of Argentina.” Iranian Journal of Public Health 41 (12): 26–29.

Posthuma, L., G. W. Suter, and T. P. Traas. 2010. Species Sensitivity Distributions in Ecotoxicology. Environmental and Ecological Risk Assessment Series. Taylor & Francis.

Pouillot, R., and M. L. Delignette-Muller. n.d. “Evaluating Variability and Uncertainty Separately in Microbial Quantitative Risk Assessment using two R Packages.” International Journal of Food Microbiology 142 (3): 330–40.

Pouillot, R., M. L. Delignette-Muller, and J. B. Denis. 2011. Mc2d: Tools for Two-Dimensional Monte-Carlo Simulations. https://cran.r-project.org/package=mc2d.

Pouillot, R., K. Hoelzer, Y. Chen, and S. Dennis. 2012. “Estimating Probability Distributions of Bacterial Concentrations in Food Based on Data Generated Using the Most Probable Number (MPN) Method for Use in Risk Assessment.” Food Control 29 (2): 350–57.

Prosser, D. J., L. L. Hungerford, R. M. Erwin, M. A. Ottinger, J. Y. Takekawa, and E. C. Ellis. 2013. “Mapping Avian Influenza Transmission Risk at the Interface of Domestic Poultry and Wild Birds.” Frontiers in Public Health 1 (28): 1–11.

R Development Core Team. 2013. R: A Language and Environment for Statistical Computing. Vienna, Austria. https://www.r-project.org/.

Ricci, V. 2005. “Fitting Distributions with R.” https://cran.r-project.org/doc/contrib/Ricci-distributions-en.pdf.

Rigaux, Clémence, Stéphane André, Isabelle Albert, and Frédéric Carlin. 2014. “Quantitative Assessment of the Risk of Microbial Spoilage in Foods. Prediction of Non-Stability at 55\(\,^{\circ}\)C Caused by Geobacillus Stearothermophilus in Canned Green Beans.” International Journal of Food Microbiology 171: 119–28.

Sak, H., and C. Haksoz. 2011. “A Copula-Based Simulation Model for Supply Portfolio Risk.” Journal of Operational Risk 6 (3): 15–38.

Samuel-Rosa, A., R. Simao Diniz Dalmolin, and P. Miguel. 2013. “Building Predictive Models of Soil Particle-Size Distribution.” Revista Brasileira de Ciencia Do Solo 37: 422–30.

Sato, Maria Ines Z., Ana Tereza Galvani, Jose Antonio Padula, Adelaide Cassia Nardocci, Marcelo de Souza Lauretto, Maria Tereza Pepe Razzolini, and Elayse Maria Hachich. 2013. “Assessing the Infection Risk of Giardia and Cryptosporidium in Public Drinking Water Delivered by Surface Water Systems in Sao Paulo State, Brazil.” Science of the Total Environment 442: 389–96.

Scholl, C. F., C. C. Nice, J. A. Fordyce, Z. Gompert, and M. L. Forister. 2012. “Larval Performance in the Context of Ecological Diversification and Speciation in Lycaeides Butterflies.” International Journal of Ecology 2012 (ID 242154): 1–13.

Simó, J., Francesc Casaña, and J. Sabaté. 2013. “Modelling ‘calçots’ (Alium cepa L.) Growth by Gompertz Function.” Statistics and Operations Research Transactions 37 (1): 95–106.

Srinivasan, S., T. P. Sorrell, J. P. Brooks, D. J. Edwards, and R. Diehl McDougle. 2013. “Workforce Assessment Method for an Urban Police Department: Using Analytics to Estimate Patrol Staffing.” Policing: An International Journal of Police Strategies & Management 36 (4): 702–18.

Stagge, J. H., and G. E. Moglen. 2013. “A Nonparametric Stochastic Method for Generating Daily Climate-Adjusted Streamflows.” Water Resources Research 49 (10): 6179–93.

Suuronen, J. P., A. Kallonen, M. Eik, J. Puttonen, Ritva Serimaa, and Heiko Herrmann. 2012. “Analysis of Short Fibres Orientation in Steel Fibre-Reinforced Concrete (Sfrc) by X-Ray Tomography.” Journal of Materials Science 48 (3): 1358–67.

Tarnczi, T., V. Fenyves, and Z. Bcs. 2011. “The Business Uncertainty and Variability Management with Real Options Models Combined Two Dimensional Simulation.” International Journal of Management Cases 13 (3): 159–67.

Tello, A., B. Austin, and T. C. Telfer. 2012. “Selective Pressure of Antibiotic Pollution on Bacteria of Importance to Public Health.” Environmental Health Perspectives 120 (8): 1100–1106.

Therneau, T. 2011. Survival: Survival Analysis, Including Penalized Likelihood. https://cran.r-project.org/package=survival.

Tikole, S., V. Jaravine, V. Yu Orekhov, and P. Guentert. 2013. “Effects of NMR spectral resolution on protein structure calculation.” PloS One 8 (7): e68567.

Tse, Y. K. 2009. Nonlife Actuarial Models: Theory, Methods and Evaluation. 1st ed. International Series on Actuarial Science. Cambridge University Press.

Turnbull, B. W. n.d. “Nonparametric Estimation of a Survivorship Function with Doubly Censored Data.” Journal of the American Statistical Association 69 (345): 169–73.

Vaninsky, A. Y. 2013. “Stochastic DEA with a Perfect Object and Its Application to Analysis of Environmental Efficiency.” American Journal of Applied Mathematics and Statistics 1 (4): 57–63.

Venables, W. N., and B. D. Ripley. 2010. Modern Applied Statistics with S. 4th ed. Springer-Verlag.

Viana, D. S., L. Santamará, T. C. Michot, and J. Figuerola. 2013. “Allometric Scaling of Long-Distance Seed Dispersal by Migratory Birds.” The American Naturalist 181 (5): 649–62.

Voigt, Christian C., Linn S. Lehnert, Ana G. Popa-Lisseanu, Mateusz Ciechanowski, Péter Estók, Florian Gloza-Rausch, Tamás Goerfoel, et al. 2014. “The Trans-Boundary Importance of Artificial Bat hibernacula in Managed European Forests.” Biodiversity and Conservation 23: 617–31.

Vose, D. 2010. Quantitative Risk Analysis. A Guide to Monte Carlo Simulation Modelling. 1st ed. John Wiley & Sons.

Wang, Yong. 2007. “On Fast Computation of the Non-Parametric Maximum Likelihood Estimate of a Mixing Distribution.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 69 (2): 185–98.

———. 2008. “Dimension-Reduced Nonparametric Maximum Likelihood Computation for Interval-Censored Data.” Computational Statistics & Data Analysis 52 (5): 2388–2402.

Wang, Yong, and Shabnam Fani. 2018. “Nonparametric Maximum Likelihood Computation of a U-Shaped Hazard Function.” Statistics and Computing 28 (1): 187–200.

Wang, Yong, and Stephen M Taylor. 2013. “Efficient Computation of Nonparametric Survival Functions via a Hierarchical Mixture Formulation.” Statistics and Computing 23 (6): 713–25.

Wayland, M. T. 2013. “Morphological Variation in Echinorhynchus truttae Schrank, 1788 and the Echinorhynchus bothniensis Zdzitowiecki & Valtonen, 1987 species complex from freshwater fishes of northern Europe.” Biodiversity Data Journal 1: e975.

Westphal-Fitch, G., and W. T. Fitch. 2013. “Spatial Analysis of ‘Crazy Quilts’, a Class of Potentially Random Aesthetic Artefacts.” PloS One 8 (9): e74055.

Wu, Xing Zheng. 2013a. “Probabilistic Slope Stability Analysis by a Copula-Based Sampling Method.” Computational Geosciences 17 (5): 739–55.

———. 2013b. “Trivariate Analysis of Soil Ranking-Correlated Characteristics and Its Application to Probabilistic Stability Assessments in Geotechnical Engineering Problems.” Soils and Foundations 53 (4): 540–56.

Zhang, Yu, Emad Habib, Robert J. Kuligowski, and Dongsoo Kim. 2013. “Joint Distribution of Multiplicative Errors in Radar and Satellite {Qpes} and Its Use in Estimating the Conditional Exceedance Probability.” Advances in Water Resources 59: 133–45.


  1. The plotdist function can plot any parametric distribution with specified parameter values in argument para. It can thus help to find correct initial values for the distribution parameters in non trivial cases, by iterative calls if necessary (see the reference manual for examples (Delignette-Muller et al. 2014)).↩︎

  2. That is what the B stands for.↩︎

fitdistrplus/inst/CITATION0000644000176200001440000000146114374641666015143 0ustar liggesuserscitHeader("To cite fitdistrplus in publications use:") bibentry(bibtype = "Article", title = "{fitdistrplus}: An {R} Package for Fitting Distributions", author = c("Marie Laure Delignette-Muller", "Christophe Dutang"), journal = "Journal of Statistical Software", year = "2015", volume = "64", number = "4", pages = "1--34", doi = "10.18637/jss.v064.i04", textVersion = paste("Marie Laure Delignette-Muller, Christophe Dutang (2015).", "fitdistrplus: An R Package for Fitting Distributions.", "Journal of Statistical Software, 64(4), 1-34.", "DOI 10.18637/jss.v064.i04.") ) citFooter("Please cite both the package and R when using them for data analysis.", "See also", sQuote("citation()"), "for citing R.")