Title: | Data Supporting the 'VGAM' Package |
---|---|
Description: | Mainly data sets to accompany the VGAM package and the book "Vector Generalized Linear and Additive Models: With an Implementation in R" (Yee, 2015) <DOI:10.1007/978-1-4939-2818-7>. These are used to illustrate vector generalized linear and additive models (VGLMs/VGAMs), and associated models (Reduced-Rank VGLMs, Quadratic RR-VGLMs, Row-Column Interaction Models, and constrained and unconstrained ordination models in ecology). This package now contains some old VGAM family functions which have been replaced by newer ones (often because they are now special cases). |
Authors: | Thomas Yee [aut, cre, cph] , James Gray [dtc] |
Maintainer: | Thomas Yee <[email protected]> |
License: | GPL-2 |
Version: | 1.1-12 |
Built: | 2024-11-17 03:54:16 UTC |
Source: | https://github.com/cran/VGAMdata |
VGAMdata is mainly an assortment of larger data sets which are a useful resource for the VGAM package.
This package mainly contains some larger data sets originally distributed with the VGAM package. Ideally both packages can be installed and loaded to be fully functional. The main intent was to limit the size of VGAM to a bare essential. Many data sets in my monograph will refer to data sets in either package. Recently, some older or less-used VGAM family functions have been shifted into VGAMdata, and this is likely to continue in the future.
Thomas W. Yee, [email protected].
Maintainer: Thomas Yee [email protected].
Yee, T. W. (2015). Vector Generalized Linear and Additive Models: With an Implementation in R. New York, USA: Springer.
# Example 1; xs.nz head(xs.nz) summary(xs.nz) # Example 2; ugss head(ugss) summary(ugss)
# Example 1; xs.nz head(xs.nz) summary(xs.nz) # Example 2; ugss head(ugss) summary(ugss)
The airbnb.ac
data frame has 18159 rows and 9 columns.
data(airbnb.ac)
data(airbnb.ac)
This data frame contains the following columns:
length of stay, in days.
Multiplicative factor for the discount for booking one week:
(published weekly rate) / (7
published nightly rate),
e.g., 0.2 means a 20 percent savings off the regular price.
Number of reviews on the website.
Average price per night per person, in Euros.
Number of bedrooms.
Logical. Superhost?
Minimum stay period, in days.
Maximum number of guests.
Character, values are "Alghero"
and "Cagliari"
.
The data frame comprises Airbnb bookings in two
cities located in Sardinia, Italy.
The stays were during the whole of 2016.
Stays of 30 days or longer
and any rows with missing variables
were deleted from the original source.
Variable LOS
exhibits heaping at the values 7 and 14 days.
The data was obtained with permission from Luca Frigau, University of Cagliari, from https://www.airbnb.com.
## Not run: mytab <- with(subset(airbnb.ac, City == "Alghero"), table(LOS)) plot(prop.table(mytab), col = "blue", ylab = "Proportion") ## End(Not run)
## Not run: mytab <- with(subset(airbnb.ac, City == "Alghero"), table(LOS)) plot(prop.table(mytab), col = "blue", ylab = "Proportion") ## End(Not run)
Maximum likelihood estimation of the 1, 2 and 3-parameter asymmetric Laplace distributions (ALDs). The 2-parameter ALD may, with trepidation and lots of skill, sometimes be used as an approximation of quantile regression.
alaplace1(tau = NULL, llocation = "identitylink", ilocation = NULL, kappa = sqrt(tau/(1 - tau)), Scale.arg = 1, ishrinkage = 0.95, parallel.locat = TRUE ~ 0, digt = 4, idf.mu = 3, zero = NULL, imethod = 1) alaplace2(tau = NULL, llocation = "identitylink", lscale = "loglink", ilocation = NULL, iscale = NULL, kappa = sqrt(tau/(1 - tau)), ishrinkage = 0.95, parallel.locat = TRUE ~ 0, parallel.scale = FALSE ~ 0, digt = 4, idf.mu = 3, imethod = 1, zero = "scale") alaplace3(llocation = "identitylink", lscale = "loglink", lkappa = "loglink", ilocation = NULL, iscale = NULL, ikappa = 1, imethod = 1, zero = c("scale", "kappa"))
alaplace1(tau = NULL, llocation = "identitylink", ilocation = NULL, kappa = sqrt(tau/(1 - tau)), Scale.arg = 1, ishrinkage = 0.95, parallel.locat = TRUE ~ 0, digt = 4, idf.mu = 3, zero = NULL, imethod = 1) alaplace2(tau = NULL, llocation = "identitylink", lscale = "loglink", ilocation = NULL, iscale = NULL, kappa = sqrt(tau/(1 - tau)), ishrinkage = 0.95, parallel.locat = TRUE ~ 0, parallel.scale = FALSE ~ 0, digt = 4, idf.mu = 3, imethod = 1, zero = "scale") alaplace3(llocation = "identitylink", lscale = "loglink", lkappa = "loglink", ilocation = NULL, iscale = NULL, ikappa = 1, imethod = 1, zero = c("scale", "kappa"))
tau , kappa
|
Numeric vectors with
|
llocation , lscale , lkappa
|
Character.
Parameter link functions for
location parameter |
ilocation , iscale , ikappa
|
Optional initial values. If given, it must be numeric and values are recycled to the appropriate length. The default is to choose the value internally. |
parallel.locat , parallel.scale
|
See the |
imethod |
Initialization method. Either the value 1, 2, 3 or 4. |
idf.mu |
Degrees of freedom for the cubic smoothing spline fit
applied to get an initial estimate of the location parameter.
See |
ishrinkage |
How much shrinkage is used when initializing |
Scale.arg |
The value of the scale parameter |
digt |
Passed into |
zero |
See |
These VGAM family functions implement one variant of asymmetric Laplace distributions (ALDs) suitable for quantile regression. Kotz et al. (2001) call it the ALD. Its density function is
for , and
for .
Here, the ranges are for all real
and
, positive
and positive
.
The special case
corresponds to the
(symmetric) Laplace distribution of Kotz et al. (2001).
The mean is
and the variance is
.
The enumeration of the linear/additive predictors used for
alaplace2()
is
the first location parameter followed by the first scale parameter,
then the second location parameter followed by the
second scale parameter, etc.
For alaplace3()
, only a vector response is handled
and the last (third) linear/additive predictor is for
the asymmetry parameter.
It is known that the maximum likelihood estimate of the
location parameter corresponds to the regression
quantile estimate of the classical quantile regression approach
of Koenker and Bassett (1978). An important property of the
ALD is that
where
so that
.
Thus
alaplace2()
might be used as an alternative to rq
in the quantreg package, although scoring is really
an unsuitable algorithm for estimation here.
Both alaplace1()
and alaplace2()
can handle
multiple responses, and the number of linear/additive
predictors is dictated by the length of tau
or
kappa
. The functions alaplace1()
and alaplace2()
can also
handle multiple responses (i.e., a matrix response)
but only with a single-valued tau
or kappa
.
An object of class "vglmff"
(see
vglmff-class
). The object is used by modelling
functions such as vglm
and vgam
.
In the extra
slot of the fitted object are some list
components which are useful, e.g., the sample proportion of
values which are less than the fitted quantile curves.
These functions are experimental and especially subject to
change or withdrawal.
The usual MLE regularity conditions do not hold for
this distribution so that misleading inferences may result,
e.g., in the summary
and vcov
of the object. The
1-parameter ALD can be approximated by extlogF1
which has continuous derivatives and is recommended over
alaplace1
.
Care is needed with tau
values which are too small,
e.g., for count data with llocation = "loglink"
and if
the sample proportion of zeros is greater than tau
.
These VGAM family functions use Fisher scoring.
Convergence may be slow and half-stepping is usual
(although one can use trace = TRUE
to see which
is the best model and then use maxit
to choose
that model)
due to the regularity conditions not holding.
Often the iterations slowly crawl towards the solution so
monitoring the convergence (set trace = TRUE
) is highly
recommended.
Instead, extlogF1
is recommended.
For large data sets it is a very good idea to keep the length of
tau
/kappa
low to avoid large memory requirements.
Then
for parallel.locat = FALSE
one can repeatedly fit a model with
alaplace1()
with one at a time;
and
for
parallel.locat = TRUE
one can refit a model with
alaplace1()
with one at a time but
using offsets and an intercept-only model.
A second method for solving the noncrossing quantile problem is illustrated below in Example 3. This is called the accumulative quantile method (AQM) and details are in Yee (2015). It does not make the strong parallelism assumption.
The functions alaplace2()
and laplace
differ slightly in terms of the parameterizations.
Thomas W. Yee
Koenker, R. and Bassett, G. (1978). Regression quantiles. Econometrica, 46, 33–50.
Kotz, S., Kozubowski, T. J. and Podgorski, K. (2001). The Laplace distribution and generalizations: a revisit with applications to communications, economics, engineering, and finance, Boston: Birkhauser.
ralap
,
laplace
,
extlogF1
,
CommonVGAMffArguments
,
lms.bcn
,
amlnormal
,
sc.studentt2
,
simulate.vlm
.
## Not run: # Example 1: quantile regression with smoothing splines set.seed(123); adata <- data.frame(x2 = sort(runif(n <- 500))) mymu <- function(x) exp(-2 + 6*sin(2*x-0.2) / (x+0.5)^2) adata <- transform(adata, y = rpois(n, lambda = mymu(x2))) mytau <- c(0.25, 0.75); mydof <- 4 fit <- vgam(y ~ s(x2, df = mydof), data=adata, trace=TRUE, maxit = 900, alaplace2(tau = mytau, llocat = "loglink", parallel.locat = FALSE)) fitp <- vgam(y ~ s(x2, df = mydof), data = adata, trace=TRUE, maxit=900, alaplace2(tau = mytau, llocat = "loglink", parallel.locat = TRUE)) par(las = 1); mylwd <- 1.5 with(adata, plot(x2, jitter(y, factor = 0.5), col = "orange", main = "Example 1; green: parallel.locat = TRUE", ylab = "y", pch = "o", cex = 0.75)) with(adata, matlines(x2, fitted(fit ), col = "blue", lty = "solid", lwd = mylwd)) with(adata, matlines(x2, fitted(fitp), col = "green", lty = "solid", lwd = mylwd)) finexgrid <- seq(0, 1, len = 1001) for (ii in 1:length(mytau)) lines(finexgrid, qpois(p = mytau[ii], lambda = mymu(finexgrid)), col = "blue", lwd = mylwd) fit@extra # Contains useful information # Example 2: regression quantile at a new tau value from an existing fit # Nb. regression splines are used here since it is easier. fitp2 <- vglm(y ~ sm.bs(x2, df = mydof), data = adata, trace = TRUE, alaplace1(tau = mytau, llocation = "loglink", parallel.locat = TRUE)) newtau <- 0.5 # Want to refit the model with this tau value fitp3 <- vglm(y ~ 1 + offset(predict(fitp2)[, 1]), alaplace1(tau = newtau, llocation = "loglink"), adata) with(adata, plot(x2, jitter(y, factor = 0.5), col = "orange", pch = "o", cex = 0.75, ylab = "y", main = "Example 2; parallel.locat = TRUE")) with(adata, matlines(x2, fitted(fitp2), col = "blue", lty = 1, lwd = mylwd)) with(adata, matlines(x2, fitted(fitp3), col = "black", lty = 1, lwd = mylwd)) # Example 3: noncrossing regression quantiles using a trick: obtain # successive solutions which are added to previous solutions; use a log # link to ensure an increasing quantiles at any value of x. mytau <- seq(0.2, 0.9, by = 0.1) answer <- matrix(0, nrow(adata), length(mytau)) # Stores the quantiles adata <- transform(adata, offsety = y*0) usetau <- mytau for (ii in 1:length(mytau)) { # cat("\n\nii = ", ii, "\n") adata <- transform(adata, usey = y-offsety) iloc <- ifelse(ii == 1, with(adata, median(y)), 1.0) # Well-chosen! mydf <- ifelse(ii == 1, 5, 3) # Maybe less smoothing will help fit3 <- vglm(usey ~ sm.ns(x2, df = mydf), data = adata, trace = TRUE, alaplace2(tau = usetau[ii], lloc = "loglink", iloc = iloc)) answer[, ii] <- (if(ii == 1) 0 else answer[, ii-1]) + fitted(fit3) adata <- transform(adata, offsety = answer[, ii]) } # Plot the results. with(adata, plot(x2, y, col = "blue", main = paste("Noncrossing and nonparallel; tau = ", paste(mytau, collapse = ", ")))) with(adata, matlines(x2, answer, col = "orange", lty = 1)) # Zoom in near the origin. with(adata, plot(x2, y, col = "blue", xlim = c(0, 0.2), ylim = 0:1, main = paste("Noncrossing and nonparallel; tau = ", paste(mytau, collapse = ", ")))) with(adata, matlines(x2, answer, col = "orange", lty = 1)) ## End(Not run)
## Not run: # Example 1: quantile regression with smoothing splines set.seed(123); adata <- data.frame(x2 = sort(runif(n <- 500))) mymu <- function(x) exp(-2 + 6*sin(2*x-0.2) / (x+0.5)^2) adata <- transform(adata, y = rpois(n, lambda = mymu(x2))) mytau <- c(0.25, 0.75); mydof <- 4 fit <- vgam(y ~ s(x2, df = mydof), data=adata, trace=TRUE, maxit = 900, alaplace2(tau = mytau, llocat = "loglink", parallel.locat = FALSE)) fitp <- vgam(y ~ s(x2, df = mydof), data = adata, trace=TRUE, maxit=900, alaplace2(tau = mytau, llocat = "loglink", parallel.locat = TRUE)) par(las = 1); mylwd <- 1.5 with(adata, plot(x2, jitter(y, factor = 0.5), col = "orange", main = "Example 1; green: parallel.locat = TRUE", ylab = "y", pch = "o", cex = 0.75)) with(adata, matlines(x2, fitted(fit ), col = "blue", lty = "solid", lwd = mylwd)) with(adata, matlines(x2, fitted(fitp), col = "green", lty = "solid", lwd = mylwd)) finexgrid <- seq(0, 1, len = 1001) for (ii in 1:length(mytau)) lines(finexgrid, qpois(p = mytau[ii], lambda = mymu(finexgrid)), col = "blue", lwd = mylwd) fit@extra # Contains useful information # Example 2: regression quantile at a new tau value from an existing fit # Nb. regression splines are used here since it is easier. fitp2 <- vglm(y ~ sm.bs(x2, df = mydof), data = adata, trace = TRUE, alaplace1(tau = mytau, llocation = "loglink", parallel.locat = TRUE)) newtau <- 0.5 # Want to refit the model with this tau value fitp3 <- vglm(y ~ 1 + offset(predict(fitp2)[, 1]), alaplace1(tau = newtau, llocation = "loglink"), adata) with(adata, plot(x2, jitter(y, factor = 0.5), col = "orange", pch = "o", cex = 0.75, ylab = "y", main = "Example 2; parallel.locat = TRUE")) with(adata, matlines(x2, fitted(fitp2), col = "blue", lty = 1, lwd = mylwd)) with(adata, matlines(x2, fitted(fitp3), col = "black", lty = 1, lwd = mylwd)) # Example 3: noncrossing regression quantiles using a trick: obtain # successive solutions which are added to previous solutions; use a log # link to ensure an increasing quantiles at any value of x. mytau <- seq(0.2, 0.9, by = 0.1) answer <- matrix(0, nrow(adata), length(mytau)) # Stores the quantiles adata <- transform(adata, offsety = y*0) usetau <- mytau for (ii in 1:length(mytau)) { # cat("\n\nii = ", ii, "\n") adata <- transform(adata, usey = y-offsety) iloc <- ifelse(ii == 1, with(adata, median(y)), 1.0) # Well-chosen! mydf <- ifelse(ii == 1, 5, 3) # Maybe less smoothing will help fit3 <- vglm(usey ~ sm.ns(x2, df = mydf), data = adata, trace = TRUE, alaplace2(tau = usetau[ii], lloc = "loglink", iloc = iloc)) answer[, ii] <- (if(ii == 1) 0 else answer[, ii-1]) + fitted(fit3) adata <- transform(adata, offsety = answer[, ii]) } # Plot the results. with(adata, plot(x2, y, col = "blue", main = paste("Noncrossing and nonparallel; tau = ", paste(mytau, collapse = ", ")))) with(adata, matlines(x2, answer, col = "orange", lty = 1)) # Zoom in near the origin. with(adata, plot(x2, y, col = "blue", xlim = c(0, 0.2), ylim = 0:1, main = paste("Noncrossing and nonparallel; tau = ", paste(mytau, collapse = ", ")))) with(adata, matlines(x2, answer, col = "orange", lty = 1)) ## End(Not run)
Density, distribution function, quantile function and random
generation for the 3-parameter asymmetric Laplace distribution
with location parameter location
, scale parameter
scale
, and asymmetry parameter kappa
.
dalap(x, location = 0, scale = 1, tau = 0.5, kappa = sqrt(tau/(1-tau)), log = FALSE) palap(q, location = 0, scale = 1, tau = 0.5, kappa = sqrt(tau/(1-tau)), lower.tail = TRUE, log.p = FALSE) qalap(p, location = 0, scale = 1, tau = 0.5, kappa = sqrt(tau/(1-tau)), lower.tail = TRUE, log.p = FALSE) ralap(n, location = 0, scale = 1, tau = 0.5, kappa = sqrt(tau/(1-tau)))
dalap(x, location = 0, scale = 1, tau = 0.5, kappa = sqrt(tau/(1-tau)), log = FALSE) palap(q, location = 0, scale = 1, tau = 0.5, kappa = sqrt(tau/(1-tau)), lower.tail = TRUE, log.p = FALSE) qalap(p, location = 0, scale = 1, tau = 0.5, kappa = sqrt(tau/(1-tau)), lower.tail = TRUE, log.p = FALSE) ralap(n, location = 0, scale = 1, tau = 0.5, kappa = sqrt(tau/(1-tau)))
x , q
|
vector of quantiles. |
p |
vector of probabilities. |
n |
number of observations.
If |
location |
the location parameter |
scale |
the scale parameter |
tau |
the quantile parameter |
kappa |
the asymmetry parameter |
log |
if |
lower.tail , log.p
|
There are many variants of asymmetric Laplace distributions
(ALDs) and this one is known as the ALD by Kotz et
al. (2001). See alaplace3
, the VGAM
family function for estimating the three parameters by maximum
likelihood estimation, for formulae and details. The ALD
density may be approximated by dextlogF
.
dalap
gives the density,
palap
gives the distribution function,
qalap
gives the quantile function, and
ralap
generates random deviates.
T. W. Yee and Kai Huang
Kotz, S., Kozubowski, T. J. and Podgorski, K. (2001). The Laplace distribution and generalizations: a revisit with applications to communications, economics, engineering, and finance, Boston: Birkhauser.
alaplace3
,
dextlogF
,
extlogF1
.
x <- seq(-5, 5, by = 0.01) loc <- 0; sigma <- 1.5; kappa <- 2 ## Not run: plot(x, dalap(x, loc, sigma, kappa = kappa), type = "l", main = "Blue is density, orange is the CDF", ylim = c(0, 1), sub = "Purple are 5, 10, ..., 95 percentiles", las = 1, ylab = "", cex.main = 0.5, col = "blue") abline(h = 0, col = "blue", lty = 2) lines(qalap(seq(0.05, 0.95, by = 0.05), loc, sigma, kappa = kappa), dalap(qalap(seq(0.05, 0.95, by = 0.05), loc, sigma, kappa = kappa), loc, sigma, kappa = kappa), col="purple", lty=3, type = "h") lines(x, palap(x, loc, sigma, kappa = kappa), type = "l", col = "orange") abline(h = 0, lty = 2) ## End(Not run) pp <- seq(0.05, 0.95, by = 0.05) # Test two functions max(abs(palap(qalap(pp, loc, sigma, kappa = kappa), loc, sigma, kappa = kappa) - pp)) # Should be 0
x <- seq(-5, 5, by = 0.01) loc <- 0; sigma <- 1.5; kappa <- 2 ## Not run: plot(x, dalap(x, loc, sigma, kappa = kappa), type = "l", main = "Blue is density, orange is the CDF", ylim = c(0, 1), sub = "Purple are 5, 10, ..., 95 percentiles", las = 1, ylab = "", cex.main = 0.5, col = "blue") abline(h = 0, col = "blue", lty = 2) lines(qalap(seq(0.05, 0.95, by = 0.05), loc, sigma, kappa = kappa), dalap(qalap(seq(0.05, 0.95, by = 0.05), loc, sigma, kappa = kappa), loc, sigma, kappa = kappa), col="purple", lty=3, type = "h") lines(x, palap(x, loc, sigma, kappa = kappa), type = "l", col = "orange") abline(h = 0, lty = 2) ## End(Not run) pp <- seq(0.05, 0.95, by = 0.05) # Test two functions max(abs(palap(qalap(pp, loc, sigma, kappa = kappa), loc, sigma, kappa = kappa) - pp)) # Should be 0
Luftwaffe losses during a subset of the Battle of Britain.
data(bb.de)
data(bb.de)
The format is a -dimensional array.
The first dimension is the event
(in order:
shot down or failed to return,
written off,
seriously damaged),
the second dimension is the day,
the third is the aircraft type.
This is a
Battle of Britain data set of Luftwaffe losses during operations
26 August–31 August 1940 continued on to 1–7 September 1940.
The aircraft types are prefixed
Bf
for Messerschmitt (Bayerische Flugzeugwerke),
Do
for Dornier,
He
for Heinkel,
Ju
for Junkers.
Note that p.151 and p.165 of Bowyer (1990) contain tables (during the first week of September) and almost the same data; however, the former is labelled "shot down" whereas the latter is "shot down or failed to return". The latter is used here. Also, there are some other small discrepancies.
Yet to do: add the data available at other dates, and include the RAF data.
Bowyer, M. J. F. (1990) The Battle of Britain: 50 years On. Patrick Stephens Limited, Northamptonshire, U.K.
data(bb.de) bb.de[,, "Bf109"] ## Not run: plot(bb.de["sdown",, "Bf109"] ~ as.Date(dimnames(bb.de)[[2]]), type = "h", col = "blue", las = 1, lwd = 3, ylab = "Frequency", xlab = "1940", main = "Numbers shot down (Bf 109)") abline(h = c(5, 10, 15, 20), lty = "dashed", col = "grey") points(bb.de["sdown",,"Bf109"] ~ as.Date(dimnames(bb.de)[[2]]), col = "blue") ## End(Not run)
data(bb.de) bb.de[,, "Bf109"] ## Not run: plot(bb.de["sdown",, "Bf109"] ~ as.Date(dimnames(bb.de)[[2]]), type = "h", col = "blue", las = 1, lwd = 3, ylab = "Frequency", xlab = "1940", main = "Numbers shot down (Bf 109)") abline(h = c(5, 10, 15, 20), lty = "dashed", col = "grey") points(bb.de["sdown",,"Bf109"] ~ as.Date(dimnames(bb.de)[[2]]), col = "blue") ## End(Not run)
A 12 x 12 table of the Births and Deaths of 348 Notable Americans. The rows and columns are for each month.
data(bd.us)
data(bd.us)
The format is: chr "bd.us"
Rows denote the month of birth; columns for the month of death. These data appear as Table 1 of Phillips and Feldman (1973), who collected the data from Morris (1965). Not all of the 400 people were used because some had not died by that time and other individuals lacked the information.
Phillips, D. P. and Feldman, K. A. (1973) A dip in deaths before ceremonial occasions: Some new relationships between social integration and mortality, American Sociological Review, 38, 678–696.
Morris, R. B. (Ed.) (1965) Four Hundred Notable Americans. Harper and Row: New York, USA.
rcim
.
print(bd.us) sum(bd.us) rowSums(bd.us) colSums(bd.us)
print(bd.us) sum(bd.us) rowSums(bd.us) colSums(bd.us)
A prospective data set containing the DMFT index of children in Belo Horizonte at the beginning and end of the BELCAP study.
data(belcap)
data(belcap)
A data frame with 797 observations on the following 5 variables.
dmftb
a numeric vector, DMFT-Index at the beginning of the study.
dmfte
a numeric vector, DMFT-Index at the end of the study.
gender
a factor with levels 0
= female, 1
= male.
ethnic
a factor with levels 1
= dark, 2
= white,
3
= black.
school
the kind of prevention measure.
A factor with levels 1
=
oral health education, 2
= all four methods, 3
= control group,
4
= enrichment of the school diet with ricebran, 5
= mouthrinse
with 0.2% NaF-solution, 6
= oral hygiene.
This data set is from the Belo Horizonte Caries Prevention (BELCAP) study. The data is collected from children in Belo Horizonte (Brazil) aged seven years at the start of the study. In in order to determine which method(s) were best for preventing tooth decay, six treatments were randomized to six separate schools. The measure used is the decayed, missing and filled teeth (DMFT) index - a well known and important measure of a persons dental health. Only the eight deciduous molars are considered, so the lowest value is 0, and the highest is 8.
https://onlinelibrary.wiley.com/
contains the data file
(a supplement of the JRSSA article).
Downloaded in January 2014 and formatted into R by J. T. Gray,
[email protected]
.
Bohning, D., D. Ekkehart, P. Schlattmann, L. Mendonca, and U. Kircher (1999). The Zero-Inflated Poisson Model and the Decayed, Missing and Filled Teeth Index in Dental Epidemiology, Journal of the Royal Statistical Society, A 162(2), 195–209.
data(belcap) ## maybe str(belcap) ; plot(belcap) ...
data(belcap) ## maybe str(belcap) ; plot(belcap) ...
Density, and random generation for the Topp-Leone distribution.
dbell(x, shape, log = FALSE) rbell(n, shape)
dbell(x, shape, log = FALSE) rbell(n, shape)
x , n
|
Same as |
shape |
the (shape) parameter, which is positive. |
log |
Logical.
If |
See bellff
, the VGAM family
function for estimating the parameter by maximum
likelihood estimation.
dbell
gives the density,
rbell
generates random deviates.
If shape
is large then rbell
will become
computationally expensive.
bell
.
## Not run: plot(0:15, dbell(0:15, shape = 1.5), type = "h", col = "blue")
## Not run: plot(0:15, dbell(0:15, shape = 1.5), type = "h", col = "blue")
Estimating the shape parameter of the Bell distribution by maximum likelihood estimation.
bellff(lshape = "loglink", zero = NULL, gshape = expm1(1.6 * ppoints(7)))
bellff(lshape = "loglink", zero = NULL, gshape = expm1(1.6 * ppoints(7)))
lshape , zero , gshape
|
More information is at |
The Bell distribution has a probability density function that can be written
for and shape parameter
.
The mean of
is
(returned as the fitted values).
Fisher-scoring is used.
This VGAM family function handles multiple responses.
The function bell
returns the first 218 Bell
numbers as finite numbers, and
returns Inf
when its argument has a higher value.
Hence this VGAM family function can only handle low-value
counts of less than 219.
An object of class "vglmff"
(see vglmff-class
).
The object is used by modelling functions
such as vglm
,
and vgam
.
T. W. Yee
Castellares, F. and Ferrari, S. L. P. and Lemonte, A. J. (2018). On the Bell distribution and its associated regression model for count data. Applied Mathematical Modelling, 56, 172–185.
bdata <- data.frame(y = rbell(1000, loglink(0.5, inverse = TRUE))) bfit <- vglm(y ~ 1, bellff, bdata, trace = TRUE, crit = "coef") coef(bfit, matrix = TRUE) Coef(bfit)
bdata <- data.frame(y = rbell(1000, loglink(0.5, inverse = TRUE))) bfit <- vglm(y ~ 1, bellff, bdata, trace = TRUE, crit = "coef") coef(bfit, matrix = TRUE) Coef(bfit)
Estimate the three parameters of McKay's bivariate gamma distribution by maximum likelihood estimation.
bigamma.mckay(lscale = "loglink", lshape1 = "loglink", lshape2 = "loglink", iscale = NULL, ishape1 = NULL, ishape2 = NULL, imethod = 1, zero = "shape")
bigamma.mckay(lscale = "loglink", lshape1 = "loglink", lshape2 = "loglink", iscale = NULL, ishape1 = NULL, ishape2 = NULL, imethod = 1, zero = "shape")
lscale , lshape1 , lshape2
|
Link functions applied to the (positive)
parameters |
iscale , ishape1 , ishape2
|
Optional initial values for |
imethod , zero
|
One of the earliest forms of the bivariate gamma distribution has a joint probability density function given by
for ,
,
and
(Mckay, 1934).
Here,
is the gamma
function, as in
gamma
.
By default, the linear/additive predictors are
,
,
.
The marginal distributions are gamma,
with shape parameters and
respectively, but they have a
common scale parameter
.
Pearson's product-moment correlation coefficient
of
and
is
.
This distribution is also
known as the bivariate Pearson type III distribution.
Also,
,
conditional on
,
has a gamma distribution with shape parameter
.
An object of class "vglmff"
(see vglmff-class
).
The object is used by modelling functions
such as vglm
and vgam
.
The response must be a two column matrix where
the first column is and the
second
.
It is necessary that each element of the
vectors
and
be positive.
Currently, the fitted value is a matrix with
two columns;
the first column has values
for the
marginal mean of
,
while the second column
has values
for the marginal mean of
(all evaluated at the final iteration).
T. W. Yee
McKay, A. T. (1934). Sampling from batches. Journal of the Royal Statistical Society—Supplement, 1, 207–216.
Kotz, S. and Balakrishnan, N. and Johnson, N. L. (2000). Continuous Multivariate Distributions Volume 1: Models and Applications, 2nd edition, New York: Wiley.
Balakrishnan, N. and Lai, C.-D. (2009). Continuous Bivariate Distributions, 2nd edition. New York: Springer.
shape1 <- exp(1); shape2 <- exp(2); scalepar <- exp(3) nn <- 1000 mdata <- data.frame(y1 = rgamma(nn, shape1, scale = scalepar), z2 = rgamma(nn, shape2, scale = scalepar)) mdata <- transform(mdata, y2 = y1 + z2) # z2 \equiv Y2-y1|Y1=y1 fit <- vglm(cbind(y1, y2) ~ 1, bigamma.mckay, mdata, trace = TRUE) coef(fit, matrix = TRUE) Coef(fit) vcov(fit) colMeans(depvar(fit)) # Check moments head(fitted(fit), 1)
shape1 <- exp(1); shape2 <- exp(2); scalepar <- exp(3) nn <- 1000 mdata <- data.frame(y1 = rgamma(nn, shape1, scale = scalepar), z2 = rgamma(nn, shape2, scale = scalepar)) mdata <- transform(mdata, y2 = y1 + z2) # z2 \equiv Y2-y1|Y1=y1 fit <- vglm(cbind(y1, y2) ~ 1, bigamma.mckay, mdata, trace = TRUE) coef(fit, matrix = TRUE) Coef(fit) vcov(fit) colMeans(depvar(fit)) # Check moments head(fitted(fit), 1)
The covid19.nz
data frame has 69 rows and 3 columns.
The number of new cases is tracked daily.
data(covid19.nz)
data(covid19.nz)
This data frame contains the following columns:
output from as.Date
, is
the day of year.
a numeric vector, counts, the number of new cases.
a numeric vector, 0 for when the first case occurred; incrementally daily after that.
These were collected from https://www.nzherald.co.nz/ during the first month or so after the first case.
The NZ Herald states their source was Johns Hopkins University.
## Not run: plot(newcases ~ doy, covid19.nz, col = "blue", type = "h", las = 1, xlab = "Date") ## End(Not run)
## Not run: plot(newcases ~ doy, covid19.nz, col = "blue", type = "h", las = 1, xlab = "Date") ## End(Not run)
The number of fatal road crashes on Australian roads during 2010–2012. They are cross-classified by time of day (in 6 hour blocks) and day of the week.
data(crashf.au)
data(crashf.au)
A data frame with 4 observations on the following 7 variables.
Day of the week.
Each cell is the aggregate number of crashes reported in Australia
during each six hour time block throughout the years 2010–2012.
The rownames
are the time period the crashes took place in.
Morning is from 3:00am to 8:59am,
midday is from 9:00am to 2:59pm,
evening is from 3:00pm to 8:59pm and
night is from 9:00pm to 2:59am.
http://www.bitre.gov.au/publications/ongoing/files/RDA_Summary_2012_June.pdf
Road Deaths Australia; 2012 Statistical Summary. Department of Infrastructure and Transport, Australian Government; ISSN: 1323–3688
Downloaded by J. T. Gray, April 2014.
crashf.au
crashf.au
These data were collected from the New Zealand Ministry of Justice and comprises selected years from 2001 to 2022.
data("crim.nz")
data("crim.nz")
A data frame with 257600 observations on the following 7 variables.
prison
a logical vector;
FALSE
if and only if the sentence was either
"Imprisonment"
,
"Imprisonment sentences"
or
"Life Imprisonment"
.
So the other sentences were not imprisonment.
agegp
an ordered factor with
levels 17-19
< 20-24
<
25-29
< 30-39
<
40+
.
The age of the offender.
offence
a factor with
levels
abduction
,
injury
,
endanger
,
fraud
,
homicide
,
drugs
,
miscoff
,
antigovt
,
weapons
,
property
,
order
,
robbery
,
sexoff
,
theft
,
burglary
.
The main offence.
In more detail,
the levels correspond to
"Abduction, harassment and other offences against the person"
,
"Acts intended to cause injury"
,
"Dangerous or negligent acts endangering persons"
,
"Fraud, deception and related offences"
,
"Homicide and related offences"
,
"Illicit drug offences"
,
"Miscellaneous offences"
,
"Offences against justice procedures, gov. security and gov. operations"
,
"Prohibited and regulated weapons and explosives offences"
,
"Property damage and environmental pollution"
,
"Public order offences"
,
"Robbery, extortion and related offences"
,
"Sexual assault and related offences"
,
"Theft and related offences"
,
"Unlawful entry with intent/burglary, break and enter"
,
respectively.
gender
a factor with levels
F
and M
.
ethnicity
a factor with levels
Asian
,
European
,
Maori
,
Other
,
Polynesian
.
year
a numeric vector. The calendar year when the crime occurred.
sentence
a factor with levels
ComDet
,
ComSent
,
ComWork
,
HomeDet
,
Prison
,
PrisonSent
,
IntSupv
,
Life
,
Money
,
Other
,
PrevDet
,
Supv
.
In more detail,
the levels correspond to
"Community Detention"
,
"Community sentences"
,
"Community Work"
,
"Home Detention"
,
"Imprisonment"
,
"Imprisonment sentences"
,
"Intensive Supervision"
,
"Life Imprisonment"
,
"Monetary "
,
"Other"
,
"Preventive Detention"
,
"Supervision"
respectively.
The data were collected in late 2023 and is described in detail in Garcia et al. (2023). Almost all the information here comes from that document. The original data comprised each year from 2001 to 2022 inclusive, however only the years 2001, 2010, 2019 and 2022 are included here for brevity.
Variables with values
"Unknown/Organisation"
were treated as missing and deleted.
The rownames
were
stripped off to make the data frame
much smaller.
A matching New Zealand Police data set on proceedings against offenders was also collected but not included in VGAMdata because of its size.
Offences are categorised using the Australian and New Zealand Standard Offence Classification (ANZSOC). There are sixteen top level ‘divisions’ which are further subdivided into ‘subdivisions’ and ‘groups’. One of the sixteen main divisions (Traffic Offences) are excluded. Only the most serious offence by the offender is recorded here.
Each row contains the count of sentences received by offenders with particular demographic information in the given period, along with the ANZSOC offence code. However, since ANZSOC is a hierarchical classification, a unique offender is represented in three separate rows: one for the ANZSOC Group of the offence, one for the Subdivision of the offence, and one for the Division of the offence.
https://datainfoplus.stats.govt.nz/
Fairness analysis and machine learning model implementation using New Zealand prosecution and conviction data. Garcia, F. and Omidyar, P. and Rodrigues, L. and Xie, J. (2023). Postgraduate Report, Computer Science Department, University of Auckland.
Only ANZSOC Divisions 01–16 were collected. More details can be found at https://www.police.govt.nz/about-us/publications-statistics/data-and-statistics/policedatanz/proceedings-offender-demographics and https://www.abs.gov.au/statistics/classifications/australian-and-new-zealand-standard-offence-classification-anzsoc/latest-release
data(crim.nz) summary(crim.nz)
data(crim.nz) summary(crim.nz)
Crime totals and rates, cross-classified by US state, during 2009.
data(crime.us)
data(crime.us)
A data frame with 50 observations on the following 22 variables.
State
a character vector. White spaces have been replaced by underscores.
Population
a numeric vector
ViolentCrimeTotal
a numeric vector
Murder
a numeric vector
Rape
a numeric vector
Robbery
a numeric vector
Assault
a numeric vector
PropertyCrimeTotal
a numeric vector
Burglary
a numeric vector
LarcenyTheft
a numeric vector
MotorVehicleTheft
a numeric vector
ViolentCrimeRate
a numeric vector
MurderRate
a numeric vector
RapeRate
a numeric vector
RobberyRate
a numeric vector
AssaultRate
a numeric vector
PropertyCrimeRate
a numeric vector
BurglaryRate
a numeric vector
LarcenyTheftRate
a numeric vector
MotorVehicleTheftRate
a numeric vector
stateNumber
a numeric vector, running from 1 to 50.
abbrev
State name as a character vector
Each row is a state of the United States of America. The first half of the columns tend to be totals, and the second half are crime rates per 100,000 population.
The data frame was downloaded as a .csv
file and edited.
The full column names are:
State, Population, Violent crime total, Murder and nonnegligent
Manslaughter, Forcible rape, Robbery, Aggravated assault, Property
crime total, Burglary, Larceny-theft, Motor vehicle theft, Violent
Crime rate, Murder and nonnegligent manslaughter rate, Forcible
rape rate, Robbery rate, Aggravated assault rate, Property crime
rate, Burglary rate, Larceny-theft rate, Motor vehicle theft rate,
state Number, abbreviation.
Technical details governing the data set are given in the URL.
http://www.ucrdatatool.gov
,
http://www.ucrdatatool.gov/Search/Crime/State/RunCrimeOneYearofData.cfm
## Not run: # Louisiana is the one outlier plot(MurderRate ~ stateNumber, crime.us, axes = FALSE, type = "h", col = 1:6, main = "USA murder rates in 2009 (per 100,000 population)") axis(1, with(crime.us, abbrev), at = with(crime.us, stateNumber), col = 1:6, col.tick = 1:6, cex.lab = 0.5) axis(2) ## End(Not run) tail(crime.us[ sort.list(with(crime.us, MurderRate)), ])
## Not run: # Louisiana is the one outlier plot(MurderRate ~ stateNumber, crime.us, axes = FALSE, type = "h", col = 1:6, main = "USA murder rates in 2009 (per 100,000 population)") axis(1, with(crime.us, abbrev), at = with(crime.us, stateNumber), col = 1:6, col.tick = 1:6, cex.lab = 0.5) axis(2) ## End(Not run) tail(crime.us[ sort.list(with(crime.us, MurderRate)), ])
Computes DeLury's method or Leslie's method for estimating a biological population size.
DeLury(catch, effort, type = c("DeLury","Leslie"), ricker = FALSE)
DeLury(catch, effort, type = c("DeLury","Leslie"), ricker = FALSE)
catch , effort
|
Catch and effort. These should be numeric vectors of equal length. |
type |
Character specifying which of the DeLury or Leslie models is to be fitted. The default is the first value. |
ricker |
Logical. If |
This simple function implements the methods of DeLury (1947).
These are called the DeLury and Leslie models.
Note that there are many assumptions.
These include: (i) Catch and effort records are available for a series
of consecutive time intervals. The catch for a given time
interval, specified by , is
, and the
corresponding effort by
.
The catch per unit effort (CPUE) for the time interval
is
.
Let
represent the proportion of the population
captured during the time interval
.
Then
so that
is the
proportion of the population captured during interval
by one unit
of effort. Then
is called the catchability,
and the intensity of effort is
.
Let
and
be the total effort and total catch
up to interval
, and
be the number of individuals
in the population at time
.
It is good idea to plot
against
for
type = "DeLury"
and
versus
for
type = "Leslie"
.
The other assumptions are as follows. (ii) The population is closed—the population must be closed to sources of animals such as recruitment and immigration and losses of animals due to natural mortality and emigration. (iii) Catchability is constant over the period of removals. (iv) The units of effort are independent, i.e., the individual units of the method of capture (i.e., nets, traps, etc) do not compete with each other. (v) All fish are equally vulnerable to the method of capture—source of error may include gear saturation and trap-happy or trap-shy individuals. (vi) Enough fish must be removed to substantially reduce the CPUE. (vii) The catches may remove less than 2% of the population. Also, the usual assumptions of simple regression such as (viii) random sampling, (ix) the independent variable(s) are measured without error—both catches and effort should be known, not estimated, (x) a line describes the data, (xi) the errors are independent and normally distributed.
A list with the following components.
catch , effort
|
Catch and effort. Same as the original vectors.
These correspond to |
type , ricker
|
Same as input. |
N0 |
an estimate of the population size at time 0. Only valid if the assumptions are satisfied. |
CPUE |
Catch Per Unit Effort |
K , E
|
|
lmfit |
the |
The data in the example below comes from DeLury (1947), and some plots of his are reproduced. Note that he used log to base 10 whereas natural logs are used here. His plots had some observations obscured by the y-axis!
The DeLury method is not applicable to the data frame
wffc.nc
since the 2008 World Fly Fishing Competition was
strictly catch-and-release.
T. W. Yee.
DeLury, D. B. (1947). On the estimation of biological populations. Biometrics, 3, 145–167.
Ricker, W. E. (1975). Computation and interpretation of biological statistics of fish populations. Bull. Fish. Res. Bd. Can., 191, 382–
Yee, T. W. (2010) VGLMs and VGAMs: an overview for applications in fisheries research. Fisheries Research, 101, 116–126.
pounds <- c( 147, 2796, 6888, 7723, 5330, 8839, 6324, 3569, 8120, 8084, 8252, 8411, 6757, 1152, 1500, 11945, 6995, 5851, 3221, 6345, 3035, 6271, 5567, 3017, 4559, 4721, 3613, 473, 928, 2784, 2375, 2640, 3569) traps <- c( 200, 3780, 7174, 8850, 5793, 9504, 6655, 3685, 8202, 8585, 9105, 9069, 7920, 1215, 1471, 11597, 8470, 7770, 3430, 7970, 4740, 8144, 7965, 5198, 7115, 8585, 6935, 1060, 2070, 5725, 5235, 5480, 8300) table1 <- DeLury(pounds/1000, traps/1000) ## Not run: with(table1, plot(1+log(CPUE) ~ E, las = 1, pch = 19, main = "DeLury method", xlab = "E(t)", ylab = "1 + log(C(t))", col = "blue")) ## End(Not run) omitIndices <- -(1:16) table1b <- DeLury(pounds[omitIndices]/1000, traps[omitIndices]/1000) ## Not run: with(table1b, plot(1+log(CPUE) ~ E, las = 1, pch = 19, main = "DeLury method", xlab = "E(t)", ylab = "1 + log(C(t))", col = "blue")) mylmfit <- with(table1b, lmfit) lines(mylmfit$x[, 2], 1 + predict.lm(mylmfit), col = "red", lty = "dashed") ## End(Not run) omitIndices <- -(1:16) table2 <- DeLury(pounds[omitIndices]/1000, traps[omitIndices]/1000, type = "L") ## Not run: with(table2, plot(CPUE ~ K, las = 1, pch = 19, main = "Leslie method; Fig. III", xlab = "K(t)", ylab = "C(t)", col = "blue")) mylmfit <- with(table2, lmfit) abline(a = coef(mylmfit)[1], b = coef(mylmfit)[2], col = "orange", lty = "dashed") ## End(Not run)
pounds <- c( 147, 2796, 6888, 7723, 5330, 8839, 6324, 3569, 8120, 8084, 8252, 8411, 6757, 1152, 1500, 11945, 6995, 5851, 3221, 6345, 3035, 6271, 5567, 3017, 4559, 4721, 3613, 473, 928, 2784, 2375, 2640, 3569) traps <- c( 200, 3780, 7174, 8850, 5793, 9504, 6655, 3685, 8202, 8585, 9105, 9069, 7920, 1215, 1471, 11597, 8470, 7770, 3430, 7970, 4740, 8144, 7965, 5198, 7115, 8585, 6935, 1060, 2070, 5725, 5235, 5480, 8300) table1 <- DeLury(pounds/1000, traps/1000) ## Not run: with(table1, plot(1+log(CPUE) ~ E, las = 1, pch = 19, main = "DeLury method", xlab = "E(t)", ylab = "1 + log(C(t))", col = "blue")) ## End(Not run) omitIndices <- -(1:16) table1b <- DeLury(pounds[omitIndices]/1000, traps[omitIndices]/1000) ## Not run: with(table1b, plot(1+log(CPUE) ~ E, las = 1, pch = 19, main = "DeLury method", xlab = "E(t)", ylab = "1 + log(C(t))", col = "blue")) mylmfit <- with(table1b, lmfit) lines(mylmfit$x[, 2], 1 + predict.lm(mylmfit), col = "red", lty = "dashed") ## End(Not run) omitIndices <- -(1:16) table2 <- DeLury(pounds[omitIndices]/1000, traps[omitIndices]/1000, type = "L") ## Not run: with(table2, plot(CPUE ~ K, las = 1, pch = 19, main = "Leslie method; Fig. III", xlab = "K(t)", ylab = "C(t)", col = "blue")) mylmfit <- with(table2, lmfit) abline(a = coef(mylmfit)[1], b = coef(mylmfit)[2], col = "orange", lty = "dashed") ## End(Not run)
Part of the data collected at two time points (2006 and 2014) by the Bank of Italy, as part of the European Central Banks Eurosystem collection of statistics, within the periodic sample surveys on households, businesses and selected intermediaries.
Data frame with the following 20 variables:
ID
a numeric vector, a unique identification number of the household.
area
a factor with 5 levels,
the Italian geographic area or region
in which the household lives:
NW
= North-West,
NE
= North-East,
C
= Center,
S
= South,
I
= Islands.
For users wanting a North–South contrast,
this variable might be coded as
NC
= North and Center (NW, NE and C),
SI
= South and Islands (S and I).
sex
a factor with 2 levels,
the gender of the head householder:
M
= Male,
F
= Female.
age
a numeric vector, age in years of the head householder.
marital
a factor with 4 levels, marital status of the head
householder:
married
= Married,
single
= Single,
separated
= Separated or divorced,
widowed
= Widowed.
education
an ordered factor with 8 levels,
the education level of the head householder:
none
= No education,
primaryschool
= Primary school,
middleschool
= Middle school,
profschool
= Professional school,
highschool
= High school,
bachelors
= Bachelors degree,
masters
= Masters degree,
higherdegree
= Higher degree.
job
a factor with 7 levels,
the type of job done by the head householder:
worker
= Worker,
employee
= Employee,
manager
= Manager,
business
= Business person,
other
= Other kind of independent job,
retired
= Retired,
unemployed
= Unemployed.
occupants
a numeric vector, the number of people living in the same house.
children
a numeric vector, the number of children of the head householder living with him/her.
other.children
a numeric vector, the number of children of the head householder not living with the household.
house.owned
a numeric vector, the ownership of the house in
which the householder lives;
0
= The house is not owned,
1
= The house is owned.
houses
a numeric vector, the number of houses owned by the family, including the one in which the family lives.
earners
a numeric vector, the number of people in the house who have salary or some kind of earnings.
accounts
a numeric vector, the number of bank accounts collectively owned by the household.
ccards
a numeric vector, the number of credit cards collectively owned by the household.
tot.income
,
dep.income
,
pens.income
,
self.income
,
cap.income
numeric vectors, the amount of income (in Euros) collectively earned by the household through different activities. The variables can be negative if the household has installments. In order, they are the total amount of income, the amount of income earned through dependent working, the amount of income earned by the household through pensions, the amount of income earned by the household through self-dependent working, the amount of income earned by the household through capital investments.
The European Central Banks (ECB) Eurosystem requests and helps organize each country within the European Union to routinely collect statistical information via their central banks. These data frames are a subset from data collected by the Bank of Italy. Each year can be considered a cross-sectional study, although there are some people common to each year. Hence the data collectively can be considered partly a longitudinal study too.
Data was downloaded at https://www.bancaditalia.it
in May 2016 by Lucia
Pilleri.
Supplements to the Statistical Bulletin,
Sample Surveys, Household Income and Wealth in 2006,
New series, Volume XVIII, Number 7–28, January 2008.
Banca D'Italia, Centro Stampa, Roma, Pubbl. Mensile,
https://www.bancaditalia.it
.
data(ecb06.it); data(ecb14.it) summary(ecb06.it) summary(ecb14.it) ## Not run: with(ecb14.it, table(house.owned)) with(ecb14.it, barplot(table(education), col = "lightblue")) ## End(Not run)
data(ecb06.it); data(ecb14.it) summary(ecb06.it) summary(ecb14.it) ## Not run: with(ecb14.it, table(house.owned)) with(ecb14.it, barplot(table(education), col = "lightblue")) ## End(Not run)
Exam results of 35 students on 18 questions.
data(exam1)
data(exam1)
A data frame with 35 observations on the following 18 variables.
binary response
binary response
binary response
For each question, a 1 means correct, a 0 means incorrect.
A simple Rasch model may be fitted to this dataframe using
rcim
and binomialff
.
Taken from William Revelle's Short Guide to R,
http://www.unt.edu/rss/rasch_models.htm
,
http://www.personality-project.org/r/.
Downloaded in October 2013.
summary(exam1) # The names of the students are the row names # Fit a simple Rasch model. # First, remove all questions and people who were totally correct or wrong exam1.1 <- exam1 [, colMeans(exam1 ) > 0] exam1.1 <- exam1.1[, colMeans(exam1.1) < 1] exam1.1 <- exam1.1[rowMeans(exam1.1) > 0, ] exam1.1 <- exam1.1[rowMeans(exam1.1) < 1, ] Y.matrix <- rdata <- exam1.1 ## Not run: # The following needs: library(VGAM) rfit <- rcim(Y.matrix, family = binomialff(multiple.responses = TRUE), trace = TRUE) coef(rfit) # Row and column effects constraints(rfit, matrix = TRUE) # Constraint matrices side-by-side dim(model.matrix(rfit, type = "vlm")) # 'Big' VLM matrix ## End(Not run) ## Not run: # This plot shows the (main) row and column effects par(mfrow = c(1, 2), las = 1, mar = c(4.5, 4.4, 2, 0.9) + 0.1) saved <- plot(rfit, rcol = "blue", ccol = "orange", cylab = "Item effects", rylab = "Person effects", rxlab = "", cxlab = "") names(saved@post) # Some useful output put here cbind(saved@post$row.effects) cbind(saved@post$raw.row.effects) round(cbind(-saved@post$col.effects), dig = 3) round(cbind(-saved@post$raw.col.effects), dig = 3) round(matrix(-saved@post$raw.col.effects, ncol = 1, # Rename for humans dimnames = list(colnames(Y.matrix), NULL)), dig = 3) ## End(Not run)
summary(exam1) # The names of the students are the row names # Fit a simple Rasch model. # First, remove all questions and people who were totally correct or wrong exam1.1 <- exam1 [, colMeans(exam1 ) > 0] exam1.1 <- exam1.1[, colMeans(exam1.1) < 1] exam1.1 <- exam1.1[rowMeans(exam1.1) > 0, ] exam1.1 <- exam1.1[rowMeans(exam1.1) < 1, ] Y.matrix <- rdata <- exam1.1 ## Not run: # The following needs: library(VGAM) rfit <- rcim(Y.matrix, family = binomialff(multiple.responses = TRUE), trace = TRUE) coef(rfit) # Row and column effects constraints(rfit, matrix = TRUE) # Constraint matrices side-by-side dim(model.matrix(rfit, type = "vlm")) # 'Big' VLM matrix ## End(Not run) ## Not run: # This plot shows the (main) row and column effects par(mfrow = c(1, 2), las = 1, mar = c(4.5, 4.4, 2, 0.9) + 0.1) saved <- plot(rfit, rcol = "blue", ccol = "orange", cylab = "Item effects", rylab = "Person effects", rxlab = "", cxlab = "") names(saved@post) # Some useful output put here cbind(saved@post$row.effects) cbind(saved@post$raw.row.effects) round(cbind(-saved@post$col.effects), dig = 3) round(cbind(-saved@post$raw.col.effects), dig = 3) round(matrix(-saved@post$raw.col.effects, ncol = 1, # Rename for humans dimnames = list(colnames(Y.matrix), NULL)), dig = 3) ## End(Not run)
The flamingo
data frame has 4871 rows and 12 columns.
data(flamingo)
data(flamingo)
This data frame contains the following columns:
length of stay, in days.
year of the stay.
date of arrival.
date of departure.
seasonality index of the period of the stay, a low value means low season, a high value means very high (peak) season.
if the guests booked
the room through the hotel's
internet website (internet
), the hotel's
telephone (direct
) or
a tour operator (agency
).
the 16 hotel's room types.
categorization in 3 groups of
roomtype
.
number of guests.
Does not include any childrenzz;
see the two zz kids
variables below.
arrangement type:
Bed and Breakfast (BB
),
Half Board (HB
) and Full Board (FB
).
number of kids between 0 and 2 years-old.
number of kids between 3 and 11 years-old.
The Flamingo Hotel is located on the beach in the southern Sardinia,
about 40 kilometers from Cagliari.
This data concerns stays from early summer 2019 to
late summer 2020
with no stays during the winter period in between.
Stays longer than 30 days were deleted from the original source.
Variable LOS
exhibits heaping at the values 7 and 14 days.
The data was obtained with permission from Luca Frigau, University of Cagliari.
## Not run: with(flamingo, spikeplot(LOS, col = "blue", ylab = "Proportion"))
## Not run: with(flamingo, spikeplot(LOS, col = "blue", ylab = "Proportion"))
Estimation of the two-parameter generalized Poisson distribution.
genpoisson(llambda = "rhobitlink", ltheta = "loglink", ilambda = NULL, itheta = NULL, imethod = 1, ishrinkage = 0.95, zero = "lambda")
genpoisson(llambda = "rhobitlink", ltheta = "loglink", ilambda = NULL, itheta = NULL, imethod = 1, ishrinkage = 0.95, zero = "lambda")
llambda , ltheta
|
Parameter link functions for |
ilambda , itheta
|
Optional initial values for |
imethod |
An integer with value |
ishrinkage , zero
|
See |
This family function is not recommended for use;
instead try
genpoisson1
or
genpoisson2
.
For underdispersion with respect to the Poisson
try the GTE (generally-truncated expansion) method
described by Yee and Ma (2023).
The generalized Poisson distribution has density
for and
.
Now
where
is the greatest positive
integer satisfying
when
[and then
for
].
Note the complicated support for this distribution means,
for some data sets,
the default link for
llambda
will not always work, and
some tinkering may be required to get it running.
As Consul and Famoye (2006) state on p.165, the lower limits
on and
are imposed
to ensure that there are at least 5 classes with nonzero
probability when
is negative.
An ordinary Poisson distribution corresponds
to .
The mean (returned as the fitted values) is
and the variance is
.
For more information see Consul and Famoye (2006) for a summary and Consul (1989) for full details.
An object of class "vglmff"
(see vglmff-class
).
The object is used by modelling functions such
as vglm
,
and vgam
.
Monitor convergence!
This family function is fragile.
Don't get confused because theta
(and not lambda
) here really
matches more closely with lambda
of
dpois
.
This family function handles multiple responses.
This distribution is potentially useful for dispersion modelling.
Convergence problems may occur when lambda
is very close
to 0 or 1.
If a failure occurs then you might want to try something like
llambda = extlogitlink(min = -0.9, max = 1)
to handle the LHS complicated constraint,
and if that doesn't work, try
llambda = extlogitlink(min = -0.8, max = 1)
, etc.
T. W. Yee.
Easton Huch derived the EIM and it has been implemented
in the weights
slot.
Consul, P. C. (1989). Generalized Poisson Distributions: Properties and Applications. New York, USA: Marcel Dekker.
Consul, P. C. and Famoye, F. (2006). Lagrangian Probability Distributions, Boston, USA: Birkhauser.
Jorgensen, B. (1997). The Theory of Dispersion Models. London: Chapman & Hall
Yee, T. W. and Ma, C. (2024). Generally altered, inflated, truncated and deflated regression. Statistical Science, 39 (in press).
genpoisson1
,
genpoisson2
,
poissonff
,
dpois
.
dgenpois0
,
rhobitlink
,
extlogitlink
.
## Not run: gdata <- data.frame(x2 = runif(nn <- 500)) # NBD data: gdata <- transform(gdata, y1 = rnbinom(nn, exp(1), mu = exp(2 - x2))) fit <- vglm(y1 ~ x2, genpoisson, data = gdata, trace = TRUE) coef(fit, matrix = TRUE) summary(fit) ## End(Not run)
## Not run: gdata <- data.frame(x2 = runif(nn <- 500)) # NBD data: gdata <- transform(gdata, y1 = rnbinom(nn, exp(1), mu = exp(2 - x2))) fit <- vglm(y1 ~ x2, genpoisson, data = gdata, trace = TRUE) coef(fit, matrix = TRUE) summary(fit) ## End(Not run)
A two-way table of counts; there are 7 ethnic groups by 12 degrees.
data(hued)
data(hued)
The format is: chr "hued"
The rownames and colnames have been edited. The full names are: Asian/Pacific Islander, Black/Non-Hispanic, Hispanic, International Students, Native American, White/Non-Hispanic, Unknown/Other. The academic year was 2009–2010. GSAS stands for Graduate School of Arts and Sciences. The Other group includes students reported as Two or More Races. See the URL below for more technical details supporting the data.
https://oir.harvard.edu/fact-book
print(hued)
print(hued)
A two-way table of counts; there are 12 degrees and 8 areas of the world.
data(huie)
data(huie)
The format is: chr "huie"
The rownames and colnames have been edited. The full colnames are: Africa, Asia, Europe, Caribbean and Central and and South America, Middle East, North America, Oceania, Stateless.
The data was for the autumn (Fall) of 2010. GSAS stands for Graduate School of Arts and Sciences. See the URL below for more technical details supporting the data.
https://oir.harvard.edu/fact-book
print(huie) ## maybe str(huie) ; plot(huie) ...
print(huie) ## maybe str(huie) ; plot(huie) ...
A two-way table of counts; there are 14 schools and 5 race/ethnicities.
data(huse)
data(huse)
The format is: chr "huse"
Ladder faculty members of Harvard University are cross-classified by their school and their race/ethnicity. This was for the period 2010–1. Ladder Faculty are defined as Assistant Professors or Convertible Instructors, Associate Professors, and Professors that have been appointed in certain Schools.
Abbreviations: FAS = Faculty of Arts and Sciences = Humanities + Social Sciences + Natural Sciences + SEAS, Natural Sciences = Life Sciences + Physical Sciences, SEAS = School of Engineering and Applied Sciences, HBS = Harvard Business School, HMS = Harvard Medical School, HSPH = Harvard School of Public Health, HLS = Harvard Law School, HKS = Harvard Kennedy School, HGSE = Harvard Graduate School of Education, GSD = Graduate School of Design , HDS = Harvard Divinity School, HSDM = Harvard School of Dental Medicine.
See the URL below for many technical details supporting the data. The table was constructed from pp.31–2 from the source.
https://oir.harvard.edu/fact-book
Harvard University Office of the Senior Vice Provost Faculty Development and Diversity: 2010 Annual Report.
print(huse) ## maybe str(huse) ; plot(huse) ...
print(huse) ## maybe str(huse) ; plot(huse) ...
Maximum likelihood estimation of the 2-parameter classical Laplace distribution.
laplace(llocation = "identitylink", lscale = "loglink", ilocation = NULL, iscale = NULL, imethod = 1, zero = "scale")
laplace(llocation = "identitylink", lscale = "loglink", ilocation = NULL, iscale = NULL, imethod = 1, zero = "scale")
llocation , lscale
|
Character.
Parameter link functions for location parameter |
ilocation , iscale
|
Optional initial values. If given, it must be numeric and values are recycled to the appropriate length. The default is to choose the value internally. |
imethod |
Initialization method. Either the value 1 or 2. |
zero |
See |
The Laplace distribution is often known as the double-exponential distribution and, for modelling, has heavier tail than the normal distribution. The Laplace density function is
where ,
and
.
Its mean is
and its variance is
.
This parameterization is called the classical Laplace
distribution by Kotz et al. (2001), and the density is symmetric
about
.
For y ~ 1
(where y
is the response)
the maximum likelihood estimate (MLE) for the location
parameter is the sample median, and the MLE for is
mean(abs(y-location))
(replace location by its MLE
if unknown).
An object of class "vglmff"
(see vglmff-class
).
The object is used by modelling functions
such as vglm
and vgam
.
This family function has not been fully tested.
The MLE regularity conditions do not hold for this
distribution, therefore misleading inferences may result, e.g.,
in the summary
and vcov
of the object. Hence this
family function might be withdrawn from VGAM in the future.
This family function uses Fisher scoring. Convergence may be slow for non-intercept-only models; half-stepping is frequently required.
T. W. Yee
Kotz, S., Kozubowski, T. J. and Podgorski, K. (2001). The Laplace distribution and generalizations: a revisit with applications to communications, economics, engineering, and finance, Boston: Birkhauser.
rlaplace
,
alaplace2
(which differs slightly from this parameterization),
exponential
,
median
.
ldata <- data.frame(y = rlaplace(nn <- 100, 2, scale = exp(1))) fit <- vglm(y ~ 1, laplace, ldata, trace = TRUE) coef(fit, matrix = TRUE) Coef(fit) with(ldata, median(y)) ldata <- data.frame(x = runif(nn <- 1001)) ldata <- transform(ldata, y = rlaplace(nn, 2, scale = exp(-1 + 1*x))) coef(vglm(y ~ x, laplace(iloc = 0.2, imethod = 2, zero = 1), ldata, trace = TRUE), matrix = TRUE)
ldata <- data.frame(y = rlaplace(nn <- 100, 2, scale = exp(1))) fit <- vglm(y ~ 1, laplace, ldata, trace = TRUE) coef(fit, matrix = TRUE) Coef(fit) with(ldata, median(y)) ldata <- data.frame(x = runif(nn <- 1001)) ldata <- transform(ldata, y = rlaplace(nn, 2, scale = exp(-1 + 1*x))) coef(vglm(y ~ x, laplace(iloc = 0.2, imethod = 2, zero = 1), ldata, trace = TRUE), matrix = TRUE)
Density, distribution function, quantile function and random
generation for the Laplace distribution with location parameter
location
and scale parameter scale
.
dlaplace(x, location = 0, scale = 1, log = FALSE) plaplace(q, location = 0, scale = 1, lower.tail = TRUE, log.p = FALSE) qlaplace(p, location = 0, scale = 1, lower.tail = TRUE, log.p = FALSE) rlaplace(n, location = 0, scale = 1)
dlaplace(x, location = 0, scale = 1, log = FALSE) plaplace(q, location = 0, scale = 1, lower.tail = TRUE, log.p = FALSE) qlaplace(p, location = 0, scale = 1, lower.tail = TRUE, log.p = FALSE) rlaplace(n, location = 0, scale = 1)
x , q
|
vector of quantiles. |
p |
vector of probabilities. |
n |
number of observations.
Same as in |
location |
the location parameter |
scale |
the scale parameter |
log |
Logical.
If |
lower.tail , log.p
|
The Laplace distribution is often known as the double-exponential distribution and, for modelling, has heavier tail than the normal distribution. The Laplace density function is
where ,
and
.
The mean is
and the variance is
.
See laplace
, the VGAM family function
for estimating the two parameters by maximum likelihood estimation,
for formulae and details.
Apart from n
, all the above arguments may be vectors and
are recyled to the appropriate length if necessary.
dlaplace
gives the density,
plaplace
gives the distribution function,
qlaplace
gives the quantile function, and
rlaplace
generates random deviates.
T. W. Yee and Kai Huang
Forbes, C., Evans, M., Hastings, N. and Peacock, B. (2011). Statistical Distributions, Hoboken, NJ, USA: John Wiley and Sons, Fourth edition.
loc <- 1; b <- 2 y <- rlaplace(n = 100, loc = loc, scale = b) mean(y) # sample mean loc # population mean var(y) # sample variance 2 * b^2 # population variance ## Not run: loc <- 0; b <- 1.5; x <- seq(-5, 5, by = 0.01) plot(x, dlaplace(x, loc, b), type = "l", col = "blue", main = "Blue is density, orange is the CDF", ylim = c(0,1), sub = "Purple are 5,10,...,95 percentiles", las = 1, ylab = "") abline(h = 0, col = "blue", lty = 2) lines(qlaplace(seq(0.05,0.95,by = 0.05), loc, b), dlaplace(qlaplace(seq(0.05, 0.95, by = 0.05), loc, b), loc, b), col = "purple", lty = 3, type = "h") lines(x, plaplace(x, loc, b), type = "l", col = "orange") abline(h = 0, lty = 2) ## End(Not run) plaplace(qlaplace(seq(0.05, 0.95, by = 0.05), loc, b), loc, b)
loc <- 1; b <- 2 y <- rlaplace(n = 100, loc = loc, scale = b) mean(y) # sample mean loc # population mean var(y) # sample variance 2 * b^2 # population variance ## Not run: loc <- 0; b <- 1.5; x <- seq(-5, 5, by = 0.01) plot(x, dlaplace(x, loc, b), type = "l", col = "blue", main = "Blue is density, orange is the CDF", ylim = c(0,1), sub = "Purple are 5,10,...,95 percentiles", las = 1, ylab = "") abline(h = 0, col = "blue", lty = 2) lines(qlaplace(seq(0.05,0.95,by = 0.05), loc, b), dlaplace(qlaplace(seq(0.05, 0.95, by = 0.05), loc, b), loc, b), col = "purple", lty = 3, type = "h") lines(x, plaplace(x, loc, b), type = "l", col = "orange") abline(h = 0, lty = 2) ## End(Not run) plaplace(qlaplace(seq(0.05, 0.95, by = 0.05), loc, b), loc, b)
Maximum likelihood estimation of the 1-parameter log-Laplace and the 1-parameter logit-Laplace distributions. These may be used for quantile regression for counts and proportions respectively.
loglaplace1(tau = NULL, llocation = "loglink", ilocation = NULL, kappa = sqrt(tau/(1 - tau)), Scale.arg = 1, ishrinkage = 0.95, parallel.locat = FALSE, digt = 4, idf.mu = 3, rep0 = 0.5, minquantile = 0, maxquantile = Inf, imethod = 1, zero = NULL) logitlaplace1(tau = NULL, llocation = "logitlink", ilocation = NULL, kappa = sqrt(tau/(1 - tau)), Scale.arg = 1, ishrinkage = 0.95, parallel.locat = FALSE, digt = 4, idf.mu = 3, rep01 = 0.5, imethod = 1, zero = NULL)
loglaplace1(tau = NULL, llocation = "loglink", ilocation = NULL, kappa = sqrt(tau/(1 - tau)), Scale.arg = 1, ishrinkage = 0.95, parallel.locat = FALSE, digt = 4, idf.mu = 3, rep0 = 0.5, minquantile = 0, maxquantile = Inf, imethod = 1, zero = NULL) logitlaplace1(tau = NULL, llocation = "logitlink", ilocation = NULL, kappa = sqrt(tau/(1 - tau)), Scale.arg = 1, ishrinkage = 0.95, parallel.locat = FALSE, digt = 4, idf.mu = 3, rep01 = 0.5, imethod = 1, zero = NULL)
tau , kappa
|
See |
llocation |
Character.
Parameter link functions for
location parameter |
ilocation |
Optional initial values. If given, it must be numeric and values are recycled to the appropriate length. The default is to choose the value internally. |
parallel.locat |
Logical.
Should the quantiles be parallel on the transformed scale
(argument |
imethod |
Initialization method. Either the value 1, 2, or .... |
idf.mu , ishrinkage , Scale.arg , digt , zero
|
See |
rep0 , rep01
|
Numeric, positive.
Replacement values for 0s and 1s respectively.
For count data, values of the response whose value is 0 are
replaced by |
minquantile , maxquantile
|
Numeric.
The minimum and maximum values possible in the quantiles.
These argument are effectively ignored by default since
|
These VGAM family functions implement translations of
the asymmetric Laplace distribution (ALD).
The resulting variants may be suitable for quantile regression
for count data or sample proportions.
For example,
a log link applied to count data is assumed to follow an ALD.
Another example is a logit link applied to proportions data so
as to follow an ALD.
A positive random variable is said to have a log-Laplace
distribution if
where
has
an ALD. There are many variants of ALDs and the one used here
is described in
alaplace1
.
An object of class "vglmff"
(see vglmff-class
).
The object is used by modelling functions
such as vglm
and vgam
.
In the extra
slot of the fitted object are some list
components which are useful.
For example, the sample proportion of
values which are less than the fitted quantile curves, which
is sum(wprior[y <= location]) / sum(wprior)
internally.
Here, wprior
are the prior weights (called ssize
below), y
is the response and location
is a
fitted quantile curve. This definition comes about naturally
from the transformed ALD data.
The VGAM family function logitlaplace1
will not handle a vector of just 0s and 1s as the response; it
will only work satisfactorily if the number of trials is large.
See alaplace1
for other warnings.
Care is needed with tau
values which are too small,
e.g., for count data the sample proportion of zeros must be less
than all values in tau
. Similarly, this also holds with
logitlaplace1
, which also requires all tau
values to be less than the sample proportion of ones.
The form of input for logitlaplace1
as response
is a vector of proportions (values in ) and the
number of trials is entered into the
weights
argument of
vglm
/vgam
.
See Example 2 below.
See alaplace1
for other notes in general.
Thomas W. Yee
Kotz, S., Kozubowski, T. J. and Podgorski, K. (2001). The Laplace distribution and generalizations: a revisit with applications to communications, economics, engineering, and finance, Boston: Birkhauser.
Kozubowski, T. J. and Podgorski, K. (2003). Log-Laplace distributions. International Mathematical Journal, 3, 467–495.
Yee, T. W. (2020). Quantile regression for counts and proportions. In preparation.
# Example 1: quantile regression of counts with regression splines set.seed(123); my.k <- exp(0) adata <- data.frame(x2 = sort(runif(n <- 500))) mymu <- function(x) exp( 1 + 3*sin(2*x) / (x+0.5)^2) adata <- transform(adata, y = rnbinom(n, mu = mymu(x2), my.k)) mytau <- c(0.1, 0.25, 0.5, 0.75, 0.9); mydof = 3 # halfstepping is usual: fitp <- vglm(y ~ sm.bs(x2, df = mydof), data = adata, trace = TRUE, loglaplace1(tau = mytau, parallel.locat = TRUE)) ## Not run: par(las = 1) # Plot on a log1p() scale mylwd <- 1.5 plot(jitter(log1p(y), factor = 1.5) ~ x2, adata, col = "red", pch = "o", cex = 0.75, main = "Example 1; green=truth, blue=estimated") with(adata, matlines(x2, log1p(fitted(fitp)), col = "blue", lty = 1, lwd = mylwd)) finexgrid <- seq(0, 1, len = 201) for (ii in 1:length(mytau)) lines(finexgrid, col = "green", lwd = mylwd, log1p(qnbinom(mytau[ii], mu = mymu(finexgrid), my.k))) ## End(Not run) fitp@extra # Contains useful information # Example 2: sample proportions set.seed(123); nnn <- 1000; ssize <- 100 # ssize = 1 wont work! adata <- data.frame(x2 = sort(runif(nnn))) mymu <- function(x) logitlink( 1.0 + 4*x, inv = TRUE) adata <- transform(adata, ssize = ssize, y2 = rbinom(nnn, ssize, prob = mymu(x2)) / ssize) mytau <- c(0.25, 0.50, 0.75) fit1 <- vglm(y2 ~ sm.bs(x2, df = 3), logitlaplace1(tau = mytau, lloc = "clogloglink", paral = TRUE), data = adata, weights = ssize, trace = TRUE) ## Not run: # Check the solution. Note: this is like comparing apples with oranges. plotvgam(fit1, se = TRUE, scol = "red", lcol = "blue", main = "Truth = 'green'") # Centered approximately ! linkFunctionChar <- as.character(fit1@misc$link) adata <- transform(adata, trueFunction = theta2eta(theta = mymu(x2), link = linkFunctionChar)) with(adata, lines(x2, trueFunction - mean(trueFunction), col = "green")) # Plot the data + fitted quantiles (on the original scale) myylim <- with(adata, range(y2)) plot(y2 ~ x2, adata, col = "blue", ylim = myylim, las = 1, pch = ".", cex = 2.5) with(adata, matplot(x2, fitted(fit1), add = TRUE, lwd = 3, type = "l")) truecol <- rep(1:3, len = fit1@misc$M) # Add the 'truth' smallxgrid <- seq(0, 1, len = 501) for (ii in 1:length(mytau)) lines(smallxgrid, col = truecol[ii], lwd = 2, qbinom(mytau[ii], pr = mymu(smallxgrid), si = ssize) / ssize) # Plot on the eta (== logitlink()/probitlink()/...) scale with(adata, matplot(x2, predict(fit1), lwd = 3, type = "l")) # Add the 'truth' for (ii in 1:length(mytau)) { true.quant <- qbinom(mytau[ii], prob = mymu(smallxgrid), size = ssize) / ssize lines(smallxgrid, theta2eta(true.quant, link = linkFunctionChar), col = truecol[ii], lwd = 2) } ## End(Not run)
# Example 1: quantile regression of counts with regression splines set.seed(123); my.k <- exp(0) adata <- data.frame(x2 = sort(runif(n <- 500))) mymu <- function(x) exp( 1 + 3*sin(2*x) / (x+0.5)^2) adata <- transform(adata, y = rnbinom(n, mu = mymu(x2), my.k)) mytau <- c(0.1, 0.25, 0.5, 0.75, 0.9); mydof = 3 # halfstepping is usual: fitp <- vglm(y ~ sm.bs(x2, df = mydof), data = adata, trace = TRUE, loglaplace1(tau = mytau, parallel.locat = TRUE)) ## Not run: par(las = 1) # Plot on a log1p() scale mylwd <- 1.5 plot(jitter(log1p(y), factor = 1.5) ~ x2, adata, col = "red", pch = "o", cex = 0.75, main = "Example 1; green=truth, blue=estimated") with(adata, matlines(x2, log1p(fitted(fitp)), col = "blue", lty = 1, lwd = mylwd)) finexgrid <- seq(0, 1, len = 201) for (ii in 1:length(mytau)) lines(finexgrid, col = "green", lwd = mylwd, log1p(qnbinom(mytau[ii], mu = mymu(finexgrid), my.k))) ## End(Not run) fitp@extra # Contains useful information # Example 2: sample proportions set.seed(123); nnn <- 1000; ssize <- 100 # ssize = 1 wont work! adata <- data.frame(x2 = sort(runif(nnn))) mymu <- function(x) logitlink( 1.0 + 4*x, inv = TRUE) adata <- transform(adata, ssize = ssize, y2 = rbinom(nnn, ssize, prob = mymu(x2)) / ssize) mytau <- c(0.25, 0.50, 0.75) fit1 <- vglm(y2 ~ sm.bs(x2, df = 3), logitlaplace1(tau = mytau, lloc = "clogloglink", paral = TRUE), data = adata, weights = ssize, trace = TRUE) ## Not run: # Check the solution. Note: this is like comparing apples with oranges. plotvgam(fit1, se = TRUE, scol = "red", lcol = "blue", main = "Truth = 'green'") # Centered approximately ! linkFunctionChar <- as.character(fit1@misc$link) adata <- transform(adata, trueFunction = theta2eta(theta = mymu(x2), link = linkFunctionChar)) with(adata, lines(x2, trueFunction - mean(trueFunction), col = "green")) # Plot the data + fitted quantiles (on the original scale) myylim <- with(adata, range(y2)) plot(y2 ~ x2, adata, col = "blue", ylim = myylim, las = 1, pch = ".", cex = 2.5) with(adata, matplot(x2, fitted(fit1), add = TRUE, lwd = 3, type = "l")) truecol <- rep(1:3, len = fit1@misc$M) # Add the 'truth' smallxgrid <- seq(0, 1, len = 501) for (ii in 1:length(mytau)) lines(smallxgrid, col = truecol[ii], lwd = 2, qbinom(mytau[ii], pr = mymu(smallxgrid), si = ssize) / ssize) # Plot on the eta (== logitlink()/probitlink()/...) scale with(adata, matplot(x2, predict(fit1), lwd = 3, type = "l")) # Add the 'truth' for (ii in 1:length(mytau)) { true.quant <- qbinom(mytau[ii], prob = mymu(smallxgrid), size = ssize) / ssize lines(smallxgrid, theta2eta(true.quant, link = linkFunctionChar), col = truecol[ii], lwd = 2) } ## End(Not run)
Flood peak and flood volume in the Madawask (River) basin, Quebec, Canada
data(mbflood)
data(mbflood)
A data frame with the following variables.
Numeric.
Numeric. Duration, in days.
Numeric. Volume (days per cubic metre per second).
Numeric. Flood peak (per cubic metre per second).
From Shen (2001),
the basin is located in the province of Quebec, Canada, and
it covers an area of 2690 square km. The data comprises 77
years daily streamflow data from 1919 to 1995 from the HYDAT CD
(Environment Canada, 1998). In the basin, winter lasts for about
4 months and precipitation is mainly in the form of snowfall
during this period. Spring represents the high flow season due
to the contribution of spring snowmelt to river runoff and the
spring flood is the annual maximum flood both in flood peak
and volume. A typical flood hydrograph with its characteristic
values are:
flood peak ,
flood volume
,
and flood duration
.
These flood characteristics are determined by first identifying
the flood duration: the dates of start and end of flood runoff
using the approach proposed by Yue (2000).
Generally, time boundaries of a flood are marked by a rise
in stage and discharge from base flow (start of flood runoff)
and a return to base flow (end of flood runoff).
Table 1 of Yue, S. (2001). A bivariate gamma distribution for use in multivariate flood frequency analysis. Hydrological Processes, 15, 1033–45.
Yue, S. (2000). The bivariate lognormal distribution to model a multivariate flood episode. Hydrological Processes, 14, 2575–88.
summary(mbflood)
summary(mbflood)
Fits a one-altered logarithmic distribution based on a conditional model involving a Bernoulli distribution and a 1-truncated logarithmic distribution.
oalog(lpobs1 = "logitlink", lshape = "logitlink", type.fitted = c("mean", "shape", "pobs1", "onempobs1"), ipobs1 = NULL, gshape = ppoints(8), zero = NULL)
oalog(lpobs1 = "logitlink", lshape = "logitlink", type.fitted = c("mean", "shape", "pobs1", "onempobs1"), ipobs1 = NULL, gshape = ppoints(8), zero = NULL)
lpobs1 |
Link function for the parameter |
lshape |
See |
gshape , type.fitted
|
See |
ipobs1 , zero
|
See |
The response is one with probability
,
or
has a 1-truncated logarithmic distribution with
probability
.
Thus
,
which is modelled as a function of the covariates. The one-altered
logarithmic distribution differs from the one-inflated
logarithmic distribution in that the former has ones coming from one
source, whereas the latter has ones coming from the logarithmic
distribution too. The one-inflated logarithmic distribution
is implemented in the VGAM package. Some people
call the one-altered logarithmic a hurdle model.
The input can be a matrix (multiple responses).
By default, the two linear/additive predictors
of oalog
are .
An object of class "vglmff"
(see vglmff-class
).
The object is used by modelling functions
such as vglm
,
and vgam
.
The fitted.values
slot of the fitted object,
which should be extracted by the generic function fitted
,
returns the mean (default) which is given by
where is the mean of the one-truncated
logarithmic distribution.
If
type.fitted = "pobs1"
then is
returned.
This family function effectively combines
binomialff
and
otlog
into
one family function.
T. W. Yee
Gaitdlog
,
Oalog
,
logff
,
oilog
,
CommonVGAMffArguments
,
simulate.vlm
.
## Not run: odata <- data.frame(x2 = runif(nn <- 1000)) odata <- transform(odata, pobs1 = logitlink(-1 + 2*x2, inv = TRUE), shape = logitlink(-2 + 3*x2, inv = TRUE)) odata <- transform(odata, y1 = roalog(nn, shape, pobs1 = pobs1), y2 = roalog(nn, shape, pobs1 = pobs1)) with(odata, table(y1)) ofit <- vglm(cbind(y1, y2) ~ x2, oalog, data = odata, trace = TRUE) coef(ofit, matrix = TRUE) head(fitted(ofit)) head(predict(ofit)) summary(ofit) ## End(Not run)
## Not run: odata <- data.frame(x2 = runif(nn <- 1000)) odata <- transform(odata, pobs1 = logitlink(-1 + 2*x2, inv = TRUE), shape = logitlink(-2 + 3*x2, inv = TRUE)) odata <- transform(odata, y1 = roalog(nn, shape, pobs1 = pobs1), y2 = roalog(nn, shape, pobs1 = pobs1)) with(odata, table(y1)) ofit <- vglm(cbind(y1, y2) ~ x2, oalog, data = odata, trace = TRUE) coef(ofit, matrix = TRUE) head(fitted(ofit)) head(predict(ofit)) summary(ofit) ## End(Not run)
Density, distribution function, quantile function and random
generation for the one-altered logarithmic distribution with
parameter pobs1
.
doalog(x, shape, pobs1 = 0, log = FALSE) poalog(q, shape, pobs1 = 0) qoalog(p, shape, pobs1 = 0) roalog(n, shape, pobs1 = 0)
doalog(x, shape, pobs1 = 0, log = FALSE) poalog(q, shape, pobs1 = 0) qoalog(p, shape, pobs1 = 0) roalog(n, shape, pobs1 = 0)
x , q , n , p
|
Same |
shape , log
|
Same as |
pobs1 |
Probability of (an observed) one, called |
The probability function of is 1 with probability
pobs1
,
else a 1-truncated
logarithmic(shape)
distribution.
doalog
gives the density and
poalog
gives the distribution function,
qoalog
gives the quantile function, and
roalog
generates random deviates.
The argument pobs1
is recycled to the required length, and
must have values which lie in the interval .
T. W. Yee
Gaitdlog
,
oalog
,
oilog
,
Otlog
.
shape <- 0.75; pobs1 <- 0.10; x <- (-1):7 doalog(x, shape = shape, pobs1 = pobs1) table(roalog(100, shape = shape, pobs1 = pobs1)) ## Not run: x <- 0:10 barplot(rbind(doalog(x, shape = shape, pobs1 = pobs1), dlog(x, shape = shape)), beside = TRUE, col = c("blue", "orange"), cex.main = 0.7, las = 1, ylab = "Probability", names.arg = as.character(x), main = paste("OAL(shape = ", shape, ", pobs1 = ", pobs1, ") [blue] vs", " Logarithmic(shape = ", shape, ") [orange] densities", sep = "")) ## End(Not run)
shape <- 0.75; pobs1 <- 0.10; x <- (-1):7 doalog(x, shape = shape, pobs1 = pobs1) table(roalog(100, shape = shape, pobs1 = pobs1)) ## Not run: x <- 0:10 barplot(rbind(doalog(x, shape = shape, pobs1 = pobs1), dlog(x, shape = shape)), beside = TRUE, col = c("blue", "orange"), cex.main = 0.7, las = 1, ylab = "Probability", names.arg = as.character(x), main = paste("OAL(shape = ", shape, ", pobs1 = ", pobs1, ") [blue] vs", " Logarithmic(shape = ", shape, ") [orange] densities", sep = "")) ## End(Not run)
Density, distribution function, quantile function and random
generation for the one-altered positive-Poisson distribution
with parameter pobs1
.
doapospois(x, lambda, pobs1 = 0, log = FALSE) poapospois(q, lambda, pobs1 = 0) qoapospois(p, lambda, pobs1 = 0) roapospois(n, lambda, pobs1 = 0)
doapospois(x, lambda, pobs1 = 0, log = FALSE) poapospois(q, lambda, pobs1 = 0) qoapospois(p, lambda, pobs1 = 0) roapospois(n, lambda, pobs1 = 0)
x , q , n , p
|
Same |
lambda , log
|
Same as |
pobs1 |
Probability of (an observed) one, called |
The probability function of is 1 with probability
pobs1
,
else a 1-truncated
positive-Poisson(lambda)
distribution.
doapospois
gives the density and
poapospois
gives the distribution function,
qoapospois
gives the quantile function, and
roapospois
generates random deviates.
The argument pobs1
is recycled to the required length, and
must have values which lie in the interval .
T. W. Yee
oapospoisson
,
Oipospois
,
Otpospois
.
lambda <- 3; pobs1 <- 0.30; x <- (-1):7 doapospois(x, lambda = lambda, pobs1 = pobs1) table(roapospois(100, lambda = lambda, pobs1 = pobs1)) ## Not run: x <- 0:10 barplot(rbind(doapospois(x, lambda = lambda, pobs1 = pobs1), dpospois(x, lambda = lambda)), beside = TRUE, col = c("blue", "orange"), cex.main = 0.7, las = 1, ylab = "Probability", names.arg = as.character(x), main = paste("OAPP(lambda = ", lambda, ", pobs1 = ", pobs1, ") [blue] vs", " PosPoisson(lambda = ", lambda, ") [orange] densities", sep = "")) ## End(Not run)
lambda <- 3; pobs1 <- 0.30; x <- (-1):7 doapospois(x, lambda = lambda, pobs1 = pobs1) table(roapospois(100, lambda = lambda, pobs1 = pobs1)) ## Not run: x <- 0:10 barplot(rbind(doapospois(x, lambda = lambda, pobs1 = pobs1), dpospois(x, lambda = lambda)), beside = TRUE, col = c("blue", "orange"), cex.main = 0.7, las = 1, ylab = "Probability", names.arg = as.character(x), main = paste("OAPP(lambda = ", lambda, ", pobs1 = ", pobs1, ") [blue] vs", " PosPoisson(lambda = ", lambda, ") [orange] densities", sep = "")) ## End(Not run)
Fits a one-altered positive-Poisson distribution based on a conditional model involving a Bernoulli distribution and a 1-truncated positive-Poisson distribution.
oapospoisson(lpobs1 = "logitlink", llambda = "loglink", type.fitted = c("mean", "lambda", "pobs1", "onempobs1"), ipobs1 = NULL, zero = NULL)
oapospoisson(lpobs1 = "logitlink", llambda = "loglink", type.fitted = c("mean", "lambda", "pobs1", "onempobs1"), ipobs1 = NULL, zero = NULL)
lpobs1 |
Link function for the parameter |
llambda |
See |
type.fitted |
See |
ipobs1 , zero
|
See |
The response is one with probability
,
or
has a 1-truncated positive-Poisson distribution with
probability
. Thus
, which is modelled as a function of the covariates.
The one-altered positive-Poisson distribution differs from the
one-inflated positive-Poisson distribution in that the former has
ones coming from one source, whereas the latter has ones coming
from the positive-Poisson distribution too. The one-inflated
positive-Poisson distribution is implemented in the VGAM
package. Some people call the one-altered positive-Poisson a
hurdle model.
The input can be a matrix (multiple responses).
By default, the two linear/additive predictors
of oapospoisson
are .
An object of class "vglmff"
(see vglmff-class
).
The object is used by modelling functions
such as vglm
,
and vgam
.
The fitted.values
slot of the fitted object,
which should be extracted by the generic function fitted
,
returns
the mean (default) which is given by
where is the mean of the one-truncated
positive-Poisson distribution.
If
type.fitted = "pobs1"
then is
returned.
This family function effectively combines
binomialff
and
otpospoisson
into
one family function.
T. W. Yee
Oapospois
,
pospoisson
,
oipospoisson
,
CommonVGAMffArguments
,
simulate.vlm
.
## Not run: odata <- data.frame(x2 = runif(nn <- 1000)) odata <- transform(odata, pobs1 = logitlink(-1 + 2*x2, inv = TRUE), lambda = loglink( 1 + 1*x2, inv = TRUE)) odata <- transform(odata, y1 = roapospois(nn, lambda, pobs1 = pobs1), y2 = roapospois(nn, lambda, pobs1 = pobs1)) with(odata, table(y1)) ofit <- vglm(cbind(y1, y2) ~ x2, oapospoisson, odata, trace = TRUE) coef(ofit, matrix = TRUE) head(fitted(ofit)) head(predict(ofit)) summary(ofit) ## End(Not run)
## Not run: odata <- data.frame(x2 = runif(nn <- 1000)) odata <- transform(odata, pobs1 = logitlink(-1 + 2*x2, inv = TRUE), lambda = loglink( 1 + 1*x2, inv = TRUE)) odata <- transform(odata, y1 = roapospois(nn, lambda, pobs1 = pobs1), y2 = roapospois(nn, lambda, pobs1 = pobs1)) with(odata, table(y1)) ofit <- vglm(cbind(y1, y2) ~ x2, oapospoisson, odata, trace = TRUE) coef(ofit, matrix = TRUE) head(fitted(ofit)) head(predict(ofit)) summary(ofit) ## End(Not run)
Fits a one-altered zeta distribution based on a conditional model involving a Bernoulli distribution and a 1-truncated zeta distribution.
oazeta(lpobs1 = "logitlink", lshape = "loglink", type.fitted = c("mean", "shape", "pobs1", "onempobs1"), gshape = exp((-4:3)/4), ishape = NULL, ipobs1 = NULL, zero = NULL)
oazeta(lpobs1 = "logitlink", lshape = "loglink", type.fitted = c("mean", "shape", "pobs1", "onempobs1"), gshape = exp((-4:3)/4), ishape = NULL, ipobs1 = NULL, zero = NULL)
lpobs1 |
Link function for the parameter |
lshape |
See |
type.fitted |
See |
gshape , ishape , ipobs1 , zero
|
See |
The response is one with probability
,
or
has a 1-truncated zeta distribution with
probability
. Thus
,
which is modelled as a function of the covariates. The one-altered
zeta distribution differs from the one-inflated
zeta distribution in that the former has ones coming from one
source, whereas the latter has ones coming from the zeta
distribution too. The one-inflated zeta distribution
is implemented in the VGAM package. Some people
call the one-altered zeta a hurdle model.
The input can be a matrix (multiple responses).
By default, the two linear/additive predictors
of oazeta
are
.
An object of class "vglmff"
(see vglmff-class
).
The object is used by modelling functions such as vglm
,
and vgam
.
The fitted.values
slot of the fitted object,
which should be extracted by the generic function fitted
, returns
the mean (default) which is given by
where is the mean of the one-truncated
zeta distribution.
If
type.fitted = "pobs1"
then is returned.
This family function effectively combines
binomialff
and
otzeta
into
one family function.
T. W. Yee
Oazeta
,
zetaff
,
oizeta
,
otzeta
,
CommonVGAMffArguments
,
simulate.vlm
.
## Not run: odata <- data.frame(x2 = runif(nn <- 1000)) odata <- transform(odata, pobs1 = logitlink(-1 + 2*x2, inverse = TRUE), shape = loglink( 1 + 1*x2, inverse = TRUE)) odata <- transform(odata, y1 = roazeta(nn, shape = shape, pobs1 = pobs1), y2 = roazeta(nn, shape = shape, pobs1 = pobs1)) with(odata, table(y1)) ofit <- vglm(cbind(y1, y2) ~ x2, oazeta, data = odata, trace = TRUE) coef(ofit, matrix = TRUE) head(fitted(ofit)) head(predict(ofit)) summary(ofit) ## End(Not run)
## Not run: odata <- data.frame(x2 = runif(nn <- 1000)) odata <- transform(odata, pobs1 = logitlink(-1 + 2*x2, inverse = TRUE), shape = loglink( 1 + 1*x2, inverse = TRUE)) odata <- transform(odata, y1 = roazeta(nn, shape = shape, pobs1 = pobs1), y2 = roazeta(nn, shape = shape, pobs1 = pobs1)) with(odata, table(y1)) ofit <- vglm(cbind(y1, y2) ~ x2, oazeta, data = odata, trace = TRUE) coef(ofit, matrix = TRUE) head(fitted(ofit)) head(predict(ofit)) summary(ofit) ## End(Not run)
Density, distribution function, quantile function and random
generation for the one-altered zeta distribution with parameter
pobs1
.
doazeta(x, shape, pobs1 = 0, log = FALSE) poazeta(q, shape, pobs1 = 0) qoazeta(p, shape, pobs1 = 0) roazeta(n, shape, pobs1 = 0)
doazeta(x, shape, pobs1 = 0, log = FALSE) poazeta(q, shape, pobs1 = 0) qoazeta(p, shape, pobs1 = 0) roazeta(n, shape, pobs1 = 0)
x , q , n , p
|
Same |
shape , log
|
Same as |
pobs1 |
Probability of (an observed) one, called |
The probability function of is 1 with probability
pobs1
,
else a 1-truncated
zeta
distribution.
doazeta
gives the density and
poazeta
gives the distribution function,
qoazeta
gives the quantile function, and
roazeta
generates random deviates.
The argument pobs1
is recycled to the required length, and
must have values which lie in the interval .
T. W. Yee
shape <- 1.1; pobs1 <- 0.10; x <- (-1):7 doazeta(x, shape = shape, pobs1 = pobs1) table(roazeta(100, shape = shape, pobs1 = pobs1)) ## Not run: x <- 0:10 barplot(rbind(doazeta(x, shape = shape, pobs1 = pobs1), dzeta(x, shape = shape)), beside = TRUE, col = c("blue", "orange"), cex.main = 0.7, las = 1, ylab = "Probability", names.arg = as.character(x), main = paste("OAZ(shape = ", shape, ", pobs1 = ", pobs1, ") [blue] vs", " zeta(shape = ", shape, ") [orange] densities", sep = "")) ## End(Not run)
shape <- 1.1; pobs1 <- 0.10; x <- (-1):7 doazeta(x, shape = shape, pobs1 = pobs1) table(roazeta(100, shape = shape, pobs1 = pobs1)) ## Not run: x <- 0:10 barplot(rbind(doazeta(x, shape = shape, pobs1 = pobs1), dzeta(x, shape = shape)), beside = TRUE, col = c("blue", "orange"), cex.main = 0.7, las = 1, ylab = "Probability", names.arg = as.character(x), main = paste("OAZ(shape = ", shape, ", pobs1 = ", pobs1, ") [blue] vs", " zeta(shape = ", shape, ") [orange] densities", sep = "")) ## End(Not run)
Fits a 1-inflated logarithmic distribution.
oilog(lpstr1 = "logitlink", lshape = "logitlink", type.fitted = c("mean", "shape", "pobs1", "pstr1", "onempstr1"), ishape = NULL, gpstr1 = ppoints(8), gshape = ppoints(8), zero = NULL)
oilog(lpstr1 = "logitlink", lshape = "logitlink", type.fitted = c("mean", "shape", "pobs1", "pstr1", "onempstr1"), ishape = NULL, gpstr1 = ppoints(8), gshape = ppoints(8), zero = NULL)
lpstr1 , lshape
|
Link functions.
For |
gpstr1 , gshape , ishape
|
For initial values.
See |
type.fitted , zero
|
See |
The 1-inflated logarithmic distribution is a mixture
distribution of the
logarithmic
distribution with some probability of obtaining a (structural) 1.
Thus there are two sources for obtaining the value 1.
This distribution is written here
in a way that retains a similar notation to the
one-inflated positive-Poisson, i.e., the
probability involves another parameter
.
See
oipospoisson
.
This family function can handle multiple responses.
An object of class "vglmff"
(see vglmff-class
).
The object is used by modelling functions
such as vglm
,
rrvglm
and vgam
.
Thomas W. Yee
Gaitdlog
,
Oilog
,
logff
,
Oizeta
.
## Not run: odata <- data.frame(x2 = runif(nn <- 1000)) # Artificial data odata <- transform(odata, pstr1 = logitlink(-1 + x2, inv = TRUE), shape = 0.5) odata <- transform(odata, y1 = roilog(nn, shape, pstr1 = pstr1)) with(odata, table(y1)) fit1 <- vglm(y1 ~ x2, oilog(zero = "shape"), odata, trace = TRUE) coef(fit1, matrix = TRUE) ## End(Not run)
## Not run: odata <- data.frame(x2 = runif(nn <- 1000)) # Artificial data odata <- transform(odata, pstr1 = logitlink(-1 + x2, inv = TRUE), shape = 0.5) odata <- transform(odata, y1 = roilog(nn, shape, pstr1 = pstr1)) with(odata, table(y1)) fit1 <- vglm(y1 ~ x2, oilog(zero = "shape"), odata, trace = TRUE) coef(fit1, matrix = TRUE) ## End(Not run)
Density, distribution function, quantile function and random
generation for the one-inflated logarithmic distribution with
parameter pstr1
.
doilog(x, shape, pstr1 = 0, log = FALSE) poilog(q, shape, pstr1 = 0) qoilog(p, shape, pstr1 = 0) roilog(n, shape, pstr1 = 0)
doilog(x, shape, pstr1 = 0, log = FALSE) poilog(q, shape, pstr1 = 0) qoilog(p, shape, pstr1 = 0) roilog(n, shape, pstr1 = 0)
x , q , p , n
|
Same as |
shape |
Vector of parameters that lie in |
pstr1 |
Probability of a structural one
(i.e., ignoring the logarithmic distribution),
called |
log |
Same as |
The probability function of is 1 with probability
, and
with
probability
. Thus
where is distributed as a
random variable.
The VGAM family function
oilog
estimates
by MLE.
doilog
gives the density,
poilog
gives the distribution function,
qoilog
gives the quantile function, and
roilog
generates random deviates.
The argument pstr1
is recycled to the required length,
and usually has values which lie in the interval .
These functions actually allow for the zero-deflated
logarithmic distribution. Here, pstr1
is also permitted
to lie in the
interval [-dlog(1, shape) / (1 - dlog(1, shape)), 0]
.
The resulting probability of a unit count is less than
the nominal logarithmic value, and the use of pstr1
to
stand for the probability of a structural 1 loses its
meaning.
When pstr1
equals
-dlog(1, shape) / (1 - dlog(1, shape))
this corresponds to the 1-truncated logarithmic distribution.
T. W. Yee
Gaitdlog
,
oilog
,
rlog
,
logff
,
Otlog
.
shape <- 0.5; pstr1 <- 0.3; x <- (-1):7 (ii <- doilog(x, shape, pstr1 = pstr1)) max(abs(poilog(1:200, shape) - cumsum(shape^(1:200) / (-(1:200) * log1p(-shape))))) # Should be 0 ## Not run: x <- 0:10 par(mfrow = c(2, 1)) # One-Inflated logarithmic barplot(rbind(doilog(x, shape, pstr1 = pstr1), dlog(x, shape)), beside = TRUE, col = c("blue", "orange"), main = paste0("OILogff(", shape, ", pstr1 = ", pstr1, ") (blue) vs Logff(", shape, ") (orange)"), names.arg = as.character(x)) deflat.limit <- -dlog(1, shape) / plog(1, shape, lower.tail = FALSE) newpstr1 <- round(deflat.limit, 3) + 0.001 # Near the boundary barplot(rbind(doilog(x, shape, pstr1 = newpstr1), dlog(x, shape)), beside = TRUE, col = c("blue","orange"), main = paste0("ODLogff(", shape, ", pstr1 = ", newpstr1, ") (blue) vs Logff(", shape, ") (orange)"), names.arg = as.character(x)) ## End(Not run)
shape <- 0.5; pstr1 <- 0.3; x <- (-1):7 (ii <- doilog(x, shape, pstr1 = pstr1)) max(abs(poilog(1:200, shape) - cumsum(shape^(1:200) / (-(1:200) * log1p(-shape))))) # Should be 0 ## Not run: x <- 0:10 par(mfrow = c(2, 1)) # One-Inflated logarithmic barplot(rbind(doilog(x, shape, pstr1 = pstr1), dlog(x, shape)), beside = TRUE, col = c("blue", "orange"), main = paste0("OILogff(", shape, ", pstr1 = ", pstr1, ") (blue) vs Logff(", shape, ") (orange)"), names.arg = as.character(x)) deflat.limit <- -dlog(1, shape) / plog(1, shape, lower.tail = FALSE) newpstr1 <- round(deflat.limit, 3) + 0.001 # Near the boundary barplot(rbind(doilog(x, shape, pstr1 = newpstr1), dlog(x, shape)), beside = TRUE, col = c("blue","orange"), main = paste0("ODLogff(", shape, ", pstr1 = ", newpstr1, ") (blue) vs Logff(", shape, ") (orange)"), names.arg = as.character(x)) ## End(Not run)
Density,
distribution function,
quantile function and random generation
for the one-inflated positive
binomial distribution with parameter pstr1
.
doiposbinom(x, size, prob, pstr1 = 0, log = FALSE) poiposbinom(q, size, prob, pstr1 = 0) qoiposbinom(p, size, prob, pstr1 = 0) roiposbinom(n, size, prob, pstr1 = 0)
doiposbinom(x, size, prob, pstr1 = 0, log = FALSE) poiposbinom(q, size, prob, pstr1 = 0) qoiposbinom(p, size, prob, pstr1 = 0) roiposbinom(n, size, prob, pstr1 = 0)
x , p , q , n
|
Same as |
size , prob
|
Same as |
pstr1 |
Probability of a structural one
(i.e., ignoring the positive binomial distribution),
called |
log |
Logical. Return the logarithm of the answer? |
The probability function of is 1 with probability
,
and
with probability
. Thus
where is distributed as a
positive
random variable.
doiposbinom
gives the density,
poiposbinom
gives the distribution function,
qoiposbinom
gives the quantile function, and
roiposbinom
generates random deviates.
The argument pstr1
is recycled to the required length,
and usually has values which lie in the interval .
These functions actually allow for the zero-deflated
binomial distribution. Here,
pstr1
is also permitted
to lie in the interval for some positive
quantity
. The
resulting probability of a unit value is less than
the nominal positive binomial value, and the use of
pstr1
to stand for the probability of a structural 1 loses its
meaning.
If pstr1
equals
then this corresponds to the 0- and 1-truncated binomial
distribution.
T. W. Yee
posbinomial
,
dbinom
,
binomialff
.
size <- 10; prob <- 0.2; pstr1 <- 0.4; x <- (-1):size (ii <- doiposbinom(x, size, prob, pstr1 = pstr1)) table(roiposbinom(100, size, prob, pstr1 = pstr1)) round(doiposbinom(x, size, prob, pstr1 = pstr1) * 100) # Similar? ## Not run: x <- 0:size par(mfrow = c(2, 1)) # One-Inflated Positive Binomial barplot(rbind(doiposbinom(x, size, prob, pstr1 = pstr1), dposbinom(x, size, prob)), beside = TRUE, col = c("blue", "orange"), main = paste0("OIPB(", size, ",", prob, ", pstr1 = ", pstr1, ") (blue) vs PosBinomial(", size, ",", prob, ") (orange)"), names.arg = as.character(x)) # Zero-deflated Pos Binomial: def.limit <- -dposbinom(1, size, prob) / (1 - dposbinom(1, size, prob)) def.limit <- size * prob / (1 + (size-1) * prob-1 / (1-prob)^(size-1)) newpstr1 <- round(def.limit, 3) + 0.001 # A little from the boundary barplot(rbind(doiposbinom(x, size, prob, pstr1 = newpstr1), dposbinom(x, size, prob)), beside = TRUE, col = c("blue","orange"), main = paste0("ODPB(", size, ",", prob, ", pstr1 = ", newpstr1, ") (blue) vs PosBinomial(", size, ",", prob, ") (orange)"), names.arg = as.character(x)) ## End(Not run)
size <- 10; prob <- 0.2; pstr1 <- 0.4; x <- (-1):size (ii <- doiposbinom(x, size, prob, pstr1 = pstr1)) table(roiposbinom(100, size, prob, pstr1 = pstr1)) round(doiposbinom(x, size, prob, pstr1 = pstr1) * 100) # Similar? ## Not run: x <- 0:size par(mfrow = c(2, 1)) # One-Inflated Positive Binomial barplot(rbind(doiposbinom(x, size, prob, pstr1 = pstr1), dposbinom(x, size, prob)), beside = TRUE, col = c("blue", "orange"), main = paste0("OIPB(", size, ",", prob, ", pstr1 = ", pstr1, ") (blue) vs PosBinomial(", size, ",", prob, ") (orange)"), names.arg = as.character(x)) # Zero-deflated Pos Binomial: def.limit <- -dposbinom(1, size, prob) / (1 - dposbinom(1, size, prob)) def.limit <- size * prob / (1 + (size-1) * prob-1 / (1-prob)^(size-1)) newpstr1 <- round(def.limit, 3) + 0.001 # A little from the boundary barplot(rbind(doiposbinom(x, size, prob, pstr1 = newpstr1), dposbinom(x, size, prob)), beside = TRUE, col = c("blue","orange"), main = paste0("ODPB(", size, ",", prob, ", pstr1 = ", newpstr1, ") (blue) vs PosBinomial(", size, ",", prob, ") (orange)"), names.arg = as.character(x)) ## End(Not run)
Fits a one-inflated positive binomial distribution by maximum likelihood estimation.
oiposbinomial(lpstr1 = "logitlink", lprob = "logitlink", type.fitted = c("mean", "prob", "pobs1", "pstr1", "onempstr1"), iprob = NULL, gpstr1 = ppoints(9), gprob = ppoints(9), multiple.responses = FALSE, zero = NULL)
oiposbinomial(lpstr1 = "logitlink", lprob = "logitlink", type.fitted = c("mean", "prob", "pobs1", "pstr1", "onempstr1"), iprob = NULL, gpstr1 = ppoints(9), gprob = ppoints(9), multiple.responses = FALSE, zero = NULL)
lpstr1 , lprob
|
Link functions for the parameter |
type.fitted |
See |
iprob , gpstr1 , gprob
|
For initial values;
see |
multiple.responses |
Logical.
See |
zero |
See |
These functions are based on
for , and
for . That is, the response is a sample
proportion out of
trials, and the argument
size
in
roiposbinom
is here.
Ideally
is needed.
The parameter
is the probability of a structural one,
and it satisfies
(usually).
The mean of
is
and these are returned as the default fitted values.
By default, the two linear/additive predictors
for
oiposbinomial()
are .
An object of class "vglmff"
(see vglmff-class
).
The object is used by modelling functions
such as vglm
and vgam
.
The response variable should have one of the formats described by
binomialff
, e.g., a factor or two column matrix
or a vector of sample proportions with the weights
argument specifying the values of .
To work well, one ideally needs large values of and
much greater than 0, i.e., the larger
and
are, the better. If
then the model
is unidentifiable since the number of parameters is excessive.
Estimated probabilities of a structural one and an
observed one are returned, as in zipoisson
.
The one-deflated positive binomial distribution might
be fitted by setting lpstr1 = "identitylink"
, albeit,
not entirely reliably. See zipoisson
for information that can be applied here.
T. W. Yee
roiposbinom
,
posbinomial
,
binomialff
,
rbinom
.
size <- 10 # Number of trials; N in the notation above nn <- 200 odata <- data.frame(pstr1 = logitlink( 0, inv = TRUE), # 0.50 mubin1 = logitlink(-1, inv = TRUE), # Binomial mean svec = rep(size, length = nn), x2 = runif(nn)) odata <- transform(odata, mubin2 = logitlink(-1 + x2, inv = TRUE)) odata <- transform(odata, y1 = roiposbinom(nn, svec, pr = mubin1, pstr1 = pstr1), y2 = roiposbinom(nn, svec, pr = mubin2, pstr1 = pstr1)) with(odata, table(y1)) fit1 <- vglm(y1 / svec ~ 1, oiposbinomial, data = odata, weights = svec, trace = TRUE, crit = "coef") fit2 <- vglm(y2 / svec ~ x2, oiposbinomial, data = odata, weights = svec, trace = TRUE) coef(fit1, matrix = TRUE) Coef(fit1) # Useful for intercept-only models head(fitted(fit1, type = "pobs1")) # Estimate of P(Y = 1) head(fitted(fit1)) with(odata, mean(y1)) # Compare this with fitted(fit1) summary(fit1)
size <- 10 # Number of trials; N in the notation above nn <- 200 odata <- data.frame(pstr1 = logitlink( 0, inv = TRUE), # 0.50 mubin1 = logitlink(-1, inv = TRUE), # Binomial mean svec = rep(size, length = nn), x2 = runif(nn)) odata <- transform(odata, mubin2 = logitlink(-1 + x2, inv = TRUE)) odata <- transform(odata, y1 = roiposbinom(nn, svec, pr = mubin1, pstr1 = pstr1), y2 = roiposbinom(nn, svec, pr = mubin2, pstr1 = pstr1)) with(odata, table(y1)) fit1 <- vglm(y1 / svec ~ 1, oiposbinomial, data = odata, weights = svec, trace = TRUE, crit = "coef") fit2 <- vglm(y2 / svec ~ x2, oiposbinomial, data = odata, weights = svec, trace = TRUE) coef(fit1, matrix = TRUE) Coef(fit1) # Useful for intercept-only models head(fitted(fit1, type = "pobs1")) # Estimate of P(Y = 1) head(fitted(fit1)) with(odata, mean(y1)) # Compare this with fitted(fit1) summary(fit1)
Density,
distribution function,
quantile function and random generation
for the one-inflated positive
Poisson distribution with parameter pstr1
.
doipospois(x, lambda, pstr1 = 0, log = FALSE) poipospois(q, lambda, pstr1 = 0) qoipospois(p, lambda, pstr1 = 0) roipospois(n, lambda, pstr1 = 0)
doipospois(x, lambda, pstr1 = 0, log = FALSE) poipospois(q, lambda, pstr1 = 0) qoipospois(p, lambda, pstr1 = 0) roipospois(n, lambda, pstr1 = 0)
x , p , q , n
|
Same as |
lambda |
Vector of positive means. |
pstr1 |
Probability of a structural one
(i.e., ignoring the positive Poisson distribution),
called |
log |
Logical. Return the logarithm of the answer? |
The probability function of is 1 with probability
,
and
with
probability
. Thus
where is distributed as a
positive
random variate.
doipospois
gives the density,
poipospois
gives the distribution function,
qoipospois
gives the quantile function, and
roipospois
generates random deviates.
The argument pstr1
is recycled to the required length,
and usually has values which lie in the interval .
These functions actually allow for the zero-deflated
Poisson distribution. Here, pstr1
is also permitted
to lie in the interval [-lambda / (expm1(lambda) -
lambda), 0]
. The resulting probability of a unit count is
less than the nominal positive Poisson value, and the
use of pstr1
to stand for the probability of a structural
1 loses its meaning.
When pstr1
equals -lambda / (expm1(lambda) -
lambda)
this corresponds to the 0- and 1-truncated Poisson
distribution.
T. W. Yee
Pospois
,
oapospoisson
,
oipospoisson
,
otpospoisson
,
pospoisson
,
dpois
,
poissonff
.
lambda <- 3; pstr1 <- 0.2; x <- (-1):7 (ii <- doipospois(x, lambda, pstr1 = pstr1)) table(roipospois(100, lambda, pstr1 = pstr1)) round(doipospois(1:10, lambda, pstr1 = pstr1) * 100) # Similar? ## Not run: x <- 0:10 par(mfrow = c(2, 1)) # One-Inflated Positive Poisson barplot(rbind(doipospois(x, lambda, pstr1 = pstr1), dpospois(x, lambda)), beside = TRUE, col = c("blue", "orange"), main = paste0("OIPP(", lambda, ", pstr1 = ", pstr1, ") (blue) vs PosPoisson(", lambda, ") (orange)"), names.arg = as.character(x)) # 0-deflated Pos Poisson: deflat.limit <- -lambda / (expm1(lambda) - lambda) newpstr1 <- round(deflat.limit, 3) + 0.001 # Near the boundary barplot(rbind(doipospois(x, lambda, pstr1 = newpstr1), dpospois(x, lambda)), beside = TRUE, col = c("blue","orange"), main = paste0("ODPP(", lambda, ", pstr1 = ", newpstr1, ") (blue) vs PosPoisson(", lambda, ") (orange)"), names.arg = as.character(x)) ## End(Not run)
lambda <- 3; pstr1 <- 0.2; x <- (-1):7 (ii <- doipospois(x, lambda, pstr1 = pstr1)) table(roipospois(100, lambda, pstr1 = pstr1)) round(doipospois(1:10, lambda, pstr1 = pstr1) * 100) # Similar? ## Not run: x <- 0:10 par(mfrow = c(2, 1)) # One-Inflated Positive Poisson barplot(rbind(doipospois(x, lambda, pstr1 = pstr1), dpospois(x, lambda)), beside = TRUE, col = c("blue", "orange"), main = paste0("OIPP(", lambda, ", pstr1 = ", pstr1, ") (blue) vs PosPoisson(", lambda, ") (orange)"), names.arg = as.character(x)) # 0-deflated Pos Poisson: deflat.limit <- -lambda / (expm1(lambda) - lambda) newpstr1 <- round(deflat.limit, 3) + 0.001 # Near the boundary barplot(rbind(doipospois(x, lambda, pstr1 = newpstr1), dpospois(x, lambda)), beside = TRUE, col = c("blue","orange"), main = paste0("ODPP(", lambda, ", pstr1 = ", newpstr1, ") (blue) vs PosPoisson(", lambda, ") (orange)"), names.arg = as.character(x)) ## End(Not run)
Fits a 1-inflated positive Poisson distribution.
oipospoisson(lpstr1 = "logitlink", llambda = "loglink", type.fitted = c("mean", "lambda", "pobs1", "pstr1", "onempstr1"), ilambda = NULL, gpstr1 = (1:19)/20, gprobs.y = (1:19)/20, imethod = 1, zero = NULL)
oipospoisson(lpstr1 = "logitlink", llambda = "loglink", type.fitted = c("mean", "lambda", "pobs1", "pstr1", "onempstr1"), ilambda = NULL, gpstr1 = (1:19)/20, gprobs.y = (1:19)/20, imethod = 1, zero = NULL)
lpstr1 , llambda
|
For |
ilambda , gpstr1 , gprobs.y , imethod
|
For initial values.
See |
type.fitted , zero
|
See |
The 1-inflated positive Poisson distribution is a mixture
distribution of the
positive (0-truncated) Poisson
distribution with some probability of obtaining a (structural) 1.
Thus there are two sources for obtaining the value 1.
It is similar to a zero-inflated Poisson model, except
the Poisson is replaced by a positive Poisson and the 0 is replaced
by 1.
This distribution is written here
in a way that retains a similar notation to the
zero-inflated Poisson, i.e., the
probability involves another parameter
.
See
zipoisson
.
This family function can handle multiple responses.
An object of class "vglmff"
(see vglmff-class
).
The object is used by modelling functions such as vglm
,
rrvglm
and vgam
.
Under- or over-flow may occur if the data is ill-conditioned.
Thomas W. Yee
Oipospois
,
pospoisson
,
oapospoisson
,
otpospoisson
,
zipoisson
,
poissonff
,
simulate.vlm
.
## Not run: set.seed(1) pdata <- data.frame(x2 = runif(nn <- 1000)) # Artificial data pdata <- transform(pdata, pstr1 = 0.5, lambda = exp(3 - x2)) pdata <- transform(pdata, y1 = roipospois(nn, lambda, pstr1 = pstr1)) with(pdata, table(y1)) fit1 <- vglm(y1 ~ x2, oipospoisson, data = pdata, trace = TRUE) coef(fit1, matrix = TRUE) ## End(Not run)
## Not run: set.seed(1) pdata <- data.frame(x2 = runif(nn <- 1000)) # Artificial data pdata <- transform(pdata, pstr1 = 0.5, lambda = exp(3 - x2)) pdata <- transform(pdata, y1 = roipospois(nn, lambda, pstr1 = pstr1)) with(pdata, table(y1)) fit1 <- vglm(y1 ~ x2, oipospoisson, data = pdata, trace = TRUE) coef(fit1, matrix = TRUE) ## End(Not run)
Fits a 1-inflated zeta distribution.
oizeta(lpstr1 = "logitlink", lshape = "loglink", type.fitted = c("mean", "shape", "pobs1", "pstr1", "onempstr1"), ishape = NULL, gpstr1 = ppoints(8), gshape = exp((-3:3) / 4), zero = NULL)
oizeta(lpstr1 = "logitlink", lshape = "loglink", type.fitted = c("mean", "shape", "pobs1", "pstr1", "onempstr1"), ishape = NULL, gpstr1 = ppoints(8), gshape = exp((-3:3) / 4), zero = NULL)
lpstr1 , lshape
|
For |
gpstr1 , gshape , ishape
|
For initial values.
See |
type.fitted , zero
|
See |
The 1-inflated zeta distribution is a mixture
distribution of the
zeta
distribution with some probability of obtaining a (structural) 1.
Thus there are two sources for obtaining the value 1.
This distribution is written here
in a way that retains a similar notation to the
zero-inflated Poisson, i.e., the
probability involves another parameter
. See
zipoisson
.
This family function can handle multiple responses.
An object of class "vglmff"
(see vglmff-class
).
The object is used by modelling functions
such as vglm
,
rrvglm
and vgam
.
Under- or over-flow may occur if the data is ill-conditioned. Lots of data is needed to estimate the parameters accurately. Usually, probably the shape parameter is best modelled as intercept-only.
Thomas W. Yee
Oizeta
,
zetaff
,
oazeta
,
otzeta
,
diffzeta
,
zeta
,
Oizipf
.
## Not run: odata <- data.frame(x2 = runif(nn <- 1000)) # Artificial data odata <- transform(odata, pstr1 = logitlink(-1 + x2, inv = TRUE), shape = exp(-0.5)) odata <- transform(odata, y1 = roizeta(nn, shape, pstr1 = pstr1)) with(odata, table(y1)) fit1 <- vglm(y1 ~ x2, oizeta(zero = "shape"), odata, trace = TRUE) coef(fit1, matrix = TRUE) ## End(Not run)
## Not run: odata <- data.frame(x2 = runif(nn <- 1000)) # Artificial data odata <- transform(odata, pstr1 = logitlink(-1 + x2, inv = TRUE), shape = exp(-0.5)) odata <- transform(odata, y1 = roizeta(nn, shape, pstr1 = pstr1)) with(odata, table(y1)) fit1 <- vglm(y1 ~ x2, oizeta(zero = "shape"), odata, trace = TRUE) coef(fit1, matrix = TRUE) ## End(Not run)
Density, distribution function, quantile function and random
generation for the one-inflated zeta distribution with parameter
pstr1
.
doizeta(x, shape, pstr1 = 0, log = FALSE) poizeta(q, shape, pstr1 = 0) qoizeta(p, shape, pstr1 = 0) roizeta(n, shape, pstr1 = 0)
doizeta(x, shape, pstr1 = 0, log = FALSE) poizeta(q, shape, pstr1 = 0) qoizeta(p, shape, pstr1 = 0) roizeta(n, shape, pstr1 = 0)
x , q , p , n
|
Same as |
shape |
Vector of positive shape parameters. |
pstr1 |
Probability of a structural one
(i.e., ignoring the zeta distribution),
called |
log |
Same as |
The probability function of is 1 with probability
, and
with
probability
. Thus
where is distributed as a
random variable.
doizeta
gives the density,
poizeta
gives the distribution function,
qoizeta
gives the quantile function, and
roizeta
generates random deviates.
The argument pstr1
is recycled to the required length,
and usually has values which lie in the interval .
These functions actually allow for the zero-deflated
zeta distribution. Here, pstr1
is also permitted
to lie in the interval
[-dzeta(1, shape) / (1 - dzeta(1, shape)), 0]
.
The resulting probability of a unit count is less than
the nominal zeta value, and the use of pstr1
to
stand for the probability of a structural 1 loses its
meaning.
When pstr1
equals
-dzeta(1, shape) / (1 - dzeta(1, shape))
this corresponds to the 1-truncated zeta distribution.
T. W. Yee
shape <- 1.5; pstr1 <- 0.3; x <- (-1):7 (ii <- doizeta(x, shape, pstr1 = pstr1)) max(abs(poizeta(1:200, shape) - cumsum(1/(1:200)^(1+shape)) / zeta(shape+1))) # Should be 0 ## Not run: x <- 0:10 par(mfrow = c(2, 1)) # One-Inflated zeta barplot(rbind(doizeta(x, shape, pstr1 = pstr1), dzeta(x, shape)), beside = TRUE, col = c("blue", "orange"), main = paste0("OIZeta(", shape, ", pstr1 = ", pstr1, ") (blue) vs Zeta(", shape, ") (orange)"), names.arg = as.character(x)) deflat.limit <- -dzeta(1, shape) / pzeta(1, shape, lower.tail = FALSE) newpstr1 <- round(deflat.limit, 3) + 0.001 # Near the boundary barplot(rbind(doizeta(x, shape, pstr1 = newpstr1), dzeta(x, shape)), beside = TRUE, col = c("blue","orange"), main = paste0("ODZeta(", shape, ", pstr1 = ", newpstr1, ") (blue) vs Zeta(", shape, ") (orange)"), names.arg = as.character(x)) ## End(Not run)
shape <- 1.5; pstr1 <- 0.3; x <- (-1):7 (ii <- doizeta(x, shape, pstr1 = pstr1)) max(abs(poizeta(1:200, shape) - cumsum(1/(1:200)^(1+shape)) / zeta(shape+1))) # Should be 0 ## Not run: x <- 0:10 par(mfrow = c(2, 1)) # One-Inflated zeta barplot(rbind(doizeta(x, shape, pstr1 = pstr1), dzeta(x, shape)), beside = TRUE, col = c("blue", "orange"), main = paste0("OIZeta(", shape, ", pstr1 = ", pstr1, ") (blue) vs Zeta(", shape, ") (orange)"), names.arg = as.character(x)) deflat.limit <- -dzeta(1, shape) / pzeta(1, shape, lower.tail = FALSE) newpstr1 <- round(deflat.limit, 3) + 0.001 # Near the boundary barplot(rbind(doizeta(x, shape, pstr1 = newpstr1), dzeta(x, shape)), beside = TRUE, col = c("blue","orange"), main = paste0("ODZeta(", shape, ", pstr1 = ", newpstr1, ") (blue) vs Zeta(", shape, ") (orange)"), names.arg = as.character(x)) ## End(Not run)
Fits a 1-inflated Zipf distribution.
oizipf(N = NULL, lpstr1 = "logitlink", lshape = "loglink", type.fitted = c("mean", "shape", "pobs1", "pstr1", "onempstr1"), ishape = NULL, gpstr1 = ppoints(8), gshape = exp((-3:3) / 4), zero = NULL)
oizipf(N = NULL, lpstr1 = "logitlink", lshape = "loglink", type.fitted = c("mean", "shape", "pobs1", "pstr1", "onempstr1"), ishape = NULL, gpstr1 = ppoints(8), gshape = exp((-3:3) / 4), zero = NULL)
N |
Same as |
lpstr1 , lshape
|
For |
gpstr1 , gshape , ishape
|
For initial values.
See |
type.fitted , zero
|
See |
The 1-inflated Zipf distribution is a mixture
distribution of the
Zipf
distribution with some probability of obtaining a (structural) 1.
Thus there are two sources for obtaining the value 1.
This distribution is written here
in a way that retains a similar notation to the
zero-inflated Poisson, i.e., the
probability involves another parameter
. See
zipoisson
.
This family function can handle multiple responses.
An object of class "vglmff"
(see vglmff-class
).
The object is used by modelling functions
such as vglm
,
rrvglm
and vgam
.
Under- or over-flow may occur if the data is ill-conditioned. Lots of data is needed to estimate the parameters accurately. Usually, probably the shape parameter is best modelled as intercept-only.
Thomas W. Yee
## Not run: odata <- data.frame(x2 = runif(nn <- 1000)) # Artificial data odata <- transform(odata, pstr1 = logitlink(-1 + x2, inv = TRUE), myN = 10, shape = exp(-0.5)) odata <- transform(odata, y1 = roizipf(nn, N = myN, shape, pstr1 = pstr1)) with(odata, table(y1)) fit1 <- vglm(y1 ~ x2, oizipf(zero = "shape"), odata, trace = TRUE) coef(fit1, matrix = TRUE) ## End(Not run)
## Not run: odata <- data.frame(x2 = runif(nn <- 1000)) # Artificial data odata <- transform(odata, pstr1 = logitlink(-1 + x2, inv = TRUE), myN = 10, shape = exp(-0.5)) odata <- transform(odata, y1 = roizipf(nn, N = myN, shape, pstr1 = pstr1)) with(odata, table(y1)) fit1 <- vglm(y1 ~ x2, oizipf(zero = "shape"), odata, trace = TRUE) coef(fit1, matrix = TRUE) ## End(Not run)
Density, distribution function, quantile function and random
generation for the one-inflated Zipf distribution with parameter
pstr1
.
doizipf(x, N, shape, pstr1 = 0, log = FALSE) poizipf(q, N, shape, pstr1 = 0) qoizipf(p, N, shape, pstr1 = 0) roizipf(n, N, shape, pstr1 = 0)
doizipf(x, N, shape, pstr1 = 0, log = FALSE) poizipf(q, N, shape, pstr1 = 0) qoizipf(p, N, shape, pstr1 = 0) roizipf(n, N, shape, pstr1 = 0)
x , q , p , n
|
Same as |
N , shape
|
See |
pstr1 |
Probability of a structural one
(i.e., ignoring the Zipf distribution),
called |
log |
Same as |
The probability function of is 1 with probability
, and
with
probability
. Thus
where is distributed as a
random variable.
The VGAM family function
oizeta
estimates
the two parameters of this model by Fisher scoring.
doizipf
gives the density,
poizipf
gives the distribution function,
qoizipf
gives the quantile function, and
roizipf
generates random deviates.
The argument pstr1
is recycled to the required length,
and usually has values which lie in the interval .
These functions actually allow for the zero-deflated Zipf
distribution. Here, pstr1
is also permitted to lie in the
interval [-dzipf(1, N, s) / (1 - dzipf(1, N, s)), 0]
.
The resulting probability of a unit count is less than
the nominal zipf value, and the use of pstr1
to stand
for the probability of a structural 1 loses its meaning.
When pstr1
equals
-dzipf(1, N, s) / (1 - dzipf(1, N, s))
this corresponds to the 1-truncated zipf distribution.
T. W. Yee
N <- 10; shape <- 1.5; pstr1 <- 0.3; x <- (-1):N (ii <- doizipf(x, N, shape, pstr1 = pstr1)) ## Not run: x <- 0:10 par(mfrow = c(2, 1)) # One-Inflated zipf barplot(rbind(doizipf(x, N, shape, pstr1 = pstr1), dzipf(x, N, shape)), beside = TRUE, col = c("blue", "orange"), main = paste0("OIZipf(", N, ", ", shape, ", pstr1 = ", pstr1, ") (blue) vs Zipf(", N, ", ", shape, ") (orange)"), names.arg = as.character(x)) deflat.limit <- -dzipf(1, N, shape) / (1 - dzipf(1, N, shape)) newpstr1 <- round(deflat.limit, 3) + 0.001 # Near the boundary barplot(rbind(doizipf(x, N, shape, pstr1 = newpstr1), dzipf(x, N, shape)), beside = TRUE, col = c("blue", "orange"), main = paste0("ODZipf(", N, ", ", shape, ", pstr1 = ", newpstr1, ") (blue) vs Zipf(", N, ", ", shape, ") (orange)"), names.arg = as.character(x)) ## End(Not run)
N <- 10; shape <- 1.5; pstr1 <- 0.3; x <- (-1):N (ii <- doizipf(x, N, shape, pstr1 = pstr1)) ## Not run: x <- 0:10 par(mfrow = c(2, 1)) # One-Inflated zipf barplot(rbind(doizipf(x, N, shape, pstr1 = pstr1), dzipf(x, N, shape)), beside = TRUE, col = c("blue", "orange"), main = paste0("OIZipf(", N, ", ", shape, ", pstr1 = ", pstr1, ") (blue) vs Zipf(", N, ", ", shape, ") (orange)"), names.arg = as.character(x)) deflat.limit <- -dzipf(1, N, shape) / (1 - dzipf(1, N, shape)) newpstr1 <- round(deflat.limit, 3) + 0.001 # Near the boundary barplot(rbind(doizipf(x, N, shape, pstr1 = newpstr1), dzipf(x, N, shape)), beside = TRUE, col = c("blue", "orange"), main = paste0("ODZipf(", N, ", ", shape, ", pstr1 = ", newpstr1, ") (blue) vs Zipf(", N, ", ", shape, ") (orange)"), names.arg = as.character(x)) ## End(Not run)
Individual data for the Summer 2012 Olympic Games.
data(oly12)
data(oly12)
A data frame with 10384 observations on the following 14 variables.
Name
The individual competitor's name.
Country
Country.
Age
A numeric vector, age in years.
Height
A numeric vector, height in m.
Weight
A numeric vector, weight in kg.
Sex
A factor with levels F
and M
.
DOB
A Date, date of birth.
PlaceOB
Place of birth.
Gold
Numeric vector, number of such medals won.
Silver
Similar to Gold
.
Bronze
Similar to Gold
.
Total
A numeric vector, total number of medals.
Sport
A factor with levels
Archery
,
Athletics
,
Athletics
,
Triathlon
,
Badminton
, etc.
Event
The sporting event.
This data set represents a very small modification of a
.csv
spreadsheet from the source below.
Height has been converted to meters,
and date of birth is of a "Date"
class
(see as.Date
).
A few non-ASCII characters have been replaced by some ASCII sequence
(yet to be fixed up properly).
Some competitors share the same name. Some errors in the data are likely to exist.
Downloaded from
http://www.guardian.co.uk/sport/series/london-2012-olympics-data
in 2013-03; more recently it has changed to
https://www.theguardian.com/sport/series/london-2012-olympics-data.
data(oly12) mtab <- with(oly12, table(Country, Gold)) (mtab <- head(sort(mtab[, "1"] + 2 * mtab[, "2"], decreasing = TRUE), 10)) ## Not run: barplot(mtab, col = "gold", cex.names = 0.8, names = abbreviate(names(mtab)), beside = TRUE, main = "2012 Summer Olympic Final Gold Medal Count", ylab = "Gold medal count", las = 1, sub = "Top 10 countries") ## End(Not run)
data(oly12) mtab <- with(oly12, table(Country, Gold)) (mtab <- head(sort(mtab[, "1"] + 2 * mtab[, "2"], decreasing = TRUE), 10)) ## Not run: barplot(mtab, col = "gold", cex.names = 0.8, names = abbreviate(names(mtab)), beside = TRUE, main = "2012 Summer Olympic Final Gold Medal Count", ylab = "Gold medal count", las = 1, sub = "Top 10 countries") ## End(Not run)
Estimating the (single) parameter of the 1-truncated logarithmic distribution.
otlog(lshape = "logitlink", gshape = ppoints(8), zero = NULL)
otlog(lshape = "logitlink", gshape = ppoints(8), zero = NULL)
lshape , gshape , zero
|
Same as
|
The 1-truncated logarithmic distribution is a logarithmic
distribution but with the probability of a one being zero. The
other probabilities are scaled to add to unity.
Some more details can be found at logff
.
Multiple responses are permitted.
An object of class "vglmff"
(see vglmff-class
).
The object is used by modelling functions
such as vglm
,
and vgam
.
T. W. Yee
Gaitdlog
,
Otlog
,
logff
,
oalog
,
oilog
,
simulate.vlm
.
## Not run: odata <- data.frame(y1 = rotlog(1000, shape = logitlink(1/3, inv = TRUE))) ofit <- vglm(y1 ~ 1, otlog, data = odata, trace = TRUE, crit = "c") coef(ofit, matrix = TRUE) Coef(ofit) with(odata, hist(y1, shape = TRUE, breaks = seq(0.5, max(y1) + 0.5, by = 1), border = "blue")) x <- seq(1, with(odata, max(y1)), by = 1) with(odata, lines(x, dotlog(x, Coef(ofit)[1]), col = "orange", type = "h", lwd = 2)) ## End(Not run)
## Not run: odata <- data.frame(y1 = rotlog(1000, shape = logitlink(1/3, inv = TRUE))) ofit <- vglm(y1 ~ 1, otlog, data = odata, trace = TRUE, crit = "c") coef(ofit, matrix = TRUE) Coef(ofit) with(odata, hist(y1, shape = TRUE, breaks = seq(0.5, max(y1) + 0.5, by = 1), border = "blue")) x <- seq(1, with(odata, max(y1)), by = 1) with(odata, lines(x, dotlog(x, Coef(ofit)[1]), col = "orange", type = "h", lwd = 2)) ## End(Not run)
Density, distribution function, quantile function, and random generation for the one-truncated logarithmic distribution.
dotlog(x, shape, log = FALSE) potlog(q, shape, log.p = FALSE) qotlog(p, shape) rotlog(n, shape)
dotlog(x, shape, log = FALSE) potlog(q, shape, log.p = FALSE) qotlog(p, shape) rotlog(n, shape)
x , q
|
Vector of quantiles. For the density, it should be a vector
with integer values |
p |
vector of probabilities. |
n |
number of observations.
Same as in |
shape |
The parameter value |
log , log.p
|
Logical.
If |
The one-truncated logarithmic distribution is a logarithmic
distribution but with the probability of a one being zero. The
other probabilities are scaled to add to unity.
Some more details are given in logff
.
dotlog
gives the density,
potlog
gives the distribution function,
qotlog
gives the quantile function, and
rotlog
generates random deviates.
Given some response data, the VGAM family function
otlog
estimates the parameter shape
.
Function potlog()
suffers from the problems that
plog
sometimes has.
T. W. Yee
dotlog(1:20, 0.5) rotlog(20, 0.5) ## Not run: shape <- 0.8; x <- 1:10 plot(x, dotlog(x, shape = shape), type = "h", ylim = 0:1, sub = "shape=0.8", las = 1, col = "blue", ylab = "Probability", main = "1-truncated logarithmic distn: blue=PMF; orange=CDF") lines(x+0.1, potlog(x, shape), col = "orange", lty = 3, type = "h") ## End(Not run)
dotlog(1:20, 0.5) rotlog(20, 0.5) ## Not run: shape <- 0.8; x <- 1:10 plot(x, dotlog(x, shape = shape), type = "h", ylim = 0:1, sub = "shape=0.8", las = 1, col = "blue", ylab = "Probability", main = "1-truncated logarithmic distn: blue=PMF; orange=CDF") lines(x+0.1, potlog(x, shape), col = "orange", lty = 3, type = "h") ## End(Not run)
Density, distribution function, quantile function, and random generation for the one-truncated positive-Poisson distribution.
dotpospois(x, lambda, log = FALSE) potpospois(q, lambda, log.p = FALSE) qotpospois(p, lambda) rotpospois(n, lambda)
dotpospois(x, lambda, log = FALSE) potpospois(q, lambda, log.p = FALSE) qotpospois(p, lambda) rotpospois(n, lambda)
x , q , p , n
|
Same as |
lambda , log , log.p
|
Same as |
The one-truncated positive-Poisson is a Poisson distribution
but with the probability of a one and a zero being 0.
That is, its support is 2, 3, ....
The other probabilities are scaled to add to unity.
Some more details are given in pospoisson
.
dotpospois
gives the density,
potpospois
gives the distribution function,
qotpospois
gives the quantile function, and
rotpospois
generates random deviates.
Given some response data, the VGAM family function
otpospoisson
estimates the
parameter lambda
.
T. W. Yee
otpospoisson
,
Pospois
,
Oipospois
.
dotpospois(1:20, 0.5) rotpospois(20, 0.5) ## Not run: lambda <- 4; x <- 1:10 plot(x, dotpospois(x, lambda = lambda), type = "h", ylim = 0:1, sub = "lambda=4", las = 1, col = "blue", ylab = "Probability", main = "1-truncated positive-Poisson distn: blue=PMF; orange=CDF") lines(x+0.1, potpospois(x, lambda), col = "orange", lty=3, type = "h") ## End(Not run)
dotpospois(1:20, 0.5) rotpospois(20, 0.5) ## Not run: lambda <- 4; x <- 1:10 plot(x, dotpospois(x, lambda = lambda), type = "h", ylim = 0:1, sub = "lambda=4", las = 1, col = "blue", ylab = "Probability", main = "1-truncated positive-Poisson distn: blue=PMF; orange=CDF") lines(x+0.1, potpospois(x, lambda), col = "orange", lty=3, type = "h") ## End(Not run)
Estimating the (single) parameter of the 1-truncated positive Poisson distribution.
otpospoisson(llambda = "loglink", type.fitted = c("mean", "lambda", "prob0", "prob1"), ilambda = NULL, imethod = 1, zero = NULL)
otpospoisson(llambda = "loglink", type.fitted = c("mean", "lambda", "prob0", "prob1"), ilambda = NULL, imethod = 1, zero = NULL)
llambda , type.fitted , ilambda
|
Same as |
imethod , zero
|
Same as |
The 1-truncated positive Poisson distribution has support on 2, 3,
....
It is a Poisson distribution but with
the probability of a one or zero being 0. The other
probabilities are scaled to add to unity.
Some more details can be found at pospoisson
.
Multiple responses are permitted.
An object of class "vglmff"
(see vglmff-class
).
The object is used by modelling functions
such as vglm
,
and vgam
.
T. W. Yee
Otpospois
,
oipospoisson
,
simulate.vlm
.
## Not run: odata <- data.frame(y1 = rotpospois(1000, lambda = loglink(1, inv = TRUE))) ofit <- vglm(y1 ~ 1, otpospoisson, data = odata, trace = TRUE, crit = "c") coef(ofit, matrix = TRUE) Coef(ofit) with(odata, hist(y1, prob = TRUE, breaks = seq(0.5, max(y1) + 0.5, by = 1), border = "blue")) x <- seq(1, with(odata, max(y1)), by = 1) with(odata, lines(x, dotpospois(x, Coef(ofit)[1]), col = "orange", type = "h", lwd = 2)) ## End(Not run)
## Not run: odata <- data.frame(y1 = rotpospois(1000, lambda = loglink(1, inv = TRUE))) ofit <- vglm(y1 ~ 1, otpospoisson, data = odata, trace = TRUE, crit = "c") coef(ofit, matrix = TRUE) Coef(ofit) with(odata, hist(y1, prob = TRUE, breaks = seq(0.5, max(y1) + 0.5, by = 1), border = "blue")) x <- seq(1, with(odata, max(y1)), by = 1) with(odata, lines(x, dotpospois(x, Coef(ofit)[1]), col = "orange", type = "h", lwd = 2)) ## End(Not run)
Estimates the parameter of the 1-truncated zeta distribution.
otzeta(lshape = "loglink", ishape = NULL, gshape = exp((-4:3)/4), zero = NULL)
otzeta(lshape = "loglink", ishape = NULL, gshape = exp((-4:3)/4), zero = NULL)
lshape , ishape , gshape , zero
|
Same as |
The 1-truncated zeta distribution is the ordinary zeta
distribution but with the probability of one being 0.
Thus the other probabilities are scaled up (i.e., divided by
). The mean is returned by default as the fitted
values. More details can be found at
zetaff
.
Multiple responses are handled.
An object of class "vglmff"
(see vglmff-class
).
The object is used by modelling functions
such as vglm
,
and vgam
.
T. W. Yee
Otzeta
,
zetaff
,
oizeta
,
diffzeta
,
zeta
,
dzeta
,
hzeta
,
zipf
.
## Not run: odata <- data.frame(x2 = runif(nn <- 1000)) # Artificial data odata <- transform(odata, shape = loglink(-0.25 + x2, inverse = TRUE)) odata <- transform(odata, y1 = rotzeta(nn, shape)) with(odata, table(y1)) ofit <- vglm(y1 ~ x2, otzeta, data = odata, trace = TRUE, crit = "coef") coef(ofit, matrix = TRUE) ## End(Not run)
## Not run: odata <- data.frame(x2 = runif(nn <- 1000)) # Artificial data odata <- transform(odata, shape = loglink(-0.25 + x2, inverse = TRUE)) odata <- transform(odata, y1 = rotzeta(nn, shape)) with(odata, table(y1)) ofit <- vglm(y1 ~ x2, otzeta, data = odata, trace = TRUE, crit = "coef") coef(ofit, matrix = TRUE) ## End(Not run)
Density, distribution function, quantile function, and random generation for the one-truncated zeta distribution.
dotzeta(x, shape, log = FALSE) potzeta(q, shape, log.p = FALSE) qotzeta(p, shape) rotzeta(n, shape)
dotzeta(x, shape, log = FALSE) potzeta(q, shape, log.p = FALSE) qotzeta(p, shape) rotzeta(n, shape)
x , q , p , n
|
Same as in |
shape |
The positive shape parameter described in in
|
log , log.p
|
Same as in |
The one-truncated zeta distribution is a zeta distribution but
with the probability of a one being zero. The other probabilities
are scaled to add to unity.
Some more details are given in zetaff
.
dotzeta
gives the density,
potzeta
gives the distribution function,
qotzeta
gives the quantile function, and
rotzeta
generates random deviates.
Given some response data, the VGAM family function
otzeta
estimates the parameter shape
.
T. W. Yee
dotzeta(1:20, 0.5) rotzeta(20, 0.5) ## Not run: shape <- 0.8; x <- 1:10 plot(x, dotzeta(x, shape = shape), type = "h", ylim = 0:1, sub = "shape=0.8", las = 1, col = "blue", ylab = "Probability", main = "1-truncated zeta distn: blue=PMF; orange=CDF") lines(x + 0.1, potzeta(x, shape), col = "orange", lty = 3, type = "h") ## End(Not run)
dotzeta(1:20, 0.5) rotzeta(20, 0.5) ## Not run: shape <- 0.8; x <- 1:10 plot(x, dotzeta(x, shape = shape), type = "h", ylim = 0:1, sub = "shape=0.8", las = 1, col = "blue", ylab = "Probability", main = "1-truncated zeta distn: blue=PMF; orange=CDF") lines(x + 0.1, potzeta(x, shape), col = "orange", lty = 3, type = "h") ## End(Not run)
The age, names and habitation of 52 pirates who were found guilty of piracy and executed, after the ships associated with Bartholomew Roberts were captured.
data(pirates1)
data(pirates1)
A data frame with the following 3 variables.
age
a numeric vector, their age in years at the time of trial. Bartholomew Roberts himself was 39 years old at his death.
name
character.
habitation
character.
According to Wiki, in February 1722 Captain Ogle was sent by the British Government to find and capture the notorious pirate Bartholomew Roberts (real name: John Roberts, but also known later as Black Bart). When his warship caught up with the Royal Fortune he attacked and Bartholomew Roberts was the first to fall, followed by 2 others. The remaining pirates surrendered soon afterwards. A total of 272 men were captured, and of these, 65 were black, and they were sold into slavery. The remainder were taken to Cape Coast Castle, apart from those who died on the voyage back. The trial was held in April, 1722, and 54 were condemned to death, of whom 52 were hanged and two were reprieved. Of those executed, their personal data (name, age, habitation) were recorded.
Pages 248–249 of Johnson, Captain Charles, (1955) (Editor: Arthur L. Hayward). A General History of the Robberies and Murders of the Most Notorious Pirates, London: Routledge and Kegan Paul Ltd. This edition was first published in 1926. The earliest manuscript of the book dates back to 1724.
This data was entered into R by Lucia Pilleri.
summary(pirates1)
summary(pirates1)
A data frame containing the age, name, birthplace and verdict of 35 members of a pirate ship associated with Edward Low, who were taken to trial on 10th to 12th July, 1723.
data(pirates2)
data(pirates2)
The variables age
and name
are
analagous to pirates1
.
The variable guilty
is binary and 1 means yes,
0 means not guilty. Guilty crew members were executed
except for two: John Brown and Patrick Cunningham;
they were respited for one year and recommended to the
King's favour.
Starting on the 10th July, 1723, the crew of the Ranger were judged. The captain of the ship was Charles Harris, and this ship was one of two pirate ships under Captain Edward Low. Their personal data (name, age, place of birth) and verdicts are recorded in Johnson (1955). This data was constructed from pp.295–296 of that book and includes those who were not found guilty (and therefore were not executed). The execution of the 25 men were performed on 19 July near Newport, Rhode Island, USA. The notorious pirate Edward Low himself was brought to trial in 1724 under different circumstances and was hanged in Martinique.
Same as pirates1
.
This data was entered into R by Lucia Pilleri.
summary(pirates2)
summary(pirates2)
Density, distribution function, quantile function and random generation for the positive-binomial distribution.
dposbinom(x, size, prob, log = FALSE) pposbinom(q, size, prob) qposbinom(p, size, prob) rposbinom(n, size, prob)
dposbinom(x, size, prob, log = FALSE) pposbinom(q, size, prob) qposbinom(p, size, prob) rposbinom(n, size, prob)
x , q
|
vector of quantiles. |
p |
vector of probabilities. |
n |
number of observations.
Fed into |
size |
number of trials.
It is the |
prob |
probability of success on each trial.
Should be in |
log |
See
|
The positive-binomial distribution is a binomial distribution but with the probability of a zero being zero. The other probabilities are scaled to add to unity. The mean therefore is
where is the argument
prob
above.
As increases, the positive-binomial and binomial
distributions become more similar. Unlike similar functions
for the binomial distribution, a zero value of
prob
is not permitted here.
dposbinom
gives the density,
pposbinom
gives the distribution function,
qposbinom
gives the quantile function, and
rposbinom
generates random deviates.
These functions are or are likely to be deprecated.
Use Gaitdbinom
instead.
For dposbinom()
, if arguments size
or prob
equal 0 then a NaN
is returned.
The family function posbinomial
estimates
the parameters by maximum likelihood estimation.
T. W. Yee.
posbinomial
,
dposbern
,
Gaitdbinom
,
zabinomial
,
zibinomial
,
Binomial
.
prob <- 0.2; size <- 10 table(y <- rposbinom(n = 1000, size, prob)) mean(y) # Sample mean size * prob / (1 - (1 - prob)^size) # Population mean (ii <- dposbinom(0:size, size, prob)) cumsum(ii) - pposbinom(0:size, size, prob) # Should be 0s table(rposbinom(100, size, prob)) table(qposbinom(runif(1000), size, prob)) round(dposbinom(1:10, size, prob) * 1000) # Should be similar ## Not run: barplot(rbind(dposbinom(x = 0:size, size, prob), dbinom(x = 0:size, size, prob)), beside = TRUE, col = c("blue", "green"), main = paste("Positive-binomial(", size, ",", prob, ") (blue) vs", " Binomial(", size, ",", prob, ") (green)", sep = ""), names.arg = as.character(0:size), las = 1) ## End(Not run) # Simulated data example nn <- 1000; sizeval1 <- 10; sizeval2 <- 20 pdata <- data.frame(x2 = seq(0, 1, length = nn)) pdata <- transform(pdata, prob1 = logitlink(-2 + 2 * x2, inv = TRUE), prob2 = logitlink(-1 + 1 * x2, inv = TRUE), sizev1 = rep(sizeval1, len = nn), sizev2 = rep(sizeval2, len = nn)) pdata <- transform(pdata, y1 = rposbinom(nn, sizev1, prob = prob1), y2 = rposbinom(nn, sizev2, prob = prob2)) with(pdata, table(y1)) with(pdata, table(y2)) # Multiple responses fit2 <- vglm(cbind(y1, y2) ~ x2, posbinomial(multip = TRUE), trace = TRUE, pdata, weight = cbind(sizev1, sizev2)) coef(fit2, matrix = TRUE)
prob <- 0.2; size <- 10 table(y <- rposbinom(n = 1000, size, prob)) mean(y) # Sample mean size * prob / (1 - (1 - prob)^size) # Population mean (ii <- dposbinom(0:size, size, prob)) cumsum(ii) - pposbinom(0:size, size, prob) # Should be 0s table(rposbinom(100, size, prob)) table(qposbinom(runif(1000), size, prob)) round(dposbinom(1:10, size, prob) * 1000) # Should be similar ## Not run: barplot(rbind(dposbinom(x = 0:size, size, prob), dbinom(x = 0:size, size, prob)), beside = TRUE, col = c("blue", "green"), main = paste("Positive-binomial(", size, ",", prob, ") (blue) vs", " Binomial(", size, ",", prob, ") (green)", sep = ""), names.arg = as.character(0:size), las = 1) ## End(Not run) # Simulated data example nn <- 1000; sizeval1 <- 10; sizeval2 <- 20 pdata <- data.frame(x2 = seq(0, 1, length = nn)) pdata <- transform(pdata, prob1 = logitlink(-2 + 2 * x2, inv = TRUE), prob2 = logitlink(-1 + 1 * x2, inv = TRUE), sizev1 = rep(sizeval1, len = nn), sizev2 = rep(sizeval2, len = nn)) pdata <- transform(pdata, y1 = rposbinom(nn, sizev1, prob = prob1), y2 = rposbinom(nn, sizev2, prob = prob2)) with(pdata, table(y1)) with(pdata, table(y2)) # Multiple responses fit2 <- vglm(cbind(y1, y2) ~ x2, posbinomial(multip = TRUE), trace = TRUE, pdata, weight = cbind(sizev1, sizev2)) coef(fit2, matrix = TRUE)
Density, distribution function, quantile function and random generation for the positive-negative binomial distribution.
dposnegbin(x, size, prob = NULL, munb = NULL, log = FALSE) pposnegbin(q, size, prob = NULL, munb = NULL, lower.tail = TRUE, log.p = FALSE) qposnegbin(p, size, prob = NULL, munb = NULL) rposnegbin(n, size, prob = NULL, munb = NULL)
dposnegbin(x, size, prob = NULL, munb = NULL, log = FALSE) pposnegbin(q, size, prob = NULL, munb = NULL, lower.tail = TRUE, log.p = FALSE) qposnegbin(p, size, prob = NULL, munb = NULL) rposnegbin(n, size, prob = NULL, munb = NULL)
x , q
|
vector of quantiles. |
p |
vector of probabilities. |
n |
number of observations.
Fed into |
size , prob , munb , log
|
Same arguments as that of an ordinary negative binomial
distribution
(see Short vectors are recycled.
The parameter Note that |
log.p , lower.tail
|
Same arguments as that of an ordinary negative binomial
distribution (see |
The positive-negative binomial distribution is a negative binomial distribution but with the probability of a zero being zero. The other probabilities are scaled to add to unity. The mean therefore is
where the mean of an ordinary negative binomial
distribution.
dposnegbin
gives the density,
pposnegbin
gives the distribution function,
qposnegbin
gives the quantile function, and
rposnegbin
generates random deviates.
These functions are or are likely to be deprecated.
Use Gaitdnbinom
instead.
T. W. Yee
Welsh, A. H., Cunningham, R. B., Donnelly, C. F. and Lindenmayer, D. B. (1996). Modelling the abundances of rare species: statistical models for counts with extra zeros. Ecological Modelling, 88, 297–308.
Gaitdnbinom
,
posnegbinomial
,
zanegbinomial
,
zinegbinomial
,
rnbinom
.
munb <- 5; size <- 4; n <- 1000 table(y <- rposnegbin(n, munb = munb, size = size)) mean(y) # Sample mean munb / (1 - (size / (size + munb))^size) # Population mean munb / pnbinom(0, mu = munb, size, lower.tail = FALSE) # Same x <- (-1):17 (ii <- dposnegbin(x, munb = munb, size = size)) max(abs(cumsum(ii) - pposnegbin(x, munb = munb, size))) # 0? ## Not run: x <- 0:10 barplot(rbind(dposnegbin(x, munb = munb, size = size), dnbinom(x, mu = munb, size = size)), beside = TRUE, col = c("blue","green"), main = paste0("dposnegbin(munb = ", munb, ", size = ", size, ") (blue) vs dnbinom(mu = ", munb, ", size = ", size, ") (green)"), names.arg = as.character(x)) ## End(Not run) # Another test for pposnegbin() nn <- 5000 mytab <- cumsum(table(rposnegbin(nn, munb = munb, size))) / nn myans <- pposnegbin(sort(as.numeric(names(mytab))), munb = munb, size) max(abs(mytab - myans)) # Should be 0
munb <- 5; size <- 4; n <- 1000 table(y <- rposnegbin(n, munb = munb, size = size)) mean(y) # Sample mean munb / (1 - (size / (size + munb))^size) # Population mean munb / pnbinom(0, mu = munb, size, lower.tail = FALSE) # Same x <- (-1):17 (ii <- dposnegbin(x, munb = munb, size = size)) max(abs(cumsum(ii) - pposnegbin(x, munb = munb, size))) # 0? ## Not run: x <- 0:10 barplot(rbind(dposnegbin(x, munb = munb, size = size), dnbinom(x, mu = munb, size = size)), beside = TRUE, col = c("blue","green"), main = paste0("dposnegbin(munb = ", munb, ", size = ", size, ") (blue) vs dnbinom(mu = ", munb, ", size = ", size, ") (green)"), names.arg = as.character(x)) ## End(Not run) # Another test for pposnegbin() nn <- 5000 mytab <- cumsum(table(rposnegbin(nn, munb = munb, size))) / nn myans <- pposnegbin(sort(as.numeric(names(mytab))), munb = munb, size) max(abs(mytab - myans)) # Should be 0
Density, distribution function, quantile function and random generation for the positive-Poisson distribution.
dpospois(x, lambda, log = FALSE) ppospois(q, lambda) qpospois(p, lambda) rpospois(n, lambda)
dpospois(x, lambda, log = FALSE) ppospois(q, lambda) qpospois(p, lambda) rpospois(n, lambda)
x , q
|
vector of quantiles. |
p |
vector of probabilities. |
n |
number of observations.
Fed into |
lambda |
vector of positive means (of an ordinary Poisson distribution). Short vectors are recycled. |
log |
logical. |
The positive-Poisson distribution is a Poisson distribution but with the probability of a zero being zero. The other probabilities are scaled to add to unity. The mean therefore is
As increases, the positive-Poisson and Poisson
distributions become more similar.
Unlike similar functions for the Poisson distribution, a zero value
of
lambda
returns a NaN
.
dpospois
gives the density,
ppospois
gives the distribution function,
qpospois
gives the quantile function, and
rpospois
generates random deviates.
These functions are or are likely to be deprecated.
Use Gaitdpois
instead.
The family function pospoisson
estimates
by maximum likelihood estimation.
T. W. Yee
Gaitdpois
,
pospoisson
,
zapoisson
,
zipoisson
,
rpois
.
lambda <- 2; y = rpospois(n = 1000, lambda) table(y) mean(y) # Sample mean lambda / (1 - exp(-lambda)) # Population mean (ii <- dpospois(0:7, lambda)) cumsum(ii) - ppospois(0:7, lambda) # Should be 0s table(rpospois(100, lambda)) table(qpospois(runif(1000), lambda)) round(dpospois(1:10, lambda) * 1000) # Should be similar ## Not run: x <- 0:7 barplot(rbind(dpospois(x, lambda), dpois(x, lambda)), beside = TRUE, col = c("blue", "orange"), main = paste("Positive Poisson(", lambda, ") (blue) vs", " Poisson(", lambda, ") (orange)", sep = ""), names.arg = as.character(x), las = 1, lwd = 2) ## End(Not run)
lambda <- 2; y = rpospois(n = 1000, lambda) table(y) mean(y) # Sample mean lambda / (1 - exp(-lambda)) # Population mean (ii <- dpospois(0:7, lambda)) cumsum(ii) - ppospois(0:7, lambda) # Should be 0s table(rpospois(100, lambda)) table(qpospois(runif(1000), lambda)) round(dpospois(1:10, lambda) * 1000) # Should be similar ## Not run: x <- 0:7 barplot(rbind(dpospois(x, lambda), dpois(x, lambda)), beside = TRUE, col = c("blue", "orange"), main = paste("Positive Poisson(", lambda, ") (blue) vs", " Poisson(", lambda, ") (orange)", sep = ""), names.arg = as.character(x), las = 1, lwd = 2) ## End(Not run)
Number of prisoners in each North American state, and the populations of those states from years 1977 to 2010
data(prison.us)
data(prison.us)
A data frame with 34 observations on the following 103 variables.
a numeric vector, the year
numeric vectors
numeric vectors
numeric vectors
numeric vectors
numeric vectors
numeric vectors
numeric vectors
numeric vectors
numeric vectors
numeric vectors
numeric vectors
numeric vectors
numeric vectors
numeric vectors
numeric vectors
numeric vectors
numeric vectors
numeric vectors
numeric vectors
numeric vectors
numeric vectors
numeric vectors
numeric vectors
numeric vectors
numeric vectors
numeric vectors
numeric vectors
numeric vectors
numeric vectors
numeric vectors
numeric vectors
numeric vectors
numeric vectors
numeric vectors
numeric vectors, overall counts for the whole country
This is a data set of the number of prisoners in each American state and the populations of those states, from 1977 to 2010. The number of prisoners are taken from December 31st, while the populations are estimates taken from July 1st based on the previous Census, except for pop.1980, which uses exact census data from 1980/04/01.
Warning:
a scatterplot of US.pop
shows a discontinuity around 2000.
The prisoner data was compiled from:
Bureau of Justice Statistics,
http://www.bjs.gov/index.cfm.
Downloaded in September 2013 and formatted into R by J. T. Gray,
[email protected]
.
The population data was compiled from:
United States Census Bureau,
http://www.census.gov/popest/data
.
Downloaded in September 2013 by J. T. Gray.
This site may have become stale.
summary(prison.us) ## Not run: # This plot shows a discontinuity around 2000. plot(US.pop / 1e6 ~ Year, prison.us, main = "US population (millions)", las = 1, type = "b", col = "blue") ## End(Not run)
summary(prison.us) ## Not run: # This plot shows a discontinuity around 2000. plot(US.pop / 1e6 ~ Year, prison.us, main = "US population (millions)", las = 1, type = "b", col = "blue") ## End(Not run)
This data set contains information on about 22 past or present professors of statistics in New Zealand universities.
data(profs.nz)
data(profs.nz)
A data frame with 22 observations on the following 7 variables.
pubtotal
a numeric vector, the total number of publications.
cites
a numeric vector, the number of citations.
initials
character, first and middle and surname initials.
Surname
character, the surname.
firstyear
a numeric vector, the earliest indexed publication.
ID
a numeric vector, the unique MR Author ID for each professor.
pub1stAuthor
a numeric vector, the total number of publications which are first authored by the person.
ARPtotal
a numeric vector, the total number of author/related publications.
institution
character,
with values "MU"
, "UA"
,
"UC"
, "UO"
, "UW"
, "VU"
,
the university affiliation.
The abbreviations are for:
Massey University,
University of Auckland,
University of Canterbury,
University of Otago,
University of Waikato and
Victoria University Wellington.
This data set contains information taken from the MathSciNet database on professors of statistics (and some related fields) affiliated with New Zealand universities.
In the future the following names may be added: C. F. Ansley, P. C. B. Phillips, B. S. Weir, C. S. Withers.
The data was compiled from the equivalent of https://mathscinet.ams.org/mathscinet/publications-search by J. T. Gray in April 2014.
profs.nz profs.nz[order(with(profs.nz, pubtotal), decreasing = TRUE), ] ## Not run: plot(pub1stAuthor / pubtotal ~ pubtotal, main = "Professors of Statistics in NZ", xlab = "Number of publications in MathSciNet", ylab = "Proportion of first-authored papers", data = profs.nz, col = "blue", las = 1, type = "n") with(profs.nz, text(pubtotal, y = pub1stAuthor / pubtotal, labels = initials, col = "blue", las = 1)) ## End(Not run)
profs.nz profs.nz[order(with(profs.nz, pubtotal), decreasing = TRUE), ] ## Not run: plot(pub1stAuthor / pubtotal ~ pubtotal, main = "Professors of Statistics in NZ", xlab = "Number of publications in MathSciNet", ylab = "Proportion of first-authored papers", data = profs.nz, col = "blue", las = 1, type = "n") with(profs.nz, text(pubtotal, y = pub1stAuthor / pubtotal, labels = initials, col = "blue", las = 1)) ## End(Not run)
Ages at death in rock and roll.
data(rar.df)
data(rar.df)
A data frame with the following variables.
Numeric. Age when the person died, in years. A handful of values are approximate or missing.
Numeric. The year that the person died.
This type of data is very subjective because many definitions cannot ever be finalized. The Wiki website was written by an anonymous person and the data collected informally and there is no quality assurance. Of course, it could be debated as whether a particular person should be included. And there will be many people who should be there but are not. Nevertheless, it can be considered a fun data set that should be of interest to anybody liking modern music or the history of such in musicology.
In the source, some peoples' ages were listed as a
decade, e.g., age was 40s
or 50--51
.
These were either replaced by NA
or the mean
of the endpoints was taken.
The lists at https://en.wikipedia.org/wiki/List_of_deaths_in_rock_and_roll. were downloaded and edited.
summary(rar.df)
summary(rar.df)
The number of wins, losses and draws for each of the top 10 rugby teams agains each other
data(rugby) data(rugby.ties)
data(rugby) data(rugby.ties)
The format is as two matrices.
The first matrix is of the number of games won by each team against each of the other teams. The other matrix is the number of draws (ties) between each team. This data is current as of 2013-10-07.
The match statistics are compiled from
http://www.rugbydata.com/
on 2013-10-07 by J. T. Gray,
[email protected]
.
The top ten teams are determined by the International Rugby Board world rankings, https://www.world.rugby/.
data(rugby); data(rugby.ties) rugby rugby.ties
data(rugby); data(rugby.ties) rugby rugby.ties
This data set contains information and satisfaction scores appearing on the TripAdvisor website between the years 2008 and 2016 regarding hotels in Sardinia, Italy.
The satisfaction data refer to the reputation of hotel located along Sardinian coasts, as expressed by clients with respect to different services (e.g., breakfast, restaurant, swimming pool) offered by the hotel.
data(SardiniaHotels)
data(SardiniaHotels)
A data frame with 518 rows and 43 columns (variables). Each row refers to a single hotel.
The following variables are included in the dataset:
municipality
a factor, the municipality where the hotel is located.
stars
an ordered factor with levels:
1OR2stars
for 1 star or 2 star hotels,
3stars
3 star hotels,
residence
,
4stars
, 4 star hotels,
5starsORresort
, 5 star hotels or resorts.
area
a factor with levels related to the area of the Sardinian coast where each single hotel is located:
AlgheroSassari
,
CagliariVillasimius
, CostaSmeralda
,
DorgaliOrosei
, Gallura
, NurraAnglona
,
Ogliastra
, Olbia
, OristanoBosa
,
PulaChia
, Sarrabus
,
Sulcis
.
seaLocation
a factor with levels
yes
(if the hotel is located close to the sea)
and no
(otherwise).
excellent
a numeric vector, the number of people that expressed the highest level of satisfaction.
good
a numeric vector, the number of people that expressed a good level of satisfaction.
average
a numeric vector, the number of people that expressed an average level of satisfaction.
bad
a numeric vector, the number of people that expressed a bad level of satisfaction.
poor
a numeric vector, the number of people that expressed the lowest level of satisfaction.
family
a numeric vector, the number of people travelling with family.
couple
a numeric vector, the number of people travelling with their partner.
single
a numeric vector, the number of people travelling alone.
business
a numeric vector, the number of people travelling for work.
MarMay
a numeric vector, the number of people travelling during the period March to May.
JunAug
a numeric vector, the number of people travelling during the period June to August.
SepNov
a numeric vector, the number of people travelling during the period September to November.
DecFeb
a numeric vector, the number of people travelling during the period December to February.
location
a numeric vector, the satisfaction score expressed by tourists towards the location.
sleepQuality
a numeric vector, the satisfaction score expressed by tourists towards the sleep quality.
room
a numeric vector, the satisfaction score expressed by tourists towards the comfort and quality of the room.
services
a numeric vector, the satisfaction score expressed by tourists towards the quality of the services.
priceQualityRate
a numeric vector, the satisfaction score expressed by tourists towards ratio between price and quality.
cleaning
a numeric vector, the satisfaction score expressed by tourists towards level of room and hotel cleaning.
bt1
a factor with levels breakfast
,
cleaning
, location
, overall
,
price
, restaurant
, room
,
services
, staff
, structure
and
Wi-Fi
.
It expresses the 1st most used word in reviews for a hotel.
ratebt1
a factor with levels -1
(if the
satisfaction score espressed in bt1
is prevalently negative)
and 1
(if the satisfaction score espressed in bt1
is prevalently positive).
bt2
a factor with levels breakfast
,
cleaning
, location
, overall
,
price
, restaurant
, room
,
services
, staff
, structure
and
Wi-Fi
.
It expresses the 2nd most used word in reviews for a hotel.
ratebt2
a factor with levels -1
(if the
satisfaction score espressed in bt2
is prevalently negative)
and 1
(if the satisfaction score espressed in bt2
is prevalently positive).
bt3
similar to bt1
and bt2
,
but with a corresponding different ranking.
bt4
similar to bt1
and bt2
,
but with a corresponding different ranking.
bt5
similar to bt1
and bt2
,
but with a corresponding different ranking.
bt6
similar to bt1
and bt2
,
but with a corresponding different ranking.
bt7
similar to bt1
and bt2
,
but with a corresponding different ranking.
bt8
similar to bt1
and bt2
,
but with a corresponding different ranking.
bt9
similar to bt1
and bt2
,
but with a corresponding different ranking.
bt10
similar to bt1
and bt2
,
but with a corresponding different ranking.
ratebt3
similar
to ratebt1
and ratebt2
, but with a
corresponding different ranking.
ratebt4
similar
to ratebt1
and ratebt2
, but with a
corresponding different ranking.
ratebt5
similar
to ratebt1
and ratebt2
, but with a
corresponding different ranking.
ratebt6
similar
to ratebt1
and ratebt2
, but with a
corresponding different ranking.
ratebt7
similar
to ratebt1
and ratebt2
, but with a
corresponding different ranking.
ratebt8
similar
to ratebt1
and ratebt2
, but with a
corresponding different ranking.
ratebt9
similar
to ratebt1
and ratebt2
, but with a
corresponding different ranking.
ratebt10
similar
to ratebt1
and ratebt2
, but with a
corresponding different ranking.
These data were manually collected during March–June 2016 by students of the class of "Statistics for Turism" at the University of Cagliari, Italy (Bachelor's degree in Tourism Economics and Managment), under the supervision of Prof. Claudio Conversano and Dr. Giulia Contu.
Many of the variables fall into several natural groups, e.g.,
[municipality
,
stars
,
area
,
seaLocation
];
[excellent
,
good
,
average
,
bad
,
poor
];
[MarMay
,
JunAug
,
SepNov
,
DecFeb
];
[family
,
couple
,
single
,
business
];
[location
,...cleaning
];
[bt1
,...bt10
];
[ratebt1
,...ratebt10
].
TripAdvisor,
https://www.tripadvisor.it/
.
data(SardiniaHotels) summary(SardiniaHotels)
data(SardiniaHotels) summary(SardiniaHotels)
Selected variables mainly focussed on the smoking questionnaire of the National Health and Nutrition Examination Survey collected during 1.5 cycles just prior to Covid-19.
data(smqP)
data(smqP)
A data frame with the following variables.
Identifier for individuals that can be used to merge in other data sets from the same cycle.
Numeric.
Tobacco consumption:
average number of cigarettes/day during past 30 days.
Aka SMD650
with some preprocessing.
Smoking cessation age (SCA).
Is missing for those who have not quit.
Computing this involved variables
SMQ050Q
, SMQ050U
and age
.
The variable SMQ050Q
is for
How long since (you) quit smoking cigarettes?
and
SMQ050U
are the units
(e.g., years, months, days).
The variable SMQ050Q
is right-censored
at 50 years (66666 means 50 or more years)
and for such people SCA was set to NA
.
Tobacco consumption:
number of cigarettes smoked per day when quit.
Aka SMD057
with some preprocessing.
Smoking initiation age (SIA):
age when individuals started smoking cigarettes regularly.
Aka SMD030
with some preprocessing.
Age (RIDAGEYR
) when surveyed;
the value 80 is right-censored,
i.e., subjects 80 years or older were recorded as 80.
Gender is RIAGENDR
.
Race (RIDRETH1
) and a
binary simplification of race
("Non-Hispanic White"
versus
"Others"
).
Education (DMDEDUC2
) and
marital status (DMDMARTZ
).
Both variables were not collected for those
aged 12–19, hence those are NA
s.
Ratio of family income to poverty (INDFMPIR
).
For example,
for the first one, this is
the number of meals from a fast food or pizza place
(DBD900
) during the last 7 days, where
the value 5555 means more than 21 meals per week,
the value 7777 means the person refused to answer, and
the value 9999 means the person didn't know.
The other variables may be described later in more detail.
To be described later.
Numeric.
For people_fam_smoking
, this is
Number of people who live here smoke tobacco?
(SMD460
). For this,
0 is: No one in household is a smoker;
1 is: 1 household member is a smoker;
2 is: 2 or more household members are smokers.
For people_home_smoking
, this is
the number of people who smoke inside this home?
(SMD470
). For this,
0 is: No one smokes inside the house;
1 is: 1 household member smokes inside the house;
2 is: 2 or more household members smoke inside the house.
Use during the last 5 days of smoking variants.
The codes are SMQ690A
, SMQ690B
, SMQ690C
.
The codes are SMQ690G
, SMQ690H
.
The codes are SMQ690E
, SMQ690K
.
Binary 0 (no) or 1 (yes)
measuring exposure to passive smoke at
certain places in the past 7 days,
e.g.,
While you were working at a job or business outside
of the home,
did someone else smoke cigarettes or other tobacco
products indoors?
The codes are SMQ858
, SMQ862
.
See above.
The codes are SMQ868
, SMQ872
.
See above.
The codes are SMQ876
, SMQ880
.
See above.
The code is SMQ940
.
The National Health and Nutrition Examination Survey
(NHANES)
is a well-known longitudinal study located in USA
(a country just north of Mexico).
This data frame shares a selection of variables mainly
to do with the smoking questionnaire
(codeword: SMQ
);
some demographic and anthropometric variables
might also be included and/or added later.
The significance of P
is that the
2019–2020 cycle was not completed due to Covid-19,
hence this data concerns 2017–2020 and is a merging of
the 2017–2018 cycle (data codenamed J
) with
further data collected just prior to the pandemic.
The original data has been preprocessed and/or simplified.
For example, "Don't know"
and "Refused"
usually have been converted to a NA
.
This data frame is subject to change, especially with the addition of new variables.
The URL https://wwwn.cdc.gov/Nchs/Nhanes/ provides an entry point from which many data sets may be downloaded. Comprehensive documentation is available there too.
The data was downloaded in late 2022 by Luca Frigau, University of Cagliari, and some subsequent edits were made by Thomas Yee.
summary(smqP)
summary(smqP)
This data is a subset from a survey of 49609 first-year college students in Taiwan collected in the year 2003 about their preferences for college study.
data(students.tw)
data(students.tw)
A data frame with 49609 observations on the following 12 response
variables.
For binary variables, a "1"
means yes
,
and "0"
means no
.
See below for exact wording (translated from the original Chinese).
ID
a numeric vector, a unique identification number for each student in the survey.
read
Read Chinese and foreign classics.
t.travel
Travel around Taiwan.
conference
Present academic papers in conferences.
act.leader
Lead large-scale activities.
team
Be on a school team.
stu.leader
Be a student association leader.
intern
Participate internship programs.
love
Fall in love.
sex
Have sexual experience.
o.travel
Travel around the world.
friends
Make many friends.
other
Other experience which is not included in the survey.
This data frame is a subset of a larger data set where any student with any missing value was deleted. The remaining data set contains of 32792 students. Unfortunately, other variables such as age and sex were not made available.
Each student was asked the following multiple response question.
Question : What kind of experience do you expect to receive during the period of college study? (Select at least one response)
1. Read Chinese and foreign classics
2. Travel around Taiwan
3. Present academic papers in conferences
4. Lead large-scale activities
5. Be on a school team
6. Be a student association leader
7. Participate internship programs
8. Fall in love
9. Have sexual experience
10. Travel around the world
11. Make many friends
12. Other
Originally, the data set for was downloaded from a survey center
of Academia Sinica https://srda.sinica.edu.tw/news
.
It now seems unavailable.
Wang, H. and Huang, W. H. (2013) Bayesian Ranking Responses in Multiple Response Questions. Journal of the Royal Statistical Society, Series A, (to appear).
Help from Viet Hoang Quoc is gratefully acknowledged.
data(students.tw) summary(students.tw) with(students.tw, table(love, sex)) ## Not run: plot(jitter(sex) ~ jitter(love), data = students.tw, col = "blue", main = "Taiwanese students") ## End(Not run)
data(students.tw) summary(students.tw) with(students.tw, table(love, sex)) ## Not run: plot(jitter(sex) ~ jitter(love), data = students.tw, col = "blue", main = "Taiwanese students") ## End(Not run)
Fits the short-tailed symmetric distribution of Tiku and Vaughan (1999).
tikuv(d, lmean = "identitylink", lsigma = "loglink", isigma = NULL, zero = "sigma")
tikuv(d, lmean = "identitylink", lsigma = "loglink", isigma = NULL, zero = "sigma")
d |
The |
lmean , lsigma
|
Link functions for the mean and standard
deviation parameters of the usual univariate normal distribution
(see Details below).
They are |
isigma |
Optional initial value for |
zero |
A vector specifying which
linear/additive predictors are modelled as intercept-only.
The values can be from the set {1,2}, corresponding
respectively to |
The short-tailed symmetric distribution of Tiku and Vaughan (1999) has a probability density function that can be written
where ,
is a function of
,
,
.
The mean of
is
and this is returned
as the fitted values.
An object of class "vglmff"
(see vglmff-class
).
The object is used by modelling functions
such as vglm
,
and vgam
.
Under- or over-flow may occur if the data is ill-conditioned,
e.g., when is very close to 2 or approaches
-Inf
.
The density function is the product of a univariate normal
density and a polynomial in the response .
The distribution is bimodal if
, else is unimodal.
A normal distribution arises as the limit
as
approaches
, i.e., as
approaches
.
Fisher scoring is implemented.
After fitting the value of
d
is
stored in @misc
with
component name d
.
Thomas W. Yee
Akkaya, A. D. and Tiku, M. L. (2008). Short-tailed distributions and inliers. Test, 17, 282–296.
Tiku, M. L. and Vaughan, D. C. (1999). A family of short-tailed symmetric distributions. Technical report, McMaster University, Canada.
m <- 1.0; sigma <- exp(0.5) tdata <- data.frame(y = rtikuv(1000, d = 1, m = m, s = sigma)) tdata <- transform(tdata, sy = sort(y)) fit <- vglm(y ~ 1, tikuv(d = 1), data = tdata, trace = TRUE) coef(fit, matrix = TRUE) (Cfit <- Coef(fit)) with(tdata, mean(y)) ## Not run: with(tdata, hist(y, prob = TRUE)) lines(dtikuv(sy, d = 1, m = Cfit[1], s = Cfit[2]) ~ sy, data = tdata, col = "orange") ## End(Not run)
m <- 1.0; sigma <- exp(0.5) tdata <- data.frame(y = rtikuv(1000, d = 1, m = m, s = sigma)) tdata <- transform(tdata, sy = sort(y)) fit <- vglm(y ~ 1, tikuv(d = 1), data = tdata, trace = TRUE) coef(fit, matrix = TRUE) (Cfit <- Coef(fit)) with(tdata, mean(y)) ## Not run: with(tdata, hist(y, prob = TRUE)) lines(dtikuv(sy, d = 1, m = Cfit[1], s = Cfit[2]) ~ sy, data = tdata, col = "orange") ## End(Not run)
Density, cumulative distribution function, quantile function and random generation for the short-tailed symmetric distribution of Tiku and Vaughan (1999).
dtikuv(x, d, mean = 0, sigma = 1, log = FALSE) ptikuv(q, d, mean = 0, sigma = 1, lower.tail = TRUE, log.p = FALSE) qtikuv(p, d, mean = 0, sigma = 1, lower.tail = TRUE, log.p = FALSE, ...) rtikuv(n, d, mean = 0, sigma = 1, Smallno = 1.0e-6)
dtikuv(x, d, mean = 0, sigma = 1, log = FALSE) ptikuv(q, d, mean = 0, sigma = 1, lower.tail = TRUE, log.p = FALSE) qtikuv(p, d, mean = 0, sigma = 1, lower.tail = TRUE, log.p = FALSE, ...) rtikuv(n, d, mean = 0, sigma = 1, Smallno = 1.0e-6)
x , q
|
vector of quantiles. |
p |
vector of probabilities. |
n |
number of observations.
Same as in |
d , mean , sigma
|
arguments for the parameters of the distribution.
See |
Smallno |
Numeric, a small value used by the rejection method for determining
the lower and upper limits of the distribution.
That is, |
... |
Arguments that can be passed into |
log |
Logical.
If |
lower.tail , log.p
|
See tikuv
for more details.
dtikuv
gives the density,
ptikuv
gives the cumulative distribution function,
qtikuv
gives the quantile function, and
rtikuv
generates random deviates.
T. W. Yee and Kai Huang
## Not run: par(mfrow = c(2, 1)) x <- seq(-5, 5, len = 401) plot(x, dnorm(x), type = "l", col = "black", ylab = "", las = 1, main = "Black is standard normal, others are dtikuv(x, d)") lines(x, dtikuv(x, d = -10), col = "orange") lines(x, dtikuv(x, d = -1 ), col = "blue") lines(x, dtikuv(x, d = 1 ), col = "green") legend("topleft", col = c("orange","blue","green"), lty = rep(1, len = 3), legend = paste("d =", c(-10, -1, 1))) plot(x, pnorm(x), type = "l", col = "black", ylab = "", las = 1, main = "Black is standard normal, others are ptikuv(x, d)") lines(x, ptikuv(x, d = -10), col = "orange") lines(x, ptikuv(x, d = -1 ), col = "blue") lines(x, ptikuv(x, d = 1 ), col = "green") legend("topleft", col = c("orange","blue","green"), lty = rep(1, len = 3), legend = paste("d =", c(-10, -1, 1))) ## End(Not run) probs <- seq(0.1, 0.9, by = 0.1) ptikuv(qtikuv(p = probs, d = 1), d = 1) - probs # Should be all 0
## Not run: par(mfrow = c(2, 1)) x <- seq(-5, 5, len = 401) plot(x, dnorm(x), type = "l", col = "black", ylab = "", las = 1, main = "Black is standard normal, others are dtikuv(x, d)") lines(x, dtikuv(x, d = -10), col = "orange") lines(x, dtikuv(x, d = -1 ), col = "blue") lines(x, dtikuv(x, d = 1 ), col = "green") legend("topleft", col = c("orange","blue","green"), lty = rep(1, len = 3), legend = paste("d =", c(-10, -1, 1))) plot(x, pnorm(x), type = "l", col = "black", ylab = "", las = 1, main = "Black is standard normal, others are ptikuv(x, d)") lines(x, ptikuv(x, d = -10), col = "orange") lines(x, ptikuv(x, d = -1 ), col = "blue") lines(x, ptikuv(x, d = 1 ), col = "green") legend("topleft", col = c("orange","blue","green"), lty = rep(1, len = 3), legend = paste("d =", c(-10, -1, 1))) ## End(Not run) probs <- seq(0.1, 0.9, by = 0.1) ptikuv(qtikuv(p = probs, d = 1), d = 1) - probs # Should be all 0
Rainbow and brown trout trapped at the Te Whaiau Trap at Lake Otamangakau in the central North Island of New Zealand. The data were collected by the Department of Conservation.
data(trapO)
data(trapO)
A data frame with 1226 observations on the following 15 variables.
Date as a class "Date"
variable.
numeric vectors,
the number of fish trapped daily.
B
/R
is for brown/rainbow trout.
F
/M
is for female/male.
TW
is for the Te Whaiau trap location
(there was another trap just off the Tongariro River).
numeric vectors, daily minimum and maximum ambient temperatures in Celsius.
numeric vector, daily rainfall that has been scaled between 0 (none) and 100 (flooding situation).
numeric vector, water level of the stream that has been scaled between 0 (none) and 100 (flooding situation). In a flooding situation it is possible that some fish going upstream were not caught.
numeric vectors,
extracted from Date
.
a numeric vector, Julian day of year. The value 1 means 1st of January, and so on up to 365.
a factor vector, the year as a factor.
similar to Date
but a fictitional year
is used for all the data. This allows all the
data to be plotted along one calendar year.
These are the daily numbers of fish trapped at the Te Whaiau trap near Lake Otamangakau, during the winter months when spawning is at its peak. These fish were all going upstream. There are two species of trout, split up by males and females, in the data set. The first is brown trout (Salmo trutta) and the second is rainbow trout (Oncorhynchus mykiss). Information on the movement patterns of brown and rainbow trout in Lake Otamangakau and Lake Te Whaiau can be found in Dedual et al. (2000).
Brown trout are more sedentary compared with rainbow trout, and spawning activities of brown trout occur between May and June whilst peak spawning of rainbow trout occurs between July and August. Furthermore, brown trout have been observed avoiding water above 19 degrees Celsius and optimum temperatures for growth are between 10–15 degrees for brown trout and 16.5–17.2 degrees for rainbow trout.
See also lakeO
.
Many thanks to Dr Michel Dedual (http://www.doc.govt.nz) for making this data available. Help from Simeon Pattenwise is also acknowledged.
Dedual, M. and Maxwell, I. D. and Hayes, J. W. and Strickland, R. R. (2000). Distribution and movements of brown (Salmo trutta) and rainbow trout (Oncorhynchus mykiss) in Lake Otamangakau, central North Island, New Zealand. New Zealand Journal of Marine and Freshwater Research, 34: 615–627.
data("trapO") summary(trapO)
data("trapO") summary(trapO)
Estimating the parameter of the triangle distribution by maximum likelihood estimation.
triangle(lower = 0, upper = 1, link = extlogitlink(min = 0, max = 1), itheta = NULL)
triangle(lower = 0, upper = 1, link = extlogitlink(min = 0, max = 1), itheta = NULL)
lower , upper
|
lower and upper limits of the distribution.
Must be finite.
Called |
link |
Parameter link function applied to the
parameter |
itheta |
Optional initial value for the parameter. The default is to compute the value internally. |
The triangle distribution
has a probability density function that
consists of two lines
joined at , which is the
location of the mode.
The lines intersect the
axis at
and
.
Here, Fisher scoring is used.
On fitting, the extra
slot has components
called lower
and upper
which contains the values of
the above arguments
(recycled to the right length).
The fitted values are the mean of the distribution, which is
.
An object of class "vglmff"
(see vglmff-class
).
The object is used by modelling functions
such as vglm
and vgam
.
The MLE regularity conditions do not hold for this
distribution
(e.g., the first derivative evaluated at the mode
does not exist because it is not continuous)
so that misleading inferences may result, e.g.,
in the summary
and vcov
of the object.
Additionally, convergence to the MLE often appears to fail.
The response must contain values in .
For most data sets (especially small ones) it is very
common for half-stepping to occur.
Arguments lower
and upper
and link
must match.
For example, setting
lower = 0.2
and upper = 4
and
link = extlogitlink(min = 0.2, max = 4.1)
will result in an error.
Ideally link = extlogitlink(min = lower, max = upper)
ought to work but it does not (yet)!
Minimal error checking is done for this deficiency.
T. W. Yee
Kotz, S. and van Dorp, J. R. (2004). Beyond Beta: Other Continuous Families of Distributions with Bounded Support and Applications. Chapter 1. World Scientific: Singapore.
Nguyen, H. D. and McLachlan, G. J. (2016). Maximum likelihood estimation of triangular and polygon distributions. Computational Statistics and Data Analysis, 102, 23–36.
Triangle
,
Topple
,
simulate.vlm
.
## Not run: # Example 1 tdata <- data.frame(y = rtriangle(n <- 3000, theta = 3/4)) fit <- vglm(y ~ 1, triangle(link = "identitylink"), tdata, trace = TRUE) coef(fit, matrix = TRUE) Coef(fit) head(fit@extra$lower) head(fitted(fit)) with(tdata, mean(y)) # Example 2; Kotz and van Dorp (2004), p.14 rdata <- data.frame(y = c(0.1,0.25,0.3,0.4,0.45, 0.6, 0.75, 0.8)) fit <- vglm(y ~ 1, triangle(link = "identitylink"), rdata, trace = TRUE, crit = "coef", maxit = 1000) Coef(fit) # The MLE is the 3rd order statistic, which is 0.3. fit <- vglm(y ~ 1, triangle(link = "identitylink"), rdata, trace = TRUE, crit = "coef", maxit = 1001) Coef(fit) # The MLE is the 3rd order statistic, which is 0.3. ## End(Not run)
## Not run: # Example 1 tdata <- data.frame(y = rtriangle(n <- 3000, theta = 3/4)) fit <- vglm(y ~ 1, triangle(link = "identitylink"), tdata, trace = TRUE) coef(fit, matrix = TRUE) Coef(fit) head(fit@extra$lower) head(fitted(fit)) with(tdata, mean(y)) # Example 2; Kotz and van Dorp (2004), p.14 rdata <- data.frame(y = c(0.1,0.25,0.3,0.4,0.45, 0.6, 0.75, 0.8)) fit <- vglm(y ~ 1, triangle(link = "identitylink"), rdata, trace = TRUE, crit = "coef", maxit = 1000) Coef(fit) # The MLE is the 3rd order statistic, which is 0.3. fit <- vglm(y ~ 1, triangle(link = "identitylink"), rdata, trace = TRUE, crit = "coef", maxit = 1001) Coef(fit) # The MLE is the 3rd order statistic, which is 0.3. ## End(Not run)
Density, distribution function, quantile function and random
generation for the Triangle distribution with parameter
theta
.
dtriangle(x, theta, lower = 0, upper = 1, log = FALSE) ptriangle(q, theta, lower = 0, upper = 1, lower.tail = TRUE, log.p = FALSE) qtriangle(p, theta, lower = 0, upper = 1, lower.tail = TRUE, log.p = FALSE) rtriangle(n, theta, lower = 0, upper = 1)
dtriangle(x, theta, lower = 0, upper = 1, log = FALSE) ptriangle(q, theta, lower = 0, upper = 1, lower.tail = TRUE, log.p = FALSE) qtriangle(p, theta, lower = 0, upper = 1, lower.tail = TRUE, log.p = FALSE) rtriangle(n, theta, lower = 0, upper = 1)
x , q
|
vector of quantiles. |
p |
vector of probabilities. |
n |
number of observations.
Same as |
theta |
the theta parameter which lies between |
lower , upper
|
lower and upper limits of the distribution. Must be finite. |
log |
Logical.
If |
lower.tail , log.p
|
See triangle
, the VGAM family function
for estimating the parameter by
maximum likelihood estimation, however the regular
conditions do not hold so the algorithm crawls
to the solution if lucky.
dtriangle
gives the density,
ptriangle
gives the distribution function,
qtriangle
gives the quantile function, and
rtriangle
generates random deviates.
T. W. Yee and Kai Huang
## Not run: x <- seq(-0.1, 1.1, by = 0.01); theta <- 0.75 plot(x, dtriangle(x, theta = theta), type = "l", col = "blue", las = 1, main = "Blue is density, orange is the CDF", sub = "Purple lines are the 10,20,...,90 percentiles", ylim = c(0,2), ylab = "") abline(h = 0, col = "blue", lty = 2) lines(x, ptriangle(x, theta = theta), col = "orange") probs <- seq(0.1, 0.9, by = 0.1) Q <- qtriangle(probs, theta = theta) lines(Q, dtriangle(Q, theta = theta), col = "purple", lty = 3, type = "h") ptriangle(Q, theta = theta) - probs # Should be all zero abline(h = probs, col = "purple", lty = 3) ## End(Not run)
## Not run: x <- seq(-0.1, 1.1, by = 0.01); theta <- 0.75 plot(x, dtriangle(x, theta = theta), type = "l", col = "blue", las = 1, main = "Blue is density, orange is the CDF", sub = "Purple lines are the 10,20,...,90 percentiles", ylim = c(0,2), ylab = "") abline(h = 0, col = "blue", lty = 2) lines(x, ptriangle(x, theta = theta), col = "orange") probs <- seq(0.1, 0.9, by = 0.1) Q <- qtriangle(probs, theta = theta) lines(Q, dtriangle(Q, theta = theta), col = "purple", lty = 3, type = "h") ptriangle(Q, theta = theta) - probs # Should be all zero abline(h = probs, col = "purple", lty = 3) ## End(Not run)
The data set contains counts of the number of passengers entering London Underground stations (also known as the Tube) on a typical day of November 2010 in quarter-hour time blocks.
data("tube10")
data("tube10")
A data frame with 100 observations on the following 270 variables.
ActonTown
a numeric vector
Aldgate
a numeric vector
AldgateEast
a numeric vector
Alperton
a numeric vector
Amersham
a numeric vector
Angel
a numeric vector
Archway
a numeric vector
ArnosGrove
a numeric vector
Arsenal
a numeric vector
BakerStreet
a numeric vector
Balham
a numeric vector
BankAndMonument
a numeric vector
Barbican
a numeric vector
Barking
a numeric vector
Barkingside
a numeric vector
BaronsCourt
a numeric vector
Bayswater
a numeric vector
Becontree
a numeric vector
BelsizePark
a numeric vector
Bermondsey
a numeric vector
BethnalGreen
a numeric vector
Blackfriars
a numeric vector
BlackhorseRoad
a numeric vector
BondStreet
a numeric vector
Borough
a numeric vector
BostonManor
a numeric vector
BoundsGreen
a numeric vector
BowRoad
a numeric vector
BrentCross
a numeric vector
Brixton
a numeric vector
BromleyByBow
a numeric vector
BuckhurstHill
a numeric vector
BurntOak
a numeric vector
CaledonianRoad
a numeric vector
CamdenTown
a numeric vector
CanadaWater
a numeric vector
CanaryWharf
a numeric vector
CanningTown
a numeric vector
CannonStreet
a numeric vector
CanonsPark
a numeric vector
ChalfontAndLatimer
a numeric vector
ChalkFarm
a numeric vector
ChanceryLane
a numeric vector
CharingCross
a numeric vector
Chesham
a numeric vector
Chigwell
a numeric vector
ChiswickPark
a numeric vector
Chorleywood
a numeric vector
ClaphamCommon
a numeric vector
ClaphamNorth
a numeric vector
ClaphamSouth
a numeric vector
Cockfosters
a numeric vector
Colindale
a numeric vector
ColliersWood
a numeric vector
CoventGarden
a numeric vector
Croxley
a numeric vector
DagenhamEast
a numeric vector
DagenhamHeathway
a numeric vector
Debden
a numeric vector
DollisHill
a numeric vector
EalingBroadway
a numeric vector
EalingCommon
a numeric vector
EarlsCourt
a numeric vector
EastActon
a numeric vector
EastFinchley
a numeric vector
EastHam
a numeric vector
EastPutney
a numeric vector
Eastcote
a numeric vector
Edgware
a numeric vector
EdgwareRoadBak
a numeric vector
EdgwareRoadCir
a numeric vector
ElephantAndCastle
a numeric vector
ElmPark
a numeric vector
Embankment
a numeric vector
Epping
a numeric vector
Euston
a numeric vector
EustonSquare
a numeric vector
Fairlop
a numeric vector
Farringdon
a numeric vector
FinchleyCentral
a numeric vector
FinchleyRoad
a numeric vector
FinsburyPark
a numeric vector
FulhamBroadway
a numeric vector
GantsHill
a numeric vector
GloucesterRoad
a numeric vector
GoldersGreen
a numeric vector
GoldhawkRoad
a numeric vector
GoodgeStreet
a numeric vector
GrangeHill
a numeric vector
GreatPortlandStreet
a numeric vector
GreenPark
a numeric vector
Greenford
a numeric vector
Gunnersbury
a numeric vector
Hainault
a numeric vector
HammersmithDis
a numeric vector
HammersmithHC
a numeric vector
Hampstead
a numeric vector
HangerLane
a numeric vector
Harlesden
a numeric vector
HarrowAndWealdstone
a numeric vector
HarrowOnTheHill
a numeric vector
HattonCross
a numeric vector
HeathrowTerminals123
a numeric vector
HeathrowTerminal4
a numeric vector
HeathrowTerminal5
a numeric vector
HendonCentral
a numeric vector
HighBarnet
a numeric vector
HighStreetKensington
a numeric vector
HighburyAndIslington
a numeric vector
Highgate
a numeric vector
Hillingdon
a numeric vector
Holborn
a numeric vector
HollandPark
a numeric vector
HollowayRoad
a numeric vector
Hornchurch
a numeric vector
HounslowCentral
a numeric vector
HounslowEast
a numeric vector
HounslowWest
a numeric vector
HydeParkCorner
a numeric vector
Ickenham
a numeric vector
Kennington
a numeric vector
KensalGreen
a numeric vector
KensingtonOlympia
a numeric vector
KentishTown
a numeric vector
Kenton
a numeric vector
KewGardens
a numeric vector
Kilburn
a numeric vector
KilburnPark
a numeric vector
KingsCrossStPancras
a numeric vector
Kingsbury
a numeric vector
Knightsbridge
a numeric vector
LadbrokeGrove
a numeric vector
LambethNorth
a numeric vector
LancasterGate
a numeric vector
LatimerRoad
a numeric vector
LeicesterSquare
a numeric vector
Leyton
a numeric vector
Leytonstone
a numeric vector
LiverpoolStreet
a numeric vector
LondonBridge
a numeric vector
Loughton
a numeric vector
MaidaVale
a numeric vector
ManorHouse
a numeric vector
MansionHouse
a numeric vector
MarbleArch
a numeric vector
Marylebone
a numeric vector
MileEnd
a numeric vector
MillHillEast
a numeric vector
MoorPark
a numeric vector
Moorgate
a numeric vector
Morden
a numeric vector
MorningtonCrescent
a numeric vector
Neasden
a numeric vector
NewburyPark
a numeric vector
NorthActon
a numeric vector
NorthEaling
a numeric vector
NorthGreenwich
a numeric vector
NorthHarrow
a numeric vector
NorthWembley
a numeric vector
Northfields
a numeric vector
Northolt
a numeric vector
NorthwickPark
a numeric vector
Northwood
a numeric vector
NorthwoodHills
a numeric vector
NottingHillGate
a numeric vector
Oakwood
a numeric vector
OldStreet
a numeric vector
Osterley
a numeric vector
Oval
a numeric vector
OxfordCircus
a numeric vector
Paddington
a numeric vector
ParkRoyal
a numeric vector
ParsonsGreen
a numeric vector
Perivale
a numeric vector
PiccadillyCircus
a numeric vector
Pimlico
a numeric vector
Pinner
a numeric vector
Plaistow
a numeric vector
PrestonRoad
a numeric vector
PutneyBridge
a numeric vector
QueensPark
a numeric vector
Queensbury
a numeric vector
Queensway
a numeric vector
RavenscourtPark
a numeric vector
RaynersLane
a numeric vector
Redbridge
a numeric vector
RegentsPark
a numeric vector
Richmond
a numeric vector
Rickmansworth
a numeric vector
RodingValley
a numeric vector
RoyalOak
a numeric vector
Ruislip
a numeric vector
RuislipGardens
a numeric vector
RuislipManor
a numeric vector
RussellSquare
a numeric vector
SevenSisters
a numeric vector
ShepherdsBushCen
a numeric vector
ShepherdsBushHC
a numeric vector
SloaneSquare
a numeric vector
Snaresbrook
a numeric vector
SouthEaling
a numeric vector
SouthHarrow
a numeric vector
SouthKensington
a numeric vector
SouthKenton
a numeric vector
SouthRuislip
a numeric vector
SouthWimbledon
a numeric vector
SouthWoodford
a numeric vector
Southfields
a numeric vector
Southgate
a numeric vector
Southwark
a numeric vector
StJamessPark
a numeric vector
StJohnsWood
a numeric vector
StPauls
a numeric vector
StamfordBrook
a numeric vector
Stanmore
a numeric vector
StepneyGreen
a numeric vector
Stockwell
a numeric vector
StonebridgePark
a numeric vector
Stratford
a numeric vector
SudburyHill
a numeric vector
SudburyTown
a numeric vector
SwissCottage
a numeric vector
Temple
a numeric vector
TheydonBois
a numeric vector
TootingBec
a numeric vector
TootingBroadway
a numeric vector
TottenhamCourtRoad
a numeric vector
TottenhamHale
a numeric vector
TotteridgeAndWhetstone
a numeric vector
TowerHill
a numeric vector
TufnellPark
a numeric vector
TurnhamGreen
a numeric vector
TurnpikeLane
a numeric vector
Upminster
a numeric vector
UpminsterBridge
a numeric vector
Upney
a numeric vector
UptonPark
a numeric vector
Uxbridge
a numeric vector
Vauxhall
a numeric vector
Victoria
a numeric vector
WalthamstowCentral
a numeric vector
Wanstead
a numeric vector
WarrenStreet
a numeric vector
WarwickAvenue
a numeric vector
Waterloo
a numeric vector
Watford
a numeric vector
WembleyCentral
a numeric vector
WembleyPark
a numeric vector
WestActon
a numeric vector
WestBrompton
a numeric vector
WestFinchley
a numeric vector
WestHam
a numeric vector
WestHampstead
a numeric vector
WestHarrow
a numeric vector
WestKensington
a numeric vector
WestRuislip
a numeric vector
WestbournePark
a numeric vector
Westminster
a numeric vector
WhiteCity
a numeric vector
Whitechapel
a numeric vector
WillesdenGreen
a numeric vector
WillesdenJunction
a numeric vector
Wimbledon
a numeric vector
WimbledonPark
a numeric vector
WoodGreen
a numeric vector
WoodLane
a numeric vector
Woodford
a numeric vector
WoodsidePark
a numeric vector
Total
a numeric vector; the total over all stations.
mins24
a numeric vector; minutes on a 24 hour clock, e.g., 0 is midnight, 120 is 2am.
Each cell contains the number of passengers entering a station
during a quarter-hour period of time on a
typical day during November 2010.
The column names of the data frame are the station names and
the most of the rows are
the start time of each quarter-hour time block given in 24 hour time,
e.g., 2215 means 10:15pm to 10:29pm.
The last four rows are
"Total"
,
"A.M.Peak"
,
"Interpeak"
,
"P.M.Peak"
.
The data is adjusted to remove the effect of abnormal circumstances that many affect passeger numbers such as industrial action.
The data came from the UK Government Transport for London website https://www.tfl.gov.uk. Downloaded in December 2013 and formatted into R by J. T. Gray (and slightly edited by T. W. Yee).
## Not run: data(tube10) waterloo <- tube10[1:(4*24), "Waterloo"] # Omit the totals and the peaks barplot(unlist(waterloo)) barplot(log10(1 + unlist(waterloo)), col = "lightblue", ylab = "log10(1+.)", las = 1) ## End(Not run)
## Not run: data(tube10) waterloo <- tube10[1:(4*24), "Waterloo"] # Omit the totals and the peaks barplot(unlist(waterloo)) barplot(log10(1 + unlist(waterloo)), col = "lightblue", ylab = "log10(1+.)", las = 1) ## End(Not run)
About 800 students studying undergraduate statistics were asked many lifestyle questions.
data(ugss)
data(ugss)
A data frame with 804 observations on the following 29 variables.
sex
Gender, a factor, (female or male)
age
age in years, a numeric vector
eyes
eye colour, a factor, (blue, brown, green, hazel or other)
piercings
Number of body piercings, a numeric vector
pierced
Any body piercings? a factor, (Yes or No)
tattoos
Number of tattoos, a numeric vector
tattooed
Any tattoos? a factor, (Yes or No)
glasses
Wears glasses etc.? a factor, (Yes or No)
sleep
Average number of hours of sleep per night, a numeric vector
study
Average number of hours of study per week, a numeric vector
tv
Average number of hours watching TV per week, a numeric vector
movies
Number of movies seen at a cinema during the last 3 months, a numeric vector
movies3m
Seen movies in last 3 months? a factor, (Yes or No)
sport
Favourite sport, a factor, about 19 of them
entertainment
Favourite entertainment, a factor, about 15 of them
fruit
Favourite fruit a factor, about 13 of them
income
Average income during semester per week, a numeric vector
rent
Amount spent on rent or room and board per week, a numeric vector
clothes
Average amount spent on clothes per month, a numeric vector
hair
Average cost to get a hair-cut, a numeric vector
tobacco
Average amount spent on tobacco per week, a numeric vector
smokes
Smokes? a factor, (Yes or No)
alcohol
Average amount spent on alcohol per week, a numeric vector
buy.alcohol
Buys (purchases) alcohol? a factor, (Yes or No)
sendtxt
Average number text messages sent per day, a numeric vector.
receivetxt
Average number text messages received per day, a numeric vector.
txts
Uses text messaging? a factor, (Yes or No)
country
Country of birth, a factor, about 54 of them
status
Student status, a factor, (International, NZ.Citizen, NZ.Resident)
This data was collected online and anonymously in 2010. The respondents were students studying an undergraduate statistics course at a New Zealand university. Possibly there are duplicate students (due to failing and re-enrolling). All monies are in NZD. Note the data has had minimal checking. Most numerical variables tend to have measurement error, and all of them happen to be all integer-valued.
summary(ugss)
summary(ugss)
Information on inpatients discharged from hospitals in Vermont, USA, 2012.
data(vtinpat)
data(vtinpat)
A data frame with 52206 observations on the following 7 variables.
hospital
a factor with levels 1
= Northwestern Medical Center, 2
= North Country Hospital and Health Center,
3
= Northeastern Vermont Regional Hospital, 4
= Copley Hospital, 5
= Fletcher Allen Health Care,
6
= Central Vermont Hospital, 8
= Rutland Regional Medical Center, 9
= Porter Medical Center,
10
= Gifford Memorial Hospital, 11
= Mount Ascutney Hospital and Health Center, 12
= Springfield Hospital,
14
= Grace Cottage Hospital, 15
= Brattleboro Memorial Hospital, 16
= Southwestern Vermont Medical Center
admit
a factor with levels 1
= Emergency, 2
= Urgent, 3
= Elective, 4
, Newborn,
5
= Trauma
age.group
a factor with levels 1
= Under 1, 2
= 1-17, 3
= 18-24, 4
= 25-29, 5
= 30-34,
6
= 35-39, 7
= 40-44, 8
= 45-49, 9
= 50-54, 10
= 55-59, 11
= 60-64, 12
= 65-69,
13
= 70-74, 14
= 75+
sex
a factor with levels 1
= Male, 2
= Female
discharge
a factor with levels 1
= To another medical facility, 2
= home, 3
= against medical advice,
4
= Died, 5
= To court or law enforcement, 6
= still a patient
diagnosis
a factor with levels 1
= Brain And C.N.S., 2
= Eye, 3
= Ear, Nose & Throat,
4
= Respiratory, 5
= Heart & Circulatory, 6
= Digestive, 7
= Liver & Pancreas, 8
= Musculoskeletal,
9
= Skin and Breast, 10
= Endocrine, 11
= Kidney & Urinary, 12
= Male Reproductive,
13
= Female Reproductive, 14
= Pregnancy, Childbirth, 15
= Neonatal, 16
= Spleen & Blood,
17
= Lymphatic, 18
= Infection, 19
= Mental Illness, 20
= Substance Abuse, 21
= Injury, Toxic Effects,
22
= Burns, 23
= Other, 24
= Trauma, 25
= H.I.V.
los
a numeric vector, number of days spent in hospital
This data set contains details on inpatients discharged from hospitals in Vermont, USA, in 2012 as part of the Vermont Uniform Hospital Discharge Data Set. The Vermont Department of Financial Regulation administers this program and the Vermont Department of Health manages the data set.
Vermont department of Health, https://www.healthvermont.gov/stats/systems formatted into R by J. T. Gray in mid-2014.
summary(vtinpat)
summary(vtinpat)
Capture records of the 2008 FIPS-MOUCHE World Fly Fishing Championships held in Rotorua, New Zealand during 22–30 March 2008.
data(wffc)
data(wffc)
A data frame with 4267 observations on the following 8 variables. Each row is a recorded capture.
length
a numeric vector; length of fish in mm.
water
a factor with levels
Waihou
, Waimakariri
, Whanganui
,
Otamangakau
, Rotoaira
.
These are known as Sectors IV, V, I, II, III respectively,
and are also represented by the variable sector
.
session
a numeric vector; a value from the set 1,2,...,6. These are time ordered, and there were two sessions per competition day.
sector
a numeric vector; a value from the set 1,2,...,5. Ideally these should be converted to a factor.
beatboat
a numeric vector; beat or boat number, a value from the set 1,2,...,19. Ideally these should be converted to a factor. For a river though, they are contiguous whereas on a lake it is less so.
comid
a numeric vector; the competitor's ID number. Uniquely identifies each competitor. These ID numbers actually correspond to their rankings in the individual level.
iname
a character vector; the individual competitor's name.
country
a character vector; what country the competitors represented. The countries represented were Australia (AUS), Canada (CAN), Croatia (CRO), Czech Republic (CZE), England (ENG), Finland (FIN), France (FRA), Holland (NED), Ireland (IRE), Italy (ITA), Japan (JPN), Malta (MAL), New Zealand (NZL), Poland (POL), Portugal (POR), South Africa (RSA), Slovakia (SVK), USA (USA), Wales (WAL).
Details may be obtained at Yee (2010) and Yee (2014). Here is a brief summary. The three competition days were 28–30 March. Each session was fixed at 9.00am–12.00pm and 2.30–5.30pm daily. One of the sessions was a rest session. Each of 19 teams had 5 members, called A, B, C, D and E (there was a composite team, actually). The scoring system allocated 100 points to each eligible fish (minimum length was 18 cm) and 20 points for each cm of its length (rounded up to the nearest centimeter). Thus a 181mm or 190mm fish was worth 480 points. Each river was divided into 19 contiguous downstream beats labelled 1,2,...,19. Each lake was fished by 9 boats, each with two competitors except for one boat which only had one. Each competitor was randomly assigned to a beat/boat.
Competitors were ranked according to their placings at each sector-session combination, and then these placings were summed. Those with the minimum total placings were the winners, thus it was not necessarily those who had the maximum points who won. For example, in Session 1 at the Waihou River, each of the 19 competitors was ranked 1 (best) to 19 (worst) according to the point system. This is the “placing” for that session. These placings were added up over the 5 sessions to give the “total placings”.
All sectors have naturally wild Rainbow trout (Oncorhynchus mykiss) while Lake Otamangakau and the Whanganui River also holds Brown trout (Salmo trutta). Only these two species were targetted. The species was not recorded electronically, however a post-analysis of the paper score sheets from the two lakes showed that, approximately, less than 5 percent were Brown trout. It may be safely assumed that all the Waihou and Waimakariri fish were Rainbow trout. The gender of the fish were also not recorded electronically, and anyway, distinguishing between male and female was very difficult for small fish.
Although species and gender data were supposed to have been collected at the time of capture the quality of these variables is rather poor and furthermore they were not recorded electronically.
Note that some fish may have been caught more than once, hence these data do not represent individual fish but rather recorded captures.
Note also that a few internal discrepancies
may be found within
and between the data frames
wffc
,
wffc.nc
,
wffc.indiv
,
wffc.teams
.
This is due to various reasons, such as
competitors being replaced by reserves when sick,
fish that were included or excluded upon the
local judge's decision,
competitors who fished two hours instead of
three by mistake, etc.
The data has already been cleaned of errors
and internal inconsistencies
but a few may remain.
This data frame was adapted from the WFFC's spreadsheet. Special thanks goes to Paul Dewar, Jill Mandeno, Ilkka Pirinen, and the other members of the Organising Committee of the 28th FIPS-Mouche World Fly Fishing Championships for access to the data. The assistance and feedback of Colin Shepherd is gratefully acknowledged.
Yee, T. W. (2010). VGLMs and VGAMs: an overview for applications in fisheries research. Fisheries Research, 101, 116–126.
Yee, T. W. (2014). Scoring rules, and the role of chance: analysis of the 2008 World Fly Fishing Championships. Journal of Quantitative Analysis in Sports. 10, 397–409.
wffc.indiv
,
wffc.teams
,
wffc.nc
,
wffc.P1
,
lakeO
.
summary(wffc) with(wffc, table(water, session)) # Obtain some simple plots waihou <- subset(wffc, water == "Waihou") waimak <- subset(wffc, water == "Waimakariri") whang <- subset(wffc, water == "Whanganui") otam <- subset(wffc, water == "Otamangakau") roto <- subset(wffc, water == "Rotoaira") minlength <- min(wffc[, "length"]) maxlength <- max(wffc[, "length"]) nwater <- c("Waihou" = nrow(waihou), "Waimakariri" = nrow(waimak), "Whanganui" = nrow(whang), "Otamangakau" = nrow(otam), "Rotoaira" = nrow(roto)) ## Not run: par(mfrow = c(2, 3), las = 1) # Overall distribution of length with(wffc, boxplot(length/10 ~ water, ylim = c(minlength, maxlength)/10, border = "blue", main = "Length (cm)", cex.axis = 0.5)) # Overall distribution of LOG length with(wffc, boxplot(length/10 ~ water, ylim = c(minlength, maxlength)/10, border = "blue", log = "y", cex.axis = 0.5, main = "Length (cm) on a log scale")) # Overall distribution of number of captures pie(nwater, border = "blue", main = "Proportion of captures", labels = names(nwater), density = 10, col = 1:length(nwater), angle = 85+30* 1:length(nwater)) # Overall distribution of number of captures with(wffc, barplot(nwater, main = "Number of captures", cex.names = 0.5, col = "lightblue")) # Overall distribution of proportion of number of captures with(wffc, barplot(nwater / sum(nwater), cex.names = 0.5, col = "lightblue", main = "Proportion of captures")) # An interesting lake with(roto, hist(length/10, xlab = "Fish length (cm)", col = "lightblue", breaks = seq(18, 70, by = 3), prob = TRUE, ylim = c(0, 0.08), border = "blue", ylab = "", main = "Lake Rotoaira", lwd = 2)) ## End(Not run)
summary(wffc) with(wffc, table(water, session)) # Obtain some simple plots waihou <- subset(wffc, water == "Waihou") waimak <- subset(wffc, water == "Waimakariri") whang <- subset(wffc, water == "Whanganui") otam <- subset(wffc, water == "Otamangakau") roto <- subset(wffc, water == "Rotoaira") minlength <- min(wffc[, "length"]) maxlength <- max(wffc[, "length"]) nwater <- c("Waihou" = nrow(waihou), "Waimakariri" = nrow(waimak), "Whanganui" = nrow(whang), "Otamangakau" = nrow(otam), "Rotoaira" = nrow(roto)) ## Not run: par(mfrow = c(2, 3), las = 1) # Overall distribution of length with(wffc, boxplot(length/10 ~ water, ylim = c(minlength, maxlength)/10, border = "blue", main = "Length (cm)", cex.axis = 0.5)) # Overall distribution of LOG length with(wffc, boxplot(length/10 ~ water, ylim = c(minlength, maxlength)/10, border = "blue", log = "y", cex.axis = 0.5, main = "Length (cm) on a log scale")) # Overall distribution of number of captures pie(nwater, border = "blue", main = "Proportion of captures", labels = names(nwater), density = 10, col = 1:length(nwater), angle = 85+30* 1:length(nwater)) # Overall distribution of number of captures with(wffc, barplot(nwater, main = "Number of captures", cex.names = 0.5, col = "lightblue")) # Overall distribution of proportion of number of captures with(wffc, barplot(nwater / sum(nwater), cex.names = 0.5, col = "lightblue", main = "Proportion of captures")) # An interesting lake with(roto, hist(length/10, xlab = "Fish length (cm)", col = "lightblue", breaks = seq(18, 70, by = 3), prob = TRUE, ylim = c(0, 0.08), border = "blue", ylab = "", main = "Lake Rotoaira", lwd = 2)) ## End(Not run)
Individual competitors' results of the 2008 FIPS-MOUCHE World Fly Fishing Championships held in Rotorua, New Zealand during 22–30 March 2008.
data(wffc.indiv)
data(wffc.indiv)
A data frame with 99 observations on the
following 8 variables.
Some of these variable are described
in wffc
.
totalPlacings
a numeric vector; these are the summed placings over the 5 sessions.
points
a numeric vector.
noofcaptures
a numeric vector.
longest
a numeric vector.
individual
a numeric vector; did the competitor fish in a team or as an individual? (one team was made of composite countries due to low numbers).
country
a character vector.
iname
a character vector.
comid
a numeric vector.
This data frame gives the individual results
of the competition.
See also wffc
and wffc.teams
for more
details and links.
Yee, T. W. (2010). VGLMs and VGAMs: an overview for applications in fisheries research. Fisheries Research, 101, 116–126.
summary(wffc.indiv)
summary(wffc.indiv)
Number of captures in the 2008 FIPS-MOUCHE World Fly Fishing Championships held in Rotorua, New Zealand during 22–30 March 2008.
data(wffc.nc)
data(wffc.nc)
A data frame with 475 observations on the
following 7 variables.
Most of these variable are described
in wffc
.
Each row is sorted by sector, session and beat.
sector
a numeric vector.
session
a numeric vector.
beatboat
a numeric vector.
numbers
a numeric vector.
comid
a numeric vector.
iname
a character vector.
country
a character vector.
This data frame was obtained by
processing wffc
.
The key variable is numbers
, which is
sector-session-beat specific.
Note that some fish may have been caught more than once, hence these data do not represent individual fish.
Yee, T. W. (2010). VGLMs and VGAMs: an overview for applications in fisheries research. Fisheries Research, 101, 116–126.
xtabs( ~ sector + session, wffc.nc)
xtabs( ~ sector + session, wffc.nc)
Point system for the 2008 World Fly Fishing Championships: current and some proposals.
wffc.P1(length, c1 = 100, min.eligible = 0.18, ppm = 2000) wffc.P2(length, c1 = 100, min.eligible = 0.18, ppm = 2000, c.quad = 12700) wffc.P3(length, c1 = 100, min.eligible = 0.18, ppm = 2000) wffc.P1star(length, c1 = 100, min.eligible = 0.18, ppm = 2000) wffc.P2star(length, c1 = 100, min.eligible = 0.18, ppm = 2000, c.quad = 12700) wffc.P3star(length, c1 = 100, min.eligible = 0.18, ppm = 2000)
wffc.P1(length, c1 = 100, min.eligible = 0.18, ppm = 2000) wffc.P2(length, c1 = 100, min.eligible = 0.18, ppm = 2000, c.quad = 12700) wffc.P3(length, c1 = 100, min.eligible = 0.18, ppm = 2000) wffc.P1star(length, c1 = 100, min.eligible = 0.18, ppm = 2000) wffc.P2star(length, c1 = 100, min.eligible = 0.18, ppm = 2000, c.quad = 12700) wffc.P3star(length, c1 = 100, min.eligible = 0.18, ppm = 2000)
length |
Length of the fish, in meters. Numeric vector. |
c1 |
Points added to each eligible fish. |
min.eligible |
The 2008 WFFC regulations stipulated that the smallest eligible fish was 0.180 m, which is 180 mm. |
ppm |
Points per meter of length of the fish. |
c.quad |
Constants for the quadratic terms.
The defaults
mean that a fish twice the minimum legal size will award
about 50 percent
more points compared to |
The official website contains a document with
the official rules and
regulations of the competition.
The function wffc.P1()
implements the
current WFFC point system,
and is ‘discrete’ in that fish lengths are
rounded up to the nearest
centimeter (provided it is greater or equal
to min.eligible
m).
wffc.P1star()
is a continuous version
of it in that it is
piecewise linear with two pieces and it is discontinuous at
min.eligible
.
The function wffc.P2()
is a new proposal which
rewards catching bigger fish.
It is based on a quadratic polynomial.
wffc.P2star()
is a continuous version of it.
The function wffc.P3()
is another new proposal which
rewards catching bigger fish.
Named a cumulative linear proposal,
its slope is ppm
between min.eligible
and
2 * min.eligible
,
its slope is 2 * ppm
between 2 * min.eligible
and
3 * min.eligible
,
its slope is 3 * ppm
between 3 * min.eligible
and
4 * min.eligible
, etc.
One adds the usual c1
to each eligible fish.
wffc.P3star()
is a continuous
version of wffc.P3()
.
The function wffc.P4()
is another new proposal which
rewards catching bigger fish.
Named a cumulative linear proposal,
its slope is ppm
between min.eligible
and
2 * min.eligible
,
its slope is 2 * ppm
between 2 * min.eligible
and
1.5 * min.eligible
,
its slope is 3 * ppm
between 1.5 * min.eligible
and
2 * min.eligible
, etc.
One adds the usual c1
to each eligible fish.
wffc.P4star()
is a continuous
version of wffc.P4()
.
A vector with the number of points.
wffc.P2
and wffc.P2star
may
change in the future,
as well as possibly
wffc.P3
and wffc.P3star
and
wffc.P4
and wffc.P4star
.
T. W. Yee.
Yee, T. W. (2014). Scoring rules, and the role of chance: analysis of the 2008 World Fly Fishing Championships. Journal of Quantitative Analysis in Sports. 10, 397–409.
wffc
.
## Not run: fishlength <- seq(0.0, 0.72, by = 0.001) plot(fishlength, wffc.P2star(fishlength), type = "l", col = "blue", las = 1, lty = "dashed", lwd = 2, las = 1, cex.main = 0.8, xlab = "Fish length (m)", ylab = "Competition points", main = "Current (red) & proposed (blue&green) WFFC point system") lines(fishlength, wffc.P1star(fishlength), col = "red", lwd = 2) lines(fishlength, wffc.P3star(fishlength), col = "darkgreen", lwd = 2, lty = "dashed") lines(fishlength, wffc.P4star(fishlength), col = "orange", lwd = 2, lty = "dashed") abline(v = (1:4) * 0.18, lty = "dotted") abline(h = (1:13) * wffc.P1star(0.18), lty = "dotted") ## End(Not run) # Successive slopes: (wffc.P1star((2:8)*0.18) - wffc.P1star((1:7)*0.18)) / (0.18 * 2000) (wffc.P3star((2:8)*0.18) - wffc.P3star((1:7)*0.18)) / (0.18 * 2000) (wffc.P4star((2:8)*0.18) - wffc.P4star((1:7)*0.18)) / (0.18 * 2000) # More successive slopes: MM2 <- 0.18 / 2 ind1 <- 2:12 (wffc.P4star((ind1)*MM2) - wffc.P4star((ind1-1)*MM2)) / (MM2 * 2000) # About 50 percent more points: wffc.P2 (2 * 0.18) / wffc.P1 (2 * 0.18) wffc.P2star(2 * 0.18) / wffc.P1star(2 * 0.18)
## Not run: fishlength <- seq(0.0, 0.72, by = 0.001) plot(fishlength, wffc.P2star(fishlength), type = "l", col = "blue", las = 1, lty = "dashed", lwd = 2, las = 1, cex.main = 0.8, xlab = "Fish length (m)", ylab = "Competition points", main = "Current (red) & proposed (blue&green) WFFC point system") lines(fishlength, wffc.P1star(fishlength), col = "red", lwd = 2) lines(fishlength, wffc.P3star(fishlength), col = "darkgreen", lwd = 2, lty = "dashed") lines(fishlength, wffc.P4star(fishlength), col = "orange", lwd = 2, lty = "dashed") abline(v = (1:4) * 0.18, lty = "dotted") abline(h = (1:13) * wffc.P1star(0.18), lty = "dotted") ## End(Not run) # Successive slopes: (wffc.P1star((2:8)*0.18) - wffc.P1star((1:7)*0.18)) / (0.18 * 2000) (wffc.P3star((2:8)*0.18) - wffc.P3star((1:7)*0.18)) / (0.18 * 2000) (wffc.P4star((2:8)*0.18) - wffc.P4star((1:7)*0.18)) / (0.18 * 2000) # More successive slopes: MM2 <- 0.18 / 2 ind1 <- 2:12 (wffc.P4star((ind1)*MM2) - wffc.P4star((ind1-1)*MM2)) / (MM2 * 2000) # About 50 percent more points: wffc.P2 (2 * 0.18) / wffc.P1 (2 * 0.18) wffc.P2star(2 * 0.18) / wffc.P1star(2 * 0.18)
Team results of the 2008 FIPS-MOUCHE World Fly Fishing Championships held in Rotorua, New Zealand during 22–30 March 2008.
data(wffc.teams)
data(wffc.teams)
A data frame with 18 observations on the
following 5 variables.
Some of these variable are described
in wffc
.
country
a character vector.
totalPlacings
a numeric vector; these are the summed placings over the 5 sessions and 5 team members.
points
a numeric vector;
see wffc
.
noofcaptures
a numeric vector.
longestfish
a numeric vector.
This data frame gives the team results of the competition.
See also wffc
and wffc.indiv
for more
details and links.
wffc.teams
wffc.teams
A cross-sectional data set of a workforce company, plus another health survey, in New Zealand during the 1990s,
data(xs.nz)
data(xs.nz)
A data frame with 10529 observations on the
following 64 variables.
For binary variables, a "1"
or TRUE
means yes
,
and "0"
or FALSE
means no
.
Also, "D"
means don't know,
and "-"
means not applicable.
The pregnancy questions were administered to women only.
regnum
a numeric vector, a unique registration number. This differs from their original registration number, and the rows are sorted by their new registration number.
study1
a logical vector, Study 1 (workforce) or Study 2?
age
a numeric vector, age in years.
sex
a factor with levels F
and M
.
pulse
a numeric vector, beats per minute.
sbp
a numeric vector, systolic blood pressure (mm Hg).
dbp
a numeric vector, diastolic blood pressure (mm Hg).
cholest
a numeric vector, cholesterol (mmol/L).
height
a numeric vector, in m.
weight
a numeric vector, in kg.
fh.heartdisease
a factor
with levels 0
, 1
,
D
.
Has a family history of heart disease
(heart attack, angina, or
had a heart bypass operation) within the immediate
family (brother, sister, father or mother,
blood relatives only)?
Note that D
means: do not know.
fh.age
a factor, following
from fh.heartdisease
,
if yes, how old was the family member when
it happened (if
more than one family member, give the age of the
youngest person)?
fh.cancer
a factor with levels 0
, 1
,
D
.
Has a family history of cancer within the immediate
family (blood relatives only)?
Note that D
means: do not know.
heartattack
a numeric vector, have you ever been told by a doctor that you have had a heart attack ("coronary")?
stroke
a numeric vector, have you ever been told by a doctor that you have had a stroke?
diabetes
a numeric vector, have you ever been told by a doctor that you have had diabetes?
hypertension
a numeric vector, have you ever been told by a doctor that you have had high blood pressure (hypertension)?
highchol
a numeric vector, have you ever been told by a doctor that you have had high cholesterol?
asthma
a numeric vector, have you ever been told by a doctor that you have had asthma?
cancer
a numeric vector, have you ever been told by a doctor that you have had cancer?
acne
a numeric vector, have you ever received treatment from a doctor for acne (pimples)?
sunburn
a numeric vector, have you ever received treatment from a doctor for sunburn?
smokepassive
a numeric vector, on average,
how many hours each week (at work and at home) would you
spend near someone who is smoking?
(put "0"
if none)
smokeever
a numeric vector, have you ever smoked tailor-made or roll-you-own cigarettes once a week or more? A 1 means yes and 0 means no.
smokenow
a numeric vector, do you smoke tailor-made or roll-you-own cigarettes now? A 1 means yes and 0 means no.
smokeagequit
a factor,
if no to smokenow
, how old were you when
you stopped smoking?
Using as.numeric(as.character(smokeagequit))
will work for those values which are not
as.character(smokeagequit) == "-"
.
smokeyears
a numeric vector,
if yes to smokeever
, for how many years altogether
have you smoked tailor-made or roll-you-own cigarettes?
smoketailormade
a numeric vector, how many tailor-made cigarettes do you smoke each day?
smokeweekpack
a numeric vector, how many
packets of roll-your-own tobacco do you use
each week?
(put "0"
if none)
smokepacketsize
a numeric vector,
what size packets of roll-your-own tobacco do you
usually buy?
("0"
means don't smoke roll-your-owns,
else 25g or 30g or 35g or 50g)
drinkmonth
a numeric vector, do you drink alcohol once a month or more?
drinkfreqweek
a numeric vector,
if yes to drinkmonth
, about how often do you
drink alcohol (days per week)?
Note: 0.25 is once a month,
0.5 is once every two weeks,
1 is once a week,
2.5 is 2-3 days a week,
4.5 is 4-5 days a week,
6.5 is 6-7 days a week.
Further note: 1 can, small bottle or handle of beer or home brew = 1 drink, 1 quart bottle of beer = 2 drinks, 1 jug of beer = 3 drinks, 1 flagon/peter of beer = 6 drinks, 1 glass of wine, sherry = 1 drink, 1 bottle of wine = 6 drinks, 1 double nip of spirits = 1 drink.
drinkweek
a numeric vector,
how many drinks per week, on average.
This is the average daily amount of drinks multiplied
by the frequency of drinking per week.
See drinkfreqweek
on what constitutes a 'drink'.
drinkmaxday
a numeric vector, in the last three months, what is the largest number of drinks that you had on any one day? Warning: some values are considered unrealistically excessive.
eggs
a numeric vector, how many eggs do you eat a week (raw, boiled, scrambled, poached, or in quiche)?
chocbiscuits
a numeric vector, how many chocolate biscuits do you usually eat in a week?
pregnant
a factor, have you ever been pregnant for more than 5 months?
pregfirst
a factor, if
yes to pregnant
, how old were you when your first
baby was born (or you had a miscarriage after 5 months)?
preglast
a factor, how old were you when your last baby was born (or you had a miscarriage after 5 months)?
babies
numeric, how many babies have you given birth to?
moody
a numeric vector, does your mood often go up or down?
miserable
a numeric vector, do you ever feel 'just miserable' for no reason?
hurt
a numeric vector, are your feelings easily hurt?
fedup
a numeric vector, do you often feel 'fed up'?
nervous
a numeric vector, would you call yourself a nervous person?
worrier
a numeric vector, are you a worrier?
worry
a numeric vector, do you worry about awful things that might happen?
tense
a numeric vector, would you call yourself tense or 'highly strung'?
embarrassed
a numeric vector, do you worry too long after an embarrassing experience?
nerves
a numeric vector, do you suffer from 'nerves'?
nofriend
a numeric vector,
do you have a friend or family member that you
can talk to about problems or worries that you may have?
The value 1 effectively means "no"
,
i.e., s/he has no friend or friends.
depressed
a numeric vector, in your lifetime, have you ever had two weeks or more when nearly every day you felt sad or depressed?
exervig
a numeric vector, how many hours per week would you do any vigorous activity or exercise either at work or away from work that makes you breathe hard and sweat? Values here ought be be less than 168.
exermod
a numeric vector, how many hours per week would you do any moderate activity or exercise such as brisk walking, cycling or mowing the lawn? Values here ought be be less than 168.
feethour
a numeric vector, on an average work day, how long would you spend on your feet, either standing or moving about?
ethnicity
a factor with 4 levels,
what ethnic group do you belong to?
European
= European (NZ European or
British or other European),
Maori
= Maori,
Polynesian
= Pacific Island Polynesian,
Other
= Other (Chinese, Indian, Other).
sleep
a numeric vector, how many hours do you usually sleep each night?
snore
a factor with levels 0
, 1
,
D
.
Do you usually snore?
Note that D
means: do not know.
cat
a numeric vector, do you have a household pet cat?
dog
a numeric vector, do you have a household pet dog?
hand
a factor with levels
right
= right,
left
= left,
both
= either.
Are you right-handed,
left-handed,
or no preference for left or right?
numhouse
an ordered factor with 4 levels:
1
= 1,
2
= 2,
3
= 3,
4+
= four or more;
how many people (including yourself)
usually live in your house?
marital
a factor with 4 levels:
single
= single,
married
= married or living with a partner,
separated
= separated or divorced,
widowed
= widowed.
educ
an ordered factor with 4 levels:
primary
= Primary school,
secondary
= High school/secondary school,
polytechnic
= Polytechnic or similar,
university
= University.
What was the highest level of education you received?
The data frame is a subset of the entire data set which was collected from a confidential self-administered questionnaire administered in a large New Zealand workforce observational study conducted during 1992–3. The data were augmented by a second study consisting of retirees. The data can be considered a reasonable representation of the white male New Zealand population in the early 1990s. There were physical, lifestyle and psychological variables that were measured. The psychological variables were headed "Questions about your feelings".
Although some data cleaning was performed and logic checks
conducted, anomalies remain. Some variables, of course,
are subject to a lot of measurement error and bias. It is
conceivable that some participants had poor reading skills!
In particular, the smoking variables contain a small
percentage of conflicting values, and when NA
s are taken
into account then there would be several different ways
the data might be cleaned.
If smokeever == 0
then strictly speaking, only
smokepassive
is the other variable—the other
smoking variables should either be NA
or 0
.
More variables may be added in the future and these may
be placed in any column position. Therefore references
such as xs.nz[, 12]
are dangerous.
Also, variable names may change in the future as well as
their format or internal structure,
e.g., factor
versus numeric
.
More error checking are needed for the pregnancy and smoking variables.
Originally,
Clinical Trials Research Unit,
University of Auckland, New Zealand,
http://www.ctru.auckland.ac.nz
.
Originally much of the error checking and formatting was
performed by Stephen Vander Hoorn.
Lately (2014), more changes and error checks were made to the
data by James T. Gray.
MacMahon, S., Norton, R., Jackson, R., Mackie, M. J., Cheng, A., Vander Hoorn, S., Milne, A., McCulloch, A. (1995). Fletcher Challenge-University of Auckland Heart & Health Study: design and baseline findings. New Zealand Medical Journal, 108, 499–502.
data(xs.nz) summary(xs.nz)
data(xs.nz) summary(xs.nz)
Fits a zero-inflated Poisson distribution based on Yip (1988).
yip88(link = "loglink", n.arg = NULL, imethod = 1)
yip88(link = "loglink", n.arg = NULL, imethod = 1)
link |
Link function for the usual |
n.arg |
The total number of observations in the data set. Needed when the response variable has all the zeros deleted from it, so that the number of zeros can be determined. |
imethod |
Details at |
The method implemented here, Yip (1988), maximizes
a conditional
likelihood. Consequently, the methodology used here
deletes the
zeros from the data set, and is thus related to the
positive Poisson
distribution (where ).
The probability function of is 0 with probability
, and
Poisson(
) with
probability
.
Thus
where is Poisson(
).
The mean,
,
can be obtained
by the extractor function
fitted
applied to the object.
This family function treats as a scalar.
If you want
to model both
and
as a function
of covariates, try
zipoisson
.
An object of class "vglmff"
(see vglmff-class
).
The object is used by modelling functions
such as vglm
,
rrvglm
and vgam
.
Under- or over-flow may occur if the data is
ill-conditioned.
Yip (1988) only considered
being a scalar and not
modelled as a function of covariates. To get
around this limitation,
try
zipoisson
.
Inference obtained from summary.vglm
and summary.vgam
may or may not be correct. In particular,
the p-values, standard
errors and degrees of freedom may need adjustment.
Use simulation on
artificial data to check that these are reasonable.
The data may be inputted in two ways.
The first is when the response is
a vector of positive values, with the
argument n
in yip88
specifying the total number of observations.
The second is simply
include all the data in the response.
In this case, the zeros are
trimmed off during the computation, and
the x
and y
slots of the object, if assigned, will reflect this.
The estimate of is placed in
the
misc
slot as
@misc$pstr0
. However, this estimate is
computed only for intercept
models, i.e., the formula is of the form y ~ 1
.
Thomas W. Yee
Yip, P. (1988). Inference about the mean of a Poisson distribution in the presence of a nuisance parameter. The Australian Journal of Statistics, 30, 299–306.
Angers, J-F. and Biswas, A. (2003). A Bayesian analysis of zero-inflated generalized Poisson model. Computational Statistics & Data Analysis, 42, 37–46.
zipoisson
,
Zipois
,
zapoisson
,
pospoisson
,
poissonff
,
dzipois
.
phi <- 0.35; lambda <- 2 # Generate some artificial data y <- rzipois(n <- 1000, lambda, phi) table(y) # Two equivalent ways of fitting the same model fit1 <- vglm(y ~ 1, yip88(n = length(y)), subset = y > 0) fit2 <- vglm(y ~ 1, yip88, trace = TRUE, crit = "coef") (true.mean <- (1-phi) * lambda) mean(y) head(fitted(fit1)) fit1@misc$pstr0 # The estimate of phi # Compare the ZIP with the positive Poisson distribution pp <- vglm(y ~ 1, pospoisson, subset = y > 0, crit = "c") coef(pp) Coef(pp) coef(fit1) - coef(pp) # Same head(fitted(fit1) - fitted(pp)) # Different # Another example (Angers and Biswas, 2003) --------------------- abdata <- data.frame(y = 0:7, w = c(182, 41, 12, 2, 2, 0, 0, 1)) abdata <- subset(abdata, w > 0) yy <- with(abdata, rep(y, w)) fit3 <- vglm(yy ~ 1, yip88(n = length(yy)), subset = yy > 0) fit3@misc$pstr0 # phi estimate (they get 0.5154 with SE 0.0707) coef(fit3, matrix = TRUE) Coef(fit3) # Estimate of lambda (they get 0.6997 with SE 0.1520) head(fitted(fit3)) mean(yy) # Compare this with fitted(fit3)
phi <- 0.35; lambda <- 2 # Generate some artificial data y <- rzipois(n <- 1000, lambda, phi) table(y) # Two equivalent ways of fitting the same model fit1 <- vglm(y ~ 1, yip88(n = length(y)), subset = y > 0) fit2 <- vglm(y ~ 1, yip88, trace = TRUE, crit = "coef") (true.mean <- (1-phi) * lambda) mean(y) head(fitted(fit1)) fit1@misc$pstr0 # The estimate of phi # Compare the ZIP with the positive Poisson distribution pp <- vglm(y ~ 1, pospoisson, subset = y > 0, crit = "c") coef(pp) Coef(pp) coef(fit1) - coef(pp) # Same head(fitted(fit1) - fitted(pp)) # Different # Another example (Angers and Biswas, 2003) --------------------- abdata <- data.frame(y = 0:7, w = c(182, 41, 12, 2, 2, 0, 0, 1)) abdata <- subset(abdata, w > 0) yy <- with(abdata, rep(y, w)) fit3 <- vglm(yy ~ 1, yip88(n = length(yy)), subset = yy > 0) fit3@misc$pstr0 # phi estimate (they get 0.5154 with SE 0.0707) coef(fit3, matrix = TRUE) Coef(fit3) # Estimate of lambda (they get 0.6997 with SE 0.1520) head(fitted(fit3)) mean(yy) # Compare this with fitted(fit3)