Presence‐Only Data and the EM Algorithm

28 May 2009

journal article
Published by Oxford University Press (OUP) in Biometrics

Vol. 65 (2), 554-563
https://doi.org/10.1111/j.1541-0420.2008.01116.x

Abstract

Summary In ecological modeling of the habitat of a species, it can be prohibitively expensive to determine species absence. Presence-only data consist of a sample of locations with observed presences and a separate group of locations sampled from the full landscape, with unknown presences. We propose an expectation–maximization algorithm to estimate the underlying presence–absence logistic model for presence-only data. This algorithm can be used with any off-the-shelf logistic model. For models with stepwise fitting procedures, such as boosted trees, the fitting process can be accelerated by interleaving expectation steps within the procedure. Preliminary analyses based on sampling from presence–absence records of fish in New Zealand rivers illustrate that this new procedure can reduce both deviance and the shrinkage of marginal effect estimates that occur in the naive model often used in practice. Finally, it is shown that the population prevalence of a species is only identifiable when there is some unrealistic constraint on the structure of the logistic model. In practice, it is strongly recommended that an estimate of population prevalence be provided.

This publication has 16 references indexed in Scilit:

Variation in demersal fish species richness in the oceans surrounding New Zealand: an analysis using boosted regression trees
Marine Ecology Progress Series, 2006
Novel methods improve prediction of species’ distributions from occurrence data
Ecography, 2006
Using multivariate adaptive regression splines to predict the distributions of New Zealand's freshwater diadromous fish
Freshwater Biology, 2005
USE AND INTERPRETATION OF LOGISTIC REGRESSION IN HABITAT-SELECTION STUDIES
The Journal of Wildlife Management, 2004
Removing GPS collar bias in habitat selection studies
Journal of Applied Ecology, 2004
An improved approach for predicting the distribution of rare and endangered species from occurrence and pseudo‐absence data
Journal of Applied Ecology, 2004
Extended statistical approaches to modelling spatial pattern in biodiversity in northeast New South Wales. II. Community-level modelling
Biodiversity and Conservation, 2002
Greedy function approximation: A gradient boosting machine.
The Annals of Statistics, 2001
Relationships Among Grizzly Bears, Roads and Habitat in the Swan Mountains Montana
Journal of Applied Ecology, 1996
Case-control studies with contaminated controls
Journal of Econometrics, 1996

Cited by 207 articles