*Updated*, 20^{th} October 2020

*Updated*, 20

^{th}October 2020

This reports a reanalysis of data from the deployment of a mobile phone app, as reported in:

M. Yauck, L.-P. Rivest, G. Rothman, “Capture-recapture methods for data on the activation of applications on mobile phones“,Journal of the American Statistical Association, 2019, 114:525, 105-114, DOI: 10.1080/01621459.2018.1469991.

The article is as linked. There is supplementary information and most datasets are freely available.

The data set analyzed in the paper was provided by Ninth Decimals, 625 Ellis St., Ste. 301, Mountain View, CA 94043, a marketing platform using location data, as indicated in the documentation of the original paper.

Their Abstract reads:

This work is concerned with the analysis of marketing data on the activation of applications (apps) on mobile devices. Each application has a hashed identification number that is specific to the device on which it has been installed. This number can be registered by a platform at each activation of the application. Activations on the same device are linked together using the identification number. By focusing on activations that took place at a business location, one can create a capture-recapture dataset about devices, that is, users, that “visited” the business: the units are owners of mobile devices and the capture occasions are time intervals such as days. A unit is captured when she activates an application, provided that this activation is recorded by the platform providing the data. Statistical capture-recapture techniques can be applied to the app data to estimate the total number of users that visited the business over a time period, thereby providing an indirect estimate of foot traffic. This article argues that the robust design, a method for dealing with a nested mark-recapture experiment, can be used in this context. A new algorithm for estimating the parameters of a robust design with a fairly large number of capture occasions and a simple parametric bootstrap variance estimator are proposed. Moreover, new estimation methods and new theoretical results are introduced for a wider application of the robust design. This is used to analyze a dataset about the mobile devices that visited the auto-dealerships of a major auto brand in a U.S. metropolitan area over a period of 1 year and a half. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.

The paper applies mark-recapture methods in a digital context, estimating the market size for a particular mobile phone application week by week, and offering a new means for doing so based upon estimating equations.

My interest was to use the freely available data to check their overall estimates, but also to use it as the keystone of semi-mathematical tutorial introducing mark-recapture methods, and illustrating modern non-parametric varieties of mark-recapture inference. In particular, the tutorial highlights little known or at least infrequently cited work by R. Tanaka:

- R. Tanaka, “Estimation of Vole and Mouse Populations on Mt Ishizuchi and on the Uplands of Southern Shikoku“,
*Journal of Mammalogy*, November 1951, 32(4), 450-458. - R. Tanaka, “Safe approach to natural home ranges, as applied to the solution of edge effect subjects, using capture-recapture data in vole populations”,
*Proceedings of the 6*, 1974, Vertebrate Pest Conference Proceedings Collection, University of Nebraska, Lincoln.^{th}Vertebrate Pest Conference - R. Tanaka, “Controversial problems in advanced research on estimating population densities of small rodents”,
*Researches on Population Ecology*, 1980,**22**, 1-67.

Seber cites Tanaka in:

G. A. F. Seber,The Estimation of Animal Abundance and Related Parameters, 2^{nd}edition, 1982, Griffin, London, page 145.

The number of first encounters of phone apps, of which there are a total of 9316 distinct ones, looks like:

Only 1654 are observed two or more times during the 77 week experiment. For the total population seen, the minimum number seen per week was 50 and the maximum was 509. The profile of the number of first observed apps, that first observation constituting a “marking”, looks like:

The technique by Tanaka reprised in the companion paper and extended there to populations with dramatically varying size and, separately, to populations with dramatically varying capture probabilities, determined there is no basis for appreciable variation in the population size or capture probability during the experiment. Accordingly, its estimate of the total population is done as follows:

Because Yauck, Rivest, and Rothman were principally interested in the marketing-oriented determination of visits per week, they do not offer an estimate of the overall population. So their results are difficult to compare with the result from the Tanaka extension. However, assuming the degree of overlap among visits to the business per week is smaller for a smaller interval, and is also smaller because, by all accounts, the number of deployed apps is small, the authors offer estimate for the first 20 weeks. This subset was also estimated by the Tanaka method, and produced:

A tally of the 20 week estimates from Yauck, Rivest, and Rothman gives the closed population gives a total of 1191, the “robust” model

a total of 1202, and the Jolly-Seber 1132. These compare with Tanaka which gives an estimate of population of size 2141, with a low estimate of 1896, and a high estimate of 2459.

Reasons for discrepancies may vary. For one, Yauck, Rivest, and Rothman are not trying to estimate overall population, or at least they do not report these. Their overall profile of population per week, taken from their paper, is shown below:

But they also do report a seasonal variation in capture probability, as shown in the following chart from their work:

Such a variation could explain a shortfall. It remains a discrepancy, though, why the segmentation of the Tanaka fit doesn’t acknowledge such variations in capture probability.

The implementation of the Tanaka extension is done using **R** code, with related files being available online. The segmentation is overseen using the facilities of the `segmented`

package for **R** developed by Professor V. M. R. Muggeo.

If you think you have one or more problems which might benefit from this kind of insight and technique, be sure to reach out. You can find my contact information under the “Above” tab at the top of the blog. Or just leave a comment to this post below. All Comments are moderated and kept private until I review them, so if you’d prefer to keep the reach-out non-public, just say so, and I will.

^{(**)} ~~And stay tuned for other blog posts. After mark-recapture, in the beginning of ~~~~March~~ **May**^{*}, I’ll be showing how causal inference and techniques like propensity scoring can be used in scientific research as well as in policy and marketing assessments.

Pingback: So, today, a diversion … | hypergeometric