From Professor Ewan Cameron at his Another Astrostatistics Blog.
This reports a reanalysis of data from the deployment of a mobile phone app, as reported in:
M. Yauck, L.-P. Rivest, G. Rothman, “Capture-recapture methods for data on the activation of applications on mobile phones“, Journal of the American Statistical Association, 2019, 114:525, 105-114, DOI: 10.1080/01621459.2018.1469991.
The data set analyzed in the paper was provided by Ninth Decimals, 625 Ellis St., Ste. 301, Mountain View, CA 94043, a marketing platform using location data, as indicated in the documentation of the original paper.
Their Abstract reads:
This work is concerned with the analysis of marketing data on the activation of applications (apps) on mobile devices. Each application has a hashed identification number that is specific to the device on which it has been installed. This number can be registered by a platform at each activation of the application. Activations on the same device are linked together using the identification number. By focusing on activations that took place at a business location, one can create a capture-recapture dataset about devices, that is, users, that “visited” the business: the units are owners of mobile devices and the capture occasions are time intervals such as days. A unit is captured when she activates an application, provided that this activation is recorded by the platform providing the data. Statistical capture-recapture techniques can be applied to the app data to estimate the total number of users that visited the business over a time period, thereby providing an indirect estimate of foot traffic. This article argues that the robust design, a method for dealing with a nested mark-recapture experiment, can be used in this context. A new algorithm for estimating the parameters of a robust design with a fairly large number of capture occasions and a simple parametric bootstrap variance estimator are proposed. Moreover, new estimation methods and new theoretical results are introduced for a wider application of the robust design. This is used to analyze a dataset about the mobile devices that visited the auto-dealerships of a major auto brand in a U.S. metropolitan area over a period of 1 year and a half. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
The paper applies mark-recapture methods in a digital context, estimating the market size for a particular mobile phone application week by week, and offering a new means for doing so based upon estimating equations. The application to mobile phones was to estimate
My interest was to use the freely available data to check their overall estimates, but also to use it as the keystone of semi-mathematical tutorial introducing mark-recapture methods, and illustrating modern non-parametric varieties of mark-recapture inference. In particular, the tutorial highlights little known or at least infrequently cited work by R. Tanaka:
- R. Tanaka, “Estimation of Vole and Mouse Populations on Mt Ishizuchi and on the Uplands of Southern Shikoku“, Journal of Mammalogy, November 1951, 32(4), 450-458.
- R. Tanaka, “Safe approach to natural home ranges, as applied to the solution of edge effect subjects, using capture-recapture data in vole populations”, Proceedings of the 6th Vertebrate Pest Conference, 1974, Vertebrate Pest Conference Proceedings Collection, University of Nebraska, Lincoln.
- R. Tanaka, “Controversial problems in advanced research on estimating population densities of small rodents”, Researches on Population Ecology, 1980, 22, 1-67.
Seber cites Tanaka in:
G. A. F. Seber, The Estimation of Animal Abundance and Related Parameters, 2nd edition, 1982, Griffin, London, page 145.
The number of first encounters of phone apps, of which there are a total of 9316 distinct ones, looks like:
Only 1654 are observed two or more times during the 77 week experiment. For the total population seen, the minimum number
seen per week was 50 and the maximum was 509. The profile of the number of first observed apps, that first observation constituting a “marking”, looks like:
The technique by Tanaka reprised in the companion paper and extended there to populations with dramatically varying size and, separately, to populations with dramatically varying capture probabilities, determined there is no basis for appreciable variation in the population size or capture probability during the experiment. Accordingly, its estimate of the total population is done as follows:
Because Yauck, Rivest, and Rothman were principally interested in the marketing-oriented determination of visits per week, they do not offer an estimate of the overall population. So their results are difficult to compare with the result from the Tanaka extension. However, assuming the degree of overlap among visits to the business per week is smaller for a smaller interval, and is also smaller because, by all accounts, the number of deployed apps is small, the authors offer estimate for the first 20 weeks. This subset was also estimated by the Tanaka method, and produced:
A tally of the 20 week estimates from Yauck, Rivest, and Rothman gives the closed population gives a total of 1191, the “robust” model
a total of 1202, and the Jolly-Seber 1132. These compare with Tanaka which gives an estimate of population of size 2141, with a low estimate of 1896, and a high estimate of 2459.
Reasons for discrepancies may vary. For one, Yauck, Rivest, and Rothman are not trying to estimate overall population, or at least they do not report these. Their overall profile of population per week, taken from their paper, is shown below:
But they also do report a seasonal variation in capture probability, as shown in the following chart from their work:
Such a variation could explain a shortfall. It remains a discrepancy, though, why the segmentation of the Tanaka fit doesn’t acknowledge such variations in capture probability.
The implementation of the Tanaka extension is done using R code, with related files being available online. The segmentation is overseen using the facilities of the
segmented package for R developed by Professor V. M. R. Muggeo.
If you think you have one or more problems which might benefit from this kind of insight and technique, be sure to reach out. You can find my contact information under the “Above” tab at the top of the blog. Or just leave a comment to this post below. All Comments are moderated and kept private until I review them, so if you’d prefer to keep the reach-out non-public, just say so, and I will.
And stay tuned for other blog posts. After mark-recapture, in the beginning of March, I’ll be showing how causal inference and techniques like propensity scoring can be used in scientific research as well as in policy and marketing assessments.
Repost of “The truly common core“, from Ben Orlin‘s Math with Bad Drawings blog.
Many scholars today expect to find data as datasets. When I took some courses in Geology at Binghamton University, specifically in Tectonics and Paleomagnetism, I learned that libraries serves, in many cases, as Geologists’ repositories of data. No, the libraries hadn’t any servers or big RAIDed disks, they had books and journals. Geologists published maps and charts with contour lines, and both synthetic and actual images. But they most often preferred greyscale for the images. At first I did not appreciate why.
I later learned, as a graduate student, this was because the way to get data out of a published paper was to take the graphics in the paper, and digitize them. This is done even today, such as when I took relative visit histograms from Google Maps to inform a study of plastic versus paper bag usage. I have used Engauge Digitizer as my main. I once estimated the volume of Fenway Park in Boston using a plan and profile of it, and using Engauge Digitizer. But I’ve also just used Inkscape, pulling in an image and measuring positions of features on a grid of pixels.
Often the data needs to be checked for smudges and corrections. This is an experiment-within-an-experiment. But it has little difference to when laboratory equipment is used, although there people tend to calibrate and recalibrate.
Accordingly it was refreshing to read Dr El-Ad David Amir’s piece on recovering heat map values from a figure, all the more because he took on the challenge of decoding a false color image, something the geologists eschewed.
Dr Amir also put together a YouTube video of his experience, linked below:
H. Holden Thorp, writing in Science, an excerpt:
The scientific community needs to step out of its labs and support evidence-based decision-making in a much more public way. The good news is that over the past few years, scientists have increasingly engaged with the public and policy-makers on all levels, from participating in local science cafes, to contacting local representatives and protesting in the international March for Science in 2017 and 2018. Science and the American Association for the Advancement of Science (AAAS, the publisher of Science) will continue to advocate for science and its objective application to policy in the United States and around the world, but we too must do more.
Scientists must speak up. In June 2019, Patrick Gonzalez, the principal climate change scientist of the U.S. National Park Service, testified to Congress on the risks of climate change even after he was sent a cease-and-desist letter by the administration (which later agreed that he was free to testify as a private citizen). That’s the kind of gumption that deserves the attention of the greater scientific community. There are many more examples of folks leading federal agencies and working on science throughout the government. When their roles in promoting science to support decision-making are diminished, the scientific community needs to raise its voice in loud objection.
I would add that, from what I have seen, efforts to “remain objective and detached” from the public discourse, even when, objectively, an individual only has the public’s interest at heart, are nearly always met by derision and dismissal by people whose interests are challenged, and, increasingly, in at least the United States, by a public which detests scholarship and expertise. Accordingly, the only path left is speaking out.
And lest readers think this is only directed towards conservatives and Republicans, there are many instances where, say, environmental progressives have departed from evidence-based, scientific considerations and knowledge. Surely not regarding climate change — although the characterization of a cliff edge in 12 years or something is obviously just wrong — but many aspects regarding plastics, potential for afforestation, and on how to implement large scale climate change mitigation and what it will cost.
Or, in other words, borrowing from a bookstore in Cobargo, New South Wales: “Post-Apocalyptic Fiction has been moved to Current Affairs.”
- Svante Thunberg
- Sir David Attenborough
- Mark Carney
- Robert Del Naja
- Maarten Wetselaar
Svante Thunberg and Greta speaking to Sir David Attenborough for the first time. Also, outgoing Bank of England chief Mark Carney on how the financial sector can tackle climate change, Massive Attack’s Robert Del Naja on reducing the music industry’s carbon blueprint, and Shell’s Maarten Wetselaar on big energy’s environmental impact.
Quoting Ms Thunberg, prompted by interviewer Mishal Husain:
MH: What would you say we should do as individuals? … What should other people do?
GT: Of course, I’m not telling anyone else that they need to stop flying or become vegan. Um, but if I were to give one advice, it would be to read up, to inform yourself about the actual science, about the situation, about what is being done and what is not being done. Because if you understand you will know what you can do yourself, and also, of course, to be an active democratic citizen, because democracy is not only on election day, it’s happening all the time. If people in general decided that this is enough, that would have to make the politicians and the people in power change their policies.
The Massachusetts Transportation and Climate Initiative (TCI) or something very much like it, perhaps stronger, is needed because of one simple reason.
The false color heatmap below shows the Carbon Dioxide (CO2) emissions from roadways in Southern New England in 2017, based upon data from the NASA ORNL DAAC.
Period. Cannot get to the targets of the Massachusetts Global Warming Solutions Act (GWSA) by decarbonizing electricity production alone, and, with NIMBYism on putting up things like solar farms and land-based wind turbines, even that’s a stretch. Moreover, next will be decarbonizing heating and cooling. Fugitive emissions from natural gas pipelines still have not been addressed.
Vehicle emissions are part of the fossil fuel infrastructure.
We shouldn’t forget where we are on the course towards climate disruption. We shouldn’t forget we’ve already disrupted. Emissions are still increasing. This means it’s getting worse every year. It is not something which is in the future. It’s here now, and it will develop.
Professor Eric Rignot from 2014:
We have yet to apply the brakes.
The new rules, approved by the Federal Energy Regulatory Commission, are designed to counteract state subsidies that support the growth of renewable energy and use of nuclear power. The rules involve what are known as “capacity markets,” where power plants bid to provide electricity to the grid. The change would require higher minimum bids for power plants that receive such subsidies, giving fossil fuel plants an advantage.
The FERC order, passed 2-1, is a response to complaints from operators of coal and natural gas power plants who say that state subsidies have led to unfair competition in the grid region managed by PJM Interconnection.
Richard Glick, the panel’s lone Democrat, cast the dissenting vote and said during the commission meeting that his Republican colleagues were trying to “stunt transition to a clean energy future that states are pursuing and consumers are pursuing.”
In his written dissent, he called the order “illegal, illogical and truly bad public policy.”
Continuing, the ICN report notes:
The Trump administration has taken other high-profile steps to try to boost the coal industry, but many of them are tied up in legal challenges. The new FERC order accomplishes many of the same goals.
But FERC’s action also is likely heading to court, where opponents will argue that the regulator has overstepped its authority and is now dictating state policy.
One issue going forward is that the order has a broad definition of “subsidy,” saying this includes direct or indirect payments, concessions and rebates, among other things. Glick said the definition is so broad that it may end up affecting many more power plants than the other commissioners intended.
In the meantime, PJM has 90 days to say how it will implement the rules, and power plant operators will need to figure out what this means for them.
Such authority would not exist if the grid were not centralized. In particular, if it were instead a loose aggregation of power islands or microgrids which had substantial authority to trade among themselves, political power would not be concentrated in organizations like PJM or, for that matter, ISO-NE or the FERC.
The economic consequences of artificial propping up of coal and natural gas are pretty straightforward: They make utility-scale zero Carbon generation more expensive, disincentivizing utilities from pursuing these options. There are other disincentives being mounted in the form of public pressure against, for example, Warren Buffett’s PacifiCorp electric utility owned by Berkshire Hathaway. There, in Wyoming, PacifiCorp has filed plans to move to wind and solar and shut down coal-fired electricity generators, raising the ire of Wyoming’s pro-coal governor. (Note I originally read this at The Financial Times and would love to link and credit them, but they have a restrictive paywall.) Specifically,
PacifiCorp this year accelerated plans to install wind turbines, solar panels and battery storage, while retiring coal-fired generators in the US west. The announcement was not received well in Wyoming, which mines 40 per cent of US coal.
It is interesting, too, that businessmen as astute as Mr Buffett and PacifiCorp’s CEO, Greg Abel, seem not in the least bit worried about the intermittency which, as some diehard Carbon worshippers who defend utilities claim:
Wind and PV in large amounts are inherently unfit for purpose; they cannot supply energy as needed, nor can they decarbonize even an electric grid by themselves.
completely ignoring the reality of utility scale battery storage.
The effect, of course, will be to raise prices to consumers of electricity, something which, no doubt, as they have in Massachusetts, utilities will claim is the fault of zero Carbon upstarts. Indirectly that’s true, but only because, as with FERC, the fossil fuel worshippers cannot compete and, so, need their price floor increased to make renewables artificially expensive.
Getting generation on your own, if you own a residence and have the means, or building your own microgrid, if you are a major consumer of electricity, such as a manufacturing facility or a university campus, has a marginally higher return as a consequence of being a customer of PJM. I don’t doubt that, as a consequence, other capacity markets will be tempted to set higher floors.
This drives the dance of electricity generation and consumption in the direction of balkanization which I’ve written about previously, and which seems to be the fate of the United States energy grid. In retrospect, how else could it be, with its collective over-optimization of measures of economic growth at the expensive of other risks, those accepted by its embrace of such costs of anarchy? (The price of anarchy has been studied extensively.)
And this is why, in part, Claire and I have configured our home as we have. We are presently participants in the local grid’s marketplace, following a rubric nearly shouted by a roundtable speaker and environmental advocate at a conference I once attended, that “You should not hoard electrons”. But that is truly a value when said grid respects you in turn. If economic or environmental reasons suggest it turns out we’re not respected so much, to the degree that happens we have lots of options to minimize our participation and increase our hoarding. Yes, we are someone limited by the silly bylaws of the Town of Westwood where we live. (See Section 4.3.2.) But technology is flowing ever onwards, and there will be increasingly more options down the river, ranging from ever cheaper battery storage, to dynamic in-home digital management of electricity flows (fans don’t need high quality voltage and power), to the ability to draw power from our Tesla Model 3 back into the home, to ever more efficient solar PV panels.
This is a contest which PJM, carbon worshippers, social capital anarchists, and even FERC will lose, for economic and environmental reasons. PJM may have more coal plants. But to keep electricity inexpensive enough to support their agriculture and manufacturing, those players will either need to move, or they’ll need to microgrid, and the PJM network will have fewer customers over time.
There is proper concern regarding the relative disadvantage which people of color and low incomes have with respect to climate impacts and environmental harms. Setting aside scientific exaggerations such as quoted in the Vimeo link there,
In a recent United Nations report, experts predict only 12 years remain to prevent unimaginable global devastation.
I’m no luckwarmer, but that’s just
But, as I said, setting that aside, much more needs to be done to provide greater equalities and opportunities to reap the benefits of zero Carbon energy sources. Some of these can be had by subsidizing such energy for communities of color and low-income others, as has our Commonwealth, and more can be had by insisting that communities which consume much electricity which is otherwise generated in dirty centralized facilities, such as the generating facilities on the Mystic River, MA, reallocate some of their own public and other lands to the purpose of doing that generation in a clean manner. They otherwise put the burden of dirty impacts upon these disadvantaged communities.
But, in my opinion, the role of relatively wealthier members of our community and region should not be minimized. As noted above, there are economic forces which are trying to reset the competitive landscape, and, being entrenched, vested, and engaging in regulatory capture, these are formidable. So while no one can expect low income people and many of communities of color to fight back, people with means and purpose can do so, and it continues to be important to encourage them. That may or may not mean retaining subsidies. As implied above, abandonment of the grid would be accelerated if subsidies were withdrawn or electricity prices directed at them were increased. (In some utilities, such price increases have even been punitively targeted at solar adopters, for example.) But I think the role should be appreciated and, in particular, it is not constructive to dismiss their and, frankly, our participation as unimportant merely because we can afford it.
People talk about “thousand year storms”. Rather than being a storm having a recurrence time of once in a thousand years, these are storms which have a 0.001 chance per year of occurring. Storms aren’t the only weather events of significance which have probabilities of occurrence like these. Consider current precipitation risks for the Town of Westwood, Massachusetts, where I live:
I have highlighted events which have a 0.01 chance per year of occurring, including things like a rainfall of almost an inch in 5 minutes, or 8 inches in a day. Again, the recurrence time is not once in a 100 years. And, note, these are not based upon expected climate change, although there already is some change in the climate baked into these. These are current risks.
So what does 0.01 per year mean? Well, as Radley Horton explains in part below, think of it as rolling a dice (*) having 100 faces, and looking for the event “It rolled the number 10”.
(Figure is from Statistics How To.)
What’s that mean?
It means that, for each successive number of years, the chances of the event happening at least once is as in the following table:
|number of years||chance of event happening 1 or more times|
So by the time 10 years roll by, the 0.01 event has an almost ten times greater chance of happening. If a stormwater management system in the Town of Westwood is effectively destroyed by an 8 inch rain, there’s a 1-in-10 chance of that happening in a 10 year stretch.
As climate chances, extreme precipitation events become more likely. If the 8 inch rain has a 0.01 chance per year now, it will soon have a 1-in-50 chance per year, or 0.02 per year. How does the risk table chance for that?
|number of years||chance of event happening 1 or more times|
Unsurprisingly, that 1-in-10 chance takes 5 years to realize, and in 10 years there’s a slightly less than 1-in-5 chance of it happening. If the stormwater management exceedance costs $10 million to repair, that means, in the first case that there’s an expected cost per year of $100,000, and, in 10 years a million dollars. When climate changes to the 1-in-50, these expected losses double.
In the case of weather events, they may not be entirely independent. Events might “bunch up” due to ENSO or other influences. Similarly a big volcanic explosion can affect global weather for a year or two, and depress probabilities of weather events.
When estimating risks of events like these directly from data on occurrences, it’s important to note that Gaussian approximations to distributions or even Poissons will underestimate risk. What’s needed to be used is a Generalized Extreme Value distribution. Lee Fawcett in his article, “A severe forecast” in the current issue of Significance Magazine (December 2019) explains in greater detail. A good book explaining use of the GEV distribution is:
|E. Castillo, A. S. Hadi, N. Balakrishnan, J. M. Sarabia, Extreme Value and Related Models with Applications in Engineering and Science, Wiley, 2005.|
The R statistical programming language facility offers a number of packages for doing inference with this distribution.
(*) According to the Oxford dictionary the singular of plural dice is still “dice”, although the older “die” is acceptable.
S. M. Stigler, "Data have a limited shelf life", Harvard Data Science Review, November 2019.
Data, unlike some wines, do not improve with age. The contrary view, that data are immortal, a view that may underlie the often-observed tendency to recycle old examples in texts and presentations, is illustrated with three classical examples and rebutted by further examination. Some general lessons for data science are noted, as well as some history of statistical worries about the effect of data selection on induction and related themes in recent histories of science.
Keywords: dead data, zombie data, post-selection inference, history
Of particular historical interest is whether or not modern scholars can ever properly interpret classic experiments, with their defects, like the Millikan oil drop experiment, or Eddington’s measurement of light deflection to confirm General Relativity.
Also of interest is whether enough metadata about old datasets in business, such as insurance or operations, or even scientific observation, is kept to be able to properly reconstruct the provenance.
- D. Engwirda, 2017: JIGSAW-GEO (1.0): Locally orthogonal staggered unstructured grid generation for general circulation modelling on the sphere, Geosci. Model Dev., 10, 2117-2140, doi:10.5194/gmd-10-2117-2017
and a general description at NASA. The figure below is copied from there.
Let it be said, apart from his so-called base, 45 is not a popular guy. Even his bud, Boris Johnson, is making moves to avoid his endorsement.
Yeah, that’s a popular, well respected guy.
(Click on figure to see a larger image. It will open in a new browser tab.)
Yes, I know, this is from Orsted, a public company which, primarily, builds offshore wind farms. And, as a result, you out there (which is, frankly, an infinitesimal fraction of the world, because, basically, no one follows me), will critique me for promoting a specific company.
Think of it.
Someone has a good idea. They pursue it. They promote it. They find a way of moving it into people’s lives. Great. What do they do? They found a company which has that as its purpose.
But, oh know, say the Environmental Purists, this is now “corporate greed” and we can’t have anything to do with that. It’s not us!
So, given a group of folks, the Environmental Purists, who want to advance a cause, but, then, deny the means to achieving that, they are either masochists, or they eternally want to be guaranteed of an opposition to fight, but they can never dominate and win.
I’m sick of this nonsense, whether it be Sierra Club or Extinction Rebellion. I want answers and programs, and not sham policies which hijack the hugely important issue of climate disruption to achieve long sought social objectives. I do not say the latter aren’t important. I say holding the rest of society in ransom for their objectives is cruel, heartless, uncharitable, and downright stupid.
Who do you think carries most of the burden for fixing the problem?
Action. “We have work to do.” (Bill Nye)
Hat tip to PV Magazine:
- B. Frew, W. Cole, P. Denholm, W. Frazier, N. Vincent, R. Margolis, “Sunny with a Chance of Curtailment: Operating the US Grid with very high levels of solar photovoltaics“, (open access) iScience, 21, 22 November 2019, Pages 436-447.
- J. Weaver, “Designing for and monetizing curtailed solar power“, PV Magazine, May 2019.
- V. Gevorgian, “Highly accurate method for real-time active power reserve estimation for utility-scale photovoltaic power plants“, NREL, 2019.
Highlights of Frew, Cole, Denholm, Frazier, Vincent, Margolis
- Load and operating reserves can be met in US grid with up to 55% PV with storage
- Power system must rapidly transition between synchronous and inverter-based generation
- Significant curtailment is seen, with hours of >40% economic curtailment
- Hours with very low energy prices become more frequent, up to 36% of hours
With rapid declines in solar photovoltaic (PV) and energy storage costs, futures with PV penetrations approaching or exceeding 50% of total annual US generation are becoming conceivable. The operational merits of such a national-scale system have not been evaluated sufficiently. Here, we analyze in detail the operational impacts of a future US power system with very high annual levels of PV (>50%) with storage. We show that load and operating reserve requirements can be met for all hours while considering key generator operational constraints. Storage plays an active role in maintaining the balance of supply and demand during sunset hours. Under the highest PV penetration scenario, hours with >90% PV penetration are relatively common, which require rapid transitions between predominately conventional synchronous generation and mostly inverter-based generation. We observe hours with almost 400 GW (over 40%) of economic curtailment and frequent (up to 36%) hours with very low energy prices.
(Emphasis added in the above.)
Even without environmental incentives, the United States has moved towards greater electrification.
Note, however, that Massachusetts is not numbered amongst the Enlightened.
Update, 2019-10-28 00:34 ET
Note the citing of how talent migrated from the fossil fuel industry to offshore wind energy.
Check out the thoughts of the late Professor Martin Weitzman as well, in “The man who got economists to take climate nightmares seriously“.