Category Archives: data science

a song in praise of data scientist Rebekah Jones

I linked to Rebekah Jones‘ keynote address at the August 2020 Data Science Conference on COVID-19 sponsored by the National Institute for Statistical Science. Below is a song in tribute to her, wishing her well. (h/t Bill McKibben) We’re doing … Continue reading

Posted in American Association for the Advancement of Science, American Mathematical Society, American Statistical Association, Boston Ethical Society, children as political casualties, Data for Good, data science, geographic, geographic information systems, International Society for Bayesian Statistics, journalism, mathematics, New England Statistical Society, pandemic, Rebekah Jones, Risky Talk, science, Significance, statistical ecology, statistics, the problem of evil, whistleblowing, ``The tide is risin'/And so are we'' | Leave a comment

What happens when time sampling density of a series matches its growth

This is the newly updated map of COVID-19 cases in the United States, updated, presumably, because of the new emphasis upon testing: How do we know this is the recent of recent testing? Look at the map of active cases: … Continue reading

Posted in American Association for the Advancement of Science, American Statistical Association, anti-intellectualism, anti-science, climate denial, corruption, data science, data visualization, Donald Trump, dump Trump, epidemiology, experimental science, exponential growth, forecasting, Kalman filter, model-free forecasting, nonlinear systems, open data, penalized spline regression, population dynamics, sampling algorithms, statistical ecology, statistical models, statistical regression, statistical series, statistics, sustainability, the right to know, the stack of lies | 1 Comment

R ecosystem package coronavirus

Dr Rami Krispin of the Johns Hopkins University Center for Systems Science and Engineering (JHU CCSE) has just released the R package coronavirus, which “provides a daily summary of the Coronavirus (COVID-19) cases by state/province“, caused by 2019-nCoV. (update 2020-03-12 … Continue reading

Posted in data presentation, data science, epidemiology | 1 Comment

Temperatures, Summers, Germany, ≈ 50.5N to 57.5N latitude

(Click on figure for larger image and use browser Back Button to return to blog.) Hat tip to Gregor Aisch, Adam Pearce, and Steve Hoey, and sourced from the mashup dataset and visuals by Lisa Charlotte Rost. Mr Aisch’s innovation … Continue reading

Posted in Anthropocene, climate change, data presentation, data science, data visualization, digital art, drought, Germany, global warming, loess, p-spline, penalized spline regression | Leave a comment

Procrustes tangent distance is better than SNCD

I’ve written two posts here on using a Symmetrized Normalized Compression Divergence or SNCD for comparing time series. One introduced the SNCD and described its relationship to compression distance, and the other applied the SNCD to clustering days at a … Continue reading

Posted in data science, dependent data, descriptive statistics, divergence measures, hydrology, Ian Dryden, information theoretic statistics, J.T.Kent, Kanti Mardia, non-parametric statistics, normalized compression divergence, quantitative ecology, R statistical programming language, spatial statistics, statistical series, time series | Leave a comment

Stream flow and P-splines: Using built-in estimates for smoothing

Mother Brook in Dedham Massachusetts was the first man-made canal in the United States. Dug in 1639, it connects the Charles River at Dedham, to the Neponset River in the Hyde Park section of Boston. It was originally an important … Continue reading

Posted in American Statistical Association, citizen data, citizen science, Clausius-Clapeyron equation, Commonwealth of Massachusetts, cross-validation, data science, dependent data, descriptive statistics, dynamic linear models, empirical likelihood, environment, flooding, floods, Grant Foster, hydrology, likelihood-free, meteorological models, model-free forecasting, non-mechanistic modeling, non-parametric, non-parametric model, non-parametric statistics, numerical algorithms, precipitation, quantitative ecology, statistical dependence, statistical series, stream flow, Tamino, the bootstrap, time series, water vapor | 2 Comments

Series, symmetrized Normalized Compressed Divergences and their logit transforms

(Major update on 11th January 2019. Minor update on 16th January 2019.) On comparing things The idea of a calculating a distance between series for various purposes has received scholarly attention for quite some time. The most common application is … Continue reading

Posted in Akaike Information Criterion, bridge to somewhere, computation, content-free inference, data science, descriptive statistics, divergence measures, engineering, George Sughihara, information theoretic statistics, likelihood-free, machine learning, mathematics, model comparison, model-free forecasting, multivariate statistics, non-mechanistic modeling, non-parametric statistics, numerical algorithms, statistics, theoretical physics, thermodynamics, time series | 4 Comments

The Johnson-Lindenstrauss Lemma, and the paradoxical power of random linear operators. Part 1.

Updated, 2018-12-04 I’ll be discussing the ramifications of: William B. Johnson and Joram Lindenstrauss, “Extensions of Lipschitz mappings into a Hilbert space, Contemporary Mathematics, 26:189–206, 1984. for several posts here. Some introduction and links to proofs and explications will be … Continue reading

Posted in clustering, data science, dimension reduction, information theoretic statistics, Johnson-Lindenstrauss Lemma, k-NN, Locality Sensitive Hashing, mathematics, maths, multivariate statistics, non-parametric model, numerical algorithms, numerical linear algebra, point pattern analysis, random projections, recommender systems, science, stochastic algorithms, stochastics, subspace projection methods | 1 Comment

Sampling: Rejection, Reservoir, and Slice

An article by Suilou Huang for catatrophe modeler AIR-WorldWide of Boston about rejection sampling in CAT modeling got me thinking about pulling together some notes about sampling algorithms of various kinds. There are, of course, books written about this subject, … Continue reading

Posted in accept-reject methods, American Statistical Association, Bayesian computational methods, catastrophe modeling, data science, diffusion processes, empirical likelihood, Gibbs Sampling, insurance, Markov Chain Monte Carlo, mathematics, Mathematics and Climate Research Network, maths, Monte Carlo Statistical Methods, multivariate statistics, numerical algorithms, numerical analysis, numerical software, numerics, percolation theory, Python 3 programming language, R statistical programming language, Radford Neal, sampling, slice sampling, spatial statistics, statistics, stochastic algorithms, stochastic search | Leave a comment

Erin Gallagher’s “#QAnon network visualizations”

See her most excellent blog post, a delve into true Data Science. (Click on figure to see a full-size image. It is large. Use your browser Back Button to return to this blog afterwards.) Hat tip to Bob Calder and … Continue reading

Posted in data science, jibber jabber, networks | Leave a comment

Senn’s `… never having to say you are certain’ guest post from Mayo’s blog

via S. Senn: Being a statistician means never having to say you are certain (Guest Post) See also: E. Cai’s blog post “Applied Statistics Lesson of the Day – The Matched Pairs Experimental Design”, from February 2014 A. Deaton, N. … Continue reading

Posted in abstraction, American Association for the Advancement of Science, American Statistical Association, cancer research, data science, ecology, experimental design, generalized linear mixed models, generalized linear models, Mathematics and Climate Research Network, medicine, sampling, statistics, the right to know | Leave a comment

Eli on “Tom [Karl]’s trick and experimental design“

A very fine post at Eli’s blog for students of statistics, meteorology, and climate (like myself) titled: Tom’s trick and experimental design Excerpt: This and the graph from Menne at the top shows that Karl’s trick is working. Although we … Continue reading

Posted in American Meteorological Association, American Statistical Association, AMETSOC, anomaly detection, climate, climate change, climate data, data science, evidence, experimental design, generalized linear mixed models, GISTEMP, GLMMs, global warming, model comparison, model-free forecasting, reblog, sampling, sampling networks | Leave a comment

Just because the data lies sometimes doesn’t mean it’s okay to censor it

Or, there’s no such thing as an outlier … Eli put up a post titled “The Data Lies. The Crisis in Observational Science and the Virtue of Strong Theory” at his lagomorph blog. Think of it: Data lying. Obviously this … Continue reading

Posted in Akaike Information Criterion, American Association for the Advancement of Science, American Meteorological Association, American Statistical Association, AMETSOC, Anthropocene, Bayes, Bayesian, climate, climate change, climate models, data science, dynamical systems, ecology, Eli Rabett, environment, Ethan Deyle, George Sughihara, Hao Ye, Hyper Anthropocene, information theoretic statistics, IPCC, Kalman filter, kriging, Lenny Smith, maximum likelihood, model comparison, model-free forecasting, physics, quantitative ecology, random walk processes, random walks, science, smart data, state-space models, statistics, Takens embedding theorem, the right to know, Timothy Lenton, Victor Brovkin | 1 Comment

“Hadoop is NOT ‘Big Data’ is NOT Analytics”

Arun Krishnan, CEO & Founder at  Analytical Sciences comments on this serious problem with the field. Short excerpt: … A person who is able to write code using Hadoop and the associated frameworks is not necessarily someone who can understand … Continue reading

Posted in alchemy, American Statistical Association, artificial intelligence, big data, data science, engineering, Internet, jibber jabber, machine learning, natural language processing, NLTK, sociology, superstition | Leave a comment

Is the answer to the democratization of Science doing more Citizen Science?

I have been following, with keen interest, the post and comment thread pertaining to “Democratising science” at the blog I monitor daily, … and Then There’s Physics. I think the core subject being discussed is a little different from my … Continue reading

Posted in American Association for the Advancement of Science, American Meteorological Association, American Statistical Association, AMETSOC, astronomy, astrophysics, biology, citizen data, citizen science, citizenship, data science, ecology, education, environment, evidence, life purpose, local self reliance, marine biology, mathematics, mathematics education, maths, moral leadership, new forms of scientific peer review, open source scientific software, science, science education, statistics, the green century, the right to know | Leave a comment

A new feature: Technical publications of the week

I’m beginning a new style of column, called technical publications of the week. While I can’t promise these will be weekly, I will, from time to time, highlight technical publications I’ve recently read which I consider to be noteworthy. I … Continue reading

Posted in Anthropocene, big data, climate change, climate disruption, data science, data streams, earthquakes, geophysics, global warming, Hyper Anthropocene, Locality Sensitive Hashing, LSH, MinHash, numerical algorithms, numerical analysis, random projections, seismology, subspace projection methods, SVD, the right to be and act stupid, the tragedy of our present civilization, the value of financial assets | 1 Comment

Why scientific measurements need to be adjusted

There is an excellent piece in Ars Technica about why scientific measurements need to be adjusted, and the implications of this for climate data. It is written by Scott K Johnson and is called “Thorough, not thoroughly fabricated: The truth … Continue reading

Posted in American Association for the Advancement of Science, American Meteorological Association, American Statistical Association, AMETSOC, Berkeley Earth Surface Temperature project, Canettes Blues Band, citizen data, climate data, data science, environment, evidence, geophysics, GISTEMP, HadCRUT4, mathematics education, meteorological models, obfuscating data, open data, physics, science, spatial statistics, Tamino, the right to know, the tragedy of our present civilization, Variable Variability | Leave a comment

Sleeping Giant Awakening

Originally posted on Climate Denial Crock of the Week:
https://twitter.com/johnmyers/status/809097380456865792 Wikipedia: Isoroku Yamamoto’s sleeping giant quotation is a quote by the Japanese Admiral Isoroku Yamamoto regarding the 1941 attack on Pearl Harbor by forces of Imperial Japan. The quotation is portrayed at the very end of…

Posted in adaptation, American Association for the Advancement of Science, American Meteorological Association, American Solar Energy Society, American Statistical Association, AMETSOC, Anthropocene, California, Carbon Worshipers, citizen data, citizen science, climate, climate change, climate data, climate disruption, data science, Donald Trump, ecology, Ecology Action, geophysics, global warming, Hyper Anthropocene, ignorance, Jerry Brown, science, sustainability, the right to be and act stupid, the right to know, the stack of lies, the tragedy of our present civilization | Leave a comment

Cathy O’Neil’s WEAPONS OF MATH DESTRUCTION: A Review

(Revised and updated Monday, 24th October 2016.) Weapons of Math Destruction, Cathy O’Neil, published by Crown Random House, 2016. This is a thoughtful and very approachable introduction and review to the societal and personal consequences of data mining, data science, … Continue reading

Posted in citizen data, citizen science, citizenship, civilization, compassion, complex systems, criminal justice, Daniel Kahneman, data science, deep recurrent neural networks, destructive economic development, economics, education, engineering, ethics, Google, ignorance, Joseph Schumpeter, life purpose, machine learning, Mathbabe, mathematics, mathematics education, maths, model comparison, model-free forecasting, numerical analysis, numerical software, open data, optimization, organizational failures, planning, politics, prediction, prediction markets, privacy, rationality, reason, reasonableness, risk, silly tech devices, smart data, sociology, Techno Utopias, testing, the value of financial assets, transparency | Leave a comment

NextGen VOICES: `On data’, `On setbacks’, and `On discovery’

Science Magazine has a periodic column called Science in brief and occasionally that column features a set of what they call “NextGen VOICES”, meaning young scientists. They gather the survey using Twitter (of course) via the hashtag #NextGenSci. For the … Continue reading

Posted in American Association for the Advancement of Science, big data, data science, evidence, Mathbabe, maths, maxims, science, Science magazine, smart data, statistics | Leave a comment

“Holy crap – an actual book!”

Originally posted on mathbabe:
Yo, everyone! The final version of my book now exists, and I have exactly one copy! Here’s my editor, Amanda Cook, holding it yesterday when we met for beers: Here’s my son holding it: He’s offered…

Posted in American Association for the Advancement of Science, Buckminster Fuller, business, citizen science, citizenship, civilization, complex systems, confirmation bias, data science, data streams, deep recurrent neural networks, denial, economics, education, engineering, ethics, evidence, Internet, investing, life purpose, machine learning, mathematical publishing, mathematics, mathematics education, maths, moral leadership, multivariate statistics, numerical software, numerics, obfuscating data, organizational failures, politics, population biology, prediction, prediction markets, privacy, quantitative biology, quantitative ecology, rationality, reason, reasonableness, rhetoric, risk, Schnabel census, smart data, sociology, statistical dependence, statistics, the right to be and act stupid, the right to know, the value of financial assets, transparency, UU Humanists | Leave a comment

“Stochastic Parameterization: Towards a new view of weather and climate models”

Judith Berner, Ulrich Achatz, Lauriane Batté, Lisa Bengtsson, Alvaro De La Cámara, Hannah M. Christensen, Matteo Colangeli, Danielle R. B. Coleman, Daan Crommelin, Stamen I. Dolaptchiev, Christian L.E. Franzke, Petra Friederichs, Peter Imkeller, Heikki Järvinen, Stephan Juricke, Vassili Kitsios, François … Continue reading

Posted in biology, climate models, complex systems, convergent cross-mapping, data science, dynamical systems, ecology, Ethan Deyle, Floris Takens, George Sughihara, Hao Ye, likelihood-free, Lorenz, mathematics, meteorological models, model-free forecasting, physics, population biology, population dynamics, quantitative biology, quantitative ecology, Scripps Institution of Oceanography, state-space models, statistical dependence, statistics, stochastic algorithms, stochastic search, stochastics, Takens embedding theorem, time series, Victor Brovkin | 4 Comments

data.table

R provides a helpful data structure called the “data frame” that gives the user an intuitive way to organize, view, and access data.  Many of the functions that you would us… Source: Intro to The data.table Package

Posted in big data, data science, engineering, numerical analysis, numerical software, numerics, open source scientific software, R, smart data, statistics | Leave a comment

On Smart Data

One of the things I find surprising, if not astonishing, is that in the rush to embrace Big Data, a lot of learning and statistical technique has been left apparently discarded along the way. I’m hardly the first to point … Continue reading

Posted in Akaike Information Criterion, Bayes, Bayesian, Bayesian inversion, big data, bigmemory package for R, changepoint detection, data science, data streams, dlm package, dynamic generalized linear models, dynamic linear models, dynamical systems, Generalize Additive Models, generalized linear models, information theoretic statistics, Kalman filter, linear algebra, logistic regression, machine learning, Markov Chain Monte Carlo, mathematics, mathematics education, maths, maximum likelihood, MCMC, Monte Carlo Statistical Methods, multivariate statistics, numerical analysis, numerical software, numerics, quantitative biology, quantitative ecology, rationality, reasonableness, sampling, smart data, state-space models, statistical dependence, statistics, the right to know, time series | Leave a comment

“Catching long tail distribution” (Ted Dunning)

One of the best presentations on what can happen if someone takes a naive approach to network data. It also highlights what is, to my mind, the greatly underappreciated t-distribution, which is typically only used in connection with frequentist Student … Continue reading

Posted in Cauchy distribution, complex systems, data science, Lévy flights, leptokurtic, mathematics, maths, networks, physics, population biology, population dynamics, regime shifts, sampling, statistics, Student t distribution, time series | Leave a comment

Climate Denial Fails Pepsi Challenge

Originally posted on Climate Denial Crock of the Week:
Stephen Lewandowsky specializes in conducting research that pulls back the curtain climate denial psychology. He’s done it again. Washington Post: Researchers have designed an inventive test suggesting that the arguments commonly used…

Posted in American Association for the Advancement of Science, American Statistical Association, card draws, card games, chance, climate, climate change, climate data, climate education, confirmation bias, data science, denial, disingenuity, education, false advertising, fear uncertainty and doubt, fossil fuels, games of chance, geophysics, global warming, ignorance, mathematics, mathematics education, maths, obfuscating data, rationality, reasonableness, risk, science, science education, sociology, the right to know | Leave a comment

A Sankey diagram showing influence of big oil on climate policy

I’ve written about Sankey diagrams before. Here’s a novel use: InfluenceMap has used a Sankey diagram to demonstrate “How much big oil spends on obstructive climate lobbying”. The figure that’s available for media is shown below. (Click on image to … Continue reading

Posted in American Petroleum Institute, Anthropocene, Bloomberg, Bloomberg New Energy Finance, BNEF, carbon dioxide, Carbon Worshipers, Chevron, citizenship, climate, climate change, climate disruption, climate education, climate justice, corporate litigation on damage from fossil fuel emissions, data science, destructive economic development, disingenuity, economics, education, energy, Exxon, false advertising, fear uncertainty and doubt, fossil fuels, global warming, greenhouse gases, Gulf Oil, Hyper Anthropocene, ignorance, lobbying, methane, natural gas, pipelines, politics, rationality, reasonableness, risk, Sankey diagram, Standard Oil of California, Texaco, the value of financial assets | 1 Comment

Of my favorite things …

(Clarifying language added 4 Apr 2016, 12:26 EDT.) I just watched an episode from the last season of Star Trek: The Next Generation entitled “Force of Nature.” As anyone who pays the least attention to this blog knows, opposing human … Continue reading

Posted in Anthropocene, bridge to somewhere, bucket list, Buckminster Fuller, Carl Sagan, climate, climate change, climate disruption, climate education, compassion, data science, Earle Wilson, ecology, Ecology Action, environment, evolution, geophysics, George Sughihara, global warming, Hyper Anthropocene, life purpose, mathematics, mathematics education, maths, numerical analysis, optimization, philosophy, physical materialism, physics, population biology, population dynamics, proud dad, quantitative biology, quantitative ecology, rationality, reasonableness, science, sociology, statistics, stochastic algorithms | 5 Comments

HadCRUT4 and GISTEMP series filtered and estimated with simple RTS model

Happy Vernal Equinox! This post has been updated today with some of the equations which correspond to the models. An assessment of whether or not there was a meaningful slowdown or “hiatus” in global warming, was recently discussed by Tamino … Continue reading

Posted in AMETSOC, anemic data, Bayesian, boosting, bridge to somewhere, cat1, changepoint detection, climate, climate change, climate data, climate disruption, climate models, complex systems, computation, data science, dynamical systems, geophysics, George Sughihara, global warming, hiatus, information theoretic statistics, machine learning, maths, meteorology, MIchael Mann, multivariate statistics, physics, prediction, Principles of Planetary Climate, rationality, reasonableness, regime shifts, sea level rise, time series | 5 Comments

p-values and hypothesis tests: the Bayesian(s) rule

The American Statistical Association of which I am a longtime member issued an important statement today which will hopefully move statistical practice in engineering and especially in the sciences away from the misleading practice of using p-values and hypothesis tests. … Continue reading

Posted in approximate Bayesian computation, arXiv, Bayes, Bayesian, Bayesian inversion, bollocks, Christian Robert, climate, complex systems, data science, Frequentist, information theoretic statistics, likelihood-free, Markov Chain Monte Carlo, MCMC, Monte Carlo Statistical Methods, population biology, rationality, reasonableness, science, scientific publishing, statistical dependence, statistics, stochastics, Student t distribution | Leave a comment