• Correlation and R-Squared for Big Data

    Originally posted on Analyticbridge, by Dr. Granville. Click here to read original article and comments. With big data, one sometimes has to compute correlations involving thousands of buckets of paired observations or time series. For instance a data bucket corresponds to a node in a decision tree, a customer segment, or a subset of observations...

  • How to detect spurious correlations, and how to find the real ones

    Originally posted on DataSciebceCentral, by Dr. Granville. Click here to read original article and comments. Specifically designed in the context of big data in our research lab, the new and simple strong correlation synthetic metric proposed in this article should be used, whenever you want to check if there is a real association between two variables, especially...

  • Practical illustration of Map-Reduce (Hadoop-style), on real data

    Originally posted on DataScienceCentral, by Dr. Granville. Click here to read original article and comments. Here I will discuss a general framework to process web traffic data. The concept of Map-Reduce will be naturally introduced. Let’s say you want to design a system to score Internet clicks, to measure the chance for a click to...

  • Jackknife logistic and linear regression for clustering and predictions

    Originally posted on DataSciebceCentral, by Dr. Granville. Click here to read original article and comments. This article discusses a far more general version of the technique described in our article The best kept secret about regression. Here we adapt our methodology so that it applies to data sets with a more complex structure, in particular with...

  • A synthetic variance designed for Hadoop and big data

    Originally posted on Hadoop36o, by Dr. Granville. Click here to read original article and comments. The new variance introduced in this article fixes two big data problems associated with the traditional variance and the way it is computed in Hadoop, using a numerically unstable formula. Synthetic Metrics This new metric is synthetic: It was not derived naturally from...

  • Fast Combinatorial Feature Selection with New Definition of Predictive Power

    Originally posted on DataScienceCentral, by Dr. Granville. Click here to read original article and comments. In this article, I proposes a simple metric to measure predictive power. It is used for combinatorial feature selection, where a large number of feature combinations need to be ranked automatically and very fast, for instance in the context of transaction...

  • Internet topology mapping – Data Science Central

    Originally posted on DataScienceCentral, by Dr. Granville. Click here to read original article and comments. This is a component often missing, yet valuable for most systems, algorithms and architectures that are dealing with online or mobile data, known as digital data: be it transaction scoring, fraud detection, online marketing, marketing mix and advertising optimization, online...

  • Hidden decision trees revisited – Data Science Central

    Originally posted on DataScienceCentral, by Dr. Granville. Click here to read original article and comments. Hidden decision trees (HDT) is a technique patented by Dr. Granville, to score large volumes of transaction data. It blends robust logistic regression with hundreds small decision trees (each one representing for instance a specific type of fraudulent transaction) and offers significant...