• Virtual Special Issue Big Data@Elsevier.Computer Science – Virtual Special Issue

    Big Data@Elsevier.Computer Science To celebrate the IEEE Big Data Conference in Washington on 5-8 December 2016, Elsevier Computer Science presents a virtual special issue on some of the most cited articles on Big Data across all our Computer Science Journals. This virtual special issue highlights papers published between 2014 and 2015 that...

  • Black-box Confidence Intervals: Excel and Perl Implementation

    Originally posted here. Check original article for most recent updates. Confidence interval is abbreviated as CI. In this new article (part of our series on robust techniques for automated data science) we describe an implementation both in Excel and Perl, and discussion about our popular model-free confidence interval technique introduced in our original...

  • Correlation and R-Squared for Big Data

    Originally posted on Analyticbridge, by Dr. Granville. Click here to read original article and comments. With big data, one sometimes has to compute correlations involving thousands of buckets of paired observations or time series. For instance a data bucket corresponds to a node in a decision tree, a customer segment, or a subset of observations...

  • How to detect spurious correlations, and how to find the real ones

    Originally posted on DataSciebceCentral, by Dr. Granville. Click here to read original article and comments. Specifically designed in the context of big data in our research lab, the new and simple strong correlation synthetic metric proposed in this article should be used, whenever you want to check if there is a real association between two variables, especially...

  • Practical illustration of Map-Reduce (Hadoop-style), on real data

    Originally posted on DataScienceCentral, by Dr. Granville. Click here to read original article and comments. Here I will discuss a general framework to process web traffic data. The concept of Map-Reduce will be naturally introduced. Let’s say you want to design a system to score Internet clicks, to measure the chance for a click to...

  • Jackknife logistic and linear regression for clustering and predictions

    Originally posted on DataSciebceCentral, by Dr. Granville. Click here to read original article and comments. This article discusses a far more general version of the technique described in our article The best kept secret about regression. Here we adapt our methodology so that it applies to data sets with a more complex structure, in particular with...

  • A synthetic variance designed for Hadoop and big data

    Originally posted on Hadoop36o, by Dr. Granville. Click here to read original article and comments. The new variance introduced in this article fixes two big data problems associated with the traditional variance and the way it is computed in Hadoop, using a numerically unstable formula. Synthetic Metrics This new metric is synthetic: It was not derived naturally from...

  • Fast Combinatorial Feature Selection with New Definition of Predictive Power

    Originally posted on DataScienceCentral, by Dr. Granville. Click here to read original article and comments. In this article, I proposes a simple metric to measure predictive power. It is used for combinatorial feature selection, where a large number of feature combinations need to be ranked automatically and very fast, for instance in the context of transaction...

  • Internet topology mapping – Data Science Central

    Originally posted on DataScienceCentral, by Dr. Granville. Click here to read original article and comments. This is a component often missing, yet valuable for most systems, algorithms and architectures that are dealing with online or mobile data, known as digital data: be it transaction scoring, fraud detection, online marketing, marketing mix and advertising optimization, online...

  • Hidden decision trees revisited – Data Science Central

    Originally posted on DataScienceCentral, by Dr. Granville. Click here to read original article and comments. Hidden decision trees (HDT) is a technique patented by Dr. Granville, to score large volumes of transaction data. It blends robust logistic regression with hundreds small decision trees (each one representing for instance a specific type of fraudulent transaction) and offers significant...