machine learning prediction

March Machine Learning Mayhem

Machine Learning and the NCAA Men’s Basketball Tournament Methodology

 <<This article is meant to be the technical document following the above article. Please read the following article before continuing.>>

“The past may not be the best predictor of the future, but it is really the only tool we have”

 
Before we delve into the “how” of the methodology, it is important to understand “what” we were going for: A set of characteristics that would indicate that a lower seed would win. We use machine learning to look through a large collection of characteristics and it finds a result set of characteristics that maximizes the number of lower seed wins while simultaneously minimizing lower seed losses. We then apply the result set as a filter to new games. The new games that make it through the filter are predicted as more likely to have the lower seed win. What we have achieved is a set of criteria that are most predictive of a lower seed winning.
 
This result set is fundamentally different than an approach trying to determine the results of all new games whereby an attempt is made to find result set that would apply to all new games. There is a level of complexity and ambiguity with a universal model which is another discussion entirely. By focusing in on one result set (lower seed win) we can get a result that is more predictive than attempting to predict all games.
 
This type of predictive result set has great applications in business. What is the combination of characteristics that best predict a repeat customer? What is the combination of characteristics that best predict a more profitable customer? What is the combination of characteristics that best predict an on time delivery? This is different from just trying to forecast a demand by using a demand signal combined with additional data to help forecast. Think of it as the difference between a stock picker that picks stocks most likely to rise vs. forecasting how far up or down a specific stock will go. The former is key for choosing stocks the later for rating stocks you already own.
 
One of the reasons we chose “lower seed wins” is that there is an opportunity in almost all games played in the NCAA tournament for there to be a data point. There are several games where identical seeds play. Most notably, the first four games do involve identical seeds and the final four can possibly have identical seeds. However, that still gives us roughly 60 or so games a year. The more data we have, the better predictions we get.
 
The second needed item is more characteristics. For our lower seed win we had >200 different characteristics for years 2012-2015. We used the difference between the characteristics of the two teams as the selection. We could have used the absolute characteristics for both teams as well. As the analysis is executed, if a characteristic is un-needed it is ignored. What the ML creates is a combination of characteristics. We call our tool, “Evolutionary Analysis”. It works by adjusting the combinations in an ever improving manner to get result. There is a little more in the logic that allows for other aspects of optimization, but the core of Evolutionary Analysis is finding a result set.
The result set was then used as a filter on 2016 to confirm that the result is predictive. It is possible that the result set from 2012-2015 doesn’t actually predict 2016 results. Our current result set as a filter on 2016 data had 47% underdog wins vs. the overall population. The historic average is 26% lower seed wins and randomly, the 47% underdog win result could happen about 3.4% of the time. Our current result is therefore highly probable as a predictive filter.
 
The last step in the process is to look at those filter criteria that have been chosen and to check to see if they are believable. For example, one of the criteria that was Defensive Efficiency Rank. Evolutionary Analysis chose a lower limit of … well it set a lower limit, let’s just say that. This makes sense, if a lower seed has a defense that is ranked so far inferior to the higher seed, it is unlikely to prevail. A counter example is that the number of blocks per game was not a criteria that was chosen. In fact, most of the >200 criteria were not used, but that handful of around ten criteria set the filter that chooses a population of games that is more likely to contain a lower seed winning.
 
And that is one of the powerful aspects of this type of analysis, you don’t get the one key driver, or even two metrics that have a correlation. You get a whole set of filters that points to a collection of results that deviates from the “normal”.
 
Please join us as we test our result set this year. We’ll see if we get around 47%. Should be interesting!
 
If you have questions on this type of analysis or machine learning in general, please don’t hesitate to contact Gordon Summers of Cabri Group (Gordon.Summers@CabriGroup.com) or Nate Watson at CAN (nate@canworksmart.com).
**Disclaimer: Any handicapping sports odds information contained herein is for entertainment purposes only. Neither CAN nor Cabri Group condone using this information to contravene any law or statute; it’s up to you to determine whether gambling is legal in your jurisdiction. This information is not associated with nor is it endorsed by any professional or collegiate league, association or team. Machine Learning can be done by anyone, but is done best with professional guidance.
 
 
 

Predicting the upsets for the NCAA Men’s Basketball Tournament using machine learning

Contemporary Analysis (CAN) and Cabri Group and have teamed up to use Machine Learning to predict the upsets for the NCAA Men’s Basketball Tournament. By demonstrating the power of ML through our results, we believe more people can give direction to their ML projects.
 
Machine Learning (ML) is a powerful technology and many companies rightly guess that they need to begin to leverage ML. Because there are so few successful ML people and projects to learn from, there is a gap between desire and direction. 
 
We will be publishing a selection of games in the 2017 NCAA Men’s Basketball Tournament. Our prediction tool estimates games where the lower seed has a better than average chance of winning against the higher seed. We will predict about 16 games from various rounds of the tournament. The historical baseline for lower seeds winning is 26%. Our current model predicted 16 upsets for the 2016 tournament. We were correct in 7 of them (47%), which in simulated gambling gave the simulated gambler an ROI was 10% (because of the odds). Our target for the 2017 tournament will be to get 48% right.
 
Remember, our analysis isn’t to support gambling, but to prove the ability of ML. However, we will be keeping score with virtual dollars. We will be “betting” on the lower seed to win. We aren’t taking into consideration the odds in our decisions, only using them to help score our results.
 
We will be publishing our first games on Wednesday 15th after the first four games are played. We won’t have any selections for the first four games as they are played by teams with identical seeds. Prior to each round, we will publish all games that our tool thinks have the best chance of the lower seed winning. We’ll also publish weekly re-caps with comments on how well our predictions are doing.
 
Understand the technique that finds a group of winners (or losers) in NCAA data can be used on any metric. Our goal is to open up people’s minds onto the possibilities of leveraging Machine Learning for their businesses. If we can predict things as seemingly complex as a basketball tournament (Something that has never been correctly predicted), then imagine what we could do with your data that drives your decisions?
 
If you have questions on this type of analysis or machine learning in general, please don’t hesitate to contact Gordon Summers of Cabri Group (Gordon.Summers@CabriGroup.com) or Nate Watson at CAN (nate@canworksmart.com).
 
Those interested in the detailed description of our analysis methodology can read the technical version of the article found here.
**Disclaimer: Any handicapping sports odds information contained herein is for entertainment purposes only. Neither CAN nor Cabri Group condone using this information to contravene any law or statute; it’s up to you to determine whether gambling is legal in your jurisdiction. This information is not associated with nor is it endorsed by any professional or collegiate league, association or team. Machine Learning can be done by anyone, but is done best with professional guidance.

Why you should update predictive models

After writing my previous post, “How Often Should You Update Predictive Models”, it was appropriate to followup with a post regarding the consequences of not updating predictive models.
Predictive models use the patterns in historical and transactional data to identify risks and opportunities. Since the conditions and the environment are constantly changing the accuracy of predictive models need to be monitored. Once a predictive model no longer reflects reality it needs to be updated. Most of the time this is because the assumptions behind the model need to be updated.
Take for example a community bank. Internally every new transaction, deposit, withdrawal, application, or transfer creates new data. For most individuals, these transactions are occur several time every day, and that means you’re compiling thousands of new data points. Over time the customers environment is changes, this  is reflected in each data point collected. Did they get a raise or a new job? Is there car breaking down? So although this community bank may have a relatively modest customer base, their customers are experiencing change all the time.
Also, their are external changes that impact a customer’s behavior. For example interest rates change, new competitors enter markets, competitors invest in marketing, consumer confidence changes, and competitors merge. It makes sense then that they would need to update their predictive models to keep up with all of these changes. When these changes start to represent structural changes a new model needs to be developed.
For a typical community bank, strategic sales, marketing, and planning decisions happen at least once a quarter. If a bank doesn’t update their predictive models in preparation for these events, they are at a high risk of using obsolete information when making decisions.
What are the consequences of using this obsolete information?

  • Your pricing models don’t reflect changes in the competitive environment.
  • You recommend outdated products.
  • Your marketing material isn’t targeted at the right groups. They might not exist any more.
  • Your business development team begins chasing the wrong types of leads. For example, it might not be a profit environment to pursue new home mortgages.

So if you’re planning on making an investment in predictive analytics, make sure you consider the implications of using your data as well as the consequences of using outdated information.

How Often to Update Predictive Models

Everyday new information is being created in your business. Your customers are buying more, subscribing or unsubscribing, and before you know it your customers today are seemingly different than the customer you had the day before.
As these new patterns emerage its important to periodically take time to investigate your data, update your predictive models, and challenge the assumptions about your business going forward. But how often should you do this? To answer that question, consider the following:

  • How often is my data changing?
  • How often do I plan on making decisions with the data?

(more…)

Predictive Analytics is Not a Crystal Ball

Its common to see predictive analytics as a sort of “crystal ball” for your business. This crystal ball image makes for great marketing. Unfortunately, predictive analytics is not a crystal ball.
It will not provide the correct prediction every time. Its primary purpose is to help you make better decisions by giving you the power to unlock the patterns inside your data. When performed correctly this gives you the ability to simplify decisions. When performed incorrectly it can spell disaster for your company.
Predictive analytics is both an art and science. It requires a combination of both empirical and subjective experience to verify that models reflect reality. This is why CAN takes into consideration three main aspects when building predictive models: Data, Theory, and Math. In our experience your predictive models will not reflect reality if all of three of these aspects are not held up.  (more…)

The Friendship Paradox

About 20 years ago, a sociologist named Scott Feld discovered an interesting phenomenon where on the average, people have less friends than their friends do.  However, most people believe they have more friends than their friends do.  This is the paradox.  The friendship paradox is a form of sampling bias. (more…)

How Much Data Do I Need For Predictive Analytics?

Before beginning any predictive analytics project, its essential to investigate the breadth and depth of data available. However, at what point is it acceptable to say you have enough data to start?
The politically correct answer to this question is that it depends. Depends on what though?
Well for starters, certain types of data science and predictive analysis projects require more specific data requirements. In an extreme case, predicting survival rates of people or machines may require data spanning their entire lifespan. However, in most cases, data requirements are less stringent.
In most cases taking a snapshot of 3 to 5 years worth of data can yield a breadth of patterns surrounding consumer and business behavior. Why? (more…)

Featured Posts – Click the Brain
Archives
CAN Jewels