Hello, world, It’s siraj and our task today is going to be to try to predict if a team is Gon na win a game or not. Now this is for football or, as Americans call it soccer, which is one of the most, which is the most popular game globally, when it comes to sports and Of all the domestic teams out there, the English premier league is the most popular of all of them. So we’re gon na predict the outcome for an English Premier league team, Using a data set of past games, and this data set I’ll show it to you right now has a bunch of different statistics. This is what the data set looks like right here. You’Ve got a home team and you’ve got an away team Right here, so it could be arsenal, it could be Chelsea, Brighton, Manchester city, so you’ve got a home team And you’ve got an away team and then you’ve got a bunch of statistics.
So these are all acronyms, but I have definitions for all these acronyms that we can look at right over here right. So we have acronyms for the full time home team goals, the home team, the away team, the shots, the target, the corners, the amount of yellow cars, the amount of Red cards. So there’s a lot of different statistics here, right, There’s so many things that go into what makes a team win or lose right, And so we’re going to take all of these features And then we’re gon na use them to try to predict the target or the Label and the label in our case is going to be The ft R, Which is the Full time result. So the ft R is right here right, h, a Hd right, so it could be either the home team h the away team a or a draw D. So it’s a multi-class classification Problem. This is not a binary classification problem.
It’S not just what the home team wins or loses its multi-class, because there are three possible labels, home team away team or draw. So that’s what we’re going to try to predict given all of those features in the data set before I show you the steps. Let me just demo this really quickly, so I can just say, x, test and then just take the first Row from this and the labels are gone. This is just for the all the features given no label and we can see that it says home write. So it’s able to predict, given all those other features, whether or not a team is going to win, lose or tie the game. Okay.
So back up to this, We’re gon na try to predict the winning football team and our steps are gon na, be the firt. It’S a four-step process, So our steps are gon na be to first clean. Our data set make sure that we only use features that we need.
What do I Mean by that when it comes to predicting who’s Gonna win a team? There’S an entire industry around this right. There are pregame analyses by commentators or postgame analyses by commentators. Entire Channels, Like Espn, are dedicated to trying to predict who’s gon na win a match, and in fact, even during the game, there are Commentators trying to predict who’s Gon na win like during halftime who’s Gon na win the full game. So this is a this is something That’s been going on for forever: rice, It’s gladdie or gladiator Roman days or whatever It’s been going on right.
People trying to predict Who’s Gon na win a match, But we’re gon na do something that people don’t do often, and that is using statistical analysis Or otherwise known as machine learning mathematical optimization to try to predict who’s going to win. If you think about it. This is like one of the most perfect machine learning problems out there, trying to predict who’s gon na win.
Think of all the features out there and those features don’t Necessarily have to do with the game. They could be the sentiment of the audience. The sentiment of the crowd of news articles: How are people talking about a team? What hashtags related to the team are trending on Twitter? Are they home? Are they away?
What’S the weather like that day, What are the forecast predictions so there’s so many different data points that could go into potentially from across the web, Telling us whether or not a team is gon na win or lose. But since I’ve never talked about this topic before I’m just gon na start off from a very basic level and based on your feedback and how you feel about this topic, I can talk about it more and do more advanced things later. Okay, So we’re gon na clean our data set, then we’re gon na split it into a training and a testing set, And what I mean by that is we’re gon na use scikit-learn. To do that. I have still I have yet to find a better library for splitting training and testing data.
Then Scikit-learn. It is still like the best out there, even if I’m using tensorflow or pi torch, to build my model I’ll still use scikit-learn to split my training and testing data. It’S just a one-liner super simple and then once we split it, We’re gon na train it on three different classifiers. So remember this is a classification problem, a multi-class classification problem And so we’re gon na use either logistic regression Support, vector, machine or – and I’ve talked about both of those in my math of intelligence series links to those in the description, But I’ll also talk about them. A little bit in this video Just as a refresher and the third one is a model that I haven’t talked about before and that’s called x g boost. Well, you Could think of it as a technique model same thing, so we’re gon na use those three.
As our classifiers We’re gon na Train, all three of them on the data set and then we’re gon na pick, the classifier that has the best result, and that Is Gon na, be the classifier that we use to predict the the winning team and we’re also going To optimize its parameter Or its hyper parameters using grid search right, so we’re using an ensemble of Machine Learning methods which psyche alert makes very easy to do once we pick the right one, then we’ll optimize that model, and then that will take that optimize model and Use that to predict the winning team – and So the history of this is like I said it’s been going on for a long time and sports betting has just been increasing in popularity for many years Right. If you look at the past five years, it’s growing at double-digit rates and there’s a lot of reasons for this number. One is just the accessibility of the internet rights, more people have internet access and Embedding on the internet is easier than in-person. Another reason is just that machine learning is becoming democratized And so everybody’s being able to build these predictive models to try to predict these scores. So this is, this is definitely a field, that’s increasing in popularity and – and This is not something that’s happening in the fringe of society. This is a very mainstream Task.
Kaggle, the Data Science Community hosts this yearly competition called March Madness, or machine learning mania. Whatever you want to call it to try to predict the scores for the NCAa, that is basketball, And you have an entire community around this, and people are trying out different models and discussing them So definitely check that link out as well. So this is something that’s happening, and I also found you know several papers talking about this. So it’s not just something.
That’S People who want to make money do this is something that legitimate researchers at academic institutions Look into and try to try to predict right. So from this paper I I’m quoting verbatim – It is possible to predict the winner of English, County twenty20 cricket games and almost two-thirds of instances Right and then for this other paper right here. Something that becomes clear from the results is that Twitter contains enough information to be useful for predicting outcomes For the premier Li. That’S for Oncasinogames right Right here, so they use Twitter sentiment to try to predict just twitter alone to try to predict who’s Gon na win. So there’s a lot of different angles. We can look at here right.
We could use sentiment analysis. We could use the past score history. We could use a whole bunch of different things. We’Re gon na use a score history, but you could try to simulate the game and a simulation, And then you know try to see from that. But you know that there’s a lot of different possibilities here and check this out in 2014, bing, Which is owned by Microsoft, correctly predicted the outcomes of for all the 15 games in the knockout round for the 2014 world cup, Every single game, 15 of them own Hundred percent accuracy, so you can be sure that bings model is really good.
However, They are not going to share it with us, because it’s it’s kind of like you, know Financial analysts at Jp, Morgan or chase if they know how to predict these stock prices. They’Re not gon na tell us: Why would they share their profits with us? So what we’ve got to do is we’ve got to figure it out for ourselves to try to reverse-engineer the techniques So that we can benefit from it.
Okay, so that was a little primer on the background so back to the data set, So this data set that I got is from football data Uk. You can find it right here. If you go to slash Data Php And then what I did was I selected the england football results and luckily for us, They’ve got data sets for every season back like two decades. So it’s perfect and if you want one you could just click on Premier league and boom. It downloads Just like that, And I showed you the data set. So one thing right off the bat that we cannotice is that if we were to just Graph – and I’ve already done this beforehand for us – and it’s in Markdown right here – We’ll see that the home team has the majority stake of this graph.
So that means right off the bat without doing any machine learning. We already know that if you are a home team, you have an Advantage to win probabilistically speaking if you’re the home team You’re more likely to win than if you’re, not just from bet from that alone, And we can reason about this a couple ways we could Say well, if you’re the home team, then you know football is a team sport and a cheering crowd, helps you and To travel through your less fatigued. You know, you’re familiar with the pitch and the weather conditions. All these things you had a hot dog from the stand, and it tastes really good, just kidding, Baseball food or any kind of sports or like stadium food is never good. You know what I’m saying. I’Ve got two great repositories for us, I’m about to start the code here, but I’ve got one for another.
Epl prediction: great ipython, Notebook or Jupiter Notebook, And I’ve got one for that that kaggle competition that I just talked about for NCAA Prediction – definitely check them both out and this guy adesh panda has really great tutorials and software on his github. So just check out all of his uh repositories, because he has some really great example code. So what we’re gon na do? Is I’m just going to code out a Good part of this just from the start and then we’re going to just go over the rest? Okay, so don’t save all right, move move, move, move, move, okay, so First things first, so our dependencies in this case are going to be to import pandas for data pre-processing. We want to import pandas because that’s like the most popular data processing library – and we also talked about Xg boost right.
That is one of the other machine learning models that we want to use. We’Re just going to form a prediction model based on an ensemble of decision trees, which I’ve talked about as well decision trees, And so another thing we’re gon na do is we’re going to import logistic regression right. That’S model two of three: there are three different models that we’re going to train our Data set on. One of them is xg boost. The other is logistic regression which is used whenever the response variable is categorical right either. Yes or no, or you know some kind of Non continuous, discrete value, you know black white red green, you know things like that, so which is perfect for us huh, you know, Win lose or draw so we have logistic regression and then we have one more, which Is going to be the support, vector, machine Right, support, vector machines, I’ll talk about that as well, and then finally we’re going to want to import this display This display library, because we are going to display our results?
Okay, so that’s it for our dependency! And now we can read our data sets So now we’re gon na go ahead and look at pandas and pandas Gon na is Gon na. Let us read from our CSV file that we downloaded That I’ve called final Data set CSV and then once we have that we’re going to preview that data, So I’m gon na, say: okay just go ahead and display the data that I’ve just pulled into memory. As a panda’s data frame object, I’ll look at its head. That is just the first few columns of that data set, and Once I have that hocking I can go ahead and print it, and now we can see this.
This data Set what it looks like and so notice, There’s a whole bunch of acronyms here. Lots of Data sets have acronyms like this, and that can be confusing. But, like I said, I’ve got this Legend of what each acronym means. The home team goal difference the difference in points a difference in last Year’s Prediction for the past three games, the wins for the past three games for the home team, the number of wins for the past three Games for the away team. So you know I’ve kind of aggravated this data and I’ve just made it into something a little more Consumable and so still remember that we still have one single target that we’re trying to predict, and that is fTR right. The full time result for the full time game.
Who is the team that won the home team, the away team, or was it a Draw, and so that’s our target that we’re trying to predict so before we get into building this model? Let’S first explore this data set, So if we were to explore this data set, we could say okay, So, first of all, let’s just kind of think about what is the win rate for the home team. So what is the win rate for the home team? So how often does the home team win aside from anything else? This is kind of what we just talked about right. How do we do this programmatically?
What we say: okay, Get the total number of Matches and that’s gon na be that first index in the data frame object and then calculate the number of features From it. So we want the number of features and we’ll subtract one, because one of them is Gon na, be the label. That’S not going to be our feature right, the ftR So we’ll subtract, one from that and then we’re gon na calculate the Matches, one by the home team, Which is going to be the length of the data all right, Toph, tfTr, Okay, and for that for the Home team, So that’s number of matches there that were won by the home team and finally, we’ll calculate the win rate, the win rate for The the home team as well, and then once we have that. Finally, we can print out the results and it’s gon na tell us exactly how many times the Home team has won as a percentage of all the wins.
So I can go ahead and print that I’ve got this print statement Right here and then we can go ahead and see the result. Okay, so already this is a. This is the graph that I showed at the beginning.
46 percent about 46 percent of winds are from The team that is home just right off the bat just something for us to know right where we’re exploring the data, We’re trying to think about what are the features that matter the most right feature: selection. That’S the process that we’re going to now we’re going through so remember when it comes to deep learning, We don’t have to really think about what are the ideal features? Deep learning learns those features for us.
However, That’s like a next step. We’Re just gon na try to build some more basic models. First And then you know whether or not you know based on feedback of how you guys like this topic, I might do a deep learning video on sports analytics later, but right now we’re just Gon na build these three simple models and Thinking about feature selection is A really important skill to have as a data scientist. So if you write which deep learning you don’t have to do that, but again you’ve got to have a lot of gPus and Crucially, you have to have a lot of data right. You have to have a lot of data to be able to do that now.
In this case, we don’t have that much data We have in this data, Set it download it in, like you know two seconds, of course It was only 500. It’S only about 500 Data points right. We want a huge amount of data, at least a hundred Thousand. Now, if we had at least a hundred thousand Data points, Then this would be something to use deep learning for right. If we’re trying to aggregate a bunch of different results sent to mit from Twitter, past team scores Different, you know talking points from other people, then we would use something like deep learning, But in this case we want to try to visualize the distribution of this data.
So what we’ll do is We’ll say: okay, so from pandas There’s this great Tool that lets us come what’s called the scatter Matrix and the scatter Matrix Basically shows how much one variable affects the other. So we’re Gon na build a scatter Matrix For a set of our features to try to predict to try to see just visually. What is the correlation between these different features and see just for ourselves This this? This will help us pick.
The relevant features that we want to use right, so we have the home team goal difference. We have the away team goal difference. We have the home team points, the away team points, the difference in points and the difference in last Year’s prediction: okay, And so once we visualize this, some of them have a positive correlation. The line is going up, some of them have a negative correlation, so that means like in terms of so that means, if the goals increase for the home team, Then maybe the points decrease for the for the away team right, and so we can look at the Positive versus negative correlations, That’s an indicator of how features are related together right. This doesn’t have some direct relation to what we’re about to do, but just good practice to think about ways of visualizing our data, seeing the Relationship between between different features and then trying to predict what those best features are for our model.
Ok, so Then, once we’ve explored our data, We’re gon na prepare it so remember. We have one single Target: variable one single, objective or Label as we like to call it, and that is the fTR the full-time results. So what we want to do is say given all of those other features, try to predict the FDR Okay and make us some money Yeah.
No, I’m just kidding. I mean yes, actually you probably want to make some money We’re trying to predict the full time. Result right and so we’re gon na split it into the fTR and then everything else Then we’ll standardize it, which means it’s all gon na, be on the same scale.
We. That means we want all of our data to be an integer format and we want it all to be on the same scale. So it’s not like we have like one feature is in the hundreds of Thousands and then the other feature is in the you know, between 1 and 10. If we’re, if they’re gon na be small values, we want Them all to be small values and what this does is.
It improves our prediction capability of our model, So once we’ve standardized our data, then we’re gon na add these three features, which is the the last three wins for both sides, and we looked at that before right, hm 1, 2, 3 and then a and an am One two and three: So if we look back at the data, some of the data was categorical like. If we look at this data set, you know We have the referee, we have htR, we don’t want any of that right. We want all of our data to be a number.
We want it to be some continuous variable, No discrete numbers, So we’re gon na pre-process. Those features by saying create a new data frame find those feature columns that are categorical by saying. If it’s, if the data type is equal to equal equal to object instead of an integer and then convert it into an integer right, so that way we remove all the Categorical features.
We only have one categorical variable, and that is our labeled, the FDR. We don’t want our features to be categorical. Those are gon na be continuous variables, and so once we have that we’ve pre-processed our data We’ve explored it. We’Ve added the the features that we thought were most relevant and we could see them all here. Right, no more categorical features, they’re all numbers, and so once we have that now we can train and we can split our model into a Training and a testing data, set it with a very easy one-liner which Scikit-learn right.
This is Gon na split our with the train train test, split function. It’S Gon na split that CSV. It’S Gon na split that Data frame object Into a training and a testing set, and it already knows what the label is gon na be, and it’s going to put them all in A one dimensional array, all of those labels, the fTR scores for each of the Associated inputs – and we have 12 Features right. We have 12 features for a single input and so for the next step. Now We’re gon na actually build this model, so I’m gon na come back to these helper functions that are gon na help us train the model, But let’s right now Just build this model right, so I’ll go down here.
So, let’s just write this out right now: we know that the first model that we want to try out – or at least One of the models that we want to try out – is logistic regression I’ll give it some random State as a seed that you know this could be any number of things right. Well, I’m just gon na say you know 40, I could say 42, It doesn’t matter, but just some seed number and we could try out different seeds to see how the results vary. But I’m just gon na you know, put some magic numbers down right now to to get some result out, and so the next classifier We’re Gon na Build is a support, vector machine. So the order of classifier as I initialized them doesn’t really matter. So so that’s irrelevant, But the fact that I am initializing them is important because it means that these are the three important ones that we are using, And so my third classifier is going to be xg boost.