Advanced Operations: Data Analysis with Python & panda Tutorial


7.1 Section Intro

What is going on, everyone? Welcome to section seven of our Pandas with Python for data analysis tutorial series. In this section we're going to be talking about how we handle larger data sets. So, what we started talking about in the outro in section six was more about resampling. So, resampling's going to let you kind of downsize the data set really quickly. Another reason why you might downsize a dataset is just for visualization purposes. So if you're going to plot a graph, you, your eyes, visually, are not going to see a difference between, say, 100,000 data points and 10,000 data points. So you might as well just do the 10,000 data points, all right? There's no reason to do 100,000. That's just going to make your graph lag and your computer cry. So that's resampling, but also in this section we're going to be talking about some more advanced concepts, like doing more analysis. So one cool thing that we can do is correlation and covariance tables with Pandas. And this is, again, like some of the other stuff that we can do with Pandas that is Really computationally intense, or at least, I guess you wouldn't say computationally intense, but if you had a big data set it would be. With pandas, though, it's not, it's fast, and you just don't have to code it, and that's great. So, covariance tables, correlation tables, these are things that are No, we used to measure basically things are correlated or have any sort of co-variance, we'll talk a little more about that when we get there. But that's a task that's really quite the challenge to create these tables but with Pandas you can do it just instantly, right? So it's really nice. So we'll be talking about that. And doing various statistics. And then we'll talk bout buffering. So, how do you buffer data into a panda's data frame? Or even take a data frame buffer out. So, an example why you might do this is if your data set is larger than you have in RAM. Okay, so a lot of database can become Many, many gigabytes or even terabytes. So just because you have maybe four gigs of RAM or something, doesn't mean you can't work with databases that are many, many terabytes. But you, the first step to dealing with that is buffering. So that's what we're going to be talking about in this coming section is basically just working with larger datasets and then also we'll talk about mapping functions. So it's more Basically advances operations, and this will be kind of like the last major section before we go into the last and final section, where we just kind of talk about truly working with a database, and so we'll actually set up a database and work with that database, and show how that works. So anyways, stay tuned for all of that, let's go ahead and get started with section seven.

7.2 Basic Sorting

Everybody in, welcome to Part Two of Section Seven of our Pandas With Python For Data Analysis tutorial series. In this tutorial, we're going to be talking about sorting, so we can use sorting to sort data frames by columns. We can either do this by sorting by a single column but we actually sort by multiple columns and also multiple rules like ascending, descending. So first we're going to need some data to work with here and just in line with just doing multiple different types of data. Let's do another wave acquiring data. The next thing that I want us to do is, let's pull some data from Quandl, this time without downloading, or at least this time without manually going to download. So, with Quandl, let me pull over a Quandl real quick, over here. So with Quandl you can do what we did before and search for a data set, you can also, they have an API See if we can access yet down here. You can query their API and there's even a module four python to connect in all that and if you want to check that out, I recommend checking it out. But at the very most basic level, you can usually query like basically you're seeing here into multiple forms, so. One thing that we might want to do is like let's say we want to get the price for Apple so we'll search for AAPL and we'll go for this price, we'll just get Apple Inc. right here. And, from here, there's a few things that you can do. One thing you can do is click on Python library. And this tells you, like if you're using the Quandl Python Package, what you need to do is just quandl.get and then you put in this little ticker like thing here. And that's just to grab To help build this URL, but the other thing you can do is you could go to download and choose CSV, once you click on it, right click and copy the link address. So when we move that over here, what we get Is, that, right? So, that's the link we actually want. Now, of course, to actually visit links with python we need to import URL live.request. Now, let's build a pickle. So, we're going to define pickle data. Empty parms. And this will be read first. Read will equal urllib.request.urlopen. And the URL we want to open is that URL, so that Apple prices. So, we open that URL and then we read that into a CSV or read that into a data frame using CSV. So, we can do that because this is just CSV structure. But actually, it's not, I mean, it's from a .csv, but it's being saved, purely as, you know, like a list, or even a string in read. Like read, if we printed it out it's not going to look like a CSV very well. So df, I'm just showing you because it's the structure that counts, so when we say df = pd.read_csv it's just going to read it in knowing it is a CSV, but it's not like we somehow downloaded any .csv and then opened it that way. It's just the structure. So, when we go to pickle, we'll also print df.head. And then finally we'll save it. So, df.to_pickle. And, we're just going to save this pickle as df.pickle. So, this will save it in our local directory. So, then we'll go ahead and pickle data, and let's go ahead and run that. It's taking a moment, sometimes the Quandl takes a little bit, but anyway there it is. And now we should have a df.pickle, so if we pull down here. There it is. DF dot pickle. So we have that and we're ready to continue on. We don't want to do this every time so I'm just going to comment that out for now. Now to read it in, we're going to have to go from pickles so we would say DF equals underscore pickle And then we specify that name. And since it's local we can get away usually with doing this. But if you're on Linux you may not get away with doing that. I'm not sure about Mac. You might have to give the full path. Now we're ready to, we've got the data frames. So let's go ahead. Let's print df.head realy quick, and look at that one more time. So, you can see that we're starting with 2015 and it appears to be going in descending order. So, we'll close that and one thing that we can do is we can sort by date. So like right now When we look at, let's pull that back up one more time, we can see that first of all, there is no index yet, but there are just columns so first we're going to sort by date and sorting is pretty simple. We can just do df.sort And then we say what do we want to sort? Well, we want to sort the date column. And then we're going to say inplace=True. And then now, we can, let's look at that one more time. So now it starts with the oldest date and it's in, basically, ascending order. Now what we're going to do is we want to set the index. So we're going to say df.set_index. And then we're going to set that index to be 'Date' and inplace=True And then we're going to shorten this dataframe by saying df = df['Adjusted Close'], and then we'll print the head. So we'll see where we're at now. Whoops, let me bring this down. Okay, so we have this, and it's just a series without really a title over here. Okay well we do have our index at least. So now we can do is. Let's go ahead and run a DF dot plot and plt dot show. So we need to bring in mat part lib. So import mat plot lib dot pipe lot as plt. We'll just Off the basic or the defaults for now. So there is our default graph, default price and stuff and we've got the date data and all that stuff. And so let's go, close this out, and what if we had not sorted? What if we did, we got rid of this And we ran that. What we get is the reversal, right, because Matplotlib plots in order, so while we have date data down here, it's obviously reversed. Right? It's flipped data. So that's why we wanted to flip it around and actually run that sort. Now what if you wanted to, maybe you don't want to sort by date, maybe you want to sort by you know something else, right? You don't have to, say like date, what's neat about pandas is it does a lot of stuff just inherently with some magical understanding. So we'll stop the plotting for now. We'll just comment that out. And what if instead you did df = pd.read_pickle. We'll read in df.pickle again. Df.picke. And we'll print df.head, and then what we'll do is, we can do df.sort, and this time we'll sort by the open price, and then, again, whoops, we'll do comma inplace equals true, and then print df.head. So, that's a lot of df.head's, actually, but the last two we're mostly interested in Slides like these. So this one is my open price only and then this one is all, first of all, this is adjusted close and then we do the open stuff. So kind of keep that in mind that the adjusted close is including stock splits, whereas regular open just wasn't. But, of course, you can sort by any column you want. Now, that's it for this tutorial. But the next tutorial we're going to talk about sorting based on multiple rules at a time. So, that gets a little more interesting when you want to sort by one thing and by another thing after that. So, We'll be talking about multiple sorting in the next tutorials so stay tuned for that.

7.3 sorting by multiple rules

Hello everybody and welcome to part III of section seven of our Python with Pandas for data analysis tutorial series. In this tutorial we're going to be talking about sorting based on multiple rules. So to start I'm going to go ahead and delete everything Except for imports up here. We won't be using that plotlib but we'll just leave that for now. So, what if you wanted to sort maybe something alphabetically and then after you sorted it alphabetically maybe numerically or something like that or maybe numerically then alphabetically so. For example, what if you're a school and you want to sort people by their grade, in school I mean, not their grade in the class, but you know, tenth grade, ninth grade and so on. You want to sort people by their grade and then by their name, or something like that. How would you do that? Well, we can do that with pandas relatively simply, but try to imagine how you might do it without pandas. That's another Decent challenge, so first let's actually just create this kind of school dictionary. So, we'll say school equals, and it's a dictionary, and then we're going to say we've got a few things. We'll have a String and a list, and then we'll do one, two, and three, and then this one will be name for the student. And then we'll have age for the student, and then we'll have grade. And again that's their grade in school, like ninth grade, tenth grade, and so on. So let's just have, let's try and do ten. So, one, two, three, four, five, six, seven, eight, nine, ten. We want to have ten just so we have enough to actually mean something. So, we'll say we've got Jeff and then we're going to have a Carrol, and then a Kyle, an Adrian, a Jessica, we'll have a Scott. A Tanner, a Kelly, a Brittney, and a Joe. And then age, this will be a little easier, let's just add some ages. 18, 17, 15, 15, 16, 17, 18, 19, 18, 14. Now let's see, one, two, three, four, five, six, seven, eight, nine, ten. Just make sure you have ten. Put whatever you want in there. Make them kind of close ages at least so we can sort by them and it'll make sense and then for grades, just do something between nine and 12 So 12 11 9 10 11 12 12 11 10 and 9. Is what I'll do. Again you can copy me if you want. It's not that important if you don't. So now we're going to load in that data frame. So we're going to say df equals pv.dataframe for school So we load that in, we'll print that dataframe, let's make sure we don't have a syntax error up to this point. We don't. So there we have it, we've got Age, Grade and Name. So it's all here and we're ready to rumble. So first of all, as you've already see we can do df2 = df.sort and we can sort by the student and their grades, so we can print{df2}, save and run that, and here we have grade 9, 9, 10, 10, 11, 11, 11, 12, 12, 12. So we've sorted by grade relatively successfully. We can also sort differently. We can sort by age. Run that, there you have it. We've sorted by age, 14, 15, and so on. But what if you wanted to sort by grade and then by age in that grade? Well, what we can do is we can say DF2 equals DF dot sort. And instead of putting just one thing, we put a list of things. And so first we'll sort by grade, and then after grade is being sorted, we'll sort by age. So, If we have people that are in the exact same grade, we'll make sure that they are subsequently sorted by their age. So, we'll print df2. And here you go. So, we have grade 9, 9, and then 14, then 15. Then you've got grade 10, 15, then 18, then 16, 17, 19. And then 12, 17, 18. And then 18 and 18 are the same so Oh, we don't really need to do anything there. Subsequently, we could do all three, actually. You don't have to just do two, you could say df2=df.sort. And then we sort by grade, we can sort by age. And then we can actually sort by name. So print{df2}, save and run that, and there you have it. We have everything being sorted now, so grade, then age, and then if the age happens to be identical Then it would sort finally by alphabetical. So remember Jeff and Tanner, for example, are both 18 and in 12th grade. So first it was sorted by grade 12, they were the same. Then we tried to settle that tie by the age, but the age is identical. So then we tried to settle that tie by name and now by order. And turns out that Jeff comes first before Tanner. Cool. But what if. what if you wanted to Sort by Grade but you don't want to sort by Grade ascending you want to sort by Grade descending, but then Age and Name can be ascending. Because ascending is obviously the default. How would we do that? Well, we can do that too by another parameter. So we do df2 = df.sort and then again we sort by the same parameters as before. So I'll just fill in these in here. Could have copy and pasted. So grade, age, name and then we have a parameter called ascending. Its just literally a parameter called ascending. The default is true. But we could say the default now, we could set this to be false and now everything would be descending. So we could say, copy that, come down here and print that. We should get away with that. Sure enough. Now it's everything is descending. But what if we just want one thing to be descending. So we'll copy this. Come down here. Paste. And instead of just having an absolute we can turn this into a list. So we want grade to be descending but then we want age and name to be ascending. So false, true, true, right? So it's applying in order so False is the zero-th here, it's the zero-th here. This is the first a foulment applies True, Second-th I suppose and applies to True. So then we can save and run that and now you see that we have first grade that is. Descending so we start with 12 and then after Grade we sort by Age, ascending, yes, and then after that we sort by Name, ascending. So pretty cool. Anyway, that's sorting. Obviously, probably leading into this you probably were asking yourself how many ways can we actually sort something? Don't we just use .sort and that's it? But note we actually had quite a few examples and things that we could customize about So it's actually a lot more complex then you probably thought it was. At least that's how I felt. So anyways. That sorting and all that various ways you can sort things and all that In the next tutorial, we're going to be talking about re-sampling. So, this is just some of the basics of re-sampling, where, usually, we pair down the size of a data set using re-sampling, by doing things like the mean of five things, and converting that to, basically, five rows, converting that to one row, or something like that. But, you'll see what I mean in the next tutorial. So, stay tuned for that

7.4 resampling basics time and how (mean, sum etc)

What's going on everybody? Welcome to the fourth part of section seven of our data analysis with Python's Panda Tutorial Series. In this tutorial, we're going to be covering how to resample a data frame. So the idea of a resampling is to take a data frame Usually in some form of a time series, but you can do non-time series, where it's a little more in depth. But, generally, this would be with time series data, and what you'll do is you'll take a time series, let's say it's something that's updated every millisecond, and you can say, instead of every millisecond, I'd like to update this every minute. And then, if you Want if you got maybe something updated every minute, but you actually just want that updated once a week, you can do that. And what it does for you, is you can update that to be maybe the sum of that entire period. The average or the mean of that entire period. Period. Or even you can do open high low close which we'll talk about when we get there. So, with that let's go ahead and get started. So the first thing we're going to do here is let's just clear this all the way up to the matplotlib stuff. And then also from, whoops, matplotlib We're going to import style and then we'll style that use, and we'll use fivethirtyeight. Now, the next thing, we're going to load in our pickle so we're going to say df = pd.read_pickle. And we're going to read in df.pickle. Now, we'll sort by date because we knew that was a problem before so df.sort and we want to sort by date, inplace equals true. And then what we're going to do is we need to convert to date time. So if we don't covert to date time, we're going to have A problem, and so there's a difference between a date stamp and then there's date time, like panda's day time format. So generally, you can take just about any date and convert it to a pd date time, so let's go ahead and do that. We're going to say df And then we're going to say date = pd.to_datetime and we're going to convert df['Date'] so that just converts it to date time. Next we're going to say df.set_index, and then we say we want to set the index to be a date, and then we'll also say that is inplaceTrue. Now that we've done all that let's go ahead and print df.head, see what we've got now. Okay, so now we can resample this data. Now, this data is currently pretty much one day data. It starts as open, high, low, close. And it's updated about once every day, but that's a trading day. So over the weekend you might have no trading going on, well you will have no trading going on, so it'll be, there's a lot Now, the next thing is how do we resample, and then what are the options? So first of all, to resample, we can do something like this. You can say, we'll do df2 = df.resample, and we want a resample, and the first thing is by what It's kind of a measurement. So capital D is for day, and you could even say 3D, for 3 days, and then finally you have another one that is how do you want to resample that? So that might be some, mean, or open high low closed. We'll use the mean. Now, the question is How do you know what all these values are? Well, you can think of a few of them and they make a little bit of sense but there's a lot of them that actually come into play, so let me drag this window over. So these are all of the, this is under offset aliases, and here's the URL for anyone who's interested, but these are all your choices. You could resample by business day, a custom business day Calendar day, weekly frequency business month start, right? Quarterly, yearly, hourly, minute, and so on. Okay so you can go pretty advanced here if you wanted. So, and then also let's has minute. But you can also do minutes like three if you wanted. You could do three min. And that would also be minutes. Just for the record. I've actually never used the t. I've always used the min. Anyway but three. We'll do three d For three days, and then after that we'll print, print df2.head. And let's run that real quick. And so we can see here that basically it's every three days. Right? It's just re-sampled every three days. One thing to note when you are re-sampling, if you re-sample by three days and there's no data for that three day period you will throw an error. No, no, sorry you won't throw an error but you'll get a not a number, like there will be no value. So a lot of times when you are re-sampling you'll have to drop not available information. Now, let's go ahead, let's do df2['Adjusted Close'].plot. And then while we're at it, let's go ahead and plot the original one as well. So let's plot them both and see how major of a difference that made. And then we'll call it here. Save and run that. Where is my graph? There it is. Okay, so you can see mostly you're just seeing one line. You almost can't even tell that they're drawn over each other but let's zoom in to a specific area, let's do this. Still can't really tell. Still can't really tell. Only when you zoom in really far do we start to see okay yeah. So, this red line is slightly lagging the blue line a little bit there. And that's because it's a mean, it's just a, for three days what was the average price? So, for three days. What was the average price? It was down here for three days and so on. So close out of that. So, now that we've done that, we could do three days, we could also do three months. So, we could take this here, and we could re-sample by, instead of three days let's just do one month. So, copy that, paste that, And so 3D will do 1M for one month and we'll call this df3, df3. And then we'll just take this copy, paste and then we'll just make that df3. So I'll plot all that. And then we'll wait for the chart again I suppose. For some reason it's taking a really long. It's not that much data. It's two thousand or something. Anyway. So there you have that and then we can see again. You know, which ones are taking longer than others. But on a larger scale you really don't see much difference. It's only when you start zooming in. And then we could do even wider resampling. We could say instead of one month DF4 we could do 12 months. And df4 there and we can copy this again, paste and make that four here, so 12 lines if you wanted and so on. So the list could go on and we could use all of those options from the list I showed you. Before. And finally now in this one we can definitely see that yellow line stands out from the crowd even at fully zoomed out. But, resampling's pretty good because, and I think showing it visually is also pretty useful because it really shows you how granularity can actually matter. If you're looking at a data set this far out. Yeah, the year makes a difference. But what about the other granularities? Not really, one month three day in the raw are almost indistinguishable this far out. But when we zoom in, again clearly one year is too much. And then we're starting to distinguish one month but there's really no difference between three days and one day, right? And then we can continue zooming in, it becomes a little more. Obvious, but it takes a lot for us to actually see, a lot of zooming in anyways, from the full dataset, to see the differences between one day and three day. But it's there. But it just depends on how much data you're actually looking at. But a graph is a great example, because it takes a lot of resources To graph information, so if you're not losing anything visually by graphing a graph this big and we can't really tell any difference between one day and one month data. Well the one month data is like 30 times less data.  Right? So it's like a big deal, it's one out of 30 data. So it can be useful to pare down your datasets that way. Now. Another useful way to paradon data is with re-sampling the open high low close. The idea there is that you re-sample your data and then for that time period. let's say you re-sampled for one month open high low close, what you would get return to you as the open so the starting value The highest value in that one month period, the lowest value in that one month period. And then that last value in that one month period. So, it's a little more descriptive for you. But it's still kind of pairs down your data. But you still see the maximum and minimum fluctuations and stuff. So, anyway that's what we're going to be covering in the next tutorial. So, stay tuned for that.

7.5 resampling to ohlc

Hello everybody and welcome to part five of section seven of our data analysis with python and pandas tutorial series. In this tutorial we're going to be talking about re-sampling, only this time re-sampling to open high low close. So first of all, we're going to go ahead and delete everything Up to basically the resetting of the index here. So go back up like this. Delete. So now what we want to do is we want to re-sample the open-high-low-close. And again, the idea behind open-high-low-close is that we can take a time period And re-sample that time period. So let's say normally you've got daily information coming in. Well you re-sample that time period for ten days and that's What we're end up doing there with open high low close is over the course of ten days we will report what was the first price, what was the last price, what was the high price, and what was the low price. So, actually with the default data we're getting from Yahoo right now. That's open high low close. So, we can print out DF.head. And we can see that this is actually open high low closed because stocks are being traded you know by the millisecond and so what ends up happening is, is you know because of the millisecond but this is data once are dead, right? But even stock it that it's updated once a minute is usually open high low closed data because Again, stocks are being traded many times a millisecond, so even second data is usually open high low close. So anyway, what we're going to do is we're going to show how we can use Pandas to create open high low close data. So we have df.head here, and we'll just leave that, I suppose. And then what we're going to say now is df2. = df and we'll do this against the adjusted close, so Adjusted Close, and then we'll do .resample as always and then we'll say we want to resample to ten days, and then finally the how is going to be equal to OHLC. Which is just short for open high low close. And actually this should be undercased, not uppercased ohlc. So now to see what we've got here, let's print DF2.head. Save and run that. And there we have some data. Now again, when you go to resample, you're going to find that fairly frequently you've got not available data So again the way that we would check for not available data would be something like this: If df two dot is null dot values dot the sum of those values if there are more than one of these. Is we would say, We contain NAN data! So let's run that and just see if we got any on that data frame. We didn't. That's great. What if we resampled to two days? Ran that. And we'll check the bottom and Chucky Darnes, we contain NAN data. Why do you think we contain not a number data when we resample the two days? Well the reason why we do is because this is two days so say we tried to do open high low close on a Monday, that's going to be a problem. If we do it on a Sunday That's going to be a problem, and so on. So we simply can't do that because there's no data. By that same token we could just do D, and we would also have a problem because of the weekend. The same thing is true, anything less than a day basically because it's not accounting for the weekend. So Pandas is kind of blind in its Dishing out of these values. But what Yahoo is doing is they have stock prices. Or actually whoever's providing. I doubt Yahoo is the one that provides their prices. Probably Reuters, or Bloomberg or something. We're just providing their prices, though. They perform the open, high, low, close information. And what they're probably doing is, when they resample by day, doing something like that, the next step that they probably do is, it's going to have not available data So then, the next thing that they're going to do is df2.dropna And then they're just going to say in place equals true. And then they're probably not going to have any of these problems, so I'll cut that there, paste there, save and run it. And sure enough we don't have any issues. By that same token, we can also fill based on when we use samples. Remember we talked about doing fills back with drop and A okay. We talked about what we could do there. Well. Instead of doing drop NA, we can comment that out. Whoops. [COUGH] We can comment that out and in the resampling as another method, we can actually add a fill underscore method and that fill method could be any of a variety of methods, b fill, forward fill, whatever Now, again we're going to use the method you ought to use. Let's ask, think for yourself. What fill method should you use with stock prices? Because, think about this, people use stock prices to do data analysis, so what are people doing when they do data analysis on stock prices? They're generally trying to predict the prices. So when you have a fill method, should you take future prices and push them back or should you take previous prices and push them forward? If you're going to go about filling. You could also just remove them, but what if you wanted to fill? Well, you would use a forward fill. You would not back fill because that's like seeing into the future so you definitely don't want to do that. So we use a forward fill and that's Now, generally, I mean, with stock data, I wouldn't even do a forward fill. I wouldn't fill at all. You shouldn't do that. But, there is some data, where, if you want to re-sample, but you're going to wind up with some not available data, you're going to want to fill it in, for whatever reason. So, we'll go ahead and do that. So, we'll exorcise our fill, and, sure enough, leaving behind that drop in a that we did before We can see that we're not having any issues here, so it is being resampled into open high low close and all that and everything, everyone is a winner today. So that is resampling to open high low close. In the next tutorial we are going to be talking about correlation and covariance and doing it within pandas. So both of these are actually kind of complex. To do all on your own. especially when we start comparing correlation, and building correlation tables and stuff like that. But they're actually pretty powerful to be able to have at your disposal. So, that's what we're going to be covering in the next tutorial so stay tuned for that.

7.6 Correlation And Covariance Part1

Hello everybody, and welcome to part six of section seven of our Python with pandas for Data Analysis tutorial series. In this part, what we're going to be talking about is correlation, and covariance. Mainly, how do we do this onto an entire data frame, and then even build these tables of correlation, and covariance So to start we're going to need a few different types of data sets, you know? Or at least multiple columns, cause the idea of correlation and covariance is to measure, you know, similarities in movement between two entities. So, an example here might be comparing yearly crime rates, employment rates. An income or something like this, right? And the idea is to hopefully find a relationships either, you know like with the positive correlation or negative correlation and so on. So another examples with you know, finance where. We find a few assets that are either positively correlated or negatively correlated and the idea's when one goes up, the other one should go up or when one goes up, the other should go down. Or another method for using this is when they start to diverge like if you have two assets that are highly correlated. And one starts to go up and the other ones goes down you'll short the one that went up and you long the one or buy the one that went down and it's a market neutral trading strategy. So anyway, more of what we're going to be doing here is working on how do we at least calculate correlation and covariance and then from there you can do Whatever it is you want to do with that data. So first of all, we're going to clear out this existing information all the way up to urllib.request. And now we need to grab some data from Quandl that we're going to use. So first we're going to grab let's grab a crime rate one. Just for the record you can create an account on and you get like a ton of calls. I think it's even unlimited actually if you have an account. Otherwise it's like five hundred api calls per ip a day So, just kind of keep that in mind. There, now we can get the crime rate database. I'm trying to think if it's violent crimes total. Let's see, US crime rate. I'm trying to find the one that we found before, it was crime 11 or something like that. Yeah this one I think. This will go download and let's just right-click Copy Link Address. So this will be one of them. Yeah, that's the one I wanted. And then we'll come back to Quandl here and let's go ahead and grab US Employment Rate. And, let's get the annual one. So, the USA Employment Rate annual. We'll download that. Grab the link for that. Csv. [NOISE] Next, let's grab US Income. And, let's see US high income, personal income. Let's see i'm looking for, I guess we'll do, we could do personal income. I'm just trying to see which one we can get here, we'll do personal income. It's a premium dataset, hold on let's click on free here. Jerks. Is this not, that's weird, I picked free why are you still showing me premium, trolls. What we're looking for, I'm looking for a very specific income, usually you can find one that is Right. Here's the one I want. So, type in average family income and we're going to go with the average family income growth by income group. We'll take that one. And any time you want to trim anything you can trim by coming down to here. And you can set this to be whatever you want. Let's set it to you just start 1980 or something. And then when we go to download it, it'll hopefully split it for this. So, copy that link, and it should append to the end of this link. Yeah, the trimming, and then finally we want GDP. So USA GDP. Do your own if you're from a different country. Feel free to use a different country than USA for all of these GDP. I'm trying to find GDP per, let's see, GDP [INAUDIBLE] Per capita at current prices, and then at constant prices. We'll just do this one, I guess. I was hoping it would say get GDP per, I guess we'll do per capita here at current prices. That'll do for now. And then we'll go Download > CSV, right-click Copy Link Address. So we'll just save all these. Okay. Now, we're going to create that, a function like we've seen before to pickle all this data. So we'll define pickle_data, actually no parameter there. And then, we're going to read in each of these urls and save to pickle and all that, so we'll go, we could probably cycle through the URL, but The file name would give some trouble, so we'll just do this. We'll say read = urllib.request.urlopen, like this and then we want to put This link in there, so we'll do this one first. I need to quote that. Then df = pd.read_csv(read), and then we'll go ahead and print df.head, and then we'll do df.to_pickle. And the pickle that we're sending this to is literally called df.pickle. Okay, so we want that done and we want to do this to every link, so just copy this, paste, paste and paste. And this was our second link, so we'll just put that into here. Third link, copy, paste, and the fourth link copy, paste. And we want this to be DF2, three, then four. That should be good enough, and now what we're going to go ahead and do quickly pickle The data, save that and run, hopefully. There we go. Might take a second to run all this data against Candor. Okay, there we go, you should now have a df.pickle, df2.pickle, df3.pcikle, df4.pickle. So if I come down here, here we go. These are the files you should have, not that one. Okay, so now that we have those, we're ready to carry on. The first thing that we want to to do is, now we need to read in these pickles. So, let's create some space here, so everybody can see. Now I'm going to say df = pd.read_pickle and were reading df.pickle and then we'll just multiply this again, two, three four. And whoops, it should be two, two, three, four and then again, two, three, four So we bring those in and now we can do simple correlation. So like a really simple correlation would be print(df3.corr()). And that's literally it and you can get the correlation there. Also let's stop pickling here. We don't need to repickle every time, just once will do. Save and run that. And this would be the correlation for df3 so basically this is a correlation table for comparing movement for the various income classes here. So the bottom fifth right always Moves with itself in a perfect correlation, right? Second and second of course. But then we can compare how the bottom fifth moves in relation to the second fifth. Right? So, pretty close. How about into the middle fifth? Not as close. How about the fourth fifth? And not the, it's a little further away, and then the top fifth, it moves even further away, so we can see that there's quite a bit of a difference there. Then we can say, okay, well how does the top fifth move in relation to the bottom fifth? Well, it should be the exact same value, right? So we can keep going And see the top fifth moves in about 90% or 91% correlation there. So, we can continue going on, so creating a correlation table like that is actually really, really simple. Another one, which one was the crime? Let's see, yes, so the crime was the first one. So we could print df.corr and get a correlation table for the crime rate, too. Now, this one is a huge table, so it takes up a lot space and has columns that are really long, but we can see Like larceny theft rate to the violent crime rate. They're very highly correlated. Another one that's interesting is the absolute negative correlation of population. It's almost a perfect negative correlation. So between this video and the next video think about what you what might actually be causing that? So anyway, we're going to continue along here. We'll talk about co-variants and correlation. We'll kind of continue along here in the next tutorial so stay tuned for that. Thanks for watching

7.7 Correlation and Covariance part 2

Hello everybody, what is going on? Welcome to part seven of section seven of our Python with pandas for data analysis tutorial series. In this part, we're going to be building on the last part where we built these data frames, we pulled that data, we pickled that data, and then now we're accessing that pickled data, and then we're showing how to do correlation on a single data frame. But then our curiosity is to really combine all of these data frames. So bringing up what we have so far, one thing I would like to mention that I said I was going to talk about is, one of the most highly correlated events is the comparison of population To a bunch of these crimes. No matter what it is population is pretty highly correlated, especially here. So larceny theft rate, I mean this is -.985, I mean that's huge. So, The question is when you use correlation and covariance, I'm sure you all have heard it plenty of times in school but it's always good to reiterate that correlation is not causation. So in the case of population and this extremely negative correlation It's probably, completely unrelated if anything more highly populated areas tend to have higher crime rates. But what's happening here is it's just Plane of fact that the population of the United States of America has been going up all the time, and our policing procedures, and practices, and all this have kind of helped with the crime rate and that has gone down over time. We could also make arguments that maybe reporting for crimes has changed and maybe that's why it's going down, but regardless the crime rate has been in steady decline. This entire time, and the population has been steady increase. So because of that, we have this high degree of negative correlation, but that doesn't necessarily mean that The more population the United States has, the less crime it has. That's definitely not true. This is just two trends that are negatively correlated, basically, but that doesn't mean much. You can't really extrapolate much meaning there. Moving along. One of the things that we can do is we can also do covariance, so we can print df.cove And that's for covariance. And what covariance is going to do is it's just basically when one tends to move in one direction, how much tendency does another have to move in that direction? So that sounds pretty vague and that's because it is pretty vague. So correlation is bounded, right? It will move from negative one to positive one. Covariance on the other hand is completely unbounded. It's not normalized. So you can have a positive 100 covariance and a negative 20 covariance. You can have these things and also, if your data itself is not normalized, covariance is going to be pretty wild, and another example is if you measure, if you have data sets that are say measured in inches, and you perform covariance in the inches measurements, and then you switch it to maybe feet, and you perform more covariance, you're going to get a different result. You should still have Positive or negatively covariance, correct, unless it was really like really close like 0.00001 or something. But here we're getting like ten covariance before, and you with inches and you convert to feet. You'll probably still get High covariance, but it won't be 10. You might get 30, you might get five, or something like that. Anyway, so let's draw a covariance real quick, and this will just be on their original data frame. So, as you can see, we have some widely varying numbers. This is just because they're not on the same varying scale One thing people will do is they'll do correlation and then apply covariance to the correlation. So you could do something like that, too. Now, the next thing that I want us to do is show how to combine all of these dataframes and then do a correlation table, let's say, of all Of this data. So that's going to require us to do a couple of things first. So, some of this data frames, for example, the original data frame has the time or the dates are in this capital y year column. And that's the same with df3 and df4, but df2 actually has a date column. So we need to change that 1 to date. The other downside is df2 and df4 don't actually have proper column titles. They're just called values. So we need to rename those to be something useful, so when we combine it, we don't just have two random columns being called value that we have no idea what they are. So first off, let's handle df2 and df4. So, let me comment this out. And, let me print this. So, print df2.head. And then we'll print df4.head. Just so you guys can see what I mean. So, this has dates and then value here. Well, we don't really want. First of all, actually they both are saying date and value, so then really. Let's do print df.head. [SOUND] I probably could just type that. That's year and then df3, let's just type it down here and say, so I guess two of them have dates. So print df3.head. And this one has year two. So we just basically have to make a choice. But we're going to convert them all to year I think. It's the most reasonable. So one thing that we can do is like if you want to change all of the columns in a data frame. I'm just going to delete all of these because we don't need that right now Or you can do something like this. YOu can say df2.columns, and this allows you to rename all of the columns. So we'll say the columns we want here are going to be Year with a capital y, and then df2 is for the employment rate, so we would do Employment Rate. And let's do an underscore. You should always have them, that way you can use the dot stuff too, like, I want to call it dot notation but that's not really. You can treat it like an attribute. So that renames the columns, all of the columns. Well what if you just wanted to rename one or two of the columns, what you would do is DF df4 = df4.rename, and then you can rename stuff. And then we'll just say we want to rename various columns. And then in here, you put a dictionary. And this will be just a dictionary of values. We want to change. So columns needs to equal that dictionary anyways. And this one had Date, and we want to change that to Year. And it also apparently had Value, and we want to change to say something useful. And this was GDP per capita, GDP_Cap. Okay, so that should rename df2.head. Oops, let's print these. Print(df2.head()) and then we'll print df4.head, and now they all have the same year, cool. So now that we've got all this same kind of normalized stuff, now what we're going to go ahead and do is set all the dataframe indexes to year. So to do that, we'll just do df.set_index and we're going to set that to be The Year column, and then we'll just say inplace=True. Easy enough. Copy that, paste, paste, paste, and then 2, 3, 4. Easy, easy. Now, we want to join all of these dataframes together so we would do something like this. We would say maybe joined. And we would say join equals df.join. And then in here I showed you before how we just joined like one data frame, but you can actually join a list of data frames right out of the gate. So, we'll say df wants to join df2, df3 and df4, because that's what we want to do. Then what we can do is we can print joined.head. And we can comment out these cause we don't need them anymore, save and run that and now we have this entire is just one big data frame with income information and Crime rates, income, and somewhere in here, we've got the employment rate and then somewhere in here, we should also have, I'm not seeing it, but maybe I'm blind. We should also have Yeah, there we go. GDP per capita, cool. So now we do. One thing we might want to do is just maybe do a dropna, joined.dropna() Inplace=True. Cool, let's do that, awesome. Now we don't have that, and then finally, what we would do is you could print joined.head if you wanted, but we can print joined.correlation or corr. And then now you can see the correlation for all of these things, so how does the GDP per capita Correlate with various crimes. Well, no surprise there that the more money people are making, the better the socioeconomic status of everyone is, and, probably, the more comfortable people are, and they don't really need to commit crimes Interestingly enough, the more money you have, this one isn't so correlated like burglary. So I suppose if other people are having more money, then maybe they're more likely to be burglarized. Who knows? Then you've got income, how the income changes are reflective of maybe violent crime. Not very highly correlated, but then what if we looked at maybe the middle fifth, their income changes are even moreso highly correlated, and then we could look at the bottom fifth. Well, these changes are very correlated, which is interesting because you would think as bottom fifth income went up That violent crime would go down maybe. So that's kind of an interesting one. because most people would think, oh, it's the poor people committing all the crimes. But anyway. So you would almost expect this to be negatively correlated, right? Where the bottom fifth income goes up, violent crime would go down, or vice versa. But we're not actually seeing that. We're almost seeing, let's see where the top fifth Yeah, really the top 5%. That's the closest to zero, actually. So, interesting that they're just completely not related. But anyway, yeah, you could kind of scroll through here and look at the correlations of basically all of these comparisons. I think the crime to crime doesn't really yield that much interesting stuff. Most of that is kind of moved together. But a lot of these correlations are kind of cool to look through and just kind of see. But this is how you would perform correlation tables on a bunch of data frames and also bring those data frames together. Now the next tutorial, we're going to kind of shift gears a bit. We're going to be talking about how we can map custom functions that we build To a pandas data frame, which we'll end up using it a little more later on in this tutorial series. But it's a super useful function to be able to do because if something maybe isn't available to you in pandas outright, you can write, usually it's a really quick function that will just handle a problem for you, but you can write these functions And then can be applied to the entire column on a data frame. So anyways, that's what we're going to be talking about in the next tutorial. So stay tuned for that.

7.8 Mapping custom functions

Hello everybody, and welcome to part eight of section seven of our Python with pandas data analysis tutorial series. In this tutorial, what we're going to be talking about is how to write a custom function that can be mapped to a column on data frame. So this is kind of panda's way like rolling a ply of lettuce fully customized. Our pandas use even if what we want to do is not included in pandas or maybe it's not in the way. Sometimes there's. it might be there and I might not even realize it. Or maybe you can just think about the more clearer way to do something possibly There's a whole lot of reasons. But generally there's just not a built in function to do what you want to do. So you can write your own, and this way you don't have to move your data out of pandas just to perform some operation and move it back in. So, what we're going to do here is revisit the income by segment. And so, we have this data on the segments of income by fits, as well as some percentage points, and for now, at least initially, we'll focus on the bottom 5th, and top 5th, let's say. So, a lot of times, people will say, at least in the United States, they compare the upper class to the middle class, or the upper class to the lower class, and they say, hey, the income gap is getting larger. But, almost always, people are focusing on the dollar figures, right, of the gap, and the dollar figure is completely explained away with a basic education in statistics. Right? So, if one person has $10,000 Any other person has a hundred thousand dollars. And lets say these are their a year. If both grow by an equal ten percent then the first person has only gained ten thousand. So they have eleven thousand a year now. And the person has a hundred and ten thousand because they grew by ten thousand. So The dollar gap between then has increased, but this was a fair increase percentage wise. You could argue that this is a red flag if you continued along this path, both people should grow exponentially based on these gains, so the real concern is when the rate of growth. In percentage is different. So let's not compare the gap in dollars, but compare the gap in percentage to get a really better grasp on income distribution and whether or not the gap is actually widening or not. So for this we'll create our own little function that calculates the percentage. We obviously, we don't need this, this could be done in pandas But we're just going to use this as an example. So we're going to graph our results because it makes it easier to read and visualize for ourselves. So we'll also import matplotlib.pyplot as plt and from matplotlib import style. Now we'll do style.use And we use fivethirtyeight as usual. So first things first. What we want to do is we'll bring in, let's bring in our function, or our data frame rather. So df will be = pd.read_pickle And the pickle that we saved the income stuff to was df3.pickle so we'll just load that one in. Now, we're going to say df.set_index to the year column, and we'll just say inplace=True Then what we want to do is sort the index so it's in chronological order. Otherwise the latest is first and then it graphs the graph in reverse and it looks silly. So df.sort_index would also be computing a percentage change in reverse. So if sort index in place equals true And then now, at this point, let's print{df.head}. Let's see where we are at the moment. Make sure we're good. Cool. So we're starting in 1980, and moving chronologically, and we've got all the information here. So the next thing that we want to do is let's say we want to do something like this. We're going to define a new column. We're going to say df and we're going to say pct for percent Diff_bot_5_vs_Top 5 okay. So that's going to be our column name. And we're going to say this is =. And what we want to do is is we want to run a function against these two values. So let's say We have a function, and let's make that function first. So, we're going to say define PCT underscore diff. And to calculate a percentage diff, it's new minus the old divided by the old times 100. So we We have new old, right? So, we're going to have value one and value two. Then, so to calculate percentage it would be equal to v2, so that's the new, minus the old, the v1. New minus the old divided by the old And then times 100.0. And so then just to make sure that everything is calculated as expected we would have you know this is all one that gets multiplied by a hundred. And then also it's new minus the old. Then divided the old times a hundred. And then finally return pct So this is our percentage change function. It just calculates percentage change on between two values. That's it. Now, how do we apply that to an entire column? Well, first Typically, you would do something like this, you would do what's called mapping. But, because we're in Python 3, we also need to list map. So before we get there, let's just write what we're planning to do. So in here, we would want to run {pct_diff} Comma and then we will put all the values that we want to pass through. Right now we've got two values that would be df on the bottom fifth like that. And then the other one would be df and then top fifth. So those, that's the value one and value two. So this is the new Let's see. No. This will be the ult, so this will be the lowers. So, percentage change should be in theory, A positive, but we'll see. So, or the percentage difference, sorry. I've been saying percentage change I think, but at least writing out percentage difference. What we're looking for is the percentage difference between these two groups. So, we'll calculate that difference. Should always be positive because we're. Comparing, basically, what's the percentage difference from bottom to top. Now, what we're going to do is, let's see, percentage diff, and then the way that we apply this is with, we map that, and then also we do a list. So we do list map. So list map, the function name, and then All of the parameters that are going to be passed through. So, that look s a little different than when you normally run a function where you have the function. And then in parenthesis you put parameters. You've got your function comma parameters. This is all like lambda, using lambda to do Do something. So we can do something like now that we have that, we can copy this, paste and then do dot plot and then, save and run that. And here we go. So this is the percentage difference between these two over time so Not only has the gap gone up, it's the percentage difference that has gone up. So if, because like I said most people are thinking in terms of dollar figure, the difference between the bottom, you know lower income people and upper income people and they're totally ignoring That with a simple understanding of math, well that's explainable really quickly, because of what I told you guys before, with the 10% change on $10,000 and $100,000. But this is what really matters, is the percentage difference between these two of these groups, and how this is just continuously going up. So we're actually, the top fifth is pulling away, percentage wise, as well as obviously dollar wise, but percentage wise, that's a much larger deal. So then what I want us to go ahead and do is, we've done that, so what about taking. We can take all of our other groups and perform the exact same calculation. And then also I wouldn't mind showing you guys that you can do this really simply without running your own function. But I did want to show you running your own functions that you knew And understood how we map functions, so the next tutorial we'll continue building on this one, and we'll plot some of the other stuff, just because it's interesting information. We'll look at that and then we'll also show how you can calculate your own, if you wanted, without using a function. But again, mapping the functions is a lot like mapping the ruling statistics or whatever. Where you might think, oh, I'll never use that. But, as I'll show you later on, there are reasons why you will use it. So, anyways, that's it for this tutorial. Stay tuned to the next tutorial. Thanks for watching.

7.9 graphing percent change of income groups

Hello everybody and welcome to part nine of section seven of our Python with PNS for data analysis tutorial series. In this tutorial we're going to be building on the last and graphing some of these other groups, as well as showing that you can do one liner functions without obviously writing your own function. So, to get started what we're going to do is I always like to separate when it comes to graphing, you separate generally the calculations, the plotting and the show. And then sometimes after plotting you have, you know, axes customization. This isn't really an in depth map plot lib tutorial, but generally You want to separate and you wouldn't want to do your calculations, plot, calculations, plot, calculations, plot. I mean, you might, and if that's how you want to organize things you could do that, but generally I do the calculations in a group and then the plots in a group. So now that we've done that, what I'd like us to do first is show how we could do this exact function without having our own function. I just feel like it should be shown because we did use a function for it but it's not necessary because you could do, I'll just write again, and you could actually do something like this. You could say DF top fifth like this So that's new minus the old. So df bottom fifth. Oops divided by and then lets just copy and paste this. The old. And then write times a hundred. So then we would just this has double parentheses. One here and one here And that does the exact same thing. So we can copy this, paste, and we should see that these columns are identical. Let's pull this up. Right, so this is the same data, okay? Mostly you would write another function either if you literally can't Do this the identical thing on the line so I'll show you an example of why that can't be done sometimes. So we'll map it instead. So we'll show that but, or if there's a necessary conversion and again that will come down the line. But this is just a basic example anyway. But I did want to cover have everybody leaving thinking to do, you know, something like percentage change is something like a one liner function that you would have to write in a new function for that. So that is that. I'm going to go ahead and just comment this out because we don't need that. And then now it's just add the other calculations just for kicks. So the next one we could do is Pct_Diff_ mid_5_vs_top_5. And then this will be = and let's just take this, copy, paste. This is equal to, an instead of bottom fifth, this would be middle fifth, but then top fifth still remains. And let's take this, copy, and let's paste it three more times, so one, two, three, and then so that was mid, and then let's do the fourth fifth. Fourth, fifth, we also need the second fifth. And my bad I pasted too many of these. Okay, so middle fifth, and then here is the fourth fifth. This is very difficult for me to say. And then second fifth, so now we have all that information. And then we come down here and plot all of that. So, i'm just going to go pate, paste, paste. And then we take the mid fifth copy, paste, the fourth fifth, copy, paste, and the second fifth, copy, paste. And then we can do plt, let's do plt.legend. And then we'll put in here label equals Bottom 5th. Let me just run there real quick and make sure that's going to give us a label. Cool. So the other label I just want to shorten it. because it's kind of out of the way. Also, let's do. I'm trying to think. I think that location for will be good for us. Let's copy this, paste, paste, paste. Instead of bottom this was middle This was fourth and this was second. So let's spread that out. Save and run that and that's still in the way. I forget all Locations. Let's try a six, I'm just trying to get one that's over a little bit, that's good enough. So we can see here that basically all these have been moving, now obviously the difference between the bottom fifth and the top fifth occupies the largest position. Also it'd be useful to look at the difference so this one is from 677. To 1144, so a little less than double. This one starts at about 260 and ends at 401 so even less than double. And then 133 to about 213 and then this one is from 72 to 107. So it looks like this one is the, truly the highest degree of variance but of course we're making almost the same mistake that we were warning about earlier where it's not the figures, right? That makes the graph look bigger. That figure difference isn't what matters, it's the percentage change. So again you'd want to compare the percentage change From start to finish of this to the percentage change of start to finish to percentage changes for percent change exception of all of these before you started passing judgment. But anyway, I think the graph is pretty telling personally. We can see that the percentage change for lower income people is just massive whereas middle income is a little closer And then basically these two are closed. You can even see that the starting point on it. Those are way different, so looking here they're much closer. And then the percentage change difference between these, basically all of them has been going up. Which is pretty crazy, including the top fit. Anyway, feel free to play around. We also have like those top 5% and stuff like that in this data set so you can kind of play with those too and figure out, you know, who it is. But most people are saying, you know, the top 1% is what is moving the most away from the group and as we've seen, that's not just based on a dollar figure, that's also based on just percentage change. Which is compounding, so over time that dollar figure, based on that growing percentage change, even if it was just percentage change, it would be compounding. But this is compounding and growing at the same time. So that's pretty crazy. Anyway, that's it for this tutorial, in the next tutorial we're going to be talking about buffering So buffering is a pretty important thing when you're dealing with larger data sets. And I'll show you guys how we can do a form of buffering within pandas. As well as how you can buffer outside of pandas. Just in case. Depending on what you're doing. So that's what we're going to be talking about in the next tutorial so stay tuned for that

7.10 Buffering Basics

What is going on everybody? Welcome to part 10 of our Python with pandas for data analysis tutorial series. In this part, what we're going to be talking about is buffering. So the idea of buffering is to take, let's say you've got a data set that is 40 terabytes in size. Well, you're going to have a really hard time opening that data set. All right? You're going to not have enough RAM. But, that doesn't mean that you can't deal with that data, right? And the same question is, how would you handle A database that was forty terabytes for a CSV file that was forty terabytes. Or, what if you are just trying to store some data into a CSV file that's forty terabytes. How do ou go about doing something like that? So, we're going to be talking about is buffering. So, first I want to show you guys a simple version of buffering. How it would be done, but Before that we actually are going to need some sort of data set to use. Now, the data set I'd like us to use is for BitCoin prices. I like to use the BitCoin prices just simply because the granularity of data i.e., you know, time wise is very high. And then we can actually use this data into section eight as well. Showing how to use this data, store it to a database and then use pandas to do a lot of the manipulation of the data, after the fact. So the first thing we're going to do is head to Everything will be free here, you don't have to worry. About funding anything. Now come to the bottom here and I guess hit okay on that and click on markets API. And we'll wait for that. And then once you're here, scroll down to what we're looking for is [00:02:01] Maybe this is what we're looking for. We want to find the CSV pages, so let's click, let me open that in a new tab here. Yes, this is what we want. So now we can get anything here, but the easiest one probably is to just grab, you can grab any of these if you want, but I'm going to do Do control F, bt, I see it but I just hope everybody else can see it, so btceusd, so we're going to take this one here, so that would be this file. So we'll click that and let that download. It's 122 megabytes so it'll take a little bit So I'm going to pause this. You can pause yours while you wait for it to complete or whatever. When it's done, click and drag it into your directory that you're working in. Okay, so mine's all done. I'm going to, here is my directory here. You can just click and drag it into there. And there it is, btcu.csv. And you'll notice it's .gz. Well, let's close that. Depending on what you're on, if you're on Mac or Linux, you should understand how to use a .gz and you can unzip those with Gzip or gunzip, now if you're on Windows like me the best thing to use is 7 zip, or at least that's what I would recommend. So you can find 7 zip at www here, and you can download whichever version suits your operating system. Once you have 7zip or if you're on Linux or Mac or whatever, go ahead and gunzip this. And we'll do for Windows 7 zip and then extract here. So you wait for that, and then when that's all done, you should just have a pure .csv. So, where are you? There it is. So btcusd.csv and this file is about 880 megabytes, I'm going to go ahead and remove that leading period as well. You can do whatever you want, you can leave that if you want, now that we're there We've got that file, fortunately I can't open it in notepad. If you have an editor like VIM you can probably open this. And VIM uses a form of buffering to let you open these files. Now, what we're going to do is first let's show a simple buffering example. So, I'm going to do def buffering_example. And this is just a basic buffering example. So To buffer even without Pandas, what you could do is you would do something like this. with open, and then you specify the file name and location. So because this is local for me, I can do btceUSD.csv. If you're on Linux or Mac, you might still have to give the full path. So if it's saying it's not found, that could be it. So with open('btceUSD You could say nothing. You could say with open that and you could say, as f, for file. And then you could print out the file. But, what we're going to do instead. Well we can see, say as f. We're going to add another parameter here, and that parameter is buffering. And we'll say buffering is 1500 there. And then four line in f print line. So what you should find is that we open this and start reading out lines very fast. So buffering underscore examples. Save and run that And there you go. I mean we're copying out lines extremely fast. Normally, it would have taken a moment to read that into memory. Okay, but we're using buffering, and a pretty small buffer there, in order to read this data in. So that's buffering without using Pandas at all. Then you actually can buffer with Pandas. And I'm going to comment this out because we don't need that anymore. And it's kind of like, it's buffering, but it sounds it's stranger. But you can say define, and we'll say buffering. With pandas. And empty parms, and this will be, for example, you can do chunks. I will say chunks equals PD.read_csv. The CSV that we're going to be using is BDCEUSD.csv. And then instead of buffering equals something, we say chunk size. And we're going to say this is four, okay? And what that is is that's literally four lines or four rows in your pandas dataframe. And then to access those, you say, for c in chunks print c, okay? So let's take this, we'll just copy that, put it over here, paste. Now, let's run that. And band. So now, excusing the column headers here, because we're just being given like the first name in that list, excusing that, we have great success here. That's all the information slowly coming through in chunks of four basically. So, you could do something like this, right? So that was buffering with pandas. So again, how fast we're bang. There's our data. It's coming in instantly. But what if we did, for example, let's comment this out. Define no buffering. Whoops, no buffering pandas empty. And then we would just say DF equals underscore CSV and then that CSV is again, BDCUSD.CSV And then print df.head. So then we can run this copy paste. Save and run that. And as you can see it's taking a moment because this entire file is currently being Stuffed into my ram. Okay. So it's kind of going to take a second here. Still waiting. This is taking actually a lot longer than I thought. It should come in pretty quick. While we're waiting let's pull up the task manager. It came up as I was pulling it. But anyway, here's our task manager. Our data is loaded in but all of it is loaded into RAM so let's close that again. And you should see actually the RAM just dropped a little bit because of the close. But let's run that again And then we can see is we're loading it in, the RAM is slowly going up to basically consume this entire 880 megabyte file that we're asking it to bring in. Whereas when you're buffering, you're not bringing in the entire dataset. You're just bringing in a portion of the data. So, that's why it doesn't take anything nearly as long. because we're loading in millions of rows here as opposed to when we had a chunk size of four. We were only loading in four rows. So, it was easy. Anyway, now that we done that, the next question that you would have for yourself. Is okay, now that we've done that what would we do, like how would we actually interact with this data. Now, there's two options here. You can interact with the database or we could go back to our HDF files, specifically HDF 5 and re-talk about that. One morning HDF. Does not care about the size of the file very much. So you will see that our 880 megabyte file suddenly gets much, much, much, much larger. But the idea is that storage is cheap Ram is not. So anyways, that's what we're going to talk about in the next tutorial is buffering into and then reading from an out of HTF files. So stay tuned for that. Thanks for watching.

7.11 Buffering Into And Out Of Hdf5

What's going on, everybody? Welcome to part 11 of section 7 of our Python with Pandas for data analysis tutorial series. In this tutorial we're going to be building on the last tutorial, which was all about buffering. Doing basic examples of buffering, and then showing how we would do it, reading in via a CSV, and then the ramifications of, say, not buffering. And the problems that we might reach. So, now what we're going to do is we're going to talk about actually buffering into something. So, we can either buffer into something like a database with sq lite, or we could buffer into a hdf5 file. So, the point of hdf files Is to sacrifice storage for memory. So the idea is to use this little memory as possible, would be very, very fast at IO which input output but there's really very little care given to how much space and hdf file takes up. What you will also see that a database is very very similar. So sc lite for example in the next section. Section eight. We'll be doing this. We'll be putting it into a database rather than an htf file. And you'll see that databases become much larger than eight hundred and eighty megabytes as well. And the reason is the database Comes with the functionality of being easily manipulated, sliced and all that. It's not just a dataset like a csv, where it's just static, right? That's not quite the case with a file. Although SQLite is very close to being a flat file. Anyway, moving on. Let's talk about buffering with hdf files. So I'm going to comment this out, move this stuff up so we're not using it at the moment. But I want to leave it there and then we're going to make some space here just for video purposes and our radio rumble. So first of all let's do, we'll create our store and that's going to be equal to pd.HDFStore, so HDF Store. And then we specify the HDF file, within parenthesis and we'll call it (hdfbuffering.h5). And then what we're going to say is, basically we need to create the chunks here. So, we would say something like this. We would say chunks = pd.read_csv, and the csv is that btceUSD.csv, to that. So we read those chunks and then we'll say chunksize= In here you might want to do something bigger. Like before we said chunk size four. That's outrageously small. We could do something like 400,000 or even four million would be totally fine. The problem is, really, the BTC USD here is, I don't know how many lines it is. See if we could open it up in vem real quick. Hopefully it'll load. It's not telling me what rows we're on. Actually, I think it is. So, if we scroll all the way to the very very bottom, we have basically Basically what appears to be twenty million rows in here. So each row will take x amount of time to bash through. But you know four million is going to cut that up into little pieces. But you can do four hundred thousand or fourth thousand or whatever. Now but you can do whatever you want. I assume most people watching this Have more than a gigabyte of ram. Okay? So you can probably load this whole file, like there's no problem. It's just hard, I don't really know of a good multi terabyte file to use as an example, plus that's just going to waste a lot of people's time and space. But, Feel free, you can take this file and make it as big as you want. Just by copying everything in it, and just duplicating the file inside itself. anyway, so we read that chunk size, whatever you want, and then what we'll go ahead and do is we should name the columns. because remember when we were reading in that CSV 04 the column headers were being Labelled as the first row of data. We don't really want that. We want to assign it something, so we'll say, names for the column names, equals, this will be Unix, because it's Unix timestamp. Price because the price was next. And then, finally add Volume. And then, outside of that, we'll specify the index column. It's just that zeroth column there. So now we've got the chunks in there. And we are ready to basically start. Now, first of all how would you change? You could have either store or append. So you could do something like this. You could say for c in chunks. What do want to do? Well we could try Actually, let's skip the try, and accept for now. We can say, store.append. We want to append to df. What do we want to append? The c chunk. And then, finally, format equals table. We have to say this is a table format, otherwise you won't be able to Append at all. So we'll do that and in theory that should work, but let's go ahead and run that and see how we wind up. Let's see. I'm trying to find our, are we HDF 5 buffering? I can't remember. Yes. Okay, so here's our file, I'll come down here, hdfbuffering, there it is, it's going to take a while. You can always press F5 if you're on Windows to see it kind of refreshing in size. She is constantly going up in size. This is get, this process will probably take a while so I'm going to pause for now and let this process go on. So it's going to take a while for you too depending on how fast you're running through those chunks. So feel free to pause this video while you're waiting for that file to build, I would At least on my computer, it'll probably take about five minutes to build this H5 file. It might take you more or less, or whatever. So anyway, pause the video and restart it when it's all done. Loading up that file. All right. Mine's done. That was actually a lot faster than I thought. And the file size is smaller than I thought. That's not always the case with HDF5. I'm pretty surprised. Our starting file size was 880 and ending is 507. That's a pretty significant change there. Anyway, I wouldn't expect that every time. A lot of times my HDF files wind up larger. But what generally is happening is just the read/write process is faster and uses less memory. Anyway, so, we have our table created. So, the next thing that we would do, well we've already done this part. So, in fact, what I'm going to say is define create_hdf file. So that will be a function. We don't need to run that again. So, we're just going to let that go. And then What we'll do is we'll come down here, and then now we can say hdf, so we can load in the hdf = pd.read_hdf. And again that file is this one here, so we'll just highlight that, copy, paste. And then the table we're interested in we call df, and then we can read it back in. With chunks if we wanted. We could say chunksize=4000 or something like that. And then print, well we can say print(hdf) like this if you wanted but this'll be, this is going to be an object that still needs to be iterated through because we're using chunk sizes, right, so just like we had to iterate right here We'd have to do that here. So I'm going to comment this out, and instead we can do something like this. Copy this, come down here, over, paste, close that off. And then, we will print{hdf.head}. Save and run. And I'm just going to bring this over. And again, the reason, obviously, for buffering purposes, you would do this, and you would iterate through. And even then, actually, let's do this, and then for c in hdf: print(c). You would do that. If you needed to buffer, right? But, we just don't need to buffer. The reason why I did this is to show you that indeed our HDF file is functional. Right? Cause that's kind of important to show that hey, it actually did work. Because again, if you try to open up an HDF Five file, Actually we're going to get an error probably when we try to open this. Oh, it worked. Look at then. Anyway, it looks like this. So, yeah. Not very useful. So, we'll close out of this. And so that is buffering with Both Python-only and buffering with Pandas, what you need to do, how to do it, how to do it and do maybe an HDF file, something like that. In the next section what we're going to end up doing is we'll be working with a database, SQ lite, and we'll be getting into Doing more. Like right here, we've just kind of buffered in and out of a storage. In the next section, we'll be actually doing something with the data really to highlight and highlight the power of pandas to work alongside a database, or even a large file. And do some quick manipulation that would otherwise make your database four or five times the size, do the manipulation and then output some information. So that's what we're going to be doing in the next section to kind of round out this series. So stay tuned for that. 

7.12 Section Outro

So that's it for section seven. In this section, we covered a lot of more advanced operations with pandas. We started with resampling, basically how to downsize the size of our database or our data frame in our case, doing things like resampling with sum or by average. Or even open high low close. From there we talked about applying basic statistical operations to our entire data frame. Doing stuff like that. And then we talked about buffering. And you know basically how do we work with data sets that are possibly larger then the amount of memory that we actually have at our disposal So that's it for section seven. Now in section eight, what we're going to be talking about is actually dealing with pandas, a real database, maybe how could we interact. So in the IO section we didn't talk about to and from SQL, so I wanted to save that for its own section entirely. So pretty much You know, JSON, Pickle, CSV, HDF, these are really basic kinds of things. There's nothing that advanced about them. But when it comes to SQL, an SQL database is, SQL first of all is its own programming language entirely. So the fact that we can interact with SQL is really phenomenal. And an SQL database has a lot more attributes that let's say a JSON does, or a CSV. So it's really important, I think, or it's really cool that we can use Panda's with SQL. And, at least to me personally, this is where I get the most value out of my Pandas is this marriage between Pandas and SQL, especially as a Python programmer who works with web development a lot, this is extremely powerful. But even if you don't do web development, even if you're just doing data analysis on larger data sets, you're going to have to store that data somewhere, and when people are say collecting data through sensors or whatever They're probably putting it into a database and so working with the database, obviously you don't have to work with MySQL database but working with a database, you know most database is for example. Do you have a method to output to CSV? So even if it's not a MySQL you know it won't be as easy as from SQL But, it will be fairly simple to get to a CSV and then load that CSV in. So the ability to do that, and work with that with Pandas, and output back is, I think, really, really powerful and definitely deserves quite a bit of attention. So we're going to kind of dedicate the whole last section To working with a database with pandas and showing kind of the things that we can do there. So even if you're not using it for web development, even if you're just simply doing data analysis, you probably have a database. So anyways, that's what you guys have to look forward to in section eight, so stay tuned for that.