More DataFrame Manipulation using Pandas — Part I

Javier Herbas
Analytics Vidhya
Published in
11 min readDec 14, 2020

--

Image by author

This blog follows a first one that I named “Basic DataFrame Manipulation using Pandas” which I published a few months ago, where I tried to give some tips and show short workflows that I use on a regular basis. Most of it, if not all of the content was for beginners, so I decided to do a second one with slightly more complex functions and methods which I broke down into two parts, not to make a super long blog tedious to follow.

Once again, I will try to do this as an exercise starting from creating my own DataFrame (df) and then playing with it as much as I can trying to keep it to maximum an intermediate level. Later on, and after parts I and II of this blog are published, I will focus on writing an advance one following a project with real data.

The content that I will cover is the following:

  1. Creating a DataFrame from a dictionary
  2. Configurations and Settings
  3. Combining DataFrames
  4. Reshaping your dataFrame
  5. Using the Datetime module
  6. Using the groupby() method/function
  7. Slicing using .loc and .iloc

If by the time I finish with the second part I find that the level is not high enough, then I will modify this content and edit it in this blog as well so that they are both coherent.

Because I am a big basketball fan, I will use once again NBA data, but this time I will take stats from the 2019–2020 recently finished season and I will use the top 25 scorers. With that said, let’s jump straight into the content.

1. Creating a DataFrame from a dictionary

With basketball stats as lists the easiest way to create your DataFrame from my point of view will be via a dictionary, of course assuming you are not dealing with a large number of lists which is this case. This is clearly not the most efficient method to create a df if you have large datasets, but for something small such as 25 rows (one per player), this does the trick.

You can see in the gist below that I have created 10 lists corresponding to players names and a series of stats that I took from the NBA main webpage. Then I created a dictionary called “stats” where I assigned keys and values, with the keys being the names of the columns/features of my future df. I followed this by using the pandas DataFrame creation function and I named my output “nba”. In the last line (17) you can see that I have called a “25”. I did this to be able to see the 25 players that I will be working with.

DataFrame creation from a series of lists

Running that code gave me my df called “nba” which you can see on Figure 1 below with all the stats that I chose. On that link that I shared above, you will find plenty more information that you can either scrape or copied manually if you wish to create bigger dataframes to play with.

Figure 1. DataFrame with the NBA 2019–2020 25 top Scorers

There is minor detail that I don’t like about the output df, and that is the index starting number (left most column on Figure 1). As a default it starts with a “0”, but because I am using the top scorers it doesn’t feel right to give “James Harden” a “0” instead of a “1”. I will therefore modify this to start with a 1 with the code below:

df.index = df.index + 1

2. Configurations and Settings

One thing that I do a lot is configuring my dataframes depending on the project that I run. Sometimes I want to show more or less information, or I just want to make all my dataframes look the same for reports. Doing this is simple and I consider it a beginners level, but I will still added here since I didn’t do it on my previous blog, and this is something I do often.

You will find that in Pandas you have plenty of options to customise your dataframes. I will touch a few of the common ones with examples and will leave you this link for you to read if you want to explore it further.

2.1 max_rows & max_columns

Settings rows and columns comes quite handy. If you have too many rows and/or columns for example, you can set your max number to a high number so that they are all displayed, or in the contrary, you might want to just see a few of them, like the example on Figure 2. If you don’t modify it, the default is set to 20. The code to do this is straight forward:

pd.set_option(‘display.max_rows’, 6, ‘display.max_columns’, 6)

The limitation here is that if you reduce the number, then you are potentially missing on a lot of information, so this is useful for reports or other documents where you want to keep things tidy, or when you are very familiar with your dataframe.

2.2 Precision

Precision sets the output numeric display decimals. With the data that we have available we only have 1 decimal, so let’s force it to be “0” so that you can see it’s effect (Figure 2):

pd.set_option(‘precision’, 0)

2.3 dimensions

Dimension is straight forward as well, as it prints out the size/shape of your DataFrame on its left lower corner (Figure 2). The code is just as the previous one but with either a True or False call as there are only two options:

pd.set_option(‘show_dimensions’, True)

2.4 max_colwidth

This one corresponds to the maximum width of a column in the repr of a Pandas data structure. When the column overflows, a “…” placeholder is embedded in the output (Figure 2) which you might have seen with df with over 20 columns. A ‘None’ value means unlimited. Once again, the code is straight forward:

pd.set_option(‘max_colwidth’, 8)

As you can imagine, you don’t need to write one line per parameter/option that you want to set up, you can just add all of them them in one call as described in the gist below, and any modifications that you have done, can always be reset back to it’s default by using the code below (above the gist):

pd.reset_option(‘^display’)
Gist for DataFrame option settings
Figure 2. DataFrame with parameters set for a maximum of 6 columns and rows, no decimals, display shape on the bottom left, and a maximum column width of 8. The row number have also been modified to start at 1.

Figure 2 shows the output of all the options that we set above. Remember that if you want to follow this workflow you will need at this point to reset_options so that you keep the default parameters and can follow the rest of the methods that I will use.

3. Combining DataFrames

With Pandas there are various ways to easily combine together DataFrames with different kinds of set logic indexes and relational algebra functionality in the case of join/merge-type operations. Let’s look at what I believe are three of the most useful ones:

3.1 Concatenating

Concatenating is probably the most commonly used one and it can be seen as a simple stacking that can be done horizontally or vertically. To show you how it really works, I have created individual DataFrames fit for purpose, for “concatenating” and also for “merging” and “appending”. I will add in each method a gist so that you can recreate them from the original one as well, starting with the gist below which will give you the dataframes displayed on Figure 3.

DataFrame creating for concatenating exercise
Figure 3. Two DataFrames created by slicing the original one. Upper one corresponds to nab_1, and lower one to nba_2

Concatenating nba_1 with nba_2 vertically is rather simple with the code below:

nba_1_2 = pd.concat([nba_1, nba_2], axis = 0)
nba_1_2.head()
Figure 4. Resulting DataFrame from concatenating nba_1 with nba_2 vertically

Now that we have seen how vertical concatenation is done, let’s see the horizontal one, which is basically the same code. Once again, below you will find the gist with the necessary code to reproduce the dataframes that I will use which are displayed on Figure 5.

If you look at the dataframes from Figure 5, you can see that it seems as if I had divided it into two separate ones. In fact, that is what I did using the slicing illustrated on the gist above. So, horizontally concatenating them (code below), will basically tied them back together as Figure 6 shows.

Figure 5. DataFrames created via slicing for horizontal concatenation exercise. Left one corresponds to nba_3 and right one to nba_4
nba_3_4 = pd.concat([nba_3, nba_4], axis=1)
nba_3_4.head()
Figure 6. Result of horizontally concatenating nba_3 with nba_4

3.2 Merging

Concatenating was simple and straight forward, now merging is slightly harder but more powerful as well. For this to work, you will need a common attribute/feature between both dataframes to be used as the joining point. In many cases you might even have to engineer it if it is not present in both.

Below is the gist to re-create the dataframes for the merge and as you can see, they both have the column ‘NAME’ in common. The outputs are below on Figure 7 and on Figure 8, the merged one.

Figure 7. DataFrames created for merging exercise. Left one corresponds to nba_6 and right one to nba_7.

In the code below, to merge the dataframes, you will notice that there are three parameters to fill in. The first is the df that will be on the left, the second the df on the right, and the third corresponding to what is used as the join point/column name.

nba_8 = pd.merge(left=nba_6, right=nba_7, on=’NAME’)
nba_8.head()

Figure 8 shows the merge output, and as you can see, they have been merged using the “NAME” column and respecting the selected left and right df inputs.

figure 8. Merge output from nba_6 and nba_7

3.3 Appending

Concatenating and merging were simple enough as long as you don’t need to engineer your “join” for the merge, but the .append() is probably the easiest of all. “append()” is used to append rows of other dataframes to the end of a given df returning a new one. If by any chance there is a column not present in the rows that you are trying to append, those will be filled out with NaN values (Figure 9).

For this case I won’t add a gist, as with previous ones you will be able to recreate the dataframes needed.

Figure 9. DataFrame to be appended (abore left and right) and resulting DataFrame (below)

The code for the append is again, probably the easiest one (below). You chose your output name (nba_11) and then the upper df (nba_9) and the lower df (nba_10)

nba_11 = nba_9.append(nba_10)
nba_11

4. Reshaping your DataFrame

There are several ways that df can be reshape with some more complex and powerful that others. Once again, I will focus on the four that I use the most and if you want additional information I would recommend you explore this link

As with the previous examples, I have created smaller dataframes to show the effects of each reshaping function/method inidvidually.

4.1 Reshaping by melt()

This is quite an interesting function that massages the df into a format where one or more columns are identifier variables, while the others are unpivoted to the row axis, leaving the one or two non identifier columns “variables” and “value”. You can rename these new columns (variable and value) if you wish as I did (code below)

First, I will sliced the original nba df and show you it’s output (Figure 10):

nba_12 = nba.iloc[0:3, 0:5]
nba_12.head(3)
Figure 10. Sliced reduced df from the original nba one

With the df above (Figure 10), I will chose to keep the variables “NAME” and “POS” and will melt the rest. I have also changed the variable and value name to ‘Stats’ and ‘Values’ respectively (see code below). The result can be seen on Figure 11.

nba_12 = nba_12.melt(id_vars=[‘NAME’, ‘POS’], var_name=’Stats’, value_name=’Values’)
Figure 11. Result of melting dataframe from Figure 10

As you can see, the melting adds together the 3 stats/features into 1 and we ended up with a rather large df now but with all their stats in just one column.

4.2 Reshape by pivot()

Pivot is one of the most used functions to reshape dataframes and one of my favorites. Basically what it does is aggregates accross two dimensions. In cases such as the one we created above after applying the melt(), pivot becomes very handy. In this case, it will allow us to regroup those categories describes in the “Stats” column and reorganise the df.

Applying the pivot is rather simple, you will need to define your new “index”, “columns” and the “values” as illustrated in the code below:

nba_13a = nba_12.pivot(index=’Stats’, columns=’NAME’, values=’Values’)

The output can be seen on Figure 12

Figure 12. Result of pivoting the df that resulted from the melt() on the previous exercise

I’m sure you noticed that you can easily play with your index, columns and values and do interesting things in a very efficient and fast way. With the pivot, you could easily do a transpose if you wanted to. Look at Figure 13 and this time I won’t give you the answer as it is pretty straight forward.

Figure 13. A different result of pivoting the df that resulted from the melt()

4.3 Reshaping by stacking and unstacking

the stacking() and unstacking() methods are closely related to the pivot() one. The stacking() is designed to work together with multi-index objects. Stacking pivots a level of the column labels, returning a df with an index with a new inner-most level of row labels, whereas the Unstack is simply the inverse operation, where it pivots producing a reshaped df with a new inner-most level column labels.

For this exercise, lets again slice a new df from the original nba one with the code below (output on Figure 14):

nba_14 = nba.iloc[0:5, 0:4]
Figure 14. Sliced new df from the original nba one

Stacking this df is as simple as illustrated with the code below (results on Figure 15):

nba_14_stack = nba_14.stack()
Figure 15. Results of stacking df displayed on Figure 14.

You can see that is basically breaks down all the stats/categories per player/NAME and makes a completely different display (Figure 15). Reverting this is easy and I’m sure you can imagine how…we use the ‘unstack()’ method (code below). I won’t show the results bas they are the same as what is diaplayed on figure 14.

nba_14_unstacked = nba_14_stack.unstack()

4.4 Reshape by transpose

Transposing a df can be quite useful and it is even easier than all the previous methods. All you need to do is use the letter “T” or .transpose(). The code below, shows how a df has been created by slicing from the original nba one, and how it has been transpose using the T (Figure 16).

nba_15 = nba.iloc[0:10, :]
nba_15 = nba_15.set_index(‘NAME’)
nba_15 = nba_15.T # or nba_14.transpose()
Figure 16. Results of Transposing a df created by slicing the nba original one

The results of the transpose are basically a change of axis where your column names have changed position with the newly defined index (“NAME”). If I had not defined a new index you will end up with the default numeric one in place of your column names.

So that’s it for this part I of manipulating dataframes. I hope you have enjoyed this. As always feel free to message me if you have any feedback or questions or even if there is anything that you would like to see on the part II, which I’m hoping will publish a week from now. This is my Linkedin and GitHub where you will find more things that I have been playing with.

--

--

Javier Herbas
Analytics Vidhya

Data Scientist formerly an engineer with 20 years of experience in the energy industry. Fan of ML and DL and everything related to AI