Basic Operations on Pandas DataFrame
In the previous tutorial, we understood the basic concept of pandas dataframe data structure, how to load a dataset into a dataframe from files like CSV, Excel sheet etc and also saw an example where we created a pandas dataframe using python dictionary.
Now we will see a few basic operations that we can perform on a dataset after we have loaded into our dataframe object.
Here you can check the complete code: collab.google.com
Find Last and First rows of the DataFrame:
To access the first and last few rows of the DataFrame, we use .head()
and .tail()
function. If used without any parameters, then, these function will return the first 5 or the last 5 rows respectively. But if we pass an integer as a parameter then the number of rows corresponding to the integer, are shown. For example,
# using dictionary to create a dataframe
data = {'Fruit': ['Apple','Banana','Orange','Mango'], 'Weight':[200,150,300,250], 'Price':[90,40,100,50]}
studyTonight_df = pd.DataFrame(data)
# using the head function to get first two entries
studyTonight_df.head(2)
# using the tail function to get last two entries
studyTonight_df.tail(2)
Output:
Accessing Columns in a DataFrame:
We can access the individual columns which make up the data frame. For doing that we use square brackets just like we do in case of array and specify the name of the column in the square brackets. For example, if we have to get the values stored in the column Weight in the above dataframe, we can do so using the following code:
studyTonight_d['Weight']
This will give us the output:
Another way to access columns is by calling the column name as an attribute, as shown below:
studyTonight_df.Fruit
Accessing Rows in a DataFrame:
Using the .loc[]
function we can access the row-index name which is passed in as a parameter, for example:
studyTonight_df.loc[2]
Output:
Various Assignments and Operations on a DataFrame:
To demonstrate the role of NaN
in our DataFrame, we will be adding a column that has no values in our data frame. To do this, we will be using the columns parameter in the DataFrame()
function, and pass a list of column names.
data = {'Fruit': ['Apple','Banana','Orange','Mango'], 'Weight':[200,150,300,250], 'Price':[90,40,100,50]}
studyTonight_df2 = pd.DataFrame(data, columns=['Fruit','Weight','Price','Kind'])
print(studyTonight_df2)
The column we just added, called Kind, didn't exist in our data frame before. Thus there are no values corresponding to this. Therefore our dataframe reads this as a missing value and places a NaN
under the Kind column. Below is the output for the above code:
If we want to assign something to this column, we can attempt to assign a constant value for all the rows. To do this, just select the column as shown below, and make it equal to some constant value.
studyTonight_df2['Kind'] = 'Round'
print(studyTonight_df2)
As we can see in our output below, all the values corresponding to the column Kind has been changed to the value Round.
A series can be mapped onto a dataframe column. This further proves the point that a DataFrame is a combination of multiple Series.
st_ser = pd.Series(["Round", "Long", "Round", "Oval-ish"])
Let's map this series with our column Kind:
studyTonight_df2['Kind'] = st_ser
print(studyTonight_df2)
For this we will get the following output:
More Operations on Dataframes:
DataFrames are highly operatable. To start off lets perform a boolean operation on a Dataframe column and use the results to fill up another Dataframe column.
1. Using Expressions to fill value in Column
studyTonight_df2['costly'] = (studyTonight_df2.Price > 60)
print(studyTonight_df2)
This creates a new column with the name costly and fills the values based on the boolean value which is a result of the condition which we set. Therefore the value will be True for every item for which Price of the item is more than 60.
2. Delete Column
The del
command lets us delete any column using its name.
del studyTonight_df2['costly']
print(studyTonight_df2)
As we can see in our output, the costly column was deleted
3. DataFrames made out of Nested Dictionaries:
Let us consider a nested dictionary:
dict = {'Items':{'pen':10, 'pencil':5, 'eraser':3, 'ruler':15},'Food':{'chips':20, 'coke':16, 'sandwiches':30, 'nachos':25}
studyTonight_df3 = pd.DataFrame(dict)
print(studyTonight_df3)
This will give us the following output:
As you can notice, all the keys in the inner layer of the dictionary have been clumped together, giving us a missing value when the column doesn't have a value corresponding to the mentioned key. For example, the index contains chips, sandwiches, pens, pencils all in one row. Thus no differentiation has been made for these keys. This is one of the shortcomings of the Dataframe and it must be kept in mind while making dataframes from a dictionary.
4. DataFrame Transpose:
A transpose essentially means flipping the data frame to make the rows into columns and columns into rows. This can be achieved with the help of the T
function.
studyTonight_df3.T
Output:
Conclusion
Dataframes are the most basic building blocks for the Pandas Library and thus it is extremely important to have a grasp over it. This tutorial covered all the important aspects which govern the Dataframes in Pandas. Please go through the functions mentioned in this tutorial whenever in doubt, it will surely help you.