Signup/Sign In

Indexing in Pandas

Selecting and manipulating data is one of the most important features of the Pandas data frame. The indexing feature of Pandas lets us play around with the indices and apply various functions that make the selection and manipulation of the data very simple and easy.

Custom indices help us to not only manage our data efficiently but also enable us to access our data at our will.

Getting started with indexing in pandas

For a better understanding of the complete code being discussed in this tutorial, please visit collab.google.com website.

As always, let us import pandas and create a pandas dataframe by loading some sample data from a CSV file:

import pandas as pd

studyTonight_df = pd.read_csv("https://sds-platform-private.s3-us-east-2.amazonaws.com/uploads/P4-Demographic-Data.csv")
print(studyTonight_df)

Following will be the output for the above code,

The pandas library has a built-in function called index() which allows us to extract the details of the index associated with the data structure, so the below code will give us the range of the index,

studyTonight_df.index

Output:

RangeIndex(start=0, stop=195, step=1)

The output gives the start point of the index, the endpoint, and the step by which it increments.

1. Using the square brackets []

The Dataframe_name.[] code, where Dataframe_name is the name of your dataframe, can be used to get the values of that column whose name you provide within the square brackets.

This function enables us to select any column x in our dataframe, using the code Dataframe_name.["x"]

Let's use this on our dataframe and see the output,

studyTonight_df["Country Name"]

Outputs:

This function also allows us to select more than one column at once. To access multiple columns, instead of just one column name, we need to pass in a list of all the column names we want to select. For example,

studyTonight_df[["Country Name","Country Code"]]

Output:

2. Using the loc function

Let's select a column from our dataset using the index of the dataframe. For this, let us re-import the data, with an extra parameter:

studyTonight_df = pd.read_csv("https://sds-platform-private.s3-us-east-2.amazonaws.com/uploads/P4-Demographic-Data.csv", index_col = "Country Name")

The index_col parameter allows us to choose the column which is to be set as an index.

On using the index() function, we will get a different result than our previous one. Let's see:

studyTonight_df.index

Output:

Index(['Aruba', 'Afghanistan', 'Angola', 'Albania', 'United Arab Emirates', 'Argentina', 'Armenia', 'Antigua and Barbuda', 'Australia', 'Austria', ... 'Virgin Islands (U.S.)', 'Vietnam', 'Vanuatu', 'West Bank and Gaza', 'Samoa', 'Yemen, Rep.', 'South Africa', 'Congo, Dem. Rep.', 'Zambia', 'Zimbabwe'], dtype='object', name='Country Name', length=195)

We can now use the loc[] function to choose indices according to their name, for example,

studyTonight_df.loc[ "Tunisia" ]

Output:

Country Code TUN
Birth rate 19.8
Internet users 43.8
Income Group Upper middle income
Name: Tunisia, dtype: object

Below we have attached a snapshot of the code execution:


If you want to select multiple rows on the basis of their name, and also specify the names of the columns for which you wish to see the data, then the simplest way to do this is as follows:

studyTonight_df.loc[ ["Tunisia", "Togo", "Bhutan"] , ["Birth rate", "Internet users"]]

Output:

There is an option for selecting all the rows for a particular list of columns.

studyTonight_df.loc[ : , ["Birth rate", "Internet users"]]

Output:

3. Using the iloc function

By providing the index number of our preferred column into the .iloc[] function, we can access any columns from our dataset. For example,

studyTonight_df.iloc[4]

Output:

As you can see in the output above, we get the data for country name United Arab Emirates which is the 5th country hence accessed using the index 4.

Using the colon (:) we can define a range of index numbers and select those columns from the dataset. For example, in the below code we have specified [:4] which means from the beginning till the 4th index,

studyTonight_df.iloc[:4]

Output:

To select more than one column by their index numbers, we can pass a list of the numbers as a parameter in the .iloc[] function, for example,

studyTonight_df.iloc[[4, 6, 8]]

Output:

If we pass two lists to the .iloc[] function then we can specify both the index values of the rows and the columns, that we want to select, for example:

studyTonight_df.iloc[[2,3],[1,3]]

Output:

To select all the rows corresponding to a list of particular columns, we can replace the list for row index values with a colon (:),

studyTonight_df.iloc[:,[1,3]]

Output:

4. Using the ix function

The ix[] function lets us pass both names and index values as parameters. But it has now been deprecated and it is recommended to use loc[] and iloc[] instead. Let's see a simple example for using this function,

studyTonight_df.ix["Tajikistan"]

Output:

Or using the index,

studyTonight_df.ix[23]

Output:

In the examples, we also see that there is a warning telling us about the deprecation.

Multi Indexing in Pandas

Multi indexing allows us to control and manipulate data across multiple dimensions. Multi indexing especially helps to control the pandas DataFrame data structure.

By providing a list of column names into the set_index() function we can set a hierarchical index for our dataframe.

studyTonight_df.set_index(["Country Code","Income Group"], inplace=True)
print(studyTonight_df)

Output:

If we check using the index function we will notice that the function returns a definition of multiple indices, unlike the normal single index definition:

print(studyTonight_df.index)

We get,

MultiIndex(levels=[['ABW', 'AFG', 'AGO', 'ALB', 'ARE', 'ARG', 'ARM', 'ATG', 'AUS', 'AUT', 'AZE', 'BDI', 'BEL', 'BEN', 'BFA', 'BGD', 'BGR', 'BHR', 'BHS', 'BIH', 'BLR', 'BLZ', 'BMU', 'BOL', 'BRA', 'BRB', 'BRN', 'BTN', 'BWA', 'CAF', 'CAN', 'CHE', 'CHL', 'CHN', 'CIV', 'CMR', 'COD', 'COG', 'COL', 'COM', 'CPV', 'CRI', 'CUB', 'CYM', 'CYP', 'CZE', 'DEU', 'DJI', 'DNK', 'DOM', 'DZA', 'ECU', 'EGY', 'ERI', 'ESP', 'EST', 'ETH', 'FIN', 'FJI', 'FRA', 'FSM', 'GAB', 'GBR', 'GEO', 'GHA', 'GIN', 'GMB', 'GNB', 'GNQ', 'GRC', 'GRD', 'GRL', 'GTM', 'GUM', 'GUY', 'HKG', 'HND', 'HRV', 'HTI', 'HUN', 'IDN', 'IND', 'IRL', 'IRN', 'IRQ', 'ISL', 'ISR', 'ITA', 'JAM', 'JOR', 'JPN', 'KAZ', 'KEN', 'KGZ', 'KHM', 'KIR', 'KOR', 'KWT', 'LAO', 'LBN', 'LBR', 'LBY', 'LCA', 'LIE', 'LKA', 'LSO', 'LTU', 'LUX', 'LVA', 'MAC', 'MAR', 'MDA', 'MDG', 'MDV', 'MEX', 'MKD', 'MLI', 'MLT', 'MMR', 'MNE', 'MNG', 'MOZ', 'MRT', 'MUS', 'MWI', 'MYS', 'NAM', 'NCL', 'NER', 'NGA', 'NIC', 'NLD', 'NOR', 'NPL', 'NZL', 'OMN', 'PAK', 'PAN', 'PER', 'PHL', 'PNG', 'POL', 'PRI', 'PRT', 'PRY', 'PSE', 'PYF', 'QAT', 'ROU', 'RUS', 'RWA', 'SAU', 'SDN', 'SEN', 'SGP', 'SLB', 'SLE', 'SLV', 'SOM', 'SRB', 'SSD', 'STP', 'SUR', 'SVK', 'SVN', 'SWE', 'SWZ', 'SYC', 'SYR', 'TCD', 'TGO', 'THA', 'TJK', 'TKM', 'TLS', 'TON', 'TTO', 'TUN', 'TUR', 'TZA', 'UGA', 'UKR', 'URY', 'USA', 'UZB', 'VCT', 'VEN', 'VIR', 'VNM', 'VUT', 'WSM', 'YEM', 'ZAF', 'ZMB', 'ZWE'], ['High income', 'Low income', 'Lower middle income', 'Upper middle income']],
codes=[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 145, 190, 191, 192, 36, 193, 194], [0, 1, 3, 3, 0, 0, 2, 0, 0, 0, 3, 1, 0, 1, 1, 2, 3, 0, 0, 3, 3, 3, 0, 2, 3, 0, 0, 2, 3, 1, 0, 0, 0, 3, 2, 2, 2, 3, 1, 2, 3, 3, 0, 0, 0, 0, 2, 0, 3, 3, 3, 2, 1, 0, 0, 1, 0, 3, 0, 2, 3, 0, 2, 2, 1, 1, 1, 0, 0, 3, 0, 2, 0, 2, 0, 2, 0, 1, 0, 2, 2, 0, 3, 3, 0, 0, 0, 3, 3, 0, 3, 2, 2, 1, 2, 0, 0, 2, 3, 1, 3, 3, 0, 2, 2, 0, 0, 0, 0, 2, 2, 1, 3, 3, 3, 1, 0, 2, 3, 3, 1, 2, 3, 1, 3, 3, 0, 1, 2, 2, 0, 0, 1, 0, 0, 2, 3, 3, 2, 2, 0, 0, 0, 3, 0, 0, 3, 0, 1, 0, 2, 2, 0, 2, 1, 2, 1, 3, 1, 2, 3, 0, 0, 0, 2, 0, 2, 1, 1, 3, 2, 3, 2, 3, 0, 3, 3, 1, 1, 2, 0, 0, 2, 3, 0, 0, 2, 2, 2, 2, 2, 3, 1, 2, 1]],
names=['Country Code', 'Income Group'])

This proves to us that there are more than one layer of indices in this dataFrame. Dataframes with multiple indices can be sorted too, in the same way as normal dataframe:

studyTonight_df.sort_index(inplace=True)
print(studyTonight_df)

Output:

Here we see that the DataFrame has been sorted according to the alphabetical order of the first index.

Conclusion

In this article for Pandas, we have thoroughly touched upon all the important and major functions which will help us control and manipulate the index of our dataset. Hopefully, you will now be able to apply the things you learned in this article, on your own programs.



About the author:
I like writing about Python, and frameworks like Pandas, Numpy, Scikit, etc. I am still learning Python. I like sharing what I learn with others through my content.