pandas is a Python library that provides data structures and data analysis tools. It is built on top of numpy and is used for data manipulation and analysis.
Installation
Install pandas using conda:
condainstall-cconda-forgepandas
Install pandas using pip:
pipinstallpandas
Import pandas
To use pandas, import the pandas module:
import pandas as pd
DataFrame
A DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. It is similar to a Excel spreadsheet.
Create a DataFrame:
import pandas as pddf = pd.DataFrame( {"Name": ["Braund, Mr. Owen Harris","Allen, Mr. William Henry","Bonnell, Miss. Elizabeth", ],"Age": [22, 35, 58],"Sex": ["male", "male", "female"], })print(df)
The results will be:
Name Age Sex
0 Braund, Mr. Owen Harris 22 male
1 Allen, Mr. William Henry 35 male
2 Bonnell, Miss. Elizabeth 58 female
In this example, the DataFrame has three columns: Name, Age, and Sex. The Name column contains strings, the Age column contains integers, and the Sex column contains strings.
The DataFrame can be indexed using the column names:
print(df.columns)print(df["Name"])
The DataFrame can be indexed using the row index. In this example, the row index is 0, 1, and 2:
print(df.index)
The values attribute returns the data in the DataFrame as a 2D NumPy array:
print(df.values)
Read Data
Pandas supports reading and writing data in various formats, including CSV, Excel, and SQL databases.
Read CSV
We use the iris dataset as an example. This dataset contains 150 observations of iris flowers. There are four columns of measurements of the flowers in centimeters. The fifth column is the species of the flower observed.
First, we create a csv file named iris.csv. In here, we only use 15 samples for demonstration. The content of the file is:
In the code iris[iris["sepal_length"] > 5], only the rows with True values are selected.
It is also possible to filter rows based on multiple conditions. For example, to select rows that sepal_length is greater than 5 and class is Iris-setosa:
import pandas as pdiris = pd.read_csv("iris.csv")# Filter rows based on multiple conditionsselected_rows = iris[(iris["sepal_length"]>5) & (iris["class"]=="Iris-setosa")]print(selected_rows)