pandas

pandas is a Python library that provides data structures and data analysis tools. It is built on top of numpy and is used for data manipulation and analysis.

Installation

Install pandas using conda:

conda install -c conda-forge pandas

Install pandas using pip:

pip install pandas

Import pandas

To use pandas, import the pandas module:

import pandas as pd

DataFrame

A DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. It is similar to a Excel spreadsheet.

Create a DataFrame:

import pandas as pd

df = pd.DataFrame(
    {
        "Name": [
            "Braund, Mr. Owen Harris",
            "Allen, Mr. William Henry",
            "Bonnell, Miss. Elizabeth",
        ],
        "Age": [22, 35, 58],
        "Sex": ["male", "male", "female"],
    }
)

print(df)

The results will be:

                       Name  Age     Sex
0   Braund, Mr. Owen Harris   22    male
1  Allen, Mr. William Henry   35    male
2  Bonnell, Miss. Elizabeth   58  female

In this example, the DataFrame has three columns: Name, Age, and Sex. The Name column contains strings, the Age column contains integers, and the Sex column contains strings.

The DataFrame can be indexed using the column names:

print(df.columns)
print(df["Name"])

The DataFrame can be indexed using the row index. In this example, the row index is 0, 1, and 2:

print(df.index)

The values attribute returns the data in the DataFrame as a 2D NumPy array:

print(df.values)

Read Data

Pandas supports reading and writing data in various formats, including CSV, Excel, and SQL databases.

Read CSV

We use the iris dataset as an example. This dataset contains 150 observations of iris flowers. There are four columns of measurements of the flowers in centimeters. The fifth column is the species of the flower observed.

First, we create a csv file named iris.csv. In here, we only use 15 samples for demonstration. The content of the file is:

sepal_length,sepal_width,petal_length,petal_width,class
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
5.0,3.4,1.5,0.2,Iris-setosa
4.4,2.9,1.4,0.2,Iris-setosa
4.9,3.1,1.5,0.1,Iris-setosa
7.0,3.2,4.7,1.4,Iris-versicolor
6.4,3.2,4.5,1.5,Iris-versicolor
6.9,3.1,4.9,1.5,Iris-versicolor
5.5,2.3,4.0,1.3,Iris-versicolor
6.5,2.8,4.6,1.5,Iris-versicolor
6.3,3.3,6.0,2.5,Iris-virginica
5.8,2.7,5.1,1.9,Iris-virginica
7.1,3.0,5.9,2.1,Iris-virginica
6.3,2.5,5.0,1.9,Iris-virginica
6.5,3.0,5.2,2.0,Iris-virginica

Next, we read the csv file using the read_csv function:

import pandas as pd

df = pd.read_csv("iris.csv")

# Display the first 6 rows
print(df.head(6))

The results will be:

   sepal_length  sepal_width  petal_length  petal_width            class
0           5.1          3.5           1.4          0.2      Iris-setosa
1           4.9          3.0           1.4          0.2      Iris-setosa
2           5.0          3.4           1.5          0.2      Iris-setosa
3           4.4          2.9           1.4          0.2      Iris-setosa
4           4.9          3.1           1.5          0.1      Iris-setosa
5           7.0          3.2           4.7          1.4  Iris-versicolor

Filter Data

Filter Rows

To select rows that sepal_length is greater than 5:

import pandas as pd

iris = pd.read_csv("iris.csv")

# Filter rows based on condition
selected_rows = iris[iris["sepal_length"] > 5]

print(selected_rows)

The results will be:

    sepal_length  sepal_width  petal_length  petal_width            class
0            5.1          3.5           1.4          0.2      Iris-setosa
5            7.0          3.2           4.7          1.4  Iris-versicolor
6            6.4          3.2           4.5          1.5  Iris-versicolor
7            6.9          3.1           4.9          1.5  Iris-versicolor
8            5.5          2.3           4.0          1.3  Iris-versicolor
9            6.5          2.8           4.6          1.5  Iris-versicolor
10           6.3          3.3           6.0          2.5   Iris-virginica
11           5.8          2.7           5.1          1.9   Iris-virginica
12           7.1          3.0           5.9          2.1   Iris-virginica
13           6.3          2.5           5.0          1.9   Iris-virginica
14           6.5          3.0           5.2          2.0   Iris-virginica

You may wonder how the code works. Let's break it down:

import pandas as pd

iris = pd.read_csv("iris.csv")

print(iris["sepal_length"] > 5)

This code will output series of boolean values:


import pandas as pd

iris = pd.read_csv("iris.csv")

print(iris["sepal_length"] > 5)
0      True
1     False
2     False
3     False
4     False
5      True
6      True
7      True
8      True
9      True
10     True
11     True
12     True
13     True
14     True
Name: sepal_length, dtype: bool

In the code iris[iris["sepal_length"] > 5], only the rows with True values are selected.

It is also possible to filter rows based on multiple conditions. For example, to select rows that sepal_length is greater than 5 and class is Iris-setosa:

import pandas as pd

iris = pd.read_csv("iris.csv")

# Filter rows based on multiple conditions
selected_rows = iris[(iris["sepal_length"] > 5) & (iris["class"] == "Iris-setosa")]

print(selected_rows)

The results will be:

   sepal_length  sepal_width  petal_length  petal_width        class
0           5.1          3.5           1.4          0.2  Iris-setosa

Note that each condition is enclosed in parentheses () , and you should use & for and, | for or.

Filter Rows and Columns

The loc and iloc methods are used to select rows and columns.

loc is label-based, which means that you have to specify rows and columns based on their row and column labels.

For example, to select rows that sepal_length is greater than 4.5 and class is Iris-setosa, and only display sepal_length and class columns:

import pandas as pd

iris = pd.read_csv("iris.csv")

# Filter rows and columns
selected_data = iris.loc[(iris["sepal_length"] > 4.5) & (iris["class"] == "Iris-setosa"), ["sepal_length", "class"]]

print(selected_data)

This code will output:

   sepal_length        class
0           5.1  Iris-setosa
1           4.9  Iris-setosa
2           5.0  Iris-setosa
4           4.9  Iris-setosa

iloc is integer-location based, which means that you have to specify rows and columns by their integer index.

For example, to select the first 3 rows and the first 2 columns:

import pandas as pd

iris = pd.read_csv("iris.csv")

# Filter rows and columns
selected_data = iris.iloc[:3, :2]

print(selected_data)

The results will be:

   sepal_length  sepal_width
0           5.1          3.5
1           4.9          3.0
2           5.0          3.4

Plot Data

First, we download the iris dataset from the internet:

import pandas as pd

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
iris = pd.read_csv(url, header=None)
iris.columns = ["sepal_length", "sepal_width", "petal_length", "petal_width", "class"]

Scatter Plot

To create a scatter plot of sepal_length and sepal_width for each class:

import pandas as pd
import matplotlib.pyplot as plt

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
iris = pd.read_csv(url, header=None)
iris.columns = ["sepal_length", "sepal_width", "petal_length", "petal_width", "class"]

classes = iris["class"].unique()

for class_name in classes:
    plt.scatter(iris[iris["class"]==class_name]["sepal_length"], 
                iris[iris["class"]==class_name]["petal_length"], 
                alpha=0.5, 
                label=class_name)

plt.legend()
plt.xlabel("sepal_length")
plt.ylabel("petal_length")
plt.show()

The results will be:

Histogram

To create a histogram of sepal_length:

import pandas as pd
import matplotlib.pyplot as plt

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
iris = pd.read_csv(url, header=None)
iris.columns = ["sepal_length", "sepal_width", "petal_length", "petal_width", "class"]

iris.plot.hist(column="sepal_length", by="class", bins=10, alpha=0.5, figsize=(6, 12))
plt.xlabel("sepal_length")
plt.show()

The results will be:

Box Plot

To create a box plot of sepal_length for each class:

import pandas as pd
import matplotlib.pyplot as plt

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
iris = pd.read_csv(url, header=None)
iris.columns = ["sepal_length", "sepal_width", "petal_length", "petal_width", "class"]

iris.boxplot(column="sepal_length", by="class")
plt.ylabel("sepal_length")
plt.show()

The results will be:

PreviousNumPy NextMatplotlib

Last updated 11 days ago