Let’s import our super tool for data cleaning | data wrangling :
import pandas as pd
Reading a CSV file in pandas:
df = pd.read_csv("data/sample-data.csv")
The
df.head()
method returns the top 5 rows by default and can accept an argumentn
to display more rows.The
df.info
()
method provides information about column names, non-null counts, data types, and memory usage.The
df.size
attribute returns the total number of elements in the DataFrame.
The first step in data cleaning is removing null values. You can drop all NaN values from the DataFrame using:
df = df.dropna()
Pandas has inbuilt methods like split
and replace
. To change the data type of a column, use:
df1["col"] = df1["col"].astype(float)
To compute and set a column value, similar to a Python array, you can do:
df2["price_usd"] = (df2["price_abc"] / 87.2).round(2)
This way, you can clean individual DataFrames in separate CSV files and then concatenate multiple DataFrames.
df = pd.concat([df1, df2])
df.info()
You can confirm the cleaned data with df.info
()
.
will continue with more topics on EDA and data analysis with pandas :D
pick a simple csv file for practice
thanks for reading