Pandas (Python ≥ 3.9
) is the de‑facto tabular‑data toolkit for analytical pipelines.
It offers two primary labelled containers—Series and DataFrame—designed around:
NumPy
vectorisation for speed.Note — Current stable: pandas 2.2 (2025‑03)
.
Install/upgrade: pip install --upgrade pandas
# classic
python -m pip install pandas numpy pyarrow # parquet/feather
# conda (recommended for scientific stack)
conda install -c conda-forge pandas
If you require polars‑style back‑end speed, enable the experimental pyarrow
engine for string & list dtypes:
pd.set_option("future.no_setting_with_copy_warning", True)
pd.options.mode.dtype_backend = "pyarrow"
One‑dimensional labelled array of homogeneous dtype.
import pandas as pd
s = pd.Series([4, 7, -3], index=["a","b","c"], name="score")
Two‑dimensional tabular data structure with labelled axes.
df = pd.DataFrame({
"name": ["Ada","Ben","Cara"],
"age": [32, 34, 29],
"join": pd.to_datetime(["2023‑01‑10","2022‑11‑23","2024‑07‑05"])
}, index=[1001,1002,1003])
df.set_index("city", inplace=True)
MultiIndex lets you represent higher‑dimensional data within 2‑D tabular form.
Format | Reader read_* | Writer to_* |
---|---|---|
CSV / TSV | read_csv | to_csv |
Parquet | read_parquet | to_parquet |
SQL | read_sql | to_sql |
Excel | read_excel | to_excel |
JSON | read_json | to_json |
df = pd.read_csv("sales_2024.csv", parse_dates=["timestamp"])
df.to_parquet("sales_2024.parquet", index=False)
df = pd.read_csv("sales.csv", parse_dates=["date"], dtype_backend="pyarrow")
read_csv supports hundreds of parameters—compression, chunked iteration, Arrow dtypes, remote URLs, and more.
Tip — Parquet is columnar & compressed; expect 5‑10× smaller files + faster load.
df.to_parquet("sales.parquet", engine="pyarrow", compression="zstd")
Parquet preserves schema and nullability; Arrow engine makes round‑trips ~30× faster than CSV.
df.loc["row_label", "col"]
).df.iloc[3, 0]
).
4.2.1 Single cell
df.at[1002, "age"] = 35
4.2.2 Conditional assignment
df.loc[df["age"] < 30, "segment"] = "young"
df[df["pop"] > 4]
df.query("pop > 4 & city == 'Joburg'")
Avoid SettingWithCopyWarning
by chaining safely or using loc.
df.isna().sum()
df.fillna({"pop": df["pop"].median()})
The isna / fillna duo forms pandas’ core missing‑data toolkit.
df["rating"] = df["rating"].astype("Int64") # preserves NA
df.convert_dtypes(dtype_backend="pyarrow")
Arrow‑backed extension arrays deliver first‑class nullability and memory savings.
df.describe(numeric_only=True)
Core quick‑stats:
GroupBy orchestrates splitting, applying, and combining in a single fluent
# average age by segment
df.groupby("segment", observed=True)["age"].mean()
# multi‑agg with named columns
g = df.groupby("segment", observed=True).agg(
avg_age = ("age","mean"),
n = ("age","size")
)
With observed=True
categorical groups of unused categories are omitted.
pd.concat([q1_df, q2_df], axis=0, ignore_index=True)
orders = pd.read_csv("orders.csv")
customers = pd.read_csv("customers.csv")
merged = orders.merge(
customers,
on="customer_id",
how="left",
indicator=True) # _merge column shows join result
pd.merge(df_a, df_b, on="id", how="inner")
pd.concat([df1, df2, df3], axis=0, ignore_index=True)
concat is faster than append (deprecated) for stacking objects.
# wide → long
tidy = pd.melt(df,
id_vars=["name"],
value_vars=["jan","feb","mar"],
var_name="month",
value_name="sales")
# long → wide
wide = tidy.pivot(index="name", columns="month", values="sales")
# DatetimeIndex
ts = df.set_index("join").sort_index()
# resample to monthly start
monthly = ts.resample("MS")["age"].mean()
# window rolling mean
ts["age"].rolling(window="30D", min_periods=5).mean()
Time zone localisation: ts.tz_localize("Africa/Johannesburg")
# forward‑fill within group
df["close"].fillna(method="ffill", inplace=True)
emails = pd.Series(["a@x.co","b@y.org ",None])
emails.str.strip().str.split("@", expand=True)
Common string methods: contains(), replace(), extract(), len(), upper()
pd.set_option("compute.use_bottleneck", True)
.read_csv(chunksize=100_000)
.df.info(memory_usage="deep")
.df["ratio"] = df["sales"] / df["cost"] # no Python loop
numba
UDFs@numba.jit(nopython=True)
def fast_z(x): return (x - x.mean()) / x.std()
df["z"] = fast_z(df["pop"].to_numpy())
pd.options.mode.dtype_backend = "pyarrow"
The Arrow backend reduces memory and speeds arithmetic by ~30 % on medium‑wide tables.
import duckdb, pandas as pd
duckdb.query("SELECT city, SUM(pop) FROM df GROUP BY 1").to_df()
DuckDB executes SQL directly on in‑memory DataFrames without conversion overhead.
df.info(memory_usage="deep")
Downcast numeric columns (pd.to_numeric(..., downcast="integer")) to trim RAM usage, as community benchmarks show >50 % savings.
NumPy — vector operations are backed by ndarray.
Matplotlib/Seaborn — plotting (df.plot()).
Polars — from_pandas
for Lightning‑fast queries.
DuckDB — SQL over Pandas with duckdb.query()
.
scikit‑learn — directly consumes DataFrames (fit/transform
).
df.to_sql("table", "sqlite:///db.sqlite", if_exists="replace")
df.to_duckdb("db.duckdb", "table")
You can move between pandas and Spark, Polars, RAPIDS, or Excel via to_* / read_* helpers.
categorical
dtype for low‑cardinality strings.# create
pd.DataFrame(data) # constructor
pd.read_csv("file.csv") # IO
# inspect
df.head(); df.tail()
df.info(); df.describe()
# select
df["col"]; df.loc[r, c]; df.iloc[i, j]
# filter
df[df["age"] > 30]
# mutate
df.assign(bonus = df["salary"]*0.1)
# group / aggregate
df.groupby("team")["pts"].sum()
# join
df1.merge(df2, on="id", how="inner")
# reshape
pd.melt(...); df.pivot(...)
# time‑series
ts.resample("W").mean()
# output
df.to_parquet("out.parquet")