0 · Overview & Philosophy

Pandas (Python ≥ 3.9) is the de‑facto tabular‑data toolkit for analytical pipelines. It offers two primary labelled containers—Series and DataFrame—designed around:

Note — Current stable: pandas 2.2 (2025‑03). Install/upgrade: pip install --upgrade pandas

1 · Installation & Environment

# classic
python -m pip install pandas numpy pyarrow          # parquet/feather

# conda (recommended for scientific stack)
conda install -c conda-forge pandas

If you require polars‑style back‑end speed, enable the experimental pyarrow engine for string & list dtypes:

pd.set_option("future.no_setting_with_copy_warning", True)
pd.options.mode.dtype_backend = "pyarrow"

2 · Core Data Structures (Series & DataFrame)

2.1 Series

One‑dimensional labelled array of homogeneous dtype.

import pandas as pd
s = pd.Series([4, 7, -3], index=["a","b","c"], name="score")

2.2 DataFrame

Two‑dimensional tabular data structure with labelled axes.

df = pd.DataFrame({
    "name": ["Ada","Ben","Cara"],
    "age": [32, 34, 29],
    "join": pd.to_datetime(["2023‑01‑10","2022‑11‑23","2024‑07‑05"])
}, index=[1001,1002,1003])

2.3  Index & MultiIndex

df.set_index("city", inplace=True)

MultiIndex lets you represent higher‑dimensional data within 2‑D tabular form.

3 · Data Input / Output (IO)

FormatReader read_*Writer to_*
CSV / TSVread_csvto_csv
Parquetread_parquetto_parquet
SQLread_sqlto_sql
Excelread_excelto_excel
JSONread_jsonto_json

Example – CSV ↔ Parquet

df = pd.read_csv("sales_2024.csv", parse_dates=["timestamp"])
df.to_parquet("sales_2024.parquet", index=False)

3.1  Reading

df = pd.read_csv("sales.csv", parse_dates=["date"], dtype_backend="pyarrow")

read_csv supports hundreds of parameters—compression, chunked iteration, Arrow dtypes, remote URLs, and more.

Tip — Parquet is columnar & compressed; expect 5‑10× smaller files + faster load.

3.2  Writing

df.to_parquet("sales.parquet", engine="pyarrow", compression="zstd")

Parquet preserves schema and nullability; Arrow engine makes round‑trips ~30× faster than CSV.

4 · Indexing & Selection

4.1 Label vs. Positional

4.2 Setting Values

				
					

4.2.1 Single cell

df.at[1002, "age"] = 35

4.2.2 Conditional assignment

df.loc[df["age"] < 30, "segment"] = "young"

4.3  Boolean masks

				df[df["pop"] > 4]
			

4.4  Query strings

				df.query("pop > 4 & city == 'Joburg'")
			

Avoid SettingWithCopyWarning by chaining safely or using loc.

5  Data Cleaning & Preparation

5.1  Missing values

df.isna().sum()
df.fillna({"pop": df["pop"].median()})

The isna / fillna duo forms pandas’ core missing‑data toolkit.

5.2  Type conversion (+ nullable dtypes)

df["rating"] = df["rating"].astype("Int64")   # preserves NA
df.convert_dtypes(dtype_backend="pyarrow")

Arrow‑backed extension arrays deliver first‑class nullability and memory savings.

6 · Descriptive Statistics

df.describe(numeric_only=True)

Core quick‑stats:

7 · GroupBy (“split‑apply‑combine”)

GroupBy orchestrates splitting, applying, and combining in a single fluent

# average age by segment
df.groupby("segment", observed=True)["age"].mean()

# multi‑agg with named columns
g = df.groupby("segment", observed=True).agg(
	avg_age = ("age","mean"),
	n = ("age","size")
)

With observed=True categorical groups of unused categories are omitted.

8 · Merging, Joining, Concatenation

8.1 Concatenate (stack)

				
					pd.concat([q1_df, q2_df], axis=0, ignore_index=True)
				
			

8.2 SQL‑style Merge

				orders = pd.read_csv("orders.csv")
customers = pd.read_csv("customers.csv")

merged = orders.merge(
	customers,
	on="customer_id",
	how="left",
	indicator=True)       # _merge column shows join result
pd.merge(df_a, df_b, on="id", how="inner")
		pd.concat([df1, df2, df3], axis=0, ignore_index=True)

concat is faster than append (deprecated) for stacking objects.

9 · Reshaping (Pivot, Melt, Stack)

# wide → long
tidy = pd.melt(df,
	id_vars=["name"],
	value_vars=["jan","feb","mar"],
	var_name="month",
	value_name="sales")
# long → wide
wide = tidy.pivot(index="name", columns="month", values="sales")

10 · Time‑Series Toolkit

# DatetimeIndex
ts = df.set_index("join").sort_index()

# resample to monthly start
monthly = ts.resample("MS")["age"].mean()

# window rolling mean
ts["age"].rolling(window="30D", min_periods=5).mean()

Time zone localisation: ts.tz_localize("Africa/Johannesburg")

11 · Handling Missing Data

# forward‑fill within group
df["close"].fillna(method="ffill", inplace=True)

12 · Vectorised String Operations

emails = pd.Series(["a@x.co","b@y.org ",None])
emails.str.strip().str.split("@", expand=True)

Common string methods: contains(), replace(), extract(),
 len(), upper()

13 · Performance Tips

13.1  Vectorise & Avoid Loops

df["ratio"] = df["sales"] / df["cost"]   # no Python loop

13.2  numba UDFs

@numba.jit(nopython=True)
def fast_z(x): return (x - x.mean()) / x.std()
df["z"] = fast_z(df["pop"].to_numpy())

13.3  Arrow engine

pd.options.mode.dtype_backend = "pyarrow"

The Arrow backend reduces memory and speeds arithmetic by ~30 % on medium‑wide tables.

13.4  DuckDB queries

import duckdb, pandas as pd
duckdb.query("SELECT city, SUM(pop) FROM df GROUP BY 1").to_df()

DuckDB executes SQL directly on in‑memory DataFrames without conversion overhead.

13.5  Memory profiling

df.info(memory_usage="deep")

Downcast numeric columns (pd.to_numeric(..., downcast="integer")) to trim RAM usage, as community benchmarks show >50 % savings.

14 · Ecosystem/Export Interoperability

NumPy — vector operations are backed by ndarray.
Matplotlib/Seaborn — plotting (df.plot()).
Polars — from_pandas for Lightning‑fast queries.
DuckDB — SQL over Pandas with duckdb.query().
scikit‑learn — directly consumes DataFrames (fit/transform).

df.to_sql("table", "sqlite:///db.sqlite", if_exists="replace")
		df.to_duckdb("db.duckdb", "table")

You can move between pandas and Spark, Polars, RAPIDS, or Excel via to_* / read_* helpers.

15 · Best Practices & Common Pitfalls

16 · Quick‑Reference Cheat Sheet

# create
pd.DataFrame(data)                # constructor
pd.read_csv("file.csv")           # IO

# inspect
df.head(); df.tail()
df.info(); df.describe()

# select
df["col"]; df.loc[r, c]; df.iloc[i, j]

# filter
df[df["age"] > 30]

# mutate
df.assign(bonus = df["salary"]*0.1)

# group / aggregate
df.groupby("team")["pts"].sum()

# join
df1.merge(df2, on="id", how="inner")

# reshape
pd.melt(...); df.pivot(...)

# time‑series
ts.resample("W").mean()

# output
df.to_parquet("out.parquet")


17  Further Resources