TensorLearn
Back to Course
Data Intelligence: NumPy & Pandas
Module 10 of 15

10. Real-World ETL & Big Data

1. The RAM Problem

You have 16GB RAM. The CSV is 50GB. If you do pd.read_csv("huge.csv"), your laptop crashes.

2. Chunking

Process the file in pieces.

python
chunk_size = 10_000 for chunk in pd.read_csv("huge.csv", chunksize=chunk_size): # Process only 10k rows process(chunk) # Save results to DB

3. Parquet vs CSV

  • CSV: Text. Slow. Heavy.
  • Parquet: Binary. Columnar. Compressed. Always convert your CSVs to Parquet for analysis.
python
df.to_parquet("data.parquet")

Mark as Completed

TensorLearn - AI Engineering for Professionals