Back to Course
Data Intelligence: NumPy & Pandas
Module 10 of 15
10. Real-World ETL & Big Data
1. The RAM Problem
You have 16GB RAM. The CSV is 50GB.
If you do pd.read_csv("huge.csv"), your laptop crashes.
2. Chunking
Process the file in pieces.
pythonchunk_size = 10_000 for chunk in pd.read_csv("huge.csv", chunksize=chunk_size): # Process only 10k rows process(chunk) # Save results to DB
3. Parquet vs CSV
- CSV: Text. Slow. Heavy.
- Parquet: Binary. Columnar. Compressed. Always convert your CSVs to Parquet for analysis.
pythondf.to_parquet("data.parquet")