Problem Statement
How do you process a 10 GB CSV with pandas on a 4 GB machine?
Explanation
Stream it in chunks. Use read_csv with chunksize to process piece by piece, aggregate partial results, and combine at the end. Restrict columns with usecols and set precise dtypes to save RAM.
If parsing dates, specify parse_dates to avoid post-processing overhead. For compute heavy steps, push aggregation into a database or use Dask for out-of-core scaling.
Code Solution
SolutionRead Only
it=pd.read_csv('big.csv', chunksize=200_000, usecols=['id','amt'], dtype={'id':'int32','amt':'float32'})
for chunk in it:
process(chunk)Practice Sets
This question appears in the following practice sets:
