49 - Improving on Dask, but failing tiling
Today I spent a lot of time troubleshooting speeding up my tiling program with Dask. The primary issue I had to overcome was that the tiling function takes in a relatively small parquet file with some
WKT polygons (0.5 - 50 MB), and then consumes a HUGE amount of memory in the tiling process with h3.polyfill
. This meant that Dask would naively split up each file into its own partition, thinking that I would be reducing the file size. Then when I blew up a single file to several GB of memory, Dask would inevitably through a memory error and stop the process.
The smartest solution would be the following:
- Load all of the data files and combine into a Pandas Dataframe.
- Do all transformations, like creating the
geom_swap_geojson
column. - Create a Dask dataframe from the Pandas dataframe with
dd = dask.from_pandas(df)
. - Split the dataframe into enough partitions that the tiling process won’t run out of memory with
dd.repartition(npartitions)
. - Perform the tiling method.
- Join the files together and save to one big Parquet file.
In practice, I stumbled on step 4. I couldn’t know how many partitions to make except through trial and error. Since I had about 86,000 tracts to tile, I set npartitions=50000
. Even with this number of partitions, one of the partitions consumed too much memory which stopped the whole process.
Unfortunately, I think this is the death knell for my hackathon project, considering the submission deadline is this Monday (11/21/22). I don’t think I will be following through with this project, but I’m going to make a blog post about it.
I'm a freelance software developer located in Denver, Colorado. If you're
interested in working together or would just like to say hi you can reach me
at me@
this domain.