49 - Improving on Dask, but failing tiling

Last updated Nov 19, 2022 Edit Source

Daily log

Today I spent a lot of time troubleshooting speeding up my tiling program with Dask. The primary issue I had to overcome was that the tiling function takes in a relatively small parquet file with some WKT polygons (0.5 - 50 MB), and then consumes a HUGE amount of memory in the tiling process with h3.polyfill. This meant that Dask would naively split up each file into its own partition, thinking that I would be reducing the file size. Then when I blew up a single file to several GB of memory, Dask would inevitably through a memory error and stop the process.

The smartest solution would be the following:

Load all of the data files and combine into a Pandas Dataframe.
Do all transformations, like creating the geom_swap_geojson column.
Create a Dask dataframe from the Pandas dataframe with dd = dask.from_pandas(df).
Split the dataframe into enough partitions that the tiling process won’t run out of memory with dd.repartition(npartitions).
Perform the tiling method.
Join the files together and save to one big Parquet file.

In practice, I stumbled on step 4. I couldn’t know how many partitions to make except through trial and error. Since I had about 86,000 tracts to tile, I set npartitions=50000. Even with this number of partitions, one of the partitions consumed too much memory which stopped the whole process.

Unfortunately, I think this is the death knell for my hackathon project, considering the submission deadline is this Monday (11/21/22). I don’t think I will be following through with this project, but I’m going to make a blog post about it.

I'm a freelance software developer located in Denver, Colorado. If you're interested in working together or would just like to say hi you can reach me at me@ this domain.

🐙 Evan Azevedo

49 - Improving on Dask, but failing tiling

Backlinks

Interactive Graph