Fix Skew In Apache Spark • Joshita Mishra's Blog

Possible ways to fix data skew in apache spark

If you have worked with big data and apache spark you would have experienced skew using apache spark.

How does data skew occur

One of the most common problem why skew occurs is due to uneven distribution of data. Below image describes how skew can be identified

Data skew in spark is a condition when your stage doesnt complete for a long time and you have only 1 or very few tasks pending.

Fix data skew

Check the distribution metrics of the task as you click on the running stage. The fix should be to repartition the data again. This should be done on columns present in your dataset which can help distribute the data better.

Example : consider your application has data for 10 years, now if you load dataframes by year then it might be a point of contention for your spark application. A better way could be to load the data by year and month or repartition the data by months.

dataframe.repartition($“year”, $“month”, 200)

dataframe.repartition($“year”, $“month”)

This way you would be able to distribute the data better and solve skew.