Encountering a ModuleNotFoundError
while running a PySpark job can be frustrating, especially when the module exists and works perfectly outside of Spark. I recently ran into this exact issue while working with a Spark job that imported a function from a local utils
module. Everything seemed fine until I tried to use a PySpark UDF, and suddenly, the module was nowhere to be found.
Thank me by sharing on Twitter 🙏
After some debugging, I found a straightforward way to fix it that worked perfectly for me.
Understanding the ModuleNotFoundError
Problem
PySpark runs code on distributed nodes, and those nodes might not have access to the same Python environment as the driver (the machine running the job). When a Python UDF is executed on worker nodes, it gets serialized and shipped over the network. If the UDF relies on a module that isn’t available on the workers, PySpark throws a ModuleNotFoundError
.
That’s why an import statement might work fine at the top of a script but fail when running inside a UDF. PySpark workers simply don’t know where to find the custom module unless explicitly told.
Fixing the Import Error in PySpark UDFs
Import Inside the UDF
The solution that worked for me was to move the import statement inside the function being used as a UDF. This ensures that the necessary module is available when the function is executed on worker nodes.
LISEN USB C Cable, 5-Pack [3.3/3.3/6.6/6.6/10FT] USBC to USBC Cable for iPhone 16 15 Pro Max Car Charger, 60W (3.1A) C to C Cable Fast Charging for Samsung S24/23, iPad, MacBook Air/Pro Charger
$9.99 (as of April 28, 2025 14:53 GMT +00:00 - More infoProduct prices and availability are accurate as of the date/time indicated and are subject to change. Any price and availability information displayed on [relevant Amazon Site(s), as applicable] at the time of purchase will apply to the purchase of this product.)HP 63 Black Ink Cartridge for HP Printers | Works with Printer Series: DeskJet 1112, 2130, 3630; ENVY 4510, 4520; OfficeJet 3830, 4650, 5200 | Eligible for Instant Ink | F6U62AN
$24.89 (as of April 28, 2025 14:53 GMT +00:00 - More infoProduct prices and availability are accurate as of the date/time indicated and are subject to change. Any price and availability information displayed on [relevant Amazon Site(s), as applicable] at the time of purchase will apply to the purchase of this product.)etguuds USB to USB C Cable 3ft, 2-Pack USB A to Type C Charger Cord Fast Charging for Samsung Galaxy A15 A25 A35 A55 A54, S24 S23 S22 S21 S20 S10 S10E, Note 20 10, Moto G, for iPhone 16 15, Gray
$5.94 (as of April 28, 2025 14:53 GMT +00:00 - More infoProduct prices and availability are accurate as of the date/time indicated and are subject to change. Any price and availability information displayed on [relevant Amazon Site(s), as applicable] at the time of purchase will apply to the purchase of this product.)Before (Fails with ModuleNotFoundError):
from utils.formatting import format_text # This works on the driver, but fails inside UDF
def format_text_local(text):
return format_text(text) # Might break when run inside Spark
After (Works with Spark UDFs):
def format_text(text):
import sys
import os
sys.path.append(os.path.abspath(os.path.dirname(__file__)))
from utils.formatting import format_text as format_text2
return format_text2(text)
By placing the import inside the function, it gets executed on the worker nodes where the function actually runs.
Conclusion
PySpark’s distributed nature makes dependency management tricky, but in my case, simply importing the module inside the UDF was enough to resolve the issue. If you’re running into a ModuleNotFoundError
when using a PySpark UDF, this approach is worth trying first before considering more complex solutions.