i want know algorithm used make spark data locality aware when task scheduled? need cluster manager yarn so,if yes underlying algorithm schedule tasks??
it depends. if data in form of key-value pairs spark handles data locality through partitioners (usually hashing key, can define custom partitioners or use rangepartitioner optimize locality depending on data). if data not given key, holds on data in per-file basis (which can problematic if have few large files might not working @ optimal parallelism). if data either distributed or localized, can respectively use repartition(numpartitions) , coalesce(numpartitions) optimize how many partitions want work with.
here example of how can create custom partitioner:
Comments
Post a Comment