About spark Setting task parallelism , There are two parameters that we often encounter ,spark.sql.shuffle.partitions and
spark.default.parallelism, What is the difference between these two parameters ?
first , Let's look at their definitions
spark.sql.shuffle.partitions200Configures the number of partitions to use when
shuffling data forjoins or aggregations.
spark.default.parallelismFor distributed shuffle operations like reduceByKey
andjoin, the largest number of partitions in a parent RDD.
For operations like parallelize with no parent RDDs, it depends on the cluster
- Local mode: number of cores on the local machine
- Mesos fine grained mode: 8
- Others: total number of cores on all executor nodes or 2, whichever is larger
Default number of partitions in RDDs returned by transformations likejoin,
reduceByKey, and parallelize when not set by user.
It seems that their definitions are similar , But in actual tests ,
* spark.default.parallelism Only in processing RDD Only when it works , Yes Spark SQL Is invalid .
* spark.sql.shuffle.partitions Yes sparks SQL Dedicated settings
We can submit the homework through --conf To modify the values of these two settings , The method is as follows ：
spark-submit --conf spark.sql.shuffle.partitions=20 --conf