About spark Setting task parallelism , There are two parameters that we often encounter ,spark.sql.shuffle.partitions and
spark.default.parallelism, What is the difference between these two parameters ?

first , Let's look at their definitions

Property NameDefaultMeaning
spark.sql.shuffle.partitions200Configures the number of partitions to use when
shuffling data forjoins or aggregations.
spark.default.parallelismFor distributed shuffle operations like reduceByKey
andjoin, the largest number of partitions in a parent RDD.

For operations like parallelize with no parent RDDs, it depends on the cluster
- Local mode: number of cores on the local machine
- Mesos fine grained mode: 8
- Others: total number of cores on all executor nodes or 2, whichever is larger
Default number of partitions in RDDs returned by transformations likejoin,
reduceByKey, and parallelize when not set by user.
It seems that their definitions are similar , But in actual tests ,

* spark.default.parallelism Only in processing RDD Only when it works , Yes Spark SQL Is invalid .
* spark.sql.shuffle.partitions Yes sparks SQL Dedicated settings
We can submit the homework through --conf To modify the values of these two settings , The method is as follows :
spark-submit --conf spark.sql.shuffle.partitions=20 --conf

©2020 ioDraw All rights reserved
Vue2.0+jsonserver+axios Simulate local request interface data Enterprises face SEM Bidding and SEO How to choose ? Or both ? Did you complain today ? Notes on core principles of reverse engineering ( One )——Hello World-1 Some views of a student on Hongmeng system Java Summary of basic learning (162)—— How to ensure thread safety ? Using function to realize the exchange of two numbers (C language )SK Hynix World Premiere DDR5 Memory : The frequency goes up 5600MHz, Capacity up to 256GBdjango Do not close CSRF middleware , Custom through CSRF Tested post request Zhejiang University data structure midterm examination questions