Spark It is a necessary skill for big data development . The question often asked in an interview is Spark What is it? , Or please introduce
Spark, Today's article mainly explains this problem . Many people's answers are not accurate enough , The most accurate description of this problem can be found on the official website .

<>1. Overall introduction

Open the official website to see a line of eye-catching tables and :

Unified engine for large-scale data analytics

Translate it : Unified engine for large-scale data analysis . Keep looking down :

What is Apache Spark™?
Apache Spark™ is a multi-language engine for executing data engineering, data
science, and machine learning on single-node machines or clusters.

Here is the answer to our question :Apache Spark™ Is a multilingual engine , Used to perform data engineering on single node machines or clusters , Data science and machine learning .

Summarize the main points :Spark Is a computing engine , For the calculation of large-scale data , Support multiple programming languages .

<>2. features

The above is a general description , Introduction to some more specific features , The official website also made an answer :

Key features
Simple. Fast. Scalable. Unified.

Spark The characteristics of are summarized in four words : simple , Fast , Scalable , Unity . A more specific description is also given on the official website :

Batch/streaming data

Unify the processing of your data in batches and real-time streaming, using
your preferred language: Python, SQL, Scala, Java or R.

Batch processing / Stream processing : have access to Python,SQL,Scala,Java or R, Unified data processing through batch processing and real-time streaming processing .

SQL analytics

Execute fast, distributed ANSI SQL queries for dashboarding and ad-hoc
reporting. Runs faster than most data warehouses.

SQL analysis : Fast execution for dashboards and interim reports , Distributed ANSI SQL query . Faster than most data warehouses .

Data science at scale

Perform Exploratory Data Analysis (EDA) on petabyte-scale data without having
to resort to downsampling

Large scale data science : yes PB Series for exploratory data analysis (EDA), Without down sampling

Machine learning

Train machine learning algorithms on a laptop and use the same code to scale
to fault-tolerant clusters of thousands of machines.

machine learning : Training machine learning algorithm on notebook computer , And use the same code to expand to a fault-tolerant cluster of thousands of machines .

<>3. ecology

Apache Spark™ integrates with your favorite frameworks, helping to scale them
to thousands of machines.

Data science and Machine learning

SQL analytics and BI

Storage and Infrastructure

Spark Integrates multiple frameworks , Ability to extend these frameworks to thousands of machines . These frameworks include :

* Data science and machine learning :scikit-learn,pandas,TensorFlow,PyTorch,mlflow,R
* SQL Analysis and BI:Superset,Power BI,Looker,redash,tableau,dbt
* Storage and infrastructure :Elasticsearch,MongoDB,Kafka,delta
lake,kubernetes,Airflow,Parquet,SQL Server,cassandra,orc
<>4. Core module

Spark Core: Provided Spark The most basic and core functions ,Spark Other functions such as :Spark SQL,Spark
Streaming,GraphX,MLlib All in Spark Core Based on .

Spark SQL:Spark Components used to manipulate structured data . adopt Spark SQL, Users can use SQL perhaps Apache Hive Version SQL
dialect (HQL) To query data .

Spark Streaming:Spark Components of streaming computing for real-time data on the platform , Provides a rich way to process data streams API.

Spark MLlib:MLlib yes Spark A machine learning algorithm library provided by .MLlib
Not only provides model evaluation , Additional functions such as data import , Some lower level machine learning primitives are also provided .

Spark GraphX:GraphX yes Spark Framework and algorithm library for graph computing .

<>5. summary

At the end of the article Spark What is this problem to make a summary :

* Spark Is a memory based fast , currency , Scalable big data analysis and calculation engine .
* Spark Core Provided in Spark The most basic and core functions .
* Spark SQL yes Spark Components used to manipulate structured data . adopt Spark SQL, Users can use SQL perhaps Apache Hive Version
SQL dialect (HQL) To query data .
* Spark Streaming yes Spark Components of streaming computing for real-time data on the platform , Provides a rich way to process data streams API.

Technology