[Spark] Run Spark Job on Amazon EMR
Amazon Elastic MapReduce (EMR) is a managed cluster platform on Amazon Web Services (AWS) for big data processing and analysis. It provides a simplifier way to run big data frameworks such as Apache Hadoop and Apache Spark.
This post will focus on running Apache Spark on EMR, and will cover:
- Create a cluster on Amazon EMR
- Submit the Spark Job
- Load/Store data from/to S3
Prerequisite
- A well developed Spark application
- Input files
- An AWS account
- An AWS S3 bucket to store input/output files, logs and Spark application JAR file
Before we create a cluster on EMR, the Spark application JAR and input files should be uploaded to S3 bucket.
Create a cluster on Amazon EMR
Cluster is the core component of EMR. A cluster is a collection of nodes, each node is an Amazon Elastic Compute Cloud (Amazon EC2) instance. Each node has a role within the cluster, master node, core node, and task node (referred to as the node type).
There are several ways to create a cluster on EMR. In this section, we are going to demostrate two of them: AWS CLI and SDK for Python (boto 3).
Create a cluster using the AWS CLI
The following create-cluster
example creates a simple EMR cluster to run Spark.
|
|
name
: Name of the cluster. If not provided, the default is “Development Cluster”.release-label
: Amazon EMR release version.instance-type
: Type of Amazon EC2 instance to use in a cluster.instance-count
: Number of Amazon EC2 instances to create for a cluster. One instance is used for the master node, and the remainder are used for the core node type.applications
: Applications to install on the cluster.
For more examples and details about the options, read can read this documentation.
Create a cluster using the SDK for Python
In boto3, run_job_flow
creates and starts running a new cluster (job flow).
|
|
Note that Name
is a required parameter here, while --name
is not required when you use the AWS CLI to create a cluster.
Submit the Spark Job
After a cluster is created, it is not ready for running jobs until you submit an actual Spark job. In EMR, these jobs are called steps. Each step is a unit of work that contains instructions to manipulate data for processing by software installed on the cluster. So, all you need to do next is to add a --steps
option to the command above.
In a step, you have to define its type, name, and tell EMR where your JAR file is located and pass all the need parameters for your script.
Suppose we have a Spark application, the purpose of the application is to transform the file format from json to parquet. We have to pass the input location and the output location to the main class file.transform.Main
.
For this Spark job, the inputs and outputs are:
- Inputs: Some json files on S3
s3://my-bucket/inputs
. - Outputs: The ouputs will be some parquet files, ane will be put to another S3 location
s3://my-bucket/outputs
.
Note that EMR meeds read and write permissions to this S3 bucket.
Now we have packaged the application into a JAR file, and uploaded it to S3 s3://my-bucket/jars/file-transform-script.jar
.
Submit a step using the AWS CLI
You can use a shorthand syntax:
|
|
Or in JSON Syntax, a more clear way:
|
|
With step.json
, you can create a cluster with one step with the following command:
|
|
You can also add a step or more steps to an existing cluster:
|
|
The cluster id can be found on EMR console or by running $ aws emr list-clusters
.
Submit a step using the SDK for Python
|
|
Name
: The nama of the step.ActionOnFailure
: The action to take when a step fails. There are three possible values: TERMINATE_CLUSTER, CANCEL_AND_WAIT, and CONTINUEHadoopJarStep
: The JAR file used for the step.Jar
: A path to a JAR file run during the step.Args
: A list of command line arguments passed to the JAR file’s main function when executed.
With command-runner.jar
you can execute many scripts or programs, and you do not have to know its full path. In the case above, spark-submit
is the command to run.
|
|
Use add_job_flow_steps
to add steps to an existing cluster:
|
|
The job will consume all of the data in the input directory s3://my-bucket/inputs
, and write the result to the output directory s3://my-bucket/outputs
.
Above are the steps to run a Spark Job on Amazon EMR.