[AWS] Create a Glue Catalog Table using AWS CDK
AWS CDK is a framework to manage cloud resources based on AWS CloudFormation. In this post, I will focus on how to create a Glue Catalog Table using AWS CDK.
What is AWS CDK?
The AWS Cloud Development Kit (AWS CDK) is an open source software development framework to model and provision your cloud application resources based on AWS CloudFormation. You can define the infrastructure of cloud applications using familiar programming languages, including Java, JavaScript, TypeScript, Python, and C#.
For more information, you can read the Developer Guide. Also, you can find a set of example projects on Github.
Start a CDK Project
Follow the Getting Started Guide from AWS.
In this post, we are going to work with CDK in Python. Python AWS CDK applications require Python 3.6 or later.
After installing CDK you will be able to create a new AWS CDK project by invoking cdk init
in an empty directory. Note that cdk init
cannot be run in a non-empty directory.
|
|
And you will see the following message:
|
|
It will also show you some useful commands:
The cdk init
command has created a virtualenv for you, to manually activate the virtualenv, use:
The next step is to install AWS Construct Library modules for the app to use. AWS Construct Library modules are named like aws-cdk.SERVICE-NAME
. In our case, which is to create a Glue catalog table, we need the modules for Amazon S3 and AWS Glue.
|
|
After initialing the project, it will be like:
There are two files for developers to pay attention to:
app.py
: The entry point of the CDK appcdk_glue_table/cdk_glue_table_stack.py
: Stack is the unit of deployment in the AWS CDK, all AWS resources defined within the scope of a stack
Create a Glue Catalog Table using CDK
In this case, we have the data store in an existing S3 bucket. So this task will be divied into the following steps:
Import an existing S3 bucket for data storage
If there is a resource which you want to use it in your CDK app, you can import it through the resource’s ARN or other identifying attributes by calling a static factory method on the resource’s class. Go to this document for more examples.
The following example shows how to import an exisiting S3 bucket.
123from aws_cdk import aws_s3 as s3bucket = s3.Bucket.from_bucket_name(self, 'my_bucket_id', 'my_bucket')If you want to create a new bucket via CDK, you can visit the document for more information.
Create a Glue database
1234567
from aws_cdk import aws_glue as gluedatabase = glue.Database( self, id='my_database_id', database_name='poc')
Of course you can import an existing database by calling the method [`from_database_arn`](https://docs.aws.amazon.com/cdk/api/latest/python/aws_cdk.aws_glue/Database.html#aws_cdk.aws_glue.Database.from_database_arn). This [page](https://docs.aws.amazon.com/glue/latest/dg/glue-specifying-resource-arns.html) tells you how to specify Glue resource ARNs.
Create a new Glue table
The following example illustrates how to create a Glue table with the S3 bucket and database we just imported/created.
1234567891011121314151617181920212223242526272829303132333435363738table = glue.Table(self,id='my_table_id',database=database,table_name='my_table',columns=[glue.Column(name='col1',type=glue.Type(input_string='string',is_primitive=True)),glue.Column(name='col2',type=glue.Type(input_string='int',is_primitive=True))],partition_keys=[glue.Column(name='dt',type=glue.Type(input_string='string',is_primitive=True))],bucket=bucket,s3_prefix='test_data',data_format=glue.DataFormat(input_format=glue.InputFormat('org.apache.hadoop.mapred.TextInputFormat'),output_format=glue.OutputFormat('org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'),serialization_library=glue.SerializationLibrary('org.openx.data.jsonserde.JsonSerDe')))Don’t forget to give the completed implemented class name for
glue.InputFormat
,glue.OutputFormat
andglue.SerializationLibrary
, otherwise it will fail when you try to create the table, with this error message:HIVE_UNKNOWN_ERROR: Unable to create input format TextInputFormat
.
Deploy the CDK App
Before deployment, you can run this command first:
This command emits the synthesized CloudFormation template.
|
|
Then, deploy the app by running:
|
|
In this step, you need to be authorized to perform"cloudformation:CreateChangeSet"
and "cloudformation:ExecuteChangeSet"
on the stack.
If there’s no error message, you can now check if the table is created on AWS Glue. Or you can check it through a CLI command:
|
|
Once the table is well created, it will show you the metadata of the table.
Play with the Table on AWS Athena
Now that the table is created, we can execute some queries on Athena.
However, a problem occured when I was trying to generate the DDL for table creation: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask.java.lang.NullPointerException
.
In this case, I created the table using CDK, and it was created successfully on AWS Glue. The table looked the same as other tables, I could add partitions to it, and the data could be reached. But when I wanted to get the table description through SHOW CREATE TABLE my_table
through Athena, it failed with an Execution Error.
I asked the AWS support for guidance, and they told me that there was a parameter missing. I have to add the parameter "serialization.format"
to the Serde Info of the table definition. However, there is no parameters for CDK to add "serialization.format"
to the table.
This error happens because the table input did not have some parameters needed by Athena. So the table will work with glue when create a new definition in the data catalog using $ aws glue create-table, however it will not work well with Athena.
The property "serialization.format"
is set in the table definition for all table created using Athena console irrespective of the input format and the values are the same for different input formats. That is, you don’t have to change the value of this property.