[AWS] Create a Glue Catalog Table using AWS CDK

AWS CDK is a framework to manage cloud resources based on AWS CloudFormation. In this post, I will focus on how to create a Glue Catalog Table using AWS CDK.

What is AWS CDK?

The AWS Cloud Development Kit (AWS CDK) is an open source software development framework to model and provision your cloud application resources based on AWS CloudFormation. You can define the infrastructure of cloud applications using familiar programming languages, including Java, JavaScript, TypeScript, Python, and C#.

For more information, you can read the Developer Guide. Also, you can find a set of example projects on Github.

Start a CDK Project

Follow the Getting Started Guide from AWS.

In this post, we are going to work with CDK in Python. Python AWS CDK applications require Python 3.6 or later.

After installing CDK you will be able to create a new AWS CDK project by invoking cdk init in an empty directory. Note that cdk init cannot be run in a non-empty directory.

1 2	$ cd cdk-glue-table $ cdk init app --language python

And you will see the following message:

Applying project template app for python
Initializing a new git repository...
Executing Creating virtualenv...
# Welcome to your CDK Python project!
This is a blank project for Python development with CDK.

It will also show you some useful commands:

## Useful commands
 * `cdk ls`          list all stacks in the app
 * `cdk synth`       emits the synthesized CloudFormation template
 * `cdk deploy`      deploy this stack to your default AWS account/region
 * `cdk diff`        compare deployed stack with current state
 * `cdk docs`        open CDK documentation

The cdk init command has created a virtualenv for you, to manually activate the virtualenv, use:

1	$ source .env/bin/activate

The next step is to install AWS Construct Library modules for the app to use. AWS Construct Library modules are named like aws-cdk.SERVICE-NAME. In our case, which is to create a Glue catalog table, we need the modules for Amazon S3 and AWS Glue.

1	$ pip install aws-cdk.aws-s3 aws-cdk.aws-glue

After initialing the project, it will be like:

.
├── .env
│   ├── bin
...
│   ├── include
│   └── pyvenv.cfg
├── .gitignore
├── README.md
├── app.py
├── cdk.json
├── cdk_glue_table
│   ├── __init__.py
│   └── cdk_glue_table_stack.py
├── requirements.txt
├── setup.py
└── source.bat

There are two files for developers to pay attention to:

app.py: The entry point of the CDK app
cdk_glue_table/cdk_glue_table_stack.py: Stack is the unit of deployment in the AWS CDK, all AWS resources defined within the scope of a stack

Create a Glue Catalog Table using CDK

In this case, we have the data store in an existing S3 bucket. So this task will be divied into the following steps:

Import an existing S3 bucket for data storage

If there is a resource which you want to use it in your CDK app, you can import it through the resource’s ARN or other identifying attributes by calling a static factory method on the resource’s class. Go to this document for more examples.

The following example shows how to import an exisiting S3 bucket.
1
2
3
from aws_cdk import aws_s3 as s3
bucket = s3.Bucket.from_bucket_name(self, 'my_bucket_id', 'my_bucket')
If you want to create a new bucket via CDK, you can visit the document for more information.
Create a Glue database

1
2
3
4
5
6
7 from aws_cdk import aws_glue as glue
database = glue.Database(
    self,
    id='my_database_id',
    database_name='poc'
)

Of course you can import an existing database by calling the method [`from_database_arn`](https://docs.aws.amazon.com/cdk/api/latest/python/aws_cdk.aws_glue/Database.html#aws_cdk.aws_glue.Database.from_database_arn). This [page](https://docs.aws.amazon.com/glue/latest/dg/glue-specifying-resource-arns.html) tells you how to specify Glue resource ARNs.

Create a new Glue table

The following example illustrates how to create a Glue table with the S3 bucket and database we just imported/created.

table = glue.Table(
        self,
        id='my_table_id',
        database=database,
        table_name='my_table',
        columns=[
            glue.Column(
                name='col1',
                type=glue.Type(
                    input_string='string',
                    is_primitive=True
                )
            ),
            glue.Column(
                name='col2',
                type=glue.Type(
                    input_string='int',
                    is_primitive=True
                )
            )
        ],
        partition_keys=[
            glue.Column(
                name='dt',
                type=glue.Type(
                    input_string='string',
                    is_primitive=True
                )
            )
        ],
        bucket=bucket,
        s3_prefix='test_data',
        data_format=glue.DataFormat(
            input_format=glue.InputFormat('org.apache.hadoop.mapred.TextInputFormat'),
            output_format=glue.OutputFormat('org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'),
            serialization_library=glue.SerializationLibrary('org.openx.data.jsonserde.JsonSerDe')
        )
    )

Don’t forget to give the completed implemented class name for glue.InputFormat, glue.OutputFormat and glue.SerializationLibrary, otherwise it will fail when you try to create the table, with this error message: HIVE_UNKNOWN_ERROR: Unable to create input format TextInputFormat.

Deploy the CDK App

Before deployment, you can run this command first:

1	$ cdk synth

This command emits the synthesized CloudFormation template.

Resources:
  mytableidTableD0000000:
    Type: AWS::Glue::Table
    Properties:
      CatalogId:
        Ref: AWS::AccountId
      DatabaseName: poc
      TableInput:
        Description: my_table generated by CDK
        Name: my_table
        Parameters:
          has_encrypted_data: false
        PartitionKeys:
          - Name: dt
            Type: string
        StorageDescriptor:
          Columns:
            - Name: col1
              Type: string
            - Name: col2
              Type: int
          Compressed: false
          InputFormat: org.apache.hadoop.mapred.TextInputFormat
          Location: s3://my_bucket_name/test_data
          OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
          SerdeInfo:
            SerializationLibrary: org.openx.data.jsonserde.JsonSerDe
          StoredAsSubDirectories: false
        TableType: EXTERNAL_TABLE
    Metadata:
      aws:cdk:path: cdk-glue-table/my_table_id/Table

Then, deploy the app by running:

1	$ cdk deploy

In this step, you need to be authorized to perform"cloudformation:CreateChangeSet" and "cloudformation:ExecuteChangeSet" on the stack.

If there’s no error message, you can now check if the table is created on AWS Glue. Or you can check it through a CLI command:

1
2
3

$ aws glue get-table \
    --database-name poc \
    --name my_table

Once the table is well created, it will show you the metadata of the table.

{
    "Table": {
        "Name": "my_table",
        "DatabaseName": "poc",
        "Description": "my_table generated by CDK",
        "CreateTime": 1592295011.0,
        "UpdateTime": 1592295011.0,
        "Retention": 0,
        "StorageDescriptor": {
            "Columns": [
                {
                    "Name": "col1",
                    "Type": "string"
                },
                {
                    "Name": "col2",
                    "Type": "int"
                }
            ],
            "Location": "s3://my_bucket/test_data",
            "InputFormat": "org.apache.hadoop.mapred.TextInputFormat",
            "OutputFormat": "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat",
            "Compressed": true,
            "NumberOfBuckets": 0,
            "SerdeInfo": {
                "SerializationLibrary": "org.openx.data.jsonserde.JsonSerDe"
            },
            "SortColumns": [],
            "StoredAsSubDirectories": false
        },
        "PartitionKeys": [
            {
                "Name": "dt",
                "Type": "string"
            }
        ],
        "TableType": "EXTERNAL_TABLE",
        "Parameters": {
            "has_encrypted_data": "false"
        },
        "CreatedBy": "arn:aws:iam::aws_account_id:user/user_name",
        "IsRegisteredWithLakeFormation": false
    }
}

Play with the Table on AWS Athena

Now that the table is created, we can execute some queries on Athena.

However, a problem occured when I was trying to generate the DDL for table creation: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask.java.lang.NullPointerException.

In this case, I created the table using CDK, and it was created successfully on AWS Glue. The table looked the same as other tables, I could add partitions to it, and the data could be reached. But when I wanted to get the table description through SHOW CREATE TABLE my_table through Athena, it failed with an Execution Error.

I asked the AWS support for guidance, and they told me that there was a parameter missing. I have to add the parameter "serialization.format" to the Serde Info of the table definition. However, there is no parameters for CDK to add "serialization.format" to the table.

This error happens because the table input did not have some parameters needed by Athena. So the table will work with glue when create a new definition in the data catalog using $ aws glue create-table, however it will not work well with Athena.
The property "serialization.format" is set in the table definition for all table created using Athena console irrespective of the input format and the values are the same for different input formats. That is, you don’t have to change the value of this property.

[AWS] Create a Glue Catalog Table using AWS CDK

[AWS] Create a Glue Catalog Table using AWS CDK

What is AWS CDK?

Start a CDK Project

Create a Glue Catalog Table using CDK

Deploy the CDK App

Play with the Table on AWS Athena

References