The Airflow scheduler monitors all tasks and DAGs, it triggers tasks and provides tools to check their status. However, to schedule these tasks could be tricky.

Keep in mind:

The scheduler runs your job one schedule_interval AFTER the start date, at the END of the period.

Set the schedule interval

schedule_interval is one of the DAG parameters, it defines how often that DAG runs. It will be added to the execution_date of the latest task instance and figure out the next schedule.

You can provide one of these values to schedule_interval:

  • A cron ‘present’ such as @daily and @monthly (find out more in the documentation)
  • A cron expression, for example 0 10 * * *
  • A datetime.timedelta object

execution_date vs start_date

There are two dates in Airflow you need to spend more time to understand:

📍Note: Here we are talking about the execution_date and start_date of a task instance, so the start_date is not the same as the date you defined in the DAG.

  • execution_date is the date and time when you expect a DAG to be executed.
  • start_date is the data and time when a DAG actually execute.

Here are some examples.

Case 1: A daily task

Suppose you deploy this DAG on 2020-01-01.

1
2
3
4
5
6
7
8
9
10
11
from airflow import DAG
from datetime import datetime, timedelta
default_args = {
'start_date': datetime(2020, 1, 1),
}
dag = DAG(
schedule_interval='@daily',
default_args=default_args
)

The start_date of the fisrt execution will be 2020-01-02, and the execution_date is 2020-01-01.

Case 2: A weekly task

Suppose you deploy this DAG on 2020-03-01 Mon.

1
2
3
4
5
6
7
8
9
10
11
from airflow import DAG
from datetime import datetime, timedelta
default_args = {
'start_date': datetime(2020, 3, 1),
}
dag = DAG(
schedule_interval='0 10 * * 1', # to run at 10:00 on Monday
default_args=default_args
)

The scheduler will trigger the first task one schedule_interval after the start_date, which means that the task will be executed after a schedule_interval period from the start_date. Therefore, the start_date of the fisrt execution will be 2020-03-09 10:00, and the execution_date will be 2020-03-02 10:00.

It would be very important to understand the difference between execution_date and start_date when you need to apply some date and time logic in your DAG and use a macro like ds. (ref: Macros)

Backfill and Catchup

So, what is the start_date defined in the DAG? It’s one of the DAG parameters, it defines the timestamp from which the scheduler will attempt to backfill. That is, if you deploy the above daily task DAG on 2020-03-01, Airflow scheduler would have created a DAG run for each interval between 2020-01-01 and 2020-03-01.

In other words, the Airflow scheduler will find out if there is any task that has not been run or has been cleared (from start_date to now or the end date), and create a DAG run for each of those tasks. By default, each DAG will handle the catchup, you can turn it off with dag.catchup = False.

Reference