[Airflow] Scheduling
The Airflow scheduler monitors all tasks and DAGs, it triggers tasks and provides tools to check their status. However, to schedule these tasks could be tricky.
Keep in mind:
The scheduler runs your job one
schedule_interval
AFTER the start date, at the END of the period.
Set the schedule interval
schedule_interval
is one of the DAG parameters, it defines how often that DAG runs. It will be added to the execution_date
of the latest task instance and figure out the next schedule.
You can provide one of these values to schedule_interval
:
- A cron ‘present’ such as
@daily
and@monthly
(find out more in the documentation) - A cron expression, for example
0 10 * * *
- A
datetime.timedelta
object
execution_date
vs start_date
There are two dates in Airflow you need to spend more time to understand:
📍Note: Here we are talking about the execution_date
and start_date
of a task instance, so the start_date
is not the same as the date you defined in the DAG.
execution_date
is the date and time when you expect a DAG to be executed.start_date
is the data and time when a DAG actually execute.
Here are some examples.
Case 1: A daily task
Suppose you deploy this DAG on 2020-01-01.
|
|
The start_date
of the fisrt execution will be 2020-01-02
, and the execution_date
is 2020-01-01
.
Case 2: A weekly task
Suppose you deploy this DAG on 2020-03-01 Mon.
|
|
The scheduler will trigger the first task one schedule_interval
after the start_date
, which means that the task will be executed after a schedule_interval
period from the start_date
. Therefore, the start_date
of the fisrt execution will be 2020-03-09 10:00
, and the execution_date
will be 2020-03-02 10:00
.
It would be very important to understand the difference between execution_date
and start_date
when you need to apply some date and time logic in your DAG and use a macro like ds
. (ref: Macros)
Backfill and Catchup
So, what is the start_date
defined in the DAG? It’s one of the DAG parameters, it defines the timestamp from which the scheduler will attempt to backfill. That is, if you deploy the above daily task DAG on 2020-03-01, Airflow scheduler would have created a DAG run for each interval between 2020-01-01 and 2020-03-01.
In other words, the Airflow scheduler will find out if there is any task that has not been run or has been cleared (from start_date
to now or the end date), and create a DAG run for each of those tasks. By default, each DAG will handle the catchup, you can turn it off with dag.catchup = False
.