Scheduling Jupyter Notebooks at Meta

At Meta, Bento is our inside Jupyter notebooks platform that’s leveraged by many inside customers. Notebooks are additionally getting used extensively for creating stories and workflows (for instance, performing data ETL) that should be repeated at sure intervals. Customers with such notebooks must bear in mind to manually run their notebooks on the required cadence – a course of individuals would possibly neglect as a result of it doesn’t scale with the variety of notebooks used.

To handle this downside, we invested in constructing a scheduled notebooks infrastructure that matches in seamlessly with the remainder of the inner tooling out there at Meta. Investing in infrastructure helps make sure that privateness is inherent in every part we construct. It permits us to proceed constructing progressive, priceless options in a privacy-safe approach. 

The power to transparently reply questions on information circulation by Meta methods for functions of information privateness and complying with laws differentiates our scheduled notebooks implementation from the remainder of the trade.

On this put up, we’ll clarify how we married Bento with our batch ETL pipeline framework referred to as Dataswarm (suppose Apache Airflow) in a privateness and lineage-aware method.

The problem round doing scheduled notebooks at Meta

At Meta, we’re dedicated to enhancing confidence in manufacturing by performing static evaluation on scheduled artifacts and sustaining coherent narratives round dataflows by leveraging clear Dataswarm Operators and information annotations. Notebooks pose a particular problem as a result of:

  • Resulting from dynamic code content material (suppose desk names created through f-strings, for example), static evaluation gained’t work, making it tougher to grasp information lineage.
  • Since notebooks can have any arbitrary code, their execution in manufacturing is taken into account “opaque” as information lineage can’t be decided, validated, or recorded. 
  • Scheduled notebooks are thought of to be on the manufacturing aspect of the production-development barrier. Earlier than something runs in manufacturing, it must be reviewed, and reviewing pocket book code is non-trivial.

These three issues formed and influenced our design choices. Specifically, we restricted notebooks that may be scheduled to these primarily performing ETL and people performing information transformations and displaying visualizations. Notebooks with another unwanted effects are at the moment out of scope and aren’t eligible to be scheduled.

How scheduled notebooks work at Meta

There are three primary elements for supporting scheduled notebooks:

  1. The UI for establishing a schedule and making a diff (Meta’s pull request equal) that must be reviewed earlier than the pocket book and related dataswarm pipeline will get checked into supply management.
  2. The debugging interface as soon as a pocket book has been scheduled. 
  3. The mixing level (a customized Operator) with Meta’s inside scheduler to really run the pocket book. We’re calling this: BentoOperator.

How BentoOperator works

So as to tackle the vast majority of the issues highlighted above, we carry out the pocket book execution state in a container with out entry to the community. We additionally leverage enter & output information annotations to indicate the circulation of information.

The general design for BentoOperator.

For ETL, we fetch information and write it out in a novel approach:

  • Supported notebooks carry out information fetches in a structured method through customized cells that we’ve constructed. An instance of that is the SQL cell. When BentoOperator runs, step one entails parsing metadata related to these cells and fetching the info utilizing clear Dataswarm Operators and persisting this in native csv recordsdata on the ephemeral distant hosts.
  • Cases of those customized cells are then changed with a name to pandas.read_csv() to load that information within the pocket book, unlocking the flexibility to execute the pocket book with none entry to the community.
  • Information writes additionally leverage a customized cell, which we change with a name to pandas.DataFrame.to_csv() to persist to a neighborhood csv file, which we then course of after the precise pocket book execution is full and add the info to the warehouse utilizing clear Dataswarm Operators.
  • After this step, the short-term csv recordsdata are garbage-collected; the ensuing pocket book model with outputs uploaded and the ephemeral execution host deallocated.
Customized SQL cell supported for scheduled notebooks.
Structured customized cell for information uploads.

Our method to privateness with BentoOperator

We’ve got built-in BentoOperator inside Meta’s information objective framework to make sure that information is used just for the aim it was meant. This framework ensures that the info utilization objective is revered as information flows and transmutes throughout Meta’s stack. As a part of scheduling a pocket book, a “objective coverage zone” is equipped by the consumer and this serves as the combination level with the info objective framework.

Total consumer workflow

Let’s now discover the workflow for scheduling a pocket book:

We’ve uncovered the scheduling entry level straight from the pocket book header, so all customers need to do is hit a button to get began.

Step one within the workflow is establishing some parameters that will probably be used for robotically producing the pipeline for the schedule.

The following step entails previewing the generated pipeline earlier than a Phabricator (Meta’s diff overview device) diff is created.

Along with the pipeline code for working the pocket book, the pocket book itself can be checked into supply management so it may be reviewed. The outcomes of making an attempt to run the pocket book in a scheduled setup are additionally included within the take a look at plan. 

As soon as the diff has been reviewed and landed, the schedule begins working the subsequent day. Within the occasion that the pocket book execution fails for no matter motive, the schedule proprietor is robotically notified. We’ve additionally constructed a context pane extension straight in Bento to assist with debugging pocket book runs.

What’s subsequent for scheduled notebooks

Whereas we’ve addressed the problem of supporting scheduled notebooks in a privacy-aware method, the notebooks which might be in scope for scheduling are restricted to these performing ETL or these performing information evaluation with no different unwanted effects. That is solely a fraction of the notebooks that customers need to ultimately schedule. So as to improve the variety of use instances, we’ll be investing in supporting different clear information sources along with the SQL cell. 

We’ve got additionally begun work on supporting parameterized notebooks in a scheduled setup. The concept is to assist cases the place as a substitute of checking in many notebooks into supply management that solely differ by a number of variables, we as a substitute simply verify in a single pocket book and inject the differentiating parameters throughout runtime.

Lastly, we’ll be engaged on event-based scheduling (along with the time-based method now we have right here) so {that a} scheduled pocket book can even anticipate predefined occasions earlier than working. This would come with, for instance, the flexibility to attend till all information sources the pocket book is determined by land earlier than pocket book execution can start.


A few of the approaches we took had been straight impressed by the work accomplished on Papermill.