MLEnv: Standardizing ML at Pinterest Beneath One ML Engine to Speed up Innovation | by Pinterest Engineering | Pinterest Engineering Weblog | Sep, 2023

Pinterest Engineering
Pinterest Engineering Blog

Pong Eksombatchai | Principal Engineer; Karthik Anantha Padmanabhan | Supervisor II, Engineering

Reading glasses sitting on top of a laptop’s spacebar with code on the screen behind it
Picture from https://unsplash.com/photos/w7ZyuGYNpRQ

Pinterest’s mission is to convey everybody the inspiration to create a life they love. We depend on an intensive suite of AI powered merchandise to attach over 460M customers to lots of of billions of Pins, leading to lots of of hundreds of thousands of ML inferences per second, lots of of hundreds of ML coaching jobs per 30 days by simply a few lots of of ML engineers.

In 2021, ML was siloed at Pinterest with 10+ totally different ML frameworks counting on totally different deep studying frameworks, framework variations, and boilerplate logic to attach with our ML platform. It was a serious bottleneck for ML innovation at Pinterest as a result of the quantity of engineering assets spent by every ML crew to keep up their very own ML stack was immense and there was restricted data sharing throughout groups.

To repair these issues we launched MLEnv — a standardized ML engine at Pinterest now leveraged by 95% of ML jobs at Pinterest (ranging from <5% in 2021). Since launching our platform we now have:

  • Noticed a 300% enhance within the variety of coaching jobs, world-class 88 Web Promoter Rating (NPS) for MLEnv and a 43% enhance in ML Platform NPS
  • Shifted the paradigm for ML improvements and delivered mixture good points in Pinner engagement on the order of mid-double digit percentages
The chart shows the impressive growth of MLEnv Jobs over all Pinterest ML jobs over time. MLEnv was started in Q3 of 2021 and by Q1 of 2023, almost all Pinterest ML jobs are MLEnv jobs.
Progress of MLEnv over all of Pinterest ML jobs over time

After we began engaged on the challenge, ML growth at Pinterest was in a siloed state the place every crew would personal most of their very own distinctive ML stack. With standardization in tooling and in style ML libraries kind of providing the identical functionalities, sustaining a number of ML stacks in an organization at Pinterest scale is suboptimal for ML productiveness and innovation. Each ML and ML platform engineers felt the complete brunt of this challenge.

For ML Engineers, this could imply:

  • Having to keep up their very own atmosphere together with work to make sure code high quality and maintainability, the runtime atmosphere and CI/CD pipeline. Questions that the crew has to reply and constantly preserve embrace the right way to allow unit/integration testing, how to make sure consistency between coaching and serving atmosphere, what coding finest practices to implement, and so on.
  • Dealing with integrations to leverage instruments and frameworks which are crucial for developer velocity. Heavy engineering work is required for fundamental high quality of life functionalities. For instance, the challenge must combine with MLFlow to trace coaching runs, with Pinterest inner ML coaching and serving platform to coach and serve fashions at scale, and so on.
  • Enabling superior ML capabilities to correctly develop cutting-edge ML at scale. ML has had an explosion of improvements lately, particularly with the prominence of huge language fashions and generative AI, and are far more difficult than simply coaching the mannequin on one GPU and serving on CPU. Groups must spend an inordinate period of time and assets to reinvent the wheels for various platforms to allow distributed coaching, re-implement state-of-the artwork algorithms on TensorFlow, optimize serving, and so on.
  • Worst of all is that every thing is completed in a silo. There’s numerous repeated work by every crew to keep up their very own environments and deal with varied integrations. All the trouble put into enabling superior ML capabilities can solely be utilized to a person challenge due every challenge having a singular ML stack.
The diagram summarizes necessary pillars which are essential for ML productiveness and for it to work at scale wherein groups spend substantial assets and repeated efforts in sustaining their very own ML stacks.
Groups battle to keep up/allow all functionalities within the pillars attributable to how a lot useful resource and energy every of them requires.

For Platform Engineers, this could imply:

  • Main struggles within the creation and adoption of platform instruments which severely restricted the worth that might be added by platform groups to ML engineers. It is vitally tough for platform engineers to construct good standardized instruments that match numerous ML stacks. The platform crew additionally must work intently with ML stacks one after the other in an effort to combine choices from ML Platform — instruments like a distributed coaching platform, automated hyperparameter tuning and so on. took for much longer than wanted for the reason that work needed to be repeated for each crew.
  • Having to construct experience in each TensorFlow and PyTorch stretched ML platform engineering assets to the restrict. The nuances of the underlying deep studying framework must be thought of in an effort to construct a high-performance ML system. The platform crew spent a number of occasions the trouble wanted attributable to having to assist a number of deep studying frameworks and variations (PyTorch vs TensorFlow vs TensorFlow2).
  • Incapacity to drive software program and {hardware} upgrades. Particular person groups have been very far behind in ML-related software program upgrades despite the fact that every improve brings numerous new functionalities. Slightly than the improve course of being dealt with by platform engineers, most groups ended up utilizing a really previous model of TensorFlow, CUDA and so on. due to how cumbersome the improve course of normally is. Equally, it’s also very tough to drive {hardware} upgrades which limits Pinterest’s potential to reap the benefits of the newest NVIDIA accelerators. {Hardware} upgrades normally require months of collaboration with varied consumer groups to get software program variations which are lagging behind up-to-date.
MLEnv structure diagram with main parts

In mid 2021, we gained alignment from varied ML stakeholders at Pinterest and constructed the ML Atmosphere (MLEnv), which is a full-stack ML developer framework that goals to make ML engineers extra productive by abstracting away technical complexities which are irrelevant to ML modeling. MLEnv instantly addresses the varied points talked about within the earlier part and gives 4 main parts for ML builders.

Code Runtime and Construct Atmosphere

MLEnv gives a standardized code runtime and construct atmosphere for its customers. MLEnv maintains a monorepo (single code repository) for all ML tasks, a single shared atmosphere for all ML tasks that coaching and serving are executed on by leveraging Docker and the CI/CD pipeline that clients can leverage highly effective parts that aren’t simply accessible reminiscent of GPU unit assessments and ML coach integration assessments. Platform engineers deal with the heavy lifting work of setting them up as soon as for each ML challenge at Pinterest to simply re-use.

ML Dev Toolbox

MLEnv gives ML builders with the ML Dev toolbox of generally used instruments that helps them be extra productive in coaching and deploying fashions. Many are common third social gathering instruments reminiscent of MLFlow, Tensorboard and profilers, whereas others are inner instruments and frameworks which are constructed by our ML Platform crew reminiscent of our mannequin deployment pipeline, ML serving platform and ML coaching platform.

The toolbox permits ML engineers to make use of dev velocity instruments via an interface and skip integrations that are normally very time consuming. One instrument to spotlight is the coaching launcher CLI which makes the transition between native growth and coaching the mannequin at scale on Kubernetes via our inner coaching platform seamless. All of the instruments mixed created a streamlined ML growth expertise for our engineers the place they’re able to shortly iterate on their concepts, use varied instruments to debug, scale coaching and deploy the mannequin for inference.

Superior Functionalities

MLEnv provides buyer entry to superior functionalities that have been prior to now solely accessible internally to the crew creating them due to our earlier siloed state. ML tasks now have entry to a portfolio of coaching strategies that assist velocity up their coaching like distributed coaching, combined precision coaching and libraries reminiscent of Speed up, DeepSpeed and so on. Equally on the serving aspect, ML tasks have entry to extremely optimized ML parts for on-line serving in addition to newer applied sciences reminiscent of GPU serving for recommender fashions.

Native Deep Studying Library

With the earlier three parts mixed, ML builders can give attention to the attention-grabbing half which is the logic to coach their mannequin. We took additional care to not add any abstraction to the modeling logic which might pollute the expertise of working with well-functioning deep studying libraries reminiscent of TensorFlow2 and PyTorch. In our framework, what finally ends up occurring is that ML engineers have full management over the dataset loading, mannequin structure and coaching loop applied utilizing native deep studying libraries whereas getting access to complementary parts outlined above.

After MLEnv normal availability in late 2021, we entered a really attention-grabbing time interval the place there have been speedy developments in ML modeling and the ML platform at Pinterest which resulted in enormous enhancements in advice high quality and our potential to serve extra inspiring content material to our Pinners.

ML Improvement Velocity

The direct influence of MLEnv is a huge enchancment in ML dev velocity at Pinterest of ML engineers. The capabilities to dump many of the ML boilerplate engineering work, entry to an entire set of helpful ML instruments via an easy-to-use interface and quick access to superior ML capabilities are sport changers in creating and deploying cutting-edge ML fashions.

ML builders are very glad with the brand new tooling. MLEnv maintains an NPS of 88 which is world-class and is a key contributor in bettering ML Platform NPS by 43%. In one of many organizations that we work with, the NPS improved by 93 factors as soon as MLEnv had been absolutely rolled out.

Groups are additionally far more productive consequently. We see a number of occasions development within the quantity of ML jobs (i.e. offline experiments) that every crew runs despite the fact that the variety of ML engineers are roughly the identical. They will now additionally take fashions to on-line experimentation in days quite than months leading to a a number of occasions enchancment of the variety of on-line ML experiments.

Explosion within the variety of ML jobs over time attributable to developer velocity enhancements

ML Platform 2.0

MLEnv made the ML Platform crew far more productive by permitting the crew to give attention to a single ML atmosphere. The ML Platform crew can now construct standardized instruments and cutting-edge ML capabilities, and drive adoption via a single integration with MLEnv.

An instance on the ML coaching platform aspect is Coaching Compute Platform (TCP), which is our in-house distributed coaching platform. Earlier than MLEnv, the crew struggled to keep up the platform attributable to having to assist numerous ML environments with totally different deep studying framework libraries and setup. The crew additionally struggled with adoption attributable to having to onboard varied consumer groups one after the other with various must the platform. Nevertheless, with MLEnv, the crew was in a position to tremendously cut back upkeep overhead by narrowing all the way down to a single unified atmosphere whereas gaining explosive development within the variety of jobs on the platform. With the a lot diminished upkeep overhead the crew was in a position to give attention to pure extensions to TCP. Extra superior functionalities like distributed coaching, automated hyperparameter tuning and distributed knowledge loading via Ray grew to become simple for the crew to implement and are launched via MLEnv for consumer groups to undertake and use with minimal effort.