We’re releasing Safety Gym, a suite of environments and tools for measuring progress towards reinforcement learning agents that respect safety constraints while training. We also provide a standardized method of comparing algorithms and how well they avoid costly mistakes while learning. If deep reinforcement learning is applied to the real world, whether in robotics or internet-based tasks, it will be important to have algorithms that are safe even while learning—like a self-driving car that can learn to avoid accidents without actually having to experience them.

Safety Gym
Safety Starter Agents

Exploration is risky

Reinforcement learning agents need to explore their environments in order to learn optimal behaviors. Essentially, they operate on the principle of trial and error: they try things out, see what works or doesn’t work, and then increase the likelihood of good behaviors and decrease the likelihood of bad behaviors. However, exploration is fundamentally risky: agents might try dangerous behaviors that lead to unacceptable errors. This is the “safe exploration” problem in a nutshell.

Consider an example of an autonomous robot arm in a factory using reinforcement learning (RL) to learn how to assemble widgets. At the start of RL training, the robot might try flailing randomly, since it doesn’t know what to do yet. This poses a safety risk to humans who might be working nearby, since they could get hit.

For restricted examples like the robot arm, we can imagine simple ways to ensure that humans aren’t harmed by just keeping them out of harm’s way: shutting down the robot whenever a human gets too close, or putting a barrier around the robot. But for general RL systems that operate under a wider range of conditions, simple physical interventions won’t always be possible, and we will need to consider other approaches to safe exploration.

Constrained reinforcement learning

The first step towards making progress on a problem like safe exploration is to quantify it: figure out what can be measured, and how going up or down on those metrics gets us closer to the desired outcome. Another way to say it is that we need to pick a formalism for the safe exploration problem. A formalism allows us to design algorithms that achieve our goals.

While there are several options, there is not yet a universal consensus in the field of safe exploration research about the right formalism. We spent some time thinking about it, and the formalism we think makes the most sense to adopt is constrained reinforcement learning.

Constrained RL is like normal RL, but in addition to a reward function that the agent wants to maximize, environments have cost functions that the agent needs to constrain. For example, consider an agent controlling a self-driving car. We would want to reward this agent for getting from point A to point B as fast as possible. But naturally, we would also want to constrain the driving behavior to match traffic safety standards.

We think constrained RL may turn out to be more useful than normal RL for ensuring that agents satisfy safety requirements. A big problem with normal RL is that everything about the agent’s eventual behavior is described by the reward function, but reward design is fundamentally hard. A key part of the challenge comes from picking trade-offs between competing objectives, such as task performance and satisfying safety requirements. In constrained RL, we don’t have to pick trade-offs—instead, we pick outcomes, and let algorithms figure out the trade-offs that get us the outcomes we want.

We can use the self-driving car case to sketch what this means in practice. Suppose the car earns some amount of money for every trip it completes, and has to pay a fine for every collision.

In normal RL, you would pick the collision fine at the beginning of training and keep it fixed forever. The problem here is that if the pay-per-trip is high enough, the agent may not care whether it gets in lots of collisions (as long as it can still complete its trips). In fact, it may even be advantageous to drive recklessly and risk those collisions in order to get the pay. We have seen this before when training unconstrained RL agents.

By contrast, in constrained RL you would pick the acceptable collision rate at the beginning of training, and adjust the collision fine until the agent is meeting that requirement. If the car is getting in too many fender-benders, you raise the fine until that behavior is no longer incentivized.

Safety Gym

To study constrained RL for safe exploration, we developed a new set of environments and tools called Safety Gym. By comparison to existing environments for constrained RL, Safety Gym environments are richer and feature a wider range of difficulty and complexity.

In all Safety Gym environments, a robot has to navigate through a cluttered environment to achieve a task. There are three pre-made robots (Point, Car, and Doggo), three main tasks (Goal, Button, and Push), and two levels of difficulty for each task. We give an overview of the robot-task combinations below, but make sure to check out the paper for details.

In these videos, we show how an agent without constraints tries to solve these environments. Every time the robot does something unsafe—which here, means running into clutter—a red warning light flashes around the agent, and the agent incurs a cost (separate from the task reward). Because these agents are unconstrained, they often wind up behaving unsafely while trying to maximize reward.

Point is a simple robot constrained to the 2D plane, with one actuator for turning and another for moving forward or backward. Point has a front-facing small square which helps with the Push task.

Goal: Move to a series of goal positions.

Button: Press a series of goal buttons.

Push: Move a box to a series of goal positions.

Car has two independently-driven parallel wheels and a free-rolling rear wheel. For this robot, turning and moving forward or backward require coordinating both of the actuators.

Goal: Move to a series of goal positions.

Button: Press a series of goal buttons.

Push: Move a box to a series of goal positions.

Doggo is a quadruped with bilateral symmetry. Each of its four legs has two controls at the hip, for azimuth and elevation relative to the torso, and one in the knee, controlling angle. A uniform random policy keeps the robot from falling over and generates travel.

Goal: Move to a series of goal positions.

Button: Press a series of goal buttons.

Push: Move a box to a series of goal positions.


To help make Safety Gym useful out-of-the-box, we evaluated some standard RL and constrained RL algorithms on the Safety Gym benchmark suite: PPO, TRPO, Lagrangian penalized versions of PPO and TRPO, and Constrained Policy Optimization (CPO).

Our preliminary results demonstrate the wide range of difficulty of Safety Gym environments: the simplest environments are easy to solve and allow fast iteration, while the hardest environments may be too challenging for current techniques. We also found that Lagrangian methods were surprisingly better than CPO, overturning a previous result in the field.

Below, we show learning curves for average episodic return and average episodic sum of costs. In our paper, we describe how to use these and a third metric (the average cost over training) to compare algorithms and measure progress.

Return and cost trade off against each other meaningfully

To facilitate reproducibility and future work, we’re also releasing the algorithms code we used to run these experiments as the Safety Starter Agents repo.

Open problems

There is still a lot of work to do on refining algorithms for constrained RL, and combining them with other problem settings and safety techniques. There are three things we are most interested in at the moment:

  1. Improving performance on the current Safety Gym environments.
  2. Using Safety Gym tools to investigate safe transfer learning and distributional shift problems.
  3. Combining constrained RL with implicit specifications (like human preferences) for rewards and costs.

Our expectation is that, in the same way we today measure the accuracy or performance of systems at a given task, we’ll eventually measure the “safety” of systems as well. Such measures could feasibly be integrated into assessment schemes that developers use to test their systems, and could potentially be used by the government to create standards for safety. We also hope that systems like Safety Gym can make it easier for AI developers to collaborate on safety across the AI sector via work on open, shared systems.

If you’re excited to work on safe exploration problems with us, we’re hiring!


Imagine a seesaw with a flamingo on one side and a grizzly bear on another. How would you ever stabilize them? That is how most digital marketers feel when they ask me to help balance out business-first decisions and brand safety. What does that mean? Simply put, it’s the natural and growing conflict between the need to increase profits or market share and ensuring that marketing and sales efforts don’t negatively impact the positive attitudes of prospects and customers toward the organization. Simpler yet, it’s the balancing of opportunity and risk in digital marketing and sales.

Balancing out these strategic and operational issues can appear complicated at first glance. But the uncomplicated place where I tend to start with anyone who calls me is understanding the specific growth or market challenges facing the organization and defining digital policy and practices around sentiment analysis.

Brands from any and all verticals use sentiment analysis to understand prospect and customer reactions, opinions and behaviors toward products or services. But while the analysis methodology has long been used to measure the latest social media campaign, it can be used as the foundation for your broader marketing and sales efforts, telling you exactly how far and fast you can push your efforts without damaging your brand. So why isn’t everyone jumping on the bandwagon? Should you take the leap? Let’s examine some of the intricacies of sentiment analysis to ensure you can proceed with eyes wide open.

The challenge of quantifying reputational risk 

It is straightforward to tie a one-off large-scale event to brand and reputation impact. Consider a news story about a data breach or an accessibility lawsuit impacting your organization. Obviously we can calculate the loss of revenue, cost of recovery, and potential legal liability. Weighed against the cost of mitigation, we can derive a clear understanding of the risk/benefit scenario and make a business decision on the most logical path forward. What is much harder to measure is how broadly and for how long the news stories will continue to cause trust issues and ill will with prospects and customers.

What I’ve found to be successful is to gather all (or as many as possible) mentions of the organization across any and all channels (e.g., news, social media, TV, radio, customer service recordings, customer surveys, user purchasing history, etc.) and use a text and data analytics engine to measure sentiment. That means identifying and categorizing opinions expressed in a piece of text in order to determine if the attitude toward the organization is positive, negative, or neutral. By tracking organizational reputation (and brand) in key demographics and markets, we can develop a solid set of sentiments that can help us track risks that impact hard-to-measure things such as influence, trust, and leadership. This approach allows us to quantify a reputational baseline. Against that baseline, we can measure trends over time or at specific events, and leverage an agile methodology to test how aggressively we can market and sell before we start to get close to a decline in that influence, trust, and leadership area. In other words, we can tell how far we can push before we encounter brand risk and start to negatively impact our reputation.

Getting the full picture

Creating a picture of your organization’s reputational risk goes beyond understanding how the entity is viewed in the marketplace. It requires the identification and quantification of the reputation of your products as well as those of your suppliers. That means understanding your entire digital ecosystem and measuring its brand risk in the context of your organization, products, and services. For example, I have a client that was involved in the AWS data exposure incident earlier this year. While the AWS relationship with my client wasn’t known well publicly, it still had a (marginal) negative impact on the brand.

Each vendor, agency or independent consultant is part of your ecosystem. So are boards of directors (past and present), brand ambassadors and influencers, and anyone else who touches your brand. You should map them all out and, based on a matrix of prioritization, determine who should be included in your full-picture analysis. After all, there is risk associated with each entity. Conversely, if any one of those entities is seen favorably, you can also benefit from such awareness and sentiment.

Managing and capitalizing on event-based risks 

Let’s continue this discussion with my AWS example. Understanding that there was a small, but real, brand risk, we decided with leadership to proactively reach out to users, and as news of the AWS breach began to spread, users were already informed of what the organization knows about the incident and what it was doing to ensure consumer data was protected. The reputational risk measurement indicated that we managed to contain the negative rollback on the organization’s brand. It also indicated to executives the level of effort to put into communicating around AWS and the incident in the future. Lastly, it allowed us to collectively understand what kind of risk we might have with AWS going forward and whether there was a return on investment (ROI) to be gained by moving to a different hosting environment.

The same approach that we used to determine the AWS incident risk and mitigate against it, devising a good response plan, could be used in a number of other scenarios to understand marketing and sales options for your organization. Consider for a moment the latest YouTube advertising scandal. Your organization could perform the same analysis used for the AWS sentiment analysis to understand impact on competitors and other operating companies advertising on YouTube. Based on the negative brand impact (if any), you could better understand the type of risk your business could incur and proceed to use YouTube advertising or, conversely, stop advertising in that channel.

Will you keep your finger on the pulse of brand safety?

By using sentiment analysis, you can keep your finger on the pulse of your brand safety risk and dial your digital marketing and sales activity up or down as appropriate, thereby delivering on the business’ bottom line. You can also minimize your exposure to brand-damaging events. With a measured approach, you can best balance your opportunity and risk and develop a better approach to marketing and sales. Moreover, you can develop the type of digital policies that will unleash creativity and innovation in the organization while keeping the business safe.

Opinions expressed in this article are those of the guest author and not necessarily Marketing Land. Staff authors are listed here.

About The Author

Kristina is a digital policy innovator. For over two decades, she has worked with some of the most high-profile companies in the world and has helped them see policies as opportunities to free the organization from uncertainty, risk and internal confusion. Kristina has a BA in international studies and an MBA in international business from the Dominican University of California and is certified as both a change management practitioner (APMG International) and a project management professional (Project Management Institute). Her book, The Power of Digital Policy, was published in 2019.