September 7 and 8, 2017 marked the first ever RISECamp at UC Berkeley. RISECamp was a two day bootcamp focused on sharing the work coming from the RISELab. The acronym RISE is explained on the Lab’s website:
Real-time Intelligence with Secure Execution
A big difference between RISELab and the notable AMPLab (where Apache Spark was born), is the focus shifting from batch processing to real time decision making at scale. This transition signifies a trend in how academia and industry are approaching state of the art machine learning systems. While batch is suitable for a wide variety of problems, real time decision systems are capable of learning on the fly from new observations as they flow into the system. These types of systems also take a step closer to true AI systems, rather than applying a model trained offline on new observations.
In addition to the focus on building systems capable making real time decisions (leveraging Reinforcement Learning), the camp also addressed additional challenges facing many organizations that attempt to integrate advanced AI systems at scale. These challenges include easily distributing code across clusters of machines, managing data lineage and governance, serving predictions at scale, and the inherent security required when building advanced systems dependent on sensitive data.
The camp was broken into five main sections that address each of these challenges. In this post, I will summarize these sections for context as well as explain the tools being built at the RISELab covered during camp.
- Reinforcement Learning
- Prediction serving: Scoring at scale
- Ease of development: Make ML easy for non-CS
- Using data in context
- Security: Authorization of IoT – Security
The RISECamp Stack
If you have done any type of machine learning work, then these topics should resonate with you. These topics are being addressed by some of the largest open source projects coming from academia and industry. While many existing technologies try to solve these problems, the RISELab has developed a stack of technologies that can be used together to create an end-to-end solution for deploying an AI system.
The RISECamp Stack:
- Ray: scale RL and ML algorithms with existing code to distribute
- Clipper: Prediction serving; supports variety of frameworks. Horizontally scalable. Online Scoring for real time decisions.
- Pywren: serverless architecture; python api; no need to manage cluster
- Ground: data context Service
- Wave: decentralized authorization using blockchain, focused on IoT devices
Before jumping into more detail on each of these components, I’ll provide a short background on Reinforcement Learning and how it compares to more Supervised and Unsupervised learning methods.
This is arguably the most popular form of machine learning, so it provides a good base for comparison. Supervised learning is used for problems when some ground truth is already known for existing data. Data used for these types of problems will have some attribute that can be identified as the
target. For a cliche classic example, in order for a supervised learning system to predict if a loan applicant should receive a loan, the historical data needs a field to indicate whether the borrower paid back the loan or defaulted. We can translate this to be a binary classification problem where we predict whether a new applicant will pay back the loan or default, based on the historical data and the
target labels. Supervised learning typically falls into two major groups –
regression. Classification is the term used when the target being predicted is discrete values (e.g. pay off loan or default). Regression problems, on the other hand, attempt to predict a target that is a continuous value (e.g. predicting the price of a home).
In contrast to Supervised learning, Unsupervised learning starts with no ground truth. This means for all the historical data, there are no labels available relevant to the problem at hand. A primary example of Unsupervised learning is a technique called clustering. Clustering works by taking data and grouping/clustering/segmenting the data into groups that were calculated to be similar by the clustering algorithm fit to the data set. A concrete example of this is taking all the historical data on customers for a business and clustering them into groups based on some dimensions of their behavior.
Though Reinforcement Learning (RL) is a distinct third approach in machine learning, it is closer to supervised than unsupervised learning. Even though it may be similar, RL is focused on sequential decision making, unlike traditional supervised learning which makes a decision for a single point in time. A common RL application is robotics where an agent is interacting with its environment and needs to make decisions in real time. Many robots used for repetitive tasks (e.g. in a manufacturing process) do not need reinforcement learning because they can be explicitly programmed to handle their task, since the environment they operate in is extremely limited. Reinforcement learning shines in robotics when the robot (in RL terminology:
agent) has a changing environment or needs to make decisions on tasks without being explicitly programmed. To frame this type of use case more formally as a reinforcement learning problem, we need to define some core concepts.
- Agent: either a simulated actor like an opponent in a video game, or a physical device like a robot or self driving car
- State: a description of the environment the agent inhabits at any point in time. This description is dependent on the inputs fed to the agent in the system.
- Action: after evaluating the state, the action is what the agent decides to do. The scope the action is defined based on what the agent has the ability to control. In the example of a self driving car this could be the amount of acceleration and the degree which the wheels should turn
- Reward: This component of a RL problem is what links it closer to supervised learning. A reward is then fed into the system based on the actions the agent has taken. Coming back to a self driving car again, the reward could be based on driving safely, smoothly, or getting to a destination in a reasonable amount of time (we’d like all three).
- Environment: This is the world the agent operates in. For a simulated agent in a video game, this consists of the constraints imposed in the game which dictate which actions are possible. In a self driving car example, this is the real world where the environment are the streets that the car will be driving on.
It could be argued that since the reward is known, we have ground truth, and reinforcement learning problems could be handled using supervised learning techniques. Although supervised learning can have some success, it falls short in updating models in real time with new training data.
Continuous learning is another subset of machine learning that updates the model as observations are misclassified; which updates the model to improve performance over time. Continuous learning also falls short in making decisions because there are times when a score can’t be understood as success or failure until many other decisions are made. An example of this is an agent playing the game of chess (or Go) where many decisions must be made before a final label is known (win or loss of the match). The best approach to problems that fit this model is Reinforcement learning because it adjusts probabilities for all decisions made rather than a single decision.
Ray makes it incredibly easy to distribute Python functions across a cluster of machines. This is accomplished by using the Ray Python package to decorate function or classes allowing them to be executed in parallel using a very simple syntax.
An example from the Ray tutorial shows how to convert an existing python function to one that runs in parallel on a cluster:
# A regular Python function. def regular_function(): return 1 # A Ray remote function. @ray.remote def remote_function(): return 1
@ray.remote decorator before the
remote_function() enables it to be called in a special way allowing parallel processing. Calling this function to run in parallel uses this syntax:
for _ in range(4): remote_function.remote()
That’s it! By calling
.remote() after the function name, the function will be executed by each item in the range (or list) in parallel rather than sequentially. This is a very simple function with no arguments, but if any arguments are needed they are passed in
remote() just like they would be when calling
remote_function() without using Ray.
That covers how to distribute a function across cores, or nodes in a cluster, and it is just as easy to get the returned values from this function with Ray. The following code snippet, also from the Ray tutorial, shows a comparison of traditional function returns and Ray remote function returns:
>>> regular_function() 1 >>> remote_function.remote() ObjectID(1c80d6937802cd7786ad25e50caf2f023c95e350) >>> ray.get(remote_function.remote()) 1
This simple example shows that a remote function returns an Object (a Python future). To convert this to the actual value, the simple call of
ray.get() is used on the returned Object to obtain the actual value.
This is scratching the surface of what Ray offers; but it should trigger some ideas about the potential use cases that this library could handle. In the context of the RISECamp, Ray is leveraged as the utility to handle simulating reinforcement learning policies on any number of agents in a parallel fashion to speed up the simulations. Many other use cases, like hyper parameter optimization, could also easily benefit from using a distributed processing system like Ray.
So you used Ray to simulate many policies for a given RL problem and found the optimal policy; now what? This is the age old question for machine learning, where a model is built by data scientists and needs to find a way into a production environment in order to bring true value to the system/application/organization. This is an area where even advanced data science teams can struggle for a number of reasons. Clipper offers a framework for deploying models, created from any of the popular open source machine learning libraries, that can serve predictions with low latency. Check out the repository for Clipper for a quickstart guide.
Clipper is somewhat early in its development, but has a very strong approach to the challenge of model serving. IBM has developed a product called Watson Machine Learning (announced here) to handle this challenge. Watson Machine Learning offers a unified interface that can be driven via code or UI to deploy models from a number of frameworks (SPSS, SparkML, Scikit-Learn, etc.). By having a single tool to deploy any type of model, data scientists have the flexibility of using any machine learning library they want and application developers can easily pick up these models to easily embed in systems and applications for real time decisions.
Just like we saw with Ray, there are many use cases where a data scientist needs to distribute their processes across many machines. PyWren is another Python package coming from the RISELab that focuses on this problem, with a different approach and goal in mind. Unlike Ray, where the true benefits are realized using a cluster of machines, PyWren opts for a serverless architecture by pushing Python functions to be executed on AWS Lambda. PyWren provides a general purpose container to be used in Lambda that has the most popular data science Python libraries available with no additional customization needed. Similar to Ray, PyWren accomplishes this in a very clean way that can easily be integrated into existing Python codebases.
To see it in action, take a look at this code snippet which comes from the PyWren tutorial:
import pywren def square(param): return param * param param_list = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] pwex = pywren.default_executor() futures = pwex.map(square, param_list) results_with_pywren_map = [f.result() for f in futures]
If you have used
map() in Python, this syntax should be very familiar. The difference between a traditional
map and the PyWren
map is that PyWren maps the function across any number of Lambdas. This enables a cost efficient and horizontally scalable solution to distributing operations that can be ran in parallel.
Since this code example is trivial, I’ll go into more detail on a specific use case this could cover. Today, many data teams leverage block storage like Cloud Object Storage or Amazon S3 to store data sets. Teams can use these object storage services for different purposes, but a common pattern is to treat a container/bucket as a sink where new data sets are written as part of some data pipeline. As the number of these files grows, it can be a slow process to apply a function on each file. Especially if the function is applied linearly using something like a loop to iterate over all the files. PyWren shines for this use case by leveraging AWS Lambda to cleanly map functions to all files in a given container/bucket. Using a very simple syntax, PyWren triggers Lambda to apply the plain python function for each file.
So far, I’ve covered many aspects of machine learning but haven’t described much about how to handle the most core component: the data. This is where Ground comes into play for the RISELab. The introductory explanation of Ground provides its ABCs for data context:
The ABCs of Data Context for Ground
- [A]pplication Context: Describes how raw bits are interpreted for use.
- [B]ehavioral Context: Information about how data was created and used by real people or systems.
- [C]hange Over Time: The version history of the other two forms of data context.
By considering the different contexts that are important for data management, Ground provides a nice full picture view of what is needed to manage large data catalogs.
Last, but not least, is the component of the RISELab stack that addresses the challenge of handling security and authorization of IoT devices: Wave.
Wave is the component that is most outside of my expertise, but also the most impressive component from my perspective. I assume there is a correlation :-). Wave provides a novel way of providing authorization for control of devices built on namespaces, public keys, and blockchain. If you take nothing else from this blog, accept the buzzword summary that we are using the cloud to train deep reinforcement learning AI systems authorized with blockchain. Take a look at the Wave exercises from RISECamp here.
In summary, RISECamp was an amazing experience where the attendees were lucky to get a sneak peak at the next generation of open source projects coming from UC Berkeley. I was very impressed by RISELab’s ability to target some of the most common challenges facing machine learning engineers in a modular and complementary way. I am very much looking forward to following each of the projects described here, and I encourage you to do the same.