Learn to efficiently build pipelines for our ML-based security detections leveraging Snowpark
Today we have what I call "modern" conveniences for helping develop machine learning (ML) solutions, but do they really make my day job that much easier? 15 years ago, when I started using machine learning to help secure data and machines to develop solutions for various cybersecurity tasks (detection, governance, DLP), I would launch Weka 3, do feature engineering and compare decision trees and linear SVMs to see which performed better while my Dell Inspiron 1501 would overheat. So sure, fast forward to 2022, we have Jupyter Notebooks, unlimited resources in the cloud, easy-to-use deep learning frameworks, specialized hardware, and readily available datasets and pre-trained models. While all these conveniences have helped, I found it could still be challenging to deploy arbitrarily complex deep learning cybersecurity models that perform inference on large datasets. That is, until Anvilogic started leveraging Snowflake’s Snowpark for Python which made running my models, however complex, on big data straightforward and with little friction.
Snowflake offers us tremendous flexibility to leverage the ML tools best suited for our task and to deploy at scale. In particular, we heavily use two features in Snowflake to apply our deep learning models against customer logs.
- We use the Snowpark API for ELT in our data pipelines. Our Data Science team finds the API to be intuitive and productive. These data pipelines are either launched from our platform from the Python Snowpark client-side connector or run in stored procedures invoked by scheduled tasks on Snowflake.
- We containerize ML models with Python UDFs by leveraging the integrated Anaconda repository and package manager. The ability to isolate code with very specific requirements (e.g. libraries, programming language) is a key differentiator because we can seamlessly run our data pipelines with Snowpark and score with our ML models encapsulated in UDFs without fear of dependency issues.
Both features lend themselves to easy, repeatable development and deployment. I’ll describe in detail how we use these features through the following example of a productionized ML detection.
The Problem: Developing ML Models for Detecting Malicious PowerShell
It is crucial for a Security Operations Center (SOC) to monitor all PowerShell invocations. The flexibility of PowerShell enables adversaries to execute malicious code to progress their attacks. This type of attack can be difficult to detect and is certainly a vector seen often by Anvilogic. Machine learning offers a way to detect these attacks using text classification methods that can identify malicious code, even if it is not an exact copy of the attacks on which the model was trained. We have found that using a set of Indicators-of-Compromises (IOCs) consisting of substrings that frequently occur in attacks (e.g., iex) provides a meaningful feature set for a model. However, the challenge with detecting malicious PowerShell is that the same features observed in attacks are also present in normal administrative PowerShell scripts. Therefore, we leverage deep learning to differentiate between the presence, ratio, and amounts of these features in attack and non-attack samples.
Like any good data scientist, we spent some time trying to find a model that performed well (i.e., detected threats and did not generate too many false positives) and is computationally efficient. We found a deep feedforward neural network with leaky rectified linear units, and batch normalization best satisfied our criteria, particularly in terms of being amenable to class weighting strategies and efficient inference. We coded this model with the Keras API in TensorFlow. Now, this is all gravy when it lives in your Jupyter Notebook where you control the environment, but is it actually deployable in production?
Snowflake provides features for us to continually experiment with, train, and execute our models in a low-friction way. The key is leveraging Snowflake features to effectively containerize complex computations (like ML model inference) as UDFs with their own dependencies. Therefore, the software used to experiment in my Jupyter Notebook, train the model and make predictions is all the same. There will be no surprises when we go to production with any model because we can effectively unit test the code. And even if we decided to go a totally different direction (perhaps some recurrent neural network programmed with PyTorch), Snowpark Python UDFs will enable us to lift and ship new models with different requirements without rewriting our data pipelines.
We also use Snowpark as our primary platform for ML data pipelines. We create these pipelines as Snowpark Python stored procedures and use tasks that enable us to run our Python Snowpark code on a desired schedule. A lot of time and effort is required to develop an effective model, and we will describe how embedding our models with Snowpark data pipelines enables production-grade model inference that is not time intensive to develop and deploy.
ML Ops for Snowflake: The Anvilogic Approach
Like many ML models, it starts with a Jupyter Notebook. Snowpark is a great tool for helping to query and transform our data because it can return the data as Pandas dataframes. These data frames are then rendered in a useful way for visualization within a Notebook so that the data scientist and cybersecurity analyst can quickly view the data. Anvilogic has docker containers that run a version of Jupyter Notebooks with some additional code for Snowflake authentication. We found that this pattern helped our internal users quickly get started without having to get bogged down in the details of session creation. Our internal Snowflake authentication library distributes config files with sections for each of our environments. A simple python module that addresses specific authentication requirements in your environment can make end users more productive. In our case, our data scientists can create notebooks that our cybersecurity analysts can use to evaluate results and hunt attacks.
Using the above pattern, I developed a couple of notebooks and found an effective model for detecting malicious PowerShell. This notebook is littered with all the code: preprocessing, model generation, and hypertuning. But, we are not going to run a notebook as production code. We also need to perform unit testing on preprocessing routines, so we do not encounter any surprises in production. How do we make the leap from research to operation? Fortunately, we have found patterns with Snowpark that allow us to graduate code from these notebooks and merge them into production pipelines. This process also doesn’t require a laborious rewrite of my Notebook contents either.
Model Inference Using Snowpark Python UDF’s
Since preprocessing code for ML models sometimes have complex logic that makes it difficult to capture in SQL or even Snowpark functionality, the Anvilogic team created a repository where we develop these preprocessing routines as python modules. The modules expose a handler method we reference when creating the UDF via SnowSQL. This simple structure makes it fairly easy to deploy arbitrarily complex functions that are effectively containerized so that they can be called easily using SQL or Snowpark.
For our PowerShell detection, we have a special tokenizer that extracts meaningful features (IOCs) from the command line embedded in a UDF. This logic requires roughly 20 lines of code using standard Python libraries, and it is critical that features are extracted correctly every time because otherwise, the model results will be erroneous (e.g. cause many false positives, miss actual attacks). Therefore, it’s important to also unit test the code.
By writing our python libraries and deploying them to Snowflake stages as zip files all our development requirements were satisfied. Poetry is the tool we use to manage dependencies and package code. In our build process, we can stage our source distribution tarballs on an internal stage using SnowSQL. A Python UDF in Snowflake essentially is a function, and there is no significant difference between testing a handler function versus any other function you would normally test. Tests are easy to code since the types passed to your python function are standard python types like booleans, integers, and dictionaries. The final piece is registering the UDF. There are different options to do this in a script or programmatically. We have a shell script that uses SnowSQL to first put the code (as a zipfile) in an internal stage and execute SQL to register the function, which can help to make the SQL statement very concise and readable while allowing someone to know exactly what function provides the capability in the source code.
Now that our code has been “containerized”, we can call the UDF from Snowpark using SQL statements or call_udf functions on a Snopwark dataframe. Other consumers would be able to use the UDF directly in SQL (via the connector, UI, or SnowSQL). In the case of Snowpark, having all of the dependencies for the UDF isolated ensures that however you invoke a Snowpark data pipeline will not have a dependency conflict with the required libraries of the UDF. Our code became more readable as we isolated different functionalities into module hierarchies (e.g., filtering.udfs) and were less likely to write excessively long script files that included a lot of disparate functionality.
From Sagemaker to Snowflake: Deploying ML models
There is certainly no shortage of ways to perform ML inference within Snowpark. For the Anvilogic team, the UDF pattern offers a way to encapsulate our functionality so the data scientists can use their preferred tooling while not introducing dependency conflicts into other models and pipelines. Python UDFs offer the ability to specify dependencies and what files should be loaded into the runtime environment of the handler code. By leveraging both the capabilities, we can make ML deployment into Snowpark easy.
We created an internal stage for model files. These model files can be any serialized model file format from a supported framework (e.g., ONNX protocol buffer files). When generating a SQL statement for UDF creation, we use the imports parameter to add the model into the UDF runtime. Our code has a small utility function to make it easy to find and load these models in the UDF. Since the models are trained in SageMaker (more on this later), I resolve the folder Snowflake loads into with this code
and load the model with this code in the imports:
Now in the UDF handler function, as part of the global scope of that script file I will load the model with our utility (called su)
And then use the model in our handler. You might notice that I’m using tensorflow in the example above. It’s important to follow Snowflake’s best practices and use predict_on_batch to get the best performance in your handler.
You may have concerns that deep learning models will not run well on the warehouse hardware. We have not had performance problems yet with our deep learning models, and much of our code runs efficiently on XS warehouse sizes. We should note a few things. Our deep learning models are significantly shorter and narrower than many of the models you may read about, like transformers (even size optimized ones like Albert) and computer vision models. I would argue that many of our future deep neural networks (and perhaps ones you may train) will be moderately sized. Using a runtime such as ONNX or TensorFlow that can leverage CPU linear optimizations should have fast enough inference capabilities on Snowflake warehouses for deep learning models. If this is not the case, I recommend leveraging specialized hardware on your cloud of choice and using external functions for inference.
Make it run
Choosing Snowpark for your data pipelines allows you to perform data manipulation and ML inference in an intuitive and productive way. Now you need to make it run! Historically for me, I have found on other platforms too inflexible to deploy more sophisticated ML detections. Fortunately, with Snowflake, there is a simple way to get our jobs to run.
First, we use stored procedures in Snowpark for our data pipelines. The stored procedures are simply python functions in a module that take as the first parameter a session and zero or more other parameters. We can encapsulate our entire pipeline in this function, which is archived in a zip file containing any other auxiliary functions we may require. I also include a main guard in these files in case I need to run this script in a container in the future; I can create a script that uses the handler code.
We run our detection logic as stored procedures in a task, giving you options for when to invoke and can be executed on cron-like schedules. This is how we schedule tasks because it works well for how often the jobs need to be run. Tasks can also be run when new data is present in a table. This is not a strategy that works well for our jobs, but a stored procedure can be run only when there is new work via a task and table stream. Tasks can also be manually invoked which enables you to rerun tasks when there is an outage or if data is backfilled. In these situations, you can rerun analytics with SnowSQL scripting.
Running a task records an entry in the task_history table function. Failed tasks are also recorded here, and periodically we execute this function for failed tasks and alert the owners of task failures. Setting up a monitoring capability when using tasks and stored procedures is straightforward and easy to integrate into existing monitoring applications.
Preliminary Assessments: ML Ops, Training and Inferencing
We have been running our detection in Snowflake for over a month, and so far, we have had no operational issues. Shortly after deploying this detection, we found malicious activity for one of our customers. The quick return on investment is attributable to the reliability and performance of Snowflake and identifying best practices for deploying ML-based analytics.
Every week we retrain this model. We found that scheduling a Python Snowpark stored procedure with a task allowed us the flexibility to build our training sets easily. We use internal stages to store known examples of malicious activity. We query our tables for events with PowerShell to form our negative dataset (or more accurately, unlabeled dataset). Snowpark allows us to load both data sources as a Data Frame, enabling us to use a standard set of operations to transform the data. Our training set is saved to an external stage and serialized using the TensorFlow Example protocol buffer to serialize our features and labels. This serialization format is customized for our use case, but Snowpark is highly flexible, and you could easily save your training datasets in JSON, CSV, or Parquet.
External stages allow you to leverage reactive patterns for training ML models. We are currently working through automating the process of training by leveraging S3 notifications to create new training jobs on SageMaker. It is technically feasible to train an ML learning model entirely in a stored procedure. For our use case, it’s prudent to take advantage of Snowflake’s capabilities to easily and efficiently export data. Our SageMaker training jobs not only train a deep neural network but also perform hyperparameter optimization. We use KerasTuner; the software provides different algorithms for discovering optimal network structures and hyperparameters. We optimize the width and depth of the network and learning rate by using the Hyperband algorithm. To do this efficiently, we leverage GPU-enabled SageMaker instances for our training jobs.
Once the training job is complete, we export our model in the TensorFlow SavedModel format. This model is loaded to a Snowflake stage that our UDF will load on the next invocation. Therefore, we do not need to recreate the UDF or update our ML preprocessing, post processing or inference code. The SavedModel format describes the model’s operations and parameters (e.g., weights, biases) that are hardware agnostic. Therefore, we train on GPUs in AWS and confidently inference with CPUs on Snowflake warehouses.
Final thoughts
As you can glean from our usage of Snowflake, it is a reliable and performant big data platform to build analytical pipelines. We have found that Snowpark enables us to efficiently build pipelines for our ML-based security detections. Critically, we use UDFs to “containerize” our ML preprocessing and inferencing, which enables us to isolate dependencies and logic for critical operations, test them thoroughly, and make them available to our pipelines, as well as for ad hoc usage in notebooks, SQL and SnowSQL. We use stored procedures and tasks to run and monitor our analytics. External stages and dataset creation within tasks facilitate training on the cloud. Leveraging portable model formats, like SavedModel or ONNX, enables us to use these models within Snowflake UDFs for inference. As we learn more, I look forward to sharing with you again and I hope to hear and learn from others how they make Snowpark and Snowflake work for them.