Scaling ThreatMapper: Detecting Threats on 100k Servers, 1000s of Cloud Accounts, 2500 K8s Clusters, and Beyond

Scaling ThreatMapper:  Detecting Threats on 100k Servers, 1000s of Cloud Accounts, 2500 K8s Clusters, and Beyond
June 1, 2023
Author:

Deepfence Authors: Thomas Legris, Ramanan Ravikumar, Manan Vaghasiya, Shyam Krishnaswamy

ThreatMapper, the fastest growing open-source CNAPP platform, hunts for threats in production platforms, and ranks these threats based on their risk-of-exploit. It uncovers vulnerable software components, exposed secrets, malwares, and deviations from standard security and compliance configurations. ThreatMapper uses a combination of agent-based inspection, and agentless monitoring, for the widest possible coverage to detect threats.

Since the launch of the open source platform eighteen months ago, ThreatMapper has seen massive adoption across a wide variety of public, private and hybrid clouds, bare-metal servers, serverless environments like AWS Fargate, and even Raspberry PI devices. ThreatMapper adds runtime context such as network flows to the thousands of scan results to build ThreatGraph; a rich visualization of the most meaningful and threatening attack paths. This has potential to reduce the threats found by up to 97%, helping users prioritize the remediation of 3% of threats that are actually exploitable. Some of our users have already installed ThreatMapper on Kubernetes clusters across 2,500 Kubernetes nodes, around 20,000 pods and up to 50,000 containers, gaining critical security observability into their risk posture and ensuring the ability to respond to threats in runtime.

Since ThreatMapper provides unprecedented visibility across the entire infrastructure, we asked ourselves – how can we meet the demands of users who want to cover 100,000 Kubernetes nodes? Or 100,000 regular EC2 servers? Will it hold if we push the boundary to 200,000 nodes or servers? We went back to the drawing board to take a close look at our technology stack. This post explores those architectural changes and how they have helped Deepfence scale its open source CNAPP to cater to organizations of all sizes without spending an additional dollar on compute. No one should have to pay for fundamental building blocks of cloud security like cloud asset inventory, cloud misconfiguration checks, vulnerability management, sensitive secrets and malware scanning. 

Current Architecture

ThreatMapper consists of two components – Deepfence Sensors, and the Deepfence Console. The sensors are simple lightweight eBPF probes that collect the relevant metadata to be sent to the Deepfence Console, and deployed onto Kubernetes clusters, docker hosts, bare-metal servers; any environment where users deploy their application workloads. The Deepfence Console aggregates the metadata from the sensors, maps it back to the various threats like vulnerabilities, exposed secrets & malware, security misconfigurations, e.t.c., to build the ThreatGraph.

A simple representation of the architecture is as follows in Figure 1:

Figure 1 - Old Architecture

As we can see above, the Deepfence sensors collect metadata around processes, containers, network connections, and send it to the Deepfence console using a REST API. A persistent websocket is used to communicate any control information from the Deepfence Console to the sensors. This persistent connection is used to trigger any scans. Once a trigger is received by the sensor, it gathers the relevant data about the various packages installed, various programs that are currently running, and sends it back to the Deepfence console using the REST API. The scans are then performed on the Deepfence console. If any SIEM is configured in the system, Celery jobs are spawned to send the results of the scans to the SIEM tools in an asynchronous manner. 

As we started to push the boundaries of scale, we observed:

  1. Visualize real-time changes to the infrastructure – The Deepfence Sensors use a REST API to send data to the Deepfence Console about network connections, processes, containers and pods; from the servers that they have been deployed within the infrastructure. This data is kept in-memory and processed to build a network connection graph – those servers that have network traffic between each other, and those servers that have network traffic to the external world. This network connection graph is aggregated along with data from various cloud services – load balancers, IAM roles, storage volumes, security groups e.t.c., into various logically interconnected entities; in order to build the topology graph of the entire infrastructure. This aggregation is performed in-memory using specific queues to process the data. To accommodate the increasing scale,  we had to constantly tune this aggregation model by modifying various in-memory processing queues. Our experiments began to show that there were limits to this model and hence, there was a need to move to a graph centric database; one that could help build the graph using the native query models available within that database. Further, communication of control information between the Deepfence console and sensors was done through websockets, which at scale, required additional resources for the Deepfence console.
  2. Prioritize the threats detected – The Deepfence console performs aggregate queries on ElasticSearch to fetch data from the results of various scans, and, overlays it with the Topology graph to build a graphical depiction of the top attack paths in the infrastructure – the ThreatGraph. At scale, this required additional memory & CPU, as a lot of the relevant data was required to be kept in-memory during the course of this entire operation.
  3. Backend API - A large part of our backend codebase was written in Python. This was extremely useful in the early stages of development, as we were able to use external application frameworks like Flask, and other third-party libraries to rapidly iterate and deploy the platform. However, as the size of the infrastructure that we monitor increased, we had to deploy additional CPU & memory in order to handle our workflows. We wanted to move to a language that had a robust native library, support for various application frameworks, and would also help to process our workflows in a parallel manner.
  4. Integration into existing tool sets - ThreatMapper has native support to integrate with various notification & SIEM channels. In our internal simulations, when we reached a scale, we noticed that Celery, our asynchronous task scheduler that was used to handle these integrations and notifications, began to place a heavy load on our resources.
  5. Workflow Automation - ThreatMapper provides extensive API support that is leveraged very well by the community to automate, extend and integrate the functionality into their existing workflows. We wanted to provide the API’s as library modules, as it would help the users to write code in languages like Python or Golang for their customization requirements.

New Architecture

We combined our learnings from the previous architecture, and our future considerations to build the new architecture as depicted below:

Figure 2 - New Architecture

As we can see above, the first major design decision that we took was to move towards a Graph Database (DB) as our primary datastore. Choosing a Graph Database as the foundation for all our operations was a strategic decision rooted in the fundamental understanding of the nature of users' infrastructure. In the world of cybersecurity, applications, network, data and identities are not isolated entities, but an interconnected web of information. Users' infrastructure, comprising a multitude of compute or cloud services, is inherently a graph. Nodes represent services while edges signify various relationships - be it a compute service tied to a particular user, or communication between different applications, deployed as pods or containers. Embracing a Graph DB is akin to reflecting the organic, interconnected structure of the digital environment within which we operate.

Choosing Neo4j, an open-source Graph DB, further enriches our capabilities, as it embodies the ethos of collaboration, continuous enhancement, and transparency; crucial in the ever-evolving landscape of cybersecurity. Neo4j's robustness and adaptability equipped us to handle myriad use cases and capabilities, bolstering the future potential of our CNAPP solution. The graph-centric data model of Neo4j enables us to intuitively organize and retrieve data, thereby improving efficiency and scalability. With Neo4j, we can explore the limitless potential of interconnected data, and draw powerful insights that empower our users to fortify their security posture. This is a testament to our commitment to harness advanced technologies, to deliver a sophisticated, dynamic, and user-centric cybersecurity solution.

Putting it all together; the Deepfence sensors collect metadata around processes, containers, network connections, and send it to the Deepfence console using a REST API. The response to the REST API is now used to communicate any control information from the Deepfence Console to the sensor, i.e., to trigger any scans. Once a trigger is received by the sensor, it gathers the relevant data about the various packages installed, various programs that are currently running, and sends it back to the Deepfence console using the REST API. The scans are then performed on the Deepfence console. If any SIEM is configured in the system, worker jobs are spawned to send the results of the scans to the SIEM tools. 

Since the focus of the new architecture was to build for scale:

  1. Visualize real-time changes to the infrastructure – The Deepfence Sensors uses a REST API to send data to the Deepfence Console about network connections, processes, containers and pods; from the servers that they have been deployed within the infrastructure. This data is now sent to Kafka topics. Individual Kafka workers use a “divide and conquer” strategy on the Kafka topics to identify  those servers that have network traffic between each other, and those servers that have network traffic to the external world. This data is saved using a relationship-centric data model into Neo4J. We now use Cipher queries to compute the network connection graph. The data from various cloud services – load balancers, IAM roles, storage volumes, security groups e.t.c., are now saved as a graph based data model into Neo4J. Cipher queries are now performed on the network connection graph and the data from various cloud services to generate the Topology graph that reduces CPU and memory usage when compared to the earlier design. We also removed the websockets that we used for communication of control information between the Deepfence console and sensors. This information is now sent as a part of the response to the REST API call from the Deepfence sensors. This also contributes to reduction in resource usage, as we do not need to maintain persistent connections with all the sensors connected to the Deepfence console.
  2. Prioritize the threats detected – Since the entire infrastructure is stored as a graph model within Neo4J, the results of various scans - vulnerabilities, secrets, malware & cloud service configurations; are now attached as attributes to the graph model. This helps us to use simple Cipher queries to easily compute a graphical depiction of the top attack paths in the infrastructure – the ThreatGraph. A ThreatGraph that previously required 100% of one core and 2GB memory, was now computed using 50% of one core and approximately 500MB of memory; up to 70% reduction in resource usage.
  3. Backend API - We moved all our backend codebase from Python to Golang. For almost all our use case scenarios, we were able to identify application frameworks and native libraries that helped us to reduce our resource usage. As we increased the size of the infrastructure that we monitor, we were able to distribute the workloads and process our workflows in a parallel manner.
  4. Integration into existing tool sets - ThreatMapper has native support to integrate with various notification & SIEM channels. Due to our move from Python to Golang, we were able to handle the notifications and integrations using the native asynchronous worker queue support available in Golang.
  5. Workflow Automation - We moved the entire existing API’s to the OpenAPI framework. This allowed us to build libraries that users can now add to their automation workflows using programming languages like Python or Golang. We leveraged this ourselves to build out a CLI for ThreatMapper.

Results

When the Deepfence Console is deployed on a single EC2 server with 16 cores & 32GB of memory, while we were previously able to handle upto 7,000 servers, we are now able to scale to 100,000 Kubernetes nodes or even 100,000 EC2 servers.

Figure 3

For ease of navigation at scale, we also provide a tabular representation of the infrastructure.

Figure 4
Figure 5

The new CLI built as a part of the revamped architecture in action – vulnerability scans when 100,000 servers are being monitored for threats.

Figure 6

We now deploy the Deepfence Console on a three node Kubernetes cluster, where each node has 8 core CPU and 32GB of memory. 

Leveraging the power of our solution, we have revolutionized the scale at which we monitor servers. While our single-node deployment capably manages 100,000 servers, we've taken it a step further. By merely augmenting compute resources, we've seamlessly scaled our console horizontally, thereby tripling our capacity. Today, we're effortlessly monitoring an astounding 300,000 servers, a significant leap from our previous capacity of 40,000 servers. This feat underscores the formidable scalability of our solution, ready to grow as your needs expand.

Figure 7
Figure 8

Conclusion

In conclusion, scaling a security tool like ThreatMapper to monitor hundreds of thousands of servers is no small feat. This exploration into our revamped architecture has demonstrated the capability of our technology to meet the needs of large-scale infrastructures. By transforming our backend and leveraging modern technologies like Golang, Neo4J, and Kafka, we've made ThreatMapper an even more powerful, efficient, and scalable CNAPP solution. In a few weeks, you will see a V2-tagged release in production, with an updated UI that reflects these enabled architecture changes, in an enterprise-grade launch; all within our open source platform ThreatMapper! Stay tuned.

Not only does this provide immense value for organizations managing vast amounts of servers, but it also empowers users with greater insights and control over their infrastructure security. 

Remember, it's not just about spotting vulnerabilities; it's about understanding the risk context and focusing remediation efforts on those threats that could truly impact your infrastructure. With ThreatMapper, you'll navigate your cybersecurity landscape with unparalleled vision and confidence, regardless of your infrastructure's scale. And this new architecture also enables the community to implement various use cases easily, key among them that we have on our roadmap are:

  • We’re introducing the ability to calculate the potential ‘blast radius’ for any given cloud misconfiguration or vulnerability. This means discerning how a CVE can be exploited and how uses can spread laterally becomes as simple as running a cypher query;
  • The interplay of scan results and runtime alerts in ThreatStryker will be streamlined and more effective - all through the simplicity of a simple graph query. 

Though our shift from Elasticsearch momentarily hinders free text search capabilities, it paves the way for an exciting integration - a self-hosted Security Specific Small Language Model (all in open-source)! This feature will not only function as a search interface but also enable a deeper correlation of the alerts that we detect in the infrastructure. We're turning temporary limitations into stepping stones for significant enhancements, staying true to our commitment to constant innovation and user-centric development. In the upcoming series of blog posts, we will also discuss the ability to monitor changes in security configurations of cloud accounts, without reaching the API limits of the cloud providers.

Join us on this journey to redefine cybersecurity at scale. Embrace the power of ThreatMapper to secure your infrastructure and to transform your cybersecurity posture from reactive to proactive. Make cybersecurity your strength, not your bottleneck.

As always, we welcome interested users to join our community Slack for any further deeper technical discussions. Additionally, the code is always available on Github for download, and to hack on it while we tag a formal V2 release over the next few weeks. We always welcome valuable contributions from the community. Finally, if you are interested in working on challenging scale problems involving graphs, and hacking into the exciting world of eBPF, let’s talk!