On Friday, critical operations for thousands of organizations came to a standstill, not as the result of a cyberattack, but due to a defect in a widely deployed EDR product. There's been a great deal of focus in the media on "how" this happened, and while that's important to understand, the primary question coming from most security teams in those organizations is "how do I set things up in the future so I can get back up as fast as possible and make sure my organization is safe?"
We believe security is a subset of operational resilience, the ability of an organization to maintain its core functions and operate in the face of adverse events. Security leaders need help to both mitigate cybersecurity threats and simultaneously lead organizations through incidents when things go wrong. From our experience, this means reducing the number of single points of failure that exist in your environment by building multiple layers of defense, using open and flexible solutions, and partnering with experts who can help.
There is no one-size-fits-all approach to resilience, but there are some best practices that can help organizations.
- Adopt open solutions. To be flexible and adapt quickly in the face of something unexpected, we recommend open security solutions that allow you to pivot and experiment with multiple vendors and technology providers as needed. This approach allows you to incrementally evolve as you introduce new technologies into your environment.
- Ensure multi-layered security is in place. No single security solution can protect you from all types of threats. We recommend a defense-in-depth strategy that covers endpoints, networks, cloud, identity, email and data. A multi-layered security approach helps you cope with disruptions, including outages. Over the weekend, for example, some of our customers at Secureworks® chose to temporarily disable their third-party endpoint protection software. These customers relied on other security controls brought together using the principles of extended detection and response to ensure they were protected.
- Partner with an MDR provider. Resilience also requires having the right skills, resources, and expertise to monitor, analyze, and respond to threats when you are unable to do so by yourself. Many organizations struggle with the shortage of security talent, the complexity of security tools, and the volume of alerts and incidents, especially in a crisis. This is why working with an MDR provider can be a valuable and cost-effective way to enhance your security resilience. An MDR provider augments your security team with experienced and certified analysts, hunters, and responders who can provide 24/7 coverage, threat intelligence, and incident response services. By working with a partner, you can focus on your core business objectives, while your partner supports your security.
By following these best practices, security teams can be prepared to both defend against threats and support business continuity and resilience in critical moments.
Many of our customers, partners and prospects have asked about the Secureworks endpoint agent and the process we use to ensure integrity and resilience. We do not claim any process is foolproof; however, we want to use this opportunity to highlight the best practices we've learned and follow. Below is our process for ensuring software quality of our Taegis™ endpoint.
How do we reduce risk during software development and deployment at Secureworks?
Our quality assurance process includes internal testing, beta testing with customers, gradual deployment, and best practices to avoid running untested code.
For software updates to the Taegis agent, we recommend that customers participate in programs that help them test how new versions of the agent will perform in their environment. This includes beta and preview tiers (which customers must opt into). We always do, and always will, start with gradual production rollouts that begin with less than 1% of customer endpoints.
The Taegis agent focuses on collecting telemetry and responding to customer-driven requests, such as isolating hosts. This improves both our resiliency and our security. We have never had the need to push down "content updates" that modify the kernel driver. This cloud-first approach supports our ability to detect historical threats after telemetry has been collected even if they were unknown at the time.
When response actions or other limited configuration changes must be sent to the agent, they are first sent to a user-level application where they can be safely parsed and validated before being passed to the kernel driver. Most of these changes are performed on a small scale as directed by the customer or their MDR provider. Any wide-scale configuration change is treated by Secureworks like a software update and tested in the same way with a subsequent slowly phased roll-out starting at less than 1%. We have never sent pre-compiled code/content updates to our endpoint agents. We focus on collecting all potentially relevant telemetry from the agent and performing our detections in the cloud where we can combine endpoint data with our other telemetry.
Our Software Update Process:
- Automated Testing: Every update to the Taegis agent has a rigorous automated test procedure against many versions of Windows to ensure the build is successful, the agent is installed properly, communication is established with the backend, and the agent can be uninstalled.All agent changes go through this process.
- Internal Rollout: Each release candidate is run internally by our Engineering teams before releasing out to the Beta Tier.
- Beta Tier: Initial customer testing on a controlled set of machines where customers have opted in. This also includes a wider Secureworks deployment on our own systems. We recommend that all Taegis agent customers utilize test machines in this tier to assist in early detection and resolution of potential issues.
- Preview Tier: Expanded testing to a wider set of machines for customers that would like to test new versions of the agent after it has been tested elsewhere. If customers are not comfortable with the Beta Tier, we recommend they use some test machines within the Preview Tier to assist in early detection and resolution of potential issues.
- Production Rollout: Gradual deployment, starting at 1% and manually increasing incrementally if no issues are observed after several days. This prevents widespread impact and quick adjustment if needed.