Bar Talk: The Buy vs. Build Dilemma

Two MSSP security analysts walk into a bar…

“Hey man. How’s things at your new company?”

“Lousy. I work all day and all night and never seem to get ahead of things.”

 “Why’s that?”

“Our SIEM just spits out a firehose of alerts all day long and I get to clean up the mess. It’s really wearing me down.”

“Why don’t you guys do something about it? We used to have the same problem, but we bought a management platform from ATA that filters out all the false positives and redundant alerts and then gives us our work priorities. It makes it way easier for us to do our jobs. In fact, I should say we can finally do our jobs – we were never supposed to be ‘human alert filterers.’”

“Well, that’s what I am. Apparently our development team is working on some sort of management platform like the one you describe. They keep telling us it’s coming, but at this point I’m beginning to wonder.”

 “So, what are you doing in the meantime?”

“Well, we’re constantly looking for new analysts, but we’re having a really hard time finding people who know what they’re doing. So our team is pretty much overwhelmed all of the time. We try to reduce the alert volume by narrowing criteria and shutting off some product features, which helps a little. On the other hand, we’re creating holes in our defense by doing that, so it’s a bit of a gamble.”

“Why don’t you just buy a management platform like we did? Our SIEM can spit out bad alerts all day long, but I never see them. I just get real events to investigate.”

“I don’t know – my boss says we need something built specifically for our environment, so our development guys are working on it.”

“That’s crazy – that’s like saying you need to build your own firewall because your operations are so unique that commercial ones won’t do. Buy vs. build is no-brainer in this case.”

“Another no-brainer is my career path…I’m updating my resume. Once I hit one-year, I’m gone. You think your company could use me?”

“We’re not looking for new people at the moment. Once we got ATA installed, we found we had enough people to do the job.  You’re smart to start looking though, because MSSPs that keep throwing more bodies at incident response aren’t going to be around very long.”

“What makes you say that?”

“Think about it – while you’re spending tons of money on people to manage alert volume, we can spend that money on things our clients actually care about. Your business model is based on finding and paying bodies, ours is based on delivering services. Who do you think wins in the end?”

“Well, we’re not totally backwards. We’re installing some automation tools that are supposed to help us deal with all these alerts.”

“Yeah, but when your basic process is broken, automation just means you’re doing bad things faster. Wasting time on useless alerts is a bad thing, even if you’re doing it quickly.”

“Thanks a lot. I still have five months before I hit a year at this place, and every day feels like a week by the time I go to bed at night. Sure you can’t use someone like me?”

“Sorry…we’re all set. But I can make tonight go faster for you: Hey bartender…two boilermakers!”

Are These the Good Old Days for MSSPs?

20 years from now, MSSPs (or whatever it is they evolve into) are going to look back on this time as the “Good Old Days,” where the confluence of the breach epidemic, security skills shortage, and enterprise IT’s desire to migrate to OpEx models was driving clients to their doorsteps. They won’t be the Good Old Days for everyone, though. Only for those MSSPs that were able to establish a level of operational excellence that, for many, remains elusive today.

Let me explain. When enterprises seek to outsource security services, they are often looking to offload problems onto someone else. Incident response is perhaps the most prominent example of this – enterprise security teams don’t have the time or resources to effectively analyze Everest-sized piles of security alerts, so they offload this problem onto MSSPs.

This is a mixed blessing for MSSPs. On the one hand, business opportunities are plentiful. On the other, when clients offload their cost and complexity problems on you, you’d better be able to achieve a level of operational excellence that inoculates yourself against those same problems. Otherwise, your clients’ ills will make you sick.

In a sector that has historically thrived on technology hype, the prevalent challenge for incident response teams today, ironically, is the rather boring operations game. They don’t need more overhyped technology; they need better strategy and processes (supported by technology) that enable them to tame the alert overload issue.

Historically, MSSPs and enterprises have taken similar approaches to this problem – hiring more analysts to process the ever-growing volume of alerts. Enterprises have determined this is not a sustainable model, hence their desire to offload alert analysis to MSSPs. MSSPs, in turn, need to find a way to turn this operations nightmare into a sustainable business process. Security orchestration technology can help to some extent, by decreasing the amount of time it takes analysts to investigate and process alerts. This still preserves all of the wasted motion involved with investigating false positives, though, so it is more a way to accelerate a broken process than it is to implement a good process.

The only way to truly fix the alert analysis process is to achieve the obvious – automating alert analysis in a way that makes it unnecessary for a human to ever touch a false positive. MSSPs that accomplish this will be the ones lucky enough to look back on these times as the Good Old Days. Those that don’t…not so much.

Zero Trust is Here – Nobody Trusts You

When we think about the endless procession of breaches in the news, it’s only natural for IT security pros to think “Phew. Glad that wasn’t me.” But the problem is, it is you. Breaches are not events-in-a-vacuum that only impact the breached – to the outside world, they are an indictment of the entire system of collecting, storing and protecting data. Even if you’ve never been breached, your customers have little trust that you won’t be in the future, simply because it seems everyone gets breached sooner or later. It’s like being an honest used car salesman - customers are going to assume you’re crooked anyway.

One by one, breaches are changing ordinary people’s behavior in many ways. A friend of mine (an ordinary not-terribly-technical kind of guy) told me he recently stopped participating in an annual U.S. Labor Department survey that he’d done since he was a teen in the 1970s.  This survey asks some pretty personal questions about finances, social habits and the like, and tracks how those evolve over time. He was considered an extremely valuable data source, due to his nearly four decades of participation.

Last year, though, it suddenly occurred to him that this intensely personal information could be horribly embarrassing if it ever got out, so when the survey rep tried to schedule him for this year, he asked what protections were in place for his data. The rep could not explain anything beyond “all of us are sworn to confidentiality,” so he said he would not participate again until someone with some technical knowledge could describe in plain English how his data was being protected. The rep promised to set up a call with such a person, but never did. Instead the rep called back in a few weeks and offered him more money to participate. (He declined, showing that you can’t buy trust.)

What does all this mean? It means that we all face an enormous challenge in restoring trust among the general public. How do we do that? So we have to start by looking at ourselves – if we can’t trust our own infrastructure and processes, how on earth can we expect the outside world to trust us? The Cisco 2017 Security Capabilities Benchmark Study found that organizations can investigate only 56% of the security alerts they receive on a given day. When you can’t even look at 44% of the security alerts coming your way, by definition it means you can’t trust your security infrastructure and operations. And if you can’t trust yourself, how do you expect anybody to trust you?

Too Much to Do? You’re a Security Risk

Most major security breaches are not caused by a brilliant attack; they’re caused by a run-of-the-mill attack against a poor defense. The recent Equifax breach is a perfect case in point – the cause was an unpatched web application vulnerability…and the patch was readily available.

I’ll take a wild guess that the Equifax patch issue was not a function of laziness or stupidity. It was more likely a function of security personnel having too much to do, so patching that particular application was somewhere down the “to do” list, giving the attackers time to siphon off 143 million consumer records. Humans having too much to do is the single biggest problem facing the cybersecurity industry. I don’t care how good you are -- if you’re given too much to do, you’ll be reduced to mediocrity or worse.

According to researcher K. Anders Ericsson, the optimal amount of time for employees to spend on highly focused work is 4.5 hours per day. After that, performance degrades. The average cybersecurity pro is expected to spend at least double that time in highly focused work – which is how things like Equifax happen.

In the context of threat detection and response, incident response teams have way too much to do, thanks to the thousands of alerts bombarding them every day. And yet, while the solution to this problem would seem to be obvious – reduce the number of useless alerts – the industry’s response has been to avoid the root cause. Rather than focusing on new strategies for reducing alert overload, we see the introduction of additional threat detection technologies (artificial intelligence (AI) being the shiny object du jour), technologies designed to automate the “busywork” of incident response teams, and, when all else fails, throwing more bodies at the problem.

Unfortunately, these approaches are making the problem worse, not better. AI is actually increasing the alert load, automating alert processing is simply increasing the velocity of wasteful activity, and more bodies means higher operations costs for no material benefit. In the world of business clichés, this would qualify as “throwing good money after bad.”

The Equifax breach is bad. But don’t expect it to change anything. As long as people have too much to do, there will be openings in defenses that make it relatively easy for hackers to do their deplorable deeds. Reducing the alert load is critical if we are ever to reduce the likelihood of another Equifax breach.

To Have and Have Not

“I changed jobs and they didn’t have ATA, so I was back to working with a SIEM spitting out a gazillion alerts. It was like going to a company that didn’t have email.” 

--ATA user after moving to an MSSP without ATA

The rise and fall of technology comes at such a rapid pace these days that we often make the mistake of thinking “things have always been this way” when, really, they haven’t. For example, those of us of “a certain age” remember when email replaced faxes and the U.S. Postal Service. We also remember when executives didn’t know how to type (the PC forced them to learn), and when researching a subject required a trip to the library. We even remember when, if you were lucky, your office had a few clunky mobile phones that employees could share when on business trips.

And yet today, we take the PC, email, internet and smartphones for granted. But when they were first rocketing up the adoption curve, these technologies created brief moments of technological disparity where the business world became a landscape of “haves” and “have nots.” (The existence of a typing pool, for example, was a clear indicator of a “have not” during the PC’s ascent.)

As an employee, moving between such companies could mean either rocketing into the future, or lapsing into the past. As our customer said in the quote that starts this post, imagine if you’d fully adopted email at one company, and then moved to another company that still relied on fax and physical letters? Most likely, you’d become disillusioned with your “have not” environment in fairly short order.

In the security world, incident responders are experiencing this same phenomenon. They move from one company that has implemented technology (like, ahem, ATA) that cleanses false positives from its SIEM, and one that has not, and it’s like lapsing back into the days of shared mobile phones and typing pools. 

In the world of technology, there is a very short fuse for being a “have not.” If you don’t adopt the technology of the “haves” very quickly, it will lead to profound competitive issues as your competition becomes more efficient and effective across every dimension – from operations to employee morale. What’s your situation? Do you work for a “have” or a “have not”?

Lucy and Ethel: Harbingers of SIEM Fatigue

HR company Kronos and advisory firm Future Workplace surveyed more than 600 HR managers earlier this year and identified the top three reasons for employee burnout. “Unfair compensation” finished first (41% rated it no. 1) with “unreasonable workload” and “too much after-hours work” finishing in a tie for second (32%).

While SIEMs certainly cannot be blamed for unfair compensation, they are a prime offender when it comes to the second-place finishers. The torrent of alerts spewing from SIEMs presents incident response teams with an unmanageable workload that, thanks to the advent of mobile technology, follows team members 24x7. Worse yet, much of this workload is soul-crushing as incident responders chase mountains of fruitless false positives.

Combine this “SIEM fatigue” problem with the oft-cited cybersecurity skills shortage, and you have a toxic brew of too few employees chasing too many security events. SIEM fatigue opens other issues as well – such as highly compensated senior security personnel being forced into entry-level event analysis activities, simply because “somebody has to do it.”

Some of us “of a certain age” remember the classic I Love Lucy episode where Lucy and Ethel get jobs as candy wrappers. The candy conveyor is the perfect metaphor for SIEMs today – as the conveyor speeds up, they can’t keep up with the pace of wrapping and start ignoring pieces of candy, or doing things to hide the fact that they can’t keep up (translation: eat the candy). Alas, this is what happens when one gives humans more work than is humanly possible.

There’s more to the alert overload problem than security vulnerability and skyrocketing operations costs. There’s also employee burnout in a time of an acute skills shortage. The solution to burnout is obvious – give employees the tools to reduce the number of alerts to only those that matter, and their work becomes far more manageable and interesting. Lucy and Ethel may have burned out, but your employees don’t have to! 

Oh, My Aching Headcount!

Whether you’re an enterprise or an MSSP,  you’re battling today’s acute shortage of cyber- security skills. Depending on whose numbers you believe, there’s something along the lines of 1 million open cyber-security jobs in the world today. Gartner analyst Earl Perkins summarizes the problem best: there is a 0% unemployment rate in cyber security.

According to a survey conducted by Enterprise Strategy Group (ESG) and the Information Security Systems Association (ISSA), 33% of respondents said their biggest shortage of cyber-security skills is in security analysis and investigations. Additional ESG research found that 54% of survey respondents believe their cyber-security analytics and operations skill levels are inappropriate, and 57% feel they’re under-manned and under-skilled in cyber-security analytics and operations.

The age-old cure for any skills shortage is to outsource, and make staffing someone else’s headache. In the cyber-security market, this means turning to MSSPs to augment or replace internal security functions. Considering the data above, it’s not surprising that event analysis and investigation is one of the prime areas of outsourcing for enterprise security organizations.

While outsourcing this function certainly shifts the burden of hiring onto the MSSP, security remains a shared function. We all remember the Target breach, where an outsourced team in India successfully identified the attack, but sent the information to the client as one of hundreds of routine “malware.binary” alerts, which caused the internal security team to overlook the threat. Even though the outsourced team caught the threat, they still included so many other similar-yet-not-important events that the Target internal team could not discern the catastrophic from the trivial. Did the outsourced team do its job? Technically, yes. But practically, no – the client was breached.

Target is no different from any other enterprise – in a world where security incident response teams are inundated by alerts, most of which are unremarkable, it is unreasonable to expect human beings to separate the needle from the haystack with anything approaching a high degree of proficiency. For MSSPs, the stakes of the game are higher. Their entire business is predicated on keeping clients secure. Every alert ignored is a potential lost client and a damaged reputation, so their only option is to increase headcount in an attempt to match the ever-growing flood of alerts. This headcount amounts to serious money that cannot be invested in other parts of the business. We call this “Alert Tyranny,” where MSSP business models are autocratically determined by the need to process alerts. 

There’s been a lot of hype about how automation – particularly security orchestration systems -- will rescue beleaguered incident response teams and curb headcount growth. But instead we’re seeing the manifestation of Bill Gates’ two rules of automation:

  • The first rule of any technology used in a business is that automation applied to an efficient operation will magnify the efficiency.
  •  The second rule is that automation applied to an inefficient operation will magnify the inefficiency.

When it comes to alert overload, automation is not solving the problem. Instead, it is magnifying the inefficiency. Processing more “non-events” does not enable SOC operators to break out of Alert Tyranny, because human beings must perform the analysis and investigation. As a result, automation simply increases the velocity of nonproductive activity, and Alert Tyranny remains in power.

The only way to tackle Alert Tyranny and the headcount beast is to fix the process – and that means dramatically reducing the number of pointless alerts people must analyze. This would not only decrease headcount requirements; it would also make security orchestration systems more effective, since actual threats could be introduced to the orchestration system with much greater accuracy and speed. How does one reduce the number of pointless security alerts? Check out our earlier blog post, The Boy who Cried “Alert!”

MSSPs: Breaking the Shackles of Alert Tyranny

Earlier this year, Gartner predicted a fundamental shift in enterprise security priorities from attack prevention, to detection and response. Gartner also said that because the industry has been fixated on prevention for so long, there is an acute skills shortage in area of detection and response. “Skill sets are scarce and, therefore, remain at a premium, leading organizations to seek external help from security consultants, managed security service providers (MSSPs) and outsourcers,” Gartner said.

This, obviously, opens a major opportunity for MSSPs. It also brings with it a huge challenge, because the onus is on MSSPs to accumulate the right skills, and to deploy people in a way where those skills are put to optimal use. The latter half of this equation is a vexing challenge today, because many employees are not in the position to use their skills to maximum effect. Nowhere is this more apparent than in incident response teams, where employees must wade through mountains of alerts, most of them false positives, before they can actually do something of value for clients.

A recent Cloud Security Alliance survey found that SOCs report a 110:1 ratio of anomalous events detected to actual threats. In other words, less than 1% of the events being flagged merit attention, and the problem is only getting worse.

The approach du jour to this issue is to “throw more bodies at the problem.” But this brute force approach ties the hands of operations managers, because they are forced to make a choice between expense and effectiveness. They can either add headcount for manual alert investigation, which is expensive; or they can tweak infrastructure to generate fewer alerts, which creates vulnerabilities and increases the likelihood of bad things happening to clients. This “Alert Tyranny” approach to operations impairs financial performance and the ability to deliver high quality services to clients.

Don’t expect the alert overload problem to be solved by the old forms of prevention, folks. Their products are fundamentally geared to detecting and alerting on anomalies, so they will always be the source of the problem, not the solution. Likewise, automating detection and response with security orchestration is a good idea – but as long as there is a 110:1 ratio of anomalous events to actual threats, this amounts to automating a process that’s a waste of time more than 99% of the time. It would be far more productive to actually fix the process.

The opportunity could not be clearer – enterprises cannot keep up with their detection and response requirements and must rely on MSSPs. They are, in effect, handing off the alert overload problem to MSSPs. MSSPs have a choice – they can pay armies of incident response people for the privilege of taking on their clients’ collective alert-overload pain; or they can be like today’s innovators and create smart new ways to reduce or eliminate the alert overload problem, which will free them to be more effective, profitable and, ultimately, indispensable to clients.

The Boy Who Cried "Alert!"

Later, he saw a REAL wolf prowling about his flock. Alarmed, he leaped to his feet and sang out as loudly as he could, "Wolf! Wolf!" But the villagers thought he was trying to fool them again, and so they didn't come.

Amazing how little changes over 2000 years. Aesop captured the danger of false positives in “The Boy Who Cried Wolf,” and yet here we are today, still dealing with the problem. Only today it’s not a mischievous little scamp playing tricks on the villagers. It’s our mischievous security infrastructure generating thousands of false-positive alerts, obscuring the smaller population of legitimate threats.

How do we solve the alert-overload problem? First, we have to stop living Einstein’s definition of insanity: doing the same thing over and over again and expecting a different result. Instead we should follow the wisdom of Jerry Seinfeld in his classic “The Opposite” episode: “If every instinct you have is wrong, then the opposite would have to be right.”

Jerry has it exactly right. The way to slay the alert-overload beast is to change our approach from “opt-in” to the opposite: “opt-out.” In the traditional SIEM model, instead of setting security parameters that opt-in anomalous behavior for analysis, let’s do the opposite in a purpose built platform and opt-out all the normal behavior. If you remove everything that is normal, you are left with only legitimate threats to investigate.  

This makes perfect sense when you consider that “the abnormal” is wholly unpredictable. Previously unseen threats, combined with workplace trends that promote anomalous but innocent behavior − like mobile, inter-enterprise collaboration, telecommuting and globalization − have made it impossible to accurately define parameters for threats without also generating masses of false positives.

Now, let’s look at the opposite: the normal, which is predictable. Big data and machine learning make it possible to establish an accurate baseline of “normalcy,” which makes it possible to opt-out false positives before they enter the incident-response process. For example, if Susan is downloading files at midnight, opt-in systems will generate alerts because this is defined as abnormal behavior. The opt-out system, however, would know that this is normal behavior because Susan is a new mom working during off-hours, and would discard the false-positive alerts.

We live in a time where we can do amazing things with data. Unfortunately, technology often outpaces process, so we wind up with too much data and too little information. In security, this phenomenon manifests itself in the alert-overload problem. It’s time to end the broken process (opting-in suspicious behavior) and replace it with the opposite.

Too bad the Boy Who Cried Wolf didn’t think of this – he’d have more sheep in his flock.