Varonis announces strategic partnership with Microsoft to accelerate the secure adoption of Copilot.

Learn more

5 Things You Should Know About Big Data

Big data is a very hot topic, and with the Splunk IPO last week seeing a 1999-style spike, the bandwagon is overflowing.  We’re poised to see many businesses pivoting into...
Rob Sobers
2 min read
Published April 24, 2012
Last updated January 17, 2023

Giant T Rex Big data is a very hot topic, and with the Splunk IPO last week seeing a 1999-style spike, the bandwagon is overflowing.  We’re poised to see many businesses pivoting into the big data space or simply slapping a big data sticker on their products—accurate or not—just to ride the wave.

This post aims to help educate you with a few byte-sized big data concepts (not just trivia) so that you can distinguish the substance from the hype.

Hate computers professionally? Try Cards Against IT.

 

1. Big data is distributed data

Big data is a nebulous term with many different definitions.  The key thing to remember is that in this day and age, big data is distributed data.  This means the data is so massive it cannot be stored or processed by a single node.

The days of buying a single big iron server from IBM or Sun to handle all your business intelligence needs are long gone.  It’s been proven by Google, Amazon, Facebook, and others that the way to scale fast and affordably is to use commodity hardware to distribute the storage and processing of our massive data streams across several nodes, adding and removing nodes as needed.

2. You’re going to hear the words “Hadoop” and “MapReduce”

What is Hadoop?   It is an open source platform for consolidating, combining and understanding large-scale data in order to make better business decisions. Hadoop is the technology powering many (but not all) big data analytics infrastructures.

There are 2 key parts to Hadoop:

  • HDFS (Hadoop distributed file system) which lets you store data across multiple nodes.
  • MapReduce which lets you process data in parallel across multiple nodes.

Although Hadoop is one of the most popular solutions for crunching big data — there are plenty others.  Big data can’t be shoehorned into one flavor of technology.  The important characteristic is that you’re able to draw insights from large quantities of data, independent of specific technologies.

3. You can understand MapReduce without a degree from Stanford

The best plain English explanation of MapReduce I’ve encountered (paraphrasing):

We want to count all the books in the library.  You count up shelf #1.  I count up shelf #2.  That’s map. Now we get together and add our individual counts.  That’s reduce.

For a deeper understanding, Wikipedia is a good place to start.

4. Distributed data generation is fueling big data growth

The reason we have data problems so big that we need large-scale distributed computing architecture to solve is that the creation of the data is also large-scale and distributed.  Most of us walk around carrying devices that are constantly pulsing all sorts of data into the cloud and beyond – our locations, our photos, our tweets, our status updates, our connections, even our heartbeats.

For every human-generated piece of data there’s likely associated machine-generated data.  And then there’s the metadata.  The data is abundant and it’s extremely valuable.

5. Machine learning is…awesome!

One of the key differentiators in big data analytics are the machine learning algorithms used to answer interesting questions and derive value from the 0s and 1s we’re furiously chewing up and spitting back out.

Some pretty cool examples:

  • Nest – a beautiful thermostat that learns how hot or cold you like your house so you never have to adjust it again (not technically big data, but fun nonetheless)
  • Gmail’s Bayesian spam filter – no more tempting emails from that pesky Nigerian prince!
  • Amazon’s product recommendations – sure, I’ll take a JavaScript book, a pair of Asics, and season 1 of Game of Thrones.  How do they know me so well?!
  • Varonis’ access control recommendations – ratchet down access based on highly accurate analytics.

If you’re interested in learning more about big data, join our webinar this Wednesday on Mastering Big Data.

photo credit: http://fav.me/d4vqn4w

What you should do now

Below are three ways we can help you begin your journey to reducing data risk at your company:

  1. Schedule a demo session with us, where we can show you around, answer your questions, and help you see if Varonis is right for you.
  2. Download our free report and learn the risks associated with SaaS data exposure.
  3. Share this blog post with someone you know who'd enjoy reading it. Share it with them via email, LinkedIn, Reddit, or Facebook.

Try Varonis free.

Get a detailed data risk report based on your company’s data.
Deploys in minutes.

Keep reading

Varonis tackles hundreds of use cases, making it the ultimate platform to stop data breaches and ensure compliance.

varonis-veterans-spotlight:-georgi-georgiev
Varonis Veterans Spotlight: Georgi Georgiev
Welcome to the Varonis Veteran Spotlight! The purpose of this spotlight is to show our gratitude and appreciation for our Varonis Veterans, as well as honor their military service and...
reconnect---ways-attackers-sidestep-your-endpoints
ReConnect - Ways Attackers Sidestep Your Endpoints
One of the age-old questions in cybersecurity is, “Are my endpoint controls enough?” Spoiler alert, unfortunately not! Kilian and Brian discuss scenarios we’ve seen where sophisticated attack groups deliberately leverage...
verizon-2018-dbir:-phishing,-stolen-passwords,-and-other-cheap-tricks
Verizon 2018 DBIR: Phishing, Stolen Passwords, and Other Cheap Tricks
Like the rest of the IT security world last week, I had to stop everything I was doing to delve into the latest Verizon Data Breach Investigations Report. I spent...
covid-19-threat-update-#6
COVID-19 Threat Update #6
Hoarding isn’t just happening with toilet paper: we’re seeing cases where remote employees have downloaded department-level folders. Chances are, these files will contain sensitive data like PII, PCI, HIPAA and...