Mar 19

A Small Introduction to Big Data

Here at InSITE, early March means one thing: South by Southwest. Over the last week, a few dozen InSITE Fellows from around the country descended upon Austin, TX to geek out with the finest that the tech community has to offer. It was my first experience at SXSW, and needless to say, it was overwhelming, in the best possible way.

The SXSW organizers (thankfully) organize the chaos of the week’s events into a few tracks, or series of events, speeches, and panels focused on a particular industry, trend, or topic. Branding and Marketing is one track; Government and Policy is another, and Startup Village, a third. Given my own interests in AI, data intelligence, and data analytics, I was determined to soak up as much of the Intelligent Future path as I could, a track that “embodies the realm of future possibilities where intelligence is embedded in every aspect of life with the goal for technology to empower and enable new possibilities.” Call me a futurist.

As I landed in Austin, I was reviewing the sessions I’d planned for the next few days, a schedule flush with such ambitiously titled events that they can only be appreciated in list format:

  • Big Data and AI: Sci-Fi vs Everyday Applications
  • Dirt, Drones and Data: The Future of Farming
  • AI and Your Shopping Habit
  • The Real Score: Analytics and Big Data in Sports
  • Will AI Augment or Destroy Humanity?
  • BFD (Big Fat Data) Revolution

After sitting through a handful of sessions, I came to an interesting realization that the sessions I was sitting in fell on the far ends of a spectrum. On one end were sessions like Injecting Machine Learning Everywhere, a session led by a statistical physics PhD from Wolfram Research (a company renown for their progress in machine learning and AI), that seemed to cater to an audience intimately familiar with the domain. The presentation itself gave an overview of machine learning techniques being used in the field today, as well as a demo of a new programming language developed by Wolfram to help run statistical queries.

On the other end of the spectrum were sessions such as Big Data Will Choose the Next US President. I sat in this session for 10 minutes before hearing the word “data” at all, and it quickly became evident that two-thirds of the panelists didn’t actually know what “Big Data” really meant. It was an interesting session, but in my opinion, probably should have been titled Digital Advertising is a Thing Political Campaigns Care About.

Most of the other sessions I attended fell close to either end of this same spectrum, leading me to conclude that most of the audience could be categorized into one of two buckets: (a) data scientists and PhDs, or (b) everyone else. As an MBA student interested in Big Data, this shouldn’t have surprised me, mostly because Big Data is one of the hottest areas for MBAs interested in tech (and everyone knows that MBAs are the last to hear about every hot trend in tech).

But whether or not there’s a position for MBAs in Big Data roles, it’s hard to deny the influence that Big Data and machine learning are having across the economy. Therefore, I felt it made sense to provide a brief introduction to Big Data.


What makes data big?

Not all data can be considered “big,” and the point of reference matters. For a small restaurant, looking at a list of all credit card receipts received over the last three years might seem like a dauntingly large data set, but may only number in the low hundreds of thousands of data points. But compare this to a clickstream service that monitors time and position stamps for every mouse movement from every user to a webpage, which could easily result in several million data points in just a few days.

To be considered truly “big” data, three conditions must be satisfied, often referred to as the 3 V’s of Big Data:

Volume: data must exist at scale, typically defined as at least several terabytes of data

  1. Variety: data must exist in many forms, including structured (organized, with interpretable structure, labeling, and relationship across the data) and unstructured text, numerical, date, and other data.
  2. Velocity: data stream is continuous and rapid, requiring analytical methodologies that can be completed in fractions of a second

A fourth V, veracity, is sometimes included as well, and refers to the uncertainty inherent in imprecise data that can result in reliability or predictability issues.

If you’re looking for an easy definition, it’s usually safe to assume that a data set qualifies as “big data” if it’s big enough to break or crash the tool that a non-professional would use to analyze it (like Excel).


How it’s handled

Due to the inherent size of data in this field, multiple servers are often required to generate enough computing power to analyze big data at the speed and precision needed in industry. Through a process known as parallelization, dozens, hundreds, or even thousands of servers can be run in tandem to organize, query, and return data from large sets. A number of tools already exist that enable parallelization, including Hadoop, Spark, and Hive.

Once a user has access to the raw data, a huge variety of statistical methods can be applied to draw insights from the data or make future predictions. With smaller data sets, linear or multiple regression methods are often used to create models that are both predictable and interpretable. But with large enough data sets, more sophisticated models (e.g., ridge and lasso regressions, decision trees, random forests, and neural networks) can be used that create much more powerful predictive models (although this likely comes by sacrificing interpretability). Building predictive models that can instantaneously adjust to new inputs is at the heart of the data analytics and intelligence industries, and being able to do this quickly and reliably has created a number of applications that are used throughout everyday your life.

  • Google uses sophisticated algorithms to predict search queries, match queries to indexed strings of text, and return the best results based not only on your search query, but also on individual characteristics gleaned from your Gmail account, browsing behavior, purchasing history, etc.
  • Traffic grids are often optimized by applying real-time traffic pattern data to algorithms that predict which intersections will be busiest or most prone to accidents.
  • New York City’s police department cross-references data in the police database with facial monitoring software and data collected from tens of thousands of CCTV cameras, radiation sensors, license plate detectors, and public data streams to monitor the entire city for known threats and potential terrorist activity.
  • Netflix uses viewing patterns across tens of millions of streaming customers to offer hyper-personalized recommendations.

The examples above represent some of the most prevalent and easily identifiable use-cases for Big Data in today’s society. As access to big data becomes cheaper, it will also become more ubiquitous, and there’s already a push to develop friendlier user interfaces that will allow non-data scientists to analyze patterns and trends across huge data sets.

So even though Big Data might not be well understood in each and every corner of the economy today, you can bet that’s not going to stop it from taking over the world.

Leave a reply

Your email address will not be published. Required fields are marked *