The Term “Big Data” has stepped out of the IT-realm and found its way into mainstream media and publications, within the last two Years. Good Examples for this are:
The Term probably only reaches back to the Year 2008 where a company called Cloudera was formed and made their first distribution available in March 2009. Before that, the Term might only be used within internal departments of Google, Yahoo or Facebook.
The staff was tasked to analyze the vast amount of (user) generated data to improve the search results or to improve the interaction of the users with the website by analyzing the interaction between the users. Instead of opting to transform the data into a 3rd NF (normal form) or Star Schema they left the data in raw form and tried to analyze the data using a new approach. This required to invest the first into a new and different infrastructure that was capable to hold all data (in raw) form and to invest into new software code that was able to analyze the data.
Within this post I want to focus on this exact point in time and provide some groundwork way other companies might also want to invest into a analyzing infrastructure different to the one they might already have.
Structured data is typically hold in a Data Warehouse which was transformed from a 3NF to a star-schema and unstructured data stored in raw form on a Hadoop Distributed Filesystem (HDFS). This difference is also known as “Schema-on-Write” and “Schema-on-Read”. Instead of transforming the data to a structured form the data resides in raw form and the schema e.g. to Join the data is created only during Run-Time (query Time). To increase query performance the data is spread across multiple nodes, hence the distribution of data to allow a parallelization of the query processing.
For databases in general it is required that for each record that needs to be inserted the schema needs to be know. E.g. for a INSERT INTO statement. For Hadoop everything can be stored, but the query (SELECT) is required to bring consistent results back. [Certain information discovery products like Endeca might be in between as they store the data in a key-value data store]
To query and analyze the data stored in the HDFS an additional Layer on top of the storage was needed to provide an interface to the data:
From this perspective the approach of using a database for structured data and a hadoop filesystem for unstructured data are enhancing each other:
Oracle also offers the Hardware and Software combined with the Big Data Appliance and the Oracle Exadata Machine:
Using Oracle Big Data SQL and connectors between both machines one can combine both data sources for performing analysis over structured data from the Data Warehouse and unstructured data stored in the Hadoop distributed File System: