Thursday, February 19, 2015

Big Unstructured Data v/s Structured Relational Data – FIGHT!

This is one of those things where you need to completely understand one before attempting to understand the other. Hence, let’s start with Structured Relational Data.

Structured Relational Data

The last time you fired up Microsoft Excel (I am guessing it would have been for the BI homework of creating the fact table for the transcript) and you filled in information in those cells which are formed by those neat little lines called rows and columns – what you created here was in essence relational structured data. Imagine this at an industrial scale and you have database management systems such as Oracle Database, IBM DB/2, Microsoft SQL Server, etc.
Here is the bookish definition of structured database (Here I talk about database and not just data because all structured data is stored in a database of some sort be it the table in an excel workbook or the Hilton's customer relationship management system running atop Oracle):
relational database is a digital database whose organization is based on the relational model of data, as proposed by E.F. Codd in 1970. This model organizes data into one or more tables (or "relations") of rows and columns, with a unique key for each row. Generally, each entity type described in a database has its own table, the rows representing instances of that entity and the columns representing the attribute values describing each instance. Because each row in a table has its own unique key, rows in other tables that are related to it can be linked to it by storing the original row's unique key as an attribute of the secondary row (where it is known as a "foreign key"). Codd showed that data relationships of arbitrary complexity can be represented using this simple set of concepts.


Big Unstructured Data 

Coming to Big Unstructured Data – Simply put it is everything that structured relational data is NOT.
Here is the bookish definition of unstructured data:
Unstructured data (or unstructured information) refers to information that either does not have a pre-defined data model or is not organized in a pre-defined manner. Unstructured information is typically text-heavy, but may contain data such as dates, numbers, and facts as well. This results in irregularities and ambiguities that make it difficult to understand using traditional computer programs as compared to data stored in fielded form in databases or annotated (semantically tagged) in documents.
IBM has done a fantastic job of defining Big Unstructured Data using four characteristics also called the The FOUR V’s of Big Data. They are:

Volume – Scale of Data



Velocity – Analysis of Streaming Data



Variety – Different Forms of Data




Veracity – Uncertainty of Data





The Three types of Data – Data, Data and Data

Here is an interesting video explaining what are the three types of data i.e. Structured, Unstructured and Semi structured. BEFORE you watch the video, here is a mind exercise for you – count how many times the word ‘data’ is spoken in the video.






Present Day Scenario

If it wasn't already clear from above, Unstructured Big Data is growing at an exponential rate. As per the info-graphic below by IBM, by 2015, structured data would account for only 20% of all data whereas unstructured data such as VOIP, Social Media, sensors and devices would account for four times as much.





Where does data warehousing fit in all this?

The short answer is “Use the best tool for the job”. What I mean by this is that the traditional data warehouses are not going anywhere anytime soon. Data warehouses have had staying power because the concept of a central data repository—fed by dozens or hundreds of databases, applications, and other source systems—continues to be the best, most efficient way for companies to get an enterprise-wide view of their customers, supply chains, sales, and operations.

That said data warehouse and big data environments can come together in an integrated and very complementary way. In the following scenario, the Hadoop system can perform quickly. For instance, a high-tech company might want to extract data from its social networking page and cross reference it with data from the data warehouse to update a client’s social network circle of friends. The environment might also use Hadoop to quickly “score” that person’s social influence. Then that data will be provisioned back to the data warehouse so that, for example, a campaign manager can view that person’s influence score and re-segment him/her as a result.



The point to make here is that each system here is doing what it is best designed to do. Although, every rule has an exception, big data and data warehouse technologies are optimized for different purposes. Again, the goal is to use these solutions for what they were designed to do. In other words: Use the best tool for the job.



Limitation of Data Warehousing


Limitation of traditional Data Warehouses

These are based on those neat lines called rows and columns we spoke of earlier. When we throw an audio file or a tweet at it, it goes for a toss and simply doesn't understand how to interpret it. As a result, it can’t handle more than a few terabytes of data efficiently. Also, all those ETL transformations inherently introduce latency.

Limitations of the fancy Big Data Warehouses

These warehouses eat millions of tweets for breakfast and a gazillion Facebook profiles for lunch. However, ask it crunch hardcore reports and it struggles because it is so bulky and lethargic given all the unstructured data it has gobbled up. That’s where the traditional data warehouse is reign supreme. And good luck querying in Hadoop – the SQL support is very limited.



Future of Data Warehousing

A white paper by Oracle explored the top 10 trends in data warehousing and I believe it pretty much sums up where Data Warehousing is headed.
  1. The “datafication” of the enterprise requires more adept data warehouses - Mobile devices, social media traffic, networked sensors (i.e. the Internet of Things), and other sources are generating an exponentially growing stream of data. IT teams are responding by adding new capabilities to data warehouses so they can handle new types of data, more data, and do so faster than ever before.
  2. Physical and logical consolidation help reduce costs - The answer to datafication isn’t simply to invest more money in these systems. In other words, 10x the data shouldn’t translate into 10x the cost. So expanding data warehouses must be amalgamated, through a blend of virtualization, compression, multi-tenant databases, and servers that are engineered to handle much higher data volumes and workloads.
  3. Hadoop optimizes data warehouse environments - The open source Hadoop program, given its distributed file system (HDFS) and parallel MapReduce paradigm, excels at processing enormous data sets. This makes Hadoop a great companion to “standard” data warehouses and explains why a growing number of data warehouse administrators are now using Hadoop to balance some of the heaviest workloads.
  4. Customer experience (CX) strategies use real-time analytics to improve marketing campaigns - Data warehouses play a pivotal role in CX initiatives because they house the data used to establish a comprehensive, 360-degree view of your customer base. A data warehouse of customer information can be used for sentiment analysis, personalization, marketing automation, sales, and customer service.
  5. Engineered systems are becoming a preferred approach for large scale information management - If one is not careful, data warehouses can become a complex association of disconnected pieces—servers, storage, database software, and other components—but not necessarily. Engineered systems such as Oracle Big Data Appliance and Oracle Exadata Database Machine are preconfigured and optimized for specific kinds of workloads, delivering the highest levels of performance without the pain of integration and configuration.
  6. On-demand analytics environments meet the growing demand for rapid prototyping and information discovery - Akin to cloud computing’s software-as-a-service model, the concept of “analytics as a service” is a technical breakthroughs allows administrators to provide “sandboxes” in a data warehouse environment for use in support of new analytics projects.
  7. Data compression enables higher-volume, higher-value analytics -  The best way to counter non-stop data expansion is nothing but data compression. The organization’s data may be growing at 10X, but advanced compression methods can match that enabling companies to capture and store more valuable data without 10X the cost and 10X the pain.
  8. In-database analytics simplify analysis - Ideally, a data warehouse will have a range of ready-to-use tools—native SQL, R integration, and data mining algorithms, for example–to kick start and expedite data analysis. Such in-database analytics capabilities minimize the need to move data back and forth between systems and applications for analysis, resulting in highly streamlined and optimized data discovery.
  9. In-memory technologies supercharge performance - The emergence of in-memory database architecture brings sports car-like performance to data warehouses. The term in memory refers to the ability to process large data sets in system RAM, accelerating number-crunching and reporting of actionable information.
  10. Data warehouses are more critical than ever to business operations - While it’s true that data warehouses have been around for years, their significance keeps increasing since they represent a firm’s most valued assets—prized information on clients and business performance. Moreover, organizations are finding new applications for data warehouses, such as the example above, where healthcare providers are using enterprise DW/BI solutions to enhance patient care and streamline processes.









References

1 comment:

  1. Siemens Mendix makes developing web applications easier. The company’s new mendix platform will allow developers to create applications using an HTML5-code generator in a visual environment. Newcomers and experienced programmers will be able to build applications that are natively compatible with all major mobile devices.

    ReplyDelete