Thursday, February 19, 2015

Big Unstructured Data v/s Structured Relational Data – FIGHT!

This is one of those things where you need to completely understand one before attempting to understand the other. Hence, let’s start with Structured Relational Data.

Structured Relational Data

The last time you fired up Microsoft Excel (I am guessing it would have been for the BI homework of creating the fact table for the transcript) and you filled in information in those cells which are formed by those neat little lines called rows and columns – what you created here was in essence relational structured data. Imagine this at an industrial scale and you have database management systems such as Oracle Database, IBM DB/2, Microsoft SQL Server, etc.
Here is the bookish definition of structured database (Here I talk about database and not just data because all structured data is stored in a database of some sort be it the table in an excel workbook or the Hilton's customer relationship management system running atop Oracle):
relational database is a digital database whose organization is based on the relational model of data, as proposed by E.F. Codd in 1970. This model organizes data into one or more tables (or "relations") of rows and columns, with a unique key for each row. Generally, each entity type described in a database has its own table, the rows representing instances of that entity and the columns representing the attribute values describing each instance. Because each row in a table has its own unique key, rows in other tables that are related to it can be linked to it by storing the original row's unique key as an attribute of the secondary row (where it is known as a "foreign key"). Codd showed that data relationships of arbitrary complexity can be represented using this simple set of concepts.


Big Unstructured Data 

Coming to Big Unstructured Data – Simply put it is everything that structured relational data is NOT.
Here is the bookish definition of unstructured data:
Unstructured data (or unstructured information) refers to information that either does not have a pre-defined data model or is not organized in a pre-defined manner. Unstructured information is typically text-heavy, but may contain data such as dates, numbers, and facts as well. This results in irregularities and ambiguities that make it difficult to understand using traditional computer programs as compared to data stored in fielded form in databases or annotated (semantically tagged) in documents.
IBM has done a fantastic job of defining Big Unstructured Data using four characteristics also called the The FOUR V’s of Big Data. They are:

Volume – Scale of Data



Velocity – Analysis of Streaming Data



Variety – Different Forms of Data




Veracity – Uncertainty of Data





The Three types of Data – Data, Data and Data

Here is an interesting video explaining what are the three types of data i.e. Structured, Unstructured and Semi structured. BEFORE you watch the video, here is a mind exercise for you – count how many times the word ‘data’ is spoken in the video.






Present Day Scenario

If it wasn't already clear from above, Unstructured Big Data is growing at an exponential rate. As per the info-graphic below by IBM, by 2015, structured data would account for only 20% of all data whereas unstructured data such as VOIP, Social Media, sensors and devices would account for four times as much.





Where does data warehousing fit in all this?

The short answer is “Use the best tool for the job”. What I mean by this is that the traditional data warehouses are not going anywhere anytime soon. Data warehouses have had staying power because the concept of a central data repository—fed by dozens or hundreds of databases, applications, and other source systems—continues to be the best, most efficient way for companies to get an enterprise-wide view of their customers, supply chains, sales, and operations.

That said data warehouse and big data environments can come together in an integrated and very complementary way. In the following scenario, the Hadoop system can perform quickly. For instance, a high-tech company might want to extract data from its social networking page and cross reference it with data from the data warehouse to update a client’s social network circle of friends. The environment might also use Hadoop to quickly “score” that person’s social influence. Then that data will be provisioned back to the data warehouse so that, for example, a campaign manager can view that person’s influence score and re-segment him/her as a result.



The point to make here is that each system here is doing what it is best designed to do. Although, every rule has an exception, big data and data warehouse technologies are optimized for different purposes. Again, the goal is to use these solutions for what they were designed to do. In other words: Use the best tool for the job.



Limitation of Data Warehousing


Limitation of traditional Data Warehouses

These are based on those neat lines called rows and columns we spoke of earlier. When we throw an audio file or a tweet at it, it goes for a toss and simply doesn't understand how to interpret it. As a result, it can’t handle more than a few terabytes of data efficiently. Also, all those ETL transformations inherently introduce latency.

Limitations of the fancy Big Data Warehouses

These warehouses eat millions of tweets for breakfast and a gazillion Facebook profiles for lunch. However, ask it crunch hardcore reports and it struggles because it is so bulky and lethargic given all the unstructured data it has gobbled up. That’s where the traditional data warehouse is reign supreme. And good luck querying in Hadoop – the SQL support is very limited.



Future of Data Warehousing

A white paper by Oracle explored the top 10 trends in data warehousing and I believe it pretty much sums up where Data Warehousing is headed.
  1. The “datafication” of the enterprise requires more adept data warehouses - Mobile devices, social media traffic, networked sensors (i.e. the Internet of Things), and other sources are generating an exponentially growing stream of data. IT teams are responding by adding new capabilities to data warehouses so they can handle new types of data, more data, and do so faster than ever before.
  2. Physical and logical consolidation help reduce costs - The answer to datafication isn’t simply to invest more money in these systems. In other words, 10x the data shouldn’t translate into 10x the cost. So expanding data warehouses must be amalgamated, through a blend of virtualization, compression, multi-tenant databases, and servers that are engineered to handle much higher data volumes and workloads.
  3. Hadoop optimizes data warehouse environments - The open source Hadoop program, given its distributed file system (HDFS) and parallel MapReduce paradigm, excels at processing enormous data sets. This makes Hadoop a great companion to “standard” data warehouses and explains why a growing number of data warehouse administrators are now using Hadoop to balance some of the heaviest workloads.
  4. Customer experience (CX) strategies use real-time analytics to improve marketing campaigns - Data warehouses play a pivotal role in CX initiatives because they house the data used to establish a comprehensive, 360-degree view of your customer base. A data warehouse of customer information can be used for sentiment analysis, personalization, marketing automation, sales, and customer service.
  5. Engineered systems are becoming a preferred approach for large scale information management - If one is not careful, data warehouses can become a complex association of disconnected pieces—servers, storage, database software, and other components—but not necessarily. Engineered systems such as Oracle Big Data Appliance and Oracle Exadata Database Machine are preconfigured and optimized for specific kinds of workloads, delivering the highest levels of performance without the pain of integration and configuration.
  6. On-demand analytics environments meet the growing demand for rapid prototyping and information discovery - Akin to cloud computing’s software-as-a-service model, the concept of “analytics as a service” is a technical breakthroughs allows administrators to provide “sandboxes” in a data warehouse environment for use in support of new analytics projects.
  7. Data compression enables higher-volume, higher-value analytics -  The best way to counter non-stop data expansion is nothing but data compression. The organization’s data may be growing at 10X, but advanced compression methods can match that enabling companies to capture and store more valuable data without 10X the cost and 10X the pain.
  8. In-database analytics simplify analysis - Ideally, a data warehouse will have a range of ready-to-use tools—native SQL, R integration, and data mining algorithms, for example–to kick start and expedite data analysis. Such in-database analytics capabilities minimize the need to move data back and forth between systems and applications for analysis, resulting in highly streamlined and optimized data discovery.
  9. In-memory technologies supercharge performance - The emergence of in-memory database architecture brings sports car-like performance to data warehouses. The term in memory refers to the ability to process large data sets in system RAM, accelerating number-crunching and reporting of actionable information.
  10. Data warehouses are more critical than ever to business operations - While it’s true that data warehouses have been around for years, their significance keeps increasing since they represent a firm’s most valued assets—prized information on clients and business performance. Moreover, organizations are finding new applications for data warehouses, such as the example above, where healthcare providers are using enterprise DW/BI solutions to enhance patient care and streamline processes.









References

Tuesday, February 3, 2015

Business Intelligence and Analystics Platforms - Demystified

E
ver walked into a Walmart and wondered why Eggs are placed close to Milk or bread is placed near cereal or for that matter how do they decide what items should be placed on their shelves and where to place them. It is obvious that these decisions are not based merely on gut feeling. They are getting help - these decision are driven by Business Intelligence.


Simply put, it is the ability to make intelligent decisions based on facts. In the Walmart example above, every time you go to a checkout counter, the attendant scans the items, generates a bill and you pay for it. Now imagine this happening 10,000 times per second. Every such transaction is recorded and in essence is a fact. Walmart feeds this data into a magic software and out comes the results. The results are in the form, if a customer buys A and B, then he/she buys C. What then Walmart does is it conveniently places the yogurt right next to the milk and eggs. This is a case of market basket analysis which enables Walmart make intelligent product placement decisions.

So what are the benefits? To put into perspective, in 2009, Amazon’s revenue was $24.5 Billion. A staggering $5 billion came from “recommended” products which was nearly 20% of their total revenue! All of this is made possible using Business Intelligence and Analytics Platforms.




Before we delve deeper into BI platforms, here is the textbook definition of BI:

Business intelligence (BI) is the set of techniques and tools for the transformation of raw data into meaningful and useful information for business analysis purposes. BI technologies are capable of handling large amounts of unstructured data to help identify, develop and otherwise create new strategic business opportunities. The goal of BI is to allow for the easy interpretation of these large volumes of data. Identifying new opportunities and implementing an effective strategy based on insights can provide businesses with a competitive market advantage and long-term stability

Below is a pictorial representation of a BI implementation example:





Business Intelligence and Analytics Platforms

Parameters for comparison

  • Intuitive
    • This is a measure of how easy is it for a new user to start using the application and generate meaningful results.
    • This is  also a measure of the learning curve. Steeper the curve, lower the points awarded
  • Cost to implement
    • This is an important factor since it could often be the deciding factor in the selection of the BI platform especially for small companies
    • Another aspect that this parameter implicitly accounts for is the value for money. Higher points if more features are provided at a lower cost.
    • Extra points if there are different versions available to cater to different market segments.
  • Dashboards
    • This is a platform's ability to enable its users to create rich, visual and interactive dashboard and data visualizations within a short time with minimal technical know-how.
    • Extra points for the ability to have dashboards in the cloud and delivery through mobile devices.
  • Support
    • This is the measure for customer experience during the sales process, after sales support and implementation support.
    • This also implicitly measures how often upgrades and updates are issued and how easy or difficult is it to apply the upgrades/updates and how much time it takes (i.e. any downtime that the client might face).
  • Security
    • In light of the recent security breach and huge financial implications  that it brings with it, Security is as important if not more compared to other parameters.
    • Severe point penalties for any vulnerabilities that could compromise data security and integrity.



Comparative Analysis

Below are some of the BI & Analytics offerings from different vendors. For the purpose of this discussion, I have considered offering from the vendors that belong to the Leader’s quadrant in the Gartner’s Magic Quadrant:




Tableau


Strengths - Tableau's biggest strength is that it offers extremely intuitive, interactive data exploration experience. Many competitors have tried to follow in their footsteps. They have carved out a huge market share with their ability to meet dominant and mainstream buying requirements for ease of use, breadth of use and enabling business users to perform more complex types of analysis without extensive skills or IT assistance — and competitive differentiation continue to increase its momentum, even though it operates in an increasingly crowded market in which most other vendors view it as a target.

Areas of improvement - In-spite of its strengths, clients do not use Tableau as their primary BI platform.  With traditional vendors investing heavily in data discovery capabilities, it could threaten Tableau's dominance. Another aspect is poor after sales experience which could keep potential buyers away. Also its enterprise features such as metadata management and BI infrastructure are below average.



Qlik - Qlikview

Strengths - Market leader in data discovery. Selfcontained
BI platform, based on an inmemory associative search engine. But their biggest, most bold move is to address the need for a BI platform standard that can fulfill both business users' requirements for ease of use and IT's requirements for enterprise features relating to re-usability, data governance and control and scalability.

Areas of improvement - Not enterprise ready. Below average meta-data management, BI infrastructure and embeddable analytics. Concerns about its capability for managing security and administering large numbers of named users. 


Microstrategy

Strengths The key strength of Microstrategy is its ability to store the reports and dashboards on cloud. It also supports BigData and Hadoop. It is a very user friendly tool for non-technical users who can build reports with just drag and drop functionality. Another distinguishing feature is its ability to provide mobile experience and also access the data in offline mode.

Areas of improvement - One of the areas it needs improvement is it allows very rigid data structures so data processing needs to be done before using it for analysis. Also there is no feature of predictive analysis supported by this tool which is disappointing for many users.


SAS

Strengths SAS's core strength is in its advanced analytical
techniques, such as data mining, predictive modeling, simulation and optimization. Industry and domain specific
advanced analytic offerings. Support for extremely large volumes of data.

Areas of improvement - Higher than normal complexity making it most difficult to use and most difficult to implement. Needs significant improvement in reporting, dashboards,
OLAP, interactive visualization and other traditional BI functionality.


IBM

Strengths - Amazing sales and product strategy coupled with support from IBM Global Services and a global presence. Capability to support larger deployments. Radical new approach to data discovery with the Watson Analytics offering. Simplified licensing model. Innovative features such as natural language query.

Areas of improvement - Significantly high cost of procurement and implementation. Poor sales experience and clients have expressed frustration with IBM's sales and contracting, and high numbers of audits. At 6.2 days, the time taken to generate a report is much higher than the industry average of 4.3 days. The move to "smart discovery", bypassing traditional data discovery progression, may result in technical challenges for customers.



So how do they stack up against each other?

Here is how I rate the different platforms across the parameters discussed above:



Below is a composite graphical representation of the individual factor scores as well as the weighted totals:




My Recommendation

There are many things that many vendors do right and they have their niche. However, if I was to pick one, it would undoubtedly have to be Tableau. And customer feedback echo the same sentiments. Tableau checks all the important boxes such as ease of implementation, super intuitive, particularly with its core differentiator — making a range of types of analysis (from simple to complex) accessible and easy for the ordinary business user, whom Tableau effectively transforms into a "data superhero."



References:
http://www.statisticbrain.com/wal-mart-company-statistics/
http://ecr-all.org/files/I.Liiv_Gaining_Shopper_Insights_Using_Market_Basket_Analysis.pdf
http://en.wikipedia.org/wiki/Business_intelligence
http://www.gartner.com/technology/reprints.do?id=1-1QLGACN&ct=140210&st=sb
http://www.tableau.com/new-features/8.3
http://public.dhe.ibm.com/common/ssi/ecm/yt/en/ytw03250caen/YTW03250CAEN.PDF
http://www-01.ibm.com/software/analytics/cognos/solutions.html
http://www.microstrategy.com/us/analytics
http://www.qlik.com/us/explore/resources
http://www.sas.com/en_us/software/business-intelligence.html