Hadapt White Paper

White Paper Technical Overview March 2013 Today, business leaders see big data as a valuable competitive advantage. Hi...

0 downloads 24 Views 2MB Size
White Paper

Technical Overview March 2013

Today, business leaders see big data as a valuable competitive advantage. High-volume, disparate data – particularly internet- and social mediabased data – is increasingly important for enterprises as they seek to glean insights about their globally dispersed workforces and customers. Yet one principal challenge remains: how to derive timely and meaningful value from the growing masses of structured and unstructured data, when traditional platforms lack flexibility and demand significant capital expenditure. Hadapt is ushering data analytics into the future with its Adaptive Analytical Platform, an extensible, interactive SQL interface to Apache Hadoop.

1

Introduction Today, business leaders see big data as a valuable competitive advantage. High-volume, disparate data – particularly internet- and social media-based data – is increasingly important for enterprises as they seek to glean insights about their globally dispersed workforces and customers. Yet one principal challenge remains: how to derive timely and meaningful value from the growing masses of structured and unstructured data, when traditional platforms lack flexibility and demand significant capital expenditure. Hadapt is ushering data analytics into the future with its Adaptive Analytical Platform, an extensible, interactive SQL interface to Apache Hadoop.

Hadapt  has  developed  the   industry’s  only  big  data  analy5c   pla7orm  na5vely  integra5ng  SQL   with  Apache  Hadoop.  The   unifica5on  of  these  tradi5onally   segregated  pla7orms  enables   customers  to  analyze  all  of  their   data  (structured,  semi-­‐structured   and  unstructured)  in  a  single   pla7orm—no  connectors,   complexi5es  or  rigid  structure.

While most businesses are heavily invested in effective, well-established data warehouses, operational data stores, data marts, and business intelligence tools to mine and analyze structured data, many still do not have a clear approach for analyzing unstructured and semi-structured data such as audio, clickstreams, graphics, log files, raw text, social media messages, and video. Organizations understand that legacy relational database approaches are neither cost-effective nor efficient for unlocking the wealth of information in unstructured data. Yet companies continue to devote the majority of their resources to managing the smallest portion of their data (structured), while moving the majority of their data (unstructured) into special-purpose databases or content management systems with little or no analytic capabilities. In recent years, Hadoop has earned a place in many data processing infrastructures as a primary tool for refining and processing unstructured data, due to its cost effectiveness, integration with a robust distributed file system, and limitless scalability. Hadoop is an open-source software platform for building reliable, scalable clusters that can cost-effectively store, process, and analyze large volumes of data with high throughput. The platform provides analysis through MapReduce (MR) and storage through a distributed, shared-nothing file system called the Hadoop Distributed File System (HDFS) that stores data on the compute nodes, providing very high compute capacity across the cluster. As companies embrace Hadoop, they tend to use HDFS as a cost-effective storage strategy for long-term retention of data in virtually any format. Hadoop enables companies to store their data without first determining what that data is or what they want to do with it. This flexibility and delayed determination differs from the approach of databases, which require a rigorous understanding of the data and anticipated analyses prior to storing. Because Hadoop provides MapReduce for processing and analyzing data in HDFS, businesses are also looking for value in their stored data, such as new insights about their customers, products, and operations. While some use MapReduce for analytical workloads, Hadoop is more commonly used for processing data, then saving those results back into HDFS for further analysis in Hadoop or a downstream system. Powerful but unfamiliar to most developers, MapReduce lacks integration with traditional analysis tools deployed in the enterprise, which are primarily based on a SQL interface. A popular approach, therefore, is to store the raw data in HDFS, use MapReduce to normalize, transform, summarize, and derive the raw data into

2

something more valuable and suitable for SQL-based analysis, then export that data into an MPP database to take advantage of SQL-based tools and interactive query speeds. What emerges is a tiered data strategy: • Raw unstructured data is captured in source systems and moved to HDFS for archival and future analysis • Summarized or derived data is created using MapReduce and stored in HDFS • Interactive reporting data is stored in a separate online analytical system One of the fundamental tenants of Hadoop, however, is that it is more efficient to move processing to the data than to move data to the processing. Today, every shared-nothing massively parallel processing (MPP) analytical database defies this tenant. MPP databases force companies to move their data out of their Hadoop cluster into a different silo for the preferred SQL-based processing. More importantly, once the summarized or derived data has moved to the secondary

analytic system, the provenance of the data has been lost, and the ability to analyze the relationship between the raw data and the summarized or derived data becomes at best, impractical, and at worst, impossible. In many types of analysis, the ability to perform root cause analysis linked to data is also critical. Although most databases provide indexing and query optimization techniques to make data analysis more efficient, MapReduce does not. MapReduce provides a bulk approach to processing data; all data is analyzed for every query. The alternative in Hadoop is summary or sample data. MapReduce is used to generate this smaller data set, and subsequent analysis of this data is more efficient with the upfront query costs amortized across many queries. If the analysis is only to be performed once, the upfront costs may not be beneficial. And while it is practical to begin with summarized or derived data, this level of analysis is only a starting point. Analysts need to be able to trace back to the fine-grained data in order to understand root cause.

Why Hadapt? Hadapt is a complete analytical solution for enterprise environments, featuring a SQL interface and Hadoop’s ability to work with unstructured, semi-structured, and structured data. Hadapt is a unique, compelling alternative to the current limitations of MapReduce and the tiered data strategy described above. Hadapt’s single, elastic infrastructure: • Manages, secures, processes, and analyzes all tiers of data (unstructured, semi-structured, and structured) in a single system: raw data, summarized/derived data, and interactive report data • Provides all types of analysis: SQL, MapReduce, and full text search (these methods can be singular or combined)

3

• Provides all tiers of service-level agreement performance: interactive queries and batch analytical workloads that perform faster than the alternatives or are otherwise impossible without Hadapt • Enables true root cause analysis. Raw data and summarized/derived data co-exist in the same platform, enabling companies to iterate between summarized views of their information and to drill into the finest level of granularity in their source data

With Hadapt’s use of MapReduce, data analysis can go beyond what was traditionally possible with SQL. For example, your business may have a list of product part numbers in a structured SQL database and customer comments about their performance in an unstructured form. Before Hadapt, you could only leverage the information with a great deal of effort, using a combination of platforms with cost, proprietary, and flexibility issues. Hadapt, however, enables you to correlate the specific part listed in the structured data with the heat problem reported in the unstructured customer comments to rapidly provide you with the necessary insight into the business problem. Without that knowledge from Hadapt’s analysis and correlation, the problem may become critical, affecting large numbers of products and customers, and damaging revenue and reputation. Thus, Hadapt allows you to combine the structured data analysis of SQL with the best-in-class unstructured data analysis capabilities of Hadoop for insights that can help you build new services and create new revenue streams. In addition, the platform enables you to combine data, without cumbersome connectors, to apply your business intelligence infrastructure, tools, and skills to all of the data. You’ll have one system to manage and analyze multiple data sources and structures at scale, using the analytical tools that you are already familiar with and using today, thus cost-effectively leveraging the infrastructure you have and eliminating the need for scarce Hadoop and MapReduce expertise. The solution offers high scalability and a lower cost structure through a singular platform built upon a cluster of commodity hardware.

Big Data Requires New Solutions Shared-nothing MPP analytical platforms, designed to overcome the scalability problems of big data, were originally architected without taking Hadoop into consideration. To deal with the problem of analyzing structured and unstructured data together, and with the emergence of Hadoop as a new standard in unstructured data processing, new versions from the MPP analytic platform vendors now offer special "connector" code to facilitate the shipping of data back and forth between their platforms and Hadoop. Unfortunately, these “bolt-on” solutions are far from optimal.

4

Relational SQL databases process structured data very efficiently, and Hadoop is similarly powerful for processing unstructured data. But neither system can both processes structured and unstructured data together efficiently, and integrate natively with SQL. Using connectors to Hadoop, whereby data is shipped back and forth between Hadoop and the analytical platform, requires separate and distinct infrastructures with their own costs, increased data silos, performance degradation during transfer, complexity of maintaining multiple systems, and difficulty operating in any type of cloud deployment. Clouds are well known for drastic performance fluctuations due to the multi-tenant nature of their environments. Virtual machines in the cloud are generally supposed to be disposable—that is, they should be able to be started and stopped at any time. Unfortunately, MPP analytical platforms plan and optimize analysis in advance and are unable to adapt on the fly during query execution when performance is fluctuating and virtual machines involved in processing are stopped. A more efficient and superior architectural approach bypasses a connector concept and focuses on tightly integrating the analytical platform into Hadoop, so that the combined system can handle both structured and unstructured query processing. As you consider new computing models, this solution positions the business to adapt in an optimized fashion to unpredictable cloud environments, while improving the scalability beyond second-generation analytical platforms.

Hadapt: Built for Performance and Agility with Big Data The Hadapt platform is a hybrid analytical system designed to tackle massive amounts of data at high speed and is optimized for virtualization. It can handle complex analytical workloads across structured, semi-structured, or unstructured data, including: • • • • • • •

Ad targeting and analysis Call detail records analysis Clickstream analysis eDiscovery Fraud detection Graph analysis Large-scale log file analysis

The platform is optimized to address performance issues in alternative platforms, including: • • • • •

Hadoop deployments requiring significant performance gains and a more complete SQL interface Structured database applications where lower total cost of operation and infinite scalability are required Integration with existing business intelligence tools and workflows Big data applications running in virtualized environments (public, private or hybrid clouds) Integrated full text search

Hadapt fully integrates with Hadoop by enhancing it with optimized structured data storage and querying capabilities. Petabyte-size datasets and thousands of machines are no longer a problem, and impressive performance and scalability is built into the Hadapt platform at every level. Hadapt leverages the MapReduce distributed computing framework for fast analysis of massive datasets and provides rich SQL support and the ability to work with all of your data within one platform. The platform ensures consistent performance with cloud-ready fault tolerance, load balancing, and data replication.

5

The main components of the Hadapt platform include: • Flexible Query Interface—Allows queries to be submitted via a MapReduce API and SQL either directly or through JDBC/ODBC drivers • Cost-Based Query Optimizer—Implements patent-pending Split Query Execution technology to translate incoming queries into an efficient combination of SQL and MapReduce to be executed in parallel by single-node database systems and Hadoop, respectively. Generates MapReduce-free plans for many incoming queries to deliver interactive performance. • Adaptive Query Execution™—Provides automatic load balancing and query fault tolerance, and makes Hadapt perfect for large clusters and the cloud • Hadapt Development Kit™—Facilitates the development and consumption of advanced analytics; empowers the business analysts to become data scientists • Data Loader—Performs partitioning of data into small chunks and coordinates their parallel load into the storage layer, while replicating each chunk for fault tolerance and performance • Data Manager—Stores various metadata on schema, data, and chunk distribution. It also handles data replication, backups, recovery, and rebalancing. • Hybrid Storage Engine—Combines a standard distributed file system (HDFS) with a DBMS layer optimized for structured data, as well as a full text index store optimized for unstructured data

Flexible Query Interface The Hadapt platform offers a highly flexible query interface. Data may be queried using both SQL and MapReduce individually or with a combination of both. For queries that access data stored exclusively in relational storage, SQL is the recommended query language, since its declarative nature makes it much more productive for end users to leverage when writing queries, while at the same time providing the query optimizer more flexibility to improve performance. MapReduce remains a valid interface and, in fact, allows data analysts to implement even the most advanced analytical algorithms. Embedding SQL inside MapReduce or MapReduce inside SQL helps achieve ultimate flexibility. Queries can be executed directly via the command line, via a web interface, or remotely through the JDBC/ODBC drivers shipped with Hadapt. The JDBC/ODBC drivers are critical for customer-facing business intelligence tools that work with database software and aid in the visualization, query generation, result dashboards, and advanced data analysis. These tools are an important part of the analytical data management stack. By supporting a standards-compliant but enriched version of

6

SQL, Hadapt enables these tools to work much more flexibly in Hadoop deployments than would be possible in strictly MapReduce or SQL environments.

Cost-Based Query Optimizer Hadapt’s SQL interface accepts queries and analyzes them during the query planning phase based on a cost model and decades of research on relational query optimization. It takes into account data partitioning and distribution, indexes, and statistics about the data to create an initial query plan. Hadapt’s patent-pending Split Query Execution technology ensures the amount of work done inside the database management system (DBMS) layer is maximized for performance benefits. Hadoop performs all work that cannot be pushed down to the underlying DBMS layer. In many cases, the generated query plan is MapReduce-free and delivers interactive performance.

Adaptive Query ExecutionTM As more and more data is added to a shared-nothing MPP analytical platform, the number of machines (or "nodes") across which data is partitioned also increases. It is nearly impossible to obtain homogeneous performance across hundreds or thousands of nodes, even if each node runs on identical hardware or in an identical virtual machine. For example, part failures, individual node disk fragmentation, software configuration errors, and concurrent queries all reduce the homogeneity of cluster performance. The problem is much worse in cloud environments. There, concurrent activities performed by different virtual machines located on the same physical machine, or sharing the same network, can cause massive variation in performance. If the amount of work needed to execute a query is equally divided among the nodes in a shared-nothing cluster, there exists the danger that the time required to complete the query will be approximately equal to the time it takes the slowest compute node to complete its assigned task. A node with degraded performance thus would have a disproportionate effect on total query time. Because Hadapt was designed from the outset to work optimally in cloud environments, Hadapt adaptively adjusts the query plan and the allocation of query processing tasks to worker nodes on the fly during query processing to deliver much improved query speeds and increased node utilization. In the cloud, virtual machines are designed to be highly disposable. For example, many users of public clouds will kill and request new virtual machines when a system monitor indicates that a particular virtual machine is seeing reduced performance due to "stolen cycles" or other types of resource-hogging by other tenants on the same physical hardware. Virtual machines also can fail for more traditional reasons, and these failures become more frequent as the number of nodes involved in processing increases.

Hadapt Development KitTM The Hadapt Development Kit (HDK) extends Hadapt’s rich native analytic capabilities by allowing end users (analysts, developers, data scientists, etc.) to produce advanced analytic packages by writing Java programs and declaring them as Hadapt SQL functions. The HDK also facilitates the consumption of advanced analytics by leveraging SQL and standard communication protocols such as ODBC and JDBC. Business analysts can invoke HDK packages via SQL and BI tools, without requiring programming skills or a vendor-specific tools extension. Not only does the HDK enable users to create their own algorithms, it also enables organizations to leverage the burgeoning Hadoop ecosystem of advanced algorithms (machine learning, sentiment, etc.) via SQL. By in large, the HDK empowers business analysts to become data scientists.

7

Data Loader Hadapt loads data using all machines in parallel and thus achieves great throughput and performance on extremely large data sets. For maximum query performance and fault tolerance, data is partitioned into small chunks replicated across the cluster.

Data Manager Hadapt stores metadata relating to the schema, data, and chunk distribution. This information helps the Cost-Based Query Optimizer choose the best execution plan. The Data Manager also handles data replication, backups, recovery, and chunk rebalancing across the cluster.

8

Hybrid Storage Engine A DBMS engine and an Apache SOLR™-based full text index store installed on each node complement a standard distributed file system (HDFS). The DBMS layer is optimized for structured data, and the combination of HDFS and the full text index store are optimized for unstructured and semi-structured data. HDFS also serves as the landing zone for raw data sets of any structure.

Hadapt: A Cost-Effective Solution Enabling Business Growth Whatever the size, type or stage of your business, Hadapt can enable new market, product and service insights derived from high-value data analysis. The Hadapt solution is ideally suited to meet current business needs around big data by marrying the familiarity of SQL data queries with the power of Hadoop-based processing of unstructured data – in a single platform with scale, cost-effectiveness and reliability. Today in retail and most other enterprises, the information and signals relating to customer behavior are dispersed over many resources and appear in many formats. Deriving intelligence from these resources requires big data analysis along with flexible, scalable infrastructure. With one system to manage and analyze multiple data sources and structures while leveraging familiar analytical tools, Hadapt can deliver the required understanding of your customers to optimize advertising spending, basket packaging, discounting, placement, purchasing behavior, stock levels, and supply chain. In banking, big data analysis can be applied to customer service and purchasing needs, as well as risk warehousing and log file analysis. Hadapt enables interactive reporting and long-term retention of risk data on a cost-effective platform. Hadapt is also cost-effective and efficient at streamlining discovery of root cause in application and service issues. IT no longer has to log into multiple systems to discover root cause; with Hadapt, IT can pre-aggregate all log data and then perform automatic detection and filtering to make it easy for the help desk to determine root cause. Insurance, telecommunications, and ad network companies all share the same needs for customer behavior analysis. In a similar application, telecommunications companies can use Hadapt to analyze customer call detail records to understand customer usage for new service opportunities or refining service bundling and packaging. Insurance companies, however, operate under strict regulatory constraints around the need to retain electronic records. Hadapt allows companies with this kind of legal mandate to maintain electronic documents in the their original formats with search and extraction capabilities for legal case management systems. In addition to these market opportunities, Hadapt anticipates and enables new technology trends within the enterprise around tiered data storage and analysis and cloud computing. In the tiered model, raw data is retained while some data is summarized, transformed and derived for retention and analysis, and some data is analyzed as is. The organization then can utilize a mixed data processing model in which an increasing amount of high-value workloads previously managed in costly analytical databases in data warehouses can be moved to Hadapt, where it is more cost-effective and easier to scale, enabling the organization to derive valuable insights from more information more quickly.

9

Not only does Hadapt help enterprises with efficient, high-value structured and unstructured data analysis, but it also (optionally) enables more cost-effective, efficient virtualized and cloud computing models. The cloud computing model, including both public clouds and private clouds, has emerged as another environment in which to perform analysis. The advantage of running analyses in the cloud is that resources are virtualized and cost-effective. The cloud increases utilization of the underlying hardware and is a flexible deployment option; resources needed for a particular analysis are allocated only for the duration of the analysis and can be released once the analysis is finished. This on-demand analysis and release of resources is a huge shift away from the traditional approach of buying hardware devoted exclusively to the analytical database. These appliances sit idle during quiet periods, while significant capital expenditure is allocated to over-provision the platform to handle peak loads. In the enterprise or a public/private cloud scenario, Hadapt makes any transition to big data analysis easier and more efficient than is possible with other approaches.

10

Hadapt’s  Adap5ve  Analy5cal   Pla7orm  enables  organiza5ons  to   perform  complex  analy5cs  on   structured  and  unstructured   data,  all  in  one  cloud-­‐op5mized   system.

Conclusion Faster and better decision-making based on more relevant information is driving enterprise productivity, lower IT costs, and faster time to market. One of the barriers to these analytic performance and cost improvements is the lack of a platform designed specifically to harness and leverage big data to solve business problems faster, with more accurate and timely insights. Hadapt offers businesses a clear, cost-effective path to analyzing and storing structured and unstructured data in the enterprise and the cloud, scaling with the enterprise. Hadapt runs on commodity hardware or on public/private clouds, providing a low total cost of ownership. More importantly, Hadapt's architecture also accelerates analysis of the structured data stored in a Hadoop cluster, overcoming major performance problems of native Hadoop implementations. While Hadoop offers almost no SQL support, Hadapt provides SQL queries that perform orders of magnitude faster than Hadoop and Hive, as well as integrated full text search. Furthermore, Hadapt's SQL is derived from the SQL support of its underlying relational DBMS and is richer and more standards-compliant than that of open-source offerings. This capability permits streamlined integration with existing business intelligence tools and workflows. Hadapt's patent-pending technology automatically splits query execution between the platform's Hadoop and RDBMS layers, providing an optimized system that leverages the scalability of Hadoop and the speed of relational database technology. In addition to these speed and scalability advantages, Hadapt provides the ability to perform analysis, including root cause analysis, across all of your data sets within one platform. Using the alternative “bolt-on” approach to Hadoop requires two disparate systems while introducing delays, creating unintended data silos, and increasing total cost of operation. Hadapt’s Adaptive Analytical Platform™ is a complete solution for data warehousing and analysis needs. The performance and cost-effectiveness of Hadapt and its enterprise and cloud centricity enable businesses to turn big data into knowledge and offer new services with confidence now and in the future.

ABOUT  HADAPT Hadapt  has  developed  the  industry’s  only  Big  Data  analy@c  plaBorm  na@vely  integra@ng  SQL  with  Apache  Hadoop.  The  unifica@on  of  these  tradi@onally  segregated   plaBorms  enables  customers  to  analyze  all  of  their  data  (structured,  semi-­‐structured  and  unstructured)  in  a  single  plaBorm—no  connectors,  complexi@es  or  rigid   structure.  The  company  is  headquartered  in  Cambridge,  MA.

614  MassachuseXs  Avenue,  4th  Floor   Cambridge,  MA  02139

(617) 539-­‐6110 [email protected]

hadapt.com

11