1. What Is Cloudera?
Cloudera is revolutionizing enterprise data management by offering the first unified Platform for Big Data: The Enterprise Data Hub. Cloudera offers enterprises one place to store, process, and analyze all their data, empowering them to extend the value of existing investments while enabling fundamental new ways to derive value from their data.
Founded in 2008, Cloudera was the first, and is currently, the leading provider and supporter of Apache Hadoop for the enterprise. Cloudera also offers software for business critical data challenges including storage, access, management, analysis, security, and search.
Customer success is Cloudera's highest priority. We’ve enabled long-term, successful deployments for hundreds of customers, with petabytes of data collectively under management, across diverse industries.
2. What Is An Enterprise Data Hub?
An enterprise data hub is one place to store all your data, for as long as desired or required, in its original fidelity; integrated with existing infrastructure and tools; with the flexibility to run a variety of enterprise workloads -- including batch processing, interactive SQL, enterprise search, and advanced analytics -- together with the robust security, governance, data protection, and management that enterprises require. With an enterprise data hub, leading organizations are changing the way they think about data, transforming it from a cost into an asset.
3. What Are Some Common Use Cases For An Enterprise Data Hub?
A Hadoop-based enterprise data hub allows you to process and access more data than ever before, so it has many near-term (operational) as well as long-term (strategic) use cases across multiple industries. Generally, enterprise data hub use cases fall into these broad categories:
Transformation and enrichment: Transform and process large amounts of data more quickly, reliably, and affordably (for loading into the data warehouse, for example).
Active archive: Get access to data that would otherwise be taken offline (typically to tape) due to the high cost of actively managing it.
Self-service exploratory BI: Allow users to explore data, with full security, using traditional interactive business intelligence tools via SQL and keyword search.
Advanced analytics: Rather than making them examine samples of data, or snapshots from short time periods, let users combine all historical data, in its full fidelity, for comprehensive analyses.
4. What Makes Cloudera's Products Unique?
Cloudera’s platform has several differentiating attributes that make it unique, including:
Differences from commercial alternatives: Cloudera offers differentiating capabilities such as production-grade interactive SQL and Search on Hadoop; comprehensive system management with rolling upgrades, automated disaster recovery, centralized security, proactive health checks, and multi-cluster management; and simplified data management with granular auditing and access control capabilities.
Differences from stock Apache Hadoop: Although Cloudera's platform contains the same code that can be found in the “upstream” Hadoop ecosystem projects, on a regular (quarterly) basis, Cloudera ships new bug fixes and stable features for users of its platform on a quarterly basis (and contributes them to the upstream code base, as well). Thus, Cloudera customers get predictable and regular access to platform improvements, along with the assurances of rigorous testing and upstream compatibility.
5. Is Cloudera's Platform Open Source?
The core of Cloudera’s platform, CDH, is open source (Apache License), so users always have the option to move their data to an alternative -- and thus Cloudera must continually earn your business based on merit. In fact, Cloudera is an open source leader in Big Data, with its employees collectively contributing more code to the Hadoop ecosystem than those of any other company.
Cloudera complements this open core with closed source management software that provides key enterprise functionality requested by customers such as support for rolling upgrades, auditing management, and disaster recovery. That software, however, does not store or process data and thus lock-in is not an issue.
6. Explain About Cloudera Search?
Cloudera Search : Provides near real-time access to data stored in or ingested into Hadoop and HBase. Search provides near real-time indexing, batch indexing, full-text exploration and navigated drill-down, as well as a simple, full-text interface that requires no SQL or programming skills. Fully integrated in the data-processing platform, Search uses the flexible, scalable, and robust storage system included with CDH. This eliminates the need to move large data sets across infrastructures to perform business tasks.
7. What Are The Security Guidelines Of Impala?
Security Guidelines for Impala : The following are the major steps to harden a cluster running Impala against accidents and mistakes, or malicious attackers trying to access sensitive data
Secure the root account. The root user can tamper with the impalad daemon, read and write the data files in HDFS, log into other user accounts, and access other system services that are beyond the control of Impala.
Restrict membership in the sudoers list (in the /etc/sudoers file). The users who can run the sudo command can do many of the same things as the root user.
Ensure the Hadoop ownership and permissions for Impala data files are restricted.
Ensure the Hadoop ownership and permissions for Impala log files are restricted.
Ensure that the Impala web UI (available by default on port 25000 on each Impala node) is password-protected.
Create a policy file that specifies which Impala privileges are available to users in particular Hadoop groups (which by default map to Linux OS groups). Create the associated Linux groups using the groupadd command if necessary.
The Impala authorization feature makes use of the HDFS file ownership and permissions mechanism; for background information, see the CDH HDFS Permissions Guide. Set up users and assign them to groups at the OS level, corresponding to the different categories of users with different access levels for various databases, tables, and HDFS locations (URIs). Create the associated Linux users using the useradd command if necessary, and add them to the appropriate groups with the usermod command.
8. Explain About Data Encryption?
Data Encryption - Data encryption and key management provide a critical layer of protection against potential threats by malicious actors on the network or in the datacenter. Encryption and key management are also requirements for meeting key compliance initiatives and ensuring the integrity of your enterprise data. The following Cloudera Navigator components enable compliance groups to manage encryption:
Cloudera Navigator Encrypt transparently encrypts and secures data at rest without requiring changes to your applications and ensures there is minimal performance lag in the encryption or decryption process.
Cloudera Navigator Key Trustee Server is an enterprise-grade virtual safe-deposit box that stores and manages cryptographic keys and other security artifacts.
Cloudera Navigator Key HSM allows Cloudera Navigator Key Trustee Server to seamlessly integrate with a hardware security module (HSM).
Cloudera Navigator data management and data encryption components can be installed independently.
9. Why Do Customers Choose Cloudera?
Cloudera was the first commercial provider of Hadoop-related software and services and has the most customers with enterprise requirements, and the most experience supporting them, in the industry. Cloudera’s combined offering of differentiated software (open and closed source), support, training, professional services, and indemnity brings customers the greatest business value, in the shortest amount of time, at the lowest TCO.
10. What Is Hadoop?
The Hadoop project, which Doug Cutting (now Cloudera's Chief Architect) co-founded in 2006, is an effort to create open source implementations of internal systems used by Web-scale companies such as Google, Yahoo!, and Facebook to manage and process massive data volumes. Hadoop, combined with related ecosystem projects, enables distributed, parallel processing of huge amounts of data across industry-standard servers (with storage and processing occurring on the same machines), and it can scale indefinitely.
11. What Is Hadoop Role In An Enterprise Data Hub?
Hadoop has evolved into a stable, scalable, flexible core for next-generation data management -- yet on its own, it lacks some critical capabilities when deployed as the center of an enterprise data hub. For example, it lacks a comprehensive security model across the entire ecosystem of projects. Hadoop was also built for batch-mode data processing workloads, which limits it to an ancillary position in the data center. (Rather, a central enterprise data hub must have real-time capability.) And Hadoop doesn’t support the range of industry-standard interfaces for query and search applications, among others, that business users require.
12. Why Do I Need A Cloudera Enterprise Subscription?
Cloudera Enterprise subscriptions, which include access to differentiated system and data management software, 8x5 or 24x7 support, and indemnity, is an essential ingredient in any sustainable deployment of an enterprise data hub.
13. What Does Cloudera’s Open Source Leadership Mean For Customers?
Open source benefits, such as freedom from lock-in, are tangible and time-tested. That said, they are just table stakes when deploying an enterprise data hub based on open source software such as Hadoop.
Cloudera also leads the way to ensure that customer needs for performance, availability, security, and recoverability are met by new features in the Apache code base, and then shipping/supporting those features for customers in our platform. To make that goal possible, Cloudera employs more ecosystem committers, establishes more successful new ecosystem projects, and contributes more code to that ecosystem, than any other vendor.
14. Do Cloudera’s Products Work With My Existing Data Management Infrastructure?
The Cloudera Connect Partner Program, more than 700 companies strong, and is designed to champion partner advancement and solution development for the Big Data ecosystem. With more partners than any other Hadoop vendor and the only Hadoop provider with a technology certification program, Cloudera ensures consistency, reliability, and tight integration with enterprise environments.
15. Explain Impala Security?
Impala includes a fine-grained authorization framework for Hadoop, based on the Sentry open source project. Sentry authorization was added in Impala 1.1.0. Together with the Kerberos authentication framework, Sentry takes Hadoop security to a new level needed for the requirements of highly regulated industries such as healthcare, financial services, and government. Impala also includes an auditing capability;
Impala generates the audit data, the Cloudera Navigator product consolidates the audit data from all nodes in the cluster, and Cloudera Manager lets you filter, visualize, and produce reports.
The security features are divided into these broad categories:
authorization : Which users are allowed to access which resources, and what operations are they allowed to perform? Impala relies on the open source Sentry project for authorization. By default (when authorization is not enabled), Impala does all read and write operations with the privileges of the impala user, which is suitable for a development/test environment but not for a secure production environment. When authorization is enabled, Impala uses the OS user ID of the user who runs impala-shell or other client program, and associates various privileges with each user.
authentication : How does Impala verify the identity of the user to confirm that they really are allowed to exercise the privileges assigned to that user? Impala relies on the Kerberos subsystem for authentication.
auditing : What operations were attempted, and did they succeed or not? This feature provides a way to look back and diagnose whether attempts were made to perform unauthorized operations. You use this information to track down suspicious activity, and to see where changes are needed in authorization policies. The audit data produced by this feature is collected by the Cloudera Manager product and then presented in a user-friendly form by the Cloudera Manager product.
16. Explain About Configuring Encryption?
The goal of encryption is to ensure that only authorized users can view, use, or contribute to a data set. These security controls add another layer of protection against potential threats by end-users, administrators, and other malicious actors on the network. Data protection can be applied at a number of levels within Hadoop:
OS Filesystem-level - Encryption can be applied at the Linux operating system filesystem level to cover all files in a volume. An example of this approach is Cloudera Navigator Encrypt (formerly Gazzang zNcrypt) which is available for Cloudera customers licensed for Cloudera Navigator. Navigator Encrypt operates at the Linux volume level, so it can encrypt cluster data inside and outside HDFS, such as temp/spill files, configuration files and metadata databases (to be used only for data related to a CDH cluster). Navigator Encrypt must be used with Cloudera Navigator Key Trustee Server (formerly Gazzang zTrustee).
Network-level - Encryption can be applied to encrypt data just before it gets sent across a network and to decrypt it just after receipt. In Hadoop, this means coverage for data sent from client user interfaces as well as service-to-service communication like remote procedure calls (RPCs). This protection uses industry-standard protocols such as TLS/SSL.
DFS-level - Encryption applied by the HDFS client software. HDFS Transparent Encryption operates at the HDFS folder level, allowing you to encrypt some folders and leave others unencrypted. HDFS transparent encryption cannot encrypt any data outside HDFS. To ensure reliable key storage (so that data is not lost), use Cloudera Navigator Key Trustee Server; the default Java keystore can be used for test purposes.
17. What Are Cloudera's Products?
Cloudera’s platform, which is designed to specifically address customer opportunities and challenges in Big Data, is available in the form of free/unsupported products (CDH or Cloudera Express, for those interested solely in a free Hadoop distribution), or as supported, enterprise-class software (Cloudera Enterprise - in Basic, Flex, and Data Hub editions) in the form of an annual subscription. All the integration work is done for you, and the entire solution is thoroughly tested for enterprise requirements and fully documented.
18. Why Does Open Source Matter For Customers?
Open source licensing and development offers customers powerful benefits, including freedom from lock-in, free no-obligation evaluation, rapid innovation on a global scale, and community-driven development. Freedom from lock-in is particularly important for customers where components that store and process data are involved.
19. How To Configure Tls Encryption For Cloudera Manager?
When you configure authentication and authorization on a cluster, Cloudera Manager Server sends sensitive information over the network to cluster hosts, such as Kerberos keytabs and configuration files that contain passwords. To secure this transfer, you must configure TLS encryption between Cloudera Manager Server and all cluster hosts.
TLS encryption is also used to secure client connections to the Cloudera Manager Admin Interface, using HTTPS.
Cloudera Manager also supports TLS authentication. Without certificate authentication, a malicious user can add a host to Cloudera Manager by installing the Cloudera Manager Agent software and configuring it to communicate with Cloudera Manager Server. To prevent this, you must install certificates on each agent host and configure Cloudera Manager Server to trust those certificates.
20. Explain About Cloudera Data Management?
Data management activities include auditing access to data residing in HDFS and Hive metastores, reviewing and updating metadata, and discovering the lineage of data objects.
Cloudera Navigator is a fully integrated data-management and security system for the Hadoop platform. Cloudera Navigator enables a broad range of stakeholders to work with data at scale:
Compliance groups must track and protect access to sensitive data. They must be prepared for an audit, track who accesses data and what are they do with it, and ensure that sensitive data is governed and protected.
Hadoop administrators and DBAs are responsible for boosting user productivity and cluster performance. They want to see how data is being used and how it can be optimized for future workloads.