Published: 4/27/2012 Last Revised: 5/10/2012
I’m trying to learn all that I can on Oracle’s Big Data. Installing the Big Data Connectors from Oracle looks like a good place to start. Here’s how I installed them.
The Oracle Direct Connector for Hadoop Distributed File System (ODC HDFS) and the Oracle Loader for Hadoop (ODC OLH) seemed like good choices to get started on the Big Data.
Download the two connector files from the Oracle Big Data Connector Download. On second thought get them all if you plan to do more in the future with Oracle’s Big Data Connectors.
Before we go too much further, I’m assuming you have a machine with Oracle to play with and another with Cloudera’s Apache Hadoop. If not, and your like me (not hardware blessed), you should consider getting Oracle Virtual Box on you desktop. I have 8G Ram on a Windows 7 box. I created 2 Linux VM’s.
VM1 – RDBMS on OEL6u2
- Oracle Enterprise Linux 6 Upgrade 2 64 bit (OEL6U2_64)
- Oracle’s RDBMS 11g Enterprise 220.127.116.11
The VM Settings on VM1:
- Base Memory: 3072 MB
- 4 Processors (1 will work fine for learning)
- Hard Disk: 30 G – Dynamic (to save space, static would be faster)
- Network: Host-Only Adapter
- No Firewall and No Security (SELINUX=disabled)
For the 2nd VM I used Cloudera’s pre-built instance. You can get this on Cloudera’s Site, download here. There are also VMWare and KVM virtual machine downloads in addition to VirtualBox.
VM2 – Cloudera Hadoop
- Cloudera’s VM Instance
VM Settings on VM2:
- OS:Linux – Red Hat (64 bit)
- Base Memory: 2048 MB
- Network: Host-Only Adapter
I used the oracle-rdbms-server-11gR2-preinstall announced on their Linux Blog (HERE) and it made the install go quickly.
Make sure both machines can communicate with each other and your host machine. Let’s give them names here for communication purposes, the RDBMS server can be “odbs” for Oracle Database Server, and the Cloudera VM can be “cloud-alocal”. If you stick with the Host-Only network and are on Windows, I think it will be a 192.168.56.x network. You assign names to each VM and the host will be 192.168.56.1. If you have any trouble with the host-only, you may try the bridge adapter network making sure you pick the same adapter for both machines and use DHCP or static addresses. I like the Host – Only since it isolates my VM’s from the rest of the network.
I’m also assuming you’ve done all the basic database checks to see that sqlplus logins work and your listener is up, la, la, la… If you’ve lived in a cave for awhile, it’s a good time to brush on the 11g install.
If you haven’t looked at it already, you may want to look at the Oracle Support Master Note 1416116.1 “Master Note for Big Data Appliance Integrated Software“. It seems to have good info for the Oracle Big Data Connectors.
On the odbs we need to install only the basic hadoop software. In preparation to loading the Cloudera Hadoop 0.20.2, we need to install the Java software. I used the jdk-6u31-linux-x64-rpm.bin downloaded from Oracle at the Java Download Page, changed the permissions (chmod o+x), and ran the binary. When it finished the binary was installed under /user/java/jdk1.6.0_31. I used the path /usr/java/latest, which is a pointer to the latest install, for all my paths later in this installation so I won’t have to go back after upgrading the JDK at some point later.
Still on odbs, now with Oracle Java installed, we are ready to install Cloudera Hadoop.
as root create a user hdfs:
On a side note Cloudera has free training videos and some good documentation on their website. You might take a look around while you’re there. Below is one way to install the Cloudera Hadoop, see the CDH3 Installation Guide for others if this doesn’t work for you.
1) download the package:
Red Hat / CentOS 5 at:
Red Hat / CentOS 6 at:
2) Use yum
sudo yum --nogpgcheck localinstall cdh3-repository-1.0-1.noarch.rpm
Get the GPG Key:
You may want to set up alternatives for the hadoop configuration. There’s a good document on this at Cloudera’s Site. /etc/alternatives shows all the alternatives set up. Cloudera sets up one called hadoop-0.20 for this version of the install.
$ ls /etc/alternatives
to see the full list.
$ sudo cp -r /etc/hadoop-0.20/conf.empty /etc/hadoop-0.20/conf.my_cluster
sudo chmod -R 755 /etc/hadoop-0.20/conf.my_cluster
Enter the password.
You might want to add a profile if you haven’t already. Add /usr/sbin to the PATH for the next step.
sudo alternatives --display hadoop-0.20-conf
[sudo] password for hdfs:
hadoop-0.20-conf - status is auto.
link currently points to /etc/hadoop-0.20/conf.empty
/etc/hadoop-0.20/conf.empty - priority 10
Current `best' version is /etc/hadoop-0.20/conf.empty.
and you can see that the /etc/hadoop-0.20/conf.empty is the currently selected alternative. Let’s change it to the new one you created.
sudo alternatives --install /etc/hadoop-0.20/conf hadoop-0.20-conf /etc/hadoop-0.20/conf.my_cluster 50
sudo alternatives --config hadoop-0.20-conf
There are 2 programs which provide ‘hadoop-0.20-conf’.
*+ 2 /etc/hadoop-0.20/conf.my_cluster
Select your new alternative if it is not already selected.
<configuration> <property> <name>fs.default.name</name> <value>hdfs://cloud-alocal:8020</value> </property> </configuration>
The default fs.default.name port is: 8020
and cloud-alocal is our instructional VM we created or a target hadoop node or cluster
While you’re sudo’d to root make the changes to the following files as well:
Changes shown in RED
edit the /etc/hadoop/conf/hadoop-env.sh file shown edited below:
# optional. When running a distributed configuration it is best to
# set JAVA_HOME in this file, so that it is correctly defined on
# remote nodes.
Found 2 items
drwxr-xr-x - cloudera supergroup 0 2012-04-05 14:00 /user/cloudera
drwxr-xr-x - hue supergroup 0 2012-03-14 20:22 /user/hive
Try a few commands:
$ touch test_src.lst
$ hadoop fs -copyFromLocal test_src.lst /user/scott/data/test.lst
Try hadoop fs to get the usage:
$ hadoop fs
So now you have a file in the Hadoop file system, and if you search for that full directory path and file you won’t see it on the OS file system.
Let’s put the connector code in place:
Go to the download directory for the zipped connector files:
# set DIRECTHDFS_HOME to install directory for DirectHDFS
Oracle Direct HDFS Release 18.104.22.168.0 - ProductionCopyright (c) 2011, Oracle and/or its affiliates. All rights reserved.Usage: $HADOOP_HOME/bin/hadoop jar orahdfs.jar oracle.hadoop.hdfs.exttab.HdfsStream <locationPath>
chmod 770 $ORACLE_BASE/external_dir
create or replace directory external_dir as '/u01/app/oracle/external_dir';
create or replace directory hdfs_bin_path as '/u01/app/oracle/product/11.2.0/dbhome_1/orahdfs-22.214.171.124.0/bin';
create user hdfsuser identified by hdfsuser;
grant create session to hdfsuser;
grant create table to hdfsuser;
grant execute on sys.utl_file to hdfsuser;
grant read, write on directory external_dir to hdfsuser;
grant read, execute on directory hdfs_bin_path to hdfsuser;
Make sure all statements executed correctly.
Now you’re all done.
Guess you need some examples to make this worth the effort. Here’s an external table example.
Revised: 5/8/2012 – Added notes about alternative config
Revised: 5/10/2012 – added permissions for conf.my_cluster