Creating an External Table with the Oracle Big Data Connector for Hadoop Data File Storage

This is a followup to the installation blog.  It assumes you’ve done everything there for the installation.

Here’s a complete example for creating an external table with a csv file on the Hadoop File System.

First let’s login as hdfsuser and create the external table:

drop table hdfsuser.emp_ext_tab1;

CREATE TABLE hdfsuser.emp_ext_tab1
( empno number(4,0),
ename VARCHAR2(10),
job VARCHAR2(9),
mgr number(4,0),
HIREDATE date,
sal number(7,2),
comm number(7,2),
deptno varchar2(4)

)
ORGANIZATION EXTERNAL
( TYPE oracle_loader
DEFAULT DIRECTORY "EXTERNAL_DIR"
ACCESS PARAMETERS
(
RECORDS DELIMITED BY NEWLINE
PREPROCESSOR HDFS_BIN_PATH:hdfs_stream
FIELDS TERMINATED BY ','
OPTIONALLY ENCLOSED BY '\"' LDRTRIM
MISSING FIELD VALUES ARE NULL
(empno integer external(4) ,
ename char(10) ,
job char(9) ,
mgr integer external(4) ,
HIREDATE char(10) date_format date mask "MM/DD/YYYY" ,
sal decimal external(7) ,
comm decimal external(7),
deptno integer external(4)
)
)
LOCATION ( 'hdfsuser_emp')
);

Ok lets go through this,
DEFAULT DIRECTORY “EXTERNAL_DIR” — is where the location, log, and bad files will be written
PREPROCESSOR HDFS_BIN_PATH:hdfs_stream — is the path and location of our processor file hdfs_stream that we added with the install of the HDFS connector
LOCATION ( ‘hdfsuser_emp’) — is where our location file will be written when we run the hadoop external table creation script

I added the drop table, because I don’t use many external tables and usually have to correct something, so I’ll drop and recreate the table as I tweak the code.

Create a csv file from scott.emp.

example data from scott.emp placed in the emp_table.csv file:
7369,”SMITH”,”CLERK”,7902,”17-Dec-1980″,800,,20
7499,”ALLEN”,”SALESMAN”,7698,”20-Feb-1981″,1600,300,30
7521,”WARD”,”SALESMAN”,7698,”22-Feb-1981″,1250,500,30
7566,”JONES”,”MANAGER”,7839,”02-Apr-1981″,2975,,20
7654,”MARTIN”,”SALESMAN”,7698,”28-Sep-1981″,1250,1400,30
7698,”BLAKE”,”MANAGER”,7839,”01-May-1981″,2850,,30
7782,”CLARK”,”MANAGER”,7839,”09-Jun-1981″,2450,,10
7788,”SCOTT”,”ANALYST”,7566,”19-Apr-1987″,3000,,20
7839,”KING”,”PRESIDENT”,,”17-Nov-1981″,5000,,10
7844,”TURNER”,”SALESMAN”,7698,”08-Sep-1981″,1500,0,30
7876,”ADAMS”,”CLERK”,7788,”23-May-1987″,1100,,20
7900,”JAMES”,”CLERK”,7698,”03-Dec-1981″,950,,30
7902,”FORD”,”ANALYST”,7566,”03-Dec-1981″,3000,,20
7934,”MILLER”,”CLERK”,7782,”23-Jan-1982″,1300,,10

Copy the file to the hdfs:
hadoop fs -copyFromLocal emp_table.csv /user/hdfsuser/data/emp_table.csv

Hadoop will create the full path if it is not created.

Use http://hadoopnamenode:8020 and browse the file system to find the emp_table.csv file, where hadoopnamenode has been cloud-alocal for the install example.

Publish the HDFS Data Paths:

You have to use the long command found in the Oracle Big Data Connectors User’s Guide. There may be a few things to watch out for:

Beware of the hdfs.exttab.connection.url a good reference for what this should be is at: http://www.orafaq.com/wiki/JDBC#Thin_driver

$ $HADOOP_HOME/hadoop jar \
> $DIRECTHDFS_HOME/jlib/orahdfs.jar oracle.hadoop.hdfs.exttab.ExternalTable \
> -D oracle.hadoop.hdfs.exttab.tableName=EMP_EXT_TAB1 \
> -D oracle.hadoop.hdfs.exttab.connection.url=jdbc:oracle:thin:@odbs:1521:or
cl \
> -D oracle.hadoop.hdfs.exttab.connection.user=hdfsuser \
> -D oracle.hadoop.hdfs.exttab.datasetPaths=hdfs:/user/hdfsuser/data/emp_table.c
sv -publish
[Enter Database Password:]
Oracle Direct HDFS Release 1.0.0.0.0 - Production

Copyright (c) 2011, Oracle and/or its affiliates. All rights reserved.

EMP_EXT_TAB1 publish operation succeeded

Details:

EXTERNAL_DIR:hdfsuser_emp file created with the following paths
hdfs://cloud-alocal.enkitec.com/user/hdfsuser/data/emp_table.csv

Note if you cut and paste the “>” would be removed. Now if you look in the $ORACLE_BASE/external_dir you’ll find the file hdfsuser_emp specified in the external table command.

[oracle@odbs external_dir]$ more hdfsuser_emp
CompressionCodec=
hdfs://cloud-alocal.enkitec.com/user/hdfsuser/data/emp_table.csv

If you run your external table select now you should get the same results as if you ran “select * from scott.emp”

SQL> select * from emp_ext_tab1;

EMPNO ENAME JOB MGR HIREDATE SAL COMM DEPT
———- ———- ——— ———- ——— ———- ———- —-
7369 SMITH CLERK 7902 17-DEC-80 800 20
7499 ALLEN SALESMAN 7698 20-FEB-81 1600 300 30
7521 WARD SALESMAN 7698 22-FEB-81 1250 500 30
7566 JONES MANAGER 7839 02-APR-81 2975 20
7654 MARTIN SALESMAN 7698 28-SEP-81 1250 1400 30
7698 BLAKE MANAGER 7839 01-MAY-81 2850 30
7782 CLARK MANAGER 7839 09-JUN-81 2450 10
7788 SCOTT ANALYST 7566 19-APR-87 3000 20
7839 KING PRESIDENT 17-NOV-81 5000 10
7844 TURNER SALESMAN 7698 08-SEP-81 1500 0 30
7876 ADAMS CLERK 7788 23-MAY-87 1100 20

EMPNO ENAME JOB MGR HIREDATE SAL COMM DEPT
———- ———- ——— ———- ——— ———- ———- —-
7900 JAMES CLERK 7698 03-DEC-81 950 30
7902 FORD ANALYST 7566 03-DEC-81 3000 20
7934 MILLER CLERK 7782 23-JAN-82 1300 10

14 rows selected.

All Done!

Installing Oracle’s Big Data Connectors for HDFS and SQL*Loader

Published: 4/27/2012 Last Revised: 5/10/2012

I’m trying to learn all that I can on Oracle’s Big Data.  Installing the Big Data Connectors from Oracle looks like a good place to start.  Here’s how I installed them.

The Oracle Direct Connector for Hadoop Distributed File System (ODC HDFS) and the Oracle Loader for Hadoop (ODC OLH) seemed like good choices to get started on the Big Data.

Download the two connector files from the Oracle Big Data Connector Download. On second thought get them all if you plan to do more in the future with Oracle’s Big Data Connectors.

Before we go too much further, I’m assuming you have a machine with Oracle to play with and another with Cloudera’s Apache Hadoop. If not, and your like me (not hardware blessed), you should consider getting Oracle Virtual Box on you desktop. I have 8G Ram on a Windows 7 box. I created 2 Linux VM’s.

VM1 – RDBMS on OEL6u2

  • Oracle Enterprise Linux 6 Upgrade 2 64 bit (OEL6U2_64)
  • Oracle’s RDBMS 11g Enterprise 11.2.0.3

The VM Settings on VM1:

  • Base Memory: 3072 MB
  • 4 Processors (1 will work fine for learning)
  • Hard Disk: 30 G – Dynamic (to save space, static would be faster)
  • Network: Host-Only Adapter
  • No Firewall and No Security (SELINUX=disabled)

For the 2nd VM I used Cloudera’s pre-built instance.  You can get this on Cloudera’s Site, download here. There are also VMWare and KVM virtual machine downloads in addition to VirtualBox.

VM2 – Cloudera Hadoop 

  • Cloudera’s VM Instance

VM Settings on VM2:

  • OS:Linux – Red Hat (64 bit)
  • Base Memory: 2048 MB
  • Network: Host-Only Adapter

I used the oracle-rdbms-server-11gR2-preinstall announced on their Linux Blog (HERE) and it made the install go quickly.

Make sure both machines can communicate with each other and your host machine.  Let’s give them names here for communication purposes, the RDBMS server can be “odbs” for Oracle Database Server, and the Cloudera VM can be “cloud-alocal”.  If you stick with the Host-Only network and are on Windows, I think it will be a 192.168.56.x network.  You assign names to each VM and the host will be 192.168.56.1.  If you have any trouble with the host-only, you may try the bridge adapter network making sure you pick the same adapter for both machines and use DHCP or static addresses.  I like the Host – Only since it isolates my VM’s from the rest of the network.

I’m also assuming you’ve done all the basic database checks to see that sqlplus logins work and your listener is up, la, la, la…   If you’ve lived in a cave for awhile, it’s a good time to brush on the 11g install.

If you haven’t looked at it already, you may want to look at the Oracle Support Master Note 1416116.1 “Master Note for Big Data Appliance Integrated Software.  It seems to have good info for the Oracle Big Data Connectors.

On the odbs we need to install only the basic hadoop software.  In preparation to loading the Cloudera Hadoop 0.20.2, we need to install the Java software.  I used the jdk-6u31-linux-x64-rpm.bin downloaded from Oracle at the Java Download Page, changed the permissions (chmod o+x), and ran the binary.  When it finished the binary was installed under /user/java/jdk1.6.0_31.  I used the path /usr/java/latest, which is a pointer to the latest install, for all my paths later in this installation so I won’t have to go back after upgrading the JDK at some point later.

Still on odbs, now with Oracle Java installed, we are ready to install Cloudera Hadoop.

as root create a user hdfs:

useradd  -d /home/hadoop -m hdfs

/home/hdfs works too, if you so choose.  If you name the user hadoop, the install will switch to the user hdfs.  Add hdfs to the wheel group and make sure the wheel group is commented in /etc/sudoers.
Now logging in as user hdfs (still on odbs).

On a side note Cloudera has free training videos and some good documentation on their website.  You might take a look around while you’re there.  Below is one way to install the Cloudera Hadoop, see the CDH3 Installation Guide for others if this doesn’t work for you.

1) download the package:

Red Hat / CentOS 5 at:

http://archive.cloudera.com/redhat/cdh/cdh3-repository-1.0-1.noarch

Red Hat / CentOS 6 at:

http://archive.cloudera.com/redhat/6/x86_64/cdh/cdh3-repository-1.0-1.noarch

OR

2) Use yum

sudo yum --nogpgcheck localinstall cdh3-repository-1.0-1.noarch.rpm

Get the GPG Key:

OLE5


OLE6

Check your yum repository now:
try:

$ yum search hadoop

then install

$ sudo yum install hadoop-0.20

When it is finished check the hadoop install:

$ hadoop version

Should render a version #

You may want to set up alternatives for the hadoop configuration. There’s a good document on this at Cloudera’s Site. /etc/alternatives shows all the alternatives set up. Cloudera sets up one called hadoop-0.20 for this version of the install.

$ ls /etc/alternatives

to see the full list.

$ sudo cp -r /etc/hadoop-0.20/conf.empty /etc/hadoop-0.20/conf.my_cluster
sudo chmod -R 755 /etc/hadoop-0.20/conf.my_cluster

Enter the password.

You might want to add a profile if you haven’t already. Add /usr/sbin to the PATH for the next step.

sudo alternatives --display hadoop-0.20-conf
[sudo] password for hdfs:
hadoop-0.20-conf - status is auto.
link currently points to /etc/hadoop-0.20/conf.empty
/etc/hadoop-0.20/conf.empty - priority 10
Current `best' version is /etc/hadoop-0.20/conf.empty.

and you can see that the /etc/hadoop-0.20/conf.empty is the currently selected alternative. Let’s change it to the new one you created.

sudo alternatives --install /etc/hadoop-0.20/conf hadoop-0.20-conf /etc/hadoop-0.20/conf.my_cluster 50

sudo alternatives --config hadoop-0.20-conf

There are 2 programs which provide ‘hadoop-0.20-conf’.

Selection Command
———————————————–
1 /etc/hadoop-0.20/conf.empty
*+ 2 /etc/hadoop-0.20/conf.my_cluster

Select your new alternative if it is not already selected.

sudo to root

sudo su -

Edit the file /etc/hadoop/conf/core-site.xml
Add:

<configuration>
   <property>
      <name>fs.default.name</name>
      <value>hdfs://cloud-alocal:8020</value>
    </property>
</configuration>


The default fs.default.name port is: 8020
and cloud-alocal is our instructional VM we created or a target hadoop node or cluster

While you’re sudo’d to root make the changes to the following files as well:

Changes shown in RED

edit the /etc/hadoop/conf/hadoop-env.sh file shown edited below:

———–

# Set Hadoop-specific environment variables here.
# The only required environment variable is JAVA_HOME.  All others are
# optional.  When running a distributed configuration it is best to
# set JAVA_HOME in this file, so that it is correctly defined on
# remote nodes.
# The java implementation to use.  Required.
export JAVA_HOME=/usr/java/latest
# Extra Java CLASSPATH elements.  Optional.
export ORACLE_HOME=/u01/app/oracle/product/11.2.0/dbhome_1
export HADOOP_CLASSPATH="$ORACLE_HOME/jdbc/lib/ojdbc6.jar:$ORACLE_HOME/oraloader-1.1.0.0.1/jlib/*:$HADOOP_CLASSPATH

———————————
Note: If you have multiple ORACLE_HOME’S, you might prefer to use ORACLE_BASE or /etc for your install location for the ODC HDFS or ODC OLH installs.
I’d prefer a standard place where I never have to involve a system admin, they’re busy enough.
exiting from root, login as oracle on odbs
Let’s add a few variables to .bash_profile:
somewhere after ORACLE_HOME gets set:
add

export HADOOP_HOME=/usr/bin
export DIRECTHDFS_HOME=$ORACLE_HOME/orahdfs-1.0.0.0.0
export HADOOP_CLASSPATH="$ORACLE_HOME/jdbc/lib/ojdbc6.jar:$ORACLE_HOME/jlib/oraclepki.jar"
Logout and login – or you can keep executing the .bash_profile every time you create a new session (easier to logout and login)
check the hadoop connectivity to cloud-alocal:

$ hadoop fs -ls /user

Found 2 items
drwxr-xr-x   - cloudera supergroup          0 2012-04-05 14:00 /user/cloudera
drwxr-xr-x   - hue      supergroup          0 2012-03-14 20:22 /user/hive

Try a few commands:

$ touch test_src.lst
$ hadoop fs -copyFromLocal test_src.lst /user/scott/data/test.lst

Try hadoop fs to get the usage:

$ hadoop fs

Open a browser and go to http://cloud-alocal:50070, Click – “browse the file system“, and navigate to /user/scott/data and see your file:

/user/scott/data/test.lst

Test.lst

So now you have a file in the Hadoop file system, and if you search for that full directory path and file you won’t see it on the OS file system.

Let’s put the connector code in place:

Go to the download directory for the zipped connector files:

unzip orahdfs-1.0.0.0.0.zip -d $ORACLE_HOME
unzip oraloader-1.1.0.0.1.x86_64.zip -d $ORACLE_HOME

Modify hdfs_stream at $ORACLE_HOME/orahdfs-1.0.0.0.0/bin
Changes in RED
——————
#
export HADOOP_HOME=/usr
export HADOOP_CONF_DIR=/etc/hadoop/conf
#
# set DIRECTHDFS_HOME to install directory for DirectHDFS
#
export DIRECTHDFS_HOME=$ORACLE_HOME/orahdfs-1.0.0.0.0
export ORAHDFS_JAR=${DIRECTHDFS_HOME}/jlib/orahdfs.jar

———————-
A correct run should show the usage info:

$ ./hdfs_stream
Oracle Direct HDFS Release 1.0.0.0.0 - ProductionCopyright (c) 2011, Oracle and/or its affiliates. All rights reserved.Usage: $HADOOP_HOME/bin/hadoop jar orahdfs.jar oracle.hadoop.hdfs.exttab.HdfsStream <locationPath>

Otherwise check in $DIRECTHDFS_HOME/log for the log files and debug your changes.
Create the external directory on odbs & grant the following as well:
mkdir $ORACLE_BASE/external_dir
chmod 770 $ORACLE_BASE/external_dir

From sqlplus execute the following:

sqlplus /nolog
connect / as sysdba

create or replace directory external_dir as '/u01/app/oracle/external_dir';
create or replace directory hdfs_bin_path as '/u01/app/oracle/product/11.2.0/dbhome_1/orahdfs-1.0.0.0.0/bin';
create user hdfsuser identified by hdfsuser;
grant create session to hdfsuser;
grant create table to hdfsuser;
grant execute on sys.utl_file to hdfsuser;
grant read, write on directory external_dir to hdfsuser;
grant read, execute on directory hdfs_bin_path to hdfsuser;

Make sure all statements executed correctly.

Now you’re all done.

Guess you need some examples to make this worth the effort. Here’s an external table example.

Revised: 5/8/2012 – Added notes about alternative config
Revised: 5/10/2012 – added permissions for conf.my_cluster