Exadata Storage Index Aging – Part 3 – Analysis on Test 2B

Continued from previous post.

Based on the observations below, it looks as if there are more mechanisms for offloading than a max of 8 columns in the storage index array. It seems that the 8 columns parameter is a minimum parameter for retention (_cell_storage_index_columns = 8). If columns 1-10 have all been offloaded, but a post test still shows a similar, but different set of columns (look closely at 1, 19, 20 and 21) retained then the total number of columns must have risen above the 8 columns in the storage index array. I’d just attribute this to the fact that the array items are only created for 1Mb sections of each segment storage region, but the same percentage of offload occurs on the In Line (IL) tests as previously tested on an earlier IL test.

This is good news for the end users meaning there there is some elastic capacity to the storage index array above the 8 columns specified. It seems that although more than these specific columns were offloaded, but once activity died down only 8 remained active. So 8 seems to be a minimum and not a maximum.

But, still need to understand why 1, 21, 20, and 19 seem to persist longer? Was it previous activity or something else in the data? Other possibilities?

—————————————————————————————————————————

Observations
Looking at the data from Test 2B here are some summary observations of the changes:

Baseline Test:
8 offloaded columns retained from previous actions. Columns (in order of testing): 1, 7, 8, 9, 10, 19, 20, 21

Test Columns 1-10:
All ten columns get storage index offloads, but column 1 is offloaded on first try and has not been aged out of the storage index array.

IL Test – All Columns – After test on columns 1-10
Offloads not aged out in order of column testing: 10, 9, 8, 7, 1, 21, 20, 19

Test Columns 2-11:
All ten columns get storage index offloads, all aged out (have to offload after 1st query)

IL Test – All Columns – After test on columns 2-11
Offloads not aged out in order of column testing: 11, 10, 9, 8, 1, 21, 20, 19

Test Columns 3-12:
All ten columns get storage index offloads, all aged out (have to offload after 1st query)

IL Test – All Columns – After test on columns 3-12
Offloads not aged out in order of column testing: 12, 11, 10, 9, 1, 21, 20, 19

Test Columns 4-13:
All ten columns get storage index offloads, all aged out (have to offload after 1st query)

IL Test – All Columns – After test on columns 4-13
Offloads not aged out in order of column testing: 13, 12, 11, 10, 1, 21, 20, 19

Test Columns 5-14:
All ten columns get storage index offloads, all aged out (have to offload after 1st query)

IL Test – All Columns – After test on columns 5-14
Offloads not aged out in order of column testing: 14, 13, 12, 11, 1, 21, 20, 19

Test Columns 6-15:
All ten columns get storage index offloads, all aged out (have to offload after 1st query)

IL Test – All Columns – After test on columns 6-15
Offloads not aged out in order of column testing: 15, 14, 13, 12, 1, 21, 20, 19

Test Columns 7-16:
All ten columns get storage index offloads, all aged out (have to offload after 1st query)

IL Test – All Columns – After test on columns 7-16
Offloads not aged out in order of column testing: 16, 15, 14, 13, 1, 21, 20, 19

Test Columns 8-17:
All ten columns get storage index offloads, all aged out (have to offload after 1st query)

IL Test – All Columns – After test on columns 8-17
Offloads not aged out in order of column testing: 17, 16, 15, 14, 1, 21, 20, 19

Test Columns 9-18:
All ten columns get storage index offloads, all aged out (have to offload after 1st query)

IL Test – All Columns – After test on columns 9-18
Offloads not aged out in order of column testing: 18, 17, 16, 15, 1, 21, 20, 19

Test Columns 10-19:
All ten columns get storage index offloads, all aged out (have to offload after 1st query), except 19

IL Test – All Columns – After test on columns 10-19
Offloads not aged out in order of column testing: 19, 18, 17, 16, 15, 1, 21, 20

Test Columns 11-20:
All ten columns get storage index offloads, all aged out (have to offload after 1st query), except 19, and 20

IL Test – All Columns – After test on columns 11-20
Offloads not aged out in order of column testing: 20, 19, 18, 17, 16, 15, 1, 21

Test Columns 12-21:
All ten columns get storage index offloads, all aged out (have to offload after 1st query), except 19, 20, and 21

IL Test – All Columns – After test on columns 12-21
Offloads not aged out in order of column testing: 21, 20, 19, 18, 17, 16, 15, 1

Exadata Storage Index Aging – Part 2B

This is the third blog in a series looking at Storage Cell Storage Index Predicate Offloads. Continued from previous post. And there’s more here: Analysis for Test 2B click here.

For Test 2B – This corrects the columns with very little offload so that they get between 60-100 percent offload from the storage index. See the previous 2 blogs for more background information.

Results:
Now we have offload for each test segment and 8 columns offloaded for each In Line (IL) test as expected. We can even see a pattern on when columns are released (aged out) after repeated queries on the 8 columns.

The pattern I see is that after 10 queries columns 1, and 7-21 seemed to keep their offload status without aging out. But what is going on with column 1 that keeps it from aging out for the entire test. And what about 2-6 that seem to age out fairly fast after their query run with a storage index offload.

I see some patterns developing, but may need more eyes info or tests to figure this out. Will continue when I can.

Let me hear your ideas.

Here’s the test sequence again:

Test 2 - Test Plan

Test 2 – Test Plan

Here’s the full output in excel format; this time run to completion:
dcox_exadata_storage_index_aging_test2b

Bad tasting Pie

The material looks great for “Oracle R”. So, I thought I’d get a few books and try a quick lab for myself.

The books covered data loading, data manipulation, statics stuff I’ve either forgotten or never learned, some plotting and a host of other useful stuff to be discovered.

So to focus my learning and usefulness I quickly eliminated data loading and data manipulation since I have tons of data, and already have a rich pl/sql language to do that anyway. So maybe some data retrieval from the database and a plot.

At several of my recent clients auditing has been a top priority, so how about a pie chart showing the status of all users.

I’m using the default user “RQUSER” and played around with a few of the demos. So far I like the “RStudio” app.

and find:


> ore.ls()
[1] "DF1" "DF2" "IRIS" "IRIS_TABLE"

So, no visibility to dba_users, what I’d normally pick.

Create local maybe?

GRANT SELECT ANY TABLE TO rquser
/
create synonym users for sys.dba_users
/

And no luck, local synonyms don’t show up on the R tools. And I did grant select any table and no luck with ore.ls() either. So more reading.

How about creating a local table so I can move passed this and have some pie – charts.


create table users as select * from dba_users;

Progress:
> ore.ls()
[1] “DF1” “DF2” “IRIS” “IRIS_TABLE” “USERS”

Now for my pie, I’d like to show a group by account status. Well I can’t directly.

I type:
> head(USERS)

and it shows my table correctly, but don’t see any clear functions to get there. I think they are there, but I just haven’t discovered them yet. I do see some good filtering, but I’m used to writing a good query and being done with it. So, moving on. Next plan, create a table with the right columns, need my pie.


create table users as select account_status, count(*) status_count from dba_users group by account_status;

> ore.ls()
[1] "DF1" "DF2" "IRIS" "IRIS_TABLE" "USERS"

Progress...
head(USERS)
ACCOUNT_STATUS STATUS_COUNT
1 LOCKED 3
2 OPEN 7
3 EXPIRED & LOCKED 7

Now for Pie:
pie(USERS$STATUS_COUNT,labels=USERS$ACCOUNT_STATUS)
Error in seq.int(x[i], x[i + 1], length.out = n) :
'length.out' must be a non-negative number

Oops, Houston we have a problem.

So what’s going on here?

Let’s try something different, using one of the more normal examples (without the Oracle part).


&GT; slices <- c(3,7,7)
&GT; lbls <- c("LOCKED","OPEN","EXPIRED & LOCKED")
&GT; pie(slices,labels=lbls)

Success with the data going in directly:
Rplot

So what’s up?

I know that Oracle uses Oracle uses data frames to store some of it’s objects and that the R packages are overloaded so let’s check the classes.


> class(slices)
[1] "numeric"
> class(USERS$STATUS_COUNT)
[1] "ore.numeric"
attr(,"package")
[1] "OREbase"

Trying the plot package:

plot(USERS$STATUS_COUNT,USERS$ACCOUNT_STATUS)

Rplot_PLOT

That worked, adding values instead of labels for each number, I’m sure there’s a way to change that, but the important thing is that it accepted my values with no errors and produced a plot.

OK, well lots of learning, but no sweet pie. Too much for Christmas anyway.

Installing Oracle R

I’m trying to learn R, particularly Oracle R. Setting it up on a W7 laptop first and see what can be learned. Then possibly some bigger machines, later.

Pulled a getting started doc from Oracle’s website http://www.oracle.com/technetwork/database/options/advanced-analytics/r-enterprise/index.html
“ore-trng1-gettingstarted-1501628.pdf” The whole series of training documents are helpful to get started.

Adding a user went fine as well running “rquser.sql” (provided by Oracle), but I prefer a less cryptic script. So I replaced some of the code in rquser.sql for the next time I run it:

-- create RQUSER user
create user &&new_user_name identified by &&new_password
default tablespace &&new_default_tablespace
temporary tablespace &&new_temp_tablespace
quota &&new_quota on &&new_default_tablespace;

-- grant privileges
grant create session, rqrole
to &&new_user_name;

-- BUG: the following grant does not work when granted to RQROLE
grant execute on rqsys.rqGroupEvalImpl to &&new_user_name;

alter session set current_schema = &&new_user_name;

Not sure why you’d grant quota to anything other than the default tablespace on Oracle XE.

Next, started the two R clients. The first was the generic R 64-bit version and ran a few of the demo packages. Then I tried the RStudio and compared the graphics pretensions between the two. Caught me off guard when it asked me what tool to use to open the demo file, from RStudio, of course RStudio, it worked.

I did get the error:

Error: cannot change value of locked binding for 'all'

Going back into the database, found I had to grant execute on SYS.DBMS_SYSTEM to rqsys for rqsys.libPath.

Overall the instructions provided by Oracle are good. They worked perfectly where I followed them.

Now on to more.

Think I might learn Oracle R

With the Big Data explosion happening, coming or whatever it’s doing. I think it’s time for some new skills. I think I might learn Oracle R. Got some basic hadoop knowledge, but realize the next gen Geeks may be more supermen with multiple skills of statistics, hardware, software, business analysts, network engineers, system engineers and more. Time to learn some more skills and tricks.

I’ve seen the tech world make some drastic changes before. Mainframe to VAX VMS, then Unix, then the rise of the Personal Computer, phone, pad. Central Computing to Client Server… Oracle 5,6,7,8i,9i,10g,11g, 12… Central Data Center to Matrix workers. …makes my brain hurt thinking all the times I’ve had to learn a new technology just to keep current. Pondering some “Expanding/Contracting Theories of Computing” … too deep. Whatever is happening, I’m hearing more rumors of some solid changes toward Big Data, so it’s that time again and I’d better get ready.

Christmas just passed, and I attempted to teach my boys the many important things like “Baby Jesus”, and it’s complicated. Some change and some stay the same. There are things that stay the same like God and His adult son, but technology changes. So with my laptop, Doritos, and a “red-hot smokin’ wife” at my side, it’s time to grow again. Next up, “ORACLE R”.

Hadoop HDFS Permissions

Updated 4/10/13

It looks like the good doc on permissions is the HDFS Permissions Guide. The HDFS Permissions Guide link changed or is obsolete, but a google search with “hdfs permissions guide” and you should find a good reference.

My simple view is that all the user permissions work the same as unix, with out all the stickyness (no sticky bits etc.). For the group permissions, the supergroup is shown; and this belongs the the owner of the namenode process.

Seems pretty direct.

Creating an External Table with the Oracle Big Data Connector for Hadoop Data File Storage

This is a followup to the installation blog.  It assumes you’ve done everything there for the installation.

Here’s a complete example for creating an external table with a csv file on the Hadoop File System.

First let’s login as hdfsuser and create the external table:

drop table hdfsuser.emp_ext_tab1;

CREATE TABLE hdfsuser.emp_ext_tab1
( empno number(4,0),
ename VARCHAR2(10),
job VARCHAR2(9),
mgr number(4,0),
HIREDATE date,
sal number(7,2),
comm number(7,2),
deptno varchar2(4)

)
ORGANIZATION EXTERNAL
( TYPE oracle_loader
DEFAULT DIRECTORY "EXTERNAL_DIR"
ACCESS PARAMETERS
(
RECORDS DELIMITED BY NEWLINE
PREPROCESSOR HDFS_BIN_PATH:hdfs_stream
FIELDS TERMINATED BY ','
OPTIONALLY ENCLOSED BY '\"' LDRTRIM
MISSING FIELD VALUES ARE NULL
(empno integer external(4) ,
ename char(10) ,
job char(9) ,
mgr integer external(4) ,
HIREDATE char(10) date_format date mask "MM/DD/YYYY" ,
sal decimal external(7) ,
comm decimal external(7),
deptno integer external(4)
)
)
LOCATION ( 'hdfsuser_emp')
);

Ok lets go through this,
DEFAULT DIRECTORY “EXTERNAL_DIR” — is where the location, log, and bad files will be written
PREPROCESSOR HDFS_BIN_PATH:hdfs_stream — is the path and location of our processor file hdfs_stream that we added with the install of the HDFS connector
LOCATION ( ‘hdfsuser_emp’) — is where our location file will be written when we run the hadoop external table creation script

I added the drop table, because I don’t use many external tables and usually have to correct something, so I’ll drop and recreate the table as I tweak the code.

Create a csv file from scott.emp.

example data from scott.emp placed in the emp_table.csv file:
7369,”SMITH”,”CLERK”,7902,”17-Dec-1980″,800,,20
7499,”ALLEN”,”SALESMAN”,7698,”20-Feb-1981″,1600,300,30
7521,”WARD”,”SALESMAN”,7698,”22-Feb-1981″,1250,500,30
7566,”JONES”,”MANAGER”,7839,”02-Apr-1981″,2975,,20
7654,”MARTIN”,”SALESMAN”,7698,”28-Sep-1981″,1250,1400,30
7698,”BLAKE”,”MANAGER”,7839,”01-May-1981″,2850,,30
7782,”CLARK”,”MANAGER”,7839,”09-Jun-1981″,2450,,10
7788,”SCOTT”,”ANALYST”,7566,”19-Apr-1987″,3000,,20
7839,”KING”,”PRESIDENT”,,”17-Nov-1981″,5000,,10
7844,”TURNER”,”SALESMAN”,7698,”08-Sep-1981″,1500,0,30
7876,”ADAMS”,”CLERK”,7788,”23-May-1987″,1100,,20
7900,”JAMES”,”CLERK”,7698,”03-Dec-1981″,950,,30
7902,”FORD”,”ANALYST”,7566,”03-Dec-1981″,3000,,20
7934,”MILLER”,”CLERK”,7782,”23-Jan-1982″,1300,,10

Copy the file to the hdfs:
hadoop fs -copyFromLocal emp_table.csv /user/hdfsuser/data/emp_table.csv

Hadoop will create the full path if it is not created.

Use http://hadoopnamenode:8020 and browse the file system to find the emp_table.csv file, where hadoopnamenode has been cloud-alocal for the install example.

Publish the HDFS Data Paths:

You have to use the long command found in the Oracle Big Data Connectors User’s Guide. There may be a few things to watch out for:

Beware of the hdfs.exttab.connection.url a good reference for what this should be is at: http://www.orafaq.com/wiki/JDBC#Thin_driver

$ $HADOOP_HOME/hadoop jar \
> $DIRECTHDFS_HOME/jlib/orahdfs.jar oracle.hadoop.hdfs.exttab.ExternalTable \
> -D oracle.hadoop.hdfs.exttab.tableName=EMP_EXT_TAB1 \
> -D oracle.hadoop.hdfs.exttab.connection.url=jdbc:oracle:thin:@odbs:1521:or
cl \
> -D oracle.hadoop.hdfs.exttab.connection.user=hdfsuser \
> -D oracle.hadoop.hdfs.exttab.datasetPaths=hdfs:/user/hdfsuser/data/emp_table.c
sv -publish
[Enter Database Password:]
Oracle Direct HDFS Release 1.0.0.0.0 - Production

Copyright (c) 2011, Oracle and/or its affiliates. All rights reserved.

EMP_EXT_TAB1 publish operation succeeded

Details:

EXTERNAL_DIR:hdfsuser_emp file created with the following paths
hdfs://cloud-alocal.enkitec.com/user/hdfsuser/data/emp_table.csv

Note if you cut and paste the “>” would be removed. Now if you look in the $ORACLE_BASE/external_dir you’ll find the file hdfsuser_emp specified in the external table command.

[oracle@odbs external_dir]$ more hdfsuser_emp
CompressionCodec=
hdfs://cloud-alocal.enkitec.com/user/hdfsuser/data/emp_table.csv

If you run your external table select now you should get the same results as if you ran “select * from scott.emp”

SQL> select * from emp_ext_tab1;

EMPNO ENAME JOB MGR HIREDATE SAL COMM DEPT
———- ———- ——— ———- ——— ———- ———- —-
7369 SMITH CLERK 7902 17-DEC-80 800 20
7499 ALLEN SALESMAN 7698 20-FEB-81 1600 300 30
7521 WARD SALESMAN 7698 22-FEB-81 1250 500 30
7566 JONES MANAGER 7839 02-APR-81 2975 20
7654 MARTIN SALESMAN 7698 28-SEP-81 1250 1400 30
7698 BLAKE MANAGER 7839 01-MAY-81 2850 30
7782 CLARK MANAGER 7839 09-JUN-81 2450 10
7788 SCOTT ANALYST 7566 19-APR-87 3000 20
7839 KING PRESIDENT 17-NOV-81 5000 10
7844 TURNER SALESMAN 7698 08-SEP-81 1500 0 30
7876 ADAMS CLERK 7788 23-MAY-87 1100 20

EMPNO ENAME JOB MGR HIREDATE SAL COMM DEPT
———- ———- ——— ———- ——— ———- ———- —-
7900 JAMES CLERK 7698 03-DEC-81 950 30
7902 FORD ANALYST 7566 03-DEC-81 3000 20
7934 MILLER CLERK 7782 23-JAN-82 1300 10

14 rows selected.

All Done!

Installing Oracle’s Big Data Connectors for HDFS and SQL*Loader

Published: 4/27/2012 Last Revised: 5/10/2012

I’m trying to learn all that I can on Oracle’s Big Data.  Installing the Big Data Connectors from Oracle looks like a good place to start.  Here’s how I installed them.

The Oracle Direct Connector for Hadoop Distributed File System (ODC HDFS) and the Oracle Loader for Hadoop (ODC OLH) seemed like good choices to get started on the Big Data.

Download the two connector files from the Oracle Big Data Connector Download. On second thought get them all if you plan to do more in the future with Oracle’s Big Data Connectors.

Before we go too much further, I’m assuming you have a machine with Oracle to play with and another with Cloudera’s Apache Hadoop. If not, and your like me (not hardware blessed), you should consider getting Oracle Virtual Box on you desktop. I have 8G Ram on a Windows 7 box. I created 2 Linux VM’s.

VM1 – RDBMS on OEL6u2

  • Oracle Enterprise Linux 6 Upgrade 2 64 bit (OEL6U2_64)
  • Oracle’s RDBMS 11g Enterprise 11.2.0.3

The VM Settings on VM1:

  • Base Memory: 3072 MB
  • 4 Processors (1 will work fine for learning)
  • Hard Disk: 30 G – Dynamic (to save space, static would be faster)
  • Network: Host-Only Adapter
  • No Firewall and No Security (SELINUX=disabled)

For the 2nd VM I used Cloudera’s pre-built instance.  You can get this on Cloudera’s Site, download here. There are also VMWare and KVM virtual machine downloads in addition to VirtualBox.

VM2 – Cloudera Hadoop 

  • Cloudera’s VM Instance

VM Settings on VM2:

  • OS:Linux – Red Hat (64 bit)
  • Base Memory: 2048 MB
  • Network: Host-Only Adapter

I used the oracle-rdbms-server-11gR2-preinstall announced on their Linux Blog (HERE) and it made the install go quickly.

Make sure both machines can communicate with each other and your host machine.  Let’s give them names here for communication purposes, the RDBMS server can be “odbs” for Oracle Database Server, and the Cloudera VM can be “cloud-alocal”.  If you stick with the Host-Only network and are on Windows, I think it will be a 192.168.56.x network.  You assign names to each VM and the host will be 192.168.56.1.  If you have any trouble with the host-only, you may try the bridge adapter network making sure you pick the same adapter for both machines and use DHCP or static addresses.  I like the Host – Only since it isolates my VM’s from the rest of the network.

I’m also assuming you’ve done all the basic database checks to see that sqlplus logins work and your listener is up, la, la, la…   If you’ve lived in a cave for awhile, it’s a good time to brush on the 11g install.

If you haven’t looked at it already, you may want to look at the Oracle Support Master Note 1416116.1 “Master Note for Big Data Appliance Integrated Software.  It seems to have good info for the Oracle Big Data Connectors.

On the odbs we need to install only the basic hadoop software.  In preparation to loading the Cloudera Hadoop 0.20.2, we need to install the Java software.  I used the jdk-6u31-linux-x64-rpm.bin downloaded from Oracle at the Java Download Page, changed the permissions (chmod o+x), and ran the binary.  When it finished the binary was installed under /user/java/jdk1.6.0_31.  I used the path /usr/java/latest, which is a pointer to the latest install, for all my paths later in this installation so I won’t have to go back after upgrading the JDK at some point later.

Still on odbs, now with Oracle Java installed, we are ready to install Cloudera Hadoop.

as root create a user hdfs:

useradd  -d /home/hadoop -m hdfs

/home/hdfs works too, if you so choose.  If you name the user hadoop, the install will switch to the user hdfs.  Add hdfs to the wheel group and make sure the wheel group is commented in /etc/sudoers.
Now logging in as user hdfs (still on odbs).

On a side note Cloudera has free training videos and some good documentation on their website.  You might take a look around while you’re there.  Below is one way to install the Cloudera Hadoop, see the CDH3 Installation Guide for others if this doesn’t work for you.

1) download the package:

Red Hat / CentOS 5 at:

http://archive.cloudera.com/redhat/cdh/cdh3-repository-1.0-1.noarch

Red Hat / CentOS 6 at:

http://archive.cloudera.com/redhat/6/x86_64/cdh/cdh3-repository-1.0-1.noarch

OR

2) Use yum

sudo yum --nogpgcheck localinstall cdh3-repository-1.0-1.noarch.rpm

Get the GPG Key:

OLE5


OLE6

Check your yum repository now:
try:

$ yum search hadoop

then install

$ sudo yum install hadoop-0.20

When it is finished check the hadoop install:

$ hadoop version

Should render a version #

You may want to set up alternatives for the hadoop configuration. There’s a good document on this at Cloudera’s Site. /etc/alternatives shows all the alternatives set up. Cloudera sets up one called hadoop-0.20 for this version of the install.

$ ls /etc/alternatives

to see the full list.

$ sudo cp -r /etc/hadoop-0.20/conf.empty /etc/hadoop-0.20/conf.my_cluster
sudo chmod -R 755 /etc/hadoop-0.20/conf.my_cluster

Enter the password.

You might want to add a profile if you haven’t already. Add /usr/sbin to the PATH for the next step.

sudo alternatives --display hadoop-0.20-conf
[sudo] password for hdfs:
hadoop-0.20-conf - status is auto.
link currently points to /etc/hadoop-0.20/conf.empty
/etc/hadoop-0.20/conf.empty - priority 10
Current `best' version is /etc/hadoop-0.20/conf.empty.

and you can see that the /etc/hadoop-0.20/conf.empty is the currently selected alternative. Let’s change it to the new one you created.

sudo alternatives --install /etc/hadoop-0.20/conf hadoop-0.20-conf /etc/hadoop-0.20/conf.my_cluster 50

sudo alternatives --config hadoop-0.20-conf

There are 2 programs which provide ‘hadoop-0.20-conf’.

Selection Command
———————————————–
1 /etc/hadoop-0.20/conf.empty
*+ 2 /etc/hadoop-0.20/conf.my_cluster

Select your new alternative if it is not already selected.

sudo to root

sudo su -

Edit the file /etc/hadoop/conf/core-site.xml
Add:

<configuration>
   <property>
      <name>fs.default.name</name>
      <value>hdfs://cloud-alocal:8020</value>
    </property>
</configuration>


The default fs.default.name port is: 8020
and cloud-alocal is our instructional VM we created or a target hadoop node or cluster

While you’re sudo’d to root make the changes to the following files as well:

Changes shown in RED

edit the /etc/hadoop/conf/hadoop-env.sh file shown edited below:

———–

# Set Hadoop-specific environment variables here.
# The only required environment variable is JAVA_HOME.  All others are
# optional.  When running a distributed configuration it is best to
# set JAVA_HOME in this file, so that it is correctly defined on
# remote nodes.
# The java implementation to use.  Required.
export JAVA_HOME=/usr/java/latest
# Extra Java CLASSPATH elements.  Optional.
export ORACLE_HOME=/u01/app/oracle/product/11.2.0/dbhome_1
export HADOOP_CLASSPATH="$ORACLE_HOME/jdbc/lib/ojdbc6.jar:$ORACLE_HOME/oraloader-1.1.0.0.1/jlib/*:$HADOOP_CLASSPATH

———————————
Note: If you have multiple ORACLE_HOME’S, you might prefer to use ORACLE_BASE or /etc for your install location for the ODC HDFS or ODC OLH installs.
I’d prefer a standard place where I never have to involve a system admin, they’re busy enough.
exiting from root, login as oracle on odbs
Let’s add a few variables to .bash_profile:
somewhere after ORACLE_HOME gets set:
add

export HADOOP_HOME=/usr/bin
export DIRECTHDFS_HOME=$ORACLE_HOME/orahdfs-1.0.0.0.0
export HADOOP_CLASSPATH="$ORACLE_HOME/jdbc/lib/ojdbc6.jar:$ORACLE_HOME/jlib/oraclepki.jar"
Logout and login – or you can keep executing the .bash_profile every time you create a new session (easier to logout and login)
check the hadoop connectivity to cloud-alocal:

$ hadoop fs -ls /user

Found 2 items
drwxr-xr-x   - cloudera supergroup          0 2012-04-05 14:00 /user/cloudera
drwxr-xr-x   - hue      supergroup          0 2012-03-14 20:22 /user/hive

Try a few commands:

$ touch test_src.lst
$ hadoop fs -copyFromLocal test_src.lst /user/scott/data/test.lst

Try hadoop fs to get the usage:

$ hadoop fs

Open a browser and go to http://cloud-alocal:50070, Click – “browse the file system“, and navigate to /user/scott/data and see your file:

/user/scott/data/test.lst

Test.lst

So now you have a file in the Hadoop file system, and if you search for that full directory path and file you won’t see it on the OS file system.

Let’s put the connector code in place:

Go to the download directory for the zipped connector files:

unzip orahdfs-1.0.0.0.0.zip -d $ORACLE_HOME
unzip oraloader-1.1.0.0.1.x86_64.zip -d $ORACLE_HOME

Modify hdfs_stream at $ORACLE_HOME/orahdfs-1.0.0.0.0/bin
Changes in RED
——————
#
export HADOOP_HOME=/usr
export HADOOP_CONF_DIR=/etc/hadoop/conf
#
# set DIRECTHDFS_HOME to install directory for DirectHDFS
#
export DIRECTHDFS_HOME=$ORACLE_HOME/orahdfs-1.0.0.0.0
export ORAHDFS_JAR=${DIRECTHDFS_HOME}/jlib/orahdfs.jar

———————-
A correct run should show the usage info:

$ ./hdfs_stream
Oracle Direct HDFS Release 1.0.0.0.0 - ProductionCopyright (c) 2011, Oracle and/or its affiliates. All rights reserved.Usage: $HADOOP_HOME/bin/hadoop jar orahdfs.jar oracle.hadoop.hdfs.exttab.HdfsStream <locationPath>

Otherwise check in $DIRECTHDFS_HOME/log for the log files and debug your changes.
Create the external directory on odbs & grant the following as well:
mkdir $ORACLE_BASE/external_dir
chmod 770 $ORACLE_BASE/external_dir

From sqlplus execute the following:

sqlplus /nolog
connect / as sysdba

create or replace directory external_dir as '/u01/app/oracle/external_dir';
create or replace directory hdfs_bin_path as '/u01/app/oracle/product/11.2.0/dbhome_1/orahdfs-1.0.0.0.0/bin';
create user hdfsuser identified by hdfsuser;
grant create session to hdfsuser;
grant create table to hdfsuser;
grant execute on sys.utl_file to hdfsuser;
grant read, write on directory external_dir to hdfsuser;
grant read, execute on directory hdfs_bin_path to hdfsuser;

Make sure all statements executed correctly.

Now you’re all done.

Guess you need some examples to make this worth the effort. Here’s an external table example.

Revised: 5/8/2012 – Added notes about alternative config
Revised: 5/10/2012 – added permissions for conf.my_cluster