星期一, 10月 22, 2012

Hadoop single node setup on Ubuntu

I tried to figure out how to use Hadoop and HDFS for a while. But the information on their official site is very scattered and out-of-updated. Here are some notes I did.

  1. A clean ubuntu 10.04 LTS build.
  2. Download hadoop package from here.
    1. The Hadoop's versioning rule is very confusing.
      • 1.0.x is stable version
      • 1.1.x is beta version
      • 2.x.x is alpha version
      • 0.23.x is similar to 2.x.x but missing Name Node HA
      • I tried to ignore all other version started with 0.2x. I just use 1.0.4 directly.
    2. Download the KEYS in the root directory
    3. Download the from hadoop_1.0.4-1_i386.deb(or it's x64 version) and it's asc file from hadoop-1.0.4 folder.
  3. Check the integrity
    1. run `gpg --import KEYS`
    2. run `gpg --verify hadoop_1.0.4-1_i386.deb.asc`
    3. You should see
      mac@mac-ubuntu:~/projects/hadoop$ gpg --verify hadoop_1.0.4-1_i386.deb.asc
      gpg: Signature made Thu 04 Oct 2012 01:04:55 PM PDT using RSA key ID ECB31663
      gpg: Good signature from "Matthew Foley (CODE SIGNING KEY) <mattf@apache.org>"
      gpg: WARNING: This key is not certified with a trusted signature!
      gpg:          There is no indication that the signature belongs to the owner.
      Primary key fingerprint: 7854 36A7 8258 6B71 829C  67A0 4169 AA27 ECB3 1663
    4. Then it's good enough to me.
  4. Install it
    • sudo dpkg -i hadoop_1.0.4-1_i386.deb
  5. Patch it
    • I found there were some issues when it deb file ran on Ubuntu, I just patched it.
    • add following lines to /usr/sbin/hadoop-daemon.sh, line 81
      export USER=`whoami`

      if [ "$command" == "jobtracker" ] || [ "$command" == "tasktracker" ]; then
        export HADOOP_LOG_DIR=/var/log/hadoop/$USER
        export HADOOP_IDENT_STRING="$USER"
      fi
  6. Configure it
    • It comes with a handy script, just run it.
    • `sudo hadoop-setup-single-node.sh --default`
    • So the Name node, Data node, Task tracker should be running now. But the Job tracker is not.
  7. Fix it
    1. `sudo -u hdfs hadoop fs -mkdir /mapred`
    2. `sudo -u hdfs hadoop fs -chown mapred /mapred`
    3. `sudo /etc/init.d/hadoop-jobtracker restart`
  8. Test it
    1. `sudo hadoop-validate-setup.sh --user=hdfs`
    2. It will run 3 test cases, you should see they are all passed.
I think it's the cleanest way to install hadoop in ubuntu.

星期二, 10月 02, 2012

TrueCrypt

TrueCrypt is a good stuff that you can encrypt your data to a virtual disk. Which is actually a file resides in your regular file system. And that file can be put in your Dropbox folder, so your data can be stored in "cloud" securely.

AES encryption/decryption


  • Encryption
    • openssl enc -e -in original_file -out original_file.aes -aes256 -k password
  • Decryption
    • openssl enc -d -in original_file.aes -out original_file.out -aes256 -k password

AES size: original file size + 1, then padding to 16bytes, then add 16
e.g. 1
117 bytes
117 + 1 padding to 16 bytes => 128 bytes
128 bytes + 16 = 144 bytes

e.g. 2
127 bytes
127 + 1 padding to 16 bytes => 128 bytes
128 bytes + 16 = 144 bytes

e.g. 3
128 bytes
128 + 1 padding to 16 bytes => 144 bytes
144 bytes + 16 = 160 bytes

It's irrelevant to the length of password.


星期一, 10月 01, 2012

Lucene

http://lucene.apache.org/core/3_6_1/demo.html

CLASSPATH

  • OK
    • export CLASSPATH=/home/mac/xxxx/xxx/xxx.jar:/home/mac/yyyy/yyy/yyy.jar
    • export CLASSPATH=/home/mac/xxxx/xxx/*:/home/mac/yyyy/yyy/*
  • Not OK
    • export CLASSPATH=/home/mac/xxxx/xxx/*.jar:/home/mac/yyyy/yyy/*.jar
    • export CLASSPATH=/home/mac/xxxx/xxx/:/home/mac/yyyy/yyy/

http://lucene.apache.org/core/3_6_1/demo2.html

  • Need to detect doc language and change to use correct analyzer.
Create Index
  • open an directory to put index files (dir)
  • new an Analyzer (analyzer)
  • new an IndexWriterConfig (iwc)
  • do some settings on IndexWriterConfig
  • use dir and iwc to new a IndexWriter (writer)
  • add documents
    • new a Document (doc)
    • add several fields
      • new a Field (pathField)
        • Field pathField = new Field("path", file.getPath(), Field.Store.YES, Field.Index.NOT_ANALYZED_NO_NORMS);
        • pathField.setIndexOptions(IndexOptions.DOCS_ONLY);
        • doc.add(pathField);
      • new a NumericField (modifiedField)
        • NumericField modifiedField = new NumericField("modified");
        • modifiedField.setLongValue(file.lastModified());
        • doc.add(modifiedField);
      • Add real content by reading actual file
        • doc.add(new Field("contents", new BufferedReader(new InputStreamReader(fis, "UTF-8"))));
  • Add / Update index
    • Add: writer.addDocument(doc);
    • Update: writer.updateDocument(new Term("path", file.getPath()), doc);
  • close writer
Search
  • Create an IndexReader by assigning a directory (reader)
  • new a IndexSearcher by assign reader (searcher)
  • new an Analyzer (analyzer)
  • new a QueryParser (parser) // need to indicate which field will be searched.
    • QueryParser parser = new QueryParser(Version.LUCENE_31, field, analyzer);
  • create Query (query)
    • Query query = parser.parse(line)
  • do search
    • Normal search
      • searcher.search(query, null, 100); // get top 100 hits. (null is for filter)
    • Do search with paging
      • There is a simple sample code in the demo source