Transcript


Hadoop-based Services for Windows Azure includes several samples you tin can employ for learning and testing. In this video, Developer Brad Sarsfield demonstrates two dissimilar ways to upload data to Hadoop-Based Services for Windows Azure. Afterward he uploads the information, he uses the WordCount sample (included) to run a MapReduce plan on the uploaded data.

Run into Likewise

  • More Videos about Hadoop Services on Windows and Windows Azure
  • Apache Hadoop Services on Windows - wiki Homepage
  • Microsoft's Large Data aqueduct on YouTube

    Transcript

    Hullo, my proper name is Brad Sarsfield, I'm a Programmer on the Hadoop Services for Windows and Windows Azure team.
    Today I'm going to testify you two different ways to upload data into a Hadoop cluster on Windows Azure. Once the data is uploaded to my cluster, I'll use one of the samples – which are included with Hadoop Services on Windows Azure – to run a discussion count MapReduce task against the new information in my cluster.

    To upload the data, I have many options – I can use the Interactive Javascript console, secure FTPS, Azure Blob shop, Amazon S3, or import information from Azure Data Market place. Let'southward offset with the JavaScript Interactive Console which I can admission from the Hadoop Services on Azure web portal.

    Upload data using the JavaScript Console

    1. The beginning thing I demand to do is create a directory for my data on HDFS inside my cluster. I name the directory case/data.
    2. To select a file from my local harddrive, I use fs.put .
      The name of the local file I am uploading is DaVinci.txt. And the HDFS destination is my newly-created case/data folder.
    3. Click Upload. That's it!
    4. To brand certain the file uploaded properly, run the #ls command to see a directory listing.

    Upload data using secure FTPS

    Another way to upload information into HDFS on Windows Azure is via secure ftp. The ftp server runs on the headnode inside Windows Azure. We chose secure ftp because regular ftp puts your credentials over the wire in cleartext. Another security requirement is that the FTP password must be MD5hashed.

    1. By default, the FTPS port is closed. To open the port, select the Open Ports tile.
    2. Toggle the FTPS port.
      In the groundwork this opens upward the port to my hadoop cluster'southward headnode and allows me to upload files via FTPS.
      The FTPS client that we recommend is called Curl, and we make it easy for y'all to use Curl AND to MD5hash your password by including a sample PowerShell script. The sample script is included here with the WordCount sample.
    3. To use the script, I download it to my local box and fill in the appropriate cluster name, my username and password. At present, when I open up PowerShell and call the script, information technology uploads the specified davinci text file to the examples/information directory in HDFS.

    And there you have the 2nd style to upload data to your Hadoop cluster on Windows Azure. Now information technology'south time to deploy the wordcount job.

    Deploy the WordCount job

    Hadoop on Azure comes with samples I tin can use for learning and testing. Today I'm going to apply the WordCount Java mapreduce sample to count the occurrences of each discussion in my davinci text file. I am going run this example on my cluster using the Hadoop Examples JAR.
    1. On the WordCount Sample folio, select Deploy to your cluster.
      The Job Template is prepopulated with the advisable job proper noun, it attaches the jar file and sets the parameters that are going to exist passed into the task. Information technology fifty-fifty displays the command that will run on the headnode.
      So for my task, the parameters are:
      1. wordcount to point nosotros are running the wordcount instance for the hadoop-exampes JAR
      2. Input file isdavinci.txt and output file DaVinciAllTopWords

      Based on my parameters, the Final Control that will be executed is constructed below.

    2. I click on Execute Chore. My davinci text file is sent to the headnode where each mapper evaluates a single line and the reducer sums the counts for each word.

      Each map process reads a line from the file then parses all of the words.
      The output of the map is a fundamental-value pair for each word and the number ane.
      The reducers and so sum up the counts for each of the words from all of the map outputs and in turn output each word and its full occurrences for the last output.

      The Task Folio displays condition. The Standard Errors department contains messages from Hadoop, things like status, statistics, and informational messages. The Output section contains messages generated by the wordcount Java code.

    My job completed successfully. I see that a new file called DaVinciALLTopWords has been created.

    I used 2 different methods to uploaded data to Hadoop Services on Windows Azure and then ran the wordcount job on that data.

    Cheers for viewing this tutorial.  I hope y'all plant it helpful.