Thursday, 5 September 2013

A FAQ for JSR

When I initially came to know about JSR (Java Specification Request), I had some difficulties in understanding that. And, I had several questions too. So, at JavaRanch/CodeRanch (where I am a moderator in some of the forums), I created a FAQ about JSR which is now publicly available: JSR FAQ

Swing TreeTable Example with root using JXTreeTable

In my earlier post, I blogged about using different objects with a HAS-A relationship to be displayed in a tree table. However, I made the root node invisible and used a dummy object for the root.

In this post, I am going to use an object to represent the root and also display it. I am going to continue from where I left off by introducing an Organization object. This Organization object will contain a List<Department> object. Each Department object will in turn contain a List<Employee> object. This is same as in my earlier post. I think this reflects a real-world scenario of an organization with several departments and each department having many employees.

The Department and Employee classes remain the same. The Organization class looks as follows:

import java.util.List;

public class Organization {

    private String name;
    private List<Department> departmentList;

    public Organization(String name, List<Department> departmentList) {
        this.name = name;
        this.departmentList = departmentList;
    }

    public List<Department> getDepartmentList() {
        return departmentList;
    }

    public String getName() {
        return name;
    }
}

Changes to the tree-table model are along expected lines. Instead of using a dummy object for the row, I pass the Organization object itself to be considered as root, like (I have a different class name now):
public MyTreeTableModel(Organization organization) {
    super(organization);
}

The getValueAt() has an extra if condition to account for one more object in the hierarchy:
@Override
public Object getValueAt(Object node, int column) {
    if (node instanceof Organization) {
        if (1 == column) {
            return ((Organization) node).getName();
        }
    }
    if (node instanceof Department) {
        Department dept = (Department) node;
        switch (column) {
            case 0:
                return dept.getId();
            case 1:
                return dept.getName();
        }
    } else if (node instanceof Employee) {
        Employee emp = (Employee) node;
        switch (column) {
            case 0:
                return emp.getId();
            case 1:
                return emp.getName();
            case 2:
                return emp.getDoj();
            case 3:
                return emp.getPhoto();
        }
    }
    return null;
}
As the Organization has only one value to be displayed, I use the first column to do it.

Same way, the getChild() and getChildCount() methods have an extra condition:
@Override
public Object getChild(Object parent, int index) {
    if (parent instanceof Organization) {
        Organization org = (Organization) parent;
        return org.getDepartmentList().get(index);
    } else {
        Department dept = (Department) parent;
        return dept.getEmployeeList().get(index);
    }
}

@Override
public int getChildCount(Object parent) {
    if (parent instanceof Organization) {
        Organization org = (Organization) parent;
        return org.getDepartmentList().size();
    } else {
        Department dept = (Department) parent;
        return dept.getEmployeeList().size();
    }
}


Now, as I have 2 levels of hierarchy, the parent could be either Organization or Department. So, I do an instanceof check and return the corresponding child (or the child count).

Same goes for the getIndexOfChild() method. This method implementation is slightly more complex. Based on the parent, I do a cast and get the exact parent and child objects. Then, I call the appropriate method to get the list from which I can get the index of the child.

@Override
public int getIndexOfChild(Object parent, Object child) {
    if(parent instanceof Organization) {
        Organization org = (Organization) parent;
        Department dept = (Department) child;
        return org.getDepartmentList().indexOf(dept);
    }
    else {
        Department dept = (Department) parent;
        Employee emp = (Employee) child;
        return dept.getEmployeeList().indexOf(emp);
    }
}

While using this model in our GUI, we need to make the root visible. This can be done by calling treeTable.setRootVisible(true). And, we will be building the Organization object in addition to what we did earlier. The complete GUI code looks like:

public class TreeTableMainWithRoot extends JFrame {

    private JXTreeTable treeTable;
    
    public TreeTableMainWithRoot() {
        //sample doj
        final Date doj = Calendar.getInstance().getTime();
        List<Department> departmentList = new ArrayList<Department>();

        List<Employee> empList1 = new ArrayList<Employee>();
        empList1.add(new Employee(1, "Kiran", doj, "emp1.jpg"));
        empList1.add(new Employee(2, "Prabhu", doj, "emp2.jpg"));
        empList1.add(new Employee(3, "Murugavel", doj, "emp1.jpg"));
        departmentList.add(new Department(1, "Sales", empList1));

        List<Employee> empList2 = new ArrayList<Employee>();
        empList2.add(new Employee(4, "Deiveegan", doj, "emp2.jpg"));
        empList2.add(new Employee(5, "Saravanan", doj, "emp1.jpg"));
        departmentList.add(new Department(2, "Production", empList2));
        
        Organization organization = new Organization("ABC XYZ Corporation", departmentList);
        
        MyTreeTableModel myTreeTableModel = new MyTreeTableModel(organization);
        treeTable = new JXTreeTable(myTreeTableModel);
        treeTable.setAutoResizeMode(JTable.AUTO_RESIZE_OFF);
        treeTable.setRootVisible(true);
        treeTable.getColumnModel().getColumn(3).setCellRenderer(new PhotoRenderer());
        treeTable.setRowHeight(50);

        add(new JScrollPane(treeTable));

        setTitle("JXTreeTable Example");
        setDefaultCloseOperation(EXIT_ON_CLOSE);
        pack();
        setVisible(true);
    }

    public static void main(String[] args) {
        SwingUtilities.invokeLater(new Runnable() {
            @Override
            public void run() {
                new TreeTableMainWithRoot();
            }
        });
    }
}

When we run the code and expand all nodes, we get an output like:

The complete source code is available in the form of a Maven project in my github repo.

Tuesday, 20 August 2013

Swing TreeTable Example using JXTreeTable

This post provides an example using JXTreeTable which is a tree-table component in SwingX. SwingX is an open source Swing extension toolkit from SwingLabs.

Most of the examples using tree-table that we find usually have only one type of entity - for example, showing a 'folder' view is most common. But, in this case, the file and the folder share some common traits and would have a IS-A relationship (from an OOPS perspective). However, in a real scenario, we may have to deal with objects that are not directly related.

For example, think of a Department object which has a list of Employee objects. These may share a relationship at the db level, but it will be a HAS-A relationship. So, in a tree-table component, the parent is one type of object and child (leaf) is another. This leads to handling of different conditions in most methods. Added to this, we might have a completely different object represent the root component - say, an Organization object.

However, for the moment, let us deal with an example where the root is not shown. This example displays the departments in an organization. Under each department, there may be n number of employees working.

Let us first create two classes representing the employee and department:

public class Employee {

    private int id;
    private String name;
    private Date doj;
    private String photo;

    public Employee(int id, String name, Date doj, String photo) {
        this.id = id;
        ...
    }
    //setters and getters not shown for brevity
}

public class Department {

    private int id;
    private String name;
    private List<Employee> employeeList;

    public Department(int id, String name, List<Employee> empList) {
        this.id = id;        
        ...
    }

    public List<Employee> getEmployeeList() {
        return employeeList;
    }

    public void setEmployeeList(List<Employee> employeeList) {
        this.employeeList = employeeList;

    }

    //other setters and getters
}

Let us now write the tree-table model. We need to extend the org.jdesktop.swingx.treetable.AbstractTreeTableModel and override the methods (this is an abstract implementation of org.jdesktop.swingx.treetable.TreeTableModel):

import java.util.List;
import org.jdesktop.swingx.treetable.AbstractTreeTableModel;

public class NoRootTreeTableModel extends AbstractTreeTableModel {
    private final static String[] COLUMN_NAMES = {"Id", "Name", "Doj", "Photo"};
    
    private List<Department> departmentList;

    public NoRootTreeTableModel(List<Department> departmentList) {
        super(new Object());
        this.departmentList = departmentList;
    }

    @Override
    public int getColumnCount() {
        return COLUMN_NAMES.length;
    }

    @Override
    public String getColumnName(int column) {
        return COLUMN_NAMES[column];
    }
    
    @Override
    public boolean isCellEditable(Object node, int column) {
        return false;
    }

    @Override
    public boolean isLeaf(Object node) {
        return node instanceof Employee;
    }
    ...

}

The treetable is going to show a list of departments. So, in our implementation of the model, we declare a constructor that takes a List<Department>. The getColumnCount(), getColumnName() and isCellEditable() implementations are same as we do for a JTable. The isLeaf() method is implemented for trees. This method should return a boolean to indicate whether the node is a leaf. In our case, the employee objects should be displayed as leaf, so, we simply do an instanceof check on the Employee object. Note the usage of a dummy object to indicate the root.

Let us continue with other methods:

    @Override
    public int getChildCount(Object parent) {
        if (parent instanceof Department) {
            Department dept = (Department) parent;
            return dept.getEmployeeList().size();
        }
        return departmentList.size();
    }

    @Override
    public Object getChild(Object parent, int index) {
        if (parent instanceof Department) {
            Department dept = (Department) parent;
            return dept.getEmployeeList().get(index);
        }
        return departmentList.get(index);
    }

The getChildCount() and getChild() methods are similar to what we do for a JTree model. The getChildCount() should return the number of departments first. In case the parent is a department itself, it should return the number of employees present in the department. So, we handle this with an if condition. 

Same way, in the getChild() implementation, we check if the passed in node is an instance of Department object. If so, we return an Department object by getting if from the list with the index number. If not, Employee object is returned. Note that, the getChild() and getChildCount() methods are closely related.

Next is the getIndexOfChild() method which is a bit tricky. The getIndexofChild() method should return the index of an Employee object within a Department object.

    @Override
    public int getIndexOfChild(Object parent, Object child) {
        Department dept = (Department) parent;
        Employee emp = (Employee) child;
        return dept.getEmployeeList().indexOf(emp);
    }

Finally, the most important method:

    @Override
    public Object getValueAt(Object node, int column) {
        if (node instanceof Department) {
            Department dept = (Department) node;
            switch (column) {
                case 0:
                    return dept.getId();
                case 1:
                    return dept.getName();
            }
        } else if (node instanceof Employee) {
            Employee emp = (Employee) node;
            switch (column) {
                case 0:
                    return emp.getId();
                case 1:
                    return emp.getName();
                case 2:
                    return emp.getDoj();
                case 3:
                    return emp.getPhoto();
            }
        }
        return null;
    }

The getValueAt() method is the one that returns the value of every cell in a JTable. Note that the JXTreeTable extends JTable, so correct implementation of this method is important.

In our case, the node might be a Department or Employee. So, an instanceof check is first applied on the node. Then, based on the column number, the corresponding method is called (this is similar to what we do for JTable). Note that the Department has only 2 columns of data to display and the Employee 4. The final "return null" statement takes care of the missing columns.

Let us now use this model by building our GUI (I have used the current time for the doj of all the records as this is just a demonstration):

import java.util.*;
import javax.swing.*;
import org.jdesktop.swingx.JXTreeTable;

public class TreeTableTest extends JFrame {

    private JXTreeTable treeTable;

    public TreeTableTest() {
        //sample doj
        final Date doj = Calendar.getInstance().getTime();        
        List<Department> departmentList = new ArrayList<Department>();

        //create and add the first department with its list of Employee objects
        List<Employee> empList1 = new ArrayList<Employee>();
        empList1.add(new Employee(1, "Kiran", doj, "emp1.jpg"));
        empList1.add(new Employee(2, "Prabhu", doj, "emp2.jpg"));
        empList1.add(new Employee(3, "Murugavel", doj, "emp1.jpg"));        
        departmentList.add(new Department(1, "Sales", empList1));

        //create and add the second department with its list of Employee objects
        List<Employee> empList2 = new ArrayList<Employee>();
        empList2.add(new Employee(4, "Deiveegan", doj, "emp2.jpg"));
        empList2.add(new Employee(5, "Saravanan", doj, "emp1.jpg"));
        departmentList.add(new Department(2, "Production", empList2));
        
        //we use a no root model
        NoRootTreeTableModel noRootTreeTableModel = new NoRootTreeTableModel(departmentList);
        treeTable = new JXTreeTable(noRootTreeTableModel);
        treeTable.setAutoResizeMode(JTable.AUTO_RESIZE_OFF);        
        treeTable.setRootVisible(false);  // hide the root

        add(new JScrollPane(treeTable));

        setTitle("JXTreeTable Example");
        setDefaultCloseOperation(EXIT_ON_CLOSE);
        pack();
        setVisible(true);
    }

    public static void main(String[] args) {
        SwingUtilities.invokeLater(new Runnable() {
            @Override
            public void run() {        
                new TreeTableTest();
            }
        });
    }
}

When we run the example and expand both the department nodes, we get an output like:

Note that the doj and the photo column for department row(s) are empty. This is a practical example of a tree-table.

Edit: As per the request of some readers, am providing code for showing an image. The Employee class has a member 'photo' which has the name of the image of the employee. Currently, it shows only as a text value. To show the photo as an actual image, we need to write an renderer. This is similar to what we do for JTable:

public class PhotoRenderer extends JLabel
                           implements TableCellRenderer {
    public Component getTableCellRendererComponent(
                            JTable table, Object photo,
                            boolean isSelected, boolean hasFocus,
                            int row, int column) {
        if(photo != null) {
            ImageIcon imageIcon = ...;
            setIcon(imageIcon);
        }
        else {
            setIcon(null);
        }
        return this;
    }
}

This renderer extends JLabel as we can set an icon for a JLabel. So, we can freely call the setIcon method of the JLabel to set the icon.

Next, we need to set this renderer to the particular column, like:
treeTable.getColumnModel().getColumn(3).setCellRenderer(new PhotoRenderer());

As the photo column is the 4h column, I am using the index 3 to get the TableColumn and set the renderer. When I run the code, I get the following output (I used some dummy images for photos):



Although the image is successfully displayed, it is not sufficient as we are not able to see the full image. To correct this, we need to increase the 'row height' of the table, like:
treeTable.setRowHeight(50);

Now, when I run the program, I get the following output:


The complete source code is available in the form of a Maven project in my github repo.

In a subsequent post, I have shown a similar example with root being visible (for example root represented by an Organization).

Friday, 7 June 2013

Handling Small Files in Hadoop MapReduce with CombineFileInputFormat

This post explains about the small files problem that I faced with my Hadoop MapReduce program and how I solved it.

Problem:

The small files problem is known in Hadoop. The problem is that if there are a number of small files, all less than the default size of 64 MB, the MapReduce (MR) is not very efficient.

I wrote a MR program and ran it - the input consisted of 1000 such small files. It simply created 1000 Mappers - one each for file. The MR program ran longer than what I thought it would take to complete.

I thought of improving this and made some changes - and it resulted in drastic improvement in performance. My MR ran in 4 times less than the original time.

Explanation:

Let me explain the solution before I proceed to provide code. The default input format in MR is FileInputFormat which is an implementation of InputFormat. Every InputFormat implementation should return a List<InputSplit>. Hadoop internally calls size() on this list and creates those many mappers i.e. 1 mapper for 1 split.

The default FileInputFormat in my case, simply created 1000 splits (as all 1000 files
were less than 64 MB) resulting in 1000 mappers being run which was inefficient.

The solution is to use CombineFileInputFormat. This class is really efficient, as it also takes rack and node locality into account. However, it is an abstract class. Though it does most of the work, some implementation is left to us (which is not easy to write).

I read one superb implementation in this this blog post. When I used this approach, my MR ran in less time, but it created only 1 mapper - yes only 1 for the entire MR run. But, this is not ideal. Let's say we have a total of 600 MB of data in 100 files - we would ideally need 10 splits - each split to deal with 64 MB of data - note that, each split data (64 MB) might be contained in multiple files.

Key Idea: 

So, I wondered why this happened and browsed through the CombinFileInputFormat.java source (wow, one of the perks of using open source software). I found that, in the getInputSplits() implementation, if the maxSplitSize is zero, it creates only one split!
And the default is 0 - as you can see in the variable declaration.

Implementation:

So, the trick lies in setting this to a different value in your implementation, like:

import java.io.IOException;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit;
import org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader;

public class MyCombineFileInputFormat extends CombineFileInputFormat<LongWritable, Text> {
    public MyCombineFileInputFormat() {
        //this is the most important line!
        //setting the max split size to 64 MB
        setMaxSplitSize(67108864);
    }

    @Override
    public RecordReader<LongWritable, Text> createRecordReader(InputSplit is, 
TaskAttemptContext tac) throws IOException {
        return new CombineFileRecordReader<LongWritable,Text>((CombineFileSplit) is, 
tac, MyRecordReader.class);
    }
}

And, the record reader can just reuse the LineRecordReader like this:

import java.io.IOException;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.input.LineRecordReader;

public class MyRecordReader extends RecordReader<LongWritable, Text> {
    private final LineRecordReader lineRecordReader;

    public MyRecordReader(CombineFileSplit split, TaskAttemptContext context, 
Integer index) throws IOException {
        FileSplit filesplit = new FileSplit(split.getPath(index), split.getOffset(index), 
                split.getLength(index), split.getLocations());
        lineRecordReader = new LineRecordReader();
        lineRecordReader.initialize(filesplit, context);
    }

    @Override
    public void initialize(InputSplit split, TaskAttemptContext tac) 
throws IOException, InterruptedException {
    }

    @Override
    public boolean nextKeyValue() throws IOException, InterruptedException {
        return lineRecordReader.nextKeyValue();
    }

    @Override
    public LongWritable getCurrentKey() throws IOException, InterruptedException {
        return lineRecordReader.getCurrentKey();
    }

    @Override
    public Text getCurrentValue() throws IOException, InterruptedException {
        return lineRecordReader.getCurrentValue();
    }

    @Override
    public float getProgress() throws IOException, InterruptedException {
        return lineRecordReader.getProgress();
    }

    @Override
    public void close() throws IOException {
        lineRecordReader.close();
    }

}

Once this was done, the number of mappers created was ideal and this drastically reduced the overall time taken for my MR program.