Monday, November 29, 2010

Design - Sometimes Naive Is Best

I am working on a project where I have to read in a data file and store the data in a database. The datafile is straightforward comma separated values with headers. i.e. something like:

header1, header2, header3
value1a, value2a, value3a
value1b, value2b, value3b

There are a fixed and well-defined set of possible values for the headers. The database table I am storing this in has one column per header type. The only caveat is that there is no guarantee as to the order of the headers, or that each possible header will exist in any given file.

Obviously, this is a fairly straightforward problem to solve. However, there are nearly 100 possible header values, so if the code looks ugly, it'll look really ugly. As I am always trying to improve my coding, I toyed with various approaches to the problem to come up with the cleanest approach.

Database Access
Since we use JPA/Hibernate on other projects, it seemed natural to use that on this project. To that end, I have a class like:
  1 @Entity 
  2 public class DataRecord {
  3   private Long mId; 
  4   private String mValue1; 
  5   private String mValue2; 
  6   // rest of instance variables go here 
  8   // getters and setters go here 
  9 }
Given that this class is going to be useful elsewhere in my code and is consistent with our other projects, I took the existence of this class to be one of the design constraints. The question is how to best populate these objects based on the file.

File Parsing
I'm not going to go into the details of how I parse the file. For simplicity assume that I have used the String.split method with the appropriate regular expression so that I have variables
String[] headers; // an array of the headers specified in the file
String[][] lines; // an array of lines where each line is  
                  // an array of the values specified in the file

My initial desire is to create a map of header name to DataRecord setter method. i.e. something like:
recordMap = {"header1" => setValue1, "header2" => setValue2, /* etc. */ }
and then the parsing code could look something like:
  1 for(String header : headers)  setMethods[ctr++] = recordMap.get(header);
  2 for (String[] line : lines) { 
  3   DataRecord record = new DataRecord(); 
  4   ctr = 0; 
  5   for (String value : line) { 
  6     setMethods[ctr++].call(record, value); 
  7   } 
  8 }
Unfortunately, functions are not first class objects in Java and as of this writing look like they won't be until Java 8. And yes, using Java is another of my design constraints.

So what are my options for accomplishing something like this in Java? Well, I could store the name of the setter method in the recordMap object and then use reflection to accomplish what I want. While reflection has a performance hit, it probably would be negligible compared to the file and database I/O that is already happening, not to mention the reflection that Hibernate is already using under the covers. However, using reflection like this feels like an inappropriate use of the technology - it is turning method calls that could be typesafe into dynamic calls.

Anonymous Inner Classes
If I am not going to use reflection, I could do this the "Java" way, using interfaces and anonymous classes in the place of method pointers.  I.e. something like:
public interface FieldSetter {
  void setField(DataRecord record, String value); 
Then the recordMap initialization would look something like:
private static Map<String, FieldSetter> recordMap = 
  new HashMap<String, FieldSetter>(); 
static { 
  recordMap.put("header1", new FieldSetter() {
    public void setField(DataRecord record, String value) {
  /* repeat for every other possible header */ 
The parsing code can then be written almost as shown in the pseudo code in the approach section above. This works, but Yuck that sure is an ugly way to initialize the recordMap object.

So how else could this be done? Well, it turns out that the all of the possible legal values for headers happen to conform to the Java spec for identifier names. This means I could have an enumeration like so:
public enum Headers { header1, header2, header3, /* etc */ }
The code that reads the header line would look something like:
Headers[] headerEnums = new Headers[headers.length] 
int ctr = 0; 
for(String headerName : headers) { 
  headerEnums[ctr++] = Headers.valueOf(headerName); 
The parsing code would then have a giant switch statement in the inner for loop:
  1 for (String[] line : lines) {  
  2   DataRecord record = new DataRecord() 
  3   ctr = 0;  
  4   for (String value : line) {  
  5     switch (headers(ctr++) {  
  6       case header1:  record.setValue1(value); break;
  7       case header2:  record.setValue2(value); break;
  8       // etc  
  9     }  
 10   }  
 11 }
This isn't great, but actually ends up being fewer lines and more readable than the interface/anonymous method approach. However, it feels dirty to require that the enum values, which are supposed to be names that are semantically meaningful to the program, have to have exactly the same name as the headers in the data input files. We could have the Headers enum have a constructor which takes the name - thereby decoupling the enum name from the file name. But now we have to implement our own valueOf that handles that name, plus more than doubling the size of the enum declartion, since it would now look like:
public enum Headers { 
  HEADER1("header1"), HEADER2("header2")....;
  /* insert constructor and valueOf code here */ 
Which means with the enum approach, I can choose between dirty and a little bit ugly, or clean and uglier.  Ugh.

"Naive" Approach
So which of these approaches should I choose? Well, first, how would I have solved this problem back before I knew as much about design?
  1 for (String[] line : lines) {  
  2   DataRecord record = new DataRecord() 
  3   for (int i = 0; i < line.length; i++) {
  4     String value = line[i]; 
  5     String header = headers[i]; 
  6     if ("header1".equals(header)) record.setValue1(value);
  7     else if ("header2".equals(header)) record.setValue2(value);
  8     /* etc. */ 
  9   } 
 10 }
How "clean" is this approach? We specify each header value exactly once - in the if statement. We've co-located the header value with the behavior just like my original idea of a recordMap object. This code isn't any larger than the other approaches. There is nothing sneaky or clever happening so the code should be easy to understand, maintain, and update by other people (or myself) down the line. All in all, I think we have a winner.

Final Thoughts
Besides reiterating the obvious moral that sometimes being clever can be bad, I wanted to address a couple other things.

take some time now to save time in the futureFirst, why did I spend so much time and effort on something that very easily could've been a homework assignment back in school? Well, maybe it was foolish, but I'd like to offer up this one defense. In the "real world", unlike in school, code lives on and will have to be maintained. Taking some time now to think about clean design could very well pay itself back in saved time in the future.

Second, what about performance?  All of those string comparisons in the final solution bug me as being slow. If we have 50 headers, and its the last one that matches, that's 50 string comparisons that have been made - not particularly efficient. However, as mentioned up above when discussing reflection, no matter how slow this is, it is almost certainly orders of magnitude quicker than the file I/O and database accesses. So this fear for performance is almost certainly unfounded.  Should profiling determine that this is a bottleneck, the String[] headers array can have all of its strings interned, which would allow us to convert all of the equals checks to the much quicker == check.

Third, what about code style? Why didn't I put curly braces around the body of each if statement (lines 6 and 7)? Sure, with one statement it isn't required, but it is generally considered good style to always use curly braces. Partly it was for purposes of (slightly) shrinking the size of this blog post that is already too long. The other is that I think in this particular situation, there were going to be nearly 100 possible header values, and thus nearly 100 else if clauses. Given this, I think the code might actually be more legible and understandable if it only takes up 100 lines, rather than 200 or 300 lines (depending on whether you put your else on the same line as the previous closing curly).

Fourth, I do want to reiterate the main point. Just because you can solve a problem using a fancy technique or design pattern doesn't mean you should. Your job as a developer to weigh the pros and cons and come up with the best solution for your particular problem.

No comments: