Saturday, August 24, 2013

Writing a Java Regular Expression Without Reading the ***** Manual


Writing and/or maintaining regular expressions is a part of every developer's routine work. Hey, and we usually can't stand it. It's annoying, the syntax is not humanly memorable, and overall it is an experience that one wants to leave behind him as quickly as possible, so he can move on to the actual problem he is facing. Wonder what will happen if we would Poker Estimate an RE problem. what would be the deviation between the estimation and the real time it took?
You see, when we need to write a new RE we go through the following steps: 
  1. Visit Pattern for a quick recap on the syntax. 
  2. Describe the RE in English. It goes something like: "start with 4 digits followed by spaces afterwards the string "DUR" then again some spaces and finally one digit"
  3. Translate the English description to Java syntax: "\d{4}\s+DUR\s+\d"
  4. Come up with examples. So here it will be something like: "1234 DUR 9" 
  5. Write a test validating the examples, thinking on edge cases, and making sure the RE is valid.
The situation is even worse when one needs to change an existing regular expression. Here we need to translate the RE syntax back to English, apply the changes and translate it back to RE syntax. This is again followed by examples and testing.
We are not alone facing this problem. Several solutions exist to help ease the process (e.g. txt2re). The problems with these solutions are:
  • They always require leaving the IDE.
  • They usually don't help with understanding an existing RE, but rather only help create new ones.
So what do we suggest? We present you with the Regular Expression Wizard, a new approach for writing and maintaining Java Regular Expression. This is a Java based project that aims to help you write RE fluently using the Wizard Design Pattern.
How simple can it get? Let's write the RE from our previous example using the new wizard. Just create a wizard object, and than using static methods slowly build your own RE, followed by examples for testing. 
   1: RE_Wizard re = new RE_Wizard();
   2: String dur = re.start().
   3:         a_character_described_as(a_digit).exactly(4L).then().
   4:         a_character_described_as(a_whitespace_character).once_or_more().then().

   5:         a_fixed_string("DUR").then().
   6:         a_character_described_as(a_whitespace_character).then().
   7:         a_character_described_as(a_digit).
   8:         for_example("1234 DUR 9").for_example("4423   DUR 1").the_end();

Here there is no need for steps A (syntax recap), C (using the syntax) and E (writing a test). Note that if the stated example does not match the regular expression than an ExampleDoesNotMatchRegularExpression exception will be thrown. All you need to do is to describe the RE in English and come up with some examples. The best part comes when later on you need to change it. Again you do not need to deal with weird syntax. You only need to know English.

Let us take another example. Mkyong wrote a post on "10 Java Regular Expression Examples You Should Know". We took the one for creating a regular expression for time in a 24-hour format. 

   1:       //([01]?[0-9]|2[0-3]):[0-5][0-9]
   2:        RE_Wizard re = new RE_Wizard();
   3:        String timeRE = re.start().start_group().
   4:                any_character_in("01").no_more_then(1L).then().
   5:                any_character_in_the_range("0","9").
   6:                or().
   7:                a_fixed_string("2").then().
   8:                any_character_in_the_range("0","3").then().
   9:                close_group().
  10:                then().
  11:                a_fixed_string(":").then().
  12:                any_character_in_the_range("0","5").then().
  13:                any_character_in_the_range("0","9").then().
  14:                for_example("06:58").
  15:                for_example("6:45").
  16:                for_example("23:12").
  17:                the_end();

So where can you get a hold of this? The wizard code can be found on https://github.com/azarian/wizards.Use it, share it, feedback us, and forget about losing time writing RE's. 

Disclaimers
  • We did not implement all java regular expression syntax mostly due to time limitation. If anyone wishes to contribute he will be highly appreciated.
  • We do not include instructions on how to use the builder. We hope it is straight forward. If it is not than we are missing the point, so please inform us.

Saturday, May 25, 2013

The Dev-QA Delicate Relationship

Success to your product is directly influenced by the ability of your QA and Dev teams to work well together. This is even more tightly coupled in the agile world when QA and Dev work and deliver under the same team. Symbiosis between QA and Dev will accelerate delivery time, create a more robust product, and overall will increase team member satisfaction. 

Saying the above is obvious. However, failing to understand the relationship between QA and Dev will take your product/team in the opposite direction. There is a delicate relationship between the two and a certain tension that must be confronted and not overlooked. Most of you probably felt it in your work place. You hear a QA's question thrown to the air, followed by a smug reply that is basically telling him that he will never understand since he didn't write the code. Or the other way around, when a developer asks a question about the product and the QA looks at him in a look that says "you really need to get out of your little world. There is a whole cosmos waiting for you..."

There are several symptoms/causes that can help you identify the level of tension in your workplace:

Domain knowledge is mostly in the QA hands

In this situation the developer works in a vacuum. He understands enough to accomplish his tasks, but not enough so that his code will be reusable. He can not foresee new advances in the field of interest. He is like an ox plowing in long corridor blind folded. 

Lack of respect

You all know it is there and from both sides. "This feature was written with so many bugs, my grandma would have written it better", or maybe "How dare he open this bug? It just shows me how little he understands..." Each side is building his own trench while accusing the other side in every single problem earth has encountered.

Over Testing

There seems to be a tendency to retest the entire product after each change (which should be prevented by proper sanity automated tests, and not by manual checks). Checks are too strict. This leads to slowness in the product improvement and frustration for developers.

Under Testing

Features are written under pressure, and as such are tested under pressure. Not all extremity cases are simulated. This may cause frustration in QA side, since they are the one that signed off the feature.

Who's the Boss?

Developers sometimes see QA as their personal assistants. They might ask the QA to complete tasks that are not directly related to QA but mostly to save "expensive" developer's time. 

Who is to blame?

In places where the QA is hold responsible for product quality every bug which was shipped with the product has the potential to flame a new fire. Who is to blame?

What can we do as managers to help reduce this tension?

  • Cross Functional teams. Putting them in the same team and make the entire team responsible for the product. As we said before this is already happening in the agile era.
  • Let them do each other's job. Let the QA do some Dev in the form of writing scripts or anything that will make them understand bugs are inevitable. Let developers do some QA so they will understand the horror of saying: "I tested it and it is ready for shipment"
  • The layer of team managers should originate from Dev and QA both, thus giving the management a broader perspective.
  • Management must have excellent interpersonal relations and be aware of the tension, confronting it when necessary.

Sunday, May 12, 2013

Gambling in Software


I want to tell you about a meeting we had a few days ago. It reminded me of “The Jack Story” (which was part of an old stage routine of Danny Thomas many years ago).

Here’s how it goes:
Traveling salesman gets stuck one night on a lonely country road with a flat tire and no jack. He starts walking toward a gas station about a mile away, and as he walks, he talks to himself. "How much can he charge me for a jack?" he wonders. "Fifty dollar, sounds reasonable. But it's the middle of the night, so maybe there's an after-hours fee. Probably another five dollars. Wait.... He'll probably figure I got no place else to go for the jack. Fifty dollars more."
He goes on walking and thinking, and the price and the anger keep rising. Finally, he gets to the gas station and is greeted cheerfully by the owner: "What can I do for you, sir?" But the salesman will have none of it. "You got the nerve to talk to me, you robber," he says. "You can take your stinkin' jack and shove it..."

The meeting was about a new feature requested by one of our customers. The feature was quite clear and we started talking about how we should implement it. At some point one of the participants claimed that if they need this feature they will surely need another related feature. A third guy immediately followed with: "if this is the case then we should also implement this feature…". This routine continued a few rounds until everybody were convinced that this feature was too big and should be rejected.

It seems that more often than one might think we follow 'The Jack Story" while building software. Fortunately, our story ends well. When we got back to the costumer and explained to him why we must reject the feature he stated that none of our guessing were true and he really only needs the original request. This time we got lucky. No extra work was done and we did not lose any costumers.

But it got me thinking. Did we do something wrong?

Now the typical agile practitioner would argue that we simply should not have added new requests on the original user story. Well...obviously my colleagues and I know this argument. We also know that the costumer often does not fully understand what he really needs. Moreover (perhaps not in this case) any company sometimes needs to be a head of the market instead of following it.
Actually, many times as software engineers we do more then we are explicitly requested (over doing). We enhance existing features. We build our code more generic and powerful than we currently need. We basically gamble on future needs, and I deliberately use the verb 'gamble' and not the verb 'guess' because there is a definite rewords for good bets. Naturally, 'Over Doing' also relates to a person character. Some will choose the 'Over Doing' approach more often than others. But everyone does it at some level.

Usually where ever there is a gamble there are measures and statistics. This must be done in order to track our gamble and measure the profit. This is also the case, for example, in software estimation which in essence also involves gambling. We continuously review our past estimation in order to improve our future ones. But this is not the case with 'Over Doing'. We never mark which of our work is mandatory for now and which is a gamble on a future need. As a direct result, we never come back to check if we were right.

So I answered myself: No, we did not do anything wrong. We should continue to gamble on future needs. But we also must find a way to document and review our gambling. It will enable us to estimate the profit of our gamble and help us improve future ones, avoid over engineering where it is not needed, and insist on generic code where we see future opportunities.

 

Prolog


A key tool for a manager is matrices. We already know that traditional matrices in software engineering often provide little help for a project success. You can read about another matrices we suggested in Effective Unit Testing - Not All Code is Created Equal. In the agile era we are in a quest for finding new matrices. New things to measure which might help us navigate our project to safe shore. This post tries to suggest such alternative metric which might be useful.


Wednesday, March 13, 2013

The Inner Software Model and the End User


When I build software I always do it aligned with a model. The model evolves with the  software and in many cases defines the boundaries of what can and can't be done (that is without modifying or breaking it). A good model is one which is simple to understand yet powerful enough to allow the introduction of new features.

A good model makes me happy. If it was developed by me then it will be the first thing I will show when presenting my work. If it is others it will be the first things that will make me appreciate their work. Actually I think so high on the importance of a good model that I have made the mistake of asking my users to learn it too.

Users obviously view the world through their eyes. In places you might recognize several use cases as the same one, your users might see them as completely different cases.
It seems I am not the only one taking this approach. Remember the first days of Android OS. One of the first things they were proud of was: "Everything is an application". Indeed as a Software Engineer the fact that every functionality on top of the operating system is modeled as an application is simple yet powerful. But as a user I always moved uncomfortably in my chair when pressing on the applications button and find the Phone application there. You see, as a user I have a phone device with phone related functions and I have the applications which is an extension to the phone. Finding the phone icon and contacts icon in the applications section confused me. Especially in the early days of Android where the phone application shortcuts was permanent. IPhone OS took a different approach where some of the device functionality was presented to the user as OS features (e.g. Siri,).

Another example is the JavaScript language and Object Oriented Paradigm. In this example the user is the JavaScript developer trying to use it as an Object Oriented language. Again you have a powerful simple model (everything is a function) which enables you to implement any Object Oriented principle. However, each concept requires a special usage of the model (Hint: want to define a class? use a function).
On the other hand Java takes a different approach. One example that comes to mind are the Enums introduced in Java 5. Although one can easily implement an Enum (see http://www.javacamp.org/designPattern/enum.html) they still decided to include it in the language.

What is the correct approach? Taking the first approach, in which the model is generic and it is also introduced to the user, is cheaper and easier to develop. Yet it will produce a less friendly software. So I believe the key consideration here is: Who is your user and will he be able to learn and adapt?

Recently I have started to adopt a hybrid approach. I expose both the general powerful model to the advanced user and a simple domain oriented interface for the average user.

To sum things up I highly recommend (especially for developers) to pay attention to the difference in the point of view of users vs. the model. Moreover, to decide on the correct approach consider both the user nature and your resources.

Monday, February 11, 2013

Solr and Lucene Fuzzy Search - A closer look

What is Fuzzy Search?
In a sentence, Fuzzy Search allows one to submit a query against an index, and get results that are close to the desired query, but not necessarily match the query exactly. Lucene (and as such Solr) offers a very effective way (from 4.0) for quickly evaluating such fuzzy queries.
Lucene Fuzzy Search
Lucene (and as such Solr) supports fuzzy searches based on the Levenshtein Distance, or Edit Distance algorithm. To do a fuzzy search use the tilde, "~", symbol at the end of a Single word Term. For example to search for a term similar in spelling to "roam" use the fuzzy search:
roam~
This search will find terms like foam and roams.
Starting with Lucene 1.9 an additional (optional) parameter can specify the required similarity. The value is between 0 and 1, with a value closer to 1 only terms with a higher similarity will be matched. For example:
roam~0.8
The default that is used if the parameter is not given is 0.5.
Under the Hood
The Levenshtein distance between two words is the minimal number of insertions, deletions or substitutions that are needed to transform one word into the other. Than given an unknown query, how does Lucene finds all the terms in the index that are at distance <= than the specified required similarity? Well... this depends on the Solr/Lucene version you are using.

You can take a look at the warning that appears at Lucene 3.2.0 Javadoc
Warning: this query is not very scalable with its default prefix length of 0 - in this case, *every* term will be enumerated and cause an edit score calculation.  
Moreover, prior to 4.0 release Lucene implementation to compute this distance was done for each query for EACH term in the index. You really don't want to use this. So my advice to you is to upgrade - the faster the better.

The Lucene 4.0 Fuzzy took a very different approach. The search now works with FuzzyQuery. The underlying implementation has changed in 4.0 drastically, which lead to significant complexity improvements. Current implementation uses the Levenshtein Automata. This automaton is based on the work of Klaus U. Schulz and Stoyan Mihov "Fast string correction with Levenshtein automata". To make a very long story short this paper shows how to recognize the set of all words V in an index where the Levenshtein distance between V and the query does not exceed a distance d, which is exactly what one wants with Fuzzy Search. For a deeper look see here and here.
Conclusion
So from 4.0 and above one can use Fuzzy Search on very large indexes and fill comfortable about it. Of course there are other ways we look for similar values in a query such as:

Tuesday, January 1, 2013

Executing a Command Line Executable From Java

In this post we'll deal with a common need for Java developers. Execute and manage an external process from within Java. Since this task is quite common we set out to find a Java library to help us accomplish it.
The requirements from such a library are:
  1. Execute the process asynchronously. 
  2. Ability to abort the process execution.
  3. Ability to wait for process completion.
  4. On process output notifications.
  5. Ability to kill the process in case it hung.
  6. Get the process exit code.
The native JDK does not help much. Fortunately, we have Apache Commons Exe. Indeed it is much easier but still not as straightforward as we hoped. We wrote a small wrapper on top of it.
Here is the method signature we expose:
public static Future<Long> runProcess(final CommandLine commandline, final ProcessExecutorHandler handler, final long watchdogTimeout) throws IOException;
  1. It returns a Future<Long>. This covers section 1,2,3,6. 
  2. Instance of ProcessExecutorHandler is passed to the function. This instance is actually a listener for any process output. This covers section 4 in our requirement.
  3. Last but not least you supply a timeout. If the process execution takes more than said timeout you assume the process hung and you will end it. In that case the error code returned by the process will be -999. 
That's it! Here is the method implantation. Enjoy.


import org.apache.commons.exec.*;
import org.apache.commons.exec.Executor;
import java.io.IOException;
import java.util.concurrent.*;


public class ProcessExecutor {
    public static final Long  WATCHDOG_EXIST_VALUE = -999L;

    public static Future<Long> runProcess(final CommandLine commandline, final ProcessExecutorHandler handler, final long watchdogTimeout) throws IOException{

        ExecutorService executor = Executors.newSingleThreadExecutor();
        Future<Long> result =  executor.submit(new ProcessCallable(watchdogTimeout, handler, commandline));
        executor.shutdown();
        return result;


    }

    private static class ProcessCallable implements Callable<Long>{


        private long watchdogTimeout;
        private ProcessExecutorHandler handler;
        private CommandLine commandline;

        private ProcessCallable(long watchdogTimeout, ProcessExecutorHandler handler, CommandLine commandline) {
            this.watchdogTimeout = watchdogTimeout;
            this.handler = handler;
            this.commandline = commandline;
        }

        @Override
        public Long call() throws Exception {
            Executor executor = new DefaultExecutor();
            executor.setProcessDestroyer(new ShutdownHookProcessDestroyer());
            ExecuteWatchdog watchDog = new ExecuteWatchdog(watchdogTimeout);
            executor.setWatchdog(watchDog);
            executor.setStreamHandler(new PumpStreamHandler(new MyLogOutputStream(handler, true),new MyLogOutputStream(handler, false)));
            Long exitValue;
            try {
                exitValue =  new Long(executor.execute(commandline));

            } catch (ExecuteException e) {
                exitValue =  new Long(e.getExitValue());
            }
            if(watchDog.killedProcess()){
                exitValue =WATCHDOG_EXIST_VALUE;
            }

            return exitValue;


        }

    }

    private static class MyLogOutputStream extends  LogOutputStream{

        private ProcessExecutorHandler handler;
        private boolean forewordToStandardOutput;

        private MyLogOutputStream(ProcessExecutorHandler handler, boolean forewordToStandardOutput) {
            this.handler = handler;
            this.forewordToStandardOutput = forewordToStandardOutput;
        }

        @Override
        protected void processLine(String line, int level) {
            if (forewordToStandardOutput){
                handler.onStandardOutput(line);
            }
            else{
                handler.onStandardError(line);
            }
        }
    }


}
// interface.
public interface ProcessExecutorHandler {
    public void onStandardOutput(String msg);
    public void onStandardError(String msg);

}