taT4Py | Recursively Search Regex Patterns

[UPDATE: 09/28/2014]

I have mainly used python for text parsing, validation and transforming as needed. If it was done using shell script, I would end up writing variety of regular expression to play around.

Getting Started

Well, python is no different and in order to cook up regular expressions, one must import re (module) and get started.

import re

So far, I have been able to use the patterns exactly the same way as I would with grep or sed. Usually, I end up writing multiple search patterns, as the script evolves. While using python, I find it intuitive to create dictionary of compiled search patterns (RegexObject), I wrote to style unified differences, as follows.

regexDict = {
    'HEADER': re.compile("^@@ ([+-][0-9]+(,[0-9]+)? ?){1,2} @@$"),
    'ADD':    re.compile("^\+"),
    'DEL':    re.compile("^\-")
}

Search Recursively

Looking at the above dictionary, there are only 3 key-value pairs, so writing if-else construct would be easy. Let say, such a dictionary is dynamically created and can have any number of key-value pairs.

All you need to figure out, whether data matches particular search pattern or not. If yes, print the data, or transform the data, etc. In this post, we will go one step further and redesign the if-else construct used to style unified differences, as follows.

with open(inputFile, 'r') as fileObj:

    ### Using Slice To Ignore First 2 Lines
    for line in fileObj.read().splitlines()[2:]:

        fn__recurSearch(regexDict.keys(), line)

The for-statement invokes function with two arguments, first one of the type – iterator and the other one – string. Let us look at the function definition below;

def fn__recurSearch(iterator, data):

    if not iterator:
        print data
        return

    key = iterator.pop(0)

    matchObj = regexDict[key].search(data)

    if not matchObj:
        fn__recurSearch(iterator, data)
        return

    print data

The function looks straight forward, however this would simply dump the file as is on the STDOUT. If you have paid attention, you will notice, we have not added logic to style unified differences !! Well, that’s an exercise left it for you, otherwise I will try to cover next time.

[UPDATE: 09/28/2014]

Let us add the logic to style unified differences, as follows;

def fn__recurSearch(iterator, data):

    global codeChunkList

    if not iterator:
        ##print data
        codeChunkList.append(fn__applyStyle(data))
        return

    key = iterator.pop(0)

    matchObj = regexDict[key].search(data)

    if not matchObj:
        fn__recurSearch(iterator, data)
        return

    if key == 'HEADER':
        if codeChunkList:
            print trHTMLCode % '<br />\n'.join(codeChunkList)
            codeChunkList = []
        return

    ##print data
    codeChunkList.append(fn__applyStyle(data))

Indeed, the recursive function does make it look so easy, yet simple. The output produced by new design, yields the same results as did the old one.

WAIT, there’s more..

How about styling context differences ?? This approach can have multiple applications, depending on your problem scenario. Give it a try, feel free to share your thoughts..

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s