taT4Py | Recursively Search Regex Patterns

[UPDATE: 09/28/2014]

I have mainly used python for text parsing, validation and transforming as needed. If it was done using shell script, I would end up writing variety of regular expression to play around.

Getting Started

Well, python is no different and in order to cook up regular expressions, one must import re (module) and get started.

import re

So far, I have been able to use the patterns exactly the same way as I would with grep or sed. Usually, I end up writing multiple search patterns, as the script evolves. While using python, I find it intuitive to create dictionary of compiled search patterns (RegexObject), I wrote to style unified differences, as follows.

regexDict = {
    'HEADER': re.compile("^@@ ([+-][0-9]+(,[0-9]+)? ?){1,2} @@$"),
    'ADD':    re.compile("^\+"),
    'DEL':    re.compile("^\-")
}

Search Recursively

Looking at the above dictionary, there are only 3 key-value pairs, so writing if-else construct would be easy. Let say, such a dictionary is dynamically created and can have any number of key-value pairs.

All you need to figure out, whether data matches particular search pattern or not. If yes, print the data, or transform the data, etc. In this post, we will go one step further and redesign the if-else construct used to style unified differences, as follows.

with open(inputFile, 'r') as fileObj:

    ### Using Slice To Ignore First 2 Lines
    for line in fileObj.read().splitlines()[2:]:

        fn__recurSearch(regexDict.keys(), line)

The for-statement invokes function with two arguments, first one of the type – iterator and the other one – string. Let us look at the function definition below;

def fn__recurSearch(iterator, data):

    if not iterator:
        print data
        return

    key = iterator.pop(0)

    matchObj = regexDict[key].search(data)

    if not matchObj:
        fn__recurSearch(iterator, data)
        return

    print data

The function looks straight forward, however this would simply dump the file as is on the STDOUT. If you have paid attention, you will notice, we have not added logic to style unified differences !! Well, that’s an exercise left it for you, otherwise I will try to cover next time.

[UPDATE: 09/28/2014]

Let us add the logic to style unified differences, as follows;

def fn__recurSearch(iterator, data):

    global codeChunkList

    if not iterator:
        ##print data
        codeChunkList.append(fn__applyStyle(data))
        return

    key = iterator.pop(0)

    matchObj = regexDict[key].search(data)

    if not matchObj:
        fn__recurSearch(iterator, data)
        return

    if key == 'HEADER':
        if codeChunkList:
            print trHTMLCode % '<br />\n'.join(codeChunkList)
            codeChunkList = []
        return

    ##print data
    codeChunkList.append(fn__applyStyle(data))

Indeed, the recursive function does make it look so easy, yet simple. The output produced by new design, yields the same results as did the old one.

WAIT, there’s more..

How about styling context differences ?? This approach can have multiple applications, depending on your problem scenario. Give it a try, feel free to share your thoughts..

Advertisements

code4Py | Style Unified Differences

As per recently created page, the following diff command output representing unified differences, needed to be styled;

[vagrant@localhost python]$ diff -u A B
--- A   2014-08-20 20:13:30.315009258 +0000
+++ B   2014-08-20 20:13:39.021009349 +0000
@@ -1,6 +1,9 @@
+typeset -i sum=0
+
 while read num
 do
   printf "%d " ${num}
+  sum=sum+${num}
 done <<EOF
 1
 2
@@ -9,5 +12,4 @@
 5
 EOF

-echo
-
+echo; echo "Sum: ${sum}"
[vagrant@localhost python]$

Source Code (GitHub Gist)
I have completed writing python script that will generate HTML output as follows.

<tr>
  <td>
    <span style='color: green'>typeset -i sum=0</span><br />
    <span style='color: green'></span><br />
    while read num<br />
    do<br />
      printf "%d " ${num}<br />
    <span style='color: green'>  sum=sum+${num}</span><br />
    done <<EOF<br />
    1<br />
    2
  </td>
</tr>
<tr>
  <td>
    5<br />
    EOF<br />
    <br />
    <span style='color: red'>echo </span><br />
    <span style='color: red'></span><br />
    <span style='color: green'>echo; echo "Sum: ${sum}"</span>
  </td>
</tr>

Style Output
This output tabulates the differences in N-row(s) and single-column format, if properly embedded into table element of HTML document, which could then be rendered by web browser based on CSS properties (if defined).

code4Py | Style Context Differences

As per recently created page, the following diff command output representing context differences, needed to be styled;

[vagrant@localhost python]$ diff -c A B
*** A   2014-08-20 20:13:30.315009258 +0000
--- B   2014-08-20 20:13:39.021009349 +0000
***************
*** 1,6 ****
--- 1,9 ----
+ typeset -i sum=0
+
  while read num
  do
    printf "%d " ${num}
+   sum=sum+${num}
  done <<EOF
  1
  2
***************
*** 9,13 ****
  5
  EOF

! echo
!
--- 12,15 ----
  5
  EOF

! echo; echo "Sum: ${sum}"
[vagrant@localhost python]$

Source Code (GitHub Gist)
I have completed writing python script that will generate HTML output as follows.

<tr>
  <td>

  </td>
  <td>
    <span style='color: green'> typeset -i sum=0</span><br />
    <span style='color: green'> </span><br />
     while read num<br />
     do<br />
       printf "%d " ${num}<br />
    <span style='color: green'>   sum=sum+${num}</span><br />
     done <<EOF<br />
     1<br />
     2
  </td>
</tr>
<tr>
  <td>
     5<br />
     EOF<br />
     <br />
    <span style='color: blue'> echo </span><br />
    <span style='color: blue'> </span>
  </td>
  <td>
     5<br />
     EOF<br />
     <br />
    <span style='color: blue'> echo; echo "Sum: ${sum}"</span>
  </td>
</tr>

Style Output
This output tabulates the differences in N-row(s) and 2-column(s) format, if properly embedded into table element of HTML document, which could then be rendered by web browser based on CSS properties (if defined).

taT4Py | Convert AutoSys Job Attributes into Python Dictionary

[UPDATE: 09/28/2014]

If you ever look at the definition of specific AutoSys Job, you would find that it contains attribute-value pairs (line-by-line), delimited by colon ‘:’ I thought it would be cool to parse the job definition, by creating python dictionary using the attribute-value pairs.

Let us take a look at sample job definition;

$> cat sample_jil
insert_job: A0001
command: echo "Hi"
condition: s(B0001, 03\:00) & v(SRVR) = "UP"
std_out_file: >/home/nvarun/outfile
std_err_file: >/home/nvarun/errfile
group: NV
$>

Getting Started

To convert this into Python Dictionary, execute ignore the following command;

$> sed "s/^\([^:]*\):\(.*\)$/'\1':'\2'/" sample_jil > sample_pydict
$> cat sample_pydict
'insert_job':' A0001
'command':' echo "Hi"'
'condition':' s(B0001, 03\:00) & v(SRVR) = "UP"'
'std_out_file':' >/home/nvarun/outfile'
'std_err_file':' >/home/nvarun/errfile'
'group':' NV'
$>

We are half-way through, to complete the conversion, write following steps in python script and populate dictionary as follows;

import string
jobDefn = {}
with open('sample_pydict', 'r') as f:
    for line in f.read().splitlines():
        colon = string.find(line, ':')
        key = line[:colon]     ##string.replace(line[:colon], "'", "")
        val = line[colon+1:].strip()
        jobDefn[key] = val        
print jobDefn

[UPDATE: 09/28/2014]

As per Antonio’s comments, one can optimize the code as follows, by ignoring sed as well;

import string
jobDefn = {}
with open('sample_pydict', 'r') as f:
    for line in f:
        key, val = line.split(':')
        jobDefn[key] = val.strip()
print jobDefn

However, there are chances when the values might contain ‘:’ as well, you could switch back to the earlier solution. Otherwise, invoking split() as above, throws ValueError: too many values to unpack.

Summary

  1. Using f.read() reads the input file at one go and invoking splitlines() splits the input into list of several lines, resulting in creating an iterator.
  2. The for-statement iterates over each line from file object wherein position of first occurrence of colon is found and used for extracting key, value based on slicing and invokes split() to determine key, value.
  3. At the end of the loop, the dictionary object jobDefn is printed.

Hope this helps.

taT4Py | Extract words from Input String and Operate Functions

Working on AIX Servers with limited grep features, sometimes makes it difficult to use for particular scenarios. For instance, I want to split the lines (read from STDIN or FILE) into words, precisely. However, without grep -o option, I am clueless on how to get desired results. Since past few months, I have been investing time in learning Python and using its features to complement text processing tasks for shell scripts, I often write for automating several tasks and creating productivity tools.

Code Snippet #1

import sys, re
for line in sys.stdin.readlines():
    listofwords = [word for word in re.split('\W', line) if word]
    print listofwords

Looking at the above code snippet, four lines of code, did the trick. The features and constructs provided by Python, to implement scenarios like this, makes it look really cool. Let us understand that quickly, before we can operate functions on those words.

  1. Imports sys module for environment related capabilities and re module for regular expression capabilities.
  2. sys.stdin.readlines() reads the input from STDIN, unless EOF character has been entered.
    • After which, for loop construct iterates over the input lines, one-by-one.
  3. The special-construct present on the right-side of assignment operator, is representation of in-built function [1], filter()
    • re.split() generates [2] list of words, based on pattern given as first argument.
    • The for loop-construct iterates over the generated list and using if-construct, filtering is done.
    • This is done to make sure, there are no empty strings in the generated list using the special-construct.
    • filter(None, re.split(‘\W’, line)) can be used as an alternative, which by default takes care of empty strings, as they return false.

Code Snippet #2

What if the line contains a word, “Hi” and I want to replace all occurrences of “Hi” with “Hey”, while the list is getting generated using above approach. To make it possible, the below code snippet imports additional module string, and invokes string.replace() to do the needful.

import sys, re, string
for line in sys.stdin.readlines(): 
    listofwords = [string.replace(word, 'Hi', 'Hey') for word in re.split('\W', line) if word]
    print listofwords

This might look like not-so-useful variant, however one gets the idea to explore further and try various possibilities by themselves. Hope this helps.

In case, you feel there is any correction required, let me know and I will do the needful.

References

[1] http://docs.python.org/2/library/functions.html#filter [2] http://docs.python.org/2/library/re.html#re.split