taT4Py | Extract words from Input String and Operate Functions

Working on AIX Servers with limited grep features, sometimes makes it difficult to use for particular scenarios. For instance, I want to split the lines (read from STDIN or FILE) into words, precisely. However, without grep -o option, I am clueless on how to get desired results. Since past few months, I have been investing time in learning Python and using its features to complement text processing tasks for shell scripts, I often write for automating several tasks and creating productivity tools.

Code Snippet #1

import sys, re
for line in sys.stdin.readlines():
    listofwords = [word for word in re.split('\W', line) if word]
    print listofwords

Looking at the above code snippet, four lines of code, did the trick. The features and constructs provided by Python, to implement scenarios like this, makes it look really cool. Let us understand that quickly, before we can operate functions on those words.

  1. Imports sys module for environment related capabilities and re module for regular expression capabilities.
  2. sys.stdin.readlines() reads the input from STDIN, unless EOF character has been entered.
    • After which, for loop construct iterates over the input lines, one-by-one.
  3. The special-construct present on the right-side of assignment operator, is representation of in-built function [1], filter()
    • re.split() generates [2] list of words, based on pattern given as first argument.
    • The for loop-construct iterates over the generated list and using if-construct, filtering is done.
    • This is done to make sure, there are no empty strings in the generated list using the special-construct.
    • filter(None, re.split(‘\W’, line)) can be used as an alternative, which by default takes care of empty strings, as they return false.

Code Snippet #2

What if the line contains a word, “Hi” and I want to replace all occurrences of “Hi” with “Hey”, while the list is getting generated using above approach. To make it possible, the below code snippet imports additional module string, and invokes string.replace() to do the needful.

import sys, re, string
for line in sys.stdin.readlines(): 
    listofwords = [string.replace(word, 'Hi', 'Hey') for word in re.split('\W', line) if word]
    print listofwords

This might look like not-so-useful variant, however one gets the idea to explore further and try various possibilities by themselves. Hope this helps.

In case, you feel there is any correction required, let me know and I will do the needful.


[1] http://docs.python.org/2/library/functions.html#filter [2] http://docs.python.org/2/library/re.html#re.split


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s