grinning like a fool

  |  More posts about python

Back in 2007 I wrote a small grep-replacement tool called grin. I wanted a quick tool for searching through source code. grep is a great tool, but it is geared towards a slightly different use case. Or rather 15 other use cases. It‘s a very flexible and general tool. I wanted something that was configured just for me, that my fingers could just type out without thinking and have it do what I meant it to.

The first thing on my list was recursive directory searching by default rather than reading from stdin. Next, I wanted to eliminate binary files. I am a Python programmer, so grepping my code for an identifier will usually show both the .py source file and its compiled .pyc file in the list.

The last major improvement was to avoid certain specially-named directories from the recursion. This was in the bad old days when I was using SVN regularly. For those who don‘t remember or were lucky enough to skip non-distributed version control entirely, SVN makes a .svn/ directory in each of your directories. Under this directory, SVN stores a plaintext copy of each of your files. The default grep will recurse into this directory and search through all of these copies in addition to your real files. This made the default grep pretty useless for tasks like checking if I had successfully changed all uses of a renamed method, for example. The old unmodified versions were helpfully kept there by SVN.

I also wanted grin to be a useful set of components to be used as a library for other searching tasks. The first thing I wrote was a small class to recurse through directories applying the configured rules. It classifies files into groups like text, binary, and symbolic link for the rest of grin to use to determine whether it should be searched or skipped. I quickly realized that I could easily repurpose this component for finding files instead of searching through them. I exposed this as the grind script which applies the same recursion rules and options as grin but applies a glob pattern to the filename and prints the matching paths similar to find.

One of the neater things I could do with this flexibility was to preprocess files before passing them through the core regex search. I wrote to configure the file recognizer to only pass through Python files. It will tokenize the Python sources and classify each token as a comment, a string, or other Python code. It then reassembles a text stream where only the requested token types are present and the others are replaced in-place with spaces. This lets you find, for example, all mentions of an identifier in actual code but not surrounding comments or in strings.

This is an example of looking for the string "grep" in the grin source code, restricting the results to just the Python code and not the strings or comments:

$ -i --python-code grep
  187 : class GrepText(object):
  291 :     def do_grep(self, fp):
  321 :             (block_line_count, block_context) = self.do_grep_block(block,
  342 :     def do_grep_block(self, block, line_num_offset):
  476 :     def grep_a_file(self, filename, opener=open):
  501 :             unique_context = self.do_grep(f)
 1028 :         g = GrepText(regex, args)
 1031 :             report = g.grep_a_file(filename, opener=openers[kind])

We can see a single, lonely comment:

$ -i --comments grep
  985 :     # something we want to grep.

But several places in the docstrings:

$ -i --strings grep
  188 :     """ Grep a single file for a regex by iterating over the lines in a file.
  292 :         """ Do a full grep.
  343 :         """ Grep a single block of file content.
  420 :             The name of the file being grepped, if one exists. If not provided,
  477 :         """ Grep a single file that actually exists on the file system.
  491 :             The grep results as text.
  634 :                 It should should be grepped for the pattern and the matching
  637 :                 The file is binary and should be either ignored or grepped
  641 :                 The file is gzip-compressed and should be grepped while
  924 :     """ Generate the filenames to grep.

In short, grin is grep configured the way I like it. It‘s not a speed demon compared to grep, it‘s not as general as grep, but it doesn‘t make me think, a desirable quality in a tool that I reach for dozens of times a day.

comments powered by Disqus