Thursday, October 16, 2008

Verilog RTL Decommenter

We're transferring a bit of soft IP to a customer, and decided to remove all the comments from the RTL files. Our IP is protected by an NDA, so we decided against obfuscation as we felt this may cause unnecessary hassle if we're asked to help debug the IP integration. We did decide to remove comments so that any stray profanity, "FIXME"s and "This is an ugly, ugly hack but..."s are not presented to the customer. It was also an opportunity to include a copyright header to the RTL file, too.

It fell to me to script the removal of the comments. Being a bit of a python fan, I went searching for some pythonic regexp-based comment remover. I found a C decommenter here, but it needed a few modifications to work with verilog comments which I present below.

#! /usr/bin/env python

# remove_comments.py
import re

def remove_comments(text):
""" remove c-style comments.
text: blob of text with comments (can include newlines)
returns: text with comments removed
"""

pattern = r"""
## --------- COMMENT ---------
/\* ## Start of /* ... */ comment
[^*]*\*+ ## Non-* followed by 1-or-more *'s
( ## group 1
[^/*][^*]*\*+ ##
)* ## 0-or-more things which don't start with /
## but do end with '*'
/ ## End of /* ... */ comment
| ## -OR-
//[^\n]* ## // comment to end of line
| ## -OR- various things which aren't comments:
( ## group 2
## ------ " ... " STRING ------
" ## Start of " ... " string
( ##
\\. ## Escaped char
| ## -OR-
[^"\\] ## Non "\ characters
)* ##
" ## End of " ... " string
| ## -OR-
##
## ------ ANYTHING ELSE -------
. ## Anything other char
[^/"'\\]* ## Chars which doesn't start a comment, string
) ## or escape
"""

regex = re.compile(pattern, re.VERBOSE|re.MULTILINE|re.DOTALL)
noncomments = [m.group(2) for m in regex.finditer(text) if m.group(2)]

return "".join(noncomments)


copyright = """// --------------------------------------------------------------
//
// My Company Inc. - Confidential Information
// Copyright 2005-2008
//
// --------------------------------------------------------------"""

if __name__ == '__main__':
import sys
filename = sys.argv[1]
code_w_comments = open(filename).read()
code_wo_comments = remove_comments(code_w_comments)

#fh = open(filename+".nocomments", "w")
#fh.write(code_wo_comments)
#fh.close()

print copyright
print code_wo_comments


First of all, I added a bit to the regexp to spot one-line comments that start with // - as mentioned in the perl FAQ - see the emphasised section in the above code.

I also got rid of the single quote string matching section of the regexp because verilog doesn't have such strings. It was also accidentally matching the code between two number specifiers which prevented the removal of the comments in what it thought was a string. For example, the comment below would not be removed:
assign a = 1'b0;
// Some comment
assign b = 1'b1;

The regexp itself saves two groups; group 1 is comment group and group 2 is a non-comment group. Printing group 2 is the thing to do if you want the comments removed. If the regexp matches a comment, then group 1 is text and group 2 is empty - printing group 2 effectively "removes" the comment. If the regexp matches a non-comment, then group 2 is text we want to keep, so we print it.

This decommenter script is used as part of an overall script which prepares our code for handover. The RTL is exported from our CVS directory, decommented and tar.gz'd - ready for secure FTPing to our customer...