Archive for August, 2008

Batch Search and Replace

Saturday, August 30th, 2008

Batch search and replace across multiple files seems to come up a lot. It’s good to know a quick and simple way to do this on text files.

Problem

Suppose you have 100 XML files and you want to add an attribute to one of the elements.

Current XML: <document author=”mark”>

Desired XML: <document date=”2008-08-08″ author=”mark”>

Any XML or even text editor can do search and replace on this easily through the GUI. But if you have 100 XML files that need the same thing done, you need a quick way to do this in batch.

Solution

We’ll write a shell script to read in all 100 XML files, do a search and replace to add the new attribute, and create a new version of each modified XML file in a new directory.

To run the shells script, you need either:

Here is the batch search and replace shell script:

for x in *.xml;
do
  sed -e 's/<document/<document date="2008-08-08"/g' < $x > tmp/$x;
done

Make sure the directory tmp/ already exits; that is where all your modified files will go.

That’s it. Just a simple for loop and a search and replace command. Next time you need to change something across multiple documents, write a simple script instead of doing it manually.

Generate a Concordance from an XML File

Thursday, August 21st, 2008

A concordance is a list of all the words in a document, and there respective word count.

For example, if you had the following sentences: I like XML. I like computers. Do you like XML?

A concordance for those sentences would look like this:

3 like
2 I
2 XML
1 you
1 computers
1 do

Concordances are especially useful for finding the words used most often when building glossaries or multilingual dictionaries.

However, it is nontrivial to generate a concordance from an XML file, because XML elements, attributes, and attribute values are all just plain text words that will skew the results. I came up with a way to easily generate a concordance from an XML document using only the GNU Linux command line to create a concordance shell script.

To run the concordance script, you need either:

Here is the concordance shell script:

sed -e 's/<[^>]*>'//g < inputfile.xml |
tr -dc "a-zA-Z0-9'\- \012" |
tr "\<[0-9][0-9]*" "\012" |
tr " "  "\012" | tr "\r" "\012" |
sort -f | uniq -ic | sort -nr > outputfile.txt

inputfile.xml is your XML file, and outputfile.txt is the concordance file created by the script.

The script does a number of things. First we have to remove all the XML, so it strips the tags to make a plain text file. Next it converts all spaces, punctuation, stand-alone numbers, Windows special characters, etc. into standard new line characters. At this point, every line has one word on it. Then it does the actual work to build the concordance by sorting every line, counting every line and their duplicates, then sorting in reverse numerical order.

It’s actually not that complicated. It just uses a few GNU command line tools to process the data, and strings them all together to form a script that takes an XML document and builds a concordance.

The concordance file generated is plain text, but you can import it into Microsoft Excel, or any spreadsheet program by using spaces to delimit the cells. In a lot of business settings, a plain text file wont do; but that same data in an Excel spreadsheet now becomes business data.