Generate a Concordance from an XML File

August 21st, 2008

A concordance is a list of all the words in a document, and there respective word count.

For example, if you had the following sentences: I like XML. I like computers. Do you like XML?

A concordance for those sentences would look like this:

3 like
2 I
2 XML
1 you
1 computers
1 do

Concordances are especially useful for finding the words used most often when building glossaries or multilingual dictionaries.

However, it is nontrivial to generate a concordance from an XML file, because XML elements, attributes, and attribute values are all just plain text words that will skew the results. I came up with a way to easily generate a concordance from an XML document using only the GNU Linux command line to create a concordance shell script.

To run the concordance script, you need either:

Here is the concordance shell script:

sed -e 's/<[^>]*>'//g < inputfile.xml |
tr -dc "a-zA-Z0-9'\- \012" |
tr "\<[0-9][0-9]*" "\012" |
tr " "  "\012" | tr "\r" "\012" |
sort -f | uniq -ic | sort -nr > outputfile.txt

inputfile.xml is your XML file, and outputfile.txt is the concordance file created by the script.

The script does a number of things. First we have to remove all the XML, so it strips the tags to make a plain text file. Next it converts all spaces, punctuation, stand-alone numbers, Windows special characters, etc. into standard new line characters. At this point, every line has one word on it. Then it does the actual work to build the concordance by sorting every line, counting every line and their duplicates, then sorting in reverse numerical order.

It’s actually not that complicated. It just uses a few GNU command line tools to process the data, and strings them all together to form a script that takes an XML document and builds a concordance.

The concordance file generated is plain text, but you can import it into Microsoft Excel, or any spreadsheet program by using spaces to delimit the cells. In a lot of business settings, a plain text file wont do; but that same data in an Excel spreadsheet now becomes business data.

Leave a Reply