I recently was asked to comment about the evolution of XML for a story about what the Government Printing Office is doing in migrating data to XML — and about how APIs (Application Program Interfaces) can help get agencies extract data out of their systems using XML (extensible markup language)

It reminded me of a conversation I had with GPO managers about 15 years ago in which I advocated how XML would give them “author once — use many” capability that would stand the test of time and it has!

If they had really implemented XML for electronic publishing back then, they would have had the Federal Digital System (FDsys) long before now and their data would already be in XML!

One can read about the FDsys at the GPO web site where it says that FDsys provides metadata about federal government publications in standard XML formats.

Wikipedia explains XML for starters and provides a simple example (shown above). The rest is so many details, but the essential idea is to describe the information so it can be read and understood by computers.

Here’s an example of a table from the Web:

DATE ORDER NUMBER DETAILS STATUS PRICE February 5, 2012 #1406395 View Details Payment authorized $92.40 February 1, 2012 #1390927 View Details Payment authorized $0.00 January 16, 2012 #1339622 View Details Payment authorized $68.48 January 14, 2012#1333605View DetailsPayment authorized $68.48

This is hard to understand unless we put it into a spreadsheet, using RDF (Resource Description Framework, a higher form of XML from the Semantic Web) to deliver data on the Web. XML Markup does that while still leaving it as text.

“Author once” means you type the information in like a word processor and “use many” means that it is XML underneath and can produce print documents, CD/DVD’s, and Web and mobile content with different stylesheets and programming code.

So I could take large documents in Word, PDF, etc., copy and import them to a software I used at the time, called Folio Views/LivePublish/NextPage and easily print them and put them on CD/DVD and the Web from the same file!

Fast forward to present day and we can do so much more with large collections of documents, especially PDF files that so many use to “store” documents, but really limits the “Author once — use many” capability and semantic search.

To illustrate this I have worked with Chuck Rehberg, CTO of Trigent Software, that has an amazing tool called the Semantic Insights Research Assistant (SIRA). Chuck and I teamed up for a tutorial on this at the Knowledge Management Conference last year and more recently for a presentation on to the Defense Department’s deputy chief management office.

The architecture and details for building a simple report (like a high school term paper) and an ontology (a formal representation of knowledge in XML) of the content are shown elsewhere.

Tools like his can uncover valuable information, “reading” thousands of documents — web pages… or a collection of PDF files… or a database… or your desktop…really, just about anything. The stuff you need to know is handed back to you in concise, plain-english reports — faster, better, and cheaper than an expert using a key-word search engine.

These core tools are built on more than six years of R&D in Natural Language Processing and Semantic Technology, resulting in seven patents, granted and pending, and the worlds fastest, most scalable rules-based engine. The tools recognize synonyms in context, identify multiple ways of saying the same thing, and consult detailed domain-specific knowledge including generalizations, specializations, relationships and instances in a given domain.

Chuck and I teamed up again for the LandWarNet 2011 Conference where I prepared the Defense and Veterans Brain Injury Newsletters in PDF in my Wiki (built using MindTouch) so they were structured documents with XML underneath that provides RSS Feeds and APIs for other uses.

Chuck was able to demonstrate his Semantic Searching and Reporting (Slides and Video) over that corpus of PDF documents. He has since been able to do that with much larger corpi of documents like Semantic PubMed and on the Internet.

All of this is to provide a deeper rationale for why GPO and many other organizations need to complete the process of migrating their data to XML (and Wiki-like technologies) and providing the ability to mine this information using semantic search .