DITA on your Local Disk: Metadata, the File System and "Common"

Peter Fournier

This series of articles explores the how and why of using DITA XML in the file system: in the early stages of DITA adoption it saves you time and money. You cannot afford to not use DITA.

This article explains how you can get one of the principle benefits of the Component Content Management Systems (CCMS's) or Content Management Systems (CMS's), finding content to link to or reuse, right on your file system.

Back to "DITA on your Local Disk"

What is "Metadata"?

In the previous articles Organizing your Suite and The "Common" Concept we described an overall strategy for organizing DITA files in your file system. What we did not explain is that the organization of your files, and the file names, mimics one of the principle benefits of CCMS's and CMS's: finding content that can be reused. This is done through the "metadata" embedded in the directory and file names in your file system.

Of course this raises the question, "What is Metadata?" In DITA there is no easy answer to that question. Metadata can be about content or structure, it can be layered (metadata in DITAMAPs overrides metadata in DITA files) or not layered, it can be about hard facts like the DATE or it can be about soft facts like AUDIENCE, and finally, it can appear in elements like <metadata> or in attributes like @product.

So, a very brief general definition of what metadata is this: Metadata is data about the data or content (for more information go to Wikipedia: Metadata).

In the file system we have described so far, metadata is captured in the names we give to directories and files and how we arrange them in the file system hierarchy.

Metadata in your File System

Consider the following graphic.

Obviously there is a lot of information, or metadata, already captured in the file system through the naming of directories and files. Notice as well that this metadata in the naming of directories and files can be used by ordinary authors to find content. Again, it's obvious.

In contrast many (most?) CCMS's and CMS's expect exactly this kind of information to be captured in DITA elements or attributes inside the DITA or DITAMAP files or in the database on check-in. They need this information to make finding the content easier. If we count up all the metadata entries already captured in the file and directory names and suppose we have to capture this information in a database attribute form and compare the two, we can conclude that:

  • capturing the metadata in the file system requires 16 steps (naming the directories and files)
  • capturing the metadata in a database entry form or in the DITA files requires 26 steps

The reason these numbers are so differernt is that if you name a directory in the file system "Release 1.0", it is named, it's done, it's one step.. In many CCMS's and CMS's "Release 1.0" would have to be attached to every file below that level.

Those numbers, 26 and 16, are similar in terms of order of magnitude. However, if we add one more file to the suite, say "030-Configuration.dita" under "Product-XYZ", we need one (1) step in the file system but we need seven (7) steps in the database. In more complex schemes the difference in the number of steps can only increase.

"Common", Metadata and your File System

For any particular suite of information products, all reusable chunks of information should reside in the "Common" directory. Once there, and once they have been edited to remove CONREFs and XREFs, they can be used in the product-specific information anywhere in the suite, and they will be!

One of the great benefits of reusable chunks in "Common" in the file system is that an author or a team can organize it in any way they wish and they can change their minds about that organization at any time. For example, suppose that in the early stages of a transition to DITA the team decides to create a directory for warnings. They might decide that the appropriate organization would be to create a folder structure like this:

  • Common/
    • Warnings/
      • 100-Kilos.dita,
      • Class-A.dita,
      • Goggles.dita,
      • Optical-Radiation.dita,
      • Wrist-Strap.dita.

Later on they decide that the appropriate warnings would be easier to find (and therefore faster to find and more accurate when CONREF'd into a topic) if they rearranged things like this:

  • Common/
    • Warnings/
      • Electrical/
        • Ground-Required.dita,
        • Wrist-Strap.dita,
        • 10000-Volts.dita,
  • Common/
    • Warnings/
      • Hardware/,
        • 100-Kilos.dita,
        • 1000-Kilos-Use-ForkLift.dita,
        • 2000-Kilos-Use-ForkLift-HeavyDuty.dita,
  • Common/
    • Warnings/
      • Temperature/
        • Burn.dita,
        • Hot.dita,
        • Warm.dita,
      and
  • Common/
    • Warnings/
      • Optical/
        • Optical-Radiation.dita,
        • Goggles.dita,
        • Class-A.dita.

Note: Although this level of reorganization is generally not recommended for DITA in the file system because of broken links, the Samalander Link Fixer module can make the process almost painless.

That level of reorganization in a CCMS or a CMS can be problematic. Furthermore, if the assumption is that topics are flagged with metadata to facilitate finding in the CCMS or CMS, that implies getting to the reusable topic or chunk normally involves some sort of search interface: find this please ... warning AND optical AND goggles. Navigating the file system seems easier for basic searching.

Some Metadata does not Belong in the File System Hierarchy

By now you might be getting the impression that we are trying to say that CCMS's and CMS's are not necessary for implementing DITA. In the early stages of implementing DITA or for a basic implementation, a CCMS or a CMS is definitely not needed. But, this article about metadata and the file system does point to cases where a CCMS or a CMS will be needed. What might these cases be? The first and most important is what metadata belongs where.

Metadata and Where it Belongs

In the graphic near the top of this article we present three classes of metadata: Category, Subject and Role.

Of these three Category and Subject metadata is almost always easily captured in directory naming conventions. Role is quite different. In our graphic we could envision situations in which the role is better classified as two or more roles of equal importance.

For example the topic/file XYZ-User-Guide/020-Install.dita might actually be a topic that includes @novice-install, @expert-install, @staff-install and @contractor-install metadata flagging content for 4 x 3 x 2 = 24 different installation manuals. This kind of metadata belongs inside the DITA and/or the DITAMAP files and must be attached to specific elements inside the files. Furthermore, managing the production of the required subset of 24 possible manuals will be a challenge, as will controlling the review and quality assurance processes! This kind of problem needs some serious assistance, currently available only from some CCMS's and CMS's.

Writers, Metadata, Reuse and Training

A common objection to the adoption of DITA, especially in small groups in large corporations and independent technical documentation contractors, is the training overhead required to implement DITA in a cost effective way. Not to worry! In Samalander's experience the training overhead required to implement reuse in DITA is close to zero, especially if the implementation is done in the file system.

Why might this be? Well, it's simple really: boredom. The basic file system architecture described in Organizing your Suite and The "Common" Concept and in this article make reuse the best and easiest benefit of DITA. Because it is related to boredom, technical writers tend to adopt reuse as a relief to the one aspect of their job that drives them crazy: doing the same thing over and over and over again.

Show the writers how to do a CONREF once and they will adopt the technique enthusiastically, even to the extent of designing a metadata hierarchy that is optimized for their particular work environment and training each other. After all, CONREF is not a difficult concept and the authoring tools are all optimized to help the writer make CONREFs, and writers who figure out CONREFs on their own naturally feel a sense of ownership and accomplishment: they have saved money and time and reduced the repetiveness in their jobs.

The key is to make it easy for the writers to design and implement their own solutions. DITA plus metadata in the file system does exactly that: makes the writer's job easier by reducing the number of repetitive tasks, and the design is under their control.

From a management or independent contractor point of view, metadata captured in the file system, plus writer self-training, is a seriously attractive option. It reduces the cost of piloting DITA to the absolute minimum (Samalander modules are inexpensive!), leads to writer self-training, and achieves the earliest and most important benefit of DITA: reuse. It also provides the data needed to accurately estimate the RoI of a deeper deployment of DITA and a DITA optimized CCMS or CMS.

The fact that implementing DITA in the file system will generate massive reuse benefits also points to the solution to other problems related to introducing a DITA-optimized CCMS or CMS in larger corporations: justifying the selection of a database driven solution. DITA in the file system clarifies the set of required features in a CCMS or CMS in a way that is simply not available in any other methodology: no amount of planning or information architecting can compare. And it keeps your writers happy: they have a significant and important role to play in the selection of a CCMS or CMS, if it is required.

Summary

This article has explained

  • what metadata is,
  • how metadata relates to the file system,
  • how metadata in the filesystem makes the writer's job easier,
  • how not all metadata belongs in the filesystem, and
  • how all this relates to CCMS's and CMS's.

Other articles in this series will explore how all of this relates to the successful introduction of DITA to corporation or to an independent contractor business model and the concept of piloting in the calculation of RoI.

The next article, The SUITE ROOT Concept makes DITA publishing, metrics, and reuse make more sense.

Back to "DITA on Local Disk"