Laridian® DocAnalyzer

User's Guide

 

October 2012

 

Table of Contents

Overview

Creating HTML documents to use with Laridian BookBuilder can be a challenging task. Computers read tags with ease, but we need a little help. Laridian DocAnalyzer can help you solve problems you might encounter while building a book, point out trouble spots in your HTML, and give you some good statistical information about the book you're creating. Whether you're publishing your own work or formatting an already published work to use with Laridian's many book readers, DocAnalyzer can be a big help.

System Requirements

DocAnalyzer requires Windows 2000, XP, Vista, Windows 7, or later. If you are running BookBuilder you meet the system requirements.

DocAnalyzer produces an HTML document as its output. A web browser will be useful in viewing this document. Any current web browser will be sufficient (i.e.: Internet Explorer, Firefox, or Google Chrome).

In A Nutshell

When you create a book yourself and run it through BookBuilder, you have total control over the original file and it's unlikely you'll insert anything into the file that is problematic. If you do, it's easy to find since it's your book. If, on the other hand, you download a book from the Web, or (like us at Laridian) you receive a file from a third party (the author or publisher), you really don't know what's in the file. If it's large, there could be small things hiding there that would be difficult to see if you had to depend on just reading the file on your computer screen.

Files you receive from an author or publisher will most likely use a tagging scheme that is not compatible with BookBuilder. Bible references might look like “4003016” instead of John 3:16. Headings may be delimited with a tag like <p class="heading-1">. Bold and italics might be marked as “...this is [BO]bold[RO] and this is [IT]italic[RO]...”. While you might be able to discover most of these by just examining the file, DocAnalyzer will look at every single character in every single line and find every single occurrence of these odd tags. Thus it is the source of your first “to-do list” when coming up with a strategy for tagging a file.

There are two times when you use DocAnalyzer. First, you will want to run DocAnalyzer on a document you have downloaded from the Web or received from an author or publisher prior to beginning to tag that document. Your goal in doing this is to come up with a full list of:

  1. tags or attributes are not recognized by BookBuilder and therefore need to be removed or changed,
  2. tags that serve as clues for the location of your headings (for the table of contents) or synchronization points (<pb_sync> tags),
  3. characters that are not recognized by BookBuilder,
  4. tags that might be missing their corresponding closing tags (i.e. there are more <ol> tags than </ol> tags)

Second, you can use DocAnalyzer during the tagging process to verify that you haven't left any unrecognized tags in the document or inserted any illegal characters.

When you run DocAnalyzer, an HTML document is produced that is called original document name.docanalyzer.htm in the same directory where your HTML document is saved. This is referred to as the output file throughout the rest of this documentation. The output file will have a wealth of information in it, including a statistical analysis of every character in your HTML file, all of the tags used, and warnings about possible problems with your HTML file.

About this Documentation

This document consists of two sections, one that details the options and operation of DocAnalyzer, which is fairly straightforward, and a section with practical help in using the information in the DocAnalyzer output file.

DocAnalyzer Options and Operations

Installing DocAnalyzer

When you installed BookBuilder you also installed DocAnalyzer which is one of three programs that come with BookBuilder. Generally DocAnalyzer will be found in the directory C:\Program Files (x86)\Laridian on 64 bit computers or C:\Program Files\Laridian on 32 bit computers. It will also appear in your Start Menu under Laridian\BookBuilder Professional or Laridian\BookBuilder Standard.

Running DocAnalyzer

Double-click on the DocAnalyzer program, or start it from the Start menu. The window that appears has a number of options to set.

Begin by clicking the Browse button in the upper right-hand corner of the window. A second window will appear to navigate to your HTML file. Click on your HTML file and click Open. You can only analyze one HTML file at a time.

Usually the other options can be left as their defaults. Just click Go! and DocAnalyzer will go to work. You can see it working in the status field at the bottom of the window. When you see Done! in the status field, you can click Exit or browse to another HTML document to analyze.

Note: DocAnalyzer overwrites any previously created output file, or any file named original document name.docanalyzer.htm. If you wish to keep previous output, it is recommended that you rename the file (but keep the extension .htm) or move it out of the directory containing your original file.

When you are finished running DocAnalyzer, close it selecting the Exit! button.

Options

The options in DocAnalyzer can usually be left in their default mode, but in some cases you may want to change the options.

Tags

DocAnalyzer assumes the file you are starting from is some kind of tagged file that conforms to HTML or XML tagging standards. Tags in these languages look like this:

start-delimiter tag-name attribute-name = value end-delimiter

Where the start-delimiter is the less-than character (<); tag-name is the name of the tag, like “p” for “paragraph” or “ul” for “unordered list”; attribute-name is the name of a tag attribute, like “align” to change the alignment of a paragraph; value is the value of the attribute, like “center” to center a paragraph on the screen; and end-delimiter is a greater-than sign (>). A tag can have zero or more attribute/value pairs, and attribute values can be in quotes (and must be, if they contain a space). Here are some examples of HTML tags:

<a href="bible:John 3:16">
<b>
<meta name=pb_title value="The Holy Bible" />

By default, DocAnalyzer is configured to analyze HTML or XML files, but if your file is in some other format where the tags coincidentally follow this syntax, you can change the start and end characters to support your file format.

Your file may contain tags that are not supported by BookBuilder, so you will need to find and change those tags. When you check the Analyze Tags checkbox, DocAnalyzer will create a list of all the tags it finds in your document. If your document isn't tagged or uses some very unusual tagging format that DocAnalyzer can't distinguish from the surrounding text, you might want to disable the option to analyze tags.

As mentioned above, HTML and XML tags start with < and end with >, and BookBuilder requires these kind of tags. But if your document is tagged with other beginning and ending characters, you may change these to reflect those characters. For example, we've seen books where square brackets (“[” and “]”) were used to delimit tags. As long as the tags in your document have some known beginning and ending sequence, you can use DocAnalyzer to come up with a list of all the tags in your document.

We've seen files where each paragraph started with a pound-sign (#) and a letter describing the type of paragraph it was (for example, “L” for left-justified, “R” for right-justified, “I” for indented, etc.). DocAnalyzer isn't going to help you analyze this type of file (unless it also happens to contain HTML-like tags for bold or italic text).

While there are limits, there are also some creative uses of this functionality that might be helpful. For example, if your book contains a number of parenthesized elements (like Strong's numbers or cross-references), you could set the start and end tag characters to “(” and “)” and DocAnalyzer will produce a list of everything it finds in parenthesis in your file.

Character Entities

“Character entities” refers to the way that special characters are represented in HTML. For instance to display & in the browser, the character entity &amp; is used. While BookBuilder will recognize all the standard character entities in HTML, our PocketBible program may not support all character entities on every platform. With that in mind, it's helpful to have a list of every character entity in your document so you can verify there is nothing there that will trip up PocketBible.

In the Character Entities box you may opt to not analyze character entities. Uncheck the Analyze Character Entities checkbox to skip this analysis.

HTML character entities begin with ampersand (&) and end with semicolon (;), and are generally less than 10 characters long. You may change these options in the Character Entities box. As with the tags, enter the characters that start and end the character entities and change the Max Length as desired. For example, we've seen files from publishers where special characters simply appeared in parenthesis, like “(emdash)” or “(ellipsis)”. In this case you could replace the start and end characters with “(” and “)” and DocAnalyzer will list all the special characters it finds named between parenthesis.

Using DocAnalyzer to Improve Your Book

The DocAnalyzer Output File

The DocAnalyzer output is an HTML file that can be viewed in any web browser. To open the file, simply double-click it and your default web browser will open the file. If, when you click on the file, windows prompts you to pick a program to open the file, choose an installed web browser like Internet Explorer, Firefox, or Google Chrome. This output file can be rather large, depending on the HTML file input to DocAnalyzer, so it may take a moment to load in your web browser.

Note: If you run DocAnalyzer multiple times, and view the results in a web browser, you will have to click the refresh button on the web browser tool bar (usually looks like a circular arrow) to see the new file. Browsers do not automatically update when the underlying HTML changes.

You may also view the output in a text editor, though it will be considerably more difficult to read there, so it is suggested you use a web browser.

File Information

The first part of the output file is the name of the file processed and the file length.

Character information

DocAnalyzer (and BookBuilder) expect the files you give them to contain 7-bit ASCII characters only (don't worry if you don't know what that means — just keep reading to see how to fix these problems). These are those characters with ASCII values of 127 or less when you view them in the Character Count Table. In fact, it is precisely this limitation that makes DocAnalyzer helpful. If you see character values greater than 127 in the Character Count Table, you need to find those characters, figure out what they are, and replace them with a character entity that BookBuilder will recognize.

Character Count Table

Following the file information is some important character information. There is a count of each character in the file in a table. The first column of the table is the ASCII value. This is the number that is understood by the computer that represents the character. For instance, capital A has an ASCII value of 65 while lower case a has an ASCII value of 97. Laridian's various readers will only accept characters with an ASCII value less then 128. To represent characters with an ASCII value greater than 127, such as accented characters and certain punctuation marks, character entities are used.

The third column in the character count table is the number of times a character appears in the HTML file, whether it is in a tag or outside a tag.

The characters are listed in ASCII numerical order. Some of the rows in the table may be colored green. This indicates that these characters are beyond ASCII 127 and will cause problems with BookBuilder. See below for tips on searching out these characters.

UTF8 Character Count Table

ASCII values from 0 to 255 can be contained in one byte. DocAnalyzer reads your file one byte at a time, interpreting each byte as one ASCII character. However, some files contain characters whose values are greater than 255 and thus require more than one byte in the file. The most common way to encode these characters is called UTF8. In UTF8, character values greater than 127 are encoded in two or more bytes, and each byte has a value greater than 127. The problem DocAnalyzer has when looking at a byte that is greater than 127 is that it doesn't know if it represents a single ASCII character (using, for example, Windows-1252 character encoding) or if it is part of a UTF8 character that spans many bytes.

When DocAnalyzer finds bytes with values greater than 127, it records them in the Character Count Table but it also attempts to interpret them as UTF8. When it is successful, it keeps a count of how many of each of those characters it finds and displays them in the UTF8 Character Count Table.

The first column in the table gives the Unicode value, similar to the ASCII value above. The second column displays the character, and the third column gives the count. If there are no suspect characters in your HTML file, this table will not appear.

The characters listed in the UTF8 Character Count Table are not going to be recognized by BookBuilder. You need to search them out in your original document and change them to character entities or replace them with a recognized character. For example, the fancy quotation marks used “here” will show up either as ASCII 147 and 148 or UTF8 8220 and 8221. They can either be represented as &ldquo; and &rdquo; (their equivalent character entities) or simply replaced with straight quotation marks like “these”. See below for tips on searching out these characters.

If DocAnalyzer is not able to translate every byte in your file that is greater than 127 into a UTF8 character, it will display a warning message: “Some high-ASCII values do not appear to be UTF8; the following table may not be accurate.” This simply means that the UTF8 Character Count Table may not include all the characters represented by bytes greater than 127. You may have to fix the characters listed in the UTF8 Character Count Table, then run DocAnalyzer again to see what problems remain to be fixed.

Count of Character Entities Table

This table lists every character entity used in your HTML file and the number of times it appears. If your HTML file has no character entities, the table will not appear, though the title “Count of character entities” will.

Tag Information

Count of Tags Table (a.k.a. Tags Table)

The first table in the Tag Information section is the Count of tags table, referred to in this documentation as the Tags table. This table lists every tag used in your document in alphabetical order. The tag delimiter (usually < >) is not shown. Ending tags usually being with a forward slash (/) and are listed near the top of the table. See below for how to use this information effectively.

Count of Tag Attributes Table (a.k.a. Attributes Table)

The next table, referred to as the Attribute table, lists every tag again in the first column. The second column lists each attribute of the tags. An attribute gives more information for the reader to use about the tag. For instance, your document may have counted 17 meta tags in the Tags table. In this table you will see a row that lists the tag meta in the first column twice. The first time is the attribute “content” and the second time the attribute is “name” because meta tags often have two attributes. Some tags don't have any attributes, such as the italics (<i>) tag. Tags with no attributes are not listed in this table.

Note: HTML comments are usually tagged <!-- comment information -->. In the output file you will find a count for the tag !-- in the Tags table, and in the Attributes table each comment will appear as an attribute.

Note: If you end tags with a forward slash (/) before the >, as with pb_sync tags, the forward slash will appear as an attribute. This is normal.

Count of Tag Attribute Values Table (a.k.a. Values Table)

This table, referred to as the Values table, will most likely be the largest table in the output. Now each tag is listed in the first column, each attribute is listed in the second column and each value for each attribute is listed in the third column, with a count of each value in the final column.

The Values table now lists each tag and each attribute separately and gives the value each attribute had. The count shows how many times that value for that attribute with that tag is used.

Using the DocAnalyzer Output File

Searching for Bad Characters

BookBuilder does not allow characters with an ASCII value above 127, but many text editors do. Those pesky right double quotes (”) and left single quotes (‘) that look so nice in your editor can wreak havoc on your beautiful HTML file as it goes through BookBuilder. DocAnalyzer's Character Count Table and UTF8 Count table can help you get a handle on things.

Inspect the Character Count Table for any cells colored green. These are characters that are higher than ASCII 127 and are not allowed in BookBuilder. Finding them may be a harder task.

Our suggestion is to download a Unicode Text Editor, such as BabelPad (free download at http://www.babelstone.co.uk/Software/BabelPad.html) to help find problem characters. There is usually an option to Save As ASCII with HTML character entities which will solve your problems in a snap. Or you can use the Find/Replace feature to find those characters listed in DocAnalyzer and replace them. In any case, BookBuilder only allows 7-bit ASCII files, and DocAnalyzer is simply telling you that the file contains something out of that range.

The output file can be a bit deceptive when your original document contains characters encoded in UTF8. For instance, let's say you have a file with an Omega character saved in UTF8. DocAnalyzer analyzes the HTML file looking at each byte, assuming the file is an ASCII file. Therefore, when DocAnalyzer comes across a Unicode character like Omega (Ω) with a Unicode value of 937 it actually thinks it is 2 characters (ASCII 206 and ASCII 169) and will display them in the Character Count table. If you searched for these characters you may not find them since your text editor may correctly interpret them as the UTF8 character they represent.

The count of UTF8 characters table is more help in this case. With our example it lists just one UTF8 character — the Omega. That's the character you need to look for in your original document and replace with the character entity &#937;.

In any case, green on the DocAnalyzer Output means something is wrong, and before you can run through BookBuilder it needs to be fixed.

Using Tag Information

Balancing Tags

BookBuilder requires opening and closing tags for most of the HTML tags used. And when it can't find them the process will fail. DocAnalyzer lists out each tag in the Tag table which can be helpful to make sure your tags are balanced (a start tag with an end tag).

Inspect the Tag table counts of each tag, and look to make sure the corresponding end tag has the same count. For instance, if your document has 4 <li> tags then it should have 4 </li> tags as well. Unfortunately DocAnalyzer can't help you find where missing tags ought to go, but it will get you pointed in the right direction.

Clues for Problems

DocAnalyzer can clue you in to potential problems by looking at the count of certain tags. For instance, The King James Version has 31,102 verses, therefore you should have 31,102 <pb_sync> tags with the attribute type equal to word if you're doing a verse-by-verse commentary. If you only have 200, you've probably missed something. Inspecting the counts of tags, especially those specific to BookBuilder, can help you find errors that might not throw an error in BookBuilder.

Contents

Overview

DocAnalyzer Options and Operations

Using DocAnalyzer to Improve Your Book

Copyright © 2012 by Laridian, Inc. All Rights Reserved. Laridian, PocketBible, and MyBible are registered trademarks of Laridian, Inc. BookBuilder, VerseLinker, and DocAnalyzer are trademarks of Laridian, Inc. Other marks are the property of their respective owners.