wiki:BookIndexingAlgorithm
Last modified 10 years ago Last modified on 04/14/08 02:27:41

Book Indexing Algorithm

See NewBookIndexingAlgorithm for revised info.

The items indexed by the following procedure are NavNodes, SMIL IDs, and text SRCs. Relationships are established between SMIL IDs and NavNodes (one to many), and text SRCs and SMIL IDs (one to one).

Call stack (first call at the top):

  • Dtb::open (AmisCore?/dtb/Dtb.cpp)
  • Dtb::processNcc (AmisCore?/dtb/Dtb.cpp)
  • ResolveSmilDataVisitor::resolve (AmisCore?/dtb/nav/NavVisitor.cpp)

Walk the navigation tree and list file.smil#id for each NavPoint and PageTarget This is useful for knowing the range of each NavPoint/PageTarget.

Data structures in ResolveSmilDataVisitor

  • NodeRefMap mpMap : the map of NavNodes to SMIL IDs (one ID to many nodes)
  • StringMap mpTextMap : a map of text SRCs and the SMIL files they occur in
  • StdStringList mBigIdList : all the IDs in all the SMIL files (listed as file.smil#id)

Steps

For the function ResolveSmilDataVisitor::resolve(...) in AmisCore?/dtb/nav/NavVisitor.cpp

Iterate over the list of SMIL files in the spine of the book.

Side process: Get all the nav nodes for this SMIL file

DAISY 2.02 books don't include the audio files in the NCC. We want the audio file references.

While iterating through the spine:

  • First use the WhoRefersToThisSmilFile visitor to get a NavNodeList of the NavNodes who refer to this SMIL ID.
  • Then include the list in the QuickDataSmilFileReader so that the audio data gets picked up and filled in as the file is parsed.

Back to the main loop

Use QuickDataSmilFileReader on each SMIL file. The following things occur during the parsing:

  • Make a list of all SMIL IDs (StdStringList mpList)
  • Make a map of all text SRC's and the SMIL IDs (going from the SRC to the ID)
  • If given a non-NULL NavNodeList: fill in the audio references, using the content field of the NavNode to find out what SMIL element it points to

After the data has been gathered

Visit the NavMap and the PageList and work out the SMIL ID ranges for each node. Use mBigIdList plus the content field for each node to calculate the range. E.g.:

mBigIdList[node->content] until and not including mBigIdList[next_node->content]

Store the results in mpMap, going from the SMIL ID to the nodes that point to it. Many nodes can point to a single SMIL ID (a page number announcement is part of the section it's in, and also part of the page it's in).

So what about text searching?

During the parse done by QuickDataSmilFileReader, a map was created of all the text SRCs and the SMIL IDs. When AMIS searches, this happens:

  • know the starting point (a text ID)
  • search the text file
  • find out the ID of the element in the text file containing the found text
  • match up this ID with using the map described above
  • load the SMIL file for that text ID

And why collect node ranges too?

We collect node ranges so that we know how long a section is, where it begins and ends (good for sidebar highlighting if you go backwards by phrase through the book), and what pages it contains.

What's the problem

The problem is that this process is very slow. The following improvements are suggested:

  • Only collect data for the current section or current SMIL file
  • Add to the data as the book gets played or if the current data set is insufficient for, say, locating a text reference

Note

Right now (11 April), relative text searching is broken. See #36. This is completely unrelated to what is described here.