Added by Martin Mueller, last edited by Martin Mueller on Feb 23, 2008


Enter labels to add to this page:
Wait Image 
Looking for a label? Just start typing.

The following is designed to move the discussion of the 'digital page' or 500 word arbitrary chunk to a practical level. I'll do it in a question and answer manner, hoping that this is a good way to define things collaboratively. I'll use the term 'chunk' here because it may be the simplest term to use.

If in the process of working through the technical details of this, a better solution emerges, we should not hesitate to adopt it.

I've put this on the wiki as a child of Analytics with the title 500 word chunks

  1. What is a chunk?

A chunk is a text block of approximately 500 words created by dividing a <div> element at appropriate paragraph or comparable element boundaries. It is the lowest unit used for analytical operations.

  1. What is the purpose of such chunks?

The purpose of such arbitrary chunks is to create ready-made units of manageable and roughly comparable size that will support a variety of user operations. There are other techniques that would allow users to create customized text spans for a variety of purposes. Whether it will be possible to create such customizable procedures within MONK I is an open question. The creation of fixed chunks is not intended as a fully satisfactory alternative for all purposes, but as a first step it is designed as a good enough way of dealing with the complexities of the diverse mid-level structures of XML documents.

  1. How do you create chunks?

The earliest point to create a chunk is after tokenization and morphosyntactic tagging, at which time a text is an XML document in which every word is enclosed in a <w> element and multiply accounted for.

Chunks could be created before ingestion into the data store or as part of the ingestion. Chunks could be marked in the XML document with <milestone unit="chunk"/> empty elements, in which case the ingest process follows that segmentation. Alternately and perhaps preferably, the chunking could be done during ingestion.

Either way, you envisage a process in which some counter starts from the beginning of a div, identifies a point 500 words down the line, and chooses the element boundary that exceeds or falls short of that point by the fewest words.

  1. What are the appropriate elements to serve as chunk boundaries?

The <p> element is by far the most common boundary. Other boundary elements are <l>, <lg>, <sp>, <said>. Questions will arise with regard to <q> or <quote> elements that contain long passages.

  1. What <div> elements are chunked?

The answer to that question will vary with the nature of the document and requires curatorial knowledge. In the case of fiction, the chapter level is the obvious target for chunking. With plays, one might choose the act rather than the scene. Scenes vary much more widely in length than acts, and not all plays are divided into acts and scenes.

Dividing prose may be a more complicated business, and curators will need to look at texts or at least batches of them. In the world of sermons or witchcraft treatises, there are relatively short texts, where perhaps the most sensible decision is to divide the body into chunks.

  1. What kinds of IDs do chunks have?

A chunk is a child of sorts of some div, which in the data store has a hierarchical ID consisting of the workID and the numbers of the relevant divs. Chunks have unique IDs that consist of the divID and a running number.

It is probably desirable to express the word occurrence ID as a running number of the chunk. This would involve a process of translating the wordIDs from the morphadorned ingest file into a word occurrence ID that can be part of a MONK wide citation scheme. Such a project wide "chapter and verse" scheme has practical advantages in any collaborative setting.

There remains the tricky question of how to assign word IDs for paratext.

  1. Chunks and screen display

From a display perspective, 500 word chunks are too long to fit on a single screen in most cases. A chunk that fits on a screen would have to be ~ 200 words. One problem with a 200 word chunk is that the variance of chunk size would increase. If we assume that most 500 word chunks sit wthin 450-550 words, 200 word chunks would sit within 150-250 words. At the margins, those chunks are no longer of approximately equal size.

A possible solution to this is to split chunks arbitrarily in the middle--a division that serves only a display function and has no analytical purpose.

  1. Chunks and paratext

The relationship of chunks to paratext is complicated. Remember the reason for the division into main text and paratext. Many texts contain non-trivial amounts of stuff that does not quite belong and may interfer with analysis. The division into main text and paratext is a crude and automatic procedure for removing most of the stuff that does not belong (and typically some stuff that does belong) on the theory that the resulting "main text" makes for a better target of analysis. The procedure groups a lot of diverse stuff under the heading of "paratext," with no attempt to distinguish its many kinds. Thus "paratext" will not be by itself a useful target of analysis. If you don't like the distinction between main text and paratext you can ignore it in your analysis.

The content of the elements that make up paratext should not be counted in the construction of chunks. Thus chunks are based on main text word counts. If you ignore the distinction the chunks could be more variable in length. For instance, a <note> element could consist of 750 words. That would be rare, but it does happen.

  1. Chunks and floatingText elements

The <floatingText> element is theoretically possible as a child of a <p> element but will in most cases occur at the level of the <p> element as a direct child of a <div>. The content of a <floatingText> element can vary greatly in length. It might be a three-line letter or it might be a 30 page inserted narrative with its own hierarchy of divs and paragraphs.

My hunch is that for the purposes of chunking the structure of the floatingText element needs to be flattened.

  1. Can you look for content in some elements regardless of chunks?

Elements differ considerably in their semantic constraints. Some elements are virtually meaningless except as indicators of some place in a hierarchy. A <div> or <p> element tells you nothing about what is inside it. Other elements are very expressive. An <l> element tells you that its content is verse. An <sp> or <said> element identifies its content as a spoken utterance. Some type attributes have a similar expressiveness. For instance, <div type="letter"> or <floatingText type="letter"> identify their content as a very specific form of written language with conventions of its own that separate it sharply from the surrounding text and will often be closer to spoken language.

There is a quite limited number of elements whose content is clearly marked in this kind (perhaps I have enumerated most of them already). But users will want to look for stuff in those elements and ask, for instance, what words are distinctly more common in poetry than in prose.

The 500 word chunks will not draw those distinctions any more than divs do, but these are distinctions users will want to make.

  1. Is it necessary to precompute features in chunks?

There are two ways of thinking about precomputing stuff in chunks. On the one hand, you can think of the chunk as a big enough unit to make counts meaningful. But you can also think of chunks as units that obviate the need to precompute everything. Because chunks will be approximately of equal size they may make good text samples that will support exploratory operations in an on-the-fly manner.