NIH-NCI Tobacco-Documents Project
at The University of Georgia
The Tobacco Documents Corpus: Archiving the Industry
As part of the 1998 Master Settlement Agreement between the Attorneys General of 46
states and the seven major U.S. tobacco industry organizations, these organizations,
companies such as RJ Reynolds and Philip Morris, agreed to release industry internal
documents to the public by putting them on company web sites. In all, more than 30
million pages of documents have been released (Legacy). Don Rubin realized that this
set of documents represented a previously unseen opportunity to look into industry
communication, as it covered the complete range of internal industry activity, from
research and manufacturing to marketing and sales. It was not just texts intended
for public release. With this in mind Dr. Rubin assembled a team of researchers in
linguistics (Bill Kretzschmar, Doug Biber, Roger Shuy), rhetorical analysis (Rod
Hart), and tobacco litigation (Bert Hirshhorn, Michael Cummings, Monique Muggli), and
in 2001 was awarded funding by the National Cancer Institute for a rhetorical
analysis of "deception" in the Tobacco Industry Documents.
The focus of this study is to assemble and analyse a corpus consisting of sets of
"manipulated" or "changed" documents, related either as successive drafts of the same
text, or by topic, but having differing audiences. That is, we want to see how texts
change with revision on their way to public release, or with a change in audience
from industry internal to public. However, early on we realized that there was no
suitable reference corpus with which we could compare our findings to determine if
they were a result of deceptive strategies, or simply the norm for this genre of
text. We had no way to judge what the norm of tobacco communication might be. To
remedy this, we decided to add another corpus to the project. As well as the corpus
of manipulated documents, which we call the rhetorical corpus, we are constructing a
representative corpus of tobacco documents, a reference corpus, by sampling the
documents released by the tobacco industry prior to July 1999.
Currently, our work on the rhetorical corpus continues. As you might imagine, it has
been difficult to find the types of documents sets we want, but we have been able to
assemble about 50 sets. Our archivists are continuing the search. Our reference
corpus, however, has been assembled and is nearly ready to be released, and it is
this corpus which will be of the most use to the greater academic community. So, in
the next few minutes, I would like to briefly introduce you to the reference corpus
by explaining the process we have gone through in making it, and pointing out some of
the more interesting aspects. Our hope is that by understanding the process, one can
better use the product.
There are three major parts to the process we are using to complete the corpus:
sampling, archiving, and description. The first part of the process, the sampling of
the tobacco documents, was necessary as we didn't have the resources to include the
millions of documents which have now been released by the tobacco industry. We began
our sampling by narrowing the scope of our study to only those documents found in
what is known as the "digital snapshot", which represents the state of the tobacco
industry web sites as of July 1999. We also added the Bliley collection, a set of
documents subpoenaed and released by the Commerce Committee: in all, approximately
3.4 million documents. We preferred the fixed boundaries of these two documents
sets, as well as the fact that they are now indexed and searchable online at a single
site (www.tobaccodocuments.org). We also have no reason to suspect that these
document sets (about 3/4), which are a large portion of the total (about 3/4), are
materially different from the entire set of tobacco documents.
The next step in our sampling was to draw a limited sample in order to determine the
types and proportions of texts in the "study set", that is, the snapshot plus the
Bliley documents. We decided to examine 0.01 percent of the study set or about 340
documents. In order to do this we divided the snapshot documents into six groups
based on the decade of origin. To these six groups we added the Bliley documents as
an additional group, making seven. The decade group 1950, because of the low number
of documents in the earlier decades, contains all the dated documents from 1900
through 1959. The group 19xx contains all the undated snapshot documents.
Link to [Page 1: Sampling Targets for Limited Sample]
For each decade group, we searched the online index and determined the number of
documents in the study set. From this we determined the proportion of the study set
the group represented, and the number of documents needed from the group for the
limited sample. We then randomly selected documents from the decade groups. We
deviated from our proportional targets only to take at least ten documents from each
of our groups, which meant selecting a few additional documents from the undated and
Bliley sets, and ending up with 349 documents.
Once selected, the documents were then classified by three primary criteria: Source
(industry internal vs. industry external), Audience (industry internal vs. industry
external), and Addressee (named individual(s) vs. unnamed individuals). We also
attempted to classify the documents by their Public Health Significance (significant
vs. not significant), but the TD expert who carried out this task found that almost
all of the documents were significant to public health, so this distinction carried
little difference for us. These criteria are clearly motivated by our intended
rhetorical analysis. We also considered the status of each document with respect to
several secondary classifications: whether it consisted of a form (like an invoice),
whether it consisted of an image with fewer than 50 words of text, whether it was
primarily written in English, whether it showed evidence of editing, whether it
contained marginalia, and whether the document was short (i.e. consisted of fewer
than 50 words of running text excluding replicated or standardized prose).
Link to [Page 2: Distribution by Classification Categories]
Although there are many interesting findings here, the two most notable are that
most documents have an industry internal audience (which is different from other
corpora), and that the tendencies were generally consistent across all seven of our
document groups. So, there did not appear to be any great deviation over time or
between the snapshot and Bliley documents.
Once we examined the classification data, we determined that for the purposes of our
study we should only collect documents generated by the tobacco industry, in English,
and with original prose of sufficient length to be rhetorically significant (which we
determined to be longer than 50 words). Eliminating documents which did not conform
to these criteria reduced the 349 documents collect for the limited sample to 202.
Link to [Page 3: Usable Documents from the Core Sample]
What this meant is that the same sampling procedure could not be used for the
reference sample. Thus, the next step in sampling was to use the data from the
limited sample to develop quotas for sampling the reference corpus. This was done by
cross tabulation of the data from the two remaining major classifications:
internal/external audience and named/unnamed audience. This yielded four document
types based on audience (which you can see in figure 4), and the quotas for sampling
based on sets of 202 documents. We determined that we had the funding to compile a
corpus of approximately 500,000 words, and from the limited sample determined the
average number of words per document to be approximately 600. Thus, we expected that
4 sets would be needed (which was the case as we ended up with about 520,000 words).
Link to [Page 4: Relative Frequencies by Classification (Sampling Quotas)]
The reference corpus documents were then gathered by decade group. For each
document, a year from the decade group and a month were randomly selected. Using the
online index, this year and month were searched, and from the list of documents
returned a document was randomly selected and accepted or rejected according to the
quotas for the decade group. This was repeated until all quotas were filled.
Following these quotas, almost 97 percent of the documents collected for the
reference corpus had an internal audience. In order to make clearer comparisons
between internal and external audience documents, we decided to create a supplement
to the reference corpus of 100 external audience documents. For this we followed the
same procedures used for the reference corpus, but added the external audience
requirement. This supplemental corpus will be made available with, but separate
from, the reference corpus.
In beginning the second part of the process, archiving the documents, we immediately
encountered obstacles. The first is that the documents are stored by the tobacco
industry online as digital images, generally tif or pdf, or on paper in
repositories. Thus, there are none of the plain ascii text files needed for computer
analysis. To compound the problem, the image quality is generally poor, which
precludes the use of scanning and OCR software as the reliability is too low for
computer analysis. Another problem was the document complexity. We quickly learned
that there are no fixed or normal formats for tobacco industry documents. In fact,
there is no clear definition of what constitutes a document, so many sets of pages
labeled as a "documents" actually consist of multiple documents, or parts of multiple
documents. Added to this is the regular occurrence of stamps, filing directives,
marginalia and editing marks. And finally, tobacco industry documents come with a
great deal of non-document indexing data, or metadata, which we wanted to maintain
with the archive.
Link to [Page 5: A Sample Document: ]
The solution to these problems, as anyone in this room knows, is hand keying the data
and encoding it using XML. Which we have done. XML provides us with an easy means
to maintain both the structure of the document and the metadata associated with it.
Although we investigated using existing XML tag sets, TEI in particular, we chose to
devise a set of tags specific to our project. The main reason for this is to
simplify the task of data entry and recovery. In terms of document structure, TEI
does a very flexible and configurable set of tags which could be adapted to our
purposes. However, we consider these tags to be difficult to use in practice because
their very flexibility makes them complex, and complexity leads to error in
keyboarding. In the same manner, our primary interest is in archiving rhetorically
significant text and events rather than the typesetting conventions used to represent
them. Thus although TEI includes a full set of tags to indicate typographical
conventions, use of these tags for our purposes might lead to ambiguities. For
example, italics, boldface, and underlining have all been found to denote emphasis in
the document set, which is rhetorically significant; however, they have also been
found to denote titles, headings, names, quotations, formulas, and standard text,
which may be of little value for our analysis. To counter these issues, we have
devised a set of 44 simple tags (that is, tags with explicit names and few
attributes) which accommodates the structural complexity of the original documents,
and which reflects the purpose of our study. We should make it clear, however, that
we do support the idea of having standardized tags, but we feel that this can be
accomplished more efficiently with an XSLT transformation after data entry.
Link to [Page 6: Sample XML]
Now moving to XSLT. The problem one has with richly encoded XML files is that they
contain too much data. That is, most linguistic and rhetorical analyses, both
traditional and computer assisted, are done on uncoded ascii text, and in many cases,
only on portions of that text. To extract data from our XML files, we employ XSLT in
a straightforward manner. We have embedded the expat XSLT engine into several Python
scripts. This allows us to quickly assemble a sub-corpus of ascii text from the
larger corpora, according to the needs of the research. This is done by simply
modifying a template XSL stylesheet and rerunning the transformation. That is, with
our scripts the XML files are parsed, desired tag content is selected, and the
selected content is assembled and written to an ASCII text file for later analysis.
There are, however, two notable differences between the standard Web use of XSLT and
that of our project. The first is data permanence. The output of most of our XSLT
processing is ascii text which is written to file for later analysis rather than HTML
sent onto the Internet. The other difference is that the XSLT output is not solely
determined by the XSL stylesheet. For ease and speed in processing, some general
document and tag selection is done by regular expressions in the Python script prior
to calling the expat program.
As an example of how we are using XSLT to generate text sub-corpora, and how we will
be allowing others to do it online in the future, I have written a simple HTML/CGI
interface that mimics the process we use. Basically the CGI script reads the form
data, generates a stylesheet, and performs the transformation. The output is text.
I have colored it for demonstration purposes. (The server username is 'bob' and the
password is 'robert'.)
Link to [Server]
The final stage in the process of corpus assembly is describing the content of the
reference corpus. This is now in the works. We are proceeding with this in two
ways: by comparing and contrasting the reference corpus with other well known corpora
of comparable size (that is, external), such as the Brown Corpus and Frieburg-Brown
Corpus, and by comparing and contrasting major metadata divisions within the corpus
itself, such as between decade group, major classifications, and industry source
(that is, internal). In the remaining few minutes I would like to show you some
initial findings.
The first is a very simple statistic, but a very telling one: type-token ratio. Here
the reference corpus is compared to the brown and frown corpora. For the comparison
I have sampled every other word from the brown and frown corpora to reduce the size
to that of the reference corpus. What one sees is that the Brown and Frown are very
similar, which one would expect from corpora of this size. However, the reference
corpus on the average falls 10,000 types short of the other two.
Link to [Page 7: Type-Token Ratio]
The next series of graphs deals with frequencies of words and collocates in the
decade groups as compared to the reference corpus as a whole. The statistic used
returns a z-score based on the proportions of an item in each group. Here the z-
scores are graphed over the decade groups. A negative scores indicate that the
occurrence of a word or collocate is lower than expected during that period. A
positive score indicates higher than expected. This allows one to trace the history
of a word or collocate over time. In these graphs, all the items have at least one
decade group in which their z-score exceeds + - 1.96.
Link to [Page 8: SEX]
Link to [Page 9: GENDER]
Link to [Page 10: CANCER]
Link to [Page 11: AMONG]
Link to [Page 12: MARKETING]
Link to [Page 13: GROUP 1]
Link to [Page 14: GROUP 2]
Well, my time is up. I hope to have piqued you interest in our corpora and in our
methods. If you would like further information, you can find it, even my email
address, on our website.
NIH-NCI Tobacco-Documents Project at the University of
Georgia (Grant # 1 RO1 CA87490-01). Please send comments concerning page
design to our Webmaster.
|