UGA Logo
NIH-NCI Tobacco-Documents Project
at The University of Georgia


The Tobacco Documents Corpus: Archiving the Industry


As part of the 1998 Master Settlement Agreement between the Attorneys General of 46 
states and the seven major U.S. tobacco industry organizations, these organizations, 
companies such as RJ Reynolds and Philip Morris, agreed to release industry internal 
documents to the public by putting them on company web sites.  In all, more than 30 
million pages of documents have been released (Legacy).  Don Rubin realized that this 
set of documents represented a previously unseen opportunity to look into industry 
communication, as it covered the complete range of internal industry activity, from 
research and manufacturing to marketing and sales.  It was not just texts intended 
for public release.  With this in mind Dr. Rubin assembled a team of researchers in 
linguistics (Bill Kretzschmar, Doug Biber, Roger Shuy), rhetorical analysis (Rod 
Hart), and tobacco litigation (Bert Hirshhorn, Michael Cummings, Monique Muggli), and 
in 2001 was awarded funding by the National Cancer Institute for a rhetorical 
analysis of "deception" in the Tobacco Industry Documents.

The focus of this study is to assemble and analyse a corpus consisting of sets of 
"manipulated" or "changed" documents, related either as successive drafts of the same 
text, or by topic, but having differing audiences.  That is, we want to see how texts 
change with revision on their way to public release, or with a change in audience 
from industry internal to public.  However, early on we realized that there was no 
suitable reference corpus with which we could compare our findings to determine if 
they were a result of deceptive strategies, or simply the norm for this genre of 
text.  We had no way to judge what the norm of tobacco communication might be.  To 
remedy this, we decided to add another corpus to the project.  As well as the corpus 
of manipulated documents, which we call the rhetorical corpus, we are constructing a 
representative corpus of tobacco documents, a reference corpus, by sampling the 
documents released by the tobacco industry prior to July 1999.

Currently, our work on the rhetorical corpus continues.  As you might imagine, it has 
been difficult to find the types of documents sets we want, but we have been able to 
assemble about 50 sets.  Our archivists are continuing the search.  Our reference 
corpus, however, has been assembled and is nearly ready to be released, and it is 
this corpus which will be of the most use to the greater academic community.  So, in 
the next few minutes, I would like to briefly introduce you to the reference corpus 
by explaining the process we have gone through in making it, and pointing out some of 
the more interesting aspects.  Our hope is that by understanding the process, one can 
better use the product.

There are three major parts to the process we are using to complete the corpus: 
sampling, archiving, and description.  The first part of the process, the sampling of 
the tobacco documents, was necessary as we didn't have the resources to include the 
millions of documents which have now been released by the tobacco industry.  We began 
our sampling by narrowing the scope of our study to only those documents found in 
what is known as the "digital snapshot", which represents the state of the tobacco 
industry web sites as of July 1999.  We also added the Bliley collection, a set of 
documents subpoenaed and released by the Commerce Committee: in all, approximately 
3.4 million documents.  We preferred the fixed boundaries of these two documents 
sets, as well as the fact that they are now indexed and searchable online at a single 
site (www.tobaccodocuments.org).  We also have no reason to suspect that these 
document sets (about 3/4), which are a large portion of the total (about 3/4), are 
materially different from the entire set of tobacco documents.

The next step in our sampling was to draw a limited sample in order to determine the 
types and proportions of texts in the "study set", that is, the snapshot plus the 
Bliley documents.  We decided to examine 0.01 percent of the study set or about 340 
documents.  In order to do this we divided the snapshot documents into six groups 
based on the decade of origin.  To these six groups we added the Bliley documents as 
an additional group, making seven.  The decade group 1950, because of the low number 
of documents in the earlier decades, contains all the dated documents from 1900 
through 1959.  The group 19xx contains all the undated snapshot documents.

Link to [Page 1: Sampling Targets for Limited Sample]

For each decade group, we searched the online index and determined the number of 
documents in the study set.  From this we determined the proportion of the study set 
the group represented, and the number of documents needed from the group for the 
limited sample.  We then randomly selected documents from the decade groups.  We 
deviated from our proportional targets only to take at least ten documents from each 
of our groups, which meant selecting a few additional documents from the undated and 
Bliley sets, and ending up with 349 documents. 

Once selected, the documents were then classified by three primary criteria: Source 
(industry internal vs. industry external), Audience (industry internal vs. industry 
external), and Addressee (named individual(s) vs. unnamed individuals).  We also 
attempted to classify the documents by their Public Health Significance (significant 
vs. not significant), but the TD expert who carried out this task found that almost 
all of the documents were significant to public health, so this distinction carried 
little difference for us.  These criteria are clearly motivated by our intended 
rhetorical analysis.  We also considered the status of each document with respect to 
several secondary classifications: whether it consisted of a form (like an invoice), 
whether it consisted of an image with fewer than 50 words of text, whether it was 
primarily written in English, whether it showed evidence of editing, whether it 
contained marginalia, and whether the document was short (i.e. consisted of fewer 
than 50 words of running text excluding replicated or standardized prose).

Link to [Page 2: Distribution by Classification Categories]

Although there are many interesting findings here, the two most notable are that 
most documents have an industry internal audience (which is different from other 
corpora), and that the tendencies were generally consistent across all seven of our 
document groups.  So, there did not appear to be any great deviation over time or 
between the snapshot and Bliley documents.

Once we examined the classification data, we determined that for the purposes of our 
study we should only collect documents generated by the tobacco industry, in English, 
and with original prose of sufficient length to be rhetorically significant (which we 
determined to be longer than 50 words).  Eliminating documents which did not conform 
to these criteria reduced the 349 documents collect for the limited sample to 202.  

Link to [Page 3: Usable Documents from the Core Sample]

What this meant is that the same sampling procedure could not be used for the 
reference sample.  Thus, the next step in sampling was to use the data from the 
limited sample to develop quotas for sampling the reference corpus.  This was done by 
cross tabulation of the data from the two remaining major classifications: 
internal/external audience and named/unnamed audience.  This yielded four document 
types based on audience (which you can see in figure 4), and the quotas for sampling 
based on sets of 202 documents.  We determined that we had the funding to compile a 
corpus of approximately 500,000 words, and from the limited sample determined the 
average number of words per document to be approximately 600.  Thus, we expected that 
4 sets would be needed (which was the case as we ended up with about 520,000 words).

Link to [Page 4: Relative Frequencies by Classification (Sampling Quotas)]

The reference corpus documents were then gathered by decade group.  For each 
document, a year from the decade group and a month were randomly selected.  Using the 
online index, this year and month were searched, and from the list of documents 
returned a document was randomly selected and accepted or rejected according to the 
quotas for the decade group.  This was repeated until all quotas were filled.  

Following these quotas, almost 97 percent of the documents collected for the 
reference corpus had an internal audience.  In order to make clearer comparisons 
between internal and external audience documents, we decided to create a supplement 
to the reference corpus of 100 external audience documents.  For this we followed the 
same procedures used for the reference corpus, but added the external audience 
requirement.  This supplemental corpus will be made available with, but separate 
from, the reference corpus.

In beginning the second part of the process, archiving the documents, we immediately 
encountered obstacles.  The first is that the documents are stored by the tobacco 
industry online as digital images, generally tif or pdf, or on paper in 
repositories.  Thus, there are none of the plain ascii text files needed for computer 
analysis.  To compound the problem, the image quality is generally poor, which 
precludes the use of scanning and OCR software as the reliability is too low for 
computer analysis.  Another problem was the document complexity.  We quickly learned 
that there are no fixed or normal formats for tobacco industry documents.  In fact, 
there is no clear definition of what constitutes a document, so many sets of pages 
labeled as a "documents" actually consist of multiple documents, or parts of multiple 
documents.  Added to this is the regular occurrence of stamps, filing directives, 
marginalia and editing marks.  And finally, tobacco industry documents come with a 
great deal of non-document indexing data, or metadata, which we wanted to maintain 
with the archive.

Link to [Page 5: A Sample Document: ]

The solution to these problems, as anyone in this room knows, is hand keying the data 
and encoding it using XML.  Which we have done.  XML provides us with an easy means 
to maintain both the structure of the document and the metadata associated with it.  
Although we investigated using existing XML tag sets, TEI in particular, we chose to 
devise a set of tags specific to our project.  The main reason for this is to 
simplify the task of data entry and recovery.  In terms of document structure, TEI 
does a very flexible and configurable set of tags which could be adapted to our 
purposes.  However, we consider these tags to be difficult to use in practice because 
their very flexibility makes them complex, and complexity leads to error in 
keyboarding.  In the same manner, our primary interest is in archiving rhetorically 
significant text and events rather than the typesetting conventions used to represent 
them. Thus although TEI includes a full set of tags to indicate typographical 
conventions, use of these tags for our purposes might lead to ambiguities. For 
example, italics, boldface, and underlining have all been found to denote emphasis in 
the document set, which is rhetorically significant; however, they have also been 
found to denote titles, headings, names, quotations, formulas, and standard text, 
which may be of little value for our analysis.  To counter these issues, we have 
devised a set of 44 simple tags (that is, tags with explicit names and few 
attributes) which accommodates the structural complexity of the original documents, 
and which reflects the purpose of our study.  We should make it clear, however, that 
we do support the idea of having standardized tags, but we feel that this can be 
accomplished more efficiently with an XSLT transformation after data entry. 

Link to [Page 6: Sample XML]

Now moving to XSLT.  The problem one has with richly encoded XML files is that they 
contain too much data.  That is, most linguistic and rhetorical analyses, both 
traditional and computer assisted, are done on uncoded ascii text, and in many cases, 
only on portions of that text.  To extract data from our XML files, we employ XSLT in 
a straightforward manner. We have embedded the expat XSLT engine into several Python 
scripts. This allows us to quickly assemble a sub-corpus of ascii text from the 
larger corpora, according to the needs of the research.  This is done by simply 
modifying a template XSL stylesheet and rerunning the transformation. That is, with 
our scripts the XML files are parsed, desired tag content is selected, and the 
selected content is assembled and written to an ASCII text file for later analysis.  
There are, however, two notable differences between the standard Web use of XSLT and 
that of our project. The first is data permanence. The output of most of our XSLT 
processing is ascii text which is written to file for later analysis rather than HTML 
sent onto the Internet. The other difference is that the XSLT output is not solely 
determined by the XSL stylesheet. For ease and speed in processing, some general 
document and tag selection is done by regular expressions in the Python script prior 
to calling the expat program.  

As an example of how we are using XSLT to generate text sub-corpora, and how we will 
be allowing others to do it online in the future, I have written a simple HTML/CGI 
interface that mimics the process we use.  Basically the CGI script reads the form 
data, generates a stylesheet, and performs the transformation.  The output is text.  
I have colored it for demonstration purposes.  (The server username is 'bob' and the
password is 'robert'.)

Link to [Server]

The final stage in the process of corpus assembly is describing the content of the 
reference corpus.  This is now in the works.  We are proceeding with this in two 
ways: by comparing and contrasting the reference corpus with other well known corpora 
of comparable size (that is, external), such as the Brown Corpus and Frieburg-Brown 
Corpus, and by comparing and contrasting major metadata divisions within the corpus 
itself, such as between decade group, major classifications, and industry source 
(that is, internal).  In the remaining few minutes I would like to show you some 
initial findings.  

The first is a very simple statistic, but a very telling one: type-token ratio.  Here 
the reference corpus is compared to the brown and frown corpora.  For the comparison 
I have sampled every other word from the brown and frown corpora to reduce the size 
to that of the reference corpus.  What one sees is that the Brown and Frown are very 
similar, which one would expect from corpora of this size.  However, the reference 
corpus on the average falls 10,000 types short of the other two.

Link to [Page 7: Type-Token Ratio]

The next series of graphs deals with frequencies of words and collocates in the 
decade groups as compared to the reference corpus as a whole.  The statistic used 
returns a z-score based on the proportions of an item in each group.  Here the z-
scores are graphed over the decade groups.  A negative scores indicate that the 
occurrence of a word or collocate is lower than expected during that period.  A 
positive score indicates higher than expected.  This allows one to trace the history 
of a word or collocate over time.  In these graphs, all the items have at least one 
decade group in which their z-score exceeds + - 1.96.

Link to [Page 8: SEX]
Link to [Page 9: GENDER]
Link to [Page 10: CANCER]
Link to [Page 11: AMONG]
Link to [Page 12: MARKETING]
Link to [Page 13: GROUP 1]
Link to [Page 14: GROUP 2]

Well, my time is up.  I hope to have piqued you interest in our corpora and in our 
methods.  If you would like further information, you can find it, even my email 
address, on our website.


UGA Arches NIH-NCI Tobacco-Documents Project at the University of Georgia (Grant # 1 RO1 CA87490-01). Please send comments concerning page design to our Webmaster.