Notes on writing research papers

Department of Computer Science and Engineering	CSE 254
University of California at San Diego	Spring 2007

Notes on writing research papers

Content. The criterion "well-chosen starting point for new research" is important. Do not "reinvent the wheel," and make your new contributions as general as possible. In particular, you should discuss broad algorithmic ideas separately from application-specific heuristics. You should be creative but skeptical about reusing previous work from other fields. For example, there is no strong reason to expect a particular weighting method that happens to be useful in traditional information retrieval, such as tf/idf, to be especially useful for real-time web search.

Experiments. When designing and discussing an experiment, use appropriate independent and dependent variables. Always think about and discuss explicitly the issue of statistical significance for experimental comparisons, even if your conclusion is that you cannot evaluate significance quantitatively. In particular, when you report a mean, you should also report the corresponding standard deviation s. Then you can use the standard error, s·n^-0.5 where n is the number of measurements averaged, to say whether the difference between two means is statistically significant, i.e. reproducible.

Abstract. Remember that when a paper is published, many people read the abstract but never read the paper. Make the abstract as useful as possible to these readers. In particular, an abstract should be as specific and concrete as possible, while remaining less technical than the whole paper. For example, do not write "Certain modifications improve the pages found vs. pages searched performance ratio, while some do not." Instead, describe briefly but explicitly the methods that do work, and your most important experiments, results, and lessons learned. Only if you still have space should you mention methods that do not work. Typical good abstracts are 150 to 200 words long.

Introduction. Avoid exaggerated claims, and claims that may not be true or cannot be proved. Avoid unnecessary metaphors. For example, do not write "The web is a sea of information, containing trillions of pages..." or "As the web grows, search engines that provide exhaustive indexing become less effective for searches in a specific genre."

Organization. Some conferences, journals, or research areas have the convention that all papers use the same section titles. Otherwise, including in most of computer science, you should use informative titles, not clichés like "Problem Statement" or single words like "Results." Do not use identical sentences in the abstract and elsewhere in a paper, and do not repeat the same arguments anywhere. It is fine for the concluding section of a paper to be very short.

Write in the present tense as much as possible, and organize the description of your work logically. Avoid writing in the past tense, and avoid any hint of chronological organization. For example, do not write "... changing it to fit our needs proved difficult ... In the end, we decided to ..." Present your work in a mostly impersonal way. Use "we" and "our" when convenient, but not continuously.

General writing. Avoid non sequiturs such as "It is important to have an accurate web crawler to maintain currency of online indexes." Currency means recency here, and accuracy and recency are not obviously linked. In addition, it is not obvious what "accurate" means concerning a web crawler. The sentence should be expanded into an argument that explains the link between accuracy and currency, as well as the intended meaning of these terms.

Be precise and clear in all descriptions, and use simple declarative sentences as much as possible. Precision and clarity lead you step by step to new insights. A straightforward list of observations, in a logical order, is an excellent way to organize most technical descriptions and analyses.

Do not use a comma to concatenate two sentences that should be separate. Avoid category errors such as "In a web crawler, the search task traverses links..." Avoid weak jokes and irrelevant allusions such as "Search for the Holy Grail." Learn once and for all to avoid common spelling mistakes such as "can not" and "effect" where "affect" is correct. ("Affect" is usually a verb while "effect" is usually a noun.) Avoid footnotes, and also avoid phrases and sentences in parentheses.

The best book on the basics of good writing is The Elements of Style by William Strunk Jr. and E. B. White, Macmillan, New York, third edition, 1979. The full text of the 1918 edition is out of copyright and available online. After you have mastered the mechanics of writing, the next challenge is to develop a sense of style. The book Clear and Simple as the Truth: Writing Classic Prose by Francis-Noel Thomas and Mark Turner is a wonderful treatise on the topic of writing style. Be sure to explore the authors' online guide to good writing.

Mathematical writing. Instead of using a multiletter identifier such as NDocs, use a single letter identifier with an explanation that is a full sentence such as "... where n is the number of documents in the training corpus." Use the simplest and most sober possible notation: avoid non-Roman letters, unusual symbols, and boldface as much as possible. When defining a novel concept, use function arguments instead of subscripts and superscripts. Follow standards in notation. For example, write "z = log xy" not "z = log ( x*y )". Never use the same letter or symbol with two different meanings.

Whenever possible, give an entire equation instead of just a formula, i.e. a fragment of an equation. Equations and their surrounding sentences should be written to flow naturally, with as little punctuation as possible. For example, do not write this:

"The weight of each keyword in a document d, is defined as: w_kd = f_kd x idf_k where f_kdis the frequency of a keyword k in a document d (the term frequency)."

Instead, write this:

"The weight of a term k in a document d is w(k,d) = f(k,d)t(k) where f(k,d) is the number of times k appears in d and t(k) is the inverse document frequency of k."

An equation should be displayed, i.e. centered on its own line, if it is long or if you need to give it a number in order to refer to it later. Whether displayed or not, each equation should be part of a complete sentence.

Figures. Use diagrams, charts, and tables when useful, with full labels and captions. Charts should not contain "chartjunk," i.e. unnecessary background lines or shading. The labels on axes and plotted points and curves should be informative. On axes, units should be made clear and numerical values should be well-chosen round numbers. The origin should be at zero whenever possible.

It is difficult to produce charts that are readable yet compact with Excel, which is not recommended. Instead, Matlab and gnuplot are recommended. Xgfe is a good graphical frontend for gnuplot. See here for how to output gnuplot charts in eps format for inclusion in LaTeX documents. Note that Xgfe generates command line input that you can capture and edit and run directly inside gnuplot. Doing so is useful for fine-tuning figures and for overcoming minor bugs in Xgfe such as xlabel mis-spelled as xlable. When generating output files with gnuplot, you must delete by hand an existing file with the same name as the desired gnuplot output file, to overcome a gnuplot file-handling bug.

Bibliography. References are like comments in software: adding them at the end misses the point. As you progress through a project, you should be looking for related published work that can give you ideas and save you work. As you find these references, you should cite them immediately in project planning documents, notes, and paper drafts. References should be complete. As an absolute minimum, each reference should contain the precise, correct title, the last names and initials of at least the first three authors, the exact title of the journal or proceedings or book, the year of publication, and correct page numbers. BibTeX makes all this easy.