Pandoc Github



The timely publication of scientific results is essential for dynamic advances in science. The ubiquitous availability of computers which are connected to a global network made the rapid and low-cost distribution of information through electronic channels possible. New concepts, such as Open Access publishing and preprint servers are currently changing the traditional print media business towards a community-driven peer production. However, the cost of scientific literature generation, which is either charged to readers, authors or sponsors, is still high. The main active participants in the authoring and evaluation of scientific manuscripts are volunteers, and the cost for online publishing infrastructure is close to negligible. A major time and cost factor is the formatting of manuscripts in the production stage. In this article we demonstrate the feasibility of writing scientific manuscripts in plain markdown (MD) text files, which can be easily converted into common publication formats, such as PDF, HTML or EPUB, using pandoc. The simple syntax of markdown assures the long-term readability of raw files and the development of software and workflows. We show the implementation of typical elements of scientific manuscripts – formulas, tables, code blocks and citations – and present tools for editing, collaborative writing and version control. We give an example on how to prepare a manuscript with distinct output formats, a DOCX file for submission to a journal, and a LaTeX/PDF version for deposition as a PeerJ preprint. Further, we implemented new features for supporting ‘semantic web’ applications, such as the ‘journal article tag suite’ - JATS, and the ‘citation typing ontology’ - CiTO standard. Reducing the work spent on manuscript formatting translates directly to time and cost savings for writers, publishers, readers and sponsors. Therefore, the adoption of the MD format contributes to the agile production of open science literature. Pandoc Scholar is freely available from https://github.com/pandoc-scholar.

Keywords: open science, document formats, markdown, latex, publishing, typesetting

To export your references as a file Pandoc can read (usually a BibTeX file) you can do tht manually from the Bookends GUI. However, you can do this automatically every day or so using this applescript, you can specify an output folder and comma-separated list of groups via command-line input. Pandoc understands an extended and slightly revised version of John Gruber’s Markdown ⧉ syntax. This document explains the syntax, noting differences from standard Markdown. Except where noted, these differences can be suppressed by using the markdownstrict format instead of markdown. Pandoc will start a new list each time a different type of list marker is used. So, the following will create three lists: (2) Two (5) Three 1. If default list markers are desired, use #.: #. Three Extension: tasklists. Pandoc supports task lists, using the syntax of GitHub-Flavored Markdown. Text.Pandoc.App.CommandLineOptions: Change setVariable to use Text instead of String. This avoids some unnecessary unpacking. Use versioned directory for windows release zipfile. Also remove old make-windows-installer.bat, superseded by GitHub actions workflow, and modify pandoc.wxs for new paths. GitHub is where people build software. More than 56 million people use GitHub to discover, fork, and contribute to over 100 million projects.

Agile development of science depends on the continuous exchange of information between researchers (Woelfle, Olliaro, and Todd 2011). In the past, physical copies of scientific works had to be produced and distributed. Therefore, publishers needed to invest considerable resources for typesetting and printing. Since the journals were mainly financed by their subscribers, their editors not only had to decide on the scientific quality of a submitted manuscript, but also on the potential interest to their readers. The availability of globally connected computers enabled the rapid exchange of information at low cost. Yochai Benkler (2006) predicts important changes in the information production economy, which are based on three observations:

  1. A nonmarket motivation in areas such as education, arts, science, politics and theology.
  2. The actual rise of nonmarket production, made possible through networked individuals and coordinate effects.
  3. The emergence of large-scale peer production, e.g. of software and encyclopedias.

Immaterial goods such as knowledge and culture are not lost when consumed or shared – they are ‘nonrival’ –, and they enable a networked information economy, which is not commercially driven (Benkler 2006).

Preprints and e-prints

In some areas of science a preprint culture, i.e. a paper-based exchange system of research ideas and results, already existed when Paul Ginsparg in 1991 initiated a server for the distribution of electronic preprints – ‘e-prints’ – about high-energy particle theory at the Los Alamos National Laboratory (LANL), USA (Ginsparg 1994). Later, the LANL server moved with Ginsparg to Cornell University, USA, and was renamed as arXiv (Butler 2001). Currently, arXiv (https://arxiv.org/) publishes e-prints related to physics, mathematics, computer science, quantitative biology, quantitative finance and statistics. Just a few years after the start of the first preprint servers, their important contribution to scientific communication was evident (Ginsparg 1994; Youngen 1998; C. Brown 2001). In 2014, arXiv reached the impressive number of 1 million e-prints (Van Noorden 2014).

In more conservative areas, such as chemistry and biology, accepting the publishing prior peer-review took more time (C. Brown 2003). A preprint server for life sciences (http://biorxiv.org/) was launched by the Cold Spring Habor Laboratory, USA, in 2013 (Callaway 2013). PeerJ preprints (https://peerj.com/preprints/), started in the same year, accepts manuscripts from biological sciences, medical sciences, health sciences and computer sciences.

The terms ‘preprints’ and ‘e-prints’ are used synonymously, since the physical distribution of preprints has become obsolete. A major drawback of preprint publishing are the sometimes restrictive policies of scientific publishers. The SHERPA/RoMEO project informs about copyright policies and self-archiving options of individual publishers (http://www.sherpa.ac.uk/romeo/).

Open Access

The term ‘Open Access’ (OA) was introduced 2002 by the Budapest Open Access Initiative and was defined as:

“Barrier-free access to online works and other resources. OA literature is digital, online, free of charge (gratis OA), and free of needless copyright and licensing restrictions (libre OA).”(Suber 2012)

Frustrated by the difficulty to access even digitized scientific literature, three scientists founded the Public Library of Science (PLoS). In 2003, PLoS Biology was published as the first fully Open Access journal for biology (P. O. Brown, Eisen, and Varmus 2003; M. Eisen 2003).

Thanks to the great success of OA publishing, many conventional print publishers now offer a so-called ‘Open Access option’, i.e. to make accepted articles free to read for an additional payment by the authors. The copyright in these hybrid models might remain with the publisher, whilst fully OA usually provide a liberal license, such as the Creative Commons Attribution 4.0 International (CC BY 4.0, https://creativecommons.org/licenses/by/4.0/).

OA literature is only one component of a more general open philosophy, which also includes the access to scholarships, software, and data (Willinsky 2005). Interestingly, there are several different ‘schools of thought’ on how to understand and define Open Science, as well the position that any science is open by definition, because of its objective to make generated knowledge public (Fecher and Friesike 2014).

Cost of journal article production

In a recent study, the article processing charges (APCs) for research intensive universities in the USA and Canada were estimated to be about 1,800 USD for fully OA journals and 3,000 USD for hybrid OA journals (Solomon and Björk 2016). PeerJ (https://peerj.com/), an OA journal for biological and computer sciences launched in 2013, drastically reduced the publishing cost, offering its members a life-time publishing plan for a small registration fee (Van Noorden 2012); alternatively the authors can choose to pay an APC of 1,095 USD, which may be cheaper, if multiple co-authors participate.

Examples such as the Journal of Statistical Software (JSS, https://www.jstatsoft.org/) and eLife (https://elifesciences.org/) demonstrate the possibility of completely community-supported OA publications. Fig. 1 compares the APCs of different OA publishing business models.

JSS and eLife are peer-reviewed and indexed by Thomson Reuters. Both journals are located in the Q1 quality quartile in all their registered subject categories of the Scimago Journal & Country Rank (http://www.scimagojr.com/), demonstrating that high-quality publications can be produced without charging the scientific authors or readers.

Article Processing Charge (APCs) that authors have to pay for with different Open Access (OA) publishing models. Data from (Solomon and Björk 2016) and journal web-pages.

In 2009, a study was carried out concerning the “Economic Implications of Alternative Scholarly Publishing Models”, which demonstrates an overall societal benefit by using OA publishing model (Houghton et al. 2009). In the same report, the real publication costs are evaluated. The relative costs of an article for the publisher are represented in Fig. 2.

Markdown

Estimated publishing cost for a ‘hybrid’ journal (conventional with Open Access option). Data from (Houghton et al. 2009).

Conventional publishers justify their high subscription or APC prices with the added value, e.g. journalism (stated in the graphics as ‘non-article processing’). But also stakeholder profits, which could be as high as 50%, must be considered, and are withdrawn from the science budget (Van Noorden 2013).

Generally, the production costs of an article could be roughly divided into commercial and academic/ technical costs (Fig. 2). For nonmarket production, the commercial costs such as margins/ profits, management etc. can be drastically reduced. Hardware and services for hosting an editorial system, such as Open Journal Systems of the Public Knowledge Project (https://pkp.sfu.ca/ojs/) can be provided by public institutions. Employed scholars can perform editor and reviewer activities without additional cost for the journals. Nevertheless, ‘article processing’, which includes the manuscript handling during peer review and production represents the most expensive part.

Therefore, we investigated a strategy for the efficient formatting of scientific manuscripts.

Current standard publishing formats

Generally speaking, a scientific manuscript is composed of contents and formatting. While the content, i.e. text, figures, tables, citations etc., may remain the same between different publishing forms and journal styles, the formatting can be very different. Most publishers require the formatting of submitted manuscripts in a certain format. Ignoring this Guide for Authors, e.g. by submitting a manuscript with a different reference style, gives a negative impression with a journal’s editorial staff. Too carelessly prepared manuscripts can even provoke a straight ‘desk-reject’ (Volmer and Stokes 2016).

Currently DOC(X), LaTeX and/ or PDF file formats are the most frequently used formats for journal submission platforms. But even if the content of a submitted manuscript might be accepted during the peer review ‘as is’, the format still needs to be adjusted to the particular publication style in the production stage. For the electronic distribution and archiving of scientific works, which is gaining more and more importance, additional formats (EPUB, (X)HTML, JATS) need to be generated. Tab. 1 lists the file formats which are currently the most relevant ones for scientific publishing.

Although the content elements of documents, such as title, author, abstract, text, figures, tables, etc., remain the same, the syntax of the file formats is rather different. Tab. 2 demonstrates some simple examples of differences in different markup languages.

Documents with the commonly used Office Open XML (DOCX Microsoft Word files) and OpenDocument (ODT LibreOffice) file formats can be opened in a standard text editor after unzipping. However, content and formatting information is distributed into various folders and files. Practically speaking, those file formats require the use of special word processing software.

From a writer’s perspective, the use of What You See Is What You Get (WYSIWYG) programs such as Microsoft Word, WPS Office or LibreOffice might be convenient, because the formatting of the document is directly visible. But the complicated syntax specifications often result in problems when using different software versions and for collaborative writing. Simple conversions between file formats can be difficult or impossible. In a worst-case scenario, ‘old’ files cannot be opened any more for lack of compatible software.

In some parts of the scientific community therefore LaTeX, a typesetting program in plain text format, is very popular. With LaTeX, documents with highest typographic quality can be produced. However, the source files are cluttered with LaTeX commands and the source text can be complicated to read. Causes of compilation errors in LaTeX are sometimes difficult to find. Therefore, LaTeX is not very user friendly, especially for casual writers or beginners.

Table 1. Current standard formats for scientific publishing.

TypeDescriptionUseSyntaxReference
DOCXOffice Open XMLWYSIWYG editingXML, ZIP(Ngo 2006)
ODTOpenDocumentWYSIWYG editingXML, ZIP(Brauer et al. 2005)
PDFportable documentprint replacementPDF(International Organization for Standardization 2013)
EPUBelectronic publishinge-booksHTML5, ZIP(Eikebrokk, Dahl, and Kessel 2014)
JATSjournal article tag suitejournal publishingXML(National Information Standards Organization 2012)
LaTeXtypesetting systemhigh-quality printTeX(Lamport 1994)
HTMLhypertext markupwebsites(X)HTML(Raggett et al. 1999; Hickson et al. 2014)
MDMarkdownlightweight markupplain text MD(Ovadia 2014; Leonard 2016)

Table 2. Examples for formatting elements and their implementations in different markup languages.

ElementMarkdownLaTeXHTML
structure
section# Introsection{Intro}<h1>Intro</h1>
subsection## Historysubsection{History}<h2>History</h2>
text style
bold**text**textbf{text}<b>text</b>
italics*text*textit{text}<i>text</i>
links
HTTP link<https:// arxiv.org>usepackage{url} url{https://arxiv.org}<a href='https:// arxiv.org'></a>

In academic publishing, it is additionally desirable to create different output formats from the same source text:

  • For the publishing of a book, with a print version in PDF and an electronic version in EPUB.
  • For the distribution of a seminar script, with an online version in HTML and a print version in PDF.
  • For submitting a journal manuscript for peer-review in DOCX, as well as a preprint version with another journal style in PDF.
  • For archiving and exchanging article data using the Journal Article Tag Suite (JATS) (National Information Standards Organization 2012), a standardized format developed by the NLM.

Some of the tasks can be performed e.g. with LaTeX, but an integrated solution remains a challenge. Several programs for the conversion between documents formats exist, such as the e-book library program calibre http://calibre-ebook.com/. But the results of such conversions are often not satisfactory and require substantial manual corrections.

Therefore, we were looking for a solution that enables the creation of scientific manuscripts in a simple format, with the subsequent generation of multiple output formats. The need for hybrid publishing has been recognized outside of science (DPT Collective 2015; Kielhorn 2011), but the requirements specific to scientific publishing have not been addressed so far. Therefore, we investigated the possibility to generate multiple publication formats from a simple manuscript source file.

Markdown was originally developed by John Gruber in collaboration with Aaron Swartz, with the goal to simplify the writing of HTML documents http://daringfireball.net/projects/markdown/. Instead of coding a file in HTML syntax, the content of a document is written in plain text and annotated with simple tags which define the formatting. Subsequently, the Markdown (MD) files are parsed to generate the final HTML document. With this concept, the source file remains easily readable and the author can focus on the contents rather than formatting. Despite its original focus on the web, the MD format has been proven to be well suited for academic writing (Ovadia 2014). In particular, pandoc-flavored MD (http://pandoc.org/) adds several extensions which facilitate the authoring of academic documents and their conversion into multiple output formats. Tab. 2 demonstrates the simplicity of MD compared to other markup languages. Fig. 3 illustrates the generation of various formatted documents from a manuscript in pandoc MD. Some relevant functions for scientific texts are explained below in more detail.

Workfow for the generation of multiple document formats with pandoc. The markdown (MD) file contains the manuscript text with formatting tags, and can also refer to external files such as images or reference databases. The pandoc processor converts the MD file to the desired output formats. Documents, citations etc. can be defined in style files or templates.

The usability of a text editor is important for the author, either writing alone or with several co-authors. In this section we present software and strategies for different scenarios. Fig. 4 summarizes various options for local or networked editing of MD files.

Markdown files can be edited on local devices or on cloud drives. A local or remote git repository enables advanced advanced version control.

Markdown editors

Due to MD’s simple syntax, basically any text editor is suitable for editing markdown files. The formatting tags are written in plain text and are easy to remember. Therefore, the author is not distracted by looking around for layout options with the mouse. For several popular text editors, such as vim (http://www.vim.org/), GNU Emacs (https://www.gnu.org/software/emacs/), atom (https://atom.io/) or geany (http://www.geany.org/), plugins provide additional functionality for markdown editing, e.g. syntax highlighting, command helpers, live preview or structure browsing.

Various dedicated markdown editors have been published as well. Many of those are cross-platform compatible, such as Abricotine (http://abricotine.brrd.fr/), ghostwriter (https://github.com/wereturtle/ghostwriter) and CuteMarkEd (https://cloose.github.io/CuteMarkEd/).

Pandoc

The lightweight format is also ideal for writing on mobile devices. Numerous applications are available on the App stores for Android and iOS systems. The programs Swype and Dragon (http://www.nuance.com/) facilitate the input of text on such devices by guessing words from gestures and speech recognition (dictation).

Fig. 5. shows the editing of a markdown file, using the cross-platform editor Atom with several markdown plugins.

Document directory tree, editing window and HTML preview using the Atom editor.

Online editing and collaborative writing

Storing manuscripts on network drives (The Cloud) has become popular for several reasons:

  • Protection against data loss.
  • Synchronization of documents between several devices.
  • Collaborative editing options.

Markdown files on a Google Drive (https://drive.google.com) for instance can be edited online with StackEdit (https://stackedit.io). Fig. 6 demonstrates the online editing of a markdown file on an ownCloud (https://owncloud.com/) installation. OwnCloud is an Open Source software platform, which allows the set-up of a file server on personal webspace. The functionality of an ownCloud installation can be enhanced by installing plugins.

Direct online editing of this manuscript with live preview using the ownCloud Markdown Editor plugin by Robin Appelman.

Even mathematical formulas are rendered correctly in the HTML live preview window of the ownCloud markdown plugin (Fig. 6 ).

The collaboration and authoring platform Authorea (https://www.authorea.com/) also supports markdown as one of multiple possible input formats. This can be beneficial for collaborations in which one or more authors are not familiar with markdown syntax.

Document versioning and change control

Programmers, especially when working in distributed teams, rely on version control systems to manage changes of code. Currently, Git (https://git-scm.com/), which is also used e.g. for the development of the Linux kernel, is one of the most employed software solutions for versioning. Git allows the parallel work of collaborators and has an efficient merging and conflict resolution system. A Git repository may be used by a single local author to keep track of changes, or by a team with a remote repository, e.g. on github (https://github.com/) or bitbucket (https://bitbucket.org/). Because of the plain text format of markdown, Git can be used for version control and distributed writing. For the writing of the present article, the co-authors (Germany and Mexico) used a remote Git repository on bitbucket. The plain text syntax of markdown facilitates the visualization of differences of document versions, as shown in Fig. 7.

Version control and collaborative editing using a git repository on bitbucket.

In the following section, we demonstrate the potential for typesetting scientific manuscripts with pandoc using examples for typical document elements, such as tables, figures, formulas, code listings and references. A brief introduction is given by Dominici (2014). The complete Pandoc User’s Manual is available at http://pandoc.org/MANUAL.html.

Tables

There are several options to write tables in markdown. The most flexible alternative - which was also used for this article - are pipe tables. The contents of different cells are separated by pipe symbols (|):

gives

LeftCenterRightDefault
LLLCCCRRRDDD

The headings and the alignment of the cells are given in the first two lines. The cell width is variable. The pandoc parameter --columns=NUM can be used to define the length of lines in characters. If contents do not fit, they will be wrapped.

Complex tables, e.g. tables featuring multiple headers or those containing cells spanning multiple rows or columns, are currently not representable in markdown format. However, it is possible to embed LaTeX and HTML tables into the document. These format-specific tables will only be included in the output if a document of the respective format is produced. This is method can be extended to apply any kind of format-specific typographic functionality which would otherwise be unavailable in markdown syntax.

Figures and images

Images are inserted as follows:

e.g.

The alt text is used e.g. in HTML output. Image dimensions can be defined in braces:

As well, an identifier for the figure can be defined with #, resulting e.g. in the image attributes {#figure1 height=30%}.

A paragraph containing only an image is interpreted as a figure. The alt text is then output as the figure’s caption.

Symbols

Scientific texts often require special characters, e.g. Greek letters, mathematical and physical symbols etc.

The UTF-8 standard, developed and maintained by Unicode Consortium, enables the use of characters across languages and computer platforms. The encoding is defined as RFC document 3629 of the Network Working group (Yergeau 2003) and as ISO standard ISO/IEC 10646:2014 (International Organization for Standardization 2014). Specifications of Unicode and code charts are provided on the Unicode homepage (http://www.unicode.org/).

In pandoc mardown documents, Unicode characters such as °, α , ä , Å can be inserted directly and passed to the different output documents. The correct processing of MD with UTF-8 encoding to LaTeX/PDF output requires the use of the --latex-engine=xelatex option and the use of an appropriate font. The Times-like XITS font (https://github.com/khaledhosny/xits-math), suitable for high quality typesetting of scientific texts, can be set in the LaTeX template:

To facilitate the input of specific characters, so-called mnemonics can be enabled in some editors (e.g. in atom by the character-table package). For example, the 2-character Mnemonics ‘:u’ gives ‘ü’ (diaeresis), or ’D*’ the Greek Δ. The possible character mnemonics and character sets are listed in RFC 1345 http://www.faqs.org/rfcs/rfc1345.html(Simonsen 1992).

Formulas

Formulas are written in LaTeX mode using the delimiters $. E.g. the formula for calculating the standard deviation (s) of a random sampling would be written as:

and gives:

(s=sqrt{frac{1}{N-1}sum_{i=1}^N(x_i-overline{x})^{2}})

with (x_i) the individual observations, (overline{x}) the sample mean and (N) the total number of samples.

Pandoc parses formulas into internal structures and allows conversion into formats other than LaTeX. This allows for format-specific formula representation and enables computational analysis of the formulas (Corbí and Burgos 2015).

Code listings

Verbatim code blocks are indicated by three tilde symbols:

Typesetting inline code is possible by enclosing text between back ticks.

Other document elements

These examples are only a short demonstration of the capacities of pandoc concerning scientific documents. For more detailed information, we refer to the official manual ( http://pandoc.org/MANUAL.html).

The efficient organization and typesetting of citations and bibliographies is crucial for academic writing. Pandoc supports various strategies for managing references. For processing the citations and the creation of the bibliography, the command line parameter --filter pandoc-citeproc is used, with variables for the reference database and the bibliography style. The bibliography will be located automatically at the header # References or # Bibliography.

Reference databases

Pandoc is able to process all mainstream literature database formats, such as RIS, BIB, etc. However, for maintaining compatibility with LaTeX/ BibTeX, the use of BIB databases is recommended. The used database either can be defined in the YAML metablock of the MD file (see below) or it can be passed as parameter when calling pandoc.

Inserting citations

For inserting a reference, the database key is given within square brackets, and indicated by an ‘@’. It is also possible to add information, such as page:

gives (Suber 2012; Benkler 2006, 57 ff.).

Styles

The Citation Style Language (CSL) http://citationstyles.org/ is used for the citations and bibliographies. This file format is supported e.g. by the reference management programs Mendeley https://www.mendeley.com/, Papers http://papersapp.com/ and Zotero https://www.zotero.org/. CSL styles for particular journals can be found from the Zotero style repository https://www.zotero.org/styles. The bibliography style that pandoc should use for the target document can be chosen in the YAML block of the markdown document or can be passed in as an command line option. The latter is more recommendable, because distinct bibliography style may be used for different documents.

Creation of LaTeX natbib citations

For citations in scientific manuscripts written in LaTeX, the natbib package is widely used. To create a LaTeX output file with natbib citations, pandoc simply has to be run with the --natbib option, but without the --filter pandoc-citeproc parameter.

Database of cited references

To share the bibliography for a certain manuscript with co-authors or the publisher’s production team, it is often desirable to generate a subset of a larger database, which only contains the cited references. If LaTeX output was generated with the --natbib option, the compilation of the file with LaTeX gives an AUX file (in the example named md-article.aux), which subsequently can be extracted using BibTool https://github.com/ge-ne/bibtool:

In this example, the article database will be called bibshort.bib.

For the direct creation of an article specific BIB database without using LaTeX, we wrote a simple Perl script called mdbibexport (https://github.com/robert-winkler/mdbibexport).

Bourne (2005) argues that journals should be effectively equivalent to biological databases: both provide data which can be referenced by unique identifiers like DOI or e.g. gene IDs. Applying the semantic-web ideas of Berners-Lee and Hendler (2001) to this domain can make this vision a reality. Here we show how metadata can be specified in markdown. We propose conventions, and demonstrate their suitability to enable interlinked and semantically enriched journal articles.

Document information such as title, authors, abstract etc. can be defined in a metadata block written in YAML syntax. YAML (“YAML Ain’t Markup Language”, http://yaml.org/) is a data serialization standard in simple, human readable format. Variables defined in the YAML section are processed by pandoc and integrated into the generated documents. The YAML metadata block is recognized by three hyphens (---) at the beginning, and three hyphens or dots (...) at the end, e.g.:

The public availability of all relevant information is a central aspect of Open Science. Analogous to article contents, data should be accessible via default tools. We believe that this principle must also be applied to article metadata. Thus, we created a custom pandoc writer that emits the article’s data as JSON–LD (Lanthaler and Gütl 2012), allowing for informational and navigational queries of the journal’s data with standard tools of the semantic web. The above YAML information would be output as:

Pandoc Converter

This format allows processing of the information by standard data processing software and browsers.

Flexible metadata authoring

We developed a method to allow writers the flexible specification of authors and their respective affiliations. Author names can be given as a string, via the key of a single-element object, or explicitly as a name attribute of an object. Affiliations can be specified directly as properties of the author object, or separately in the institute object.

Additional information, e.g. email addresses or identifiers like ORCID (Haak et al. 2012), can be added as additional values:

JATS support

The journal article tag suite (JATS) was developed by the NLM and standardized by ANSI/NISO as an archiving and exchange format of journal articles and the associated metadata (National Information Standards Organization 2012), including data of the type shown above. The pandoc-jats writer by Martin Fenner is a plugin usable with pandoc to produce JATS-formatted output. The writer was adapted to be compatible with our metadata authoring method, allowing for simple generation of files which contain the relevant metadata.

Citation types

Writers can add information about the reason a citation is given. This might help reviewers and readers, and can simplify the search for relevant literature. We developed an extended citation syntax that integrates seamlessly into markdown and can be used to add complementary information to citations. Our method is based on CiTO, the Citation Typing Ontology (Shotton 2010), which specifies a vocabulary for the motivation when citing a resource. The type of a citations can be added to a markdown citation using @CITO_PROPERTY:KEY, where CITO_PROPERTY is a supported CiTO property, and KEY is the usual citation key. Our tool extracts that information and includes it in the generated linked data output. A general CiTO property (cites) is used, if no CiTO property is found in a citation key.

The work at hand will always be the subject of the generated semantic subject-predicate-object triples. Some CiTO predicates cannot be used in a sensical way under this condition. Focusing on author convenience, we use this fact to allow shortening of properties when sensible. E.g. if authors of a biological paper include a reference to the paper describing a method which was used in their work, this relation can be described by the uses_method_in property of the CiTO ontology. The inverse property, provides_method_for, would always be nonsensical in this context as implied by causality. It is therefore not supported by our tool. This allows us to introduce an abbreviation (method) for the latter property, as any ambiguity has been eliminated. Users of western blotting might hence write @method_in:towbin_1979 or even just @method:towbin_1979, where towbin_1979 is the citation identifier of the describing paper by Towbin, Staehelin, and Gordon (1979).

Scientific manuscripts have to be submitted in a format defined by the journal or publisher. At the moment, DOCX is the most common file format for manuscript submission. Some publishers also accept or require LaTeX or ODT formats. Additional to the general style of the manuscript - organization of sections, fonts, etc. – the citation style of the journal must also be followed. Often, the same manuscript has to be prepared for different journals, e.g. if the manuscript was rejected by a journal and has to be formatted for another one, or if a preprint of the paper is submitted to an archive that requires a distinct document format than the targeted peer-reviewed journal.
In this example, we want to create a manuscript for a PLoS journal in DOCX and ODT format for WYSIWYG word processors. Further, a version in LaTeX/ PDF should be produced for PeerJ submission and archiving at the PeerJ preprint server.

The examples for DOCX/ ODT are kept relatively simple, to show the proof-of-principle and to provide a plain document for the development of own templates. Nevertheless, the generated documents should be suitable for submission after little manual editing. For specific journals it may be necessary to create more sophisticated templates or to copy/ paste the generic DOCX/ ODT output into the publisher’s template.

Development of a DOCX/ ODT template

A first DOCX document with bibliography in PLoS format is created with pandoc DOCX output:

The parameters -S -s generate a typographically correct (dashes, non-breaking spaces etc.) stand-alone document. A bibliography with the PLoS style is created by the citeproc filter setting --csl=plos.csl --filter pandoc-citeproc.

The document settings and styles of the resulting file pandoc-manuscript.docx can be optimized and be used again as document template (--reference-docx=pandoc-manuscript.docx).

It is also possible to directly re-use a previous output file as template (i.e. template and output file have the same file name):

In this way, the template can be incrementally adjusted to the desired document formatting. The final document may be employed later as pandoc template for other manuscripts with the same specifications. In this case, running pandoc the first time with the template, the contents of the new manuscript would be filled into the provided DOCX template. A page with DOCX manuscript formatting of this article is shown in Fig. 8.

Opening a pandoc-generated DOCX in Microsoft Office 365.

The same procedure can be applied with an ODT formatted document.

Development of a TeX/PDF template

The default pandoc LaTeX template can be written into a separate file by:

This template can be adjusted, e.g. by defining Unicode encoding (see above), by including particular packages or setting document options (line numbering, font size). The template can then be used with the pandoc parameter --template=pandoc-peerj.latex.

Github

The templates used for this document are included as Supplemental Material (see section Software and code availability below).

Styles for HTML and EPUB

The style for HTML and EPUB formats can be defined in .css stylesheets. The Supplemental Material contains a simple example .css file for modifying the HTML output, which can be used with the pandoc parameter -c pandoc.css.

The commands necessary to produce the document in a specific formats or styles can be defined in a simple Makefile. An example Makefile is included in the source code of this preprint. The desired output file format can be chosen when calling make. E.g. make outfile.pdf produces this preprint in PDF format. Calling make without any option creates all listed document types. A Makefile producing DOCX, ODT, JATS, PDF, LaTeX, HTML and EPUB files of this document is provided as Supplemental Material.

Cross-platform compatibility

The make process was tested on Windows 10 and Linux 64 bit. All documents – DOCX, ODT, JATS, LaTeX, PDF, EPUB and HTML – were generated successfully, which demonstrates the cross-platform compatibility of the workflow.

Following the trend to peer production, the formatting of scientific content must become more efficient. Markdown/ pandoc has the potential to play a key role in the transition from proprietary to community-driven academic production. Important research tools, such as the statistical computing and graphics language R (R Core Team 2014) and the Jupyter notebook project (Kluyver et al. 2016) have already adopted the MD syntax (e.g. http://rmarkdown.rstudio.com/). The software for writing manuscripts in MD is mature enough to be used by academic writers. Therefore, publishers also should consider implementing the MD format into their editorial platforms.

Authoring scientific manuscripts in markdown (MD) format is straight-forward, and manual formatting is reduced to a minimum. The simple syntax of MD facilitates document editing and collaborative writing. The rapid conversion of MD to multiple formats such as DOCX, LaTeX, PDF, EPUB and HTML can be done easily using pandoc, and templates enable the automated generation of documents according to specific journal styles.

The additional features we implemented facilitate the correct indexing of meta information of journal articles according to the ‘semantic web’ philosophy.

Altogether, the MD format supports the agile writing and fast production of scientific literature. The associated time and cost reduction especially favours community-driven publication strategies.

We cordially thank Dr. Gerd Neugebauer for his help in creating a subset of a BibTeX data base using BibTool, as well as Dr. Ricardo A. Chávez Montes, Prof. Magnus Palmblad and Martin Fenner for comments on the manuscript. Warm thanks also go to Anubhav Kumar and Jennifer König for proofreading.

The relevant software for creating this manuscript used is cited according to (Smith, Katz, and Niemeyer 2016) and listed in Tab. 3. Since unique identifiers are missing for most software projects, we only refer to the project homepages or software repositories:

Table 3. Relevant software used for this article.

SoftwareUseAuthorsVersionReleaseHomepage/ repository
pandocuniversal markup converterJohn MacFarlane1.16.0.216/01/13http://www.pandoc.org
pandoc-citeproclibrary for CSL citations with pandocJohn MacFarlane, Andrea Rossato0.9.116/03/19https://github.com/jgm/pandoc-citeproc
pandoc-jatscreation of JATS files with pandocMartin Fenner0.915/04/26https://github.com/mfenner/pandoc-jats
ownCloudpersonal cloud softwareownCloud GmbH, Community9.1.116/09/20https://owncloud.org/
Markdown Editorplugin for ownCloudRobin Appelman0.116/03/08https://github.com/icewind1991/files_markdown
BibToolBibTeX database toolGerd Neugebauer2.6316/01/16https://github.com/ge-ne/bibtool

The software created as part of this article, pandoc-scholar, is suitable for general use and has been published at https://github.com/pandoc-scholar/pandoc-scholar, DOI: 10.5281/zenodo.376761. The source code of this manuscript, as well as the templates and pandoc Makefile, have been deposited to https://github.com/robert-winkler/scientific-articles-markdown/.

Drawings for document types, devices and applications have been adopted from Calibre http://calibre-ebook.com/, openclipart https://openclipart.org/ and the GNOME Theme Faenza https://code.google.com/archive/p/faenza-icon-theme/.

Benkler, Yochai. 2006. The Wealth of Networks: How Social Production Transforms Markets and Freedom. New Haven, CT, USA: Yale University Press.

Berners-Lee, Tim, and James Hendler. 2001. “Publishing on the Semantic Web.” Nature 410 (6832): 1023–4. doi:10.1038/35074206.

Bourne, Philip. 2005. “Will a Biological Database Be Different from a Biological Journal?” PLOS Computational Biology 1 (3): e34. doi:10.1371/journal.pcbi.0010034.

Brauer, Michael, Patrick Durusau, Gary Edwards, David Faure, Tom Magliery, and Daniel Vogelheim. 2005. “Open Document Format for Office Applications (OpenDocument) V1.0.” OASIS.

Brown, Cecelia. 2001. “The E-Volution of Preprints in the Scholarly Communication of Physicists and Astronomers.” J. Am. Soc. Inf. Sci. 52 (3): 187–200. doi:3.0.CO;2-D'>10.1002/1097-4571(2000)9999:9999<::AID-ASI1586>3.0.CO;2-D.

———. 2003. “The Role of Electronic Preprints in Chemical Communication: Analysis of Citation, Usage, and Acceptance in the Journal Literature.” J. Am. Soc. Inf. Sci. 54 (5): 362–71. doi:10.1002/asi.10223.

Brown, Patrick O, Michael B Eisen, and Harold E Varmus. 2003. “Why PLoS Became a Publisher.” PLoS Biol 1 (1). doi:10.1371/journal.pbio.0000036.

Butler, Declan. 2001. “Los Alamos Loses Physics Archive as Preprint Pioneer Heads East.” Nature 412 (6842): 3–4. doi:10.1038/35083708.

Callaway, Ewen. 2013. “Preprints Come to Life.” Nature News 503 (7475): 180. doi:10.1038/503180a.

Corbí, Alberto, and Daniel Burgos. 2015. “Semi-Automated Correction Tools for Mathematics-Based Exercises in MOOC Environments.” International Journal of Interactive Multimedia and Artificial Intelligence 3 (3): 89–95. doi:10.9781/ijimai.2015.3312.

Dominici, Massimiliano. 2014. “An Overview of Pandoc.” TUGboat 35 (1): 44–50.

DPT Collective. 2015. “From Print to Ebooks: A Hybrid Publishing Toolkit for the Arts.” In, edited by Joe Monk, Miriam Rasch, Florian Cramer, and Amy Wu. Institute of Network Cultures.

Eikebrokk, Trude, Tor Arne Dahl, and Siri Kessel. 2014. “EPUB as Publication Format in Open Access Journals: Tools and Workflow.” Code4Lib, no. 24 (April).

Eisen, Michael. 2003. “Publish and Be Praised.” The Guardian, October.

Fecher, Benedikt, and Sascha Friesike. 2014. “Open Science: One Term, Five Schools of Thought.” In Opening Science, edited by Sönke Bartling and Sascha Friesike, 17–47. Springer International Publishing.

Ginsparg, Paul. 1994. “First Steps Towards Electronic Research Communication.” Computers in Physics 8 (4): 390–96. doi:10.1063/1.4823313.

Haak, Laurel L., Martin Fenner, Laura Paglione, Ed Pentz, and Howard Ratner. 2012. “ORCID: A System to Uniquely Identify Researchers.” Learned Publishing 25 (4): 259–64. doi:10.1087/20120404.

Hickson, Ian, Robin Berjon, Steve Faulkner, Travis Leithead, Erika Doyle Navara, Edward O’Connor, Silvia Pfeiffer, et al. 2014. “HTML5.” W3C Recommendation. W3C.

Houghton, John, Bruce Rasmussen, Peter Sheehan, Charles Oppenheim, Anne Morris, Claire Creaser, Helen Greenwood, Mark Summers, and Adrian Gourlay. 2009. “Economic Implications of Alternative Scholarly Publishing Models: Exploring the Costs and Benefits.” http://www.webarchive.org.uk/wayback/archive/20140614041628/http://www.jisc.ac.uk/publications/reports/2009/economicpublishingmodelsfinalreport.aspx#downloads.

International Organization for Standardization. 2013. “ISO 32000-1:2008 - Document Management – Portable Document Format – Part 1: PDF 1.7.” ISO. http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=51502.

———. 2014. “ISO/IEC 10646:2014 - Information Technology – Universal Coded Character Set (UCS).” ISO. http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=63182.

Kielhorn, Axel. 2011. “Multi-Target Publishing-Generating EPub, PDF, and More, from Markdown Using Pandoc.” TUGboat-TeX Users Group 32 (3): 272.

Kluyver, Thomas, Benjamin Ragan-Kelley, Fernando Pérez, Brian Granger, Matthias Bussonnier, Jonathan Frederic, Kyle Kelley, et al. 2016. “Jupyter Notebooks—a Publishing Format for Reproducible Computational Workflows.” In Positioning and Power in Academic Publishing: Players, Agents and Agendas, 87–90. doi:10.3233/978-1-61499-649-1-87.

Lamport, Leslie. 1994. LaTeX: A Document Preparation System. 2 edition. Reading, Mass: Addison-Wesley Professional.

Lanthaler, Markus, and Christian Gütl. 2012. “On Using JSON-LD to Create Evolvable RESTful Services.” In Proceedings of the Third International Workshop on RESTful Design, 25–32. ACM.

Leonard, Sean. 2016. “Guidance on Markdown: Design Philosophies, Stability Strategies, and Select Registrations.” RFC. RFC Editor; Internet Request for Comments.

Pandoc Github Download

National Information Standards Organization. 2012. “JATS: Journal Article Tag Suite.” ANSI/NISO Z39.96.

Ngo, Tom. 2006. “OFFICE OPEN XML OVERVIEW ECMA TC45.” Ecma International.

Ovadia, Steven. 2014. “Markdown for Librarians and Academics.” Behavioral & Social Sciences Librarian 33 (2): 120–24. doi:10.1080/01639269.2014.904696.

R Core Team. 2014. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org.

Raggett, Dave, Arnaud Le Hors, Ian Jacobs, Arnaud Le Hors, Dave Raggett, and Ian Jacobs. 1999. “HTML 4.01 Specification.” W3C Recommendation. W3C.

Shotton, David. 2010. “CiTO, the Citation Typing Ontology.” Journal of Biomedical Semantics 1 (1): S6. doi:10.1186/2041-1480-1-S1-S6.

Simonsen, K. 1992. “Character Mnemonics & Character Sets.” RFC. Rationel Almen Planlaegning; Internet Request for Comments.

Smith, Arfon M., Daniel S. Katz, and Kyle E. Niemeyer. 2016. “Software Citation Principles.” Edited by Silvio Peroni. PeerJ Computer Science 2 (September): e86. doi:10.7717/peerj-cs.86.

Solomon, David, and Bo-Christer Björk. 2016. “Article Processing Charges for Open Access Publicationthe Situation for Research Intensive Universities in the USA and Canada.” PeerJ 4 (July): e2264. doi:10.7717/peerj.2264.

Suber, Peter. 2012. Open Access. Cambridge, Mass: The MIT Press.

Towbin, H., T. Staehelin, and J. Gordon. 1979. “Electrophoretic Transfer of Proteins from Polyacrylamide Gels to Nitrocellulose Sheets: Procedure and Some Applications.” Proceedings of the National Academy of Sciences 76 (9): 4350–4. http://www.pnas.org/content/76/9/4350.

Van Noorden, Richard. 2012. “Journal Offers Flat Fee for ‘All You Can Publish’.” Nature News 486 (7402): 166. doi:10.1038/486166a.

———. 2013. “Open Access: The True Cost of Science Publishing.” Nature 495 (7442): 426–29. doi:10.1038/495426a.

———. 2014. “The ArXiv Preprint Server Hits 1 Million Articles.” Nature News. doi:10.1038/nature.2014.16643.

Volmer, Dietrich A., and Caroline S. Stokes. 2016. “How to Prepare a Manuscript Fit-for-Purpose for Submission and Avoid Getting a ‘Desk-Reject’.” Rapid Commun. Mass Spectrom., January, n/a–n/a. doi:10.1002/rcm.7746.

Willinsky, John. 2005. “The Unacknowledged Convergence of Open Source, Open Access, and Open Science.” First Monday 10 (8). doi:10.5210/fm.v10i8.1265.

Woelfle, Michael, Piero Olliaro, and Matthew H. Todd. 2011. “Open Science Is a Research Accelerator.” Nat Chem 3 (10): 745–48. doi:10.1038/nchem.1149.

Yergeau, F. 2003. “UTF-8, a Transformation Format of ISO 10646.” RFC. Alis Technologies.

Youngen, Gregory K. 1998. “Citation Patterns to Traditional and Electronic Preprints in the Published Literature.” Coll. Res. Libr. 59 (5): 448–56. doi:10.5860/crl.59.5.448.

If you need to convert files from one markup format into another, pandoc is your swiss-army knife. Pandoc can convert between the following formats:

(← = conversion from; → = conversion to; ↔︎ = conversion from and to)

Lightweight markup formats

↔︎ Markdown (including CommonMark and GitHub-flavored Markdown)
↔︎ reStructuredText
→ AsciiDoc
↔︎ Emacs Org-Mode
↔︎ Emacs Muse
↔︎ Textile
← txt2tags

HTML formats

↔︎ (X)HTML 4
↔︎ HTML5

Ebooks

↔︎ EPUB version 2 or 3
↔︎ FictionBook2

Documentation formats

→ GNU TexInfo
↔︎ Haddock markup

Roff formats

↔︎ roff man
→ roff ms

TeX formats

↔︎ LaTeX
→ ConTeXt

XML formats

↔︎ DocBook version 4 or 5
↔︎ JATS
→ TEI Simple

Pandoc Github Actions

Outline formats

↔︎ OPML

Bibliography formats

↔︎ BibTeX
↔︎ BibLaTeX
↔︎ CSL JSON
↔︎ CSL YAML

Word processor formats

↔︎ Microsoft Word docx
↔︎ OpenOffice/LibreOffice ODT
→ OpenDocument XML
→ Microsoft PowerPoint

Interactive notebook formats

↔︎ Jupyter notebook (ipynb)

See Full List On Github.com

Page layout formats

Pandoc Github Style

→ InDesign ICML

Wiki markup formats

↔︎ MediaWiki markup
↔︎ DokuWiki markup
← TikiWiki markup
← TWiki markup
← Vimwiki markup
→ XWiki markup
→ ZimWiki markup
↔︎ Jira wiki markup

Slide show formats

→ LaTeX Beamer
→ Slidy
→ reveal.js
→ Slideous
→ S5
→ DZSlides

Data formats

← CSV tables

Custom formats

→ custom writers can be written in lua.

PDF

→ via pdflatex, lualatex, xelatex, latexmk, tectonic, wkhtmltopdf, weasyprint, prince, context, or pdfroff.

Pandoc understands a number of useful markdown syntax extensions, including document metadata (title, author, date); footnotes; tables; definition lists; superscript and subscript; strikeout; enhanced ordered lists (start number and numbering style are significant); running example lists; delimited code blocks with syntax highlighting; smart quotes, dashes, and ellipses; markdown inside HTML blocks; and inline LaTeX. If strict markdown compatibility is desired, all of these extensions can be turned off.

LaTeX math (and even macros) can be used in markdown documents. Several different methods of rendering math in HTML are provided, including MathJax and translation to MathML. LaTeX math is converted (as needed by the output format) to unicode, native Word equation objects, MathML, or roff eqn.

Pandoc includes a powerful system for automatic citations and bibliographies. This means that you can write a citation like

and pandoc will convert it into a properly formatted citation using any of hundreds of CSL styles (including footnote styles, numerical styles, and author-date styles), and add a properly formatted bibliography at the end of the document. The bibliographic data may be in BibTeX, BibLaTeX, CSL JSON, or CSL YAML format. Citations work in every output format.

There are many ways to customize pandoc to fit your needs, including a template system and a powerful system for writing filters.

Pandoc includes a Haskell library and a standalone command-line program. The library includes separate modules for each input and output format, so adding a new input or output format just requires adding a new module.

Pandoc is free software, released under the GPL. Copyright 2006–2020 John MacFarlane.





Comments are closed.