Molecular Epidemiology - Q&A

Resources and Training

1. What WGS resources are currently available through the CoEs?

Under the Whole Genome Sequencing tab of our website, there are four modules and four webinars. The modules are short videos that introduce key concepts of whole genome and next generation sequencing. The webinars are live recorded online seminars that go in-depth into molecular epidemiology. There are also on-going live learning series, where small groups of participants meet for five sessions to discuss outbreak scenarios with WGS analysis. We are also available to one-on-one assistance.

2. Can you recommend a good primer reference on WGS basics and language? Could you recommend a good bioinformatics course or training for epidemiologists?

I would recommend viewing the webinars and modules provided by ELC training efforts and the Food Core Center of Excellence.

3. You indicated that live webinars will be available online but it would be great if the PowerPoint slides could be made available as well.

See the “Webinar” tab on the side menu for access to the slides and the recording.

Cost

4. Can you speak to the cost per genome analyzed?

From Heather Carleton’s presentation at the regional PulseNet meeting:

Case:Shiga Toxin-Producing E. coli Cost Savings by Moving from Traditional Isolate Characterization to WGS (Materials only)

Characterization of a Shiga toxin-producing E.coli isolate	Current testing costs	ID + characterization by WGS MiSeq NextSeq
Identification	$60
Serotyping	$159
PCR Virulence Profile – 4 targets	$10
PFGE	$30
MLVA	$15
AST	$30
WGS		$123	$60
Total	$304	$123	$60
Cost savings %		59%	80%

Annual cost savings based on # uploads to PulseNet in 2014: $(2239+3614)*(274-123)=$884,000

This just covers materials. There would be significant savings in labor if all of those tests were replaced with one WGS experiment.

Other informal analyses of the costs of bacterial whole-genome sequencing have produced estimates of about $250 per isolate, including labor and overhead. It may be possible to lower these costs in high volume settings to perhaps $200 or less with the use of more automation higher-throughput sequencing technology.

It should be noted that the trend over the past 10 years has been for sequencing costs to decrease and automation to increase. Assuming these trends continue, sequencing costs should continue to decrease.

5. While WGS and variants therein are certainly promising, costs of both testing and of hiring staff who are trained in bioinformatics needs to be considered. Can you comment on the ways that departments of health, or even direct care health-care systems, can approach this to determine cost-benefit of bringing this type of technology online?

Question #1 contains a figure from a recent PulseNet regional meeting demonstrating the savings in cost of materials. There will also be a cost savings in labor if WGS is used to completely replace the tests listed, especially if you automate library preparation. As for bioinformatics skills, many labs currently use BioNumerics for PFGE analysis, and the plans PulseNet has for WGS analysis in BioNumerics is similar using wgMLST schemes. There will be some additional training but there should be minimal new hires required if BioNumerics is used. So, for routine analysis the cost should be less and the benefit greater.

CDC’s AMD program is putting out funding to state and local health departments for workforce development. Up to this point, most of those funds have gone towards training laboratory staff, but it is now starting to address needs in the epidemiology staff as well. Also, while the move to genomics will be cost-saving in the long run, there will be transition costs, such as those for workforce development, and certain legacy costs, such as the need to maintain older technologies during an overlap period.

Roll Out and Timing

6. You mentioned that all of these new sequencing techniques would "be coming to public health lab near you soon!" Two questions: realistically how soon do you think that will be (e.g. current wait times for E. coli isolates can be months) and do think commercial labs would ever have those capabilities or would this be solely the work of state health labs/ CDC?

As far as when will public health labs have access to wgMLST schemes in BioNumerics for E. coli, Salmonella, Listeria, and Campylobacter, the latest estimates I have seen suggest that they will start arriving at a public health lab near you this summer. Ten public health labs are already using wgMLST in real time for Listeria analysis as part of the Listeria WGS pilot project. Applied Maths, the creator of BioNumerics, already has wgMLST schemes available for commercial use, but analyzing on their computation engine for this analysis is a fee-for-service analysis of about $10 per isolate (http://www.applied-maths.com/applications/wgmlst).

With TB, CDC is already contracting with several state labs to do sequencing on selected isolates for the entire country. Next year, they hope to be moving to universal sequencing. Similarly, the influenza program is funding three state labs to do sequencing for the entire country, although the uses for that data are mainly at the national and international levels—to improve vaccine strain selection. Several other programs, including viral hepatitis, Legionnaires disease, streptococcal pathogens, meningitis pathogens, viral vaccine-preventable diseases, GC, and HIV have already started to roll out their protocols to several states.

7 .Can you describe the timescales required to prep samples prior to sequencing?

Starting with a streaked isolate on Monday morning, you would extract DNA and begin library preparation, finishing on Tuesday and loading the instrument early afternoon. If necessary, you can squeeze in the entire extraction and library preparation in one day but that would require more than 8 hours.

8. What is the expected turnaround time (TAT) for analyses conducted at CDC?

During the transition period (e.g. the period of time until all states have access to the calculation engine), CDC will prioritize the Listeria analyses. The goal is “real time” surveillance. Generally, it should take <24 hours for completed analyses from the time the sequence data is shared with CDC. States should have the ability to conduct sequence analyses for E. coli, Campylobacter, and Salmonella by the time these databases are rolled out

9. What is the expected turnaround time (TAT) for labs that are WGS certified?

The target TAT (from receipt in the PulseNet lab to submission of sequence to CDC) is 7 business days for Listeria WGS. Achieving this TAT may require using lower capacity cartridges (i.e. 2 x 150bp, Micro, Nano) or sequencing historic or non-PulseNet isolates together on the same cartridge. If there are concerns about meeting TAT, please contact the PulseNet NGS lab (pulsenetngslab [at] cdc.gov (pulsenetngslab[at]cdc[dot]gov)) to help optimize runs. An additional option is to send the isolate to CDC or your area lab for sequencing. If you choose this option, please reach out to either lab ahead of time to make arrangements.

10. What are the expectations for labs that are not certified for WGS?

The non-certified labs need to send their Listeria isolates to their area lab or CDC. In either case, contact the lab ahead of time to make arrangements. The overall TAT (receipt of the isolate at the non-certified lab to sequencing data at area lab or CDC) should not exceed 10 business days.

11. When will SEDRIC be able to import allele code data?

The process of adapting SEDRIC to allele code data is just getting going. They can’t say when SEDRIC will be adapted to import the allele codes, but almost certainly less than 6 months.

12. What is the hang-up with all states having access to the calculation engine? When will it be available to all?

The WGS and PFGE national databases need to be combined in BioNumerics 7.6 so that the states don’t need to maintain licenses both for vs. 6 and vs. 7 and the PulseNet Central does not need to provide support for both. This process has turned out to be complex and we are actively working to resolve issues. We don’t have specific time when this will be fixed other than “soon”. This also needs to happen before release of the E. coli, Campylobacter, and Salmonella databases to the pilot labs or the Listeria production database to the remaining labs.

13.Does PulseNet Central plan to open analysis certification for WGS for Listeria to all WGS certified labs?

The answer is not yet – once the databases are combined, then analysis certification can begin.

Data Storage

14. Loading and storing data on NCBI is fine and often required for academic research, but in public health work, what bioethical or privacy issues would doing so impose? And how (and if) would sequence results be shared with healthcare providers, laboratory, residents (like how AST results are being shared now)?

Very limited metadata (meaning data associated with a whole genome sequence, such as patient residence, age, etc.) will be included with the submissions of the genome sequences to the NCBI SRA database. In some cases, such as cases of rare diseases, the public metadata may not even include the state of residence or occurrence. The vast majority (or sometimes all) meta data associated with the isolates will stay within the public health network, so bioethical or privacy issues should be minimal. As for sharing results, it won’t differ much from how things are shared now. Serotype, antibiotic resistance, virulence data etc., which were created using other assays, are already shared. Although the assay will change the result, it won’t change the sharing (as long as the WGS is CLIA validated). For surveillance activities for outbreaks it will be very similar to the way PFGE is used right now. Phylogenetic trees will be created to visualize relatedness and clustering of isolates.

15. You spoke of how WGS can be used as a public health tool in the future. To what extent will storage of this massive information be limiting and a barrier to this scenario?

Storage of routine sequencing as part of the PulseNet program includes immediate upload of data to the NCBI Short Read Archives (SRA), which hence will provide data storage. Public health labs will have no reason to store data locally for E. coli, Salmonella, Listeria, and Campylobacter. This would be the bulk of sequencing data for most public health labs. Of note, most of the analysis we do is on the assembled genomes, which are much smaller than the raw data.

While there is no technical reason to keep raw data in-house after it has been uploaded to the SRA, there are certain legal issues that haven’t been sorted out yet. Foodborne disease outbreaks, for example, occasionally result in lawsuits, and in those cases, laboratory and epidemiologic data are usually subject to release. Will it be enough if the original, raw data is not available, but the slightly processed data are available on NCBI? This still needs to be resolved.

A single MiSeq run for 24-36 isolates will take up 10-15 GB hard drive space. This is about 500 MB per genome in raw data. Thus, a public health lab that wishes to sequence non-PulseNet organisms would need about 1 TB of data storage for every 2,000 non-PulseNet organisms. If the labs embrace cloud storage for anonymous sequencing data, the monthly cost for 1 TB is less than $50.

PFGE

16. Will certification and proficiency testing continue for Listeria PFGE after January 15, 2018?

Certification and proficiency testing (PT) for Listeria PFGE will end on January 15th.

17. Will PFGE pattern naming continue for Listeria PFGE after January 15, 2018?

For those states that continue to do PFGE and submit patterns to the national database after January 15th, patterns will only be named upon request by pulsenet [at] cdc.gov (subject: Request%20to%20name%20Listeria%20PFGE%20patterns) (sending an email to the PulseNet inbox).

18. Do I need to combine my PFGE database with my WGS database (if a pilot state) before January 15th?

No, you do not need to combine the PFGE and WGS databases for Listeria before January 15th. PulseNet Central and Applied Maths are working on scripts to combine your databases and upgrade your PFGE database to BN 7.6. Stay tuned for further guidance on how to combine databases.

19. It was announced that L. monocytogenes PFGE will be completely replaced by WGS. What other pathogens will be transitioned to WGS-only in the near future?

The transition to WGS should be complete for the major PulseNet pathogens (e.g. Salmonella enterica, Campylobacter jejuni/coli, pathogenic E. coli, Shigella, Listeria monocytogenes) in 2018. Other pathogens (e.g. Vibrio spp., Yersinia spp., Cronobacter sakazaki) should follow in 2019.

20. I would like PFGE patterns for my Listeria isolate(s) – what should I do?

There are two options: 1) run the isolate(s) by PFGE in your lab or 2) submit the isolate(s) to CDC and request PFGE to be run at CDC by emailing the pulsenet [at] cdc.gov (subject: Request%20to%20PFGE%20Listeria) (PulseNet inbox). CDC will retain capacity for Listeria PFGE and perform the analysis on outbreak related isolates for international comparisons and may be able to assist states with specific projects and other needs, if requested; however, if you are not certified to perform WGS in your lab, we highly encourage you to get certified as soon as possible.

Analysis

21. Do I have to upgrade to BioNumerics 7.6 before January 15, 2018?

No, you do not need to upgrade to BioNumerics 7.6 before January 15th. In the transition period as we get the combined national databases into production, please share your metadata using Excel templates (available on SharePoint).

22. What's the difference between wgSNP vs high quality SNP?

hqSNP analysis is just another name for wgSNP analysis. Both are looking for SNP within total genomic DNA minus the masked regions.

23. There are different methods for assessing the relatedness of isolates using WGS data. For example, we have wgMLST, cgMLST, CFSAN high quality SNP (hqSNP) pipeline, BioNumerics hqSNP, and Lyve-Set hqSNP pipeline. What are the differences between them and which one should I use?

These methods all produce similar results when properly controlled. Each has its pros and cons, and which you choose depends on your situation. A hqSNP analysis is highly specific, and with multiple pipelines available can be run with little upfront development. Most of the genome is potentially included in the analysis, including non-coding regions. However, analyses are data-intensive and not suitable for a distributed testing network such as PulseNet. Ad-hoc analyses must be well controlled, which must include a preliminary sequence analysis to inform to the choice of reference genomes and removal of confounding SNPs, such as those due to mobile elements. MLST analyses are easier to standardize, and once a sequence is initially analyzed little computing power is needed to conduct comparison studies. For nomenclature and global comparisons, cgMLST provides the most stability, while wgMLST provides the most resolution. References are built into the definition tables, and mobile elements (or other potentially confounding sequences) are not scored (unless defined in the scheme). However, cg/wgMLST analyses require considerable up-front development, validation, and curation. If cg/wgMLST analyses are available, they are the logical choice for many public health applications. The hqSNP analyses can then be used as a secondary method to answer specific epidemiological questions. For fully centralized systems, primary hqSNP analysis is an option.

24. Still regarding the different methods for assessing the relatedness of isolates using WGS, will there be a standard method in the near future, or should we know how to use all of them?

PulseNet is adopting standard cgMLST and wgMLST analyses, with on-demand hqSNP analyses for identified clusters. This same approach is being pursued by PulseNet International. In the U.S., PulseNet analyses will be conducted through BioNumerics. Analyses conducted at NCBI, including those of GenomeTrakr, utilize both wgMLST and hqSNP analyses.

25. How will we be able to detect clusters of listeriosis if PFGE is not run routinely and we don’t have access to the national Listeria wgMLST database?

PulseNet CDC will continue to perform cluster searches in the Listeria wgMLST database. If a cluster is detected, an outbreak code will be assigned and this information will be posted on the PulseNet SharePoint site. It is imperative that you check these posts to see if your lab has isolates that are involved in an on-going investigation. Additionally, you should know the trends for Listeria prevalence in your state. If you detect an increase in the number of listeriosis cases (despite not knowing if they have matching sequences/PFGE patterns), do not hesitate to contact pulsenet [at] cdc.gov (subject: Increase%20in%20Listeria%20cases) (PulseNet) with your concerns.

NCBI has developed the Pathogen Detection Pipeline which is a great tool to view your sequence data on NCBI and how it relates to other sequences (within 50 SNPs). All PulseNet Listeria sequences and minimal metadata are uploaded in real-time to NCBI and the trees are updated daily on the website. Please look on SharePoint in the Library of PulseNet Documents under BioNumerics Training for directions on how to utilize this tool for surveillance.

26. How will the advancement of Culture-independent diagnostic tests impact sequencing?

CIDT will reduce the number of isolates that are sent to public health labs. This will require the clinics, etc., to send the sample matrix directly to the public health lab for isolation in order to perform WGS. Public health labs will require additional funding for the materials and labor required for isolation. Although there is active research ongoing to directly sequence from clinical specimens without the need for isolation, this technology is not ready yet and needs significant development.

Interpretation

27. Should we routinely resample patients when we are first trying to establish SNP pipeline/interpretation with a new organism? What are the minimal criteria for establishing an evolutionary time clock for a new species--you mentioned the importance of tight epi correlations, thinking more of this in terms of resampling colonized patients?

There is definitely some value in resampling patients to help establish the rate of change within patients (as well as potentially hypervariable genes). While definition of the mutational rates for different taxa typically is not a trivial research undertaking, the type of date generated through resampling (of both patients and environments where an organism persists) will always be helpful for interpretation of SNP or allele differences between isolates.

28. What are the risks of SNP analysis when there are only a handful of sequenced isolates--i.e. operating on that initial steep downward slope of establishing the core genome?

One main issue with SNP analyses when only a few sequenced isolates are available is that SNP calling really works best when done with a closely related references genome (see JC Kwong, J Clin Micro 2016, Figure 3). With few available genomes, there is a higher risk that no reference genome that is closely related to the isolates of interest is available. For cgMLST, a good number of isolates is needed to identify the core genes that should be included in the cgMLST scheme, so design of a cgMLST scheme would be difficult if only a few sequenced isolates of a species or group (e.g., serotype) are available.

29. When sequencing, how do you distinguish between integral genes vs mobile elements?

This is based on sequence similarity. There are databases of known genes and mobile elements which are used to identify different regions within a genome sequence.

30. What should I do if my clinical isolates match an anonymous submission?

Anonymous submission is not currently possible.

Communication

31. What do the numbers assigned to each isolate mean?

Numbers assigned to isolates are usually identifiers. There are identifiers for the original isolate in the local public health department, identifiers for PulseNet WGS, Biosample identifiers at NCBI, and short read archive identifiers at NCBI. Soon there will also be identifiers as to where the isolate is located in a composite phylogenetic tree made from wgMLST data. When looking at the phylogenetic tree the identifier will most often be the state public health lab ID (for trees built at the state public health lab) or the PulseNet WGS ID (for trees built at the CDC).

32. How is/will the rollout be communicated with the States:

The roll out has been presented on SharePoint and discussed on AMD/CARB calls.

33. How will the analyses be shared with the states?

Analyses will be shared as they currently are, via trees, cluster codes and other information on SharePoint. Based on feedback from the states, additional analyses will be provided.Kelley Hise described these on the AMD/CARB call on January 11. States will also have the option of using the NCBI Pathogen Detection Pipeline, with directions in SharePoint. However, please remember that sequences will not be available for analysis until at least 24 hours after their upload to NCBI.

34. How will State DOHs be able to access WGS data from other states for comparison against their own isolates? Or will this always be centralized within CDC?

States will be able to conduct these analyses themselves through BioNumerics 7.6, using data from PulseNet participants in all 50 states in the PulseNet Central database. Currently, this capability is limited to CDC and 10 pilot states, but full local control will be rolled out to all the states when database modifications are completed. Bioinformatics and high-performance computing will remain centralized at CDC or some other shared facility.

35. Is it possible to make an anonymous submission of WGS to NCBI to prevent liability issues?

We are not aware of a mechanism for submitting anonymous data to NCBI. Account registration is required. VoluntaryNet (VolNet; a program of University of Georgia) was set up to enable the food industry querying PFGE profiles from their production against all profiles in the PulseNet national database anonymously; It has not yet been set up to accept WGS data. Industry data is anonymized before submission to the VolNet database and only the sender knows the link between the anonymized identifier and their submission. VolNet sends out reports with the results of all queries of the PulseNet database to all participants in VolNet. PulseNet may also ask VolNet for industry matches to clinical isolates in an ongoing outbreak investigation. If a match between a clinical isolate and an anonymously submitted food or environmental isolate is found a message to all VolNet members. The member with the isolate that matches the outbreak strain may then voluntarily choose to be identified, or may choose to not identify themselves.

Pathogen-specific

36. Since TB is a unique disease where there is recent vs old exposure, especially for prevalent strains spanning transmission over 10-20 years, how can WGS help refute or confirm transmission in absence of epi data? Could you speak to how many SNPs mean relatedness in absence of epi data for TB in comparison to E. coli or other organisms that have point source transmission?

One of the main uses of MTB sequencing is helping to identify those cases likely to be due to recent transmission. The older typing technologies—MIRU/VNTR and spoligotyping—have been shown to be very useful in this regard over the past decade. Cases with closely related isolates are much more likely to be due to recent transmission. The older technologies, however had much more limited resolution compared with whole-genome sequencing. Experience with whole-genome sequencing has already demonstrated the usefulness of MTB sequencing, both in identifying cases likely to be due to recent transmission and, in some cases, further dividing those cases into smaller groups that could help in narrowing possible transmission settings. As with any other pathogen in public health practice, sequencing by itself rarely if ever gives a definitive response. Solid epidemiologic data and follow-through are needed as much as they were before sequencing. Sequencing provides one more tool for TB programs.

One challenge that is somewhat unique to MTB is its slow rate of mutation. On average, MTB generates a single SNP every two years over its entire 4.4M base genome (0.5 SNPs/year, or ~0.1 SNP/MB/year). But TB tends to spread relatively slowly (relative to many other pathogens, that is), so even with such a low mutation rate, MTB sequencing is proving very useful.

37. For pathogens like TB - to what degree is the epi, for example relationships between cases, needed to interpret the WGS - do you needing strong epi data to interpret the WGS results?

In public health, molecular data should always be interpreted taking into account the epidemiologic context. As with any isolated piece of epidemiologic data, molecular data are rarely conclusive by themselves and can sometimes be misleading. It is always best to look at all of the available data in order to make the most accurate inferences.

38. Could you speak about this type of genome sequencing for Zika virus? Is it at all practical? Do we know how stable the virus is?

The Zika genome is about 10,000 to 11,000 nucleotides long, so whole-genome sequencing is definitely possible, even with Sanger sequencing. However, while sequencing has been useful in understanding the virus’s emergence in the Americas, it’s not being performed routinely at this point, and may have limited utility in standard public health practice. In addition, many case-patients present after or towards the end of the relatively short viremic phase, making isolation or sequencing of the virus more challenging.

39. You spoke a lot about this technique's application in bacteriology, but do you see any applications/utility in viral outbreaks?

NGS is very useful for a number of viral pathogens. In fact, because viral genomes are relatively small and often mutate very quickly, routine sequencing (i.e., Sanger sequencing), including whole-genome sequencing, of viral pathogens has been done since the 1990s. The Global Polio Eradication Initiative, for example has been routine sequencing all wild poliovirus isolates since the early 2000s, and the data has been extremely useful in understanding the spread of the virus, the emergence of vaccine-derived polioviruses, and in monitoring progress towards eradication. Sequencing is similarly being used for other viral vaccine-preventable diseases, such as measles and rubella. CaliciNet is now replacing Sanger sequencing with NGS in some sites. For hepatitis C, the CDC’s “GHOST” (Global Hepatitis Outbreak Sequencing Technology) system has shown how useful sequencing can be in investigating hospital-based outbreaks and is now being applied in the community setting. During the waning phase of the West Africa Ebola outbreak, NGS was used to identify the source of outlier cases. There are many other examples in addition to these.

40. What is the role of WGS related to fungal outbreaks, like mucor? What do we know about thresholds for fungi? How do fungal genomes compare (in size) to bacterial genomes?

The genomes of eukaryotes such as fungi are always more complicated than those of prokaryotes (bacteria). Whereas bacterial genomes are typically in the millions of bases, eukaryotic genomes are typically in the billions, although both vary considerably in size. In addition, the fact that eukaryotic genomes are typically diploid makes the sequencing and assembly more complex. With those caveats, NGS has proved useful in certain fungal outbreaks. For example, sequencing of Candida auris is giving us a better picture of its recent emergence throughout the world. NGS has also been used to look at isolates from coccidioidomycosis (“Valley Fever”) cases that were acquired recently in the Pacific Northwest, outside of the usual endemic zone of the fungus. In summary, sequencing of fungi is more complex than sequencing of bacteria, but clearly has a role in public health.

Answers written by:

John Besser (CDC/OID/NCEZID)
Matthew Wise (CDC/OID/NCEZID)
Heather Carleton (CDC/OID/NCEZID)
Amanda Conrad (CDC/OID/NCEZID)
Peter Gerner-Smidt (CDC/OID/NCEZID)
Eija Trees (CDC/OID/NCEZID)
Joel Sevinsky (CDPHE)
Gregory Armstrong (CDC/OID/NCEZID)
Martin Wiedmann (Cornell University)
Renato Orsi (Cornell University)
Genevieve Sullivan (Cornell University)