BigQuery and Data Analysis

Data tables used in the Mitelman Database are stored within the Google BigQuery data warehouse. BigQuery allows for a high-level of interactivity and analysis to be performed on the data, providing high performance and low cost in a cloud environment. The following sections provide details for the Mitelman database content hosted in BigQuery, as well as giving some examples for how you can extract meaningful insights from the Mitelman data.

About BigQuery and the Mitelman Database Tables

The Mitelman data are organized in a collection of tables within the BigQuery environment. Each table uses rows and columns to represent related data. Specific values can be obtained using a query consisting of the table and column names.

Table	Data Type	Table Description
AuthorReference	LITERATURE	Data was compiled from the original publications used in the Mitelman database. Author names were manually curated for each assigned reference number. The order of the names in the list of authors for the given reference is recorded.
CytoBands_hg38	CYTOGENETIC	Data was extracted from the UCSC database. Start and end coordinates of each cytoband were obtained for all chromosomes based on genome build hg38.
CytoConverted	CYTOGENETIC	Data was processed using karyotype nomenclature from the CytoConverterSample table. Chromosomal imbalances and associated genomic coordinates were generated by the CytoConverter software. More details: see https://github.com/isb-cgc/ISB-CGC-CytoConverter/
CytoConvertedLog	METADATA	Data was generated by the CytoConverter software. Warnings and errors were returned from the software while processing karyotype nomenclature from the CytoConverterSample table. More details: see https://github.com/isb-cgc/ISB-CGC-CytoConverter/
Cytogen	CLINICAL DATA	Data was compiled from the original publications used in the Mitelman database. Patient demographics and disease data were manually curated for each assigned reference and case number.
CytogenInv	CYTOGENETIC	Data was compiled from the original publications used in the Mitelman database. Karyotype nomenclature data were manually curated for each assigned reference, case, and investigation number.
CytogenInvValid	CYTOGENETIC	Data was generated using the CytogenInv table. Karyotype nomenclature was processed using a syntax checker that validated the string for use with the CytoConverter software.
KaryAbnorm	CYTOGENETIC	Data was generated using the CytogenInv table. Individual karyotype nomenclature abnormalities are maintained for searching purposes.
KaryBit	CYTOGENETIC	Data was generated using the CytogenInv table. Includes all elements of karyotype nomenclature when split on commas.
KaryBreak	CYTOGENETIC	Data was generated using the CytogenInv table. Breakpoints were extracted from the karyotype nomenclature.
KaryClone	CYTOGENETIC	Data was generated using the CytogenInv table. Contains separated clones from karyotype nomenclature.
Koder	CLINICAL SUPPLEMENT	Data was compiled from the original publications used in the Mitelman database. Various fields of data were assigned a unique code to be used as a reference in other database tables.
MolBiolClinAssoc	CYTOGENETIC	Data was compiled from the original publications used in the Mitelman database. Gene fusions associated with karyotype nomenclature were manually curated for each reference.
MolBiolSpecMorph	CYTOGENETIC	Data was generated using the MolBiolClinAssoc table. Rare tumor types, special morphologies that are associated with gene fusion's karyotype nomenclature were manually curated for each reference.
MolClinAbnorm	CYTOGENETIC	Data was generated using the MolBiolClinAssoc table. Karyotype nomenclature was split into individual abnormalities.
MolClinBreak	CYTOGENETIC	Data was generated using the MolBiolClinAssoc table. Breakpoints were extracted from the karyotype nomenclature.
MolClinGene	GENE FUSION	Data was compiled from the original publications used in the Mitelman database. Gene fusions were manually curated for each reference.
RecurrentData	GENE FUSION	Data was aggregated from the separated chromosomal abnormalities. Frequency counts of each structural abnormality are recorded.
RecurrentNumData	GENE FUSION	Data was aggregated from the separated chromosomal abnormalities. Frequency counts of each numerical abnormality are recorded.
Reference	LITERATURE	Data was compiled from the original publications used in the Mitelman database. Information associated with the reviewed publication is recorded and assigned a reference number.

To learn more about the data and computational capabilities of ISB-CGC:
https://isb-cancer-genomics-cloud.readthedocs.io/en/latest/sections/About-ISB-CGC.html

See how ISB-CGC uses BigQuery tables to host dozens of other NCI cancer datasets:
https://isb-cancer-genomics-cloud.readthedocs.io/en/latest/sections/BigQuery.html

Learn about BigQuery direct from Google:
https://cloud.google.com/bigquery/?hl=en

Examples and Tutorials for How to Use BigQuery

Queries are constructed using the GoogleSQL dialect, which is an SQL-like language that provides powerful statements in an easy-to-understand format.

SQL uses a simple format for creating queries:


                                        SELECT column

                                        FROM table

                                        WHERE conditions

This simple construct can be used to build many queries. Typically, data are spread across multiple tables. In that case you would join the tables:


                                        SELECT column1, column2

                                        FROM table1

                                        JOIN table2

                                        ON table1value = table2value

                                        WHERE conditions

A more complex example from the notebook Mitelman Fusions in TCGA:


                                            SELECT g.Gene, count(g.Gene) AS Count, m.Morph, k.Benamning, m.Topo

                                            FROM `mitelman-db.prod.MolClinGene` g

                                            JOIN `mitelman-db.prod.MolBiolClinAssoc` m

                                            ON m.RefNo = g.RefNo AND m.InvNo = g.InvNo

                                            JOIN `mitelman-db.prod.Koder` k

                                            ON k.Kod = m.Topo AND k.kodTyp = "TOP"

                                            WHERE g.Gene LIKE "%::%"

                                            AND m.Morph = "3111"

                                            AND m.Topo = "0602”

                                            GROUP BY g.Gene, m.Morph, k.Benamning, m.Topo

                                            ORDER BY Count DESC LIMIT 10

This query returns the top 10 prostate adenocarcinoma gene fusions from the Mitelman database using these criteria:

topography of prostate
morphology of adenocarcinoma
gene name that includes a double colon

Because the returned values exist in three different tables, the query uses RefNo and InvNo to join the tables. Counts of the number of each fusion occurrence are also determined.

The inclusion of Mitelman data in BigQuery presents the opportunity to perform complex data analysis. This capability can be extended further with the use of the R or Python programming languages in Jupyter-type notebooks.

ISB-CGC hosts multiple Mitelman-based notebooks in a GitHub repository:
https://github.com/isb-cgc/Community-Notebooks/tree/master/MitelmanDB

Mitelman Fusions in TCGA
This notebook explores common gene fusions found in prostate cancer and uses that information to create a random forest classifier using TCGA gene expression data to predict the Gleason Grade in prostate cancer cases.
Mitelman Cytogenetics Subsets
Grouping similar cases is always a useful way of looking at cytogenetic data. This notebook explores a variety of methods for creating subsets relevant to cytogenetic research.
Correlations Between Mitelman and TCGA Datasets
With an ever-increasing number of databases in existence, comparing data from multiple databases is always going to be a useful tool. This notebook demonstrates how to compute correlations between the Mitelman Database and TCGA.
Exploring and Comparing Mitelman DB CytoConverter and TCGA Datasets
This notebook examines common chromosomal aberrations and utilizes CytoConverter coordinates to compare with TCGA data.

Find additional information on notebooks at these links:

All ISB-CGC notebooks:
https://github.com/isb-cgc/Community-Notebooks

Documentation for notebooks:
https://isb-cancer-genomics-cloud.readthedocs.io/en/latest/sections/MitelmanDBNotebooks.html

A notebook demonstrating the basics of working with ISB-CGC BigQuery tables:
https://github.com/isb-cgc/Community-Notebooks/blob/master/Notebooks/Quick_Start_Guide_to_ISB_CGC.ipynb

BigQuery and Data Analysis

About BigQuery and the Mitelman Database Tables

Mitelman Database Tables

Documentation and Links

Examples and Tutorials for How to Use BigQuery

SQL Query Examples

Data Analysis Notebooks