BigQuery and Data Analysis

Data tables used in the Mitelman Database are stored within the Google BigQuery data warehouse. BigQuery allows for a high-level of interactivity and analysis to be performed on the data, providing high performance and low cost in a cloud environment. The following sections provide details for the Mitelman database content hosted in BigQuery, as well as giving some examples for how you can extract meaningful insights from the Mitelman data.

About BigQuery and the Mitelman Database Tables

The Mitelman data are organized in a collection of tables within the BigQuery environment. Each table uses rows and columns to represent related data. Specific values can be obtained using a query consisting of the table and column names.

To learn more about the data and computational capabilities of ISB-CGC:
https://isb-cancer-genomics-cloud.readthedocs.io/en/latest/sections/About-ISB-CGC.html
See how ISB-CGC uses BigQuery tables to host dozens of other NCI cancer datasets:
https://isb-cancer-genomics-cloud.readthedocs.io/en/latest/sections/BigQuery.html
Learn about BigQuery direct from Google:
https://cloud.google.com/bigquery/?hl=en

Examples and Tutorials for How to Use BigQuery

Queries are constructed using the GoogleSQL dialect, which is an SQL-like language that provides powerful statements in an easy-to-understand format.

SQL uses a simple format for creating queries:
SELECT column
FROM table
WHERE conditions
This simple construct can be used to build many queries. Typically, data are spread across multiple tables. In that case you would join the tables:
SELECT column1, column2
FROM table1
JOIN table2
ON table1value = table2value
WHERE conditions
A more complex example from the notebook Mitelman Fusions in TCGA:
SELECT g.Gene, count(g.Gene) AS Count, m.Morph, k.Benamning, m.Topo
FROM `mitelman-db.prod.MolClinGene` g
JOIN `mitelman-db.prod.MolBiolClinAssoc` m
ON m.RefNo = g.RefNo AND m.InvNo = g.InvNo
JOIN `mitelman-db.prod.Koder` k
ON k.Kod = m.Topo AND k.kodTyp = "TOP"
WHERE g.Gene LIKE "%::%"
AND m.Morph = "3111"
AND m.Topo = "0602”
GROUP BY g.Gene, m.Morph, k.Benamning, m.Topo
ORDER BY Count DESC LIMIT 10
This query returns the top 10 prostate adenocarcinoma gene fusions from the Mitelman database using these criteria:
  • topography of prostate
  • morphology of adenocarcinoma
  • gene name that includes a double colon
Because the returned values exist in three different tables, the query uses RefNo and InvNo to join the tables. Counts of the number of each fusion occurrence are also determined.

The inclusion of Mitelman data in BigQuery presents the opportunity to perform complex data analysis. This capability can be extended further with the use of the R or Python programming languages in Jupyter-type notebooks.
ISB-CGC hosts multiple Mitelman-based notebooks in a GitHub repository:
https://github.com/isb-cgc/Community-Notebooks/tree/master/MitelmanDB
  • Mitelman Fusions in TCGA
    This notebook explores common gene fusions found in prostate cancer and uses that information to create a random forest classifier using TCGA gene expression data to predict the Gleason Grade in prostate cancer cases.
  • Mitelman Cytogenetics Subsets
    Grouping similar cases is always a useful way of looking at cytogenetic data. This notebook explores a variety of methods for creating subsets relevant to cytogenetic research.
  • Correlations Between Mitelman and TCGA Datasets
    With an ever-increasing number of databases in existence, comparing data from multiple databases is always going to be a useful tool. This notebook demonstrates how to compute correlations between the Mitelman Database and TCGA.
  • Exploring and Comparing Mitelman DB CytoConverter and TCGA Datasets
    This notebook examines common chromosomal aberrations and utilizes CytoConverter coordinates to compare with TCGA data.
Find additional information on notebooks at these links: