BigQuery and Data Analysis
Data tables used in the Mitelman Database are stored within the Google BigQuery data warehouse. BigQuery allows
for a high-level of interactivity and analysis to be performed on the data, providing high performance and low
cost in a cloud environment. The following sections provide details for the Mitelman database content hosted in
BigQuery, as well as giving some examples for how you can extract meaningful insights from the Mitelman data.
About BigQuery and the Mitelman Database Tables
The Mitelman data are organized in a collection of tables within the BigQuery environment. Each table uses
rows and columns to represent related data. Specific values can be obtained using a query consisting of the
table and column names.
Table | Data Type | Table Description |
---|---|---|
AuthorReference | LITERATURE | Data was compiled from the original publications used in the Mitelman database. Author names were manually curated for each assigned reference number. The order of the names in the list of authors for the given reference is recorded. |
CytoBands_hg38 | CYTOGENETIC | Data was extracted from the UCSC database. Start and end coordinates of each cytoband were obtained for all chromosomes based on genome build hg38. |
CytoConverted | CYTOGENETIC | Data was processed using karyotype nomenclature from the CytoConverterSample table. Chromosomal imbalances and associated genomic coordinates were generated by the CytoConverter software. More details: see https://github.com/isb-cgc/ISB-CGC-CytoConverter/ |
CytoConvertedLog | METADATA | Data was generated by the CytoConverter software. Warnings and errors were returned from the software while processing karyotype nomenclature from the CytoConverterSample table. More details: see https://github.com/isb-cgc/ISB-CGC-CytoConverter/ |
Cytogen | CLINICAL DATA | Data was compiled from the original publications used in the Mitelman database. Patient demographics and disease data were manually curated for each assigned reference and case number. |
CytogenInv | CYTOGENETIC | Data was compiled from the original publications used in the Mitelman database. Karyotype nomenclature data were manually curated for each assigned reference, case, and investigation number. |
CytogenInvValid | CYTOGENETIC | Data was generated using the CytogenInv table. Karyotype nomenclature was processed using a syntax checker that validated the string for use with the CytoConverter software. |
KaryAbnorm | CYTOGENETIC | Data was generated using the CytogenInv table. Individual karyotype nomenclature abnormalities are maintained for searching purposes. |
KaryBit | CYTOGENETIC | Data was generated using the CytogenInv table. Includes all elements of karyotype nomenclature when split on commas. |
KaryBreak | CYTOGENETIC | Data was generated using the CytogenInv table. Breakpoints were extracted from the karyotype nomenclature. |
KaryClone | CYTOGENETIC | Data was generated using the CytogenInv table. Contains separated clones from karyotype nomenclature. |
Koder | CLINICAL SUPPLEMENT | Data was compiled from the original publications used in the Mitelman database. Various fields of data were assigned a unique code to be used as a reference in other database tables. |
MolBiolClinAssoc | CYTOGENETIC | Data was compiled from the original publications used in the Mitelman database. Gene fusions associated with karyotype nomenclature were manually curated for each reference. |
MolClinAbnorm | CYTOGENETIC | Data was generated using the MolBiolClinAssoc table. Karyotype nomenclature was split into individual abnormalities. |
MolClinBreak | CYTOGENETIC | Data was generated using the MolBiolClinAssoc table. Breakpoints were extracted from the karyotype nomenclature. |
MolClinGene | GENE FUSION | Data was compiled from the original publications used in the Mitelman database. Gene fusions were manually curated for each reference. |
RecurrentData | GENE FUSION | Data was aggregated from the separated chromosomal abnormalities. Frequency counts of each structural abnormality are recorded. |
RecurrentNumData | GENE FUSION | Data was aggregated from the separated chromosomal abnormalities. Frequency counts of each numerical abnormality are recorded. |
Reference | LITERATURE | Data was compiled from the original publications used in the Mitelman database. Information associated with the reviewed publication is recorded and assigned a reference number. |
To learn more about the data and computational capabilities of ISB-CGC:
https://isb-cancer-genomics-cloud.readthedocs.io/en/latest/sections/About-ISB-CGC.html
https://isb-cancer-genomics-cloud.readthedocs.io/en/latest/sections/About-ISB-CGC.html
See how ISB-CGC uses BigQuery tables to host dozens of other NCI cancer datasets:
https://isb-cancer-genomics-cloud.readthedocs.io/en/latest/sections/BigQuery.html
https://isb-cancer-genomics-cloud.readthedocs.io/en/latest/sections/BigQuery.html
Learn about BigQuery direct from Google:
https://cloud.google.com/bigquery/?hl=en
https://cloud.google.com/bigquery/?hl=en
Examples and Tutorials for How to Use BigQuery
Queries are constructed using the GoogleSQL dialect, which is an SQL-like language that provides powerful
statements in an easy-to-understand format.
SQL uses a simple format for creating queries:
SELECT column
FROM table
WHERE conditions
This simple construct can be used to build many queries. Typically, data are spread
across multiple tables. In that case you would join the tables:
SELECT column1, column2
FROM table1
JOIN table2
ON table1value = table2value
WHERE conditions
A more complex example from the notebook Mitelman Fusions in TCGA:
SELECT g.Gene, count(g.Gene) AS Count, m.Morph, k.Benamning, m.Topo
FROM `mitelman-db.prod.MolClinGene` g
JOIN `mitelman-db.prod.MolBiolClinAssoc` m
ON m.RefNo = g.RefNo AND m.InvNo = g.InvNo
JOIN `mitelman-db.prod.Koder` k
ON k.Kod = m.Topo AND k.kodTyp = "TOP"
WHERE g.Gene LIKE "%::%"
AND m.Morph = "3111"
AND m.Topo = "0602”
GROUP BY g.Gene, m.Morph, k.Benamning, m.Topo
ORDER BY Count DESC LIMIT 10
This query returns the top 10 prostate adenocarcinoma gene fusions from the
Mitelman
database using these criteria:
- topography of prostate
- morphology of adenocarcinoma
- gene name that includes a double colon
Because the returned values exist in three different tables, the query uses RefNo
and InvNo
to join the tables. Counts of the number of each fusion occurrence are also
determined.
The inclusion of Mitelman data in BigQuery presents the opportunity to perform
complex data analysis. This capability can be extended further with the use of the R or
Python programming languages in Jupyter-type notebooks.
ISB-CGC hosts multiple Mitelman-based notebooks in a GitHub repository:
https://github.com/isb-cgc/Community-Notebooks/tree/master/MitelmanDB
https://github.com/isb-cgc/Community-Notebooks/tree/master/MitelmanDB
-
Mitelman Fusions in TCGA
This notebook explores common gene fusions found in prostate cancer and uses that information to create a random forest classifier using TCGA gene expression data to predict the Gleason Grade in prostate cancer cases.
-
Mitelman Cytogenetics Subsets
Grouping similar cases is always a useful way of looking at cytogenetic data. This notebook explores a variety of methods for creating subsets relevant to cytogenetic research.
-
Correlations Between Mitelman and TCGA Datasets
With an ever-increasing number of databases in existence, comparing data from multiple databases is always going to be a useful tool. This notebook demonstrates how to compute correlations between the Mitelman Database and TCGA.
-
Exploring and Comparing Mitelman DB CytoConverter and TCGA Datasets
This notebook examines common chromosomal aberrations and utilizes CytoConverter coordinates to compare with TCGA data.
Find additional information on notebooks at these links:
All ISB-CGC notebooks:
https://github.com/isb-cgc/Community-Notebooks
https://github.com/isb-cgc/Community-Notebooks
Documentation for notebooks:
https://isb-cancer-genomics-cloud.readthedocs.io/en/latest/sections/MitelmanDBNotebooks.html
https://isb-cancer-genomics-cloud.readthedocs.io/en/latest/sections/MitelmanDBNotebooks.html
A notebook demonstrating the basics of working with ISB-CGC BigQuery
tables:
https://github.com/isb-cgc/Community-Notebooks/blob/master/Notebooks/Quick_Start_Guide_to_ISB_CGC.ipynb
https://github.com/isb-cgc/Community-Notebooks/blob/master/Notebooks/Quick_Start_Guide_to_ISB_CGC.ipynb