At GlaxoSmithKline we have created a world-leading data and computational environment to enable large scale scientific experiments that exploit GSK’s unique access to data.
Our focus is on bringing data, analytics & science together into solutions for our scientists to develop medicines for patients.
This data and computational environment supports GSK R&D across a broad range of pharmaceutical areas including genetics, functional genomics, clinical, biopharma and others.
The Data & Compute Delivery (DCD) Data Engineering team is a crucial component of the environment and are responsible for delivery of data pipelines populating and maintaining data for scientific use in HPCs, Cloud and the R&D Information Platform (RDIP).
We are looking for a passionate and enthusiastic individual who will contribute to the strategy for data movement in a variety of scientific areas by working closely with people who are involved in the generation, handling and consumption of such data that includes Data & Computational Science (DCS), R&D Tech, different vendors and the larger R&D organization.
The data engineer needs to be able to apply technologies in a DataOps environment to solve big data problems and to develop innovative big data solutions based on defined business requirements.
The successful candidate must be able to learn and work independently, lead or assist with pipeline development efforts and collaborate effectively with co-workers.
This role will provide YOU the opportunity to lead key activities to progress YOUR career, these responsibilities include some of the following :
Participate in data teams to supporting the implementation of pipelines to support R&D strategy and conceptual data flows
Partner with principal data engineers and metadata leads to translate conceptual data models into physical database / tables optimized for data analytics in RDIP using established environments and tools
Assist the design, build, test and maintenance of data acquisition and processing pipelines including but not limited to the creation / maintenance of appropriate artifacts
Ensure the preservation of data integrity from source to target state including but not limited to the acquisition of appropriate metadata and the incorporation of appropriate QC checks into the pipelines
Support the use and growth of the Data Engineering DataOps environment including development and maintenance of related DataOps / DevOps infrastructure
Provide Tier 3 support for production pipelines
Support DCS and broader R&D in self-service / exploratory efforts
Work with R&D and Tech to support DataOps enhancements, and onboard these tools or enhancements
Ensure the quality consistency and availability of guidance documentation of end users of the tools to support high quality outputs
Support GxP readiness as it related to the data pipelines and address associated gaps
Basic Qualifications :
We are looking for professionals with these required skills to achieve our goals :
Computer Science, Bioinformatics, or related degree; 1+ years experience in big data technologies, data movement, data wrangling or data / dev ops systems and tools
Experience data movement and data pipelines
Experience with Big Data technologies (ideally Cloudera stack including HDFS, Hive, Impala and Spark), Cloud-based offerings (Microsoft Azure, GCP, AWS, etc), and corresponding tools.
Preferred Qualifications :
If you have the following characteristics, it would be a plus :
Proven ability to contribute to development projects.
Strong interpersonal skills and effective communication of complex concepts to stake holders with wide range of expertise.
Familiarity with open source software, bioinformatics tools and languages such as SQL, R, Perl, Python, Java, and ETL tools.
Experience with data movement and management in the Pharmaceutical industry or related scientific fields.
Background and experience in LIMS systems, Next Generation Sequencing (NGS) workflows, Cloud computing and HPC systems.
Understanding of diverse omic data types including RNA-Seq, DNA-Seq, Chip-Seq, WES, WGS, ATAC-seq, microbiome, proteomic, metabolomic data etc.
from different sources.
Familiarity with data mining, machine learning and artificial intelligence techniques