Job Title: Data Engineer
Job Number: 22148
Location: Cambridge, MA
We are targeting a talented data engineer to contribute to the design and development of the software platform supporting its Kidney Genome Atlas (KGA). The KGA is a multi-omics and clinical data repository for tens of thousands of patients with various forms of kidney disease. The selected individual will assist with development and monitoring of data processing pipelines, data infrastructure and data integration in the KGA. This position will also be responsible for assisting with the management of data from clinical trials and the integration of our clinical trial data into the KGA
- Identify, design, and implement internal process improvements: automating manual processes, optimizing data delivery, re-designing infrastructure and re-factoring code for greater scalability, etc.
- Build the infrastructure required for optimal extraction, transformation, and loading of data from a wide variety of data sources including clinical informatics systems using SQL, NoSQL and AWS ‘big data’ technologies.
- Create data tools for computational research team members that assist them in efficient access to data through APIs and client libraries
- Build analytics tools that utilize the data pipelines to provide actionable insights around next-gen sequencing activities, data ingestion and other key business performance metrics
- Experience building and optimizing 'big data' data pipelines, architectures and data sets
- Working knowledge of genetic/genomic and bioinformatics tools and technologies
- 5+ years of experience in a data engineer or data processing role, who has attained a Bachelor’s or higher degree in Computer Science, Statistics, Informatics, Information Systems or another quantitative field
- Experience with big data tools: Hadoop, Spark, Kafka, etc.
- Experience with relational SQL and NoSQL databases
- Experience with data pipeline and workflow management tools: Nextflow, Azkaban, Luigi, Airflow, etc.
- Experience with AWS cloud services: EC2, EMR, RDS, Redshift
- Experience with object-oriented/object function scripting languages: Python, Java, C++, Scala, etc.
- Experience with biological data (DNA sequences, RNAseq, proteomics, microscopy images, etc.)