Systems Engineer

  • Next Gen Cyber LLC
  • Krieger School of Arts and Sciences, Homewood Campus , Baltimore, MD
  • Apr 29, 2018
Full time Risk Management Technology R&D Systems Development Knowledge Management

Job Description

Essential Duties & Responsibilities:

Systems Engineering and Oversight

  • Design, organize, test and implement cutting-edge hardware designs
  • Document systems so that users can easily find useful information and other IT staff can perform routine tasks and provide backup.
  • Provides stable solutions for HHPC use
  • Oversee maintenance of HHPC community’s technical infrastructure
  • Maintain HHPC and Related clusters
  • Plans and makes purchases to meet the needs of the HHPC community.
  • Maintain job scheduling and storage allocation systems and policies in accordance with the HHPC Steering Committee to ensure fair allocation of shared resources.
  • Maintain extensive monitoring systems to facilitate quick, proactive responses to routine failures, and to provide comprehensive performance data logging.
  • May provide general system administration backup for other facilities or research groups.
  • Ensures solutions released to the community are stable and usable.
  • Ensures resources meet the community’s needs and are highly available to the group with limited interruption.
  • Optimizes clusters frequently to meet the needs of the community and are highly available with limited interruption.
  • Effectively meets the community’s infrastructure needs so they are highly available with limited interruption.
  • Makes purchases and purchase requests on time and in accordance with departmental and university policy and procedure.

 

Project Management and Outreach

  • Understands HPC technical needs. Work closely with the facility’s faculty steering committee to shape policies, and ensure that these policies are successfully implemented.
  • Conceive, initiate, define, plan, organize and execute project plans
  • Develop close ties with participating faculty and their research groups in order to maintain awareness of their computing needs. Facilitate community building among the facility’s users to encourage sharing of solutions.
  • Learn from previous experiences when developing new projects.
  • Work closely with the facility’s faculty steering committee to shape policies, and ensure that these policies are successfully implemented.
  • Create and maintain a stable, secure operating system and software environment, which continues
  • Create and maintain a stable, secure operating system and software environment, which continues to meet users’ evolving research needs.

 

Technological Research

  • Plan the retirement of aging systems.
  • Develop custom tools where necessary, and contribute useful creations back to open source development efforts where appropriate.
  • Research new technologies that could be beneficial to HPC.
  • Tests and vets new technology in support of HPC efforts
  • Works with vendors to procure prototypes and demo units
  • Be involved with purchasing of additions to existing clusters. Develop custom tools where necessary, and contribute useful creations back to open source development efforts where appropriate.

 

Training/Education

  • Continuously evaluate new tools and technologies for use in existing and future clusters.
  • Attend department and University-sponsored training to increase knowledge, improve skills, and learn new skills. May substitute University training for supervisor approved commercial job related course offerings.

 

Internal and External Contacts

List only those contacts required to perform the major activities of this job, and the typical purpose (Exchange information, Advise/Consult, Negotiate/Influence) of the contact.

 

This position may interact with an array of departmental and central administrative offices, faculty, staff, researchers, and students, and with numerous external vendors for the purpose of accomplishing HPC technology goals. Works routinely with University faculty, administrators, students, and researchers. Collaborates regularly with professional colleagues from the central IT@JH organization,and from other academic departments.

Qualifications

Bachelor's degree. Five years related experience. Additional education may substitute for required experience and additional related experience may substitute for required education, to the extent permitted by the JHU equivalency formula. Master’s degree preferred. Formal training in computational science or engineering a big plus.

 

Required Experience:

  • Minimum 5 years' experience managing Linux Servers.
  • Experience as a high-level Linux Systems Administrator.
  • Experience managing mission critical services in a 24x7x365 environment
  • Experience administering High Performance Computing Cluster Schedulers (ie, maui, slurm, moab, etc)
  • In-depth knowledge of TCP/IP networking and related protocols
  • Excellent scripting skills, python, perl, shell

 

Special Knowledge and Skill:

  • Knowledge of job scheduling software (e.g. OpenPBS/Torque, Maui, LSF, SLURM , Moab).
  • Advanced knowledge of Linux, Apache, MySQL, PHP/Python/Perl (LAMP) technology/toolkits.
  • Apply expert knowledge of Unix/Linux systems administration, including all aspects of management, monitoring, performance analysis, and integration in complex heterogeneous environments
  • Use configuration management tools (e.g., xCAT, puppet, IPMI) to help maintain large-scale Linux clusters, supercomputers, storage systems, and smaller systems
  • Understanding of HPC hardware and software technologies.
  • Understanding of large data storage systems.
  • Develop, debug and utilize programs to automate system management tasks and user workflows
  • Understanding of networking, including high-speed networks (Ethernet/Infiniband).
  • Monitor, optimize services and performance (file system, network interconnects) using Nagios, Ganglia, etc.
  • Administer management servers for infrastructure (file servers, monitoring, etc.)
  • Solve escalated systems related issues, coordinate with vendors to isolate hardware problems, install firmware or software patches as necessary
  • Provide in-depth system analysis, problem resolution, design and implementation of system enhancements. This includes both functional and performance issues
  • Working autonomously, design, implement, and maintain the security and monitoring infrastructure for the HHPC
  • Independently research and make technical recommendations regarding the HHPC’s cybersecurity policies, practices, system development, and architecture
  • Respond to security alerts and tickets as required
  • Must have the ability to multi-task and prioritize.
  • Must be adaptable and able to meet conflicting deadlines.
  • Exceptional organizational skills.
  • Must have excellent oral and written interpersonal skills in terms of customer service, training, and evangelism of new technologies, negotiation, and persuasion.
  • Knowledge of networking principles as they apply to cluster computing including protocols, routers and firewalls.
  • Ability to apply techniques to maintain a consistent operating system image across a large number of homogeneous nodes (e.g. PXEboot, CF Node, NFS Root).
  • Ability to meet the physical requirements of the position.
  • Produce effective and thorough technical documentation
  • Maintain individual components of the HHPC computing environment to assure compliance to Johns Hopkins University security standards and practices
  • Provide on-call and off-hours support as assigned.
  • Other Duties as assigned by the supervisor.

     

Preferred Qualifications

Preferred Experience

  • Experience architecting and managing HPC clusters
  • Expert level knowledge of configuration management and monitoring tools (puppet, nagios, xcat, etc.)
  • Experience with open source software compilation and Apache administration
  • Experience with open source software development and the open source community.
  NOTE: The successful candidate(s) for this position will be subject to a pre-employment background check.

Salary

$67,114 - $92,219