Jobs

Interviews
Job Alerts
Tools

Upskill and Grow with AI

Mock Interview Practice interviews in realistic simulations

Coding Practice Improve your coding skills with challenges

Certification Earn certifications to validate your skills

AI Learning Get trained with AI expert sessions

Career Path AI insights for smarter career decisions

AI Job Match Score AI-Powered Job Match Against Your Resume and Optimize Your Resume

Career Tools and Resources

Resume Builder Build Professional Resume with Ease

ATS Friendliness Check Check Resume Friendliness for Applicant Tracking Systems

Auto Apply Apply to hundreds of jobs on any platform effortlessly

Co-Pilot (Chrome Extension) Your AI Assistant for Seamless Browsing Efficiency

Interview Questions Streamline interviews with ready-to-use questions

Salaries Discover market-driven salary insights across skillsets and geographies

Companies Explore leading companies actively hiring talent
For Employers

Home
>
Jobs in bengaluru
>
PhonePe
>
Site Reliability Engineer - Big Data

Site Reliability Engineer - Big Data

PhonePe

4 years

0 Lacs

bengaluru karnataka india

Posted:1 day ago| Platform:

Apply

Skills Required

reliability data scalability security troubleshooting support linux unix analysis design automation provisioning scaling patching resolve drive standardization monitoring maintenance analyze automate efficiency development software technology ip iptables scripting programming perl hadoop stack airflow kafka configuration management deployment puppet chef networking communication collaboration devops saltstack ansible docker logging aws azure visualize

Work Mode

On-site

Job Type

Full Time

Job Description

About the Role:

This role is responsible for managing and maintaining complex, distributed big data ecosystems. It ensures the reliability, scalability, and security of large-scale production infrastructure. Key responsibilities include automating processes, optimizing workflows, troubleshooting production issues, and driving system improvements across multiple business verticals.

Roles and Responsibilities:

● Manage, maintain, and support incremental changes to Linux/Unix environments.

● Lead on-call rotations and incident responses, conducting root cause analysis and driving postmortem processes.

● Design and implement automation systems for managing big data infrastructure, including provisioning, scaling, upgrades, and patching clusters.

● Troubleshoot and resolve complex production issues while identifying root causes and implementing mitigating strategies.

● Design and review scalable and reliable system architectures.

● Collaborate with teams to optimize overall system/cluster performance.

● Enforce security standards across systems and infrastructure.

● Set technical direction, drive standardization, and operate independently.

● Ensure availability, performance, and scalability of systems and services through proactive monitoring, maintenance, and capacity planning.

● Resolve, analyze, and respond to system outages and disruptions and implement measures to prevent similar incidents from recurring.

● Develop tools and scripts to automate operational processes, reducing manual workload, increasing efficiency and improving system resilience.

● Monitor and optimize system performance and resource usage, identify and address bottlenecks, and implement best practices for performance tuning.

● Collaborate with development teams to integrate best practices for reliability, scalability, and performance into the software development lifecycle.

● Stay informed of industry technology trends and innovations, and actively contribute to the organization's technology communities.

● Develop and enforce SRE best practices and principles.

● Align across functional teams on priorities and deliverables.

● Drive automation to enhance operational efficiency.

● Adapt new technologies as and when the need arises and define architectural recommendations for new tech stacks.

Skills Required:

● Over 4 years of experience managing and maintaining distributed big data ecosystems.

● Strong expertise in Linux including IP, Iptables, and IPsec.

● Proficiency in scripting/programming with languages like Perl, Golang, or Python.

● Hands-on experience with the Hadoop stack (HDFS, HBase, Airflow, YARN, Ranger, Kafka, Pinot).

● Familiarity with open-source configuration management and deployment tools such as Puppet, Salt, Chef, or Ansible.

● Solid understanding of networking, open-source technologies, and related tools.

● Excellent communication and collaboration skills.

● DevOps tools: Saltstack, Ansible, docker, Git.

● SRE Logging and monitoring tools: ELK stack, Grafana, Prometheus, opentsdb, Open Telemetry.

Good to Have:

● Experience managing infrastructure on public cloud platforms (AWS, Azure, GCP).

● Experience in designing and reviewing system architectures for scalability and reliability.

● Experience with observability tools to visualize and alert on system performance.

● Experience in massive petabyte scale data migrations, massive upgrades.

More Jobs at PhonePe

Intern CTM

Bengaluru, Karnataka

Experience: Not specified

Salary: Not disclosed

PREMIUM ACCOUNTING EXECUTIVE

Ahmedabad, Gujarat, India

Experience: Not specified

Salary: Not disclosed

Grievance Advisor

Bengaluru, Karnataka

Experience: Not specified

Salary: Not disclosed

Software Engineer - Backend (7-10 years), Pune

Pune, Maharashtra, India

Experience: Not specified

Salary: Not disclosed

Area Manager - Operations - Noida

Greater Delhi Area

3 - 5 yrs

Salary: Not disclosed

Mock Interview

Practice Video Interview with JobPe AI

Start DevOps Interview

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

Enhance Your Skills

Practice coding challenges to boost your skills

Start Practicing Now

PhonePe

Login to

Please Verify Your Phone or Email

Confirm Action

Site Reliability Engineer - Big Data