Accelerate Insights Using Aws Sdk Instrumentation Nfx302

Title

AWS re:Invent 2022 - Accelerate insights using AWS SDK instrumentation (NFX302)

Summary

  • Scott Pack and Nick Seow from Netflix's Cloud Infrastructure Security Team presented their work on identifying AWS resource dependencies for applications and their associated IAM roles.
  • Netflix faced challenges with managing IAM blast zones and single account scaling issues, prompting a need to migrate application identities to different accounts.
  • They needed a solution that provided broad service coverage, captured both management and data events, and included resource indicators without requiring application code modifications.
  • CloudTrail, S3 access logs, IAM Access Advisor, and IAM policies were considered but found insufficient for Netflix's needs.
  • They turned to client-side visibility, specifically AWS SDK instrumentation, to collect the necessary data.
  • Client-side metrics (CSM) and HTTP proxies were evaluated but had limitations.
  • They explored modifying the SDK or using runtime-enabled intercepts, with the latter being more appealing.
  • Netflix's Java-heavy environment and immutable infrastructure influenced their approach to SDK instrumentation.
  • They discovered global interceptors in the AWS SDK for Java and history recorders in Boto3 for Python, which allowed them to intercept and record API calls without application owners' knowledge.
  • The collected data was minified and deduplicated on-instance before being sent to a centralized logging platform.
  • Off-instance, the data was reformatted for uniformity and stored in a relational database for querying.
  • The solution provided real-time insights into resource interactions and dependencies, enabling better management of IAM roles and resource policies.

Insights

  • Netflix's approach to SDK instrumentation is a novel way to gain visibility into application behavior and dependencies on AWS resources without altering application code.
  • The use of global interceptors and history recorders can be a powerful tool for organizations looking to understand and manage their cloud infrastructure more effectively.
  • The strategy of piggybacking on existing libraries and infrastructure for deployment facilitated rapid adoption and minimized the need for application owners to make changes.
  • The ability to inspect both requests and responses opens up possibilities for real-time permissions troubleshooting and understanding operational dependencies.
  • This method of data collection and analysis can significantly reduce the "we don't know" answers when it comes to resource interactions, leading to more informed decision-making and policy crafting.
  • The approach taken by Netflix can be replicated using AWS primitives, making it accessible to other AWS customers who may face similar challenges.
  • The insights gained from this project have the potential to improve security posture, operational efficiency, and application robustness for organizations operating at scale in the cloud.