17th November 2022
Managing and securing external connectivity can be challenging and expensive when an organization’s workload is split between many isolated accounts.
Let’s consider a use case where an organization has dev, prod, and shared workload deployed on private subnets in 3 isolated AWS accounts. Some of these workloads must be able to fetch data from the public internet, access aws resources using Virtual Private Cloud (VPC) endpoints, and possibly communicate with each other.
To cover the above requirements we need to perform a few steps in each account.
Deploy a NAT Gateway
As a best practice, we should deploy one NAT Gateway (NGW) for each Availability Zone (AZ) to increase our system availability.
Apart from adding some management overhead and complexity to our landing zone creation, the main downside here is the cost of this solution.
Considering the us-east-1 region each NGW will cost 0.045 USD per hour + 0.045 USD per GB of generated traffic. Without considering the traffic cost we will have a cost for each account of ~100 USD/month (considering 3 AZs).
Create the Required VPC Endpoints to Keep the Communication with AWS Services Internal without Flowing through the NAT Gateway
Again, as above, when deploying multiple times, the VPC endpoint introduces some overhead and is not cost-effective. Each endpoint deployed on 3 AZs costs ~22USD/month without considering the traffic. We will incur this fee for each account that we deploy the endpoints.
Create VPC Peerings when Required to Establish the Internal Communication Between VPCs
This aspect introduces little overhead with very few accounts and a pretty stable network topology. It may become a huge issue if the accounts involved in your network grow and the intra-connections requirements change more frequently than expected. We will introduce the benefits of using a centralized Transit Gateway (TGW) over VPC peering shortly.
Manage the Network Security
From a security point of view, in each account, we will need to replicate whatever policy we need to respect concerning outbound connections. This includes firewalls, proxies, etc …
The security team will need to trace and act on several different accounts to ensure that the security policies are correctly applied.
Introducing a centralized outbound network account may alleviate some of the outlined issues.
This account will act as
- The main organization network router
- The centralized exit point for all the organization’s outbound connections
In 2018 AWS introduced Transit Gateways as a way to centralize VPCs and on-premises networks through a central hub.
Some of the advantages of using a TGW compared to VPC peering are as follows:
- A TGW allows attachments from many VPCs, Direct Connect, and site-to-site at the same time, and can route traffic between all the attachments. A new VPC peering must be established between 2 points. Peering VPC A and C to B does not allow A to C communication.
- Reuse the same VPN connection for multiple VPCs.
- TGW supports up to thousands of attached networks
- TGW route table per attachment allows fine-grained routing configurations
In the following diagram, I have described the proposed architecture for creating a centralized outbound account using Transit Gateway as the main account network router + TGW peerings attachments from spoke accounts.
- The private subnets route table in DEV and PROD (#A – #B) accounts routes any external traffic (0.0.0.0/0) through a local TGW (#PT – #DT).
- The TGW attachments route tables (#C and #D) will send any non-local request through the peering attachments between #PT/#DT and #OT. This will make traffic flow into the OUTBOUND account.
- Both the peering attachments from DEV and PROD use a custom TGW route table (#E) that routes all the traffic to the OUTBOUND VPC that is directly attached to the TGW as well. At this point, the traffic directed to a local resource in the outbound VPC (a VPC endpoint for example) is locally managed while the rest flows through the NAT Gateway.
- The TGW attachment between the OUTBOUND VPC and the TGW allows the traffic to flow back to the DEV and the PROD accounts.
Check this GitHub repo for a terraform sample deployment.
The VPC Endpoint’s DNS resolution deserves some more attention.
When we create a VPC endpoint AWS adds a dedicated Elastic Network Interface (ENI) which receives a private IP address from the subnet Classless Inter-Domain Routing (CIDR) range where the endpoint is deployed.
The private IP DNS resolution (if enabled in the endpoint configuration) is managed by AWS behind the scenes using a hidden route53 private hosted zone. Considering our example deploying an endpoint in the OUTBOUND account private subnet will allow the endpoint’s private IP to be resolved only in the perimeter of that account. Still, it will not work in the PROD and the DEV account.
To solve this issue we can perform the following steps:
- Disable the private DNS resolution in the VPC endpoint configuration
- Create a private route53 hosted zone in the OUTBOUND account.
For example for s3:
- Add an alias record that points to the VPC endpoint
- Associate the hosted zone with the PROD and DEV VPCs. Keep in mind that this cannot be done from the AWS console but must be performed using the AWS CLI, APIs, or SDKs. That said if you are working with terraform this use case is covered.
In this way, the service endpoint URL will be resolved in the PROD and DEV accounts and the request will be correctly routed to the private IP in the OUTBOUND VPC.
What About Internal Routing Between VPCs?
As we said before the Transit Gateway in the outbound account acts as the main network router.
All the traffic originating from the organization’s VPCs flows through it.
As a consequence, we can use the TGW routes table to allow or deny connections between the attached VPCs.
In the above diagram, a connection originated from the PROD VPC flows through the peering attachment and is redirected to the DEV VPC (see the routings in the outbound account subnets routes tables).
Adding a blackhole route in the TGW route table associated with the peering attachments can prevent this flow. For example
|10.10.0.0/16||VPC Outbound TGW Attachment|
|0.0.0.0/0||VPC Outbound TGW Attachment|
The clear benefit is the ability to manage the org network topology acting on a single resource.
Note that this configuration is done in the OUTBOUND account and does not need any action in the PROD or DEV account.
What About Costs?
One of the benefits of this solution is the reduction of the costs that come from reusing the same NAT Gateway and VPC endpoints deployments.
Let’s consider an organization with 3 accounts with workload deployed in private subnets that need external connectivity + 3 VPC endpoints (s3, dynamo, and SSM for session manager)
Without Outbound Account
Without the outbound account we have a total cost of ~498 USD/month
Deploying 3 NGW in 3 Azs will cost ~300 USD/month (100 USD * 3 accounts)
Deploying 1 endpoint in 3 Azs will cost ~198 USD/month (22 USD * 3 endpoints * 3 accounts)
With Outbound Account
With the outbound account we have a total of ~276 USD/month
We will deploy only on the Outbound account in 3 Azs for a total of ~100USD
We will deploy only on the Outbound account in 3 Azs for a total of ~66USD (22 * 3 endpoints)
We must calculate the cost of the 3 TGW attachments.
As of today, this will cost 36.50 USD x attachment with a total cost of ~110 USD/month
Important: We are not calculating data transfer that may vary depending on many different aspects in both of the solutions. Data flowing through a TGW attachment will incur a cost of ~ 0.02 USD per GB.
As we can see the savings will increase with more accounts joining the network. It’s possible that a huge amount of data flowing through the TGWs will reduce the savings. That said even without any savings we will obtain a better architecture for the same price.
As we saw, introducing a network hub account has several benefits.
Working with a centralized network router gives better visibility over the organization’s network and simplifies the setup of spoke accounts that will require less management overhead.
The network routing is managed on a dedicated account that can be provisioned with the required permissions to make it available only to a dedicated team.
Since every network connection can be inspected, allowed, and logged in a single point, applying the organization’s network security policies is easier and more effective.
In a future post, we will analyze how we can use tools like Network ACL, Security Groups, and the new VPC Firewall to control the traffic flowing through the OUTBOUND account.
Need help with DevOps and Platform Engineering? Contact us today to schedule a discovery workshop.