Managing and securing external connectivity can be challenging and expensive when an organization's workload is split between many isolated accounts.
Let's consider a use case where an organization has dev, prod, and shared workload deployed on private subnets in 3 isolated AWS accounts. Some of these workloads must be able to fetch data from the public internet, access aws resources using Virtual Private Cloud (VPC) endpoints, and possibly communicate with each other.
To cover the above requirements we need to perform a few steps in each account.
As a best practice, we should deploy one NAT Gateway (NGW) for each Availability Zone (AZ) to increase our system availability.
Apart from adding some management overhead and complexity to our landing zone creation, the main downside here is the cost of this solution.
Considering the us-east-1 region each NGW will cost 0.045 USD per hour + 0.045 USD per GB of generated traffic. Without considering the traffic cost we will have a cost for each account of ~100 USD/month (considering 3 AZs).
Again, as above, when deploying multiple times, the VPC endpoint introduces some overhead and is not cost-effective. Each endpoint deployed on 3 AZs costs ~22USD/month without considering the traffic. We will incur this fee for each account that we deploy the endpoints.
This aspect introduces little overhead with very few accounts and a pretty stable network topology. It may become a huge issue if the accounts involved in your network grow and the intra-connections requirements change more frequently than expected. We will introduce the benefits of using a centralized Transit Gateway (TGW) over VPC peering shortly.
From a security point of view, in each account, we will need to replicate whatever policy we need to respect concerning outbound connections. This includes firewalls, proxies, etc ...
The security team will need to trace and act on several different accounts to ensure that the security policies are correctly applied.
Introducing a centralized outbound network account may alleviate some of the outlined issues.
This account will act as
In 2018 AWS introduced Transit Gateways as a way to centralize VPCs and on-premises networks through a central hub.
Some of the advantages of using a TGW compared to VPC peering are as follows:
In the following diagram, I have described the proposed architecture for creating a centralized outbound account using Transit Gateway as the main account network router + TGW peerings attachments from spoke accounts.
Check this GitHub repo for a terraform sample deployment.
The VPC Endpoint's DNS resolution deserves some more attention.
When we create a VPC endpoint AWS adds a dedicated Elastic Network Interface (ENI) which receives a private IP address from the subnet Classless Inter-Domain Routing (CIDR) range where the endpoint is deployed.
The private IP DNS resolution (if enabled in the endpoint configuration) is managed by AWS behind the scenes using a hidden route53 private hosted zone. Considering our example deploying an endpoint in the OUTBOUND account private subnet will allow the endpoint's private IP to be resolved only in the perimeter of that account. Still, it will not work in the PROD and the DEV account.
To solve this issue we can perform the following steps:
In this way, the service endpoint URL will be resolved in the PROD and DEV accounts and the request will be correctly routed to the private IP in the OUTBOUND VPC.
As we said before the Transit Gateway in the outbound account acts as the main network router.
All the traffic originating from the organization's VPCs flows through it.
As a consequence, we can use the TGW routes table to allow or deny connections between the attached VPCs.
In the above diagram, a connection originated from the PROD VPC flows through the peering attachment and is redirected to the DEV VPC (see the routings in the outbound account subnets routes tables).
Adding a blackhole route in the TGW route table associated with the peering attachments can prevent this flow. For example
The clear benefit is the ability to manage the org network topology acting on a single resource.
Note that this configuration is done in the OUTBOUND account and does not need any action in the PROD or DEV account.
One of the benefits of this solution is the reduction of the costs that come from reusing the same NAT Gateway and VPC endpoints deployments.
Let’s consider an organization with 3 accounts with workload deployed in private subnets that need external connectivity + 3 VPC endpoints (s3, dynamo, and SSM for session manager)
Without the outbound account we have a total cost of ~498 USD/month NAT Gateway
Deploying 3 NGW in 3 Azs will cost ~300 USD/month (100 USD * 3 accounts)
Deploying 1 endpoint in 3 Azs will cost ~198 USD/month (22 USD * 3 endpoints * 3 accounts)
With the outbound account we have a total of ~276 USD/month
We will deploy only on the Outbound account in 3 Azs for a total of ~100USD
We will deploy only on the Outbound account in 3 Azs for a total of ~66USD (22 * 3 endpoints)
We must calculate the cost of the 3 TGW attachments.
As of today, this will cost 36.50 USD x attachment with a total cost of ~110 USD/month Important: We are not calculating data transfer that may vary depending on many different aspects in both of the solutions. Data flowing through a TGW attachment will incur a cost of ~ 0.02 USD per GB.
As we can see the savings will increase with more accounts joining the network. It's possible that a huge amount of data flowing through the TGWs will reduce the savings. That said even without any savings we will obtain a better architecture for the same price.
As we saw, introducing a network hub account has several benefits.
Working with a centralized network router gives better visibility over the organization's network and simplifies the setup of spoke accounts that will require less management overhead.
The network routing is managed on a dedicated account that can be provisioned with the required permissions to make it available only to a dedicated team.
Since every network connection can be inspected, allowed, and logged in a single point, applying the organization's network security policies is easier and more effective.
In a future post, we will analyze how we can use tools like Network ACL , Security Groups , and the new VPC Firewall to control the traffic flowing through the OUTBOUND account.
Need help with DevOps and Platform Engineering ? Contact us today to schedule a discovery workshop .