Hello everyone, and welcome to the Sprkl tips & tools series. In our interview series, we host a prominent developer each time and explore topics that would bring value to the developer community.
Sprkl is a Personal Observability platform that provides personalized feedback on code changes while coding in the IDE. We help developers ship correct and efficient code while spending less time on debugging and frustrating rework. Powered by OpenTelemetry, Sprkl instruments every code change and analyzes it upon execution.
This time we interviewed Rohan Verghese, software backend developer at Amazon; Rohan gave us insightful tips on enhancing developer productivity operating in complex distributed environments. The interview with Rohan was very insightful; we even got some insights into Amazon’s R& D teams. We hope you’ll also gain some value from it. 🙂
My name is Rohan Verghese. I’ve been a software developer for about 16 years now. I’m currently at Amazon, where I have worked in the supply chain area for about a year. Before Amazon, I worked at Bally Interactive, an online sports betting platform, in their third-party feed team. Before that, I spent ten years at a very small startup that made a hedge fund administration product. By very small, I mean that I was the only backend developer for the first five years!
Amazon is pretty famous for its “two-pizza” teams. Each team is 6-10 engineers (i.e., can be fed by two pizzas) and owns one or more entire services end-to-end. Data science teams are also organized along similar lines, tackling individual projects. However, I’m not sure there’s a traditional R&D structure.
It depends on how you look at it. The actual code is probably a three or so. It very rarely does anything tricky. But your service will probably call six other services, and multiple services will call yours. So organizing all that and making sure it’s resilient and highly scalable with all the different AWS constructs and options is where the complexity lies.
To sum up, the basic code is relatively simple. However, the complexity lies in figuring out how your service fits into the greater web of services and communicating with other services.
Ironically, previous jobs had more complex code simply because the domain was more complex. There were days when I cursed the crazy people who invented cryptocurrencies.
It’s usually left up to the individual team. Most teams I’ve seen follow a basic Scrum-like process with two-week sprints. Sprint planning at the start and regular standups during the sprint.
The only major day-to-day difference at Amazon from other companies I’ve worked at is that Amazon likes writing documents. People write documents for almost everything. Pretty much every design decision and meeting will have an engineer write a design document about the subject of the meeting, with all the pros and cons listed out. You spend the first 10-15 minutes of the meeting reading the document, discussing it, and making a decision.
On the whole, I like this culture of writing documents, but there’s no denying it is a pretty big culture shock when you’re first introduced to it.
Some teams who deploy to production may perform automated canary roll-outs. This means the code changes will go live only on a few instances at first and automatically revert if the metrics say something went wrong. But those only work when you have a lot of instances, and they are not trivial to implement. It would be an interesting step to make our deployments to production even safer, but we have yet to feel the need for it. Simpler health checks have always been enough for us, as the test coverage catches most breaking changes.
In my opinion, the biggest blind spot is in structuring data. Because of scalability concerns, most teams use non-relational data sources like DynamoDB or DocumentDB (MongoDB). But because your data is unstructured with these data sources, it’s very easy to use a data shape that is not quite right. That ends up making dealing with your data more difficult than it needs to be.
I find the structure imposed by relational databases, normal forms, relationships, and constraints lead you to shape your data in a superior, more correct fashion. Which in turn often makes your code a lot simpler and straightforward.
In distributed software development, many teams start the delivery process with the top level API, which other services will call, basically the code. However, I think you should always start at the bottom with the data that will be stored and/or operated on by your system.
All of them? At the very least you should have good unit tests that run quickly on every local build. Ideally, there would be automated integration tests that run against a test environment as part of CI.
Finally, in production, it’s often a good idea to have “canaries,” sample data, and workflows that your system processes regularly so that there’s always some activity. Then if activity drops to zero, you know there’s a serious problem.
In distributed systems, you need to rely on the pillars of observability: metrics, logs, and traces. Good use of metrics will alert you to potential issues early. Traces narrow down where in the system the problem is occurring. These two pillars are most important to determine problems with resources like memory, network, or dependencies.
The most important tool to debug logical issues is logging. Therefore, it’s imperative to have good logging of inputs to your service and return values from your service’s dependencies.
To be honest, I don’t think there’s any real substitute for good logging. If you know all the inputs that caused the problem, it’s easier to write test cases that reproduce it.
I’m sure productivity is measured at some level at Amazon, but it’s above my level.
I suppose you could look at throughput (number of tasks completed per unit time) or latency (average length of time it takes to complete a task). But most of these measurements get confounded by the fact that every task is different.
At a meta-level, a productive team meets its deadlines (set by the team, not others). At the very least, such a team knows how productive they are, and there is no external interference driving down their intrinsic productivity. On the other hand, an unproductive team is constantly slipping, which is very often a sign that external factors are interfering.
If you want to give Sprkl a try, get started here.
Enjoy your reading 10 Min Read