Tip #1: Set your baselines and define success.
As you approach your peak periods, it’s important to understand what “good” looks like. What is normal, acceptable, or concerning about the performance of your services?
It’s a good idea to determine some baselines and KPIs so that you’re fully aware of what your platform looks like under different load profiles. Under stress, your services will likely perform differently, but how different can they be before it has a noticeable business impact? Understand what parts of the system are particularly at risk and will act as indicators for further degradation.
Make sure that all of your processes and operational agreements for the level of your services provided to customers are appropriate. Check out our observability maturity guide to service level management.
Tip #2: Plan well in advance.
Ideally, your planning should be mapped out at least six weeks in advance of the big day. With online sales starting earlier every year, this plan needs to be plotted out ASAP.
What new features will you release? What bugs will you fix first? What type of tests will you run, and when? You will also want to identify current baselines for performance and availability at the application, infrastructure, service, and frontend levels.
You should identify key performance indicators (KPIs) at each level, beginning with measuring performance, error/crash rates, and throughput. You should also create and verify application/infrastructure maps, as well as key transaction maps. Service owners should create and verify an incoming/outgoing call map. Similarly, mobile app owners should identify API calls to view internal and external dependencies.
Next, you need to set your goals and expectations for the big day. What are your availability goals? How much traffic do you expect to receive? How much cloud and infrastructure capacity do you need to put in place (both planned and dynamic) so you can scale to meet those expectations? Identify any existing or potential issues that could get in the way of meeting your goals.
With all this in mind, establish a timeline and document a detailed play-by-play for getting everything ready and have teams in place to cover the entire season. Make sure that resources are available around the clock and handoffs are scheduled in advance.
Tip #3: Know the current state of your environment.
Your load testing and game day activities might lead you to decide to reconfigure your environment to better serve peak events. For example, your Black Friday posture might require more databases to be brought online, more virtual machines (VMs) added to your cluster, and more customer service operators staffing the online chat.
When the time comes, how can you quickly and confidently know that your platform is in the correct posture for business to commence? Did Rupesh remember to scale up that cluster? Are the customer service team on a coffee break? Is the waiting room enabled?
A “current state” dashboard is an essential tool to allow you to quickly understand the overall state of your environment. You can include pertinent data from across your platform services such as cluster sizes, databases, queue lengths, feature toggles, active customer service representatives etc. Your charts should be concise and clear, easy to read with low cognitive load. The billboard chart type works great, because you can set thresholds and highlight them red if they are not the values you expect. You can even use the if() NRQL syntax to simplify the data display giving you simple thumbs up or thumbs down indicators:
from SystemSample select if(uniqueCount(hostname) > 20,’👍’,’👎’) as ‘App Cluster’ since 5 minutes ago
You might need different state dashboards for different “postures,” with thresholds and indicators set differently for low traffic, normal operation and peak events. Learn more about how you can customize and use dashboards in New Relic.
Tip #4: Classify issues by business impact.
Black Friday events tend to have a lot more eyes on the data than everyday operations, and those eyes are inevitably more business-focused. How do you ensure that others will understand the data and visualizations and understand the impact of issues that arise? Suppose the severity of an issue can’t be quickly determined by this audience. In that case, you may waste time fixing irrelevant issues and miss the ones that really make a difference to your business.
One approach is to classify your issues by business impact. Agree on the classification across your services so that everyone is clear on how important a given issue is and its effect on the business without having to fully understand the issue itself. Just being able to understand a dependent service is suffering a “SEV2 incident” provides clarity at times of stress. Don’t underestimate the value of this classification. Read more about alerting strategies.
Tip #5: Set up your command center.
As you solidify your plans, assemble your cross-functional teams. Identify which team members from marketing, fulfillment, web and mobile operations, and other key functions will be involved. Assign clear roles and responsibilities to each person. Who will make the mission-critical decisions and course corrections? Who will execute which tasks? How and where will collaboration occur? Who’s in charge if incidents occur? Keep in mind that successfully navigating Black Friday and Cyber Monday is a team sport, so always think and act like a team.
When you set up the dashboards, consider high-level, “at a glance” dashboards covering both business and technology data.
Make sure the dashboards identify the contacts who own the services. Consider including links to runbooks and team contacts on relevant command center dashboards. This saves you from needing to look up the information at crucial times.
Many companies document these decisions to codify the procedures for who does what and when in runbooks, published in their Observability Center of Excellence. To give your efforts a physical focus, you may want to set up a network operations center (NOC) to house key team members, monitors with shared dashboards, and other resources.
Tip #6: Be aware of changes.
It’s important to ensure you can execute your plan, so your biggest days may not be the best time to roll out risky experiments or deal with unnecessary chaos. Be sure your plan specifies what you can and can’t do. Proper instrumentation and visibility into your software and systems enable you to innovate more confidently, but there’s no need to be reckless about it. Of course, you don’t want to leave money on the table, but you really, really don’t want to break things on Black Friday.
That’s why many companies build a timeline that includes a feature or code freeze. In order to minimize last-minute surprises, you need to specify a hard cutoff date when new features can’t be incorporated into your systems. You’ll need another date when you won’t accept any new code changes, even bug fixes. This allows QA teams to confidently verify that key customer journeys avoid any roadblocks, and helps ensure you deliver the highest quality digital customer experience when it matters most.
Deployments might be inevitable, especially in microservice architectures, or where you rely heavily on third parties. Use deployment markers or custom events to understand if and when the landscape changes. New Relic Lookout is great for investigating real-time performance changes in your environment. Be sure you understand how to use this tool before you need it!
Tip #7: Stay focused on the big picture.
With millions of dollars potentially on the line and multiple things happening at once, how do you stay focused on factors such as conversion rate, order counts, payment success rates, and Apdex for key transactions?
Building and sharing real-time business analytics dashboards give everyone visibility into user flow and performance across web, mobile, and infrastructure by tracking key metrics and user satisfaction scores. Create dashboards for business outcomes and order processes, and set up high-density views of how your apps and infrastructure are performing, as well as the quality of the digital customer experience you’re delivering.
Craft your dashboards carefully and ensure they are consistent in layout and structure across your services so that they are familiar and easy for everyone to use. A pyramid dashboard strategy helps summarize information and allows drilling down for details when necessary. Also, use workloads to help curate operational views of the stack pertinent to the peak event.
These dashboards should be placed prominently in the NOC so that everyone is working from the same single source of truth and can quickly figure out if a leaky funnel is due to a site performance issue, a third-party service outage, or some other cause.
Consider installing the Wall Status Board application. This application is designed to give you an at-a-glance status of your environment and indicate recent historical activity, a great addition to any operations center.
Tip #8: Monitor your golden journey.
Synthetic monitoring is a powerful tool for monitoring availability. For your Black Friday event, ensure that you’re keeping an eye on the really important “golden journeys” or essential actions of your customers, which have key business impacts. Some examples include searching for products, adding items to a basket, checkout, and payment.
Make sure you don’t unnecessarily complicate these customer journeys with features that don’t affect your primary business operations. For example, does it matter if customers can’t rate the product, as long as they can add it to their basket and pay for it? Monitor these ancillary features separately and be sure to categorize your journeys based on business impact.
Tip # 9: Optimize the payment process.
Great news! Your customer clicked the buy button—now your job is done, and it’s time to book the revenue, right? Sadly, no. In fact, you’re just getting started.
Both the payment step and the payment success rate are critical to your success on Black Friday. Even after a customer hits the purchase button, the transaction (along with your revenue) remains susceptible to payment, gateway, payment processor, and buyer errors as well as slow networks and a host of other issues. That’s why it makes sense to track the magnitude and ratio of both successful and failed orders over time to determine trends well in advance of the big days. Similarly, monitoring payment methods can help you identify which ones have the highest success rates—and then you can feature those options most prominently.
Look at it this way: Imagine if you could collect revenue from 100% of your customers’ purchase attempts on Black Friday. How much extra revenue would you book in just that one day?
To quickly determine the health of your payment service alongside all of its dependencies, use workloads in New Relic. To assess customer satisfaction and funnel performance, use our browser analyzer and customer journey apps. For deeper insight into the health of applications, websites, and mobile apps, use APM, browser monitoring, and mobile monitoring—all within the New Relic observability platform. These insights help you determine possible reasons why payments failed, making it easier to troubleshoot the issues and quickly improve your payment success rate.
Start by marking your checkout transaction as a key transaction leading up to Black Friday, and set an alert to notify you of any issues. If your payment provider can accept $0 transactions, test the checkout process at regular intervals using synthetic monitoring (see previous tip #8). The goal is to avoid hearing about problems via customer complaints on Twitter. Key metrics to track are payment success rate, payment gateway response time, and third-party payment provider response time.
Next, take a close look at your cart abandonment rate. How much of that is due to app performance problems versus other variables, such as issues with external payment providers or other third-party services? With real-time performance dashboards powered by New Relic, you can answer these questions right away and fix problems quickly.
Tip #10: Stay flexible.
You can test and plan for every scenario you can imagine, but you never know what real shoppers will do when the big day finally arrives. For example, you might expect shopper traffic to thin out around 2 am, but a popular promotion or performance issues with a competitor’s site could cause your site to be flooded with users until 3 am or 4 am, which could cut into the window available to conduct routine housekeeping tasks.
During November and December, you want to continually assess KPIs and traffic trends to identify any opportunity for improvement. If it’s a choice between releasing that shiny new bit of functionality or fixing an underlying infrastructure problem that’s causing a half-second delay in response time, you probably want to save the value-add stuff for a quieter moment.
To be sure, it helps to define priorities ahead of time in the planning phase rather than in the heat of the moment. What you choose to prioritize, of course, depends on your particular situation. Some companies prioritize anything that will capture more profit, whether uptime, conversion rates, or even high-margin products and high-value customers.
Ultimately, the key is to be prepared to cope with rapidly changing plans and priorities to keep things humming in real time when it matters most—to stay agile and resilient instead of locking down your systems so tightly that they become brittle, more likely to break, and more difficult to recover. Keep your skills up with the NRQL Lessons tutorial.
Tip #11: Set yourself up for future success.
Just because you did everything right and enjoyed a successful Black Friday and Cyber Monday doesn’t mean your work is done. You still need to maximize the rest of the holiday shopping season—and take advantage of other big days throughout the year.
To leverage the lessons, hold an evidence-based, blameless postmortem as soon as possible. Instead of relying on opinions and resorting to finger-pointing, use your performance observability data to analyze what worked and what didn’t. Identify innovation and performance improvement goals, refine your processes, measure, and repeat. Then record these outcomes in your Center of Excellence, so everyone can easily find them when this time comes around again.
Remember: E-commerce is a 365-days-a-year business. While there’s even more at stake during peak season, every day is critical for online retailers. Successful companies need a flexible, “always-on” approach to development, testing, and monitoring.
Get ready for the holiday season sales blitz by putting these tips into practice using New Relic, and ensure your web properties deliver for your business beyond the holiday season. If you don’t already have a New Relic account, sign up here.