All systems are go

Don't agree with this? Please let us know

Previous Incidents

[Resolved] Outage

This incident lasted 40 minutes.
  • Speedyrails Datacenters
  • Miami
  • Vancouver
  • Toronto
  • Montreal
  • DNS
Tue, 6 Nov 2018
04:12:54 UTC

We are investigating an outage in our facilities. Will share more details shortly.

04:42:18 UTC

We have identified the root cause of this incident and service is been restored to all those affected customers.

04:53:20 UTC

We can confirm all services are now restored. Our team continues investigating the root cause of this incident and we will publish an update within 48 hours.

[Resolved] Intel processors security issue

This incident lasted 1 week, 6 days, 17 hours, and 36 minutes.
Wed, 15 Aug 2018
22:15:50 UTC

Yesterday Intel announced a new set of vulnerabilities known as L1 Terminal Fault or Foreshadow. It was discovered that memory present in the L1 data cache of an Intel CPU core may be exposed to a malicious process that’s executing on the CPU core.
All the software vendors are working on software updates to mitigate this vulnerability. As soon as they become available we will be updating the systems.

Wed, 29 Aug 2018
15:52:49 UTC

We have been updating all our servers with the patches released by the different operating system providers

[Resolved] Vancouver network packet loss

This incident lasted 4 hours and 15 minutes.
Tue, 7 Aug 2018
21:59:23 UTC

We are investigating elevated packet loss in our Vancouver network.

22:20:11 UTC

Cogeco Peer 1 has confirmed they are investigating an issue in their network. We are waiting for an update and will post details as soon as we receive them.

22:32:49 UTC

While we wait for an update from Cogeco Peer 1, we can confirm the network is back to normal at this moment. We continue monitoring this issue and will provide an update as soon as we have more information.

Wed, 8 Aug 2018
02:15:20 UTC

This incident has been resolved. Cogeco Peer 1 will provide a Reason For Outage report within next 2-3 business days as engineers are still investigating the root cause of the issue. We will publish a Postmortem report with all the details as soon as we receive them.

Sun, 30 Sep 2018
14:30:25 UTC

Summary

On August 7, 2018, at 1:26pm PDT, the Network Operations Center \(NOC\) began receiving monitoring alerts for devices missing polls in the Vancouver Data Center. The networking team identified a potential issue with an aggregate switch and implemented a reroute of traffic through a redundant aggregate switch. This resolved the majority of the issues. The team continued their investigation and determined that a layer 2 traffic loop was occurring through a segment of the network. Once this had been identified, mitigating actions were implemented to normalize the network.

Details

As a result of the incident on August 7, 2018, customers would have experienced varying degrees of connectivity issues. The incident was caused by an improper configuration for a customer’s new L2 circuit solution, that was brought to light when the customer initiated their connectivity. The NOC’s initial investigation focused on the 16th floor aggregate switch as a possible cause of the issue so the decision was made to re-route the traffic through the redundant aggregate switch on the 21st floor. This action resolved a majority of the reported customer connectivity issues. The networking team then continued their investigation and determined that a layer 2 traffic loop was occurring through a segment of the network. Once this had been identified as the cause, this was mitigated by deactivating the associated layer 2 tunnels. As a result of this final step, connectivity for the remaining few affected customers was restored.

Event Timeline

PDT Time Zone:

• 13:26:00 - NOC receives alerts for network devices missing monitoring polls. Investigate possible issue with van-hc16e-agg-1.

• 13:29:00 - Devices begin responding to polls. NOC receives one customer report of brief connectivity issue during the event.

• 14:37:00 - NOC receives additional alerts for network devices missing monitoring polls.

• 14:45:00 - Attempts to access van-hc16e-agg-1 are unsuccessful so local console is attached. A possible reboot is evaluated to normalize the network disruption.

• 15:06:00 - Issue is escalated to Network Development Engineering for further investigation.

• 15:13:00 - Traffic for downstream devices is migrated from van-hc16e-agg-1 to van-21e-agg-1 by failing over the Redundant Trunk Group \(RTG\). This is in preparation of the possible switch reboot. This action resolved the majority of reported customer connectivity issues.

• 16:16:00 - Further investigation identified a layer 2 traffic loop that is being caused by a newly provisioned network solution. The Layer 2 network solution was deactivated and remaining effected customers report connectivity fully restored.

• 16:30:00 - Previous changes to RTG are reverted.

Resolution

Downstream customer traffic was rerouted from van-hc16e-agg-1 to van-hc21e-agg-1, this resolved the majority of the customers experiencing connectivity issues. The remaining customers issues were resolved when the layer 2 tunnels were deactivated.

Mitigation Plan

In future, the networking team will lab test these types of customized network solutions prior to deployment into production. Network design configurations will also be peer reviewed to ensure accuracy and optimization before provisioning steps are taken.

No further notices from the past 90 days.