Network-level and application-level emergency security mechanims

Background

The recent v2 DAO propsal originally included an Emergency Security Developer Multisig (ESDM). The original language is reproduced below for reference:

Community discussion around the v1 DAO proposal highlighted the need for a better emergency security mechanism within the Threshold DAO. In emergency situations - the 2 day vote delay and the 10 day vote period in the StakerDAO is too long for a proper response to be executed which might result in the exploitation of our network. This developer multisig is intended to serve as a critical fallback security mechanism for the Threshold network.

The v2 DAO proposal calls for the creation of a 3-of-4 Emergency Security Developer multisig with two technical representatives appointed from both the Keep team and NuCypher team. This multisig holds the power to propose network changes immediately after a vulnerability is found only in emergency situations. These proposals can be immediately executed by the elected council.

The proposal was ambiguous as to whether the scope of the ESDM applied to application-level contracts or only to network-level Ownable contracts.

@Agoristen raised the concern that if the tBTC v2 contracts are subject to the ESDM, thereā€™s a risk of malicious seizure of usersā€™ BTC. Due to tBTCā€™s premise as a non-custodial, censorship-resistant, trust-minimized bridge this could compromise the primary raison dā€™ĆŖtre of tBTC.

The ensuing discussions across the forum, chat, and community calls made it clear that the mechanics of the ESDM required additional debate and a dedicated proposal, separate from the rest of the (otherwise uncontroversial) DAO v2 proposal. As a result, the ESDM aspect was removed and tabled for further discussion so that the rest of the DAO v2 proposal could proceed to a community snapshot.

Current Status

To proceed with the discussion on emergency security fallbacks, letā€™s explicity distinguish between:

  1. Network-level emergency security mechanisms for DAO governance contracts, T staking contracts, NU/KEEP<>T vending machine contracts, token grant contracts, etc., and;
  2. Application-level emergency security mechanisms for tBTC v2 contracts, PRE contracts, random beacon contracts, etc.

For (1) network-level security, my baseline assumption is that there will be an ESDM since itā€™s not possible to safely upgrade a contract in a timely manner in public solely via a DAO proposal, as @tux pointed out in Discord. Whether the ESDM has the power to directly upgrade contracts or only to temporarily pause contracts while a formal DAO proposal is being voted on/approved needs further consideration.

For (2) application-level security, letā€™s start by assuming that, depending on the power of the network-level ESDM, each application may wish to impose more restrictive security mechanisms than what the network provides.

For example, tBTC may wish to somehow limit upgradeability such that taking control of BTC is impossible or opt-out entirely via immutable, non-ownable contracts (although this is unlikely as v2 is taking an incremental approach to product development).

Next Steps

I believe these are the network-level security questions that need to be answered prior to Threshold launching:

  1. Is there a satisfactory alternative to a network-level ESDM?
  2. Assuming there isnā€™t, should the ESDM be able to upgrade contracts directly or only temporarily pause contracts while the DAO approves a security fix? Are there potential security vulnerabilities where pausing is not sufficient?
  3. If the ESDM can only pause contracts, what are the parameters (timeline, etc.)?
  4. What is the composition of the ESDM (members, quorum, etc.)?

There are additional application-level security questions that need to be answered prior to tBTC v2 (and other applications) launching, which means we have additional time to consider them:

  1. How can we preserve the raison dā€™ĆŖtre of tBTC along with an incremental approach to product development post-launch and the ability to fix security vulnerabilities?
  2. Is there a way to implement an ESDM with limited powers (specifically, no ability to malicious take control of usersā€™ BTC) while preserving the ability to fix potential security vulnerabilities?
  3. Does a limited ESDM simply mean removing the ability to directly upgrade contracts in favor of only temporarily pausing contracts while the DAO approves a security fix?
  4. If so, can we require applications to inherit network-level security after all (if we similarly decide on a pause-only ESDM at the network-level)?

I suspect we may uncover additional questions as the discussion evolves!

4 Likes

A. Background & Premises

The following are several ā€œtruthsā€ on which Iā€™ve based this proposal on. Hopefully, if thereā€™s disagreement then these premises will serve as a way to reach a common fundamental understanding and at more structure to the conversation. Iā€™ve also labelled each section for easier reference later on.

A1. Risk is not homogenous.

Broadly speaking, there are two types of risk that the ESDM addresses: exploits and collusion.

A2. Risk is not static.

Exploit risk > collusion risk when t ~ 0, whereas:
Exploit risk < collusion risk when t ~ āˆž
In simple English: exploits are more likely early when contracts are new and collusion is more likely later when value has accrued and is worth colluding.

This chart helps visualize this concept (using arbitrary units):

A3. There is a natural tradeoff between exploit and collusion risk.

Unless a novel solution arises (this would be awesome), broadly speaking, methods to reduce exploit risk rely on increasing centralization and methods to reduce collusion risk rely on increasing decentralization.

A4. There are some exceptions/alternatives to the methods mentioned above.

For example: the pause function can allow for a centralized group to meaningfully prevent exploits while a long-term fix is implemented by a broader decentralized group. This capability has the effect of ā€˜defangingā€™ the power of the central group. However, this solution still has two notable drawbacks:

  • While a pause function cannot be used to steal, it can be used maliciously, to sabotage. In this example, the attacker is more likely to be a competitor than the typical black hat hacker.

  • 3rd parties, such as regulators and other government organizations (especially corrupt ones), can pressure the semi-centralized group with threat of economic liability or violence to pause these contracts for any number of unreasonable, irrational or unjust reasons. For any ESDM with this pause power, this risk will exist.

Our solution should at least aim to alleviate that risk for the multi-sig.

A5. Weā€™re working with a limited toolset (as far as Iā€™m aware).

The solution set involves some combination of:

  • Contract Security Parameters = [Upgradable, Pausable, Non-Upgradeable]

  • Contract Security Parameter Duration = [Infinite, Standardized & Finite, Custom & Finite]

  • Contract Ownership = [Network-level, Application-level]

Please let me know if any of those need elaboration or are incomplete.

B. Goals

B1. Our goal should be to reduce net risk.

This can be accomplished by either:

  • Reducing exploit risk
  • Reducing collusion risk
  • Reducing both risks

B2. The optimal solution is most likely one that transitions over time.

Since the utility of reducing exploit risk is much higher early on and diminishes over time, and the utility of reducing collusion risk is lower early on and increases over time, the optimal solution is most likely one that is not static.

B3. A network-level solution is ideal.

In line with the concerns raised by @tux , while there should still be consideration for security mechanisms on the application-level, the optimal solution is one that is easily applied and thus cannot be easily avoided.

C. Proposal

By default, after deployment contracts are:

  • Upgradable and pausable by the ESDM for 1 week
    • We should be able to fix early exploits quickly, however, after 1 week, significant value may start to accrue.
    • Iā€™m very willing to consider other lengths for this, just ballparking.
  • Pausable by the ESDM for 1 year
    • This allows for significant peer review and live environment testing. I find it difficult to imagine an exploit is not found after 1 year in this industry.
    • As I mentioned above, this ā€˜defangsā€™ the collusion risk (no more risk of centralized theft), while allowing for emergency powers.
    • This does expose one risk I can think of, which is an attacker sitting on the exploit knowing the pause power will expire.
  • Non-upgradable 1 year after deployment
    • This eliminates any collusion risk, even the fringe cases described above in [A4].
    • If the contract is exploited after this period, weā€™re screwed, but if we pair this solution with a reasonable and reputable bug-bounty program, we should be able to cover all of our bases.

D. Some closing thoughts & loose ends

D1. I think there needs to be more said about how application-level security mechanisms would interact with a network level if we were to have a network-level, like the one presented in [C].

D2. Iā€™ve opted for the Contract Security Parameter Duration to be standardized & finite. It might make sense to do custom & finite. I donā€™t know - havenā€™t fully explored the pros and cons of either yet.

D3. It would be helpful to hear feedback from the core-team developers on this. If possible, it would be very powerful if this could be codified as a default mechanism, rather than just generally agreed upon.

D4. This is a first attempt, so naturally, I expect pushback and feedback. Our goal, however, isnā€™t just to debate game theory and contract security, itā€™s to make an actionable plan. In other words, letā€™s try to keep comments on point and solution-oriented.

4 Likes

Thanks @jakelynch!

For any ESDM with this pause power, this risk will exist.

While external pressure to deploy an unjustified pause is possible, Iā€™m not convinced this is an existential risk. Yes, it would cause short-term frustration but the DAO could overrule the pause and revoke the ESDM to prevent repeat abuse. If the worst case is that funds are temporarily inaccessible but not lost, Iā€™m not sure removing pause power is a desirable or necessary goal.

C. Proposal

I like the idea of a phased approach from Upgradeable and pausable to Pausable. This could potentially pair well with a guarded launch that limits deposits during the initial phase.

Non-upgradable 1 year after deployment

Iā€™m less comfortable with this, especially setting arbitrary timelines up-front. It ties us to a specific path of action regardless of future conditions.

I find it difficult to imagine an exploit is not found after 1 year in this industry.

Iā€™m not so sure: Uncovering a Four Year Old Bug

3 Likes

You make a convincing argument :sweat:

Iā€™m personally comfortable with striking the non-upgradable aspect. The transition to non-upgradable shouldnā€™t be codified and really should be a judgement call if we get to a point where were comfortable with that. The marginal utility of transitioning from pausable to non-upgradable might very well be negative and not worth doing.

Preface: My goal is to exclusively challenge and address the lack of security with the proposal. Nothing more nothing less.

A: Background & Premises
Regarding the ā€œtruthsā€ preface, Iā€™ll assume you mean industry standard definitions and not your personal truth since there are countless security professional hours defining and honing in the lexicon. That said, Iā€™m not going to respond with ā€œtruthsā€ but with the current and accepted best risk mitigating practices.

A1: Risk is not homogenous
Agreed. However there are currently two types of risks that ESDM addresses. This is a more risk adverse view. At the end of the day this is about risk management and being as risk adverse as possible which is produced by measuring said risk.

A2: Risk is not static
Risk can (and usually is) both static and dynamic because the attack surface should include all assets, components, threats and risks that are within control.

A3: There is a natural tradeoff between exploit and collusion risk
A4. There are some exceptions/alternatives to the methods mentioned above
I believe this is where we fundamentally disagree. Mitigating risk isnā€™t about finding a bullet proof solution. Of course Everyone uses a firewall for their network to mitigate risk. Is it perfect? No. Does it help? Absolutely. Will novel vulnerabilities arise? Without a doubt. The goal is to find a reasonable security measure that outweighs the risk in question.

B1: Our goal should be to reduce net risk
The goal IS NOT reducing net risk and itā€™s a dangerous to approach risk this way. Net risk is only measured post mitigating when treating the individual risks as their own entity. Additionally, you have only addressed known risks. This is where DID and measuring risk appropriately comes into play.

B2. The optimal solution is most likely one that transitions over time
Risk DOES NOT diminish over time. I donā€™t know how you are stating this as a fact or where you got his. Technology, knowledge, general advancements, tooling, greater exposure, funding, poor and fast paced engineering are just a few variables why risk in fact becomes a bigger problem over time. Just because you get over the arbitrary 1 year mark doesnā€™t mean youā€™re all in the clear. This is especially true with complex systems that have zero-days. 5-10 year old Sudo, Grub2, Kernel level, floppy disks and not to mention Windows exploits are found all the time - Now think about the crypto industry and how new it is. When a fundamental flaw is found youā€™re going to wish there were a security measure in place. Even 802.11 WEP was introduced in 1997 and was disclosed as insecure in 2001 - Today we would laugh at RC4 24-bit IVs. But what did IEEE do? Introduced 64-bit and 128-bit to strengthen it. All can be cracked with ease. To mitigate the actual risk, due diligence was performed and a new protocol was developed. Not a bandaid. During this negligence many people were compromised. To quote Tux ā€œThe greatest threat to your contracts in tBTC are security bugsā€. And bugs donā€™t always present themselves instantly.

ā€œI find it difficult to imagine an exploit is not found after 1 year in this industry.ā€ A response like this is incredibly risky and the exact opposite of mitigating risk and just plain irresponsible. Just because X is True does not mean Y is False. Which is to say, just because you canā€™t imagine it happening doesnā€™t mean it can not or will not happen. In fact metrics point towards a completely different result.

I like data. Here is a short list I curated without much research on Logic/Protocol vulnerabilities within the last 24 months

  • AKRO - Found: 2017 / Exploit: 2020
  • COVER - Found: 2018 / Exploit: 2020
  • ALPHA - Found: 2018 / Exploit: 2021
  • POLY - Found: 2018 / Exploit: 2021
  • SUSHI - Found: 2018 / Exploit: 2021
  • BUNNY - Found: 2018 / Exploit: 2021
  • BAL - Found: 2018 / Exploit: 2020
  • PAID - Found: 2019-20 / Exploit: 2021
  • CREAM - Found: 2018 / Exploit: 2021,2021
  • RUNE - Found: 2018 / Exploit: 2021,2021

Misc crypto instituions exploitation where age > 1year

  • Mt. Gox - Found: 2010 / Exploit: 2011, 2014
  • BitCash - Found: 2011 / Exploit: 2013
  • Vicurex - Found: 2011 / Exploit: 2013
  • 796 - Found: 2013 / Exploit: 2015
  • LocalBitcoins - Found: 2012 / Exploit: 2015
  • BitStamp - Found: 2011 / Exploit: 2015
  • BTER - Found: 2012 / Exploit: 2015
  • GateCoin - Found: 2013 / Exploit: 2016
  • Bithumb - Found: 2014 / Exploit: 2017,2018
  • Youbit - Found: 2014 / Exploit: 2017
  • NiceHash - Found: 2014 / Exploit: 2017
  • Bitgrail - Found: 2014 / Exploit: 2018.
  • CoinSecure - Found: 2014 / Exploit: 2018
  • Bitcoin Gold - Found: 2009 / Exploit: 2018
  • Bithumb - Found: 2014 / Exploit: 2018
  • Zaif - Found: 2014 / Exploit: 2018
  • QuadrigaCX - Found: 2014 / Exploit: 2018
  • Parity - Found: 2015 / Exploit: 2017,2017

These incidences demonstrate that the technology is still in its infancy and several attack vectors known or unknown are likely to be unforeseen. So how do we mitigate risk? Even security audits do not guarantee safety.

DeFi Has Accounted for Over 75% of Crypto Hacks in 2021 of which many of these appear to not be newly introduced changes. Offensive Security for DeFi is becoming very popular and of course lucrative.

B3. A network-level solution is ideal
Your wording of ā€œthe optimal solution is one that is easily applied and thus cannot be easily avoidedā€ only illustrates the lack of taking security serious to me. What I hear the proposal to be is: ā€œwhatā€™s easiest to do? Because that will be hard to avoid because it takes little effortā€. At the end of the day, this is a cutting corners mentality. A response like this regarding threats and mitigation of any risk is dangerous and should be neglected.

All of that said, no, my communication intent is not FUD like someone said on the call (09/09/2021) this morning. Rather, this is a professionals concern and assessment. There are likely SEVERAL components the team have not evaluated. The reasonable solution to that problem is kneeling to Occasmā€™s Razor. However, placing DID and the appropriate security measures in place will lend hand in risk management and mitigation. The overarching goal is not obtaining absolute certainty regarding risk. We will always be unable to obtain absolute certainty so you best identify and mitigate them as forward thinking and future proof as possible.

2 Likes

I appreciate your comments, but Iā€™m not entirely sure what you are suggesting. The developers are taking security very seriously and following the best practices (ā€˜evidenceā€™ of this is the two audits that are being conducted). As I understand it, the purpose of this thread is to design a mechanism.

How do you recommend we address it?

Here are some protocols that donā€™t have admin keys:

Uniswap v2 is the most successful DeFi project in existence and it did not need admin keys to get there. Not only does v2 have no admin keys, the smart contracts are immutable. No upgrades, no alterations. Itā€™s the holy grail of DeFi and has been running for over a year without incident.

Curve does have an Emergency DAO with 9 members who has the ability to shut down all smart contract functions except withdraw. The smart contracts are not upgradable or alterable besides that.

Aave v2 has a group with veto-powers. Other changes are voted on in their DAO.

Liquity does not have any governance whatsoever and is fully immutable and non-upgradable.

tBTC has been running successfully for a year with no major security vulnerabilities. It cannot be upgraded.

Risk-mitigation can be accomplished through heavy audits, graduated supply cap, and by the code proving itself over time. Code that does not change gets battletested and the more adopted the more scrutinized the code will be.

I will not argue on network or applications in general. I care about a trustless BTC on ethereum. That is an application which needs very little improvements and therefore upgrades can be very slow. Governance votes with months-long time locks should be a given. An emergency pause functionality that ensures users are able to safely withdraw all funds would be acceptable, however I believe, in the true spirit of the orginial tBTC vision, the emergency pause functionality should eventually be deprecated and disabled.

tBTC should never be compromised with admin keys. The goal is not reducing risk, itā€™s decentralization, permisonlessness and trustlessnes. Otherwise you might as well re-create the traditional financial system with gatekeepers, kyc and tight regulations. Reducing risk is secondary to these goals. Traditional finance sucks for that exact reason. They have sacrificed freedom in the name of security.

A trustless BTC should ideally be ungovernable too. Simply an algorithmic bridge between pure BTC and Ethereum.

I suggest that we create sub-topics for each application, or at least tBTC.

5 Likes

however I believe, in the true spirit of the orginial tBTC vision, the emergency pause functionality should eventually be deprecated and disabled.

Do you view a social expectation (versus coded) of future deprecation of the emergency pause functionality as sufficient?

Do you want to kick off a tBTC-specific thread and move the discussion there?

2 Likes

@maclane On further thought, I suggest that we exclude discussion of tBTC v2 from this topic, then whoever is passionate about adding ESDM to tBTC v2 can create a topic where they argue in favor of that. Iā€™m unsure if the idea even have enough traction at this point to be worth itā€™s own discussion.

To answer your specific question: as long as an emergency pause functionality presents no risk to users of tBTC both solutions are sufficient, but code is always preferable.