Skip to content
  • Florian Westphal's avatar
    netfilter: conntrack: allow insertion of clashing entries · 6a757c07
    Florian Westphal authored
    
    
    This patch further relaxes the need to drop an skb due to a clash with
    an existing conntrack entry.
    
    Current clash resolution handles the case where the clash occurs between
    two identical entries (distinct nf_conn objects with same tuples), i.e.:
    
                        Original                        Reply
    existing: 10.2.3.4:42 -> 10.8.8.8:53      10.2.3.4:42 <- 10.0.0.6:5353
    clashing: 10.2.3.4:42 -> 10.8.8.8:53      10.2.3.4:42 <- 10.0.0.6:5353
    
    ... existing handling will discard the unconfirmed clashing entry and
    makes skb->_nfct point to the existing one.  The skb can then be
    processed normally just as if the clash would not have existed in the
    first place.
    
    For other clashes, the skb needs to be dropped.
    This frequently happens with DNS resolvers that send A and AAAA queries
    back-to-back when NAT rules are present that cause packets to get
    different DNAT transformations applied, for example:
    
    -m statistics --mode random ... -j DNAT --dnat-to 10.0.0.6:5353
    -m statistics --mode random ... -j DNAT --dnat-to 10.0.0.7:5353
    
    In this case the A or AAAA query is dropped which incurs a costly
    delay during name resolution.
    
    This patch also allows this collision type:
                           Original                   Reply
    existing: 10.2.3.4:42 -> 10.8.8.8:53      10.2.3.4:42 <- 10.0.0.6:5353
    clashing: 10.2.3.4:42 -> 10.8.8.8:53      10.2.3.4:42 <- 10.0.0.7:5353
    
    In this case, clash is in original direction -- the reply direction
    is still unique.
    
    The change makes it so that when the 2nd colliding packet is received,
    the clashing conntrack is tagged with new IPS_NAT_CLASH_BIT, gets a fixed
    1 second timeout and is inserted in the reply direction only.
    
    The entry is hidden from 'conntrack -L', it will time out quickly
    and it can be early dropped because it will never progress to the
    ASSURED state.
    
    To avoid special-casing the delete code path to special case
    the ORIGINAL hlist_nulls node, a new helper, "hlist_nulls_add_fake", is
    added so hlist_nulls_del() will work.
    
    Example:
    
          CPU A:                               CPU B:
    1.  10.2.3.4:42 -> 10.8.8.8:53 (A)
    2.                                         10.2.3.4:42 -> 10.8.8.8:53 (AAAA)
    3.  Apply DNAT, reply changed to 10.0.0.6
    4.                                         10.2.3.4:42 -> 10.8.8.8:53 (AAAA)
    5.                                         Apply DNAT, reply changed to 10.0.0.7
    6. confirm/commit to conntrack table, no collisions
    7.                                         commit clashing entry
    
    Reply comes in:
    
    10.2.3.4:42 <- 10.0.0.6:5353 (A)
     -> Finds a conntrack, DNAT is reversed & packet forwarded to 10.2.3.4:42
    10.2.3.4:42 <- 10.0.0.7:5353 (AAAA)
     -> Finds a conntrack, DNAT is reversed & packet forwarded to 10.2.3.4:42
        The conntrack entry is deleted from table, as it has the NAT_CLASH
        bit set.
    
    In case of a retransmit from ORIGINAL dir, all further packets will get
    the DNAT transformation to 10.0.0.6.
    
    I tried to come up with other solutions but they all have worse
    problems.
    
    Alternatives considered were:
    1.  Confirm ct entries at allocation time, not in postrouting.
     a. will cause uneccesarry work when the skb that creates the
        conntrack is dropped by ruleset.
     b. in case nat is applied, ct entry would need to be moved in
        the table, which requires another spinlock pair to be taken.
     c. breaks the 'unconfirmed entry is private to cpu' assumption:
        we would need to guard all nfct->ext allocation requests with
        ct->lock spinlock.
    
    2. Make the unconfirmed list a hash table instead of a pcpu list.
       Shares drawback c) of the first alternative.
    
    3. Document this is expected and force users to rearrange their
       ruleset (e.g. by using "-m cluster" instead of "-m statistics").
       nft has the 'jhash' expression which can be used instead of 'numgen'.
    
       Major drawback: doesn't fix what I consider a bug, not very realistic
       and I believe its reasonable to have the existing rulesets to 'just
       work'.
    
    4. Document this is expected and force users to steer problematic
       packets to the same CPU -- this would serialize the "allocate new
       conntrack entry/nat table evaluation/perform nat/confirm entry", so
       no race can occur.  Similar drawback to 3.
    
    Another advantage of this patch compared to 1) and 2) is that there are
    no changes to the hot path; things are handled in the udp tracker and
    the clash resolution path.
    
    Cc: rcu@vger.kernel.org
    Cc: "Paul E. McKenney" <paulmck@kernel.org>
    Cc: Josh Triplett <josh@joshtriplett.org>
    Cc: Jozsef Kadlecsik <kadlec@netfilter.org>
    Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
    Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
    6a757c07