From owner-freebsd-fs@FreeBSD.ORG  Sun Oct  7 07:55:51 2012
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id 81D001065740;
	Sun,  7 Oct 2012 07:55:51 +0000 (UTC) (envelope-from avg@FreeBSD.org)
Received: from citadel.icyb.net.ua (citadel.icyb.net.ua [212.40.38.140])
	by mx1.freebsd.org (Postfix) with ESMTP id 711318FC08;
	Sun,  7 Oct 2012 07:55:49 +0000 (UTC)
Received: from porto.starpoint.kiev.ua (porto-e.starpoint.kiev.ua
	[212.40.38.100])
	by citadel.icyb.net.ua (8.8.8p3/ICyb-2.3exp) with ESMTP id KAA12998;
	Sun, 07 Oct 2012 10:55:48 +0300 (EEST)
	(envelope-from avg@FreeBSD.org)
Received: from localhost ([127.0.0.1])
	by porto.starpoint.kiev.ua with esmtp (Exim 4.34 (FreeBSD))
	id 1TKliG-0007sk-7B; Sun, 07 Oct 2012 10:55:48 +0300
Message-ID: <50713582.9040600@FreeBSD.org>
Date: Sun, 07 Oct 2012 10:55:46 +0300
From: Andriy Gapon <avg@FreeBSD.org>
User-Agent: Mozilla/5.0 (X11; FreeBSD amd64;
	rv:15.0) Gecko/20120913 Thunderbird/15.0.1
MIME-Version: 1.0
To: Pawel Jakub Dawidek <pjd@FreeBSD.org>,
	"freebsd-fs@freebsd.org" <freebsd-fs@FreeBSD.org>
X-Enigmail-Version: 1.4.3
Content-Type: text/plain; charset=X-VIET-VPS
Content-Transfer-Encoding: 7bit
Cc: 
Subject: zfs_remove: delete_now case
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 07 Oct 2012 07:55:51 -0000


It seems that the delete_now path is never taken in zfs_remove().
This is probably good in the bug-cancels-bug way...

On FreeBSD VOP_REMOVE is always called with vp being referenced.  zfs_remove
doesn't take advantage of VOP_REMOVE interface.  It ignores the vp argument and
instead re-looks up a directory entry id by name (a small performance hit here)
and then uses zfs_zget, which adds another reference to the entry's vnode.
Thus, a reference count of the vnode is always not less than two.  So
may_delete_now and delete_now are always false.

Why this is good?  Because FreeBSD VFS doesn't support direct destruction (or
corruption) of the vnode in VOP_REMOVE.  It expects to still have a valid vnode
with a valid reference count after VOP_REMOVE and then calls vput/vrele on it.
But the code in the delete_now branch does some nasty things.  It directly
decrements the use count and it directly destroys the underlying znode (which is
fine in Solaris but not in FreeBSD).
But FreeBSD VFS wouldn't even have a chance to panic on the damaged vnode
because ZFS code would sooner panic in zfs_znode_delete -> zfs_znode_free ->
ASSERT(ZTOV(zp) == NULL) [a FreeBSD-specific assertion).

I think that we should make zfs_remove code less confusing and more FreeBSD
friendly.  We should explicitly rely on zfs_inactive doing the right thing after
VOP_REMOVE and drop all the "direct action" code.

What do you think?
-- 
Andriy Gapon

From owner-freebsd-fs@FreeBSD.ORG  Sun Oct  7 09:20:10 2012
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id 7EEF6106564A;
	Sun,  7 Oct 2012 09:20:10 +0000 (UTC)
	(envelope-from pawel@dawidek.net)
Received: from mail.dawidek.net (garage.dawidek.net [91.121.88.72])
	by mx1.freebsd.org (Postfix) with ESMTP id 434BC8FC08;
	Sun,  7 Oct 2012 09:20:09 +0000 (UTC)
Received: from localhost (89-73-195-149.dynamic.chello.pl [89.73.195.149])
	by mail.dawidek.net (Postfix) with ESMTPSA id 3AB7A8E5;
	Sun,  7 Oct 2012 11:19:00 +0200 (CEST)
Date: Sun, 7 Oct 2012 11:20:37 +0200
From: Pawel Jakub Dawidek <pjd@FreeBSD.org>
To: Andriy Gapon <avg@FreeBSD.org>
Message-ID: <20121007092037.GB28611@garage.freebsd.pl>
References: <50713582.9040600@FreeBSD.org>
MIME-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha1;
	protocol="application/pgp-signature"; boundary="BwCQnh7xodEAoBMC"
Content-Disposition: inline
In-Reply-To: <50713582.9040600@FreeBSD.org>
X-OS: FreeBSD 10.0-CURRENT amd64
User-Agent: Mutt/1.5.21 (2010-09-15)
Cc: "freebsd-fs@freebsd.org" <freebsd-fs@FreeBSD.org>
Subject: Re: zfs_remove: delete_now case
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 07 Oct 2012 09:20:10 -0000


--BwCQnh7xodEAoBMC
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Sun, Oct 07, 2012 at 10:55:46AM +0300, Andriy Gapon wrote:
>=20
> It seems that the delete_now path is never taken in zfs_remove().
> This is probably good in the bug-cancels-bug way...
>=20
> On FreeBSD VOP_REMOVE is always called with vp being referenced.  zfs_rem=
ove
> doesn't take advantage of VOP_REMOVE interface.  It ignores the vp argume=
nt and
> instead re-looks up a directory entry id by name (a small performance hit=
 here)
> and then uses zfs_zget, which adds another reference to the entry's vnode.
> Thus, a reference count of the vnode is always not less than two.  So
> may_delete_now and delete_now are always false.
>=20
> Why this is good?  Because FreeBSD VFS doesn't support direct destruction=
 (or
> corruption) of the vnode in VOP_REMOVE.  It expects to still have a valid=
 vnode
> with a valid reference count after VOP_REMOVE and then calls vput/vrele o=
n it.
> But the code in the delete_now branch does some nasty things.  It directly
> decrements the use count and it directly destroys the underlying znode (w=
hich is
> fine in Solaris but not in FreeBSD).
> But FreeBSD VFS wouldn't even have a chance to panic on the damaged vnode
> because ZFS code would sooner panic in zfs_znode_delete -> zfs_znode_free=
 ->
> ASSERT(ZTOV(zp) =3D=3D NULL) [a FreeBSD-specific assertion).
>=20
> I think that we should make zfs_remove code less confusing and more FreeB=
SD
> friendly.  We should explicitly rely on zfs_inactive doing the right thin=
g after
> VOP_REMOVE and drop all the "direct action" code.
>=20
> What do you think?

I'm fully aware of this code path being dead on FreeBSD. It is left
there only to minimize diff against vendor, so I'd prefer not to remove
it. Surrounding the code with '#ifdef sun' or similar is ok, I think.

--=20
Pawel Jakub Dawidek                       http://www.wheelsystems.com
FreeBSD committer                         http://www.FreeBSD.org
Am I Evil? Yes, I Am!                     http://tupytaj.pl

--BwCQnh7xodEAoBMC
Content-Type: application/pgp-signature

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (FreeBSD)

iEYEARECAAYFAlBxSWUACgkQForvXbEpPzT7kgCdF6+4xV+L7aF17HtZFxFVBL+v
gjUAniJwvbeAzHRvlHUEBdxjh+6roDiy
=UJy4
-----END PGP SIGNATURE-----

--BwCQnh7xodEAoBMC--

From owner-freebsd-fs@FreeBSD.ORG  Sun Oct  7 09:34:14 2012
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id 48A121065670;
	Sun,  7 Oct 2012 09:34:14 +0000 (UTC) (envelope-from avg@FreeBSD.org)
Received: from citadel.icyb.net.ua (citadel.icyb.net.ua [212.40.38.140])
	by mx1.freebsd.org (Postfix) with ESMTP id 4E8CD8FC0C;
	Sun,  7 Oct 2012 09:34:12 +0000 (UTC)
Received: from porto.starpoint.kiev.ua (porto-e.starpoint.kiev.ua
	[212.40.38.100])
	by citadel.icyb.net.ua (8.8.8p3/ICyb-2.3exp) with ESMTP id MAA13438;
	Sun, 07 Oct 2012 12:34:11 +0300 (EEST)
	(envelope-from avg@FreeBSD.org)
Received: from localhost ([127.0.0.1])
	by porto.starpoint.kiev.ua with esmtp (Exim 4.34 (FreeBSD))
	id 1TKnFT-0007yG-6X; Sun, 07 Oct 2012 12:34:11 +0300
Message-ID: <50714C91.4080407@FreeBSD.org>
Date: Sun, 07 Oct 2012 12:34:09 +0300
From: Andriy Gapon <avg@FreeBSD.org>
User-Agent: Mozilla/5.0 (X11; FreeBSD amd64;
	rv:15.0) Gecko/20120913 Thunderbird/15.0.1
MIME-Version: 1.0
To: Pawel Jakub Dawidek <pjd@FreeBSD.org>
References: <50713582.9040600@FreeBSD.org>
	<20121007092037.GB28611@garage.freebsd.pl>
In-Reply-To: <20121007092037.GB28611@garage.freebsd.pl>
X-Enigmail-Version: 1.4.3
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Cc: "freebsd-fs@freebsd.org" <freebsd-fs@FreeBSD.org>
Subject: Re: zfs_remove: delete_now case
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 07 Oct 2012 09:34:14 -0000

on 07/10/2012 12:20 Pawel Jakub Dawidek said the following:
> I'm fully aware of this code path being dead on FreeBSD. It is left there
> only to minimize diff against vendor, so I'd prefer not to remove it.
> Surrounding the code with '#ifdef sun' or similar is ok, I think.

Ah, very good!  It wasn't just clear from the code.
I'll try to make the ifdef patch.

-- 
Andriy Gapon

From owner-freebsd-fs@FreeBSD.ORG  Sun Oct  7 15:32:12 2012
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: fs@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id E45B2106566C;
	Sun,  7 Oct 2012 15:32:12 +0000 (UTC) (envelope-from avg@FreeBSD.org)
Received: from citadel.icyb.net.ua (citadel.icyb.net.ua [212.40.38.140])
	by mx1.freebsd.org (Postfix) with ESMTP id CC9908FC16;
	Sun,  7 Oct 2012 15:32:11 +0000 (UTC)
Received: from porto.starpoint.kiev.ua (porto-e.starpoint.kiev.ua
	[212.40.38.100])
	by citadel.icyb.net.ua (8.8.8p3/ICyb-2.3exp) with ESMTP id SAA15154;
	Sun, 07 Oct 2012 18:32:05 +0300 (EEST)
	(envelope-from avg@FreeBSD.org)
Received: from localhost ([127.0.0.1])
	by porto.starpoint.kiev.ua with esmtp (Exim 4.34 (FreeBSD))
	id 1TKspp-0008Hl-Jh; Sun, 07 Oct 2012 18:32:05 +0300
Message-ID: <5071A071.1020800@FreeBSD.org>
Date: Sun, 07 Oct 2012 18:32:01 +0300
From: Andriy Gapon <avg@FreeBSD.org>
User-Agent: Mozilla/5.0 (X11; FreeBSD amd64;
	rv:15.0) Gecko/20120913 Thunderbird/15.0.1
MIME-Version: 1.0
To: "Justin T. Gibbs" <gibbs@scsiguy.com>,
	Pawel Jakub Dawidek <pjd@FreeBSD.org>,
	Konstantin Belousov <kib@FreeBSD.org>
References: <76CBA055-021F-458D-8978-E9A973D9B783@scsiguy.com>
	<506EB43B.8050204@FreeBSD.org>
In-Reply-To: <506EB43B.8050204@FreeBSD.org>
X-Enigmail-Version: 1.4.3
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Cc: fs@FreeBSD.org
Subject: Re: ZFS: Deadlock during vnode recycling
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 07 Oct 2012 15:32:13 -0000

In fact here is a real patch that I would like to propose:
http://people.freebsd.org/~avg/zfs-getnewvnode_reserve.diff

The patch incorporates the kib's patch for extending VFS API that helps to avoid
entering the vnode inactive/reclaim path from getnewvnode.

The patch should fix the problem reported in this thread and the problem from
"panic: _sx_xlock_hard: recursed on non-recursive sx zfsvfs->z_hold_mtx ..."
thread that runs in parallel.

Reviews and testing are welcome.

Here is a draft of a commit message that also provides some additional details:

zfs: overhaul zfs_freebsd_reclaim and zfs_zget...

now that we do not need to fear recursion from getnewvnode into
zfs_inactive and zfs_freebsd_reclaim.
This removes the need for the delayed destruction of znodes via taskqueue,
thus making znode/vnode state machine a bit simpler.
Also, try to make zfs_zget saner with respected to doomed vnodes to avoid
a deadlock when zfs_zget is called from zfs_freebsd_recycle.

To do: pass locking flags parameter to zfs_zget, so that the zfs-vfs glue
code doesn't have to re-lock a vnode but could ask for proper locking
from the very start.


The patch also drops some redundant interlock acquisitions, since both
vop_inactive and vop_reclaim are called with exclusive vnode lock held.
-- 
Andriy Gapon

From owner-freebsd-fs@FreeBSD.ORG  Sun Oct  7 18:34:56 2012
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id 0B687106566B;
	Sun,  7 Oct 2012 18:34:56 +0000 (UTC)
	(envelope-from gibbs@scsiguy.com)
Received: from aslan.scsiguy.com (www.scsiguy.com [70.89.174.89])
	by mx1.freebsd.org (Postfix) with ESMTP id A9F388FC14;
	Sun,  7 Oct 2012 18:34:55 +0000 (UTC)
Received: from macbook.scsiguy.com (macbook.scsiguy.com [192.168.0.99])
	(authenticated bits=0)
	by aslan.scsiguy.com (8.14.5/8.14.5) with ESMTP id q97IYnnY084421
	(version=TLSv1/SSLv3 cipher=AES128-SHA bits=128 verify=NO);
	Sun, 7 Oct 2012 12:34:49 -0600 (MDT)
	(envelope-from gibbs@scsiguy.com)
Content-Type: text/plain; charset=us-ascii
Mime-Version: 1.0 (Mac OS X Mail 6.2 \(1499\))
From: "Justin T. Gibbs" <gibbs@scsiguy.com>
In-Reply-To: <20121003085326.GC1386@garage.freebsd.pl>
Date: Sun, 7 Oct 2012 12:34:52 -0600
Content-Transfer-Encoding: quoted-printable
Message-Id: <BBD74405-D867-4FE3-AB56-7D2EC026859D@scsiguy.com>
References: <505DE715.8020806@FreeBSD.org>
	<DA42C8E9-BFFF-4C5A-9E14-1D50EAEFA669@scsiguy.com>
	<20121003085326.GC1386@garage.freebsd.pl>
To: Pawel Jakub Dawidek <pjd@freebsd.org>
X-Mailer: Apple Mail (2.1499)
X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7
	(aslan.scsiguy.com [192.168.0.4]);
	Sun, 07 Oct 2012 12:34:49 -0600 (MDT)
Cc: freebsd-fs@freebsd.org, Andriy Gapon <avg@freebsd.org>
Subject: Re: zfs: allow to mount root from a pool not in zpool.cache
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 07 Oct 2012 18:34:56 -0000

On Oct 3, 2012, at 2:53 AM, Pawel Jakub Dawidek <pjd@freebsd.org> wrote:

> On Sat, Sep 22, 2012 at 10:59:56PM -0600, Justin T. Gibbs wrote:
>> On Sep 22, 2012, at 10:28 AM, Andriy Gapon <avg@freebsd.org> wrote:
>>=20
>>>=20
>>> Currently FreeBSD ZFS kernel code doesn't allow to mount root =
filesystem on a
>>> pool that is not listed in zpool.cache as only pools from the cache =
are known to
>>> ZFS at that time.
>>=20
>> I've for some time been of the opinion that FreeBSD should only use
>> the cache file for ZFS pools created from non-GEOM objects (i.e.
>> files).  GEOM tasting should be used to make the kernel aware of
>> all pools whether they be imported on the system, partial, or
>> foreign.  Even for pools created by files, the user land utilities
>> should do nothing more than ask the kernel to "taste them".  This
>> would remove code duplicated in user land for this task (code that
>> must be re-executed in kernel space for validation reasons anyway)
>> and also help solve problems we've encountered at Spectra with races
>> in fault event processing, spare management, and device arrival and
>> departures.
>>=20
>> So I'm excited by your work in this area and would encourage you
>> to "think larger" than just trying to integrate root pool discovery
>> with GEOM.  Spectra may even be able to help in this work sometime
>> in the near future.
>=20
> GEOM tasting would most likely require rewriting the code heavly.
> Also note that you can have pools in you system that do match your
> hostid, but user decided to keep exported and such pool should not be
> configured automatically. Not a huge problem probably as there is pool
> status somewhere in the metadata that we can use to see if the pool is
> exported or not.

This topic came up during ZFS day last week.  It turns out that the OS-X
port of ZFS already does this and Don Brady said he's happy to share
patches.

I don't see any reason why this type of solution cannot be upstreamed
as well.  On Illumos, user land can maintain the cache file and use it
to ask the in-kernel ZFS code to "taste" devices, minimizing the amount
of divergence.

--
Justin=

From owner-freebsd-fs@FreeBSD.ORG  Sun Oct  7 18:43:59 2012
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: fs@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id 5A0901065670;
	Sun,  7 Oct 2012 18:43:59 +0000 (UTC)
	(envelope-from gibbs@scsiguy.com)
Received: from aslan.scsiguy.com (ns1.scsiguy.com [70.89.174.89])
	by mx1.freebsd.org (Postfix) with ESMTP id 255A38FC08;
	Sun,  7 Oct 2012 18:43:58 +0000 (UTC)
Received: from macbook.scsiguy.com (macbook.scsiguy.com [192.168.0.99])
	(authenticated bits=0)
	by aslan.scsiguy.com (8.14.5/8.14.5) with ESMTP id q97Ihw7r084465
	(version=TLSv1/SSLv3 cipher=AES128-SHA bits=128 verify=NO);
	Sun, 7 Oct 2012 12:43:58 -0600 (MDT)
	(envelope-from gibbs@scsiguy.com)
Content-Type: text/plain; charset=iso-8859-1
Mime-Version: 1.0 (Mac OS X Mail 6.2 \(1499\))
From: "Justin T. Gibbs" <gibbs@scsiguy.com>
In-Reply-To: <5071A071.1020800@FreeBSD.org>
Date: Sun, 7 Oct 2012 12:44:02 -0600
Content-Transfer-Encoding: 7bit
Message-Id: <97D56E7F-1284-4FB1-8C83-9EE04FE4F59F@scsiguy.com>
References: <76CBA055-021F-458D-8978-E9A973D9B783@scsiguy.com>
	<506EB43B.8050204@FreeBSD.org> <5071A071.1020800@FreeBSD.org>
To: Andriy Gapon <avg@FreeBSD.org>
X-Mailer: Apple Mail (2.1499)
X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7
	(aslan.scsiguy.com [192.168.0.4]);
	Sun, 07 Oct 2012 12:43:58 -0600 (MDT)
Cc: Konstantin Belousov <kib@FreeBSD.org>,
	Pawel Jakub Dawidek <pjd@FreeBSD.org>, fs@FreeBSD.org
Subject: Re: ZFS: Deadlock during vnode recycling
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 07 Oct 2012 18:43:59 -0000

On Oct 7, 2012, at 9:32 AM, Andriy Gapon <avg@FreeBSD.org> wrote:

> In fact here is a real patch that I would like to propose:
> http://people.freebsd.org/~avg/zfs-getnewvnode_reserve.diff

OS-X has these same types of problems and I talked with Don Brady
of the OS-X ZFS port about them during ZFS day.  It sounds like he
explicitly pre-allocates vnodes in these code paths instead of
relying on a reserve pool.  I plan to review his work since I expect
he's found and fixed problems we don't even know we have yet.

My only complaint with this patch is that it doesn't include stats
counters for these rare conditions so that I can validate that the
code is exercised during a test suite.  Can you merge in the kstat
portion of the change I proposed?

--
Justin


From owner-freebsd-fs@FreeBSD.ORG  Sun Oct  7 20:17:07 2012
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id 394DD1065673
	for <freebsd-fs@freebsd.org>; Sun,  7 Oct 2012 20:17:07 +0000 (UTC)
	(envelope-from rmacklem@uoguelph.ca)
Received: from esa-jnhn.mail.uoguelph.ca (esa-jnhn.mail.uoguelph.ca
	[131.104.91.44])
	by mx1.freebsd.org (Postfix) with ESMTP id AE6AE8FC0C
	for <freebsd-fs@freebsd.org>; Sun,  7 Oct 2012 20:17:06 +0000 (UTC)
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: Ap4EALzhcVCDaFvO/2dsb2JhbABFDoYDuhiCIAEBBSMERwszEQUBEwIEVQaIGKYnkVyOOgGCEoESA45uhn2QLoIyV4FAOw
X-IronPort-AV: E=Sophos;i="4.80,548,1344225600"; d="c'?scan'208";a="182460992"
Received: from erie.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca)
	([131.104.91.206])
	by esa-jnhn-pri.mail.uoguelph.ca with ESMTP; 07 Oct 2012 16:17:05 -0400
Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1])
	by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id 5D4FBB3F36;
	Sun,  7 Oct 2012 16:17:05 -0400 (EDT)
Date: Sun, 7 Oct 2012 16:17:05 -0400 (EDT)
From: Rick Macklem <rmacklem@uoguelph.ca>
To: Piete Brooks <Piete.Brooks@cl.cam.ac.uk>
Message-ID: <2071960851.1864186.1349641025365.JavaMail.root@erie.cs.uoguelph.ca>
In-Reply-To: <E1TKsFj-0007ai-9s@mta0.cl.cam.ac.uk>
MIME-Version: 1.0
Content-Type: multipart/mixed; 
	boundary="----=_Part_1864185_1441828033.1349641025360"
X-Originating-IP: [172.17.91.201]
X-Mailer: Zimbra 6.0.10_GA_2692 (ZimbraWebClient - IE7 (Win)/6.0.10_GA_2692)
X-Content-Filtered-By: Mailman/MimeDel 2.1.5
Cc: FS List <freebsd-fs@freebsd.org>,
	Ilias Marinos <ilias.marinos@cl.cam.ac.uk>,
	Brooks Davis <brooks@csl.sri.com>, Herbert Poeckl <freebsdml@ist.tugraz.at>
Subject: Kerberized NFS/gssd credential cache issue
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 07 Oct 2012 20:17:07 -0000

------=_Part_1864185_1441828033.1349641025360
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit

Piete Brooks wrote:
> I initially took the priorities to be sorted, but it seems that all
> just add
> one to the score. Is this as planned, or should "++" become "|= 1 <<
> N" so
> that the most important one aleays wins?
My intent was that they all count the same, because I don't know if
one is more important than another. (A "more important" one could add
+N, if we collectively decide what is "more important".)

I hope you don't mind, but I thought if this is going to be discussed,
it should be on a mailing list, so I've replaced some of the cc's with
freebsd-fs@. (I took out the ones I believe will be reading the list.)

Everyone, a discussion has been going on w.r.t. an NFS over Kerberos
issue, where the gssd can't find the Kerberos credentials cache file
because it assumes it uses a name /tmp/krb5cc_<N>, where <N> is the
effective uid. Some setups of sshd use different naming, usually a
random suffix appended to the above, to differentiate between login
sessions, so the credentials cache can be destroyed upon logout.

The Linux gssd does a search of directories, using various heuristics
to try and guess which file is the most appropriate one.
I've coded a function that does something similar. Since I am not
a Kerberos wizzard, I don't know how appropriate the heuristics are.
I have attached testcc.c, which is the function plus a simple main()
to test it. (Once tested, this function would be used in the gssd to
select a credentials cache file.)
The current code does the following:
- Searches a directory for files that satisfy the following:
  - has "krb5cc_ as a substring of the file's name
  - is a regular file
  - is owned by the uid
  - has a valid tgt in it
  For each file that satisfies the above, I generate a "rating",
  which is an attempt at heuristically guessing the most
  appropriate file, when there is more than one file matching the
  above:
  - add one to the rating for each of
    - not a cross-realm tgt
    - the principal without realm is the same name as
       getpwuid(uid)->pw_name
    - if the realm for the client principal is the preferred realm
      (the preferred realm and "krb5cc_" substring are arguments
       and I was assuming the preferred realm will usually be the
       default realm)
  Each of these currently counts one towards the rating.

  If multiple files matching the above gets the same rating, it uses
  the one that has the tgt that expires later.

So, Kerberos wizzards...
Should there be other criteria for selecting the file?
Should some of the rating checks count for more than others?
(They currently each count as 1, although some could count for more.)

Personally, I don't like the idea that a uid has multiple credential
cache files, since there is no definitive way to select the "correct one"
to authenticate a "uid", but it seems unavoidable.

Thanks in advance for any comments, rick

------=_Part_1864185_1441828033.1349641025360--

