MongoDB上手笔记4——在很多数据中进行查询

上一次把简单的Restful API搭建起来了,接下来让我们往数据库里面填充一些数据。

俗话说的好,一天一个苹果,考不上博士。假设小明每天吃一个苹果,为了记录截止到某一天小明一共吃了多少个苹果,我在db logs的collection applesIAte中,记录了截止到某一天小明一共吃了多少苹果。

假设小明命中注定不会读博,所以小明骨骼惊奇的生下来就不吃奶,而是出生第一天就吃苹果,每天以苹果为生(好像一个并吃不饱的样子,请不要在意这些细节)。然后每天为了建设祖国而奋斗,一直活到了100岁,最后在100岁生日的当晚因为看到了祖国的建设成果过于兴奋(以及吃了太多的苹果)而含笑九泉。

按照这以上假设,我写了个脚本,记录了无私而强韧的小明的不平凡的一生中吃了的苹果数量。这段脚本中用了个insert_many()的方法,在一次性写入很多条document的时候比较快。

from flask_pymongo import MongoClient
mongo = MongoClient(host="localhost", port=27999)
db = mongo.logs
auth_res = db.authenticate("readwriter", "123456")
apples = [{
    "day": day,
    "apple": day
} for day in range(1, 100*365+1)]
db.applesIAte.insert_many(apples)

然后我们看一看小明这光辉的一生。

> db.applesIAte.stats()
{
        "ns" : "logs.applesIAte",
        "count" : 36500,
        "size" : 1752064,
        "avgObjSize" : 48,
        "numExtents" : 5,
        "storageSize" : 2793472,
        "lastExtentSize" : 2097152,
        "paddingFactor" : 1,
        "paddingFactorNote" : "paddingFactor is unused and unmaintained in 3.0. It remains hard coded to 1.0 for compatibility only.",
        "userFlags" : 1,
        "capped" : false,
        "nindexes" : 1,
        "totalIndexSize" : 1193696,
        "indexSizes" : {
                "_id_" : 1193696
        },
        "ok" : 1
}

恩,小明活了36500天(原谅我忘了闰年这事),真是了不起!

在回顾小明这光辉的一生(中吃了多少苹果)之前,让我们先做一些小手脚。这样我们才能看看我们能以多快的速度回顾(查询)了小明的一生(丫可是活了100年口阿!)

> db.setProfilingLevel(2)
{ "was" : 0, "slowms" : 100, "ok" : 1 }

上面这句话的意思是,我们所有的回顾小明一生的行为(查询)的相关信息都会被记录下来。

接下来我们看看小明在10000天的时候一共吃了多少苹果。(此处也可以用findOne({day: 10000}),不过并不影响结论)

> db.applesIAte.find({day: 10000})
{ "_id" : ObjectId("56a9c39cabf8322aa0828dec"), "day" : 10000, "apple" : 10000 }

显然他吃了10000个,awesome!那让我们看看我们花了多长时间发现他吃了这么多苹果。

> db.system.profile.find()
{
    "op":"query",
    "ns":"logs.applesIAte",
    "query":{
        "day":10000
    },
    "ntoreturn":0,
    "ntoskip":0,
    "nscanned":0,
    "nscannedObjects":36500,
    "keyUpdates":0,
    "writeConflicts":0,
    "numYield":285,
    "locks":{
        "Global":{
            "acquireCount":{
                "r":NumberLong(572)
            }
        },
        "MMAPV1Journal":{
            "acquireCount":{
                "r":NumberLong(286)
            }
        },
        "Database":{
            "acquireCount":{
                "r":NumberLong(286)
            }
        },
        "Collection":{
            "acquireCount":{
                "R":NumberLong(286)
            }
        }
    },
    "nreturned":1,
    "responseLength":62,
    "millis":20,
    "execStats":{
        "stage":"COLLSCAN",
        "filter":{
            "day":{
                "$eq":10000
            }
        },
        "nReturned":1,
        "executionTimeMillisEstimate":10,
        "works":36502,
        "advanced":1,
        "needTime":36500,
        "needFetch":0,
        "saveState":285,
        "restoreState":285,
        "isEOF":1,
        "invalidates":0,
        "direction":"forward",
        "docsExamined":36500
    },
    "ts":    ISODate("2016-01-28T07:53:57.857    Z"),
    "client":"127.0.0.1",
    "allUsers":[
        {
            "user":"boss",
            "db":"admin"
        },
        {
            "user":"readwriter",
            "db":"logs"
        }
    ],
    "user":"readwriter@logs"
}

上面这一大段基本上都没啥用,我们只需要两条——"millis":20,"docsExamined":36500。也就是查询一共花了20ms,查看了36500条document才找到了这条记录。(好慢!)

如果小明不是一个普通的人,是一个脱离了低级趣味的神仙,他活了一万年,甚至十万年前他就和猛犸象谈笑风生了(可他依旧不想读博)。那么我们的记录会越来越多,查询速度会越来越慢。

而我们为days加上索引可以有效的解决这个问题。因为day是单调增长的,所以参数是{day: 1}。

> db.applesIAte.createIndex({day: 1})
{
        "createdCollectionAutomatically" : false,
        "numIndexesBefore" : 1,
        "numIndexesAfter" : 2,
        "ok" : 1
}
> db.applesIAte.stats()
{
        "ns" : "logs.applesIAte",
        "count" : 36500,
        "size" : 1752064,
        "avgObjSize" : 48,
        "numExtents" : 5,
        "storageSize" : 2793472,
        "lastExtentSize" : 2097152,
        "paddingFactor" : 1,
        "paddingFactorNote" : "paddingFactor is unused and unmaintained in 3.0. It remains hard coded to 1.0 for compatibility only.",
        "userFlags" : 1,
        "capped" : false,
        "nindexes" : 2,
        "totalIndexSize" : 2117584,
        "indexSizes" : {
                "_id_" : 1193696,
                "day_1" : 923888
        },
        "ok" : 1
}

从stats()返回的结果来看,day的索引确实是加上了。那么结果如何呢?

{
    "op":"query",
    "ns":"logs.applesIAte",
    "query":{
        "day":20000
    },
    "ntoreturn":0,
    "ntoskip":0,
    "nscanned":1,
    "nscannedObjects":1,
    "keyUpdates":0,
    "writeConflicts":0,
    "numYield":0,
    "locks":{
        "Global":{
            "acquireCount":{
                "r":NumberLong(2)
            }
        },
        "MMAPV1Journal":{
            "acquireCount":{
                "r":NumberLong(1)
            }
        },
        "Database":{
            "acquireCount":{
                "r":NumberLong(1)
            }
        },
        "Collection":{
            "acquireCount":{
                "R":NumberLong(1)
            }
        }
    },
    "nreturned":1,
    "responseLength":62,
    "millis":0,
    "execStats":{
        "stage":"FETCH",
        "nReturned":1,
        "executionTimeMillisEstimate":0,
        "works":2,
        "advanced":1,
        "needTime":0,
        "needFetch":0,
        "saveState":0,
        "restoreState":0,
        "isEOF":1,
        "invalidates":0,
        "docsExamined":1,
        "alreadyHasObj":0,
        "inputStage":{
            "stage":"IXSCAN",
            "nReturned":1,
            "executionTimeMillisEstimate":0,
            "works":2,
            "advanced":1,
            "needTime":0,
            "needFetch":0,
            "saveState":0,
            "restoreState":0,
            "isEOF":1,
            "invalidates":0,
            "keyPattern":{
                "day":1
            },
            "indexName":"day_1",
            "isMultiKey":false,
            "direction":"forward",
            "indexBounds":{
                "day":[
                    "[20000.0, 20000.0]"
                ]
            },
            "keysExamined":1,
            "dupsTested":0,
            "dupsDropped":0,
            "seenInvalidated":0,
            "matchTested":0
        }
    },
    "ts":    ISODate("2016-01-28T08:05:46.904    Z"),
    "client":"127.0.0.1",
    "allUsers":[
        {
            "user":"boss",
            "db":"admin"
        },
        {
            "user":"readwriter",
            "db":"logs"
        }
    ],
    "user":"readwriter@logs"
}

结果发现"docsExamined":1,"millis":0。我们只查询了一条数据就直接找到了到20000天的时候小明吃了多少苹果,花的时间更是小到近似于0了!所以说当小明成仙之后,我们必须对day加上索引才能知道他为了不读博吃了多少苹果XD。

后记:最近看了一本书,作者比鸟哥还谐,整本书写的很像《春物》这种轻小说的风格。于是尝试着『学习』了一下这种风格,试过之后感觉有些浮夸啊XD

MongoDB上手笔记3——写个简单的Restful API读写数据库

上一篇中设置好了db logs的两个用户,并尝试往collection gen1中写入了一条简单的document。使用上一篇中的同样的方法,又写入了新的两条document,现在gen1中有了三条document。

> db
logs
> db.gen1.find()
{ "_id" : ObjectId("56a0daabcf03a917cb662aef"), "time" : 1, "value" : 1 }
{ "_id" : ObjectId("56a0dab3cf03a917cb662af0"), "time" : 2, "value" : 2 }
{ "_id" : ObjectId("56a0dabacf03a917cb662af1"), "time" : 3, "value" : 3 }

在命令行中写js语句来操作mongoDB自然是可行的,可是有很多情况需要使用别的语言来读取数据库。js可能缺少某些别人造好的轮子,比如我想用python来做科学计算,数据存在mongoDB里面,我就需要用python的驱动来操作数据库。

提供这个功能的轮子已经有很多了,mongoEngine和pymongo之类的。但是代码变多之后,感觉维护变得比较麻烦。代码中往往操作数据库和进行计算的部分混合在一起,看着条理不是很清晰。

所以想是不是可以将两部分分开来做,中间用个什么东西连接起来。后来知道了很Fancy(大概?)的Restful API,于是想着可以把各个部分分开,中间用http连接起来。

于是用flask和做了个十分简单的Restful API,功能就只有读取db logs的collection gen1中的最新一条数据以及向其中写数据。

api的验证部分使用了最简单的basic http authentication。request header中的auth部分就只是使用了base64加密而已(都不能算得上加密吧),十分简陋,完全没有安全性。每次请求都会创建一个MongoClient,简单测试了一下,目前没看出来有什么性能影响。

# 创建MongoClient
mongo = MongoClient(host="localhost", port=27999)
# 使用创建的对象验证身份
mongo["logs"].authenticate(request.authorization.username, request.authorization.password)

以下就是所有的代码了,一共只有50行。

simple_restful.py

from flask import request
from flask_restful import Resource
from bson.json_util import loads, dumps
from flask_pymongo import MongoClient
from pymongo.errors import OperationFailure
from flask import Flask
from flask_restful import Api


class Data(Resource):
    def get(self):
        mongo = MongoClient(host="localhost", port=27999)
        try:
            mongo["logs"].authenticate(request.authorization.username, request.authorization.password)
            last_log = mongo["logs"]["gen1"].find().sort("_id", -1)[0]
            return {
                       "err": "False",
                       "message": "Successfully get data",
                       "result": {"data": dumps(last_log)}
            }
        except OperationFailure, err:
            return {
                "err": "True",
                "message": "Failed getting data",
                "result": err.details
            }

    def post(self):
        mongo = MongoClient(host="localhost", port=27999)
        try:
            mongo["logs"].authenticate(request.authorization.username, request.authorization.password)
            data = request.data
            l = loads(data)
            result = mongo["logs"]["gen1"].insert_one(l["data"])
            return {
                "err": "False",
                "message": "Successfully post data",
                "result": {"data": dumps(l["data"]), "ack": str(result.acknowledged)}
            }
        except OperationFailure, err:
            return {
                "err": "True",
                "message": "Failed post data",
                "result": err.details
            }

app = Flask(__name__)
api = Api(app)
api.add_resource(Data, '/data')
app.run(debug=True)

使用curl简单测试一下吧。

>curl -v -H "Content-Type: application/json" --user readwriter:123456 http://localhost:5000/data
* Adding handle: conn: 0x140cf10
* Adding handle: send: 0
* Adding handle: recv: 0
* Curl_addHandleToPipeline: length: 1
* - Conn 0 (0x140cf10) send_pipe: 1, recv_pipe: 0
* About to connect() to localhost port 5000 (#0)
*   Trying 127.0.0.1...
* Connected to localhost (127.0.0.1) port 5000 (#0)
* Server auth using Basic with user 'readwriter'
> GET /data HTTP/1.1
> Authorization: Basic cmVhZHdyaXRlcjoxMjM0NTY=
> User-Agent: curl/7.33.0
> Host: localhost:5000
> Accept: */*
> Content-Type: application/json
>
* HTTP 1.0, assume close after body
< HTTP/1.0 200 OK
< Content-Type: application/json
< Content-Length: 189
< Server: Werkzeug/0.11.3 Python/2.7.10
< Date: Thu, 21 Jan 2016 13:59:24 GMT
<
{
    "err": "False",
    "message": "Successfully get data",
    "result": {
        "data": "{\"_id\": {\"$oid\": \"56a0dabacf03a917cb662af1\"}, \"value\":
3.0, \"time\": 3.0}"
    }
}
* Closing connection 0

确实读出来了最新的一条document。再写入一条document试试。

C:\Users\Juiceyang\Desktop>curl -v -H "Content-Type: application/json" --user re
adwriter:123456 http://localhost:5000/data -X POST -d "{\"data\":{\"time\":4,\"v
alue\":4}}"
* Adding handle: conn: 0x85cf10
* Adding handle: send: 0
* Adding handle: recv: 0
* Curl_addHandleToPipeline: length: 1
* - Conn 0 (0x85cf10) send_pipe: 1, recv_pipe: 0
* About to connect() to localhost port 5000 (#0)
*   Trying 127.0.0.1...
* Connected to localhost (127.0.0.1) port 5000 (#0)
* Server auth using Basic with user 'readwriter'
> POST /data HTTP/1.1
> Authorization: Basic cmVhZHdyaXRlcjoxMjM0NTY=
> User-Agent: curl/7.33.0
> Host: localhost:5000
> Accept: */*
> Content-Type: application/json
> Content-Length: 29
>
* upload completely sent off: 29 out of 29 bytes
* HTTP 1.0, assume close after body
< HTTP/1.0 200 OK
< Content-Type: application/json
< Content-Length: 210
< Server: Werkzeug/0.11.3 Python/2.7.10
< Date: Thu, 21 Jan 2016 14:02:58 GMT
<
{
    "err": "False",
    "message": "Successfully post data",
    "result": {
        "ack": "True",
        "data": "{\"_id\": {\"$oid\": \"56a0e512612bf10dcce366d7\"}, \"value\":
4, \"time\": 4}"
    }
}
* Closing connection 0

结果来看也确实写入成功了。