1 Tales of Two Cities: Using Social Media to Understand Idiosyncratic Lifestyles in Distinctive Metropolitan Areas Tianran Hu, Eric Bigelow, Jiebo Luo, Fellow, IEEE, and Henry Kautz, Member, IEEE (cid:70) 7 1 0 Abstract—Lifestylesareavaluablemodelforunderstandingindividuals’ US. For smaller cities, we select the Great Rochester area 2 physical and mental lives, comparing social groups, and making rec- (ROC) as representative for two main reasons: First, the n ommendationsforimprovingpeople’slives.Inthispaper,weexamine size of Rochester (0.2 million) is close to the median size a and compare lifestyle behaviors of people living in cities of different (0.16 million) of cities in the US, approximately 40 times J sizes,utilizingfreelyavailablesocialmediadataasalarge-scale,low- smaller than NYC. Second, these two areas are located close 2 costalternativetotraditionalsurveymethods.WeusetheGreaterNew 2 York City area as a representative for large cities, and the Greater to each other (both in the north-eastern US). Geographic Rochester area as a representative for smaller cities in the United closeness generally leads to similarity of climate and culture, ] States.Weemployedmatrixfactoranalysisasanunsupervisedmethod which helps eliminate confounding factors that may lead to I S toextractsalientmobilityandwork-restpatternsforalargepopulationof differences in lifestyle behaviors unrelated to city size. . userswithineachmetropolitanarea.Wediscoveredinterestinghuman In contrast to traditional research investigating lifestyle s behaviorpatternsatbothalargerscaleandafinergranularitythanis c patterns,wheredatacollectionmethodsincludequestionnaires presentinpreviousliterature,someofwhichallowustoquantitatively [ and telephone interviewing [6], [7], [8], we leverage data comparethebehaviorsofindividualsoflivinginbigcitiestothoseliving 1 in small cities. We believe that our social media-based approach to fromsocialmediatomakeinferencesaboutpeople’slifestyles. v lifestyleanalysisrepresentsapowerfultoolforsocialcomputinginthe The wide adoption of social media brings researchers a new 6 bigdataage. opportunityofstudyingnatural,unconstrainedhumanbehavior 3 at very large scales. Foursquare is one of the most popular 2 IndexTerms—Lifestyles,UrbanComputing. Location Based Social Networks (LBSNs), holding 5 billion 6 0 check-in records for 55 million users worldwide1. This offers 1. 1 INTRODUCTION us a rich data source for conducting mobility, behavior and 0 lifestyle studies. 7 WE take lifestyle to be the way in which a person or We consider temporal and spatial lifestyle in this work. 1 grouplivesincludingtheinterests,opinions,behaviors, The temporal dimension of a person’s lifestyle is assumed : and behavioral orientations. Understanding lifestyle is key v to correlate with his/her work-rest ratio in daily actives. In to gaining insight of the physical and mental aspects of i the primary literature on circadian topology (CT), people are X individuals, social groups and cultures. Health, for example, classifiedintooneofthreecategories:morning-types,evening- r ishighlyrelatedtoone’slifestyle[1],[2].Culturalboundaries a types, and neither-types [9]. In the CT literature, individuals can be discovered from people’s ways of living such as are modeled by just one of these types. In our present work, pace of life, eating and drinking habits and so on [3], [4]. work-rest behavioral patterns are instead considered to be a Researchers have also discovered correlations between health weighted combination of all three temporal lifestyles: “Night andindividuals’dailymovementsasestimatedfromcellphone Owl”, “Early Bird” and “Intermediate” GPS tags on social media [5]. To avoid assigning a person to a lifestyle in an arbitrary In this work, we study the differences of lifestyles in cities or qualitative fashion, we employ non-negative matrix factor- ofdifferentsizes.Apopularstereotypeisthatlifeinbigcities ization (NMF) to discover three latent patterns of temporal is fast-paced, high-pressure, and consistently exciting, while activity. The extracted patterns offer precise definitions of life in small cities is calmer and less various due to a lower activity levels associated with specific lifestyles and align population density and more limited selection of recreational withourassumptionsabouthumanwork-resthabits.Aspatial venues. dimensionisusedtodescribelifestylesaccordingtolocational We select the Greater New York City area (NYC) as behavior.Forexample,oneprimitivelifestylepatternisdefined being representative, for our purposes, of big cities in the byfrequent visitstoPOIs(points ofinterest)such asbarsand music venues, while another is defined by visits to parks, art • AuthorsarewithComputerScienceDepartment,UniversityofRochester E-mail:{thu,jluo,kautz}@cs.rochester.edu,[email protected] 1.https://foursquare.com/about 2 galleries and museums. We then apply a clustering method themorningness-eveningnessquestionnaire(MEQ)of[12]and to group these primitive latent patterns into more complex variations of it [13]. In the work of Horne et al., Morning- lifestyles that are representative of a group of individuals type subjects (MTs) are found to wake at a mean of 7:24am, (e.g.studentsorstay-at-homeparents).Wesignificantvariance Neither-typesubjects(NTs)at8:07am,andEvening-typesub- between the distribution of lifestyles in NYC and ROC. jects (ETs) at 9:18am; mean bed times for the three types are Additionally, we use third-order tensor decomposition to 11:26pm, 11:30pm, and 1:05am, respectively. These specific find composite patterns across both spatial and temporal times vary in different studies, leading to differing assertions dimensions. We extract clearly identifiable patterns of behav- about how much of the population is a member of each CT ior, for example high school students posting during school type [14], [15]. hours, andfor collegestudents frequentlyvisiting or livingon Randler finds a significant positive correlation between campus.Thismethodofferspromiseasanefficientwayofex- “morningness”tendenciesofpeopleandsatisfactioninlife[7], tractingcomplicatedpatternsacrossmultiplehigh-dimensional and Monk [16] find that MT individuals appear to have more spaces. regular lifestyle than ETs. A positive correlation between The main contributions of this work are: eveningness and depression level is reported by Hasler et 1. Use of open-source geo-tagged social media data for an- al. [17]. A thorough review of contemporary CT literature is alyzing lifestyle patterns as a low-cost, large-scale alternative available in [15]. to traditional survey methods. 2. Application of matrix factor analysis to extract persistent 2.2 SocialMediaAnalytics andsalienthumanmobilityandwork-restpatternsoveralarge population of users. In recent years researchers have successfully utilized social 3.ApplicationofCPtensordecompositiontodiscovercom- mediainresearchventuresrelatedtolifestyleanalysis.Noulas posite spatial-temporal lifestyle patterns which are useful for et al. of [18] use Foursquare data to discover the behavioral understanding fluctuations in people’s activity across different habits of residents in London. The work presented in this time ranges and locations. paper is strongly inspired by this research: we contribute 4. Confirming intuitive knowledge and previous research in stackedplotssimilartothoseofNoulasetal.,representingthe human activity patterns with quantitative, unsupervised data relativevisitfrequenciesofthemostfrequentPOIs,comparing analysis. between NYC (6) and ROC (5) and between weekends and 5.Sheddinglightonthedifferencesandsimilaritiesbetween weekdayswithinthesecities.Basedonthecontentsoftweets, life in big cities and life in smaller cities, quantitatively Sadiek et al. build a language model to detect the health confirming many of the common perceptions about life in big condition of individuals [19]. By relating a user’s health and small cities. For example, life in big cities is more work- level with other attributes such as environmental features of focused, while it is more home-focused in smaller cities; life places where the user spends tags as estimated from his tweet in large cities is also more fast-paced and diverse. Further- geotags, they estimate the influence of lifestyle on health more, we have discovered fine-grained lifestyle descriptions conditions [5]. Eating and drinking habit is also a key point that previous small-scale survey-based studies have failed to to understanding human life. In [20] Abbar et al. find out the illuminate. For example, we extracted three types of temporal namesoffoodinpeople’stweetsandusethemtoestimatethe lifestyles, and report the activity level of each lifestyle along caloric values people possibly take. time quantitatively. Cranshaw et al. [21] construct a metric called Location Entropy to measure the diversity of a POI. Sang et al. discuss people’s movement session patterns [22] based on China’s 2 RELATED WORK LBSN data. The eating and drinking habits of different coun- 2.1 SociologyandCronobiology tries and regions are investigated in [4] based on Foursquare Lifestyle is well studied in sociology. The work of [8] data. They find that geographic closeness usually leads to suggests that lifestyles such as residential location, mode closeness in eating and drinking habits. Wu et al. reported options, destination choices, and trip timing are constrained an approach on modeling temporal dynamic in [23]. Their by household considerations. Gender difference in lifestyles work showed that besides user-item factors temporal factors also attracts interest of many researchers. Budesa et al. study are as important in social media popularity prediction. Other the influence of gender on perceived healthy and unhealthy aspectsoflifestylesuchaspaceoflifeandpowerdistance,are lifestyles, finding that gender is not an important determinant discussed in [3]. They estimate each index related to life via ofindividualperceptionsabouthealth[6].Merrittetal.in[10] tweetscollectedallovertheworld.In[24],Golderetal.found suggest that men and women have no significant difference that the negative affect (NA) tweets sent in winter is higher in motor ability in daily activities. Finally, a study [11] on thanthosesentinsummer.Similarresultsarereportedin[25], universitystudentsfindsthatfemalestudentsarehealthierdue in which the weather influence on human sentiment is studied to less alcohol consumption and more healthy habits. using tweets. Tensor decomposition was applied in [26]. In Much work has been done on human work-rest habits in this work, Zheng Yu et. al decomposed third-order tensors to chronobiology and Circadian Topology (CT). The traditional extractnoise-locationcompoundpatternsinanurbanarea.We method of studying how work-rest patterns relate to aspects employ similar method in our work to find temporal-spatial of physical and mental well-being has been learned through compound patterns of lifestyles. 3 3 DATA SET AND PREPROCESSING NYC ROC FoursquareCheck-ins 233,046 49,744 The large number of self-reported location records and wide Foursquarevenues 99,466 13,483 ExtendedCheck-ins 1,028,016 971,660 geographic coverage make Foursquare a valuable data source Total#ofusers 12,960 10,576 for analyzing behavioral tendencies across groups of indi- Maleusers 1,690 1,491 viduals. However, directly collection of users’ check-ins is Femaleusers 1,803 2,017 a nontrivial task due to the strict limits on Foursquare data TABLE1:Numberofusersinourdataset,bycityandgender. download rates. As an alternative, many researchers collect Weonlyassigngenderlabelstouserswhenahighconfidence Foursquare data through other social media sources that con- in gender classification is achieved, so the genders of many nect with Foursquare such as Twitter [18]. If a user links his users are considered unknown. or her Foursquare account with a Twitter account, when s/he performsacheck-inonFoursquare,ageo-taggedtweetwillbe postedautomatically.Thistweetcontainsalinktothewebpage opdw73Wf1PAcrafra3.Oohe1dett,metaov4sIahgce.es8yeoonecnT3sgLlartu.ovtleyitPweeefoewaceOngc-evtoctsuaoeeIoaransetdtrgkisnyideg,fb2wlarseete3bowp,whd3empSeease,ert0utewaeasRuu4ctsursseO6hstedbeheidfsyCgFteraestoneo.hsttumxeuhiiFosndscrAroeflspourumNqrtoastuorucsnseYisgavhsqtoshCriee&hutbeelohc,ayvlkdeScereaEreshwro.nttenaooaidiFccltntulehoiksrc9arciorv-iotn9saliaineltp,taMilsa4exsner,g6aFopcmefwom6votredeorieurpiudicnimeraolfishetsse,edsqel,,4trcueHsao9Fwkar’rf,oo-or7hoilecumu4neeu.hrn4srsesteIesdq,ncuPftukrs6hOoeotae-0thanimrrIcn0neeess. probablity of a POI's check−in volume exceeds a given amount −3−2−1010101010 ●●●●●●●●●●●●●●●●●FE●●●●●●●●●ox●●●●●●ut●●●●●●e●●●r●●●ns●●●●●●q●●●d●●●●●●ae●●●●●●u●●●d●●●●●●r ●●●de●●●●●●●●●a ●●●d●●●t●●●●●●aa●●●●●●t●●●a●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●● may assign both American Restaurant and Restaurant to a ● Twitter data ● ●● single venue. 100 101 102 1●●0●3 1●●0●●4● 105 DuetothesparsityofdirectFoursquarecheck-ins,wechose amount of check−ins to extend these activity records by applying a method used Fig. 1: Plot of Complementary Cumulative Distribution Func- in [19]: for each geo-tagged tweet located within a small tions for the numbers of check-in amount of POIs. This plot distance (30 meters) of a POI, we count it as a check-in reports the probabilities of a POI’s check-in volume exceeds from this POI. Through this process, we extend the number a given amount in three data sets. of check-ins to 1,028,016 for NYC and 971,660 for ROC. In order to study gender effect on lifestyles, we employ the API of genderize.io to assign gender tags to each user [20]. that the extended data set not only preserves the long-tailed Genderize.iogivestheprobabilityofanindividualbeingeither characteristic, but also shortens the gap between the original maleorfemalegivenhisorherusername.Wefirstfilteroutthe Foursquare data and its subset that is extracted from tweets. userswhosendlessthan10tweetsduringoursamplingperiod, and obtain 12,960 users in NYC and 10,576 users in ROC. We then feed the handles for these users’ Twitter accounts 0.08 into the genderize.io API, and filter out gender tags with 0.07 low confidence (probability < 0.8). From this, we aggregate a total of 3,493 male- and 3,508 female-labeled users (see 0.06 American Restaurant Tchreooaanbwstloeeenvn1eatsrb),.l[eO2ga7itvhc]e,cenurprrwatohcofiyerlkeocfhoinoamfsuoprprlmearxepadipttiyircootnaeo,cdfha,unthswdeeerspseergoxmeficnlleuedtsheiporvideculstsyu,irnueagssnedtd[w2tte8hhe]ee;t Percentage of Checki-‐ins000...000345 BOCHGDUoaioynffinrffmmi evceer eee r (sSpihtryoiv pa te) genderize.io API to assign gender labels. 0.02 Music Venue Two key points should be verified to ensure the quality of Supermarket our data set: 0.01 1) Only 20% to 25% Foursquare accounts are linked with 0 Twitter [29].Check-inscollectedfromtweetsformasubsetof Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec alltheFoursquarerecords.Weneedtoensurethatourdataset Fig. 2: Percentages of check-ins from top 10 visited POI should have a similar distribution to the original Foursquare categories. data.Tovalidatetheapplicabilityofourextensionmethod,we plottheprobabilityoftheamountofvisitsasafunctionofthe 2)Inordertoobtaindatasetsofcomparablesizeforthetwo amountofPOIsasNoulasetal.didin[18]inFig.1.Itshows cities, we collect tweets in NYC for a one-month period, and 4 ROC for a one-year period. The length of the time period for tweetcollectionisdifferentinNYCandROCdatasets.Tweets from NYC were posted during June 2012, while tweets from ROC were posted from July 2012 to June 2013. In Fig. 2, we plot the percentage of check-ins from the top 10 most frequent check-in categories. Note that we eliminate duplicate categories. For example we omit “Restaurant” (ranked 3rd) since we have “American Restaurant” in the first place. The portions of check-ins from most categories remain at stable levels throughout the year. This observation implies that the distribution in one month could approximately represent the remaining months of a year. One exception is the category of University,whichshowsadecreasefromMaytoAugust.This coincides with the summer break of universities. 4 LIFESTYLE DIFFERENCE AT CITY LEVEL 12 Fig. 4: Comparison between visiting frequencies of 9 POI categories in big cities and small cities. The yellow boxes are the frequencies for NYC and the green ones are for ROC. 10 8 Rate • Each POI category has a specific range of visiting fre- Returnig 6 qaluietyncbye,twweheicnhdiisffecrleeanrtlyPOinIdsicinatipveeopolfe’dsifdfeariilnygliffuen.ction- • Some categories show different ranges of visiting fre- 4 quency in cities with different sizes. This help us to examine the different lifestyles at the city level. 2 4.1.1 VisitingfrequencyrangeofPOIcategories We plot the visiting frequency of several popular POI cate- Bar Church Drugstore POIG CasategGorroiceesry Park RestaurantSupermarket gthoartieasraesahibgohxlyplroetlainteFdigtuored3a.ilTyhilsifpeloatreshovwissitethdatrceapteeagtoedrileys, e.g. Church, Grocery, Drugstore and Supermarket all have a Fig. 3: Box plot of visiting frequency of Bar, Church, Drug- highvisitingfrequency.Themedianvisitingfrequenciesofall store,GasStation,Grocery,Park,RestaurantandSupermarket, categories is approximately 4, though some show very high aggregated over both cities. visiting frequencies. For example, the visiting frequencies for some churches reaches 12, indicating a high prevalence of Church in some people’s lives. As we expected, the highest visiting frequencies appear in Home (with a median of ap- 4.1 VisitingFrequencyofPOIs proximately 20 for both cities) and Office (with a median of The visiting frequency of for a location is defined as the approximately 10 for both cities), as shown in Figure 4 for number of visits (check-ins) divided by the number of unique details.Thesetwoarethemostfrequentlyvisitedlocationsfor visitors.Inotherwords,visitingfrequencyistheaveragevisits most people. The visiting frequencies of Bar and Restaurant per visitor to a location. This metric measures the degree of are much lower with a median around 2. relevance of a POI to people’s daily life. The higher visiting frequency is, the more relevant the POI is to a person’s 4.1.2 Difference in visiting frequencies between NYC lifestyle. “Home”, for example, as one of the most impor- andROC tant locations to individuals’ lives, has a very high visiting In this section, we compare the visiting frequencies of cat- frequency. Most of the check-ins at home are performed by egories in big cities and small cities (see Figure 4). It is family members or friends, so the visiting frequency of home interesting that the visiting frequencies for some categories is very high. On the contrary, a public location such as bar, in small cities are larger than those in big cities, such as usually has a lower visiting frequency. Restaurant, Bar, Supermarket and Drugstore. This may imply Regarding visiting frequency, we have two interesting ob- a higher regularity of life in smaller cities – in other words, servations: people in smaller cities are more localized, with stricter 5 routines. In big cities, people have more options to go when 0.06" HCaigféh""School" eating(Restaurant),havingfun(Bar)andpurchasingdailyne- Bar" Coffee"Shop" cessities(SupermarketandDrugstore).Therefore,theseplaces 0.05" College"Classroom" Home"(private)" are generally visited less frequently in larger cities. While College"Academic"Building" fvoisritointhgefrrecqauteegnocireiessofsutwchotayspeHsoomfec,itiOesffiacreeraonudghClyhtuhrechs,amthee. (cid:1)heck/ins0.04" ACOomffilleceregi"cea"Cna"Rfeetsetraiuar"ant" C Trehtuisrnminagkehsomobevaionudsreselingsieonbsehcoauusldebtheesilmifeilsatryluensdoefr twhoerskainmge, ntage)of)0.03" cultural atmosphere. Perce0.02" 4.2 Basic mobility patterns in big cities and small 0.01" cities 0" It is interesting to study the fluctuation of residents’ activity 7am"8am"9am"10am"11am"12pm"1pm"2pm"3pm"4pm"5pm"6pm"7pm"8pm"9pm"10pm"11pm"12am"1am"2am"3am"4am"5am"6am" overtimeintermsofoccurrenceatPOIs.Weplotthe10most Hours)of)a)day(cid:1) popular POI categories on weekdays and weekends separately (a) Rochester weekdays for ROC and NYC. On weekends, the mobility patterns of Pizza"Place" 0.05" General"Entertainment" the two cities are similar (Figure 5b and Figure 6b). The College"Cafeteria" total check-in amount climbs rapidly to a high level around Café" Hotel" 10am and 12pm in ROC and NYC, respectively. The activity 0.04" Office" Coffee"Shop" levels remain constant until 9pm, when a peak of check-ins ns) Home"(private)" n0i Bar" appearsinbothcities,indicatingasuddenincreaseofmobility hecki0.03" American"Restaurant" in weekends night. After the peak, the activity level in ROC of)C m2waeomevkeiesnnddNosYwfConr.qbTuohitcehk3lcyim,tiweosshtailrfeereiBqtuarree,nmAtlamyinevsriisacittaenadRhPiegOshtIalcuearvtaeenglto,thrHireoosumgoehn. Percentage)0.02" Obvious divergence is present between the weekday mobility 0.01" patterns of the two cities (Figure 5a and Figure 6a). In big cities, there are three peaks during a day appearing around 0" 8am, 1pm and 9pm. Similar pattern also appear in London 7am"8am"9am"10am"11am"12pm"1pm"2pm"3pm"4pm"5pm"6pm"7pm"8pm"9pm"10pm"11pm"12am"1am"2am"3am"4am"5am"6am" accordingto[18],indicatingroughlythesamemobilitypattern Hours)of)a)day(cid:1) betweenLondonandNYC.Amongthethreepeaks,thehighest (b) Rochester weekends one is at night, which implies that night is the most active Fig. 5: Stacked plot of the 10 most popular categories over periodinlargecities.Incontrast,thereisonlyonepeakduring weekdays and weekends in Rochester. Categories are listed in adayat10aminsmallercitiesandthecheck-inamountdrops the order of increasing probability from the top down. The significantly after that. It reveals that during weekday nights, widthofeachfaultindicatesthepercentageofaPOIcategory people in small cities are not as active as those in large cities. for a given time of day. During the nights in weekdays, people in small cities prefer visiting Home, American Restaurant and Cafeteria, while in largecities,Barismuchmorepopularduringnight,indicating where L is a k by M matrix, recording k latent lifestyles, and thatpeopleinlargecitiesaremorepronetoindulgeincopious wisacoefficientvectorsofkdimensions,indicatingtheuser’s recreation during the weekends. preference to each lifestyle. To uncover and compare lifestyles that are commonly fol- 5 MINING LIFESTYLES WITH MATRIX AND lowed in different cities, first we assemble the activity vectors TENSOR DECOMPOSITION of residents into a single matrix for each city. We define 5.1 MatrixDecomposition A =(a ,a ,...,a )T The activities of a user, a, can be described with an M roc 1 2 Nroc dimensional vector. For temporal patterns we set M to 24 and wherea indicatestheactivityvectorofaresident,and N is i roc values in the vector are the activities of the user performed thenumberofsampleswecollectedfromtheGreatRochester in each hour, i.e. the number of check-ins. When we examine area. Similarly, spatial patterns latent in individuals’ actives, M is set to be A =(a ,a ,...,a )T equal to the number of POI categories, and each element nyc 1 2 Nnyc indicates the amount of check-ins the person performed in where N is the number of samples we collected from the nyc a single POI category. We refer to vector a as an “activity GreaterNewYorkarea.Second,weconcatenate A and A roc nyc vector” of a user. to obtain a complete matrix We assume that a person’s activities are determined by the lifestyle(s) that person lives. Formally, A=(Aroc,Anyc)T a=w×L where A is a (N +N ) by M matrix. roc nyc 6 0.045" Café" Pizza"Place" 0.04" RHeostiedle"n=al"Building" 0.07 The Intermediate Building" Night Owl (cid:1)Percentage)of)Check/ins0000....0000000...01230005555123""""""" HBOACoamoffirffm"ecereeei""c"(Saphnroi"vRpae"tset)a"urant" malized weight on each hour 0.030.040.050.06 Early Bird 0" Nor 0.02 7am"8am"9am"10am"11am"12pm"1pm"2pm"3pm"4pm"5pm"6pm"7pm"8pm"9pm"10pm"11pm"12am"1am"2am"3am"4am"5am"6am" Hours)of))a)day) 0.01 (a) New York City weekdays 0.045" CPiazfzéa""Place" 0.00 Residen=al"Building" 0 2 4 6 8 10 12 14 16 18 20 22 0.04" Hotel" Building" Hours of a day 0.035" Home"(private)" (cid:1) American"Restaurant" eck/ins0.03" CBOoaffirff"ceee""Shop" Finitge.rm7e:dAiactetisveovteirmweereakndgaeyss.of night owls, early birds and h C0.025" of) e) ag0.02" ent At a finer granularity, W consists of four even smaller ma- Perc0.015" trices: W = (Wroc male,Wroc female,Wnyc male,Wnyc female)T. For 0.01" a particular group of people, i.e. a component matrix, the 0.005" degree of preference for a lifestyle is defined as the average of the coefficients of people in the group for the lifestyle. For 0" 7am"8am"9am"10am"11am"12pm"1pm"2pm"3pm"4pm"5pm"6pm"7pm"8pm"9pm"10pm"11pm"12am"1am"2am"3am"4am"5am"6am" example, the preference to ith lifestyle of residents of New Hours)of)a)day(cid:1) York City is calculated by averaging the ith column of matrix (b) New York City weekends Wnyc. Inthefollowingsections,wereportthetemporalandspatial Fig. 6: Stacked plot of the 10 most popular categories over lifestyles found from people’s activities, and compare the weekdays and weekends in New York City. Categories are preferences to lifestyles in the two cities. listedintheorderofincreasingprobabilityfromthetopdown. The width of each fault indicates the percentage of a POI category for a given time of day. 5.2 Third-OrderTensorDecomposition User activities may be analyzed across multiple dimensions simultaneouslyusinghigher-ordertensors.Tensorsareanatu- Third, we decompose A into two matrix W and L. W is ral way to aggregate data across multiple factors. Vectors and a N by k coefficient matrix, while L is the lifestyle matrix matrices are special cases of tensors, where each vector v of we explained above. Since non-negative matrix factorization dimensionalityD,itistruethatv∈RD,andforeachmatrix M (NMF) usually leads to interpretable results [2], we applied it to complete the decomposition. Formally, we solve the of dimensionality D1 by D2, M ∈ RD1×D2. Tensors generalize this to data structures of arbitrary order, where vectors are of following optimization problem: order1,matricesoforder2.FortensorT oforderd,T maybe 1 concisely described as: T ∈ RD1×D2×...×Dd. To learn temporal- min (cid:107)A−WL(cid:107)2 s.t. L≥0,W ≥0 spatial patterns of human activities, for example, we can W,L 2 F aggregatethedataintoathird-ordertensor.Insuchatensor,a where A ∈ R(Nroc+Nnyc)×M, W ∈ R(Nroc+Nnyc)×k, L ∈ Rk×M. person’sactivitiesisrecordedasamatrix,ofwhichdimensions (cid:107)X(cid:107)F = ((cid:80)i,j|Xij|2)−21 is the Frobenius norm, L ≥ 0 (or are POI categories and hours of a day. Decomposition on the W ≥ 0) requires that all components of L (or W) should be tensor produces multidimensional knowledge on lifestyles. In nonnegative.Luncoversthelifestylesthatpeoplefollow,while Fig. 8, we illustrate the tensor decompositio processes. W provides information about individuals’ preferences across The most commonly used technique for tensor decompo- these lifestyles. sition is CANDECOMP/PARAFAC (CP) decomposition [30]. Afterdecomposition,wesplitW intosmallermatrices,each Thisalgorithmdecomposesatensoroforderd intod separate of which records a sample of lifestyles for various groups of matrices,eachofdimensionalityk×D ,wherekisthenumber j people. On a city level, W is split into two smaller matrices, of components selected a priori and D is the dimension for j W = (Wroc,Wnyc)T, where Wroc ∈ RNroc×k and Wnyc ∈ RNnyc×k. the tensor’s jth order. 7 the work presented here, we found no significant differences in mean component weights across these demographics. FulltensorsT ={t ,t ,...,t }thatweconsiderconcatenate i 1 2 N lifestyle matrices t across all users. In this work, we present an analysis of two third-order tensors T and T , such that 1 2 T ∈ RN×M×P. N indexes check-in counts by user id and P i indexes by category. Only the 100 categories with the highest number of check-ins are used, so P=100 for both T and T . 1 2 T indexes by times of day as well, so M = 24. T indexes 1 2 instead by days of the week, so M =7. The first tensor, T ∈ RN×24×100, will allow us to examine 1 Fig. 8: Visualization of tensor cube T, decomposed into joint spatial-temporal lifestyles, indicative of user’s locational component matrices W, L , and L . behavior at various times of the day. Trivially, we might M P find components of user check-in at bars and pubs, with greater weight assigned to night hours than to the morning or afternoon. The second tensor, T ∈ RN×7×100, will be 2 The formal optimization problem for this decomposition is: conducivetolocationallifestyleswithdistincttrendsacrossthe work week, through the weekend. For example, we might see min (cid:107)T −W(LM(cid:12)LP)(cid:62)(cid:107) lifestyles of individuals visiting restaurants and entertainment W,LM,LP venueslaterintheweek,withlessweightassignedtoMonday, In this equation, (cid:12) represents the Khatri-Rao product. Tuesday, and Wednesday. The Khatri-Rao product may be considered as a column- wise Kronecker product ⊗ between two matrices with equal numbers of columns A = [a ,a ,a ] and B = [b ,b ,b ], 6 TEMPORAL ASPECTS OF LIFESTYLE 1 2 3 1 2 3 where A(cid:12)B=[a ⊗b ,a ⊗b ,a ⊗b ]. While A and B both 1 1 2 2 3 3 have 3 columns here, this can be generalized for any number of columns. Assuming that A ∈ RM×K and D ∈ RN×K, the Khatri-Rao product matrix will be of dimensionality MN×K. To solve the optimization problem for CP decomposition, 0.05 TNhigeh Itn Otewrmlediate Early Bird we use the alternating least-squares (ALS) algorithm, origi- noaeLsafPthl,iClimygtPhha-petAirlnoeoLnvpLSseoMlso,uefsAdaeonLdnbdSeyisLimnpP[car3rottr1eovim]xi,deeie[snn3dtti2ameil]anl.aycthtTehuheisWteeessr,csaipWtakienioctdn-iatfi.nescndosioLmorMnpt,loteooimmlkeepisntrt.toiamv2tiianoAtgnet weight on each hour 0.030.04 miAzastioAnL,Sthemoalngootrointhicmalliys sduebcjreecatsetos gerertotirngrattreapfpoerdthine loopctai-l malized 0.02 Nor minima. Thus ALS is not guaranteed to find an optimal solu- tion, and results may depend heavily on initialization. In our 0.01 experience, it was found that using both singular vector and rwaintdhosmimiinlaitriaelrizroartiornatsescownhveenrgeudsintgo asimteirlmarindateicoonmcpoonsditiitoionns 0.00 0 2 4 6 8 10 12 14 16 18 20 22 of 10−5 error improvement between iterations. Hours of a day Similar to our matrix decomposition methodology, we as- sume that individuals’ check-in activity may be decomposed Fig. 9: Active time ranges of night owls, early birds and into a weighted combination of lifestyle factors stored in a intermediates over weekends. matrix: t=w (L (cid:12)L )(cid:62) AnalogoustoETs,MTs,andNTsinthecircadiantopology M P literature, we classify people’s work and rest habits into three where L ∈ Rk×M, and L ∈ Rk×P, each recording k latent categories: night owls, people who tend to stay up until late M P lifestyles, where again w ∈ Rk is a coefficient vector for a at night; early birds, people who usually get up early in the single user. L reveals the first dimensional characterizations morningandgotobedearlyintheevening;andintermediates, M of each lifestyle component, and L the second dimensional people who have schedules between night owls and early P characteristics. As with our matrix decomposition framework, birds [9]. we consider weight matrix W as a concatenation of four Interestingly but not surprisingly, our approach provides smaller matrices according to city and gender. However, in supportforthesethreecommontemporallifestyles.Moreover, we are able to provide a precise description of activity level 2.https://github.com/mnick/scikit-tensor along time of day for each lifestyle. We study weekday and 8 MEQ Ourresults weekend separately to gain a better understanding of people’s EarlyBird lives. Getup 5:00am-7:45am 6:00am-8:00am Let A bea(N +N )by M matrix,where M equals Mostactive 5:00am-10:00am 7:00am-2:00pm weekday roc nyc to24.Acomponenta inthematrixdenotesthei user’stotal Gotobed 8:00pm-10:15pm 8:00pm-10:00am ij th Inter number of check-ins during j hour of weekdays. Similarly, th Getup 7:45am-9:45am 8:00am-10:00am Aweekend,alsoa(Nroc+Nnyc)by M matrix,recordstheactivities Mostactive 10:00am-5:00pm 2:00pm-8:00pm ofusersonweekends.Wesetkas3toalignwiththenumberof Gotobed 10:00pm-12:30am 10:00pm-12:00am NightOwl predefined categories, and then employ matrix decomposition Getup 9:45am-12pm 10:00am-12:00pm on Aweekday and Aweekend, respectively. The results are Lweekday Mostactive 5:00pm-5:00am 9:00pm-1:00am and W for A ; L and W for A . Gotobed 12:00am-3:00am 3:00am-6:00am weekday weekday weekend weekend weekend We first plot the result matrices Lweekday and Lweekend in Fig. 7 TABLE 2: Comparison between our method with traditional and Fig. 9. methods the active level grows faster after being activated (11 am) and Early Bird 0.50 TNhigeh Itn Otewrmlediate remains high till the peak of night (11 pm). The results extracted through data agree with the time mporal Lifestyle 0.400.45 rwaintIghnetshtaodbselefienfre2odmwbyperterlvaisidotiutitsohnehaulmtismatunedCieriasr.ncgaWedesiacnoofRmhwpyaatrhkemeouu[rp3r3et]si.mulets, Weight on each Te 0.300.35 s(etimlvmeeeoenprinrnaitgninmngge-eetssysapwqneude,eelsmvitseiotonsntithnnegaa-ciptrtyeeivpr(ceeMe,ntniEtmeaQiget)heeo[o1rf-f2tty]ah,pcetteoi)vciicontoyrrmtehdspeuparmorinenogdrwnitniihtnghegttnsyhaepemsessees- Average 0.25 tNimMeF.raWngeedoeffienaechthleisfeesttyhlreeedetcimomeproasnegdesfraosm: foruormdathtaeuesairnlgy 0.20 morning(5am),thetimerangeofthefirst∼15%ofactivities is defined as “get up”, the next ∼ 70% is defined as “most 0.15 activity” and the final ∼15% is defined as “go to bed”. These ROC_male ROC_female NYC_male NYC_female percentages are not exact, and a small amount of activity is City & Gender present between the “go to bed” and “wake up” time ranges. For most time ranges, our results generally agree with those Fig. 10: Average weight on night owls, early birds and frompreviouswork,e.g.“getup”timerange,and“gotobed” intermediates of male and female residents of a small and time range for the Early Bird and Intermediate lifestyle. big city over weekends. All the “most active” ranges in our findings are later than the previous assessments, and the “go to bed” for Night Owl Early birds: In weekdays, early birds’ days start around is much later than that of the evening-type in previous work. 7 am, and they are most active around noon. After that, Our explanation for these differences is twofold: Firstly, we their activities decrease, and then vanish gradually in the believethatasageneraltrendpeople’sactivitiesshiftalotinto night around 10 pm. The distribution of activity over time night in modern times when compared with the year when in weekend is similar for early birds, except the increase and the seminal previous work was done (1976) [9]. Secondly, decrease of activities in weekend is faster, which leads to a individuals’behaviorsinourmodelaremodeledbyaweighted sharper peak around 12 pm. combination of “lifestyles”, whereas in the MEQ paradigm, Night Owls: For night owls, we observe two active periods an individual is assigned to a single, discrete “type”. As in a day. Their activities start from 10 am for both weekdays a consequence, our “lifestyle” patterns should capture more and weekends. We observe the first small peak appears at 2 distinctivework-restactivitiesasingleindividualmightfollow pm, but is not comparable to their active level during night. as a subset of all his or her behaviors, whereas each MEQ After the inactive daytime, the activities of night owl rocket “type” should capture the aggregate work-rest patterns for all from 10 pm and achieve maximum at late night (1 am in of an individual’s behaviors. For example, an individual in weekdays and 2 am weekends). Their activities vanish in the our model might be a “night owl” on the weekends and an early morning (6 am). “early bird” on the weekdays, while in the MEQ model this Intermediates: People who are neither early birds nor night individualwouldbeconsideredaseitheramorning-typeoran owls usually start their day in the late morning (10 am). On evening-type. weekdays,theactivelevelincreasesgraduallyintheafternoon W and W indicate the preference of each user weekday weekend and reaches the peak around 10 pm, and rapidly decreases to to three lifestyles on weekday and weekend, respectively. For zero around 2 am. On weekends, they are more active during eachmatrix,wefirstsplititto4smallermatricesaccordingto afternoon than on weekdays. Instead of a gradual increase, user’scity(ROCorNYC)andgender.Secondly,wecalculated 9 theaveragepreferenceofeachgrouptoeachlifestyle.Weplot a spatial activity matrix A in this case. A is a (N + N ) roc nyc the results in Fig. 10 and Fig. 11. by M matrix, where M is the number of categories of POIs. The decomposition generates two result matrices L and W. L records the spatial lifestyles that are lived by the people in different cities, W contains the information of the preference of each resident to these spatial lifestyles. k is empirically Early Bird 0.50 TNhigeh Itn Otewrmlediate set to 5 to achieve a good tradeoff between granularity and interpretability. mporal Lifestyle 0.400.45 FanodWreeaascrsehipgponarttatethrnneamwlifeeesltitosytletthhseeextpotarpattc5etrewnd.efiWgrohemtedctahcneatdseeagntoasreiiensthToeafbPclelOea3Isr. Weight on each Te 0.300.35 cTcCoaaotnkelnlegeeogcpretiaieotLstneaarbbnr.eetTowCnhoeeielslneaigtssheeaRaPnecOsoieImdxceamanmtcoeepngloHemr.aieolTlsb,hiwCleiitotyht-owinppoaarttkthheiirrndengdeeoSnwfpapeccaiogetlthleaetrnegndde. Average 0.25 stotudeexnetrsc.isPea,ttwerhnerseevtehnedteospcrtihbreeseacpaatettgeornrieosf pareeopGleymwh,oYloigkea 0.20 Studio and Athletics & Sports. For pattern ten, the top three weightedcategoriesareTrainStation,SubwayandTrain.This 0.15 is a typical movement pattern of people who commute a lot. ROC_male ROC_female NYC_male NYC_female It’s natural to see one’s behaviors as a combination of City & Gender several lifestyles. For example, for a college student we may Fig. 11: Average weight on night owls, early birds and find he/her lives first lifestyle in Table 3 (College lifestyle) as intermediates of male and female residents of a small and well as the third (Bar & Pub lifestyle) and the seventh (Sport big city over weekdays. lifestyle) with different weights. By “weight” we mean the importanceofacertainlifestyletoone’sdailylife.RowsinW, indicatedbyw,aresuchweightvectors.w isa15dimensional i i Generally speaking, the average preference to night owl vector, indicating the quantified preferences of i user to 15 th lifestyleonweekendsissignificantlyhigherthanonweekdays. spatial lifestyles. In order to gain a group level understanding Correspondingly, the average weight of early bird and the oflifestyles,weemployaclusteringmethodonws.Thecenter i intermediate style is significantly lower on weekends. This of each cluster denotes the mean lifestyle combination for a indicates that people in both cities are more willing to stay group of individuals. Moreover, by analyzing the component active late on weekends, while getting up early on weekdays. of a group we are able to determine the tastes for lives of People live in big cities usually have higher preference to residentscitiesofdifferentsize.Wesetthenumberofclusters the night owl type for both males and females, while people to 5 empirically. tend to have higher weights on the early bird type in small cities. This observation suggests that big cities are more active than small cities during night. We did not observe significantdifferencebetweentwogendersontheirpreference to temporal lifestyles in both two cities. This agrees with previous work [10], in which the authors verify that daily activity levels generally are not biased by gender. 6.1 SpatialAspectsofLifestyles Individuals preferences towards specific locations are another important indicator to their lifestyles. These lifestyles can be described as combinations of several specific POI categories. For example, we observed the co-occurrence of Home, Gro- cery and Gas Station in many people’s visiting records; we could have the feeling that the people performed this pattern are “home-originated”, since the places they visited are quite relatedtodailylife.Anothercommonlyobservedcombination is Bar, Pub and Music Venue; people following this pattern clearly tend to have much excitement (alcohol) in their daily Fig.12:Componentsandcorrespondingpercentageoflifestyle life. These movement patterns are conducive to understanding 1 and the people in this lifestyle. individuals’ lifestyle preferences. We also employ the NMF method to detect these hidden patterns. Instead of a temporal activity matrix, we decompose We plot the components of two of groups of people and 10 HiddenPatterns 1stcategory 2ndcategory 3rdcategory 4thcategory 5thcategory 1,College ResidenceHall Co-workingSpace Lab RecCenter WineBar 2,Restaurant AmericanRestaurant GroceryStore Supermarket FastFood Diner 3,Bar&Pub Bar MusicVenue nightclub Lounge RockClub 4,Office Office Co-workingSpace Building Conf.Room Bar 5,Home&Grocery Home(private) Supermarket GroceryStore Drugstore Church 6,Entertainment Arts&Entertainment BaseballStadium Bar BurgerJoint ConcertHall 7,Sports Gym YogaStudio Athletics&Sports Spa FitnessCenter 8,Park&Outdoor Park Neighborhood ScenicLookout Plaza Beach 9,Hotel&Bar Hotel Lounge CocktailBar RoofDeck Airport 10,Commute TrainStation Subway Train Platform BusStation TABLE 3: 15 Hidden patterns with their assigned names and top 5 weighted categories of POIs of each pattern. the percentages of males and females from both cities. Note multiple theoretically distinct lifestyle patterns. With higher that all ratios are normalized by the number of users in each k, we run into issues of redundancy, where multiple highly city. For the people who are in the first group (Fig. 12), home similarlifestylepatternsareextracted,andlowinterpretability, is the absolute center of their life. Additionally, this group where some extracted patterns include very disparate behav- is comprised more of people from small cities (56%) than iors. For both T and T , we tested a wide range of possible 1 2 of those from large cities (44%). People in the second group k values. (Fig. 13) tend to visit office and entertainment venues more Individualswithfewercheck-insthansomethresholdhwere often. People in large cities (73%) prefer this lifestyle more pruned to remove outlier noise. In our experiments, we found thaninsmallcities(27%).Theseresultssuggestthatforpeople very little difference between many components when h = 5 in small cities, home is a prominent location in life; while in fromwhenh=30.However,withh=5,somepatternsemerge large cities, people tend to spend more time at office. more apparently, and components are more distinct overall. Fig. 14: Component weights L by hour for tensor T ; 5 out M 1 of k=12 total components are shown. Component labels are addedaposteriori,andweightsshownarenormalizedbymin- maxnormalization.Eachcurveintheupperpartrepresentsthe trend of a POI category along 24 hours of a day. Fig.13:Componentsandcorrespondingpercentageoflifestyle 2 and the people in this lifestyle. 7.1 Time-of-Day&Location T offers the most interpretable results of third-order tensor 1 7 COMPOSITE ASPECTS OF LIFESTYLES decomposition, most clearly when k = 12, shown in Fig. 14. Some lifestyle patterns may be seen a combination of in- Across a wide range of values for k, we see a few distinct dividuals’ daily, weekly, and spatial habits. In this section, lifestyle patterns emerge. One component assigns highest we consider the analysis of tensors T and T . Each tensor weight to the Arts & Entertainment category and significantly 1 2 T ∈ RN×M×P may be factorized into any number of compo- lowerforallothers,withtime-of-daybeginningaround10am, i nents k = [2,min{N,M,P}] . Recall that for both T and T , peaking at 9pm, and tailing off in the hours following mid- 1 2 N indexes check-in counts by user id and P=100 indexes by night. This matches the intuitive assumption that individuals category. For T , M =24 for indexing by time of day, and for usually visit these sort of venues later in the day, primarily 1 T , M =7 for days in a week. around evening and night times. The next component assigns 2 There is a significant trade off that exists with choosing highestweighttotheHighSchoolcategory,significantlylower tuning parameter k: with a smaller k, fewer lifestyle patterns for all others, with time-of-day peaking significantly at 7am, may be identified, and some components may be mixtures of and some additional weight for the hours of 8am - 3pm. This